If you recall, I wrote last week about my program that must have had some sort of error in it, because it worked for small numbers of processors, but not for 32 processors. I've never dealt with this sort of debugging before, so I got lots of advice from my friends. They scrutinized my files, but couldn't see anything obviously wrong with my program. One suggested that I just take everything out of the program, and get it down to the simplest program that did not work correctly.
I stripped it down to a simple "for" loop that printed out the numbers 1 to 5000. It was at this point that I began to realize that it was definitely not a problem with my program. So I asked my advisor about it today, and he got on the phone and asked the guy who runs the cluster to help me. He looked at it and couldn't see anything fundamentally wrong with it, and invited me to come over to his office and he'd help me figure it out.
I went over to his office right after lunch, and sat there for almost an hour as he tried to debug it. Eventually he put it all together and realized what the problem was. It was a problem with the way the job was spawned, and had nothing to do with my program. My program was actually running just fine, but the output was not finding its way back to the output file. I felt so relieved when the problem was figured out! He hasn't implemented a solution to it yet, but I was just relieved to know that it wasn't an error I had made. A huge weight was lifted from my shoulders (and placed on his!). So it means I can continue running my program at lower numbers of processors until they get it fixed.
Monday, April 25, 2005
Subscribe to:
Post Comments (Atom)
1 comment:
YAAY!!! I'm so glad you got it figured out, and extra bonus it wasn't even your fault. Good work, sweetie!
Post a Comment