Bugs? - Just don't write buggy programs! - Simple! Of course, it will clearly never happen that a program written in this class would ever have any sort of problems, but, if, for some reason, a program that you write were to crash unexpectedly, there's something to watch out for.
An MPI program that contains parallelism may start simultaneously on all (or at least, many) of the machines on the network. If one process crashes, and MPI dies, it is quite possible that some of the other processes might continue living -- and, cut off from their MPI connection -- may just sort of hang around and use up CPU time. This is a great way to lose friends!
In fact, sit back for a while and imagine the lab computers, filled to the brim with students, all of them running their programs together on all the machines. One student's program crashes, leaving nine other copies of his program treading water.
Then a second person's program crashes. And a third.
These people try to fix their bugs, recompile, and run their programs again. The twenty-seven floundering processes from their first attempts are still around.
Some other people's programs crash, adding more dead weight. After a second compile-and-run attempt, the computers are host to sixty-three floundering processes, each potentially using up a unit of CPU load.
Inexplicably, they start to feel sluggish.
Tempers flare. People start getting out their knives.
Not a good scene!
Soooo, for just such an eventuality, I have provided the commands (which are actually scripts) psmachines.sh and cleanmachines.sh.
When you type psmachines.sh, a remote shell is started on each of the machines via ssh and a ps command is issued to display the current status of all processes on that machine associated with your username.
cleanmachines.sh allows you to automatically kill all your processes (except processes on the machine from which you executed the command otherwise your login shell, editors, everything would be killed).
These programs can be used with a number of command-line switches. The most important of these is the "-machinefile" option which allows you to specify which machines to check. After all, you only really need to kill processes on machines you have been using. Use the same machine file which you used with mpirun. These programs can also be used with an optional string to match to (so that not everythin is killed, only those processes matching the string). The most useful way to use this option is to use the name of your MPI program as the string to match.
Type psmachines.sh -h or cleanmachines -h to get a more complete list of options.
These programs are available to you if your path includes /usr/local/units/aca319/bin which it should if you followed the instructions in the "Startp file modifications" section here
Be careful using these programs
Note that cleanmachines.sh will not kill any processes on the local machine. You will need to do this manually using the "kill" command. Type "psme" to get a listing of all of your processes on the local machine and then go through killing them using "kill -9 pid" where pid is the process id of the process you want to kill.
It is suggested that whenever you run an MPI program on a large number of computers, and it crashes unexpectedly in a way that leads you to believe that there may be other, floundering processes left over, you should run psmachines.sh to check out your suspicions and cleanmachines.sh to find and kill any processes you have hanging around.
It is strongly suggested that you issue a cleanmachines.sh command immediately before logging off to help keep the peace. All of your processes on the local machine should automatically be killed when you log out.