Troubleshooting Condor Job Problems
General suggestions
- Make sure you're giving the full path to the executable, in your submit file. (Unless the executable is in the same directory you're running condor_submit in.)
- Make sure the directory your logfile is in exists.
- Make sure your input and output files are correct.
- If you're running a script as your executable, make sure its execute bit is set and the Shebang line is correct. Try running it from the command line to make sure it works.
- Check the job's logfile for useful error messages.
Guidelines for specific situations
Job immediately goes into the Held state
Often this means Condor is having trouble executing the job. Try using
condor_q -long and examining the HoldReason attribute. For example:
condor_q -global -long 13281 | grep 'HoldReason'.
One particularly common error that some users may find puzzling is "Exec format error." This usually means you've forgotten to include the
"shebang" line in a script. While a shebang line is not necessarily required when running a script interactively, Condor needs to see one so it knows what shell or interpreter to use to run the script.
Job runs for a while, then bogs down or gets killed
This often means the job has exceeded its memory request without Condor noticing, and has gotten so large that the machine it's running on has run out of RAM. Check the
SIZE column of
condor_q and compare to what you've specified on your
request_memory line. (The default if you don't specify is 1024 MB.) See
BigMemoryCondor and the section below for more information.
Job runs for a while, then goes idle (or gets evicted)
This is usually because your job is consuming more memory than you requested for it on the
request_memory line of your submit decription file. You can verify this by looking at the
SIZE column of
condor_q and comparing it to your memory request. (If you didn't include a
request_memory line, the default is 1024 MB.) See
BigMemoryCondor for more information on
request_memory and how to use it.
If you don't want to have to re-submit the job, you can use
condor_qedit to change its memory requrements on the fly. The format is
condor_qedit <jobid> RequestMemory <memory in MB>. For example, to request 5 GB of RAM:
condor_qedit 123456 RequestMemory "5*1024"
The quote marks prevent the shell from interpreting the multiplication as a file wildcard.
Job sometimes works and sometimes fails
Sometimes this can be caused by problems with a particular node -- either a misconfiguration, or a temporary problem such as memory pressure from jobs with incorrect memory specifications. Check your job log file and see if the IP address on the "Job executing on host:" line is the same for all the failed jobs. If so, you should email linghelp so the situation can be fixed. As a temporary fix, you can avoid the problematic node by excluding it from your requirements, e.g.:
Requirements = ( Machine != "patas-n1.ling.washington.edu" )