Troubleshooting Condor Job Problems
- Make sure you're giving the full path to the executable, in your submit file. (Unless the executable is in the same directory you're running condor_submit in.)
- Make sure the directory your logfile is in exists.
- Make sure your input and output files are correct.
- If you're running a script as your executable, make sure its execute bit is set and the Shebang line is correct. Try running it from the command line to make sure it works.
- Check the job's logfile for useful error messages.
Guidelines for specific situations
Job immediately goes into the Held state
Often this means Condor is having trouble executing the job. Try using
and examining the HoldReason attribute. For example:
condor_q -global -long 13281 | grep 'HoldReason'
One particularly common error that some users may find puzzling is "Exec format error." This usually means you've forgotten to include the "shebang"
line in a script. While a shebang line is not necessarily required when running a script interactively, Condor needs to see one so it knows what shell or interpreter to use to run the script.
condor_submit fails with "no such directory"
Sometimes, usually when working on group projects outside your home directory, condor_submit will fail with an error like
ERROR: No such directory: /projects/foo/bar/biz
This happens when the directory is not owned by you, regardless of whether you have access. (This may be a bug in condor_submit.) If you encounter this, either move the submit script and log file to a directory you own, or contact linghelp@uw
to have the directory chown'd to you.
Job runs for a while, then bogs down or gets killed
This often means the job has exceeded its memory request without Condor noticing, and has gotten so large that the machine it's running on has run out of RAM. Check the
and compare to what you've specified on your
line. (The default if you don't specify is 1024 MB.) See BigMemoryCondor
and the section below for more information.
Job runs for a while, then goes idle (or gets evicted)
This is usually because your job is consuming more memory than you requested for it on the
line of your submit decription file. You can verify this by looking at the
and comparing it to your memory request. (If you didn't include a
line, the default is 1024 MB.) See BigMemoryCondor
for more information on
and how to use it.
If you don't want to have to re-submit the job, you can use
to change its memory requrements on the fly. The format is
condor_qedit <jobid> RequestMemory <memory in MB>
. For example, to request 5 GB of RAM:
condor_qedit 123456 RequestMemory "5*1024"
The quote marks prevent the shell from interpreting the multiplication as a file wildcard.
Job sometimes works and sometimes fails
Sometimes this can be caused by problems with a particular node -- either a misconfiguration, or a temporary problem such as memory pressure from jobs with incorrect memory specifications. Check your job log file and see if the IP address on the "Job executing on host:" line is the same for all the failed jobs. If so, you should email linghelp so the situation can be fixed. As a temporary fix, you can avoid the problematic node by excluding it from your requirements, e.g.:
Requirements = ( Machine != "patas-n1.ling.washington.edu" )