Troubleshooting Condor Job Problems
General suggestions
- Make sure you're giving the full path to the executable, in your submit file. (Unless the executable is in the same directory you're running condor_submit in.)
- Make sure the directory your logfile is in exists.
- Make sure your input and output files are correct.
- If you're running a script as your executable, make sure its execute bit is set and the Shebang line is correct. Try running it from the command line to make sure it works.
- Check the job's logfile for useful error messages.
Guidelines for specific situations
Job immediately goes into the Held state
Often this means Condor is having trouble executing the job. Try using
condor_q -long and examining the HoldReason attribute. For example:
condor_q -long 13281 | grep 'HoldReason'.
Job runs for a while, then gets killed
This is usually because your job is consuming too much memory. The compute nodes each have 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between up to two jobs running on the system. If the system runs out of memory, it will begin killing off large processes until the situation is resolved.
In other words, if your job requires more than 4 gigabytes of RAM, it will not perform well and may end up being killed off. You may need to process smaller data sets or restructure your code to store less data in RAM.
You can see Condor's estimate of your job's memory footprint, in megabytes, by running the condor_q command and looking at the SIZE column.
Job runs for a while, then goes idle (or gets evicted)
This usually means the job has exceeded the requirements Condor set for it. You can check this with the
condor_q -better-analyze command. For example:
brodbd@patas:~$ condor_q -better-analyze 3727
-- Submitter: patas.ling.washington.edu : <192.168.100.50:44689> : patas.ling.washington.edu
---
3727.000: Run analysis summary. Of 44 machines,
44 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
Last successful match: Tue Feb 5 20:55:24 2008
Last failed match: Wed Feb 6 09:36:38 2008
Reason for last match failure: no match found
WARNING: Be advised:
No resources matched request's constraints
The Requirements expression for your job is:
( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
( TARGET.FileSystemDomain == MY.FileSystemDomain )
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( ( 1024 * target.Memory ) >= 2570000 )0 REMOVE
2 ( target.Arch == "X86_64" ) 44
3 ( target.OpSys == "LINUX" ) 44
4 ( target.Disk >= 10000 ) 44
5 ( TARGET.FileSystemDomain == "ling.washington.edu" )
44
Note condition 1 -- Condor has calculated the size of this job is 2,570,000 KB. Our nodes have 4 GB of RAM each, but Condor divides this by two and allocates 2 GB to each job slot. The easiest way around this is to override Condor's calculation by putting something like this in your submit file:
requirements = (Memory > 1000)
This tells Condor to let the job run on any system with at least 1000 KB of RAM.
If you don't want to have to re-submit the job, you can use condor_qedit to change the requirements on the fly. First you need the current requirements line; you can get this by running
condor_q -long with the job number; e.g.,
condor_q -long 3727. Then cut and paste the Requirements line into a condor_qedit command line, removing the equal sign and editing the Memory portion, like this:
brodbd@patas:~$ condor_qedit 3727 Requirements '(Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory) >= 1000) && (TARGET.FileSystemDomain == MY.FileSystemDomain)'
Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work.
--
brodbd - 29 Aug 2008
to top