
| Line: 1 to 1 | ||||||||
|---|---|---|---|---|---|---|---|---|
Troubleshooting Condor Job ProblemsGeneral suggestions | ||||||||
| Line: 7 to 7 | ||||||||
| ||||||||
| Added: | ||||||||
| > > |
| |||||||
Guidelines for specific situations | ||||||||
| Line: 1 to 1 | ||||||||
|---|---|---|---|---|---|---|---|---|
Troubleshooting Condor Job ProblemsGeneral suggestions | ||||||||
| Line: 10 to 10 | ||||||||
Guidelines for specific situations | ||||||||
| Added: | ||||||||
| > > | Job immediately goes into the Held stateOften this means Condor is having trouble executing the job. Try usingCGI condor_q -long and examining the HoldReason attribute. For example: CGI condor_q -long 13281 | grep 'HoldReason'. | |||||||
Job runs for a while, then gets killedThis is usually because your job is consuming too much memory. The compute nodes each have 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between up to two jobs running on the system. If the system runs out of memory, it will begin killing off large processes until the situation is resolved. | ||||||||
| Line: 69 to 73 | ||||||||
| Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work. | ||||||||
| Changed: | ||||||||
| < < | -- brodbd - 21 Feb 2008 | |||||||
| > > | -- brodbd - 29 Aug 2008 | |||||||
| Line: 1 to 1 | ||||||||
|---|---|---|---|---|---|---|---|---|
Troubleshooting Condor Job ProblemsGeneral suggestions | ||||||||
| Line: 10 to 10 | ||||||||
Guidelines for specific situations | ||||||||
| Added: | ||||||||
| > > | Job runs for a while, then gets killedThis is usually because your job is consuming too much memory. The compute nodes each have 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between up to two jobs running on the system. If the system runs out of memory, it will begin killing off large processes until the situation is resolved. In other words, if your job requires more than 4 gigabytes of RAM, it will not perform well and may end up being killed off. You may need to process smaller data sets or restructure your code to store less data in RAM. You can see Condor's estimate of your job's memory footprint, in megabytes, by running the condor_q command and looking at the SIZE column. | |||||||
Job runs for a while, then goes idle (or gets evicted)This usually means the job has exceeded the requirements Condor set for it. You can check this with theCGI condor_q -better-analyze command. For example: | ||||||||
| Line: 61 to 69 | ||||||||
| Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work. | ||||||||
| Changed: | ||||||||
| < < | -- brodbd - 06 Feb 2008 | |||||||
| > > | -- brodbd - 21 Feb 2008 | |||||||
| Line: 1 to 1 | ||||||||
|---|---|---|---|---|---|---|---|---|
| Added: | ||||||||
| > > | Troubleshooting Condor Job ProblemsGeneral suggestions
Guidelines for specific situationsJob runs for a while, then goes idleThis usually means the job has exceeded the requirements Condor set for it. You can check this with theCGI condor_q -better-analyze command. For example:
brodbd@patas:~$ condor_q -better-analyze 3727 -- Submitter: patas.ling.washington.edu : <192.168.100.50:44689> : patas.ling.washington.edu --- 3727.000: Run analysis summary. Of 44 machines, 44 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match but are serving users with a better priority in the pool 0 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 are available to run your job Last successful match: Tue Feb 5 20:55:24 2008 Last failed match: Wed Feb 6 09:36:38 2008 Reason for last match failure: no match found WARNING: Be advised: No resources matched request's constraints The Requirements expression for your job is: ( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) && ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) && ( TARGET.FileSystemDomain == MY.FileSystemDomain ) Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( ( 1024 * target.Memory ) >= 2570000 )0 REMOVE 2 ( target.Arch == "X86_64" ) 44 3 ( target.OpSys == "LINUX" ) 44 4 ( target.Disk >= 10000 ) 44 5 ( TARGET.FileSystemDomain == "ling.washington.edu" ) 44Note condition 1 -- Condor has calculated the size of this job is 2,570,000 KB. Our nodes have 4 GB of RAM each, but Condor divides this by two and allocates 2 GB to each job slot. The easiest way around this is to override Condor's calculation by putting something like this in your submit file: CGI requirements = (Memory > 1000)
This tells Condor to let the job run on any system with at least 1000 KB of RAM.
If you don't want to have to re-submit the job, you can use condor_qedit to change the requirements on the fly. First you need the current requirements line; you can get this by running CGI condor_q -long with the job number; e.g., CGI condor_q -long 3727. Then cut and paste the Requirements line into a condor_qedit command line, removing the equal sign and editing the Memory portion, like this:
brodbd@patas:~$ condor_qedit 3727 Requirements '(Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory) >= 1000) && (TARGET.FileSystemDomain == MY.FileSystemDomain)'Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work. -- brodbd - 06 Feb 2008 | |||||||