Difference: TroubleshootingCondor (1 vs. 13)

Revision 132013-05-11 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="kill a job"

Troubleshooting Condor Job Problems

Line: 28 to 28
 condor_qedit 123456 RequestMemory "5*1024"

The quote marks prevent the shell from interpreting the multiplication as a file wildcard. \ No newline at end of file

Added:
>
>

Job sometimes works and sometimes fails

Sometimes this can be caused by problems with a particular node -- either a misconfiguration, or a temporary problem such as memory pressure from jobs with incorrect memory specifications. Check your job log file and see if the IP address on the "Job executing on host:" line is the same for all the failed jobs. If so, you should email linghelp so the situation can be fixed. As a temporary fix, you can avoid the problematic node by excluding it from your requirements, e.g.:

Requirements = ( Machine != "patas-n1.ling.washington.edu" )

Revision 122012-12-07 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="kill a job"

Troubleshooting Condor Job Problems

Line: 13 to 13
 

Job immediately goes into the Held state

Changed:
<
<
Often this means Condor is having trouble executing the job. Try using condor_q -long and examining the HoldReason attribute. For example: condor_q -long 13281 | grep 'HoldReason'.
>
>
Often this means Condor is having trouble executing the job. Try using condor_q -long and examining the HoldReason attribute. For example: condor_q -global -long 13281 | grep 'HoldReason'.
  One particularly common error that some users may find puzzling is "Exec format error." This usually means you've forgotten to include the "shebang" line in a script. While a shebang line is not necessarily required when running a script interactively, Condor needs to see one so it knows what shell or interpreter to use to run the script.

Job runs for a while, then bogs down or gets killed

Revision 112011-09-09 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="kill a job"

Troubleshooting Condor Job Problems

Line: 19 to 18
 One particularly common error that some users may find puzzling is "Exec format error." This usually means you've forgotten to include the "shebang" line in a script. While a shebang line is not necessarily required when running a script interactively, Condor needs to see one so it knows what shell or interpreter to use to run the script.

Job runs for a while, then bogs down or gets killed

Changed:
<
<
This is usually because your job is consuming too much memory. The majority of our compute nodes have 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between all that jobs running on that node -- up to one job per CPU core. If the system runs out of memory, it will begin killing off large processes until the situation is resolved.

In other words, if your job requires more than 4 gigabytes of RAM and is assigned to one of our 4 GB nodes, it will not perform well and may end up being killed off. You may need to process smaller data sets or restructure your code to store less data in RAM. You can also try steering the job toward nodes that have more memory available; see BigMemoryCondor for details

You can see Condor's estimate of your job's memory footprint, in megabytes, by running the condor_q command and looking at the SIZE column.

>
>
This often means the job has exceeded its memory request without Condor noticing, and has gotten so large that the machine it's running on has run out of RAM. Check the SIZE column of condor_q and compare to what you've specified on your request_memory line. (The default if you don't specify is 1024 MB.) See BigMemoryCondor and the section below for more information.
 

Job runs for a while, then goes idle (or gets evicted)

Changed:
<
<
This usually means the job has exceeded the requirements Condor set for it. You can check this with the condor_q -better-analyze command. For example:
brodbd@patas:~$ condor_q -better-analyze 3727


-- Submitter: patas.ling.washington.edu : <192.168.100.50:44689> : patas.ling.washington.edu
---
3727.000:  Run analysis summary.  Of 44 machines,
     44 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
   Last successful match: Tue Feb  5 20:55:24 2008
   Last failed match: Wed Feb  6 09:36:38 2008
   Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
( TARGET.FileSystemDomain == MY.FileSystemDomain )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ( 1024 * target.Memory ) >= 2570000 )0                   REMOVE
2   ( target.Arch == "X86_64" )       44                   
3   ( target.OpSys == "LINUX" )       44                   
4   ( target.Disk >= 10000 )          44                   
5   ( TARGET.FileSystemDomain == "ling.washington.edu" )
                                      44                   

Note condition 1 -- Condor has calculated the size of this job is 2,570,000 KB. On many of our nodes Condor allocates 2 GB to each job slot. The easiest way around this is to override Condor's calculation by putting something like this in your submit file:

requirements = (Memory > 1000)

This tells Condor to let the job run on any system with at least 1000 KB of RAM.

>
>
This is usually because your job is consuming more memory than you requested for it on the request_memory line of your submit decription file. You can verify this by looking at the SIZE column of condor_q and comparing it to your memory request. (If you didn't include a request_memory line, the default is 1024 MB.) See BigMemoryCondor for more information on request_memory and how to use it.
 
Changed:
<
<
See also the BigMemoryCondor page for more suggestions on handling large jobs.
>
>
If you don't want to have to re-submit the job, you can use condor_qedit to change its memory requrements on the fly. The format is condor_qedit <jobid> RequestMemory <memory in MB>. For example, to request 5 GB of RAM:
 
Changed:
<
<
If you don't want to have to re-submit the job, you can use condor_qedit to change the requirements on the fly. First you need the current requirements line; you can get this by running condor_q -long with the job number; e.g., condor_q -long 3727. Then cut and paste the Requirements line into a condor_qedit command line, removing the equal sign and editing the Memory portion, like this:
brodbd@patas:~$ condor_qedit 3727 Requirements '(Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory) >= 1000) && (TARGET.FileSystemDomain == MY.FileSystemDomain)' 
>
>
condor_qedit 123456 RequestMemory "5*1024"
 
Changed:
<
<
Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work.
>
>
The quote marks prevent the shell from interpreting the multiplication as a file wildcard.

Revision 102010-06-29 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="kill a job"

Troubleshooting Condor Job Problems

Line: 19 to 19
 One particularly common error that some users may find puzzling is "Exec format error." This usually means you've forgotten to include the "shebang" line in a script. While a shebang line is not necessarily required when running a script interactively, Condor needs to see one so it knows what shell or interpreter to use to run the script.

Job runs for a while, then bogs down or gets killed

Changed:
<
<
This is usually because your job is consuming too much memory. The majority of our compute nodeshave 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between up to two jobs running on the system. If the system runs out of memory, it will begin killing off large processes until the situation is resolved.
>
>
This is usually because your job is consuming too much memory. The majority of our compute nodes have 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between all that jobs running on that node -- up to one job per CPU core. If the system runs out of memory, it will begin killing off large processes until the situation is resolved.
  In other words, if your job requires more than 4 gigabytes of RAM and is assigned to one of our 4 GB nodes, it will not perform well and may end up being killed off. You may need to process smaller data sets or restructure your code to store less data in RAM. You can also try steering the job toward nodes that have more memory available; see BigMemoryCondor for details
Line: 75 to 75
 
brodbd@patas:~$ condor_qedit 3727 Requirements '(Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory) >= 1000) && (TARGET.FileSystemDomain == MY.FileSystemDomain)' 

Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work.

Deleted:
<
<
-- brodbd - 4 Dec 2009
 \ No newline at end of file

Revision 92010-01-28 - gfra

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="kill a job"
 

Troubleshooting Condor Job Problems

General suggestions

Revision 82009-12-04 - brodbd

Line: 1 to 1
 

Troubleshooting Condor Job Problems

General suggestions

Line: 15 to 15
  Often this means Condor is having trouble executing the job. Try using condor_q -long and examining the HoldReason attribute. For example: condor_q -long 13281 | grep 'HoldReason'.
Added:
>
>
One particularly common error that some users may find puzzling is "Exec format error." This usually means you've forgotten to include the "shebang" line in a script. While a shebang line is not necessarily required when running a script interactively, Condor needs to see one so it knows what shell or interpreter to use to run the script.
 

Job runs for a while, then bogs down or gets killed

This is usually because your job is consuming too much memory. The majority of our compute nodeshave 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between up to two jobs running on the system. If the system runs out of memory, it will begin killing off large processes until the situation is resolved.

Line: 26 to 27
 

Job runs for a while, then goes idle (or gets evicted)

This usually means the job has exceeded the requirements Condor set for it. You can check this with the condor_q -better-analyze command. For example:

Changed:
<
<
brodbd@patas:~$ condor_q -better-analyze 3727

>
>
brodbd@patas:~$ condor_q -better-analyze 3727

 
Changed:
<
<
-- Submitter: patas.ling.washington.edu : <192.168.100.50:44689> : patas.ling.washington.edu
>
>
-- Submitter: patas.ling.washington.edu : <192.168.100.50:44689> : patas.ling.washington.edu
 
3727.000: Run analysis summary. Of 44 machines, 44 are rejected by your job's requirements
Line: 49 to 49
 The Requirements expression for your job is:

( target.Arch = "X86_64" ) && ( target.OpSys = "LINUX" ) &&

Changed:
<
<
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
>
>
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
 ( FileSystemDomain == FileSystemDomain )

Condition Machines Matched Suggestion --------- ---------------- ----------

Changed:
<
<
1 ( ( 1024 * target.Memory ) >= 2570000 )0 REMOVE
>
>
1 ( ( 1024 * target.Memory ) >= 2570000 )0 REMOVE
 2 ( target.Arch == "X86_64" ) 44 3 ( target.OpSys == "LINUX" ) 44
Changed:
<
<
4 ( target.Disk >= 10000 ) 44
>
>
4 ( target.Disk >= 10000 ) 44
 5 ( FileSystemDomain == "ling.washington.edu" ) 44
Line: 74 to 75
  Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work.
Changed:
<
<
-- brodbd - 17 Sep 2009
>
>
-- brodbd - 4 Dec 2009

Revision 72009-09-17 - brodbd

Line: 1 to 1
 

Troubleshooting Condor Job Problems

General suggestions

Line: 15 to 15
  Often this means Condor is having trouble executing the job. Try using condor_q -long and examining the HoldReason attribute. For example: condor_q -long 13281 | grep 'HoldReason'.
Changed:
<
<

Job runs for a while, then gets killed

>
>

Job runs for a while, then bogs down or gets killed

 
Changed:
<
<
This is usually because your job is consuming too much memory. The compute nodes each have 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between up to two jobs running on the system. If the system runs out of memory, it will begin killing off large processes until the situation is resolved.
>
>
This is usually because your job is consuming too much memory. The majority of our compute nodeshave 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between up to two jobs running on the system. If the system runs out of memory, it will begin killing off large processes until the situation is resolved.
 
Changed:
<
<
In other words, if your job requires more than 4 gigabytes of RAM, it will not perform well and may end up being killed off. You may need to process smaller data sets or restructure your code to store less data in RAM.
>
>
In other words, if your job requires more than 4 gigabytes of RAM and is assigned to one of our 4 GB nodes, it will not perform well and may end up being killed off. You may need to process smaller data sets or restructure your code to store less data in RAM. You can also try steering the job toward nodes that have more memory available; see BigMemoryCondor for details
  You can see Condor's estimate of your job's memory footprint, in megabytes, by running the condor_q command and looking at the SIZE column.
Line: 61 to 61
 5 ( FileSystemDomain == "ling.washington.edu" ) 44
Changed:
<
<
Note condition 1 -- Condor has calculated the size of this job is 2,570,000 KB. Our nodes have 4 GB of RAM each, but Condor divides this by two and allocates 2 GB to each job slot. The easiest way around this is to override Condor's calculation by putting something like this in your submit file:
>
>
Note condition 1 -- Condor has calculated the size of this job is 2,570,000 KB. On many of our nodes Condor allocates 2 GB to each job slot. The easiest way around this is to override Condor's calculation by putting something like this in your submit file:
 
Changed:
<
<
requirements = (Memory > 1000)
>
>
requirements = (Memory > 1000)
  This tells Condor to let the job run on any system with at least 1000 KB of RAM.
Changed:
<
<
See also the BigMemoryCondor page for more suggestions.
>
>
See also the BigMemoryCondor page for more suggestions on handling large jobs.
  If you don't want to have to re-submit the job, you can use condor_qedit to change the requirements on the fly. First you need the current requirements line; you can get this by running condor_q -long with the job number; e.g., condor_q -long 3727. Then cut and paste the Requirements line into a condor_qedit command line, removing the equal sign and editing the Memory portion, like this:
Changed:
<
<
brodbd@patas:~$ condor_qedit 3727 Requirements '(Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory) >= 1000) && (TARGET.FileSystemDomain == MY.FileSystemDomain)'
>
>
brodbd@patas:~$ condor_qedit 3727 Requirements '(Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory) >= 1000) && (TARGET.FileSystemDomain == MY.FileSystemDomain)' 
  Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work.
Changed:
<
<
-- brodbd - 29 Aug 2008
>
>
-- brodbd - 17 Sep 2009

Revision 62009-02-23 - brodbd

Line: 1 to 1
 

Troubleshooting Condor Job Problems

General suggestions

Line: 67 to 67
  This tells Condor to let the job run on any system with at least 1000 KB of RAM.
Added:
>
>
See also the BigMemoryCondor page for more suggestions.
 If you don't want to have to re-submit the job, you can use condor_qedit to change the requirements on the fly. First you need the current requirements line; you can get this by running condor_q -long with the job number; e.g., condor_q -long 3727. Then cut and paste the Requirements line into a condor_qedit command line, removing the equal sign and editing the Memory portion, like this:
brodbd@patas:~$ condor_qedit 3727 Requirements '(Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory) >= 1000) && (TARGET.FileSystemDomain == MY.FileSystemDomain)'

Revision 52008-08-29 - brodbd

Line: 1 to 1
 

Troubleshooting Condor Job Problems

General suggestions

Line: 7 to 7
 
  • Make sure the directory your logfile is in exists.
  • Make sure your input and output files are correct.
  • If you're running a script as your executable, make sure its execute bit is set and the Shebang line is correct. Try running it from the command line to make sure it works.
Added:
>
>
  • Check the job's logfile for useful error messages.
 

Guidelines for specific situations

Revision 42008-08-29 - brodbd

Line: 1 to 1
 

Troubleshooting Condor Job Problems

General suggestions

Line: 10 to 10
 

Guidelines for specific situations

Added:
>
>

Job immediately goes into the Held state

Often this means Condor is having trouble executing the job. Try using condor_q -long and examining the HoldReason attribute. For example: condor_q -long 13281 | grep 'HoldReason'.

 

Job runs for a while, then gets killed

This is usually because your job is consuming too much memory. The compute nodes each have 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between up to two jobs running on the system. If the system runs out of memory, it will begin killing off large processes until the situation is resolved.

Line: 69 to 73
  Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work.
Changed:
<
<
-- brodbd - 21 Feb 2008
>
>
-- brodbd - 29 Aug 2008
 

Revision 32008-02-21 - brodbd

Line: 1 to 1
 

Troubleshooting Condor Job Problems

General suggestions

Line: 10 to 10
 

Guidelines for specific situations

Added:
>
>

Job runs for a while, then gets killed

This is usually because your job is consuming too much memory. The compute nodes each have 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between up to two jobs running on the system. If the system runs out of memory, it will begin killing off large processes until the situation is resolved.

In other words, if your job requires more than 4 gigabytes of RAM, it will not perform well and may end up being killed off. You may need to process smaller data sets or restructure your code to store less data in RAM.

You can see Condor's estimate of your job's memory footprint, in megabytes, by running the condor_q command and looking at the SIZE column.

 

Job runs for a while, then goes idle (or gets evicted)

This usually means the job has exceeded the requirements Condor set for it. You can check this with the condor_q -better-analyze command. For example:

Line: 61 to 69
  Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work.
Changed:
<
<
-- brodbd - 06 Feb 2008
>
>
-- brodbd - 21 Feb 2008
 

Revision 22008-02-14 - brodbd

Line: 1 to 1
 

Troubleshooting Condor Job Problems

General suggestions

Line: 10 to 10
 

Guidelines for specific situations

Changed:
<
<

Job runs for a while, then goes idle

>
>

Job runs for a while, then goes idle (or gets evicted)

  This usually means the job has exceeded the requirements Condor set for it. You can check this with the condor_q -better-analyze command. For example:

Revision 12008-02-06 - brodbd

Line: 1 to 1
Added:
>
>

Troubleshooting Condor Job Problems

General suggestions

  • Make sure you're giving the full path to the executable, in your submit file. (Unless the executable is in the same directory you're running condor_submit in.)
  • Make sure the directory your logfile is in exists.
  • Make sure your input and output files are correct.
  • If you're running a script as your executable, make sure its execute bit is set and the Shebang line is correct. Try running it from the command line to make sure it works.

Guidelines for specific situations

Job runs for a while, then goes idle

This usually means the job has exceeded the requirements Condor set for it. You can check this with the condor_q -better-analyze command. For example:

brodbd@patas:~$ condor_q -better-analyze 3727


-- Submitter: patas.ling.washington.edu : <192.168.100.50:44689> : patas.ling.washington.edu
---
3727.000:  Run analysis summary.  Of 44 machines,
	  44 are rejected by your job's requirements
		0 reject your job because of their own requirements
		0 match but are serving users with a better priority in the pool
		0 match but reject the job for unknown reasons
		0 match but will not currently preempt their existing job
		0 are available to run your job
	Last successful match: Tue Feb  5 20:55:24 2008
	Last failed match: Wed Feb  6 09:36:38 2008
	Reason for last match failure: no match found

WARNING:  Be advised:
	No resources matched request's constraints

The Requirements expression for your job is:

( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
( TARGET.FileSystemDomain == MY.FileSystemDomain )

	 Condition								 Machines Matched	 Suggestion
	 ---------								 ----------------	 ----------
1	( ( 1024 * target.Memory ) >= 2570000 )0						 REMOVE
2	( target.Arch == "X86_64" )		 44						 
3	( target.OpSys == "LINUX" )		 44						 
4	( target.Disk >= 10000 )			 44						 
5	( TARGET.FileSystemDomain == "ling.washington.edu" )
												  44						 
Note condition 1 -- Condor has calculated the size of this job is 2,570,000 KB. Our nodes have 4 GB of RAM each, but Condor divides this by two and allocates 2 GB to each job slot. The easiest way around this is to override Condor's calculation by putting something like this in your submit file:

requirements = (Memory > 1000)

This tells Condor to let the job run on any system with at least 1000 KB of RAM.

If you don't want to have to re-submit the job, you can use condor_qedit to change the requirements on the fly. First you need the current requirements line; you can get this by running condor_q -long with the job number; e.g., condor_q -long 3727. Then cut and paste the Requirements line into a condor_qedit command line, removing the equal sign and editing the Memory portion, like this:

brodbd@patas:~$ condor_qedit 3727 Requirements '(Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory) >= 1000) && (TARGET.FileSystemDomain == MY.FileSystemDomain)'

Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work.

-- brodbd - 06 Feb 2008

 
This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions