Skip to topic | Skip to bottom
Home
Main
Main.TroubleshootingCondorr1.5 - 29 Aug 2008 - 21:55 - Main.brodbdtopic end

Start of topic | Skip to actions

Troubleshooting Condor Job Problems

General suggestions

  • Make sure you're giving the full path to the executable, in your submit file. (Unless the executable is in the same directory you're running condor_submit in.)
  • Make sure the directory your logfile is in exists.
  • Make sure your input and output files are correct.
  • If you're running a script as your executable, make sure its execute bit is set and the Shebang line is correct. Try running it from the command line to make sure it works.
  • Check the job's logfile for useful error messages.

Guidelines for specific situations

Job immediately goes into the Held state

Often this means Condor is having trouble executing the job. Try using condor_q -long and examining the HoldReason attribute. For example: condor_q -long 13281 | grep 'HoldReason'.

Job runs for a while, then gets killed

This is usually because your job is consuming too much memory. The compute nodes each have 4 gigabytes of RAM and 4 gigabytes of swap. This is shared between up to two jobs running on the system. If the system runs out of memory, it will begin killing off large processes until the situation is resolved.

In other words, if your job requires more than 4 gigabytes of RAM, it will not perform well and may end up being killed off. You may need to process smaller data sets or restructure your code to store less data in RAM.

You can see Condor's estimate of your job's memory footprint, in megabytes, by running the condor_q command and looking at the SIZE column.

Job runs for a while, then goes idle (or gets evicted)

This usually means the job has exceeded the requirements Condor set for it. You can check this with the condor_q -better-analyze command. For example:

brodbd@patas:~$ condor_q -better-analyze 3727


-- Submitter: patas.ling.washington.edu : <192.168.100.50:44689> : patas.ling.washington.edu
---
3727.000:  Run analysis summary.  Of 44 machines,
     44 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
   Last successful match: Tue Feb  5 20:55:24 2008
   Last failed match: Wed Feb  6 09:36:38 2008
   Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
( TARGET.FileSystemDomain == MY.FileSystemDomain )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ( 1024 * target.Memory ) >= 2570000 )0                   REMOVE
2   ( target.Arch == "X86_64" )       44                   
3   ( target.OpSys == "LINUX" )       44                   
4   ( target.Disk >= 10000 )          44                   
5   ( TARGET.FileSystemDomain == "ling.washington.edu" )
                                      44                   
Note condition 1 -- Condor has calculated the size of this job is 2,570,000 KB. Our nodes have 4 GB of RAM each, but Condor divides this by two and allocates 2 GB to each job slot. The easiest way around this is to override Condor's calculation by putting something like this in your submit file:

requirements = (Memory > 1000)

This tells Condor to let the job run on any system with at least 1000 KB of RAM.

If you don't want to have to re-submit the job, you can use condor_qedit to change the requirements on the fly. First you need the current requirements line; you can get this by running condor_q -long with the job number; e.g., condor_q -long 3727. Then cut and paste the Requirements line into a condor_qedit command line, removing the equal sign and editing the Memory portion, like this:

brodbd@patas:~$ condor_qedit 3727 Requirements '(Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory) >= 1000) && (TARGET.FileSystemDomain == MY.FileSystemDomain)'

Be sure to put the requirements in single quotes, otherwise the shell will try to interpret the && signs and it won't work.

-- brodbd - 29 Aug 2008
to top


You are here: Main > TroubleshootingCondor

to top

Copyright © 1999-2008 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback