Difference: BigMemoryCondor (1 vs. 10)

Revision 102011-10-10 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="TroubleshootingCondor"

Running Condor jobs with large memory requirements

Line: 9 to 10
 

Running jobs larger than 1 GB

Changed:
<
<
If you have a job with processes that consume more than 1 GB of memory, you can tell Condor how much RAM they require by adding the require_memory keyword to your submit file. This value should be specified in megabytes.
>
>
If you have a job with processes that consume more than 1 GB of memory, you can tell Condor how much RAM they require by adding the request_memory keyword to your submit file. This value should be specified in megabytes.
  Here's an example submit script for an executable called hugejob, which requires at least 7 GB of memory to run:
executable = hugejob
Line: 18 to 19
 output = hugejob.out error = hugejob.err log = hugejob.log
Changed:
<
<
require_memory = 7*1024
>
>
request_memory = 7*1024
 queue

Revision 92011-09-09 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="TroubleshootingCondor"

Running Condor jobs with large memory requirements

By default, Condor assigns each process you launch 1 GB of RAM. If your job grows too large, one of two things will happen.

Changed:
<
<
  • Condor may evict it, causing it to return to the queue and stay in the idle ("I") state. This is more likely to happen when there are a lot of other jobs queued. The job may then get relaunched on another machine, but all of its progress up to that point will be lost.
>
>
  • Condor may evict it, causing it to return to the queue and stay in the idle ("I") state.
 
  • If the combination of jobs on a machine exceeds the amount of available RAM and swap, the kernel out of memory killer will kill processes until memory becomes available.
Added:
>
>
Both of these problems can be avoided by giving Condor a realistic idea of how much memory your job needs.
 

Running jobs larger than 1 GB

If you have a job with processes that consume more than 1 GB of memory, you can tell Condor how much RAM they require by adding the require_memory keyword to your submit file. This value should be specified in megabytes.

Here's an example submit script for an executable called hugejob, which requires at least 7 GB of memory to run:

Changed:
<
<
universe = vanilla
executable = hugejob
>
>
executable = hugejob
 getenv = true input = hugejob.in output = hugejob.out

Revision 82011-08-22 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="TroubleshootingCondor"

Running Condor jobs with large memory requirements

Changed:
<
<
Normally Condor assigns one job to each CPU on a node, dividing up the memory equally. On most of our current systems this results in 2 GB of RAM per slot. Ideally, you should structure your jobs to stay within this amount of memory; this uses the cluster efficiently.

If you job grows too large, one of two things will happen.

  • If it exceeds 2 GB, Condor may evict it, causing it to return to the queue and stay in the idle ("I") state. This is more likely to happen when there are a lot of other jobs queued.
>
>
By default, Condor assigns each process you launch 1 GB of RAM. If your job grows too large, one of two things will happen.
  • Condor may evict it, causing it to return to the queue and stay in the idle ("I") state. This is more likely to happen when there are a lot of other jobs queued. The job may then get relaunched on another machine, but all of its progress up to that point will be lost.
 
  • If the combination of jobs on a machine exceeds the amount of available RAM and swap, the kernel out of memory killer will kill processes until memory becomes available.
Changed:
<
<

Running jobs larger than 2 GB

If you have jobs that consume more than 2 GB of memory, you can tell Condor to claim an entire machine instead of one slot, so all of the system's memory is available to your job. To do this, add

+RequiresWholeMachine = True
>
>

Running jobs larger than 1 GB

 
Changed:
<
<
to your submit file. (Note the plus sign, which is required. Also, note that this attribute is a custom one for our site and may not be available on other Condor clusters.) You also will also want to tell Condor not to check your job's memory use, so it won't be evicted when it grows larger than 2 GB. This is easily done by adding your own memory constraint to your job's submit file; for example:
Requirements = (Memory > 0)

Finally, you may want to specify a minimum amount of total memory for the machine. This can be done by adding a TotalMemory requirement. (Both TotalMemory and Memory are measured in megabytes. Memory is the memory available per slot, while TotalMemory is the total amount of memory for the whole machine.)

>
>
If you have a job with processes that consume more than 1 GB of memory, you can tell Condor how much RAM they require by adding the require_memory keyword to your submit file. This value should be specified in megabytes.
  Here's an example submit script for an executable called hugejob, which requires at least 7 GB of memory to run:
universe = vanilla
Line: 25 to 17
 output = hugejob.out error = hugejob.err log = hugejob.log
Deleted:
<
<
+RequiresWholeMachine = True Requirements = ( Memory > 0 && TotalMemory >= (7*1024) )

Note: Be careful about being too specific with TotalMemory constraints. For various reasons (memory consumed by the OS, etc.) the TotalMemory constraint will probably be stricter than you expect. For example, our 4 gigabyte nodes actually report their total memory as 3950 MB, so a constraint of (TotalMemory >= (4*1024)) will exclude them.

Interaction with other jobs

Jobs with +RequiresWholeMachine set follow the following rules:

  1. RequiresWholeMachine jobs will only start on Slot 1. Once the job is running, other slots will be marked as having "Owner" status, to prevent single-slot jobs from running in them and consuming memory. If no machines that match the job's requirements have Slot 1 available, the job will remain idle in the queue until Slot 1 opens up on a machine.
  2. If a machine with no slots taken that matches the job's requirements is available, the job will start there.
  3. If no machines are completely free, but an otherwise occupied machine has Slot 1 open, the job will start there and immediately go into the "Suspended" state. It will remain suspended until all the single-slot jobs on the machine complete, and then it will continue. (This more or less causes the job to claim "dibs" on a slot that might otherwise go to a single-slot job.) If the job remains suspended for at least two hours without running, it will become eligible for preemption and may return to the queue to wait for a new slot assignment. You can force this to happen at any time with the condor_vacate command.

I'm still tweaking these rules, so if you see any pathological behavior, or have an idea for a way to allocate slots more fairly, email linghelp@u and let me know.

-- brodbd - 09 Apr 2009

 \ No newline at end of file
Added:
>
>
require_memory = 7*1024 queue

Revision 72009-11-24 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="TroubleshootingCondor"

Running Condor jobs with large memory requirements

Changed:
<
<
Normally Condor assigns one job to each CPU on a node, dividing up the memory equally. On all of our current systems this results in 2 GB of RAM per slot. Ideally, you should structure your jobs to stay within this amount of memory; this uses the cluster efficiently.
>
>
Normally Condor assigns one job to each CPU on a node, dividing up the memory equally. On most of our current systems this results in 2 GB of RAM per slot. Ideally, you should structure your jobs to stay within this amount of memory; this uses the cluster efficiently.
  If you job grows too large, one of two things will happen.
  • If it exceeds 2 GB, Condor may evict it, causing it to return to the queue and stay in the idle ("I") state. This is more likely to happen when there are a lot of other jobs queued.
Line: 10 to 10
 

Running jobs larger than 2 GB

If you have jobs that consume more than 2 GB of memory, you can tell Condor to claim an entire machine instead of one slot, so all of the system's memory is available to your job. To do this, add

Changed:
<
<
+RequiresWholeMachine = True
to your submit file. (Note the plus sign, which is required. Also, note that this attribute is a custom one for our site and may not be available on other Condor clusters.) You also will also want to tell Condor not to check your job's memory use, so it won't be evicted when it grows larger than 2 GB. This is easily done by adding your own memory constraint to your job's submit file; for example:
Requirements = (Memory > 0)
Finally, you may want to specify a minimum amount of total memory for the machine. This can be done by adding a TotalMemory requirement. (Both TotalMemory and Memory are measured in megabytes. Memory is the memory available per slot, while TotalMemory is the total amount of memory for the whole machine.)
>
>
+RequiresWholeMachine = True

to your submit file. (Note the plus sign, which is required. Also, note that this attribute is a custom one for our site and may not be available on other Condor clusters.) You also will also want to tell Condor not to check your job's memory use, so it won't be evicted when it grows larger than 2 GB. This is easily done by adding your own memory constraint to your job's submit file; for example:

Requirements = (Memory > 0)

Finally, you may want to specify a minimum amount of total memory for the machine. This can be done by adding a TotalMemory requirement. (Both TotalMemory and Memory are measured in megabytes. Memory is the memory available per slot, while TotalMemory is the total amount of memory for the whole machine.)

  Here's an example submit script for an executable called hugejob, which requires at least 7 GB of memory to run:
universe = vanilla
Line: 21 to 26
 error = hugejob.err log = hugejob.log +RequiresWholeMachine = True
Changed:
<
<
Requirements = ( Memory > 0 && TotalMemory >= (7*1024) ) Note: Be careful about being too specific with TotalMemory constraints. For various reasons (memory consumed by the OS, etc.) the TotalMemory constraint will probably be stricter than you expect. For example, our 4 gigabyte nodes actually report their total memory as 3950 MB, so a constraint of (TotalMemory >= (4*1024)) will exclude them.
>
>
Requirements = ( Memory > 0 && TotalMemory >= (7*1024) )

Note: Be careful about being too specific with TotalMemory constraints. For various reasons (memory consumed by the OS, etc.) the TotalMemory constraint will probably be stricter than you expect. For example, our 4 gigabyte nodes actually report their total memory as 3950 MB, so a constraint of (TotalMemory >= (4*1024)) will exclude them.

 

Interaction with other jobs

Revision 62009-04-09 - goodmami

Line: 1 to 1
 
META TOPICPARENT name="TroubleshootingCondor"

Running Condor jobs with large memory requirements

Line: 21 to 21
 error = hugejob.err log = hugejob.log +RequiresWholeMachine = True
Changed:
<
<
Requirements = ( Memory > 0 && TotalMemory >= (7*1024) ) Note: Be careful about being too specific with TotalMemory constraints. For various reasons (memory consumed by the OS, etc.) the TotalMemory constraint will probably be stricter than you expect. For example, our 4 gigabyte nodes actually report their total memory as 3950 KB, so a constraint of (TotalMemory >= (4*1024)) will exclude them.
>
>
Requirements = ( Memory > 0 && TotalMemory >= (7*1024) ) Note: Be careful about being too specific with TotalMemory constraints. For various reasons (memory consumed by the OS, etc.) the TotalMemory constraint will probably be stricter than you expect. For example, our 4 gigabyte nodes actually report their total memory as 3950 MB, so a constraint of (TotalMemory >= (4*1024)) will exclude them.
 

Interaction with other jobs

Revision 52009-04-09 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="TroubleshootingCondor"

Running Condor jobs with large memory requirements

Line: 10 to 10
 

Running jobs larger than 2 GB

If you have jobs that consume more than 2 GB of memory, you can tell Condor to claim an entire machine instead of one slot, so all of the system's memory is available to your job. To do this, add

Changed:
<
<
+RequiresWholeMachine = True
to your submit file. (Note the plus sign, which is required. Also, note that this attribute is a custom one for our site and may not be available on other Condor clusters.) You also will also want to tell Condor not to check your job's memory use, so it won't be evicted when it grows larger than 2 GB. This is easily done by adding your own memory constraint to your job's submit file; for example:
Requirements = (Memory > 0)
Finally, you may want to specify a minimum amount of total memory for the machine. This can be done by adding a TotalMemory requirement. (Both TotalMemory and Memory are measured in kilobytes. Memory is the memory available per slot, while TotalMemory is the total amount of memory for the whole machine.)
>
>
+RequiresWholeMachine = True
to your submit file. (Note the plus sign, which is required. Also, note that this attribute is a custom one for our site and may not be available on other Condor clusters.) You also will also want to tell Condor not to check your job's memory use, so it won't be evicted when it grows larger than 2 GB. This is easily done by adding your own memory constraint to your job's submit file; for example:
Requirements = (Memory > 0)
Finally, you may want to specify a minimum amount of total memory for the machine. This can be done by adding a TotalMemory requirement. (Both TotalMemory and Memory are measured in megabytes. Memory is the memory available per slot, while TotalMemory is the total amount of memory for the whole machine.)
  Here's an example submit script for an executable called hugejob, which requires at least 7 GB of memory to run:
universe = vanilla
Line: 32 to 32
  I'm still tweaking these rules, so if you see any pathological behavior, or have an idea for a way to allocate slots more fairly, email linghelp@u and let me know.
Changed:
<
<
-- brodbd - 30 Mar 2009
>
>
-- brodbd - 09 Apr 2009

Revision 42009-03-30 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="TroubleshootingCondor"

Running Condor jobs with large memory requirements

Line: 10 to 10
 

Running jobs larger than 2 GB

If you have jobs that consume more than 2 GB of memory, you can tell Condor to claim an entire machine instead of one slot, so all of the system's memory is available to your job. To do this, add

Changed:
<
<
+RequiresWholeMachine = True
to your submit file. (Note the plus sign, which is required. Also, note that this attribute is a custom one for our site and may not be available on other Condor clusters.) You also will also want to tell Condor not to check your job's memory use, so it won't be evicted when it grows larger than 2 GB. This is easily done by adding your own memory constraint to your job's submit file; for example:
Requirements = (Memory > 0)
Finally, you may want to specify a minimum amount of total memory for the machine. This can be done by adding a TotalMemory requirement. (Both TotalMemory and Memory are measured in kilobytes. Memory is the memory available per slot, while TotalMemory is the total amount of memory for the whole machine.)
>
>
+RequiresWholeMachine = True
to your submit file. (Note the plus sign, which is required. Also, note that this attribute is a custom one for our site and may not be available on other Condor clusters.) You also will also want to tell Condor not to check your job's memory use, so it won't be evicted when it grows larger than 2 GB. This is easily done by adding your own memory constraint to your job's submit file; for example:
Requirements = (Memory > 0)
Finally, you may want to specify a minimum amount of total memory for the machine. This can be done by adding a TotalMemory requirement. (Both TotalMemory and Memory are measured in kilobytes. Memory is the memory available per slot, while TotalMemory is the total amount of memory for the whole machine.)
 
Changed:
<
<
Here's an example submit script for an executable called hugejob, which requires at least 8 GB of memory to run:
>
>
Here's an example submit script for an executable called hugejob, which requires at least 7 GB of memory to run:
 
universe = vanilla
executable = hugejob
getenv = true
Line: 24 to 21
 error = hugejob.err log = hugejob.log +RequiresWholeMachine = True
Changed:
<
<
Requirements = ( Memory > 0 && TotalMemory >= (8*1024) )
>
>
Requirements = ( Memory > 0 && TotalMemory >= (7*1024) ) Note: Be careful about being too specific with TotalMemory constraints. For various reasons (memory consumed by the OS, etc.) the TotalMemory constraint will probably be stricter than you expect. For example, our 4 gigabyte nodes actually report their total memory as 3950 KB, so a constraint of (TotalMemory >= (4*1024)) will exclude them.
 

Interaction with other jobs

Jobs with +RequiresWholeMachine set follow the following rules:

Line: 35 to 32
  I'm still tweaking these rules, so if you see any pathological behavior, or have an idea for a way to allocate slots more fairly, email linghelp@u and let me know.
Changed:
<
<
-- brodbd - 26 Mar 2009
>
>
-- brodbd - 30 Mar 2009

Revision 32009-03-27 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="TroubleshootingCondor"

Running Condor jobs with large memory requirements

Changed:
<
<
Normally Condor assigns one job to each CPU on a node, dividing up the memory equally. On all of our current systems this results in 2 GB of RAM per slot. Ideally, you should structure your jobs to stay within this amount of memory.
>
>
Normally Condor assigns one job to each CPU on a node, dividing up the memory equally. On all of our current systems this results in 2 GB of RAM per slot. Ideally, you should structure your jobs to stay within this amount of memory; this uses the cluster efficiently.
  If you job grows too large, one of two things will happen.
  • If it exceeds 2 GB, Condor may evict it, causing it to return to the queue and stay in the idle ("I") state. This is more likely to happen when there are a lot of other jobs queued.
Changed:
<
<
  • If the combination of jobs on a machine exceeds the amount of available RAM and swap (about 8 GB, for our 2-CPU nodes), the kernel out of memory killer will kill processes until memory becomes available.
>
>
  • If the combination of jobs on a machine exceeds the amount of available RAM and swap, the kernel out of memory killer will kill processes until memory becomes available.

Running jobs larger than 2 GB

 
Changed:
<
<
Eventually I will implement a custom submit file attribute to allow jobs to claim the entire machine, but this requires a newer version of Condor than we're currently running. My current target for this upgrade is the spring '09 term break. However, there are some stop-gap techniques that can help.

By adding the requirement "VirtualMachineID == 1" to your job, it will only run on the first CPU slot of any machine. This will not prevent other jobs from occupying other slots, but it will ensure that only one copy of your job (or any similarly flagged job) will run on each machine. Note: The name of this parameter changed to SlotID in condor 7.x, so when we upgrade in the spring any submit files that use this parameter will need to be changed.

>
>
If you have jobs that consume more than 2 GB of memory, you can tell Condor to claim an entire machine instead of one slot, so all of the system's memory is available to your job. To do this, add
+RequiresWholeMachine = True
to your submit file. (Note the plus sign, which is required. Also, note that this attribute is a custom one for our site and may not be available on other Condor clusters.) You also will also want to tell Condor not to check your job's memory use, so it won't be evicted when it grows larger than 2 GB. This is easily done by adding your own memory constraint to your job's submit file; for example:
Requirements = (Memory > 0)
Finally, you may want to specify a minimum amount of total memory for the machine. This can be done by adding a TotalMemory requirement. (Both TotalMemory and Memory are measured in kilobytes. Memory is the memory available per slot, while TotalMemory is the total amount of memory for the whole machine.)

Here's an example submit script for an executable called hugejob, which requires at least 8 GB of memory to run:

universe = vanilla
executable = hugejob
getenv = true
input = hugejob.in
output = hugejob.out
error = hugejob.err
log = hugejob.log
+RequiresWholeMachine = True
Requirements = ( Memory > 0 && TotalMemory >= (8*1024) ) 

Interaction with other jobs

 
Changed:
<
<
By adding an explicit Memory requirement to your job, Condor will allow it to run on any slot with at least that amount of RAM, and will not evict it if it grows larger than 2 GB. (It's still vulnerable to the out-of-memory killer if it grows too large, however.) This requirement is measured in kilobytes and, for our purposes, can be set to any arbitrary number that's less than the smallest slot in the cluster -- currently 1975 KB.
>
>
Jobs with +RequiresWholeMachine set follow the following rules:
  1. RequiresWholeMachine jobs will only start on Slot 1. Once the job is running, other slots will be marked as having "Owner" status, to prevent single-slot jobs from running in them and consuming memory. If no machines that match the job's requirements have Slot 1 available, the job will remain idle in the queue until Slot 1 opens up on a machine.
  2. If a machine with no slots taken that matches the job's requirements is available, the job will start there.
  3. If no machines are completely free, but an otherwise occupied machine has Slot 1 open, the job will start there and immediately go into the "Suspended" state. It will remain suspended until all the single-slot jobs on the machine complete, and then it will continue. (This more or less causes the job to claim "dibs" on a slot that might otherwise go to a single-slot job.) If the job remains suspended for at least two hours without running, it will become eligible for preemption and may return to the queue to wait for a new slot assignment. You can force this to happen at any time with the condor_vacate command.
 
Changed:
<
<
Combining these two requirements, we end up with the following, which can be added to the submit file of your large job:
Requirements = (VirtualMachineID == 1 && Memory > 1024) 
>
>
I'm still tweaking these rules, so if you see any pathological behavior, or have an idea for a way to allocate slots more fairly, email linghelp@u and let me know.
 
Deleted:
<
<
-- brodbd - 23 Feb 2009
 \ No newline at end of file
Added:
>
>
-- brodbd - 26 Mar 2009

Revision 22009-02-24 - brodbd

Line: 1 to 1
 
META TOPICPARENT name="TroubleshootingCondor"

Running Condor jobs with large memory requirements

Line: 12 to 12
  By adding the requirement "VirtualMachineID == 1" to your job, it will only run on the first CPU slot of any machine. This will not prevent other jobs from occupying other slots, but it will ensure that only one copy of your job (or any similarly flagged job) will run on each machine. Note: The name of this parameter changed to SlotID in condor 7.x, so when we upgrade in the spring any submit files that use this parameter will need to be changed.
Changed:
<
<
By adding an explicit Memory requirement to your job, Condor will allow it to run on any slot with at least that amount of RAM, and will not evict it if it grows larger than 2 GB. (It's still vulnerable to the out-of-memory killer if it grows too large, however.) This requirement is measured in kilobytes and, for our purposes, can be set to any arbitrary number that's less than the smallest slot in the cluster -- currently 1976 KB.
>
>
By adding an explicit Memory requirement to your job, Condor will allow it to run on any slot with at least that amount of RAM, and will not evict it if it grows larger than 2 GB. (It's still vulnerable to the out-of-memory killer if it grows too large, however.) This requirement is measured in kilobytes and, for our purposes, can be set to any arbitrary number that's less than the smallest slot in the cluster -- currently 1975 KB.
  Combining these two requirements, we end up with the following, which can be added to the submit file of your large job:
Requirements = (VirtualMachineID == 1 && Memory > 1024) 

Revision 12009-02-23 - brodbd

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="TroubleshootingCondor"

Running Condor jobs with large memory requirements

Normally Condor assigns one job to each CPU on a node, dividing up the memory equally. On all of our current systems this results in 2 GB of RAM per slot. Ideally, you should structure your jobs to stay within this amount of memory.

If you job grows too large, one of two things will happen.

  • If it exceeds 2 GB, Condor may evict it, causing it to return to the queue and stay in the idle ("I") state. This is more likely to happen when there are a lot of other jobs queued.
  • If the combination of jobs on a machine exceeds the amount of available RAM and swap (about 8 GB, for our 2-CPU nodes), the kernel out of memory killer will kill processes until memory becomes available.

Eventually I will implement a custom submit file attribute to allow jobs to claim the entire machine, but this requires a newer version of Condor than we're currently running. My current target for this upgrade is the spring '09 term break. However, there are some stop-gap techniques that can help.

By adding the requirement "VirtualMachineID == 1" to your job, it will only run on the first CPU slot of any machine. This will not prevent other jobs from occupying other slots, but it will ensure that only one copy of your job (or any similarly flagged job) will run on each machine. Note: The name of this parameter changed to SlotID in condor 7.x, so when we upgrade in the spring any submit files that use this parameter will need to be changed.

By adding an explicit Memory requirement to your job, Condor will allow it to run on any slot with at least that amount of RAM, and will not evict it if it grows larger than 2 GB. (It's still vulnerable to the out-of-memory killer if it grows too large, however.) This requirement is measured in kilobytes and, for our purposes, can be set to any arbitrary number that's less than the smallest slot in the cluster -- currently 1976 KB.

Combining these two requirements, we end up with the following, which can be added to the submit file of your large job:

Requirements = (VirtualMachineID == 1 && Memory > 1024) 

-- brodbd - 23 Feb 2009

 
This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions