Ensuring That SGE Does Not Oversubscribe Processors or Memory

SGE, be default, will happily oversubscribe processors when multiple queues target the same nodes (nice one, SGE). Furthermore, even if jobs specify a memory limit, if each individual job uses less then the total memory limit, but the sum memory usage of jobs assigned to the machine exceeds the machine’s memory, the memory of the entire node can be exhausted, sending it off into limbo (cunning, SGE, very cunning). The solution to both of these issues is to specify processors and memory as consumable resources at the host level. This post details the procedure.

Creating the Consumable Resource Attributes

• Call up the complex attribute modification editor:

$qconf -mc • Edit the “slots” entry, so it looks like the following: slots s INT • Edit the “virtual_free” entry, so it looks like the following: virtual_free vf MEMORY • Save and exit (”:wq”) Set the Host-Level Limits for the Resources This is a pain. The configuration for each host has to be specified individually. (Really, SGE? Really?) So, for each host with the name/address ““, type: $ qconf -me

and either add or edit the existing “complex_values” entry so that it looks like:

complex_values        slots=8, virtual_free=16G

This page has a script that helps by scripting out some of the pain. It creates a temporary file for each host that describes the limits, and then sets each host’s configuration using the file. It obviously requires careful customization for each individual set-up. Modified for my case:

#! /bin/bash

host_prefix='compute-0-'
for i in ; do
n=printf "%d" $i host=$host_prefix$n echo$host
file_name=sge_$host.conf cat >$file_name << EOF
hostname $host load_scaling NONE complex_values slots=8, virtual_free=15G user_lists NONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE EOF qconf -Me$file_name
done

Set/Customize the Default Resources Used by a Job

Edit “\$SGE_ROOT/default/common/sge_request” and add:

# default memory limit
-l h_vmem=1.8G
# default memory usage
-l virtual_free=1.8G
# default to general.q
#-q "general.q"
Share