allows the temporary addition of a grid resource to a local Condor pool. The addition is accomplished by installing and executing some of the Condor daemons on the remote grid resource, such that it reports in as part of the local Condor pool.
accomplishes two separate tasks: set up and execution. These separated tasks allow flexibility, in that the user may use
to do only one of the tasks or both, in addition to customizing the tasks.
The set up task generates a script that may be used to start the Condor daemons during the execution task, places this script on the remote grid resource, composes and installs a configuration file, and it installs the
daemons on the grid resource.
The execution task runs the script generated by the set up task. The goal of the script is to invoke the
daemon. The Condor job
appears in the queue of the local Condor pool for each invocation of
. To remove the grid resource from the local Condor pool, use
to remove the
The Condor jobs to do both the set up and execute tasks utilize Condor-G and Globus protocols (gt2 or gt4) to communicate with the remote resource. Therefore, an X.509 certificate (proxy) is required for the user running
Specify the remote grid machine with the command line argument
takes one of 4 forms:
Globus contact string
specifies the full path and file name of a file that contains Globus contact strings. Each of the resources given by a Globus contact string is added to the local Condor pool.
The set up task of
copies the binaries for the correct platform from a central server. To obtain access to the server, or to set up your own server, follow instructions on the Glidein Server Setup page, at http://www.cs.wisc.edu/condor/glidein. Set up need only be done once per site, as the installation is never removed.
By default, all files installed on the remote grid resource are placed in the directory $(HOME)/Condor_glidein. $(HOME)is evaluated and defined on the remote machine using a grid map. This directory must be in a shared file system accessible by all machines that will run the Condor daemons. By default, the daemon's log files will also be written in this directory. Change this directory with the
option to make Condor daemons write to local scratch space on the execution machine. For debugging initial problems, it may be convenient to have the log files in the more accessible default directory. If using the default directory, occasionally clean up old log and execute directories to avoid running out of space.
To have 10 grid resources running PBS at a grid site with a gatekeeper named gatekeeper.site.edu join the local Condor pool:
If you try something like the above and
is not able to automatically determine everything it needs to know about the remote site, it will ask you to provide more information. A typical result of this process is something like the following command:
The Condor jobs that do the set up and execute tasks will appear in the queue for the local Condor pool. As a result of a successful glidein, use
to see that the remote grid resources are part of the local Condor pool.
A list of common problems and solutions is presented in this manual page.
Generate File Options
Create a local copy of the configuration file that may be used on the remote resource. The file is named glidein_condor_config.<suffix>. The string defined by <suffix>defaults to the process id (PID) of the
process or is defined with the
command line option. The configuration file may be edited for later use with the
Create a local copy of the script used on the remote resource to invoke the
. The file is named glidein_startup.<suffix>. The string defined by <suffix>defaults to the process id (PID) of the
process or is defined with the
command line option. The file may be edited for later use with the
Generate submit description files, but do not submit. The submit description file for the set up task is named glidein_setup.submit.<suffix>. The submit description file for the execute task is named glidein_run.submit.<suffix>. The string defined by <suffix>defaults to the process id (PID) of the
process or is defined with the
command line option.
Set Up Task Options
Do only the set up task of
. This option cannot be run simultaneously with
Do the set up task on the local machine, instead of at a remote grid resource. This may be used, for example, to do the set up task of
in an AFS area that is read-only from the remote grid resource.
During the set up task, force the copying of files, even if this overwrites existing files. Use this to push out changes to the configuration.
The set up task copies the specified configuration file, rather than generating one.
The set up task copies the specified startup script, rather than generating one.
Identifies the jobmanager on the remote grid resource to receive the files during the set up task. If a reasonable default can be discovered through MDS, this is optional.
is a string representing any gt2 name for the job manager. The correct string in most cases will be
. Other common strings may be
Execute Task Options
Starts execution of the Condor daemons on the grid resource. If any of the necessary files or executables are missing,
exits with an error code. This option cannot be run simultaneously with
directly rather than submitting a Condor job that causes the remote execution. To instead generate a script that does this, use
in combination with
. This may be useful for running Condor daemons on resources that are not directly accessible by Condor.
Display brief usage information and exit.
Specifies the base directory on the remote grid resource used for placing files. The default directory is $(HOME)/Condor_glideinon the grid resource.
Specifies the directory on the remote grid resource for placement of the Condor executables. The default value for
s based upon version information on the grid resource. It is of the form <basedir>/<condor-version>-<Globus canonicalsystemname>. An example of the directory (without the base directory) for Condor version 6.1.13 running on a Sun Sparc machine with Solaris 2.6 is 6.1.13-sparc-sun-solaris-2.6.
Specifies the directory on the remote grid resource in which to create log and execution subdirectories needed by Condor. If limited disk quota in the home or base directory on the grid resource is a problem, set
to a large temporary space, such as /tmpor /scratch. If the batch system requires invocation of Condor daemons in a temporary scratch directory, '.' may be used for the definition of the
Identifies the platform of the required tarball containing the correct Condor daemon executables to download and install. If a reasonable default can be discovered through MDS, this is optional. A list of possible values may be found at http://www.cs.wisc.edu/condor/glidein/binaries. The architecture name is the same as the tarball name without the suffix tar.gz. An example is 6.6.5-i686-pc-Linux-2.4 .
is a string used at the grid resource to identify a job queue.
is a string used at the grid resource to identify a project name.
The maximum memory size in Megabytes to request from the grid resource.
-count CPU count
The number of CPUs requested to join the local pool. The default is 1.
-slots slot count
For machines with multiple CPUs, the CPUs maybe divided up into slots.
is the number of slots that results. By default, Condor divides multiple-CPU resources such that each CPU is a slot, each with an equal share of RAM, disk, and swap space. This option configures the number of slots, so that multi-threaded jobs can run in a slot with multiple CPUs. For example, if 4 CPUs are requested and
is not specified, Condor will divide the request up into 4 slots with 1 CPU each. However, if
is specified, Condor will divide the request up into 2 slots with 2 CPUs each, and if
is specified, Condor will put all 4 CPUs into one slot.
The amount of time that a remote grid resource will remain idle state, before the daemons shut down. A value of 0 (zero) means that the daemons never shut down due to remaining in the idle state. In this case, the
option defines when the daemons shut down. The default value is 20 minutes.
The maximum amount of time the Condor daemons on the remote grid resource will run before shutting themselves down. This option is useful for resources with enforced maximum run times. Setting
to be a few minutes shorter than the enforced limit gives the daemons time to perform a graceful shut down.
Sets the Condor STARTexpression for the added remote grid resource to True. This permits any user's job which can run on the added remote grid resource to run. Without this option, only jobs owned by the user executing
can execute on the remote grid resource. WARNING: Using this option may violate the usage policies of many institutions.
Where to send e-mail with problems. The default is the login of the user running
at UID domain of the local Condor pool.
Suffix to use when generating files. Default is process id.
Includes and enables GSI authentication in the configuration for the remote grid resource. The argument is the GSI certificate name that the daemons will use to authenticate themselves.
The argument identifies the directory containing the trusted CA certificates that the daemons are to use (for example, /etc/grid-security/certificates). The contents of this directory will be installed at the remote site in the directory <basedir>/grid-security.
The argument is the file name of the GSI-specific X.509 map file that the daemons will use. The file will be installed at the remote site in <basedir>/grid-security. The file contains entries mapping certificates to user names. At the very least, it must contain an entry for the certificate given by the command-line option
If other Condor daemons use different certificates, then this file will also list any certificates that the daemons will encounter for the
. See section for more information.
will exit with a status value of 0 (zero) upon complete success, or with non-zero values upon failure. The status value will be 1 (one) if
encountered an error making a directory, was unable to copy a tar file, encountered an error in parsing the command line, or was not able to gather required information. The status value will be 2 (two) if there was an error in the remote set up. The status value will be 3 (three) if there was an error in remote submission. The status value will be -1 (negative one) if no resource was specified in the command line.
Common problems are listed below. Many of these are best discovered by looking in the StartLoglog file on the remote grid resource.
WARNING: The file xxx is not writable by condor
This error occurs when
is run in a directory that does not have the proper permissions for Condor to access files. An AFS directory does not give Condor the user's AFS ACLs.
One common cause of this problem is that the remote grid resources are in a different file system domain, and the submitted Condor jobs have an implicit requirement that they must run in the same file system domain. See section for details on using Condor's file transfer capabilities to solve this problem. Another cause of this problem is a communication failure. For example, a firewall may be preventing the
daemons from connecting to the
on the remote grid resource. Although work is being done to remove this requirement in the future, it is currently necessary to have full bidirectional connectivity, at least over a restricted range of ports. See page for more information on configuring a port range.
Glideins run but fail to join the pool
This may be caused by the local pool's security settings or by a communication failure. Check that the security settings in the local pool's configuration file allow write access to the remote grid resource. To not modify the security settings for the pool, run a separate pool specifically for the remote grid resources, and use flocking to balance jobs across the two pools of resources. If the log files indicate a communication failure, then see the next item.
The startd cannot connect to the collector
This may be caused by several things. One is a firewall. Another is when the compute nodes do not have even outgoing network access. Configuration to work without full network access to and from the compute nodes is still in the experimental stages, so for now, the short answer is that you must at least have a range of open (bidirectional) ports and set up the configuration file as described on page . Use the option
, edit the generated configuration file, and then do the glidein execute task with the option
Another possible cause of connectivity problems may be the use of UDP by the
to register itself with the
. Force it to use TCP as described on page .
Yet another possible cause of connectivity problems is when the remote grid resources have more than one network interface, and the default one chosen by Condor is not the correct one. One way to fix this is to modify the glidein startup script using the
options. The script needs to determine the IP address associated with the correct network interface, and assign this to the environment variable _condor_NETWORK_INTERFACE.
NFS file locking problems
option uses files on NFS (not recommended, but sometimes convenient for testing), the Condor daemons may have trouble manipulating file locks. Try inserting the following into the configuration file:
IGNORE_NFS_LOCK_ERRORS = True
Condor Team, University of Wisconsin-Madison
Copyright (C) 1990-2009 Condor Team, Computer Sciences Department, University of Wisconsin-Madison, Madison, WI. All Rights Reserved. Licensed under the Apache License, Version 2.0.