Add Distributed Backend
In realistic cases the simulator you wish to model is computationally expensive to run (one sample evaluation may take many minutes or hours). In this case it is impractical to run the toolbox in the default, sequential model. In order to speed up things you will want to run simulations in parallel on a cluster or grid.
There are many different ways to evaluate samples in parallel, exactly how this happens depends on the grid middleware you are using (the software layer that takes care of authentication, job submission, etc) and how the middleware is accessed (locally, through a remote front node, through SOAP, ...). Since we cannot accommodate for every possible situation, we have provided a number of primitives (java classes) that you can extend to implement support for your particular setup. A number of examples are provided which you can refer to. Of these the Sun Grid Engine (SGE) backend is the most tested.
This page will guide you through the necessary steps to implement your own backend. So note that this page does NOT discuss how to run multiple, separate, standalone instances of the toolbox in parallel. How you do that is up to you.
Standalone executable or script
Of course you will need the code that actually does the simulation. This can be one of the following things:
- a single executable binary (compiled for OS/architecture that your setup uses)
- a single shell script
How the toolbox interfaces with this code is explained on the Interfacing with the toolbox page.
In addition each of the above can be accompanied by a number of files (not directories!!) that it depends upon (e.g., configuration files, input files, ...). All this is specified in the simulator xml file. In order for the toolbox to be aware of these files, they should be placed in the project directory, or alternatively, absolute paths should be used for their locations.
Each backend will automatically copy the the executable and dependency files to the correct location on the submission host.
Complete program or environment
Often however, your code is not a single executable + some dependencies but a whole program or environment (e.g., CADENCE, CST or some complex CFD program). In this case, to be able to do sample evaluations in parallel you should ensure that:
- the program is installed and correctly configured on every node of the cluster or grid you are going to use
- you have written a single, self-contained, shell script that interfaces between the toolbox and the simulator code (see the previous section). When run on one of the compute nodes this script should successfully produce the simulator output given a set of input parameters.
The grid/cluster you use is administered by a middleware (e.g., Sun Grid Engine). You will have to make sure you have a working account and you understand how this system works (authentication, job submission, job monitoring, file stating, ...).
Access to the middleware can happen through a number of ways:
- locally, you can submit jobs from the same machine the toolbox is running on
- remotely, jobs can only be submitted from a remote frontnode (= headnode), which may, for example, only be accessible though SSH or through a web browser.
In both cases, you will have to take care of implementing the communication between the toolbox and the submission system (e.g., tunnel job submission commands through SSH when dealing with a SSH accessible headnode).
The toolbox comes with 2 existing backends (APST was removed):
- SGE: submission through a remote, SSH accessible headnode
- LCG: submission through a remote, SSH accessible headnode
Of the 2, SGE is the most well tested.
Adding a new backend
Firstly make sure you understand the submission system and how to communicate with it. Lets call this new submission system X.
You will have to provide the following Java classes:
- XSampleEvaluator: should subclass BasicSampleEvaluator and implement the ResultProcessor interface. Use the prefix 'Remote' (i.e., RemoteXSampleEvaluator) if the submission system is accessed remotely. Its tasks are:
- read the configuration options from the toolbox configuration file
- perform necessary initialization (e.g., setup an SSH session, initialize the backend)
- copy the executables and dependencies to the correct location on the local/remote machine
- construct the Job objects and send them to the backend class
- update the sample queues when jobs are finished (ResultProcessor interface)
- XDistributedBackend (again adding the prefix Remote if necessary) which subclasses DistributedBackend or RemoteDistributedBackend and implements JobFinishedEventListener. Its tasks are:
- initialize the submission system (e.g., set up proxy, authenticate, ...)
- submit the actual job
- deal with failed jobs
- start and control the Poller object
- retrieve the output when the job has completed and notify the result processor object
- XPoller: should subclass Poller. Its tasks are:
- once a job is submitted it monitors the job state and informs the backend once a job has completed
- NB: LocalResultPoller and SSHPoller classes are already available and should be re-used