Ben

A lightweight batch job scheduler

Overview

Ben is a simple and versatile batch job scheduler. It lets you run a queue of jobs, in parallel, on multiple machines. It comes in a single executable, ben, and does not require any configuration, only a working ssh setup.

By default, ben relies on Unix-domain sockets and piggy-backs on file permissions for security. It uses ssh forwarding under the hood (ssh -L and ssh -R) for networking. As an alternative, ben can also use TCP/IP sockets directly1. Ben is free software (GPLv3) written in C with no dependencies.

Quick example

On a single computer

Say we want to transcode a hundred mp4/H.264 videos (v00.mp4 to v99.mp4) into ogv/Theora (v00.ogv to v99.ogv). We want to run commands like:

$ ffmpeg -i v00.mp4 v00.ogv

Now, transcoding takes time, and we assume for this example that ffmpeg’s encoder is singlethreaded only (this is actually true for the Theora encoder in the current default build of ffmpeg). Therefore, we want to run multiple instances in parallel.

We can do it as follows. First, we start the ben server. It will maintain the job queue, dispatch jobs and handle coordination.

$ ben server -d

The -d option puts the server “in the background”, so that we get our command prompt back. We can omit it, and we will see log information displayed in the terminal. Next, we start a client. The client will take jobs off the queue (first-in-first-out) and run them. The server and client are separate processes because, as we will see later, they can be run on distinct machines.

$ ben client -n 4 -d

The -d option has the same meaning as for the server, it puts the client “in the background”. The -n 4 option specifies that we allow up to 4 jobs to run simultaneously. We are now ready to transcode. Let us start with the first six videos:

$ ben add -c 'ffmpeg -i $job.mp4 $job.ogv' -j v00 v01 v02 v03 v04 v05

The -c option specifies the command to run. Each command is run with the environment variable job set to the job name, so we can use $job to refer to it. The output of the jobs (the output of ffmpeg) is stored in files named after the jobs: stdout in $job.out and stderr in $job.log. So we will get 12 files v00.out, v00.log, v01.out, v01.log, etc.

Here, we added all six jobs in a single command, but we could have added them equivalently with multiple ben add commands, for example one at a time, or two by two. Jobs can be added to the queue at any time. As soon as some client is available, it will run jobs from the queue. In this example, the first four will start running immediately. As soon as one of them is finished, the fifth will start.

It is often more convenient to store the job command and the job names in files instead of passing them as arguments to ben add. Many ben options have a capital-letter variant that lets you do exactly that. The above ben add command can thus be replaced by:

$ echo 'ffmpeg -i $job.mp4 $job.ogv' > command.sh
$ echo 'v00 v01 v02 v03 v04 v05' > job-list.txt
$ ben add -C command.sh -J job-list.txt

Of course, we want job-list.txt to contain our 100 videos not just the first six. For example, we could generate the appropriate file as follows:

$ basename -s .mp4 v*.mp4 > job-list.txt

Over the network

Four parallel jobs is still not much. Assume that all the previous commands were run as user user-A on machine-A.host.com. Furthermore, we have ssh access as user user-B to a larger computer, machine-B.host.com. We can start a client there, either remotely:

[user-A@machine-A ~]$ ben client -r user-B@machine-B.host.com -n 48 -d

or by first logging into it, then contacting machine-A from there:

[user-A@machine-A ~]$ ssh user-B@machine-B.host.com
[user-B@machine-B ~]$ ben client -f user-A@machine-A.host.com -n 48 -d

Note that the latter is more robust, because remote-starting a client needs ben to be installed on the remote host in a specific way2. Ben uses ssh under the hood, so a password may be prompted. We can now check that everything worked fine with the command ben nodes:

[user-A@machine-A ~]$ ben nodes
# node                            R  P
     0 machine-A                  0  4
     1 machine-B                  0 48
     2 machine-A: ben nodes       -  C

We can see that we have two clients connected, which can run up to 4 and 48 simultaneous jobs, respectively. The last line indicates the connection that ben nodes is currently using to communicate with the server; it is a “control” client that does not execute jobs. If we proceed to add the first 6 jobs, we will see that the clients get busy (the R column gives the number of running jobs).

[user-A@machine-A ~]$ ben nodes
# node                            R  P
     0 machine-A                  3  4
     1 machine-B                  3 48
     4 machine-A: ben nodes       -  C

We can also display the queue with the command ben list:

[user-A@machine-A ~]$ ben list
#  id             dir                  job S  node                  duration
    0                                  v00 r     0 machine-A       >00:00:27
    2                                  v02 r     0 machine-A       >00:00:27
    4                                  v04 r     0 machine-A       >00:00:27
    1                                  v01 r     1 machine-B       >00:00:27
    3                                  v03 r     1 machine-B       >00:00:27
    5                                  v05 r     1 machine-B       >00:00:27

Other controls

After adding jobs, one can remove them with ben rm. If we remove a job that is currently running, it is stopped and its (partial) output files are removed. If it is completed, the output is preserved.

One can dynamically change the maximum number of simultaneous jobs on a client with ben scale. If the new maximum number exceeds the current number of running jobs on that client, some jobs will be interrupted (and re-queued3), unless the --retire option is specified. In the latter case, the client will not accept new jobs until its load falls below the new maximum, but no jobs will be interrupted. Similarly, ben kill, which disconnects a client, also has a --retire option, letting currently running jobs finish beforehand.

Another useful command is ben exec. It queues one special “sync” job per client. A each of those sync jobs can only run on its specified client, and cannot be run simultaneously with any other job. Sync jobs are useful for scheduling code updates or recompilations, or for getting notified when all jobs queued previously are completed. Unlike ben add, ben exec hangs until the sync jobs are run, and it displays their output. However, it can be interrupted (for example with control+c), without affecting the sync jobs: only their output is discarded.

Further information

See the manual for more detailed info.

Author

Laurent Poirrier


  1. Using TCP/IP sockets directly is possible, but comes with no authentication or encryption. It is then strongly advised to bind to localhost and manually setup ssh port forwardings. This is still not perfect (everyone with access to localhost needs to be trusted), but better than having open public ports. It is the only option with old versions of ssh that do not support Unix-domain sockets.↩︎

  2. This requires ben to be installed in the default $PATH on machine-B. Because of the way ssh works, entries added to the $PATH by user-profile scripts (e.g. .bash_profile) will not work. Furthermore, when ben client is called with option -r, it is essentially a wrapper for ssh -R remote_socket:local_socket "ben client ...". By default, remote_socket is a Unix-domain socket placed in the directory /tmp/ben-$USER/, which ssh will not create on its own. As a result, the first time on uses ben client -r on a machine, multiple round-trips will be necessary to first create the directory, then create the socket.↩︎

  3. Interrupted jobs are requeued at the beginning of the queue: they will be run before the other jobs currently pending on the queue. Note that the same happens if a client is unintentionally interrupted (with SIGKILL for example): its running jobs are requeued.↩︎