paralleljobs

Paralleljobs takes a job specification from a database, executes the corresponding command, collects and stores information from the job’s output, and repeats until all jobs were processed. Multiple instances may be run in parallel against the same database.

Usage

$ paralleljobs.py -h

Take the jobs defined in the jobs table of the given database and process one
after the other. Multiple instances may be launched against the same database.

Usage:
  /home/shuber/.local/bin/paralleljobs.py [OPTIONS] [COMMANDS] -d DB
  /home/shuber/.local/bin/paralleljobs.py -h

COMMANDS:
  -c FILE         add each line as a job resp. job's command to DB
  -h              print this text
  -s              print status information
  -w              do work and process jobs

OPTIONS:
  -d DB           the database to process
  -n NUM          in -w mode, only process num-many jobs
  -p COL-DEF      create properties table with SQL column spec
  -v              verbose output

Commands may be combined in one call of /home/shuber/.local/bin/paralleljobs.py.

A list of jobs may be importet line-by-line from a file using the -c option.
Every job may output to stdout or stderr one or more strings of the form
    DB-PROPERTIES: { "key": "value", "key2": 1.23, "key3": True }
It is assumed that a table 'properties' exists with the columns jobid, key,
key2, and key3. The corresponding values are inserted into this table. Using
the option -p such a properties table can be created by giving a list of
column definitions in SQL style.

The jobs table also contains a 'workloadestm' column that is used when
estimating the finished workload so far. The entries default to 1 and may be
adjusted.

Examples:
  # create cmds.sh with jobs
  echo "ulimit -v 2000000 -t 1200; ./isprime 65535" > cmds.sh
  echo "ulimit -v 2000000 -t 1200; ./isprime 65537" >> cmds.sh
  # create an initial database, but do not work
  /home/shuber/.local/bin/paralleljobs.py -d jobs.db -c cmds.sh \
      -p 'number INTEGER, time REAL, mem INTEGER'
  # launch two workers
  /home/shuber/.local/bin/paralleljobs.py -d jobs.db -w &
  /home/shuber/.local/bin/paralleljobs.py -d jobs.db -w &
  # print status info
  /home/shuber/.local/bin/paralleljobs.py -d jobs.db -s

Download

You may obtain a copy of the git repository of paralleljobs by

$ git clone http://git.sthu.org/repos/paralleljobs.git

or you can simply download the latest tarball.

A short story

Say you have a fancy new method to test whether a given integer is a prime number. You casted your algorithm into an implementation that behaves like this:

$ ./isprime 65535
DB-PROPERTIES: { 'number': 65535, 'time': 1.23, 'mem': 3122, 'result': False }
$ ./isprime 65537
DB-PROPERTIES: { 'number': 65537, 'time': 100.21, 'mem': 5421, 'result': True }

(The idea is that isprime (or a wrapper script) produces information that will be collected by paralleljobs.)

We would like to launch thousands of such invokations in parallel on our 12 core machine. Some invokations may take very long, others may be quite fast. And we would like to collect data from our invokations, not only the exit code, but also runtime, memory used, and application specific data. Paralleljobs allows us to do this in a simple way. We first create a file with a list of jobs:

$ for i in `seq 1000000`; do
	echo "ulimit -t 1200 -v 1000000; ./isprime $i" >> jobs.sh;
done

We do not know whether our code may use a huge amount of memory or may not terminate at all. Hence, we limit resources with ulimit. We then tell paralleljobs to create a sqlite database with a jobs table and we let it import jobs from ‘jobs.sh’. (Instead of importing the jobs from a file, we could also fill the ‘jobs’ table in the database by other means.) Furthermore, we would like paralleljobs to collect data (key-value pairs) from the jobs’ output and we specify in SQL which keys it must anticipate.

$ paralleljobs.py -d jobs.sqlite -c jobs.sh \
	-p "number INTEGER, time REAL, mem INTEGER, result BOOL"

And now we can launch, say, 12 parallel processes that do all the work.

$ for i in `seq 12`; do paralleljobs.py -d jobs.sqlite -w &; done

After some time you may wonder how much jobs were already processed.

$ paralleljobs.py -d jobs.sqlite -s
132 (0.0%) of 1000000 jobs and 0.0% of the workload done. 12 jobs are running.
Exit code stats:
    0:    132  (100.0%)

If you have for each job an estimate for the workload you can adjust the per-job estimates in the database (see table ‘jobs’) to get a better estimate of the workload finished.

When all jobs were processed you can extract data from the properties table and produce a plot in gnuplot, or whatever.

$ sqlite3 jobs.sqlite "SELECT number, time FROM properties" > runtime.dat
$ sqlite3 jobs.sqlite "SELECT number, mem FROM properties" > mem.dat
$ sqlite3 jobs.sqlite \
	"SELECT number FROM properties WHERE result=True " > primes.dat