bioscons Documentation

This page contains the bioscons Package documentation.

Subpackages

The fast Module

Some tips from http://www.scons.org/wiki/GoFastButton

bioscons.fast.fast(env)[source]

Given an SCons.Script.Environment, set some flags for faster builds:

  • Caching of implicit dependencies
  • Use MD5-timestamp decider (if timestamp unchanged, don’t checksum)
  • Clear the default environment. Requires use of env.Command(...), rather than bare Command(...)

The fileutils Module

class bioscons.fileutils.Targets(objs=None)[source]

Bases: object

Provides an object with methods for identifying objects in the local namespace representing build targets and to compare this list to the contents of a directory to idenify extraneous files.

Example usage:

from bioscons.fileutils import Targets
targets = Targets()
# build some targets
some_target = env.Command(
    target='infile.txt', source='outfile.txt',
    action='some_action $SOURCE $TARGET')

targets.update(locals().values())
targets.show_extras("outdir")
show_extras(directory, one_line=True)[source]

Given a relative path directory search for files recursively and print a list of those not found among self.targets. Print one path per line if one_line is False.

update(objs)[source]

Given a list of objects (eg, the output of locals().values()), update self.targets with the set containing the relative path to each target (ie, those objects with a “NodeInfo” attribute).

bioscons.fileutils.check_digest(fname, dirname=None)[source]

Return True if the stored hash exists and is identical to the signature of the file fname. Hash is saved to a file named fname.md5 in either the same directory or in dirname if provided.

bioscons.fileutils.rename(fname, ext=None, pth=None)[source]

Replace the directory or file extension in fname with pth and ext, respectively. fname may be a string, an object coercible to a string using str(), or a single-element list of either.

Example:

from bioscons import rename
stofile = 'align.sto'
fastafile = env.Command(
    target = rename(stofile, ext='.fasta'),
    source = stofile,
    action = 'seqmagick convert $SOURCE $TARGET'
    )
bioscons.fileutils.split_path(fname, split_ext=False)[source]

Returns file name elements given an absolute or relative path fname, which may be a string, an object coercible to a string using str(), or a single-element list of either. If split_ext is True, the name of the file is further split into a base component and the file suffix, ie, (dir, base, suffix), and (dir, filename) otherwise.

bioscons.fileutils.write_digest(fname, dirname=None)[source]

Save the md5 checksum of fname as fname.md5 in either the same directory as fname or in dirname if provided.

The slurm Module

Functions for dispatching to SLURM (https://computing.llnl.gov/linux/slurm/) from scons.

class bioscons.slurm.SlurmEnvironment(use_cluster=True, slurm_queue=None, all_precious=False, time=False, **kwargs)[source]

Bases: SCons.Script.SConscript.SConsEnvironment

Mostly drop-in replacement for an SCons Environment, where all calls to Command are by executed via srun using a single core.

The SRun and SAlloc methods can be used to use multiple cores for multithreaded and MPI jobs, respectively.

Command(target, source, action, use_cluster=True, time=True, **kw)[source]

Dispatches action (and extra arguments) to SRun if use_cluster is True.

Local(target, source, action, **kw)[source]

Run a command locally, without SLURM

SAlloc(target, source, action, ncores, timelimit=None, **kw)[source]

Run action with salloc.

This method should be used for MPI jobs only. Combining an salloc call with mpirun (with no arguments) will use all nodes allocated automatically.

Optional arguments: slurm_args: Additional arguments to pass to salloc timelimit: value to use for environment variable SALLOC_TIMELIMIT

SRun(target, source, action, ncores=1, timelimit=None, slurm_queue=None, **kw)[source]

Run action with srun.

This method should be used for multithreaded jobs on a single machine only. By default, calls to SlurmEnvironment.Command use srun. Specify a number of processors with ncores, which provides a value for srun -c/--cpus-per-task.

Optional arguments: slurm_args: Additional arguments to pass to salloc timelimit: Value to use for environment variable SLURM_TIMELIMIT

SetCpusPerTask(cpus_per_task)[source]

Set number of CPUs used by tasks launched from this environment with srun. Equivalent to srun -c

SetPartition(partition)[source]

Set the partition to be used. Subsequent calls to SRun and SAlloc will use this partition.

SetTimeLimit(timelimit)[source]

Set a limit on the total run time for jobs launched by this environment.

Formats:
minutes minutes:seconds hours:minutes:seconds days-hours days-hours:minutes days-hours:minutes:seconds
bioscons.slurm.check_srun()[source]

Return the absolute path to the srun executable.

Notes on parallelization

The combination of scons and slurm provides a powerful mechanism for rate-limiting parallel tasks to maximize resource utilization while preventing oversubscription. Here’s an example. Consider the following SConstruct:

import os
from bioscons.slurm import SlurmEnvironment

vars = Variables()
vars.Add('ncores', 'number of cores per task', default=1)

env = SlurmEnvironment(
    ENV=os.environ,
    variables=vars,
    use_cluster=True,
    SHELL='bash',
    out='output'
)

print(env.subst('--> ncores=$ncores'))

for i in range(10):
    AlwaysBuild(env.Command(
        target='$out/{}.txt'.format(i),
        source=None,
        action=('date > $TARGET; sleep 2'),
        ncores=env['ncores'],
    ))

Running this as 10 jobs in parallel with a single core allocated to each task results in the following:

% scons -j10
scons: Reading SConscript files ...
--> ncores=1
scons: done reading SConscript files.
scons: Building targets ...
srun -J "date" bash -c 'date > output/0.txt; sleep 2'
srun -J "date" bash -c 'date > output/1.txt; sleep 2'
srun -J "date" bash -c 'date > output/2.txt; sleep 2'
srun -J "date" bash -c 'date > output/3.txt; sleep 2'
srun -J "date" bash -c 'date > output/4.txt; sleep 2'
srun -J "date" bash -c 'date > output/5.txt; sleep 2'
srun -J "date" bash -c 'date > output/6.txt; sleep 2'
srun -J "date" bash -c 'date > output/7.txt; sleep 2'
srun -J "date" bash -c 'date > output/8.txt; sleep 2'
srun -J "date" bash -c 'date > output/9.txt; sleep 2'
scons: done building targets.
% cat output/*.txt | sort | uniq -c
10 Wed Nov 29 22:13:50 PST 2017

The final line demonstrates that all tasks are dispatched simultaneously. Now let’s imagine that the action is actually a multi-threaded process requiring 20 cores, and dispatching 10 simultaneous jobs would exceed the number of cpus or amount of memory per cpu available. By increasing the number of cores per task, we can force slurm to rate-limit the jobs.:

% scons ncores=20 -j10
scons: Reading SConscript files ...
--> ncores=20
scons: done reading SConscript files.
scons: Building targets ...
srun -J "date" bash -c 'date > output/0.txt; sleep 2'
srun -J "date" bash -c 'date > output/1.txt; sleep 2'
srun -J "date" bash -c 'date > output/2.txt; sleep 2'
srun -J "date" bash -c 'date > output/3.txt; sleep 2'
srun -J "date" bash -c 'date > output/4.txt; sleep 2'
srun -J "date" bash -c 'date > output/5.txt; sleep 2'
srun -J "date" bash -c 'date > output/6.txt; sleep 2'
srun: job 26896 queued and waiting for resources
srun -J "date" bash -c 'date > output/7.txt; sleep 2'
srun: job 26897 queued and waiting for resources
srun: job 26898 queued and waiting for resources
srun -J "date" bash -c 'date > output/8.txt; sleep 2'
srun -J "date" bash -c 'date > output/9.txt; sleep 2'
srun: job 26899 queued and waiting for resources
srun: job 26900 queued and waiting for resources
srun: job 26901 queued and waiting for resources
srun: job 26896 has been allocated resources
srun: job 26897 has been allocated resources
srun: job 26898 has been allocated resources
srun: job 26899 has been allocated resources
srun: job 26900 has been allocated resources
srun: job 26901 has been allocated resources
scons: done building targets.
% cat output/*.txt | sort | uniq -c
      4 Wed Nov 29 22:24:44 PST 2017
      4 Wed Nov 29 22:24:46 PST 2017
      2 Wed Nov 29 22:24:48 PST 2017

Rate-limiting job dispatch is of course the whole purpose of slurm; scons brings the additional benefit of paralellization. This pattern provides a mechanism for specifying an arbitrarily large value for -j to maximize the number of tasks run in parallel without exceeding system resources.

The utils Module

bioscons.utils.getvars(config, secnames, indir=None, outdir=None, fmt_in='%(sec)s-infiles', fmt_out='%(sec)s-outfiles', fmt_params='%(sec)s-params')[source]

Return a tuple of configuration variables that may be passed to vars.AddVariables().

  • config - an instance of ConfigParser.SafeConfigParser()
  • secnames - a list of section names from which to import variables
  • indir - path to a directory to prepend to filenames if not found in cwd
  • outdir - path to a directory to prepend to output files.
  • fmt_* - formatting string defining sections corresponding to infiles, outfiles, and params for each section

Example:

vars = Variables()
varlist = utils.getvars(config, ['ncbi','placefiles'],
                        indir=output, outdir=output)
vars.AddVariables(*varlist)
class bioscons.utils.verbose(f)[source]

Bases: object

Decorator class to provide more verbose progress messages.