bioscons
Documentation¶
This page contains the bioscons
Package documentation.
Subpackages¶
The fast
Module¶
Some tips from http://www.scons.org/wiki/GoFastButton
The fileutils
Module¶
-
class
bioscons.fileutils.
Targets
(objs=None)[source]¶ Bases:
object
Provides an object with methods for identifying objects in the local namespace representing build targets and to compare this list to the contents of a directory to idenify extraneous files.
Example usage:
from bioscons.fileutils import Targets targets = Targets() # build some targets some_target = env.Command( target='infile.txt', source='outfile.txt', action='some_action $SOURCE $TARGET') targets.update(locals().values()) targets.show_extras("outdir")
-
bioscons.fileutils.
check_digest
(fname, dirname=None)[source]¶ Return True if the stored hash exists and is identical to the signature of the file
fname
. Hash is saved to a file named fname.md5 in either the same directory or in dirname if provided.
-
bioscons.fileutils.
rename
(fname, ext=None, pth=None)[source]¶ Replace the directory or file extension in
fname
withpth
andext
, respectively.fname
may be a string, an object coercible to a string using str(), or a single-element list of either.Example:
from bioscons import rename stofile = 'align.sto' fastafile = env.Command( target = rename(stofile, ext='.fasta'), source = stofile, action = 'seqmagick convert $SOURCE $TARGET' )
-
bioscons.fileutils.
split_path
(fname, split_ext=False)[source]¶ Returns file name elements given an absolute or relative path
fname
, which may be a string, an object coercible to a string using str(), or a single-element list of either. Ifsplit_ext
is True, the name of the file is further split into a base component and the file suffix, ie, (dir, base, suffix), and (dir, filename) otherwise.
The slurm
Module¶
Functions for dispatching to SLURM (https://computing.llnl.gov/linux/slurm/) from scons.
-
class
bioscons.slurm.
SlurmEnvironment
(use_cluster=True, slurm_queue=None, all_precious=False, time=False, **kwargs)[source]¶ Bases:
SCons.Script.SConscript.SConsEnvironment
Mostly drop-in replacement for an SCons Environment, where all calls to Command are by executed via srun using a single core.
The SRun and SAlloc methods can be used to use multiple cores for multithreaded and MPI jobs, respectively.
-
Command
(target, source, action, use_cluster=True, time=True, **kw)[source]¶ Dispatches
action
(and extra arguments) toSRun
ifuse_cluster
is True.
-
SAlloc
(target, source, action, ncores, timelimit=None, **kw)[source]¶ Run
action
with salloc.This method should be used for MPI jobs only. Combining an salloc call with
mpirun
(with no arguments) will use all nodes allocated automatically.Optional arguments:
slurm_args
: Additional arguments to pass to salloctimelimit
: value to use for environment variable SALLOC_TIMELIMIT
-
SRun
(target, source, action, ncores=1, timelimit=None, slurm_queue=None, **kw)[source]¶ Run
action
with srun.This method should be used for multithreaded jobs on a single machine only. By default, calls to SlurmEnvironment.Command use srun. Specify a number of processors with
ncores
, which provides a value forsrun -c/--cpus-per-task
.Optional arguments:
slurm_args
: Additional arguments to pass to salloctimelimit
: Value to use for environment variable SLURM_TIMELIMIT
-
SetCpusPerTask
(cpus_per_task)[source]¶ Set number of CPUs used by tasks launched from this environment with srun. Equivalent to
srun -c
-
Notes on parallelization¶
The combination of scons
and slurm
provides a powerful
mechanism for rate-limiting parallel tasks to maximize resource
utilization while preventing oversubscription. Here’s an
example. Consider the following SConstruct
:
import os
from bioscons.slurm import SlurmEnvironment
vars = Variables()
vars.Add('ncores', 'number of cores per task', default=1)
env = SlurmEnvironment(
ENV=os.environ,
variables=vars,
use_cluster=True,
SHELL='bash',
out='output'
)
print(env.subst('--> ncores=$ncores'))
for i in range(10):
AlwaysBuild(env.Command(
target='$out/{}.txt'.format(i),
source=None,
action=('date > $TARGET; sleep 2'),
ncores=env['ncores'],
))
Running this as 10 jobs in parallel with a single core allocated to each task results in the following:
% scons -j10
scons: Reading SConscript files ...
--> ncores=1
scons: done reading SConscript files.
scons: Building targets ...
srun -J "date" bash -c 'date > output/0.txt; sleep 2'
srun -J "date" bash -c 'date > output/1.txt; sleep 2'
srun -J "date" bash -c 'date > output/2.txt; sleep 2'
srun -J "date" bash -c 'date > output/3.txt; sleep 2'
srun -J "date" bash -c 'date > output/4.txt; sleep 2'
srun -J "date" bash -c 'date > output/5.txt; sleep 2'
srun -J "date" bash -c 'date > output/6.txt; sleep 2'
srun -J "date" bash -c 'date > output/7.txt; sleep 2'
srun -J "date" bash -c 'date > output/8.txt; sleep 2'
srun -J "date" bash -c 'date > output/9.txt; sleep 2'
scons: done building targets.
% cat output/*.txt | sort | uniq -c
10 Wed Nov 29 22:13:50 PST 2017
The final line demonstrates that all tasks are dispatched simultaneously. Now let’s imagine that the action is actually a multi-threaded process requiring 20 cores, and dispatching 10 simultaneous jobs would exceed the number of cpus or amount of memory per cpu available. By increasing the number of cores per task, we can force slurm to rate-limit the jobs.:
% scons ncores=20 -j10
scons: Reading SConscript files ...
--> ncores=20
scons: done reading SConscript files.
scons: Building targets ...
srun -J "date" bash -c 'date > output/0.txt; sleep 2'
srun -J "date" bash -c 'date > output/1.txt; sleep 2'
srun -J "date" bash -c 'date > output/2.txt; sleep 2'
srun -J "date" bash -c 'date > output/3.txt; sleep 2'
srun -J "date" bash -c 'date > output/4.txt; sleep 2'
srun -J "date" bash -c 'date > output/5.txt; sleep 2'
srun -J "date" bash -c 'date > output/6.txt; sleep 2'
srun: job 26896 queued and waiting for resources
srun -J "date" bash -c 'date > output/7.txt; sleep 2'
srun: job 26897 queued and waiting for resources
srun: job 26898 queued and waiting for resources
srun -J "date" bash -c 'date > output/8.txt; sleep 2'
srun -J "date" bash -c 'date > output/9.txt; sleep 2'
srun: job 26899 queued and waiting for resources
srun: job 26900 queued and waiting for resources
srun: job 26901 queued and waiting for resources
srun: job 26896 has been allocated resources
srun: job 26897 has been allocated resources
srun: job 26898 has been allocated resources
srun: job 26899 has been allocated resources
srun: job 26900 has been allocated resources
srun: job 26901 has been allocated resources
scons: done building targets.
% cat output/*.txt | sort | uniq -c
4 Wed Nov 29 22:24:44 PST 2017
4 Wed Nov 29 22:24:46 PST 2017
2 Wed Nov 29 22:24:48 PST 2017
Rate-limiting job dispatch is of course the whole purpose of
slurm
; scons
brings the additional benefit of
paralellization. This pattern provides a mechanism for specifying an
arbitrarily large value for -j
to maximize the number of tasks run
in parallel without exceeding system resources.
The utils
Module¶
-
bioscons.utils.
getvars
(config, secnames, indir=None, outdir=None, fmt_in='%(sec)s-infiles', fmt_out='%(sec)s-outfiles', fmt_params='%(sec)s-params')[source]¶ Return a tuple of configuration variables that may be passed to vars.AddVariables().
- config - an instance of ConfigParser.SafeConfigParser()
- secnames - a list of section names from which to import variables
- indir - path to a directory to prepend to filenames if not found in cwd
- outdir - path to a directory to prepend to output files.
- fmt_* - formatting string defining sections corresponding to infiles, outfiles, and params for each section
Example:
vars = Variables() varlist = utils.getvars(config, ['ncbi','placefiles'], indir=output, outdir=output) vars.AddVariables(*varlist)