CLI reference¶

This page is generated during the Sphinx build from each command module docstring and dawgtools <subcommand> -h output.

`extract_batch`¶

Extract features from one or more input files.

Environment¶

OPENAI_API_KEY must be set.
OPENAI_BASE_URL can be used to set a custom API base URL.

Set environment variables:

export OPENAI_API_KEY="sk-..."
export OPENAI_BASE_URL="https://api.openai.com/v1"  # optional

Example¶

Given a schema file (e.g., developed using toolbuilder) and a directory of text files named input_texts, extract features into features.csv:

dawgtools extract_batch schema.json -d input_texts -o features.csv

Caching¶

A cache directory is created to store intermediate results and avoid re-querying the model for files that have already been processed. New model queries are performed each time the schema file changes.

Schema format¶

The schema file should be a JSON file defining a tool compatible with the OpenAI function calling API. See:

https://platform.openai.com/docs/guides/function-calling

Example schema (from the OpenAI documentation):

{
  "type": "function",
  "name": "extract_features",
  "description": "Extract features from text",
  "parameters": {
    "type": "object",
    "properties": {
      "feature1": {
        "type": "string",
        "description": "Description of feature1"
      },
      "feature2": {
        "type": "integer",
        "description": "Description of feature2"
      }
    },
    "required": ["feature1", "feature2"]
  }
}

`dawgtools extract_batch -h`¶

usage: dawgtools extract_batch [-h] [-i INFILE] [-d DIRNAME] [-p PROMPT]
                               [-o OUTFILE] [-m MODEL] [--cache-dir CACHE_DIR]
                               [-n]
                               schema

Extract features from one or more input files.

Environment
-----------

- ``OPENAI_API_KEY`` must be set.
- ``OPENAI_BASE_URL`` can be used to set a custom API base URL.

Set environment variables::

  export OPENAI_API_KEY="sk-..."
  export OPENAI_BASE_URL="https://api.openai.com/v1"  # optional

Example
-------

Given a schema file (e.g., developed using toolbuilder) and a directory of text
files named ``input_texts``, extract features into ``features.csv``::

  dawgtools extract_batch schema.json -d input_texts -o features.csv

Caching
-------

A cache directory is created to store intermediate results and avoid re-querying
the model for files that have already been processed. New model queries are
performed each time the schema file changes.

Schema format
-------------

The schema file should be a JSON file defining a tool compatible with the OpenAI
function calling API. See:

https://platform.openai.com/docs/guides/function-calling

Example schema (from the OpenAI documentation):

.. code-block:: json

   {
     "type": "function",
     "name": "extract_features",
     "description": "Extract features from text",
     "parameters": {
       "type": "object",
       "properties": {
         "feature1": {
           "type": "string",
           "description": "Description of feature1"
         },
         "feature2": {
           "type": "integer",
           "description": "Description of feature2"
         }
       },
       "required": ["feature1", "feature2"]
     }
   }

positional arguments:
  schema                json file with feature schema

options:
  -h, --help            show this help message and exit
  -i, --infile INFILE   A single input file
  -d, --dirname DIRNAME
                        A directory of input files
  -p, --prompt PROMPT   Optional file with additional prompt content
  -o, --outfile OUTFILE
                        Output file
  -m, --model MODEL     Model name [gpt-5.2]
  --cache-dir CACHE_DIR
                        Directory containing cached results
                        [extract_batch_cache]
  -n, --no-cache

`query`¶

Execute an sql query.

Renders a query template string into a parameterized sql query.

Use a combination of python string formatting directives (for variable substituion) and jinja2 expressions (for conditional expressions).

For example:

$ dawgtools -v query -q “select ‘foo’ as col1, %(barval)s as col2” -p barval=bar {“col1”: “foo”, “col2”: “bar”}

The command may be preceded by the creation and loading of a temporary table containing mrns that can be referenced in the query. For example:

$ cat mrns.txt fee fie fo fum $ dawgtools query –mrns mrns.txt -q ‘select * from #mrns’ {“mrn”: “fee”} {“mrn”: “fie”} {“mrn”: “fo”} {“mrn”: “fum”}

`dawgtools query -h`¶

usage: dawgtools query [-h] [-q QUERY] [-i INFILE] [-n {path_reports,notes}]
                       [-p PARAMS] [-P PARAMS_FILE] [--mrns FILE]
                       [--temp-schema FILE] [--temp-data FILE] [-o OUTFILE]
                       [-f {jsonl,json,json-rows,csv}] [-x]

Execute an sql query.

Renders a query template string into a parameterized sql query.

Use a combination of python string formatting directives (for variable
substituion) and jinja2 expressions (for conditional expressions).

For example:

  $ dawgtools -v query -q "select 'foo' as col1, %(barval)s as col2" -p barval=bar
  {"col1": "foo", "col2": "bar"}

The command may be preceded by the creation and loading of a temporary table
containing mrns that can be referenced in the query. For example:

  $ cat mrns.txt
  fee
  fie
  fo
  fum
  $ dawgtools query --mrns mrns.txt -q 'select * from #mrns'
  {"mrn": "fee"}
  {"mrn": "fie"}
  {"mrn": "fo"}
  {"mrn": "fum"}

options:
  -h, --help            show this help message and exit
  -x, --dry-run         Print the rendered query and exit

inputs:
  -q, --query QUERY     sql command
  -i, --infile INFILE   Input file containing an sql command
  -n, --query-name {path_reports,notes}
                        name of an sql query
  -p, --params PARAMS   One or more variable value pairs in the form -p
                        var=val; these are used as parameters when rendering
                        the query.
  -P, --params-file PARAMS_FILE
                        json file containing parameter values

temptable:
  --mrns FILE           A file containing whitespace-delimited mrns to be
                        loaded into a temporary table '#mrns(mrn
                        varchar(102))' before the query.
  --temp-schema FILE    File containing schema for a temporary table to be
                        created before running the query.
  --temp-data FILE      CSV file with columns corresponding to the schema
                        containing data to load into the temporary table
                        before running the query. Requires --temp-schema.
                        Columns not in the schema are ignored.

outputs:
  -o, --outfile OUTFILE
                        Output file name; uses gzip compression if ends with
                        .gz or stdout if not provided.
  -f, --format {jsonl,json,json-rows,csv}
                        Output format [jsonl]

CLI reference¶

extract_batch¶