PipelineJobs Agave Proxy¶

This Reactor provides a generalized proxy for running Agave API jobs such that their inputs, parameterization, and outputs are connected to (and thus discoverable from within) the Data Catalog.

Register Agave App as a Pipeline¶

Before an Agave App can be run by this proxy, three things must happen:

It must be architected to fit the PipelineJobs workflow
It must be public or shared with user sd2eadm
It must be registered as a Data Catalog Pipeline

App Architecture¶

The app must generate filenames that are distinguishable between runs. This is enforced to prevent accidentallly over-writing of files when multiple jobs share an archiving destination. Furthermore, the app definition and any interior runtime logic must use fully-qualified Agave files URLs to define inputs. Finally, the app’s id must be unique not only in the Agave Apps Catalog (this is automatically enforced) but also in the Data Catalog Pipelines collection.

Registering a Pipeline¶

Coming soon…

Launching a Managed Agave Job¶

Construct and send a message including the following components to the PipelineJobs Agave Proxy Reactor.

An Agave job definition
A metdata linkage parameter
Optional control parameters

Note

The agave_pipelinejob format is documented in JSONSchemas.

Agave Job Definition¶

The Agave job definition must be included as as subdocument in the message. To illustrate this, start with a basic Agave job definition: Here is an example for an imaginary Agave app tacobot9000-0.1.0u1.

{
  "appId": "tacobot9000-0.1.0u1",
  "name": "TACObot job",
  "inputs": {"file1": "agave://data.tacc.cloud/examples/tacobot/test1.txt"},
  "parameters": {"salsa": true, "avocado": false, "cheese": true},
  "maxRunTime": "01:00:00"
}

To launch this via the Agave, this document would be sent directly to the /apps endpoint. To send it instead to the proxy, move it to key job_definition in a JSON document.

{
    "job_definition": {
            "appId": "tacobot9000-0.1.0u1",
            "name": "TACObot job",
            "inputs": {
                    "file1": "agave://data.tacc.cloud/examples/tacobot/test1.txt"
            },
            "parameters": {
                    "salsa": true,
                    "avocado": false,
                    "cheese": true
            },
            "maxRunTime": "01:00:00"
    }
}

Metadata Linkage Parameter¶

An explicit linkage to objects in the Data Catalog must be established. This is done via the parameters key, which must contain a valid value for one of the following:

experiment_id
sample_id
measurement_id

Either single values or an array of values may be passed, and either the readable text value may be provided or the corresponding UUID.

Which Parameter to Pass¶

A PipelineJob is always linked to a set of measurements by way of the linkage parameter. The job’s archive path is also determined by the linkage parameter. To illustrate:.

If a job’s measurement_id=['measurement.tacc.1234', 'measurement.tacc.2345'], it will linked to these two measurements and its archive path will end with a hash of the two measurement_id values.

Assuming those measurements are children of sample.tacc.abcde and the only linkage parameter sent was sample_id='sample.tacc.abcdef', the job will still be linked to all the child measurements of that sample. Its archive path will end with a hash of sample.tacc.abcde. Howver, if both measurement_id and sample_id are passed, the linkages are made to the specified measurement(s) while the archive path is a function of the sample_id value(s).

For experiment_id, the specific samples are linked to the job and the archive path is a function of experiment_id value(s).

This design allows files generated by the job to be linked to only one level of the metadata hiearchy, while allowing collection of outputs at higher levels of organization in the file system.

Here is a worked example of the current example job request, as it stands:

{
    "parameters": {
            "sample_id": "sample.tacc.abcde"
    },
    "job_definition": {
            "appId": "tacobot9000-0.1.0u1",
            "name": "TACObot job",
            "inputs": {
                    "file1": "agave://data.tacc.cloud/examples/tacobot/test1.txt"
            },
            "parameters": {
                    "salsa": true,
                    "avocado": false,
                    "cheese": true
            },
            "maxRunTime": "01:00:00"
    }
}

Additional Control Parameters¶

Job behavior can be refined with additional control parameters.

instanced¶

Each PipelineJob has a distinct archive path derived from its Pipeline UUID, the data dictionary passed at job init() and/or setup(), and a function of its linkage parameters to experiments, samples, or measurements. To avoid inadvertent over-writes, the archive path is extended with an instancing directory named in the form adjective-animal-YYYYMMDDTHHmmssZ. To avoid use of the instancing directory, include instanced: false in the job request message.

Example: "instanced": false

index_patterns¶

The default behavior of the PipelineJobs System is to index every file found under a job’s archive path to be linked to that specific job. To subselect only specific files, it is possible to include one or more Python regular expressions in index_patterns. Only files matching these patterns will be linked to the job.

Example: "index_patterns": []

processing_level¶

The default behavior of the PipelineJobs System is to index files under a job’s archive path as processing level “1”. To change this, an alternative processing_level may be passed in the job request message.

Example: "processing_level": "2"

Note

Only one automatic indexing configuration can be active for a given job. Additional indexing actions with other configurations may be initiated by sending a message directly to PipelineJobs Indexer

Job Life Cycle¶

Here is complete record from the Pipelines system showing how the information from job creation and subsequent events is stored and discoverable. A few key highlights:

The top-level data field holds the original parameterization of the job
Three events are noted in the history: create, run, finish
The actor and execution for the managing instance of PipelineJobs Agave Proxy are available under agent and task, respectively

{
    "agent": "https://api.tacc.cloud/actors/v2/G46vjoAVzGkkz",
    "archive_path": "/products/v2/103f877a7ab857d182807b75af4eab6e/106bd127e2d257acb9be11ed06042e68/eligible-awk-20181127T173243Z",
    "archive_system": "data-sd2e-community",
    "data": {
        "appId": "urrutia-novel_chassis_app-0.1.0",
        "archivePath": "",
        "inputs": {
            "file1": "agave://data.tacc.cloud/examples/tacobot/test1.txt"
        },
        "maxRunTime": "01:00:00",
        "name": "TACObot job",
        "parameters": {
            "avocado": false,
            "cheese": true,
            "salsa": true
        }
    },
    "derived_from": [
        "1022efa3-4480-538f-a581-f1810fb4e0c3"
    ],
    "generated_by": [
        "106bd127-e2d2-57ac-b9be-11ed06042e68"
    ],
    "history": [
        {
            "data": {
                "appId": "tacobot9000-0.1.0u1",
                "inputs": {
                    "file1": "agave://data.tacc.cloud/examples/tacobot/test1.txt"
                },
                "maxRunTime": "01:00:00",
                "name": "TACObot job",
                "parameters": {
                    "avocado": false,
                    "cheese": true,
                    "salsa": true
                }
            },
            "date": "2018-12-08T00:08:32.000+0000",
            "name": "create"
        },
        {
            "data": {
                "appId": "tacobot9000-0.1.0u1",
                "archive": true,
                "archivePath": "/products/v2/103f877a7ab857d182807b75af4eab6e/106bd127e2d257acb9be11ed06042e68/eligible-awk-20181127T173243Z",
                "archiveSystem": "data-tacc-cloud",
                "batchQueue": "normal",
                "created": "2018-12-07T18:08:37.000-06:00",
                "endTime": null,
                "executionSystem": "hpc-tacc-stampede2",
                "id": "7381691026605150696-242ac11b-0001-007",
                "inputs": {
                    "file1": "agave://data.tacc.cloud/examples/tacobot/test1.txt"
                },
                "lastUpdated": "2018-12-07T18:09:40.000-06:00",
                "maxRunTime": "01:00:00",
                "memoryPerNode": 1,
                "name": "TACObot job",
                "nodeCount": 1,
                "outputPath": "tacobot/job-7381691026605150696-242ac11b-0001-007-TACObot-job",
                "owner": "tacobot",
                "parameters": {
                    "avocado": false,
                    "cheese": true,
                    "salsa": true
                },
                "processorsPerNode": 1,
                "startTime": null,
                "status": "RUNNING",
                "submitTime": "2018-12-07T18:09:40.000-06:00"
            },
            "date": "2018-12-08T00:10:12.000+0000",
            "name": "run"
        },
        {
            "data": {
                "appId": "tacobot9000-0.1.0u1",
                "archive": true,
                "archivePath": "/products/v2/103f877a7ab857d182807b75af4eab6e/106bd127e2d257acb9be11ed06042e68/eligible-awk-20181127T173243Z",
                "archiveSystem": "data-tacc-cloud",
                "batchQueue": "normal",
                "created": "2018-12-07T18:08:37.000-06:00",
                "endTime": null,
                "executionSystem": "hpc-tacc-stampede2",
                "id": "7381691026605150696-242ac11b-0001-007",
                "inputs": {
                    "file1": "agave://data.tacc.cloud/examples/tacobot/test1.txt"
                },
                "lastUpdated": "2018-12-07T18:53:20.000-06:00",
                "maxRunTime": "01:00:00",
                "memoryPerNode": 1,
                "name": "TACObot job",
                "nodeCount": 1,
                "outputPath": "tacobot/job-7381691026605150696-242ac11b-0001-007-TACObot-job",
                "owner": "tacobot",
                "parameters": {
                    "avocado": false,
                    "cheese": true,
                    "salsa": true
                },
                "processorsPerNode": 1,
                "startTime": "2018-12-07T18:09:49.000-06:00",
                "status": "FINISHED",
                "submitTime": "2018-12-07T18:09:40.000-06:00"
            },
            "date": "2018-12-08T00:53:45.000+0000",
            "name": "finish"
        }
    ],
    "last_event": "finish",
    "pipeline_uuid": "106bd127-e2d2-57ac-b9be-11ed06042e68",
    "session": "casual-bass",
    "state": "FINISHED",
    "task": "https://api.tacc.cloud/actors/v2/G46vjoAVzGkkz/executions/Myp6wvklV0zgQ",
    "updated": "2018-12-08T00:53:45.000+0000",
    "uuid": "10743f9e-f5ae-5b4c-859e-6774ef4ab08b"
}

JSON Schemas¶

agave_pipelinejob¶

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "$id": "https://schema.catalog.sd2e.org/schemas/agave_pipelinejob.json",
    "title": "AgavePipelineJob",
    "description": "Launch an Agave job as a PipelineJob",
    "type": "object",
    "definitions": {
        "datacatalog_field": {
            "type": "object",
            "anyOf": [
                {
                    "required": [
                        "sample_id"
                    ]
                },
                {
                    "required": [
                        "experiment_design_id"
                    ]
                },
                {
                    "required": [
                        "experiment_id"
                    ]
                },
                {
                    "required": [
                        "measurement_id"
                    ]
                }
            ],
            "properties": {
                "sample_id": {
                    "type": "string",
                    "pattern": "^sample.(uw_biofab|transcriptic|ginkgo|emerald)."
                },
                "experiment_design_id": {
                    "type": "string",
                    "$ref": "experiment_reference.json"
                },
                "experiment_id": {
                    "type": "string",
                    "pattern": "^experiment.(uw_biofab|transcriptic|ginkgo|emerald)."
                },
                "measurement_id": {
                    "type": "string",
                    "pattern": "^measurement.(uw_biofab|transcriptic|ginkgo|emerald)."
                }
            }
        }
    },
    "properties": {
        "parameters": {
            "$ref": "#/definitions/datacatalog_field"
        },
        "job_definition": {
            "description": "An Agave API job definition",
            "type": "object"
        },
        "archive_path": {
            "description": "Optional Agave URN defining the job's archive path",
            "$ref": "agave_files_uri.json"
        },
        "instanced": {
            "description": "Whether the generated archive path should be instanced with a randomized session",
            "type": "boolean",
            "value": true
        },
        "data": {
            "description": "Optional dict-like object describing the job's run-time parameterization",
            "type": "object"
        },
        "index_patterns": {
            "type": "array",
            "description": "List of Python regular expressions defining which output files to associate with the job. Omit entirely if you do not want to apply filtering.",
            "items": {
                "type": "string"
            }
        },
        "processing_level": {
            "description": "Defaults to '1' if not provided",
            "$ref": "processing_level.json"
        }
    },
    "required": [
        "job_definition",
        "parameters"
    ],
    "additionalProperties": false
}