The following examples demonstrate the commands in the BaseSpace CLI tool. For more information about the CLI and a list of commands, see CLI Overview.
Note that in examples where the output is very long it has been contracted to make this document more manageable. Users are encouraged to follow these examples whilst trying the commands for themselves on their own data to see the full output in their own system.
See main instructions.
$ bs auth
Please go to this URL to authenticate:
https://basespace.illumina.com/oauth/device?code=jfHSG
Created config file /home/username/.basespace/default.cfg
The default settings include:
$ bs whoami
+----------------+----------------------------------------------------+
| Name | User Name |
| Id | 1234567 |
| Email | myemail@domain.com |
| DateCreated | 2014-09-25 16:29:21 +0000 UTC |
| DateLastActive | 2021-09-16 10:38:17 +0000 UTC |
| Host | https://api.basespace.illumina.com |
| Scopes | READ GLOBAL, CREATE GLOBAL, BROWSE GLOBAL, |
| | CREATE PROJECTS, CREATE RUNS, START APPLICATIONS, |
| | MOVETOTRASH GLOBAL, WRITE GLOBAL |
+----------------+----------------------------------------------------+
$ bs list projects
+--------------------------------------------------+----------+---------------+
| Name | Id | TotalSize |
+--------------------------------------------------+----------+---------------+
| NovaSeq: TruSeq Nano 550 (Replicates of NA12878) | 36080093 | 2233311909088 |
+--------------------------------------------------+----------+---------------+
$ bs list datasets
+--------------------------+-------------------------------------+--------------------------------------------------+---------------------+
| Name | Id | Project.Name | DataSetType.Id |
+--------------------------+-------------------------------------+--------------------------------------------------+---------------------+
| NA12878-I13_L002 | ds.184ba3d796f343f4886b4aa7fb43c496 | NovaSeq: TruSeq Nano 550 (Replicates of NA12878) | illumina.fastq.v1.8 |
| NA12878-I13_L001 | ds.c805113ed9884caa8912dafdf8edd63d | NovaSeq: TruSeq Nano 550 (Replicates of NA12878) | illumina.fastq.v1.8 |
| NA12878-I54_L001 | ds.dc5657d91983479eb0dd6abb53b9d60f | NovaSeq: TruSeq Nano 550 (Replicates of NA12878) | illumina.fastq.v1.8 |
| NA12878-I85_L001 | ds.0a7781b4d7684113a4c64c1f2ca3c175 | NovaSeq: TruSeq Nano 550 (Replicates of NA12878) | illumina.fastq.v1.8 |
...
$ bs dataset headers
[
"Id",
"Name",
"AppSession.Id",
"AppSession.Name",
"AppSession.Application.AppFamilySlug",
"AppSession.Application.AppVersionSlug",
"AppSession.Application.Id",
"AppSession.Application.VersionNumber",
...
Using custom columns selected from the headers list:
$ bs list datasets -F Name -F QcStatus -F TotalSize -F AppSession.Application.Name
+--------------------------+-----------+-------------+---------------------------------+
| Name | QcStatus | TotalSize | AppSession.Application.Name |
+--------------------------+-----------+-------------+---------------------------------+
| NA12878-I13_L002 | Undefined | 4047606231 | FASTQ Generation |
| NA12878-I13_L001 | Undefined | 4118853581 | FASTQ Generation |
| NA12878-I54_L001 | Undefined | 3496655065 | FASTQ Generation |
| NA12878-I85_L001 | Undefined | 2462111619 | FASTQ Generation |
| NA12878-I87_L001 | Undefined | 2271497152 | FASTQ Generation |
...
BaseSpaceCLI provides several options for filtering the entities that are output by a list
command.
The option to filter results on an entity field is --filter-term
. By default, this filters on the Name
field.
$ bs list projects --filter-term=examples
+---------------+---------+-----------+
| Name | Id | TotalSize |
+---------------+---------+-----------+
| data_examples | 5472467 | 26510 |
+---------------+---------+-----------+
The filter term is specified as a regular expression:
$ bs list appsessions --filter-term=" .* "
+------------------------------------------+----------+-----------------+
| Name | Id | ExecutionStatus |
+------------------------------------------+----------+-----------------+
| Illumina's Uploader 2012-11-19 22:06:29Z | 1313312 | Complete |
| Illumina's Uploader 2012-11-19 22:06:29Z | 1306305 | Complete |
| BaseSpaceCLI 2017-04-04 11:07:54Z | 10743733 | Complete |
+------------------------------------------+----------+-----------------+
You can also specify the field on which to filter by using the --filter-term
option:
$ bs list datasets --filter-field=Project.Name --filter-term=data
+-----------+-------------------------------------+---------------+---------------------+
| Name | Id | Project.Name | DataSetType.Id |
+-----------+-------------------------------------+---------------+---------------------+
| valid | ds.5c4200d4f52e4a9dae86fd8b166e296d | data_examples | illumina.fastq.v1.8 |
| test_data | ds.46c118551d51497789ddaf84bbc9bff0 | data_examples | common.files |
+-----------+-------------------------------------+---------------+---------------------+
This is necessary for entities that do not have a "Name" field, like biosamples:
$ bs list biosamples --filter-term=demo
ERROR: *** Name "Name" not found in object ***
$ bs list biosamples --filter-term=demo --filter-field=BioSampleName
+-------------------------------+---------+---------------+-------------------+--------+
| BioSampleName | Id | ContainerName | ContainerPosition | Status |
+-------------------------------+---------+---------------+-------------------+--------+
| HiSeq_2500_NA12878_demo_2x150 | 2280211 | | | New |
+-------------------------------+---------+---------------+-------------------+--------+
Note that only one --filter-field
and --filter-term
pairing can be used per command. For additional filtering, consider post-processing the CLI results with tools such as grep
(see BaseSpaceCLI filtering vs. POSIX)
You can also specify the age of entities to be displayed by using --older-than
and --newer-than
.
$ bs list datasets -F Name -F DateModified --newer-than=400d
+------------------------------------------+-------------------------------+
| Name | DateModified |
+------------------------------------------+-------------------------------+
| HiSeq 2500 NA12878 demo 2x150 | 2017-06-30 22:17:17 +0000 UTC |
| BWA GATK - HiSeq 2500 NA12878 demo 2x150 | 2017-07-04 03:09:15 +0000 UTC |
+------------------------------------------+-------------------------------+
By default, date filtering applies to the DateModified
field. You can alter this to another date field by using the --date-field
option.
$ bs list datasets --date-field=DateCreated --older-than=1y -F Name -F DateCreated
+-----------+-------------------------------+
| Name | DateCreated |
+-----------+-------------------------------+
| valid | 2017-04-04 11:07:56 +0000 UTC |
| test_data | 2017-04-04 11:20:09 +0000 UTC |
+-----------+-------------------------------+
The --filter-term
and --filter-field
options use client side filtering - the API returns all entities and they are filtered before they are displayed. This means that even if you only end up listing a handful of results, it can take a long time on a large account.
Some entities have specific filtering options that make use of server side filtering, where the API does the filtering and only returns the matching entities. These are available on an entity-specific basis:
$ bs list appsessions --exec-status=Complete
+------------------------------------------+----------+-----------------+
| Name | Id | ExecutionStatus |
+------------------------------------------+----------+-----------------+
| test_data | 10743734 | Complete |
| Illumina's Uploader 2012-11-19 22:06:29Z | 1313312 | Complete |
| Illumina's Uploader 2012-11-19 22:06:29Z | 1306305 | Complete |
| BaseSpaceCLI 2017-04-04 11:07:54Z | 10743733 | Complete |
+------------------------------------------+----------+-----------------+
$ bs list datasets --is-type=common.files
+------------------------------------------+-------------------------------------+-------------------------------+----------------+
| Name | Id | Project.Name | DataSetType.Id |
+------------------------------------------+-------------------------------------+-------------------------------+----------------+
| test_data | ds.46c118551d51497789ddaf84bbc9bff0 | data_examples | common.files |
| BWA GATK - HiSeq 2500 NA12878 demo 2x150 | ds.2f03151b6c9b4a909d05b1af729a6fc2 | HiSeq 2500 NA12878 2x150 Demo | common.files |
+------------------------------------------+-------------------------------------+-------------------------------+----------------+
You can discover the server side filtering options for each entity by using --help
$ bs list datasets --is-type=common.files
[dataset command options]
--like-type= Filter DataSets that are LIKE this type
--is-type= Filter DataSets that are this type
--not-type= Filter DataSets that are NOT this type
--input-biosample= Filter by Input BioSample
--project-name= Name of parent project
--project-id= ID of parent project
You can also combine server-side and client-side filtering:
$ bs list datasets --is-type=common.files --filter-term=data
+-----------+-------------------------------------+---------------+----------------+
| Name | Id | Project.Name | DataSetType.Id |
+-----------+-------------------------------------+---------------+----------------+
| test_data | ds.46c118551d51497789ddaf84bbc9bff0 | data_examples | common.files |
+-----------+-------------------------------------+---------------+----------------+
An alternative to using the BaseSpaceCLI filter options it to use the standard POSIX tools such as grep
and cut
. The advantage of using the BaseSpaceCLI filters is that you can still use other options such as column selection and output formatting to help get you the output you want, which can be more convenient:
#using POSIX tools:
$ bs list datasets -f csv | grep common.files | cut -d, -f2
ds.46c118551d51497789ddaf84bbc9bff0
ds.2f03151b6c9b4a909d05b1af729a6fc2
# the equivalent, with BSCLI filters:
$ bs list datasets -f csv --terse --is-type=common.files
ds.46c118551d51497789ddaf84bbc9bff0
ds.2f03151b6c9b4a909d05b1af729a6fc2
The BaseSpace CLI downloader will download files incrementally. If the connection is interrupted, re-running the download command will, by default, check for files that have already downloaded successfully and will avoid unnecessarily downloading them again.
The example below can be used to download all files from a run which can be used to generate FASTQ files locally as well as inspect run metrics.
$ bs download run -i <RunID> -o <output>
Multiple runs can be downloaded in succession by iterating through a list of run IDs. The example below can be used to generate a list of all run IDs associated with an account and download each run in the list saved to a folder named for the numerical run ID.
$ bs list runs --terse > download.txt
$ while read run; do bs download run -i $run -o $run; done < download.txt
or in a more compact form:
$ bs list run --terse | xargs -I@ bs download run -i @ -o @
The example below can be used to download all files associated with a project from FASTQ files to analysis results.
$ bs download project -i <ProjectID> -o <output>
A subset of files can be downloaded from a project by specifying the desired file extension. The example below can be used to download all FASTQ files in a project and only the FASTQ files.
$ bs download project -i <ProjectID> -o <output> --extension=fastq.gz
The example below will download all datasets associated with a biosample, even if the datasets are spread across multiple projects and aggregated under the single biosample name.
$ bs download biosample -i <BiosampleID> -o <output>
This example is from the VCAT app:
$ bs get dataset -i ds.f45e4fcccbce4fb18dd91bdad7dcb272
+---------------------------------------------------+----------------------------------------------------------------------------------------+
| Id | ds.f45e4fcccbce4fb18dd91bdad7dcb272 |
| Name | NA12878-R1S1vcf-38337470 |
| AppSession.Id | 42463886 |
| AppSession.Name | NA12878-R1_S1.vcf.gz_2 |
| AppSession.Application.AppFamilySlug | basespace-labs.variant-calling-assessment-tool |
...
This example is a FASTQ app:
$ bs list attributes dataset -i ds.2f5b56dddc0440858943246ba4ac9d11
+---------------------+---------------+
| Name | Value |
+---------------------+---------------+
| TotalReadsPF | 4.1119628e+07 |
| MaxLengthIndexRead1 | 8 |
| MaxLengthRead1 | 151 |
| MaxLengthRead2 | 151 |
| IsPairedEnd | true |
| TotalClustersPF | 2.0559814e+07 |
| TotalClustersRaw | 2.6711606e+07 |
| TotalReadsRaw | 5.3423212e+07 |
| MaxLengthIndexRead2 | 8 |
+---------------------+---------------+
This is a VCAT example:
$ bs contents dataset -i ds.f45e4fcccbce4fb18dd91bdad7dcb272
+------------+-----------------------------------------------------------------------------------------+
| Id | FilePath |
+------------+-----------------------------------------------------------------------------------------+
| 7240583239 | happy/NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.vcf.gz.tbi |
| 7240583238 | happy/NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.vcf.gz |
| 7240583237 | happy/NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.summary.csv |
| 7240583236 | happy/NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.metrics.json |
| 7240583235 | happy/NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.extended.csv |
| 7240583234 | happy/NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.counts.json |
| 7240583233 | happy/NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.counts.csv |
| 7240583232 | report.log |
| 7240583231 | report.json |
+------------+-----------------------------------------------------------------------------------------+
Simple filtering of files to download by their extension can be done using the --extension
flag. For more sophisticated filtering, see the Selective filtering of uploads and downloads section.
# will download into directory /tmp/vcat
$ bs download dataset -i ds.2f5b56dddc0440858943246ba4ac9d11 --extension=json -o /tmp/vcat
NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.metrics.json 27.44 KB / 27.44 KB [============] 100.00% 348.12 KB/s 0s
happy/NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.metrics.json
NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.counts.json 5.02 KB / 5.02 KB [================] 100.00% 22.04 MB/s 0s
happy/NA12878-R1_S1-vcf-38337470__NA12878-Platinum-Genomes-v2016-1-0-hg38-.counts.json
report.json 2.53 KB / 2.53 KB [=====================================================================================] 100.00% 13.41 MB/s 0s
report.json
NA12878-R1S1vcf-38337470.ds.f45e4fcccbce4fb18dd91bdad7dcb272.json 2.71 KB / 2.71 KB [===============================] 100.00% 61.58 MB/s 0s
NA12878-R1S1vcf-38337470.ds.f45e4fcccbce4fb18dd91bdad7dcb272.json
Many BSSH entities can be tagged with properties, key/value pairs that label those entities. Some entities, like appsessions, come tagged with properties automatically but others can still be added manually. BSCLI lets you inspect existing properties of any type and create string properties.
BSSH entities that can be labelled with properties include projects, runs, biosamples, appsession, appresults and datasets.
The command to see all the properties of an entity is property list
:
$ bs appsession property list -i 46664618
+-------------------------------+----------------------------+-------------+------------------------------------+
| Name | Description | Type | Content |
+-------------------------------+----------------------------+-------------+------------------------------------+
| Output.Projects | | project[] | <use `bs get` to obtain more info> |
| Output.Datasets | | dataset[] | <use `bs get` to obtain more info> |
| Input.snp_vqsr | SNP VQSR sensitivity | string | "99.5" |
| Input.sample-id.attributes | Sample Attributes | map[] | <use `bs get` to obtain more info> |
| Input.reference_genome | Reference genome | string | "b37_decoy" |
| Input.Projects | | project[] | <use `bs get` to obtain more info> |
| Input.project-id.attributes | Save Results To Attributes | map[] | <use `bs get` to obtain more info> |
| Input.project-id | Save Results To | project | <use `bs get` to obtain more info> |
| Input.indel_vqsr | Indel VQSR sensitivity | string | "99.5" |
| Input.Datasets | | dataset[] | <use `bs get` to obtain more info> |
| Input.BioSamples | | biosample[] | <use `bs get` to obtain more info> |
| Input.app-session-name | Analysis Name | string | "Sentieon [LocalDateTime]" |
| BaseSpace.Private.IsMultiNode | | string | "True" |
+-------------------------------+----------------------------+-------------+------------------------------------+
To see an individual property, use property get
with the --property-name
switch:
$ bs appsession property get -i 46664618 --property-name="Input.snp_vqsr"
"99.5"
Note that many of the properties of this appsession are themselves BSSH entities, which can be listed directly by using property get
:
$ bs appsession property get -i 46664618 --property-name="Output.Projects"
+---------------+----------+----------------+
| Name | Id | TotalSize |
+---------------+----------+----------------+
| sgdp_sentieon | 38827790 | 32372028734815 |
+---------------+----------+----------------+
This is particularly useful for finding the inputs and outputs of an app:
$ bs appsession property get -i 46664618 --property-name="Input.Datasets"
+------------+-------------------------------------+---------------------------------+---------------------+
| Name | Id | Project.Name | DataSetType.Id |
+------------+-------------------------------------+---------------------------------+---------------------+
| ERR1347692 | ds.4fa74a92d9b04a69a5cb53f603d965fa | Simons Genome Diversity Project | illumina.fastq.v1.8 |
+------------+-------------------------------------+---------------------------------+---------------------+
$ bs appsession property get -i 46664618 --property-name="Output.Datasets"
+----------+-------------------------------------+---------------+----------------+
| Name | Id | Project.Name | DataSetType.Id |
+----------+-------------------------------------+---------------+----------------+
| 36701821 | ds.e599c516419e4470b03f26c480fce45d | sgdp_sentieon | common.files |
+----------+-------------------------------------+---------------+----------------+
We can use some shell features to see the list of files that were output in a single command:
$ bs contents dataset -i $(bs appsession property get -i 46664618 --property-name="Output.Datasets" --terse)
+------------+--------------------------------------------+
| Id | FilePath |
+------------+--------------------------------------------+
| 7776871946 | .basespace/ERR1347692_128_hs37d5.cov.gz |
| 7776871945 | .basespace/ERR1347692_128_Y.cov.gz |
| 7776871944 | .basespace/ERR1347692_128_X.cov.gz |
| 7776871943 | .basespace/ERR1347692_128_NC_007605.cov.gz |
| 7776871942 | .basespace/ERR1347692_128_MT.cov.gz |
| 7776871941 | .basespace/ERR1347692_128_GL0002491.cov.gz |
| 7776871940 | .basespace/ERR1347692_128_GL0002481.cov.gz |
...
Projects by default do not have properties set:
$ bs projects properties list -i 27932921
$
We can add string properties:
$ bs projects properties set -i 27932921 --property-name="MyNamespace.TestProperty" --property-content="TestValue"
$ bs projects properties list -i 27932921
+--------------------------+-------------+--------+-------------+
| Name | Description | Type | Content |
+--------------------------+-------------+--------+-------------+
| MyNamespace.TestProperty | | string | "TestValue" |
+--------------------------+-------------+--------+-------------+
Note that a namespace prefix for a property name is compulsory:
$ bs projects properties set -i 27932921 --property-name="TestProperty" --property-content="TestValue"
ERROR: *** BASESPACE.PROPERTIES.NAME_INVALID: Property name: TestPropery must contain 2 or more segments split by a period. Each segment may contain letters, numbers, '-', and '\_'. First segment must start with a letter or number. ***
We will refer to this in future commands.
$ bs auth --scopes 'READ GLOBAL,CREATE GLOBAL,CONFIGURE QC' -c laneqc
Please go to this URL to authenticate: https://basespace.illumina.com/oauth/device?code=HrACj
Created config file /Users/basespaceuser/.basespace/laneqc.cfg
Welcome, BSSH CLI TestUser
$ bs -c laneqc lane threshold export
Name,Group,Operator,ThresholdValues
$ cat > /tmp/thresholds.txt
Name,Group,Operator,ThresholdValues
PercentGtQ30,SequencingRead1,GreaterThanOrEqual,50
PercentGtQ30,SequencingRead2,GreaterThanOrEqual,40
$ bs -c laneqc lane threshold import -f /tmp/thresholds.txt
# should finish without errors!
$ bs -c laneqc lane threshold export
Name,Group,Operator,ThresholdValues
PercentGtQ30,SequencingRead1,GreaterThanOrEqual,50
PercentGtQ30,SequencingRead2,GreaterThanOrEqual,40
$ bs -c laneqc lane threshold clear
$ ./bs -c laneqc lane threshold export
Name,Group,Operator,ThresholdValues
# warning! if your project does not already exist it will be implicitly created
$ bs create biosample -n "MyBioSample" -p "MyProject"
Note that there are quite a few optional metadata parameters for biosamples. These are primarily designed to help high-throughput labs classify and display biosamples:
$ bs create biosample --help
(snip)
BioSample Options:
-n, --name= Name of the BioSample
-p, --project= Name of the project where FastQs will be stored. Created if not found.
--container-name= Name of container
--container-position= Position within the container
--analysis-workflow= Name of the analysis to schedule
--prep-request= Name of the lab workflow that LIMS should perform
--required-yield= Required yield in Gbp that is needed before launching analysis (required if --prep-request is provided)
--metadata= Key/Value metadata properties to set on the BioSample
--delivery-mode=[Deliver|Do Not Deliver] Intial delivery mode
(snip)
Use the --metadata
flag to attach arbitrary labels to a new biosample:
$ bs create biosample -n MyBioSample -p MyProject --metadata Type:FFPE --metadata SequencingLab:12
You can preview biosample creation (--preview
), to validate that the data provided is correct.
$ bs create biosample -n "MyBioSample" -p "MyProject" --preview
ERROR: *** Error in BioSample Name: BioSample 'MyBioSample' already exists and cannot be imported ***
To attach a new analysis workflow to an existing biosample, use the --allow-existing
option.
There are two options to associate uploaded FASTQ files with a biosample:
$ ls
MyBioSample_S1_L001_R1_001.fastq.gz
MyBioSample_S1_L001_R2_001.fastq.gz
# note that you need a project ID here, not a project name as you did when you created the biosample!
$ bs upload dataset -p 27943921 MyBioSample_S1_L001_R1_001.fastq.gz MyBioSample_S1_L001_R2_001.fastq.gz
Creating sample: MyBioSample
MyBioSample_S1_L001_R1_001.fastq.gz 1.07 GiB / 1.07 GiB [=========================================================] 100.00%
MyBioSample_S1_L001_R2_001.fastq.gz 1.11 GiB / 1.11 GiB [=========================================================] 100.00%
Upload complete
# note that datasets and created asynchronously and there can be a delay
$ bs list dataset --input-biosample="MyBioSample"
+-------------+-------------------------------------+--------------+---------------------+
| Name | Id | Project.Name | DataSetType.Id |
+-------------+-------------------------------------+--------------+---------------------+
| MyBioSample | ds.94f7e9663e86473c8582dcf85a830195 | MyProject | illumina.fastq.v1.8 |
+-------------+-------------------------------------+--------------+---------------------+
$ ls
valid_S1_L001_R1_001.fastq.gz valid_S1_L001_R2_001.fastq.gz
# note that you need a project ID here, not a project name as you did when you created the biosample!
$ bs upload dataset --biosample-name="MyBioSample" -p 27943921 valid_S1_L001_R1_001.fastq.gz valid_S1_L001_R2_001.fastq.gz
MyBioSample_S1_L001_R1_001.fastq.gz 1.07 GiB / 1.07 GiB [=========================================================] 100.00%
MyBioSample_S1_L001_R2_001.fastq.gz 1.11 GiB / 1.11 GiB [=========================================================] 100.00%
Upload complete
$ /tmp/BSCLI/amd64-darwin/bs list dataset --input-biosample="MyBioSample"
+-------------+-------------------------------------+--------------+---------------------+
| Name | Id | Project.Name | DataSetType.Id |
+-------------+-------------------------------------+--------------+---------------------+
| MyBioSample | ds.94f7e9663e86473c8582dcf85a830195 | MyProject | illumina.fastq.v1.8 |
| MyBioSample | ds.64706f7d2e504e1c9495c00c468d6640 | MyProject | illumina.fastq.v1.8 |
+-------------+-------------------------------------+--------------+---------------------+
Note that even though the dataset has a name to match the biosample, the files within retain their original names:
$ bs dataset contents -i ds.94f7e9663e86473c8582dcf85a830195
+------------+-------------------------------------+
| Id | FilePath |
+------------+-------------------------------------+
| 8652306587 | MyBioSample_S1_L001_R1_001.fastq.gz |
| 8652306586 | MyBioSample_S1_L001_R2_001.fastq.gz |
+------------+-------------------------------------+
$ bs dataset contents -i ds.64706f7d2e504e1c9495c00c468d6640
+------------+-------------------------------+
| Id | FilePath |
+------------+-------------------------------+
| 8652572270 | valid_S1_L001_R2_001.fastq.gz |
| 8652572269 | valid_S1_L001_R1_001.fastq.gz |
+------------+-------------------------------+
The bs upload dataset
supports a --recursive
option that scans a directory and its subdirectories looking for fastq files:
$ ls
MyBioSample2_S1_L002_R1_001.fastq.gz MyBioSample2_S1_L002_R2_001.fastq.gz MyBioSample_S1_L001_R1_001.fastq.gz MyBioSample_S1_L001_R2_001.fastq.gz
$ bs upload dataset -p 21646627 --recursive .
Creating sample: MyBioSample2
MyBioSample2_S1_L002_R1_001.fastq.gz 1.08 GiB / 1.08 GiB [==========================================================] 100.00%
MyBioSample2_S1_L002_R2_001.fastq.gz 1.11 GiB / 1.11 GiB [==========================================================] 100.00%
Upload complete
Creating sample: MyBioSample
MyBioSample_S1_L001_R1_001.fastq.gz 1.07 GiB / 1.07 GiB [==========================================================] 100.00%
MyBioSample_S1_L001_R2_001.fastq.gz 1.11 GiB / 1.11 GiB [==========================================================] 100.00%
Upload complete
By default, these will be automatically grouped by name and uploaded as many individual fastq datasets, but you can also force these to all be uploaded to the same biosample:
$ bs upload dataset -p 21646627 --recursive . --biosample-name=MyBioSample
Creating sample: MyBioSample
MyBioSample2_S1_L002_R2_001.fastq.gz 1.11 GiB / 1.11 GiB [==========================================================] 100.00%
MyBioSample2_S1_L002_R1_001.fastq.gz 1.08 GiB / 1.08 GiB [==========================================================] 100.00%
MyBioSample_S1_L001_R2_001.fastq.gz 1.11 GiB / 1.11 GiB [==========================================================] 100.00%
MyBioSample_S1_L001_R1_001.fastq.gz 1.07 GiB / 1.07 GiB [==========================================================] 100.00%
Upload complete
Uploading a run requires an additional authentication scope CREATE RUNS
:
$ bs authenticate -c run-upload --scopes "CREATE RUNS"
The upload run
command takes a run folder and uploads it to Sequence Hub:
$ bs -c run-upload upload run -n MyNewRunName -t HiSeqX /path/to/runFolder
The run folder must contain standard Illumina run files:
A run sample sheet file (SampleSheet.csv) is recommended to kick-off automatic FASTQ Generation once the run upload has completed.
Runs often consist of many small files, to optimise the upload you may want to
tune concurrency settings, for example by increasing --concurrent-files
and
possibly decreasing --concurrent-parts
.
The upload run
command requires a named instrument type (--instrument
/ -t
).
Valid options for this flag include:
You can use bs upload dataset
to upload any arbitrary files by supplying the --type common.files
option:
$ ls
testfile1.txt testfile2.txt
$ bs upload dataset -p 21646627 --type common.files --recursive .
Creating dataset: testfile1.txt+
testfile2.txt 2.76 KiB / 2.76 KiB [==========================================================] 100.00%
testfile1.txt 2.31 KiB / 2.31 KiB [==========================================================] 100.00%
Upload complete
If no name for the dataset is specified, a name will be created based on one of the files to be uploaded. You can specify a name with the --name
option.
The bs upload dataset
command contains a number of options to control how the upload is conducted and reported. For example, to upload with high concurrency and no progress bars:
$ bs upload dataset -p 21646627 --type common.files --no-progress-bars --concurrency=high --recursive .
To only download a set of file extensions, such as BAMs and VCFs, you can supply the --extension
flag to download commands multiple times. This will only pull files with names ending in the given suffixes:
$ bs download dataset -n MyDataSetName --extension bam --extension bai --extension vcf.gz --extension vcf.gz.tbi -o downloads
For uploading runs, there's a --skip-ext
option for excluding files from the set being uploaded, and --skip-dir
for omitting directories:
$ bs upload run -n MyRun -t HiSeqX --skip-ext jpg --skip-dir Thumbnail_Images /path/to/run-folder
For more complex filtering when uploading or download a set of files, see the Advanced filtering section.
Upload and download offer --include
and --exclude
flags for flexible filtering of file sets. Use these flags multiple times to have full control over which subset of files is upload or downloaded for a given command. Include and exclude patterns are not regular expressions but simple UNIX shell patterns as implemented by fnmatch:
*
matches any string including path separators?
matches any single character[]
defines a character set, so [AB]
matches one of A
or B
!
or ^
negates a match, so [!1]
matches any single character except 1
Note that to prevent your shell from expanding wildcards, you may need to use single quotes around your include and exclude patterns. Do not use single quotes when running under Windows CMD.
The following examples will use this directory set up to demonstrate how these filters work:
/tmp/upload/
├── NA12877.bam
├── NA12877.bam.bai
├── NA12878.bam
├── NA12878.bam.bai
├── NA12878.vcf.gz
├── NA12878.vcf.gz.tbi
├── metrics
│ ├── exome-coverage.csv
│ └── wgs-coverage.csv
├── plots
│ └── coverage-histogram.png
└── summary.csv
Filters are applied in the order in which they are supplied on the commandline, starting from a position of including all files. That means that only using --include
flags will have no effect on the files being uploaded — they're already included. To selectively include files, you may want to exclude everything first.
Command | File set |
---|---|
--exclude '*' --include '*.bam*' |
/tmp/upload/ ├── NA12877.bam ├── NA12877.bam.bai ├── NA12878.bam └── NA12878.bam.bai |
--exclude '*' --include '*.csv' |
/tmp/upload/ ├── metrics │ ├── exome-coverage.csv │ └── wgs-coverage.csv └── summary.csv |
--exclude 'plots/*' --exclude 'metrics/*' --exclude '*/*' |
/tmp/upload/ ├── NA12877.bam ├── NA12877.bam.bai ├── NA12878.bam ├── NA12878.bam.bai ├── NA12878.vcf.gz ├── NA12878.vcf.gz.tbi └── summary.csv |
A given file may be included and excluded by multiple conditions, its final inclusion status is determined after applying all include and exclude patterns in the order supplied from left to right. Note that patterns are relative to the root directory being uploaded and are matched against the full relative path of each file being considered. Here are some more examples to demonstrate how this works:
Command | File set |
---|---|
--exclude '*' --include '*.bam' --exclude 'NA12877*' --exclude '*' --include NA12878.bam |
/tmp/upload/ └── NA12878.bam |
--exclude '*' --include 'metrics/*' --exclude 'metrics/exome*' |
/tmp/upload/ └── metrics └── wgs-coverage.csv |
You can also use negative match characters to filter a file set:
Command | File set |
---|---|
--exclude '*' --include 'NA1287[!8]*' --exclude '*' --include 'NA1287[^8]*' |
/tmp/upload/ ├── NA12877.bam └── NA12877.bam.bai |
Many core BSSH entities can be deleted with a bs delete command
# delete dataset by ID
$ bs delete dataset -i ds.123
# delete project by name
$ bs delete project -n MyProject
You can see which entities support deletion by running bs delete
:
$ bs delete
Please specify one command of: appresult, appsession, dataset, lane, project, property, run or workflow
Note that some examples are deletion of configuration for the automated workflow feature, rather than deletion of entities (bs delete lane
, bs delete workflow
)
To delete a property, you need to specify the entity that owns the property, as well as the property name you want to delete:
# show the properties in a project
$ bs list properties project --name="Project test"
+-------------------------+-------------+--------+--------------+
| Name | Description | Type | Content |
+-------------------------+-------------+--------+--------------+
| myproperty.testproperty | | string | "testvalue2" |
+-------------------------+-------------+--------+--------------+
# delete a property - specify both project and property name
$ bs delete properties project --name="Project test" --property-name="myproperty.testproperty"
# list again - the property has disappeared
$ bs list properties project --name="Project test"
$
Some entities (runs and datasets)support deleting the space-consuming files from an entity, but retaining other information. This is achieved with the --preserve-metadata
switch.
# this will delete the files in ds.123, but keep the metrics,
$ bs delete dataset -i ds.123 --preserve-metadata
# after deleting with preserve metadata, you'll still be able to see dataset attributes
$ bs list attributes dataset -i ds.123
# but the file contents will be gone
$ bs contents dataset -i ds.123
If you delete a run preserving metadata, a number of key files are also retained, including XML files that describe how a run was configured and the interop files, which allow the graph views of a run to be viewed. This is ideal if you want to hugely reduce the data footprint of a run by deleting BCL files but leave behind the metrics for long-term trending and record-keeping
A common pattern that can be useful is to delete BCL files in runs above a certain age. This can be achieved with the following combination:
# delete BCL files in runs older than 30 days,
# whilst retaining interops and other metadata
$ bs list runs --older-than=30d --terse | xargs -n1 bs delete run --preserve-metadata -i
Archiving is a Sequence Hub feature which migrates data to lower-cost cold storage, useful for data that cannot be deleted but does not need to be accessed in the near future. Archived data must be restored before it can be used either as an input to a Sequence Hub application or downloaded for local use. For more information about storage costs, see the illumina.com iCredits information page.
Use the archive
command to send data to long-term storage:
$ bs archive run -i 123456
$ bs archive dataset -i ds.123
Use the IsArchived
field to check whether runs or datasets have been archived:
# show all archived runs
$ bs list runs --filter-field IsArchived --filter-term true
Similar to deletion, archival will only move the Data/
directory of run files, while InterOps and other metadata will remain
accessible. For datasets, all files will be archived but metadata including any dataset attributes will remain available.
To regain access to archived data, use the unarchive
command:
$ bs unarchive run -i 123456
$ bs unarchive dataset -i ds.123
Note that the restore process can take up to several days to complete. The IsArchived
field will remain true
until the data has been fully restored.
As with other CLI commands, it's easy to combine archive
with the list
command for powerful results:
# archive all runs which are over a year old
$ bs list runs --older-than 1y --terse | xargs -n1 bs archive run -i
# archive all datasets in a given project
$ bs list datasets --project-id 123456 --terse | xargs -n1 bs archive dataset -i
The analysis workflow feature of BSSH allows apps to be launched automatically when they meet a set of conditions or dependencies. The feature also allows automated quality control to be applied so that the appsession is automatically marked as "QCPassed" or "QCFailed" based on the metrics it generated.
Before an app can be launched automatically in this way, a "workflow" needs to be created which wraps the app and describes its dependencies and (optionally) any QC thresholds. The workflow takes as input a template appsession, an app launch that has been configured and launched manually using the desired settings; automated launches for this workflow will be based on these settings.
It is also possible to chain automated app launches by creating another workflow for the downstream app and creating a dependency on the upstream step.
The MANAGE APPLICATIONS
scope is required to create and work with workflows. If you do not already have a CLI configuration with this scope, generate a new one. In the following example commands the new config is named docs-demo
:
$ bs auth -c docs-demo --scopes 'READ GLOBAL,CREATE GLOBAL,BROWSE GLOBAL,MANAGE APPLICATIONS'
Creating an analysis workflow uses the create workflow
command. The below example commands contain example values for appsession IDs and other fields which will need to be changed for your usage.
Fixed parameters for the analysis workflow will be taken from a template appsession provided via the --appsession-id
option. This appsession should be one previously launched with each of the settings required by your analysis workflow; for example, if you want your analysis workflow to use the hg19 human reference genome setting, that is what this template appsession must have been launched with.
$ bs -c docs-demo workflow create -n TestWorkflow -d CLICreated --appsession-id 123456789
3978975
The value returned is the ID of the newly-created workflow. You can see all available workflows using the list applications
command:
$ bs -c docs-demo list applications --category=workflow
+--------------+---------+---------------+
| Name | Id | VersionNumber |
+--------------+---------+---------------+
| TestWorkflow | 3978975 | 1.0.0 |
+--------------+---------+---------------+
The below commands use the analysis workflow ID from the example in the previous section. For your usage you will need to supply your own analysis workflow ID.
This example will demonstrate adding two dependencies, both of which will then gate the launching of the analysis workflow: 1. A biosample with at least 1 megabase pairs of yield 2. A completed DRAGEN Germline appsession
In order to launch the analysis workflow with a biosample input, we must add a "BioSample Yield" dependency:
$ bs -c docs-demo workflow dependency add biosample-yield --chooser-id=automation-sample-id --can-use-primary-biosample --required-yield 100000 -i 3978975
An optional --required-yield
parameter is used to specify that the analysis workflow should only be launched once the biosample has a yield of at least 10 megabase pairs. It's a good idea to set a required yield to prevent launching with empty or in-progress biosamples.
In this example, our analysis workflow also requires a file generated by an upstream Sequence Hub application, specifically DRAGEN Germline v3.8.4 which has application ID 11786775 on US Sequence Hub.
Look up the AppVersionSlug
value of the upstream application you want to use:
$ bs -c docs-demo get application -i 11786775
+----------------------------+----------------------------------------------------------------------------------------+
| AppFamilySlug | illumina-inc.dragen-germline |
| AppVersionSlug | illumina-inc.dragen-germline.3.8.4 |
| Id | 11786775 |
| VersionNumber | 3.8.4 |
...
Find the name of the parameter that will accept the file from the upstream application.
$ bs launch application -i 123456 --list
+--------------------------------+-------------+---------+---------+-------------+----------+
| Option | Type | Choices | Default | Multiselect | Required |
+--------------------------------+-------------+---------+---------+-------------+----------+
...
| vcf-file | FileChooser | <File> | | false | false |
...
The input file parameter is named vcf-file
. We can add an "App Completion" dependency to our existing workflow as follows:
$ bs -c docs-demo workflow dependency add app-completion -i 3978975 --application-id illumina-inc.dragen-germline.3.8.4 --chooser-id vcf-file --qc-pass --file-selector='.*\\.vcf\\.gz$|.*\\.vcf$'
A regular expression supplied using the --file-selector
argument will be used to find which single file will be used from the set of output files generated by the upstream application. This pattern must match exactly one file otherwise the launch will fail.
The --qc-pass
flag ensures that the dependency will only be met if the upstream appsession automated QC has passed successfully.
Review the analysis workflows dependencies using the workflow dependency export
command:
$ bs -c docs-demo workflow dependency export -i 3978975 [ { "Type": "BioSampleYield", "Attributes": { "BioSampleChooserId": "sample-id", "CanUsePrimaryBioSample": false, "Label": "", "LibraryPrepId": "", "MixLibraryTypesAllowed": false, "RequiredYield": 100000 }, "Dependencies": null }, { "Type": "AppCompletion", "Attributes": { "ApplicationId": "illumina-inc.dragen-germline.3.8.4", "CanUsePrimaryResource": false, "ColumnId": "", "Label": "", "RequireQcPass": true, "ResourceChooserId": "vcf-file" }, "Dependencies": null } ]
In summary, our analysis workflow now has two launch conditions: 1. A biosample with at least 1 megabase pairs of yield 2. A completed DRAGEN Germline appsession, run on that same biosample, which has written a VCF and has passed QC
When both of those conditions are met, our analysis workflow will launch.
Most analysis workflow settings are copied from their template appsession. One exception is any user-provided file set via File Chooser app controls. These need to be set explicitly as "Specific Resource" dependencies, such that if these files are deleted the dependency will no longer be met and the analysis workflow will not launch successfully.
$ bs workflow dependency add specific-resource -i <analysisWorkflowID> --chooser-id <appOptionID> --resource-reference v1pre3/files/<fileID>
Note that this is used for files which will be fixed for every execution of the analysis workflow, for files or datasets that are output by a previous step in a chain of analysis workflows you should instead use an App Completion dependency.
Automated QC makes use of registered dataset metrics, so can be set on any value shown by the dataset attributes list
command. For example, to auto-QC an analysis workflow wrapping a DRAGEN Germline application, the available metrics can be viewed with:
$ bs -c docs-demo dataset attributes list -i ds.abcdef12345678910987654
+--------------------------------------------------------+-------------------+
| Name | Value |
+--------------------------------------------------------+-------------------+
| number_of_duplicate_marked_reads_pct | 12.19 |
| paired_reads_different_chromosomes_mapq_gt_eq_10_pct | 0.78 |
| pct_of_genome_with_coverage_10x_inf | 96.11 |
| secondary_alignments | 0 |
...
View the dataset type ID for a dataset using get datatset
:
$ bs -c docs-demo get dataset -i ds.abcdef12345678910987654 -F DataSetType.Id
+----------------+---------------------------------+
| DataSetType.Id | illumina.dragen.complete.v0.3.1 |
+----------------+---------------------------------+
This type ID is used to reference dataset metrics in the threshold definitions. In this example, the thresholds are configured in the following CSV file:
$ cat /tmp/qcthresholds.csv
Name,DatasetTypeId,Operator,ThresholdValues
illumina_dragen_complete_v0_3_1.number_of_duplicate_marked_reads_pct,illumina.dragen.complete.v0.3.1,LessThan,25
illumina_dragen_complete_v0_3_1.pct_of_genome_with_coverage_10x_inf,illumina.dragen.complete.v0.3.1,GreaterThanOrEqual,90
Available operators are:
To register these thresholds with your analysis workflow, use the workflow threshold import
command:
$ bs -c docs-demo workflow threshold import -f /tmp/qcthresholds.csv -i 3978975
To create a new instance of your workflow, create a biosample with that workflow. Note that you need to use the project name not the ID here - this is to match the manifest import mechanism.
$ bs -c docs-demo biosample create -p "MyProject" -n "MyBiosample" --analysis-workflow "TestWorkflow"
$ bs -c docs-demo list biosamples --newer-than=1d
Setting different fields here shows you extra information about the status. Note that you need to use the biosample ID and not the name here.
$ bs -c docs-demo appsession list --input-biosample=$BIOSAMPLEID -F Id -F Status -F StatusSummary
Uploading data against your new biosample can be carried out as listed in a previous example. This will register yield against this biosample that can trigger an app launch if the workflow has a yield dependency.
This section provides some examples for manual app launch at the command line. This is distinct for configuring apps for automated apps, as provided in an earlier example.
The CLI parsing app forms giving the user access, via the command line, to view and configure options for the relevant app.
For a given app name and version, the --list
flag will return a large table showing all available options for a commandline launch:
$ bs launch application -n "Whole Genome Sequencing" --app-version 7.0.1 --list
+------------------------+----------------+-----------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+----------+
| Option | Type | Choices | Default | Multiselect | Required |
+------------------------+----------------+-----------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+----------+
| app-session-name | TextBox | <String> | Example [LocalDateTime] | false | true |
| project-id | ProjectChooser | <Project> | | false | true |
| sample-id | SampleChooser | <Sample> | | true | false |
| bam-file-id | FileChooser | <File> | | true | false |
| reference-genome | Select | /data/scratch/hg19/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta, | /data/scratch/hg38/Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta | false | true |
| | | /data/scratch/GRCh37/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta, | | | |
| | | /data/scratch/hg38/Homo_sapiens/NCBI/GRCh38Decoy/Sequence/WholeGenomeFasta | | | |
| enable-variant-calling | CheckBox | 1 | 1 | false | false |
| enable-sv-calling | CheckBox | 1 | 1 | false | false |
| enable-cnv-calling | CheckBox | 1 | 1 | false | false |
| annotation-source | RadioButton | ensembl, refseq, both, none | ensembl | false | false |
+------------------------+----------------+-----------------------------------------------------------------------------+----------------------------------------------------------------------------+-------------+----------+
To launch an app, supply the app name and version along with any settings using the --option
/ -o
flag in the format optionName:value
. The launch command expects "New" entities for all inputs, such as biosamples and datasets rather than samples and appresults. Sequence Hub entities can be referred to in an app launch by their unique ID, while other launch arguments can accept plain text:
$ bs launch application -n 'Whole Genome Sequencing' --app-version 7.0.1 \
-o project-id:1232 -o bam-file-id:3534333 -o annotation-source:both \
-l "My test appsession"
You can also launch apps by their ID:
$ bs launch application -i 5143138 -o project-id:1232 -o bam-file-id:3534333
If an option accepts multiple arguments (check with --list
), you can supply these as comma-separated values:
$ bs launch application -i 5143138 -o project-id:1232 -o bam-file-id:3534333,232321
In most cases, biosamples can be passed to a launch command by just their ID:
$ bs launch application -n 'Whole Genome Sequencing' --app-version 7.0.1 \
-o project-id:1232 -o sample-id:2323244
If a biosample contains FASTQ datasets with a mix of library preps however, you will need to specify the library prep ID for the FASTQ datasets you wish to launch with in the format:
-o sample-id:343432/librarypreps/1014015
Apps such as Tumor Normal v5 use a ResourceMatcher to submit matched pairs of WGS datasets. For CLI launch, the format for these fields is:
-o input-id:'col1_dataset1,col1_dataset2;col2_dataset1,col2_dataset2'
Here commas seperate multiple inputs and a semi-colon delimits a column, so the above string would render as the following table if launched through the Sequence Hub web user interface:
Normal Tumor
col1_dataset1 col2_dataset1
col1_dataset2 col2_dataset2
TabularFieldsets are sophisticated form controls shown as an expandable table of sub-controls when launching an application through the web user interface. To set options for this type of controls through the CLI, the format is: -o tabularFieldsetControlName.subControlName:value
For example, VCAT v2.3.0 uses a TabularFieldset named sample-pairs
containing a FileChooser named file-id
and a TextBox named file-label
(as shown by --list
). This application can be launched as shown:
$ bs launch application -n "Variant Calling Assessment Tool" --app-version 2.3.0 \
-o sample-pairs.file-id:11232444,11232445 -o sample-pairs.file-label:vcf1,vcf2 \
...
DisplayFields are additional controls that pop-up underneath SampleChoosers and AppResultChoosers — these are not yet supported by the launch API. The CLI will warn you if the app you're trying to launch uses these, for example:
$ bs launch application -n 'SPAdes Genome Assembler' --app-version 3.5.0 --list
WARNING: Input field 'sample-id' uses DisplayFields which are not yet supported, you may not be able to launch this app !
The CLI can get many entities by name and return their ID:
$ bs get project -n MyProjectName --terse
$ bs get biosample -n MyBioSampleName --terse
$ bs get dataset -n MyDatasetName --terse
Alternatively you can retrieve IDs via list
commands, for example:
$ bs list datasets --project-name ProjectContainingMyDataset
Project and Biosample IDs are also visible in URLs when browsing Sequence Hub through the web user interface.
You can list the contents of an appresult or dataset to get the file IDs:
$ bs contents appresult -i 12224237 | head
+------------+----------------------------------------------------------------------+
| Id | FilePath |
+------------+----------------------------------------------------------------------+
| 8618632807 | Plots/s_0_1_2212_MismatchByCycle.png |
| 8618632806 | Plots/s_0_1_2113_MismatchByCycle.png |
| 8618632805 | Plots/sorted_S1_G1_chr2_MismatchByCycle.png |
| 8618632804 | Plots/s_0_1_1115_MismatchByCycle.png |
| 8618632803 | Plots/s_1_1_2108_MismatchByCycle.png |
| 8618632802 | Plots/s_0_2_1107_MismatchByCycle.png |
| 8618632801 | Plots/s_0_1_1112_NoCallByCycle.png |
You can also use BaseMount to find the file by navigating to the project and appresult which generated it:
${BASEMOUNT}/Projects/<project>/appresults/<appsession name>/Files/file.vcf
Then use the Files.metadata directory to get the ID:
$ cat ${BASEMOUNT}/Projects/<project>/appresults/<appsession name>/Files.metadata/file.vcf/.id
Analysis name is a special parameter, best practice is to set your argument through both the form control and via the --appsession-label
/ -l
argument:
bs application launch -n "Whole Genome Sequencing" --app-version 7.0.1 \
-o app-session-name:"Your appsession name" -l "Your appsession name" \
...
This error is due to an outdated CLI version which is no longer compatible with application launch. Please update to the latest CLI version.