VCF Input with the ClinGen Allele Registry (CAR) API - annotation appending, JSON output, and registration of variants

Goals

  • Demonstration of use cases of VCF input
    • Test Case A - POST VCF - Annotation to retrieve CAIDs - Full and Filtered Column Test
    • Test Case B - POST VCF - Return JSON with CAIDs and additional metadata
    • Test Case C - PUT VCF - Registration of variants and retrieval of CAIDs

API Documentation

Important Note

  • It's imperative to note that the below numbers are likely to change as data sets are updated periodically both internally and by the public

Example gnomAD Input

  • For this example we will be using chr21 GRCh38 VCF

Download input

time wget --no-check-certificate https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/genomes/gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz

...
HTTP request sent, awaiting response... 200 OK
Length: 6197117360 (5.8G) [application/octet-stream]
Saving to: ‘gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz’
gnomad.genomes.r2.1.1.s 100%[=============================>] 5.77G 76.9MB/s in 3m 8s
2020-06-03 11:52:28 (31.4 MB/s) - ‘gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz’ saved [6197117360/6197117360]

Verify MD5 checksum

md5sum gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz

be37bea3ac95097021e8c130291d45de gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz

  • Does this match the MD5? Yes.
    • MD5: be37bea3ac95097021e8c130291d45de
  • Number of starting lines
zcat gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz | wc -l

3499899

  • Number of variants
time zgrep -c -v "^#" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz

3495953

Extract VCF Headers

  • The first thing we need to do is create a valid VCF header if it does not already exist
    • fileformat entry (e.g. ##fileformat=VCFv4.2)
    • contig entries
      • Correct chromosome notation (1-22,X,Y,M) (does not have additional text such as 'chr') (e.g. ##contig=<ID=1,length=249250621,assembly=GRCh37>)
      • Valid assembly (GRCh37, GRCh38)
    • Column names (e.g. #CHROM POS ID REF ALT QUAL FILTER INFO)
    • For this example we will also remove other headers (##) that do not match the required entities (above)
zcat gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz | head -n100000 | grep "#[#C][fcH][ioR][lnO]" | grep -v "HLA" | grep -v "EBV" | sed 's/gnomAD\_//g' | sed 's/chr//g' | grep -v "_" > GRCh38_header.vcf
cat GRCh38_header.vcf

##fileformat=VCFv4.2
##contig=<ID=1,length=248956422,assembly=GRCh38>
##contig=<ID=2,length=242193529,assembly=GRCh38>
##contig=<ID=3,length=198295559,assembly=GRCh38>
##contig=<ID=4,length=190214555,assembly=GRCh38>
##contig=<ID=5,length=181538259,assembly=GRCh38>
##contig=<ID=6,length=170805979,assembly=GRCh38>
##contig=<ID=7,length=159345973,assembly=GRCh38>
##contig=<ID=8,length=145138636,assembly=GRCh38>
##contig=<ID=9,length=138394717,assembly=GRCh38>
##contig=<ID=10,length=133797422,assembly=GRCh38>
##contig=<ID=11,length=135086622,assembly=GRCh38>
##contig=<ID=12,length=133275309,assembly=GRCh38>
##contig=<ID=13,length=114364328,assembly=GRCh38>
##contig=<ID=14,length=107043718,assembly=GRCh38>
##contig=<ID=15,length=101991189,assembly=GRCh38>
##contig=<ID=16,length=90338345,assembly=GRCh38>
##contig=<ID=17,length=83257441,assembly=GRCh38>
##contig=<ID=18,length=80373285,assembly=GRCh38>
##contig=<ID=19,length=58617616,assembly=GRCh38>
##contig=<ID=20,length=64444167,assembly=GRCh38>
##contig=<ID=21,length=46709983,assembly=GRCh38>
##contig=<ID=22,length=50818468,assembly=GRCh38>
##contig=<ID=X,length=156040895,assembly=GRCh38>
##contig=<ID=Y,length=57227415,assembly=GRCh38>
##contig=<ID=M,length=16569,assembly=GRCh38>
#CHROM POS ID REF ALT QUAL FILTER INFO

Extract VCF Data and Prepare Input

  • Next, we'll want to extract the variant data from this file as a precursor to splitting into smaller chunks for annotation / registration
time zgrep -v "^#" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf
wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf

3495953

  • We will illustrate two approaches with their respective consequences
    • Note: we will also be removing 'chr' from the chromosome value to reflect the above contig change in VCF necessary for input

Approach 1 - All columns in VCF - 100k entries test

#start with GRCh38 header
cat GRCh38_header.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf
#append 100k variants
head -n100000 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf | sed 's/^chr//g' >> gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf
wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf

100027

Approach 2 - Ignoring unnecessary columns required for registration / annotation - 100k entries test

  • This approach retains CHROM, POS, ID, REF, ALT and lists the remaining columns as '.'
#start with GRCh38 header
cat GRCh38_header.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf
#append 100k variants with filtered columns
head -n100000 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf | awk '{split($0,a,"\t"); print a[1] "\t" a[2] "\t" a[3] "\t" a[4] "\t" a[5] "\t.\t.\t."}' | sed 's/^chr//g' >> gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf
wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf

100027

Note different sizes between two approaches

File Variants Size
gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf 100000 1.4G
gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf 100000 2.3M
  • We generally recommend that users annotate / register ~1M per batch and filter out unnecessary metadata columns if at all possible to decrease the chance of a network timeout, etc.

Test Case A - POST VCF - Annotation to retrieve CAIDs - Full and Filtered Column Test

Full VCF columns

time curl -X POST "http://reg.genome.network/annotateVcf?assembly=GRCh38&ids=CA" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf.CA.vcf

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2677M 0 1339M 100 1338M 6894k 6888k 0:03:18 0:03:18 --:--:-- 24.9M
real 3m20.297s

Filtered VCF columns

time curl -X POST "http://reg.genome.network/annotateVcf?assembly=GRCh38&ids=CA" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.CA.vcf

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7859k 0 4509k 100 3350k 1830k 1360k 0:00:02 0:00:02 --:--:-- 3189k
real 0m2.479s

Full vs. Filtered Column VCF Comparison - 100k tests

Type Variants Starting Size Starting rsIDs Returned CAIDs Time
Full 100,000 1.4G 98385 99163 3m20s
Filtered 100,000 2.3M 98385 99163 2.5s
  • 100,027 lines of output in each returned set == number of input lines (essential for making sure we received the full resultant set)
  • 98.4% of 100k variants has rsID in original VCF
  • 99.2% of 100k variants return CAID
  • 97.9% of 100k variants have both rsID and CAID

Example output

  • Note the appended CAID in ID column
head -n38 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.CA.vcf | grep -v "^##" 

#CHROM POS ID REF ALT QUAL FILTER INFO
21 8522378 rs1316595271,CA636797537 A G . . .
21 8522386 rs1351680335,CA636797538 T C . . .
21 8522390 rs1456505548,CA636797539 T A . . .
21 8522406 rs559462325,CA317178097 G A . . .
21 8522412 rs181691356,CA317178100 C A . . .
21 8522415 rs1412016291,CA636797540 G T . . .
21 8522435 rs1286875378,CA636797541 C T . . .
21 8522442 rs1350348343,CA636797542 G C . . .
21 8522444 rs1229401566,CA914635628 T C . . .
21 8522444 rs1229401566,CA636797543 TAGATTTC T . . .
21 8522447 rs1292728344,CA636797544 A T . . .

Filtered VCF columns - full chr21

  • Let's see how long it would take to POST the filtered VCF to get a baseline for a larger data set
    • We still recommend ~1M per import of column filtered VCF inputs for more consist and reliable bulk transfers
#start with GRCh38 header
cat GRCh38_header.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf
#append 100k variants with filtered columns
time awk '{split($0,a,"\t"); print a[1] "\t" a[2] "\t" a[3] "\t" a[4] "\t" a[5] "\t.\t.\t."}' gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf |  sed 's/^chr//g' >> gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf
wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf

28m14.658s
3495980

time curl -X POST "http://reg.genome.network/annotateVcf?assembly=GRCh38&ids=CA" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf.CA.vcf

bg. % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 273M 0 156M 100 116M 4418k 3297k 0:00:36 0:00:36 --:--:-- 9193k
real 0m36.371s

One more comparison - POSTing the full chr21 filtered data set

Type Variants Starting Size Starting rsIDs Returned CAIDs Time
Full 100,000 1.4G 98385 99163 3m20s
Filtered 100,000 2.3M 98385 99163 2.5s
Full 3,495,953 48G 3423534 NA NA
Filtered 3,495,953 117M 3423534 3493894 36s
  • We observed a POST rate of 2.5s for 100,000 entries, if that holds we should expect ~3.5M to take ~90s
    • We observed a POST rate of ~1s for 100,000 entries for the ~3.5M set
  • 99.94% of entries for the full dataset return CAIDs
    • As compared to 97.9% of entries that contained rsIDs in the original data
  • I did not attempt to import the full column full data set totaling 48G (NA values). If this was required I would resort to the recommended limit of ~1M filtered columns. If non-filtered columns are used, I would recommend decreasing the number to the low hundred megabytes.

Test Case B - POST VCF - Return JSON with CAIDs and additional metadata

  • An alternate way of retrieving CAIDs from the CAR is to do a POST of gnomAD VCF to retrieve JSON response
  • We will start with the input used above and will use the filtered column VCF (smaller, quicker, and same data is used as input as full column VCF)

JSON - CAIDs only

time curl -X POST "http://reg.genome.network/alleles?file=vcf&fields=none+@id" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9273k 0 5922k 100 3350k 1979k 1119k 0:00:02 0:00:02 --:--:-- 3098k
real 0m3.007s

  • Non-error entities returned
grep -c "^\ \ \"@id\"" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json

100000

  • CAIDs returned
grep -c allele/CA gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json

99163
#same number as annotating the VCF

  • Example output
head gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json

[{
"@id": "http://reg.genome.network/allele/CA636797537"
}
,{
"@id": "http://reg.genome.network/allele/CA636797538"
}
,{
"@id": "http://reg.genome.network/allele/CA636797539"
}
,{

JSON - All metadata

  • Note difference in POST URL
time curl -X POST "http://reg.genome.network/alleles?file=vcf&fields=all" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 180M 0 177M 100 3350k 4223k 79628 0:00:43 0:00:43 --:--:-- 16.4M
real 0m43.107s

  • Non-error entities returned
grep -c "^\ \ \"@id\"" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json

100000

  • CAIDs returned
grep -c allele/CA gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json

99163
#same number as annotating the VCF and just CAIDs JSON

  • Example output
head -n25 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json

[{
"@context": "http://reg.genome.network/schema/allele.jsonld",
"@id": "http://reg.genome.network/allele/CA636797537",
"externalRecords": {
"MyVariantInfo_hg19": [ {
"@id": "http://myvariant.info/v1/variant/chr21:g.9411211A>G?assembly=hg19",
"id": "chr21:g.9411211A>G"
}
],
"MyVariantInfo_hg38": [ {
"@id": "http://myvariant.info/v1/variant/chr21:g.8522378A>G?assembly=hg38",
"id": "chr21:g.8522378A>G"
}
],
"gnomAD": [ {
"@id": "http://gnomad.broadinstitute.org/variant/21-9411211-A-G",
"id": "21-9411211-A-G",
"variant": "21:9411211 A / G"
}
]
},
"genomicAlleles": [

Comparison of VCF => JSON payloads

Type CAIDs returned Output Size Time
CAIDs Only 99,163 5.8M 3s
All Fields 99,163 178M 43s
  • As above, we see a similar trend for input / output sizes and run time, which should be considered when utilizing large data sets
  • Can we account for any errors? Yes, but none seen in this 100k data set.
    • grep -c "error" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json
      • 0
        • We expect to see (# non-error entities (^\ \ @id) + errors = total input size) if the full payload is consumed and returned

Illustrative example - (1) Error from gnomAD GRCh37 input (not from above work)

#Example error:
,{
  "description": "Given allele cannot be mapped in consistent way to reference genome.",
  "errorType": "NoConsistentAlignment",
  "inputLine": "21\t9425293\trs368989897\tGC\tG\t.\t.\t.",
  "message": "Cannot align NC_000021.8 [9425293,9425295). ",
  "region": "[9425293,9425295)",
  "sequenceName": "NC_000021.8" 
}

Test Case C - PUT VCF - Registration of variants and retrieval of CAIDs

Available scripts

  • We provide 3 example scripts that allow users to register (PUT) variants that can be used as a stand-alone script or mimicked in custom pipelines
  • Let's use the shell script as an example
    • Copy the script to your directory
    • Make it executable
      • e.g. chmod 755 request_with_payload.sh

Create an account if you do not already have one

Check usage

./request_with_payload.sh

Incorrect parameters!
This program sends POST or PUT request with payload and print the response.
Parameters for POST request: URL file_with_payload
Parameters for PUT request: URL file_with_payload login password

Register variants

time ./request_with_payload.sh "http://reg.genome.network/alleles?file=vcf&fields=none+@id" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf <your_user_name> <your_password> > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf.PUT.json

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 320M 0 203M 100 116M 3953k 2271k 0:00:52 0:00:52 --:--:-- 6226k
real 0m52.787s

grep -c allele/CA gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf.PUT.json

3495953