VCF Input with the ClinGen Allele Registry (CAR) API - annotation appending, JSON output, and registration of variants¶
Goals¶
- Demonstration of use cases of VCF input
- Test Case A - POST VCF - Annotation to retrieve CAIDs - Full and Filtered Column Test
- Test Case B - POST VCF - Return JSON with CAIDs and additional metadata
- Test Case C - PUT VCF - Registration of variants and retrieval of CAIDs
API Documentation¶
- Please see the API documentation for additional details
- NOTE: We encourage new users / developers to use the test server (reg.test.genome.network), which has the same functionality but dummy CAIDs. Real CAIDs will always be on reg.genome.network.
Important Note¶
- It's imperative to note that the below numbers are likely to change as data sets are updated periodically both internally and by the public
Example gnomAD Input¶
- For this example we will be using chr21 GRCh38 VCF
Download input¶
- https://gnomad.broadinstitute.org/downloads
- gnomAD v2 liftover
time wget --no-check-certificate https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/genomes/gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz
...
HTTP request sent, awaiting response... 200 OK
Length: 6197117360 (5.8G) [application/octet-stream]
Saving to: ‘gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz’
gnomad.genomes.r2.1.1.s 100%[=============================>] 5.77G 76.9MB/s in 3m 8s
2020-06-03 11:52:28 (31.4 MB/s) - ‘gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz’ saved [6197117360/6197117360]
Verify MD5 checksum¶
md5sum gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz
be37bea3ac95097021e8c130291d45de gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz
- Does this match the MD5? Yes.
- MD5: be37bea3ac95097021e8c130291d45de
- Number of starting lines
zcat gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz | wc -l
3499899
- Number of variants
time zgrep -c -v "^#" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz
3495953
Extract VCF Headers¶
- The first thing we need to do is create a valid VCF header if it does not already exist
- fileformat entry (e.g. ##fileformat=VCFv4.2)
- contig entries
- Correct chromosome notation (1-22,X,Y,M) (does not have additional text such as 'chr') (e.g. ##contig=<ID=1,length=249250621,assembly=GRCh37>)
- Valid assembly (GRCh37, GRCh38)
- Column names (e.g. #CHROM POS ID REF ALT QUAL FILTER INFO)
- For this example we will also remove other headers (##) that do not match the required entities (above)
zcat gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz | head -n100000 | grep "#[#C][fcH][ioR][lnO]" | grep -v "HLA" | grep -v "EBV" | sed 's/gnomAD\_//g' | sed 's/chr//g' | grep -v "_" > GRCh38_header.vcf
cat GRCh38_header.vcf
##fileformat=VCFv4.2
##contig=<ID=1,length=248956422,assembly=GRCh38>
##contig=<ID=2,length=242193529,assembly=GRCh38>
##contig=<ID=3,length=198295559,assembly=GRCh38>
##contig=<ID=4,length=190214555,assembly=GRCh38>
##contig=<ID=5,length=181538259,assembly=GRCh38>
##contig=<ID=6,length=170805979,assembly=GRCh38>
##contig=<ID=7,length=159345973,assembly=GRCh38>
##contig=<ID=8,length=145138636,assembly=GRCh38>
##contig=<ID=9,length=138394717,assembly=GRCh38>
##contig=<ID=10,length=133797422,assembly=GRCh38>
##contig=<ID=11,length=135086622,assembly=GRCh38>
##contig=<ID=12,length=133275309,assembly=GRCh38>
##contig=<ID=13,length=114364328,assembly=GRCh38>
##contig=<ID=14,length=107043718,assembly=GRCh38>
##contig=<ID=15,length=101991189,assembly=GRCh38>
##contig=<ID=16,length=90338345,assembly=GRCh38>
##contig=<ID=17,length=83257441,assembly=GRCh38>
##contig=<ID=18,length=80373285,assembly=GRCh38>
##contig=<ID=19,length=58617616,assembly=GRCh38>
##contig=<ID=20,length=64444167,assembly=GRCh38>
##contig=<ID=21,length=46709983,assembly=GRCh38>
##contig=<ID=22,length=50818468,assembly=GRCh38>
##contig=<ID=X,length=156040895,assembly=GRCh38>
##contig=<ID=Y,length=57227415,assembly=GRCh38>
##contig=<ID=M,length=16569,assembly=GRCh38>
#CHROM POS ID REF ALT QUAL FILTER INFO
Extract VCF Data and Prepare Input¶
- Next, we'll want to extract the variant data from this file as a precursor to splitting into smaller chunks for annotation / registration
time zgrep -v "^#" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf
wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf
3495953
- We will illustrate two approaches with their respective consequences
- Note: we will also be removing 'chr' from the chromosome value to reflect the above contig change in VCF necessary for input
Approach 1 - All columns in VCF - 100k entries test¶
#start with GRCh38 header cat GRCh38_header.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf #append 100k variants head -n100000 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf | sed 's/^chr//g' >> gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf
100027
Approach 2 - Ignoring unnecessary columns required for registration / annotation - 100k entries test¶
- This approach retains CHROM, POS, ID, REF, ALT and lists the remaining columns as '.'
#start with GRCh38 header cat GRCh38_header.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf #append 100k variants with filtered columns head -n100000 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf | awk '{split($0,a,"\t"); print a[1] "\t" a[2] "\t" a[3] "\t" a[4] "\t" a[5] "\t.\t.\t."}' | sed 's/^chr//g' >> gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf
100027
Note different sizes between two approaches¶
File | Variants | Size |
gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf | 100000 | 1.4G |
gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf | 100000 | 2.3M |
- We generally recommend that users annotate / register ~1M per batch and filter out unnecessary metadata columns if at all possible to decrease the chance of a network timeout, etc.
Test Case A - POST VCF - Annotation to retrieve CAIDs - Full and Filtered Column Test¶
Full VCF columns¶
time curl -X POST "http://reg.genome.network/annotateVcf?assembly=GRCh38&ids=CA" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf.CA.vcf
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2677M 0 1339M 100 1338M 6894k 6888k 0:03:18 0:03:18 --:--:-- 24.9M
real 3m20.297s
Filtered VCF columns¶
time curl -X POST "http://reg.genome.network/annotateVcf?assembly=GRCh38&ids=CA" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.CA.vcf
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7859k 0 4509k 100 3350k 1830k 1360k 0:00:02 0:00:02 --:--:-- 3189k
real 0m2.479s
Full vs. Filtered Column VCF Comparison - 100k tests¶
Type | Variants | Starting Size | Starting rsIDs | Returned CAIDs | Time | |
Full | 100,000 | 1.4G | 98385 | 99163 | 3m20s |
Filtered | 100,000 | 2.3M | 98385 | 99163 | 2.5s |
- 100,027 lines of output in each returned set == number of input lines (essential for making sure we received the full resultant set)
- 98.4% of 100k variants has rsID in original VCF
- 99.2% of 100k variants return CAID
- 97.9% of 100k variants have both rsID and CAID
Example output¶
- Note the appended CAID in ID column
head -n38 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.CA.vcf | grep -v "^##"
#CHROM POS ID REF ALT QUAL FILTER INFO
21 8522378 rs1316595271,CA636797537 A G . . .
21 8522386 rs1351680335,CA636797538 T C . . .
21 8522390 rs1456505548,CA636797539 T A . . .
21 8522406 rs559462325,CA317178097 G A . . .
21 8522412 rs181691356,CA317178100 C A . . .
21 8522415 rs1412016291,CA636797540 G T . . .
21 8522435 rs1286875378,CA636797541 C T . . .
21 8522442 rs1350348343,CA636797542 G C . . .
21 8522444 rs1229401566,CA914635628 T C . . .
21 8522444 rs1229401566,CA636797543 TAGATTTC T . . .
21 8522447 rs1292728344,CA636797544 A T . . .
Filtered VCF columns - full chr21¶
- Let's see how long it would take to POST the filtered VCF to get a baseline for a larger data set
- We still recommend ~1M per import of column filtered VCF inputs for more consist and reliable bulk transfers
#start with GRCh38 header cat GRCh38_header.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf #append 100k variants with filtered columns time awk '{split($0,a,"\t"); print a[1] "\t" a[2] "\t" a[3] "\t" a[4] "\t" a[5] "\t.\t.\t."}' gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf | sed 's/^chr//g' >> gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf
28m14.658s
3495980
time curl -X POST "http://reg.genome.network/annotateVcf?assembly=GRCh38&ids=CA" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf.CA.vcf
bg. % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 273M 0 156M 100 116M 4418k 3297k 0:00:36 0:00:36 --:--:-- 9193k
real 0m36.371s
One more comparison - POSTing the full chr21 filtered data set¶
Type | Variants | Starting Size | Starting rsIDs | Returned CAIDs | Time | |
Full | 100,000 | 1.4G | 98385 | 99163 | 3m20s |
Filtered | 100,000 | 2.3M | 98385 | 99163 | 2.5s | |||||||
Full | 3,495,953 | 48G | 3423534 | NA | NA | |||||||
Filtered | 3,495,953 | 117M | 3423534 | 3493894 | 36s |
- We observed a POST rate of 2.5s for 100,000 entries, if that holds we should expect ~3.5M to take ~90s
- We observed a POST rate of ~1s for 100,000 entries for the ~3.5M set
- 99.94% of entries for the full dataset return CAIDs
- As compared to 97.9% of entries that contained rsIDs in the original data
- I did not attempt to import the full column full data set totaling 48G (NA values). If this was required I would resort to the recommended limit of ~1M filtered columns. If non-filtered columns are used, I would recommend decreasing the number to the low hundred megabytes.
Test Case B - POST VCF - Return JSON with CAIDs and additional metadata¶
- An alternate way of retrieving CAIDs from the CAR is to do a POST of gnomAD VCF to retrieve JSON response
- We will start with the input used above and will use the filtered column VCF (smaller, quicker, and same data is used as input as full column VCF)
JSON - CAIDs only¶
time curl -X POST "http://reg.genome.network/alleles?file=vcf&fields=none+@id" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9273k 0 5922k 100 3350k 1979k 1119k 0:00:02 0:00:02 --:--:-- 3098k
real 0m3.007s
- Non-error entities returned
grep -c "^\ \ \"@id\"" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json
100000
- CAIDs returned
grep -c allele/CA gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json
99163
#same number as annotating the VCF
- Example output
head gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json
[{
"@id": "http://reg.genome.network/allele/CA636797537"
}
,{
"@id": "http://reg.genome.network/allele/CA636797538"
}
,{
"@id": "http://reg.genome.network/allele/CA636797539"
}
,{
JSON - All metadata¶
- Note difference in POST URL
time curl -X POST "http://reg.genome.network/alleles?file=vcf&fields=all" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 180M 0 177M 100 3350k 4223k 79628 0:00:43 0:00:43 --:--:-- 16.4M
real 0m43.107s
- Non-error entities returned
grep -c "^\ \ \"@id\"" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json
100000
- CAIDs returned
grep -c allele/CA gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json
99163
#same number as annotating the VCF and just CAIDs JSON
- Example output
head -n25 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json
[{
"@context": "http://reg.genome.network/schema/allele.jsonld",
"@id": "http://reg.genome.network/allele/CA636797537",
"externalRecords": {
"MyVariantInfo_hg19": [ {
"@id": "http://myvariant.info/v1/variant/chr21:g.9411211A>G?assembly=hg19",
"id": "chr21:g.9411211A>G"
}
],
"MyVariantInfo_hg38": [ {
"@id": "http://myvariant.info/v1/variant/chr21:g.8522378A>G?assembly=hg38",
"id": "chr21:g.8522378A>G"
}
],
"gnomAD": [ {
"@id": "http://gnomad.broadinstitute.org/variant/21-9411211-A-G",
"id": "21-9411211-A-G",
"variant": "21:9411211 A / G"
}
]
},
"genomicAlleles": [
Comparison of VCF => JSON payloads¶
Type | CAIDs returned | Output Size | Time | |||||
CAIDs Only | 99,163 | 5.8M | 3s | |
All Fields | 99,163 | 178M | 43s |
- As above, we see a similar trend for input / output sizes and run time, which should be considered when utilizing large data sets
- Can we account for any errors? Yes, but none seen in this 100k data set.
- grep -c "error" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json
- 0
- We expect to see (# non-error entities (^\ \ @id) + errors = total input size) if the full payload is consumed and returned
- 0
- grep -c "error" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json
Illustrative example - (1) Error from gnomAD GRCh37 input (not from above work)¶
#Example error: ,{ "description": "Given allele cannot be mapped in consistent way to reference genome.", "errorType": "NoConsistentAlignment", "inputLine": "21\t9425293\trs368989897\tGC\tG\t.\t.\t.", "message": "Cannot align NC_000021.8 [9425293,9425295). ", "region": "[9425293,9425295)", "sequenceName": "NC_000021.8" }
- Can we explain this type of error?
- Let's start with rs368989897 through gnomAD
- https://gnomad.broadinstitute.org/variant/21-9425293-GC-G?dataset=gnomad_r2_1
- Deletion: 21-9425293-GC-G (GRCh37)
- https://gnomad.broadinstitute.org/variant/21-8536460-GC-G?dataset=gnomad_r3
- Deletion: 21-8536460-GC-G (GRCh38)
- https://gnomad.broadinstitute.org/variant/21-9425293-GC-G?dataset=gnomad_r2_1
- Do we have rs368989897 registered via dbSNP? Yes for GRCh38.
- http://reg.genome.network/redmine/projects/registry/genboree_registry/by_caid?caid=CA317179657
- NC_000021.9:g.8536462del , CM000683.2:g.8536462del
- We do not see the mapping to GRCh37 on this page
- http://reg.genome.network/redmine/projects/registry/genboree_registry/by_caid?caid=CA317179657
- Observations
- We see a few % that do not have consistent alignments from different assemblies, but at least for this case we have registered the variant through dbSNP and anticipate being able to register this variant with the gnomAD v3 (GRCh38) data, which should lower the % failed for this class of error
- Let's start with rs368989897 through gnomAD
Test Case C - PUT VCF - Registration of variants and retrieval of CAIDs¶
Available scripts¶
- We provide 3 example scripts that allow users to register (PUT) variants that can be used as a stand-alone script or mimicked in custom pipelines
- http://reg.genome.network/doc/scripts/
- request_with_payload.py
- request_with_payload.rb
- request_with_payload.sh
- http://reg.genome.network/doc/scripts/
- Let's use the shell script as an example
- Copy the script to your directory
- Make it executable
- e.g. chmod 755 request_with_payload.sh
Create an account if you do not already have one¶
Check usage¶
./request_with_payload.sh
Incorrect parameters!
This program sends POST or PUT request with payload and print the response.
Parameters for POST request: URL file_with_payload
Parameters for PUT request: URL file_with_payload login password
Register variants¶
time ./request_with_payload.sh "http://reg.genome.network/alleles?file=vcf&fields=none+@id" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf <your_user_name> <your_password> > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf.PUT.json
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 320M 0 203M 100 116M 3953k 2271k 0:00:52 0:00:52 --:--:-- 6226k
real 0m52.787s
grep -c allele/CA gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf.PUT.json
3495953