VCF Input with the ClinGen Allele Registry (CAR) API - annotation appending, JSON output, and registration of variants¶

Goals¶

Demonstration of use cases of VCF input
- Test Case A - POST VCF - Annotation to retrieve CAIDs - Full and Filtered Column Test
- Test Case B - POST VCF - Return JSON with CAIDs and additional metadata
- Test Case C - PUT VCF - Registration of variants and retrieval of CAIDs

API Documentation¶

Please see the API documentation for additional details
- http://reg.genome.network/doc
  - http://reg.genome.network/doc/AlleleRegistry_1.01.xx_api_v1.pdf
NOTE: We encourage new users / developers to use the test server (reg.test.genome.network), which has the same functionality but dummy CAIDs. Real CAIDs will always be on reg.genome.network.

Important Note¶

It's imperative to note that the below numbers are likely to change as data sets are updated periodically both internally and by the public

Example gnomAD Input¶

For this example we will be using chr21 GRCh38 VCF

Download input¶

https://gnomad.broadinstitute.org/downloads
- gnomAD v2 liftover
  - chr21 sites VCF

time wget --no-check-certificate https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/genomes/gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz

...
HTTP request sent, awaiting response... 200 OK
Length: 6197117360 (5.8G) [application/octet-stream]
Saving to: ‘gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz’
gnomad.genomes.r2.1.1.s 100%[=============================>] 5.77G 76.9MB/s in 3m 8s
2020-06-03 11:52:28 (31.4 MB/s) - ‘gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz’ saved [6197117360/6197117360]

Verify MD5 checksum¶

md5sum gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz

be37bea3ac95097021e8c130291d45de gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz

Does this match the MD5? Yes.
- MD5: be37bea3ac95097021e8c130291d45de

Number of starting lines

zcat gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz | wc -l

3499899

Number of variants

time zgrep -c -v "^#" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz

3495953

Extract VCF Headers¶

The first thing we need to do is create a valid VCF header if it does not already exist
- fileformat entry (e.g. ##fileformat=VCFv4.2)
- contig entries
  - Correct chromosome notation (1-22,X,Y,M) (does not have additional text such as 'chr') (e.g. ##contig=<ID=1,length=249250621,assembly=GRCh37>)
  - Valid assembly (GRCh37, GRCh38)
- Column names (e.g. #CHROM POS ID REF ALT QUAL FILTER INFO)
- For this example we will also remove other headers (##) that do not match the required entities (above)

zcat gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz | head -n100000 | grep "#[#C][fcH][ioR][lnO]" | grep -v "HLA" | grep -v "EBV" | sed 's/gnomAD\_//g' | sed 's/chr//g' | grep -v "_" > GRCh38_header.vcf

cat GRCh38_header.vcf

##fileformat=VCFv4.2
##contig=<ID=1,length=248956422,assembly=GRCh38>
##contig=<ID=2,length=242193529,assembly=GRCh38>
##contig=<ID=3,length=198295559,assembly=GRCh38>
##contig=<ID=4,length=190214555,assembly=GRCh38>
##contig=<ID=5,length=181538259,assembly=GRCh38>
##contig=<ID=6,length=170805979,assembly=GRCh38>
##contig=<ID=7,length=159345973,assembly=GRCh38>
##contig=<ID=8,length=145138636,assembly=GRCh38>
##contig=<ID=9,length=138394717,assembly=GRCh38>
##contig=<ID=10,length=133797422,assembly=GRCh38>
##contig=<ID=11,length=135086622,assembly=GRCh38>
##contig=<ID=12,length=133275309,assembly=GRCh38>
##contig=<ID=13,length=114364328,assembly=GRCh38>
##contig=<ID=14,length=107043718,assembly=GRCh38>
##contig=<ID=15,length=101991189,assembly=GRCh38>
##contig=<ID=16,length=90338345,assembly=GRCh38>
##contig=<ID=17,length=83257441,assembly=GRCh38>
##contig=<ID=18,length=80373285,assembly=GRCh38>
##contig=<ID=19,length=58617616,assembly=GRCh38>
##contig=<ID=20,length=64444167,assembly=GRCh38>
##contig=<ID=21,length=46709983,assembly=GRCh38>
##contig=<ID=22,length=50818468,assembly=GRCh38>
##contig=<ID=X,length=156040895,assembly=GRCh38>
##contig=<ID=Y,length=57227415,assembly=GRCh38>
##contig=<ID=M,length=16569,assembly=GRCh38>
#CHROM POS ID REF ALT QUAL FILTER INFO

Extract VCF Data and Prepare Input¶

Next, we'll want to extract the variant data from this file as a precursor to splitting into smaller chunks for annotation / registration

time zgrep -v "^#" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf

wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf

3495953

We will illustrate two approaches with their respective consequences
- Note: we will also be removing 'chr' from the chromosome value to reflect the above contig change in VCF necessary for input

Approach 1 - All columns in VCF - 100k entries test¶

#start with GRCh38 header
cat GRCh38_header.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf
#append 100k variants
head -n100000 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf | sed 's/^chr//g' >> gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf
wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf

100027

Approach 2 - Ignoring unnecessary columns required for registration / annotation - 100k entries test¶

This approach retains CHROM, POS, ID, REF, ALT and lists the remaining columns as '.'

#start with GRCh38 header
cat GRCh38_header.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf
#append 100k variants with filtered columns
head -n100000 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf | awk '{split($0,a,"\t"); print a[1] "\t" a[2] "\t" a[3] "\t" a[4] "\t" a[5] "\t.\t.\t."}' | sed 's/^chr//g' >> gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf
wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf

100027

Note different sizes between two approaches¶

File	Variants	Size
gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf	100000	1.4G
gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf	100000	2.3M

We generally recommend that users annotate / register ~1M per batch and filter out unnecessary metadata columns if at all possible to decrease the chance of a network timeout, etc.

Test Case A - POST VCF - Annotation to retrieve CAIDs - Full and Filtered Column Test¶

Full VCF columns¶

time curl -X POST "http://reg.genome.network/annotateVcf?assembly=GRCh38&ids=CA" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.full.vcf.CA.vcf

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2677M 0 1339M 100 1338M 6894k 6888k 0:03:18 0:03:18 --:--:-- 24.9M
real 3m20.297s

Filtered VCF columns¶

time curl -X POST "http://reg.genome.network/annotateVcf?assembly=GRCh38&ids=CA" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.CA.vcf

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7859k 0 4509k 100 3350k 1830k 1360k 0:00:02 0:00:02 --:--:-- 3189k
real 0m2.479s

Full vs. Filtered Column VCF Comparison - 100k tests¶

Type

Variants

Starting Size

Starting rsIDs

Returned CAIDs

Time

Full

100,000

1.4G

98385

99163

3m20s

Filtered

100,000

2.3M

98385

99163

2.5s

100,027 lines of output in each returned set == number of input lines (essential for making sure we received the full resultant set)
98.4% of 100k variants has rsID in original VCF
99.2% of 100k variants return CAID
97.9% of 100k variants have both rsID and CAID

Example output¶

Note the appended CAID in ID column

head -n38 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.CA.vcf | grep -v "^##"

#CHROM POS ID REF ALT QUAL FILTER INFO
21 8522378 rs1316595271,CA636797537 A G . . .
21 8522386 rs1351680335,CA636797538 T C . . .
21 8522390 rs1456505548,CA636797539 T A . . .
21 8522406 rs559462325,CA317178097 G A . . .
21 8522412 rs181691356,CA317178100 C A . . .
21 8522415 rs1412016291,CA636797540 G T . . .
21 8522435 rs1286875378,CA636797541 C T . . .
21 8522442 rs1350348343,CA636797542 G C . . .
21 8522444 rs1229401566,CA914635628 T C . . .
21 8522444 rs1229401566,CA636797543 TAGATTTC T . . .
21 8522447 rs1292728344,CA636797544 A T . . .

Filtered VCF columns - full chr21¶

Let's see how long it would take to POST the filtered VCF to get a baseline for a larger data set
- We still recommend ~1M per import of column filtered VCF inputs for more consist and reliable bulk transfers

#start with GRCh38 header
cat GRCh38_header.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf
#append 100k variants with filtered columns
time awk '{split($0,a,"\t"); print a[1] "\t" a[2] "\t" a[3] "\t" a[4] "\t" a[5] "\t.\t.\t."}' gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf |  sed 's/^chr//g' >> gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf
wc -l gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf

28m14.658s
3495980

time curl -X POST "http://reg.genome.network/annotateVcf?assembly=GRCh38&ids=CA" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf.CA.vcf

bg. % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 273M 0 156M 100 116M 4418k 3297k 0:00:36 0:00:36 --:--:-- 9193k
real 0m36.371s

One more comparison - POSTing the full chr21 filtered data set¶

Type

Variants

Starting Size

Starting rsIDs

Returned CAIDs

Time

Full

100,000

1.4G

98385

99163

3m20s

Filtered

100,000

2.3M

98385

99163

2.5s

Full

3,495,953

48G

3423534

Filtered

3,495,953

117M

3423534

3493894

36s

We observed a POST rate of 2.5s for 100,000 entries, if that holds we should expect ~3.5M to take ~90s
- We observed a POST rate of ~1s for 100,000 entries for the ~3.5M set
99.94% of entries for the full dataset return CAIDs
- As compared to 97.9% of entries that contained rsIDs in the original data
I did not attempt to import the full column full data set totaling 48G (NA values). If this was required I would resort to the recommended limit of ~1M filtered columns. If non-filtered columns are used, I would recommend decreasing the number to the low hundred megabytes.

Test Case B - POST VCF - Return JSON with CAIDs and additional metadata¶

An alternate way of retrieving CAIDs from the CAR is to do a POST of gnomAD VCF to retrieve JSON response
We will start with the input used above and will use the filtered column VCF (smaller, quicker, and same data is used as input as full column VCF)

JSON - CAIDs only¶

time curl -X POST "http://reg.genome.network/alleles?file=vcf&fields=none+@id" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9273k 0 5922k 100 3350k 1979k 1119k 0:00:02 0:00:02 --:--:-- 3098k
real 0m3.007s

Non-error entities returned

grep -c "^\ \ \"@id\"" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json

100000

CAIDs returned

grep -c allele/CA gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json

99163
#same number as annotating the VCF

Example output

head gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json

[{
"@id": "http://reg.genome.network/allele/CA636797537"
}
,{
"@id": "http://reg.genome.network/allele/CA636797538"
}
,{
"@id": "http://reg.genome.network/allele/CA636797539"
}
,{

JSON - All metadata¶

Note difference in POST URL

time curl -X POST "http://reg.genome.network/alleles?file=vcf&fields=all" --data-binary @gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 180M 0 177M 100 3350k 4223k 79628 0:00:43 0:00:43 --:--:-- 16.4M
real 0m43.107s

Non-error entities returned

grep -c "^\ \ \"@id\"" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json

100000

CAIDs returned

grep -c allele/CA gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json

99163
#same number as annotating the VCF and just CAIDs JSON

Example output

head -n25 gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.all_fields.json

[{
"@context": "http://reg.genome.network/schema/allele.jsonld",
"@id": "http://reg.genome.network/allele/CA636797537",
"externalRecords": {
"MyVariantInfo_hg19": [ {
"@id": "http://myvariant.info/v1/variant/chr21:g.9411211A>G?assembly=hg19",
"id": "chr21:g.9411211A>G"
}
],
"MyVariantInfo_hg38": [ {
"@id": "http://myvariant.info/v1/variant/chr21:g.8522378A>G?assembly=hg38",
"id": "chr21:g.8522378A>G"
}
],
"gnomAD": [ {
"@id": "http://gnomad.broadinstitute.org/variant/21-9411211-A-G",
"id": "21-9411211-A-G",
"variant": "21:9411211 A / G"
}
]
},
"genomicAlleles": [

Comparison of VCF => JSON payloads¶

Type	CAIDs returned	Output Size	Time
CAIDs Only	99,163	5.8M	3s		All Fields	99,163	178M	43s

As above, we see a similar trend for input / output sizes and run time, which should be considered when utilizing large data sets

Can we account for any errors? Yes, but none seen in this 100k data set.
- grep -c "error" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.100k.filtered.vcf.json
  - 0
    - We expect to see (# non-error entities (^\ \ @id) + errors = total input size) if the full payload is consumed and returned

Illustrative example - (1) Error from gnomAD GRCh37 input (not from above work)¶

#Example error:
,{
  "description": "Given allele cannot be mapped in consistent way to reference genome.",
  "errorType": "NoConsistentAlignment",
  "inputLine": "21\t9425293\trs368989897\tGC\tG\t.\t.\t.",
  "message": "Cannot align NC_000021.8 [9425293,9425295). ",
  "region": "[9425293,9425295)",
  "sequenceName": "NC_000021.8" 
}

Can we explain this type of error?
- Let's start with rs368989897 through gnomAD
  - https://gnomad.broadinstitute.org/variant/21-9425293-GC-G?dataset=gnomad_r2_1
    - Deletion: 21-9425293-GC-G (GRCh37)
  - https://gnomad.broadinstitute.org/variant/21-8536460-GC-G?dataset=gnomad_r3
    - Deletion: 21-8536460-GC-G (GRCh38)
- Do we have rs368989897 registered via dbSNP? Yes for GRCh38.
  - http://reg.genome.network/redmine/projects/registry/genboree_registry/by_caid?caid=CA317179657
    - NC_000021.9:g.8536462del , CM000683.2:g.8536462del
    - We do not see the mapping to GRCh37 on this page
- Observations
  - We see a few % that do not have consistent alignments from different assemblies, but at least for this case we have registered the variant through dbSNP and anticipate being able to register this variant with the gnomAD v3 (GRCh38) data, which should lower the % failed for this class of error

Test Case C - PUT VCF - Registration of variants and retrieval of CAIDs¶

Available scripts¶

We provide 3 example scripts that allow users to register (PUT) variants that can be used as a stand-alone script or mimicked in custom pipelines
- http://reg.genome.network/doc/scripts/
  - request_with_payload.py
  - request_with_payload.rb
  - request_with_payload.sh

Let's use the shell script as an example
- Copy the script to your directory
- Make it executable
  - e.g. chmod 755 request_with_payload.sh

Create an account if you do not already have one¶

http://reg.genome.network/redmine/projects/clingen_users/register_clingen_user/cg_users/new

Check usage¶

./request_with_payload.sh

Incorrect parameters!
This program sends POST or PUT request with payload and print the response.
Parameters for POST request: URL file_with_payload
Parameters for PUT request: URL file_with_payload login password

Register variants¶

time ./request_with_payload.sh "http://reg.genome.network/alleles?file=vcf&fields=none+@id" gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf <your_user_name> <your_password> > gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf.PUT.json

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 320M 0 203M 100 116M 3953k 2271k 0:00:52 0:00:52 --:--:-- 6226k
real 0m52.787s

grep -c allele/CA gnomad.genomes.r2.1.1.sites.21.liftover_grch38.vcf.bgz.variants_only.vcf.all.filtered.vcf.PUT.json

3495953

ClinGen Allele Registry

Wiki

VCF Input with the ClinGen Allele Registry (CAR) API - annotation appending, JSON output, and registration of variants¶

Goals¶

API Documentation¶

Important Note¶

Example gnomAD Input¶

Download input¶

Verify MD5 checksum¶

Extract VCF Headers¶

Extract VCF Data and Prepare Input¶

Approach 1 - All columns in VCF - 100k entries test¶

Approach 2 - Ignoring unnecessary columns required for registration / annotation - 100k entries test¶

Note different sizes between two approaches¶

Test Case A - POST VCF - Annotation to retrieve CAIDs - Full and Filtered Column Test¶

Full VCF columns¶

Filtered VCF columns¶

Full vs. Filtered Column VCF Comparison - 100k tests¶

Example output¶

Filtered VCF columns - full chr21¶

One more comparison - POSTing the full chr21 filtered data set¶

Test Case B - POST VCF - Return JSON with CAIDs and additional metadata¶

JSON - CAIDs only¶

JSON - All metadata¶

Comparison of VCF => JSON payloads¶

Illustrative example - (1) Error from gnomAD GRCh37 input (not from above work)¶

Test Case C - PUT VCF - Registration of variants and retrieval of CAIDs¶

Available scripts¶

Create an account if you do not already have one¶

Check usage¶

Register variants¶