ClinGen Allele Registry (CAR) Performance and Guidelines¶

Registration Metrics - Higher Level Summary¶

Registration rate / second	Sort status	Genomic Range	Annotations in VCF?	Return payload
90k	sorted	narrow (<50M nt)	N	None
20k	sorted	medium (50M < x < ~100M nt)	N	None
5k	unsorted	wide (>~100M nt)	N	None
~1k	sorted / unsorted	narrow / medium / wide range	N	Full
100-900	sorted / unsorted	narrow / medium / wide range	Y	Full

You will note that we observe increased performance (bottom to top):

4X increase in throughput by removing annotations (specific to VCF)
4X increase in throughput by returning no fields during registration
4X increase in throughput by sorting and utilizing a medium range of genomic location distribution
4X increase in throughput by sorting and splitting by narrow genomic ranges

High Volume Guidance¶

The Allele Registry provides for multiple routes for the bulk registration or bulk querying of variants. There are multiple factors that affect the turnaround time of an API request (can vary by 100X between API requests), namely:

Network
- The size of the the submitted input, size of generated output, and network speed all factor in to the total request time
- Input size should be minimized when possible (e.g. removing annotation fields in VCF files)
  - VCF_gnomAD shows the dramatic effect (80X) of removing the INFO field data prior to VCF registration / annotation
- Output size can be minimized by requesting only required JSON fields when possible
  - fields=all returns full variant payload
  - fields=none returns blank variant payload, but can be useful for verifying all input lines were processed (input lines == output lines)
  - fields=none+@id returns only the CA ID of the variant
  - API documentation has more information related to possible values to return
    - http://reg.clinicalgenome.org/doc/
Number of input entities
- The job time is also correlated with number of entities in the input
- <1M entries per API request is generally suitable. Heavily annotated VCF files input chunks should be decreased (25%, 50%, etc.) in the event of network timeouts.
Ordering of entities
- Sorted input is not required, but very strongly advised! Sorted input requires loading less pages into RAM and dramatically decreases turnaround time.
Distribution of entities across the genome
- Tighter distribution of input across chromosomes and chromosomal coordinates also requires loading less pages into RAM, which also decreases the total job time
- When possible, multiple datasets should be combined and sorted for large scale bulk registration

Common Input Types¶

I have HGVS input, where should I start?¶

Obtaining_dbSNP_identifiers_for_CA_IDs - useful information showing how to process HGVS data from dbSNP, which can be applied to any HGVS data
Sort your input by accession and starting coordinate
- You may find that splitting your input by reference sequence (especially for genomic variants) makes sorting easier
Split your input up into multiple 'chunks' < 1M lines each
Query the Allele Registry
- ./request_with_payload.sh "http://reg.clinicalgenome.org/alleles.json?file=hgvs&fields=none+@id" {input_file_name} > {output_file_name}.json
Register variants in the Allele Registry
- ./request_with_payload.sh "http://reg.clinicalgenome.org/alleles.json?file=hgvs&fields=none+@id" {input_file_name} {user_name} {password} > {output_file_name}.json

I have VCF input, where should I start?¶

VCF_gnomAD shows how gnomAD VCF data can be utilized
Sort input based on chromosome and start coordinate
Split your input into 1M chunks (see above regarding decreasing this number if you experience client timeouts)
- Note: per the above VCF gnomAD tutorial, you will need to insert the VCF header onto each split file
Query the Allele Registry - return CA IDs
- ./request_with_payload.sh "https://reg.clinicalgenome.org/alleles?file=vcf&fields=none+@id" {input_file_name} > {output_file_name}.json
Register variants in the Allele Registry - return CA IDs
- ./request_with_payload.sh "https://reg.clinicalgenome.org/alleles?file=vcf&fields=none+@id" {input_file_name} {user_name} {password} > {output_file_name}.json
Annotate VCF - return CA IDs
- ./request_with_payload.sh ""http://reg.clinicalgenome.org/annotateVcf?assembly=GRCh38&ids=CA" {input_file_name} > {output_file_name}.vcf
Annotate VCF and register variants - return CA IDs
- ./request_with_payload.sh ""http://reg.clinicalgenome.org/annotateVcf?assembly=GRCh38&ids=CA" {input_file_name} {user_name} {password} > {output_file_name}.vcf
Header files
- VCF_Header_GRCh37
- VCF_Header_GRCh38

Registration Metrics - More Detailed¶

Row	Input Format	CAR version	Input characteristics	Return type	Database state	Input size	Output size	Average registrations / second
A	VCF	64	unsorted, distributed across the full genome	register only	empty	25MB	3.7MB	89,484
B	VCF	64	sorted narrow distribution	register only	empty	25MB	3.7MB	103,120
C	VCF	64	unsorted, distributed across the full genome	register, full payload	empty	25MB	7.4GB	2,805
D	VCF	64	sorted narrow distribution	register, full payload	empty	25MB	9.9GB	2,278
E	VCF	64	unsorted, distributed across the full genome	register only	2.5B	25MB	3.7MB	5,346
F	VCF	64	sorted narrow distribution	register only	2.5B	25MB	3.7MB	90,679
G	HGVS	32	sorted narrow distribution	query only	2.5B	38MB	54MB	22,222
H	VCF	32	sorted, chr21, no annotations in input	query, annotate VCF	2.5B	2.3MB	.	40,000
I	VCF	32	sorted, chr21, full annotations in input	query, annotate VCF	2.5B	1.4GB	.	500

Additional Notes¶

There are four main external factors related to CAR performance

Network latency (input, output file sizes)
- The Allele Registry is cloud hosted by Rackspace and performance can be adversely affected by connectivity to their distributed servers, which is very rare.
- Additionally, the size of the input that needs to be uploaded and the size of the output that needs to be downloaded is a product of network latency.
- For example:
  - Rows C vs. D we have an increased turnaround time for sorted, narrowly distributed input, which is correlated with the larger return size.
  - Rows H vs. I we see a dramatic decrease in performance by including unnecessary input not required by the CAR (annotation details in the VCF INFO column)
Number of input entities
- Performance scales linearly with number of input items. We generally recommend batches of < 1M per API query.
Ordering of entities
- Submitting input in sorted order is highly encouraged and directly affects performance. This is due to loading pages of the CAR database into memory based on genomic location, thus variants in close proximity will be loaded into RAM for quicker access and minimizes the number of pages that need to be loaded.
Distribution of entities across the genome
- Similar to the above, the narrower the window of input entities across the genome, the better the performance (similar to ordering entities mentioned above).
- For example, you can see the effect of unordered distributed entities across the genome adversely affect the performance (A vs. B, E vs. F)

Notes:

API documentation:
- http://reg.clinicalgenome.org/doc/
  - PDF documentation
  - Script examples (e.g. request_with_payload.sh)
The current version of Allele Registry as of 6/3/22 is utilizing 32 bit infrastructure. We are planning on rolling out 64 bit in the near future, which among other things will provide support for registration of > 4.2B canonical alleles (currently at ~2.57B).
The sorted narrow distribution data referred to in rows B, D, and F has a range of ~10M nt and the input in row G has a range of ~20M
Obtain CA IDs for HGVS (row G) tutorial:
- Obtaining_dbSNP_identifiers_for_CA_IDs
Annotate VCF (rows H, I) tutorial
- VCF_gnomAD
Database growth
- Another function of database growth is initial increased turnaround time (A-D, vs. E-I). Again, this is related to the number of pages that are required for loading into memory and we will continue to monitor performance as we increase the number of registered alleles to 3B, 4B, etc.

CAR Performance - Sorted vs. Unsorted¶

car_performance.png (39.4 KB) Riehle Kevin, 06/03/2022 09:13 PM

ClinGen Allele Registry

Wiki