Obtaining dbSNP identifiers for CA IDs¶
Requirements¶
- Requirements
- Unix / Linux
- We are using CentOS 6.9
- Ruby 1.8.7+
- Required for extracting HGVS expressions from input, but the logic could be ported to another language
- C++ complier (gcc 4.8.2+)
- Required for building bcftools
- gcc --version
- gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
- bcftools and htslib
- Versions
- bcftools 1.9-40-g44192ea
- htslib 1.9-21-gb7e5a51
- bcftools htslib install notes
- Versions
- Download request with payload script (we'll use the shell script for this example, but there is also a ruby and python version)
- http://reg.clinicalgenome.org/doc/scripts/
- Make executable
- chmod 755 request_with_payload.sh
- request_with_payload.sh
- Make executable
- http://reg.clinicalgenome.org/doc/scripts/
- Unix / Linux
Download dbSNP data¶
- Download latest dbSNP data set
- mkdir 10-13-21-dbSNP
- cd 10-13-21-dbSNP
- time wget -r –level=0 -E –ignore-length -x -k -p -erobots=off -np --no-check-certificate -N https://ftp.ncbi.nih.gov/snp/latest_release/VCF/
- 21m (47G)
- Verify checksums
- cd ftp.ncbi.nih.gov/snp/latest_release/VCF/
- md5sum -c GCF_000001405.25.gz.md5 #GRCh37
- GCF_000001405.25.gz: OK
- md5sum -c GCF_000001405.39.gz.md5 #GRCh38
- GCF_000001405.39.gz: OK
Extract variants¶
- Extract variants using bcftools
- time bcftools norm -m -any GCF_000001405.25.gz | bcftools view -V snps > GCF_000001405.25.gz.indels.vcf
- 167m
- time bcftools norm -m -any GCF_000001405.39.gz | bcftools view -V snps > GCF_000001405.39.gz.indels.vcf
- 182m
- time bcftools norm -m -any GCF_000001405.25.gz | bcftools view -V snps > GCF_000001405.25.gz.indels.vcf
- Observe line counts
- GCF_000001405.25.gz.indels.vcf
- Total lines: 131,924,564
- Variants: 131,924,224
- GCF_000001405.39.gz.indels.vcf
- Total lines: 135,985,520
- Variants: 135,984,844
- GCF_000001405.25.gz.indels.vcf
Extract HGVS¶
- Extract HGVS from input
- time ruby dbsnp_convert_to_hgvs.rb GCF_000001405.25.gz.indels.vcf
- 99m
- 131,924,224 lines
- 99m
- time ruby dbsnp_convert_to_hgvs.rb GCF_000001405.39.gz.indels.vcf
- 100m
- 135,984,844 lines
- 100m
- time ruby dbsnp_convert_to_hgvs.rb GCF_000001405.25.gz.indels.vcf
Split input files into 1M chunks¶
- Split input files into 1M chunks
- time split -a 3 -d -l 1000000 GCF_000001405.25.gz.indels.vcf.hgvs GCF_000001405.25.gz.indels.vcf.hgvs.
- time split -a 3 -d -l 1000000 GCF_000001405.39.gz.indels.vcf.hgvs GCF_000001405.39.gz.indels.vcf.hgvs.
Extract HGVS only for Allele Registry POST query and combine input / output into mapping file¶
Note: This is for (1) 1M chunk of GRCh38. The same procedure would be used for the remaining files in GRCh38/37.
- Extract HGVS only from the input
- cut -f1 GCF_000001405.39.gz.indels.vcf.hgvs.000 > GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only
- Search dbSNP variants by HGVS
- Get CA ID only
- time sh ./request_with_payload.sh "http://reg.clinicalgenome.org/alleles.json?file=hgvs&fields=none+@id+end" GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only > GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only.json
- 45s
- time sh ./request_with_payload.sh "http://reg.clinicalgenome.org/alleles.json?file=hgvs&fields=none+@id+end" GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only > GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only.json
- Get CA ID only
- Combine input and output
- HGVS RSID CAID
- paste GCF_000001405.39.gz.indels.vcf.hgvs.000 <(grep -o CA[0-9]* GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only.json) > GCF_000001405.39.gz.indels.vcf.hgvs.000.combined.txt
- HGVS RSID CAID
NC_000001.11:g.10013_10014delinsT 1639538231 CA1147837201 NC_000001.11:g.10019_10020delinsT 775809821 CA16716391 NC_000001.11:g.10024delinsCT 1639541758 CA1148213494 NC_000001.11:g.10031_10032delinsT 1639542364 CA1148307565 NC_000001.11:g.10036_10037delinsC 1639542429 CA1148307570 NC_000001.11:g.10039delinsAC 1639542564 CA1139532654 NC_000001.11:g.10042delinsCT 1639543135 CA1148307582 NC_000001.11:g.10045delinsAC 1639543279 CA1148307591 NC_000001.11:g.10049_10050delinsT 1639543682 CA1148307601 NC_000001.11:g.10051delinsAC 1326880612 CA884477203 ...