Obtaining dbSNP identifiers for CA IDs

Requirements

  • Requirements
    • Unix / Linux
      • We are using CentOS 6.9
    • Ruby 1.8.7+
      • Required for extracting HGVS expressions from input, but the logic could be ported to another language
    • C++ complier (gcc 4.8.2+)
      • Required for building bcftools
      • gcc --version
        • gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
    • bcftools and htslib
    • Download request with payload script (we'll use the shell script for this example, but there is also a ruby and python version)

Download dbSNP data

  • Download latest dbSNP data set
  • Verify checksums
    • cd ftp.ncbi.nih.gov/snp/latest_release/VCF/
    • md5sum -c GCF_000001405.25.gz.md5 #GRCh37
      • GCF_000001405.25.gz: OK
    • md5sum -c GCF_000001405.39.gz.md5 #GRCh38
      • GCF_000001405.39.gz: OK

Extract variants

  • Extract variants using bcftools
    • time bcftools norm -m -any GCF_000001405.25.gz | bcftools view -V snps > GCF_000001405.25.gz.indels.vcf
      • 167m
    • time bcftools norm -m -any GCF_000001405.39.gz | bcftools view -V snps > GCF_000001405.39.gz.indels.vcf
      • 182m
  • Observe line counts
    • GCF_000001405.25.gz.indels.vcf
      • Total lines: 131,924,564
      • Variants: 131,924,224
    • GCF_000001405.39.gz.indels.vcf
      • Total lines: 135,985,520
      • Variants: 135,984,844

Extract HGVS

  • Extract HGVS from input
    • time ruby dbsnp_convert_to_hgvs.rb GCF_000001405.25.gz.indels.vcf
      • 99m
        • 131,924,224 lines
    • time ruby dbsnp_convert_to_hgvs.rb GCF_000001405.39.gz.indels.vcf
      • 100m
        • 135,984,844 lines

Split input files into 1M chunks

  • Split input files into 1M chunks
    • time split -a 3 -d -l 1000000 GCF_000001405.25.gz.indels.vcf.hgvs GCF_000001405.25.gz.indels.vcf.hgvs.
    • time split -a 3 -d -l 1000000 GCF_000001405.39.gz.indels.vcf.hgvs GCF_000001405.39.gz.indels.vcf.hgvs.

Extract HGVS only for Allele Registry POST query and combine input / output into mapping file

Note: This is for (1) 1M chunk of GRCh38. The same procedure would be used for the remaining files in GRCh38/37.

  • Extract HGVS only from the input
    • cut -f1 GCF_000001405.39.gz.indels.vcf.hgvs.000 > GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only
  • Search dbSNP variants by HGVS
    • Get CA ID only
      • time sh ./request_with_payload.sh "http://reg.clinicalgenome.org/alleles.json?file=hgvs&fields=none+@id+end" GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only > GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only.json
        • 45s
  • Combine input and output
    • HGVS RSID CAID
      • paste GCF_000001405.39.gz.indels.vcf.hgvs.000 <(grep -o CA[0-9]* GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only.json) > GCF_000001405.39.gz.indels.vcf.hgvs.000.combined.txt
NC_000001.11:g.10013_10014delinsT       1639538231      CA1147837201
NC_000001.11:g.10019_10020delinsT       775809821       CA16716391
NC_000001.11:g.10024delinsCT    1639541758      CA1148213494
NC_000001.11:g.10031_10032delinsT       1639542364      CA1148307565
NC_000001.11:g.10036_10037delinsC       1639542429      CA1148307570
NC_000001.11:g.10039delinsAC    1639542564      CA1139532654
NC_000001.11:g.10042delinsCT    1639543135      CA1148307582
NC_000001.11:g.10045delinsAC    1639543279      CA1148307591
NC_000001.11:g.10049_10050delinsT       1639543682      CA1148307601
NC_000001.11:g.10051delinsAC    1326880612      CA884477203
...