Obtaining dbSNP identifiers for CA IDs¶

Requirements¶

Requirements
- Unix / Linux
  - We are using CentOS 6.9
- Ruby 1.8.7+
  - Required for extracting HGVS expressions from input, but the logic could be ported to another language
    - dbsnp_convert_to_hgvs.rb
- C++ complier (gcc 4.8.2+)
  - Required for building bcftools
  - gcc --version
    - gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
- bcftools and htslib
  - Versions
    - bcftools 1.9-40-g44192ea
    - htslib 1.9-21-gb7e5a51
  - bcftools htslib install notes
- Download request with payload script (we'll use the shell script for this example, but there is also a ruby and python version)
  - http://reg.clinicalgenome.org/doc/scripts/
    - Make executable
      - chmod 755 request_with_payload.sh
    - request_with_payload.sh

Download dbSNP data¶

Download latest dbSNP data set
- mkdir 10-13-21-dbSNP
- cd 10-13-21-dbSNP
- time wget -r –level=0 -E –ignore-length -x -k -p -erobots=off -np --no-check-certificate -N https://ftp.ncbi.nih.gov/snp/latest_release/VCF/
  - 21m (47G)
Verify checksums
- cd ftp.ncbi.nih.gov/snp/latest_release/VCF/
- md5sum -c GCF_000001405.25.gz.md5 #GRCh37
  - GCF_000001405.25.gz: OK
- md5sum -c GCF_000001405.39.gz.md5 #GRCh38
  - GCF_000001405.39.gz: OK

Extract variants¶

Extract variants using bcftools
- time bcftools norm -m -any GCF_000001405.25.gz | bcftools view -V snps > GCF_000001405.25.gz.indels.vcf
  - 167m
- time bcftools norm -m -any GCF_000001405.39.gz | bcftools view -V snps > GCF_000001405.39.gz.indels.vcf
  - 182m
Observe line counts
- GCF_000001405.25.gz.indels.vcf
  - Total lines: 131,924,564
  - Variants: 131,924,224
- GCF_000001405.39.gz.indels.vcf
  - Total lines: 135,985,520
  - Variants: 135,984,844

Extract HGVS¶

Extract HGVS from input
- time ruby dbsnp_convert_to_hgvs.rb GCF_000001405.25.gz.indels.vcf
  - 99m
    - 131,924,224 lines
- time ruby dbsnp_convert_to_hgvs.rb GCF_000001405.39.gz.indels.vcf
  - 100m
    - 135,984,844 lines

Split input files into 1M chunks¶

Split input files into 1M chunks
- time split -a 3 -d -l 1000000 GCF_000001405.25.gz.indels.vcf.hgvs GCF_000001405.25.gz.indels.vcf.hgvs.
- time split -a 3 -d -l 1000000 GCF_000001405.39.gz.indels.vcf.hgvs GCF_000001405.39.gz.indels.vcf.hgvs.

Extract HGVS only for Allele Registry POST query and combine input / output into mapping file¶

Note: This is for (1) 1M chunk of GRCh38. The same procedure would be used for the remaining files in GRCh38/37.

Extract HGVS only from the input
- cut -f1 GCF_000001405.39.gz.indels.vcf.hgvs.000 > GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only
Search dbSNP variants by HGVS
- Get CA ID only
  - time sh ./request_with_payload.sh "http://reg.clinicalgenome.org/alleles.json?file=hgvs&fields=none+@id+end" GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only > GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only.json
    - 45s

Combine input and output
- HGVS RSID CAID
  - paste GCF_000001405.39.gz.indels.vcf.hgvs.000 <(grep -o CA[0-9]* GCF_000001405.39.gz.indels.vcf.hgvs.000.hgvs_only.json) > GCF_000001405.39.gz.indels.vcf.hgvs.000.combined.txt

NC_000001.11:g.10013_10014delinsT       1639538231      CA1147837201
NC_000001.11:g.10019_10020delinsT       775809821       CA16716391
NC_000001.11:g.10024delinsCT    1639541758      CA1148213494
NC_000001.11:g.10031_10032delinsT       1639542364      CA1148307565
NC_000001.11:g.10036_10037delinsC       1639542429      CA1148307570
NC_000001.11:g.10039delinsAC    1639542564      CA1139532654
NC_000001.11:g.10042delinsCT    1639543135      CA1148307582
NC_000001.11:g.10045delinsAC    1639543279      CA1148307591
NC_000001.11:g.10049_10050delinsT       1639543682      CA1148307601
NC_000001.11:g.10051delinsAC    1326880612      CA884477203
...

ClinGen Allele Registry

Wiki

Obtaining dbSNP identifiers for CA IDs¶

Requirements¶

Download dbSNP data¶

Extract variants¶

Extract HGVS¶

Split input files into 1M chunks¶

Extract HGVS only for Allele Registry POST query and combine input / output into mapping file¶