SnpSift

Filter and manipulate annotated files

SnpSift is a toolbox that allows you to filter and manipulate annotated files.

Once your genomic variants have been annotated, you need to filter them out in order to find the "interesting / relevant variants". Given the large data files, this is not a trivial task (e.g. you cannot load all the variants into XLS spreadsheet). SnpSift helps to perform this VCF file manipulation and filtering required at this stage in data processing pipelines.

Download and install

SnpSift is part of SnpEff main distribution, so please click on here and follow the instructions on how to download and install SnpEff.

SnpSift utilities

SnpSift is a collection of tools to manipulate VCF (variant call format) files.
Some examples of what you can do:

Filter You can filter using arbitrary expressions, for instance "(QUAL > 30) | (exists INDEL) | ( countHet() > 2 )". The actual expressions can be quite complex, so it allows for a lot of flexibility.
Annotate & AnnMem You can add 'ID' and INFO fields from another "VCF database" (e.g. typically dbSnp database in VCF format).
CaseControl You can compare how many variants are in 'case' and in 'control' groups. Also calculates p-values (Fisher exact test).
Intervals Filter variants that intersect with intervals.
Intervals (intidx) Filter variants that intersect with intervals. Index the VCF file using memory mapped I/O to speed up the search. This is intended for huge VCF files and a small number of intervals to retrieve.
Join Join by generic genomic regions (intersecting or closest).
RmRefGen Remove reference genotype (i.e. replace '0/0' genotypes by '.')
TsTv Calculate transition to transversion ratio.
Extract fields Extract fields from a VCF file to a TXT (tab separated) format.
Variant type Adds SNP/MNP/INS/DEL to info field. It also adds "HOM/HET" if there is only one sample.
GWAS Catalog Annotate using GWAS Catalog.
DbNSFP Annotate using dbNSFP: The dbNSFP is an integrated database of functional predictions from multiple algorithms (SIFT, Polyphen2, LRT and MutationTaster, PhyloP and GERP++, etc.)
SplitChr Split a VCF file by chromosome

Citing SnpSift

In order to cite SnpSift, please use the following reference:

"Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift", Cingolani, P., et. al., Frontiers in Genetics, 3, 2012.
BibTex entry:
@article{cingolani2012using,
  title={Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift},
  author={Cingolani, P. and Patel, V.M. and Coon, M. and Nguyen, T. and Land, S.J. and Ruden, D.M. and Lu, X.},
  journal={Frontiers in Genetics},
  volume={3},
  year={2012},
  publisher={Frontiers Media SA}
}

Source code

The project is hosted at SourceForge.

Here is the SVN command to check out the development version of the code:
svn co https://snpeff.svn.sourceforge.net/svnroot/snpeff/SnpSift/trunk

SnpSift filter is one of the most useful SnpSift commands. Using SnpSift filter you can filter VCF files using arbitrary expressions, for instance "(QUAL > 30) | (exists INDEL) | ( countHet() > 2 )". The actual expressions can be quite complex, so it allows for a lot of flexibility.

Some examples for the impatient:
  • I want to filter out samples with quality less than 30:
     cat variants.vcf | java -jar SnpSift.jar filter " ( QUAL >= 30 )" > filtered.vcf
  • ...but we also want InDels that have quality 20 or more:
     cat variants.vcf | java -jar SnpSift.jar filter "(( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 )" > filtered.vcf
  • ...or any homozygous variant present in more than 3 samples:
     cat variants.vcf | java -jar SnpSift.jar filter "(countHom() > 3) | (( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 )" > filtered.vcf
  • ...or any heterozygous sample with coverage 25 or more:
     cat variants.vcf | java -jar SnpSift.jar filter "((countHet() > 0) && (DP >= 25)) | (countHom() > 3) | (( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 )" > filtered.vcf
  • I want to keep samples where the genotype for the first sample is homozygous variant and the genotype for the second sample is reference:
     cat variants.vcf | java -jar SnpSift.jar filter "isHom( GEN[0] ) & isVariant( GEN[0] ) & isRef( GEN[1] )" > filtered.vcf
  • I want to keep samples where the ID matches a set defined in a file:
     cat variants.vcf | java -jar SnpSift.jar filter --set my_rs.txt "ID in SET[0]" > filtered.vcf
    and the file my_rs.txt has one string per line, e.g.:
    rs58108140
    rs71262674
    rs71262673
Now you get the idea. You can combine any conditions you want using boolean operators.

Variables

All VCF fields can be used as variables names, as long as they are declared in the VCF header OR they are "standard" VCF fields (as defined by the VCF 4.1 specification).
  • Fields names: "CHROM, POS, ID, REF, ALT, QUAL or FILTER" Examples:
    • Any variant in chromosome 1:
       "( CHROM = 'chr1' )" 
    • Variants between two positions:
       "( POS > 123456 ) & ( POS < 654321 )" 
    • Has an ID and it matches the regulat expression 'rs':
       "(exists ID) & ( ID =~ 'rs' )" 
    • The reference is 'A':
       "( REF = 'A' )" 
    • The alternative is 'T':
       "( ALT = 'T' )" 
    • Quality over 30:
       "( QUAL > 30 )" 
    • Filter value is either 'PASS' or it is missing:
       "( na FILTER ) | (FILTER = 'PASS')" 
  • INFO field names in the INFO field. E.g. if the info field has "DP=48;AF1=0;..." you can use something like
     ( DP > 10 ) & ( AF1 = 0 )

Multiple valued fields and variables

When variables have multiple values, you can access individual values as if it was an array.
  • Multiple value info fields (comma separated) can be accessed using an index. E.g. If the INFO field has "CI95=0.04167,0.5417" you can use an expression such as
     "( CI95[0] > 0.1 ) & (CI95[1] <= 0.3)" 
  • Multiple indexes You may test multiple indexed fields using 'ANY' or 'ALL' as index. In the examples we assume the INFO field has "CI95=0.04167,0.5417"

    ANY or *: If you use 'ANY' as index, the expression will be true if any field satisfies the expression.
    So, for instance, the following expressions

     "( CI95[ANY] > 0.1 )" 
    or
     "( CI95[*] > 0.1 )" 
    are equivalent to (in this case, there are only two values in the array)
     "( CI95[0] > 0.1 ) | ( CI95[1] > 0.1 )" 

    ALL or ?: If you use 'ALL' as index, the expression will be true if all field satisfy the expression.
    So, for instance, the following expressions

     "( CI95[ALL] > 0.1 )" 
     "( CI95[?] > 0.1 )" 
    are equivalent to (in this case, there are only two values in the array)
     "( CI95[0] > 0.1 ) & ( CI95[1] > 0.1 )" 

Genotype fields

Vcf genotype fields can be accessed individually using array notation.
  • Genotype fields are accessed using an index (sample number) followed by a variable name. E.g. If the genotypes are "GT:PL:GQ 1/1:255,66,0:63 0/1:245,0,255:99" You can write something like
    "( GEN[0].GQ > 60 ) & ( GEN[1].GQ > 90 )" 
    You may use an asterisk to represent 'ANY' field
    "( GEN[*].GQ > 60 )" 
  • Genotype multiple fields are accessed using an index (sample number) followed by a variable name and then another index. E.g. If the genotypes are "GT:PL:GQ 1/1:255,66,0:63 0/1:245,0,255:99" You can write something like
    "( GEN[0].PL[2] = 0 )" 
    You may use an asterisk to represent 'ANY' field
    "( GEN[0].PL[*] = 0 )" 
    ...or even
    "( GEN[*].PL[*] = 0 )" 


You can create an expression using sample names instead of genotype numbers. The script snpSift_filter_sample_to_number.pl converts sample names to sample numbers.
E.g.
$ ./scripts/snpSift_filter_sample_to_number.pl test.1KG.vcf "( GEN[HG00096].GQ > 60 ) & ( GEN[HG00097].GQ > 90 )"
( GEN[0].GQ > 60 ) & ( GEN[1].GQ > 90 )
You can use the resulting expression in SnpSift filter without having to worry about different versions of the VCF files having different sample order.

Sets

Sets are defined by the '-s' (or '--set') command line option. Each file must have one string per line. They are named based on the order used in the command line (e.g. the first one is 'SET[0]', the second one is 'SET[1]', etc.)

Example: You can write something like (assuming your command line was "-s set1.txt -s set2.txt -s set3.txt"):
"( ID in SET[2] )" 

SnpEff 'EFF' fields

SnpEff annotations are parsed, so you can access individual sub-fields:
Effect fields (from SnpEff) are accessed using an index (effect number) followed by a sub-field name.
Available EFF sub-fields are:
  • EFFECT: Effect (e.g. SYNONYMOUS_CODING, NON_SYNONYMOUS_CODING, FRAME_SHIFT, etc.)
  • IMPACT: { HIGH, MODERATE, LOW, MODIFIER }
  • FUNCLASS: { NONE, SILENT, MISSENSE, NONSENSE }
  • CODON: Codon change (e.g. 'ggT/ggG')
  • AA: Amino acid change (e.g. 'G156')
  • GENE: Gene name (e.g. 'PSD3')
  • BIOTYPE: Gene biotype, as described by the annotations (e.g. 'protein_coding')
  • CODING: Gene is { CODING, NON_CODING }
  • TRID: Transcript ID
  • RANK: Exon or Intron rank (i.e. exon number in a transcript)
For example, you may want only the lines where the first effect is a NON_SYNONYMOUS variants:
"( EFF[0].EFFECT = 'NON_SYNONYMOUS_CODING' )" 
...but this probably doesn't make much sense. What you may really want are lines where ANY effect is NON_SYNONYMOUS:
"( EFF[*].EFFECT = 'NON_SYNONYMOUS_CODING' )" 
May be you want only the ones that affect gene 'TCF7L2'
"( EFF[*].EFFECT = 'NON_SYNONYMOUS_CODING' ) &  ( EFF[*].GENE = 'TCF7L2' )" 

SnpEff 'LOF' and 'NMD' fields

Similarly LOF and NMD sub-fields are available:
  • LOF.GENE and NMD.GENE
  • LOF.GENEID and NMD.GENEID
  • LOF.NUMTR and NMD.NUMTR
  • LOF.PERC and NMD.PERC
For instance, if we want to obtain genes having a Loss of Function effect in more than 90% of the transcripts, you can do this:
$cat test.snpeff.vcf | java -Xmx1G -jar SnpSift.jar filter "(exists LOF[*].PERC) & (LOF[*].PERC > 0.9)" 
We assume that 'test.snpeff.vcf' was annotated with SnpEff using '-lof' command line option.

Available operands and functions

The following operators and functions are interpreted by SnpSift filter:

Operand Description Data type Example
= Equality test FLOAT, INT or STRING (REF = 'A')
> Greater than FLOAT or INT (DP > 20)
Greater or equal than FLOAT or INT (DP 20)
< Less than FLOAT or INT (DP < 20)
Less or equal than FLOAT or INT (DP 20)
=~ Match regular expression STRING (REL =~ 'AC')
!~ Does not match regular expression STRING (REL !~ 'AC')
& AND operator Boolean (DP > 20) & (REF = 'A')
| OR operator Boolean (DP > 20) | (REF = 'A')
! NOT operator Boolean ! (DP > 20)
exists The variable exists (not missing) Any (exists INDEL)
Function Description Data type Example
countHom() Count number of homozygous genotypes No arguments (countHom() > 0)
countHet() Count number of heterozygous genotypes No arguments (countHet() > 2)
countVariant() Count number of genotypes that are variants (i.e. not reference 0/0) No arguments (countVariant() > 5)
countRef() Count number of genotypes that are NOT variants (i.e. reference 0/0) No arguments (countRef() < 1)
Genotype
Function
Description Data type Example
isHom Is homozygous genotype? Genotype isHom( GEN[0] )
isHet Is heterozygous genotype? Genotype isHet( GEN[0] )
isVariant Is genotype a variant? (i.e. not reference 0/0) Genotype isVariant( GEN[0] )
isRef Is genotype a reference? (i.e. 0/0) Genotype isRef( GEN[0] )

Annotate using fields from another VCF file (e.g. dbSnp or 1000 genomes projects).

This is typically used to annotate IDs and INFO fields from a 'database' VCF file (e.g. dbSnp). Here is an example:
 java -jar SnpSift.jar annotate dbSnp132.vcf variants.vcf > variants_annotated.vcf 

Important:

  • SnpSift annotate command assumes that both the database and the input VCF files are sorted by position. Chromosome sort order can differ between files.
  • SnpSift annotateMem command allows unsorted VCF files, but it loads the entire 'database' VCF file into memory (which may not be practical for large 'database' VCF files).

Note:

  • By default it adds ALL database INFO fields.
  • You can use the '-info' command line option if you only want select only a subset of fields from db.vcf file.
  • You can use the '-id' command line option if you only want to add ID fields (no INFO fields will be added).

DbSnp in VCF format can download here



Example 1: Annotating ID from dbSnp:

$ cat test.chr22.vcf
#CHROM  POS         ID           REF  ALT  QUAL   FILTER  INFO
22      16157571    .            T    G    0.0    FAIL    NS=53
22      16346045    .            T    C    0.0    FAIL    NS=244
22      16350245    .            C    A    0.0    FAIL    NS=192
22      17054103    .            G    A    0.0    PASS    NS=404
22      17071906    .            A    T    0.0    PASS    NS=464
22      17072347    .            C    T    0.0    PASS    NS=464
22      17072394    .            C    G    0.0    PASS    NS=463
22      17072411    .            G    T    0.0    PASS    NS=464

$ java -jar SnpSift.jar annotate -id db/dbSnp/dbSnp137.20120616.vcf test.chr22.vcf 
#CHROM  POS         ID           REF  ALT  QUAL   FILTER  INFO
22      16157571    .            T    G    0.0    FAIL    NS=53
22      16346045    rs56234788   T    C    0.0    FAIL    NS=244
22      16350245    rs2905295    C    A    0.0    FAIL    NS=192
22      17054103    rs4008588    G    A    0.0    PASS    NS=404
22      17071906    .            A    T    0.0    PASS    NS=464
22      17072347    rs139948519  C    T    0.0    PASS    NS=464
22      17072394    .            C    G    0.0    PASS    NS=463
22      17072411    rs41277596   G    T    0.0    PASS    NS=464
Example 2: Annotatating ID and all INFO fields from dbSnp (VCF headers not shown for brevity):
$ cat test.chr22.vcf
#CHROM  POS         ID           REF  ALT  QUAL   FILTER  INFO
22      16157571    .            T    G    0.0    FAIL    NS=53
22      16346045    .            T    C    0.0    FAIL    NS=244
22      16350245    .            C    A    0.0    FAIL    NS=192
22      17054103    .            G    A    0.0    PASS    NS=404
22      17071906    .            A    T    0.0    PASS    NS=464
22      17072347    .            C    T    0.0    PASS    NS=464
22      17072394    .            C    G    0.0    PASS    NS=463
22      17072411    .            G    T    0.0    PASS    NS=464

$ java -jar SnpSift.jar annotate db/dbSnp/dbSnp137.20120616.vcf test.chr22.vcf 
#CHROM  POS         ID           REF  ALT  QUAL   FILTER  INFO
22      16157571    .            T    G    0.0    FAIL    NS=53
22      16346045    rs56234788   T    C    0.0    FAIL    NS=244;RSPOS=16346045;GMAF=0.162248628884826;dbSNPBuildID=129;SSR=0;SAO=0;VP=050100000000000100000100;WGT=0;VC=SNV;SLO;GNO
22      16350245    rs2905295    C    A    0.0    FAIL    NS=192;RSPOS=16350245;GMAF=0.230804387568556;dbSNPBuildID=101;SSR=1;SAO=0;VP=050000000000000100000140;WGT=0;VC=SNV;GNO
22      17054103    rs4008588    G    A    0.0    PASS    NS=404;RSPOS=17054103;GMAF=0.123400365630713;dbSNPBuildID=108;SSR=0;SAO=0;VP=050100000000070010000100;WGT=0;VC=SNV;SLO;VLD;G5A;G5;KGPilot123
22      17071906    .            A    T    0.0    PASS    NS=464
22      17072347    rs139948519  C    T    0.0    PASS    NS=464;RSPOS=17072347;dbSNPBuildID=134;SSR=0;SAO=0;VP=050200000004040010000100;WGT=0;VC=SNV;S3D;ASP;VLD;KGPilot123
22      17072394    .            C    G    0.0    PASS    NS=463
22      17072411    rs41277596   G    T    0.0    PASS    NS=464;RSPOS=17072411;GMAF=0.00411334552102377;dbSNPBuildID=127;SSR=0;SAO=0;VP=050200000008040010000100;GENEINFO=CCT8L2:150160;WGT=0;VC=SNV;S3D;CFL;VLD;KGPilot123

Allows you to count how many samples are in 'case' and 'control' groups.

This command counts the number of 'homozygous', 'heterozygous' and 'total' variants in a case and control groups and performs some basic pValue calculation using Fisher exact test and Cochran-Armitage test.

Case and Control groups can be defined either by a command line string or a TFAM file (see PLINK's documentation).
Case/Control command line string containing plus and minus symbols {'+', '-', '0'} where '+' is case, '-' is control and '0' is neutral (ignored).
E.g. We have ten samples, which means ten genotype columns in the VCF file. The first four are 'cases', the fifth one is 'neutral', and the last five are 'control'. So the description string would be "++++0-----" (note that the following output has been edited, only counts are shown, no pValues):

$ java -jar SnpSift.jar caseControl "++++0-----" cc.vcf 
#CHROM  POS    ID  REF  ALT  QUAL  FILTER  INFO                                FORMAT  Sample_01  Sample_02  Sample_03  Sample_04  Sample_05  Sample_06  Sample_07  Sample_08  Sample_09  Sample_10
1       69496  .   G    A    .     PASS    AF=0.01;Cases=1,2,4;Controls=2,2,6  GT      0/1        1/1        1/0        0/0        0/0        0/1        1/1        1/1        1/0        0/0
Cases genotypes are samples 1 to 4 : 0/1, 1/1, 1/0 and 0/0. So there are 1 homozygous, 2 heterozygous, and a total of 4 variants (2 * 1 + 1 * 2 = 4). Thus the annotation is Cases=1,2,4
Control genotypes are samples 6 to 10 : 0/1, 1/1, 1/1, 1/0 and 0/0. So there are 2 homozygous, 2 heterozygous, and a total of 6 variants (2 * 2 + 1 * 2 = 6) Thus the annotation is Controls=2,2,6

You can use the -tfam command line option to specify a TFAM file. Case, control from are read from phenotype field of a TFAM file (6th column). Phenotype order in TFAM files do not need to match VCF sample order (sample IDs are used). Phenotype column should be coded as {0,1,2} meaning {Missing, Control, Case} respectively. See PLINK's reference for details about TFAM file format.

You can use the -name nameString command line option to add name to the INFO tags. This can be used to count different case/control groups in the same dataset (e.g. multiple phenotypes)
$  java -jar SnpSift.jar caseControl -name "_MY_GROUP" "++++0-----" cc.vcf \
	| java -jar SnpSift.jar caseControl -name "_ANOTHER_GROUP" "+-+-+-+-+-" -

#CHROM  POS    ID  REF  ALT  QUAL  FILTER  INFO                                                                                                         FORMAT  Sample_01  Sample_02  Sample_03  Sample_04  Sample_05  Sample_06  Sample_07  Sample_08  Sample_09  Sample_10
1       69496  .   G    A    .     PASS    AF=0.01;Cases_MY_GROUP=1,2,4;Controls_MY_GROUP=2,2,6;Cases_ANOTHER_GROUP=1,3,5;Controls_ANOTHER_GROUP=2,1,5  GT      0/1        1/1        1/0        0/0        0/0        0/1        1/1        1/1        1/0        0/0

p-values

SnpSift caseControl calculates the p-value using different models: dominant, recessive, allelic and co-dominant.

When we say we use Fisher exact test, it means that we use the real Fisher exact test calculation, not approximations (like Chi-Square approximations). So the p-values should be correct even for low counts on any of the values in the contingency tables. Approximations tend to be wrong when any count in a contingency table is below 5. You should not see that problem here.

Models:
  • Dominant model (CC_DOM): A 2 by 2 contingency table is created
    Alt (A/a + a/a) Ref (A/A)
    Cases N11 N12
    Controls N21 N22
    This means that the first column are the number of samples that have ANY non-reference: either 1 (heterozygous) or 2 (homozygous). Fisher exact test is used to calculate the p-value.

  • Recessive model (CC_REC): A 2 by 2 contingency table is created
    Alt (a/a) Ref + Het (A/A + A/a)
    Cases N11 N12
    Controls N21 N22
    This means that the first column are the number of samples that have both non-reference chromosomes: homozygous ALT. Fisher exact test is used to calculate the p-value.

  • Allelic model (CC_ALL): A 2 by 2 contingency table is created
    Variants References
    Cases N11 N12
    Controls N21 N22
    This means that the first column are the number of non-reference genotypes. For instance homozygous reference samples count as 0, heterozygous count as 1 and homozygous non-reference count as 2. Fisher exact test is used to calculate the p-value.

  • Genotipic / Codominant model (CC_GENO): A 2 by 3 contingency table is created
    A/A a/A a/a
    Cases N11 N12 N13
    Controls N21 N22 N23
    Weight 0.0 1.0 2.0
    This means that the first column are the number of homozygous reference genotypes. The second column is the number of heterozygous. And the third column is the number of homozygous non-reference. Cochran-Armitage test is used to calculate the p-value, using the weights shown in the last row.

This is used to extract variants that intersect any interval.

You must provide intervals as BED files.
Command line options:

  • '-x' : Filter out (exclude) VCF entries that match any interval in the BED files.
  • '-i file.vcf' : Specify the input VCF file (default is STDIN).

E.g.:

 cat variants.vcf | java -jar SnpSift.jar intervals my_intervals.bed > variants_intersecting_intervals.vcf 


BED file format is tab separated zero-based coordinates "chr \t start \t end " (for this application, all other fields in the BED file are ignored).

If BED file has header lines, they must start with a '#'

This is used to extract variants that intersect any interval.

This is similar to "SnpSift intervals", but intended for huge VCF files, and relatively small number of intervals.

This command indexes the VCF file, thus is optimized for huge VCF files. You must provide intervals as BED files. BED format is tab separated zero-based coordinates "chr \t start \t end " (for this application, all other fields in the BED file are ignored). You can use command line option '-if 1' if you want one-based coordinates.

E.g.:

 java -jar SnpSift.jar intidx variants.vcf my_intervals.bed > variants_intersecting_intervals.vcf 
You can also have genomic coordinate in the command line. Note that in this case, coordinates are assumed to be one-based (instead of zero-based, like in BED files):
 java -jar SnpSift.jar intidx -c variants.vcf chr1:12345-23456 chr2:3456789-4567890  > variants_intersecting_intervals.vcf 


BED file format is tab separated zero-based coordinates "chr \t start \t end " (for this application, all other fields in the BED file are ignored).

If BED file has header lines, they must start with a '#'

Join files by genomic regions (i.e. chr:start-end).

Files can be generic TXT (tab separated), VCF or BED.
Usage example:
Usage: java -jar SnpSift.jar join [options] file1 file2 
Note: It is assumed that both files fit in memory.
Options:
    -if1 <num>       : Offset for file1 (e.g. 1 if coordinates are one-based. Default: 1
    -if2 <num>       : Offset for file2 (e.g. 2 if coordinates are one-based. Default: 1
    -cols1 <colDef>  : Column definition for file 1. Format: chrCol,startCol,endCol (e.g. '1,2,3'). 
                       Shortcuts 'bed' or 'vcf' are allowed. Default: 'vcf
    -cols2 <colDef>  : Column definition for file 2. Format: chrCol,startCol,endCol (e.g. '1,2,3'). 
                       Shortcuts 'bed' or 'vcf' are allowed. Default: 'vcf
    -all             : For each interval, show all intersecting. 
                       Default: show only one (the largest intersection)
    -closest         : Show closest intervals in file2 if none intersect. 
                       Default: off
    -empty           : Show intervals in file1 even if they do not intersect with any other interval. 
                       Default: off
Example: Join two bed files, showing intersecting or closest intervals
java -Xmx2G -jar SnpSift.jar join -v -cols1 bed -cols2 bed -closest file1.bed file2.bed
Example: Join one bed file and another file having chr:start-end in columns 7,8 and 9 respectively. Showing intervals form file1 that do not intersect any interval from file2
java -Xmx2G -jar SnpSift.jar join -v -cols1 bed -cols2 7,8,9 -empty file.bed my_weird_file.txt

Remove reference genotypes.

Replaces genotype information for non-variant samples.

E.g. If you have this file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT       M1                 M2                X1              X2              
2L      426906  .       C       G       53.30   .       DP=169  GT:PL:GQ     0/1:7,0,255:4      0/1:7,0,255:4     0/0:0,0,0:6     0/0:0,30,255:35 
2L      601171  .       C       A       999.00  .       DP=154  GT:PL:GQ     0/1:81,0,141:78    0/1:42,0,251:39   0/0:0,0,0:4     0/0:0,33,255:36 
2L      648611  .       A       T       999.00  .       DP=225  GT:PL:GQ     0/1:52,0,42:47     1/1:75,21,0:14    0/0:0,0,0:3     0/0:0,60,255:61 
2L      807373  .       A       G       106.00  .       DP=349  GT:PL:GQ     0/1:14,0,65:12     0/1:60,0,42:50    0/0:0,0,0:4     0/0:0,69,255:72 
2L      816766  .       G       T       999.00  .       DP=411  GT:PL:GQ     0/1:108,0,45:53    0/1:7,0,255:6     0/0:0,0,0:4     0/0:0,57,255:59 
You can run:
 cat file.vcf | java -jar SnpSift.jar rmRefGen > file_noref.vcf 
and you get this (notice the last two columns, that had '0/0' genotype):
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT       M1                 M2                X1    X2              
2L      426906  .       C       G       53.30   .       DP=169  GT:PL:GQ     0/1:7,0,255:4      0/1:7,0,255:4     .     .
2L      601171  .       C       A       999.00  .       DP=154  GT:PL:GQ     0/1:81,0,141:78    0/1:42,0,251:39   .     .
2L      648611  .       A       T       999.00  .       DP=225  GT:PL:GQ     0/1:52,0,42:47     1/1:75,21,0:14    .     .
2L      807373  .       A       G       106.00  .       DP=349  GT:PL:GQ     0/1:14,0,65:12     0/1:60,0,42:50    .     .
2L      816766  .       G       T       999.00  .       DP=411  GT:PL:GQ     0/1:108,0,45:53    0/1:7,0,255:6     .     .

Calculate transition vs transversion ratios for each sample.

Usage example:
 
$ java -jar SnpSift.jar tstv hom s.vcf 

Sample        :	1	2	3	4	5	6	7	8	9	10	11	12	Total
Transitions   :	150488	150464	158752	156674	152936	160356	152276	155314	156484	149276	151182	153468	1847670
Transversions :	70878	70358	73688	72434	70828	76150	72030	71958	72960	69348	70180	71688	862500
Ts/Tv         :	2.123	2.139	2.154	2.163	2.159	2.106	2.114	2.158	2.145	2.153	2.154	2.141	2.142	

Extract fields from a VCF file to a TXT, tab separated format, that you can easily load in R, XLS, etc.

You can also use sub-fields and genotype fields / sub-fields such as:

  • Standard VCF fields:
    • CHROM
    • POS
    • ID
    • REF
    • ALT
    • FILTER
  • INFO fields:
    • AF
    • AC
    • DP
    • MQ
    • etc. (any info field available)
  • SnpEff 'EFF' fields:
    • "EFF[*].EFFECT"
    • "EFF[*].IMPACT"
    • "EFF[*].FUNCLASS"
    • "EFF[*].CODON"
    • "EFF[*].AA"
    • "EFF[*].AA_LEN"
    • "EFF[*].GENE"
    • "EFF[*].BIOTYPE"
    • "EFF[*].CODING"
    • "EFF[*].TRID"
    • "EFF[*].RANK"
    You can combine vcfEffOnePerLine.pl script with SnpSift extractFields if you want to have each effect in a seprarate line.
  • SnpEff 'LOF' fields:
    • "LOF[*].GENE"
    • "LOF[*].GENEID"
    • "LOF[*].NUMTR"
    • "LOF[*].PERC"
  • SnpEff' NMD' fields:
    • "NMD[*].GENE"
    • "NMD[*].GENEID"
    • "NMD[*].NUMTR"
    • "NMD[*].PERC"


When using multiple indexes (e.g. "EFF[*].EFFECT") you must remember to use quotes the command line. Otherwise, the shell would parse the asterisk changing the expression and producing unexpected results.


You can use command line option -s to specify multiple field separator and -e to specify how to represent empty fields.

Example 1: Extracting chromosome, position, ID and allele frequency from a VCF file:

$ java -jar SnpSift.jar extractFields s.vcf CHROM POS ID AF | head
#CHROM        POS        ID            AF
1             69134                    0.086
1             69496      rs150690004   0.001
1             69511      rs75062661    0.983
1             69569                    0.538
1             721559                   0.001
1             721757                   0.011
1             846854     rs111957712   0.003
1             865584     rs148711625   0.001
1             865625     rs146327803   0.001

Example 2: Extracting genotype fields

$ java -jar SnpSift.jar extractFields file.vcf CHROM" "POS" "ID" "THETA" "GEN[0].GL[1]" "GEN[1].GL" "GEN[3].GL[*]" "GEN[*].GT"
This means to extract:
  • CHROM POS ID: regular fields (as in the previous example)
  • THETA : This one is from INFO
  • GEN[0].GL[1] : Second likelihood from first genotype
  • GEN[1].GL : The whole GL fiels (all entries without separating them)
  • "GEN[3].GL[*]" : All likelihoods form genotype 3 (this time they will be tab separated, as opposed to the previous one).
  • "GEN[*].GT" : Genotype subfields (GT) from ALL samples (tab separated).
The result will look something like:
#CHROM  POS     ID              THETA   GEN[0].GL[1]    GEN[1].GL               GEN[3].GL[*]            GEN[*].GT
1       10583   rs58108140      0.0046  -0.47           -0.24,-0.44,-1.16       -0.48   -0.48   -0.48   0|0     0|0     0|0     0|1     0|0     0|1     0|0     0|0     0|1 
1       10611   rs189107123     0.0077  -0.48           -0.24,-0.44,-1.16       -0.48   -0.48   -0.48   0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0 
1       13302   rs180734498     0.0048  -0.58           -2.45,-0.00,-5.00       -0.48   -0.48   -0.48   0|0     0|1     0|0     0|0     0|0     1|0     0|0     0|1     0|0 
1       13327   rs144762171     0.0204  -1.11           -1.97,-0.01,-2.51       -0.48   -0.48   -0.48   0|0     0|1     0|0     0|0     0|0     1|0     0|0     0|0     0|0 
1       13957   rs201747181     0.0100  0               0,0,0                   0       0       0       0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0 
1       13980   rs151276478     0.0139  -0.48           -0.48,-0.48,-0.48       -0.48   -0.48   -0.48   0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0 
1       30923   rs140337953     0.0162  -0.61           -0.10,-0.69,-2.81       -0.48   -0.48   -0.48   1|1     0|0     0|0     1|1     1|0     0|0     1|1     1|0     1|1 
1       46402   rs199681827     0.0121  0               0,0,0                   0       0       0       0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0 
1       47190   rs200430748     0.0153  0               0,0,0                   0       0       0       0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0 


Example 3: Extracting fields with multiple values in a friendlier format. You can use command line option -s to specify multiple field separator and -e to specify how to represent empty fields.
$ java -jar SnpSift.jar extractFields -s "," -e "." test.chr22.eff.vcf CHROM POS REF ALT "EFF[*].EFFECT" "EFF[*].AA" 

Notice how we separate same fields using "," instead of the default tab using the option -s ",", and we use "." for empty fiels (option -e "."). The results is:
#CHROM  POS        REF  ALT     EFF[*].EFFECT                           EFF[*].AA
22      16346045   T    C       DOWNSTREAM,INTRON,SPLICE_SITE_DONOR     .,.,.
22      16350245   C    A       INTRON,SPLICE_SITE_ACCEPTOR             .,.
22      17054103   G    A       INTRON,SPLICE_SITE_DONOR                .,.
22      17071906   A    T       DOWNSTREAM,NON_SYNONYMOUS_CODING        .,L512Q
22      17072347   C    T       DOWNSTREAM,STOP_GAINED                  .,W365*
22      17072394   C    G       DOWNSTREAM,NON_SYNONYMOUS_CODING        .,R349S
22      17072411   G    T       DOWNSTREAM,NON_SYNONYMOUS_CODING        .,P344T
22      17072483   A    G       DOWNSTREAM,NON_SYNONYMOUS_CODING        .,W320R
22      17072907   G    T       DOWNSTREAM,NON_SYNONYMOUS_CODING        .,D178E
22      17073032   C    T       DOWNSTREAM,NON_SYNONYMOUS_CODING        .,A137T
22      17073124   G    T       DOWNSTREAM,NON_SYNONYMOUS_CODING        .,A106D
22      17073332   A    T       DOWNSTREAM,NON_SYNONYMOUS_CODING        .,L37M


Example 4: Extracting effects, one per line
In order to extract effects, you can simply do something like this (notice that there are multiple columns per line because there are mutiple effects per variant):
$ java -jar SnpSift.jar extractFields test.snpeff.vcf CHROM POS "EFF[*].EFFECT"
#CHROM  POS     EFF[*].EFFECT
1       10303   DOWNSTREAM      INTERGENIC      UPSTREAM
Note that since that variant has three effects, there are three "EFFECT" columns.

If we prefer to have one effect per line, then we can use the vcfEffOnePerLine.pl provided with SnpEff distribution
$ cat test.snpeff.vcf | ./scripts/vcfEffOnePerLine.pl | java -jar SnpSift.jar extractFields - CHROM POS "EFF[*].EFFECT"
#CHROM  POS     EFF[*].EFFECT
1       10303   DOWNSTREAM
1       10303   INTERGENIC
1       10303   UPSTREAM
Now we obtain one effect per line, while all other parameters in the line are repeated across mutiple lines (e.g. there are three 1:10303 lines).
Note that in SnpSift, we used - as input file name, which denotes STDIN.

SnpSift extractFields can get confused if the VCF field has non-alphanumeric charaters in the name (e.g. dbNSFP_GERP++_RS has two "+" signs). A quick fix, it so is to change the field names in the VCF file. Here is an example:

# Change field names in VCF
$ cat kath.vcf | sed "s/dbNSFP_GERP++/dbNSFP_GERP/g" > kath.gerp.vcf

# Use new names to extract fields
$ java -jar SnpSift.jar  extractFields kath.gerp.vcf CHROM POS REF ALT dbNSFP_GERP_RS dbNSFP_GERP_NS
#CHROM	POS	REF	ALT	dbNSFP_GERP_RS
1	142827044	G	A		
2	132914561	G	A		
7	151933217	C	A		
7	151933251	T	C		
7	151933302	T	C		
7	151945101	G	C	-0.892	
7	151945167	G	T		
7	151962176	T	A		
7	151970672	A	T		
7	151970856	T	A	3.71	
18	14183638	G	C		
18	14183710	A	G		
18	14542909	G	A		
18	14543039	T	C	-0.942	

Adds an INFO field denoting variant type.

It adds "SNP/MNP/INS/DEL/MIXED" in the INFO field. It also adds "HOM/HET", but this last one works if there is only one sample (otherwise it doesn't make any sense).

$ java -jar SnpSift.jar varType  test.vcf | grep -v "^#" | head
20	10469	.	C	G	100.0	PASS	SNP;HOM	GT:AP	0|0:0.075,0.060
20	10492	.	C	T	100.0	PASS	SNP;HET	GT:AP	0|1:0.180,0.345
20	10575	.	C	CG	100.0	PASS	DEL;HET	GT:AP	0|1:0.000,0.000
20	10611	.	CG	C	100.0	PASS	INS;HET	GT:AP	0|1:0.000,0.010
20	10618	.	GT	TA	100.0	PASS	MNP;HET	GT:AP	0|1:0.020,0.030

Annotate using GWAS catalog

You need the GWAS catalog file (in TXT format), which can be downloaded here

$ java -jar SnpSift.jar gwasCat gwascatalog.txt test.vcf | tee test.gwas.vcf
1   1005806 rs3934834   C   T   .   PASS    AF=0.091;GWASCAT=Body_mass_index    
1   2069172 rs425277    C   T   .   PASS    AF=0.400;GWASCAT=Height 
1   2069681 rs3753242   C   T   .   PASS    AF=0.211;GWASCAT=Reasoning  
1   2392648 rs2477686   G   C   .   PASS    AF=0.745;GWASCAT=Non_obstructive_azoospermia    
1   2513216 rs734999    C   T   .   PASS    AF=0.547;GWASCAT=Ulcerative_colitis 
1   2526746 rs3748816   A   G   .   PASS    AF=0.489;GWASCAT=Celiac_disease 
1   3083712 rs2651899   T   C   .   PASS    AF=0.467;GWASCAT=Migraine   
1   3280253 rs6658356   G   A   .   PASS    AF=0.070;GWASCAT=Response_to_statin_therapy 
1   4315204 rs966321    G   T   .   PASS    AF=0.522;GWASCAT=Factor_VII 
1   5170712 rs7513590   A   G   .   PASS    AF=0.256;GWASCAT=Anthropometric_traits  
1   6279370 rs846111    G   C   .   PASS    AF=0.153;GWASCAT=QT_interval,QT_interval    
1   6631431 rs11587438  C   T   .   PASS    AF=0.906;GWASCAT=White_blood_cell_types 
1   7879063 rs2797685   C   T   .   PASS    AF=0.186;GWASCAT=Crohn_s_disease    
1   8021973 rs35675666  G   T   .   PASS    AF=0.093;GWASCAT=Ulcerative_colitis 
1   8046672 rs12727642  C   A   .   PASS    AF=0.101;GWASCAT=Celiac_disease 
1   8422676 rs2252865   T   C   .   PASS    AF=0.771;GWASCAT=Schizophrenia  
1   8526142 rs4908760   G   A   .   PASS    AF=0.630;GWASCAT=Vitiligo   

The dbNSFP is an integrated database of functional predictions from multiple algorithms (SIFT, Polyphen2, LRT and MutationTaster, PhyloP and GERP++, etc.).

One of the main advantages is that you can annotate using multiple prediction tools with just one command. Here is the link to dbNSFP database website.


Database: In order to annotate using dbNSFP, you need to download the dbNSFP database and the index file. dbNSFP is large (several GB) so it might take a while to download it. As of SnpSift version 3.5, the dbNSFP file has to be tabix indexed (i.e. block-gzipped + tabix). This allows for faster annotations. You can download the files from SnpEff's site:

dbNSFP Annotation example

Here is a full example how to perform annotations

# Donload database and tabix index files (you need to do this only once):
# WARNING: The database file is large (over 5Gb)
wget http://sourceforge.net/projects/snpeff/files/databases/dbNSFP/dbNSFP2.4.txt.gz
wget http://sourceforge.net/projects/snpeff/files/databases/dbNSFP/dbNSFP2.4.txt.gz.tbi

# Annotate using dbNSFP
java -jar SnpSift.jar dbnsfp -v dbNSFP2.4.txt.gz myFile.vcf > myFile.annotated.vcf

You can now specify which fields you want to use for annotation using the '-f' command line option followed by a comma separated list of field names. Defaults fields are
  • Uniprot_acc
  • Interpro_domain
  • SIFT_pred
  • Polyphen2_HDIV_pred
  • Polyphen2_HVAR_pred
  • LRT_pred
  • MutationTaster_pred
  • GERP++_NR
  • GERP++_RS
  • phastCons100way_vertebrate
  • 1000Gp1_AF
  • 1000Gp1_AFR_AF
  • 1000Gp1_EUR_AF
  • 1000Gp1_AMR_AF
  • 1000Gp1_ASN_AF
  • ESP6500_AA_AF
  • ESP6500_EA_AF

Building dbNSFP (for developers)

Users do NOT need to do this, since a pre-indexed database can be downloaded from SnpSift's site (see previous sub-section). These instructions are mostly for developers.

You can also create dbNSFP files yourself, downloading the files from DbNsfp site. Two files are required:
  • A block-gzipped database file
  • The corresponding tabix index for the database file.

Creating a file that SnpSift can use is simple, just follow this guideline:
# Download dbNSFP database
$ wget http://dbnsfp.houstonbioinformatics.org/dbNSFPzip/dbNSFP2.4.zip

# Uncompress
$ unzip dbNSFP2.4.zip

# Create a single file version
$ (head -n 1 dbNSFP2.4_variant.chr1 ; cat dbNSFP2.4_variant.chr* | grep -v "^#" ) > dbNSFP2.4.txt

# Compress using block-gzip algorithm
bgzip dbNSFP2.4.txt

# Create tabix index
tabix -s 1 -b 2 -e 2 dbNSFP2.4.txt.gz

Simply split (or join) VCF files. Allows to create one file per chromosome or one file every N lines.

A typical usage for this command is to:
  1. Split very large VCF files SnpSift split huge.vcf
  2. Perform some CPU intensive processing in parallel using several computers or cores
  3. Join the resulting VCF files SnpSift split -j huge.000.vcf huge.001.vcf huge.002.vcf ... > huge.out.vcf.
E.g.: Splitting a VCF having human variants

java -jar SnpSift.jar split myHugeVcf.vcf.gz

Will create files myHugeVcf.1.vcf, myHugeVcf.2.vcf, ... , myHugeVcf.22.vcf, myHugeVcf.X.vcf, myHugeVcf.Y.vcf

You can also specify '-l' command line option to split the file every N lines

E.g.: Split a VCF file every 10,000 lines

java -jar SnpSift.jar split -l 10000 myHugeVcf.vcf.gz

Will create files myHugeVcf.001.vcf, myHugeVcf.002.vcf, ...

VCF header will be added to each file, so resulting files will be more than 10,000 lines.

You can use -j (join) command line option to join a set of VCF files.
java -jar SnpSift.jar split -j huge.000.vcf huge.001.vcf huge.002.vcf ...  > huge.out.vcf

Annotate using PhastCons conservation scores.

You must download PhastCons files here

You also need a chromosome size file, which can be created using samtools faidx, or you can download it from here

Full example. Most of the example deals with downloading and installing PhastCons database, which is done only once. The real annotation process is done in the last line.

# Create a dir for PhastCons database
cd ~/snpEff
mkdir -p db/phastCons/

# Download all PhastCons files
cd db/phastCons/
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr1.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr2.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr3.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr4.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr5.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr6.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr7.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr8.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr9.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr10.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr11.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr12.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr13.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr14.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr15.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr16.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr17.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr18.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr19.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr20.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr21.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chr22.phastCons100way.wigFix.gz                 
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chrM.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chrX.phastCons100way.wigFix.gz                  
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/vertebrate/chrY.phastCons100way.wigFix.gz 

# Create a chromosome size file and name it "genome.fai"
samtools faidx path/to/genome/hg19.fa.gz
cp path/to/genome/hg19.fa.gz.fai ./genome.fai

# Now we are ready to annotate
java -Xmx4g -jar SnpSift.jar phastCons ~/snpEff/db/phastCons file.vcf > file.phastCons.vcf


You can annotated intervals using BED files and -bed command line option. In the output BED formatted intervals, the score column (fifth column), is the average conservation score of all bases within the interval.

It is possible to extract sub-intervals having at least 'minScore' conservation score and 'len' length by using -minScore score and -extract len command line options. For instance, the following command:
java -jar SnpSift.jar phastCons -minScore 0.8 -extract 10 -bed path/to/phastCons/dir input.bed
extracts all subintervals from each line in input.bed, that has at least 10 bases length and a conservation score of 0.8

Calculate concordance between two VCF files.

This is typically used when you want to calculate concordance between a genotyping experiment and a sequencing experiment.
For instance, you sequenced several samples and, as part of a related experiment or just as quality control, you also genotype the same samples using a genotyping array. Now you want to compare the two experiments. Ideally there would be no difference between the variants from genotyping and sequencing, but this is hardly the case in real world.
You can use SnpSift concordance to measure the differences between the two experiments.

It is assumed that both VCF files are sorted by chromosome and position.

Sample names are defined in '#CHROM' line of the header section. Concordance is calculated only if sample label matches in both files.
Example:

$ java -Xmx1g -jar SnpSift.jar concordance -v genotype.vcf sequencing.vcf > concordance.txt
00:00:00.000	Indexing file 'genotype.vcf'
	index:	MT	460030998
	index:	1	19705
		1 / 2	45170805 / 45174315
		2 / 3	77052081 / 77055591
		3 / 4	104065531 / 104069041
		4 / 5	124098372 / 124101881
		5 / 6	146535292 / 146538802
		6 / 7	184793526 / 184797035
		7 / 8	206156508 / 206160018
		8 / 9	223072816 / 223076326
		9 / 10	242315995 / 242319505
		10 / 11	261053789 / 261057299
		11 / 12	290190553 / 290194063
		12 / 13	312869636 / 312873146
		13 / 14	321966539 / 321970049
		14 / 15	336131317 / 336134827
		15 / 16	350871669 / 350875179
		16 / 17	368900523 / 368904032
		17 / 18	391305860 / 391309369
		18 / 19	398932237 / 398935747
		19 / 20	425219198 / 425222708
		20 / 21	437022008 / 437025517
		21 / 22	442563678 / 442567188
		22 / X	451783418 / 451786927
		X / Y	459553691 / 459557200
		Y / MT	459588787 / 459592296
00:00:01.137	Open VCF file 'genotype.vcf'
00:00:01.141	Open VCF file 'sequencing.vcf'
00:00:01.176	Chromosome: '1'
00:00:02.127		1:1550992	1:1528859
00:00:02.739		1:2426313	1:2389636
...
00:02:13.780		1:248487058	1:248471945

Output

SnpSif's concordance output is written to STDOUT and two files. For instance the command java -jar SnpSift.jar concordance -v genotype.vcf sequencing.vcf will write:

  • Concordance by variant: Written to STDOUT
  • Concordance by sample: Written to concordance_genotyping_sequencing.by_sample.txt
  • Summary: Written to concordance_genotyping_sequencing.summary.txt

Concordance by variant

This sections is a table showing concordance details for every entry (chr:position) that both files have in common. E.g.:

chr  pos     ref  alt  change_0_0  change_0_1  change_0_2  change_1_0  change_1_1  change_1_2  change_2_0  change_2_1  change_2_2  missing_genotype_genotype  missing_genotype_sequencing
1    865584  G    A        508         0           0           0           2           0           0           0           0           0                                 5
1    865625  G    A        512         0           0           0           1           0           0           0           0           0                                 1
1    865628  G    A        511         0           0           0           2           0           0           0           0           0                                 1
1    865665  G    A        495         0           0           0           4           0           0           0           0           0                                17
1    865694  C    T        428         0           0           0          82           0           0           0           4           0                                 0


Each genotype is coded according to the number of ALT variants. i.e.:
  • '0/0' (homozygous reference) is coded as '0'
  • '0/1' or '1/0' (heterozygous ALT) coded as '1'
  • '1/1' (homozygous ALT) is coded as '1'
So the column "change_X_Y" on the table shows how many genotypes coded 'X' in the first VCF, changed to 'Y' in the second VCF. For example, 'change_0_1' counts the number of "homozygous reference in genotype.vcf" that changed to "heterozygous ALT in sequencing.vcf". Or 'change_2_2' counts the number of "homozygous ALT" that did not change (in both files they are '2').

A few rules apply:
  • If a VCF entry (chr:pos) is present in only one of the files, obviously we cannot calculate concordance, so it is ignored.
  • If a VCF entry (chr:pos) has more than one ALT it is ignored. This means that non-biallelic variants are ignored.
  • If, for the same chr:pos, REF field is different between the two files, then the entry is ignored.
  • If, for the same chr:pos, ALT field is different between the two files, then the entry is ignored.

Concordance by sample

This section shows details in the same format as the previous section. Here, concordance metrics are shown aggregated for each sample. E.g.:

# Totals by sample
sample  change_0_0  change_0_1  change_0_2  change_1_0  change_1_1  change_1_2  change_2_0  change_2_1  change_2_2  missing_genotype_genotype  missing_genotype_sequencing
ID_003  79          0           0           1           8           0           0           0           2           1                          1
ID_004  83          0           0           1           2           0           0           0           5           0                          1
ID_005  80          0           0           0           7           0           0           0           4           1                          0
ID_006  79          0           0           0           5           0           0           0           6           0                          2
ID_008  81          0           0           0           4           0           0           0           4           0                          3
ID_009  80          0           0           0           7           0           0           0           3           0                          2
ID_012  74          0           0           0           10          0           0           0           1           0                          7
ID_013  79          1           0           0           4           0           0           0           5           0                          3
ID_018  84          0           0           0           5           0           0           0           3           0                          0
...

Summary

Summary file contains overall information and errors. Here is an example of a summary file:

$ cat concordance_genotyping_sequencing.summary.txt
Number of samples:
    929    File genotype.vcf
    583    File sequencing.vcf
    514    Both files
Errors:
	ALT field does not match	19
The header indicates that one file ('genotype.vcf') has 929 samples, the other file has 583 and there are 514 matching sample IDs in both files.
At the end of the file, a footer shows the total for each column followed by number of possible errors (or mismatches). In this case the were 19 ALT fields that did not match between 'genotype.vcf' and 'sequencing.vcf'. This can happen, for instance, when there are INDELs, which cannot be detected by genotyping arrays.

Summary messages are shown to STDERR if you use verbose mode (command line option -v).

Annotate if a variant is private to a family.

A Private=Family_ID is added to a variant's INFO field, if the variant is only found in one family. A TFAM file (see PLINK's documentation) specifies a mapping from sample IDs to family IDs.
E.g.:

$ java -jar SnpSift.jar private pheotypes.tfam imp.ann.vcf > imp.ann.private.vcf

An annotated variant may look like this:
1   1005806 rs3934834   C   T   .   PASS    AF=0.091;..;Private=Family_47
This indicates that the variant is only found in members of Family_47, according to the definitions in pheotypes.tfam.

Convert from VCF to PLINK's TPED file format.

The vcf2tped command uses a VCF and a TFAP file as input, creating a TPED and a consolidated TFAM as outputs.

Command line options are:

$ java -jar SnpSift.jar vcf2tped
SnpSift version 1.9d (build 2013-04-26), by Pablo Cingolani
Usage: java -jar SnpSift.jar vcf2tped [options] file.tfam file.vcf outputName
Options:
	-f             : Force. Overwrite new files if they exist. Default: false
	-onlySnp       : Use only SNPs when converting VCF to TPED. Default: false
	-onlyBiAllelic : Use only bi-allelic variants. Default: false
	-useMissing    : Use entries with missing genotypes (otherwise they are filtered out). Default: false
	-useMissingRef : Use entries with missing genotypes marking them as 'reference' instead of 'missing'. Default: false
Parameters:
	file.tfam      : File with genotypes and groups information (in PLINK's TFAM format)
	file.vcf       : A VCF file (variants and genotype data)
	outputName     : Base name for the new TPED and TFAM files.


vcf2tped command supports the following features:
  • Output a TPED file:
    • Only samples present in both the input TFAM and the input VCF files are in the output TPED.
    • Bi-allelic filter: -onlyBiAllelic option filters out non bi-allelic variants.
    • Non SNP variants (InDels, MNPs, etc):
      • InDels and other non-SNP variants are converted for "fake" SNPs (some programs have problems handling non-SNP variants).
      • -onlySnp option filters out non SNP variants.
    • Missing variants:
      • Variants having missing data are filtered out by default.
      • -useMissing uses missing variants in TPED file.
      • -useMissingRef Converts missing variants to reference genotype.
  • Output TFAM file:
    • Only samples present in both the input TFAM and the input VCF files are in the output TFAM.
    • Samples are re-ordered to have the same order as the VCF file

This command intersects several intervals files (e.g. BED, BigBed, TXT) and produces a result of all intersections.


A typical usage example is to create a consensus of peaks from several Chip-Seq experiments.

Algorithm: This command creates one interval forest for each input file. For every interval in all input files, finds all intervals that intersect at least minOverlap bases (default 1 base). If there are at least cluster number of intersecting intervals it creates a consensus interval from the intersections (or union) of all intervals found. The consensus interval, if any, is shown as result.


Command line options:

$ java -jar SnpSift.jar intersect
SnpSift version 1.9d (build 2013-04-26), by Pablo Cingolani
Usage: java -jar SnpSift.jar [options] file_1.bed file_2.bed ... file_N.bed
Options:
	-minOverlap  : Minimum number of bases that two intervals have to overlap. Default : 0
	-cluster     : An interval has to intersect at least 'num' intervals (from other files) to be considered. Default: 0
	-intersect        : Report the intersection of all intervals. Default: false
	-union            : Report the union of all intervals. Default: true

This command removes INFO fields from a VCF file (i.e. removes annotations)

Removing INFO fields is usually done because you want to re-annotate the VCF file, thus removing old INFO fields in order to add new ones later.

SnpEff & SnpSift only add annotations and do not change current ones. So, in order to re-annotate a file, you should first remove the old annotations and then re-annotate.
The reason for this behavior is simply because replacing annotation values is considered a bad practice. Imagine that you have a VCF entry  in your re-annotated file having the value "AA=1": How do you know if this is from the old annotations or from the new ones? This confusion often leads to problems in downstream steps of your pipelines, so it's better to avoid the problem by first removing all the previous annotations and then adding the new ones.

Usage example:

$ cat test.snpeff.vcf 
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	734462	1032	G	A	.	s50	AC=348;EFF=DOWNSTREAM(MODIFIER|||||RP11-206L10.8|processed_transcript|NON_CODING|ENST00000447500||1),INTRON(MODIFIER|||||RP11-206L10.6|processed_transcript|NON_CODING|ENST00000429505|1|1)

$ java -jar SnpSift.jar rmInfo test.snpeff.vcf EFF
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	734462	1032	G	A	.	s50	AC=348

Annotating GeneSets, such as Gene Ontology (GO), KEGG, Reactome, etc.; can be quite useful to find significant variants.

Gene set annotations can be added to a SnpEff annotated file using SnpSift geneSets command. The VCF file must be annotated using SnpEff before perforimg Gene Sets annotations. This is because we must know which gene the variant affects).

You can download MSigDb from Broad Institute

Usage example:

$ java -jar SnpSift.jar geneSets -v db/msigDb/msigdb.v3.1.symbols.gmt test.eff.vcf > test.eff.geneSets.vcf
00:00:00.000	Reading MSigDb from file: 'db/msigDb/msigdb.v3.1.symbols.gmt'
00:00:01.168	Done. Total:
	8513 gene sets
	31847 genes
00:00:01.168	Annotating variants from: 'test.eff.vcf'
00:00:01.298	Done.
# Summary
#	gene_set	gene_set_size	variants
#	ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN	940	8
#	CHR1P36	504	281
#	KEGG_OLFACTORY_TRANSDUCTION	389	8
#	REACTOME_GPCR_DOWNSTREAM_SIGNALING	805	8
#	REACTOME_OLFACTORY_SIGNALING_PATHWAY	328	8
...
#	REACTOME_SIGNALING_BY_GPCR	920	8

$ cat test.eff.geneSets.vcf
##INFO=<ID=MSigDb,Number=.,Type=String,Description="Gene set from MSigDB database (GSEA)">
1	69849	.	G	A	454.73	PASS	AC=33;EFF=STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W253*|305|OR4F5|protein_coding|CODING|ENST00000335137|1|1);MSigDb=ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN,CHR1P36,KEGG_OLFACTORY_TRANSDUCTION,REACTOME_GPCR_DOWNSTREAM_SIGNALING,REACTOME_OLFACTORY_SIGNALING_PATHWAY,REACTOME_SIGNALING_BY_GPCR

Compress genotype calls, reducing the overall size of the VCF file.

This is intended for compressing very large VCF in very large sequencing projects (e.g. thousands of samples).

For instance, we've reduced 1Tb (1,000 Gb) VCF file to roughly 1Gb in a project that has over 10,000 samples.

The underlying idea is quite simple. In large re-sequenceing projects most of the variants are singletons. This means that most variants are present in only one of the samples. For those variants, you have thousands of samples that are homozygous reference (i.e. genotype entry is "0/0") and one that is a variant (e.g. '0/1' or '1/1').
A trivial way to compress these VCF entries is just to state which sample has non-reference information. Intuitively, this is similar to the way used to represent sparse matrices (only store non-zero elements).

SnpSift gt creates three INFO fields. These three files are composed of comma separated indexes of samples having:

  • HE: Indicated heterozygous variants (i.e. '0/1').
  • HO: Indicated homozygous variants (i.e. '1/1').
  • NA: Indicated missing genotype data (i.e. './.').

You can use -u command line option to uncompress.
E.g.:
$ cat test.vcf
#CHROM  POS     ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  Sample_1  Sample_2  Sample_3  Sample_4  Sample_5  Sample_6  Sample_7  Sample_8  Sample_9  Sample_10  Sample_11  Sample_12  Sample_13  Sample_14  Sample_15
1       861276  .   A    G    .     PASS    AC=1  GT      0/0       1/1       0/0       0/0       0/0       0/0       0/0       0/0       0/0       0/0        0/0        0/0        0/0        0/0        0/0

#---
# Compress genotypes
#---
$ java -jar SnpSift.jar gt test.vcf | tee test.gt.vcf
##INFO=<ID=HO,Number=.,Type=Integer,Description="List of sample indexes having homozygous ALT genotypes">
##INFO=<ID=HE,Number=.,Type=Integer,Description="List of sample indexes having heterozygous ALT genotypes">
##INFO=<ID=NA,Number=.,Type=Integer,Description="List of sample indexes having missing genotypes">
#CHROM  POS     ID  REF  ALT  QUAL  FILTER  INFO        FORMAT  Sample_1  Sample_2  Sample_3  Sample_4  Sample_5  Sample_6  Sample_7  Sample_8  Sample_9  Sample_10  Sample_11  Sample_12  Sample_13  Sample_14  Sample_15
1       861276  .   A    G    .     PASS    AC=1;HO=1

#---
# Uncompress genotypes (command line option '-u')
#---
$ java -jar SnpSift.jar gt -u test.gt.vcf 
#CHROM  POS     ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  Sample_1  Sample_2  Sample_3  Sample_4  Sample_5  Sample_6  Sample_7  Sample_8  Sample_9  Sample_10  Sample_11  Sample_12  Sample_13  Sample_14  Sample_15
1       861276  .   A    G    .     PASS    AC=1  GT      0/0       1/1       0/0       0/0       0/0       0/0       0/0       0/0       0/0       0/0        0/0        0/0        0/0        0/0        0/0


This is lossy compression. Note that only GT informations is compressed, all other information in genotype field is lost.

Perform some basic check ups on VCF files to spot common problems.

SnpSift vcfCheck checks for some common problems where VCF files are not following the specification. Given that many common VCF problems cause analysis tools and pipelines to behave unexpectedly, this command is intended as a simple debugging tool.
E.g.:

$ java -jar SnpSift.jar vcfCheck bad.vcf 

WARNING: Malformed VCF entryfile 'bad.vcf', line 7:
	Entry  : 3	148885779	.	A	ATT,AT	999.0	PASS	UK10KWES_AC=0,0;MDV=94
	Errors :
		INFO filed 'UK10KWES_AC' has 'Number=1' in header, but it contains '2' elements.
		Cannot find header for INFO field 'MDV'

WARNING: Malformed VCF entryfile 'bad.vcf', line 14:
	Entry  : 3	148890104	.	TCA	T	.	.	.
	Errors :
		File is not sorted: Position '3:148890104' after position '3:148890105'