SnpEff

Documentation

SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (such as amino acid changes).

License

SnpEff is open source. It is released as "LGPLv3".

Summary

A typical SnpEff use case would be:
  • Input: The inputs are predicted variants (SNPs, insertions, deletions and MNPs). The input file is usually obtained as a result of a sequencing experiment, and it is usually in variant call format (VCF).

  • Output: SnpEff analyzes the input variants. It annotates the variants and calculates the effects they produce on known genes (e.g. amino acid changes). A list of effects and annotations that SnpEff can calculate can be found here.

Variants

By genetic variant we mean difference between a genome and a "reference" genome. As an example, imagine we are sequencing a "sample". Here "sample" can mean anything that you are interested in studying, from a cell culture, to a mouse or a cancer patient.
It is a standard procedure to compare your sample sequences against the corresponding "reference genome". For instance you may compare the cancer patient genome against the current "reference genome" (at the time of this writing, it is 'hg19').

In a typical sequencing experiment, you will find many places in the genome where your sample differs from the reference genome. These are called "genomic variants" or just "variants".
Typically, variants are categorized as follows:

Name What is means Example
SNP Single-Nucleotide Polymorphism Reference = 'A', Sample = 'C'
Ins Insertion Reference = 'A', Sample = 'AGT'
Del Deletion Reference = 'AC', Sample = 'C'
MNP Multiple-nucleotide polymorphism Reference = 'ATA', Sample = 'GTC'
This is not a comprehensive list, it is just to give you an idea.

Annotations

So, you have a huge file describing all the differences between your sample and the reference genome. But you want to know more about these variants than just their genetic coordinates. E.g.: Are they in a gene? In an exon? Do they change protein coding? Do they cause premature stop codons?
SnpEff can help you answer all these questions. The process of adding this information about the variants is called "Annotation".
SnpEff provides several degrees of annotations, from simple (e.g. which gene is each variant affecting) to extremely complex annotations (e.g. will this non-coding variant affect the expression of a gene?). It should be noted that the more complex the annotations, the more it relies in computational predictions. Such computational predictions are sometimes incorrect.
Results cannot be trusted blindly. SnpEff results may have to analyzed and independently validated.

Databases

In order to produce the annotations, SnpEff requires a database. We build these databases using informations from trusted resources.
Currently, there are pre-built database for over 8,500 reference genomes. This means that 99% of the cases are covered.
In some very rare occasions, people need to build a database for an organism not currently supported (e.g. the genome is not publicly available). In most cases, this can be done and there is a section of this manual teaching how to build your own SnpEff database.

Which databases are supported? You can find out all the supported databases by running the database command:

java -jar snpEff.jar databases

This command shows the database name, genome name and source data (where was the genome reference data obtained from). Keep in mind that many times I use ENSEMBL reference genomes, so the name would be GRCh37 instead of hg19, or GRCm38 instead of mm10, and so on.

Example: Finding a database: So, let's say you want to find out the name of the latest mouse (Mus.Musculus) database. You can runs something like this:

$ java -jar snpEff.jar databases | grep -i musculus
GRCm38.68                                                   	Mus_musculus                                                	OK        	http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip
GRCm38.69                                                   	Mus_musculus                                                	          	http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip
GRCm38.70                                                   	Mus_musculus                                                	          	http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip
GRCm38.71                                                   	Mus_musculus                                                	          	http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip
GRCm38.72                                                   	Mus_musculus                                                	          	http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip
GRCm38.73                                                   	Mus_musculus                                                	          	http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip
GRCm38.74                                                   	Mus_musculus                                                	          	http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip
NCBIM37.64                                                  	Mus_musculus                                                	          	http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip
NCBIM37.65                                                  	Mus_musculus                                                	          	http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip
NCBIM37.66                                                  	Mus_musculus                                                	          	http://sourceforge.net/projects/snpeff/files/databases/v3.4/snpEff_v3.4_Mus_musculus.zip

At the time of writing this, you have 10 options (obviously this will change in the future). Some are databases are GRCm version 37 (i.e. mm9) and some are version 38 (i.e. mm10). Since it is generally better to use the latest release, you should probably pick GRCm38.74. Again, this is an example of the version numbers at the time of writing this paragraph, in the future there will be other releases and you should update to the corresponding version.

Unsupported reference genomes: If your reference genome of interest is not supported yet (i.e. there is no database avaialble), you can build a database yourself (see "Building dataases" section). If you have problems adding you own organism, send me an email and I'll do my best to help you out.

Citing

If you are using SnpEff or SnpSift, please cite our work as shown here. Thank you!

Features

The following table shows the main SnpEff features:
Feature Comment
Local install SnpEff can be installed in your local computer or servers. Local installations are preferred for processing genomic data. As opposed to remote web-based services, running a program locally has many advantages:
  • There no need to upload huge genomic dataset.
  • Processing doesn't depend on availability or processing capacity of remote servers.
  • Service continuity: no need to worry if a remote service will be maintained in the future.
  • Security and confidentiality issues of uploading data to third party servers are not a problem.
  • Avoid legal problems of processing clinical data on "outside" servers.
Multi platform SnpEff is written in Java. It runs on Unix / Linux, OS.X and Windows.
Simple installation Installation is as simple as downloading a ZIP file and double clicking on it.
Genomes Human genome, as well as all model organisms are supported. Over 2,500 genomes are supported, which includes most mammalian, plant, bacterial and fungal genomes with published genomic data.
Speed SnpEff is really fast. It can annotate up to 1,000,000 variants per minute.
GATK integration SnpEff can be easily integrated with GATK pipelines.
GUI Web based user interface via Galaxy project
Input and Output formats SnpEff accepts input files in the following format:
  • VCF format, which is the de-facto standard for sequencing variants.
  • TXT format: Although TXT format is deprecated, some people still find it useful.
  • BED format: To annotate enrichment experiments (e.g. ChIP-Seq peaks) or other genomic data.
Variants supported SnpEff can annotate SNPs, MNPs, insertions and deletions. Support for mixed variants and structural variants is available (although sometimes limited).
Effect supported Many effects are calculated: such as SYNONYMOUS_CODING, NON_SYNONYMOUS_CODING, FRAME_SHIFT, STOP_GAINED just to name a few.
Variant impact SnpEff provides a simple assessment of the putative impact of the variant (e.g. HIGH, MODERATE or LOW impact).
Cancer tissue analysis Somatic vs Germline mutations can be calculated on the fly. This is very useful for the cancer researcher community.
Loss of Function (LOF) assessment SnpEff can estimate if a variant is deemed to have a loss of function on the protein.
Nonsense mediate decay (NMD) assessment Some mutations may cause mRNA to be degraded thus not translated into a protein. NMD analysis marks mutations that are estimated to trigger nonsense mediated decay.
HGVS notation SnpEff can provide output in HGVS notation, which is quite popular in clinical and translation research environments.
User annotations A user can provide custom annotations (by means of BED files).
Public databases SnpEff can annotate using publicly available data from well known databases, for instance:
  • ENCODE datasets are supported by SnpEff (by means of BigWig files provided by ENCODE project).
  • Epigenome Roadmap provides data-sets that can be used with SnpEff.
  • TFBS Transcription factor binding site predictions can be annotated. Motif data used in this annotations is generates by Jaspar and ENSEBML projects
  • NextProt database can be used to annotate protein domains as well as important functional sites in a protein (e.g. phosphorilation site)
Common variants (dbSnp) Annotating "common" variants from dbSnp and 1,000 Genomes can be easily done (see SnpSift annotate).
Gwas catalog Support for GWAS catalog annotations (see SnpSift gwasCat )
Conservation scores PhastCons conservation score annotations support (see SnpSift phastCons)
DbNsfp A comprehensive database providing many annotations and scores, such as: SIFT ,Polyphen2 ,GERP++, PhyloP ,MutationTaster ,SiPhy ,Interpro ,Haploinsufficiency ,etc. (via SnpSift).

See SnpSift dbnsfp for details.

Non-coding annotations Regulatory and non-coding annotations are supported for different tissues and cell lines. Annotations supported include PolII ,H3K27ac ,H3K4me2 ,H3K4me3 ,H3K27me3 ,CTCF ,H3K36me3, just to name a few.
Gene Sets annotations Gene sets (MSigDb, GO, BioCarta, KEGG, Reactome, etc.) can be used to annotate via SnpSift geneSets command.

Here we'll show a few examples that should be enough to get you started.

Basic examples: Installing SnpEff

Obviously the first step to use the program is to install it. You have to download the core program from the "Download" page. Then you have to uncompress the ZIP file. In Windows systems, you can just double click and copy the contents of the ZIP file to wherever you want the program installed. If you have a Unix or a Mac system, the command line would be:

unzip snpEff_latest_core.zip 

By default SnpEff stores all the databases in $HOME/snpeff/data (where $HOME is your home directory). If you want to change this, you can edit the config file and change the data_dir entry:

#---
# Databases are stored here
# E.g.: Information for 'hg19' is stored in data_dir/hg19/
#
# Note: Since version 2.1 you can use tilde ('~') as first character to refer to your home directory
#---
data_dir = ~/snpEff/data/

Basic examples: Managing SnpEff databases

In order to use SnpEff, you need a database for the reference genome you want to use. SnpEff databases for the most popular genomes are already pre-built and available for you to download. So there is no need for you to build it (this will save you a LOT of work).

Most of the time you don't need to build the SnpEff database. You can just download a pre-built one.

The easiest way to download and install a pre-built SnpEff database it using the "download" command. E.g. if you want to install the SnpEff database for the human genome, you can run the following command:

java -jar snpEff.jar download -v GRCh37.66

If you are running SnpEff from a directory different than the one it was installed, you will have to specify where the config file is. This is done using the '-c' command line option:

java -Xmx4g -jar snpEff.jar download -c path/to/snpEff/snpEff.config -v GRCh37.66 

Basic examples: Annotate using SnpEff

Let's assume you have a VCF file ("file.vcf") and you want to annotate the variants in that file. You can download the file for this example here. This data is from the 1000 Genomes project, so the reference genome is the human genome (GRCh37.66).

Let's assume this is the first time you use SnpEff, so you don't have the database installed, so the first step is to download the database. Then you can annotate the file:

# Go to the directory where snpEff is installed
cd snpEff

# Do this only if you don't already have the database installed.
java -jar snpEff.jar download -v GRCh37.66

# Annotate the file
java -Xmx4g -jar snpEff.jar eff -v GRCh37.66 file.vcf > file.eff.vcf

The annotated variants will be in the new file "file.eff.vcf".

SnpEff creates a file called "snpEff_summary.html" showing basic statistics about the analyzed variants. Take a quick look at it!

We used the java parameter -Xmx4g to increase the memory available to the Java Virtual Machine to 4G. SnpEff's human genome database is large and it has to be loaded into memory. If your computer doesn't have at least 4G of memory, you probably won't be able to run this example.

If you are running SnpEff from a directory different than the one it was installed, you will have to specify where the config file is. This is done using the '-c' command line option:

java -Xmx4g -jar snpEff.jar eff -c path/to/snpEff/snpEff.config -v GRCh37.66 file.vcf > file.eff.vcf

Advanced examples: Filter out SNPs from dbSnp

These are slightly more advanced examples. Here we'll try to show how to perform specific tasks. If you want to filter out SNPs from dbSnp, you can do it using SnpSift. You can download SnpSift from the "Downloads" page.

You can download the file for this example here.
Here is how to do it:

  1. Annotate ID fields using SnpSif annotate and DbSnp
    We annotate using dbSnp before using SnpEff in order to have 'known' and 'unknown' statistics in SnpEff's summary page. Those stats are based on the presence of an ID field. If the ID is non-empty, then it is assumed to be a 'known variant'.
    # Download and uncompress dbSnp database.
    wget -O dbSnp.vcf.gz ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
    gunzip dbSnp.vcf.gz
    
    # Annotate ID field using dbSnp
    java -jar SnpSif.jar annotate -v dbSnp.vcf.gz file.vcf > file.dbSnp.vcf
    
  2. Annotate using SnpEff:
    # Do this only if you don't already have the database installed.
    java -jar snpEff.jar download -v GRCh37.66
    
    # Annotate the file
    java -Xmx4g -jar snpEff.jar eff -v GRCh37.66 file.dbSnp.vcf > file.eff.vcf
    
  3. Filter out variants that have a non-empty ID field. These variants are the ones that are NOT in dbSnp, since we annotated the ID field using rs-numbers from dbSnp in step 1.
    java -jar SnpSift.jar filter -f file.eff.vcf "! exists ID" > file.eff.not_in_dbSnp.vcf
    
    The expression using to filter the file is "! exists ID". This means that the ID field does not exists (i.e. the value is empty) which is represented as a dot (".") in a VCF file.

Download and installing SnpEff it pretty easy.

System requirements

SnpEff requires that you have Java v1.6 or later installed (any modern operating system has it).
You'll probably need at least 2Gb of memory. The amount of memory used can vary significantly depending on genome size and data analysis type you are doing.

Downloading and Installing

In order to download and install the program, you can follow the directions here.

Configuration

The only configuration file is snpEff.config. Most configuration parameters, are explained in the comments in the same config file, so I won't repeat the explanation here :-)
Usually you will only need to change the data_dir parameter. This parameter points to the data directory where you installed the tool (i.e. in unix systems, this is $HOME/snpEff/data).

Source code

The source code is in GitHub (although we keep the binary distribution is at SourceForge).

Getting the souce: Here is the git command to check out the development version of the code:

git clone https://github.com/pcingola/SnpEff.git

# If you want to SnpSift as well
git clone https://github.com/pcingola/SnpSift.git

Install required libraries: Most libraries should be install using Maven, so you just need to run mvn command.

You may need to install sam-jdk JAR file manually on your maven repository. Download the JAR file here, then run the following command to add the JAR to Maven:

# Download JAR file
wget http://sourceforge.net/projects/picard/files/sam-jdk/1.97/sam-1.97.jar

# Manually add it to Maven
mvn install:install-file -Dfile=sam-1.97.jar -DgroupId=net.sf.samtools -DartifactId=Sam -Dversion=1.97 -Dpackaging=jar

Intalling test cases: Test cases require special "test cases databases and genome", you can find them here:

# Install test databases in SnpEff's development directory (not the soruce code dir!)
cd $HOME/snpEff

# Download dtabases and genome for test cases
wget http://sourceforge.net/projects/snpeff/files/databases/test_cases.tgz

# Uncompress
tar -xvzf test_cases.tgz

# Go to Eclipse's workspace directory (where the source code is)
cd $HOME/workspace/SnpEff

# Create a link to the 'data' dir, so that we can run test cases within Eclipse
ln -s $HOME/snpEff/data

# Add data dir to 'gitignore'
echo "/data" >> .gitignore

We show some basic examples how to use SnpEff.

Command line vs Web interface

In order to run SnpEff you need to be comfortable running command from a command line terminal. If you are not, then it is probably a good idea to ask you systems administrator to install a Galaxy server and use the web interface. You can also use the open Galaxy server, but functionality may be limited and SnpEff versions may not be updated frequently.

Usage example

In this example, we will annotate a variants file using the human reference genome.
As an input, we will use a Variant Call Format (VCF) file. VCF is the de-facto standard for genetic variants.
In this example we will use a file 'demo.1kg.vcf.gz', which is provided in the distribution.
# Go to SnpEff install directory
cd ~/snpEff 

# Download human database 'hg19'
java -jar snpEff.jar download -v hg19

# Annotate 
java -Xmx2g -jar snpEff.jar hg19 -v demo.1kg.vcf > demo.1kg.snpeff.vcf
Here is an example of some entries in the annotated output file. You can see the 'EFF' field was added, predicting STOP_GAINED protein changes:
$ cat demo.1kg.snpeff.vcf | grep STOP_
1    889455    .    G    A    100.0    PASS    ...;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q236*...
1    897062    .    C    T    100.0    PASS    ...;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q141*...
1    900375    .    G    A    100.0    PASS    ...;EFF=STOP_GAINED(HIGH|NONSENSE|tGg/tAg|W578*...
Note: The real output was edited for readibility reasons.

Running from another directory

When you run SnpEff from a different directory than your install directory, you have to specify where the config file is located using the '-c' command line option.
java -Xmx2g path/to/snpEff/snpEff.jar -c path/to/snpEff/snpEff.config GRCh37.70 path/to/snps.vcf

Java memory options

By default the amount of memory set by a java process is set too low. If you don't assign more memory to the process, you will most likely have an "OutOfMemory" error.
You should set the amount of memory in your java virtual machine to, at least, 2 Gb. This can be easily done using the Java command line option "-XmX". E.g. In this example I use 4Gb:
# Run using 4 Gb of memory 
java -Xmx4G snpEff.jar hg19 path/to/your/files/snps.vcf
Note: There is no space between "-Xmx" and "4G".

Running SnpEff in the Cloud

You can run SnpEff in a "the Cloud" exactly the same way as running it on your local computer. You should not have any problems at all.
Here is an example of installing it and running it on an Amazon EC2 instance (virtual machine):

$ ssh -i ./aws_amazon/pcingola_aws.pem ec2-user@ec2-54-234-14-244.compute-1.amazonaws.com

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|


[ec2-user@ip-10-2-202-163 ~]$ wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
[ec2-user@ip-10-2-202-163 ~]$ unzip snpEff_latest_core.zip
[ec2-user@ip-10-2-202-163 ~]$ cd snpEff_3_1/
[ec2-user@ip-10-2-202-163 snpEff_3_1]$ java -jar snpEff.jar download -v hg19
00:00:00.000    Downloading database for 'hg19'
...
00:00:36.340    Done
[ec2-user@ip-10-2-202-163 snpEff_3_1]$ java -Xmx4G -jar snpEff.jar dump -v hg19 > /dev/null
00:00:00.000    Reading database for genome 'hg19' (this might take a while)
00:00:20.688    done
00:00:20.688    Building interval forest
00:00:33.110    Done.

As you can see, it's very simple.

Command line options

In order to see all available options, you can run
java -jar snpEff.jar -help
We'll explain many of those options in detail in the following chapters.

Loading the database

One of the first thins SnpEff has to do is to load the database. Usually it takes frmo a few seconds to a couple of minutes, depending on database size. Complex databases, like human, require more time to load. After the database is loaded, SnpEff can analyze thousands of variants per second.

Here we show an example on how to get from Sequencing data to an annotated variants file

Sequencing data example

This is an extremely simplified version on how to analyze the data from scratch This is not meant to be a tutorial on sequencing analysis as it would be way beyond the scope of this handbook.
Let's assume you have sequence data in FASTQ format (file "s.fastq") and your reference genome is dm5.34 (fly genome)
# Download the genome, uncompress and rename file
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.34_FB2011_02/fasta/dmel-all-chromosome-r5.34.fasta.gz
gunzip dmel-all-chromosome-r5.34.fasta.gz
mv dmel-all-chromosome-r5.34.fasta dm5.34.fasta

# Create a genome index (we assume you installed BWA http://bio-bwa.sourceforge.net/)
bwa index -bwtsw dm5.34.fasta

# Map sequences to the genome: Create SAI file
bwa aln -bwtsw dm5.34.fasta s.fastq > s.sai

# Map sequences to the genome: Create SAM file
bwa samse dm5.34.fasta s.sai s.fastq > s.sam

# Create BAM file (we assume you installed SamTools http://samtools.sourceforge.net/)
samtools view -S -b s.sam > s.bam

# Sort BAM file (will create s_sort.bam)
samtools sort s.bam s_sort

# Create VCF file (BcfTools is part of samtools distribution)
samtools mpileup -uf dm5.34.fasta s_sort.bam | bcftools view -vcg - > s.vcf

# Analyze variants using snpEff
java -Xmx4g -jar snpEff.jar dm5.34 s.vcf > s.eff.vcf
This highly simplified sequencing data analysis pipeline, has the common basic steps
  • Index the reference genome (bwa)
  • Map reads to reference genome (bwa)
  • Call variants (bcftools)
  • Annotate variants (SnpEff)

Files used as input to SnpEff must comply with standard formats. Here we describe supported input data formats.

You can use "-" as input file name to spedicfy STDIN.
For example, you can easily stream files like this:

cat file.vcf | java -Xmx4g -jar snpEff.jar hg19 - > file.eff.vcf

VCF

As we mentioned before, Variant Call Format (VCF) is the recomended format for input files. This is the format used by the "1000 Genomes Project", and is currently considered the de-facto standard for genomic variants. Furthermore, most variant calling programs output VCF files, so it is also the simplest format to use.
In a nutshell, VCF format is tab-separated text file having the following columns:
  1. Chromosome name
  2. Position
  3. Variant's ID
  4. Reference genome
  5. Alternative (i.e. variant)
  6. Quality score
  7. Filter (whether or not the variant passed quality filters)
  8. INFO : Generic information about this variant. SnpEff adds annotation information in this column.
Here is an example of a few lines in a VCF file:
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                    
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017   
Note that the first line is header information. Header lines start with '#'

TXT

TXT format is currently deprecated and discouraged.
TXT file format must be tab-separated, containing columns that correspond to:
  • chromosome_name
  • chromosome_position
  • reference sequence
  • changed sequence: A slash is used to separate two alleles
  • strand: {+,-} (optional)
  • quality (optional)
  • coverage (optional)
This is an example of an input TXT file:
5   140532    T    C       +
12  1017956   T    A       +    45    12
2   946507    G    C       +    23    8
14  19584687  C    T       -
19  66520     G    -G/-G   +
8   150029    A    */+T    +

TXT: Representing an heterozygous variant
TXT format is currently deprecated and discouraged.
When representing a heterozygous SNP you can use IUB codes: M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T and K=G/T. Indels are indicated by, for example, */+A, -A/* or +CC/-C. There is no difference between */+A or +A/*.

BED

In an enrichment experiment, such as ChIP-Seq, the results are enrichment regions, usually called "peaks". It is common for "peak callers" (algorithms that detect enrichment), write the results in a BED file. SnpEff can annotate BED files in order to facilitate interpretation of enrichment experiments.

Column fifth onwards are ignored when using BED file format and they will be lost in the output file.

Here we describe the format of SnpEff's output files and the information added.

VCF output format

As we mentioned in the previous chapter, VCF is the default input format. It is highly recommended to use VCF as input and output format, since it is a standard format that can be also used by other tools and software packages. Thus VCF makes it much easier to integrate genomic data processing pipelines.
SnpEff adds annotation information to the INFO field of a VCF file. The INFO field is the eight column of a VCF file, see previous section for a quick example or take a look at the VCF specification for details.

Here is an example of a file before and after being annotated using SnpEff:
VCF file before annotations
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                    
1	889455	.	G	A	100.0	PASS	AF=0.0005
1	897062	.	C	T	100.0	PASS	AF=0.0005
VCF file after being annotated using SnpEff
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                    
1	889455	.	G	A	100.0	PASS	AF=0.0005;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q236*|749|NOC2L||CODING|NM_015658|)
1	897062	.	C	T	100.0	PASS	AF=0.0005;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q141*|642|KLHL17||CODING|NM_198317|)
A you can see, SnpEff added an 'EFF' tag to the INFO field (eigth column).

VCF EFF field

Effects information is added to the INFO field using an 'EFF' tag. There can be multiple effects separated by comma. The format for each effect is:

EFF= Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank  | Genotype_Number [ | ERRORS | WARNINGS ] )
EFF Sub-field Meaning
Effect Effect of this variant. See details here.
Effect impact Effect impact {High, Moderate, Low, Modifier}. See details here.
Functional Class Functional class {NONE, SILENT, MISSENSE, NONSENSE}.
Codon_Change / Distance Codon change: old_codon/new_codon OR distance to transcript (in case of upstream / downstream)
Amino_Acid_Change Amino acid change: old_AA AA_position/new_AA (e.g. 'E30K')
Amino_Acid_Length Length of protein in amino acids (actually, transcription length divided by 3).
Gene_Name Gene name
Transcript_BioType Transcript bioType, if available.
Gene_Coding [CODING | NON_CODING]. This field is 'CODING' if any transcript of the gene is marked as protein coding.
Transcript_ID Transcript ID (usually ENSEMBL IDs)
Exon/Intron Rank Exon rank or Intron rank (e.g. '1' for the first exon, '2' for the second exon, etc.)
Genotype_Number Genotype number corresponding to this effect (e.g. '2' if the effect corresponds to the second ALT)
Warnings / Errors Any warnings or errors (not shown if empty).

Errors and Warnings

As mentioned int the previous section, the last sub-field in EFF field shows errors or warnings (if any). Here is a description of the errors and warnings:

Error Meaning and possible solutions
ERROR_CHROMOSOME_NOT_FOUND Chromosome does not exits in reference database. Typically this means that there is a mismatch between the chromosome names in your input file and the chromosome names used in the reference genome to build SnpEff's database.

Warning: This could be caused because you are trying to annotate using a reference genome that is different than the one you used for sequence alignment. Obviously doing this makes no sense and the annotation information you'll get will be garbage. That's why SnpEff shows you an error message.

Possible solution: Sometime SnpEff database matches the reference genome for your organism, and it's just that the chromosome names are changed. You can fix this by changing the input file.
You can see chromosome names used by SnpEff by using -v (verbose) option. SnpEff will show a line like this one:
# Chromosomes names [sizes]  : '1' [249250621] '2' [243199373] ...
Then you can modify your input file chromosome names to match the database (e.g. using sed unix command).
ERROR_OUT_OF_CHROMOSOME_RANGE This means that the position is higher than chromosome's length. Probably an indicator that your data is not from this reference genome.
ERROR_OUT_OF_EXON Exonic information not matching the coordinates. Indicates a problem (or even a bug?) in the database
ERROR_MISSING_CDS_SEQUENCE Transcript has no CDS info. Indicates a problem (or even a bug?) in the database
Warning Meaning and possible solutions
WARNING_REF_DOES_NOT_MATCH_GENOME This means that the REF field does not match the reference genome.

This warning probably indicated there is something really wrong with your data!

This happens when your data was aligned to a different reference genome than the one used to create SnpEff's database. If there are many of these warnings, it's a strong indicator that the data doesn't match and all the annotations will be garbage (because you are using the wrong database).

Solution: Use the right database to annotate!

Due to performance and memory optimizations, SnpEff only checks reference sequence on Exons.
WARNING_SEQUENCE_NOT_AVAILABLE For some reason the exon sequence is not available, so we cannot calculate effects.
WARNING_TRANSCRIPT_INCOMPLETE A protein coding transcript whose length is non-multiple of 3. This means that information is missing for one or more amino acids.
This is usually due to errors in the genomic information (e.g. the genomic databases provided by UCSC or ENSEMBL). Genomic information databases are constantly being improved and are getting more accurate, but some errors still remain.
WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS A protein coding transcript has two or more STOP codons in the middle of the coding sequence (CDS). This should not happen and it usually means the genomic information may have an error in this transcript.
This is usually due to errors in the genomic information (e.g. the genomic databases provided by UCSC or ENSEMBL). Genomic information databases are constantly being improved and are getting more accurate, but some errors still remain.
WARNING_TRANSCRIPT_NO_START_CODON A protein coding transcript does not have a proper START codon. It is rare that a real transcript does not have a START codon, so this probably indicates errors in genomic information for this transcript (e.g. the genomic databases provided by UCSC or ENSEMBL).
Genomic information databases are constantly being improved and are getting more accurate, but some errors still remain.

VCF Header lines

SnpEff updates the header of the VCF file to reflect additional fields. This is required by the VCF specification. SnpEff also adds the command line options used to annotate the file as well as SnpEff's version, so you can keep track of what exactly was done.

Here is an example of some header lines added to an annotated file:

##SnpEffVersion="SnpEff 3.1m (build 2013-02-08)"
##SnpEffCmd="SnpEff  hg19 demo.1kg.vcf "
##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' \">


Multiple effects per VCF line

Usually there is more than one effect reported in each EFF field.
There are several reasons for this:

  • A variant can affect multiple genes. E.g a variant can be DOWNSTREAM from one gene and UPSTREAM from another gene.
  • In complex organisms, genes usually have multiple transcripts. So SnpEff reports the effect of a variant on each transcript.
  • A VCF line can have more then one variant. For instance if reference genome is 'G', but the sample has either 'A' or 'T', then this will be reported as one line, having multiple alternative variants. E.g.:
    #CHROM  POS       ID   REF  ALT      QUAL  FILTER  INFO                    
    1       889455    .    G    A,T      .     .       AF=0.0005
    
    In this case SnpEff will report the effect of each variant on each gene and each transcript.


Counting total number of effects of a given type

Some people try to count the number of effects in a file by doing (assuming we want to count how many MODIFIER effects we have):

 grep -o MODIFIER output.eff.vcf | wc -l 

This is incorrect because a VCF line can have multiple effects (e.g. when there are multiple transcripts in a gene). A proper way to count effects would be:
cat output.eff.vcf \
	| cut -f 8 \
	| tr ";" "\n" \
	| grep ^EFF= \
	| cut -f 2 -d = \
	| tr "," "\n" \
	| grep MODIFIER \
	| wc -l
Brief explanation:
cut -f 8 Extract INFO fields
tr ";" "\n" Expand each field into one line
grep ^EFF= Only keep 'EFF' fields
cut -f 2 -d = Keep only the effect data (drop the 'EFF=' part)
tr "," "\n" Expand effects to multiple lines
grep MODIFIER | wc -l Count the ones you want (in this example 'MODIFIER')


TXT output format

TXT format is currently deprecated and discouraged.
The TXT output format consist of one line per effect. This means that you usually get more than one line per SNP since the same SNP may affect several different transcripts in the same gene.

The format consist of tab separated columns:
Column Meaning
Chromosome Chromosome name (usually without any leading 'chr' string)
Position One based position
Reference Reference
Change Sequence change
Change type Type of change {SNP, MNP, INS, DEL}
Homozygous Is this homozygous or heterozygous {Hom, Het}
Quality Quality score (from input file)
Coverage Coverage (from input file)
Warnings Any warnings or errors.
Gene_ID Gene ID (usually ENSEMBL)
Gene_name Gene name
Bio_type BioType, as reported by ENSEMBL.
Transcript_ID Transcript ID (usually ENSEMBL)
Exon_ID Exon ID (usually ENSEMBL)
Exon_Rank Exon number on a transcript
Effect Effect of this variant. See details below.
old_AA/new_AA Amino acid change
old_codon/new_codon Codon change
Codon_Num(CDS) Codon number in CDS
Codon_degenaracy Codon degenaracy (see below).
CDS_size CDS size in bases
Custom_interval_ID If any custom interval was used, add the IDs here (may be more than one).

This is an example of an output file:
chr2L   35041   T       K       SNP     Het     50                      FBgn0051973     Cda5            FBtr0078164     FBgn0051973:11  11      SYNONYMOUS_CODING       Q/Q     CAA/CAA 332     5997    
chr2L   35041   T       K       SNP     Het     50                      FBgn0051973     Cda5            FBtr0078163     FBgn0051973:11  11      NON_SYNONYMOUS_CODING   Q/H     CAA/CAC 332     3120    
chr2L   200401  C       Y       SNP     Het     228                     FBgn0016977     spen            FBtr0078121     CG18497:10      10      STOP_GAINED     Y/*     TAC/TAG 5240    16602   
chr2L   194601  C       Y       SNP     Het     228                     FBgn0016977     spen            FBtr0078122     CG18497:6       6       NON_SYNONYMOUS_CODING   P/A     CCG/GCG 3473    16683   
chr2L   779563  A       R       SNP     Het     168                     FBgn0031277     CG13947         FBtr0078005     FBgn0031277:1   1       STOP_LOST       */Y     TAA/TAC 120     360   
chr2L   856021  *       +CGGAGGAGG/*    INS     Het     267                     FBgn0031288     CG13949         FBtr0078017     FBgn0031288:3   3       FRAME_SHIFT                     118     438     
chr2L   856890  C       Y       SNP     Het     201                     FBgn0029095     aru             FBtr0078031                     DOWNSTREAM: 43 bases   
chr2L   871556  G       R       SNP     Het     228                     FBgn0053526     PNUTS           FBtr0091487     CG33526:2       2       5PRIME_UTR: 18 bases from TSS  
chr2L   878267  T       W       SNP     Het     228             WARNING: Base is 'G' but SNP says 'T'.  FBgn0031292     CG15824         FBtr0078029     CG15824:14      14      3PRIME_UTR: 14 bases from transcript end  


Meaning of 'Degeneracy' column : Here is an explanation form Stephen Wright (Univ. Toronto), who requested to add this feature
	     "...a fourfold degenerate site would be a site where any change is synonymous. So the 
	     third codon position for the arginine CGN, is a fourfold degenerate site, as is the 
	     third codon position for valine, alanine, etc.
	     Similarly, all second positions of a codon are zerofold degenerate, since any change is 
	     nonsynonymous. Many first codon positions are also zerofold degenerate, however, for 
	     example, the first codon position of AGG is NOT zerofold, because a shift to CGG is a 
	     synonymous change."


Details about Hom/Het calculation (summary file and TXT output format)

In older formats, there used to be one sample per file, so identifying Hom/Het in a per variant basis made sense.

In VCF files, where many samples are analyzed simultaneously, identification of Hom/Het should be done in a per sample basis instead of per variant basis. In an attempt to output that field I used a majority vote (i.e. if most samples are Hom, I output Hom, otherwise, Het). This is not strictly correct and, it may also be confusing.

For this reason, the output is empty when using multi-sample files, such as VCF. Unless the file contains only one sample (i.e. only one genotype field)


Coordinates differences between VCT and TXT outputs

There are some situations when there seems to be a difference between TXT output coordinates and VCF output coordinates. This happens because sometimes VCF files have a coordinate that does not start exactly were the variant is. Here is an example:

VCF line:                   "chr10   102770293       .       tgctgcggctgcggctgcggctacggctgcggct      tGCTGCGgctgcggctgcggctgcggctacggctgcggct"
INS reported by SnpEff:     "chr10   102770315       *       +GCGGCT INS"

At first glace, these two look different, since 102770293 is not the same as 102770315. But if you look at the sequences:

tgctgcggctgcggctgcggct      acggctgcggct    REF (from VCF file)
||||||||||||||||||||||      ||||||||||||
tgctgcggctgcggctgcggctgcggctacggctgcggct    ALT (from VCF file)
|      |         |    |||||| 
|      |         |    gcggct                INS (reported by SnpEff TXT format)
|      |         |    |
102770293        |    |                     Position (VCF)
       |         |    |
       102770300 |    |
                 |    |
                 102770310
                      |
                      102770315             Position (TXT)
So the two VCF and TXT representations are equivalent.



BED output format


SnpEff can annotate BED files in order to facilitate interpretation of enrichment experiments. Annotations are added to the fourth column of the BED file.
E.g.:
$ java -Xmx4g -jar snpEff.jar eff -i bed BDGP5.69 chipSeq_peaks.bed 

# SnpEff version 3.3 (build 2013-05-15), by Pablo Cingolani
# Command line: SnpEff  -i bed BDGP5.69 /home/pcingola/fly_pvuseq/chipSeq/Sample_w1118_IP_w_5hmC/w1118_IP_w_5hmC_peaks.bed 
# Chromo  Start     End       Name;Effect|Gene|BioType        Score
2L        189463    190154    MACS_peak_1;Exon|exon_6_12_RETAINED|FBtr0078122|protein_coding|spen|protein_coding;Exon|exon_5_10_RETAINED|FBtr0078123|protein_coding|spen|protein_coding;Exon|exon_7_13_RETAINED|FBtr0306341|protein_coding|spen|protein_coding;Exon|exon_6_11_RETAINED|FBtr0078121|protein_coding|spen|protein_coding 245.41
2L        195607    196120    MACS_peak_2;Exon|exon_6_12_RETAINED|FBtr0078122|protein_coding|spen|protein_coding;Exon|exon_5_10_RETAINED|FBtr0078123|protein_coding|spen|protein_coding;Exon|exon_7_13_RETAINED|FBtr0306341|protein_coding|spen|protein_coding;Exon|exon_6_11_RETAINED|FBtr0078121|protein_coding|spen|protein_coding 51.22
2L        527253    527972    MACS_peak_3;Intron|intron_2_RETAINED-RETAINED|FBtr0078063|protein_coding|ush|protein_coding     55.97
2L        711439    711764    MACS_peak_4;Intron|intron_1_RETAINED-RETAINED|FBtr0078045|protein_coding|ds|protein_coding      61.16
2L        1365255   1365556   MACS_peak_5;Upstream|FBtr0077927|protein_coding|CG14346|protein_coding;Upstream|FBtr0077926|protein_coding|CG14346|protein_coding;Intergenic|NLaz...CG14346;Upstream|FBtr0077942|protein_coding|NLaz|protein_coding     62.78
2L        1970199   1970405   MACS_peak_6;Upstream|FBtr0077813|protein_coding|Der-1|protein_coding;Intergenic|tRNA:CR31942...Der-1;Downstream|FBtr0077812|tRNA|tRNA:CR31942|tRNA      110.34
2L        3345637   3346152   MACS_peak_7;Intron|intron_2_ALTTENATIVE_3SS-ALTTENATIVE_3SS|FBtr0089979|protein_coding|E23|protein_coding;Intron|intron_3_ALTTENATIVE_3SS-ALTTENATIVE_3SS|FBtr0089981|protein_coding|E23|protein_coding 65.49
2L        4154734   4155027   MACS_peak_8;Intergenic|CG2955...Or24a;Downstream|FBtr0077468|protein_coding|CG2955|protein_coding       76.92
2L        4643232   4643531   MACS_peak_9;Downstream|FBtr0110769|protein_coding|BG642163|protein_coding;Exon|exon_2_2_RETAINED|FBtr0300354|protein_coding|CG15635|protein_coding      76.92

When a peak intersects multiple transcripts or even multiple genes, each annotation is separated by a semicolon. So if you look into the previous results in more detail, the fisrt line looks like this (format editted for readibility pourposes):
2L  189463  190154  MACS_peak_1;Exon|exon_6_12_RETAINED|FBtr0078122|protein_coding|spen|protein_coding
                               ;Exon|exon_5_10_RETAINED|FBtr0078123|protein_coding|spen|protein_coding
                               ;Exon|exon_7_13_RETAINED|FBtr0306341|protein_coding|spen|protein_coding
                               ;Exon|exon_6_11_RETAINED|FBtr0078121|protein_coding|spen|protein_coding
This peak is hitting four transcripts (FBtr0078122, FBtr0078123, FBtr0306341, FBtr0078121) in gene 'spen'.


Exon naming convention

The format for the exon identifier is exon_Rank_Total_Type, where:
  • rank is the exon rank in the transcript (position in the transcript)
  • total is the total number of exons in that transcript
  • type is the exon splice type.

For instance exon_5_10_RETAINED would be the fifth exon in a 10 exon transcript. This exon is type "RETAINED", which means it is not spliced out.

Exons are categorized by splicing as follows:
  • NONE : Not spliced
  • RETAINED : All transcripts have this exon
  • SKIPPED : Some transcripts skip it
  • ALTTENATIVE_3SS : Some transcripts have and alternative 3' exon start
  • ALTTENATIVE_5SS : Some transcripts have and alternative 5' exon end
  • MUTUALLY_EXCLUSIVE : Mutually exclusive (respect to other exon)
  • ALTTENATIVE_PROMOMOTER : The first exon is different in some transcripts.
  • ALTTENATIVE_POLY_A : The last exon.
See this Wikipedia entry for more information on exon splice types.

Intron naming convention

Similarly to exons, introns are named asintron_Rank_ExonTypeBefore-ExonTypeAfter, where:
  • Rank : the rank number for this inton in the transcript
  • ExonTypeBefore : the splicing type of the exon preceding this intron (see exon naming convention for details).
  • ExonTypeAfter : the splicing type of the after this intron (see exon naming convention for details).
For instance intron_9_SKIPPED-RETAINED would be the ninth intron of the transcript. The intron is preceded by a SKIPPED exon and followed by a RETAINED exon.

SnpEff creates an additional output file showing overall statistics. This "stats" file is an HTML file which can be opened using a web browser. You can find an example of a 'stats' file here.

The program performs some statistics and saves them to the file 'snpEff_summary.html' on the directory where snpEff is being executed. You can see the file, by opening it in your browser.

You can change the default location by using the '-stats' command line option. This also changes the location of the TXT summary file.

Summary can be create in CSV format using command line option -csvStats. This allows easy downstream processing.

E.g.: In the stats file, you can see coverage histogram plots like this one


"Effects by type" vs "Effects by region"

SnpEff annotates variants. Variants produce effect of difference "types" (e.g. NON_SYNONYMOUS_CODING, STOP_GAINED). These variants affect regions of the genome (e.g. EXON, INTRON). The two tables count how many effects for each type and for each region exists.
E.g.: In an EXON region, you can have all the following effect types: NON_SYNONYMOUS_CODING, SYNONYMOUS_CODING, FRAME_SHIFT, STOP_GAINED, etc.

The complicated part is that some effect types affect a region that has the same name (yes, I know, this is confusing).
E.g.: In a UTR_5_PRIME region you can have UTR_5_PRIME and START_GAINED effect type.


This means that the number of both tables are not exactly the same, because the labels don't mean the same. See the next figure as an example

So the number of effects that affect a UTR_5_PRIME region is 206. Of those, 57 are effects type START_GAINED and 149 are effects type UTR_5_PRIME.
How exactly are effect type and effect region related? See the following table

Effect Type Region
NONE
CHROMOSOME
CUSTOM
CDS
NONE
INTERGENIC
INTERGENIC_CONSERVED
INTERGENIC
UPSTREAM
UPSTREAM
UTR_5_PRIME
UTR_5_DELETED
START_GAINED
UTR_5_PRIME
SPLICE_SITE_ACCEPTOR
SPLICE_SITE_ACCEPTOR
SPLICE_SITE_DONOR
SPLICE_SITE_DONOR
SPLICE_SITE_REGION
SPLICE_SITE_REGION
INTRAGENIC
START_LOST
SYNONYMOUS_START
NON_SYNONYMOUS_START
GENE
TRANSCRIPT
EXON or NONE
EXON
EXON_DELETED
NON_SYNONYMOUS_CODING
SYNONYMOUS_CODING
FRAME_SHIFT
CODON_CHANGE
CODON_INSERTION
CODON_CHANGE_PLUS_CODON_INSERTION
CODON_DELETION
CODON_CHANGE_PLUS_CODON_DELETION
STOP_GAINED
SYNONYMOUS_STOP
STOP_LOST
RARE_AMINO_ACID
EXON
INTRON
INTRON_CONSERVED
INTRON
UTR_3_PRIME
UTR_3_DELETED
UTR_3_PRIME
DOWNSTREAM
DOWNSTREAM
REGULATION
REGULATION

We describe the effects predicted by SnpEff

Notes:
  • Sequence Ontology Sequence ontology (SO) allows to standardize terminology used for assessing sequence changes and impact. This allows for a common language across all variant annotation programs and makes it easier to communicate using a uniform terminology.
  • Effect (Classic) This are the "classic" effect names usd by SnpEff. We are moving towards a standardized classification system using Sequence Ontology.
  • Effect impact Effects are categorized by 'impact': {High, Moderate, Low, Modifier}. This are pre-defined categories to help users find more significant variants.
    Impact categories must be used with care, they were created only to help and simplify the filtering process. Obviously, there is no way to predict whether a "high impact" or a "low impact" variant is the one producing you phenotype of interest.

Here is a list of effects and some brief explanations:

Effect
Seq. Ontology
Effect
Classic
Note & Example Impact
coding_sequence_variant CDS The variant hits a CDS. MODIFIER
chromosome CHROMOSOME_LARGE DELETION A large parte (over 1%) of the chromosome was deleted. HIGH
coding_sequence_variant CODON_CHANGE One or many codons are changed
e.g.: An MNP of size multiple of 3
MODERATE
inframe_insertion CODON_INSERTION One or many codons are inserted
e.g.: An insert multiple of three in a codon boundary
MODERATE
disruptive_inframe_insertion CODON_CHANGE_PLUS CODON_INSERTION One codon is changed and one or many codons are inserted
e.g.: An insert of size multiple of three, not at codon boundary
MODERATE
inframe_deletion CODON_DELETION One or many codons are deleted
e.g.: A deletion multiple of three at codon boundary
MODERATE
disruptive_inframe_deletion CODON_CHANGE_PLUS CODON_DELETION One codon is changed and one or more codons are deleted
e.g.: A deletion of size multiple of three, not at codon boundary
MODERATE
downstream_gene_variant DOWNSTREAM Downstream of a gene (default length: 5K bases) MODIFIER
exon_variant EXON The vairant hits an exon. MODIFIER
exon_loss_variant EXON_DELETED A deletion removes the whole exon. HIGH
frameshift_variant FRAME_SHIFT Insertion or deletion causes a frame shift
e.g.: An indel size is not multple of 3
HIGH
gene_variant GENE The variant hits a gene. MODIFIER
intergenic_region INTERGENIC The variant is in an intergenic region MODIFIER
conserved_intergenic_variant INTERGENIC_CONSERVED The variant is in a highly conserved intergenic region MODIFIER
intragenic_variant INTRAGENIC The variant hits a gene, but no transcripts within the gene MODIFIER
intron_variant INTRON Variant hits and intron. Technically, hits no exon in the transcript. MODIFIER
conserved_intron_variant INTRON_CONSERVED The variant is in a highly conserved intronic region MODIFIER
miRNA MICRO_RNA Variant affects an miRNA MODIFIER
missense_variant NON_SYNONYMOUS_CODING Variant causes a codon that produces a different amino acid
e.g.: Tgg/Cgg, W/R
MODERATE
initiator_codon_variant NON_SYNONYMOUS_START Variant causes start codon to be mutated into another start codon (the new codon produces a different AA).
e.g.: Atg/Ctg, M/L (ATG and CTG can be START codons)
LOW
stop_retained_variant NON_SYNONYMOUS_STOP Variant causes stop codon to be mutated into another stop codon (the new codon produces a different AA).
e.g.: Atg/Ctg, M/L (ATG and CTG can be START codons)
LOW
rare_amino_acid_variant RARE_AMINO_ACID The variant hits a rare amino acid thus is likely to produce protein loss of function HIGH
splice_acceptor_variant SPLICE_SITE_ACCEPTOR The variant hits a splice acceptor site (defined as two bases before exon start, except for the first exon). HIGH
splice_donor_variant SPLICE_SITE_DONOR The variant hits a Splice donor site (defined as two bases after coding exon end, except for the last exon). HIGH
splice_region_variant SPLICE_SITE_REGION A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron. LOW
splice_region_variant SPLICE_SITE_BRANCH A varaint affective putative (Lariat) branch point, located in the intron. LOW
splice_region_variant SPLICE_SITE_BRANCH_U12 A varaint affective putative (Lariat) branch point from U12 splicing machinery, located in the intron. MODERATE
stop_lost STOP_LOST Variant causes stop codon to be mutated into a non-stop codon
e.g.: Tga/Cga, */R
HIGH
5_prime_UTR_premature start_codon_gain_variant START_GAINED A variant in 5'UTR region produces a three base sequence that can be a START codon. LOW
start_lost START_LOST Variant causes start codon to be mutated into a non-start codon.
e.g.: aTg/aGg, M/R
HIGH
stop_gained STOP_GAINED Variant causes a STOP codon
e.g.: Cag/Tag, Q/*
HIGH
synonymous_variant SYNONYMOUS_CODING Variant causes a codon that produces the same amino acid
e.g.: Ttg/Ctg, L/L
LOW
start_retained SYNONYMOUS_START Variant causes start codon to be mutated into another start codon.
e.g.: Ttg/Ctg, L/L (TTG and CTG can be START codons)
LOW
stop_retained_variant SYNONYMOUS_STOP Variant causes stop codon to be mutated into another stop codon.
e.g.: taA/taG, */*
LOW
transcript_variant TRANSCRIPT The variant hits a transcript. MODIFIER
regulatory_region_variant REGULATION The variant hits a known regulatory feature (non-coding). MODIFIER
upstream_gene_variant UPSTREAM Upstream of a gene (default length: 5K bases) MODIFIER
3_prime_UTR_variant UTR_3_PRIME Variant hits 3'UTR region MODIFIER
3_prime_UTR_truncation + exon_loss UTR_3_DELETED The variant deletes an exon which is in the 3'UTR of the transcript MODERATE
5_prime_UTR_variant UTR_5_PRIME Variant hits 5'UTR region MODIFIER
5_prime_UTR_truncation + exon_loss_variant UTR_5_DELETED The variant deletes an exon which is in the 5'UTR of the transcript MODERATE

Details about Rare amino acid effect

These are amino acids that occurs very rarely in an organism. For instance, humans are supposed to use 20 amino acids, but there is also one rare AA. Selenocysteine, single letter code 'U', appears roughly 100 times in the whole genome. The amino acid is so rare that usually it does not appear in codon translation tables. It is encoded as UGA, which usually means a STOP codon. Secondary RNA structures are assumed to enable this special translation.

A variant in one of these sites is likely to cause a loss of function in the protein. E.g. in case of a Selenocysteine, a loss of a selenium molecule is likely to cause loss of function. Put it simply, the assumption is that there is a great deal of trouble to get that non-standard amino acid there, so it must be important. RARE_AMINO_ACID mark is used to show that special attention should be paid in these cases.

When the variant hits a RARE_AMINO_ACID mark, it is likely that the 'old_AA/new_AA' field will be incorrect. This may happen because the amino acid is not predictable using a codon table.


Loss of function and nonsense-mediated decay predictions

Loss of function ('LOF') and nonsense-mediated decay ('NMD') predictions can be activated by using -lof command line option. Details on how these variants work, can be found in these slides
Usage example:
java –Xmx4g -jar snpEff.jar –v \
	-lof \
	GRCh37.68 \
	file.vcf.gz > file.eff.vcf
SnpEff adds 'LOF' and 'NMD' tags to INFO fields (column 8 in VCF format). LOF and NMD tags have the following format:
Gene | ID | num_transcripts | percent_affected
Where:
Field Description
Gene Gene name
ID Gene ID (usually ENSEMBL)
Num_transcripts Number of transcripts in this gene
percent_affected Percentage of transcripts affected by this variant.

Example: If we have this effect
EFF=…, SPLICE_SITE_DONOR(HIGH||||639|ILDR2|protein_coding|CODING|ENST00000271417|1)
and the corresponding LOF tag is
LOF=ILDR2|ENSG00000143195|7|1.00
The meaning of the LOF tag is:
Field Description
Gene ILDR2
ID ENSG00000143195
Num_transcripts There are 7 transcripts in this gene
percent_affected 100% of transcripts are affected by this variant.

Troubleshooting Effect predictions

Chromosome not found

This is by far the most common problem. It means that the input VCF file has chromosome names that do not match SnpEff's database and don't match reference genome either, since SnpEff's database are created using reference genome chromosome names.

The solution is simple: fix your VCF file to use standard chromosome names. You can see which chromosome names are used by SnpEff simply by using the -v (verbose) command line option. This shows all chromosome names and their respective lengths. Notice the last line ("Chromosomes names [sizes]"):

$ java -Xmx4g eff -v GRCh37.70 test.1K.vcf 
00:00:00.000    Reading configuration file 'snpEff.config'
00:00:00.176    done
00:00:00.176    Reading database for genome version 'GRCh37.70' from file '/home/pcingola//snpEff/data/GRCh37.70/snpEffectPredictor.bin' (this might take a 
00:01:13.513    done
00:01:13.555    Setting '-treatAllAsProteinCoding' to 'false'
00:01:13.556    Building interval forest
00:01:32.642    done.
00:01:38.280    Genome stats :
# Genome name                : 'Homo_sapiens'
# Genome version             : 'GRCh37.70'
# Has protein coding info    : true
# Genes                      : 62069
# Protein coding genes       : 22643
# Transcripts                : 206046
# Avg. transcripts per gene  : 3.32
# Protein coding transcripts : 85685
# Cds                        : 756629
# Exons                      : 1262644
# Exons with sequence        : 1262644
# Exons without sequence     : 0
# Avg. exons per transcript  : 6.13
# Number of chromosomes      : 275
# Chromosomes names [sizes]  : '1' [249250621] '2' [243199373] '3' [198022430] '4' [191154276] '5' [180915260] '6' [171115067] '7' [159138663] 'X' [155270560] '8' [146364022] '9' [141213431] '10' [135534747] '11' [135006516] '12' [133851895] '13' [115169878] '14' [107349540] '15' [102531392] '16' [90354753] '17' [81195210] '18' [78077248] '20' [63025520] 'Y' [59373566] '19' [59128983] '22' [51304566] '21' [48129895] 'HG1287_PATCH' [249889652]  'HG1292_PATCH' [249822337]  'HG1473_PATCH' [249272860]  'HG1471_PATCH' [249269426] ...
...

Apparent inconsistencies when using UCSC genome browser

WARNING: Usage of hg19 genome is deprecated and discouraged, you should use GRChXX.YY instead (e.g. the latest version at the time of writing is GRCh37.70)

Reference sequence and annotations are made for an organism version and sub-version. For examples human genome, version 37, sub-version 70 would be called (GRCh37.70).
UCSC doesn't specify sub-version. They just say hg19. This annoying sub-version problem appeared often and, having reproducibility of results in mind, I dropped UCSC annotations in favor of ENSEMBL ones (they have clear versioning).

SnpEff reporting an effect that doesn't match ENSEMBL's web page

Please remember that databases are updated often (e.g. by ENSEMBL), so if you are using an old database, you might get different effects.

For example, this transcript ENST00000487462 changed from protein_coding in GRCh37.63

1       protein_coding  exon    1655388 1655458 .       -       .        gene_id "ENSG00000008128"; transcript_id "ENST00000487462"; exon_number "1"; gene_name "CDK11A"; transcript_name "CDK11A-013";
1       protein_coding  exon    1653905 1654270 .       -       .        gene_id "ENSG00000008128"; transcript_id "ENST00000487462"; exon_number "2"; gene_name "CDK11A"; transcript_name "CDK11A-013";
...to processed_transcript in GRCh37.64:
1       processed_transcript    exon    1655388 1655458 .       -       .        gene_id "ENSG00000008128"; transcript_id "ENST00000487462"; exon_number "1"; gene_name "CDK11A"; gene_biotype "protein_coding"; transcript_name "CDK11A-013";
1       processed_transcript    exon    1653905 1654270 .       -       .        gene_id "ENSG00000008128"; transcript_id "ENST00000487462"; exon_number "2"; gene_name "CDK11A"; gene_biotype "protein_coding"; transcript_name "CDK11A-013";
This means that you'll get different results for this transcript using sub-version 63 or 64. I assume that latest versions are improved, so I always encourage to upgrade.
Sometimes it might even be the case that latest released database and the one shown on the web interface may be out of sync.

SnpEff reports a SYNONYMOUS and a NON_SYNONYMOUS effect on the same gene

This is not a bug. It is not uncommon for a gene to have more than one transcript (e.g. in human most genes have multiple transcripts). A variant (e.g. a SNP) might affect different transcripts in different ways, as a result of different reading frames.
For instance:

chr5 137622242 . C T . . EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gaa/Aaa|E/K|CDC25C|protein_coding|CODING|ENST00000514017|exon_5_137622186_137622319),
                             SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000323760|exon_5_137622186_137622319),
                             SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000348983|exon_5_137622186_137622319),
                             SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000356505|exon_5_137622186_137622319),
                             SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000357274|exon_5_137622186_137622319),
                             SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000415130|exon_5_137622186_137622319),
                             SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000513970|exon_5_137622186_137622319),
                             SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000514555|exon_5_137622186_137622319),
                             SYNONYMOUS_CODING(LOW|SILENT|caG/caA|Q|CDC25C|protein_coding|CODING|ENST00000534892|exon_5_137622186_137622319)
in this example (it was divided into multiple lines for legibility), the first transcript ENST0000051401 has a NON_SYNONYMOUS effect, but all other transcripts have a SYNONYMOUS effect.

SnpEff can also provide non-coding and regulatory annotations. Here we show how to annotate them.

Non-coding and regulatory annotations databases are available only for a few organisms (e.g. human, mouse, etc.). We intend to incorporate more non-coding annotations as soon as public databases are available, but your organism of choice might not have a non-coding/regulatory database available.

First of all, you need to see if your organism has a regulatory database. You can just look into the database directory to see if regulation_*.bin files are there. For instance, for human genome:

$ cd ~/snpEff
$ cd data/GRCh37.70/
$ ls -al
total 2399156
drwxr-xr-x    2 pcingola pcingola       4096 Feb 11 22:01 .
drwxr-xr-x 2472 pcingola pcingola     122880 Feb 11 19:55 ..
-rw-r--r--    1 pcingola pcingola    6292802 Feb 11 22:01 regulation_CD4.bin
-rw-r--r--    1 pcingola pcingola    2474396 Feb 11 22:01 regulation_GM06990.bin
-rw-r--r--    1 pcingola pcingola    7924641 Feb 11 22:00 regulation_GM12878.bin
-rw-r--r--    1 pcingola pcingola    6212489 Feb 11 22:01 regulation_H1ESC.bin
-rw-r--r--    1 pcingola pcingola    5152384 Feb 11 22:00 regulation_HeLa-S3.bin
-rw-r--r--    1 pcingola pcingola    7392256 Feb 11 22:01 regulation_HepG2.bin
-rw-r--r--    1 pcingola pcingola    3991817 Feb 11 22:01 regulation_HMEC.bin
-rw-r--r--    1 pcingola pcingola    4562241 Feb 11 22:01 regulation_HSMM.bin
-rw-r--r--    1 pcingola pcingola    5550176 Feb 11 22:01 regulation_HUVEC.bin
-rw-r--r--    1 pcingola pcingola    5528240 Feb 11 22:01 regulation_IMR90.bin
-rw-r--r--    1 pcingola pcingola     534786 Feb 11 22:01 regulation_K562b.bin
-rw-r--r--    1 pcingola pcingola    8371354 Feb 11 22:01 regulation_K562.bin
-rw-r--r--    1 pcingola pcingola    3063189 Feb 11 22:01 regulation_NH-A.bin
-rw-r--r--    1 pcingola pcingola    5621629 Feb 11 22:00 regulation_NHEK.bin
-rw-r--r--    1 pcingola pcingola   89937614 Feb 11 21:57 snpEffectPredictor.bin
So we can annotate using any of those tracks.
E.g. To use 'HeLa-S3' and 'NHEK' tracks, you can run:
java -Xmx4g -jar snpEff.jar eff -reg HeLa-S3 -reg NHEK GRCh37.70 in.vcf > out.vcf

You can use -onlyReg command line option to perform only regulatory annotations (i.e. coding annotations will be omitted).

We describe how to change some defaults and filter input and outputs to narrow down the number of results.

Speed up options: Multithreaded & No statistics

In order to speed up the annotation process, there are two options that can be activated:

  • No statistics: Calculating statistics can take a significant amount of time. Particularly if there are hundreds or thousands of samples in the VCF file. The command line option -noStats disables the statistics and may result in a significant speedup.
  • Mutlithreaded: SnpEff has a mutlithreaded implementation. This allows to use several cores available in the local machine. It can be activated using the -t command line option.

    Some features are not available when running in mutlithreaded mode. For instance, statistics are disabled in order to speed up the annotation process.

    In mutlithreaded mode, SnpEff will attempt to use all available cores. Nevertheless is very rare that all cores get 100% usage.

HGVS notation

SnpEff partially implements HGVS notation, which is somewhat popular amongst clinicians. You can activate HGVS notation using the command line option -hgvs. When activated, the "AA" sub-field in the EFF field (VCF file, INFO column) will be written in HGVS notation.

HGVS notation is only partially implemented, some cases may be missing.

Logging

SnpEff will try to log usage statistics to our "log server". This is useful for us to understand user's needs and have some statistics on what users are doing with the program (e.g. decide whether a command or option is useful or not).
Logging can be deactivated by using the -noLog command line option.

Input filters

SnpEff supports filter on input files by using combinations of the following command line options:

Input filters can be implemented using SnpSift filter, which allows to create more flexible and complex filters.

Command line option Meaning
-del Analyze deletions only
-ins Analyze insertions only
-hom Analyze homozygous variants only
-het Analyze heterozygous variants only
-minQ X, -minQuality X Filter out variants with quality lower than X
-maxQ X, -maxQuality X Filter out variants with quality higher than X
-minC X, -minCoverage X Filter out variants with coverage lower than X
-maxC X, -maxCoverage X Filter out variants with coverage higher than X
-mnp Only MNPs (multiple nucleotide polymorphisms)
-snp Only SNPs (single nucleotide polymorphisms)

Output filters

SnpEff supports filter of output results by using combinations of the following command line options:

Output filters can be implemented using SnpSift filter, which allows to create more flexible and complex filters.

Command line option Meaning
-no-downstream Do not show DOWNSTREAM changes
-no-intergenic Do not show INTERGENIC changes
-no-intron Do not show INTRON changes
-no-upstream Do not show UPSTREAM changes
-no-utr Do not show 5_PRIME_UTR or 3_PRIME_UTR changes
-no EffectType Do not show 'EffectType' (it can be used several times)
e.g: -no INTERGENIC -no SPLICE_SITE_REGION

Annotating selected intervals

You can use the -fi intervals.bed command line option (filterInterval). For instance, let's assume you have an interval file 'intervals.bed':

2L	10000	10999
2L	12000	12999
2L	14000	14999
2L	16000	16999
2L	18000	18999 

Canonical transcripts

SnpEff allows to annotate using canonical transcripts by using -canon command line option.

Canonical transcripts are defined as the longest CDS of amongst the protein coding transcripts in a gene. If none of the transcripts in a gene is protein coding, then it is the longest cDNA.

Although this seems to be the standard definitions of "canonical transcript", there is no warranties that what SnpEff considers a canonical transcript will match exactly what UCSC or ENSEMBL consider a canonical transcript.

Example on how to use canonical transcripts annotations:
 java -Xmx4G -jar snpEff.jar -canon GRCh37.70 file.vcf > file.snpeff.canon.vcf

In order to get only changes that match your intervals, you can use the command:
 java -Xmx4G -jar snpEff.jar -fi intervals.bed dm5.30 file.vcf 

In order to get a list of canonical transcipts, you can use the -d (debug) command line option. E.g.:
$ java -Xmx4G -jar snpEff.jar eff -d -v -canon GRCh37.66 test.vcf
00:00:00.000    Reading configuration file 'snpEff.config'
00:00:00.173    done
00:00:00.173    Reading database for genome version 'GRCh37.66'
00:00:02.834    done
00:00:02.845    Filtering out non-canonical transcripts.
00:00:03.219    Canonical transcripts:
                geneName        geneId          transcriptId    cdsLength
                GGPS1           ENSG00000152904 ENST00000488594 903
                RP11-628K18.1.1 ENSG00000235112 ENST00000430808 296
                MIPEPP2         ENSG00000224783 ENST00000422560 1819
                FEN1P1          ENSG00000215873 ENST00000401028 1145
                AL591704.7.1    ENSG00000224784 ENST00000421658 202
                CAPNS1P1        ENSG00000215874 ENST00000401029 634
                ST13P20         ENSG00000215875 ENST00000447996 1061
                NCDN            ENSG00000020129 ENST00000373243 2190
                RP11-99H8.1.1   ENSG00000226208 ENST00000423187 432
                AL391001.1      ENSG00000242652 ENST00000489859 289
...

Selected list of transcripts

SnpEff allows you to provide a list of transcripts to use for annotations by using the -onlyTr file.txt and providing a file with one transcript ID per line. Any other transcript will be ignored.

 java -Xmx4G -jar snpEff.jar -onlyTr my_transcripts.txt GRCh37.70 file.vcf > file.snpeff.vcf

Upstream and downstream

You can change the default upstream and downstream interval size (default is 5K) using the -ud size_in_bases option. This also allows to eliminate any upstream and downstream effect by using "-ud 0".

Example: Make upstream and downstream size zero (i.e. do not report any upstream or downstream effect).

 java -Xmx4G -jar snpEff.jar -ud 0 GRCh37.70 file.vcf > file.snpeff.vcf

Splice site size

You can change the default splice site size (default is 2 bases) using the -spliceSiteSize size_in_bases option.

Example: Make splice sites four bases long

 java -Xmx4G -jar snpEff.jar -spliceSiteSize 4 GRCh37.70 file.vcf > file.snpeff.vcf

Adding your own annotations

SnpEff allows user defined intervals to be annotated. This is achieved using the -interval file.bed command line option, which can be used multiple times in the same command line. Any variant that intersects an interval defined in those files, will be annotated using the "name" field (fourth column) in the input bed file.

Example: We create our own annotations in my_annotations.bed

$ cat my_annotations.bed 
1	10000	20000	MY_ANNOTATION

$ cat test.vcf 
1	10469	.	C	G	365.78	PASS	AC=30;AF=0.0732

$ java -Xmx4g -jar snpEff.jar eff -interval my_annotations.bed GRCh37.66 test.vcf
1	10469	.	C	G	365.78	PASS	AC=30;AF=0.0732;EFF=CUSTOM(MODIFIER||||||MY_ANNOTATION||||1),DOWNSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000423562||1),DOWNSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000438504||1),DOWNSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000488147||1),DOWNSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000538476||1),DOWNSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000541675||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER|||||DDX11L1|processed_transcript|NON_CODING|ENST00000456328||1),UPSTREAM(MODIFIER|||||DDX11L1|transcribed_unprocessed_pseudogene|NON_CODING|ENST00000450305||1),UPSTREAM(MODIFIER|||||DDX11L1|transcribed_unprocessed_pseudogene|NON_CODING|ENST00000515242||1),UPSTREAM(MODIFIER|||||DDX11L1|transcribed_unprocessed_pseudogene|NON_CODING|ENST00000518655||1)
Notice that the variant was annotated using "MY_ANNOTATION" in the EFF field.

Gene ID instead of gene names


You can obtain gene IDs instead of gene names by using the command line option -geneId.
Example:

$ java -Xmx4g -jar snpEff.jar -geneId GRCh37.66 test.vcf 
1  902128  3617  C  T  .  PASS  AC=80;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCt/gTt|A43V|576|ENSG00000187583|protein_coding|CODING|ENST00000379407|2|1),...
Note: The gene 'PLEKHN1' was annotated as 'ENSG00000187583'.

Chromosome names

This is only used in TXT output format, which is deprecated.

By default SnpEff simplifies all chromosome names. For instance 'chr1' is just '1'. Some people just hate this, so you can prepend any string you want to the chromosome name by using the -chr string

Example:

$ java -Xmx4g -jar snpEff.jar eff -v -o txt -chr "CHROMOSOME" GRCh37.66 test.vcf 
# Chromo	Position	Reference	Change	Change_type	Homozygous	Quality	Coverage	Warnings	Gene_ID	Gene_name	Bio_type	Transcript_ID	Exon_ID	Exon_Rank	Effect	old_AA/new_AA	Old_codon/New_codon	Codon_Num(CDS)	Codon_Degeneracy	CDS_size	Codons_around	AAs_around	Custom_interval_ID
CHROMOSOME1	10469	C	G	SNP	Hom	365.78	0		ENSG00000227232	WASH7P	unprocessed_pseudogene	ENST00000423562			DOWNSTREAM: 3894 bases								
CHROMOSOME1	10469	C	G	SNP	Hom	365.78	0		ENSG00000227232	WASH7P	unprocessed_pseudogene	ENST00000438504			DOWNSTREAM: 3894 bases								
CHROMOSOME1	10469	C	G	SNP	Hom	365.78	0		ENSG00000227232	WASH7P	unprocessed_pseudogene	ENST00000541675			DOWNSTREAM: 3894 bases								
CHROMOSOME1	10469	C	G	SNP	Hom	365.78	0		ENSG00000227232	WASH7P	unprocessed_pseudogene	ENST00000488147			DOWNSTREAM: 3935 bases								
CHROMOSOME1	10469	C	G	SNP	Hom	365.78	0		ENSG00000227232	WASH7P	unprocessed_pseudogene	ENST00000538476			DOWNSTREAM: 3942 bases								
CHROMOSOME1	10469	C	G	SNP	Hom	365.78	0		ENSG00000223972	DDX11L1	transcribed_unprocessed_pseudogene	ENST00000518655			UPSTREAM: 1405 bases								
CHROMOSOME1	10469	C	G	SNP	Hom	365.78	0		ENSG00000223972	DDX11L1	transcribed_unprocessed_pseudogene	ENST00000450305			UPSTREAM: 1541 bases								
CHROMOSOME1	10469	C	G	SNP	Hom	365.78	0								INTERGENIC			
CHROMOSOME1	10469	C	G	SNP	Hom	365.78	0		ENSG00000223972	DDX11L1	processed_transcript	ENST00000456328			UPSTREAM: 1400 bases								
CHROMOSOME1	10469	C	G	SNP	Hom	365.78	0		ENSG00000223972	DDX11L1	transcribed_unprocessed_pseudogene	ENST00000515242			UPSTREAM: 1403 bases								

Here we describe details about annotating cancer samples.

Representing cancer data

In a typical cancer sequencing experiment, we want to measure and annotate differences between germline (healthy) and somatic (cancer) tissue samples from the same patient. The complication is that germline is not always the same as the reference genome, so a typical annotation does not work.
For instance, let's assume that at a given genomic position (e.g. chr1:69091), reference genome is 'A', germline is 'C' and somatic is 'C'. This should be represented in a VCF file as:

#CHROM  POS    ID  REF  ALT    QUAL  FILTER  INFO  FORMAT  Patient_01_Germline  Patient_01_Somatic
1       69091  .   A    C,G    .     PASS    AC=1  GT      1/0                  2/1

Some people tend to represent this by changing REF base 'A' using germline 'C'. This is a mistake, REF must always represent the reference genome, not one of your samples.

Under normal conditions, SnpEff would provide the effects of changes "A -> C" and "A -> G". But in case of cancer samples, we are actually interested in the difference between somatic and germline, so we'd like to calculate the effect of a "C -> G" mutation. Calculating this effect is not trivial, since we have to build a new "reference" by calculating the effect of the first mutation ("A -> C") and then calculate the effect of the second one ("C -> G") on our "new reference".

In order to activate cancer analysis, you must use -cancer command line option.

Defining cancer samples

As we already mentioned, cancer data is represented in a VCF file using multiple ALTs (REF field always is reference genome). In order to specify which samples are somatic and which ones are germline, there are two options:

  • Use a TXT file using -canceSample command line option.
  • Use the PEDIGREE meta information in your VCF file's header. This is the default, but some people might find hard to edit / change information in VCF file's headers.

If you do not provide either PEDIGREE meta information or a TXT samples file, SnpEff will not know which somatic samples derive from which germline samples. Thus it will be unable to perform cancer effect analysis.

TXT file
This is quite easy. All you have to do is to create a tab-separated TXT file having two columns: the first column has the germline sample names and the second column has the somatic sample names. Make sure that sample names match exactly the ones in the VCF file.

E.g.: Create a TXT file named 'samples_cancer.txt'
Patient_01_Germline	Patient_01_Somatic
Patient_02_Germline	Patient_02_Somatic
Patient_03_Germline	Patient_03_Somatic
Patient_04_Germline	Patient_04_Somatic
Then you have to specify this TXT file when invoking SnpEff, using the -canceSample command line option.

E.g. In our example, the file name is 'samples_cancer.txt', so the command line would look like this:
java -Xmx4g -jar snpEff.jar eff -v -cancer -canceSample samples_cancer.txt GRCh37.70 cancer.vcf > cancer.snpeff.vcf 


VCF header
This is the default method and the main advantage is that you don't have to carry information on a separate TXT file (all the information is within your VCF file). You have to add the PEDIGREE header with the appropriate information to your VCF file. Obviously this requires you to edit you VCF file's header.

How to edit VCF headers is beyond the scope of this manual (we recommend using vcf-annotate from VCFtools). But if you find adding PEDIGREE information to your VCF file difficult, just use the TXT file method described in the previous sub-section.

E.g.: Pedigree information in a VCF file would look like this
##PEDIGREE=<Derived=Patient_01_Somatic,Original=Patient_01_Germline>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO        FORMAT  Patient_01_Germline Patient_01_Somatic
1   69091   .   A   C,G .       PASS    AF=0.1122   GT      1/0                 2/1
1   69849   .   G   A,C .       PASS    AF=0.1122   GT      1/0                 2/1
1   69511   .   A   C,G .       PASS    AF=0.3580   GT      1/1                 2/2
Here we say that the sample called Patient_01_Somatic is derived from the sample called Patient_01_Germline. In this context, this means that cancer sample is derived from the healthy tissue.

Running in cancer analysis mode

We added '-cancer' command line option for somatic vs germline comparisons. By default SnpEff does not perform them because they computationally intensive. So your command line would be:
 java -Xmx4g -jar snpEff.jar eff -v -cancer GRCh37.70 cancer.vcf > cancer.snpeff.vcf 

HGVS notations seems to be preferred in the clinical and cancer community. You can activate it by using -hgvs to provide HGVS notation in AA sub-field.

Interpreting Cancer effects

Interpretation of EFF field cancer sample relies heavily on 'Genotype' sub-field. Just as a reminder, EFF field has the following format:
EFF= Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank  | Genotype_Number [ | ERRORS | WARNINGS ] )
The GenotypeNum field tells you which effect relates to which genotype. More importantly, genotype difference between Somatic and Germline.
Example: when there are multiple ALTs (e.g. REF='A' ALT='C,G') and the genotype field says:
  • GenotypeNum = "1": it means is the effect related to the first ALT ('C')
  • GenotypeNum = "2" if it's the effect related to the second ALT ('G')
  • GenotypeNum = "2-1" means that this is the effect of having the second ALT as variant while using the first ALT as reference ("C -> G").
It is important that you understand the meaning of the last one, because you'll use it often for your cancer analysis.

Example: Sample output for the previously mentioned VCF example would be (the output has been edited for readability reasons)
For the first line:
1   69091   .   A   C,G .   PASS    AF=0.1122
    ;EFF=NON_SYNONYMOUS_START(LOW|MISSENSE|Atg/Ctg|M1L|305|OR4F5|protein_coding|CODING|ENST00000335137|1|1)
        ,START_LOST(HIGH|MISSENSE|Atg/Gtg|M1V|305|OR4F5|protein_coding|CODING|ENST00000335137|1|2)
        ,START_LOST(HIGH|MISSENSE|Ctg/Gtg|L1V|305|OR4F5|protein_coding|CODING|ENST00000335137|1|2-1)
    GT  1/0 2/1 
What does it mean:
  1. In this case, we have two ALTs = 'C' and 'G'.
  2. Germline sample is heterozygous 'C/A' (GT = '1/0')
  3. Soamtic tissue is heterozygous 'G/C' (GT = '2/1')
  4. Change A -> C and A -> G are always calculated by SnpEff (this is the "default mode").
    • A -> C produces this effect NON_SYNONYMOUS_START(LOW|MISSENSE|Atg/Ctg|p.Met1Leu|305|OR4F5|protein_coding|CODING|ENST00000335137|1) Note that the last field (genotype field) is '1' indicating this is produced by the first ALT (C)
    • A -> G produces this effect START_LOST(HIGH|MISSENSE|Atg/Gtg|p.Met1Val|305|OR4F5|protein_coding|CODING|ENST00000335137|2) Note that the last field (genotype field) is '2' indicating this is produced by the first ALT (G)
  5. Finally, this is what you were expecting for, the cancer comparisons. Since both germline and somatic are heterozygous (GT are '1/0' and '2/1'), there are 4 possible comparisons to make:
    • 2 vs 1 : This is the Somatic vs Germline we are interested in. SnpEff reports this one
    • 2 vs 0 : This compares ALT to REF, so it was already reported in "default mode". SnpEff doesn't report this one again.
    • 1 vs 1 : This is not a variant, since both og them ar '1'. SnpEff skips this one.
    • 1 vs 0 : This compares ALT to REF, so it was already reported in "default mode". SnpEff doesn't report this one again.
    I know is confusing, but the bottom line is that only the first comparison one makes sense, and is the one SnpEff reports. So 'C -> G' produces the following effect:
    'START_LOST(HIGH|MISSENSE|Ctg/Gtg|p.Leu1Val|305|OR4F5|protein_coding|CODING|ENST00000335137|2-1)

    Notice the genotype field is "2-1" meaning the we produce a new reference on the fly using ALT 1 ('C') and then used ALT 2 ('G') as the change.

If your brain hasn't exploded yet and want to have some more fun, the other lines produced in the above example are (again, lines edited for the sake of brevity):
1   69849   .   G   A,C .   PASS    AF=0.1122
    ;EFF=STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W253*|305|OR4F5|protein_coding|CODING|ENST00000335137|1|1)
        ,NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|tgG/tgC|W253C|305|OR4F5|protein_coding|CODING|ENST00000335137|1|2)
        ,STOP_LOST(HIGH|MISSENSE|tgA/tgC|*253C|305|OR4F5|protein_coding|CODING|ENST00000335137|1|2-1)
    GT  1/0 2/1 


1   69511   .   A   C,G .   PASS    AF=0.3580
    ;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Cca|T141P|305|OR4F5|protein_coding|CODING|ENST00000335137|1|1)
        ,NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|1|2)
        ,NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Cca/Gca|P141A|305|OR4F5|protein_coding|CODING|ENST00000335137|1|2-1)
    GT  1/1 2/2 

ENCODE project's goal is to find all functional elements in the human genome. You can perform annotations using ENCODE's data.

ENCODE project has produced huge amounts of data (see also Nature's portal). This information is available for download and can be used to annotate genomic variants or regions. An overview of all the data available from ENCODE is shown as an experimental data matrix. The download site is here.

Data is available in "BigBed" format, which can be feed into SnpEff using -interval command line option (you can add many -interval options).
Here is a simple example:

# Create a directory for ENCODE files
mkdir -p db/encode

# Download ENCODE experimental results (BigBed file)
cd db/encode
wget "http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/openchrom/jan2011/fdrPeaks/wgEncodeDukeDnase8988T.fdr01peaks.hg19.bb"

# Annotate using ENCODE's data:
java -Xmx4g -jar snpEff.jar eff -v -interval db/encode/wgEncodeDukeDnase8988T.fdr01peaks.hg19.bb GRCh37.71 test.vcf > test.eff.vcf

# Annotations are added as "CUSTOM" intervals:
$ grep CUSTOM test.eff.vcf 
1	799463	.	T	C	.	PASS	AC=958;EFF=CUSTOM[wgEncodeDukeDnase8988T](MODIFIER||||||wgEncodeDukeDnase8988T:799426_799575||||1),DOWNSTREAM(MODIFIER|||||FAM41C|lincRNA|NON_CODING|ENST00000427857||1),DOWNSTREAM(MODIFIER|||||FAM41C|lincRNA|NON_CODING|ENST00000432963||1),DOWNSTREAM(MODIFIER|||||FAM41C|lincRNA|NON_CODING|ENST00000446136||1),DOWNSTREAM(MODIFIER|||||RP11-206L10.11|lincRNA|NON_CODING|ENST00000445118||1),INTERGENIC(MODIFIER||||||||||1)
1	902128	.	C	T	.	PASS	AC=80;EFF=CUSTOM[wgEncodeDukeDnase8988T](MODIFIER||||||wgEncodeDukeDnase8988T:902066_902215||||1),DOWNSTREAM(MODIFIER||||642|KLHL17|protein_coding|CODING|ENST00000338591||1),DOWNSTREAM(MODIFIER|||||KLHL17|nonsense_mediated_decay|CODING|ENST00000466300||1),DOWNSTREAM(MODIFIER|||||KLHL17|retained_intron|CODING|ENST00000463212||1),DOWNSTREAM(MODIFIER|||||KLHL17|retained_intron|CODING|ENST00000481067||1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCt/gTt|A43V|576|PLEKHN1|protein_coding|CODING|ENST00000379407|2|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCt/gTt|A43V|611|PLEKHN1|protein_coding|CODING|ENST00000379410|2|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCt/gTt|A43V|663|PLEKHN1|protein_coding|CODING|ENST00000379409|2|1),UPSTREAM(MODIFIER|||||PLEKHN1|retained_intron|CODING|ENST00000480267||1)
1	948921	.	T	C	.	PASS	AC=904;EFF=CUSTOM[wgEncodeDukeDnase8988T](MODIFIER||||||wgEncodeDukeDnase8988T:948786_948935||||1),UPSTREAM(MODIFIER|||||RP11-54O7.11|antisense|NON_CODING|ENST00000458555||1),UTR_5_PRIME(MODIFIER||||165|ISG15|protein_coding|CODING|ENST00000379389|1|1)
...

Epigenome Roadmap Project has produced large amounts of information that can be used by SnpEff.

Epigenome Roadmap Project goal is "to map DNA methylation, histone modifications, chromatin accessibility and small RNA transcripts in stem cells and primary ex vivo tissues selected to represent the normal counterparts of tissues and organ systems frequently involved in human disease". A data matrix shows the experimental set ups currently available.

Unfortunately the project is not (currently) providing results files that can be used directly by annotation software, such as SnpEff. They will be available later in the project. So, for the time being, data has to be downloaded an pre-processed. We'll be processing these information and making it available (as SnpEff databases) as soon as we can.

The latest Epigenome project processed information, can be found here. This includes genomic intervals for high confidence peaks in form of BED files.

To annotate you can do:

# Download Epigenome project database (pre-processed as BED files)
wget http://sourceforge.net/projects/snpeff/files/databases/epigenome_latest.tgz/download

# Open tar file
tar -xvzf epigenome_latest.tgz

# Annotate using SnpEff and "-interval" command line
java -Xmx4g -jar snpEff.jar eff -v -interval db/epigenome/BI_Pancreatic_Islets_H3K4me3.peaks.bed GRCh37.71 test.vcf > test.eff.vcf 

# See the data represented as "CUSTOM" EFF fields
$ grep CUSTOM test.eff.vcf 
1	894573	.	G	A	.	PASS	AC=725;EFF=CUSTOM[BI_Pancreatic_Islets_H3K4me3](MODIFIER||||||MACS_peak_8||||1),INTRON(MODIFIER||||749|NOC2L|protein_coding|CODING|ENST00000327044|1|1),INTRON(MODIFIER|||||NOC2L|processed_transcript|CODING|ENST00000487214|1|1),INTRON(MODIFIER|||||NOC2L|retained_intron|CODING|ENST00000469563|1|1),UPSTREAM(MODIFIER||||642|KLHL17|protein_coding|CODING|ENST00000338591||1),UPSTREAM(MODIFIER|||||KLHL17|nonsense_mediated_decay|CODING|ENST00000466300||1),UPSTREAM(MODIFIER|||||KLHL17|retained_intron|CODING|ENST00000463212||1),UPSTREAM(MODIFIER|||||KLHL17|retained_intron|CODING|ENST00000481067||1),UPSTREAM(MODIFIER|||||NOC2L|retained_intron|CODING|ENST00000477976||1)
1	948692	.	G	A	.	PASS	AC=896;EFF=CUSTOM[BI_Pancreatic_Islets_H3K4me3](MODIFIER||||||MACS_peak_9||||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||||165|ISG15|protein_coding|CODING|ENST00000379389||1),UPSTREAM(MODIFIER|||||RP11-54O7.11|antisense|NON_CODING|ENST00000458555||1)
1	948921	.	T	C	.	PASS	AC=904;EFF=CUSTOM[BI_Pancreatic_Islets_H3K4me3](MODIFIER||||||MACS_peak_9||||1),UPSTREAM(MODIFIER|||||RP11-54O7.11|antisense|NON_CODING|ENST00000458555||1),UTR_5_PRIME(MODIFIER||||165|ISG15|protein_coding|CODING|ENST00000379389|1|1)
1	1099342	.	A	C	.	PASS	AC=831;EFF=CUSTOM[BI_Pancreatic_Islets_H3K4me3](MODIFIER||||||MACS_peak_10||||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER|||||MIR200A|miRNA|NON_CODING|ENST00000384875||1),UPSTREAM(MODIFIER|||||MIR200B|miRNA|NON_CODING|ENST00000384997||1)

NextProt has useful proteomic annotations than can help to identify variants causing reduced protein functionality or even loss of function.

Nextprot project provides proteomic information that can be used for genomic annotations. NextProt povides only human data.

This data is available by default in GRCh37 databases (latest version). You only need to add the -nextprot command line option to get this information.

$ java -Xmx4g -jar snpEff.jar -v -nextprot GRCh37.71 test.vcf > test.eff.vcf

# Show results (edited for readibility)
$ cat test.eff.vcf
$ grep glycosylation_site test.eff.vcf 
1  1147655  .  G  A  .  PASS  AC=7;EFF=DOWNSTREAM(MODIFIER||||348|SDF4|protein_coding|CODING|ENST00000263741||1)
                                       ,DOWNSTREAM(MODIFIER|||||SDF4|retained_intron|CODING|ENST00000494748||1)
                                       ,INTRON(MODIFIER|||||TNFRSF4|retained_intron|CODING|ENST00000497869|3|1)
                                       ,NEXT_PROT[glycosylation_site:N-linked__GlcNAc..._](MODERATE||||277|TNFRSF4|protein_coding|CODING|ENST00000379236|4|1)
                                       ,NEXT_PROT[glycosylation_site:N-linked__GlcNAc..._](MODERATE||||277|TNFRSF4|protein_coding|CODING|ENST00000379236|5|1)
                                       ,NEXT_PROT[repeat:TNFR-Cys_4](LOW||||277|TNFRSF4|protein_coding|CODING|ENST00000379236|4|1)
                                       ,NEXT_PROT[repeat:TNFR-Cys_4](LOW||||277|TNFRSF4|protein_coding|CODING|ENST00000379236|5|1)
                                       ,NEXT_PROT[topological_domain:Extracellular](LOW||||277|TNFRSF4|protein_coding|CODING|ENST00000379236|1|1)
                                       ,NEXT_PROT[topological_domain:Extracellular](LOW||||277|TNFRSF4|protein_coding|CODING|ENST00000379236|2|1)
                                       ,NEXT_PROT[topological_domain:Extracellular](LOW||||277|TNFRSF4|protein_coding|CODING|ENST00000379236|3|1)
                                       ,NEXT_PROT[topological_domain:Extracellular](LOW||||277|TNFRSF4|protein_coding|CODING|ENST00000379236|4|1)
                                       ,NEXT_PROT[topological_domain:Extracellular](LOW||||277|TNFRSF4|protein_coding|CODING|ENST00000379236|5|1)
                                       ,NEXT_PROT[topological_domain:Extracellular](LOW||||277|TNFRSF4|protein_coding|CODING|ENST00000379236|6|1)
This example shows a glycosylation_site marked as MODERATE impact, since a modification of such a site might impair protein function.

Motif annotations provided by ENSEMBL and Jaspar can be added to the standard annotations.

ENSEMBL provides trancription factor binding sites prediction, for human and mouse genomes, using Jaspar motif database.
Using -motif command line option, you can add TFBS prediction to your annotations:

$ java -Xmx4g -jar snpEff.jar -v -motif GRCh37.71 test.vcf > test.eff.vcf
00:00:00.000	Reading configuration file 'snpEff.config'
00:00:00.228	done
00:00:00.228	Reading database for genome version 'GRCh37.71' from file '/Users/pablocingolani//snpEff/data/GRCh37.71/snpEffectPredictor.bin' (this might take a while)
00:00:23.506	done
00:00:23.552	Loading Motifs and PWMs
00:00:23.552		Loading PWMs from : /Users/pablocingolani//snpEff/data/GRCh37.71/pwms.bin
00:00:23.699		Loading Motifs from file '/Users/pablocingolani//snpEff/data/GRCh37.71/motif.bin'
00:00:24.811		Motif database: 393522 markers loaded.
...

$ grep MOTIF test.eff.vcf 
1	1060235	.	G	A	.	PASS	AC=16;EFF=MOTIF[PB0134.1:HNF4A](MODIFIER||||||||||1),INTERGENIC(MODIFIER||||||||||1)
1	1250957	.	G	A	.	PASS	AC=9;EFF=MOTIF[MA0058.1:Max](LOW||||||||||1),MOTIF[MA0093.1:USF1](LOW||||||||||1),MOTIF[PB0043.1:Max](MODIFIER||||||||||1),MOTIF[PL0007.1:Max](MODIFIER||||||||||1),MOTIF[PL0014.1:mxl-1](MODIFIER||||||||||1),START_GAINED(LOW||||342|CPSF3L|protein_coding|CODING|ENST00000421495|2|1),SYNONYMOUS_CODING(LOW|SILENT|caC/caT|H128|571|CPSF3L|protein_coding|CODING|ENST00000545578|6|1),SYNONYMOUS_CODING(LOW|SILENT|caC/caT|H157|600|CPSF3L|protein_coding|CODING|ENST00000435064|5|1),SYNONYMOUS_CODING(LOW|SILENT|caC/caT|H163|208|CPSF3L|protein_coding|CODING|ENST00000527719|6|1|WARNING_TRANSCRIPT_INCOMPLETE),SYNONYMOUS_CODING(LOW|SILENT|caC/caT|H163|606|CPSF3L|protein_coding|CODING|ENST00000540437|7|1),SYNONYMOUS_CODING(LOW|SILENT|caC/caT|H204|248|CPSF3L|protein_coding|CODING|ENST00000530031|6|1|WARNING_TRANSCRIPT_INCOMPLETE),SYNONYMOUS_CODING(LOW|SILENT|caC/caT|H33|108|CPSF3L|protein_coding|CODING|ENST00000526332|3|1),SYNONYMOUS_CODING(LOW|SILENT|caC/caT|H56|499|CPSF3L|protein_coding|CODING|ENST00000419704|3|1),SYNONYMOUS_CODING(LOW|SILENT|caC/caT|H59|502|CPSF3L|protein_coding|CODING|ENST00000411962|3|1),UPSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000461514||1),UPSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000470030||1),UPSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000478641||1),UPSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000485710||1),UPSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000497304||1),UPSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000525164||1),UPSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000525769||1),UPSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000531020||1),UPSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000531292||1),UTR_3_PRIME(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000430786|7|1),UTR_3_PRIME(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000458452|7|1),UTR_3_PRIME(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000526797|5|1),UTR_3_PRIME(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000527098|6|1),DOWNSTREAM(MODIFIER||||124|PUSL1|protein_coding|CODING|ENST00000467712||1),DOWNSTREAM(MODIFIER||||139|CPSF3L|protein_coding|CODING|ENST00000534345||1),DOWNSTREAM(MODIFIER||||166|CPSF3L|protein_coding|CODING|ENST00000498476||1),DOWNSTREAM(MODIFIER||||303|PUSL1|protein_coding|CODING|ENST00000379031||1),DOWNSTREAM(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000470679||1),DOWNSTREAM(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000525285||1),DOWNSTREAM(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000531019||1),DOWNSTREAM(MODIFIER|||||CPSF3L|processed_transcript|CODING|ENST00000490853||1),DOWNSTREAM(MODIFIER|||||CPSF3L|processed_transcript|CODING|ENST00000493534||1),DOWNSTREAM(MODIFIER|||||CPSF3L|processed_transcript|CODING|ENST00000532952||1),DOWNSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000429572||1),DOWNSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000496353||1),DOWNSTREAM(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000498173||1),DOWNSTREAM(MODIFIER|||||PUSL1|processed_transcript|CODING|ENST00000470520||1),DOWNSTREAM(MODIFIER|||||PUSL1|retained_intron|CODING|ENST00000493657||1),DOWNSTREAM(MODIFIER|||||RP5-890O3.9|processed_transcript|NON_CODING|ENST00000444968||1),EXON(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000527383|1|1),EXON(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000528879|5|1),EXON(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000531377|2|1),EXON(MODIFIER|||||CPSF3L|processed_transcript|CODING|ENST00000462432|2|1),EXON(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000323275|4|1),EXON(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000467408|1|1),EXON(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000488042|5|1),EXON(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000525603|3|1),EXON(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000526113|2|1),EXON(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000526904|2|1),EXON(MODIFIER|||||CPSF3L|retained_intron|CODING|ENST00000533916|1|1),INTRON(MODIFIER||||343|CPSF3L|protein_coding|CODING|ENST00000434694|4|1),INTRON(MODIFIER||||578|CPSF3L|protein_coding|CODING|ENST00000450926|4|1),INTRON(MODIFIER|||||CPSF3L|nonsense_mediated_decay|CODING|ENST00000532772|2|1)
1	1310924	.	T	C	.	PASS	AC=948;EFF=MOTIF[PB0020.1:Gabp](MODIFIER||||||||||1),UPSTREAM(MODIFIER||||199|AURKAIP1|protein_coding|CODING|ENST00000321751||1),UPSTREAM(MODIFIER||||199|AURKAIP1|protein_coding|CODING|ENST00000338338||1),UPSTREAM(MODIFIER||||199|AURKAIP1|protein_coding|CODING|ENST00000338370||1),UPSTREAM(MODIFIER||||199|AURKAIP1|protein_coding|CODING|ENST00000378853||1),UPSTREAM(MODIFIER|||||AURKAIP1|processed_transcript|CODING|ENST00000489799||1),UPSTREAM(MODIFIER|||||AURKAIP1|processed_transcript|CODING|ENST00000496905||1),UPSTREAM(MODIFIER|||||RP5-890O3.3|processed_pseudogene|NON_CODING|ENST00000435351||1),INTERGENIC(MODIFIER||||||||||1)
1	1368599	.	A	C	.	PASS	AC=920;EFF=MOTIF[MA0065.1:PPARG](MODIFIER||||||||||1),MOTIF[MA0065.2:PPARG](MODIFIER||||||||||1),MOTIF[MA0114.1:HNF4A](MODIFIER||||||||||1),MOTIF[MA0115.1:RNR1H2](MODIFIER||||||||||1),UPSTREAM(MODIFIER||||147|VWA1|protein_coding|CODING|ENST00000471398||1),UPSTREAM(MODIFIER||||166|VWA1|protein_coding|CODING|ENST00000495558||1),UPSTREAM(MODIFIER||||233|VWA1|protein_coding|CODING|ENST00000404702||1),UPSTREAM(MODIFIER||||445|VWA1|protein_coding|CODING|ENST00000476993||1),UPSTREAM(MODIFIER||||64|VWA1|protein_coding|CODING|ENST00000338660||1),UPSTREAM(MODIFIER|||||RP4-758J18.10|lincRNA|NON_CODING|ENST00000434150||1),INTRON(MODIFIER|||||RP4-758J18.10|lincRNA|NON_CODING|ENST00000417917|1|1),INTRON(MODIFIER|||||RP4-758J18.10|lincRNA|NON_CODING|ENST00000430109|1|1),INTRON(MODIFIER|||||RP4-758J18.10|lincRNA|NON_CODING|ENST00000454562|1|1)
1	2182470	.	G	A	.	PASS	AC=743;EFF=MOTIF[PB0134.1:HNF4A](MODIFIER||||||||||1),INTRON(MODIFIER||||728|SKI|protein_coding|CODING|ENST00000378536|1|1),INTRON(MODIFIER|||||SKI|processed_transcript|CODING|ENST00000478223|1|1),INTRON(MODIFIER|||||SKI|processed_transcript|CODING|ENST00000508416|1|1)
1	2466633	.	A	G	.	PASS	AC=770;EFF=MOTIF[PB0191.1:Ap2gamma](MODIFIER||||||||||1),UPSTREAM(MODIFIER||||166|HES5|protein_coding|CODING|ENST00000378453||1),INTERGENIC(MODIFIER||||||||||1)
1	2480337	.	G	A	.	PASS	AC=16;EFF=MOTIF[MA0058.1:Max](MODIFIER||||||||||1),MOTIF[MA0059.1:Cmyc](MODIFIER||||||||||1),MOTIF[MA0093.1:USF1](MODIFIER||||||||||1),MOTIF[MA0147.1:Cmyc](LOW||||||||||1),MOTIF[PB0043.1:Max](MODIFIER||||||||||1),MOTIF[PL0007.1:Max](MODIFIER||||||||||1),MOTIF[PL0014.1:mxl-1](MODIFIER||||||||||1),DOWNSTREAM(MODIFIER|||||RP3-395M20.8|processed_transcript|NON_CODING|ENST00000416860||1),DOWNSTREAM(MODIFIER|||||RP3-395M20.8|processed_transcript|NON_CODING|ENST00000432521||1),DOWNSTREAM(MODIFIER|||||RP3-395M20.8|processed_transcript|NON_CODING|ENST00000443892||1),DOWNSTREAM(MODIFIER|||||RP3-395M20.8|processed_transcript|NON_CODING|ENST00000448624||1),DOWNSTREAM(MODIFIER|||||RP3-395M20.8|processed_transcript|NON_CODING|ENST00000449660||1),INTERGENIC(MODIFIER||||||||||1)


So, for instance, the first MOTIF annotation is MOTIF[PB0134.1:HNF4A] corresponding to motif PB0134.1, which you can look up in Jaspar.

SnpEff needs a database to perform genomic annotations. There are pre-built databases for over 2,500 genomes, so chances are that your organism of choice already has a SnpEff database available. In the (unlikely?) event that you need to build one yourself, here we describe how to it.

You can know which genomes are supported by running the following command:

java -jar snpEff.jar databases

Most people do NOT need to build a database, and can safely use a pre-built one. So unless you are working with an rare genome you most likely don't need to do it either.

Again, chances are that you DO NOT NEED to build a database. Sorry to repeat this, but I cannot tell how many times I get emails asking for help to build database that is already available.

Building a database

In order to build a database for a new genome, you need to:

  1. Configure a new genome in SnpEff's config file snpEff.config.
    1. Add genome entry to snpEff's configuration
    2. If the genome uses a non-standard codon table: Add codon table parameter
  2. Get the reference genome sequence (e.g. in FASTA format).
  3. Get genome annotations. There are four different ways you can do this:
    1. Option 1: Building a database from GTF files (the easiest way)
    2. Option 2: Building a database from GFF files
    3. Option 3: Building a database from RefSeq table from UCSC
    4. Option 4: Building a database from GenBank files
  4. Run a command to create the database (i.e. "java -jar snpEff.jar build ...")
Note: All files can be compressed using gzip. E.g. the reference file 'hg19.fa' can be compressed to 'hg19.fa.gz', snpEff will automatically decompress the file.

Warning: Some files claimed to be compressed using GZIP are actually not or even use a block compression variant not supported by Java's gzip library. If you notice that your build process finishes abruptly for no apparent reason, try uncompressing the files. This sometimes happens with ENSEMBL files.

Configuring a new genome

In order to tell SnpEff that there is a new genome available, you must update SnpEff's configuration file snpEff.config.
You must add a new genome entry to snpEff.config.
If your genome, or a chromosome, uses non-standard codon tables you must update snpEff.config accordingly. A typical case is when you use mitochondrial DNA. Then you specify that chromosome 'MT' uses codon.Invertebrate_Mitochondrial codon table. Another common case is when you are adding a bacterial genome, then you specify that the codon table is Bacterial_and_Plant_Plastid.

Add a genome to the configuration file

This example shows how to add a new genome to the config files. For this example we'll use the mouse genome (mm37.61):

  1. Edit the config file to create the new genome
     vi snpEffect.config 
    Add the following lines (you are editing snpEffect.config)
    # Mouse genome, version mm37.61
    mm37.61.genome : Mouse 
    

    Warning: You may need to add codon table information for the genome or some parts of it (e.g. mitochondrial "chromosome"). See next section for details.

  2. Optional: Add genome to Galaxy's menu
    cd /path/to/galaxy
    cd tools/snpEffect/
    vi snpEffect.xml 
    
    Add the following lines to the file
    	<param name="genomeVersion" type="select" label="Genome">
    		<option value="hg37">Human (hg37)<option>
    		<option value="mm37.61">Mouse (mm37.61)<option>
    	<param> 
    


Configuring codon tables (not always required)

Codon tables are provided in the snpEff.config configuration file under the section codon.Name_of_your_codon_table. The format is a comma separated list of CODON/AMINO_ACID.
E.g.:

codon.Invertebrate_Mitochondrial:  TTT/F, TTC/F, TAC/Y, TAA/*, ATG/M+, ATG/M+, ACT/T, ...
Note that codons marked with '*' are STOP codons and codons marked with a '+' are START codons.

In order for you to use them, you have to specify that a given "chromosome" uses one of the tables (otherwise the default codon table is used).
E.g. Here we say the chromosome 'M' from fly genome (dm3) uses Invertebrate_Mitochondrial codon table:
dm3.M.codonTable : Invertebrate_Mitochondrial
...of course, chromosome 'M' is not a real chromosome, it is just a way to mark the sequence as mitochondrial DNA in the reference genome.

Reference genome: GTF, GFF, RefSeq or GenBank

As we previously mentioned, reference genome information can be in different formats: GTF, GFF, RefSeq or GenBank.
In the following sub-sections, we show how to build a database for each type of genomic information file.

Option 1: Building a database from GTF files

GTF 2.2 files are supported by SnpEff (e.g. ENSEMBL releases genome annotations in this format).

  1. Get the genome and uncompress it
    # Create directoy for this new genome
    cd /path/to/snpEff/data/
    mkdir mm37.61
    cd mm37.61
    
    # Get annotation files
    wget ftp://ftp.ensembl.org/pub/current/gtf/mus_musculus/Mus_musculus.NCBIM37.61.gtf.gz
    mv Mus_musculus.NCBIM37.61.gtf.gz genes.gtf.gz
    
    # Get the genome
    cd /path/to/snpEff/data/genomes
    wget ftp://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/Mus_musculus.NCBIM37.61.dna.toplevel.fa.gz
    mv Mus_musculus.NCBIM37.61.dna.toplevel.fa.gz mm37.61.fa.gz
    
  2. Note: The FASTA file can be either in
    /path/to/snpEff/data/genomes/mm37.61.fa
    or in
    /path/to/snpEff/data/mm37.61/sequences.fa

  3. Add the new genome to the config file (see section "How to add a new genome to the configuration file" for details)
  4. Create database
    cd /path/to/snpEff
    java -jar snpEff.jar build -gtf22 -v mm37.61 
    


Option 2: Building a database from GFF files


Warning: Using GFF is discouraged, we recommend you use GTF files instead (whenever possible).

This example shows how to create a database for a new genome using GFF file ((e.g. FlyBase, WormBase, BeeBase release GFF files). For this example we'll use the Drosophila melanogaster genome (dm5.31):

  1. Get a GFF file (into path/to/snpEff/data/dm5.31/genes.gff):
    mkdir path/to/snpEff/data/dm5.31
    cd path/to/snpEff/data/dm5.31
    wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.31_FB2010_08/gff/dmel-all-r5.31.gff.gz
    mv dmel-all-r5.31.gff.gz genes.gff.gz 
    
  2. Note: GFF3 files can include the reference sequence in the same file. This is done by dumping the fasta file after a '##FASTA' line. You can also add the sequence fasta file to the 'data/genomes/' directory, like it is done in when using GTF format.
  3. Add the new genome to the config file (see section "How to add a new genome to the configuration file" for details)
  4. Create database (note the "-gff3" flag):
    cd /path/to/snpEff
    java -jar snpEff.jar build -gff3 -v dm5.31 
    


Option 3: Building a database from RefSeq table from UCSC

This example shows how to create a database for a new genome. For this example we'll use the Human genome (hg19).

Warning: Using UCSC genome tables is highly discouraged, we recommend you use ENSEMBL versions instead.

Warning: UCSC tables sometimes change for different species. This means that even if these instructions work for human genome, it might not work for other genomes. Obviously creating a new parser for each genome is impractical, so working with UCSC genomes is highly discouraged. We recommend to use ENSEMBL genomes instead.

Warning: UCSC genomes provide only major release version, but NOT sub-versions. E.g. UCSC's "hg19" has major version 19 but there is no "sub-version", whereas ENSEMBL's GRCh37.70 clearly has major version 37 and minor version 70. Not providing a minor version means that they might change the database and two "hg19" genomes are actually be different. This creates all sorts of consistency problems (e.g. the annotations may not be the same that you see in the UCSC genome browser, even though both of them are 'hg19' version). Using UCSC genome tables is highly discouraged, we recommend you use ENSEMBL versions instead.

In order to build a genome using UCSC tables, you can follow these instructions:

  1. Go to UCSC genome browser (http://genome.ucsc.edu/)
  2. Click on "Table" menu
  3. Select parameters as shown here:

  4. Click on "get output" and save the data to file "/path/to/snpEff/data/hg19/genes.txt".
  5. Add the fasta reference genome. The FASTA file can be either in
    /path/to/snpEff/data/genomes/hg19.fa
    or in
    /path/to/snpEff/data/hg19/sequences.fa

  6. Add the new genome to the config file (see section "How to add a new genome to the configuration file" for details)
  7. Create database (note the "-refSeq" flag):
    cd /path/to/snpEff
    java -jar snpEff.jar build -refSeq -v hg19 
    


Option 4: Building a database from GenBank files

This example shows how to create a database for a new genome. For this example we'll use "Staphylococcus aureus":

  1. Go to NIH page for CP000730
  2. Download the features in geneBank format, by clicking as shown in the following images (red arrows)



    Make sure you click the "Update" button!
    Then you go to the "Send" menu



    and then



  3. Save the GenBank data to "/path/to/snpEff/data/CP000730/genes.gb".

    Note: If there are more than one genbank file for an organism (e.g. multiple chromosomes), then you can download each file and concatenate them.
    E.g.: Vibrio Cholerae has two chromosomes with GenBank accessions: NC_002505.1 and NC_002506.1. You can download both files and save them as snpEff/data/vibrio/NC_002505.1.gb and snpEff/data/vibrio/NC_002506.1.gb respectively, and then concatenate both files:
    cat NC_002505.1.gb NC_002506.1.gb > genes.gb
    
    Add the following entries in the config file:

    # Vibrio Cholerae
    vibrio.genome : Vibrio Cholerae
    	vibrio.chromosomes : NC_002505.1, NC_002506.1
    	vibrio.NC_002505.1.codonTable : Bacterial_and_Plant_Plastid
    	vibrio.NC_002506.1.codonTable : Bacterial_and_Plant_Plastid
    
  4. Create database (note the "-genbank" flag):

    	cd /path/to/snpEff
    	java -jar snpEff.jar build -genbank -v CP000730 
    


Example: Building the Human Genome database

This is a full example on how to build the human genome database (using GTF file from ENSEBML), it includes support for regulatory features, sanity check, rare amino acids, etc..

# Go to SnpEff's install dir
cd ~/snpeff

# Create database dir
mkdir data/GRCh37.70
cd data/GRCh37.70

# Download annotated genes
wget ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/Homo_sapiens.GRCh37.70.gtf.gz
mv Homo_sapiens.GRCh37.70.gtf.gz genes.gtf.gz

# Download proteins 
# This is used for:
#	- "Rare Amino Acid" annotations
#	- Sanity check (checking protein predicted from DNA sequences match 'real' proteins)
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.70.pep.all.fa.gz
mv Homo_sapiens.GRCh37.70.pep.all.fa.gz protein.fa.gz

# Download CDSs
# Note: This is used as "sanity check" (checking that CDSs prediscted from gene sequences match 'real' CDSs)
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.70.cdna.all.fa.gz
mv Homo_sapiens.GRCh37.70.cdna.all.fa.gz cds.fa.gz

# Download regulatory annotations
wget ftp://ftp.ensembl.org/pub/release-70/regulation/homo_sapiens/AnnotatedFeatures.gff.gz
mv AnnotatedFeatures.gff.gz regulation.gff.gz

# Uncompress
gunzip *.gz

# Download genome
cd ../genomes/
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.70.dna.toplevel.fa.gz
mv Homo_sapiens.GRCh37.70.dna.toplevel.fa.gz GRCh37.70.fa.gz

# Uncompress: 
# Why do we need to uncompress? 
# Because ENSEMBL compresses files using a block compress gzip which is not compatibles whith Java's library Gunzip 
gunzip GRCh37.70.fa.gz

# Edit snpEff.config file
#
# WARNING! You must do this yourself. Just copying and pasting this into a terminal won't work.
#
# Add lines:
#		GRCh37.70.genome : Homo_sapiens
#		GRCh37.70.reference : ftp://ftp.ensembl.org/pub/release-70/gtf/

# Now we are ready to build tha database
cd ~/snpeff
java -Xmx20g -jar snpEff.jar build -v GRCh37.70 2>&1 | tee GRCh37.70.build


Troubleshooting Database builds

Warning: By far the most common problem is that the FASTA file chromosome names are different than the GFF chromosome names. Make sure chromosome names are consistent in all the files you use.

When I build the database using GFF 3 SnpEff reports that Exons don't have sequences

GFF3 files can have sequence information either in the same file or in a separate fasta file.
In order to add sequence information in the GFF file, you can do this:

cat annotations.gff > genes.gff
echo "###"  >> genes.gff
echo "##FASTA"  >> genes.gff
cat sequence.fa  >> genes.gff

When building a database, I get zero protein coding genes

When building a database, snpEff tries to find which transcripts are protein coding. This is done using the 'bioType' information.
The bioType information is not a standard GFF or GTF feature. So I follow ENSEMBL's convention of using the second column ('source') for bioType, as well as the gene_biotype attribute.
If your file was not produced by ENSEMBL, it probably doesn't have this information. This means that snpEff doesn't know which genes are protein coding and which ones are not.
Having no information, snpEff will treat all genes as protein coding (assuming you have '-treatAllAsProteinCoding Auto' option in the command line, which is the default).
So you will get effects as if all genes were protein coding, then you can filter out the irrelevant genes. Unfortunately, this is the best I can do if there is no 'bioType' information

When building a database, I get too many warnings

There are plenty of GFF and GTF files that, unfortunately, do not follow the specification. SnpEff usually complains about this, but tries hard to correct the problems. So the database may be OK even after you see many warnings.
You can check the database to see if the features (genes, exons, UTRs) have been correctly incorporated, by taking a look at the database:

java -jar snpEff.jar dump myGenome | less

SnpEff supports regulatory and non-coding annotations. In this sections we show how to build those databases. As in the previous section, most likely you will never have to do it yourself and can just use available pre-built databases.

There are two ways to add support for regulatory annotations (these are not mutually exclusive, you can use both at the same time):

  1. GFF regulation file (from ENSEMBL).
  2. BED files.
Adding regulation support and analyzing data using regulation tracks can take much more memory. For instance, for the human genome I use 10Gb to 20Gb of RAM.

It is assumed the the genome is already installed, only regulatory tracks are added.

Option 1: Using a GFF file

This example shows how to create a regulation database for human (GRCh37.65):
  1. Get the GFF regulatory annotations (into path/to/snpEff/data/GRCh37.65/regulation.gff):
    cd path/to/snpEff/data/GRCh37.65
    wget ftp:/ftp.ensembl.org/pub/release-65/regulation/homo_sapiens/AnnotatedFeatures.gff.gz
    mv AnnotatedFeatures.gff.gz regulation.gff.gz 
    
  2. Create databases. Note that we use "-onlyReg" flag, because we are only creating regulatory databases. If you omit it, it will create both of "normal' and regulatory databases:
    cd /path/to/snpEff
    java -Xmx20G -jar snpEff.jar build -v -onlyReg GRCh37.65 
    
    The output looks like this
    Reading regulation elements (GFF)
    	Chromosome '11'	line: 226964
    	Chromosome '12'	line: 493780
    	...
    	Chromosome '9'	line: 4832434
    	Chromosome 'X'  line: 5054301
    	Chromosome 'Y'  line: 5166958
    Done
    	Total lines                 : 5176289
    	Total annotation count      : 3961432
    	Percent                     : 76.5%
    	Total annotated length      : 3648200193
    	Number of cell/annotations  : 266
    Saving database 'HeLa-S3' in file '/path/to/snpEff/data/GRCh37.65/regulation_HeLa-S3.bin'
    Saving database 'HepG2' in file '/path/to/snpEff/data/GRCh37.65/regulation_HepG2.bin'
    Saving database 'NHEK' in file '/path/to/snpEff/data/GRCh37.65/regulation_NHEK.bin'
    Saving database 'GM12878' in file '/path/to/snpEff/data/GRCh37.65/regulation_GM12878.bin'
    Saving database 'HUVEC' in file '/path/to/snpEff/data/GRCh37.65/regulation_HUVEC.bin'
    Saving database 'H1ESC' in file '/path/to/snpEff/data/GRCh37.65/regulation_H1ESC.bin'
    Saving database 'CD4' in file '/path/to/snpEff/data/GRCh37.65/regulation_CD4.bin'
    Saving database 'GM06990' in file '/path/to/snpEff/data/GRCh37.65/regulation_GM06990.bin'
    Saving database 'IMR90' in file '/path/to/snpEff/data/GRCh37.65/regulation_IMR90.bin'
    Saving database 'K562' in file '/path/to/snpEff/data/GRCh37.65/regulation_K562.bin'
    Done.
    
    As you can see, annotations for each cell type are saved in different files. This makes it easier to load annotations only for the desired cell types when analyzing data.

Option 2: Using an BED filed

This example shows how to create a regulation database for human (GRCh37.65). We assume we have a file called "my_regulation.bed" which has information for H3K9me3 in Pancreatic Islets (for instance, as a result of a Chip-Seq experiment and peak enrichment analysis).

  1. Add all your BED files to 'path/to/snpEff/data/GRCh37.65/regulation.bed/' dir:
    cd path/to/snpEff/data/GRCh37.65
    mkdir regulation.bed
    cd regulation.bed
    mv where/everh/your/bed/file/is/my_regulation.bed ./regulation.Pancreatic_Islets.H3K9me3.bed 
    
    Note: The name of the file must be 'regulation.CELL_TYPE.ANNOTATION_TYPE.bed'. In this case, 'CELL_TYPE=Pancreatic_Islets' and 'ANNOTATION_TYPE=H3K9me3'
  2. Create databases (note the "-onlyReg" flag):
    cd /path/to/snpEff
    java -Xmx20G -jar snpEff.jar build -v -onlyReg GRCh37.65 
    
    The output looks like this
    Building database for 'GRCh37.65'
    Reading regulation elements (GFF)
    Cannot read regulation elements form file '/path/to/snpEff/data/GRCh37.65/regulation.gff'
    Directory has 1 bed files and 1 cell types
    Creating consensus for cellType 'Pancreatic_Islets', files: [/path/to/snpEff/data/GRCh37.65/regulation.bed/regulation.Pancreatic_Islets.H3K9me3.bed]
    Reading file '/path/to/snpEff/data/GRCh37.65/regulation.bed/regulation.Pancreatic_Islets.H3K9me3.bed'
    	Chromosome '10'	line: 5143
    	Chromosome '11'	line: 8521
    	...
    	Chromosome 'X'	line: 52481
    	Chromosome 'Y'	line: 53340
    Done
    	Total lines                 : 53551
    	Total annotation count      : 53573
    	Percent                     : 100.0%
    	Total annotated length      : 75489402
    	Number of cell/annotations  : 1
    Creating consensus for cell type: Pancreatic_Islets
    Sorting: Pancreatic_Islets	, size: 53573
    Adding to final consensus
    Final consensus for cell type: Pancreatic_Islets	, size: 53549
    Saving database 'Pancreatic_Islets' in file '/path/to/snpEff/data/GRCh37.65/regulation_Pancreatic_Islets.bin'
    Done
    Finishing up
    
    Note: If there are many annotations, they are saved in one binary file for each cell type (i.e. several BED files for different cell types are collapsed together). This makes it easier to load annotations only for the desired cell types when analyzing data.

SnpEff is integrated with Broad Institute's Genome Analysis Toolkit (GATK).

In order to make sure SnpEff and GATK understand each other, you must activate GATK compatibility in SnpEff by using the -o gatk command line option. The reason for using '-o gatk' is that, even though both GATK and SnpEff use VCF format, SnpEff has recently updated the EFF sub-field format and this might cause some trouble (since GATK still uses the original version).

GATK only picks one effect. Indeed, the GATK team decided to only report the effect having the highest impact. This was done intentionally for the sake of brevity, in a 'less is more' spirit. You can get the full effect by using snpEff independently, instead of using it within GATK framework.

Script example: In this example we combine SnpEff and GATK's VariantAnnotator (you can find this script in snpEff/scripts/ directory of the distribution)
#!/bin/sh

#-------------------------------------------------------------------------------
# Files
#-------------------------------------------------------------------------------

in=$1                                                   # Input VCF file
eff=`dirname $in`/`basename $in .vcf`.snpeff.vcf        # SnpEff annotated VCF file
out=`dirname $in`/`basename $in .vcf`.gatk.vcf          # Output VCF file (annotated by GATK)

ref=$HOME/snpEff/data/genomes/hg19.fa                   # Reference genome file
dict=`dirname $ref`/`basename $ref .fa`.dict            # Reference genome: Dictionary file

#-------------------------------------------------------------------------------
# Path to programs and libraries
#-------------------------------------------------------------------------------

gatk=$HOME/tools/gatk/GenomeAnalysisTK.jar
picard=$HOME/tools/picard/
snpeff=$HOME/snpEff/snpEff.jar

#-------------------------------------------------------------------------------
# Main
#-------------------------------------------------------------------------------

# Create genome index file
echo
echo "Indexing Genome reference FASTA file: $ref"
samtools faidx $ref

# Create dictionary
echo
echo "Creating Genome reference dictionary file: $dict"
java -jar $picard/CreateSequenceDictionary.jar R= $ref O= $dict

# Annotate
echo
echo "Annotate using SnpEff"
echo "    Input file  : $in"
echo "    Output file : $eff"
java -Xmx4G -jar $snpeff -c $HOME/snpEff/snpEff.config -v -o gatk hg19 $in > $eff

# Use GATK
echo
echo "Annotating using GATK's VariantAnnotator:"
echo "    Input file  : $in"
echo "    Output file : $out"
java -Xmx4g -jar $gatk \
    -T VariantAnnotator \
    -R $ref \
    -A SnpEff \
    --variant $in \
    --snpEffFile $eff \
    -L $in \
    -o $out

Important: In order for this to work, GATK requires that the Genome Reference file should have the chromosomes in karotyping order (largest to smallest chromosomes, followed by the X, Y, and MT). Your VCF file should also respect that order.

Now we can use the script:
$ ~/snpEff/scripts/gatk.sh zzz.vcf

Indexing Genome reference FASTA file: /home/pcingola/snpEff/data/genomes/hg19.fa

Creating Genome reference dictionary file: /home/pcingola/snpEff/data/genomes/hg19.dict
[Fri Apr 12 11:23:12 EDT 2013] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/home/pcingola/snpEff/data/genomes/hg19.fa OUTPUT=/home/pcingola/snpEff/data/genomes/hg19.dict    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Fri Apr 12 11:23:12 EDT 2013] Executing as pcingola@localhost.localdomain on Linux 3.6.11-4.fc16.x86_64 amd64; OpenJDK 64-Bit Server VM 1.6.0_24-b24; Picard version: 1.89(1408)
[Fri Apr 12 11:23:12 EDT 2013] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=141164544
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" net.sf.picard.PicardException: /home/pcingola/snpEff/data/genomes/hg19.dict already exists.  Delete this file and try again, or specify a different output file.
	at net.sf.picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:114)
	at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
	at net.sf.picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:93)

Annotate using SnpEff
    Input file  : zzz.vcf
    Output file : ./zzz.snpeff.vcf
00:00:00.000	Reading configuration file '/home/pcingola/snpEff/snpEff.config'
00:00:00.173	done
00:00:00.173	Reading database for genome version 'hg19' from file '/home/pcingola//snpEff/data/hg19/snpEffectPredictor.bin' (this might take a while)
00:00:11.860	done
00:00:11.885	Building interval forest
00:00:17.755	done.
00:00:18.391	Genome stats :
# Genome name                : 'Homo_sapiens (USCS)'
# Genome version             : 'hg19'
# Has protein coding info    : true
# Genes                      : 25933
# Protein coding genes       : 20652
# Transcripts                : 44253
# Avg. transcripts per gene  : 1.71
# Protein coding transcripts : 36332
# Cds                        : 365442
# Exons                      : 429543
# Exons with sequence        : 409789
# Exons without sequence     : 19754
# Avg. exons per transcript  : 9.71
# Number of chromosomes      : 50
# Chromosomes names [sizes]  : '1' [249250621]	'2' [243199373]	'3' [198022430]	'4' [191154276]	'5' [180915260]	'6' [171115067]	'7' [159138663]	'X' [155270560]	'8' [146364022]	'9' [141213431]	'10' [135534747]	'11' [135006516]	'12' [133851895]	'13' [115169878]	'14' [107349540]	'15' [102531392]	'16' [90354753]	'17' [81195210]	'18' [78077248]	'20' [63025520]	'Y' [59373566]	'19' [59128983]	'22' [51304566]	'21' [48129895]	'6_ssto_hap7' [4905564]	'6_mcf_hap5' [4764535]	'6_cox_hap2' [4734611]	'6_mann_hap4' [4679971]	'6_qbl_hap6' [4609904]	'6_dbb_hap3' [4572120]	'6_apd_hap1' [4383650]	'17_ctg5_hap1' [1574839]	'4_ctg9_hap1' [582546]	'Un_gl000220' [156152]	'19_gl000209_random' [145745]	'Un_gl000213' [139339]	'17_gl000205_random' [119732]	'Un_gl000223' [119730]	'4_gl000194_random' [115071]	'Un_gl000228' [114676]	'Un_gl000219' [99642]	'Un_gl000218' [97454]	'Un_gl000211' [93165]	'Un_gl000222' [89310]	'4_gl000193_random' [88375]	'7_gl000195_random' [86719]	'1_gl000192_random' [79327]	'Un_gl000212' [60768]	'1_gl000191_random' [50281]	'M' [16571]	
00:00:18.391	Predicting variants
00:00:20.267	Creating summary file: snpEff_summary.html
00:00:20.847	Creating genes file: snpEff_genes.txt
00:00:25.026	done.
00:00:25.036	Logging
00:00:26.037	Checking for updates...

Annotating using GATK's VariantAnnotator:
    Input file  : zzz.vcf
    Output file : ./zzz.gatk.vcf
INFO  11:23:41,316 ArgumentTypeDescriptor - Dynamically determined type of zzz.vcf to be VCF 
INFO  11:23:41,343 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  11:23:41,344 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-9-g532efad, Compiled 2013/03/19 07:35:36 
INFO  11:23:41,344 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  11:23:41,344 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  11:23:41,347 HelpFormatter - Program Args: -T VariantAnnotator -R /home/pcingola/snpEff/data/genomes/hg19.fa -A SnpEff --variant zzz.vcf --snpEffFile ./zzz.snpeff.vcf -L zzz.vcf -o ./zzz.gatk.vcf 
INFO  11:23:41,347 HelpFormatter - Date/Time: 2013/04/12 11:23:41 
INFO  11:23:41,348 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  11:23:41,348 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  11:23:41,353 ArgumentTypeDescriptor - Dynamically determined type of zzz.vcf to be VCF 
INFO  11:23:41,356 ArgumentTypeDescriptor - Dynamically determined type of ./zzz.snpeff.vcf to be VCF 
INFO  11:23:41,399 GenomeAnalysisEngine - Strictness is SILENT 
INFO  11:23:41,466 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  11:23:41,480 RMDTrackBuilder - Loading Tribble index from disk for file zzz.vcf 
INFO  11:23:41,503 RMDTrackBuilder - Loading Tribble index from disk for file ./zzz.snpeff.vcf 
WARN  11:23:41,505 RMDTrackBuilder - Index file /data/pcingola/Documents/projects/snpEff/gatk_test/./zzz.snpeff.vcf.idx is out of date (index older than input file), deleting and updating the index file 
INFO  11:23:41,506 RMDTrackBuilder - Creating Tribble index in memory for file ./zzz.snpeff.vcf 
INFO  11:23:41,914 RMDTrackBuilder - Writing Tribble index to disk for file /data/pcingola/Documents/projects/snpEff/gatk_test/./zzz.snpeff.vcf.idx 
INFO  11:23:42,076 IntervalUtils - Processing 33411 bp from intervals 
INFO  11:23:42,125 GenomeAnalysisEngine - Creating shard strategy for 0 BAM files 
INFO  11:23:42,134 GenomeAnalysisEngine - Done creating shard strategy 
INFO  11:23:42,134 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  11:23:42,135 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  11:23:49,268 VariantAnnotator - Processed 9966 loci.
 
INFO  11:23:49,280 ProgressMeter -            done        3.34e+04    7.0 s        3.6 m    100.0%         7.0 s     0.0 s 
INFO  11:23:49,280 ProgressMeter - Total runtime 7.15 secs, 0.12 min, 0.00 hours 
INFO  11:23:49,953 GATKRunReport - Uploaded run statistics report to AWS S3 

SnpEff provides integrations with Galaxy project.

In order to install SnpEff in your own Galaxy server, you can use the galaxy/*.xml files provided in the main distribution.

This is a screen capture from a Galaxy server (click to enlarge)



Installing SnpEff in a Galaxy server:

# Set variable to snpEff install dir (we only use it for this install script)
export snpEffDir="$HOME/snpEff"

# Go to your galaxy 'tools' dir
cd galaxy-dist/tools

# Create a directory and copy the XML config files from SnpEff's distribution
mkdir snpEff
cd snpEff/
cp $snpEffDir/galaxy/* .

# Create links to JAR files
ln -s $snpEffDir/snpEff.jar
ln -s $snpEffDir/SnpSift.jar

# Link to config file
ln -s $snpEffDir/snpEff.config

# Allow scripts execution
chmod a+x *.{pl,sh}

# Copy genomes information
cd ../..
cp $snpEffDir/galaxy/tool-data/snpEff_genomes.loc tool-data/

# Edit Galaxy's tool_conf.xml and add all the tools
vi tool_conf.xml

-------------------- Begin: Edit tool_conf.xml --------------------
  <!-- 
       Add this section to tool_conf.xml file in your galaxy distribution
	
       Note: The following lines should be added at the end of the 
             file, right before "</toolbox>" line
  -->
  <section name="snpEff tools" id="snpEff_tools">
    <tool file="snpEff/snpEff.xml" />
    <tool file="snpEff/snpEff_download.xml" />
    <tool file="snpEff/snpSift_annotate.xml" />
    <tool file="snpEff/snpSift_caseControl.xml" />
    <tool file="snpEff/snpSift_filter.xml" />
    <tool file="snpEff/snpSift_int.xml" />
  </section>
-------------------- End: Edit tool_conf.xml --------------------

# Run galaxy and check that the new menues appear
./run.sh

SnpEff provides several other commands and utilities that can be useful for genomic data analysis.

Most of this manual was dedicated the SnpEff eff and SnpEff build commands, which annotate effects and build databases respectively. Here we describe all the other commands and some scripts provided, that are useful for genomic data analysis.

SnpEff closest

Annotates using the closest genomic region (e.g. exon, transcript ID, gene name) and distance in bases.
Example:

$ java -Xmx4g -jar snpEff.jar closest GRCh37.66 test.vcf
##INFO=<ID=CLOSEST,Number=4,Type=String,Description="Closest exon: Distance (bases), exons Id, transcript Id, gene name">
1       12078   .       G       A       25.69   PASS    AC=2;AF=0.048;CLOSEST=0,exon_1_11869_12227,ENST00000456328,DDX11L1
1       16097   .       T       G       42.42   PASS    AC=9;AF=0.0113;CLOSEST=150,exon_1_15796_15947,ENST00000423562,WASH7P
1       40261   .       C       A       366.26  PASS    AC=30;AF=0.484;CLOSEST=4180,exon_1_35721_36081,ENST00000417324,FAM138A
1       63880   .       C       T       82.13   PASS    AC=10;AF=0.0400;CLOSEST=0,exon_1_62948_63887,ENST00000492842,OR4G11P
For intance, in the third line (1:16097 T G), it added the tag CLOSEST=150,exon_1_15796_15947,ENST00000423562,WASH7P , which means that the variant is 150 bases away from exon "exon_1_15796_15947". The exon belongs to transcript "ENST00000423562" of gene "WASH7P".

If multiple markers are available (at the same distance) the one belonging to the longest mRna transcript is shown.

The input can also be a BED file, the output file has the same information as CLOSEST info field, added to the fourth column of the output BED file:
$ snpeff closest -bed GRCh37.66 test.bed 
1	12077	12078	line_1;0,exon_1_11869_12227,ENST00000456328,DDX11L1
1	16096	16097	line_2;150,exon_1_15796_15947,ENST00000423562,WASH7P
1	40260	40261	line_3;4180,exon_1_35721_36081,ENST00000417324,FAM138A
1	63879	63880	line_4;0,exon_1_62948_63887,ENST00000492842,OR4G11P

 

SnpEff count

As the name suggests, snpEff count command counts how many reads and bases from a BAM file hit a gene, transcript, exon, intron, etc. Input files can be in BAM, SAM, VCF, BED or BigBed formats.
A summary HTML file with charts is generated. Here are some examples:


If you need to count how many reads (and bases) from a BAM file hit each genomic region, you can use 'count' utility.
The command line is quite simple. E.g. in order to count how many reads (from N BAM files) hit regions of the human genome, you simply run:

java -Xmx4g -jar snpEff.jar count GRCh37.68 readsFile_1.bam readsFile_2.bam ...  readsFile_N.bam > countReads.txt 


The output is a TXT (tab-separated) file, that looks like this:
chr  start  end       type                IDs                         Reads:readsFile_1.bam  Bases:readsFile_1.bam  Reads:readsFile_2.bam  Bases:readsFile_2.bam ...
1    1      11873     Intergenic          DDX11L1                     130                    6631                   50                     2544
1    1      249250621 Chromosome          1                           2527754                251120400              2969569                328173439
1    6874   11873     Upstream            NR_046018;DDX11L1           130                    6631                   50                     2544
1    9362   14361     Downstream          NR_024540;WASH7P            243                    13702                  182                    9279
1    11874  12227     Exon                exon_1;NR_046018;DDX11L1    4                      116                    2                      102
1    11874  14408     Gene                DDX11L1                     114                    7121                   135                    6792
1    11874  14408     Transcript          NR_046018;DDX11L1           114                    7121                   135                    6792
1    12228  12229     SpliceSiteDonor     exon_1;NR_046018;DDX11L1    3                      6                      0                      0
1    12228  12612     Intron              intron_1;NR_046018;DDX11L1  13                     649                    0                      0
1    12611  12612     SpliceSiteAcceptor  exon_2;NR_046018;DDX11L1    0                      0                      0                      0
1    12613  12721     Exon                exon_2;NR_046018;DDX11L1    3                      24                     1                      51
1    12722  12723     SpliceSiteDonor     exon_2;NR_046018;DDX11L1    3                      6                      0                      0
1    12722  13220     Intron              intron_2;NR_046018;DDX11L1  22                     2110                   20                     987
1    13219  13220     SpliceSiteAcceptor  exon_3;NR_046018;DDX11L1    5                      10                     1                      2
1    13221  14408     Exon                exon_3;NR_046018;DDX11L1    82                     4222                   113                    5652
1    14362  14829     Exon                exon_11;NR_024540;WASH7P    37                     1830                   7                      357
1    14362  29370     Transcript          NR_024540;WASH7P            704                    37262                  524                    34377
1    14362  29370     Gene                WASH7P                      704                    37262                  524                    34377
1    14409  19408     Downstream          NR_046018;DDX11L1           122                    7633                   39                     4254
The columns are:
  • Column 1: Chromosome name
  • Column 2: Genomic region start
  • Column 3: Genomic region end
  • Column 4: Genomic region type (e.g. Exon, Gene, SpliceSiteDonor, etc.)
  • Column 5: ID (e.g. exon ID ; transcript ID; gene ID)
  • Column 6: Count of reads (in file readsFile_1.bam) intersecting genomic region.
  • Column 7: Count of bases (in file readsFile_1.bam) intersecting genomic region, i.e. each read is intersected and the resulting number of bases added.
  • Column ...: (repeat count reads and bases for each BAM file provided)

Totals and Binomial model
Using command line option -p, you can calculate p-values based on a Binomial model. For example (output edited for the sake of brevity):
$ java -Xmx4g -jar snpEff.jar count -v BDGP5.69 fly.bam > countReads.txt
00:00:00.000	Reading configuration file 'snpEff.config'
...
00:00:12.148	Calculating probability model for read length 50
...
type               p.binomial             reads.fly  expected.fly  pvalue.fly
Chromosome         1.0                    205215     205215        1.0
Downstream         0.29531659795589793    59082      60603         1.0
Exon               0.2030262729897713     53461      41664         0.0
Gene               0.49282883664487515    110475     101136        0.0
Intergenic         0.33995644860241336    54081      69764         0.9999999963234701
Intron             0.3431415343615103     72308      70418         9.06236369003514E-19
RareAminoAcid      9.245222303207472E-7   3          0             9.879186871519377E-4
SpliceSiteAcceptor 0.014623209601955131   3142       3001          0.005099810118785825
SpliceSiteDonor    0.015279075154423956   2998       3135          0.9937690786738507
Transcript         0.49282883664487515    110475     101136        0.0
Upstream           0.31499087549896493    64181      64641         0.9856950416729887
Utr3prime          0.03495370828296416    8850       7173          1.1734134297889064E-84
Utr5prime          0.02765432673262785    8146       5675          7.908406840800345E-215

The columns in for this table are (in the previous example the input file was 'fly.bam' so fileName is 'fly'):
  • type : Type of interval
  • p.binomial : Probability that a random read hits this 'type' of interval (in binomial model)
  • reads.fileName : Total number of reads in 'fileName' (user provided BAM/SAM file)
  • expected.fileName : Expected number of reads hitting this 'type' of interval (for user provided BAM/SAM file)
  • pvalue.fileName : p-value that 'reads.fileName' reads or more hit this 'type' of interval (for user provided BAM/SAM file)
  • Column ...: (repeat last three column for each BAM/SAM file provided by the user)

User defined intervals
You can add user defined intervals using -i file.bed command line option. The option can be used multiple times, thus allowing multiple BED files to be added.
Example : You want to know how many reads intersect each peak from a peak detection algorithm:
java -Xmx4g -jar snpEff.jar count -i peaks.bed GRCh37.68 reads.bam

SnpEff databases

This command provides a list of configured databases, i.e. available in snpEff.config file.
Example:

$ java -jar snpEff.jar databases
Genome                                                          Organism                                                        Database download link
------                                                          --------                                                        ----------------------
ADWO01                                                          Prevotella bryantii B14                                         http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Prevotella bryantii B14.zip
Acholeplasma_laidlawii_PG_8A_uid58901                           Acholeplasma_laidlawii_PG_8A_uid58901                           http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acholeplasma_laidlawii_PG_8A_uid58901.zip
Achromobacter_xylosoxidans_A8_uid59899                          Achromobacter_xylosoxidans_A8_uid59899                          http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Achromobacter_xylosoxidans_A8_uid59899.zip
Acidaminococcus_fermentans_DSM_20731_uid43471                   Acidaminococcus_fermentans_DSM_20731_uid43471                   http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidaminococcus_fermentans_DSM_20731_uid43471.zip
Acidaminococcus_intestini_RyC_MR95_uid74445                     Acidaminococcus_intestini_RyC_MR95_uid74445                     http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidaminococcus_intestini_RyC_MR95_uid74445.zip
Acidianus_hospitalis_W1_uid66875                                Acidianus_hospitalis_W1_uid66875                                http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidianus_hospitalis_W1_uid66875.zip
Acidilobus_saccharovorans_345_15_uid51395                       Acidilobus_saccharovorans_345_15_uid51395                       http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidilobus_saccharovorans_345_15_uid51395.zip
Acidimicrobium_ferrooxidans_DSM_10331_uid59215                  Acidimicrobium_ferrooxidans_DSM_10331_uid59215                  http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidimicrobium_ferrooxidans_DSM_10331_uid59215.zip
Acidiphilium_cryptum_JF_5_uid58447                              Acidiphilium_cryptum_JF_5_uid58447                              http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidiphilium_cryptum_JF_5_uid58447.zip
Acidiphilium_multivorum_AIU301_uid63345                         Acidiphilium_multivorum_AIU301_uid63345                         http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidiphilium_multivorum_AIU301_uid63345.zip
Acidithiobacillus_caldus_SM_1_uid70791                          Acidithiobacillus_caldus_SM_1_uid70791                          http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidithiobacillus_caldus_SM_1_uid70791.zip
Acidithiobacillus_ferrivorans_SS3_uid67387                      Acidithiobacillus_ferrivorans_SS3_uid67387                      http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidithiobacillus_ferrivorans_SS3_uid67387.zip
Acidithiobacillus_ferrooxidans_ATCC_23270_uid57649              Acidithiobacillus_ferrooxidans_ATCC_23270_uid57649              http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidithiobacillus_ferrooxidans_ATCC_23270_uid57649.zip
Acidithiobacillus_ferrooxidans_ATCC_53993_uid58613              Acidithiobacillus_ferrooxidans_ATCC_53993_uid58613              http://sourceforge.net/projects/snpeff/files/databases/v3.2/snpEff_v3.2_Acidithiobacillus_ferrooxidans_ATCC_53993_uid58613.zip
...
...

SnpEff download

This command downloads and installs a database.
Note that the database must be configured in snpEff.config and available at the download site.
Example: Download and install C.Elegans genome

$ java -jar snpEff.jar download -v WBcel215.69
00:00:00.000	Downloading database for 'WBcel215.69'
00:00:00.002	Connecting to http://downloads.sourceforge.net/project/snpeff/databases/v3_1/snpEff_v3_1_WBcel215.69.zip
00:00:00.547	Copying file (type: application/zip, modified on: Sat Dec 01 20:59:55 EST 2012)
00:00:00.547	Local file name: 'snpEff_v3_1_WBcel215.69.zip'
00:00:01.949	Downloaded 1049506 bytes
00:00:03.624	Downloaded 2135266 bytes
00:00:05.067	Downloaded 3185026 bytes
00:00:06.472	Downloaded 4234786 bytes
00:00:07.877	Downloaded 5284546 bytes
00:00:09.580	Downloaded 6374626 bytes
00:00:11.005	Downloaded 7424386 bytes
00:00:12.410	Downloaded 8474146 bytes
00:00:13.815	Downloaded 9523906 bytes
00:00:15.358	Downloaded 10604226 bytes
00:00:16.761	Downloaded 11653666 bytes
00:00:18.168	Downloaded 12703426 bytes
00:00:19.573	Downloaded 13753186 bytes
00:00:21.198	Downloaded 14837506 bytes
00:00:22.624	Downloaded 15887266 bytes
00:00:24.029	Downloaded 16937026 bytes
00:00:25.434	Downloaded 17986786 bytes
00:00:26.864	Downloaded 19036546 bytes
00:00:28.269	Downloaded 20086306 bytes
00:00:29.155	Donwload finished. Total 20748168 bytes.
00:00:29.156	Local file name: '/home/pcingola//snpEff/data/WBcel215.69/snpEffectPredictor.bin'
00:00:29.156	Extracting file 'data/WBcel215.69/snpEffectPredictor.bin' to '/home/pcingola//snpEff/data/WBcel215.69/snpEffectPredictor.bin'
00:00:29.157	Creating local directory: '/home/pcingola/snpEff/data/WBcel215.69'
00:00:29.424	Unzip: OK
00:00:29.424	Done

SnpEff dump

Dump the contents of a database to a text file, a BED file or a tab separated TXT file (that can be loaded into R).

BED file example:

$ java -jar snpEff.jar download -v GRCh37.70
$ java -Xmx4g -jar snpEff.jar dump -v -bed GRCh37.70 > GRCh37.70.bed
00:00:00.000	Reading database for genome 'GRCh37.70' (this might take a while)
00:00:32.476	done
00:00:32.477	Building interval forest
00:00:45.928	Done.
The output file looks like a typical BED file (chr \t start \t end \t name).
Keep in mind that BED file coordinates are zero based, semi-open intervals. So a 2 base interval at (one-based) positions 100 and 101 is expressed as a BED interval [99 - 101].
$ head GRCh37.70.bed
1	0	249250621	Chromosome_1
1	111833483	111863188	Gene_ENSG00000134216
1	111853089	111863002	Transcript_ENST00000489524
1	111861741	111861861	Cds_CDS_1_111861742_111861861
1	111861948	111862090	Cds_CDS_1_111861949_111862090
1	111860607	111860731	Cds_CDS_1_111860608_111860731
1	111861114	111861300	Cds_CDS_1_111861115_111861300
1	111860305	111860427	Cds_CDS_1_111860306_111860427
1	111862834	111863002	Cds_CDS_1_111862835_111863002
1	111853089	111853114	Utr5prime_exon_1_111853090_111853114
TXT file example:
$ java -Xmx4g -jar snpEff.jar dump -v -txt GRCh37.70 > GRCh37.70.txt
00:00:00.000	Reading database for genome 'GRCh37.70' (this might take a while)
00:00:31.961	done
00:00:31.962	Building interval forest
00:00:45.467	Done.
In this case, the ouput file looks like a typical BED file (chr \t start \t end \t name):
$ head GRCh37.70.txt
chr start       end        strand  type         id                          geneName  geneId            numberOfTranscripts  canonicalTranscriptLength  transcriptId     cdsLength  numerOfExons  exonRank  exonSpliceType
1   1           249250622  +1      Chromosome   1                                                                                                       
1   111833484   111863189  +1      Gene         ENSG00000134216             CHIA      ENSG00000134216   10                   1431                                                                  
1   111853090   111863003  +1      Transcript   ENST00000489524             CHIA      ENSG00000134216   10                   1431                       ENST00000489524  862        9                     
1   111861742   111861862  +1      Cds          CDS_1_111861742_111861861   CHIA      ENSG00000134216   10                   1431                       ENST00000489524  862        9                     
1   111861949   111862091  +1      Cds          CDS_1_111861949_111862090   CHIA      ENSG00000134216   10                   1431                       ENST00000489524  862        9                     
1   111853090   111853115  +1      Utr5prime    exon_1_111853090_111853114  CHIA      ENSG00000134216   10                   1431                       ENST00000489524  862        9             1         ALTTENATIVE_3SS
1   111854311   111854341  +1      Utr5prime    exon_1_111854311_111854340  CHIA      ENSG00000134216   10                   1431                       ENST00000489524  862        9             2         SKIPPED
1   111860608   111860732  +1      Exon         exon_1_111860608_111860731  CHIA      ENSG00000134216   10                   1431                       ENST00000489524  862        9             5         RETAINED
1   111853090   111853115  +1      Exon         exon_1_111853090_111853114  CHIA      ENSG00000134216   10                   1431                       ENST00000489524  862        9             1         ALTTENATIVE_3SS
1   111861742   111861862  +1      Exon         exon_1_111861742_111861861  CHIA      ENSG00000134216   10                   1431                       ENST00000489524  862        9             7         RETAINED
The format is
Column Meaning
chr Chromosome name
start Marker start (one-based coordinate)
end Marker end (one-based coordinate)
strand Strand (positive or negative)
type Type of marker (e.g. exon, transcript, etc.)
id ID. E.g. if it's a Gene, then it may be ENSEBML's gene ID
geneName Gene name, if marker is within a gene (exon, transcript, UTR, etc.), empty otherwise (e.g. intergenic)
geneId Gene is, if marker is within a gene
numberOfTranscripts Number of transcripts in the gene
canonicalTranscriptLength CDS length of canonical transcript.
transcriptId Transcript ID, if marker is within a transcript
cdsLength CDS length of the transcript
numerOfExons Number of exons in this transcript
exonRank Exon rank, if marker is an exon
exonSpliceType Exon splice type, if marker is an exon

SnpEff genes2bed

Dumps a selected set of genes as BED intervals.

The functionality of this command is a subset of SnpEff dump, so it is likely to be deprecated in the future.

Example:

$ java -Xmx4g -jar snpEff.jar genes2bed GRCh37.66 DDX11L1 WASH7P
#chr	start	end	geneName;geneId
1	11868	14411	DDX11L1;ENSG00000223972
1	14362	29805	WASH7P;ENSG00000227232

SnpEff cds & SnpEff protein

These commands perform SnpEff database sanity checks. They calculate CDS and protein sequences from a SnpEff database and then compare the results to a FASTA file (having the "correct" sequences).
The commands are invoked automatically when building databases, so there is no need for the user to invoke them manually.

SnpEff len

Calculates the genomic length of every type of marker (Gene, Exon, Utr, etc.). Length is calculated by overlapping all markers and counting the number of bases (e.g. a base is counted as 'Exon' if any exon falls onto that base). This command also reports the probability of a Binomial model.

Parameter -r num adjusts the model for a read length of 'num' bases. That is, if two markers of the same type are closer than 'num' bases, it joins them by incliding the bases separating them.

E.g.:

$ java -Xmx1g -jar snpEff.jar len -r 100 BDGP5.69
marker                   size    count     raw_size raw_count    binomial_p
Cds                  22767006    56955     45406378    122117    0.13492635563570918
Chromosome          168736537       15    168736537        15    1.0
Downstream           49570138     5373    254095562     50830    0.29377240330587084
Exon                 31275946    61419     63230008    138474    0.18535372691689175
Gene                 82599213    11659     87017182     15222    0.4895158717166277
Intergenic           56792611    11637     56792611     11650    0.3365756581812509
Intron               55813748    42701    168836797    113059    0.33077452573297744
SpliceSiteAcceptor      97977    48983       226118    113059    5.806507691929223E-4
SpliceSiteDonor        101996    50981       226118    113059    6.044689657225808E-4
Transcript           82599213    11659    232066805     25415    0.4895158717166277
Upstream             52874082     5658    254044876     50830    0.3133528928592389
Utr3prime             5264120    13087     10828991     24324    0.031197274126824114
Utr5prime             3729197    19324      6368070     33755    0.02210070839607192

Column meaning:
  • marker : Type of marker interval
  • size : Size of all intervals in the genome, after overlap and join.
  • count : Number of intervals in the genome, after overlap and join.
  • raw_size : Size of all intervals in the genome. Note that this could be larger than the genome.
  • raw_count : Number of intervals in the genome.

Scripts

Several scripts are provided in the scripts directory. Here we briefly describe their functionality:

Script Functionality
sam2fastq.pl Convert a SAM input to a FASTQ output. Example:
samtools view test.bam | ./scripts/bam2fastq.pl | head -n 12
@HWI-ST1220:76:D12CHACXX:7:2207:18986:95756
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATT
+
CCCFFFFFGHHHHIIJIJJIJIJJIJJIIIJIIHIJJJIJJIJJJJJJIJJ
@HWI-ST1220:76:D12CHACXX:7:2206:4721:162268
ATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATT
+
@@@DDD>DAB;?DGGEGGBCD>BFGI?FCFFBFGG@ 
fasta2tab.pl Convert a fasta file to a two column tab-separated TXT file (name \t sequence) Example (output truncated for brevity)
$ zcat ce6.fa.gz | ./scripts/fasta2tab.pl 
chrI    gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcct...
chrV    GAATTcctaagcctaagcctaagcctaagcctaagcctaagcctaa...
chrII   cctaagcctaagcctaagcctaagcctaagcctaagcctaagccta...
chrM    CAGTAAATAGTTTAATAAAAATATAGCATTTGGGTTGCTAAGATAT...
chrX    ctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa...
chrIV   cctaagcctaagcctaagcctaagcctaagcctaagcctaagccta...
chrIII  cctaagcctaagcctaagcctaagcctaagcctaagcctaagccta...
fastaSplit.pl Split a multiple sequence FASTA file to individial files. Example: Creates one file per chromosome
$ zcat ce6.fa.gz | ./scripts/fastaSplit.pl
Writing to chrI.fa
Writing to chrV.fa
Writing to chrII.fa
Writing to chrM.fa
Writing to chrX.fa
Writing to chrIV.fa
Writing to chrIII.fa
hist.pl Given a list of numbers (one per line), shows a histogram. Note: It requires R.
Example: Extract the file sizes in a directory and show a histogram
$ ls -al scripts/ | tr -s " " | cut -f 5 -d " " | ./scripts/hist.pl 
Creates the following plot
maPlot.pl
plot.pl
qqplot.pl
smoothScatter.pl
Similar to 'hist.pl', these perform plots based on input from STDIN. Note that in some cases, inputs are expected to be probabilities (qqplot.pl) or pairs of numbers (maPlot.pl).
$ ls -al scripts/ | tr -s " " | cut -f 5 -d " " | ./scripts/plot.pl
Creates the following plot
queue.pl Process a list of statements in parallel according to the number of CPUs available in the local machine
splitChr.pl Splits a file by chromosome. Works on any tab separated file that the first column is CHR field (e.g. BED, VCF, etc.)
Example:
$ cat large_test.vcf | ./scripts/splitChr.pl 
Input line 28. Creating file 'chr1.txt'
Input line 13332. Creating file 'chr2.txt'
Input line 22097. Creating file 'chr3.txt'
Input line 29289. Creating file 'chr4.txt'
Input line 34236. Creating file 'chr5.txt'
Input line 39899. Creating file 'chr6.txt'
Input line 47120. Creating file 'chr7.txt'
Input line 53371. Creating file 'chr8.txt'
Input line 57810. Creating file 'chr9.txt'
Input line 63005. Creating file 'chr10.txt'
Input line 68080. Creating file 'chr11.txt'
Input line 76629. Creating file 'chr12.txt'
Input line 83071. Creating file 'chr13.txt'
Input line 85124. Creating file 'chr14.txt'
Input line 89281. Creating file 'chr15.txt'
Input line 93215. Creating file 'chr16.txt'
Input line 99081. Creating file 'chr17.txt'
Input line 106405. Creating file 'chr18.txt'
Input line 108330. Creating file 'chr19.txt'
Input line 118568. Creating file 'chr20.txt'
Input line 121795. Creating file 'chr21.txt'
Input line 123428. Creating file 'chr22.txt'
Input line 126520. Creating file 'chrX.txt'
Input line 129094. Creating file 'chrY.txt'
Input line 129113. Creating file 'chrMT.txt'
uniqCount.pl Count number of unique lines. It's the same as doing cat lines.tst | sort | uniq -c , but much faster. Particularly useful for very large inputs.
vcfEffOnePerLine.pl Splits EFF fields in a VCF file, creating mutiple lines, each one with only one effect.
Very useful for filtering with SnpSift.
Example:
$ cat test.stop.vcf 
1	897062	.	C	T	100.0	PASS	AC=1;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q141*|642|KLHL17||CODING|NM_198317|),UPSTREAM(MODIFIER||||576|PLEKHN1||CODING|NM_001160184|),UPSTREAM(MODIFIER||||611|PLEKHN1||CODING|NM_032129|),UPSTREAM(MODIFIER||||749|NOC2L||CODING|NM_015658|)

$ cat test.stop.vcf | ./scripts/vcfEffOnePerLine.pl 
1	897062	.	C	T	100.0	PASS	AC=1;EFF=STOP_GAINED(HIGH|NONSENSE|Cag/Tag|Q141*|642|KLHL17||CODING|NM_198317|)
1	897062	.	C	T	100.0	PASS	AC=1;EFF=UPSTREAM(MODIFIER||||576|PLEKHN1||CODING|NM_001160184|)
1	897062	.	C	T	100.0	PASS	AC=1;EFF=UPSTREAM(MODIFIER||||611|PLEKHN1||CODING|NM_032129|)
1	897062	.	C	T	100.0	PASS	AC=1;EFF=UPSTREAM(MODIFIER||||749|NOC2L||CODING|NM_015658|)