snpEff: SNP effect predictor

Regulatory annotations

There are two ways to add support for regulatory annotations (these are not mutually exclusive, you can use both at the same time):
  1. GFF regulation file (usually from ENSEMB).
  2. BED files.
WARNING: Adding regulation support and analyzing data using regulation tracks can take much more memory. For instance, for the human genome I use 10Gb to 20Gb of RAM.

WARNING: It is assumed the the genome is already installed, only regulatory tracks are added.


Option 1: Using an annotation GFF

This example shows how to create a regulation database for human (GRCh37.65):
  1. Get the GFF regulatory annotations (into path/to/snpEff/data/GRCh37.65/regulation.gff):
    			cd path/to/snpEff/data/GRCh37.65
    			wget ftp:/ftp.ensembl.org/pub/release-65/regulation/homo_sapiens/AnnotatedFeatures.gff.gz
    			mv AnnotatedFeatures.gff.gz regulation.gff.gz 

  2. Create databases (note the "-onlyReg" flag):
    			cd /path/to/snpEff
    			java -Xmx20G -jar snpEff.jar build -v -onlyReg GRCh37.65 

    The output looks like this

    Reading regulation elements (GFF)
        Chromosome '11'	line: 226964
        Chromosome '12'	line: 493780
        ...
        Chromosome '9'	line: 4832434
        Chromosome 'X'  line: 5054301
        Chromosome 'Y'  line: 5166958
    Done
        Total lines                 : 5176289
        Total annotation count      : 3961432
        Percent                     : 76.5%
        Total annotated length      : 3648200193
        Number of cell/annotations  : 266
    Saving database 'HeLa-S3' in file '/path/to/snpEff/data/GRCh37.65/regulation_HeLa-S3.bin'
    Saving database 'HepG2' in file '/path/to/snpEff/data/GRCh37.65/regulation_HepG2.bin'
    Saving database 'NHEK' in file '/path/to/snpEff/data/GRCh37.65/regulation_NHEK.bin'
    Saving database 'GM12878' in file '/path/to/snpEff/data/GRCh37.65/regulation_GM12878.bin'
    Saving database 'HUVEC' in file '/path/to/snpEff/data/GRCh37.65/regulation_HUVEC.bin'
    Saving database 'H1ESC' in file '/path/to/snpEff/data/GRCh37.65/regulation_H1ESC.bin'
    Saving database 'CD4' in file '/path/to/snpEff/data/GRCh37.65/regulation_CD4.bin'
    Saving database 'GM06990' in file '/path/to/snpEff/data/GRCh37.65/regulation_GM06990.bin'
    Saving database 'IMR90' in file '/path/to/snpEff/data/GRCh37.65/regulation_IMR90.bin'
    Saving database 'K562' in file '/path/to/snpEff/data/GRCh37.65/regulation_K562.bin'
    Done.
    		
    As you can see, annotations for each cell type are saved in different files. This makes it easier to load annotations only for the desired cell types when analyzing data.


Option 2: Using an BED filed

This example shows how to create a regulation database for human (GRCh37.65). We assume we have a file called "my_regulation.bed" which has infromation for H3K9me3 in Pancreatic Islets (for instance, as a result of a Chip-Seq experiment and peak enrichment analysis).
  1. Add all your BED files to 'path/to/snpEff/data/GRCh37.65/regulation.bed/' dir:
    			cd path/to/snpEff/data/GRCh37.65
    			mkdir regulation.bed
    			cd regulation.bed
    			mv where/everh/your/bed/file/is/my_regulation.bed ./regulation.Pancreatic_Islets.H3K9me3.bed 

    Note: The name of the file must be 'regulation.CELL_TYPE.ANNOTATION_TYPE.bed'. In this case, 'CELL_TYPE=Pancreatic_Islets' and 'ANNOTATION_TYPE=H3K9me3'

  2. Create databases (note the "-onlyReg" flag):
    			cd /path/to/snpEff
    			java -Xmx20G -jar snpEff.jar build -v -onlyReg GRCh37.65 

    The output looks like this

    Building database for 'GRCh37.65'
    Reading regulation elements (GFF)
    Cannot read regulation elements form file '/path/to/snpEff/data/GRCh37.65/regulation.gff'
    Directory has 1 bed files and 1 cell types
    Creating consensus for cellType 'Pancreatic_Islets', files: [/path/to/snpEff/data/GRCh37.65/regulation.bed/regulation.Pancreatic_Islets.H3K9me3.bed]
    Reading file '/path/to/snpEff/data/GRCh37.65/regulation.bed/regulation.Pancreatic_Islets.H3K9me3.bed'
    	Chromosome '10'	line: 5143
    	Chromosome '11'	line: 8521
    	...
    	Chromosome 'X'	line: 52481
    	Chromosome 'Y'	line: 53340
    Done
    	Total lines                 : 53551
    	Total annotation count      : 53573
    	Percent                     : 100.0%
    	Total annotated length      : 75489402
    	Number of cell/annotations  : 1
    Creating consensus for cell type: Pancreatic_Islets
    Sorting: Pancreatic_Islets	, size: 53573
    Adding to final consensus
    Final consensus for cell type: Pancreatic_Islets	, size: 53549
    Saving database 'Pancreatic_Islets' in file '/path/to/snpEff/data/GRCh37.65/regulation_Pancreatic_Islets.bin'
    Done
    Finishing up
    		
    Note: If there are many annotations, they are saved in one binary file for each cell type (i.e. several BED files for different cell types are collapsed toghether). This makes it easier to load annotations only for the desired cell types when analyzing data.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Author: Pablo Cingolani (pcingola@users.sourceforge.net)
Key words (for search engines):
SNP : single nucleotide polymorphisms
SNP effect predictor
SNP effect predictor Galaxy
SNP functional annotation
NG sequencing analysis
Next genenration sequencing analysis
Next genenration sequencing analysis tool
Functionally annotate single nucleotide polymorphisms (SNP)
Variation analysis
Open source
java
Windows Linux OSX
SNP Wikipedia