*** INSEGT - INtersecting SEcond Generation sequencing daTa with annotation ***

---------------------------------------------------------------------------
Table of Contents
---------------------------------------------------------------------------
  1.   Overview
  2.   Installation
  3.   Usage
  4.   Input Format
  5.   Output Format
  6.   Contact

---------------------------------------------------------------------------
1. Overview
---------------------------------------------------------------------------

INSEGT is a tool to analyse alignments of RNA-Seq reads (single-end or 
paired-end) by using gene-annotations.
It can measure exon-, transcript- and gene-expression levels of given 
annotations. If read-alignments span more than one exon, INSEGT can also 
compute possible exon-combinations (tuple) and their expression levels.
According to requirements it computes maximal tuple or tuple with a certain 
length. For paired-end-reads INSEGT builds also tuple, whose exons are 
connected by matepairs.
Briefly it searches the intervals of the read-alignments in the intervals 
of the given annotations. It counts the mapped reads for each annotation and 
its parent-annotation and stores the mapped annotations for each read.
INSEGT produces tree output files:
The first contains the expression levels of the given annotions.
The second contains for each given read the mapped annotations.
If the read mapped in not annotated regions, an UNKNOWN_REGION in the output will mark
this.
And the third file contains exon-tuple and their expression levels.  
(For further informations see: 
http://www.inf.fu-berlin.de/en/groups/abi/theses/bachelor/krakau/bsc_thesis_krakau.pdf)

---------------------------------------------------------------------------
2. Installation
---------------------------------------------------------------------------

INSEGT is distributed with SeqAn - The C++ Sequence Analysis Library (see 
http://www.seqan.de). To build INSEGT do the following:

  1)  Download the latest snapshot of SeqAn
  2)  Unzip it to a directory of your choice (e.g. snapshot)
  3)  cd snapshot/apps
  4)  make insegt
  5)  cd insegt

Alternatively you can check out the latest SVN version of INSEGT and SeqAn
with:

  1)  svn co http://svn.mi.fu-berlin.de/seqan/trunk/seqan
  2)  cd seqan
  3)  make forwards
  4)  cd projects/library/apps
  5)  make insegt
  6)  cd insegt

---------------------------------------------------------------------------
3. Usage
---------------------------------------------------------------------------

Usage: insegt [OPTION]... -s <ALIGNMENTS FILE> -g <ANNOTATIONS FILE> 

INSEGT expects at least three files:
A SAM file which contains the alignments of the reads and a GFF file which 
contains the gene-annotations. 
Without any additional parameters INSEGT would compute all tuple up to the 
length 2. That means that it builds only 2-tuple, whose exons are connected 
by matepairs.  
The default behaviour can be modified by adding the following options to 
the command line:

---------------------------------------------------------------------------
3.1. Options
---------------------------------------------------------------------------

  [ -p PATH ],  [ --outputPath PATH ]

  Path to the directory, where the output files should be stored. 
  The default value is the insegt-directory.

  [ -n NUM ],  [ --nTuple NUM ]

  Tuple length parameter. By default it is 2.	

  [ -o NUM ],  [ --offsetInterval NUM ]

  The offset-parameter to short alignment-intervals to search them in the
  given annotation-regions. The default value is 5.

  [ -t NUM ],  [ --thresholdGaps NUM ]

  The number of gaps, which are allowed in the read-alignments, for which 
  the gaps are not interpreted as introns. The default value is 5.

  [ -c NUM ],  [ --thresholdCount NUM ]

  Threshold for min. count of tuple, for which the tuple are given out.
  The default value is 1. 
  
  [ -r NUM ],  [ --thresholdRPKM NUM ]
  
  Threshold for min. RPKM-value of tuple, for which the tuple are given out.
  The default value is 0.0.
  
  [ -m],  [ --maxTuple ]

  Only maximal tuple are computed, which are spanned by the whole read. 

  [ -e ],  [ --exact_nTuple ]
  
  Only tuple of the given length are computed. By default all tuple up to 
  the given length are computed. (If -m is set, -e will be ignored) 
  
  [ -u ],  [ --unknown_orientation ]
  
  Read-orientations are unknown. The alignment-intervals are searched in the 
  annotations of both strands. By default they're just searched in the 
  annotations corresponding to the given orientation of the read.

  [ -f],   [ --fusion_genes ]

  Check for fusion genes. A separate output for matepair tupel, which map in 
  different genes, is created. By default there is no check for fusion genes.

  [ -z],   [ --gtf ]

  The gene-annotations are given in a GTF file (instead of GFF).
  By default INSEGT expects a GFF file.

---------------------------------------------------------------------------
4. Input General Feature Format
---------------------------------------------------------------------------

INSEGT expects a GFF file which contains the gene-annotations.

The General Feature Format (GFF) is specified by the Sanger Institute as a tab-
delimited text format with the following columns:

<seqname> <src> <feat> <start> <end> <score> <strand> <frame> [attr] [cmts]

See also: http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml

INSEGT requires additional specifications for the input of the exon- and genenames.

Input columns:
(Irrelevant fields are represented as "<>", their contents doesn't matter)

<Sequenzname> < > < > <Start> <Ende> < > <Strang> < > <ID;ParentID> 

Example:

 chr1          .   .   200     1000   .   +        .   ID=ENSG1
 chr1          .   .   300     450    .   +        .   ID=ENSG1-1;ParentID=ENSG1
 chr1          .   .   700     600    .   -        .   ID=ENSG2-1;ParentID=ENSG2

The order of the row-entries doesn't matter.
If no entry exists for the parent of an exon, INSEGT generates it automatically.
Moreover parent-entries don't need start- and end-positions.


---------------------------------------------------------------------------
5. Output Formats
---------------------------------------------------------------------------

INSEGT output-files are GFF files. The first output-file contains the annotations 
similar to the GFF-input and additionally the counts of the mapped 
reads and the normalized expression levels in RPKM (reads per kilobase of 
exon model per million mapped reads).
The two other output-files have different specifications.


---------------------------------------------------------------------------
5.1. Output for annotations (annoOutput.gff):
---------------------------------------------------------------------------

Output columns:

<sequencename> < > < > <start> <end> <count> <strand> < > \
    <ID=annotationname;[ParentID=parentname;]expression-level (RPKM)> 

Example:

 chr1           .   .   30      .     60      +        .   ID=P1;30.0;
 chr1           .   .   75      300   20      +        .   ID=E1;ParentID=P1;25.0;  
 

---------------------------------------------------------------------------
5.2. Output for reads (readAnnoOutput.gff):
---------------------------------------------------------------------------

The read-output of INSEGT contains the mapped annotations followed by their 
parent-annotation. If the read mapped in annotations of different parents,
the annotations are presented one after another corresponding to their parents.
Consecutive mapped annotations are separated by ":".
Overlapping annotations which are mapped by the same read-interval are separated by ";". 

Output columns:

<readname> <sequencename> <strand> <annotationnames> <parentname1> \
   [<annotationnames>   <parentname2> ...]

Example:

 read_1     chr1           +        E1:E2-1;E2-2:E3   P1            E10:UNKNOWN_REGION  P2   


---------------------------------------------------------------------------
5.3. Output for tuple (tupleOutput.gff):
---------------------------------------------------------------------------

In the tuple-output of INSEGT exons which are connected by a single read are 
separated by ":" and exons which are connected by a matepair are separated by "^".

Output columns:

<sequencename> <parentname> <strand> <tupel of annotationnames> <count> <expression-level (RPKM)>


Example:

 chr1           P1           +        E1:E2-1:E3^E7              50      30.0
 chr1           P1           +        E1:E2-2:E3^E7              10      11.32


---------------------------------------------------------------------------
6. Contact
---------------------------------------------------------------------------

For questions or comments, contact:
  David Weese <weese@inf.fu-berlin.de>
  Marcel H. Schulz <marcel.schulz@molgen.mpg.de>
  Sabrina Krakau <krakau@mi.fu-berlin.de>
