ChIP-Seq Analysis Server

SGA Format Specifications

A ChIP-seq experiment produces as output a set of tag positions mapped to the genome. The ChIP-Seq tools use their own compact format, which is a simplified GFF format, called SGA. The SGA (Simplified Genome Annotation) format is a single-line-oriented and tab-delimited format with the following five obligatory fields:

  • Sequence name (Char String)
  • Feature (Char String)
  • Sequence Position (Integer)
  • Strand (+/- or 0)
  • Tag Counts (Integer)

Any number of additional fields may be added containing application-specific information.

ChIP-seq programs use SGA files at both input and output. The SGA format differs in one very important aspect from similar formats such as BED or GFF. It is required to be sorted by sequence name, genome positions, and strand according to the following rule:

setenv LANG C; sort -s -k1,1 -k3,3n -k4,4

The SGA format can be used to represent other genome annotations, e.g. the location of transcription starts sites (TSS), or matches to consensus sequences. Orientation-less features will be associated with a strand value of 0.
In a data analysis pipeline, the SGA file is typically generated from a variety of richer formats, such as Eland from the Illumina Solexa pipeline, BED (Browser Extensible Data), or BAM (Binary Alignment/Map) formats.
We support data Upload in SGA, BED, GFF, BAM and FPS formats.
FPS (Functional Position Set) is the specific format used by the Signal Search Analysis server at SIB (SSA).
SGA may also be converted at output into several formats, in particular BED or WIG (Wiggle Track Format) formats for data visualization, and FPS for sequence extraction. WIG or BED files are used for viewing ChIP-seq data or analysis results within the UCSC genome browser environment.

Last update September 2021