See all news
2019-04-24New data sets for H. sapiens, and G. gallus.   [showhide]

The MGA Data Repository

The Mass Genome Annotation (MGA) Data Repository stores published next generation sequencing data and other genome annotation data (such as gene start sites, SNPs, etc.) that, in conjunction with the ChIP-Seq and SSA servers, can be accessed and studied by scientists. The main characteristic of the MGA database is to store mapped data (in the form of genomic coordinates of mapped reads) and not sequence files. In this way, each sample present in the database has been pre-processed (for example sequence reads has been mapped to a genome) and presented in a standardized text format named SGA (Simple Genome Annotation).

How to cite:
R. Dreos, G. Ambrosini, R. Groux, R. Cavin Perier, P. Bucher; MGA repository: a curated data resource for ChIP-seq and other genome annotated data, Nucleic Acids Research, gkx995, https://doi.org/10.1093/nar/gkx995

Access to the database

Access to the database can be done in various ways:
  • Searching for keywords in the MGA-Search page. Links to documentation, relevant publication and analysis tools help in the study and interpretation of published data.
  • Via the MGA Data Overview page browsing through all series and samples.
  • Via the FTP site for data download in SGA format.
  • Through menus in all input pages of the ChIP-Seq and SSA servers.

Data export and format conversion

The native file format at the back end of the repository is SGA and can be accessed via the FTP server. Users interested in using MGA data with other tools that do not support SGA format can easily convert SGA formatted data to BED by: Technical information about SGA file format and conversion rules can be found here.

Database content

The MGA repository contains the following number of samples (stratified by organism and data type):

Data Type

Human

Mouse

Rat

Rhesus Macaque

Dog

Chicken

Zebra fish

Bee

Fruit Fly

Water Flea

Worm

Baker's Yeast

Fission Yeast

Arabidopsis

Corn

Malaria Parasite

Total

ChIP-seq

8248

758

4

5

11

14

34

-

514

18

198

527

405

212

12

52

11012

ChIP-seq-invitro

-

-

-

-

-

-

-

-

-

-

-

-

931

-

-

-

931

ChIP-seq-peak

8206

28

-

-

-

-

-

-

-

-

-

-

-

-

-

-

8234

Transcription Profiling

2431

1352

13

15

12

33

12

16

371

11

19

22

16

13

8

13

4357

DNase FAIRE etc.

1434

42

-

-

-

-

4

-

68

-

6

58

8

9

3

12

1644

DNA methylation

24

4

-

-

-

-

-

-

-

-

-

-

-

-

-

-

28

Genome annotation

32

23

2

2

2

15

6

2

16

4

18

4

5

5

3

3

179

Sequence-derived

3617

2315

-

-

-

1

14

9

1240

-

9

9

9

1531

9

-

8764

Total # of Samples

27051

4535

19

22

25

63

70

27

2209

33

250

620

443

2701

35

15

38185


Data types are the following:
  • ChIP-seq: raw data (reads mapping coordinates) from classical ChIP-seq experiments targeting transcription factors, protein-DNA interaction, histone variants and modifications, etc.
  • ChIP-seq-invitro: raw data (reads mapping coordinates) from in-vitro ChIP-seq experiments such ad DAP-seq.
  • ChIP-seq-peak: peak regions provided by the authors of the data
  • Transcript Profiling: raw data from experiments aimed at profiling transcripts initiation such as CAGE, GRO-cap, GRO-seq, PEAT, etc.
  • DNase FAIRE etc.: raw data from chromatin and chromatin accessibility studies such as MNase-seq, DNase-seq, DNase-hypersensitivity, etc.
  • DNA methylation: raw data from methylation studies.
  • Genome Annotation: transcription start sites, transcription end sites, intron-exon boundaries
  • Sequence derived: PWM matches, Natural Variants, Conservation scores, etc.

The list of series present in the database can be found in the MGA Data Overview page.

Sample name conventions

Samples names in MGA contain useful information about the samples' biological and technical variables. For example, the sample '* S2|PolII|80mMsalt|control' contains some information that can be summarised in the figure below:
MGA naming conventions
Sample names are divided into multiple sections separated by pipes ('|') or sometimes by dash lines ('-'). Each section is devoted to store information about one important sample variable:
  1. Cell type: the cell in which the sample experiment was carried out. This can refer to a cell line (for example GM12878), a developmental stage (as in the example of a S2 cell in D. melanogaster) or a mutant strain (for example 'WT', for wild type cells, or 'anchor-away Abf1', for cell depleted of Abf1 TF).
  2. Target: target protein that is the focus of the sample. Examples are transcription factors (CTCF, YY1, etc.), DNA-interacting proteins ('PolII', histones, etc.), histone modifications and variants (H3K4me3, H2A.Z, etc.).
  3. Conditions: important conditions in which the experiment was performed and that characterise one or more samples. Examples can be specific growing media or time points during a time course experiment. Note that this field does not list growing conditions that are common to all samples in the series.
  4. Additional Info: other information that characterise the samples such as replica number
  5. Star: the star symbol ('*') at the beginning of the name indicates that this sample has unoriented features. This is often the case for samples containing peak lists (a peak in the genome is unoriented by definition) or samples derived from paired-end sequencing (the fragment defined by the two paired reads does not have a preferred orientation in the genome).
Note that the first two fields are always present in the sample name, whereas the others can be missing if non-relevant.
Last update September 2021