News:

The MGA Data Repository

The Mass Genome Annotation (MGA) Data Repository stores published next generation sequencing data and other genome annotation data (such as gene start sites, SNPs, etc.) that, in conjunction with the ChIP-Seq and SSA servers, can be accessed and studied by scientists. The main characteristic of the MGA database is to store mapped data (in the form of genomic coordinates of mapped reads) and not sequence files. In this way, each sample present in the database has been pre-processed (for example sequence reads has been mapped to a genome) and presented in a standardized text format named SGA (Simple Genome Annotation).

How to cite:
R. Dreos, G. Ambrosini, R. Groux, R. Cavin Perier, P. Bucher; MGA repository: a curated data resource for ChIP-seq and other genome annotated data, Nucleic Acids Research, gkx995, https://doi.org/10.1093/nar/gkx995

Access to the database

Access to the database can be done in various ways:

  • Searching for keywords in the MGA-Search page. Links to documentation, relevant publication and analysis tools help in the study and interpretation of published data.
  • Via the MGA Data Overview page browsing through all series and samples.
  • Via the FTP site for data download in SGA format.
  • Through menus in all input pages of the ChIP-Seq and SSA servers.

Data export and format conversion

The native file format at the back end of the repository is SGA and can be accessed via the FTP server. Users interested in using MGA data with other tools that do not support SGA format can easly convert SGA formatted data to BED by:

Technical informations about SGA file format and conversion rules can be found here.

Database content

The MGA repository contains the following numebr of samples (stratified by organism and data type):

Data Type

Human

Mouse

Rat

Rhesus Macaque

Dog

Chicken

Zebra fish

Bee

Fruit Fly

Water Flea

Worm

Baker's Yeast

Fission Yeast

Arabidopsis

Corn

Malaria Parasite

Total

ChIP-seq

8248

758

4

5

11

14

34

-

514

18

198

527

405

212

12

52

11012

ChIP-seq-invitro

-

-

-

-

-

-

-

-

-

-

-

-

931

-

-

-

931

ChIP-seq-peak

8206

28

-

-

-

-

-

-

-

-

-

-

-

-

-

-

8234

Transcription Profiling

2431

1352

13

15

12

33

12

16

371

11

19

22

16

13

8

13

4357

DNase FAIRE etc.

1434

42

-

-

-

-

4

-

68

-

6

58

8

9

3

12

1644

DNA methylation

24

4

-

-

-

-

-

-

-

-

-

-

-

-

-

-

28

Genome annotation

32

23

2

2

2

15

6

2

16

4

18

4

5

5

3

3

179

Sequence-derived

3617

2315

-

-

-

1

14

9

1240

-

9

9

9

1531

9

-

8764

Total # of Samples

27051

4535

19

22

25

63

70

27

2209

33

250

620

443

2701

35

15

38185

Data types are the following:

  • ChIP-seq: raw data (reads mapping coordinates) from classical ChIP-seq experiments targeting transcription factors, protein-DNA intraction, histone variants and modifications, etc.
  • ChIP-seq-invitro: raw data (reads mapping coordinates) from in-vitro ChIP-seq experiments such ad DAP-seq.
  • ChIP-seq-peak: peak regions provided by the authors of the data
  • Transcript Profiling: raw data from experiments aimed at profiling transcripts initiation such as CAGE, GRO-cap, GRO-seq, PEAT, etc.
  • DNase FAIRE etc.: raw data from chromatin and chromatin accessibility studies such as MNase-seq, DNase-seq, DNase-hypersensitivity, etc.
  • DNA methylation: raw data from methylation studies.
  • Genome Annotation: transcription start sites, transcription end sites, intron-exon boundaries
  • Sequence derived: PWM matches, Natural Variants, Conservation scores, etc.

The list of series present in the database can be found in the MGA Data Overview page.

Sample name conventions

Samples names in MGA contain useful informations about the samples' biological and technical variables. For example, the sample '* S2|PolII|80mMsalt|contol' contains several informations that can be summarised in the figure below:

MGA naming conventions
Sample names are divided into multiple sections separated by pipes ('|') or sometimes by dash lines ('-'). Each section is devoted to store informations about one important sample variable:
  1. Cell type: the cell in wich the samples experiment was carried out. This can refer to a cell line (for example GM12878), a developmental stage (as in the example of a S2 cell in D. melanogaster) or a mutant strain (for example 'WT', for wild type cells, or 'anchor-away Abf1', for cell depleted of Abf1 TF).
  2. Target: target protein that is the focus of the sample. Examples are transcription factors (CTCF, YY1, etc.), DNA-interacting proteins ('PolII', histones, etc.), histone modifications and variants (H3K4me3, H2A.Z, etc.).
  3. Conditions: important conditions in wich the experiment was performed and that characterise one or more samples. Examples can be specific growing media or time points during a time course experiment. Note that this field does not list growing conditions that are common to all samples in the series.
  4. Additional Info: other informations that characterise the samples such as replica number
  5. Star: the star symbol ('*') at the beginning of the name indicates that this sample has unoriented features. This is often the case for samples containing peak lists (a peak in the genome is unoriented by definition) or samples derived from paired-end sequencing (the fragment defined by the two paired reads does not have a preferred orientation in the genome).
Note that the first two fields are always present in the sample name whreas the others can be missing if non relevant.

Last update September 2021