============================================================================================= Mass Genome Annotation (MGA) Data Repository ============================================================================================= MGA is a repository of publicly available mass genome annotation data (ChIP-Seq and RNA-seq) in various formats from the most common model organisms: human, mouse, fruit fly, worm, zebra fish, baker's yeast, fission yeast, and arabidopsis. It also hosts a large collection of so-called derived data such as ChIP-seq peaks, genome annotations (e.g. promoters, splice-junctions, etc.), and sequence-intrinsic data (conservation scores, SNPs, indel, etc.). ============================================================================================= Overview of Data Types ============================================================================================= Currently, we provide data for the following assemblies: hg38, hg19, hg18, mm9, dm6, dm3, ce6, danRer7, sacCer3, araTha1, S.pombe NCBI/ASM294v2 (spo2), and Z. mays B73 RefGen_v3/zm3. The Mass Genome Annotation Data Repository includes the following types of data: Primary data: - ChIP-seq data (transcription factors, histone modifications, other chromatin proteins) - Trascription Profiling data (only TSS-related: CAGE, GRO-cap, etc.) - DNA methylation data (under development) - Chromatin accessibility assays (MNase, DNase I, ATAC-seq, etc.) Derived data: - ChIP-seq peaks (published peak lists) - Genome annotations (promoters, splice-junctions, etc.) Sequence-intrinsic (derived from genome sequences only) data: - Conservation scores - Genome variation data (SNPs, indels, etc.) ============================================================================================= Organization of the MGA Repository ============================================================================================= The data repository is a hierarchically structured directory. At the first hierarchical level, the root directory is split into subdirectories corresponding to genome assemblies (e.g. hg19, mm9,...), at the second level according to data series (e.g. hg19/barski07). A data series subdirectory typically contains data from one publication. It often corresponds to a series entry in GEO (GSE entry). The root directory of a data series contains the following files: - One or several data files in SGA format. One file typically contains data from one experiment and often corresponds to a GEO/GSM entry. The EPD MGA data archive may contain additional data files obtained by merging several data sources. - Compressed versions of all SGA files for FTP download. - Two configuration files: series_name.dat, and series_name.txt The series_name.dat file contains computer readable information pertaining to the series as a whole: - A series name (identical to the subdirectory name). - A series description. It appears in the web server menu next to "Series". - A short reference. It appears on the resulta pages next to "includes data from". - A GEO identifier (optional). - A PUBMED identifier (optional). - A resource identifier (optional). The series_name.txt file contains a table pertaining to individual samples, including: - The corresponding SGA filename. - A short sample description. It will appear in the server menu next to "Sample", e.g "CD4+ CTCF". It is also displayed on the result pages. - The feature name, included in the SGA file (second field), e.g. "CTCF". - The data type, e.g. "ChIP-seq". - A flag named oriented with values "T" (TRUE) or "F" (FALSE). - The contents of the sixth field of the SGA file. A dash is used to indicate that the sixth field is not present. - A URL of a custom track file in bigWIG or bigBED format. A dash is used to indicate the absence of custom track files. - A document file: series_name.html The series_name.html is a HTML file including: - A brief description of the contents: experiment type, genome annotations, etc. - Identification of source data: URL, date of download, names and format of the source files, and assembly version. - A table of samples contained in the series, including links to the corresponding SGA files, description of the samples, feature and, when applicable, links to the corresponding GEO entries. - Methods used to convert the source data into SGA files. - Hyperlinks to PUBMED, GEO, and other resources. - References, credits. - Genome browser viewable files. Some samples (e.g. ChIP-seq peak files, promoters, etc.) are stored in FPS format as well for access by the SSA server. ChIP-Seq server menus are automatically generated from the configuration files. ============================================================================================= Access to the MGA Repository ============================================================================================= Access to MGA Data Repository is provided via the following methods: - MGA Query Page: https://epd.expasy.org/mga/SearchMga.php - MGA Data Browser: https://epd.expasy.org/mga/ - MGA Data FTP site: https://epd.expasy.org/ftp/mga/ - MGA Data sets available via the ChIP-Seq server menu-driven interface. ============================================================================================= Last update: July 13 2017, CCG Lab