###################################################################### The Eukaryotic Promoter Database EPDnew ftp repository ###################################################################### EPDnew is a collection of species-specific databases that contain information on experimentally validated promoters. These are the result of in-house analysis of promoter-specific high-throughput data such as CAGE and GRO-cap. Raw data used here can be found in the sister database MGA (https://epd.expasy.org/mga/). ====================================================================== Organization of the EPDnew ftp repository ====================================================================== The data repository is a hierarchically structured directory. The root directory is split into subdirectories corresponding to organisms (e.g. H_sapiens, M_musculus), which are split according to EPDnew version (e.g. H_sapiens/004). Each version subdirectory has a "db" folder with text files that are imported into the MySQL database used by the web server. A version subdirectory typically contains EPDnew annotation in various formats. Here, file names contain information on the organism (first two letters in the name), database (letters from 4 to 9), version (11-13) and assembly used (15-end). For example, the file named 'Hs_EPDnew_004_hg19.sga' contains promoter coordinates ((version 4) in SGA format for H. sapiens, mapped on genome assembly hg19. The root directory of a data series contains the following file formats: - SGA: this file contains information about the location of a promoter in the genome. It is a tab-delimited file with the following fields: - Chromosome name with RefSeq identifier - Feature type. Here is always TSS (transcription start site) - Position in the chromosome (starting at base 1) - Strand (+ or -) - Count field (1) - Annotation field (Associated promoter ID) - FPS: this file format is used by our sister toolkit SSA (https://epd.expasy.org/ssa/). More information about this file format can be found here: https://epd.expasy.org/epd/current/usrman.php#The_FP_line - BED: annotation file in BED format used to draw the EPDnew track on the UCSC genome browser. This is a representation of the sequence field (SE lines) of the DAT file. As such it starts at base -49 from the TSS (base 0) and end at base 10 (60 bp interval). BED file follows UCSC standards. The columns are: Chromosome RegionStart RegionEnd PromoterID Score Strand thickStart thickEnd with: - Chromosome, the chromosome with UCSC nomenclature - RegionStart, the start of the region (base -50 or + 11 from the TSS) - RegionEnd, the end of the region (base -50 or + 11 from the TSS) - PromoterID, EPDnew promoter ID - Score, required by UCSC, always 900 - strand, + or - - thickStart. start of the thich region (TSS or base +10) - thickEnd, end of the thich region (TSS or base +10) Note that BED files start chromosomes at base 0, whereas SGA chromosomes start at base 1. Moreover the BED file is chromosome-oriented, so the TSS can be found at the start or at the end of the thick region depending on the strand. - DAT: EMBL-like annotation file for each entry. A detailed description of this file can be found here: https://epd.expasy.org/epd/current/usrman.php#FORMAT_CONVENTIONS The "db" directory contains files used to updata the MySQL database. They share the same name between different organisms: - promoter_coordinate.txt contains information on promoter location in the genome, organism and promoter type. Columns are the following: - EPDnew promoter ID - chrmosome RefSeq ID - TSS position as defined by EPDnew - strand: + or - - scientific name of the organism - type: either "single", "multiple" or "region" - promoter_samples_expression.txt contains average expression levels for each promoter with the following columns: - EPDnew promoter ID - number of samples in which the promoter is active - average expression level (evaluated as the average number of CAGE tags that map in a window of 100 bp arount annotated TSS) - promoter_expression.txt contains information on expression levels of each promoter in each sample used during EPDnew validation process. Columns are the following: - EPDnew promoter ID - expression level (evaluated as the total number of CAGE tags that map in a 100bp window around the TSS) - sample-specific TSS position relative to the annotated TSS - sample name - promoter_sequence.txt contains sequence information for each promoter. Columns are the following: - EPDnew promoter ID - short sequence segment corresponding to the -49 to +10 region of the promoter - promoter_ensembl.txt links EPDnew IDs with ENSEMBL Gene IDs. Columns are the following: - EPDnew promoter ID - ENSEMBL Gene ID - cross_references.txt links ENSEMBL Gene IDs with external databases ID: - ENSEMBL Gene ID - Gene Name - RefSeq ID - Gene Description - gene_description.txt stores information on gene name and description. Columns are the following: - EPDnew promoter ID - Gene Name - Gene Description - promoter_motifs.txt is a boolean file describing the presence or absence of core promoter elements. Columns are the following: - EPDnew promoter ID - TATA-box presence at position -28 (+- 3 bp) - Initiator presence at the TSS - CCAAT-box presence in the region -200 to -50 from the TSS - GC-box presence in the region -200 to -50 from the TSS - promoter_ucsc.txt contains information about the promoter location in UCSC style format. Columns are the following: - EPDnew promoter ID - UCSC assembly name (e.g. hg19, hg38, ...) - UCSC chrmosome name (e.g. chr1, chr2, ...) - strand (+ or -) - position in the genome (base 0 start) ====================================================================== Last update: 14 Oct 2019, EPD team