DESSO is a database/web server that provides all identified sequence and shape motifs from a homonymous framework developed by us. This framework contains a deep learning model for motif pattern learning and a novel statistical model for motif instance identification. The performance of this framework was evaluated on the 690 ENCODE ChIP-seq datasets (ENCODE Downloads). DESSO also enables motif scan and other comprehensive analyses based on user-provided DNA sequences.
A DNA motif is a region of DNA that regulates the expression of downstream genes located on that same molecule of DNA, i.e., a chromosome. This concept is equivalent to a DNA cis-regulatory element or cis-element. It contains the transcription factor binding sites (TFBSs) and other conserved functional elements in the five intergenic regions of genes.
DNA shape represents three-dimensional structure information of the corresponding DNA sequences and play an important role in TF-DNA recognition. Four distinct DNA shape features (i.e., Minor Groove Width (MGW), Propeller Twist (ProT), Helix Twist (HelT), and Roll) can be computationally derived from DNA sequences based on the Monte Carlo simulation.
Shape motif indicates conserved DNA shape patterns that are involved in TF-shape readouts recognition. Shape motifs are conserved in shape level but not necessarily in sequence level.
Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. (from WIKIPEDIA)
ChIP-sequencing (ChIP-seq) is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. (from WIKIPEDIA)
This matrix represents the 690 ENCODE ChIP-seq datasets used in our study, covering 161 cell lines and 91 TFs. Each entry indicates the number of datasets derived from specific cell line and TF.
UCSC Accession is a unique ID for each ENCODE ChIP-seq dataset.
FASTA is the only acceptable format.
See details in FASTA format.
Known motif represents the 2,263 identified sequence motifs that can be matched to the documented motifs in JASPAR or TRANSFAC, while undocumented motif represents the 523 identified sequence motifs that do not have any matches in these two motif databases.