Gene2vec

Gene Subsequence Embedding for Prediction of Mammalian N6-methyladenosine Sites

Input RNA sequences below:
(click here for example sequences of Block Format[1] )
(click here for example sequences of Single Format[2] )

  (Support for both Single and Block Format)
  (Support for both Single and Block Format)
  (Support for both Single and Block Format)
  (Only Support for Single Format)




[1] the faste format of multiple unknow GAC/AAC peaks in one sample sequence.
[2] the faste format of 1001-nt length sequence with one GAC/AAC peak in one sample sequence and title has form of ">TranscriptID|Postion"
for the requirement of calculating neighbor sites state data.

Citation: Quan Zou, Pengwei Xing, Leyi Wei, Bin Liu. Gene2vec: Gene Subsequence Embedding for Prediction of Mammalian N6ÔÇÉMethyladenosine Sites from mRNA. RNA. Doi: 10.1261/rna.069112.118
Pengwei Xing, Ran Su, Fei Guo, and Leyi Wei*. Identifying N6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine. Scientific Reports. 2017. 7, 46757. DOI: 10.1038/srep46757. (IF2015=5.228).

Welcome to Gene2vec server

        N6-methyladenosine (m6A) refers to methylation modification of the adenosine nucleotide acid at the nitrogen-6 position. Recent research scope of identifying N6-methyladenosine (m6A) methylation site has extend to mammalian transcriptomes. Many traditional identification methods based on sequence feature are limited by data scale, take advantage of million levels of mammalian m6A site dataset and larger sequence windows, deep learning technology is expected to make a better effect with property of data driving. Dealing with analogous sequence type data, researches in Natural Language Processing (NLP) suggested to learn a latent representation of words using word embedding algorithms. Inspired by it, we report Gene2vec, an RNA N6-adenosine methylation predictor based on gene-subsequence-based neural embedding algorithms.

         In this paper, we built four prediction schemes with various RNA sequence representation with optimized convolutional neural networks and compared the prediction effect, all these predictors based on neural network achieve a more effective prediction than traditional methods, and the gene-subsequence-based neural embedding (Gene2vec) method stands outing benefitting from the combination of word embedding and deep network. As we know, this is first time that using word embedding and deep neural network on prediction of mammalian N6-methyladenosine sites. We evaluated these predictors on rigorous independent test dataset and proved that our proposed method outperforms the state-of-the-art predictors.

Fig 1. Workflow of multiple predictor. Four prediction shames have been built, i.e., one-hot encoding transformed by sequence flanking windows with four cell structures network, neighboring methylation states encoding data with two cell structures, RNA word embedding and Gene2vec from pseudo RNA sequence word with two cell structures and with two cell structures.


2D plot: 3-length of vector space correlation of RNA words generated by Gene2vec

3D plot: 3-length of vector space correlation of RNA words generated by Gene2vec

Group1 Data

We remapped rDNA sequence by Transcript ID from SRAMP of Zhou[1] for group 1 comparative experiment

The training set, validation set for building model:
rdna_train_balance_group1(xlsx)(fasta)
rdna_validate_balance_group1(xlsx)(fasta)

The independent test set, and YTHDF test set for assessment predictor:
rdna_test_unbalance_group1(xlsx)(fasta)
YTHDF_test_group1(xlsx)(fasta)

Group2 Data

We remapped rDNA sequence by Transcript ID from RNAMethPre of Xiang[2] for group 2 comparative experiment

The training set, validation set for building model:
rdna_train_balance_group2(xlsx)(fasta)
rdna_validate_balance_group2(xlsx)(fasta)

The independent test set for assessment predictor:
rdna_test_unbalance_group2(xlsx)(fasta)

References:


[1] Zhou Y, Zeng P, Li Y H, et al. SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features[J]. Nucleic acids research, 2016, 44(10): e91-e91.
[2] Xiang S, Liu K, Yan Z, et al. RNAMethPre: A Web Server for the Prediction and Query of mRNA m6A Sites[J]. PloS one, 2016, 11(10): e0162707.
The CNN kernel weight,motif PWMs and motif logos :
kernel-PWM-logo.zip

The motif comparison result with tsv and xml formart :
tomtom_motif_compar_hum.tsv
tomtom_motif_compar_hum.xml
tomtom_motif_compar_mus.tsv
tomtom_motif_compar_mus.xml

Motif Comparison Report

Contact

Quan Zou, Professor
Tianjin University, School of Computer Science and Technology, China
Email: zouquan@nclab.net
Personal website: http://lab.malab.cn/~zq

Pengwei Xing
Tianjin University, School of Computer Science and Technology, China
Email: bluerxing@163.com