A mixed method including Information Theory, K-Skip-N-Grams and PFPA extract the features from protein sequences.

The suffix name of the file submitted in this page should be ".txt" or ".fasta".
And the data should be fasta format and including "|"character and labels.

Upload your file:       

Welcome to IKP-DBPPred server

This paper uses mixed feature representation with the best performance according to the experimental results. The method entails the following main steps. The protein sequences are represented by three feature representation methods, and the classifier used is SVM. These three methods are combined, and max-relevance-max-distance (MRMD) is used to reduce the dimensions. The mixed feature representation is finally tested through an experiment. Figure 1 shows the experimental process in this paper. the experimental process in this paper


The benchmark dataset used in the paper is PDB186. It is first proposed by Lou et al.[1], and contains 93 actual DNA-binding proteins (positive samples) and 93 non-DNA-binding proteins (negative samples) . The dataset PDB186 with FASTA format can be (Download Here)


[1] W. Lou, et al., "Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naive Bayes," PloS one, vol. 9, 2014.


The site is for the identification of protein sequences to determine whether it is DNA binding protein. The format of this site should be fasta format.The first line is any text that starts with ">" and the analysis only works for fasta header with "|" character and labels. Starting from the second line is the sequence itself, allowing only the use of established amino acid encoding symbols.


Dr. Le-Yi WEI
Tianjin University, School of Computer Science and Technology, China
Email: weileyi@tju.edu.cn
Personal website:http://lab.malab.cn/~wly