It is necessary and essential to discovery protein function from the novel primary sequences. Wet lab experimental procedures are not only time-consuming, but also costly, so predicting protein structure and function reliably based only on amino acid sequence has significant value.TATA-binding protein (TBP) is a kind of DNA binding protein, which plays a key role in the transcrip-tion regulation. Our study proposed an automatic approach for identifying TATA-binding proteins effi-ciently, accurately, and conveniently. This method would guide for the special protein identificationwith computational intelligence strategies.
The raw TBP dataset is downloaded from the Uniport database (http://www.uniprot.org). The dataset contains 964 TBP protein sequences. We clustered the raw dataset using CD-HIT before each analysis, because of extensive redundancy in the raw data (including many repeat sequences). We found 559 positive instances and 8,465 negative instances at a clustering threshold value of 90%. Then 559 negative control sequences were selected by random sampling from the 8,465 sequence negative instances. (download source)
Based on the original source above, we extracted a novel 188D and PriPred feature dataset (download) used for a series of machine learning experiments. This dataset can be dealt with WEKA directly.
For the convenience of scientific community, we offer trained model file for predicting TATA-binding directly. (download)
If you have any question, please feel free to send us your doubt. (Email: shixiangwan@gmail.com)
Zou Quan, Shixiang Wan, Ying Ju, Jijun Tang, and Xiangxiang Zeng. "Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy." BMC Systems Biology, 10.4 (2016): 401.
Bioinformatics Laboratory - Tianjin University @ Shixiang Wan