Human protein subcellular location problem is a multi-label issue with imbalanced source. We developed HPSLPred classifier for predicting ten kinds of subcellular sites: Cytoplasm, Nucleus, Cell membrane, Membrane, Secreted, cytoskeleton, Cell projection, Endoplasmic reticulum membrane, Cell junction, and Mitochondrion. Applied our approach to prediction of such ten sites, the optimal average precision achieves 75%.
Imbalanced source may result in terrible recall rate for the fewer samples’ labels and discrediting the classification results. Based the decision boundary of Support Vector Machines (SVM), an imbalanced source can be transformed into a balanced source. A balanced source can be generated by each label/site, then ensemble classifier HPSLPred will search the highest precision by multiple threads technique.
The original sources are extracted from UniprotKB database. Additionally, LOCATE, PSORTdb, Arabidopsis Subcellular DB, Yeast Subcellular DB, Plant-PLoc, LOCtarget, LOC3D are not bad choices. Our approach cleans and trims the original sources carefully, and releases in this site (see document page) which can be downloaded freely. If you have any question, please feel free to send us your doubt.
Bioinformatics Laboratory - Tianjin University @ Shixiang Wan