ELM-MHC

The paper proposed a new MHC recognition method compared with traditional biological methods. And used the built classifier to classify and identify MHC I and MHC II. The classifier uses the SVMProt 188D, bag-of-ngrams (BonG) and information theory (IT) mixed feature representation methods, using the extreme learning machine (ELM),

Input protein sequences below: (click here for example sequences)
The format of this site should be fasta format.
The first line is any text that starts with ">" and the analysis only works for fasta header with "|" character .
Starting from the second line is the sequence itself, allowing only the use of established amino acid encoding symbols.And
cannot contain line breaks in a single sequence of input.


Upload data file



Welcome to ELM-MHC server

The classifier ELM-MHC mainly includes two parts of training and testing. The feature extraction representation method for the data set is a hybrid feature representation method. A hybrid feature representation method using BonG, 188D, and IT is used. For the mixed feature matrix, the MRMR method is used for feature selection and dimensionality reduction to obtain the final feature matrix. The training model is then constructed using the extreme learning machine classification method for the data set. At the same time, feature extraction and dimension reduction are performed on the prediction sequence.The frame diagram of the MHC classifier in this paper is shown in Figure 1.
the experimental process in this paper


Datasets

The first is to search for MHC sequences through the Uniprot database. The relative protein sequence is selected to generate an MHC sequence file. Then get a counterexample through the PFAM family. The resulting fasta format file is then de-redundant to obtain the final data set. After the above steps, the paper constructed a data set of 6712 MHC protein sequences (expressed as Smhc) and 6776 non-MHC protein sequences (expressed as Snon-mhc), named DMHC. Smhc is divided into two parts to predict the two types of MHC protein. The first part, which contains 4370 MHC protein sequences, is used for training, and the second part, which contains 2342 MHC protein sequences, is used for independent testing.
The all dataset with FASTA format can be (Download Here) .
The feature files extracted from the MHC sequence can be (downloaded Here) by clicking the link.



Help

The site is for the identification of protein sequences to determine whether it is MHC protein. The format of this site should be fasta format.The first line is any text that starts with ">" and the analysis only works for fasta header with "|" character and labels. Starting from the second line is the sequence itself, allowing only the use of established amino acid encoding symbols.
e.g
>2MA1A|1
HDAPLFEALRAWRLQKAKELSLPPYTIFHDATLKTIAELRPGSHATLGTVSGVGGRKLAAYGDEVLQVVRDSSGG

Contact

Dr. Quan Zou
Tianjin University, School of Computer Science and Technology, China
Email: zouquan@tju.edu.cn
Personal website:http://lab.malab.cn/~zq/