The first is to search for MHC sequences through the Uniprot database. The relative protein sequence is selected to
generate an MHC sequence file. Then get a counterexample through the PFAM family. The resulting fasta format file is
then de-redundant to obtain the final data set. After the above steps, the paper constructed a data set of 6712 MHC
protein sequences (expressed as Smhc) and 6776 non-MHC protein sequences (expressed as Snon-mhc), named DMHC. Smhc is
divided into two parts to predict the two types of MHC protein. The first part, which contains 4370 MHC protein sequences,
is used for training, and the second part, which contains 2342 MHC protein sequences, is used for independent testing.
The all dataset with FASTA format can be (Download Here
The feature files extracted from the MHC sequence can be (downloaded Here
) by clicking the link.