Datasets
 
 

Four datasets are used in the paper, including a benchmark dataset and three updated large-scale datasets. These datasets are described below.

1. A benchmark dataset, called DD, which is widely used in several previous studies. This dataset includes 311 training sequences (training set) and 383 testing sequences (testing set) from the 27 fold classes.

2. Three updated large-scale datasets, where all the sequences are derived from the latest ASTRAL SCOP (release 1.75B, July 2012). The sequence similarity in all the three datasets are lower than 40%.

Large-scale Dataset 1 (EDDnew). This dataset includes a total of 3,625 amino acid sequences in the set from the same 27 fold classes with the DD set.

Large-scale Dataset 2 (F95new). This dataset includes a total of 6,791 amino acid sequences in the set from the 95 fold classes.

Large-scale Dataset 3 (F194new). This dataset includes a total of 8,525 amino acid sequences in the set from the 194 fold classes.

Moreover, detailed information of all the four datasets (including the identifier, the name, and the number of each fold class) can be found in supplement A.

3. The feature sets corresponding to the datasets mentioned above can be downloaded (here)