CN113971984A

CN113971984A - Classification model construction method and device, electronic equipment and storage medium

Info

Publication number: CN113971984A
Application number: CN202010648024.3A
Authority: CN
Inventors: 钱宝健
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2022-01-25

Abstract

The application discloses a classification model construction method, which comprises the following steps: acquiring a target object sample set, and extracting target characteristic information of the target object sample set; determining a plurality of split attributes of the target feature information; wherein the splitting attribute is used for characterizing the attribute of the class splitting node in the target object sample set; determining weight values corresponding to the plurality of splitting attributes respectively, and acquiring partial target splitting attributes with the maximum weight values from the plurality of splitting attributes; the weight value is used for representing the category discrimination of the splitting attribute; and constructing a classification model of the target object based on the target characteristic information and the target splitting attribute. The classification model construction method can improve the training efficiency of the classification model, shorten the training time and reduce the calculation overhead. The application also discloses a classification model construction device, an electronic device and a computer readable storage medium.

Description

Classification model construction method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of big data processing technologies, and in particular, to a classification model construction method and apparatus, an electronic device, and a computer-readable storage medium.

Background

Research into associations of gene expression data and cancer/disease states is a very important context in biological and medical tasks. The gene expression data of the lesion tissues and the gene expression data of the normal tissues are compared and researched, so that the understanding of pathology can be deepened, and different tissues and disease types can be identified.

Currently, there is interest in the industry to classify gene expression data of cancer/disease by machine learning and deep learning methods. However, because the gene expression data has the characteristics of small sample number and high sample dimension, a large amount of model training time is consumed for constructing the classification model aiming at the gene expression data at present.

Disclosure of Invention

The application provides a classification model construction method and device, electronic equipment and a computer readable storage medium, which can improve the training efficiency of a classification model, shorten the training time and reduce the calculation overhead.

In a first aspect, the present application provides a classification model construction method, including:

acquiring a target object sample set, and extracting target characteristic information of the target object sample set;

determining a plurality of split attributes of the target feature information; wherein the splitting attribute is used for characterizing the attribute of the class splitting node in the target object sample set;

determining weight values corresponding to the plurality of splitting attributes respectively, and acquiring partial target splitting attributes with the maximum weight values from the plurality of splitting attributes; the weight value is used for representing the category discrimination of the splitting attribute;

and constructing a classification model of the target object based on the target characteristic information and the target splitting attribute.

In a second aspect, the present application provides a classification model building apparatus, the apparatus comprising:

the characteristic extraction unit is used for acquiring a target object sample set and extracting target characteristic information of the target object sample set;

a split attribute determination unit for determining a plurality of split attributes of the target feature information; wherein the splitting attribute is used for characterizing the attribute of the class splitting node in the target object sample set;

the weight value determining unit is used for determining weight values corresponding to the split attributes respectively and acquiring a part of target split attributes with the maximum weight values from the split attributes; the weight value is used for representing the category discrimination of the splitting attribute;

and the processing unit is used for constructing a classification model of the target object based on the target characteristic information and the target splitting attribute.

In a third aspect, the present application provides an electronic device comprising a processor and a memory for storing a computer program operable on the processor;

wherein the processor is configured to execute the steps of the classification model construction method according to the first aspect when running the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to perform the steps of the classification model construction method according to the first aspect.

According to the classification model construction method and device, the electronic equipment and the computer storage medium, a target object sample set is obtained, target characteristic information of the target object sample set is extracted, and a plurality of splitting attributes of the target characteristic information are determined; determining weight values corresponding to the split attributes respectively, and acquiring a part of target split attributes with the maximum weight values from the split attributes; the weighted value is used for representing the category discrimination of the splitting attribute; and constructing a classification model of the target object based on the target splitting attribute and the target characteristic information. Therefore, the feature extraction can be carried out on the target object sample set, the dimensionality of the original target object sample set is reduced, and meanwhile, the partial splitting attributes with the largest weight value (namely, the highest discrimination) are selected from the splitting attributes to carry out model construction, so that the data dimensionality in the training process of the classification model can be greatly reduced, and the learning time of the classification model is shortened. Therefore, the classification model is constructed by extracting the target characteristic information after the characteristics and the selected partial splitting attributes, so that the data dimension in the training process of the classification model can be greatly reduced, the training efficiency of the classification model is improved, the training time is shortened, and the calculation overhead is reduced.

Drawings

Fig. 1 is a schematic flowchart 1 of a classification model construction method provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram 1 of a decision tree according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart 2 of a classification model construction method according to an embodiment of the present application;

fig. 4 is a schematic flowchart 3 of a classification model construction method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram 1 of a deep forest model according to an embodiment of the present application;

fig. 6 is a schematic system architecture diagram of a classification model building method according to an embodiment of the present application;

fig. 7 is a schematic structural composition diagram of a classification model building apparatus according to an embodiment of the present application;

fig. 8 is a schematic diagram illustrating a hardware structure of an electronic device according to an embodiment of the present disclosure.

Detailed Description

So that the manner in which the features and elements of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.

Early diagnosis of cancer/disease was only possible by clinical laboratory observation, recording and analysis due to lack of technical means. With the advent of gene chip technology and gene sequencing technology, a great deal of gene expression data about cancer/disease tissues is continuously generated, and a new approach is provided for the research and diagnosis of cancer/diseases. In practical applications, by using relevant techniques in statistics and machine learning, gene expression data of cancer/disease are classified and characteristic genes helpful for cancer/disease treatment and research are developed.

Currently, there are many machine learning and deep learning approaches applied to the classification of gene expression data for cancer/disease. Machine learning-based cancer/disease classification approaches can be broadly divided into four categories:

(1) classification method based on similarity

The classification method based on the similarity mainly determines the class of the gene expression data to be classified according to the similarity (such as Euclidean distance) between the gene expression data sample to be classified and a training sample. The classification method based on similarity represents two algorithms: based on a proximity algorithm (K-nearest neighbor, KNN) and on a clustering algorithm. The classification method based on the KNN algorithm mainly has the idea that a training sample most similar to gene expression data to be classified is obtained through similarity calculation, and the class of the selected training sample is used as the class of the gene expression data to be classified. In addition, the classification method based on the clustering algorithm has the main idea that a clustering algorithm is adopted, and through similarity calculation, some similar samples are clustered into the same cluster; and then determining the category of the gene expression data to be classified by adopting a voting mode according to the actual category conditions of other samples in the cluster where the gene expression data to be classified is located.

However, the similarity-based classification algorithm has the disadvantages that when the dimensionalities of the gene expression data to be classified and the training samples are large, it takes a lot of time to calculate the similarity between each gene expression data to be classified and each training sample in the training sample set, and the classification efficiency is low.

(2) Classification method based on maximum interval

The classification method based on the maximum interval makes classification decision based on linear combination result by linearly combining the sample gene expression data, and has the key point of maximizing the classification interval, wherein the classification interval is the distance between a classification boundary and a training sample point closest to the classification boundary. Representative algorithms based on the maximum interval classification method are: support Vector Machines (SVM). The method comprises the following steps: finding a hyperplane capable of segmenting samples of different classes; the classification decision is made based on which side of the hyperplane the gene expression data to be classified lies on.

However, the maximum interval-based classification method works well for linearly separable gene expression data, but for some non-linearly separable and noisy data, the classification effect is not accurate.

(3) Classification method based on distribution situation

The classification method based on the distribution situation is mainly based on the distribution situation of the training sample, namely, the final classification situation is determined by the gene expression data. Such methods represent algorithms: fisher Linear Discriminant Analysis (FLDA) algorithm, naive bayes algorithm, etc. For example, the classification method based on the naive bayes algorithm solves the probability of occurrence of each class under the condition of occurrence of the gene expression data for the given gene expression data to be classified, and determines which class the expression data to be classified belongs to as to which class the probability of which class is the maximum.

However, there are two problems with the classification method based on distribution, and first, the classification method based on distribution assumes independence between genes, and actually many gene expression values are correlated, which ignores many related genes of classification effect or biological significance. Second, the classification algorithm based on distribution is performed on the assumption that the data is gaussian distributed, and actually, there are many cases where the distribution of gene expression data does not match, so that it is only suitable for the gene expression data of certain specific types of cancers/diseases, and it is difficult to adapt to the gene expression data of all types of cancers/diseases.

(4) Method based on deep neural network model

The cancer classification method based on the deep neural network model is mainly inspired by the fact that the deep model has good effect in other fields. Researchers learn by applying an automatic encoder to features of a cancer/disease data set and then apply the learned features to a classifier for classification.

However, the classification method based on the depth model can achieve good effect on a small amount of gene expression data sets, but the method consumes a large amount of feature dimension reduction time and model training time, and can not find out the feature genes with potential biological significance.

In order to solve the problems in the related art, an embodiment of the present application provides a classification model building method, where an execution subject of the classification model building method may be a classification model building apparatus provided in the embodiment of the present application, or an electronic device integrated with the classification model building apparatus, where the classification model building apparatus may be implemented in a hardware or software manner, where the electronic device may be a smart phone, a tablet computer, a personal computer, a server, an industrial computer, or the like, and the embodiment of the present application is not limited herein.

Fig. 1 is a schematic flowchart 1 of a classification model construction method provided in an embodiment of the present application, and as shown in fig. 1, the classification model construction method includes the following steps:

and step 110, acquiring a target object sample set, and extracting target characteristic information of the target object sample set.

In embodiments provided herein, the target object may be an object with a small number of samples and/or a high sample dimension. Based on the characteristics of small number and high dimensionality of the target object sample set, the classification model establishing method provided by the embodiment of the application can obtain target characteristic information by performing characteristic extraction on the obtained target object sample set; the method can eliminate the error and noise existing in the original target object sample set, remove the data irrelevant to the classification result, and achieve the effect of reducing the dimensionality of the original target object sample set.

In some embodiments of the present application, the target object may be gene expression data. Specifically, gene expression data is the abundance of gene transcription product mRNA in cells measured directly or indirectly in specific tissues by using gene chip technology, and can be used for analyzing which genes have changed expression, which correlation exists among genes, and how the activity of genes is affected under different conditions.

In embodiments provided herein, a gene expression data sample set is made up of a plurality of gene expression data samples. In practical application, the gene expression data sample set is expressed by a matrix with m rows and n columns. Referring to table 1, table 1 shows an example of a gene expression data sample set; wherein each row in table 1 represents a sample; the samples of each row may be characteristic of tissue, or a type of a certain cancer/disease, or a subtype of a certain cancer/disease; (for example, case 1 may be a normal tissue of a human body, case 2 may be a gastric cancer, case m may be a lung cancer, etc.), each column indicating an expression value of the same gene in different cases. Typically n > > m.

	Gene 1	Gene 2	……	Gene n
					Case
1	219	102.5	……	45
					Case 2	180	117	……	89
……	……	……	……	……
					Case m	45	56	……	81

In the embodiment provided by the application, when the gene expression data is classified or the classification model is constructed based on the gene expression data sample set, the classification effect is poor due to errors and noises in the original gene expression data. Furthermore, in some types of gene expression data, the expression values of most genes are not related to the occurrence of cancer/disease, and therefore, it is necessary to screen gene expression data that are closely related to cancer/disease from gene expression data by some strategy. In addition, the number of samples of gene expression data is small, the sample dimension is high, generally the number of samples is between tens and a hundred, and the sample dimension is between thousands and tens of thousands.

Based on the reasons, the classification model establishing method provided by the embodiment of the application can obtain target characteristic information by performing characteristic extraction on the obtained gene expression data sample set; so as to eliminate errors and noises existing in the original gene expression data, remove data irrelevant to cancer/diseases and achieve the effect of reducing the dimensionality of the original gene expression data sample set.

Step 120, determining a plurality of splitting attributes of the target characteristic information; and the splitting attribute is used for representing the attribute of the class splitting node in the target object sample set.

Next, the gene expression data of the target object will be described as an example.

Referring to the gene expression data sample set shown in table 1, one sample per row, and one attribute field per column can be understood as one for classification. The problem to be solved by the present application is to establish a cancer/disease classification model according to the gene expression data sample set, wherein the cancer/disease classification model can be a decision tree or a random forest consisting of a plurality of decision trees. When a researcher obtains gene expression data of a certain tissue at a certain future time through a gene sequencing technology, whether the tissue has a lesion or not and the type of the lesion can be predicted according to a rule corresponding to a random forest formed by the decision tree or a plurality of decision trees and an expression value attribute of each gene in the tissue.

In practical application, the classification model construction device can construct one or more decision trees similar to the decision tree shown in fig. 2 after acquiring the gene expression data set shown in table 1, where fig. 2 is a schematic structural diagram of the decision tree, as shown in fig. 2, the decision tree has 8 nodes, node 1 is a root node,

nodes

4, 5, 6, 7, and 8 are leaf nodes, and

nodes

2 and 3 are intermediate nodes. Where the leaf nodes (i.e.,

nodes

4, 5, 6, 7, 8) can no longer perform attribute splitting. Intermediate nodes (i.e., nodes 2 and 3) and the root node (i.e., node 1) may perform attribute splitting, and when a node can be split according to a certain attribute, the attribute is called a split attribute.

In the embodiment of the present application, the splitting attribute may be determined by various methods, for example, the classification model building apparatus may determine the plurality of split data of the target feature information by an information gain manner, or may determine the plurality of split data of the target feature information by a statistical method. The embodiment of the present application does not limit the way of determining the splitting attribute.

Step 130, determining weight values corresponding to the plurality of splitting attributes respectively, and obtaining a part of target splitting attributes with the largest weight values from the plurality of splitting attributes.

The weight value is used for representing the category discrimination of the splitting attribute. The discrimination of the division attribute specifically refers to the degree to which the division attribute can discriminate the gene expression data class.

In the embodiment provided by the application, the classification model building device can quantize the discrimination, and the discrimination of each split attribute is represented by the weight value. The higher the weight value is, the higher the discrimination of the splitting attribute is; conversely, the smaller the weight value, the lower the discrimination of the splitting attribute.

Illustratively, referring to a decision tree model shown in fig. 2, the attributes of node 2 and node 3 both belong to split attributes, but attribute splitting for node 2 can result in two nodes and attribute splitting for node 3 can result in three nodes. It can be seen that the splitting attribute of the node 3 is better differentiated than that of the node 2, and the weight value of the node 3 is greater than that of the node 2.

Further, the classification model construction apparatus may determine the weight value of each split attribute by a variety of methods. In the embodiment provided by the application, the weight value of each splitting attribute can be obtained by calculating the kini coefficient of each splitting attribute, or the weight value of each splitting attribute can be obtained by calculating the information gain of each splitting attribute, or the weight value corresponding to each splitting attribute can be obtained by calculating the information gain rate of each splitting attribute. The embodiment of the present application does not limit the manner of determining the splitting attribute weight value here.

In the embodiment provided by the application, after the weight value of each splitting attribute is determined, the weight values of each splitting attribute may be sorted from large to small, and a part of the splitting attributes with the top sorting order is selected as the target splitting attribute.

Here, 10-100 splitting attributes ranked top may be selected as the target splitting attributes, and 5% -10% splitting attributes ranked top may also be selected as the target splitting attributes. The embodiment of the present application does not limit the way of selecting the splitting attribute with the top ranking.

It should be noted that the number of the selected splitting attributes is related to the target object sample set, and it can be understood that the number of the selected gene splitting attributes can be adjusted according to the difference of the target object sample sets, and the number of the corresponding selected splitting attributes is also different when the target object sample sets are different.

And 140, constructing a classification model of the target object based on the target characteristic information and the target splitting attribute.

In the embodiments provided in the present application, the classification model building apparatus may build a classification model for the target object according to a part of the target split attributes selected from the plurality of split attributes and the target feature information extracted from the target object sample set. Specifically, the target feature information and the target splitting attribute may be used as input of a classification model, and the classification model is trained to obtain a final trained classification model. The classification model may be a classification model based on a decision tree, a classification model based on a random forest, or a classification model based on a deep forest. The embodiments of the present application are not limited herein.

It can be understood that, the classification model construction method provided by the embodiment of the present application does not need to construct a classification model based on all the split attributes and the original target object sample set, but performs feature extraction on the gene expression data sample set to obtain target feature information; simultaneously selecting a part of splitting attributes with the maximum weight value (namely the highest discrimination) from the plurality of splitting attributes; and the classification model is constructed according to the extracted target characteristic information and the selected partial splitting attributes, so that the data dimension in the training process of the classification model can be greatly reduced, and the learning and training time of the classification model is shortened.

Therefore, in the classification model construction method provided by the embodiment of the application, a plurality of splitting attributes of target characteristic information are determined by obtaining a target object sample set and extracting the target characteristic information of the target object sample set; determining weight values corresponding to the split attributes respectively, and acquiring a part of target split attributes with the maximum weight values from the split attributes; and then, constructing a classification model of the target object based on the target splitting attribute and the target characteristic information. Therefore, the feature extraction can be carried out on the target object sample set, the dimensionality of the original target object sample set is reduced, and meanwhile, the partial splitting attributes with the largest weight value (namely, the highest discrimination) are selected from the splitting attributes to carry out model construction, so that the data dimensionality in the training process of the classification model can be greatly reduced, and the learning time of the classification model is shortened. Therefore, the classification model is constructed by extracting the target characteristic information after the characteristics and the selected partial splitting attributes, so that the data dimension in the training process of the classification model can be greatly reduced, the training efficiency of the classification model is improved, the training time is shortened, and the calculation overhead is reduced.

Based on the foregoing embodiments, in the classification model construction method provided in the embodiment of the present application, step 140 may be implemented by step 1401 and step 1402, specifically, referring to fig. 3, which is a schematic flow diagram of the classification model construction method 2, and step 140 specifically includes:

and 1401, taking the target characteristic information and the target splitting attribute as the input of the deep forest model, and training and testing each level of cascade forest in the deep forest model to obtain the trained deep forest model.

And 1402, taking the trained deep forest model as a classification model of the target object.

In a possible implementation manner, the classification model mentioned in the embodiment of the present application refers to a deep forest model. The deep forest model belongs to an integrated forest model, and is an integration of a traditional random forest model in the aspects of breadth and depth.

In practical application, the integrated learning is taken as a main research direction in the field of machine learning, and the idea is mainly to combine a plurality of basic learner models into a strong classifier so as to improve the classification effect. According to the related researches, the accuracy and generalization of the ensemble learning model can be improved from two aspects. First, the diversity and representativeness of features are to be guaranteed; second, the diversity of the basic learner is guaranteed. In recent years, a new ensemble learning model has been proposed. The model uses the thought of multilayer extraction features in deep learning for reference, and expands random forests into multilayer connected forest models to improve the classification effect, wherein the multilayer connected forest models are also called deep forest models. In the embodiment provided by the application, the deep forest model is used in the classification problem of the target object with small sample number and high sample dimension, for example, in the classification problem of cancer/related target objects, and the accuracy of cancer/disease classification is improved.

In embodiments provided herein, the deep forest model may include an N-level online forest. Wherein N is an integer greater than 1. Each hierarchical forest of the deep forest model comprises M random forests, and each random forest comprises L decision trees.

Specifically, the classification model construction apparatus may pre-configure an initial value of a hyper-parameter of the deep forest model, where the hyper-parameter refers to a parameter that is set before the classification model is trained, and is not parameter data obtained through training. In the embodiment provided by the application, the hyper-parameters of the deep forest model at least comprise the maximum number of layers N of the cascading forests, the number M of random forests in each level of cascading forests, and the number L of decision trees of each random forest.

In a possible implementation manner, before the target feature information and the target splitting attribute are used as the input of the deep forest model, the method further comprises the following steps:

1400a, receiving configuration information aiming at a deep forest model;

and 1400b, determining the maximum layer number N of the cascade forests, the number M of the random forests of each layer and the number L of the decision trees of each random forest in the deep forest model based on the configuration information.

It can be understood that the classification model building device may receive configuration information for the depth forest model input by a user or configuration information of the depth forest model sent by other third party platforms, and configure the maximum number of layers of cascaded forests, the number of random forests of each layer of the cascaded forests, and the number of decision trees of each random forest included in the depth forest model according to the configuration information.

Further, after the initial values of the hyper-parameters of the deep forest model are configured, the target feature information obtained in the step 110 and the target splitting attribute determined in the step 130 are used as the input of the deep forest model to train and test the deep forest model, so that the trained deep forest model is obtained.

Based on the foregoing embodiment, in the classification model construction method provided in the embodiment of the present application, step 1401 may be implemented by steps 1401a to 1401c, and referring to fig. 4, a schematic diagram 3 of the classification model construction method is shown, and step 1401 specifically includes the following steps:

1401a, inputting target characteristic information into a first layer of cascade forest of the deep forest model, and training and testing the first layer of cascade forest to obtain a class 1 vector;

1401b, taking the ith category vector and the splitting feature vector corresponding to the target splitting attribute as the input of the (i + 1) th level forest, and training and testing the (i + 1) th level forest to obtain the (i + 1) th category vector; the splitting characteristic vector is used for representing a characteristic vector which is divided into target splitting attributes in the target characteristic information;

1401c, continuing to take the (i + 1) th class vector and the splitting feature vector corresponding to the target splitting attribute as the input of the (i + 2) th level forest, and training and testing the (i + 2) th level forest until the training and testing of the (N) th level forest are finished; wherein i is an integer of 1 or more and less than N-1.

In practical application, aiming at the characteristics of small sample data volume and high sample dimensionality of a target object, the classification effect is poor under a simple classification model (such as a KNN classification model, a classification model based on a self-encoder and the like); and in the case of high sample dimensions, it is extremely easy to get into an overfitting.

Based on this, the embodiment of the application comprehensively considers the complexity of the classification model, the data size and the relationship between the data dimensions, so that the data dimensions are matched with the complexity of the model. In general, a multi-granularity scanning strategy is adopted for an original sample set by the deep forest model, and the method is possibly effective for samples with sufficient sample data volume and low data dimension; however, the target object with a small sample data size and a very high sample dimension is 'frosted on snow' by adopting a multi-granularity scanning strategy; the mismatch between data dimensions and model complexity is exacerbated. Therefore, according to the classification model construction method provided by the embodiment of the application, in order to enable the deep forest model to be more suitable for the sample characteristics of the target object, only the characteristics of the cascade forest of the deep forest model are reserved, and the characteristics of multi-granularity scanning of the deep forest model are abandoned.

In addition, in order to further improve the precision, the embodiment of the application modifies the transfer vectors among the cascaded forests in the deep forest model. The training process of the deep forest is described in detail below with reference to the schematic structural diagram of the deep forest model shown in fig. 5. Fig. 5 exemplarily shows that each hierarchical forest of the depth forest model includes four random forests, namely random forest 1, random forest 2, random forest 3 and random forest 4.

Specifically, referring to fig. 5, the classification model building apparatus inputs the target feature vector into the level 1 cascade forest 501 of the cascade random forest. Thus, each random forest in the level 1 connected forest 501 generates a class estimate by calculating the percentage of different classes at the nodes where the target feature information falls, and then calculating the average of all trees in each random forest to generate a class estimate, the estimated classes forming the class 1 vectors 502, respectively.

Next, the class 1 vector 502 output from the 1 st-level forest 501 is combined with a split feature vector 504 corresponding to a target split attribute 503 selected from the split attributes, and the combined vector is input to the 2 nd-level forest 505. Further, each random forest in the level 2 online forest 505 calculates the percentage of different classes of nodes that the input data falls into, and then calculates the average of all trees in each random forest, resulting in a category 2 vector 506.

It is noted that the split feature vector 504 corresponding to the target split attribute 503 is a set of all features of the target feature information divided under the target split attribute. Here, combining the 1 st type vector 502 output by the 1 st level forest 501 with the splitting feature vector 504 corresponding to the target splitting attribute 503 means specifically splicing the 1 st type vector 502 output by the 1 st level forest 501 with the splitting feature vector 504 corresponding to the target splitting attribute 503.

Further, a vector obtained by combining the class 2 vector 506 and the splitting feature vector 504 corresponding to the target splitting attribute 503 is used as an input of a 3 rd-level connected forest (not shown in the figure) to obtain a 3 rd-class vector. Continuing to train and test the next layer of cascade forests by combining the 3 rd class vector and the splitting characteristic vector 504 corresponding to the target splitting attribute 503 until training and testing of each layer of cascade forests in the deep forest model are finished; and carrying out averaging and maximum value processing on the Nth category vector obtained from the Nth layer to obtain a finally trained deep forest model for predicting the category of the target object.

The classification model construction device can sort the splitting attributes of the target object sample set according to the weight, selects the maximum partial splitting attributes and the category vector output by each level of the cascade forest to form a new vector, and then transmits the new vector to the next level of the cascade forest. Therefore, only the maximum partial splitting attribute is selected for training the classification model, so that the data dimensionality of training can be greatly reduced, and the training time is relieved to a certain extent; and compared with the original target object set in which one sample contains thousands and tens of thousands of characteristic genes, the method selects the division characteristic information corresponding to the division attribute with higher discrimination, is more favorable for clinically judging which genes are related to a certain cancer, and is favorable for discovering the characteristic genes with biological significance.

Based on the foregoing embodiment, in the classification model construction method provided in the embodiment of the present application, after the target object sample set is obtained in step 110, the method further includes the following steps:

dividing a target object sample set into K mutually disjoint subsets;

selecting K-1 subsets from the K subsets as training sample sets, and selecting the rest subsets as test sample sets;

correspondingly, step 1401 is to train and test each layer of random forest in the deep forest model by using the target feature information and the target splitting attribute as the input of the deep forest model, so as to obtain a trained deep forest model, and includes the following steps:

training each level of the linked forests of the deep forest model through target characteristic information and target splitting attributes corresponding to the training sample set to obtain an initial deep forest model;

and testing and adjusting each level of the cascade forest of the initial deep forest model through target characteristic information and target splitting attributes corresponding to the test sample set to obtain the trained deep forest model.

In practical applications, researchers typically divide the raw data set into a training set and a test set. The training set is used for training the model, and the testing set is used for evaluating the generalization ability of the model, namely the adaptability of the model to a fresh sample. In the embodiment of the application, based on the characteristic that the data volume of the target object sample set is small, the depth forest model is trained and tested by adopting a K-fold cross testing principle.

Specifically, dividing a target object sample set into K mutually disjoint subsets; and selecting K-1 subsets from the K subsets as a training sample set, and selecting the rest subsets as a test sample set.

Secondly, training each level of the cascade forest of the deep forest model through target characteristic information and target splitting attributes corresponding to the training sample set to obtain an initial deep forest model; further, testing and updating each level of the cascade forest of the initial deep forest model through target characteristic information and target splitting attributes corresponding to the test sample set to obtain the trained deep forest model.

It should be noted that training and testing of each level of the online forest in the deep forest model are the same as in steps 1041a, 1041b and 1041 c. And will not be described in detail herein.

Based on the foregoing examples, the target object in the present embodiment may be gene expression data. Based on this, the classification model construction method provided in the embodiment of the present application further includes, before the step 110 obtains the target object sample set, the following steps:

step 101, obtaining a plurality of gene expression data;

step 102, preprocessing a plurality of gene expression data to obtain a target object sample set; wherein the pretreatment comprises: the gene expression data is subjected to a null value processing and/or a normalization processing.

In the examples provided in the present application, the gene expression data obtained from the gene chip has a null value, and therefore, the null value needs to be processed. Meanwhile, in order to improve the situation that the trained classification model generates overfitting, the gene expression data generally needs to be standardized.

Based on this, after a plurality of gene expression data are acquired, the classification model construction device needs to preprocess each gene expression data; the pretreatment mainly comprises the following steps: vacancy value processing and/or normalization processing. The processing of the vacancy value specifically comprises the following steps: if the number of the vacancy values is larger than a preset threshold value, discarding the gene expression data sample; and if the number of the vacancy values is less than the preset threshold value, filling the vacancy values. Here, there are various ways of filling the empty value. In one possible embodiment, a plurality of gene expression data may be arranged in the format described in Table 1, the mean of the columns in which the gap values are located is calculated, and the resulting mean is filled in at the gap positions.

The normalization process is also referred to as a normalization process, in which a numerical value in the gene expression data is mapped between (0, 1). And finally, arranging the plurality of preprocessed gene expression data into a format described in table 1 to obtain a target object sample set.

In a possible implementation manner, the extracting target feature information of the target object sample set in step 110 specifically includes:

extracting target feature information of a target object sample set according to a preset feature selection method; the preset feature selection method is used for reducing the dimension of the target object.

Due to the fact that the dimension of the target object sample set is high, dimension reduction is conducted by extracting target feature information of the target object sample set.

Specifically, the preset feature selection method includes at least one of:

t test method, Fisher's discriminant method, class-related feature method, and genetic algorithm.

According to the classification model construction method provided by the embodiment of the application, the splitting attributes of the target object sample set are sorted according to the weight, the largest part of the splitting attributes and the class vector output by each level of the cascade forest are selected to be combined into a new vector, and then the new vector is transmitted to the cascade forest of the next layer. Therefore, only the maximum partial splitting attribute is selected for training the classification model, so that the data dimensionality of training can be greatly reduced, and the training time is relieved to a certain extent; and compared with the original target object set, one sample contains thousands of features and tens of thousands of features, the splitting feature information corresponding to the splitting attribute with higher discrimination is selected, and the accuracy of classification is improved. It can be understood that, when the target object is gene expression data, the classification model construction method provided by the application is more helpful for clinical judgment of which genes are related to a certain disease/cancer and is helpful for finding characteristic genes with biological significance.

The classification model construction method provided by the embodiment of the present application is described in detail below with reference to specific scenarios.

Based on the above embodiments, the classification model construction method provided by the embodiments of the present application can be applied to the system architecture diagram shown in fig. 6. Referring to fig. 6, the classification model construction method provided in the embodiment of the present application specifically includes two stages, a model training stage and a model testing stage. Each stage comprises 5 steps: gene expression data preprocessing, K-fold cross test partitioning, target feature information selection, classification model training (classification model testing), and classification model evaluation. The details of each step are described in detail below.

(1) And (4) preprocessing gene expression data.

The gene expression data obtained from the gene chip has a null value, and therefore, the null value needs to be processed. Meanwhile, in order to improve the overfitting condition of the training model, the data is generally required to be standardized. Therefore, the pretreatment of gene expression data includes: vacancy value processing and normalization processing. For the processing of the vacancy values, filling by adopting the mean value of the characteristic columns of the vacancy values; for the case of excessive values of the nulls, the sample is discarded. For data normalization, the data values of the gene expression data are mapped between (0, 1).

(2) K-fold cross test division. Cross-testing is a method of segmenting a data sample into subsets based on statistical considerations. In general data mining and machine learning research, researchers typically divide raw data sets into two parts, a training set and a test set. Wherein the training set is used for training the model, and the testing set is used for evaluating the generalization ability of the model. However, due to the fact that the samples of the gene expression data are small, the result is often inaccurate due to division, and in the embodiment of the application, the deep forest model is trained and tested by adopting a K-fold cross testing principle.

Specifically, dividing a gene expression data sample set into K mutually-disjoint subsets;

and selecting K-1 subsets from the K subsets as a training sample set, and selecting the rest subsets as a test sample set.

(3) And selecting target characteristic information.

Due to the fact that the dimensionality of the gene expression data sample is high, the dimensionality reduction is performed in a mode that the target characteristic information is obtained by performing characteristic extraction on the gene expression data sample set.

(4) Training a classification model and testing the classification model.

In the embodiment provided by the application, aiming at the training process of the classification model, the classification model construction device can firstly determine a plurality of split attributes of the target characteristic information corresponding to the training sample set; determining a weight value corresponding to each crack attribute in the test sample set; acquiring a part of target split attributes with the largest weight value from the plurality of split attributes; inputting target characteristic information and target splitting attributes corresponding to the training sample set into the deep forest model for training to obtain an initial deep forest model;

further, after the initial depth forest model is obtained, the initial depth forest model is trained. Here, the classification model construction device may determine a plurality of split attributes of the target feature information corresponding to the test sample set; determining a weight value corresponding to each split attribute in the test sample set respectively, and acquiring a part of target split attributes with the maximum weight value from the plurality of split attributes; and inputting target characteristic information and target splitting attributes in the test sample set into the initial deep forest model, updating and adjusting parameters of the initial deep forest to obtain a trained deep forest model, and taking the trained deep forest model as a classification model of gene expression data.

It should be noted that, for both the training process of the training sample set and the testing process of the testing sample set, the splitting attributes corresponding to the training sample set or the testing sample set need to be determined first, and the first X attributes and the category vectors output by each level of the depth forest model or the initial depth forest model are selected to be combined into a new vector, so as to be transmitted to the cascade forest of the next level; and training and testing each level of connected forests in the deep forest model or the initial deep forest model. Therefore, the training time can be relieved to a certain extent, and the discovery of the characteristic genes with biological significance is facilitated. In the embodiment provided by the application, the value of X may be 10-100, the specific value of X may be adjusted according to the difference of the gene expression data sample sets, the gene expression data sample sets are different, and the number of the correspondingly selected splitting attributes is also different.

It should be noted that some initial values of the hyper-parameters are set before training. For example: the number of random forests in each level of the cascading forest, the number of decision trees in each random forest, and the cascading layer number of the depth forest model. These three hyper-parameters are most important since the depth forest after improvement has no multi-granularity scanning operation.

(5) And (5) evaluating a classification model. The classification model evaluation is used for evaluating the accuracy and reliability of the classification model. The commonly used evaluation indexes include accuracy of classification prediction, Area Under Curve (AUC) value Under ROC Curve, confusion matrix, and the like. But in the cancer/disease classification problem, the most common is the accuracy and recall of classification.

Therefore, the classification model construction method provided by the embodiment of the application obtains the gene expression data sample set, extracts the target characteristic information of the gene expression data sample set, and then determines a plurality of division attributes of the target characteristic information; determining weight values corresponding to the split attributes respectively, and acquiring a part of target split attributes with the maximum weight values from the split attributes; the weighted value is used for representing the category discrimination of the splitting attribute; and constructing a classification model of the gene expression data based on the target division attribute and the target characteristic information. Therefore, the feature extraction can be carried out on the gene expression data sample set, the dimensionality of the original gene expression data sample set is reduced, and meanwhile, the partial splitting attribute with the largest weight value (namely, the highest discrimination) is selected from the splitting attributes to carry out model construction, so that the data dimensionality in the training process of the classification model can be greatly reduced, and the learning time of the classification model is shortened. Therefore, the classification model is constructed by extracting the target characteristic information after the characteristics and the selected partial splitting attributes, so that the data dimension in the training process of the classification model can be greatly reduced, the training efficiency of the classification model is improved, the training time is shortened, and the calculation overhead is reduced.

Based on the foregoing embodiments, an embodiment of the present application provides a classification model building apparatus, as shown in fig. 7, the apparatus includes:

a feature extraction unit 71, configured to obtain a target object sample set and extract target feature information of the target object sample set;

a split attribute determining unit 72 for determining a plurality of split attributes of the target feature information; wherein the splitting attribute is used for characterizing the attribute of the class splitting node in the target object sample set;

a weight value determining unit 73, configured to determine weight values corresponding to the plurality of splitting attributes respectively, and obtain a part of target splitting attributes with a largest weight value from the plurality of splitting attributes; the weight value is used for representing the category discrimination of the splitting attribute;

and the processing unit 74 is configured to establish a classification model of the target object based on the target splitting attribute and the target feature information.

Optionally, the classification model is a deep forest model, wherein the deep forest model comprises an N-level connected forest; n is an integer greater than 1;

the processing unit 74 is specifically configured to use the target feature information and the target splitting attribute as input of a deep forest model, and train and test each level of the cascade forest in the deep forest model to obtain a trained deep forest model; and taking the trained deep forest model as a classification model of the target object.

Optionally, the processing unit 74 is further configured to input the target feature information into a first layer of cascade forest of a deep forest model, train and test the first layer of random forest, and obtain a category 1 vector; taking the ith category vector and the splitting feature vector corresponding to the target splitting attribute as the input of an (i + 1) th level linked forest, and training and testing the (i + 1) th level linked forest to obtain an (i + 1) th category vector; the splitting characteristic vector is used for representing a characteristic vector which is divided into target splitting attributes in the target characteristic information; continuing to take the (i + 1) th class vector and the splitting feature vector corresponding to the target splitting attribute as the input of the (i + 2) th level connected forest, and training and testing the (i + 2) th level random forest until the training and testing of the (N) th level connected forest are finished; wherein i is an integer of 1 or more and less than N-1.

Optionally, the classification model building apparatus further includes an obtaining unit 75, configured to receive configuration information for the deep forest model;

and the processing unit 74 is configured to determine, based on the configuration information, the maximum number N of layers of the cascaded forests, the number M of random forests in each hierarchical cascaded forest, and the number L of decision trees of each random forest.

Optionally, the target object comprises gene expression data.

Optionally, the obtaining unit 75 is further configured to obtain a plurality of gene expression data;

the processing unit 74 is further configured to perform preprocessing on the multiple gene expression data to obtain the target object sample set; the pretreatment comprises the following steps: performing vacancy value processing and/or normalization processing on the gene expression data.

Optionally, the feature extraction unit 71 is configured to extract target feature information of the target object sample set according to a preset feature selection method; the preset feature selection method is used for reducing the dimension of the target object.

Optionally, the preset feature selection method includes at least one of:

Optionally, the processing unit 74 is configured to divide the target object sample set into K mutually disjoint subsets; wherein K is an integer greater than 1; selecting K-1 subsets from the K subsets as training sample sets, and selecting the rest subsets as test sample sets; training each level of cascade forest of the deep forest model through target characteristic information and target splitting attributes corresponding to the training sample set to obtain an initial deep forest model; and testing and updating each level of the cascade forest of the initial deep forest model through the target characteristic information and the target splitting attribute corresponding to the test sample set to obtain the trained deep forest model.

Therefore, the classification model construction device provided by the embodiment of the application determines a plurality of splitting attributes of the target characteristic information by acquiring the target object sample set and extracting the target characteristic information of the target object sample set; determining weight values corresponding to the split attributes respectively, and acquiring a part of target split attributes with the maximum weight values from the split attributes; the weighted value is used for representing the category discrimination of the splitting attribute; and constructing a classification model of the target object based on the target splitting attribute and the target characteristic information. Therefore, the classification model construction device can perform feature extraction on the gene expression data sample set, reduce the dimensionality of the original target object sample set, and select the partial splitting attributes with the largest weight value (namely, the highest discrimination) from the splitting attributes to perform model construction, so that the data dimensionality in the classification model training process can be greatly reduced, and the learning time of the classification model is shortened. Therefore, the classification model is constructed by extracting the target characteristic information after the characteristics and the selected partial splitting attributes, so that the data dimension in the training process of the classification model can be greatly reduced, the training efficiency of the classification model is improved, the training time is shortened, and the calculation overhead is reduced.

Based on the implementation of each unit in the image classification apparatus, in order to implement the classification model construction method provided in the embodiment of the present application, an embodiment of the present application further provides an electronic device, as shown in fig. 8, where the electronic device 80 includes: a processor 81 and a memory 82 configured to store a computer program capable of running on the processor,

wherein the processor 81 is configured to perform the method steps in the preceding embodiments when running the computer program.

In practice, of course, the various components of the electronic device 80 are coupled together by a bus system 83, as shown in FIG. 8. It will be appreciated that the bus system 83 is used to enable communications among the components. The bus system 83 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 83 in fig. 8.

In an exemplary embodiment, the present application further provides a computer readable storage medium, such as a memory 82 including a computer program shown in fig. 8, which is executable by a processor 81 of an electronic device 80 to perform the steps of the foregoing method. The computer-readable storage medium may be a magnetic random access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM), among other memories.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. A classification model construction method, characterized in that the method comprises:

2. The method of claim 1, wherein the classification model is a deep forest model, wherein the deep forest model comprises an N-level online forest; n is an integer greater than 1;

the constructing a classification model of the target object based on the target split attribute and the target feature information includes:

taking the target characteristic information and the target splitting attribute as the input of a deep forest model, and training and testing each level of linked forests in the deep forest model to obtain a trained deep forest model;

and taking the trained deep forest model as a classification model of the target object.

3. The method as claimed in claim 2, wherein the training and testing each level of the online forest in the deep forest model by using the target feature information and the target splitting attribute as input of the deep forest model to obtain a trained deep forest model comprises:

inputting the target characteristic information into a first layer of cascade forest of a deep forest model, and training and testing the first layer of random forest to obtain a class 1 vector;

taking the ith category vector and the splitting feature vector corresponding to the target splitting attribute as the input of an (i + 1) th level linked forest, and training and testing the (i + 1) th level linked forest to obtain an (i + 1) th category vector; the splitting characteristic vector is used for representing a characteristic vector which is divided into target splitting attributes in the target characteristic information;

continuing to take the (i + 1) th class vector and the splitting feature vector corresponding to the target splitting attribute as the input of the (i + 2) th level connected forest, and training and testing the (i + 2) th level random forest until the training and testing of the (N) th level connected forest are finished; wherein i is an integer of 1 or more and less than N-1.

4. A method as claimed in claim 2 or 3, wherein the entering the target feature information and the target split attributes as input to a depth forest model further comprises:

receiving configuration information for the depth forest model;

and determining the maximum layer number N of the cascade forests, the number M of random forests in each level of cascade forests and the number L of decision trees of each random forest based on the configuration information.

5. The method of any one of claims 1-3, wherein the target object comprises gene expression data.

6. The method of claim 5, wherein prior to obtaining the set of target object samples, comprising:

obtaining a plurality of gene expression data;

preprocessing the gene expression data to obtain a target object sample set; the pretreatment comprises the following steps: performing vacancy value processing and/or normalization processing on the gene expression data.

7. The method according to any one of claims 1-3, wherein the extracting target feature information of the target object sample set comprises:

extracting target feature information of the target object sample set according to a preset feature selection method; the preset feature selection method is used for reducing the dimension of the target object.

8. The method of claim 7, wherein the preset feature selection method comprises at least one of:

9. The method of claim 2, wherein after obtaining the set of target object samples, further comprising:

dividing the target object sample set into K mutually disjoint subsets; wherein K is an integer greater than 1;

the method for training and testing each layer of random forest in the deep forest model by taking the target characteristic information and the target splitting attribute as the input of the deep forest model comprises the following steps:

training each level of cascade forest of the deep forest model through target characteristic information and target splitting attributes corresponding to the training sample set to obtain an initial deep forest model;

and testing and updating each level of the cascade forest of the initial deep forest model through the target characteristic information and the target splitting attribute corresponding to the test sample set to obtain the trained deep forest model.

10. A classification model construction apparatus, characterized in that the apparatus comprises:

11. An electronic device comprising a processor and a memory for storing a computer program executable on the processor;

wherein the processor is configured to execute the steps of the classification model construction method according to any one of claims 1 to 9 when running the computer program.

12. A computer-readable storage medium having stored thereon a computer program for executing the steps of the method of constructing a classification model according to any one of claims 1 to 9 by a processor.