CN111860671A

CN111860671A - Classification model training method and device, terminal equipment and readable storage medium

Info

Publication number: CN111860671A
Application number: CN202010738246.4A
Authority: CN
Inventors: 衣杨; 李强; 梁达安; 赵福利; 林倩青; 周晓聪
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-10-30

Abstract

The embodiment of the invention discloses a classification model training method, a classification model training device, terminal equipment and a readable storage medium, wherein the method comprises the following steps: preprocessing a training data set to obtain a standard training data set; dividing the standard training samples in the standard training data set into a preset number of categories; carrying out equalization processing on each category so as to keep the number of standard training samples in each category consistent; labeling each standard training sample in each category with a label of the corresponding category; the classification model is trained using standard training samples with labels. According to the invention, the standard training samples are classified, and the labels are automatically added to the standard training samples, so that the time and the human resources for marking the samples are reduced, and the problems of strong subjectivity and high error rate of manual marking can be solved.

Description

Classification model training method and device, terminal equipment and readable storage medium

Technical Field

The invention relates to the field of data mining, in particular to a classification model training method, a classification model training device, terminal equipment and a readable storage medium.

Background

Training samples utilized in the training process of the classification model are generally marked artificially, and a large number of training samples with marks and accurate marking of each training sample are required to obtain a classification model with a good classification effect. At present, the training samples are marked by means of manual marking of professional technicians generally, or the professional technicians mark the training samples by means of statistical analysis tools, the marking process is complicated, long marking time and human resources are wasted, the manual marking subjectivity is high, the error rate is high, and the training speed of the classification models and the classification accuracy of the classification models are affected.

Disclosure of Invention

In view of the above problems, a classification model training method, an apparatus, a terminal device and a readable storage medium.

One embodiment of the present invention provides a classification model training method, including:

preprocessing a training data set to obtain a standard training data set;

dividing the standard training samples in the standard training data set into a preset number of categories;

carrying out equalization processing on each category so as to keep the number of standard training samples in each category consistent;

labeling each standard training sample in each category with a label of the corresponding category;

the classification model is trained using standard training samples with labels.

The method for training a classification model according to the foregoing embodiment, where preprocessing a training data set to obtain a standard training data set includes:

carrying out quantization processing on each attribute score in the training data set by using a quantization formula;

carrying out standardization processing on the training data set subjected to quantization processing to obtain a training square matrix;

carrying out eigenvalue decomposition on the training square matrix so as to represent the training square matrix by using eigenvectors and eigenvalues;

selecting a characteristic vector of which the characteristic value is greater than a preset threshold value;

re-scoring the attributes in the training square matrix according to the feature vectors of which the feature values are larger than the preset threshold value to obtain a feature data sample set;

and carrying out normalization processing on each new attribute score in the characteristic data sample set to obtain a standard training data set.

In the classification model training method according to the embodiment, the quantization formula is as follows:

a quantitative score representing an mth attribute of a kth sample in the training data set,

an attribute score representing an mth attribute of a kth sample in the training data set,

a minimum attribute score representing an mth attribute in the training data set,

a maximum attribute score representing an mth attribute in the training data set.

The classification model training method described in the above embodiment re-scores the attributes in the training matrix according to the following formula:

representing the p-th new attribute score of the kth sample in the training matrix,

a quantitative score representing an mth attribute of a kth sample in the training matrix,

and representing the mth element value of the feature vector corresponding to the pth feature value, wherein the training matrix is M.

The classification model training method according to the embodiment described above, where the classification of the standard training samples in the standard training data set into a preset number of classes includes:

randomly selecting the preset number of standard training samples as a clustering center;

respectively calculating the distance between each standard training sample and the preset number of clustering centers;

distributing each standard training sample to the class corresponding to the clustering center with the minimum distance;

calculating a corresponding sample mean value according to the standard training samples in each category;

and if the distance between the sample mean value of a certain class and the clustering center of the class is greater than or equal to a preset distance threshold, taking the sample mean value of each class as a new clustering center to perform the classification operation again until the distance between the sample mean values of the classes with the preset number and the clustering centers of the corresponding classes is smaller than the preset distance threshold.

The method for training a classification model according to the foregoing embodiment, where the equalization processing is performed on each class to keep the number of data samples in each class consistent, includes:

acquiring the number of standard training samples in each category;

and taking the class containing the maximum number of standard training samples as a reference, and supplementing other classes by using an up-sampling method.

The method for training a classification model according to the above embodiment, where the training of the classification model using a standard training sample with a label includes:

dividing the standard training samples with labels into n groups;

sequentially taking the standard training samples included in the ith group in the n groups as a test sample set, and taking the standard training samples in the other groups as training sample sets;

after the ith training is finished, testing the classification model according to the ith group of test sample sets, and calculating the error of the ith test;

after n times of training are finished, calculating the error mean value of n times of tests;

if the error mean value is larger than or equal to a preset error threshold value, adjusting a network structure and/or corresponding parameter values of the classification model;

and continuously training the adjusted classification model by using a standard training sample with a label until the error mean value is smaller than the error threshold value.

Another embodiment of the present invention provides a classification model training apparatus, including:

the data preprocessing module is used for preprocessing the training data set to obtain a standard training data set;

the sample classification module is used for classifying the standard training samples in the standard training data set into a preset number of categories;

the equalization processing module is used for performing equalization processing on each category so as to keep the number of standard training samples in each category consistent;

the class marking module is used for marking each standard training sample in each class with a label of the corresponding class;

and the model training module is used for training the classification model by using the standard training sample with the label.

The above embodiments relate to a terminal device, including a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to enable the terminal device to execute the classification model training method according to the above embodiments.

The above embodiments relate to a readable storage medium storing a computer program which, when run on a processor, performs the classification model training method of the above embodiments.

The invention discloses a classification model training method, which comprises the steps of preprocessing a training data set to obtain a standard training data set; dividing the standard training samples in the standard training data set into a preset number of categories; carrying out equalization processing on each category so as to keep the number of standard training samples in each category consistent; labeling each standard training sample in each category with a label of the corresponding category; the classification model is trained using standard training samples with labels. According to the method, the standard training samples are classified, and the labels are automatically added to the standard training samples, so that the time and the human resources for marking the samples are reduced, the problems of strong subjectivity and high error rate of manual marking can be solved, and the training speed of the classification model and the classification accuracy of the classification model can be effectively improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention. Like components are numbered similarly in the various figures.

FIG. 1 is a flow chart of a classification model training method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a pre-processing of a training data set according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating a sort operation provided by an embodiment of the present invention;

FIG. 4 is a flow chart illustrating a classification model training process according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram illustrating a classification model training apparatus according to an embodiment of the present invention.

Description of the main element symbols:

1-a classification model training device; 100-a data preprocessing module; 200-a sample classification module; 300-a balance processing module; 400-category label module; 500-model training module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Hereinafter, the terms "including", "having", and their derivatives, which may be used in various embodiments of the present invention, are only intended to indicate specific features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the existence of, or adding to, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.

Example 1

In this embodiment, referring to fig. 1, it is shown that a classification model training method includes the following steps:

step S100: the training data set is preprocessed to obtain a standard training data set.

The training data set can be generally obtained from the network through a crawler technology and can also be directly downloaded through some special data websites. Data acquired from a network by using a crawler technology or data directly downloaded from a data website need to be subjected to targeted preprocessing so as to be used for training a classification model.

Exemplarily, referring to fig. 2, the preprocessing of the training data set comprises the following steps:

step S101: and carrying out quantization processing on each attribute score in the training data set by using a quantization formula.

It should be understood that the training data set includes a plurality of samples, each sample includes a plurality of attributes, and each attribute of each sample corresponds to an attribute score, and the training data set can be represented symbolically in table 1, which is shown below in table 1.

Further, the following quantization formula may be used to perform quantization processing on each attribute score in the training data set.

Quantization formula

Illustratively, the attribute score 70 in attribute 1 of sample 1 in the table above is quantified as

After quantization, the attribute scores in table 1 can be converted into quantization scores, and accordingly, the quantization scores in table 2 are shown in the following table.

After quantization processing, each attribute score can be represented quantitatively by a natural number between 0 and 100, the quantization score can clearly represent the importance degree of a certain attribute to a certain sample, and as can be seen from table 2, the attribute 1 is the most important to the sample 2, the attribute 2 is the most important to the sample 1, and the attribute M is the most important to the sample 1.

For further example, the training data set may be image data, the corresponding sample is an image, the corresponding attribute is a pixel point, the corresponding attribute score is a pixel value of 0 to 255, and the training data set of the image data may be represented symbolically in table 3, where table 3 is shown below.

After quantization, the attribute scores in table 3 can be converted into quantization scores, and accordingly, the quantization scores in table 4 are shown in the following table.

For further example, the training data set may be the ranking data of colleges and universities, the corresponding sample is the name of colleges and universities, the corresponding attribute is the evaluation index, the corresponding attribute score is the corresponding score of the evaluation index, and the training data set of the ranking data of colleges and universities may be symbolically represented by table 5, where table 5 is shown below.

After quantization, the attribute scores in table 5 can be converted into quantization scores, and accordingly, the quantization scores in table 6 are shown in the following table.

Step S102: and carrying out standardization processing on the training data set subjected to the quantization processing to obtain a training square matrix.

The training data set after quantization processing may be normalized by using data analysis software SPSS to obtain a training square matrix.

Exemplarily, the training matrix of M × M can be generated by importing the above table 2 into the data analysis software SPSS.

Step S103: the training matrix is subjected to eigenvalue decomposition to represent the training matrix with eigenvectors and eigenvalues.

Carrying out eigenvalue decomposition on a training matrix R ═ PIP^TI is a diagonal matrix, characteristic values are arranged from large to small on the diagonal, P is a characteristic matrix, each row of the characteristic matrix P is a characteristic vector, and P is^TIs the transpose of the feature matrix.

Step S104: and selecting the characteristic vector of which the characteristic value is greater than a preset threshold value.

For example, the preset threshold may be set to 1, and a feature vector with a feature value greater than 1 is selected.

Step S105: and re-scoring the attributes in the training square matrix according to the feature vectors of which the feature values are larger than the preset threshold value to obtain a feature data sample set.

Exemplarily, the feature vectors with feature values larger than 1 re-score the attributes in the training matrix to obtain the feature data sample set.

The attributes in the training square may be re-scored according to the following formula:

Exemplarily, there are P eigenvectors with eigenvalues greater than 1, and the P-th eigenvector corresponds to [ z [₁，z₂，……，z_M]，

And representing the mth element value of the feature vector corresponding to the pth feature value. The quantization score vector of the kth sample is [ x ]₁，x₂，……，x_M]，

And representing the quantitative score of the mth attribute of the kth sample in the training square matrix. M is 1, 2, … …, M.

Step S106: and carrying out normalization processing on each new attribute score in the characteristic data sample set to obtain a standard training data set.

The normalization processing formula is as follows:

a normalized score representing the j-th new attribute score for the k-th sample in the feature data sample set,

represents the jth new attribute score for the kth sample in the feature data sample set, and P represents the total number of new attribute scores included in the kth sample in the feature data sample set.

After the normalization process, a standard training data set can be obtained for subsequent steps.

Step S200: and dividing the standard training samples in the standard training data set into a preset number of categories.

The standard training samples are classified into a preset number of categories, and the number of the categories can be set according to the characteristics of the samples in the standard training data set. Similar standard training samples are aggregated into one class by clustering and classifying the standard training samples in the standard training data set, so that the labels can be conveniently added to the standard training samples in the subsequent process.

Exemplarily, referring to fig. 3, it is shown that the sorting operation comprises the following steps:

step S201: and randomly selecting the preset number of standard training samples as a clustering center.

Exemplarily, 4 standard training samples can be randomly selected as the cluster center according to the characteristics of the standard training samples.

Step S202: and respectively calculating the distance between each standard training sample and the preset number of clustering centers.

Exemplarily, the distance between each standard training sample and 4 cluster centers can be calculated by using an euclidean distance formula or a cosine formula.

Step S203: each standard training sample is assigned to the class corresponding to the cluster center with the smallest distance.

Exemplarily, the distance between each standard training sample and the 4 cluster centers is a first distance, a second distance, a third distance and a fourth distance, respectively, and the first distance, the second distance, the third distance and the fourth distance are compared to assign each standard training sample to the category corresponding to the cluster center with the smallest distance.

Step S204: and calculating the corresponding sample mean value according to the standard training samples in each category.

Exemplarily, after classifying each standard training sample into 4 classes, calculating the sample mean of each class respectively.

Step S205: and if the distance between the sample mean value of a certain class and the clustering center of the class is greater than or equal to a preset distance threshold, taking the sample mean value of each class as a new clustering center to perform the classification operation again until the distance between the sample mean values of the classes with the preset number and the clustering centers of the corresponding classes is smaller than the preset distance threshold.

Further, whether the distance between the sample mean value of each category and the cluster center of the category is greater than or equal to a preset distance threshold value or not is judged, if the distance between the sample mean value of a certain category and the cluster center of the category is greater than or equal to the preset distance threshold value, the sample mean value of each category is taken as a new cluster center to perform the classification operation of the steps S202 to S205 again until the distance between the sample mean values of the preset number of categories and the cluster centers of the corresponding categories is less than the preset distance threshold value.

Step S300: and carrying out equalization processing on each class so as to keep the number of standard training samples in each class consistent.

And carrying out equalization processing on the standard training samples in each class so as to keep the number of the standard training samples in each class consistent.

The number of standard training samples in each class may be obtained first, and then the class containing the largest number of standard training samples is used as a reference to supplement other classes by using an upsampling method. The method comprises the steps of randomly selecting standard training samples from a standard training sample set, and carrying out random linear change on the whole, so as to generate new standard training samples to be filled into the standard training sample set.

Exemplarily, the specific method for generating a new standard training sample and supplementing the new standard training sample to the standard training sample set is as follows: step one, randomly selecting any standard training sample in the category; secondly, randomly amplifying or reducing all attribute scores of the sample in the same proportion; thirdly, adding the new standard training sample which is enlarged or reduced into the category; and fourthly, judging whether the total amount of the samples of the categories is enough, stopping generating a new sample if the total amount of the samples of the categories is enough, and repeating the first step, the second step and the third step if the total amount of the samples of the categories is not enough.

Step S400: and labeling the label of the corresponding category of each standard training sample in each category.

Labeling the standard training samples in each category, for example, labels of all samples in category 1 may be labeled as 1, labels of all samples in corresponding category 2 may be labeled as 2, and so on.

Step S500: the classification model is trained using standard training samples with labels.

And training a classification model by using the obtained standard training sample with the label.

Exemplarily, referring to fig. 4, the training process comprises the following steps:

step S501: dividing the labeled standard training samples into n groups.

n is a preset grouping number and is set according to the number of standard training samples and the complexity of a training model.

Step S502: and sequentially taking the standard training samples included in the ith group in the n groups as a test sample set, and taking the standard training samples in the rest groups as a training sample set.

Exemplarily, if n is 10, the standard training samples of the remaining 9 groups are used as the training sample set while the group 1 is used as the test sample set. It should be understood that the 1 st to 10 th groups may be used as the test sample set, and the corresponding remaining 9 groups of standard training samples are used as the training sample set.

Step S503: and after the ith training is finished, testing the classification model according to the ith group of test sample set, and calculating the error of the ith test.

If n is 10, the training process is at least 10 times, and after each training by using the training sample set is completed, the classification model is tested by using the test sample set, and the error of the test is calculated.

Step S504: and after n times of training are finished, calculating the error mean value of n times of tests.

Exemplarily, when 10 training sessions are completed, the error mean of 10 tests is calculated.

Step S505: and if the error mean value is larger than or equal to a preset error threshold value, adjusting the network structure and/or corresponding parameter values of the classification model.

And judging whether the error mean value of the n tests is greater than or equal to a preset error threshold value, if so, indicating that the trained model still does not reach the standard, and adjusting the trained model to adjust the network structure and/or corresponding parameter values of the classification model.

Step S506: and continuously training the adjusted classification model by using a standard training sample with a label until the error mean value is smaller than the error threshold value.

And continuously utilizing the standard training sample with the label to repeatedly execute the steps S503 to S506 until the error mean value is smaller than the error threshold value.

It should be appreciated that the above steps may be used to analyze image problems, college ranking problems, and other data analysis problems. The classification model comprises a neural network model, a deep learning model and other models for classification.

In the classification model training method disclosed in this embodiment, a training data set is preprocessed to obtain a standard training data set; dividing the standard training samples in the standard training data set into a preset number of categories; carrying out equalization processing on each category so as to keep the number of standard training samples in each category consistent; labeling each standard training sample in each category with a label of the corresponding category; the classification model is trained using standard training samples with labels. According to the embodiment, the standard training samples are classified, the labels are automatically added to the standard training samples, the time and the human resources for sample marking are reduced, the problems that the subjectivity of manual marking is strong and the error rate is high can be solved, and the training speed of the classification model and the classification accuracy of the classification model are effectively improved.

Example 2

In the present embodiment, referring to fig. 5, a classification model training apparatus 1 is shown to include a data preprocessing module 100, a sample classification module 200, an equalization processing module 300, a class labeling module 400, and a model training module 500.

A data preprocessing module 100, configured to preprocess the training data set to obtain a standard training data set; a sample classification module 200, configured to classify the standard training samples in the standard training data set into a preset number of categories; the equalization processing module 300 is configured to perform equalization processing on each category to keep the number of standard training samples in each category consistent; a category labeling module 400, configured to label each standard training sample in each category with a label of the corresponding category; and the model training module 500 is used for training the classification model by using the standard training sample with the label.

The classification model training apparatus 1 of this embodiment is configured to execute the classification model training method according to the above embodiment through the cooperative use of the data preprocessing module 100, the sample classification module 200, the equalization processing module 300, the class marking module 400, and the model training module 500, and the implementation scheme and the beneficial effects related to the above embodiment are also applicable in this embodiment, and are not described herein again.

It should be understood that the above embodiments relate to a terminal device, comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to perform the classification model training method according to the above embodiments.

It should be appreciated that the above embodiments relate to a readable storage medium storing a computer program which, when run on a processor, performs the classification model training method described in the above embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part of the technical solution that contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A classification model training method, comprising:

preprocessing a training data set to obtain a standard training data set;

2. The classification model training method according to claim 1, wherein the preprocessing the training data set to obtain a standard training data set comprises:

3. The classification model training method according to claim 2, wherein the quantization formula is as follows:

4. The classification model training method according to claim 2, wherein the attributes in the training square are re-scored according to the following formula:

5. The method for training classification models according to claim 1, wherein the classifying the standard training samples in the standard training data set into a preset number of classes comprises:

6. The method for training classification models according to claim 1, wherein the equalizing the classes to keep the number of data samples in the classes consistent comprises:

acquiring the number of standard training samples in each category;

7. The method for training classification models according to claim 1, wherein the training classification models by using labeled standard training samples comprises:

dividing the standard training samples with labels into n groups;

8. A classification model training apparatus, characterized in that the apparatus comprises:

9. A terminal device, comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to perform the classification model training method according to any one of claims 1 to 7.

10. A readable storage medium, characterized in that it stores a computer program which, when run on a processor, performs the classification model training method of any one of claims 1 to 7.