CN116628491A - Training method and device for classification model, electronic equipment and storage medium - Google Patents

Training method and device for classification model, electronic equipment and storage medium Download PDF

Info

Publication number
CN116628491A
CN116628491A CN202310462029.0A CN202310462029A CN116628491A CN 116628491 A CN116628491 A CN 116628491A CN 202310462029 A CN202310462029 A CN 202310462029A CN 116628491 A CN116628491 A CN 116628491A
Authority
CN
China
Prior art keywords
sample data
current
current sample
training
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310462029.0A
Other languages
Chinese (zh)
Inventor
杨建雄
杜志高
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Si Tech Information Technology Co Ltd
Original Assignee
Beijing Si Tech Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Si Tech Information Technology Co Ltd filed Critical Beijing Si Tech Information Technology Co Ltd
Priority to CN202310462029.0A priority Critical patent/CN116628491A/en
Publication of CN116628491A publication Critical patent/CN116628491A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a training method and device for a classification model, electronic equipment and a storage medium, wherein the method comprises the following steps: obtaining a current sample data set corresponding to the current period, wherein the current sample data set comprises: a plurality of current sample data and sample classification labels corresponding to each current sample data; performing current iterative training on the two classification models based on the current sample data set, and obtaining a prediction classification result corresponding to each current sample data output by the current training of the two classification models; based on each prediction classification result and the sample classification label, carrying out normalization processing on each current sample data; and carrying out next iterative training on the classification model based on the normalized current sample data set until the current iterative times are equal to the training iterative times, and determining that the classification model is trained when the next period is completed. By the technical scheme provided by the embodiment of the invention, the training efficiency and accuracy of the classification model can be improved.

Description

Training method and device for classification model, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to a computer technology, in particular to a training method and device for a classification model, electronic equipment and a storage medium.
Background
With the development of computer technology, deep learning models are widely used. The classification model needs to be trained using a large amount of diversified sample data before it can be used.
Currently, in some scenarios of periodically generating data, such as data generated in 11 months each year, there is often a situation that part of the feature values are missing in the data, that is, feature values corresponding to some features are not collected. And similar or even identical eigenvalues exist in the sample data under the same scene. Therefore, the training effect of the classification model is poor and even the model after training is under-fitted due to the fact that the classification model is directly trained by using the sample data with poor diversity.
Disclosure of Invention
The embodiment of the invention provides a training method, a training device, electronic equipment and a storage medium for a classification model, so that the training of the classification model is completed rapidly, and the training efficiency and accuracy of the classification model are improved.
In a first aspect, an embodiment of the present invention provides a training method for a classification model, including:
Obtaining a current sample data set corresponding to a current period, wherein the current sample data set comprises: a plurality of current sample data and sample classification labels corresponding to each current sample data;
performing current iterative training on the classification model based on the current sample data set, and obtaining a prediction classification result corresponding to each current sample data output by the current training of the classification model;
based on the prediction classification result corresponding to each current sample data and the sample classification label, carrying out normalization processing on each current sample data to obtain a normalized current sample data set;
and performing next iterative training on the classification model based on the normalized current sample data set until the current iterative times are equal to training iterative times, and determining that the classification model is trained in the current period.
In a second aspect, an embodiment of the present invention provides a training apparatus for a classification model, including:
a current sample data set obtaining module, configured to obtain a current sample data set corresponding to a current period, where the current sample data set includes: a plurality of current sample data and sample classification labels corresponding to each current sample data;
The model output acquisition module is used for performing current iterative training on the classification model based on the current sample data set and acquiring a prediction classification result corresponding to each piece of current sample data output by the current training of the classification model;
the sample data set normalization processing module is used for carrying out normalization processing on each current sample data based on a prediction classification result corresponding to each current sample data and the sample classification label to obtain a normalized current sample data set;
the model training completion determining module is used for carrying out next iterative training on the two-class model based on the normalized current sample data set until the current iterative times are equal to the training iterative times, and determining that the two-class model is completed when the period training is completed.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
one or more processors;
a memory for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a method of training a classification model as provided by any embodiment of the invention.
In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a training method for a classification model as provided by any of the embodiments of the present invention.
According to the technical scheme, the current sample data set corresponding to the current period is obtained, wherein the current sample data set comprises: a plurality of current sample data and sample classification labels corresponding to each current sample data; performing current iterative training on the classification model based on the current sample data set, and obtaining a prediction classification result corresponding to each current sample data output by the current training of the classification model; based on the prediction classification result corresponding to each current sample data and the sample classification label, carrying out normalization processing on each current sample data to obtain a normalized current sample data set, so that the current sample data used for training a classification model in the next iterative training is modified under the condition of not increasing the data characteristics in the current sample data, and the training efficiency of the classification model is improved; and carrying out next iterative training on the classification model based on the normalized current sample data set until the current iterative times are equal to the training iterative times, and determining that the classification model is trained in the current period, so that the sample data of the next training is normalized based on the prediction classification result obtained by each iterative training, the condition of poor training effect of the classification model caused by poor diversity of the sample data can be avoided, and the training accuracy of the classification model is improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a training method of a classification model according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a training method of a classification model according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a training device for a classification model according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a training method for a classification model according to an embodiment of the present invention, where the method may be performed by a training device for a classification model, and the training device for a classification model may be implemented in hardware and/or software, and the training device for a classification model may be configured in an electronic device, where the training device for a classification model may be used in some scenarios where data is periodically generated. As shown in fig. 1, the method includes:
S110, acquiring a current sample data set corresponding to the current period, wherein the current sample data set comprises: a plurality of current sample data and a sample class label corresponding to each of the current sample data.
The current period may refer to the current training model period. For example, when the sub-period may be divided into periods in units of months, such as one month in the present year or two months in the present year; the time period can be divided into periods with the unit period of years, such as the whole year of the last year. The current sample data may refer to sample data collected during the current cycle for training of the classification model. For example, the current sample data may be, but is not limited to, sample data generated in february of the current year. The sample classification label may refer to the actual classification category of the sample, which may be determined by means of manual marking.
Specifically, a current sample data set corresponding to the current period is obtained, wherein the current sample data set comprises: a plurality of current sample data and a sample class label corresponding to each of the current sample data. The training of the classification model by such periodic sample data has the advantage that: the classification model used in a specific period can be trained, so that prediction classification is performed on the specific period by using the trained classification model. For example, the air quality monitoring data of february in the last year may be utilized as training data for the classification model, and the classification model that predicts for february air quality may be trained.
S120, performing current iterative training on the two classification models based on the current sample data set, and obtaining a prediction classification result corresponding to each current sample data output by the current training of the two classification models.
Wherein there are multiple iterative exercises in the current period of exercises. The time-of-iteration training may be any one of the time-of-cycle training. The prediction classification result may be a prediction result output by the classification model. Specifically, each current sample data in the current sample data set is input into the two classification models for current iterative training, and a prediction classification result corresponding to each current sample data output by the current training of the two classification models is obtained.
S130, carrying out normalization processing on each current sample data based on a prediction classification result and a sample classification label corresponding to each current sample data to obtain a normalized current sample data set.
Specifically, comparing the prediction classification result corresponding to each current sample data with the sample classification label corresponding to each current sample data, and determining current sample data with successful prediction, which is consistent with the respective sample classification label, of the prediction classification result and current sample data with failure prediction, which is inconsistent with the respective sample classification label, of the prediction classification result. And carrying out normalization processing on the current sample data successfully predicted by adopting a forward normalization mode. And carrying out normalization processing on the current sample data with failed prediction by adopting a reverse normalization mode. The data compared with the sample classification labels is subjected to classification normalization processing, and the method has the advantages that: under the condition that the original training correct sample data is unchanged, the sample data which fails to train is reversed. For example, the sample data for training failure is a healthy elderly person aged 95 years, and the sample data after inversion may be an unhealthy child aged around 5 years. Because the training is the classification model, if the model parameters of the classification model are correct, the sample data which fails to be trained are reversed and input into the classification model again, and then the prediction classification result which is output again is consistent with the corresponding sample classification label; even if the prediction classification result output by the part again is inconsistent with the corresponding sample classification label, the training efficiency of the classification model can be improved. And the method is not limited to repeated training of the sample data set, and the iterative self-adaptive correction sample data set is adopted to train the two classification models.
And S140, performing next iterative training on the classification model based on the normalized current sample data set until the current iterative times are equal to the training iterative times, and determining that the current periodic training of the classification model is completed.
The training iteration number may refer to the total number of times that the model needs to perform iterative training when training in the secondary cycle. Specifically, the next iterative training is performed on the classification model based on the normalized current sample data set, for example, the normalized current sample data set is returned as the current sample data set to perform the operations of steps S120-S130 until the current iteration number is equal to the training iteration number, and it is determined that the current cycle training of the classification model is completed. In this embodiment, the training iteration number is used as the training convergence condition of the classification model, which has the following benefits: the situation that the prediction accuracy of the output of the classification model cannot be continuously kept within the preset accuracy convergence range due to the inversion of the sample data, so that whether the classification model is trained or not cannot be determined can be avoided. For example, the model output accuracy of the classification model corresponding to the last iteration number is not within the preset accuracy convergence range, and the model output accuracy corresponding to the current iteration number is within the preset accuracy convergence range, but the model output accuracy corresponding to the next iteration number is not within the preset accuracy convergence range. It should be noted that the training of the classification model may be suspended when the above situation occurs at least 10 times in succession. And deleting the current sample data which is still inconsistent with the respective sample classification labels from the current sample data set, and performing classification model training again.
According to the technical scheme, the current sample data set corresponding to the current period is obtained, wherein the current sample data set comprises the following components: a plurality of current sample data and sample classification labels corresponding to each current sample data; performing current iterative training on the two classification models based on the current sample data set, and obtaining a prediction classification result corresponding to each current sample data output by the current training of the two classification models; based on the prediction classification result and the sample classification label corresponding to each current sample data, carrying out normalization processing on each current sample data to obtain a normalized current sample data set, so that the current sample data used for training the classification model in the next iterative training is modified under the condition of not increasing the data characteristics in the current sample data, and the training efficiency of the classification model is improved; based on the normalized current sample data set, performing next iterative training on the classification model, and determining that the classification model is trained in the current period until the current iterative times are equal to the training iterative times, so that the sample data of the next training are normalized based on the prediction classification result obtained in each iterative training, the condition that the training effect of the classification model is poor due to poor sample data diversity can be avoided, and the training accuracy of the classification model is improved.
Based on the above technical solution, S110 may include: acquiring a plurality of original sample data; determining the number of non-empty feature values corresponding to the same original feature based on the original feature value corresponding to each original feature in each original sample data; determining target original features based on the number of non-empty feature values corresponding to each original feature; deleting the residual original features except the target original features in each original sample data and the feature values corresponding to the residual original features, and taking the deleted original sample data as current sample data.
Wherein the raw sample data may refer to sample data of undeleted features collected during the current cycle for training of the classification model. Original features may refer to all features contained in the original sample data. Each raw feature value may be a null value or a non-null value. The number of non-null feature values may refer to the number of non-null feature values of the same original feature in all original sample data. The target raw features may refer to portions of the raw features that may be used for training the classification model. For example, the target raw feature may be, but is not limited to, raw data for which the null duty cycle does not reach a certain proportion.
Specifically, a plurality of original sample data corresponding to the current period are acquired. And determining the number of non-empty characteristic values corresponding to the same original characteristic based on the original characteristic value corresponding to each original characteristic in each original sample data, thereby determining the number of non-empty characteristic values corresponding to each original characteristic. Determining target original features based on the number of non-empty feature values corresponding to each original feature; for example, dividing the number of non-null feature values corresponding to each original feature by the total number of original sample data, and determining the result of the division as the non-null value corresponding to each original feature; and determining the original features with the non-space-ratio value larger than the preset feature threshold as target original features. Deleting the residual original features except the target original features in each original sample data and the feature values corresponding to the residual original features, and taking the deleted original sample data as current sample data. The embodiment can screen the characteristics, such as choosing and separating, and determine the characteristics which can be used for training the classification model in the current period training. The condition of poor model training effect caused by a large quantity of missing characteristic values is avoided.
It should be noted that the original features may also be feature-engineered. For example, the operations of analyzing, cleaning, screening, retaining and the like of the characteristics are realized through basic data analysis, data processing and descriptive statistical analysis.
On the basis of the technical scheme, after determining that the time period training of the classification model is finished, the method further comprises the following steps: determining the current accuracy corresponding to each iteration training in the current period training based on the prediction classification result and the sample classification label corresponding to each iteration training in the current period training; and determining the target iteration times of the classification model in the next period training based on the current iteration times and the current accuracy corresponding to each iteration training in the period training.
The current accuracy may refer to the degree to which the model output result corresponding to one iteration training is consistent with the sample classification label. The target iteration number may refer to the total number of iterations required by the classification model in the next period of training, i.e., the training iteration number in the next period of training.
Specifically, based on the prediction classification result and the sample classification label corresponding to each iteration training in the current period training, the current accuracy corresponding to each iteration training in the current period training is determined. For example, the prediction classification result that matches the sample classification label is denoted as 1, and the prediction classification result that does not match the sample classification label is denoted as 0. And adding all the marked values, dividing the added result by the total number of the current sample data, and determining the added result as the current accuracy corresponding to each iteration training in the current period training. The present embodiment may determine the current iteration number corresponding to the highest current accuracy in the next period training as the target iteration number of the classification model in the next period training. The embodiment can also determine the current accuracy greater than or equal to the preset accuracy threshold in the current period training; comparing the current iteration times corresponding to the current accuracy greater than the preset accuracy threshold, and determining the minimum current iteration times as the target iteration times of the classification model in the next period training. For example, when the number of current iterations is as small as possible and the model prediction accuracy is at least 95%, the current iterations and the current accuracy in the current period training can be screened based on the requirement, so that the target iteration number meeting the service requirement is determined, the user diversity requirement is improved, and the training time can be saved.
It should be noted that, in this embodiment, the prediction accuracy corresponding to the two classification models obtained by training in each history period may also be determined; and determining the target iteration times when the training of the classification model is completed in the next period training based on the training iteration times and the prediction accuracy corresponding to each history period. For example, the present embodiment may determine the number of training iterations corresponding to the highest prediction accuracy in each history period as the target number of iterations of the classification model in the next period training. The target iteration number can also adopt the voting idea, and the iteration number meeting the requirement is selected from the results of multiple iterations and is used for predicting the final test set, so that the robustness and the accuracy of the classification model are enhanced. And meanwhile, the iteration times can be flexibly adjusted, and the balance in time and space is found, so that an effective classification effect is achieved.
Example two
Fig. 2 is a flowchart of a training method of a classification model according to a second embodiment of the present invention, where, based on the foregoing embodiment, normalization processing is performed on each current sample data, and a process of obtaining a normalized current sample data set is described in detail. Wherein the explanation of the same or corresponding terms as those of the above embodiments is not repeated herein. As shown in fig. 2, the method includes:
S210, acquiring a current sample data set corresponding to the current period, wherein the current sample data set comprises: a plurality of current sample data and a sample class label corresponding to each of the current sample data.
S220, performing current iterative training on the two classification models based on the current sample data set, and obtaining a prediction classification result corresponding to each current sample data output by the current training of the two classification models.
S230, comparing the prediction classification result corresponding to each piece of current sample data with the sample classification label to determine first current sample data with the prediction classification result consistent with the sample classification label and second current sample data with the prediction classification result inconsistent with the sample classification label.
The first current sample data may refer to current sample data in which the prediction classification result output by the model in the iterative training is consistent with the sample classification label. The second current sample data may refer to current sample data in which the prediction classification result output by the model in the iterative training is inconsistent with the sample classification label.
S240, performing forward normalization processing on the first current sample data to obtain forward normalized first current sample data.
Specifically, a maximum characteristic value and a minimum characteristic value corresponding to each original characteristic in the current sample data set are obtained, and the maximum characteristic value and the minimum characteristic value are subtracted to obtain a third difference value. And subtracting the minimum characteristic value from the original characteristic value corresponding to each original characteristic in each first current sample data to obtain a fourth difference value corresponding to each original characteristic in each first current sample data. And dividing the fourth difference value corresponding to each original feature in each first current sample data by the third difference value, and determining the obtained result as a forward normalization value corresponding to each original feature in each first current sample data, so as to obtain forward normalized first current sample data. In the next iterative training process, if the same first current sample data exists, the first current sample data after the last forward normalization can be directly used, so that the training efficiency of the classification model is improved. Meanwhile, the influence caused by different dimensions can be avoided.
S250, performing inverse normalization processing on the second current sample data to obtain inverse normalized second current sample data.
Specifically, the maximum feature value and the minimum feature value corresponding to each original feature in the current sample data set acquired in S240 are adopted. And subtracting the minimum characteristic value from the maximum characteristic value to obtain a first difference value. And subtracting the maximum characteristic value from the original characteristic value corresponding to each original characteristic in each second current sample data to obtain a second difference value corresponding to each original characteristic in each second current sample data. And dividing a second difference value corresponding to each original feature in each second current sample data by the first difference value, and determining a correlation result as a reverse normalization value corresponding to each original feature in each second current sample data, so as to obtain reversely normalized second current sample data, and inverting the sample data failing in training under the condition that the original training correct sample data is unchanged. And model training and sample data adjustment are performed at the same time, so that the training efficiency of the two classification models is further improved. Meanwhile, the influence caused by different dimensions can be avoided.
And S260, performing next iterative training on the classification model based on the normalized current sample data set until the current iterative times are equal to the training iterative times, and determining that the current periodic training of the classification model is completed.
According to the technical scheme, the first current sample data with the prediction classification result consistent with the sample classification label and the second current sample data with the prediction classification result inconsistent with the sample classification label are determined by comparing the prediction classification result corresponding to each current sample data with the sample classification label; carrying out forward normalization processing on the first current sample data to obtain forward normalized first current sample data; and carrying out reverse normalization processing on the second current sample data to obtain reverse normalized second current sample data, so that the sample data which fails to train is reversed under the condition that the sample data which is correct in the original training is unchanged. And model training and sample data adjustment are performed at the same time, so that the training efficiency of the two classification models is further improved.
Based on the above technical solution, S250 may include: obtaining a maximum characteristic value and a minimum characteristic value corresponding to each original characteristic in a current sample data set, and subtracting the minimum characteristic value from the maximum characteristic value to obtain a first difference value; subtracting the maximum characteristic value from the original characteristic value corresponding to each original characteristic in each second current sample data to obtain a second difference value corresponding to each original characteristic in each second current sample data; and dividing a second difference value corresponding to each original feature in each second current sample data by the first difference value, and determining a result of the division as an inverse normalization value corresponding to each original feature in each second current sample data.
The maximum feature value may refer to the maximum feature value in each feature. The minimum feature value may refer to the minimum feature value in each feature. The first difference may refer to a characteristic value difference between a maximum characteristic value and a minimum characteristic value. The second difference may refer to a feature value difference between a feature value of the feature itself and a maximum feature value. The inverse normalized value may refer to a value between 0 and 1, inclusive of 0 and 1.
On the basis of the technical scheme, after the second current sample data is subjected to inverse normalization processing to obtain the inversely normalized second current sample data, the method further comprises the following steps: determining the current data weight corresponding to the second current sample data after reverse normalization; and weighting the inversely normalized second current sample data based on the current data weight to obtain weighted second current sample data.
Wherein, the current data weight can be used for adjusting the importance degree or the duty ratio of the current sample data in the next iterative training. Specifically, if the multiple model output results of the same sample data are not successfully matched with the sample classification labels of the sample data, the sample data can be subjected to marginalization along with the number of iterative training. Determining the current data weight corresponding to the second current sample data after reverse normalization; and weighting the second current sample data subjected to inverse normalization based on the current data weight to obtain weighted second current sample data, and performing next iterative training by using the weighted second current sample data, so that the condition that training results are not converged or the model effect is poor due to training of the model for many times based on the second current sample data is avoided, and the training efficiency and accuracy of the classification model are further improved.
It should be noted that, the training set and the test set may be further segmented according to the time sequence characteristic for the normalized current sample data, and the LightGBM algorithm capable of automatically processing the missing value and the parallelization operation is selected as the standard model for iterative optimization of the model. Setting the iteration times of the model, storing the intermediate result of each iteration optimization in the model training, carrying out model evaluation on the training set by using the optimal parameters, and finding a sample set which is not matched with the real supervision result. Further, parameters of the LightGBM can be configured, a search space of the super parameters can be dynamically constructed by using an Optuna framework, the super parameters of the specified sample set are learned, and error training of a loss function is reduced to obtain super parameter estimation of the specified sample set.
On the basis of the above technical solution, the "determining the current data weight corresponding to the inversely normalized second current sample data" may include: determining the iteration times of inconsistent prediction classification results corresponding to the second current sample data and the sample classification labels; subtracting the current iteration times from inconsistent iteration times, and dividing the subtraction result by the current iteration times to obtain the current data weight corresponding to the reversely normalized second current sample data.
Specifically, the more iterations that are inconsistent, the greater the likelihood that the current sample data is problematic. If the current sample data is used for training the classification model, the training result may not be converged or the model effect may not be good. For example, if the current sample data a is successfully matched with the sample classification label of the current sample data a in each iteration training, the current data weight corresponding to the current sample data a is 1. When the number of iterations is 10, if the matching of the current sample data B with the sample classification label of the current sample data B fails in 5 iterations of the training, the current data weight corresponding to the current sample data B is 5/10, namely 1/2. In the next iterative training, the importance of the current sample data a to the two-class model training process is greater than the importance of the current sample data B to the two-class model training process.
For example, the current sample dataset is an aircraft flight dataset. The original sample data in the current sample data set has 6155 ten thousand samples, the period time span is 2009 to 2018, the data set has 12 months each year, and the two classification supervision results are balanced and have 28 characteristics. First, data cleaning is performed, and the continuous value result of the delay field is binarized as a supervision result. Checking whether null conditions exist in 28 features, and deleting the features with the null conditions exceeding 98% of the total. Further, it is checked whether 12 months of data exist each year, and whether enumeration type data exist. Furthermore, all data are normalized, so that deviation caused by different dimensions to model training is avoided. Further, a LightGBM classification algorithm model based on supervised learning is designed, parameters of a basic LightGBM are configured, an Optuna framework is used for rapidly and dynamically constructing a search space of super parameters, iteration times are designated, and the effect of the experiment is optimal for 5 times. Further, the processed data are input into an algorithm model, a training set is predicted through automatic parameter adjustment, a first round of sample mismatch set is searched, and a first round of F1 value is obtained to serve as a reference. Further, the model is optimized in an iteration mode, the unmatched samples are added into the original data for training along with the reverse normalization operation of the iteration process, a new round of results are generated, and the like. Further, the optimal parameters of each round of model training are stored, and the prediction supervision result of each round is stored. And setting an equal interval threshold value for the supervision result after the iteration is completed, and voting to select the critical value of the optimal classification effect as the judgment standard of test set prediction. The average improvement of the F1 value in the evaluation result of the final test set is about 22.30%.
The following is an embodiment of a training device for a classification model provided by the embodiment of the present invention, where the training device and the training method for a classification model in the foregoing embodiments belong to the same inventive concept, and details of the training device for a classification model are not described in detail in the embodiment of the training device for a classification model, and reference may be made to the embodiment of the training method for a classification model.
Example III
Fig. 3 is a schematic structural diagram of a training device for a classification model according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes: a current sample dataset acquisition module 310, a model output acquisition module 320, a sample dataset normalization processing module 330, and a model training completion determination module 340.
The current sample data set obtaining module 310 is configured to obtain a current sample data set corresponding to a current period, where the current sample data set includes: a plurality of current sample data and sample classification labels corresponding to each current sample data; the model output obtaining module 320 is configured to perform current iterative training on the two classification models based on the current sample data set, and obtain a prediction classification result corresponding to each current sample data output by the current training of the two classification models; the sample data set normalization processing module 330 is configured to normalize each current sample data based on the prediction classification result and the sample classification label corresponding to each current sample data, and obtain a normalized current sample data set; the model training completion determining module 340 is configured to perform next iterative training on the two-class model based on the normalized current sample data set, until the current iteration number is equal to the training iteration number, and determine that the current cycle training of the two-class model is completed.
According to the technical scheme, the current sample data set corresponding to the current period is obtained, wherein the current sample data set comprises the following components: a plurality of current sample data and sample classification labels corresponding to each current sample data; performing current iterative training on the two classification models based on the current sample data set, and obtaining a prediction classification result corresponding to each current sample data output by the current training of the two classification models; based on the prediction classification result and the sample classification label corresponding to each current sample data, carrying out normalization processing on each current sample data to obtain a normalized current sample data set, so that the current sample data used for training the classification model in the next iterative training is modified under the condition of not increasing the data characteristics in the current sample data, and the training efficiency of the classification model is improved; based on the normalized current sample data set, performing next iterative training on the classification model, and determining that the classification model is trained in the current period until the current iterative times are equal to the training iterative times, so that the sample data of the next training are normalized based on the prediction classification result obtained in each iterative training, the condition that the training effect of the classification model is poor due to poor sample data diversity can be avoided, and the training accuracy of the classification model is improved.
Optionally, the current sample data set acquisition module 310 is specifically configured to: acquiring a plurality of original sample data; determining the number of non-empty feature values corresponding to the same original feature based on the original feature value corresponding to each original feature in each original sample data; determining target original features based on the number of non-empty feature values corresponding to each original feature; deleting the residual original features except the target original features in each original sample data and the feature values corresponding to the residual original features, and taking the deleted original sample data as current sample data.
Optionally, the sample dataset normalization processing module 330 may include:
the prediction classification result checking sub-module is used for comparing the prediction classification result corresponding to each piece of current sample data with the sample classification label to determine first current sample data with the prediction classification result consistent with the sample classification label and second current sample data with the prediction classification result inconsistent with the sample classification label;
the forward normalization sub-module is used for carrying out forward normalization processing on the first current sample data to obtain forward normalized first current sample data;
and the reverse normalization sub-module is used for carrying out reverse normalization processing on the second current sample data to obtain reverse normalized second current sample data.
Optionally, the inverse normalization submodule is specifically configured to: obtaining a maximum characteristic value and a minimum characteristic value corresponding to each original characteristic in a current sample data set, and subtracting the minimum characteristic value from the maximum characteristic value to obtain a first difference value; subtracting the maximum characteristic value from the original characteristic value corresponding to each original characteristic in each second current sample data to obtain a second difference value corresponding to each original characteristic in each second current sample data; and dividing a second difference value corresponding to each original feature in each second current sample data by the first difference value, and determining a result of the division as an inverse normalization value corresponding to each original feature in each second current sample data.
Optionally, the apparatus further comprises:
the current data weight determining module is used for determining the current data weight corresponding to the reversely normalized second current sample data after reversely normalizing the second current sample data to obtain the reversely normalized second current sample data;
and the weighting processing module is used for carrying out weighting processing on the second current sample data after inverse normalization based on the current data weight to obtain weighted second current sample data.
Optionally, the current data weight determining module is specifically configured to: determining the iteration times of inconsistent prediction classification results corresponding to the second current sample data and the sample classification labels; subtracting the current iteration times from inconsistent iteration times, and dividing the subtraction result by the current iteration times to obtain the current data weight corresponding to the reversely normalized second current sample data.
Optionally, the apparatus further comprises:
the current accuracy determining module is used for determining the current accuracy corresponding to each iteration training in the current period training based on the prediction classification result and the sample classification label corresponding to each iteration training in the current period training after determining that the current period training of the two classification models is completed;
the target iteration number determining module is used for determining the target iteration number of the classification model in the next period training based on the current iteration number and the current accuracy corresponding to each iteration training in the period training.
Example IV
Fig. 4 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM12 and the RAM13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the training method of the classification model.
In some embodiments, the training method of the classification model may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM12 and/or the communication unit 19. When the computer program is loaded into RAM13 and executed by processor 11, one or more steps of the training method of the classification model described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the training method of the classification model in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of training a classification model, comprising:
obtaining a current sample data set corresponding to a current period, wherein the current sample data set comprises: a plurality of current sample data and sample classification labels corresponding to each current sample data;
performing current iterative training on the classification model based on the current sample data set, and obtaining a prediction classification result corresponding to each current sample data output by the current training of the classification model;
Based on the prediction classification result corresponding to each current sample data and the sample classification label, carrying out normalization processing on each current sample data to obtain a normalized current sample data set;
and performing next iterative training on the classification model based on the normalized current sample data set until the current iterative times are equal to training iterative times, and determining that the classification model is trained in the current period.
2. The method of claim 1, wherein obtaining a plurality of current sample data comprises:
acquiring a plurality of original sample data;
determining the number of non-empty feature values corresponding to the same original feature based on the original feature value corresponding to each original feature in each original sample data;
determining target original features based on the number of non-empty feature values corresponding to each original feature;
deleting the residual original features except the target original features in each original sample data and the feature values corresponding to the residual original features, and taking the deleted original sample data as current sample data.
3. The method according to claim 1, wherein the normalizing each current sample data based on the prediction classification result corresponding to each current sample data and the sample classification label to obtain a normalized current sample data set includes:
Comparing the prediction classification result corresponding to each piece of current sample data with the sample classification label to determine first current sample data with the prediction classification result consistent with the sample classification label and second current sample data with the prediction classification result inconsistent with the sample classification label;
carrying out forward normalization processing on the first current sample data to obtain forward normalized first current sample data;
and carrying out inverse normalization processing on the second current sample data to obtain inverse normalized second current sample data.
4. A method according to claim 3, wherein said inversely normalizing the second current sample data to obtain inversely normalized second current sample data comprises:
obtaining a maximum characteristic value and a minimum characteristic value corresponding to each original characteristic in a current sample data set, and subtracting the maximum characteristic value from the minimum characteristic value to obtain a first difference value;
subtracting the maximum characteristic value from the original characteristic value corresponding to each original characteristic in each second current sample data to obtain a second difference value corresponding to each original characteristic in each second current sample data;
And dividing a second difference value corresponding to each original feature in each second current sample data by the first difference value, and determining a correlation result as an inverse normalization value corresponding to each original feature in each second current sample data.
5. A method according to claim 3, further comprising, after performing inverse normalization on the second current sample data to obtain inverse normalized second current sample data:
determining the current data weight corresponding to the second current sample data after reverse normalization;
and weighting the second current sample data after inverse normalization based on the current data weight to obtain weighted second current sample data.
6. The method of claim 5, wherein determining the current data weight corresponding to the inversely normalized second current sample data comprises:
determining the iteration times of inconsistent prediction classification results corresponding to the second current sample data and the sample classification labels;
and subtracting the current iteration times from the inconsistent iteration times, and dividing a subtraction result by the current iteration times to obtain a current data weight corresponding to the second current sample data after reverse normalization.
7. The method of claim 1, further comprising, after determining that the classification model is complete when the secondary training is complete:
determining the current accuracy corresponding to each iteration training in the current period training based on the prediction classification result and the sample classification label corresponding to each iteration training in the current period training;
and determining the target iteration times of the classification model in the next period training based on the current iteration times corresponding to each iteration training in the period training and the current accuracy.
8. A training device for a classification model, comprising:
a current sample data set obtaining module, configured to obtain a current sample data set corresponding to a current period, where the current sample data set includes: a plurality of current sample data and sample classification labels corresponding to each current sample data;
the model output acquisition module is used for performing current iterative training on the classification model based on the current sample data set and acquiring a prediction classification result corresponding to each piece of current sample data output by the current training of the classification model;
the sample data set normalization processing module is used for carrying out normalization processing on each current sample data based on a prediction classification result corresponding to each current sample data and the sample classification label to obtain a normalized current sample data set;
The model training completion determining module is used for carrying out next iterative training on the two-class model based on the normalized current sample data set until the current iterative times are equal to the training iterative times, and determining that the two-class model is completed when the period training is completed.
9. An electronic device, the electronic device comprising:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, causes the one or more processors to implement the method of training a classification model as claimed in any one of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a training method of a classification model according to any of claims 1-7.
CN202310462029.0A 2023-04-26 2023-04-26 Training method and device for classification model, electronic equipment and storage medium Pending CN116628491A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310462029.0A CN116628491A (en) 2023-04-26 2023-04-26 Training method and device for classification model, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310462029.0A CN116628491A (en) 2023-04-26 2023-04-26 Training method and device for classification model, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116628491A true CN116628491A (en) 2023-08-22

Family

ID=87596403

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310462029.0A Pending CN116628491A (en) 2023-04-26 2023-04-26 Training method and device for classification model, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116628491A (en)

Similar Documents

Publication Publication Date Title
CN115294397A (en) Classification task post-processing method, device, equipment and storage medium
CN116489038A (en) Network traffic prediction method, device, equipment and medium
CN113657249A (en) Training method, prediction method, device, electronic device, and storage medium
CN115391160B (en) Abnormal change detection method, device, equipment and storage medium
CN116226628A (en) Feature optimization method, device, equipment and medium
CN116228301A (en) Method, device, equipment and medium for determining target user
CN114116688B (en) Data processing and quality inspection method and device and readable storage medium
CN116628491A (en) Training method and device for classification model, electronic equipment and storage medium
CN115630708A (en) Model updating method and device, electronic equipment, storage medium and product
CN114610953A (en) Data classification method, device, equipment and storage medium
CN114548307A (en) Classification model training method and device, and classification method and device
CN113807391A (en) Task model training method and device, electronic equipment and storage medium
CN114037058B (en) Pre-training model generation method and device, electronic equipment and storage medium
CN117271373B (en) Automatic construction method and device for test cases, electronic equipment and storage medium
CN115905021B (en) Fuzzy test method and device, electronic equipment and storage medium
CN116820826B (en) Root cause positioning method, device, equipment and storage medium based on call chain
CN114037057B (en) Pre-training model generation method and device, electronic equipment and storage medium
CN114241243B (en) Training method and device for image classification model, electronic equipment and storage medium
CN114866437B (en) Node detection method, device, equipment and medium
CN112598118B (en) Method, device, storage medium and equipment for processing abnormal labeling in supervised learning
CN113361402B (en) Training method of recognition model, method, device and equipment for determining accuracy
JP6805313B2 (en) Specific device, specific method and specific program
CN117608896A (en) Transaction data processing method and device, electronic equipment and storage medium
CN118012657A (en) Root cause positioning method and device of micro-service system, readable medium and electronic equipment
CN117371506A (en) Model training method, model testing device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination