CN118094215A

CN118094215A - Sample data balancing, model training and classifying method, device and equipment

Info

Publication number: CN118094215A
Application number: CN202410027882.4A
Authority: CN
Inventors: 赵俊杰; 温佳美; 李�昊
Original assignee: Picc Information Technology Co ltd
Current assignee: Picc Information Technology Co ltd
Priority date: 2024-01-08
Filing date: 2024-01-08
Publication date: 2024-05-28

Abstract

The application discloses a sample data balancing, model training and classifying method, device and equipment, belongs to the field of artificial intelligence, and aims to solve the problem of how to balance the number of samples of different types in an original data set so as to avoid the occurrence of obvious tendency of model prediction results. The balancing method comprises the steps of obtaining samples with small sample number from an initial sample set as few sample types according to the sample number of each sample type in the initial sample set; based on the minority samples, acquiring respective neighbor samples of the obtained minority samples in a mode of mapping an initial sample set to a linear space; generating new minority class samples with target number according to the ratio of the number of the majority class samples to the number of the minority class samples in each sample class, each minority class sample and each neighbor sample; the new minority class samples are added to the initial sample set. The balance of the sample number under each sample category in the sample set is realized.

Description

Sample data balancing, model training and classifying method, device and equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method, a device and equipment for balancing, model training and classifying sample data.

Background

Insurance underwriting, generally refers to the process by which an insurance company reviews, approves, and selects risks for an applicant's application to identify standard and non-standard entities. For example, for personal insurance, an insurance company identifies a standard body and a non-standard body of the insurance by evaluating information of the health state, occupation category, and the like of an insured person.

The standard body is usually a group which is verified to be normal in body indexes and can be normally underwritten by an insurance company according to standard premium; the non-standard body refers to an insurance group except the standard body, and the general body has abnormal indexes, so that an insurance company cannot normally underwrite according to standard premium.

The accurate identification of the standard body and the non-standard body can help an insurance company to reduce risks and effectively prevent insurance fraud.

Currently, with the development of big data analysis and machine learning modeling technology, insurance companies usually utilize machine learning algorithms to mine features and trends in historical data of insurance applicant, insured person, insurance application and the like to predict standard bodies and non-standard bodies. The general method comprises the following steps:

Constructing an original data set by integrating data of a policy, an applicant, an insured person and an agent;

Marking data of a standard body in an original data set as a positive sample (the label of the data can be 1), marking data of a non-standard body as a negative sample (the label of the data can be 0), taking a core protection decision as a classification result, and training a classification model by utilizing the original data set in a supervised learning mode;

And carrying out probability prediction of standard bodies and non-standard bodies on different application based on the trained model.

When the standard body and the non-standard body are predicted by using the machine learning algorithm, because the samples of the original data set are unbalanced (the number ratio of the standard body to the non-standard body is generally 1:10), training the two-class model by using the original data set easily results in that the two-class model obtained by training is more prone to output the classification result corresponding to the class with the larger number in the original data set when the class is predicted, namely, the model is easier to output the result that the object to be predicted is the non-standard body.

Aiming at the problems, how to balance the number of samples of different categories in an original data set so as to avoid obvious tendency of the model prediction result is a problem to be solved.

Disclosure of Invention

The embodiment of the application provides a method, a device and equipment for balancing, model training and classifying sample data, which are used for solving the problem of how to balance the number of samples of different types in an original data set so as to avoid the occurrence of obvious tendency of model prediction results.

The embodiment of the application adopts the following technical scheme:

a method of balancing sample data, comprising:

according to the number of samples in each sample category in the initial sample set, acquiring samples in the sample category with small sample number from the initial sample set as a minority sample;

based on the minority samples, acquiring respective neighbor samples of the obtained minority samples in a mode of mapping an initial sample set to a linear space;

Generating a new minority class sample with a target number according to the ratio of the number of the majority class samples to the number of the minority class samples in each sample class, each minority class sample and each neighbor sample; the target number is determined based on the ratio;

the new minority class samples are added to the initial sample set.

A method of training a classification model, comprising:

Acquiring a sample set; the sample set includes an initial sample and a new minority class sample;

Inputting the sample set into a classification model to be trained, performing iterative training, and adjusting the weight of the target sample in the next training according to the accuracy of the prediction result of the target sample in the previous training in the iterative training process; obtaining a trained classification model when the iteration condition of the classification model is met: the weight is related to the attention degree of the classification model to the target sample;

The new minority samples are obtained according to the balancing method.

A classification method, comprising:

acquiring target data to be classified;

inputting the target data into a trained classification model to obtain a classification result of the target data;

The classification model is obtained through training by the training method.

A sample data balancing apparatus, comprising:

The minority sample selection module is used for acquiring samples with small sample number from the initial sample set according to the sample number of each sample type in the initial sample set, and taking the samples with small sample number as minority samples;

The adjacent sample determining module is used for acquiring each obtained adjacent sample of the minority samples in a mode of mapping an initial sample set to a linear space based on the minority samples;

A new sample generation module, configured to generate a new minority sample of a target number according to a ratio of the number of majority samples to the number of minority samples in each sample class, each minority sample, and each neighbor sample; the target number is determined based on the ratio;

And a sample balancing module, configured to add the new minority class samples to the initial sample set.

A sorting apparatus comprising:

The target acquisition module is used for acquiring target data to be classified;

the classification processing module is used for inputting the target data into a trained classification model to obtain a classification result of the target data;

The classification model is obtained through training by the training method.

A computing device, comprising: a memory and a processor, wherein,

The memory is used for storing a computer program;

The processor is coupled to the memory for executing the computer program stored in the memory for performing the method described above.

A computer readable storage medium storing a computer program which, when executed by a computer, is capable of carrying out the method described above.

The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects:

According to the minority class samples with small sample numbers under each class of the initial sample set, the neighbor samples of the minority class samples in the linear space and the proportion of the majority class samples to the minority class samples, generating a new minority class sample with a target number, adding the new minority class sample into the initial sample set, increasing the number of the minority class samples in the sample set, and realizing the balance of the sample numbers under each sample class in the sample set, thereby solving the problem that model prediction results caused by unbalanced sample classes in the sample set in the prior art have obvious tendency.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flowchart of a sample data balancing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of linear mapping of an initial sample set and reclassifying of minority class samples during sample data balancing;

FIG. 3 is a flowchart of a training method of a classification model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an implementation of a training method of classification models;

FIG. 5 is a schematic flow chart of a specific implementation of a classification method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a specific flow including sample balancing, model training, and model deployment online provided in an embodiment of the present application;

Fig. 7 is a schematic diagram of a specific structure of a balancing device for sample data according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a specific structure of a training device for classification models according to an embodiment of the present application;

Fig. 9 is a schematic diagram of a specific structure of a sorting device according to an embodiment of the present application;

Fig. 10 is a schematic diagram of a specific structure of a computing device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Deep learning (DEEP LEARNING, DL) is a research direction in the field of machine learning (MACHINE LEARNING, ML) that was introduced into machine learning to bring it closer to the original goal-artificial intelligence (ARTIFICIAL INTELLIGENCE, AI). In recent years, deep learning has achieved great success in various fields of computer vision, such as object classification and object detection.

The rapid development of machine learning and big data technology has brought new innovations and opportunities for the insurance industry. The big data technology can conveniently extract and integrate mass data to support data analysis. Taking the application scene of insurance verification as an example, machine learning modeling can mine massive historical data in a short time to perform risk assessment on the application, so that the verification process is accelerated, and cost and efficiency are reduced. However, due to unbalance of the data set underwriting standard body and the non-standard body, the traditional classification modeling method has poor performance and is difficult to be effectively applied to underwriting service.

In the following detailed description of the embodiments of the present application, the scenario of insurance verification mentioned in the background art is described in more detail by way of example. The methods mentioned in the following embodiments of the present application, such as a balancing method for sample data, a training method for classification model, and a classification method, may be applied to scenes such as image classification, text classification, and emotion classification.

An embodiment of the application provides a sample data balancing method, which is used for solving the problem of how to balance the number of samples of different types in an original data set so as to avoid the occurrence of obvious tendency of model prediction results.

The execution subject of the method can be any computing device that can implement the method, such as a server, a mobile phone, a personal computer, an intelligent wearable device, an intelligent robot, and the like.

In addition, the execution sequence of the different steps is not limited in the embodiment of the application. When the method provided by the embodiment of the application is used, the execution sequence of different steps can be adjusted according to actual requirements.

For convenience of description, the method provided by the embodiment of the present application will be described in detail below by taking a balancing device for sample data as an execution subject of the method.

As shown in fig. 1, a flowchart of a specific implementation of a sample data balancing method according to an embodiment of the present application includes the following steps:

Step 11: and acquiring samples with the sample categories with the small sample number from the initial sample set as a few sample categories according to the sample number of each sample category in the initial sample set.

The construction of the initial sample set may be: extracting historical data for making a verification decision on an application of the applicant from a business database of an insurance company; integrating information such as the application ID, the corresponding multidimensional features of the person to be protected and the agent, the corresponding nuclear protection result and the like in the historical data to form a sample containing the features (the application ID, the multidimensional features) and the labels (the nuclear protection result); the set of samples formed based on the history data is used as an initial sample set.

Wherein the multidimensional features comprise sex, age, marital status, education level, personal income, height, weight, occupation and the like of the insured life; and agent's sex, education level, staff class, work experience, staff status, and insurance years, etc. The feature information quantity of each sample can be enriched through the multidimensional feature integrated by the application ID.

The label characterizes the underwriting result, in this embodiment, the underwriting result may be a standard body and a non-standard body, and in the actual application scenario, the underwriting result may also be a standard body, except for, adding fee, delaying and refusing to be ensured.

The sample category is determined by the tag values of the samples in the initial sample set, and samples with the same tag value belong to the same category. Such as labels for each sample in the initial sample set, including label 1 for standard volumes and label 0 for non-standard volumes. The sample corresponding to the standard body with the label of 1 is the same kind, and the sample corresponding to the non-standard body with the label of 0 is the same kind.

Samples under the sample category with a small number of samples are obtained from the initial sample set, wherein the samples under the sample category with the small number of samples are obtained and can be one type of samples or multiple types of samples.

If the sample classes in the initial sample set include two types (standard body and non-standard body in the above example), the samples in the sample class having a small number of samples are obtained as a small number of types of samples (typically, the number of standard body samples is small) as compared with the two types of samples. Correspondingly, a large number of one-class samples are taken as a plurality of class samples.

As another example, the sample categories in the initial sample set include four categories (e.g., shopping, sports, music, finance that characterize the application type), then samples under a small number of sample categories are taken, which in contrast to the four categories may be at least one of the following: the least number of samples (e.g., sports), the least and the next least number of samples (e.g., sports, finance), and the remaining samples (e.g., sports, finance, music) other than the most number of samples are the minority samples. In contrast, the majority of samples may be the most number of samples of all the classes, or the most number of samples of the classes, and this is not limited.

Step 12: based on the minority samples, acquiring neighbor samples of each obtained minority sample in a mode of mapping an initial sample set to a linear space.

The linear space may be a kernel space to which the initial sample set is mapped, or may be a kernel function (e.g., a gaussian kernel function) to which each sample in the initial sample set is mapped. The kernel function realizes the mapping of the samples from the low-dimensional space to the high-dimensional space, thereby realizing the linear separability of the initial sample set in the high-dimensional space and obtaining the distribution condition of each sample in the linear space.

As shown in fig. 2, a schematic diagram of linear mapping of an initial sample set and reclassifying of a few types of samples is shown, where the initial sample set is mapped to a two-dimensional linear space, where "good" and "thin" in the figure respectively represent samples of different types, e.g., "good" represents a non-standard volume sample, and "thin" represents a standard volume sample.

The determination of the neighbor samples may be: calculating each distance value between the target sample and other samples in a linear space, and arranging each distance value according to the numerical order; based on the set condition, selecting a sample with a distance value meeting the set condition from the initial sample set as a neighbor sample of the target sample.

The distance between the target sample and other samples can be determined by mapping the two samples to the distance between two points corresponding to the two-dimensional space, or by the inner product of the feature vectors corresponding to the two samples.

The set condition may be a distance threshold, a number threshold. For example, samples with distance values smaller than the distance threshold are neighbor samples of the target sample, or samples with distance minimum values are sequentially selected according to the arrangement order until the number of samples meeting the distance threshold are neighbor samples of the target sample.

As shown in fig. 2, the determined neighbor samples of the target sample a are a1 and a2, and the determined neighbor samples of the target sample B are B1 and B2.

Step 13: generating a new minority class sample with a target number according to the ratio of the number of the majority class samples to the number of the minority class samples in each sample class, each minority class sample and each neighbor sample; the target number is determined based on the ratio.

Wherein, the determination of the target number may be: if the ratio is 10:1, the target number is 10.

The specific implementation of the step comprises the following steps:

For each of the minority class samples, performing: determining a generation mode of a new sample corresponding to the minority class sample; the generation mode is related to sample category similarity between the minority sample and a neighbor sample of the minority sample;

And generating the new minority samples of the target number by adopting a generation mode of the new samples according to the proportion, the minority samples and the neighbor samples.

The sample class similarity between the minority class sample and the neighbor samples of the minority class sample may be a class similarity of the minority class sample and two neighbor samples randomly extracted from the neighbor samples of the minority class sample: the class of the minority sample is different from the class of the two randomly extracted neighbor samples, and the class of the minority sample is the same as the class of one of the two randomly extracted neighbor samples.

Determining a generation mode of new samples corresponding to the minority class samples comprises the following steps:

Inquiring the generation mode of new samples corresponding to each sub-classification category according to the sub-classification category of the minority class samples; the sub-classification type is determined according to the fact that the sample type of the minority sample and the sample type of the neighbor sample of the minority sample are the same;

and determining the generation mode of the new sample corresponding to the minority class sample as the generation mode of the new sample corresponding to the minority class sample.

The determination of the reclassification category specifically comprises:

If the class of each neighbor sample of the minority sample is different from the class of the minority sample, the minority sample is a noise sample; if each neighbor sample of the minority sample has the same neighbor sample category as the minority sample, the minority sample is a non-noise sample. As shown in fig. 2, samples A, B are all minority samples. If the class of the sample A is different from that of the neighbor samples a1 and a2, the sample A is a noise sample; and B1 and B2 in the same class as the sample B exist in the neighbor samples of the sample B, and the sample B is a non-noise sample.

For example, the initial sample set is a set including two sample types, namely, two types including a minority sample and a majority sample, and the minority sample is represented as a minority sample subset to which J _i(J_i belongs is represented as ((J ₁,...,J_n)), and k neighbor samples of sample J _i are determined according to the above step 12. If all k neighbor samples are most class samples, sample J _i is classified as a noise sample; if a few classes of samples are included in the k neighbor samples, sample J _i is classified as a non-noise sample.

The generation mode of the new sample corresponding to each sub-classification category comprises the following steps:

(i) For a minority class sample J _i of the non-noise class, two neighbor samples x ₁,x₂ are randomly extracted from its k neighbor samples.

If the class of the neighbor sample x ₁,x₂ is different from the class of the minority class sample J _i, the new sample is generated as follows:

t_newi＝x₁+rand(0,0.5)*(x₂-x₁) (1)

X_newi＝J_i+rand(0,1)*(t_newi-J_i) (2)

In formula (1), t _newi represents a temporary minority class sample, formula (1) represents that N temporary minority class samples t _newi are randomly generated using two different classes of samples than J _i, where i=1. In equation (2), in combination with minority class samples J _i, N minority class samples X _newi can be generated.

If the class of the neighbor sample x ₁,x₂ contains a few classes, the new sample is generated as follows:

t_newi＝x₁+rand(0,1)*(x₂-x₁) (3)

Based on formula (3), performing random linear interpolation between neighboring samples X ₁,x₂ to obtain t _newi, and then performing random linear interpolation between t _newi and J _i by using formula (2), so as to construct N new minority samples X _newi.

(Ii) For a few samples of the noise class, up-sampling them would risk introducing noise data, and in order to minimize the risk, up-sampling magnification N is set to 1. A minority class J _k is randomly selected from the minority class samples, and random linear interpolation is performed between J _i and J _k, as shown in formula (4).

X_newi＝J_i+rand(0.5,1)*(J_k-J_i) (4)

Step 14: the new minority class samples are added to the initial sample set.

New minority class samples are added to the initial sample set to form a balanced sample set with balanced numbers of class samples.

An embodiment of the present application provides a training method for a classification model based on the balancing method for sample data provided in the above embodiment.

The execution subject of the method can be any computing device capable of realizing the method, such as a server, a mobile phone, a personal computer, an intelligent wearable device, an intelligent robot and the like.

Different steps of the method can be implemented by the same execution body or different execution bodies, and the embodiment of the application does not limit what execution body is adopted to implement the method.

For convenience of description, taking a model training device as an execution subject of the method as an example, a underwriting scenario of an insurance company is taken as an example, and the method provided by the embodiment of the application is described in detail.

As shown in fig. 3, a flowchart of a specific implementation of a training method of a classification model according to an embodiment of the present application includes the following steps:

Step 31: acquiring a sample set; the sample set includes an initial sample and a new minority class sample;

The initial sample may be an initial sample set in the sample data balancing method, and the new minority sample may be a new minority sample generated in the sample data balancing method.

The construction of the sample set may also be: extracting historical data of a verification decision made on an application of the insurance applicant from a business database of an insurance company; integrating information such as the application ID, the corresponding multidimensional features of the person to be protected and the agent, the corresponding nuclear protection result and the like in the historical data to form a sample containing the features (the application ID, the multidimensional features) and the labels (the nuclear protection result); and forming a set formed by each sample based on the historical data as a sample set.

In one implementation, the method can further comprise preprocessing the data in the acquired sample set by adopting a method in feature engineering.

The data preprocessing includes processing of missing values and selection of features.

Wherein, the processing of the missing value may be: and discarding the sample data with larger missing value proportion, and filling the missing value in the sample data with smaller missing value proportion by nonsensical characters and numerical values. The magnitude of the missing value proportion can be determined by adopting a mode of presetting a threshold value.

The selection of features may be: and carrying out box division processing on the features, measuring the contribution degree of the features to the prediction target according to IV (Information Value) values for the features in each box, and selecting the features with the contribution degree in a reasonable range. If the characteristic IV value is smaller than 0.1, the contribution to the kernel prediction is lower, and if the IV value is higher than 0.8, the characteristic contribution value is too high and thus the authenticity is lower, so that the characteristic with the IV value in the range of 0.1-0.8 is screened.

It should be noted that, in the two classification problems of machine learning, the IV value is mainly used to encode the input variable and evaluate the predictive power. The magnitude of the characteristic variable IV value represents the strength of the variable prediction capability. The value range of the IV value is [0 ], positive infinity).

The selection of features may also be: and calculating correlation coefficients among the features, and eliminating features weakly related to the target features and collinear features so as to improve the performance of the model. Wherein, the weak correlation can be determined by setting a threshold value of the correlation coefficient, and if the calculated correlation coefficient is smaller than the threshold value, the two characteristics participating in calculation are considered to be weak correlation. Collinear features refer to features that are highly correlated with each other, which in machine learning can lead to reduced generalization performance over a sample set, thus suggesting collinear features for target features.

Step 32: inputting the sample set into a classification model to be trained, performing iterative training, and adjusting the weight of the target sample in the next training according to the accuracy of the prediction result of the target sample in the previous training in the iterative training process; obtaining a trained classification model when the iteration condition of the classification model is met: the weight is related to the degree of interest of the classification model in the target sample.

In one implementation, the classification model includes at least two weak learners integrated in series, then,

The specific implementation process of step 32 includes:

Step 321: inputting the sample set into a classification model to be trained, and performing iterative training;

In order to avoid the problems of insufficient generalization, overfitting and the like of a single classification learner in classification tasks, the embodiment models based on an adaptive lifting algorithm (ADABOOST). Dividing the sample set into a training set and a testing set according to a certain proportion, and training ADABOOST by using the training set.

As shown in fig. 4, ADABOOST iteratively learns feature characterizations in the application history data by serially integrating a plurality of weak learners (weak learner H1, weak learner H2 …, weak learner Hn).

Step 322: the error of each weak learner per training and the target sample of each weak learner per training classification error are determined.

Before describing the specific implementation of this step, the samples in the training set are described as follows:

For N samples in the training set, before training is started, the attention of each weak learner is the same for each sample, so the same weight is given to each sample: w _i = 1/N, whereby the initial weight distribution of the training samples is D ₁ (i), i e (1, N), as shown in equation (5).

For an iterative process, the training process of the classification model includes: using the weak learner H1 to conduct classification prediction on each sample in the first training set (i.e. the training set determined in step 321), counting errors of the weak learner H1 and target samples with wrong classification of the weak learner H1, and adjusting weights of the target samples with wrong classification of the weak learner H1; inputting the second training set with the target sample weight adjusted into a weak learner H2, classifying and predicting the input second training set by using the weak learner H2, counting errors of the weak learner H2, and adjusting the weight of the target sample with the error classified by the weak learner H2; and inputting the third training set with the target sample weight adjusted into a weak learner H3 … and the like until the input nth training set is subjected to classification prediction by using the weak learner Hn, and counting errors of the weak learner Hn. So far, one iteration training process is completed.

Determination of erroneous target samples after each training by the weak learner: and for a target sample, acquiring a classification result of the target sample after the training of a weak learner, comparing whether the classification result is the same as the label value of the target sample, if so, classifying the target sample by the weak learner correctly, otherwise, classifying the target sample by the weak learner incorrectly. In fig. 4, the hexagonal symbols represent samples of classification errors.

Error determination of weak learner:

Training the weak learner H _t by using a training set with a weight distribution of D _t (i), calculating an error of the weak learner H _t on the weight distribution of D _t (i), and calculating a formula as (6):

Equation (6), e _t, is the error rate of the base learner H _t (here, the error rate is expressed as an error), and P (H _t(x_i)≠y_i) represents the probability that H _t misclassifies the sample x _i. w _ti represents the weight of the ith sample in the t-th training. The I function is an indicating function, and when the H _t(x_i)≠y_i condition is satisfied, the I function value is 1, otherwise, the I function value is 0. From this, the error rate e _t is the sum of the weights of the misclassified samples of the weak learner H _t.

Step 323: according to the error, correspondingly adjusting the influence weight of each weak learner on the prediction result; and adjusting the weight of the target sample with the wrong classification so as to increase the attention degree of the classification model to the target sample with the wrong classification.

The final weight value of the learner and the updated training sample weight distribution can be calculated by using formulas (7) and (8) based on the error rate of the base learner.

As in equations (7), (8), α _t is the final weight value of learner H _t and D _t+1 is the updated training sample weight distribution. For a certain target sample, if the target sample is accurately predicted by the learner, the weight value of the target sample is correspondingly reduced after updating. But if mispredicted, its weight value increases after updating.

Sample points with large weight values are more concerned in the next training, so that the sample points can be classified correctly with a larger probability. The principle is as follows: at the next weak classifier training, samples with high weight values can greatly affect the classification errors of the model. If samples with high weight values are classified incorrectly, the sample error calculated based on the loss function will multiply. For example, a sample with a weight of 1, when it classifies an error, the model error will be increased by 1. However, after the samples with weight of 10 are divided by mistake, the model error is increased by 10. Since the model training aims at minimizing model errors, the model must pay more attention to learning of misclassified samples so that misclassified samples are classified as correctly as possible.

Step 324: and when the iteration condition of the classification model is met, using at least two weak learners which are integrated in series and contain respective influence weights as a trained classification model.

And iterating the process, stopping iterating when the prediction error of the learner is low enough to meet the specified condition, and fusing all weak learners.

Finally, carrying out linear combination according to the weight values of all weak learners to obtain a final strong learner H _final, namely a trained classification model, wherein the combination method is as shown in formula (9):

in one implementation manner, the training method of the classification model of the present embodiment further includes: and performing model test on the trained classification model by using the test set.

And evaluating the performance of the model by using the test set based on the strong learner obtained by training. And inputting the test set into the trained classification model, so that the classification mode outputs the verification decision prediction result of each sample in the test set. And calculating an AUC index by combining the label value of the test set and the predicted value, wherein the AUC index is used for evaluating the model, and the classification effect of the representative model is better when the AUC (Area Under ROC Curve, the area under the ROC curve, which is a common index for evaluating the performance of the classification model) value is closer to 1. If the model performance is poor, the super-parameters can be regulated by adopting a Bayesian optimizer and other methods, so that the AUC value of the model on the test set meets the actual application requirement.

Wherein, the Bayesian tuning super-parameters can be: by setting pre-optimized parameters such as weak learner class, maximum iteration number of weak learner, learning rate, etc., using python third party package bayes _opt is adopted. And setting a value range of the parameter, namely calling a Bayesian optimizer method, and automatically selecting a parameter value which enables the performance of the model to be optimal through multiple iterations.

In one implementation, the training method of the classification model of the present embodiment further includes deployment of the classification model.

In order to ensure the effect of the model in actual production, PSI (Population Stability Index, group stability index, stability index commonly used for evaluating characteristics or models) indexes are adopted to evaluate the stability of the model, so that the distribution change condition of the model on different data sets or time periods is monitored, and the stability of the model is monitored in time. When PSI is between 0 and 0.1, the model stability is higher. At 0.1-0.25, it is shown that the model has some distribution shifts over different time periods, the cause of model instability needs to be checked, and the monitoring frequency is increased. When the PSI value is above 0.25, the stability of the model is reduced, and the model is required to be retrained and constructed by taking the data of different time periods into consideration, so that the model can fully learn the sample distribution characteristics of different time periods, and the stability of the model predicted in different time periods is further improved.

In the training method of the classification model, in the training process of the classification model, a strong classifier formed by fusing a plurality of weak classifiers is used as the classification model, so that the problem of low kernel-preserving prediction precision caused by insufficient generalization and robustness of a single classifier in classification tasks is solved; in the training process of the embodiment, the samples in the training set are given a weight related to the attention degree of the classification model to the samples, so that the samples which are predicted incorrectly in the last time can be focused on in the next prediction, and are classified correctly, and the training efficiency of the model is improved.

An embodiment of the present application provides a classification method based on the training method for classifying samples provided in the above embodiment.

For convenience of description, taking a classification device as an execution subject of the method as an example, a underwriting scenario of an insurance company is taken as an example, and the method provided by the embodiment of the application is described in detail.

As shown in fig. 5, a specific flowchart of a classification method according to an embodiment of the present application includes the following steps:

Step 51: acquiring target data to be classified;

in the insurance policy of the present embodiment, the target data may be application data for application. The application data includes the application ID, the corresponding multidimensional characteristics of the insured person and agent.

In this step, preprocessing, such as missing value processing and feature processing, may also be included on the application data, as mentioned in the training method embodiment of the classification model.

Step 52: inputting the target data into a trained classification model to obtain a classification result of the target data;

The classification model is obtained by training the training method of the classification model in the embodiment.

According to the classification method provided by the embodiment of the application, the adopted classification model is the strong classifier obtained by training based on the fusion of the multi-weak classifier, and the data set adopted by training the classification model is the balanced sample with balanced sample number under each class, so that the classification model can accurately classify the target to be predicted, that is, the classification method in the embodiment can obtain an accurate classification result.

Fig. 6 is a schematic diagram of a specific flow including sample balancing, model training, and model deployment online according to an embodiment of the present application. Comprising the following steps:

Data integration: the feature of the dimensions of the applicant, the protected person and the agent are integrated through the application ID to form a data set so as to further enrich the feature information quantity of each sample. Wherein, the dimension characteristics of the applicant and the protected person comprise gender, age, marital status, education degree, personal income, height, weight, occupation and the like; the agent dimension characteristics include gender, education level, staff class, work experience, staff status, insurance years, etc.

Characteristic engineering: for the resulting dataset, the data preprocessing is performed using methods in feature engineering. The data preprocessing comprises the processing of missing values, the feature binning processing and the feature screening. Specific implementation may employ a means of preprocessing the data in step 31 of the classification model training method embodiment.

Data rebalancing: in order to rebalance the unbalanced sample set to improve the model classification performance, a KSMOTE algorithm is proposed for minority class sample generation. The basic idea of the algorithm is as follows: first, a kernel function is used to map samples to a kernel space, and distances between sample points are calculated in the kernel space. The standard volume samples (minority class samples) are reclassified in the kernel space, and different upsampling rates are set according to the reclassification classes. K neighbors of each standard body sample are calculated in the whole sample space, and new standard body samples are generated according to the synthesis rule until the standard body samples are expanded to be balanced with the non-standard body samples. Based on the re-balanced sample set, all samples are mapped back to the original feature space to balance the classifier's understanding and predictive power for both classes of samples. The specific implementation manner can adopt the means from step 11 to step 14 in the sample balancing method embodiment.

Modeling: in order to avoid the problems of insufficient generalization, overfitting and the like of a single classifier in a policy and insurance classification task, modeling based on an adaptive lifting algorithm (ADABOOST) is provided. And adopting means in the training method embodiment of the classification model to carry out weighted fusion on a plurality of weak learners to obtain a trained strong classifier as the classification model.

Model test: and evaluating the performance of the model by using the test set based on the strong learner obtained by training. And inputting the test set into a model to obtain the verification decision prediction result of each sample in the test set. And calculating an AUC index evaluation model by combining the label value of the test set and the predicted value, wherein the classification effect of the representative model is better when the AUC value is closer to 1. If the model performance is poor, the super-parameters can be regulated by adopting a Bayesian optimizer and other methods, so that the AUC value of the model on the test set meets the actual application requirement.

Deployment of online: in order to ensure the effect of the model in actual production, PSI indexes are adopted to evaluate the stability of the model, so that the distribution change condition of the model on different data sets or time periods is monitored, and the stability of the model is monitored in time. When PSI is between 0 and 0.1, the model stability is higher. At 0.1-0.25, it is shown that the model has some distribution shifts over different time periods, the cause of model instability needs to be checked, and the monitoring frequency is increased. When the PSI value is above 0.25, the stability of the model is reduced, and the model is required to be retrained and constructed by taking the data of different time periods into consideration, so that the model can fully learn the sample distribution characteristics of different time periods, and the stability of the model predicted in different time periods is further improved.

In order to solve the problem of how to balance the number of samples of different types in the original data set to avoid the occurrence of a significant tendency of model prediction results, an embodiment of the present application provides a sample data balancing apparatus according to the same inventive concept as the sample data balancing method embodiment of the present application.

The specific structural schematic diagram of the balancing device is shown in fig. 7, and the balancing device comprises the following functional modules:

The minority sample selection module 71 acquires samples with a small number of samples from the initial sample set as minority samples according to the number of samples in each sample category in the initial sample set;

A neighboring sample determining module 72, configured to obtain, based on the minority samples, neighboring samples of each of the obtained minority samples by mapping the initial sample set to a linear space;

A new sample generation module 73 that generates a new minority sample of a target number from the ratio of the number of majority samples to the number of minority samples in each sample class, each minority sample, and each neighbor sample; the target number is determined based on the ratio;

a sample balancing module 74 for adding the new minority class samples to the initial sample set.

The new sample generating module 73 is configured to determine, for each of the minority class samples, a generating manner of a new sample corresponding to the minority class sample; the generation mode is related to sample category similarity between the minority sample and a neighbor sample of the minority sample;

Optionally, the new sample generating module 73 determines a generating manner of the new samples corresponding to the minority class samples, including: inquiring the generation mode of new samples corresponding to each sub-classification category according to the sub-classification category of the minority class samples; and determining the generation mode of the new sample corresponding to the minority class sample as the generation mode of the new sample corresponding to the minority class sample.

Based on the sample data balancing method embodiment, an embodiment of the present application provides a training device for a classification model, which has the same concept as the training method embodiment for the classification model.

The specific structure schematic diagram of the device is shown in fig. 8, and the device comprises the following functional modules:

A sample acquisition module 81 for acquiring a sample set; the sample set includes an initial sample and a new minority class sample

The model training module 82 inputs the sample set into a classification model to be trained, performs iterative training, and adjusts the weight of the target sample during the next training according to the accuracy of the prediction result of the target sample during the previous training; obtaining a trained classification model when the iteration condition of the classification model is met: the weight is related to the degree of interest of the classification model in the target sample.

The new minority samples are obtained according to the sample data balancing method embodiment.

The classification model includes at least two weak learners integrated in series, and the model training module 82 includes the following sub-functional modules:

the data input sub-module is used for inputting the sample set into a classification model to be trained and carrying out iterative training;

The data acquisition sub-module is used for determining the error of each weak learner in each training and the target sample of each weak learner in each training and classifying error;

The data processing sub-module correspondingly adjusts the influence weight of each weak learner on the prediction result according to the error; adjusting the weight of the target sample with the wrong classification so as to increase the attention degree of the classification model to the target sample with the wrong classification;

And the model integration sub-module is used for taking at least two weak learners which are integrated in series and contain respective influence weights as a trained classification model when the iteration condition of the classification model is met.

Based on the training mode embodiment of the classification model, an embodiment of the present application provides a classification device based on the same concept as the embodiment of the classification method.

The specific structure diagram of the classifying device is shown in fig. 9, and the classifying device comprises the following functional modules:

a target acquisition module 91, configured to acquire target data to be classified;

the classification processing module 92 is configured to input the target data into a trained classification model, and obtain a classification result of the target data;

The classification model is obtained by training the method described in the training method embodiment of the classification model.

Embodiments of the present application also provide a computing device, in view of the same inventive concepts as the previous embodiments of the present application.

As shown in fig. 10, the computing device includes: memory 101 and processor 102. The memory 101 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device. The memory 101 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

A processor 102 coupled to the memory 101 for executing a computer program stored in the memory 101 for executing a method of balancing sample data as described in the previous embodiments, or a method of training or classifying a classification model as described in an embodiment of the application.

The processor 102 may perform other functions in addition to the above functions when executing the computer program in the memory 101, and in particular, reference is made to the foregoing description of the embodiments.

Further, as shown in fig. 10, the computing device further includes: display 104, communication component 103, power component 105, audio component 106, and other components. Only some of the components are schematically shown in fig. 10, which does not mean that the computing device only includes the components shown in fig. 10.

Accordingly, the embodiments of the present application also provide a computer-readable storage medium storing a computer program, which when executed by a computer is capable of implementing the method provided in each of the above embodiments.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of balancing sample data, comprising:

Based on the minority samples, acquiring neighbor samples of each obtained minority sample in a mode of mapping an initial sample set to a linear space _;

the new minority class samples are added to the initial sample set.

2. The method of claim 1, wherein generating a new minority class sample of a target number based on a ratio of a number of majority class samples to a number of minority class samples in each sample class, each minority class sample, and each neighbor sample comprises:

3. The method of claim 2, wherein determining a manner of generating new samples corresponding to the minority class of samples comprises:

inquiring the generation mode of new samples corresponding to each sub-classification category according to the sub-classification category of the minority class samples;

4. A method of training a classification model, comprising:

the new minority class samples obtained according to the method of claims 1-3.

5. The method of claim 4, wherein the classification model comprises at least two weak learners integrated in series, then,

Inputting the kernel set into a classification model to be trained, performing iterative training, and adjusting the weight of the target sample in the next training according to the accuracy of the prediction result of the target sample in the previous training in the iterative training process; until the iteration condition of the classification model is satisfied, obtaining a trained classification model, comprising:

Inputting the sample set into a classification model to be trained, and performing iterative training;

determining the error of each training of each weak learner and the target sample of the classification error of each weak learner after training;

according to the error, correspondingly adjusting the influence weight of each weak learner on the prediction result; and adjusting the weight of the target sample with the wrong classification so as to increase the attention degree of the classification model to the target sample with the wrong classification;

And when the iteration condition of the classification model is met, using at least two weak learners which are integrated in series and contain respective influence weights as a trained classification model.

6. A method of classification, comprising:

acquiring target data to be classified;

Wherein the classification model is trained by the method of claim 4 or 5.

7. A sample data balancing apparatus, comprising:

The adjacent sample determining module is used for acquiring the adjacent samples of each obtained minority sample by mapping an initial sample set to a linear space based on the minority samples _;

8. A sorting apparatus, comprising:

Wherein the classification model is trained by the method of claim 4 or 5.

9. A computing device, comprising: a memory and a processor, wherein,

The memory is used for storing a computer program;

The processor, coupled to the memory, for executing the computer program stored in the memory for performing the method of any of claims 1-6.

10. A computer readable storage medium storing a computer program which, when executed by a computer, is capable of carrying out the method of any one of claims 1 to 6.