CN116976408A - Method and device for calibrating predictive scores of two classification machine learning models - Google Patents

Method and device for calibrating predictive scores of two classification machine learning models Download PDF

Info

Publication number
CN116976408A
CN116976408A CN202310845405.4A CN202310845405A CN116976408A CN 116976408 A CN116976408 A CN 116976408A CN 202310845405 A CN202310845405 A CN 202310845405A CN 116976408 A CN116976408 A CN 116976408A
Authority
CN
China
Prior art keywords
model
prediction score
data set
machine learning
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310845405.4A
Other languages
Chinese (zh)
Inventor
邬子庄
李策
闫旭芃
英继越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310845405.4A priority Critical patent/CN116976408A/en
Publication of CN116976408A publication Critical patent/CN116976408A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a device for calibrating prediction scores of a classification machine learning model, which relate to the technical field of artificial intelligence, wherein the method comprises the following steps: inputting the characteristic data into a two-class calibration model to obtain a calibrated output prediction score; and (3) constructing a classification calibration model: collecting characteristic data and constructing a first data set; training to obtain a first class classification machine learning model as an upstream model; inputting the feature broad table in the first data set into an upstream model to obtain a first prediction score column; splicing the first prediction score array with the first data set to obtain a second data set; training to obtain a second class classification machine learning model as a midstream model; inputting the feature broad table in the second data set into the midstream model to obtain a second prediction score column; obtaining a third data set according to the second prediction score column and the second data set; training to obtain a logistic regression model as a downstream model. The method and the device can improve the calibration of the two-classification machine learning model.

Description

Method and device for calibrating predictive scores of two classification machine learning models
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for calibrating prediction scores of a classification machine learning model.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
With the progress of big data and artificial intelligence technology, more and more large enterprises use machine learning models to assist in business decisions. The integrated method-based bi-classification prediction model (such as GBDT model) and the depth method-based bi-classification prediction model (such as LSTM model) are often superior to the traditional linear model in practice due to the sequencing capability (which can be measured by indexes such as AUC, etc.), and are particularly widely applied in the fields of risk prevention and control and intelligent marketing in banking industry.
However, the internal structure of the two-class machine learning model based on the integration method and the depth method is complex, the black box model predicts the sample according to actual experience, and the difference between the obtained model output score and the probability of actual event occurrence in the sample is generally large, namely the calibration of the complex model is often weak.
In the fields of risk prevention and control and intelligent marketing in banking industry, certain scenes have high requirements on the calibration of machine learning models. For example, in the post-credit early warning scene in the risk prevention and control field, if the probability of risk occurrence of a customer can be accurately predicted, indexes such as risk value and the like can be further estimated, so that business personnel can be assisted in making decisions; the marketing activity pushing scene in the intelligent marketing field can further calculate the cost and income of the activity and reasonably configure marketing resources if the probability of the client participating in the activity can be accurately predicted.
Therefore, a probability calibration method suitable for a two-class machine learning model needs to be found, and on the premise of keeping the ordering capability of the original model as much as possible, the predicted output score of the model is converted into the probability of actual event occurrence (such as the probability of risk event occurrence in the wind control field and the probability of marketing success in the marketing field). The reliability of model output is improved, and the model application scene is deepened.
Disclosure of Invention
The embodiment of the invention provides a method for calibrating the prediction scores of a two-class machine learning model, which is used for converting the prediction output values of the two-class machine learning model into the occurrence probability of actual events on the premise of keeping the ordering capacity of the original two-class machine learning model as much as possible, so as to improve the calibration of the two-class machine learning model, and comprises the following steps:
Obtaining characteristic data to be analyzed aiming at the target field;
inputting the characteristic data to be analyzed into a two-class calibration model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model, wherein the two-class calibration model inputs the characteristic data to be analyzed into an upstream model to obtain a first prediction score, inputs the first prediction score into a midstream model to obtain a second prediction score, and inputs the second prediction score into a downstream model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model;
the two classification calibration models are constructed by the following steps:
collecting characteristic data of a two-class machine learning model to be modeled aiming at the target field, and constructing a first data set;
training to obtain a first class of classification machine learning model based on the first data set, wherein the first class of classification machine learning model is used as an upstream model;
inputting a feature wide table in the first data set into an upstream model to obtain a first prediction score column;
splicing the first prediction score array with the first data set to obtain a second data set;
training to obtain a second class classification machine learning model based on the second data set, wherein the second class classification machine learning model is used as a midstream model;
Inputting the feature broad table in the second data set into the midstream model to obtain a second prediction score column;
obtaining a third data set according to the second prediction score line and the second data set;
based on the third dataset, a logistic regression model is trained as a downstream model.
The embodiment of the invention also provides a device for calibrating the prediction scores of the two-class machine learning model, which is used for converting the prediction output values of the two-class machine learning model into the probability of actual events on the premise of keeping the ordering capacity of the original two-class machine learning model as much as possible so as to improve the calibration of the two-class machine learning model, and the device comprises:
the prediction score calibration module is used for obtaining characteristic data to be analyzed aiming at the target field; inputting the characteristic data to be analyzed into a two-class calibration model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model, wherein the two-class calibration model inputs the characteristic data to be analyzed into an upstream model to obtain a first prediction score, inputs the first prediction score into a midstream model to obtain a second prediction score, and inputs the second prediction score into a downstream model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model;
The two-category calibration model construction module comprises:
the first data set construction module is used for collecting characteristic data of a two-class machine learning model to be modeled aiming at the target field and constructing a first data set;
the upstream model training module is used for training and obtaining a first class classification machine learning model based on the first data set to serve as an upstream model;
the first prediction score obtaining module is used for inputting the feature wide table in the first data set into the upstream model to obtain a first prediction score column;
the second data set construction module is used for splicing the first prediction score column with the first data set to obtain a second data set;
the midstream model training module is used for training and obtaining a second class classification machine learning model based on the second data set to serve as a midstream model;
the second prediction score obtaining module is used for inputting the feature broad table in the second data set into the midstream model to obtain a second prediction score column;
a third data set construction module, configured to obtain a third data set according to the second prediction score column and the second data set;
and the downstream model training module is used for training to obtain a logistic regression model based on the third data set to serve as a downstream model.
The embodiment of the invention also provides computer equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the method for calibrating the prediction scores of the two-class machine learning model is realized when the processor executes the computer program.
Embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the above-described method of calibrating a predictive score of a bifurcated machine learning model.
Embodiments of the present invention also provide a computer program product comprising a computer program which, when executed by a processor, implements the above-described method of calibrating predictive scores of a bifurcated machine learning model.
In the embodiment of the invention, the characteristic data to be analyzed aiming at the target field is obtained; inputting the characteristic data to be analyzed into a two-class calibration model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model, wherein the two-class calibration model inputs the characteristic data to be analyzed into an upstream model to obtain a first prediction score, inputs the first prediction score into a midstream model to obtain a second prediction score, and inputs the second prediction score into a downstream model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model; the two classification calibration models are constructed by the following steps: collecting characteristic data of a two-class machine learning model to be modeled aiming at the target field, and constructing a first data set; training to obtain a first class of classification machine learning model based on the first data set, wherein the first class of classification machine learning model is used as an upstream model; inputting a feature wide table in the first data set into an upstream model to obtain a first prediction score column; splicing the first prediction score array with the first data set to obtain a second data set; training to obtain a second class classification machine learning model based on the second data set, wherein the second class classification machine learning model is used as a midstream model; inputting the feature broad table in the second data set into the midstream model to obtain a second prediction score column; obtaining a third data set according to the second prediction score line and the second data set; based on the third dataset, a logistic regression model is trained as a downstream model. Through the scheme, calibration of multiple prediction scores is carried out, wherein the calibration is carried out through a midstream model after the first prediction score is obtained, the second prediction score is obtained, and the final output prediction score is obtained after the downstream model is input for calibration; the logistic regression model is characterized in that an input function value is mapped to a 0-1 interval through a Sigmoid function on the basis of a linear model, the input function value is used as the probability of classification discrimination, the model is established under the assumption of binomial distribution and carries out parameter estimation by using a maximum likelihood method in statistics, and the output probability of the model is similar to the proportion of a sample actually belonging to 1 under statistics, namely, the logistic regression model has good calibration, so that the predicted output value of the classification machine learning model is converted into the probability of actual event occurrence on the premise of keeping the sequencing capability of the original classification machine learning model as much as possible, and the calibration of the classification machine learning model is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:
FIG. 1 is a flow chart of a method for calibrating predictive scores for a bifurcated machine learning model in accordance with an embodiment of the invention;
FIG. 2 is a flow chart of a two-class calibration model construction in an embodiment of the invention;
FIG. 3 is a schematic diagram of a two-class calibration model constructed in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of an apparatus for calibrating predictive scores for a bifurcated machine learning model in accordance with an embodiment of the invention;
fig. 5 is a schematic diagram of a computer device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. The exemplary embodiments of the present invention and their descriptions herein are for the purpose of explaining the present invention, but are not to be construed as limiting the invention.
The technical scheme of the application obtains, stores, uses, processes and the like the data, which all meet the relevant regulations of national laws and regulations.
First, terms related to the embodiments of the present application will be explained.
The two-classification machine learning model refers to: the prediction results may be formalized into a classification machine learning model expressed as 0 (negative class) or 1 (positive class).
Model output scores refer to: and inputting the sample to be predicted into a two-class machine learning model, and obtaining a model output score. The score is in the form of a fraction distributed between 0 and 1, indicating the magnitude of the likelihood that the sample belongs to the positive class, the closer the score is to 1, the greater the likelihood that the sample belongs to the positive class. And dividing the model output score according to a certain 0-1 dividing threshold value to obtain a final 0 or 1 prediction result of the classification model.
Model fusion refers to: refers to combining a plurality of different types of machine learning models to form a new machine learning model. The model fusion herein refers to training an upstream model according to an initial data set, and then combining the initial training set and the output score of the upstream model to the initial training set to form a new training set and train a downstream model.
Probability calibration refers to: and relearning the output scores of the classification models, so that the output scores of the classification models are converted into probabilities, and the probability calibration of the models is improved. The probability calibratability of a model can be measured typically by a Brier score and a probability calibration curve. The logistic regression model generally has excellent probability calibrations because it makes maximum likelihood estimates of model parameters based on binomial distribution hypotheses.
The invention fuses the two models by utilizing the excellent probability calibration of the logistic regression model and the excellent sorting capability (the capability of sorting the probability of the sample belonging to the positive class) of the two classes of machine learning models, thereby carrying out probability calibration on the output scores of the two classes of machine learning models on the premise of keeping the sorting capability of the two classes of machine learning models as far as possible.
In the embodiment of the invention, the target field is a risk prevention and control field and an intelligent marketing field, and of course, other fields can be provided as long as the application of the two-class machine learning model is involved.
Fig. 1 is a flowchart of a method for calibrating a prediction score of a classification machine learning model according to an embodiment of the present invention, and fig. 2 is a flowchart of a classification calibration model construction according to an embodiment of the present invention, wherein the method for calibrating a prediction score of a classification machine learning model includes:
Step 101, obtaining characteristic data to be analyzed aiming at a target field;
102, inputting feature data to be analyzed into a two-class calibration model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model, wherein the two-class calibration model inputs the feature data to be analyzed into an upstream model to obtain a first prediction score, inputs the first prediction score into a midstream model to obtain a second prediction score, inputs the second prediction score into a downstream model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model;
the two classification calibration models are constructed by the following steps:
step 201, collecting characteristic data of a two-class machine learning model to be modeled aiming at a target field, and constructing a first data set;
step 202, training to obtain a first class classification machine learning model based on a first data set, wherein the first class classification machine learning model is used as an upstream model;
step 203, inputting a feature wide table in the first data set into an upstream model to obtain a first prediction score row;
step 204, splicing the first prediction score array with the first data set to obtain a second data set;
step 205, training to obtain a second class classification machine learning model based on the second data set, wherein the second class classification machine learning model is used as a midstream model;
Step 206, inputting the feature broad table in the second data set to the midstream model to obtain a second prediction score column;
step 207, obtaining a third data set according to the second prediction score line and the second data set;
step 208, training to obtain a logistic regression model as the downstream model based on the third data set.
The embodiment of the invention carries out calibration of multiple prediction scores, which comprises the steps of obtaining a first prediction score, carrying out calibration through a midstream model to obtain a second prediction score, and then inputting a downstream model for calibration to obtain a final output prediction score; the logistic regression model is characterized in that an input function value is mapped to a 0-1 interval through a Sigmoid function on the basis of a linear model, the input function value is used as the probability of classification discrimination, the model is established under the assumption of binomial distribution and carries out parameter estimation by using a maximum likelihood method in statistics, and the output probability of the model is similar to the proportion of a sample actually belonging to 1 under statistics, namely, the logistic regression model has good calibration, so that the predicted output value of the classification machine learning model is converted into the probability of actual event occurrence on the premise of keeping the sequencing capability of the original classification machine learning model as much as possible, and the calibration of the classification machine learning model is improved.
The key point of the embodiment of the invention is to construct a two-class calibration model,
in one embodiment, collecting feature data of a two-class machine learning model to be modeled for a target domain, constructing a first dataset, includes:
collecting characteristic data of a two-class machine learning model to be modeled aiming at the target field, and forming a characteristic wide table;
the first dataset is constructed based on the feature broad table.
In one embodiment, collecting feature data of a two-class machine learning model to be modeled for a target domain and forming a feature broad table includes:
aiming at the target field, determining a adopted two-class machine learning model to be modeled;
collecting feature data for a target area;
preprocessing the characteristic data, wherein the preprocessing comprises filtering abnormal values and missing values;
carrying out feature engineering treatment on the preprocessed feature data, wherein the feature engineering treatment comprises aggregation and derivatization;
and forming a feature wide table X according to the feature data processed by the feature engineering.
In one embodiment, constructing a first data set based on a feature broad table includes:
labeling each feature data in the feature wide table aiming at a modeling target of the two-class machine learning model, wherein the label value is 0 or 1 to form a label column y;
The feature broad table and the tag columns are formed into a first dataset. Specifically, X and y are associated according to a primary key to form a data set { X, y } required for modeling.
In one embodiment, training to obtain a first class of classification machine learning models as an upstream model based on a first dataset includes:
splitting the first data set { X, y } into a first training set { X_t, y_t } and a first verification set { X_v, y_v } according to a preset ratio n:1;
training a first class of classification machine learning models by adopting a first training set; the first class second classification machine learning model architecture adopts complex architectures such as an integrated model or a depth model;
verifying the trained first class classification machine learning model by adopting a first verification set;
after the verification is passed, a first class classification machine learning model is obtained as the upstream model M1.
In one embodiment, inputting the feature broad table in the first dataset into the upstream model to obtain a first predictive score array includes:
obtaining a first training set { X_t, y_t } in a first data set;
and inputting the feature broad table X_t in the first training set into the upstream model M1, and predicting the probability that each feature data in the first training set belongs to the first category 1 to obtain a first prediction score column p_t.
Wherein the predictive score takes a value of a fraction between 0 and 1 for each feature data in the training set.
In an embodiment, stitching the first prediction score line with the first data set to obtain a second data set includes:
splicing the first prediction score row p_t with a feature width table X_t of a first training set { X_t, y_t } in a first data set to obtain a new feature width table Y_t= { X_t, p_t };
the new feature width table y_t= { x_t, p_t } and the tag column y_t of the first training set are formed into a second data set { y_t, y_t }.
Based on the second data set, when the second class classification machine learning model is obtained through training, the second class classification machine learning model and the first class classification machine learning model adopt different complex architectures such as an integrated model or a depth model.
In an embodiment, the method further comprises:
inputting a feature broad table X_v in a first verification set { X_v, y_v } into an upstream model, and predicting the probability that each feature data in the first verification set belongs to a first category 1 to obtain a third prediction score column p_v;
splicing the third prediction score row p_v with a first verification set in the first data set to obtain a new feature width table Y_v= { X_v, p_v };
Forming a second verification set { Y_v, y_v } from the new feature width table and the tag column y_v in the first verification set;
after the midstream model is obtained, verifying the midstream model by adopting a second verification set;
after the verification is passed, outputting a midstream model M2 passing the verification;
inputting the feature broad table in the second data set to the midstream model to obtain a second prediction score array, including: and inputting the feature wide table in the second data set into the verified midstream model to obtain a second prediction score column.
In an embodiment, obtaining a third data set from the second prediction score column and the second data set comprises:
stitching the second prediction score row q_t with the second dataset { Y_t, y_t };
adding a first combination feature p_t, q_t and obtaining a third data set Z_t= { X_t, p_t, q_t and p_t, wherein each element in the first combination feature is the product of the element of the first prediction fractional column and the element of the second prediction fractional column at the corresponding position of the element.
Since p_t and q_t contain the discrimination information of the upstream model M1 and the midstream model M2 to the original feature wide table, the downstream model M3 can obtain higher discrimination accuracy.
In an embodiment, the method further comprises:
inputting a feature wide table Y_v in the second verification set into a midstream model M2, predicting the probability that each feature data in the second verification set belongs to the first category 1, and obtaining a fourth prediction score column q_v;
splicing the fourth prediction score row q_v, the tag row y_v in the first verification set, the feature width table y_v= { X_v, p_v } in the second verification set and the second combined feature to obtain a third verification set Z_v, wherein each element in the second combined feature is the product of the element of the third prediction score row and the element of the fourth prediction score row at the corresponding position of the element;
after obtaining the midstream model, verifying the downstream model by using a third verification set Z_v;
after the verification is passed, the downstream model that passed the verification is output.
Wherein, M3 has promoted the calibration on the basis of keeping M1, M2 discrimination ability as far as possible.
A specific example is given below to illustrate the application of the method of the invention.
Referring to fig. 3, a schematic diagram of a two-class calibration model is constructed in an embodiment of the present invention.
In one particular embodiment, a business bank uses historical credit risk data to build a two-class machine learning model to predict the probability of poor performance of a loan while it is being made. The training set characteristic X is MM annual loan characteristic data, and comprises 100 ten thousand rows and 75 columns, and customer financial information, transaction information and the like. The training set label y is formed by 0 or 1 in 100 ten thousand rows and 1 columns when the loans are violated in the next year of MM.
In another specific embodiment, a business bank uses financial market trader transaction record data to build a two-class machine learning model to predict the probability of a violation occurring in a transaction. The training set characteristic X is MM transaction characteristic data, and comprises 50 ten thousand rows and 40 columns, wherein the training set characteristic X comprises market price time points, highest market price, lowest market price, fluctuation of the trading prices of all transactions of the whole set, fluctuation of the trading prices of all transactions in an institution to which the transaction belongs, and the like. The training set label y is used for detecting the illegal situation of the transaction records after the fact, and is composed of 50 rows and 1 column, namely 0 or 1.
When constructing the classification calibration model, first, the first classification machine learning model is trained by using X and the corresponding y. The training of the two-class lifting tree model by adopting the LightGBM algorithm is carried out, and the first-class second-class machine learning model is the two-class lifting tree model, so that the upstream model M1 is obtained.
Then, X is input to the upstream model M1 to be predicted, and an output score p is obtained, which is 100 ten thousand rows and 1 column in total, and is composed of decimal numbers distributed between 0 and 1.
And (3) transversely splicing the X and the p to obtain new characteristic data Y, wherein the total number of the Y is 100 ten thousand rows and 76 columns. Training a second class of classification machine learning model by using Y and the corresponding Y, wherein the second class of classification machine learning model adopts a deep neural network algorithm to obtain a midstream model M2.
And Y is input into a midstream model M2 for prediction to obtain an output fraction q, wherein the output fraction q is 100 ten thousand rows and 1 column and consists of decimal numbers distributed between 0 and 1.
And transversely splicing Y and q, and adding p and q combined features p x q to obtain new feature data Z, wherein the total number of the new feature data Z is 100 ten thousand rows and 78 columns. And fitting the logistic regression model by using Z and the corresponding y to obtain a downstream model M3.
The method provided by the embodiment of the invention realizes that the predicted output score of the complex two-class machine learning model is converted into the probability of occurrence of an actual event by fusion with the logistic regression model under the condition of no need of additional data input, thereby improving the calibration of the complex model. According to the actual wind control data inspection of a commercial bank, compared with an upstream LightGBM model, the downstream logistic regression model has the advantages that the decrease of an AUC (automatic Score) which is a measure of the distinguishing capability of the model is less than 1%, and the Brier Score which is a measure of the calibration of the model is obviously optimized.
In banking risk prevention and control and intelligent marketing scenarios, it is often necessary to quantitatively estimate the risk or benefit of a decision, rather than simply order the relative size of the risk or benefit. For example, in a loan risk early warning scene, a business party is not only concerned about which loans are easy to break in a batch of loans, but also hopefully estimates the online value by combining the loan amount, and makes decisions and disposals according to the magnitude of the online value. Such scenarios place demands on the degree of calibration of the artificial intelligence model. The patent creatively provides a method and a device for carrying out probability calibration on output scores of two-class machine learning models by utilizing logistic regression model fusion, converts predicted output scores of a complex two-class machine learning model into probability of occurrence of actual events, and deepens application scenes of intelligent models.
In summary, in the method provided by the embodiment of the invention, the feature data to be analyzed for the target field is obtained; inputting the characteristic data to be analyzed into a two-class calibration model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model, wherein the two-class calibration model inputs the characteristic data to be analyzed into an upstream model to obtain a first prediction score, inputs the first prediction score into a midstream model to obtain a second prediction score, and inputs the second prediction score into a downstream model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model; the two classification calibration models are constructed by the following steps: collecting characteristic data of a two-class machine learning model to be modeled aiming at the target field, and constructing a first data set; training to obtain a first class of classification machine learning model based on the first data set, wherein the first class of classification machine learning model is used as an upstream model; inputting a feature wide table in the first data set into an upstream model to obtain a first prediction score column; splicing the first prediction score array with the first data set to obtain a second data set; training to obtain a second class classification machine learning model based on the second data set, wherein the second class classification machine learning model is used as a midstream model; inputting the feature broad table in the second data set into the midstream model to obtain a second prediction score column; obtaining a third data set according to the second prediction score line and the second data set; based on the third dataset, a logistic regression model is trained as a downstream model. Through the scheme, calibration of multiple prediction scores is carried out, wherein the calibration is carried out through a midstream model after the first prediction score is obtained, the second prediction score is obtained, and the final output prediction score is obtained after the downstream model is input for calibration; the logistic regression model is characterized in that an input function value is mapped to a 0-1 interval through a Sigmoid function on the basis of a linear model, the input function value is used as the probability of classification discrimination, the model is established under the assumption of binomial distribution and carries out parameter estimation by using a maximum likelihood method in statistics, and the output probability of the model is similar to the proportion of a sample actually belonging to 1 under statistics, namely, the logistic regression model has good calibration, so that the predicted output value of the classification machine learning model is converted into the probability of actual event occurrence on the premise of keeping the sequencing capability of the original classification machine learning model as much as possible, and the calibration of the classification machine learning model is improved.
The embodiment of the invention also provides a device for calibrating the prediction scores of the classification machine learning model, as described in the following embodiment. Because the principle of the device for solving the problem is similar to that of the method for calibrating the prediction scores of the two-class machine learning model, the implementation of the device can refer to the implementation of the method for calibrating the prediction scores of the two-class machine learning model, and the repetition is omitted.
FIG. 4 is a schematic diagram of an apparatus for calibrating predictive scores for a bifurcated machine learning model in accordance with an embodiment of the invention, comprising:
a prediction score calibration module 401, configured to obtain feature data to be analyzed for a target domain; inputting the characteristic data to be analyzed into a two-class calibration model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model, wherein the two-class calibration model inputs the characteristic data to be analyzed into an upstream model to obtain a first prediction score, inputs the first prediction score into a midstream model to obtain a second prediction score, and inputs the second prediction score into a downstream model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model;
A two-class calibration model building module 402, comprising:
the first data set construction module 4021 is configured to collect feature data of a two-class machine learning model to be modeled for a target field, and construct a first data set;
an upstream model training module 4022, configured to train to obtain a first class of classification machine learning models as an upstream model based on the first data set;
a first prediction score obtaining module 4023, configured to input a feature wide table in the first dataset into the upstream model to obtain a first prediction score column;
a second data set construction module 4024, configured to splice the first prediction score array with the first data set to obtain a second data set;
the midstream model training module 4025 is configured to train to obtain a second class of classified machine learning models based on the second data set, as a midstream model;
a second prediction score obtaining module 4026, configured to input a feature broad table in the second data set to the midstream model, to obtain a second prediction score column;
a third dataset construction module 4027 for obtaining a third dataset from the second prediction score column and the second dataset;
the downstream model training module 4028 is configured to train to obtain a logistic regression model based on the third data set as the downstream model.
In one embodiment, the first data set construction module is specifically configured to:
collecting characteristic data of a two-class machine learning model to be modeled aiming at the target field, and forming a characteristic wide table;
the first dataset is constructed based on the feature broad table.
In one embodiment, the first data set construction module is specifically configured to:
aiming at the target field, determining a adopted two-class machine learning model to be modeled;
collecting feature data for a target area;
preprocessing the characteristic data, wherein the preprocessing comprises filtering abnormal values and missing values;
carrying out feature engineering treatment on the preprocessed feature data, wherein the feature engineering treatment comprises aggregation and derivatization;
and forming a feature wide table according to the feature data processed by the feature engineering.
In one embodiment, the first data set construction module is specifically configured to:
labeling each feature data in the feature broad table aiming at a modeling target of the two-classification machine learning model to form a label column;
the feature broad table and the tag columns are formed into a first dataset.
In one embodiment, the upstream model training module is specifically configured to:
splitting the first data set into a first training set and a first verification set according to a preset proportion;
Training a first class of classification machine learning models by adopting a first training set;
verifying the trained first class classification machine learning model by adopting a first verification set;
after verification passes, a first class of classification machine learning models are obtained as upstream models.
In an embodiment, the first prediction score obtaining module is specifically configured to:
obtaining a first training set in a first data set;
and inputting the feature broad table in the first training set into an upstream model, and predicting the probability that each feature data in the first training set belongs to a first category to obtain a first prediction score column.
In an embodiment, the second data set construction module is specifically configured to:
splicing the first prediction score column with the feature width table of the first training set in the first data set to obtain a new feature width table;
the new feature broad table is formed with the tag columns of the first training set into a second data set.
In an embodiment, the first prediction score obtaining module is further configured to:
inputting the feature broad table in the first verification set into an upstream model, predicting the probability that each feature data in the first verification set belongs to a first category, and obtaining a third prediction score column;
the second data set construction module is further configured to:
Splicing the third prediction score column with a first verification set in the first data set to obtain a new feature width table;
forming a second verification set by the new feature broad table and the tag column in the first verification set;
the midstream model training module is further for:
after the midstream model is obtained, verifying the midstream model by adopting a second verification set;
after the verification is passed, outputting a midstream model passing the verification;
the second prediction score obtaining module is specifically configured to: and inputting the feature wide table in the second data set into the verified midstream model to obtain a second prediction score column.
In one embodiment, the third data set construction module is specifically configured to:
splicing the second predicted score column with the second dataset;
and adding a first combined feature into the spliced set to obtain a third data set, wherein each element in the combined feature is the product of the element of the first predicted score column and the element of the second predicted score column at the corresponding position of the element.
In an embodiment, the second prediction score obtaining module is further configured to:
inputting the feature broad table in the second verification set into a midstream model, predicting the probability that each feature data in the second verification set belongs to the first category, and obtaining a fourth prediction score column;
The third dataset construction module is further for:
splicing the fourth prediction score column, the tag column in the first verification set, the feature width table in the second verification set and the second combined feature to obtain a third verification set, wherein each element in the combined feature is the product of the element of the third prediction score column and the element of the fourth prediction score column at the corresponding position of the element;
the downstream model training module is further configured to:
after the midstream model is obtained, verifying the downstream model by adopting a third verification set;
after the verification is passed, the downstream model that passed the verification is output.
In summary, in the device provided by the embodiment of the present invention, the feature data to be analyzed for the target field is obtained; inputting the characteristic data to be analyzed into a two-class calibration model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model, wherein the two-class calibration model inputs the characteristic data to be analyzed into an upstream model to obtain a first prediction score, inputs the first prediction score into a midstream model to obtain a second prediction score, and inputs the second prediction score into a downstream model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model; the two classification calibration models are constructed by the following steps: collecting characteristic data of a two-class machine learning model to be modeled aiming at the target field, and constructing a first data set; training to obtain a first class of classification machine learning model based on the first data set, wherein the first class of classification machine learning model is used as an upstream model; inputting a feature wide table in the first data set into an upstream model to obtain a first prediction score column; splicing the first prediction score array with the first data set to obtain a second data set; training to obtain a second class classification machine learning model based on the second data set, wherein the second class classification machine learning model is used as a midstream model; inputting the feature broad table in the second data set into the midstream model to obtain a second prediction score column; obtaining a third data set according to the second prediction score line and the second data set; based on the third dataset, a logistic regression model is trained as a downstream model. Through the scheme, calibration of multiple prediction scores is carried out, wherein the calibration is carried out through a midstream model after the first prediction score is obtained, the second prediction score is obtained, and the final output prediction score is obtained after the downstream model is input for calibration; the logistic regression model is characterized in that an input function value is mapped to a 0-1 interval through a Sigmoid function on the basis of a linear model, the input function value is used as the probability of classification discrimination, the model is established under the assumption of binomial distribution and carries out parameter estimation by using a maximum likelihood method in statistics, and the output probability of the model is similar to the proportion of a sample actually belonging to 1 under statistics, namely, the logistic regression model has good calibration, so that the predicted output value of the classification machine learning model is converted into the probability of actual event occurrence on the premise of keeping the sequencing capability of the original classification machine learning model as much as possible, and the calibration of the classification machine learning model is improved.
An embodiment of the present invention further provides a computer device, and fig. 5 is a schematic diagram of a computer device in the embodiment of the present invention, where the computer device 500 includes a memory 510, a processor 520, and a computer program 530 stored in the memory 510 and capable of running on the processor 520, and the method for calibrating the prediction scores of the two-class machine learning model is implemented when the processor 520 executes the computer program 530.
Embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the above-described method of calibrating a predictive score of a bifurcated machine learning model.
Embodiments of the present invention also provide a computer program product comprising a computer program which, when executed by a processor, implements the above-described method of calibrating predictive scores of a bifurcated machine learning model.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (13)

1. A method of calibrating predictive scores for a bifurcated machine learning model, comprising:
obtaining characteristic data to be analyzed aiming at the target field;
inputting the characteristic data to be analyzed into a two-class calibration model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model, wherein the two-class calibration model inputs the characteristic data to be analyzed into an upstream model to obtain a first prediction score, inputs the first prediction score into a midstream model to obtain a second prediction score, and inputs the second prediction score into a downstream model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model;
The two classification calibration models are constructed by the following steps:
collecting characteristic data of a two-class machine learning model to be modeled aiming at the target field, and constructing a first data set;
training to obtain a first class of classification machine learning model based on the first data set, wherein the first class of classification machine learning model is used as an upstream model;
inputting a feature wide table in the first data set into an upstream model to obtain a first prediction score column;
splicing the first prediction score array with the first data set to obtain a second data set;
training to obtain a second class classification machine learning model based on the second data set, wherein the second class classification machine learning model is used as a midstream model;
inputting the feature broad table in the second data set into the midstream model to obtain a second prediction score column;
obtaining a third data set according to the second prediction score line and the second data set;
based on the third dataset, a logistic regression model is trained as a downstream model.
2. The method of claim 1, wherein collecting feature data of a two-classification machine learning model to be modeled for a target domain, constructing a first dataset, comprises:
collecting characteristic data of a two-class machine learning model to be modeled aiming at the target field, and forming a characteristic wide table;
The first dataset is constructed based on the feature broad table.
3. The method of claim 2, wherein collecting feature data for a two-classification machine learning model to be modeled for a target domain and forming a feature broad table comprises:
aiming at the target field, determining a adopted two-class machine learning model to be modeled;
collecting feature data for a target area;
preprocessing the characteristic data, wherein the preprocessing comprises filtering abnormal values and missing values;
carrying out feature engineering treatment on the preprocessed feature data, wherein the feature engineering treatment comprises aggregation and derivatization;
and forming a feature wide table according to the feature data processed by the feature engineering.
4. The method of claim 2, wherein constructing the first data set based on the feature broad table comprises:
labeling each feature data in the feature broad table aiming at a modeling target of the two-classification machine learning model to form a label column;
the feature broad table and the tag columns are formed into a first dataset.
5. The method of claim 1, wherein training to obtain a first class of classification machine learning models as upstream models based on the first dataset comprises:
Splitting the first data set into a first training set and a first verification set according to a preset proportion;
training a first class of classification machine learning models by adopting a first training set;
verifying the trained first class classification machine learning model by adopting a first verification set;
after verification passes, a first class of classification machine learning models are obtained as upstream models.
6. The method of claim 1, wherein inputting the feature-wide table in the first dataset into the upstream model results in a first predictive score array comprising:
obtaining a first training set in a first data set;
and inputting the feature broad table in the first training set into an upstream model, and predicting the probability that each feature data in the first training set belongs to a first category to obtain a first prediction score column.
7. The method of claim 1, wherein stitching the first predictive score column with the first data set to obtain a second data set comprises:
splicing the first prediction score column with the feature width table of the first training set in the first data set to obtain a new feature width table;
the new feature broad table is formed with the tag columns of the first training set into a second data set.
8. The method as recited in claim 1, further comprising:
inputting the feature broad table in the first verification set into an upstream model, predicting the probability that each feature data in the first verification set belongs to a first category, and obtaining a third prediction score column;
splicing the third prediction score column with a first verification set in the first data set to obtain a new feature width table;
forming a second verification set by the new feature broad table and the tag column in the first verification set;
after the midstream model is obtained, verifying the midstream model by adopting a second verification set;
after the verification is passed, outputting a midstream model passing the verification;
inputting the feature broad table in the second data set to the midstream model to obtain a second prediction score array, including: and inputting the feature wide table in the second data set into the verified midstream model to obtain a second prediction score column.
9. The method of claim 1, wherein obtaining a third data set from the second predictive score column and the second data set comprises:
splicing the second predicted score column with the second dataset;
and adding a first combined feature into the spliced set to obtain a third data set, wherein each element in the combined feature is the product of the element of the first predicted score column and the element of the second predicted score column at the corresponding position of the element.
10. The method as recited in claim 7, further comprising:
inputting the feature broad table in the second verification set into a midstream model, predicting the probability that each feature data in the second verification set belongs to the first category, and obtaining a fourth prediction score column;
splicing the fourth prediction score column, the tag column in the first verification set, the feature width table in the second verification set and the second combined feature to obtain a third verification set, wherein each element in the combined feature is the product of the element of the third prediction score column and the element of the fourth prediction score column at the corresponding position of the element;
after the midstream model is obtained, verifying the downstream model by adopting a third verification set;
after the verification is passed, the downstream model that passed the verification is output.
11. An apparatus for calibrating predictive scores for a bifurcated machine learning model, comprising:
the prediction score calibration module is used for obtaining characteristic data to be analyzed aiming at the target field; inputting the characteristic data to be analyzed into a two-class calibration model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model, wherein the two-class calibration model inputs the characteristic data to be analyzed into an upstream model to obtain a first prediction score, inputs the first prediction score into a midstream model to obtain a second prediction score, and inputs the second prediction score into a downstream model to obtain an output prediction score after calibrating the prediction score of the two-class machine learning model;
The two-category calibration model construction module comprises:
the first data set construction module is used for collecting characteristic data of a two-class machine learning model to be modeled aiming at the target field and constructing a first data set;
the upstream model training module is used for training and obtaining a first class classification machine learning model based on the first data set to serve as an upstream model;
the first prediction score obtaining module is used for inputting the feature wide table in the first data set into the upstream model to obtain a first prediction score column;
the second data set construction module is used for splicing the first prediction score column with the first data set to obtain a second data set;
the midstream model training module is used for training and obtaining a second class classification machine learning model based on the second data set to serve as a midstream model;
the second prediction score obtaining module is used for inputting the feature broad table in the second data set into the midstream model to obtain a second prediction score column;
a third data set construction module, configured to obtain a third data set according to the second prediction score column and the second data set;
and the downstream model training module is used for training to obtain a logistic regression model based on the third data set to serve as a downstream model.
12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 10 when executing the computer program.
13. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the method of any of claims 1 to 10.
CN202310845405.4A 2023-07-11 2023-07-11 Method and device for calibrating predictive scores of two classification machine learning models Pending CN116976408A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310845405.4A CN116976408A (en) 2023-07-11 2023-07-11 Method and device for calibrating predictive scores of two classification machine learning models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310845405.4A CN116976408A (en) 2023-07-11 2023-07-11 Method and device for calibrating predictive scores of two classification machine learning models

Publications (1)

Publication Number Publication Date
CN116976408A true CN116976408A (en) 2023-10-31

Family

ID=88477611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310845405.4A Pending CN116976408A (en) 2023-07-11 2023-07-11 Method and device for calibrating predictive scores of two classification machine learning models

Country Status (1)

Country Link
CN (1) CN116976408A (en)

Similar Documents

Publication Publication Date Title
US10552735B1 (en) Applied artificial intelligence technology for processing trade data to detect patterns indicative of potential trade spoofing
US7904397B2 (en) System and method for scalable cost-sensitive learning
CN110751557B (en) Abnormal fund transaction behavior analysis method and system based on sequence model
Ala’raj et al. Modelling customers credit card behaviour using bidirectional LSTM neural networks
Callejón et al. A System of Insolvency Prediction for industrial companies using a financial alternative model with neural networks
CN110807700A (en) Unsupervised fusion model personal credit scoring method based on government data
CN113095927B (en) Method and equipment for identifying suspected transactions of backwashing money
EP3613003B1 (en) System and method for managing detection of fraud in a financial transaction system
CN112561320A (en) Training method of mechanism risk prediction model, mechanism risk prediction method and device
Van Thiel et al. Artificial intelligent credit risk prediction: An empirical study of analytical artificial intelligence tools for credit risk prediction in a digital era
CN113052703A (en) Transaction risk early warning method and device
Zhu et al. Explainable prediction of loan default based on machine learning models
CN113159796A (en) Trade contract verification method and device
CN112766814A (en) Training method, device and equipment for credit risk pressure test model
KR20110114181A (en) Loan underwriting method for improving forecasting accuracy
CN116468273A (en) Customer risk identification method and device
WO2022143431A1 (en) Method and apparatus for training anti-money laundering model
CN116976408A (en) Method and device for calibrating predictive scores of two classification machine learning models
KR102499182B1 (en) Loan regular auditing system using artificia intellicence
CN114612239A (en) Stock public opinion monitoring and wind control system based on algorithm, big data and artificial intelligence
TWM622331U (en) System and device for risk prediction therefor
CN114581209A (en) Method, device and equipment for training financial analysis model and storage medium
Lee et al. Application of machine learning in credit risk scorecard
Cao et al. Coupled market behavior based financial crisis detection
Caplescu et al. Will they repay their debt? Identification of borrowers likely to be charged off

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination