WO2017133569A1 - 评估指标获取方法及装置 - Google Patents

评估指标获取方法及装置 Download PDF

Info

Publication number
WO2017133569A1
WO2017133569A1 PCT/CN2017/072405 CN2017072405W WO2017133569A1 WO 2017133569 A1 WO2017133569 A1 WO 2017133569A1 CN 2017072405 W CN2017072405 W CN 2017072405W WO 2017133569 A1 WO2017133569 A1 WO 2017133569A1
Authority
WO
WIPO (PCT)
Prior art keywords
probability
sample
evaluation index
threshold
histogram
Prior art date
Application number
PCT/CN2017/072405
Other languages
English (en)
French (fr)
Inventor
姜晓燕
王少萌
杨旭
蔡宁
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to US16/066,102 priority Critical patent/US20190034516A1/en
Publication of WO2017133569A1 publication Critical patent/WO2017133569A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Definitions

  • the invention belongs to the field of data processing, and in particular relates to a method and a device for obtaining an evaluation index.
  • the evaluation indicators of the two-class algorithm model include: confusion matrix, receiver operating characteristic curve (ROC) map, area under the curve in the ROC graph (Area Under RocCurve, AUC for short) and promotion (Lift) indicators and other indicators.
  • the classification is required.
  • the output data of the model is scanned once. After a large number of threshold points are input, the evaluation indicators of the classification model are obtained.
  • the method of obtaining the classification model evaluation index by scanning the output data of the classification model multiple times has the problem of low computational efficiency.
  • the invention provides a method and a device for acquiring an evaluation index, which are used to solve the problem that a method for obtaining an evaluation index by repeatedly scanning the output data of the classification model has a low computational efficiency.
  • the present invention provides a method for obtaining an evaluation index, including:
  • the probability statistical result includes a probability interval and an actual positive sample number and an actual negative sample quantity in each probability interval;
  • the evaluation index of the classification model is calculated according to the threshold set and the probability statistics.
  • an evaluation index obtaining apparatus including:
  • a classification training module configured to input a sample into a classification model for classification training, and obtain output data of the classification model
  • a probability statistics module configured to perform probability distribution statistics on the output data to obtain a probability statistical result; wherein the probability statistical result includes a probability interval and an actual positive sample number and an actual negative sample quantity in each probability interval;
  • a calculation module configured to calculate an evaluation indicator of the classification model according to the threshold set and the probability statistics.
  • the method and device for obtaining evaluation indexes provided by the present invention perform probability calculation on the output data of the classification model, and calculate the evaluation index based on the obtained probability and statistical results including the probability interval and the corresponding actual positive sample and the actual negative sample number, and solve the problem.
  • the problem of scanning the output data multiple times in the calculation process of the evaluation index, especially when the output data is large-scale data, can improve the calculation efficiency of the evaluation index.
  • FIG. 1 is a schematic flowchart of a method for acquiring an evaluation index according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic flowchart of a method for acquiring an evaluation index according to Embodiment 2 of the present invention
  • FIG. 3 is a schematic diagram of an application example of an evaluation index acquisition method according to Embodiment 2 of the present invention.
  • FIG. 4 is a second schematic diagram of an application example of an evaluation index acquisition method according to Embodiment 2 of the present invention.
  • FIG. 5 is a schematic structural diagram of an apparatus for acquiring an evaluation index according to Embodiment 3 of the present invention.
  • FIG. 6 is a schematic structural diagram of an apparatus for acquiring an evaluation index according to Embodiment 4 of the present invention.
  • FIG. 1 is a schematic flowchart diagram of an evaluation index acquisition method according to Embodiment 1 of the present invention.
  • the method for obtaining the evaluation indicator includes the following steps:
  • S101 Enter a sample into a classification model for classification training, and obtain output data of the classification model.
  • the classification model corresponding to the binary classification algorithm divides the sample into positive samples or negative samples.
  • positive samples are often represented by "1" and negative samples by "0".
  • each sample of the input classification model has an original sample attribute.
  • the sample attributes include a positive sample attribute and a negative sample attribute. The original sample attribute indicates whether the sample is actually a positive or negative sample.
  • the sample In order to evaluate the classification model, the sample needs to be input into the classification model for classification training. After the training is completed, the classification model will classify and predict each sample. Specifically, the classification model outputs the trained sample attributes for each sample after the training is completed, and the trained sample attributes can indicate that the samples are positive after the classification model The sample is still a negative sample.
  • the classification model also performs probability prediction for each sample after the training is completed, and the user can select the probability that each sample is predicted to be a positive sample by the classification model according to actual needs, or select each sample to be predicted to be negative by the classification model.
  • the probability of the sample is 1.
  • the probability statistics result includes a probability interval and an actual positive sample number and an actual negative sample quantity in each probability interval.
  • each sample in the output data has a prediction probability.
  • the probability of each sample output by the classification model is each sample.
  • probability distribution statistics are performed on the output data according to the predicted probability, and the probability statistical result is obtained.
  • the probability interval is first divided, and then the actual positive sample number and the actual negative sample number are calculated based on the original sample attributes of each sample in the output data in each probability interval to obtain the probability distribution map of the positive sample and the negative sample.
  • the probability distribution map of the positive sample is used to obtain the actual positive sample number in each probability interval, and the probability distribution map based on the negative sample obtains the actual negative sample quantity in each probability interval.
  • the output data is subjected to statistics of probability distribution based on a histogram algorithm, and a histogram of the positive sample and a histogram of the negative sample are acquired, and the above-described probability statistical result can be obtained based on the histogram of the positive sample and the histogram of the negative sample.
  • the threshold set includes a plurality of threshold points, and then obtain each based on the actual positive sample data and the actual negative sample data in each probability interval of each threshold point and probability statistical result.
  • the evaluation parameters corresponding to the threshold points are used to generate the evaluation indicators of the classification model by using the evaluation parameters corresponding to all the threshold points.
  • the endpoint value of the probability interval in the probability statistics result may be used as a threshold point to form a threshold set.
  • the lower limit value of each probability interval can be utilized as a threshold point to constitute a threshold set.
  • the lower limit value of the partial probability interval is used as a threshold point to form a threshold set.
  • the upper limit value of the probability interval may be used as a threshold point to form a threshold set.
  • the probability interval is divided, and the endpoint of the probability interval can be used as the demarcation point, and the endpoint value of the probability interval is directly used as the threshold point, and the threshold point is not required to be reset, thereby improving Evaluate the computational efficiency of the indicator.
  • the endpoint value of the utilization probability interval input by the user may be received as a threshold point to form a threshold set.
  • the user may use the lower limit value of each probability interval as a threshold point to form a threshold set, or the user selects a partial probability interval.
  • the lower limit value constitutes a threshold set as a threshold point.
  • the user can initially have a certain understanding of the effect of the classification model, so that a suitable threshold point can be selected to form a threshold set, the user interaction is better, and the evaluation of the classification model is more accurate.
  • the evaluation index is calculated according to the threshold point in the threshold set and the probability statistics.
  • the evaluation indicators include confusion matrix, ROC curve, AUC value and Lift diagram.
  • the confusion matrix includes: the number of positive samples that are positively positive samples (True Positives, TP for short), the number of positive samples that are positively negative samples (False Positives, FP for short), and the negative for negative samples.
  • the threshold point is used as the demarcation point.
  • the actual positive samples in all probability intervals larger than the threshold point are predicted as positive samples by the classification model, and the actual positive samples are predicted to be positive by the classification model.
  • the number of samples is accumulated, and the accumulated actual positive samples are predicted by the classification model as the number of positive samples as the TP of the confusion matrix.
  • the actual positive samples in all probability intervals smaller than the threshold point are predicted into negative samples by the classification model, and the actual positive samples are accumulated by the classification model into negative samples, and the accumulated positive positive samples are predicted into negative samples by the classification model.
  • the number of FPs as the confusion matrix.
  • the actual negative samples in all probability intervals greater than the threshold point are predicted into positive samples by the classification model, and the actual negative samples are accumulated by the classification model to predict the number of positive samples, and the accumulated actual negative samples are classified.
  • the model predicts the number of positive samples as the FN of the confusion matrix.
  • the actual negative samples in all probability intervals smaller than the threshold point are predicted as negative samples by the classification model, and the actual negative samples are accumulated by the classification model as negative samples, and the accumulated negative samples are predicted as negative samples by the classification model.
  • the number of TNs as the confusion matrix.
  • the TP, FP, TN, and FN in the confusion matrix may be used to calculate the evaluation parameters corresponding to the threshold points of other evaluation indicators, and when the evaluation parameters corresponding to all the threshold points are calculated,
  • the evaluation index is generated by using the evaluation parameters corresponding to each threshold point.
  • the coordinates of the ROC curve at the threshold point can be calculated according to the confusion matrix corresponding to one threshold point, and the coordinates are used as the evaluation parameters of the threshold point ROC curve.
  • the ROC curve is drawn using the coordinates of the ROC curve corresponding to each threshold point.
  • the method for obtaining an evaluation index provided by the embodiment, by performing probability statistics on the output data of the classification model, calculates the evaluation index based on the probability statistics including the probability interval and the actual positive sample number and the actual negative sample number in each probability interval. , solves the problem of scanning the output data multiple times in the calculation process of the evaluation index, Especially when the output data is large-scale data, the calculation efficiency of the evaluation index can be improved.
  • FIG. 2 it is a schematic flowchart of a method for acquiring an evaluation index according to Embodiment 2 of the present invention.
  • the method for obtaining the evaluation indicator includes the following steps:
  • S201 input the sample into the classification model for classification training, and obtain output data of the classification model.
  • the sample In order to evaluate the classification model, the sample needs to be input into the classification model for classification training. After the training is completed, the classification model will classify and predict each sample. Specifically, the classification model outputs the trained sample attributes for each sample after the training is completed, and the trained sample attributes can indicate whether the sample is a positive sample or a negative sample after the classification model. Further, the classification model will also perform probabilistic prediction for each sample after the training is completed, and the general classification model will select the probability that each sample is predicted to be a positive sample by the classification model.
  • the output data after the classification model performs classification training includes: the original sample attributes of each sample and the prediction probability that each sample is predicted to be a positive sample by the classification model.
  • the sample attributes include a positive sample attribute and a negative sample attribute.
  • positive samples are often represented by "1" and negative samples by "0".
  • S202 Perform a probability interval division on the output data based on a histogram algorithm, and count the actual positive sample number and the actual negative sample quantity in each probability interval.
  • the output data of the classification model is scanned.
  • the output table format of the classifier is: the original sample attribute, the predicted sample attribute of the classification model, and the predicted probability that the sample is predicted to be a positive sample by the classification model.
  • the classification model may be provided with a selection item, and may select a prediction probability that the output sample is predicted by the classification model as a positive sample or a prediction probability that the sample is predicted by the classification model to be a positive sample.
  • the ROC curve and the Lift map corresponding to the positive sample may be selected, or the ROC curve and the Lift map corresponding to the negative sample may be selected.
  • a positive sample is taken as an example.
  • a first histogram corresponding to the positive sample and a second histogram corresponding to the negative sample are generated according to the prediction probability that each sample is predicted to be a positive sample and the original sample attribute of each sample in the output data.
  • the horizontal axis of the first histogram is the prediction probability
  • the vertical axis of the first histogram is the actual positive sample number
  • the horizontal axis of the second histogram is the prediction probability
  • the vertical axis of the second histogram is the actual negative sample. Quantity.
  • the probability intervals of the two histograms may not be synchronized, in order to obtain a consistent probability interval, the horizontal axis step size needs to be adjusted to make the first histogram and the second The probability intervals of the histograms are consistent. After the probability interval adjustment is consistent, the probability interval in the probability statistics can be obtained.
  • the number of actual positive samples in each probability interval may be obtained from the first histogram, and the number of actual negative samples in each probability interval may be obtained from the second histogram.
  • the endpoint value of the probability interval may be used as a threshold point to form a threshold set.
  • the lower limit value or the upper limit value of the partial probability interval is used as a threshold point to form a threshold set. For example, every interval is selected.
  • a probability interval selects a lower limit value as a threshold point to form a threshold set.
  • the division of the probability interval is completed, and the endpoint value of the probability interval can be used as a demarcation point, so that the endpoint value of the probability interval can be used as a threshold point to form a threshold set, and the threshold is not required to be reset. Therefore, the calculation efficiency of the evaluation indicators is improved.
  • the probability statistics result may be fed back to the user, so that the user uses the endpoint value of the probability interval as the threshold point to form a threshold set.
  • the user may use the lower limit value of each probability interval as the threshold point as the threshold set, or the user selects the lower limit value of the partial probability interval as the threshold point to form the threshold set.
  • the endpoint value of the partial probability interval may be selected as the threshold point to constitute the threshold set. .
  • the user inputs a threshold set to calculate an evaluation indicator.
  • the user may have a certain understanding of the effect of the classification model according to the probability and statistical results of the feedback, so that a suitable threshold point can be selected to form a threshold set, and the user interaction is better, and The assessment of the classification model is more accurate.
  • the confusion matrix includes the number TP that is actually predicted as a positive sample, the number FP that is actually predicted to be a negative sample, the number TN that the negative sample is predicted to be a negative sample, and the actual negative sample is predicted
  • the number of positive samples is FN, as shown in Table 1 below.
  • Table 1 is a schematic table of the confusion matrix
  • the actual positive sample number in all probability intervals greater than the threshold point is sequentially accumulated according to the size of the threshold point to obtain TP, and the actual distance is true for all probability intervals smaller than the threshold point.
  • the positive sample number is accumulated to obtain FN.
  • the number of negative samples in all probability intervals larger than the threshold point is sequentially accumulated according to the magnitude of the threshold point to obtain FP, and the number of negative samples in all probability intervals smaller than the threshold point is accumulated.
  • the ratio of the FP to the actual negative sample total is taken as the abscissa of the ROC
  • the ratio of the TP to the actual positive sample total is taken as the ordinate of the ROC.
  • the ROC coordinates and the ROC curve corresponding to the adjacent threshold points may constitute a curved trapezoid, and the area of a curved trapezoid can be calculated according to the adjacent ROC coordinates. After all the areas of the curved trapezoid are acquired, all the areas are added to obtain the AUC value of the ROC curve.
  • the ratio of the sum of the TP and FP to the total sample size is taken as the abscissa of the Lift map, and TP is taken as the ordinate of the Lift map.
  • the Lift coordinates corresponding to each threshold point are acquired, the Lift coordinates corresponding to all the threshold points are drawn to the Lift map.
  • the user may send a display instruction for displaying the evaluation indicator, and after receiving the display instruction, visually display the calculated evaluation indicator to the user, so that the user can intuitively judge the excellent condition of the classification model.
  • the evaluation index obtaining method may be executed on the server. After calculating the evaluation index, the user may send a display instruction to the server, and after receiving the display instruction, the server may send the evaluation indicator to the local terminal, so that the local The terminal visualizes the evaluation indicators through the display screen, such as displaying the ROC curve, the Lift chart, and the like to the user.
  • the histogram when the histogram is calculated, the amount of data is large, and the calculation may be performed on the server.
  • the histogram result may be sent to the local terminal, and the evaluation index is calculated on the local terminal. This can slow down the pressure on the server.
  • the user can send a display instruction to the local terminal.
  • the local terminal visually displays the evaluation indicator through the display screen, such as displaying the ROC curve and the Lift map to the user.
  • the user clicks on the point on the ROC curve the confusion matrix corresponding to the point can be performed. Show.
  • the method for obtaining the evaluation indicator may be performed on the local terminal.
  • the user may send a display instruction to the local terminal, and after receiving the display instruction, perform visual display on the display, such as to the user. Show ROC curves, Lift charts, etc. When the user clicks on a point on the ROC curve, the confusion matrix corresponding to the point can be displayed.
  • the sample is user 0 to user 99, and the sample user has the following characteristic parameters: age, work class, sample amount ( Fnlwgt) education, education_num, marital status (status), occupation, relationship, race, sex, capital gain (capital_gain), capital loss (capital_loss) ), weekly work hours (hours_per_week), nationality (native_country), etc., input the characteristic parameters of these users into the classification model for classification training, and can obtain a classification result for the user's income situation. In this example, "0" is indicated as low income, and "1" is indicated as high income.
  • the output data of the classification model includes the original sample attributes of each sample, the predicted sample attributes, and the probability that each sample is predicted to be a high-income category, as shown in Table 2 below.
  • Table 2 shows the output data of the classification model.
  • Table 3 is the first histogram result corresponding to the positive sample
  • Table 4 is the second histogram result corresponding to the negative sample.
  • Table 3 shows the results of the first histogram of the positive sample.
  • Probability interval Number of positive samples in the probability interval [0,0.04) 0 [0.04, 0.08) 0 [0.08, 0.12) 0 [0.12, 0.16) 0 [0.16, 0.2) 0
  • Table 4 is the second histogram of the negative sample
  • the probability interval may be obtained, and the lower limit of each probability interval is used as the threshold point to form a threshold set.
  • the threshold set in this example is: 0, 0.04, 0.08, 0.12, 0.16, 0.2, 0.24, 0.28, 0.32, 0.36, 0.4, 0.44, 0.48, 0.52, 0.56, 0.6, 0.64, 0.68, 0.72, 0.76, 0.8, 0.84. , 0.88, 0.92, 0.96
  • the corresponding ROC coordinates and Lift coordinates can be calculated from the confusion matrix.
  • the ROC curve and the Lift map can be drawn.
  • 3 is the ROC curve of the classification model.
  • the ordinate of the ROC curve in FIG. 3 is the TRTR (True Positive Rate), and the hit rate can be used to indicate that the classification model recognizes the sensitivity of the positive sample (Sensitivity).
  • Figure 4 is a Lift diagram of the classification model.
  • the ordinate is the number of actual positive samples
  • the ROC coordinates corresponding to each threshold point After the ROC coordinates corresponding to each threshold point are obtained, after the ROC curve can be drawn, the ROC coordinates corresponding to the adjacent threshold points and the ROC curve can form a curved trapezoid, and a curved edge can be calculated according to the adjacent ROC coordinates. The area of the trapezoid. After the area of all the curved trapezoids is obtained, the areas of all the curved trapezoids are added to obtain the AUC value corresponding to the ROC curve.
  • N icProb, icTrue, icFalse#N is the number of probability intervals, the lower limit of the icProb probability interval, the number of actual positive samples in the icTrue probability interval, and the number of actual negative samples in the icFalse probability interval.
  • Output ROC coordinates corresponding to each threshold point, Lift coordinates, confusion matrix, AUC value;
  • Threshold point p icProb[N-1-i]
  • the confusion matrix calculated according to the histogram calculation result can be conveniently calculated based on the confusion matrix, and a visual image can be generated, and the user can intuitively judge The classification model is excellent.
  • FIG. 5 it is a schematic structural diagram of an evaluation index obtaining apparatus according to Embodiment 3 of the present invention.
  • the evaluation index obtaining device includes: a classification training module 11, a probability statistics module 12, and a calculation module 13.
  • the classification training module 11 is configured to input the sample into the classification model for classification training, and obtain output data of the classification model.
  • the classification training module 11 In order to evaluate the classification model, the classification training module 11 needs to input the samples into the classification model for classification training. After the training is completed, the classification training module 11 classifies and predicts each sample. Specifically, the classification training module 11 outputs the trained sample attributes for each sample after the training is completed, and the trained sample attributes may indicate whether the sample is a positive sample or a negative sample after passing through the classification model.
  • the classification training module 11 performs probabilistic prediction for each sample after the training is completed, and the user can select the probability that each sample is predicted to be a positive sample by the classification model according to actual needs, or select and output each sample to be predicted by the classification model.
  • the probability of becoming a negative sample wherein, the sum of the probability that the sample is predicted to be a positive sample by the classification model and the probability of being predicted to be a negative sample is 1.
  • the probability statistics module 12 is configured to perform probability distribution statistics on the output data to obtain a probability statistics result.
  • the probability statistics result includes the probability interval and the actual positive sample number and the actual negative sample number in each probability interval.
  • each sample in the output data has a prediction probability.
  • the probability of each sample output by the classification training module 11 The predicted probability of a positive sample is predicted for each sample by the classification model.
  • the probability and statistics module 12 performs probability distribution statistics on the output data according to the predicted probability, and obtains a probability statistical result.
  • the probability and statistics module 12 first needs to divide the probability interval when performing the probability statistics, and then, according to the original sample attribute of each sample in the output data, the actual positive sample number and the actual negative sample quantity are obtained in each probability interval, and the positive sample and the negative sample are obtained.
  • the probability distribution map obtains the actual positive sample number in each probability interval based on the probability distribution map of the positive sample, and obtains the actual negative sample quantity in each probability interval based on the probability distribution map of the negative sample.
  • the probability and statistics module 12 performs statistics on the probability distribution of the output data based on the histogram algorithm, and obtains a histogram of the positive sample and a histogram of the negative sample, and the above probability can be obtained based on the histogram of the positive sample and the histogram of the negative sample. statistical results.
  • the calculation module 13 is configured to calculate an evaluation indicator of the classification model according to the threshold set and the probability statistics.
  • the threshold set includes a plurality of threshold points, and then based on the first data and the actual negative samples of the actual positive samples in each probability interval in each of the threshold points and the probability statistics.
  • the second data acquires an evaluation parameter corresponding to each threshold point, and generates an evaluation index of the classification model by using the evaluation parameter corresponding to all the threshold points.
  • the calculation module 13 may form an endpoint value of the probability interval in the probability statistics result as a threshold point to form a threshold set.
  • the lower limit value of each probability interval can be utilized as a threshold point to constitute a threshold set.
  • the lower limit value of the partial probability interval is used as a threshold point to form a threshold set.
  • the probability interval is divided.
  • the endpoint of the probability interval can be used as the demarcation point, and the endpoint value of the probability interval is directly used as the threshold point, and the threshold point is not required to be reset, thereby improving Evaluate the computational efficiency of the indicator.
  • the calculation module 13 may receive the utilization probability interval end point input by the user as a threshold point threshold set.
  • the user may use the lower limit value of each probability interval as a threshold point to form a threshold set, or the user selects a lower limit value of the partial probability interval as a threshold point to form a threshold set.
  • the user may perform a statistical result based on the feedback. Initially, the effect of the classification model is understood, so that the appropriate threshold points can be selected to form the threshold set, the user interaction is better, and the evaluation of the classification model is more accurate.
  • the calculation module 13 calculates an evaluation index according to the threshold point in the threshold set and the probability statistics.
  • the evaluation indicators include confusion matrix, ROC curve, AUC value and Lift diagram.
  • the confusion matrix includes: TP, FP, TN, and FN.
  • the calculation module 13 uses the threshold point as a demarcation point.
  • the actual positive samples in all probability intervals greater than the threshold point are predicted into positive samples by the classification model, and the actual positive samples are classified.
  • the model predicts the number of positive samples to be accumulated, and the accumulated actual positive samples are predicted by the classification model as the number of positive samples as the TP of the confusion matrix.
  • the actual positive samples in all probability intervals smaller than the threshold point are predicted into negative samples by the classification model, and the actual positive samples are accumulated by the classification model into negative samples, and the accumulated positive positive samples are predicted into negative samples by the classification model.
  • the number of FPs as the confusion matrix.
  • the actual negative samples in all probability intervals greater than the threshold point are predicted into positive samples by the classification model, and the actual negative samples are accumulated by the classification model to predict the number of positive samples, and the accumulated actual negative samples are classified.
  • the model predicts the number of positive samples as the FN of the confusion matrix.
  • the actual negative samples in all probability intervals smaller than the threshold point are predicted as negative samples by the classification model, and the actual negative samples are accumulated by the classification model as negative samples, and the accumulated negative samples are predicted as negative samples by the classification model.
  • the number of TNs as the confusion matrix.
  • the calculation module 13 may use the TP, FP, TN, and FN in the confusion matrix to calculate the evaluation parameters corresponding to the threshold points of other evaluation indicators, and the evaluation parameters corresponding to all the threshold points.
  • the evaluation index is generated by using the evaluation parameters corresponding to each threshold point.
  • the coordinates of the ROC curve at the threshold point can be calculated according to the confusion matrix corresponding to one threshold point, and the coordinates are used as the evaluation parameters of the threshold point ROC curve.
  • the ROC curve is drawn using the coordinates of the ROC curve corresponding to each threshold point.
  • the evaluation index obtaining device performs probability calculation on the output data of the classification model, calculates the evaluation index based on the obtained probability statistical result, and solves the problem of scanning the output data multiple times in the calculation process of the evaluation index. Especially when the output data is large-scale data, the calculation efficiency of the evaluation index can be improved.
  • FIG. 6 is a schematic structural diagram of an evaluation index obtaining apparatus according to Embodiment 4 of the present invention.
  • the evaluation index obtaining device includes: a classification training module 21, a probability statistics module 22, a calculation module 23, and a visualization module 24.
  • the classification training module 21 is configured to input the sample into the classification model for classification training, and obtain output data of the classification model.
  • the probability statistics module 22 is specifically configured by the histogram calculation unit 221, configured to perform probability interval division on the output data based on the histogram algorithm, and count the actual positive sample number and the actual negative sample quantity in each probability interval.
  • the output data includes: an original sample attribute of each sample and a predicted probability that each sample is predicted into a positive sample by the classification model; wherein the sample attribute includes a positive sample attribute and a negative sample attribute.
  • the optional structure of the probability statistics module 22 includes: a scanning unit 221, a histogram generating unit 222, a step adjusting unit 223, and a counting unit 224.
  • the scanning unit 221 is configured to scan output data.
  • a histogram generating unit 222 configured to generate a first histogram corresponding to the positive sample and a second histogram corresponding to the negative sample according to the prediction probability that each sample is predicted to be a positive sample and the original sample attribute of each sample in the output data;
  • the horizontal axis of the first histogram is the prediction probability
  • the vertical axis of the first histogram is the actual positive sample number
  • the horizontal axis of the second histogram is the prediction probability
  • the vertical axis of the second histogram is the actual negative sample Quantity.
  • the step adjustment unit 223 is configured to adjust the horizontal axis step size so that the probability intervals of the first histogram and the second histogram are consistent to obtain a probability interval in the probability statistics.
  • the statistics unit 224 is configured to count the number of actual positive samples in each probability interval in the first histogram, and to count the number of actual negative samples in each probability interval in the second histogram.
  • an optional configuration manner of the calculation module 23 includes: a threshold set acquisition unit 231, and a confusion moment.
  • the threshold set obtaining unit 231 is configured to form an endpoint value of each probability interval as a threshold point to form a threshold set.
  • the threshold set obtaining unit 231 is further configured to receive a threshold set formed by the user according to the endpoint value of the probability interval.
  • the confusion matrix generating unit 232 is configured to acquire the confusion matrix corresponding to each threshold point in the threshold set according to the order of large to small, wherein the confusion matrix includes TP, FP, TN, and FN.
  • the evaluation indicator generating unit 233 is configured to use the confusion matrix corresponding to each threshold point as an evaluation indicator of the classification module.
  • the confusion matrix generating unit 232 is specifically configured to, for the first histogram, successively accumulate the actual positive sample numbers in all probability intervals greater than the threshold point according to the size of the threshold point, and obtain the TP, and the less than the threshold point.
  • the actual positive sample number is accumulated in all probability intervals to obtain FN
  • the number of negative samples in all probability intervals larger than the threshold point is sequentially accumulated according to the magnitude of the threshold point to obtain FP, and for less than the threshold point.
  • the number of negative samples in all probability intervals is accumulated to obtain TN.
  • the evaluation indicator generating unit 233 is specifically configured to use the confusion matrix corresponding to each threshold as an evaluation index.
  • the evaluation index generating unit 233 is specifically configured to use, as the abscissa of the ROC, the ratio of the FP to the actual negative sample total for each threshold point, and the ratio of the TP to the actual positive sample total amount as the ordinate of the ROC, and utilize The ROC coordinates of the classification model are plotted against the ROC coordinates of all threshold points.
  • the evaluation index generating unit 233 is specifically configured to acquire an area of each of the curved trapezoids formed by the ROC coordinates corresponding to the adjacent threshold points and the ROC curve, and add the areas of all the curved trapezoids to obtain the AUC of the ROC curve. value.
  • the evaluation index generating unit 233 is specifically configured to use the ratio of the sum value of the TP and the FP to the total amount of the sample as the abscissa of the Lift map for each threshold point, and the TP as the ordinate of the Lift map and the corresponding points of all the threshold points.
  • the Lift coordinate draws a classification chart of the evaluation index of the classification model.
  • the visualization module 24 is configured to receive a display instruction of the user, and visually display the evaluation indicator according to the display instruction.
  • the evaluation index obtaining device may be configured to execute the evaluation index obtaining method on the server.
  • the user may send a display instruction to the visualization module 24 in the device, and after receiving the display instruction, the visualization module 24
  • the evaluation indicator can be sent to the local terminal, so that the local terminal will evaluate the index through the display screen.
  • the target is visualized, such as showing the ROC curve, Lift chart, etc. to the user.
  • the confusion matrix corresponding to the point can be displayed.
  • the classification training module 21 and the probability statistics module 22 in the evaluation index obtaining device may be disposed on the server, and the computing module 23 and the visualization module 24 are disposed on the local terminal to reduce the pressure on the server. And easy to interact with the user.
  • the sample data is subjected to classification training and histogram calculation on the server.
  • the probability statistics module 22 can deliver the histogram result to the calculation module 23 of the local terminal, and the calculation module 23 calculates and evaluates on the local terminal. Indicators, which can slow down the pressure on the server.
  • the evaluation indicator is calculated, the user can send a display instruction to the visualization module 24.
  • the visualization module 24 visually displays the evaluation indicator through the display screen, such as displaying the ROC curve, the Lift diagram, and the like to the user.
  • the user clicks on a point on the ROC curve the confusion matrix corresponding to the point can be displayed.
  • the evaluation index obtaining means may be configured to execute the evaluation index obtaining method on the local terminal.
  • the user may send a display instruction to the visualization module 24, and after receiving the display instruction, the visualization module 24 is displaying Visual display on the screen, such as showing the user the ROC curve, Lift map and so on.
  • the visualization module 24 is displaying Visual display on the screen, such as showing the user the ROC curve, Lift map and so on.
  • the evaluation index obtaining device performs probability statistics on the output data of the classification model, and calculates the evaluation index based on the probability statistical result including the probability interval and the actual positive sample number and the actual negative sample number in each probability interval.
  • the problem of scanning the output data multiple times in the calculation process of the evaluation index is solved, especially when the output data is large-scale data, the calculation efficiency of the evaluation index can be improved. Further, after the evaluation index is obtained, the evaluation index can be visually displayed, so that the user can intuitively judge the excellent condition of the classification model.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种评估指标获取方法及装置,该方法包括:通过将样本输入分类模型进行分类训练,获取分类模型的输出数据(101),对输出数据进行概率分布统计获取概率统计结果,其中概率统计结果包括概率区间以及每个概率区间内实际正样本数量和实际负样本数量(102),根据阈值集和概率统计结果计算分类模型的评估指标(103)。该方法和装置通过对分类模型的输出数据进行概率统计,基于得到的概率统计结果对评估指标进行计算,解决了在评估指标的计算过程中多次扫描输出数据的问题,尤其在输出数据为大规模数据时可以提高评估指标的计算效率。

Description

评估指标获取方法及装置
本申请要求2016年02月05日递交的申请号为201610082141.1、发明名称为“评估指标获取方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明属于数据处理领域,尤其涉及一种评估指标获取方法及装置。
背景技术
在大数据挖掘的业务场景中,经常需要针对超大规模的数据使用分类算法进行训练分类。当前分类算法有很多,而且不同的分类算法又用很多不同的变种。当根据分类算法建立一个分类模型之后,会考虑该分类模型的性能或准确率,因此需要对该分类模型的优良情况进行评估。目前,二分类算法模型的评估指标包括:混淆矩阵、受试者工作特征曲线(receiver operating characteristic curve,简称ROC)图、ROC图中的曲线下的面积(Area Under RocCurve,简称AUC)值与提升(Lift)图等指标。
现有的对二分类算法对应的分类模型的评估方法或者系统中,在获取评估指标的过程中,每当输入一个阈值点时,在计算与该阈值点对应的评估参数时,就需要对分类模型的输出数据进行一次扫描。经过大量阈值点的输入,然后获取到该分类模型的评估指标。对大规模数据来说,通过多次扫描分类模型的输出数据,获取该分类模型评估指标的方式存在计算效率较低的问题。
发明内容
本发明提供一种评估指标获取方法及装置,用于解决通过多次扫描分类模型的输出数据来获取评估指标的方式存在计算效率较低的问题。
为了实现上述目的,本发明提供了一种评估指标获取方法,包括:
将样本输入分类模型进行分类训练,获取分类模型的输出数据;
对所述输出数据进行概率分布统计获取概率统计结果;其中,所述概率统计结果包括概率区间以及每个概率区间内实际正样本数量和实际负样本数量;
根据阈值集和所述概率统计结果计算所述分类模型的评估指标。
为了实现上述目的,本发明提供了一种评估指标获取装置,包括:
分类训练模块,用于将样本输入分类模型进行分类训练,获取分类模型的输出数据;
概率统计模块,用于对所述输出数据进行概率分布统计获取概率统计结果;其中,所述概率统计结果包括概率区间以及每个概率区间内实际正样本数量和实际负样本数量;
计算模块,用于根据阈值集和所述概率统计结果计算所述分类模型的评估指标。
本发明提供的评估指标获取方法及装置,通过对分类模型的输出数据进行概率统计,基于得到的包括概率区间以及对应的实际正样本和实际负样本数量的概率统计结果对评估指标进行计算,解决了在评估指标的计算过程中多次扫描输出数据的问题,尤其在输出数据为大规模数据时可以提高评估指标的计算效率。
附图说明
图1为本发明实施例一的评估指标获取方法的流程示意图;
图2为本发明实施例二的评估指标获取方法的流程示意图;
图3为本发明实施例二的评估指标获取方法的应用示例示意图之一;
图4为本发明实施例二的评估指标获取方法的应用示例示意图之二;
图5为本发明实施例三的评估指标获取装置的结构示意图;
图6为本发明实施例四的评估指标获取装置的结构示意图。
具体实施方式
下面结合附图对本发明实施例提供的评估指标获取方法及装置进行详细描述。
实施例一
如图1所示,其为本发明实施例一的评估指标获取方法的流程示意图。该评估指标获取方法包括以下步骤:
S101、将样本输入分类模型进行分类训练,获取分类模型的输出数据。
二分类算法对应的分类模型将样本分成正样本或者负样本。在分类模型中往往将正样本用“1”表示,将负样本用“0”表示。其中,输入分类模型的每个样本都有一个原始的样本属性。本实施例中,样本属性包括正样本属性和负样本属性。原始的样本属性表示样本实际是正样本还是负样本。
为了对分类模型进行评估,需要将样本输入分类模型中进行分类训练,在训练完成后,分类模型会对每个样本进行分类和概率预测。具体地,分类模型在训练完成后为每个样本输出训练后的样本属性,训练后的样本属性可以指示出样本经过分类模型后是正 样本还是负样本。
进一步地,分类模型在训练完成后还会为每个样本进行概率预测,用户可以根据实际需要选择输出每个样本经过分类模型预测成正样本的概率,或者选择输出每个样本经过分类模型预测成负样本的概率。其中,样本经过分类模型被预测成正样本的概率和被预测成负样本的概率的和为1。
S102、对输出数据进行概率分布统计获取概率统计结果;其中,概率统计结果包括概率区间以及每个概率区间内实际正样本数量和实际负样本数量。
在获取到输出数据后,由于分类模型会对每个样本进行概率预测,这样输出数据中每个样本会有一个预测概率,本实施例中,分类模型输出的每个样本的概率为每个样本被分类模型预测成正样本的预测概率。
进一步地,根据预测概率对输出数据进行概率分布统计,获取概率统计结果。在进行概率统计时首先需要划分概率区间,然后在每个概率区间内基于输出数据中每个样本原始的样本属性统计实际正样本数量和实际负样本数量,得到正样本和负样本的概率分布图,基于正样本的概率分布图获取每个概率区间内实际正样本数量,基于负样本的概率分布图获取每个概率区间内实际负样本数量。
优选地,基于直方图算法对输出数据进行概率分布的统计,获取正样本的直方图和负样本的直方图,基于正样本的直方图和负样本的直方图能够获取到上述概率统计结果。
S103、根据阈值集和概率统计结果计算分类模型的评估指标。
在获取到概率统计结果后,需要获取阈值集,其中阈值集中包括多个阈值点,然后基于每个阈值点和概率统计结果中每个概率区间内实际正样本数据和实际负样本数据,获取每个阈值点对应的评估参数,利用所有阈值点对应的评估参数生成分类模型的评估指标。
本实施例中,在概率统计结果后,可以将概率统计结果中的概率区间的端点值作为阈值点构成阈值集。例如,可以利用每个概率区间的下限值作为阈值点构成阈值集。或者将部分概率区间的下限值作为阈值点构成阈值集。再例如,可以将概率区间的上限值作为阈值点构成阈值集。本实施例中在概率统计的过程中,对概率区间进行划分,概率区间的端点可作为分界点,直接将概率区间的端点值作为阈值点,不需要进行阈值点的重新设定,进而提高了评估指标的计算效率。
可选地,可以接收用户输入的利用概率区间的端点值作为阈值点构成阈值集。例如,用户可以将每个概率区间的下限值作为阈值点构成阈值集,或者用户选取部分概率区间 的下限值作为阈值点构成阈值集。本实施例中,用户根据反馈的概率统计结果,可以初步对分类模型的效果有一定的了解,从而能够选取合适的阈值点构成阈值集,用户交互较好,而且对分类模型的评估更加准确。
进一步地,在获取到阈值集后,根据阈值集中的阈值点和概率统计结果计算评估指标。其中,评估指标包括混淆矩阵、ROC曲线、AUC值和Lift图。
其中,混淆矩阵中包括:实际为正样本预测为正样本的数量(True Positives,简称TP)、实际为负样本预测为正样本的数量(False Positives,简称FP)、实际为负样本预测为负样本的数量(True Negatives,简称TN)和实际为正样本预测为负样本的数量(False Negatives,简称FN)。
在获取到阈值点之后,将阈值点作为分界点,对于正样本的概率分布来说,大于阈值点的所有概率区间内实际正样本被分类模型预测成正样本,对实际正样本被分类模型预测成正样本的数量进行累积,将累积的实际正样本被分类模型预测成正样本的数量作为混淆矩阵的TP。而小于阈值点的所有概率区间内实际正样本被分类模型预测成负样本,对实际正样本被分类模型预测成负样本的数量进行累计,将累计后的实际正样本被分类模型预测成负样本的数量作为混淆矩阵的FP。
对于负样本的概率分布来说,大于阈值点的所有概率区间内实际负样本被分类模型预测成正样本,对实际负样本被分类模型预测成正样本的数量进行累积,将累积的实际负样本被分类模型预测成正样本的数量作为混淆矩阵的FN。而小于阈值点的所有概率区间内实际负样本被分类模型预测成负样本,对实际负样本被分类模型预测成负样本的数量进行累计,将累计后的实际负样本被分类模型预测成负样本的数量作为混淆矩阵的TN。
在获取到阈值点对应的混淆矩阵后,可以利用混淆矩阵中的TP、FP、TN和FN,计算得到其他评估指标的该阈值点对应的评估参数,当所有阈值点对应的评估参数计算完成后,利用每个阈值点对应的评估参数生成评估指标。例如,根据一个阈值点对应的混淆矩阵可以计算出在该阈值点处ROC曲线的坐标,将坐标作为该阈值点ROC曲线的评估参数。当所有阈值点对应的评估参数计算完成后,利用每个阈值点对应的ROC曲线的坐标绘制ROC曲线。
本实施例提供的评估指标获取方法,通过对分类模型的输出数据进行概率统计,基于得到包括概率区间以及每个概率区间内实际正样本数量和实际负样本数量的概率统计结果对评估指标进行计算,解决了在评估指标的计算过程中多次扫描输出数据的问题, 尤其在输出数据为大规模数据时可以提高评估指标的计算效率。
实施例二
如图2所示,其为本发明实施例二的评估指标获取方法的流程示意图。该评估指标获取方法包括以下步骤:
S201、将样本输入分类模型进行分类训练,获取分类模型的输出数据。
为了对分类模型进行评估,需要将样本输入分类模型中进行分类训练,在训练完成后,分类模型会对每个样本进行分类和概率预测。具体地,分类模型在训练完成后为每个样本输出训练后的样本属性,训练后的样本属性可以指示出样本经过分类模型后是正样本还是负样本。进一步地,分类模型在训练完成后还会为每个样本进行概率预测,一般分类模型会选择输出每个样本经过分类模型预测成正样本的概率。
本实施例中,分类模型进行分类训练后的输出数据中包括:每个样本原始的样本属性以及每个样本被分类模型预测成正样本的预测概率。本实施例中,样本属性包括正样本属性和负样本属性。在分类模型中往往将正样本用“1”表示,将负样本用“0”表示。
S202、基于直方图算法对输出数据进行概率区间划分,统计每个概率区间内实际正样本数量和实际负样本数量。
具体地,对分类模型的输出数据进行扫描。本实施例中,假设分类器的输出表格式为:原始的样本属性、分类模型的预测后样本属性以及样本被分类模型预测成正样本的预测概率。一般情况下,分类模型可以设置有选择项,可以选择输出样本被分类模型预测成正样本的预测概率或者样本被分类模型预测成正样本的预测概率。相应地,可以选择生成正样本对应的ROC曲线和Lift图,或者选择生成负样本对应的ROC曲线和Lift图,本实施例中以正样本为例。
进一步地,根据每个样本被预测成正样本的预测概率和输出数据中每个样本原始的样本属性生成正样本对应的第一直方图和负样本对应的第二直方图。其中,第一直方图的横轴是预测概率,第一直方图的纵轴是实际正样本数量,第二直方图的横轴是预测概率,第二直方图的纵轴是实际负样本数量。
在生成第一直方图和第二直方图的过程中,两个直方图的概率区间可能不同步,为了获取到一致的概率区间,需要调整横轴步长使第一直方图和第二直方图的概率区间一致,在概率区间调整一致后,可以获取到概率统计结果中的概率区间。
在获取到概率区间后,可以从第一直方图中统计获取每个概率区间内实际正样本的数量,以及可以从第二直方图中统计获取每个概率区间内实际负样本的数量。
S203、获取阈值点构成的阈值集。
在生成了概率区间后,可以将概率区间的端点值作为阈值点,构成阈值集,可选地,将部分概率区间的下限值或者上限值作为阈值点构成阈值集,例如,选取每隔一个概率区间选取一个下限值作为阈值点构成阈值集。本实施例中,在概率统计的过程,完成概率区间的划分,概率区间的端点值能够作为分界点,从而可将概率区间的端点值作为阈值点构成阈值集,不需要在对阈值进行重新设定,进而提高了评估指标的计算效率。
可选地,在获取到概率区间后,可以将概率统计结果反馈给用户,以使用户利用概率区间的端点值作为阈值点构成阈值集。例如,用户可以将每个概率区间的下限值作为阈值点作为阈值集,或者用户选取部分概率区间的下限值作为阈值点构成阈值集可以选取部分概率区间的端点值作为阈值点构成阈值集。在获取到阈值集后,用户输入阈值集进行计算评估指标。本实施例中,通过直方图的统计过程,用户根据反馈的概率统计结果,可以初步对分类模型的效果有一定的了解,从而能够选取合适的阈值点构成阈值集,用户交互较好,而且对分类模型的评估更加准确。
S204、按照由大到小的顺序获取阈值集中每个阈值点对应的混淆矩阵。
其中,混淆矩阵包括实际为正样本被预测为正样本的数量TP、实际为正样本被预测为负样本的数量FP、实际为负样本被预测为负样本的数量TN、实际为负样本被预测为正样本的数量FN,如下表1所示。
表1为混淆矩阵的示意表
Figure PCTCN2017072405-appb-000001
具体地,对于正样本对应的第一直方图,按照阈值点的大小顺序逐次对大于阈值点的所有概率区间内实际正样本数量进行累积得到TP,以及对小于阈值点的所有概率区间内实际正样本数量进行累积得到FN。
对于负样本对应的第二直方图,按照阈值点的大小顺序逐次对大于阈值点的所有概率区间内负样本数量进行累积得到FP,以及对小于阈值点的所有概率区间内负样本数量进行累积得到TN。
S205、将每个阈值点对应的混淆矩阵作为评估指标。
S206、针对每一个阈值点,根据混淆矩阵获取对应的ROC坐标。
S207、利用每个阈值点的ROC坐标绘制ROC曲线。
S208、获取每个由相邻阈值点对应的ROC坐标与ROC曲线构成的曲边梯形的面积。
S209、将所有曲边梯形的面积相加得到ROC曲线的AUC值。
在获取到每个阈值点的混淆矩阵后,根据混淆矩阵可以获取到分类模型其他的评估指标,例如ROC曲线、ROC曲线下面积AUC值以及Lift图。
具体地,针对每一个阈值点,将FP与实际负样本总量的比值作为ROC的横坐标,以及将TP与实际正样本总量的比值作为ROC的纵坐标。在获取到每个阈值点对应的ROC坐标后,对所有阈值点对应的ROC坐标进行描点绘制ROC曲线。
进一步地,在绘制出ROC曲线后,由相邻阈值点对应的ROC坐标与ROC曲线可以构成一个曲边梯形,根据相邻的ROC坐标能够计算一个曲边梯形的面积。在获取到所有的曲边梯形的面积后,将所有面积相加得到该ROC曲线的AUC值。
S210、针对每一个阈值点,根据混淆矩阵获取对应的Lift坐标。
具体地,针对每一个阈值点,将TP和FP的和值与样本总量的比值作为Lift图的横坐标,以及将TP作为Lift图的纵坐标。
S211、利用每个阈值点对应的Lift坐标绘制Lift图。
进一步地,在获取到每个阈值点对应的Lift坐标后,将所有阈值点对应的Lift坐标绘制Lift图。
S212、接收用户的显示指令,根据显示指令将评估指标进行可视化展示。
在获取到评估指标后,用户可以发送显示评估指标的显示指令,在接收到显示指令后,向用户可视化展示计算出的评估指标,使得用户能够直观地判断分类模型的优良情况。
本实施例中,可以在服务器上执行该评估指标获取方法,在计算出评估指标,用户可以向服务器进行发送显示指令,在接收到显示指令,服务器可以将评估指标下发给本地终端,这样本地终端通过显示屏将评估指标进行可视化展示,如向用户展示ROC曲线、Lift图等。
可选地,对于大规模数据,计算直方图时数据量较大,可以在服务器上进行计算,在计算完直方图后,可以将直方图结果下发到本地终端,在本地终端上计算评估指标,这样可以减缓服务器的压力。在计算出评估指标后,用户可以向本地终端发送显示指令,在接收到显示指令后,本地终端通过显示屏将评估指标进行可视化展示,如向用户展示ROC曲线、Lift图等。当用户点击ROC曲线上的点时,可以将该点对应的混淆矩阵进行 展示。
可选地,可以在本地终端上执行该评估指标获取方法,在计算出评估指标后,用户可以向本地终端发送显示指令,在接收到显示指令后,在显示屏上进行可视化展示,如向用户展示ROC曲线、Lift图等。当用户点击ROC曲线上的点时,可以将该点对应的混淆矩阵进行展示。
为了更好地理解本实施例提供的评估指标获取方法,下面举例进行说明:样本为用户0~用户99,样本用户具有如下的特征参数:年龄(age)、工作性质(workclass)、取样量(fnlwgt)学历(education)、教育程度(education_num)、婚姻状况(matrital_status)、职业(occupation)、家庭情况(relationship)、种族(race)、性别(sex)、资本收益(capital_gain)、资本损失(capital_loss)、每周工作时长(hours_per_week)、国籍(native_country)等,将这些用户的特征参数输入到分类模型中进行分类训练,能够获取到一个用于用户收入情况的分类结果。在该例子中用“0”表示为低收入,“1”表示高收入。将高收入作为正样本属性,将低收入作为负样本属性。分类模型的输出数据中包括每个样本原始的样本属性、预测的样本属性以及每个样本被预测成高收入类别的概率,如下表2所示。
表2为分类模型的输出数据
Figure PCTCN2017072405-appb-000002
Figure PCTCN2017072405-appb-000003
Figure PCTCN2017072405-appb-000004
Figure PCTCN2017072405-appb-000005
对分类模型的输出数据进行直方图计算,得到如下表3和表4,表3为正样本对应的第一直方图结果,表4为负样本对应的第二直方图结果。
表3为正样本的第一直方图结果
概率区间 概率区间内正样本数量
[0,0.04) 0
[0.04,0.08) 0
[0.08,0.12) 0
[0.12,0.16) 0
[0.16,0.2) 0
[0.2,0.24) 0
[0.24,0.28) 0
[0.28,0.32) 0
[0.32,0.36) 0
[0.36,0.4) 1
[0.4,0.44) 0
[0.44,0.48) 2
[0.48,0.52) 0
[0.52,0.56) 1
[0.56,0.6) 0
[0.6,0.64) 2
[0.64,0.68) 3
[0.68,0.72) 2
[0.72,0.76) 3
[0.76,0.8) 0
[0.8,0.84) 2
[0.84,0.88) 3
[0.88,0.92) 1
[0.92,0.96) 0
[0.96,1) 5
表4为负样本的第二直方图
概率区间 概率区间内负样本数量
[0,0.04) 34
[0.04,0.08) 13
[0.08,0.12) 10
[0.12,0.16) 5
[0.16,0.2) 3
[0.2,0.24) 3
[0.24,0.28) 4
[0.28,0.32) 1
[0.32,0.36) 0
[0.36,0.4) 1
[0.4,0.44) 1
[0.44,0.48) 0
[0.48,0.52) 0
[0.52,0.56) 0
[0.56,0.6) 0
[0.6,0.64) 0
[0.64,0.68) 0
[0.68,0.72) 0
[0.72,0.76) 0
[0.76,0.8) 0
[0.8,0.84) 0
[0.84,0.88) 0
[0.88,0.92) 0
[0.92,0.96) 0
[0.96,1) 0
在获取到第一直方图和第二直方图的结果后,可以获取到概率区间,将每个概率区间的下限制作为阈值点构成阈值集。该示例中阈值集为:0、0.04、0.08、0.12、0.16、0.2、0.24、0.28、0.32、0.36、0.4、0.44、0.48、0.52、0.56、0.6、0.64、0.68、0.72、0.76、0.8、0.84、0.88、0.92、0.96
此处仅以两个阈值点作为示例说明阈值点对应评估参数的计算过程:
当阈值点选择为0.4时,根据第一直方图和第二直方图可以获取阈值点为0.4时的混淆矩阵:TP=24,FP=1,FN=1,TN=74。
当阈值点选择为0.6时,根据第一直方图结果和第二直方图结果可以获取阈值点为0.6时的混淆矩阵:TP=21,FP=4,FN=0,TN=75。
对于每个阈值点,根据混淆矩阵可以计算出对应的ROC坐标和Lift坐标。
ROC坐标:横坐标X=FP/(FP+TN);纵坐标Y=TP/(TP+FN)。Lift坐标:横坐标X=(TP+FN)/样本总量;纵坐标Y=TP。在获取到所有的阈值点对应的ROC坐标和Lift坐标后,就可以描点绘制ROC曲线以及Lift图。图3为分类模型的ROC曲线,图3中ROC曲线的纵坐标为击中率TPR(True Positive Rate),击中率可用于指示出分类模型识别出正样本的灵敏度(Sensitivity)。TPR=TP/(TP+FN);横坐标为假正率FPR(False  Positive Rate),其中,FPR=FP/(FP+TN)。其中,假正率可以通过特异率(Spcificity表示,假正率=1-Spcificity,特异率为负例的覆盖率(True Negative Rate,TNR)TNR=TN/(TN+FP)。
图4为分类模型的Lift图,图4中纵坐标为实际正样本的数量,横坐标为正样本预测比例=(TP+FN)/样本总量。
在获取到每个阈值点对应的ROC坐标后,可以绘制出ROC曲线后,由相邻阈值点对应的ROC坐标与ROC曲线可以构成一个曲边梯形,根据相邻的ROC坐标能够计算一个曲边梯形的面积。在获取到所有的曲边梯形的面积后,将所有曲边梯形的面积相加得到ROC曲线对应的AUC值。
下面为计算评估参数的代码:
输入:N,icProb,icTrue,icFalse#N为概率区间的个数、icProb概率区间的下限值、icTrue概率区间内实际正样本的数量、icFalse概率区间内实际负样本的数量#
输出:每个阈值点对应的ROC坐标,Lift坐标,混淆矩阵,AUC值;
计算过程:
1.计算总体正样本数量:totalTrue=∑(icTrue);总体负样本数量:totalFalse=∑(icFalse)
2.初始化累计正负样本数量curTrue=0,curFalse=0
3.For i:0to N
a)阈值点p=icProb[N-1-i]
b)curTrue+=icTrue[N-1-i];curFalse+=icFalse[N-1-i]#对实际正样本被预测成正样本数量进行累积得到TP,对实际负样本被预测成正样本数量进行累积得到FN#
c)混淆矩阵坐标:cm.p=p;cm.tp=curTrue,cm.fp=curFalse
cm.fn=totalTrue-curTrue,cm.tn=totalFalse–curFalse
d)ROC坐标:roc.p=p;
roc.x=curFalse/totalFalse
roc.y=curTrue/totalTrue
e)Lift坐标:lift.p=plift.x=(curTrue+curFalse)/(totalTrue+totalFalse)
lift.y=curTrue
4.根据ROC坐标计算曲线下方的面积,即AUC值。
通过上述实施例可以看出,根据直方图计算结果计算得出的混淆矩阵,然后基于该混淆矩阵就可以方便的计算出其他评估指标,并生成可视化图像,用户可以直观地判断 分类模型的优良。
实施例三
如图5所示,其为本发明实施例三的评估指标获取装置的结构示意图。该评估指标获取装置包括:分类训练模块11、概率统计模块12和计算模块13。
分类训练模块11,用于将样本输入分类模型进行分类训练,获取分类模型的输出数据。
为了对分类模型进行评估,分类训练模块11需要将样本输入分类模型中进行分类训练,在训练完成后,分类训练模块11会对每个样本进行分类和概率预测。具体地,分类训练模块11在训练完成后为每个样本输出训练后的样本属性,训练后的样本属性可以指示出样本经过分类模型后是正样本还是负样本。
进一步地,分类训练模块11在训练完成后还会为每个样本进行概率预测,用户可以根据实际需要选择输出每个样本经过分类模型预测成正样本的概率,或者选择输出每个样本经过分类模型预测成负样本的概率。其中,样本经过分类模型被预测成正样本的概率和被预测成负样本的概率的和为1。
其中,输入的每个样本都有一个原始的样本属性。本实施例中,样本属性包括正样本属性和负样本属性。原始的样本属性表示样本实际是正样本还是负样本。
概率统计模块12,用于对输出数据进行概率分布统计获取概率统计结果。
其中,概率统计结果包括概率区间以及每个概率区间内实际正样本数量和实际负样本数量。
在获取到输出数据后,由于分类训练模块11会对每个样本进行概率预测,这样输出数据中每个样本会有一个预测概率,本实施例中,分类训练模块11输出的每个样本的概率为每个样本被分类模型预测成正样本的预测概率。
进一步地,概率统计模块12根据预测概率对输出数据进行概率分布统计,获取概率统计结果。概率统计模块12在进行概率统计时首先需要划分概率区间,然后在每个概率区间内基于输出数据中每个样本原始的样本属性统计实际正样本数量和实际负样本数量,得到正样本和负样本的概率分布图,基于正样本的概率分布图获取每个概率区间内实际正样本数量,基于负样本的概率分布图获取每个概率区间内实际负样本数量。
优选地,概率统计模块12基于直方图算法对输出数据进行概率分布的统计,获取正样本的直方图和负样本的直方图,基于正样本的直方图和负样本的直方图能够获取到上述概率统计结果。
计算模块13,用于根据阈值集和概率统计结果计算分类模型的评估指标。
在获取到概率统计结果后,需要获取阈值集,其中阈值集中包括多个阈值点,然后基于每个阈值点和概率统计结果中每个概率区间内实际正样本的第一数据和实际负样本的第二数据,获取每个阈值点对应的评估参数,利用所有阈值点对应的评估参数生成分类模型的评估指标。
本实施例中,在概率统计结果后,计算模块13可以将概率统计结果中的概率区间的端点值作为阈值点构成阈值集。例如,可以利用每个概率区间的下限值作为阈值点构成阈值集。或者将部分概率区间的下限值作为阈值点构成阈值集。在概率统计的过程中,对概率区间进行划分,本实施例中概率区间的端点可作为分界点,直接将概率区间的端点值作为阈值点,不需要进行阈值点的重新设定,进而提高了评估指标的计算效率。
可选地,计算模块13可以接收用户输入的利用概率区间端值点作为阈值点阈值集。例如,用户可以将每个概率区间的下限值作为阈值点构成阈值集,或者用户选取部分概率区间的下限值作为阈值点构成阈值集本实施例中,用户根据反馈的概率统计结果,可以初步对分类模型的效果有一定的了解,从而能够选取合适的阈值点构成阈值集,用户交互较好,而且对分类模型的评估更加准确。
进一步地,计算模块13根据阈值集中的阈值点和概率统计结果计算评估指标。其中,评估指标包括混淆矩阵、ROC曲线、AUC值和Lift图。
其中,混淆矩阵中包括:TP、FP、TN和FN。
在获取到阈值点之后,计算模块13将阈值点作为分界点,对于正样本的概率分布来说,大于阈值点的所有概率区间内实际正样本被分类模型预测成正样本,对实际正样本被分类模型预测成正样本的数量进行累积,将累积的实际正样本被分类模型预测成正样本的数量作为混淆矩阵的TP。而小于阈值点的所有概率区间内实际正样本被分类模型预测成负样本,对实际正样本被分类模型预测成负样本的数量进行累计,将累计后的实际正样本被分类模型预测成负样本的数量作为混淆矩阵的FP。
对于负样本的概率分布来说,大于阈值点的所有概率区间内实际负样本被分类模型预测成正样本,对实际负样本被分类模型预测成正样本的数量进行累积,将累积的实际负样本被分类模型预测成正样本的数量作为混淆矩阵的FN。而小于阈值点的所有概率区间内实际负样本被分类模型预测成负样本,对实际负样本被分类模型预测成负样本的数量进行累计,将累计后的实际负样本被分类模型预测成负样本的数量作为混淆矩阵的TN。
在获取到阈值点对应的混淆矩阵后,计算模块13可以利用混淆矩阵中的TP、FP、TN和FN,计算得到其他评估指标的该阈值点对应的评估参数,当所有阈值点对应的评估参数计算完成后,利用每个阈值点对应的评估参数生成评估指标。例如,根据一个阈值点对应的混淆矩阵可以计算出在该阈值点处ROC曲线的坐标,将坐标作为该阈值点ROC曲线的评估参数。当所有阈值点对应的评估参数计算完成后,利用每个阈值点对应的ROC曲线的坐标绘制ROC曲线。
本实施例提供的评估指标获取装置,通过对分类模型的输出数据进行概率统计,基于得到的概率统计结果对评估指标进行计算,解决了在评估指标的计算过程中多次扫描输出数据的问题,尤其在输出数据为大规模数据时可以提高评估指标的计算效率。
实施例四
如图6所示,其为本发明实施例四的评估指标获取装置的结构示意图。该评估指标获取装置包括:分类训练模块21、概率统计模块22、计算模块23和可视化模块24。
分类训练模块21,用于将样本输入分类模型进行分类训练,获取分类模型的输出数据。
进一步地,概率统计模块22,具体用于直方图计算单元221,用于基于直方图算法对输出数据进行概率区间划分,统计每个概率区间内实际正样本数量和实际负样本数量。
其中,输出数据包括:每个样本原始的样本属性以及每个样本被分类模型预测成正样本的预测概率;其中,样本属性包括正样本属性和负样本属性。
进一步地,概率统计模块22一种可选的结构方式包括:扫描单元221、直方图生成单元222、步长调整单元223和统计单元224。
扫描单元221,用于扫描输出数据。
直方图生成单元222,用于根据每个样本被预测成正样本的预测概率和输出数据中每个样本原始的样本属性生成正样本对应的第一直方图和负样本对应的第二直方图;其中,第一直方图的横轴是预测概率,第一直方图的纵轴是实际正样本数量;第二直方图的横轴是预测概率,第二直方图的纵轴是实际负样本数量。
步长调整单元223,用于调整横轴步长使第一直方图和第二直方图的概率区间一致,以获取概率统计结果中的概率区间。
统计单元224,用于统计第一直方图中每个概率区间内实际正样本的数量,以及统计第二直方图中每个概率区间内实际负样本的数量。
本实施例中,计算模块23一种可选的结构方式包括:阈值集获取单元231、混淆矩 阵生成单元232和评估指标生成单元233。
阈值集获取单元231,用于将每个概率区间的端点值作为阈值点构成阈值集。
进一步地,阈值集获取单元231,还用于接收用户输入的根据概率区间的端点值构成的阈值集。
混淆矩阵生成单元232,用于按照由大到小的顺序获取阈值集中每个阈值点对应的混淆矩阵,其中,混淆矩阵包括TP、FP、TN、FN。
评估指标生成单元233,用于将每个阈值点对应的混淆矩阵作为分类模块的评估指标。
在获取到每个阈值点的混淆矩阵后,根据混淆矩阵可以获取到分类模型其他的评估指标,例如ROC曲线、ROC曲线下面积AUC值以及Lift图。
进一步地,混淆矩阵生成单元232,具体用于对于第一直方图,按照阈值点的大小顺序逐次对大于阈值点的所有概率区间内实际正样本数量进行累积得到TP,以及对小于阈值点的所有概率区间内实际正样本数量进行累积得到FN,以及对于第二直方图,按照阈值点的大小顺序逐次对大于阈值点的所有概率区间内负样本数量进行累积得到FP,以及对小于阈值点的所有概率区间内负样本数量进行累积得到TN。
评估指标生成单元233,具体用于将每个阈值对应的混淆矩阵作为评估指标。
评估指标生成单元233,具体用于针对每一个阈值点,将FP与实际负样本总量的比值作为ROC的横坐标,以及将TP与实际正样本总量的比值作为ROC的纵坐标,以及利用所有阈值点对应的ROC坐标绘制分类模型的评估指标ROC曲线。
评估指标生成单元233,具体用于获取每个由相邻阈值点对应的ROC坐标与所述ROC曲线构成的曲边梯形的面积,将所有曲边梯形的面积相加得到所述ROC曲线的AUC值。
评估指标生成单元233,具体用于针对每一个阈值点将TP和FP的和值与样本总量的比值作为Lift图的横坐标,以及将TP作为Lift图的纵坐标以及利用所有阈值点对应的Lift坐标绘制分类模型的评估指标Lift图。
可视化模块24,用于接收用户的显示指令,根据显示指令将评估指标进行可视化展示。
本实施例中,评估指标获取装置可以设置在服务器上执行该评估指标获取方法,在计算出评估指标,用户可以向该装置中的可视化模块24发送显示指令,在接收到显示指令,可视化模块24可以将评估指标下发给本地终端,这样本地终端通过显示屏将评估指 标进行可视化展示,如向用户展示ROC曲线、Lift图等。当用户点击ROC曲线上的点时,可以将该点对应的混淆矩阵进行展示。
可选地,对于大规模数据,评估指标获取装置中分类训练模块21和概率统计模块22可以设置在服务器上,而将计算模块23和可视化模块24设置在本地终端上,以减少服务器的压力,且便于与用户的交互。在服务器上对样本数据进行分类训练以及直方图计算,在计算完直方图后,概率统计模块22可以将直方图结果下发到本地终端的计算模块23中,计算模块23在本地终端上计算评估指标,这样可以减缓服务器的压力。在计算出评估指标后,用户可以向可视化模块24发送显示指令,在接收到显示指令后,可视化模块24通过显示屏将评估指标进行可视化展示,如向用户展示ROC曲线、Lift图等。当用户点击ROC曲线上的点时,可以将该点对应的混淆矩阵进行展示。
可选地,评估指标获取装置可以设置在本地终端上执行该评估指标获取方法,在计算出评估指标后,用户可以向可视化模块24发送显示指令,在接收到显示指令后,可视化模块24在显示屏上进行可视化展示,如向用户展示ROC曲线、Lift图等。当用户点击ROC曲线上的点时,可以将该点对应的混淆矩阵进行展示。
本实施例提供的评估指标获取装置,对分类模型的输出数据进行概率统计,基于得到包括概率区间以及每个概率区间内实际正样本数量和实际负样本数量的概率统计结果对评估指标进行计算,解决了在评估指标的计算过程中多次扫描输出数据的问题,尤其在输出数据为大规模数据时可以提高评估指标的计算效率。进一步地,在获取到评估指标后,能够将评估指标可视化展示,使用户能够直观地判断分类模型的优良情况。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (22)

  1. 一种评估指标获取方法,其特征在于,包括:
    将样本输入分类模型进行分类训练,获取分类模型的输出数据;
    对所述输出数据进行概率分布统计获取概率统计结果;其中,所述概率统计结果包括概率区间以及每个概率区间内实际正样本数量和实际负样本数量;
    根据阈值集和所述概率统计结果计算所述分类模型的评估指标。
  2. 根据权利要求1所述的评估指标获取方法,其特征在于,所述对所述输出数据进行概率分布统计获取概率统计结果,包括:
    基于直方图算法对所述输出数据进行概率区间划分,统计每个概率区间内所述实际正样本数量和所述实际负样本数量。
  3. 根据权利要求2所述的评估指标获取方法,其特征在于,所述输出数据包括:每个样本原始的样本属性以及每个样本被所述分类模型预测成正样本的预测概率;其中,样本属性包括正样本属性和负样本属性。
  4. 根据权利要求3所述的评估指标获取方法,其特征在于,所述基于直方图算法对所述输出数据进行概率区间划分,统计每个概率区间内所述实际正样本数量和所述实际负样本数量,包括:
    扫描所述输出数据;
    根据每个样本被预测成正样本的预测概率和所述输出数据中每个样本原始的样本属性生成正样本对应的第一直方图和负样本对应的第二直方图;其中,所述第一直方图的横轴是预测概率,所述第一直方图的纵轴是实际正样本数量;所述第二直方图的横轴是预测概率,所述第二直方图的纵轴是实际负样本数量;
    调整横轴步长使所述第一直方图和所述第二直方图的概率区间一致,以获取所述概率统计结果中的所述概率区间;
    统计所述第一直方图中每个概率区间内所述实际正样本的数量;
    统计所述第二直方图中每个概率区间内所述实际负样本的数量。
  5. 根据权利要求4所述的评估指标获取方法,其特征在于,所述根据阈值集和所述概率统计结果计算所述分类模型的评估指标,包括:
    将每个概率区间的端点值作为阈值点构成所述阈值集;
    按照由大到小的顺序获取所述阈值集中每个阈值点对应的混淆矩阵,其中,所述混淆矩阵包括实际为正样本被预测为正样本的数量TP、实际为正样本被预测为负样本的 数量FP、实际为负样本被预测为负样本的数量TN、实际为负样本被预测为正样本的数量FN;
    将每个阈值点对应的混淆矩阵作为评估指标。
  6. 根据权利要求4所述的评估指标获取方法,其特征在于,所述根据阈值集和所述概率统计结果计算所述分类模型的评估指标,包括:
    接收用户输入的根据概率区间的端点值构成的所述阈值集;
    按照由大到小的顺序获取所述阈值集中每个阈值点对应的混淆矩阵,其中,所述混淆矩阵包括:TP、FP、TN和FN;
    将每个阈值点对应的混淆矩阵作为所述评估指标。
  7. 根据权利要求5或6所述的评估指标获取方法,其特征在于,所述按照由大到小的顺序获取所述阈值集中每个阈值点对应的混淆矩阵,包括:
    对于所述第一直方图,按照阈值点的大小顺序逐次对大于阈值点的所有概率区间内实际正样本数量进行累积得到所述TP,以及对小于阈值点的所有概率区间内实际正样本数量进行累积得到所述FN;
    对于所述第二直方图,按照阈值点的大小顺序逐次对大于阈值点的所有概率区间内负样本数量进行累积得到所述FP,以及对小于阈值点的所有概率区间内负样本数量进行累积得到所述TN。
  8. 根据权利要求7所述的评估指标获取方法,其特征在于,所述按照由大到小的顺序获取所述阈值集中每个阈值点对应的混淆矩阵之后,还包括:
    针对每个阈值点,将所述FP与实际负样本总量的比值作为所述ROC的横坐标;
    将所述TP与实际正样本总量的比值作为所述ROC的纵坐标;
    利用所有阈值点对应的ROC坐标绘制所述分类模型的评估指标ROC曲线。
  9. 根据权利要求8所述的评估指标获取方法,其特征在于,所述利用所有阈值点对应的ROC坐标绘制所述分类模型的评估指标ROC曲线之后,还包括:
    获取每个由相邻阈值点对应的ROC坐标与所述ROC曲线构成的曲边梯形的面积;
    将所有曲边梯形的面积相加得到所述ROC曲线对应的AUC值。
  10. 根据权利要求7所述的评估指标获取方法,其特征在于,所述按照由大到小的顺序获取所述阈值集中每个阈值点对应的混淆矩阵之后,还包括:
    针对每个阈值点,将所述TP和所述FP的和值与样本总量的比值作为Lift图的横坐标;
    将所述TP作为Lift图的纵坐标;
    利用所有阈值点对应的Lift坐标绘制所述分类模型的评估指标Lift图。
  11. 一种评估指标获取装置,其特征在于,包括:
    分类训练模块,用于将样本输入分类模型进行分类训练,获取分类模型的输出数据;
    概率统计模块,用于对所述输出数据进行概率分布统计获取概率统计结果;其中,所述概率统计结果包括概率区间以及每个概率区间内实际正样本数量和实际负样本数量;
    计算模块,用于根据阈值集和所述概率统计结果计算所述分类模型的评估指标。
  12. 根据权利要求11所述的评估指标获取装置,其特征在于,所述概率统计模块,具体用于基于直方图算法对所述输出数据进行概率区间划分,统计每个概率区间内所述实际正样本数量和所述实际负样本数量。
  13. 根据权利要求12所述的评估指标获取装置,其特征在于,所述输出数据包括:每个样本原始的样本属性以及每个样本被所述分类模型预测成正样本的预测概率;其中,样本属性包括正样本属性和负样本属性。
  14. 根据权利要求13所述的评估指标获取装置,其特征在于,所述概率统计模块,包括:
    扫描单元,用于扫描所述输出数据;
    直方图生成单元,用于根据每个样本被预测成正样本的预测概率和所述输出数据中每个样本原始的样本属性生成正样本对应的第一直方图和负样本对应的第二直方图;其中,所述第一直方图的横轴是预测概率,所述第一直方图的纵轴是实际正样本数量;所述第二直方图的横轴是预测概率,所述第二直方图的纵轴是实际负样本数量;
    步长调整单元,用于调整横轴步长使所述第一直方图和所述第二直方图的概率区间一致,以获取所述概率统计结果中的所述概率区间;
    统计单元,用于统计所述第一直方图中每个概率区间内所述实际正样本的数量,以及统计所述第二直方图中每个概率区间内所述实际负样本的数量。
  15. 根据权利要求14所述的评估指标获取装置,其特征在于,所述计算模块,包括:
    阈值集获取单元,用于将每个概率区间的端点值作为阈值点生成所述阈值集;
    混淆矩阵生成单元,用于按照由大到小的顺序获取所述阈值集中每个阈值点对应的混淆矩阵,其中,所述混淆矩阵包括实际为正样本被预测为正样本的数量TP、实际为 正样本被预测为负样本的数量FP、实际为负样本被预测为负样本的数量TN、实际为负样本被预测为正样本的数量FN;
    评估指标生成单元,用于将每个阈值点对应的混淆矩阵作为所述评估指标。
  16. 根据权利要求15所述的评估指标获取装置,其特征在于,所述阈值集获取单元,还用于接收用户输入的根据概率区间的端点值构成所述阈值集。
  17. 根据权利要求16所述的评估指标获取装置,其特征在于,所述混淆矩阵生成单元,具体用于对于所述第一直方图,按照阈值点的大小顺序逐次对大于阈值点的所有概率区间内实际正样本数量进行累积得到所述TP,以及对小于阈值点的所有概率区间内实际正样本数量进行累积得到所述FN,以及对于所述第二直方图,按照阈值点的大小顺序逐次对大于阈值点的所有概率区间内负样本数量进行累积得到所述FP,以及对小于阈值点的所有概率区间内负样本数量进行累积得到所述TN。
  18. 根据权利要求17所述的评估指标获取装置,其特征在于,所述评估指标生成单元,具体用于针对每个阈值点,将所述FP与实际负样本总量的比值作为所述ROC的横坐标,以及将所述TP与实际正样本总量的比值作为所述ROC的纵坐标,以及利用所有阈值点对应的ROC坐标绘制所述分类模型的评估指标ROC曲线。
  19. 根据权利要求18所述的评估指标获取装置,其特征在于,所述评估指标生成单元,还具体用于获取每个由相邻阈值点对应的ROC坐标与所述ROC曲线构成的曲边梯形的面积,将所有曲边梯形的面积相加得到所述ROC曲线的AUC值。
  20. 根据权利要求19所述的评估指标获取装置,其特征在于,所述评估指标生成单元,具体用于针对每个阈值点,将所述TP和所述FP的和值与样本总量的比值作为Lift图的横坐标,以及将所述TP作为Lift图的纵坐标以及利用所有阈值点对应的Lift坐标绘制所述分类模型的评估指标Lift图。
  21. 根据权利要求20所述的评估指标获取装置,其特征在于,所述分类训练模块和所述概率统计模块设置于服务器端,所述计算模块设置于本地终端。
  22. 根据权利要求21所述的评估指标获取装置,其特征在于,还包括:可视化模块,用于接收用户的显示指令,根据显示指令将所述分类模型的评估指标进行可视化展示;
    其中,所述可视化模块设置于所述本地终端。
PCT/CN2017/072405 2016-02-05 2017-01-24 评估指标获取方法及装置 WO2017133569A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/066,102 US20190034516A1 (en) 2016-02-05 2017-01-24 Method and apparatus for acquiring an evaluation index

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610082141.1A CN107045506A (zh) 2016-02-05 2016-02-05 评估指标获取方法及装置
CN201610082141.1 2016-02-05

Publications (1)

Publication Number Publication Date
WO2017133569A1 true WO2017133569A1 (zh) 2017-08-10

Family

ID=59500562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/072405 WO2017133569A1 (zh) 2016-02-05 2017-01-24 评估指标获取方法及装置

Country Status (4)

Country Link
US (1) US20190034516A1 (zh)
CN (1) CN107045506A (zh)
TW (1) TW201732643A (zh)
WO (1) WO2017133569A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163248A (zh) * 2019-04-09 2019-08-23 文远知行有限公司 模型评估的可视化方法、装置、计算机设备和存储介质
CN110322143A (zh) * 2019-06-28 2019-10-11 深圳前海微众银行股份有限公司 模型实体化管理方法、装置、设备及计算机存储介质
CN111582351A (zh) * 2020-04-30 2020-08-25 北京百度网讯科技有限公司 分类模型评价指标的确定方法、装置、设备及介质
CN112434839A (zh) * 2019-08-26 2021-03-02 电力规划总院有限公司 一种预测方法及电子设备
EP3907670A4 (en) * 2019-01-02 2022-03-02 Panasonic Intellectual Property Corporation of America INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD AND PROGRAM
CN118172589A (zh) * 2024-02-02 2024-06-11 北京视觉世界科技有限公司 自动化模型质量评估方法、装置、设备及存储介质

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704495B (zh) * 2017-08-25 2018-08-10 平安科技(深圳)有限公司 主题分类器的训练方法、装置及计算机可读存储介质
CN109447125B (zh) * 2018-09-28 2019-12-24 北京达佳互联信息技术有限公司 分类模型的处理方法、装置、电子设备及存储介质
CN109800796A (zh) * 2018-12-29 2019-05-24 上海交通大学 基于迁移学习的船舶目标识别方法
CN112016940A (zh) * 2019-05-29 2020-12-01 中国移动通信集团福建有限公司 模型建立方法及设备、网络满意度评估方法及设备
CN112184279A (zh) * 2019-07-05 2021-01-05 上海哔哩哔哩科技有限公司 Auc指标快速计算方法、装置以及计算机设备
CN112308099B (zh) * 2019-07-29 2024-08-20 腾讯科技(深圳)有限公司 样本特征重要性的确定方法、分类模型的训练方法及装置
CN110796034B (zh) * 2019-10-12 2022-04-22 北京达佳互联信息技术有限公司 一种目标对象识别方法、装置、设备及介质
CN110796381B (zh) * 2019-10-31 2024-07-09 深圳前海微众银行股份有限公司 风控模型的建模方法、装置、终端设备及介质
CN111341439B (zh) * 2020-02-27 2023-09-26 江苏品生医疗科技集团有限公司 一种临床预测模型决策分析方法
CN111784093B (zh) * 2020-03-27 2023-07-11 国网浙江省电力有限公司 一种基于电力大数据分析的企业复工辅助判断方法
CN111488927B (zh) * 2020-04-08 2023-07-21 中国医学科学院肿瘤医院 分类阈值确定方法、装置、电子设备及存储介质
CN112163625B (zh) * 2020-10-06 2021-06-25 西安石油大学 基于人工智能和云计算的大数据挖掘方法及云端服务中心
CN113887125B (zh) * 2021-08-31 2024-06-04 哈尔滨工业大学 一种复杂仿真系统运行有效性评估方法
CN113723835B (zh) * 2021-09-02 2024-02-06 国网河北省电力有限公司电力科学研究院 火电厂用水评估方法和终端设备
CN114330562B (zh) * 2021-12-31 2023-09-26 大箴(杭州)科技有限公司 小样本细化分类及多分类模型构建方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137841A1 (en) * 2008-08-05 2011-06-09 Fujitsu Limited Sample class prediction method, prediction program, and prediction apparatus
CN102663723A (zh) * 2012-02-24 2012-09-12 武汉大学 一种基于颜色样本与电场模型的图像分割方法
CN104361224A (zh) * 2014-10-31 2015-02-18 深圳信息职业技术学院 置信分类方法及置信机器
CN104504583A (zh) * 2014-12-22 2015-04-08 广州唯品会网络技术有限公司 分类器的评价方法
CN105069470A (zh) * 2015-07-29 2015-11-18 腾讯科技(深圳)有限公司 分类模型训练方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8588519B2 (en) * 2010-09-22 2013-11-19 Siemens Aktiengesellschaft Method and system for training a landmark detector using multiple instance learning
CN103123633A (zh) * 2011-11-21 2013-05-29 阿里巴巴集团控股有限公司 评价参数的生成方法以及基于评价参数的信息搜索方法
CN103605103B (zh) * 2013-06-26 2016-12-28 广东电网公司东莞供电局 基于s型曲线函数的电能计量故障智能诊断方法
CN105096058A (zh) * 2015-08-20 2015-11-25 北京中电普华信息技术有限公司 一种应用于客服人员评分系统的数据处理方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137841A1 (en) * 2008-08-05 2011-06-09 Fujitsu Limited Sample class prediction method, prediction program, and prediction apparatus
CN102663723A (zh) * 2012-02-24 2012-09-12 武汉大学 一种基于颜色样本与电场模型的图像分割方法
CN104361224A (zh) * 2014-10-31 2015-02-18 深圳信息职业技术学院 置信分类方法及置信机器
CN104504583A (zh) * 2014-12-22 2015-04-08 广州唯品会网络技术有限公司 分类器的评价方法
CN105069470A (zh) * 2015-07-29 2015-11-18 腾讯科技(深圳)有限公司 分类模型训练方法及装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3907670A4 (en) * 2019-01-02 2022-03-02 Panasonic Intellectual Property Corporation of America INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD AND PROGRAM
CN110163248A (zh) * 2019-04-09 2019-08-23 文远知行有限公司 模型评估的可视化方法、装置、计算机设备和存储介质
CN110322143A (zh) * 2019-06-28 2019-10-11 深圳前海微众银行股份有限公司 模型实体化管理方法、装置、设备及计算机存储介质
CN112434839A (zh) * 2019-08-26 2021-03-02 电力规划总院有限公司 一种预测方法及电子设备
CN112434839B (zh) * 2019-08-26 2023-05-30 电力规划总院有限公司 一种配电变压器重过载风险的预测方法及电子设备
CN111582351A (zh) * 2020-04-30 2020-08-25 北京百度网讯科技有限公司 分类模型评价指标的确定方法、装置、设备及介质
CN111582351B (zh) * 2020-04-30 2023-09-22 北京百度网讯科技有限公司 分类模型评价指标的确定方法、装置、设备及介质
CN118172589A (zh) * 2024-02-02 2024-06-11 北京视觉世界科技有限公司 自动化模型质量评估方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN107045506A (zh) 2017-08-15
TW201732643A (zh) 2017-09-16
US20190034516A1 (en) 2019-01-31

Similar Documents

Publication Publication Date Title
WO2017133569A1 (zh) 评估指标获取方法及装置
US11341156B2 (en) Data segmentation and visualization
CN110175549B (zh) 人脸图像处理方法、装置、设备及存储介质
CN107741898B (zh) 一种基于大数据的游戏玩家操作偏好分析方法和系统
CN109325218B (zh) 一种数据筛选统计方法、装置、电子设备及存储介质
US11756199B2 (en) Image analysis in pathology
CN112084913B (zh) 一种端到端的人体检测与属性识别方法
JP2023016848A (ja) 広告閲覧情報出力方法及び広告閲覧情報出力プログラム、並びに情報処理装置
CN111210022B (zh) 向后模型选择方法、设备及可读存储介质
CN117809124B (zh) 基于多特征融合的医学图像关联调用方法及系统
CN116665390A (zh) 基于边缘计算及优化YOLOv5的火灾检测系统
CN108052918A (zh) 一种笔迹比对系统及方法
JP6031972B2 (ja) 知覚反応分析装置,その方法及びプログラム
JP5027201B2 (ja) テロップ文字領域検出方法,テロップ文字領域検出装置およびテロップ文字領域検出プログラム
JP5929532B2 (ja) イベント検出装置、イベント検出方法およびイベント検出プログラム
TW201523459A (zh) 物件追蹤方法及電子裝置
CN106570003B (zh) 数据推送方法及装置
JP2012003358A (ja) 背景判別装置、方法及びプログラム
US10642864B2 (en) Information processing device and clustering method
CN113691525A (zh) 一种流量数据处理方法、装置、设备及存储介质
CN111209428A (zh) 图像检索方法、装置、设备及计算机可读存储介质
CN110619344B (zh) 一种基于ssd和时序模型的微博好友推荐方法
CN116569225B (zh) 文档图像识别系统
TWI813338B (zh) 影像處理系統以及處理影像的方法
CN107451180B (zh) 识别站点同源关系的方法、装置、设备和计算机存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17746889

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17746889

Country of ref document: EP

Kind code of ref document: A1