CN115034400B

CN115034400B - Service data processing method and device, electronic equipment and storage medium

Info

Publication number: CN115034400B
Application number: CN202210422527.8A
Authority: CN
Inventors: 杨宇雪; 李虹锋; 曹清鑫
Original assignee: CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2024-05-14
Anticipated expiration: 2042-04-21
Also published as: CN115034400A

Abstract

The application relates to the technical field of data processing, and particularly discloses a business data processing method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: inputting user data to be processed into a pre-trained machine learning model, and determining an output target group to which the user belongs; the target training samples of the training process of the machine learning model are determined by: determining a training sample feature set according to the training sample set; discretizing continuous features in the training sample feature set, and forming a target training feature set by the discrete features obtained by discretizing and the discrete features in the training sample feature set; determining a distribution difference index of the discrete features aiming at any one discrete feature in the target training feature set; and screening the discrete features based on the distribution difference indexes of the discrete features, and determining a target training sample set based on the obtained reference feature set. The accuracy of dividing the target group to which the user belongs is improved.

Description

Service data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a service data processing method, a device, an electronic device, and a storage medium.

Background

When facing massive user data, business personnel need to quickly and accurately determine the characteristics and/or classification of each user so as to formulate a targeted business service strategy for each user.

In the related art, a machine learning model is applied, for example, to determine characteristics and/or classifications of individual users. However, there are many user data, and the data of the same user also relates to different dimensions, and the data of each dimension can be used as a characteristic of a sample. For business personnel, different sample characteristics are differentiated in their importance for distinguishing individual user characteristics and/or classifications. If the characteristics of the sample are directly used for training the model, the trained model is inaccurate, and the characteristics and/or classification of the distinguished users are inaccurate.

Disclosure of Invention

The embodiment of the application provides a business data processing method, a business data processing device, electronic equipment and a storage medium, which are used for improving the accuracy of dividing target groups to which users belong according to user data.

In a first aspect, an embodiment of the present application provides a service data processing method, including:

Acquiring user data to be processed; the user data comprises user basic information and service association data;

inputting the user data to be processed into a pre-trained machine learning model, and determining an output target group to which the user belongs;

The method comprises the following steps of determining a target training sample set applied to a training process of the machine learning model:

Determining a training sample feature set according to the training sample set; each training sample in the training sample set comprises basic information of a training sample user and business association data of the training sample user;

Discretizing continuous features in the training sample feature set, and forming a target training feature set by the discrete features obtained by discretizing and the discrete features in the training sample feature set;

Determining a distribution difference index of any one discrete feature in the target training feature set; wherein the distribution difference index characterizes the degree of difference of the discrete features in a positive example sample and a negative example sample; the positive example samples are samples with basic information meeting preset user attributes and/or the service related data meeting preset service attributes, and the negative example samples are samples with basic information not meeting preset user attributes and the service related data not meeting preset service attributes;

and screening the discrete features based on the distribution difference indexes of the discrete features, and determining a target training sample set based on the obtained reference feature set.

In some exemplary embodiments, the determining the distribution difference index of the discrete feature comprises:

Counting first value count vectors of all positive examples on the discrete features and counting second value count vectors of all negative examples on the discrete features;

determining a first distribution probability vector of the positive example sample on the discrete feature according to the first value count vector and the total number of the positive example samples, and determining a second distribution probability vector of the negative example sample on the discrete feature according to the second value count vector and the total number of the negative example samples;

and determining a distribution difference index of the discrete features according to the first distribution probability vector, the second distribution probability vector and the number of different values of the discrete features.

In some exemplary embodiments, the counting the first valued count vector of all positive examples on the discrete feature includes:

determining a first number of positive examples of all the positive examples of the discrete features as the values according to each value of the discrete features;

Taking each first quantity as an element of a first value count vector to form the first value count vector;

and counting a second valued count vector of all negative examples on the discrete feature, wherein the second valued count vector comprises the following steps:

Determining, for each value of the discrete feature, a second number of negative examples for which the discrete feature is the value in all of the negative examples;

and taking each second quantity as an element of a second value count vector to form the second value count vector.

In some exemplary embodiments, the determining the distribution difference index of the discrete feature according to the first distribution probability vector, the second distribution probability vector, and the number of different values of the discrete feature includes:

for each value, determining a first probability corresponding to the value in the first distribution probability vector according to the positions of the elements in the first value count vector of a first quantity corresponding to the value; determining a second probability corresponding to the value in a second distribution probability vector according to the positions of the elements of the first quantity corresponding to the value in the first value counting vector; determining a reference index corresponding to the value according to the absolute value of the difference value between the first probability and the second probability and the number of different values of the discrete feature;

and determining the sum of the reference indexes corresponding to the values as the distribution difference index of the discrete features.

In some exemplary embodiments, the filtering each of the discrete features based on the distribution difference index of each of the discrete features to obtain a reference feature set includes:

selecting discrete features with distribution difference indexes larger than a preset index threshold value to form a reference feature set; or (b)

And selecting a preset number of discrete features according to the size of the distribution difference index to form a reference feature set.

In some exemplary embodiments, after the determining the target training sample set based on the obtained reference feature set, the method further includes:

Displaying the reference feature set according to a preset display mode;

For any reference feature, if the span of the range of the feature value of the original feature corresponding to the reference feature is larger than a preset threshold value, the preset display mode is a fold line comparison mode; if the range span of the feature value of the original feature corresponding to the reference feature is smaller than or equal to a preset threshold value, comparing the preset display mode with the graph mode in a straight direction; and the abscissa of the broken line comparison graph and the straight square comparison graph is the value corresponding to the reference feature, and the ordinate is the value of each element in the first probability distribution vector and the value of each element in the second probability distribution vector corresponding to the reference feature.

In some exemplary embodiments, the method further comprises:

determining, for any one discrete feature of any one target test sample, a SHAP value of the target test sample on the discrete feature based on a predicted result of the machine model on the target test sample; the target test sample is obtained by discretizing the test sample; the test sample comprises basic information of a test sample user and business association data of the test sample user;

Carrying out weighted average processing on SHAP values of all target test samples on the discrete features to obtain association degrees of the discrete features and the prediction results of the target test samples; the association degree characterizes the decision degree of the corresponding discrete feature in the model training process;

And determining the association degree corresponding to each discrete feature.

In some exemplary embodiments, after the determining the SHAP value of the target test sample on the discrete feature based on the prediction of the target test sample by the machine model, the method further comprises:

for any one of the discrete features, displaying the prediction results of the discrete features of each target test sample in a scatter diagram display mode;

Wherein, the abscissa of the scatter diagram is the SHAP value of the discrete feature, and the ordinate of the scatter diagram is the value of the discrete feature; wherein the scatter plot characterizes the extent to which each value of the discrete feature affects the predicted outcome of the discrete feature for each target test sample.

In a second aspect, an embodiment of the present application provides a service data processing apparatus, including:

The data acquisition module is used for acquiring user data to be processed; the user data comprises user basic information and service association data;

The determining module is used for inputting the user data to be processed into a pre-trained machine learning model and determining an output target group to which the user belongs;

the training system further comprises a model training module, wherein the model training module is used for determining a target training sample set applied to a training process of the machine learning model by the following method:

In some exemplary embodiments, the model training module is specifically configured to:

In some exemplary embodiments, the method further includes a first display module, configured to display the reference feature set according to a preset display manner after determining the target training sample set based on the obtained reference feature set;

The first display module is specifically configured to: aiming at any one reference feature, if the span of the range of the feature value of the original feature corresponding to the reference feature is larger than a preset threshold value, the preset display mode is a fold line comparison mode; if the range span of the feature value of the original feature corresponding to the reference feature is smaller than or equal to a preset threshold value, comparing the preset display mode with the graph mode in a straight direction; and the abscissa of the broken line comparison graph and the straight square comparison graph is the value corresponding to the reference feature, and the ordinate is the value of each element in the first probability distribution vector and the value of each element in the second probability distribution vector corresponding to the reference feature.

In some exemplary embodiments, the system further comprises a test module, wherein the test module is specifically configured to:

And determining the association degree corresponding to each discrete feature.

In some exemplary embodiments, the method further comprises, after determining the SHAP value of the target test sample on the discrete feature based on the prediction of the target test sample by the machine model, a second presentation module:

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any of the methods described above when the processor executes the computer program.

In a fourth aspect, an embodiment of the application provides a computer readable storage medium having stored thereon computer program instructions which when executed by a processor perform the steps of any of the methods described above.

In a fifth aspect, an embodiment of the application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the methods as provided in the first aspect of the application.

The embodiment of the application has the following beneficial effects:

By inputting user data to be processed including user basic information and business-related data into a pre-trained machine learning model, a target group to which the user belongs can be determined. In the process, in the process of training a machine learning model, a series of processing is carried out on the training sample set to obtain a target training sample set, and the target training sample set is applied to training. In some processes, firstly, discretizing continuous features in a training sample feature set, and forming a target training feature set by the discrete features obtained by discretizing and the discrete features in the training sample feature set; secondly, aiming at any one discrete feature in the target training feature set, a distribution difference index for representing the difference degree of the discrete feature in the positive example sample and the negative example sample is determined. And screening each discrete feature based on the distribution difference index of each discrete feature, screening out the feature with obvious difference in the value distribution of the feature in the positive example sample and the negative example sample (indicating that the feature can effectively distinguish the positive example sample and the negative example sample), so that the machine learning model is trained by determining the target training sample set based on the obtained reference feature set, and the accuracy of dividing the target group of the user according to the user data is improved when the obtained machine learning model predicts the user data.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic application scenario diagram of a service data processing method according to an embodiment of the present application;

fig. 2 is a flow chart of a service data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a process for processing a training sample set according to an embodiment of the present application;

FIG. 4 is a schematic diagram showing a comparison of the distribution of positive examples and negative examples on the "age" feature according to an embodiment of the present application;

FIG. 5 is a schematic diagram showing a comparison of the distribution of positive examples and negative examples on the "owner identification" feature according to an embodiment of the present application;

FIG. 6 is a schematic diagram of correlation analysis of predicted results of values of discrete features 36 according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of a service data processing device according to an embodiment of the present application;

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

For convenience of understanding, the terms involved in the embodiments of the present application are explained below:

(1) Machine learning: the method is that a computer uses the existing data to train and learn a model from the existing data and predicts the result by using the model.

(2) SHAP: all called SHAPLEY ADDITIVE exPlanation, a "model interpretation" package developed by Python, can interpret the output of any machine learning model.

Any number of elements in the figures are for illustration and not limitation, and any naming is used for distinction only and not for any limiting sense.

In a specific practical process, when facing massive user data, the characteristics and/or classification of each user need to be quickly and accurately determined so as to formulate a targeted business service policy for each user. Since there are many user data, the data of the same user also relates to different dimensions, and the data of each dimension can be used as a characteristic of a sample. For business personnel, different sample characteristics are differentiated in their importance for distinguishing individual user characteristics and/or classifications. If the characteristics of the sample are directly used for training the model, the trained model is inaccurate, and the characteristics and/or classification of the distinguished users are inaccurate.

Therefore, the application provides a business data processing method, in which user data to be processed is obtained; the user data comprises user basic information and service association data; and inputting the user data to be processed into a pre-trained machine learning model, and determining the target group to which the output user belongs. In the process, a training sample feature set determined by the training sample set is processed, discrete features with large difference degrees in a positive example sample and a negative example sample are screened out to serve as reference features, and then a target training sample set is determined according to the obtained reference features. The machine learning model obtained by training the target training sample set obtained in the mode is applied to determine the target group to which the user belongs, so that the accuracy of dividing the target group to which the user belongs according to the user data is improved.

After the design idea of the embodiment of the present application is introduced, some simple descriptions are made below for application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application and are not limiting. In the specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

Referring to fig. 1, an application scenario diagram of a service data processing method according to an embodiment of the present application is shown. In order to make service personnel more accurately formulate service strategies for each user, the service data processing method (11 is service data processing equipment) in the embodiment of the application classifies 20 ten thousand pieces of user data, and is suitable for being used as a risk investment class, a steady investment class and a small flexible saving class. In this way, the staff can know the characteristics of each user according to the classification result.

Of course, the method provided by the embodiment of the present application is not limited to the application scenario shown in fig. 1, but may be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described together in the following method embodiments, which are not described in detail herein.

In order to further explain the technical solution provided by the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although embodiments of the present application provide the method operational steps shown in the following embodiments or figures, more or fewer operational steps may be included in the method based on routine or non-inventive labor. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application.

The technical scheme provided by the embodiment of the application is described below with reference to an application scenario shown in fig. 1.

Referring to fig. 2, an embodiment of the present application provides a service data processing method, including the following steps:

S201, acquiring user data to be processed; the user data includes user basic information and service association data.

S202, inputting the user data to be processed into a pre-trained machine learning model, and determining the target group to which the output user belongs.

The method comprises the following steps of determining a target training sample set applied to a training process of a machine learning model:

Determining a distribution difference index of the discrete features aiming at any one discrete feature in the target training feature set; the distribution difference index characterizes the difference degree of the discrete features in the positive example sample and the negative example sample; the positive example sample is a sample of which the basic information meets preset user attributes and/or the service association data meets preset service attributes, and the negative example sample is a sample of which the basic information does not meet the preset user attributes and the service association data does not meet the preset service attributes;

Referring to S201, user data to be processed is acquired, the user data including user basic information and business association data, wherein the user basic information includes user age, sex, work, housing conditions, deposit and annual income, and the business association data includes historical investment data, investment risk assessment report, acceptable maximum investment amount, and the like.

Referring to S202, after obtaining the user data to be processed, in order to determine the target group to which the user belongs, the user data to be processed is input into a pre-trained machine learning model (GBM or random forest), so that after determining the group to which the target user belongs, for example, the fund amount is 10 ten thousand, so that the business personnel recommend a suitable product to the user.

Illustratively, in connection with FIG. 3, a set of target training samples to which the training process of the machine learning model is applied is determined by:

s301, determining a training sample feature set according to the training sample set; wherein each training sample in the training sample set comprises basic information of a training sample user and business association data of the training sample user.

S302, discretizing continuous features in the training sample feature set, and forming a target training feature set by the discretized discrete features and the discrete features in the training sample feature set.

S303, determining a distribution difference index of the discrete features aiming at any one of the discrete features in the target training feature set; the distribution difference index characterizes the difference degree of the discrete features in the positive example sample and the negative example sample; the positive example sample is a sample of which the basic information meets preset user attributes and/or the business association data meets preset business attributes, and the negative example sample is a sample of which the basic information does not meet the preset user attributes and the business association data does not meet the preset business attributes.

S304, screening the discrete features based on the distribution difference indexes of the discrete features, and determining a target training sample set based on the obtained reference feature set.

By applying the method to screen each discrete feature, the features of the samples participating in model training are large in difference degree in the positive example samples and the negative example samples (namely effective features, specific indexes of the model are optimized, and the performance of a machine learning algorithm in a user data prediction process is improved), so that the machine learning model obtained by training is more accurate, and the machine learning model is used for predicting the user data to determine the target group to which the user belongs.

After determining the characteristics/categories of the users or determining the groups to which the users belong according to the user data, the purpose is to guide the business personnel to accurately know the characteristics of the users so as to provide a service strategy in a targeted manner. Therefore, the process is mainly oriented to business personnel, if the business personnel select a certain number of features to be used as model input based on the existing business experience and subjective judgment, the screening mode is strong in subjectivity, has individual cognition difference of the business personnel, and is difficult to multiplex.

In addition, compared with the prior art that the variance of each feature, the correlation coefficient of the feature to the target value and the correlation of each feature to the target are calculated by applying the prior method, and then the feature with the calculation result larger than the threshold value is selected as the modeling feature according to the threshold value, modeling personnel (service personnel) are not required to understand the principle of the correlation index, analysis experience of algorithm engineers is not required to be relied on, the screening process is more visual, and the operation difficulty of the service personnel is reduced.

S301, a plurality of training samples are acquired to form a training sample set, the data of each training sample user is a training sample, and each training sample comprises basic information and business association data of the training sample user. And extracting the characteristics of each sample user to obtain a training sample characteristic set. In a specific example, the features in the training sample feature set mainly include two types of continuous features, such as a user age, an investment intention amount, and the like, and discrete features, such as a user gender, and the like.

In order to reduce the error caused by the difference between the continuous features and the discrete features, the continuous specific features are discretized, and the discrete features obtained after the discretization are combined with the original discrete features to obtain the target training feature set, that is, any feature in the target training feature set is a discrete feature.

In a specific example, a process of discretization will be described:

The equidistant binning method is used for carrying out the binning processing on all the continuous features, the total of m continuous features is assumed, taking the ith continuous feature as an example, and assuming that m _i represents the value of a sample on a feature i, max (m _i) is the maximum value of the ith continuous feature in all the samples, min (m _i) is the minimum value of the ith continuous feature in all the samples, and the continuous features are mapped to k (bin number) intervals in a unified mode, the bin width corresponding to each bin interval of the feature i is as follows:

further obtaining the demarcation point of the box division interval as a vector:

if, according to the calculated demarcation point If 1, then the original value m _i is set to a certain value of [0,1,2,..k ].

Referring to S303, after the discretization process, the target training feature sets are all discrete features, and then, for each discrete feature, a distribution difference index of the discrete feature is determined, where the distribution difference index characterizes a degree of difference of the discrete feature in a positive example sample and a negative example sample, and the larger the degree of difference is, the larger the discrete feature plays a role in a model training process. In the actual application process, samples in the training sample set can be divided into positive examples and negative examples according to whether preset user attributes are met and/or whether service related data meet the conditions of preset service attributes. Wherein, the preset user attribute is for example a crowd aged 25-45 years, and the preset business attribute is for example that the investment intention is more than 100 ten thousand. Thus, both are taken as positive examples, and the non-coincidence is taken as negative examples.

In a specific example, for any one of the discrete features, the distribution difference index for that discrete feature is determined by:

A. and counting the first value count vectors of all positive examples on the discrete features and counting the second value count vectors of all negative examples on the discrete features.

In the step A, the mode of counting the first value count vector is realized by the following modes:

a1, determining the first number of positive example samples with the discrete features as the values in all positive example samples according to each value of the discrete features.

Taking the age as an example, if the age is 41 values which are 20 to 60 years old in all training samples, determining how many positive samples with the discrete feature value of 20 are in all positive samples according to each value, and recording the positive samples as the first number. This gives 41 first numbers for 41 values.

A2, taking each first quantity as an element of the first value count vector to form the first value count vector.

The order of the elements may be preset, for example, the order of the elements is from small to large, so that the first value count vector may be formed.

In the step A, the mode of counting the second value count vector is realized by the following modes:

A3, determining the second number of negative example samples with the discrete features as the values in all the negative example samples according to each value of the discrete features.

Wherein, the discrete feature still takes age as an example, and determines how many negative examples with the discrete feature value of 20 in all positive examples are recorded as the second number. This gives 41 second numbers for 41 values.

A4, taking each second quantity as an element of a second value count vector to form the second value count vector.

The order of the elements may be preset, for example, the order of the elements is from small to large, so that the second value count vector may be formed.

B. Determining a first distribution probability vector of the positive example sample on the discrete feature according to the first value count vector and the total number of the positive example samples, and determining a second distribution probability vector of the negative example sample on the discrete feature according to the second value count vector and the total number of the negative example samples.

Dividing each element in the first value counting vector by the total number of the positive samples to obtain a first distribution probability vector of the positive samples on discrete characteristics; dividing each element in the second value counting vector by the total number of negative examples to obtain a second distribution probability vector of the negative examples on the discrete characteristics.

C. And determining a distribution difference index of the discrete features according to the first distribution probability vector, the second distribution probability vector and the number of different values of the discrete features.

In step C, determining the distribution difference index of the discrete feature is achieved by:

C1, determining a first probability corresponding to the value in a first distribution probability vector according to the positions of the elements in the first value counting vector of a first quantity corresponding to the value for each value; determining a second probability corresponding to the value in the second distribution probability vector according to the positions of the elements of the second quantity corresponding to the value in the second value counting vector; and determining a reference index corresponding to the value according to the absolute value of the difference value between the first probability and the second probability and the number of different values of the discrete feature.

Taking the above age as an example, for example, the value is 20, the first probability P1 corresponding to the value 20 is determined according to the positions, for example, the first, of the elements in the first value count vector, of the corresponding first number M1. And similarly, determining a second probability P2 corresponding to the value. And determining a reference index corresponding to the value according to the absolute value of the difference value of the two values and the number of different values of the discrete feature.

And C2, determining the sum of the reference indexes corresponding to the values as a distribution difference index of the discrete feature.

As above, the reference indices corresponding to the respective values (20 to 60) are added to obtain the distribution difference index of the discrete feature.

In a specific example, taking a discrete feature as an example, the distribution difference index of the discrete feature is described as follows:

After the discrete processing, M is all the features, the dimension of which is |M|, and n values are assumed to be shared by the feature i, and the number vector of each value of the positive example sample and the negative example sample on the feature i is counted respectively. By using The positive sample is represented by a value count vector (first value count vector) on the ith feature, and the negative sample is represented by a value count vector (second value count vector) on the ith feature by F _Qi. The following are provided:

q _it is the number of samples with the value t on the feature i for the positive example sample

Q' _it is the number of samples with the negative example sample value t on the feature i.

Assuming that the number of samples of the positive example sample is |t|, and the number of samples of the negative example is |f|, the elements (the number of samples) in the vector are divided by the number of samples, so as to obtain the sample number duty ratio vector of the positive example sample and the negative example sample on each value of the feature i, namely, the first probability distribution vector and the second probability distribution vector.

T_i＝(p_i1,p_i2,p_i3,…,p_it,…,p_it)i∈[1，|M|]；

P _it is the probability distribution of the positive sample with the value t on the characteristic i

F_i＝(p`_i1,p`_i2,p′_i3…,p′_it)i∈[1，|M|]；

P' _it is the probability distribution of the negative example sample taking the value t on the feature i.

Calculating the distribution difference of each value interval of the positive example sample and the negative example sample on the feature i, and assuming that the feature i comprises n values, the feature distribution difference index of the feature i is:

Wherein, The sum of the individual reference indices is the distribution difference index Zi.

And (3) carrying out reverse order sequencing on feature distribution difference indexes corresponding to all the |M| features to obtain a feature distribution difference ranking table, wherein the higher the ranking is, the larger the difference of the representative features in positive and negative sample is.

Referring to S304, since the distribution difference index characterizes the degree of difference of the discrete feature in the positive example sample and the negative example sample, the greater the degree of difference, the more important the discrete feature is for the training process of the machine learning model. Thus, the target training sample set may be determined by screening each discrete feature based on its distribution difference index in the following manner, and then based on the resulting reference feature set.

In a specific example, discrete features with a distribution difference index greater than a preset index threshold may be selected to form a reference feature set; the reference feature set can also be formed by selecting a preset number of discrete features according to the size of the distribution difference index.

In the embodiment of the application, after the target training sample set is determined based on the obtained reference feature set, the reference feature set is displayed according to a preset display mode in order to intuitively display the features which belong to the selected features for participating in model training to business personnel.

Specifically, for any one reference feature, if the span of the range of the feature value of the original feature corresponding to the reference feature is larger than a preset threshold value, the preset display mode is a fold line comparison mode; if the range span of the feature value of the original feature corresponding to the reference feature is smaller than or equal to a preset threshold value, presetting a display mode straight-square comparison graph mode; the abscissa of the broken line comparison graph and the straight square comparison graph is the value of the reference feature, and the ordinate is the value of each element in the first probability distribution vector and the value of each element in the second probability distribution vector corresponding to the reference feature.

Wherein, since the reference features are all discrete features, the corresponding original features may be discrete features or continuous features. In the display process, in order to enable the business personnel to better know the influence of the features on model training, the original features corresponding to the reference features are required to be applied to judge in the display process. In the actual display process, according to the magnitude relation between the feature value of the original feature corresponding to the reference feature and the preset threshold, the method is mainly divided into the following two cases:

Case 1: the range span of the feature values of the original features corresponding to the reference features is larger than a preset threshold, where the range span refers to the minimum value to the maximum value and the number of the values included between the minimum value and the maximum value. In this case, the preset display mode is a mode of fold line contrast graph. In this example, the reference feature is exemplified by an "age", the preset threshold value is, for example, 10, and in this example, the range spans 100, which is formed from 1 to 100, is greater than the preset threshold value 10. Referring to fig. 4, a schematic diagram of a comparison of the distribution of positive and negative examples over "age" features is shown. Wherein cst1 represents the positive sample characteristic distribution curve, cst2 represents the positive sample characteristic distribution curve, the abscissa represents the discretized characteristic value, and the ordinate represents the probability value (unit%) of the corresponding characteristic value. From fig. 4, it can be observed that there is a significant distribution difference between the positive and negative samples at the characteristic "age", and the users of the positive samples are younger.

Case 2: the span of the range of the feature values of the original features corresponding to the reference features is smaller than or equal to a preset threshold value. In this case, the preset display mode is a mode of a straight contrast graph. In this case, the preset display mode is a mode of a straight contrast graph. In this example, the reference feature takes the "owner identification" as an example, and the preset threshold is, for example, 10, and the "owner identification" contains only two values, so that the range span is 2 formed from 1 to 2, which is smaller than the preset threshold 10. Referring to FIG. 5, a schematic diagram of a comparison of the distribution of positive and negative examples over the "owner identification" feature is shown. cst3 represents the positive sample feature distribution histogram, cst4 represents the negative sample feature distribution histogram, the abscissa represents the discretized feature value, and the ordinate represents the probability value (unit%) of the corresponding feature value.

In addition, in order to verify the accuracy of the machine learning model obtained by training, a test sample is applied to verify the training result of the machine learning model, and the verification process is as follows:

D1, determining SHAP values of any one discrete feature of the target test sample based on a prediction result of the machine model on the target test sample aiming at any one discrete feature of any one target test sample; the target test sample is obtained by discretizing the test sample; the test sample includes basic information of the test sample user and business association data of the test sample user.

In this step, the type of the test sample and the processing for the test sample are the same as those for the training sample, and are not described here. In one specific example, the SHAP value of the target test sample on the discrete feature is determined as follows.

In a specific example, the SHAP value is determined as follows:

Based on SHAP additivity interpretation, a cooperative game theory idea is introduced, marginal contribution of a feature when the feature is added to a model is calculated, and then the average value of different marginal contributions of the feature under all feature sequences is considered. The mathematical expression is as follows:

Where g is the interpretation model, M is the number of all features in the training set, z' _i∈{0,1}^M represents whether the corresponding feature is present (1 represents present, 0 represents absent); Is the cause value of each feature,/> Is a constant (the predicted average of all training samples, since the input to the tree model must be structured data, z' should be a vector of all values 1 for instance x, i.e. all features can be observed), the above formula reduces to:

For sample x, the SHAP value of sample x on feature i The calculation mode of (a) is as follows:

Where { x ₁,...,x_|M| } is the set of all input features of sample x and S is the subset extracted from the feature library M. Its dimension is |s|, f _x (S) is a prediction based on feature subset S; The weight value representing the difference between the sample values for the case of the above-described inclusion feature i and the case of the absence of feature i under the corresponding feature subset S. Since under all the features M, a plurality of feature combinations can be extracted to form the subset S, the Shapley value of the sample x is an integrated score when all possible feature subsets S are enumerated, and the influence relationship of other features on the feature i except the sample x itself is considered.

D2, carrying out weighted average processing on SHAP values of all the target test samples on the discrete features to obtain association degrees of the discrete features and the prediction results of all the target test samples; the association degree characterizes the decision degree of the corresponding discrete features in the model training process.

In a specific example, the SHAP value of the jth sample on the feature i is recorded as

Weighted average is carried out on SHAP values of all samples on the feature i, and the association degree sp _i between the feature i and the prediction result is obtained:

And (3) carrying out reverse order sequencing on the relevance of all the |M| features to obtain a feature relevance sequencing table, wherein the higher the sequencing is, the larger the representing features play a role in the model decision process.

And D3, determining the association degree corresponding to each discrete feature.

And determining the association degree corresponding to each discrete feature according to the manner of determining the association degree corresponding to one discrete feature.

In addition, after determining the SHAP value of the target test sample on the discrete feature based on the predicted result of the machine model on the target test sample, the validation result is presented to the business person by:

Aiming at any one discrete feature, displaying the prediction result of the discrete feature of each target test sample according to a scatter diagram display mode; wherein, the abscissa of the scatter diagram is the SHAP value of the discrete feature, and the ordinate of the scatter diagram is the value of the discrete feature; the scatter diagram characterizes influence degree of each value of the discrete features on the prediction result of the discrete features of each target test sample.

Referring to fig. 6, for the total |t|+|f| samples, u samples are randomly extracted to draw a two-dimensional scatter diagram between the feature values of the features of the R1 before ranking and the prediction result, and at the same time, in order to further highlight the distribution situation of the samples, density measurement is performed on all sample points in the original diagram, so as to obtain a thermodynamic diagram capable of reflecting the concentration of the samples.

Taking a certain discrete feature as an example, for example, the discrete feature 36 is represented by a point in the graph, the vertical axis corresponding to the point represents the value of the feature 36 at the point, the horizontal axis represents the correlation value (SHAP value to characterize) between the feature value and the predicted result, the correlation value is positive, the representative model predicts the sample as a positive example, the correlation value is negative, the representative model predicts the sample as a negative example, and the larger the absolute value is, the larger the effect of the representative sample on the predicted result of the model is represented by the feature value. The color corresponding to the point represents the density of the point, and the density of the point has a corresponding relation with the color.

Therefore, SHAP is adopted to calculate the relation between the value of the sample characteristic and the result, and the relation between the characteristic value and the corresponding sample number is introduced, so that the relation between the sample characteristic and the predicted result is visualized based on the thermodynamic diagram form, and service personnel can understand the function of the characteristic in the predicted result of the model further conveniently.

In conclusion, the training process of the model directly shows the distribution condition between the solicited value and the result in a visual mode, so that service personnel can understand the model decision process and trust the prediction result based on the model, and service personnel can conveniently precipitate service experience based on the model learned mode.

As shown in fig. 7, based on the same inventive concept as the above-mentioned service data processing method, an embodiment of the present application further provides a service data processing apparatus, where the apparatus at least includes a data acquisition module 71, a determination module 72, and a model training module 73.

Wherein, the data acquisition module 71 is configured to acquire user data to be processed; the user data comprises user basic information and service association data;

a determining module 72, configured to input user data to be processed into a pre-trained machine learning model, and determine a target group to which the output user belongs;

The model training module 73 is further configured to determine a target training sample set applied by a training process of the machine learning model by:

In some exemplary embodiments, model training module 73 is specifically configured to:

determining a first number of positive examples samples with discrete features as values in all positive examples samples according to each value of the discrete features;

Taking each first quantity as an element of a first value counting vector to form the first value counting vector;

Counting second valued count vectors of all negative examples on discrete characteristics, including:

determining a second number of negative example samples with the discrete features as values in all the negative example samples for each value of the discrete features;

and taking each second quantity as an element of the second value count vector to form the second value count vector.

For each value, determining a first probability corresponding to the value in a first distribution probability vector according to the positions of the elements in the first value count vector of a first quantity corresponding to the value; determining a second probability corresponding to the value in the second distribution probability vector according to the positions of the elements in the first value count vector, corresponding to the value, of the first quantity; determining a reference index corresponding to the value according to the absolute value of the difference between the first probability and the second probability and the number of different values of the discrete feature;

and determining the sum of the reference indexes corresponding to the values as a distribution difference index of the discrete features.

In some exemplary embodiments, the method further includes a first display module for displaying the reference feature set in a preset display manner after determining the target training sample set based on the obtained reference feature set;

The first display module is specifically configured to: aiming at any one reference feature, if the span of the feature value range of the original feature corresponding to the reference feature is larger than a preset threshold value, the preset display mode is a fold line comparison mode; if the range span of the feature value of the original feature corresponding to the reference feature is smaller than or equal to a preset threshold value, presetting a display mode straight-square comparison graph mode; the abscissa of the broken line comparison graph and the straight square comparison graph is the value corresponding to the reference feature, and the ordinate is the value of each element in the first probability distribution vector and the value of each element in the second probability distribution vector corresponding to the reference feature.

In some exemplary embodiments, the device further comprises a test module, wherein the test module is specifically configured to:

Determining SHAP values of any one discrete feature of the target test sample based on a prediction result of the machine model on the target test sample aiming at any one discrete feature of any one target test sample; the target test sample is obtained by discretizing the test sample; the test sample comprises basic information of a test sample user and business association data of the test sample user;

Carrying out weighted average processing on SHAP values of all the target test samples on the discrete features to obtain the association degree of the discrete features and the prediction results of all the target test samples; the association degree characterizes the decision degree of the corresponding discrete features in the model training process;

And determining the association degree corresponding to each discrete feature.

In some exemplary embodiments, the system further comprises a second presentation module for, after determining the SHAP value of the target test sample on the discrete feature based on the prediction of the target test sample by the machine model:

aiming at any one discrete feature, displaying the prediction result of the discrete feature of each target test sample according to a scatter diagram display mode;

Wherein, the abscissa of the scatter diagram is the SHAP value of the discrete feature, and the ordinate of the scatter diagram is the value of the discrete feature; the scatter diagram characterizes influence degree of each value of the discrete features on the prediction result of the discrete features of each target test sample.

The service data processing device and the service data processing method provided by the embodiment of the application adopt the same application conception, can obtain the same beneficial effects, and are not repeated here.

Based on the same inventive concept as the service data processing method, the embodiment of the application also provides an electronic device, which can be a desktop computer, a portable computer, a smart phone, a tablet Personal computer, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a server and the like. As shown in fig. 8, the electronic device may include a processor 801 and a memory 802.

The Processor 801 may be a general purpose Processor such as a Central Processing Unit (CPU), digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

Memory 802, as a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, which may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 802 of embodiments of the present application may also be circuitry or any other device capable of performing storage functions for storing program instructions and/or data.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; such computer storage media can be any available media or data storage device that can be accessed by a computer including, but not limited to: various media that can store program code, such as a mobile storage device, a random access memory (RAM, random Access Memory), a magnetic memory (e.g., a floppy disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical memory (e.g., CD, DVD, BD, HVD, etc.), and a semiconductor memory (e.g., ROM, EPROM, EEPROM, a nonvolatile memory (NAND FLASH), a Solid State Disk (SSD)), etc.

Or the above-described integrated units of the application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media that can store program code, such as a mobile storage device, a random access memory (RAM, random Access Memory), a magnetic memory (e.g., a floppy disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical memory (e.g., CD, DVD, BD, HVD, etc.), and a semiconductor memory (e.g., ROM, EPROM, EEPROM, a nonvolatile memory (NAND FLASH), a Solid State Disk (SSD)), etc.

The foregoing embodiments are only used for describing the technical scheme of the present application in detail, but the descriptions of the foregoing embodiments are only used for helping to understand the method of the embodiments of the present application, and should not be construed as limiting the embodiments of the present application. Variations or alternatives readily apparent to those skilled in the art are intended to be encompassed within the scope of the embodiments of the present application.

Claims

1. A method for processing service data, comprising:

2. The method of claim 1, wherein said determining a distribution difference index of the discrete features comprises:

3. The method of claim 2, wherein said counting the first valued count vector for all positive examples on the discrete feature comprises:

4. The method of claim 2, wherein the determining the distribution difference index of the discrete feature based on the first distribution probability vector, the second distribution probability vector, and the number of different values of the discrete feature comprises:

5. The method of claim 1, wherein said screening each of said discrete features based on their distribution difference index to obtain a set of reference features comprises:

6. The method of claim 1, wherein after determining the set of target training samples based on the obtained set of reference features, further comprising:

Displaying the reference feature set according to a preset display mode;

7. The method according to any one of claims 1 to 6, further comprising:

And determining the association degree corresponding to each discrete feature.

8. The method of claim 7, wherein after the determining the SHAP value of the target test sample on the discrete feature based on the prediction of the target test sample by the machine model, the method further comprises:

9. A traffic data processing apparatus, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed by the processor.

11. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the method of any of claims 1 to 8.

12. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 8.