CN115034400A

CN115034400A - Business data processing method and device, electronic equipment and storage medium

Info

Publication number: CN115034400A
Application number: CN202210422527.8A
Authority: CN
Inventors: 杨宇雪; 李虹锋; 曹清鑫
Original assignee: CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-09-09
Anticipated expiration: 2042-04-21
Also published as: CN115034400B

Abstract

The application relates to the technical field of data processing, and particularly discloses a service data processing method, a service data processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: inputting user data to be processed into a pre-trained machine learning model, and determining a target group to which an output user belongs; determining target training samples of a training process of a machine learning model by: determining a training sample characteristic set according to the training sample set; discretizing the continuous features in the training sample feature set, and forming a target training feature set by the discretized discrete features and the discrete features in the training sample feature set; determining a distribution difference index of discrete features aiming at any one discrete feature in a target training feature set; and screening each discrete feature based on the distribution difference index of each discrete feature, and determining a target training sample set based on the obtained reference feature set. The accuracy of dividing the target group to which the user belongs is improved.

Description

Business data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing service data, an electronic device, and a storage medium.

Background

When facing massive user data, business personnel need to quickly and accurately determine the characteristics and/or classification of each user so as to make a targeted business service strategy for each user.

In related art techniques, a machine learning model is applied, for example, to determine characteristics and/or classifications of individual users. However, there are many user data, and the data of the same user also relates to different dimensions, and the data of each dimension can be used as the characteristics of the sample. For the service personnel, the importance levels of different sample characteristics are different for distinguishing the characteristics and/or classification of each user. If the characteristics of the sample are directly used for training the model, the trained model is inaccurate, and the characteristics and/or classification of the distinguished user are inaccurate.

Disclosure of Invention

The embodiment of the application provides a business data processing method and device, electronic equipment and a storage medium, which are used for improving the accuracy of dividing a target group to which a user belongs according to user data.

In a first aspect, an embodiment of the present application provides a method for processing service data, including:

acquiring user data to be processed; the user data comprises user basic information and service associated data;

inputting the user data to be processed into a pre-trained machine learning model, and determining a target group to which the output user belongs;

wherein a set of target training samples to which a training process of the machine learning model is applied is determined by:

determining a training sample characteristic set according to the training sample set; wherein each training sample in the training sample set comprises basic information of a training sample user and service associated data of the training sample user;

discretizing the continuous features in the training sample feature set, and forming a target training feature set by the discretized discrete features and the discrete features in the training sample feature set;

determining a distribution difference index of any discrete feature in the target training feature set; wherein the distribution difference index characterizes a degree of difference of the discrete features in positive and negative examples; the positive sample is a sample of which the basic information meets the preset user attribute and/or the service associated data meets the preset service attribute, and the negative sample is a sample of which the basic information does not meet the preset user attribute and the service associated data does not meet the preset service attribute;

and screening each discrete feature based on the distribution difference index of each discrete feature, and determining a target training sample set based on the obtained reference feature set.

In some exemplary embodiments, the determining the distribution difference index of the discrete features includes:

counting first value counting vectors of all positive samples on the discrete features, and counting second value counting vectors of all negative samples on the discrete features;

determining a first distribution probability vector of the positive sample on the discrete feature according to the first value counting vector and the total number of the positive samples, and determining a second distribution probability vector of the negative sample on the discrete feature according to the second value counting vector and the total number of the negative samples;

and determining the distribution difference index of the discrete features according to the first distribution probability vector, the second distribution probability vector and the number of different values of the discrete features.

In some exemplary embodiments, the counting a first value count vector of all positive example samples on the discrete feature includes:

determining a first number of positive examples of which the discrete features are the values in all the positive examples aiming at each value of the discrete features;

taking each first quantity as an element of a first value counting vector to form the first value counting vector;

the counting of the second value counting vector of all negative example samples on the discrete features comprises the following steps:

determining a second number of negative examples of the values of the discrete features in all the negative examples for each value of the discrete features;

and taking each second quantity as an element of a second value counting vector to form the second value counting vector.

In some exemplary embodiments, the determining the distribution difference index of the discrete feature according to the first distribution probability vector, the second distribution probability vector, and the number of different values of the discrete feature includes:

for each value, determining a first probability corresponding to the value in the first distribution probability vector according to the position of the element of the first number corresponding to the value in the first value count vector; determining a second probability corresponding to the value in a second distribution probability vector according to the position of the element of the first number corresponding to the value in the first value counting vector; determining a reference index corresponding to the value according to the absolute value of the difference value between the first probability and the second probability and the number of different values of the discrete features;

and determining the sum of the reference indexes corresponding to the values as the distribution difference index of the discrete characteristics.

In some exemplary embodiments, the screening each discrete feature based on the distribution difference index of each discrete feature to obtain a reference feature set includes:

selecting discrete features with distribution difference indexes larger than a preset index threshold value to form a reference feature set; or

And selecting a preset number of discrete features according to the size of the distribution difference index to form a reference feature set.

In some exemplary embodiments, after determining the target training sample set based on the obtained reference feature set, the method further includes:

displaying the reference feature set according to a preset display mode;

aiming at any one reference feature, if the range span of the feature value of the original feature corresponding to the reference feature is larger than a preset threshold, the preset display mode is a broken line comparison graph mode; if the range span of the feature value of the original feature corresponding to the reference feature is less than or equal to a preset threshold, comparing the histogram with the graph in the preset display mode; the horizontal coordinates of the polygonal line comparison graph and the vertical coordinate of the vertical comparison graph are values corresponding to the reference features, and the vertical coordinates are values of each element in the first probability distribution vector and values of each element in the second probability distribution vector corresponding to the reference features.

In some exemplary embodiments, the method further comprises:

for any discrete feature of any target test sample, determining a SHAP value of the target test sample on the discrete feature based on a prediction result of the machine model on the target test sample; the target test sample is obtained by carrying out discretization processing operation on the test sample; the test sample comprises basic information of a test sample user and service associated data of the test sample user;

carrying out weighted average processing on SHAP values of all target test samples on the discrete features to obtain the association degree of the discrete features and the prediction result of each target test sample; the relevance represents the decision degree of the corresponding discrete features in the model training process;

and determining the corresponding relevance of each discrete feature.

In some exemplary embodiments, after determining the SHAP value of the target test sample on the discrete feature based on the predicted result of the machine model on the target test sample, the method further comprises:

aiming at any one discrete feature, displaying the prediction result of the discrete feature of each target test sample according to a scatter diagram display mode;

the abscissa of the scatter diagram is the SHAP value of the discrete feature, and the ordinate of the scatter diagram is the value of the discrete feature; and the scatter diagram represents the influence degree of each value of the discrete characteristic on the prediction result of the discrete characteristic of each target test sample.

In a second aspect, an embodiment of the present application provides a service data processing apparatus, including:

the data acquisition module is used for acquiring user data to be processed; the user data comprises user basic information and service associated data;

the determining module is used for inputting the user data to be processed into a pre-trained machine learning model and determining a target group to which the output user belongs;

wherein the method further comprises a model training module for determining a target training sample set to which the training process of the machine learning model is applied by:

discretizing the continuous features in the training sample feature set, and forming a target training feature set by the discretized features and the discrete features in the training sample feature set;

In some exemplary embodiments, the model training module is specifically configured to:

for each value, determining a first probability corresponding to the value in the first distribution probability vector according to the position of the element in the first value counting vector corresponding to the first number of the value; determining a second probability corresponding to the value in a second distribution probability vector according to the position of the element of the first number corresponding to the value in the first value counting vector; determining a reference index corresponding to the value according to the absolute value of the difference value between the first probability and the second probability and the number of different values of the discrete features;

In some exemplary embodiments, the method further includes a first presentation module, configured to, after determining a target training sample set based on the obtained reference feature set, present the reference feature set in a preset presentation manner;

wherein, the first display module is specifically configured to: aiming at any one reference feature, if the range span of the feature value of the original feature corresponding to the reference feature is larger than a preset threshold value, the preset display mode is a broken line comparison graph mode; if the range span of the feature value of the original feature corresponding to the reference feature is less than or equal to a preset threshold, comparing the histogram with the graph in the preset display mode; the horizontal coordinates of the polygonal line comparison graph and the vertical coordinate of the vertical comparison graph are values corresponding to the reference features, and the vertical coordinates are values of each element in the first probability distribution vector and values of each element in the second probability distribution vector corresponding to the reference features.

In some exemplary embodiments, the system further comprises a testing module, the testing module is specifically configured to:

for any one discrete feature of any one target test sample, determining a SHAP value of the target test sample on the discrete feature based on a prediction result of the machine model on the target test sample; the target test sample is obtained by carrying out discretization processing operation on the test sample; the test sample comprises basic information of a test sample user and service associated data of the test sample user;

carrying out weighted average processing on SHAP values of all target test samples on the discrete features to obtain the association degree of the discrete features and the prediction results of all the target test samples; the relevance represents the decision degree of the corresponding discrete features in the model training process;

and determining the corresponding relevance of each discrete feature.

In some exemplary embodiments, the method further comprises, after determining the SHAP value of the target test sample on the discrete feature based on the predicted result of the machine model on the target test sample:

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any one of the methods when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, implement the steps of any of the methods described above.

In a fifth aspect, an embodiment of the present application provides a computer program product comprising a computer program that, when executed by a processor, performs the steps of any of the methods as provided in the first aspect of the present application.

The embodiment of the application has the following beneficial effects:

by inputting user data to be processed, including user basic information and business related data, into a pre-trained machine learning model, a target group to which the user belongs can be determined. In the process, a series of processing is carried out on the training sample set in the process of machine learning model training to obtain a target training sample set, and the target training sample set is used for training. In some of the processing, firstly, discretizing continuous features in a training sample feature set, and forming a target training feature set by the discretized discrete features and the discrete features in the training sample feature set; secondly, determining a distribution difference index for representing the difference degree of the discrete features in the positive sample and the negative sample aiming at any one discrete feature in the target training feature set. And screening each discrete feature based on the distribution difference index of each discrete feature, and screening out the feature that the value distribution of the screened feature has obvious difference (indicating that the feature can effectively distinguish the positive sample from the negative sample) in the positive sample and the negative sample, so that a target training sample set is determined based on the obtained reference feature set to train the machine learning model, and the accuracy of dividing the target group to which the user belongs according to the user data is improved when the obtained machine learning model predicts the user data.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a service data processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a service data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a process of training a sample set according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a comparison of the distribution of positive and negative examples on an "age" characteristic according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a comparison between distributions of a positive example and a negative example on an "owner identification" feature according to an embodiment of the present application;

fig. 6 is a schematic diagram of a correlation analysis of a prediction result of a value of a discrete feature 36 according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a service data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

For convenience of understanding, terms referred to in the embodiments of the present application are explained below:

(1) machine learning: the method is that the computer uses the existing data to train and learn a certain model from the existing data, and uses the model to predict the result.

(2) SHAP: the model interpretation package is called SHAPLey Additive edition and is a model interpretation package developed by Python, and can interpret the output of any machine learning model.

Any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used for distinction only and not by way of limitation.

In a specific practice process, when a large amount of user data is faced, the characteristics and/or classification of each user need to be determined quickly and accurately so as to make a targeted business service strategy for each user. Since there are many user data, the data of the same user also relates to different dimensions, and the data of each dimension can be used as the characteristics of the sample. For the service personnel, the importance levels of different sample characteristics are different for distinguishing the characteristics and/or classification of each user. If the characteristics of the sample are directly used for training the model, the trained model is inaccurate, and the distinguished characteristics and/or classification of the user are inaccurate.

Therefore, the application provides a business data processing method, in the method, user data to be processed is obtained; the user data comprises user basic information and service associated data; and inputting the user data to be processed into a pre-trained machine learning model, and determining the target group to which the output user belongs. In the process, the training sample feature set determined by the training sample set is processed, discrete features with large difference degree in the positive sample and the negative sample are screened out to be used as reference features, and then the target training sample set is determined according to the obtained reference features. The machine learning model obtained by training the target training sample set obtained in the mode is used for determining the target group to which the user belongs, so that the accuracy of dividing the target group to which the user belongs according to the user data is improved.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

Fig. 1 is a schematic view of an application scenario of a service data processing method according to an embodiment of the present application. In order to enable business personnel to more accurately make a service strategy for each user, the business data processing method (11 is business data processing equipment) in the embodiment of the application is applied to classify 20 ten thousand pieces of user data and determine that the method is suitable for making a risk investment class, a steady investment class and a small-amount flexible deposit class. In this way, the staff member can know the characteristics of each user according to the classification result.

Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide method operation steps as shown in the following embodiments or figures, more or fewer operation steps may be included in the method based on conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application.

The following describes the technical solution provided in the embodiment of the present application with reference to the application scenario shown in fig. 1.

Referring to fig. 2, an embodiment of the present application provides a service data processing method, including the following steps:

s201, acquiring user data to be processed; the user data comprises user basic information and service associated data.

S202, inputting the user data to be processed into a machine learning model trained in advance, and determining the target group to which the output user belongs.

Wherein a set of target training samples to which the training process of the machine learning model applies is determined by:

determining a training sample characteristic set according to the training sample set; each training sample in the training sample set comprises basic information of a training sample user and service associated data of the training sample user;

determining a distribution difference index of discrete features aiming at any one discrete feature in a target training feature set; the distribution difference index represents the difference degree of the discrete features in the positive sample and the negative sample; the positive sample is a sample that the basic information meets the preset user attribute and/or the service associated data meets the preset service attribute, and the negative sample is a sample that the basic information does not meet the preset user attribute and the service associated data does not meet the preset service attribute;

By inputting user data to be processed, including user basic information and business related data, into a pre-trained machine learning model, a target group to which the user belongs can be determined. In the process, a series of processing is carried out on the training sample set in the process of machine learning model training to obtain a target training sample set, and the target training sample set is used for training. In some of the processing, firstly, discretizing continuous features in a training sample feature set, and forming a target training feature set by the discretized discrete features and the discrete features in the training sample feature set; secondly, determining a distribution difference index for characterizing the difference degree of the discrete features in the positive sample and the negative sample aiming at any discrete feature in the target training feature set. And screening each discrete feature based on the distribution difference index of each discrete feature, and screening out the feature that the value distribution of the screened feature has obvious difference (indicating that the feature can effectively distinguish the positive sample from the negative sample) in the positive sample and the negative sample, so that a target training sample set is determined based on the obtained reference feature set to train the machine learning model, and the accuracy of dividing the target group to which the user belongs according to the user data is improved when the obtained machine learning model predicts the user data.

Referring to S201, user data to be processed is obtained, where the user data includes user basic information and business related data, where the user basic information includes user age, gender, work, own housing condition, deposit, annual income, and the like, and the business related data includes historical investment data, an investment risk assessment report, an acceptable maximum investment amount, and the like.

Referring to S202, after the user data to be processed is obtained, in order to determine a target group to which the user belongs, the user data to be processed is input to a pre-trained machine learning model (GBM or random forest), so that after the group to which the target user belongs is determined, for example, the investment amount is 10 ten thousand, so that a service person can recommend an appropriate product to the user.

Illustratively, in conjunction with fig. 3, the set of target training samples to which the training process of the machine learning model applies is determined by:

s301, determining a training sample characteristic set according to a training sample set; each training sample in the training sample set comprises basic information of a training sample user and service associated data of the training sample user.

S302, discretizing the continuous features in the training sample feature set, and forming a target training feature set by the discretized discrete features and the discrete features in the training sample feature set.

S303, determining a distribution difference index of discrete features aiming at any one discrete feature in the target training feature set; the distribution difference index represents the difference degree of the discrete features in the positive sample and the negative sample; the positive sample is a sample of which the basic information meets the preset user attribute and/or the service associated data meets the preset service attribute, and the negative sample is a sample of which the basic information does not meet the preset user attribute and the service associated data does not meet the preset service attribute.

S304, screening each discrete feature based on the distribution difference index of each discrete feature, and determining a target training sample set based on the obtained reference feature set.

The discrete features are screened by applying the method, so that the features of the samples participating in model training are all larger in difference degree between the positive sample and the negative sample (namely, effective features are obtained, the specific indexes of the models are optimized, and the performance of a machine learning algorithm in the user data prediction process is improved).

After the characteristics/categories of the users are determined according to the user data or the groups to which the users belong are determined, the purpose is to guide business personnel to accurately know the characteristics for providing service strategies in a targeted manner. Therefore, the process is mainly oriented to business personnel, if the business personnel select a certain number of characteristics as model input based on the existing business experience and subjective judgment, the screening mode has strong subjectivity, has individual cognitive difference of the business personnel, and is difficult to multiplex.

In addition, compared with the prior art in which the variance of each feature, the correlation coefficient of the feature to a target value and the correlation of each feature to a target are calculated by applying the existing method, and then the feature with the calculation result larger than the threshold is selected as the modeling feature according to the threshold, modeling personnel (service personnel) do not need to understand the principle of the related indexes and rely on the analysis experience of algorithm engineers, the screening process is more intuitive, and the operation difficulty of the service personnel is reduced.

In step S301, a plurality of training samples are obtained to form a training sample set, where data of each training sample user is a training sample, and each training sample includes basic information and service-related data of the training sample user. And extracting the characteristics of each sample user to obtain a training sample characteristic set. In a specific example, the features in the training sample feature set mainly include two types, namely, a continuous feature and a discrete feature, wherein the continuous feature is the age of the user, the amount of investment intention, and the like, and the discrete feature can be the gender of the user, and the like.

Referring to S302, in order to reduce errors caused by differences between the continuous features and the discrete features, here, discretizing the continuous features, and combining the discretized features obtained after processing with the original discretized features to obtain a target training feature set, that is, any one feature in the target training feature set is a discrete feature.

In a specific example, a description will be given of a procedure of discretization processing:

performing box separation on all continuous features by using an equidistant box separation method, assuming that m continuous features are included in total, taking the ith continuous feature as an example, assuming that m is _i Represents the value of the sample on the feature i, max (m) _i ) Is the maximum value of the ith successive feature in all samples, min (m) _i ) The minimum value of the ith continuous feature in all samples is obtained, the continuous features are uniformly mapped to k (the number of the bins) intervals, and the bin width corresponding to each bin interval of the feature i is as follows:

further obtaining a dividing point of the box-dividing interval as a vector:

based on the calculated cut point, if

1, then the original value m is taken _i Is set to [0, 1, 2.. k ]]A certain value of.

Referring to S303, after discretization, all the target training feature sets are discrete features, and then, for each discrete feature, a distribution difference index of the discrete feature is determined, where the distribution difference index represents a difference degree of the discrete feature between positive and negative examples, and a larger difference degree indicates that the discrete feature plays a larger role in the model training process. In an actual application process, samples in the training sample set may be divided into positive examples and negative examples according to whether the preset user attribute is satisfied and/or whether the service-related data satisfies the condition of the preset service attribute. Wherein the predetermined user attribute is, for example, a group of people 25-45 years old, and the predetermined business attribute is, for example, an intention to invest gold of more than 100 ten thousand. Thus, the coincidence is used as a positive example, and the non-coincidence is used as a negative example.

In a specific example, for any one discrete feature, the distribution difference index of the discrete feature is determined by:

A. and counting first value counting vectors of all positive samples on the discrete features, and counting second value counting vectors of all negative samples on the discrete features.

In the step a, the mode of counting the first value counting vector is realized by the following mode:

and A1, determining a first number of positive examples with the discrete features as values in all the positive examples for each value of the discrete features.

The discrete features take age as an example, if the age values in all the training samples are 41 values which are different from 20 to 60 years old, and for each value, the number of the 20 discrete feature value-based positive example samples in all the positive example samples is determined and recorded as a first number. Thus, for 41 values, 41 first numbers are obtained.

And A2, taking each first number as an element of the first value counting vector to form the first value counting vector.

The 41 first numbers are used as elements of the first value counting vector, and the order of each element may be preset, for example, the order of the values from small to large is used as the order of each element, so that the first value counting vector may be formed.

In the step a, the mode of counting the second value counting vector is realized by the following mode:

and A3, determining a second number of negative examples of which the discrete features are values in all the negative examples for each value of the discrete features.

And determining the number of negative examples with the discrete feature value of 20 in all positive examples by taking the age as an example, and recording the number as a second number. Thus, for 41 values, 41 second quantities are obtained.

And A4, taking each second quantity as an element of the second value counting vector to form the second value counting vector.

The 41 second numbers are used as elements of the second value counting vector, and the order of each element may be preset, for example, the order of the values from small to large is used as the order of each element, so that the second value counting vector may be formed.

B. And determining a first distribution probability vector of the positive example samples on the discrete features according to the first value counting vector and the total number of the positive example samples, and determining a second distribution probability vector of the negative example samples on the discrete features according to the second value counting vector and the total number of the negative example samples.

Dividing each element in the first value counting vector by the total number of the positive example samples to obtain a first distribution probability vector of the positive example samples on the discrete features; and dividing each element in the second value counting vector by the total number of the negative example samples to obtain a second distribution probability vector of the negative example samples on the discrete features.

C. And determining the distribution difference index of the discrete features according to the first distribution probability vector, the second distribution probability vector and the number of different values of the discrete features.

In step C, determining the distribution difference index of the discrete feature is performed by:

c1, determining, for each value, a first probability corresponding to the value in the first distribution probability vector according to the position of the element in the first value count vector corresponding to the first number of values; determining a second probability corresponding to the value in the second distribution probability vector according to the position of the element of the second quantity corresponding to the value in the second value counting vector; and determining a reference index corresponding to the value according to the absolute value of the difference value between the first probability and the second probability and the number of different values of the discrete features.

Taking the age as an example, for example, if the value is 20, determining a first probability P1 corresponding to the value 20 according to the position, for example, the first one, of the element of the corresponding first number M1 in the first value counting vector. Similarly, a second probability P2 corresponding to the value is determined. And determining a reference index corresponding to the value according to the absolute value of the difference value of the two and the number of different values of the discrete characteristics.

And C2, determining the sum of the reference indexes corresponding to the values as the distribution difference index of the discrete characteristics.

As above, the reference indexes corresponding to the respective values (20 to 60) are summed to obtain the distribution difference index of the discrete feature.

In a specific example, taking a discrete feature as an example, the distribution difference index of the discrete feature is described as follows:

after the discrete processing, M is all the characteristics, the dimensionality of the characteristics is | M |, the characteristics i are assumed to have n values, and the number vectors of each value of the positive sample and the negative sample on the characteristics i are respectively counted. By using

Indicating that the positive sample takes the value of the counting vector (the first value counting vector) on the ith characteristic, and using F _Qi And a value counting vector (a second value counting vector) of the negative sample on the ith characteristic is shown. The following were used:

q _it number of samples with value t on feature i for positive example samples

q` _it The number of samples with a value t on feature i for negative example samples.

Assuming that the number of samples of the positive example sample is | T |, and the number of samples of the negative example sample is | F |, the elements (sample numbers) in the vector are respectively divided by the number of samples, so as to obtain sample number ratio vectors of the positive example sample and the negative example sample at each value of the feature i, that is, the first probability distribution vector and the second probability distribution vector.

T _i ＝(p _i1 ，p _i2 ，p _i3 ，…，p _it ，…，p _it )i∈[1，|M|]；

p _it Probability distribution over feature i of positive example samples with value t

F _i ＝(p` _i1 ，p` _i2 ，p′ _i3 …，p′ _it )i∈[1，|M|]；

p′ _it The probability distribution over the feature i for negative examples takes the value t.

Calculating the distribution difference of each value interval of the positive sample and the negative sample on the characteristic i, and assuming that the characteristic i contains n values, the characteristic distribution difference index of the characteristic i is as follows:

wherein the content of the first and second substances,

to be reference indices, the sum of the individual reference indices is the distribution difference index Zi.

And performing reverse ordering on the feature distribution difference indexes corresponding to all the M features to obtain a feature distribution difference ranking table, wherein the more advanced features are ranked, the larger the difference of the representative features in the positive and negative sample is.

Referring to S304, since the distribution difference index characterizes the difference degree of the discrete feature between the positive sample and the negative sample, the greater the difference degree, the more important the discrete feature is for the training process of the machine learning model. Therefore, the discrete features can be screened based on the distribution difference index of each discrete feature in the following manner, and then the target training sample set is determined based on the obtained reference feature set.

In a specific example, discrete features with distribution difference indexes larger than a preset index threshold value can be selected to form a reference feature set; or selecting a preset number of discrete features according to the size of the distribution difference index to form a reference feature set.

In the embodiment of the application, after the target training sample set is determined based on the obtained reference feature set, the reference feature set is displayed according to a preset display mode in order to visually display the screened features for participating in model training to business personnel.

Specifically, for any one reference feature, if the range span of the feature value of the original feature corresponding to the reference feature is greater than a preset threshold, the preset display mode is a broken line comparison graph mode; if the range span of the feature value of the original feature corresponding to the reference feature is smaller than or equal to a preset threshold, presetting a display mode histogram mode; the horizontal coordinates of the broken line comparison graph and the vertical coordinate comparison graph are values of the reference features, and the vertical coordinates are values of each element in the first probability distribution vector and values of each element in the second probability distribution vector corresponding to the reference features.

Since the reference features are all discrete features, the corresponding original features may be discrete features or continuous features. In the display process, in order to enable a service person to better understand the influence of the features on the model training, the original features corresponding to the reference features need to be applied for judgment in the display process. In the actual display process, the following two conditions are mainly classified according to the size relationship between the characteristic value of the original characteristic corresponding to the reference characteristic and the preset threshold:

case 1: the range span of the characteristic value of the original characteristic corresponding to the reference characteristic is larger than a preset threshold, wherein the range span refers to the number of values from the minimum value to the maximum value and included values. In this case, the preset display mode is a mode of comparing broken lines with graphs. In this example, the reference feature is taken as "age" for example, and the preset threshold is, for example, 10, and in this example, the range span is 100, which is comprised from 1 to 100, and is greater than the preset threshold 10. Referring to FIG. 4, a schematic diagram showing a comparison of the distribution of positive and negative examples over the "age" characteristic is shown. Wherein cst1 represents a positive case sample feature distribution curve, cst2 represents a positive case sample feature distribution curve, the abscissa represents the discretized feature value, and the ordinate represents the probability value (unit%) of the corresponding feature value. It can be observed from fig. 4 that there is a significant difference in the distribution of the positive and negative examples over the characteristic "age", and the users of the positive examples are younger.

Case 2: the range span of the characteristic value of the original characteristic corresponding to the reference characteristic is smaller than or equal to a preset threshold value. In this case, the predetermined display mode is a histogram mode. In this case, the predetermined display mode is a histogram mode. In this example, the reference feature is an "owner identifier", and the preset threshold ratio is 10, in this example, the "owner identifier" includes only two values, and therefore, the range span is 2 formed from 1 to 2, and is smaller than the preset threshold 10. Referring to FIG. 5, a schematic diagram illustrating a comparison of the distribution of positive and negative examples on the "owner identification" feature is shown. cst3 represents a positive case sample feature distribution histogram, cst4 represents a negative case sample feature distribution histogram, the abscissa represents the discretized feature value, and the ordinate represents the corresponding feature value probability value (unit%).

In addition, in order to verify the accuracy of the machine learning model obtained by training, the training result of the machine learning model is verified by applying the test sample, and the verification process is as follows:

d1, aiming at any one discrete feature of any one target test sample, determining the SHAP value of the target test sample on the discrete feature based on the prediction result of the machine model on the target test sample; the target test sample is obtained by performing discretization processing operation on the test sample; the test sample comprises basic information of a test sample user and service associated data of the test sample user.

In this step, the type of the test sample and the processing for the test sample are the same as the processing for the training sample, which is not described herein again. In one particular example, the SHAP value of the target test sample on the discrete feature is determined as follows.

In one specific example, the SHAP value is determined as follows:

based on SHAP additive interpretation, the idea of cooperative game theory is introduced, the marginal contribution of a feature when the feature is added into the model is calculated, and then the mean value of different marginal contributions of the feature under all feature sequences is considered. The mathematical expression is as follows:

where g is the interpretation model, M is the number of all features in the training set, z ″ _i ∈{0，1} ^M Indicates whether the corresponding feature is present (1 indicates present, 0 indicates absent);

is a value of the cause of each feature,

is a constant (the predicted average of all training samples, since the input to the tree model must be structured data, for example x, z' should be a vector of all values 1, i.e. all features can be observed), then the above formula reduces to:

for sample x, SHAP value of sample x on feature i

The calculation method is as follows:

wherein, { x ₁ ，...，x _|M| Is the set of all input features of sample x, and S is the subset extracted from the feature library M. Its dimensionality is | S |, f _x (S) is a prediction based on the feature subset S;

and a weight value representing a difference between the sample values in the case where the sample includes the feature i and the sample does not include the feature i in the corresponding feature subset S. Since a plurality of feature combinations can be extracted under all the features M to form the subset S, the sample x is a comprehensive score when the share value of the feature i enumerates all the possible feature subsets S, and the influence relationship of other features on the feature i is considered besides the sample x itself.

D2, carrying out weighted average processing on SHAP values of all target test samples on the discrete characteristics to obtain the association degree of the discrete characteristics and the prediction result of each target test sample; and the relevance represents the decision degree of the corresponding discrete features in the model training process.

In a specific example, the SHAP value of the jth sample on the feature i is recorded as

Carrying out weighted average on SHAP values of all samples on the characteristic i to obtain the relevance sp of the characteristic i and the prediction result _i ：

And (3) carrying out reverse ordering on the relevance degrees of all the M characteristics to obtain a characteristic relevance degree ordering table, wherein the higher the ordering of the characteristics is, the greater the function of the representative characteristics in the model decision making process is.

And D3, determining the corresponding association degree of each discrete feature.

And determining the association degree corresponding to each discrete feature according to the manner of determining the association degree corresponding to one discrete feature.

In addition, after determining the SHAP value of the target test sample on the discrete features based on the prediction result of the machine model on the target test sample, the verification result is displayed to the service personnel by the following way:

aiming at any one discrete feature, displaying the prediction result of the discrete feature of each target test sample according to a scatter diagram display mode; the abscissa of the scatter diagram is the SHAP value of the discrete feature, and the ordinate of the scatter diagram is the value of the discrete feature; and the scatter diagram represents the influence degree of each value of the discrete characteristic on the prediction result of the discrete characteristic of each target test sample.

Referring to fig. 6, for the total | T | + | F | samples, u samples are randomly extracted to draw a two-dimensional scatter diagram between the feature value of the features of R1 before ranking and the prediction result, and simultaneously, in order to further highlight the distribution situation of the samples, density measurement and calculation are performed on all sample points in the original diagram, so that a thermodynamic diagram capable of reflecting the concentration ratio of the samples is obtained.

Taking a certain discrete feature as an example, such as the discrete feature 36, a point in the graph represents a sample, a vertical axis corresponding to the point represents a value of the feature 36 at the point, an abscissa represents a correlation value size (characterized by a SHAP value) between the feature value and a prediction result, the correlation value is positive, the representative model predicts the sample as a positive example, the correlation value is negative, the representative model predicts the sample as a negative example, and the larger the absolute value is, the larger the effect of the representative sample on the model prediction result in the feature value is. The corresponding color represents the density of the points, and the density of the points and the color have a corresponding relationship.

Therefore, the SHAP is adopted to calculate the relation between the values of the sample characteristics and the results, the relation between the characteristic values and the corresponding sample quantity is introduced, the relation between the sample characteristics and the prediction results is visualized based on the thermodynamic diagram form, and business personnel can further conveniently understand the role played by the characteristics in the prediction results of the model.

In conclusion, the distribution condition between the characteristic value and the result is directly displayed in a visual mode in the training process of the model, so that business personnel can understand the decision process of the model and trust the prediction result based on the model, and the business personnel can conveniently deposit business experience based on the mode learned by the model.

As shown in fig. 7, based on the same inventive concept as the business data processing method, the embodiment of the present application further provides a business data processing apparatus, which at least includes a data obtaining module 71, a determining module 72, and a model training module 73.

The data acquiring module 71 is configured to acquire user data to be processed; the user data comprises user basic information and service associated data;

a determining module 72, configured to input user data to be processed into a machine learning model trained in advance, and determine a target group to which an output user belongs;

wherein, a model training module 73 is further included for determining a target training sample set to which the training process of the machine learning model is applied by:

determining a distribution difference index of discrete features aiming at any one discrete feature in a target training feature set; the distribution difference index represents the difference degree of the discrete features in the positive sample and the negative sample; the positive sample is a sample of which the basic information meets the preset user attribute and/or the service associated data meets the preset service attribute, and the negative sample is a sample of which the basic information does not meet the preset user attribute and the service associated data does not meet the preset service attribute;

In some exemplary embodiments, model training module 73 is specifically configured to:

determining a first distribution probability vector of the positive sample on the discrete features according to the first value counting vector and the total number of the positive samples, and determining a second distribution probability vector of the negative sample on the discrete features according to the second value counting vector and the total number of the negative samples;

determining a first number of positive examples with the discrete features as values in all the positive examples aiming at each value of the discrete features;

taking each first quantity as an element of a first value counting vector to form a first value counting vector;

counting second value counting vectors of all negative example samples on discrete features, wherein the counting comprises the following steps:

determining a second number of negative examples samples taking the discrete features as values in all the negative examples samples aiming at each value of the discrete features;

for each value, determining a first probability corresponding to the value in the first distribution probability vector according to the position of the element of the first number corresponding to the value in the first value counting vector; determining a second probability corresponding to the value in the second distribution probability vector according to the position of the element of the first quantity corresponding to the value in the first value counting vector; determining a reference index corresponding to a value according to the absolute value of the difference value between the first probability and the second probability and the number of different values of the discrete features;

wherein, the first display module is specifically configured to: aiming at any one reference feature, if the range span of the feature value of the original feature corresponding to the reference feature is larger than a preset threshold, the preset display mode is a broken line comparison graph mode; if the range span of the feature value of the original feature corresponding to the reference feature is less than or equal to a preset threshold, presetting a display mode histogram mode; the horizontal coordinates of the broken line comparison graph and the vertical coordinate comparison graph are values corresponding to the reference features, and the vertical coordinate is a value of each element in the first probability distribution vector and a value of each element in the second probability distribution vector corresponding to the reference features.

In some exemplary embodiments, the system further includes a test module, and the test module is specifically configured to:

aiming at any discrete feature of any target test sample, determining the SHAP value of the target test sample on the discrete feature based on the prediction result of the machine model on the target test sample; the target test sample is obtained by carrying out discretization processing operation on the test sample; the test sample comprises basic information of a test sample user and service associated data of the test sample user;

and determining the corresponding association degree of each discrete feature.

In some exemplary embodiments, the method further comprises a second presentation module, configured to, after determining the SHAP value of the target test sample on the discrete feature based on the predicted result of the machine model on the target test sample:

aiming at any one discrete feature, displaying a prediction result of the discrete feature of each target test sample according to a scatter diagram display mode;

the abscissa of the scatter diagram is the SHAP value of the discrete features, and the ordinate of the scatter diagram is the value of the discrete features; and the scatter diagram represents the influence degree of each value of the discrete characteristic on the prediction result of the discrete characteristic of each target test sample.

The service data processing device and the service data processing method provided by the embodiment of the application adopt the same inventive concept, can obtain the same beneficial effects, and are not described again here.

Based on the same inventive concept as the service data processing method, the embodiment of the present application further provides an electronic device, which may be specifically a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a server, and the like. As shown in fig. 8, the electronic device may include a processor 801 and a memory 802.

The Processor 801 may be a general-purpose Processor, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in a processor.

The memory 802, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 802 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function to store program instructions and/or data.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; the computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to: various media that can store program codes include a removable Memory device, a Random Access Memory (RAM), a magnetic Memory (e.g., a flexible disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical Memory (e.g., a CD, a DVD, a BD, an HVD, etc.), and a semiconductor Memory (e.g., a ROM, an EPROM, an EEPROM, a nonvolatile Memory (NAND FLASH), a Solid State Disk (SSD)).

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media that can store program codes include a removable Memory device, a Random Access Memory (RAM), a magnetic Memory (e.g., a flexible disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical Memory (e.g., a CD, a DVD, a BD, an HVD, etc.), and a semiconductor Memory (e.g., a ROM, an EPROM, an EEPROM, a nonvolatile Memory (NAND FLASH), a Solid State Disk (SSD)).

The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method of the embodiments of the present application, and should not be construed as limiting the embodiments of the present application. Modifications and substitutions that may be readily apparent to those skilled in the art are intended to be included within the scope of the embodiments of the present application.

Claims

1. A method for processing service data, comprising:

2. The method of claim 1, wherein determining the distribution variance index of the discrete features comprises:

3. The method of claim 2, wherein the counting the first value count vector of all positive samples on the discrete feature comprises:

4. The method of claim 2, wherein determining the distribution difference index of the discrete features according to the first distribution probability vector, the second distribution probability vector, and the number of different values of the discrete features comprises:

5. The method according to claim 1, wherein the screening each of the discrete features based on the distribution difference index of each of the discrete features to obtain a reference feature set comprises:

selecting discrete features of which the distribution difference indexes are larger than a preset index threshold value to form a reference feature set; or

6. The method of claim 1, wherein after determining the set of target training samples based on the obtained set of reference features, further comprising:

displaying the reference feature set according to a preset display mode;

aiming at any one reference feature, if the range span of the feature value of the original feature corresponding to the reference feature is larger than a preset threshold, the preset display mode is a broken line comparison graph mode; if the range span of the feature value of the original feature corresponding to the reference feature is smaller than or equal to a preset threshold, comparing the histogram with the preset display mode; the horizontal coordinates of the polygonal line comparison graph and the vertical coordinate of the vertical comparison graph are values corresponding to the reference features, and the vertical coordinates are values of each element in the first probability distribution vector and values of each element in the second probability distribution vector corresponding to the reference features.

7. The method according to any one of claims 1 to 6, further comprising:

for any one discrete feature of any one target test sample, determining a SHAP value of the target test sample on the discrete feature based on a prediction result of the machine model on the target test sample; the target test sample is obtained by performing discretization processing operation on the test sample; the test sample comprises basic information of a test sample user and service associated data of the test sample user;

and determining the corresponding relevance of each discrete feature.

8. The method of claim 7, wherein after determining the SHAP value of the target test sample on the discrete feature based on the predicted result of the machine model on the target test sample, the method further comprises:

9. A service data processing apparatus, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 8 are implemented when the computer program is executed by the processor.

11. A computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, implement the steps of the method of any one of claims 1 to 8.

12. A computer program product comprising a computer program, characterized in that the computer program realizes the method of any one of claims 1 to 8 when executed by a processor.