CN112085528A

CN112085528A - Data processing method and device

Info

Publication number: CN112085528A
Application number: CN202010937318.8A
Authority: CN
Inventors: 李见黎
Original assignee: Beijing Shenyan Intelligent Technology Co ltd
Current assignee: Beijing Shenyan Intelligent Technology Co ltd
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2020-12-15

Abstract

The invention discloses a data processing method and a data processing device. Wherein, the method comprises the following steps: acquiring sample data and predictive variable data from historical data, wherein the historical data is behavior data generated when a user browses a page; training a loss early warning model according to the sample data and the prediction variable data; classifying users according to a loss early warning model to obtain at least two types of user groups; and matching corresponding loss prevention strategies according to at least two types of user groups. The invention solves the technical problem that the accuracy and the effectiveness of data prediction cannot be guaranteed because the rules used in the process of data analysis in the prior art are manually defined rules.

Description

Data processing method and device

Technical Field

The invention relates to the technical field of internet, in particular to a data processing method and device.

Background

Under the influence of the internet technology, the demand for services generated based on the internet technology is gradually increased based on services derived from the internet, particularly in the field of electronic commerce, and particularly after Artificial Intelligence (AI) is started, how to efficiently combine the AI technology and utilize various computing models to perform data analysis on passenger flow data in an electronic commerce platform becomes a direction for providing an effective technical scheme in the prior art.

However, in the prior art, data analysis on passenger flow data is generally determined according to manual definition rules of technicians, so that whether market behaviors and technical behaviors can be effectively fused, that is, whether a prediction result obtained by the internet technology is similar to a result generated by the influence of the market behaviors or not, and the accuracy and the effectiveness of data prediction cannot be guaranteed by using such data analysis schemes.

For the above-mentioned problem that the accuracy and effectiveness of data prediction cannot be guaranteed because the rules used in the process of analyzing data in the prior art are manually defined rules, no effective solution is proposed at present.

Disclosure of Invention

The embodiment of the invention provides a data processing method and a data processing device, which at least solve the technical problem that the accuracy and the effectiveness of data prediction cannot be guaranteed because the rule used in the data analysis process in the prior art is a manually defined rule.

According to an aspect of an embodiment of the present invention, there is provided a data processing method including: acquiring sample data and predictive variable data from historical data, wherein the historical data is behavior data generated when a user browses a page; training a loss early warning model according to the sample data and the prediction variable data; classifying users according to a loss early warning model to obtain at least two types of user groups; and matching corresponding loss prevention strategies according to at least two types of user groups.

Optionally, the obtaining of the sample data and the predictor variable data from the historical data includes: classifying the historical data according to the historical behavior data of the page browsed by the user to obtain observation period data and expression period data; generating sample data according to the observation period data; predictive variable data is generated from the performance period data.

Further, optionally, the observation period data comprises: at least one of purchase transaction amount, purchase item class, purchase frequency and purchase time of at least one user when browsing the page; the predictor variable data includes: the number of users that are churned, the type of users, and the impact of predictive variables on churning.

Optionally, training the attrition early warning model according to the sample data and the predictive variable data includes: segmenting the sample data to obtain at least one training set and a test set corresponding to the at least one training set; training a loss early warning model by a preset verification method according to at least one training set and a test set corresponding to at least one training set to obtain trained model parameters; and correcting the loss early warning model according to the trained model parameters and the prediction variable data to obtain a corrected loss early warning model.

Further, optionally, classifying the users according to the churn early warning model to obtain at least two types of user groups includes: scoring the users according to the loss early warning model to obtain at least one scored user group; and matching the corresponding risk label according to the score of at least one user group to obtain at least two types of user groups.

According to another aspect of the embodiments of the present invention, there is also provided a data processing apparatus, including: the acquisition module is used for acquiring sample data and predictive variable data from historical data, wherein the historical data is behavior data generated when a user browses a page; the training module is used for training the loss early warning model according to the sample data and the predictive variable data; the classification module is used for classifying the users according to the loss early warning model to obtain at least two types of user groups; and the matching module is used for matching the corresponding loss prevention strategies according to at least two types of user groups.

Optionally, the obtaining module includes: the classification unit is used for classifying the historical data according to the historical behavior data of the page browsed by the user to obtain observation period data and presentation period data; the first data generation unit is used for generating sample data according to the observation period data; and a second data generation unit for generating predictor variable data based on the presentation period data.

Optionally, the training module includes: the data set dividing unit is used for segmenting the sample data to obtain at least one training set and a test set corresponding to the at least one training set; the training unit is used for training the loss early warning model by a preset verification method according to at least one training set and a test set corresponding to the at least one training set to obtain model parameters after training; and the correcting unit is used for correcting the loss early warning model according to the trained model parameters and the prediction variable data to obtain the corrected loss early warning model.

Further, optionally, the classification module includes: the scoring unit is used for scoring the users according to the loss early warning model to obtain at least one scored user group; and the classification unit is used for matching the corresponding risk label according to the score of at least one user group to obtain at least two user groups.

In the embodiment of the invention, sample data and predictive variable data are acquired from historical data, wherein the historical data is behavior data generated when a user browses a page; training a loss early warning model according to the sample data and the prediction variable data; classifying users according to a loss early warning model to obtain at least two types of user groups; according to the loss prevention strategy corresponding to the matching of at least two types of user groups, the purpose of effectively marking and distinguishing the user groups is achieved, so that the technical effect of guaranteeing the accuracy and effectiveness of data prediction is achieved, and the technical problem that in the prior art, the accuracy and effectiveness of data prediction cannot be guaranteed because the rules used in the data analysis process are manually defined rules is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow diagram of a data processing method according to an embodiment of the invention;

FIG. 2a is a diagram illustrating a distribution of prediction values in a data processing method according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of a calibration curve in a data processing method according to an embodiment of the present invention;

FIG. 3 is a schematic illustration of different risk client classes in a data processing method according to an embodiment of the present invention;

fig. 4a to 4c are schematic diagrams of a scheme implementation architecture in a data processing method according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present invention, there is provided a method embodiment of a data processing method, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that presented herein.

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

step S102, acquiring sample data and predictive variable data from historical data, wherein the historical data is behavior data generated when a user browses a page;

in an optional implementation, obtaining sample data and predictor variable data from historical data includes: classifying the historical data according to the historical behavior data of the page browsed by the user to obtain observation period data and expression period data; generating sample data according to the observation period data; predictive variable data is generated from the performance period data.

Step S104, training a loss early warning model according to the sample data and the predictive variable data;

in an optional implementation manner, training the attrition early warning model according to the sample data and the predictive variable data includes: segmenting the sample data to obtain at least one training set and a test set corresponding to the at least one training set; training a loss early warning model by a preset verification method according to at least one training set and a test set corresponding to at least one training set to obtain trained model parameters; and correcting the loss early warning model according to the trained model parameters and the prediction variable data to obtain a corrected loss early warning model.

Specifically, in the embodiment of the present application, the selection of the loss early warning model may be: the method considers that the basic data features are not high-dimensional and sparse and are relatively dense, and is based on the inherent advantages of the algorithm: 1. and (4) regularizing. 2. And (4) training parallelism of large-scale data. 3. And the XGboost algorithm is selected according to the characteristics of flexibility, missing value processing and the like, and the interval of the model AUC is between 0.75 and 0.82 on different data sets.

The optimization of the model comprises the following steps:

step 1: optimizing model parameters; and (3) adopting GridSearchCV for multiple verification, and optimizing model parameters: the optimal value of n _ estimators is 400, max _ depth is 10, min _ child _ weight is 5, colsample _ byte is 0.3, learning rate is 0.1, etc.

Step 2: performing cross validation; the cross validation is that data is repeatedly used, obtained sample data is segmented and combined into different training sets and test sets, the training sets are used for training the model, and the test sets are used for evaluating the quality of model prediction, so that a plurality of groups of different training sets and test sets can be obtained on the basis, and a simple cross method, an S-fold cross validation method and a leave-one cross validation method are generally adopted. In the method, an S-fold cross-validation method is used for improving the model normalization capability and finding the optimal model parameters.

And step 3: model fusion; by fusing a plurality of different models, the performance of machine learning can be improved. The model fusion method in the embodiment of the application comprises the following steps:

average method: the average method includes a general evaluation and a weighted average. For the averaging method, the method is generally used in a regression prediction model, and in the Boosting series fusion model, weighted average fusion is generally adopted.

Voting method: there are absolute majority votes (more than half votes), relative majority votes (most votes), weighted votes. The method is generally used for classification models and is used in bagging models.

A learning method comprises the following steps: a more powerful combination strategy is to use "learning", i.e. combining by another learner, the individual learner being referred to as the primary learner and the learner used for combining being referred to as the secondary learner or meta-learner.

In the examples of the present application, the voting method is taken as an example, and contributes to the AUC of the model by about 2%.

And 4, step 4: calibrlation: and carrying out corresponding calibration according to the real data distribution and the predicted data distribution.

As shown in fig. 2a and 2b, fig. 2a is a schematic diagram of a distribution of prediction values in a data processing method according to an embodiment of the present invention; FIG. 2b is a schematic diagram of a calibration curve in a data processing method according to an embodiment of the present invention; in fig. 2b, the calibration graph (reliability curve) is shown in fig. 2b, in the first graph, the vertical axis represents the positive score, in the first graph, the dotted line represents the ideal calibration curve (perfect calibrated), and the solid line represents the prediction curve; in the second graph, the vertical axis represents the parameter, and the curve in the graph represents the predicted curve; in the embodiment of the application, the distribution of the prediction value has a larger relation with the proportion of the positive sample and the negative sample, and corresponding calibration is carried out according to the real distribution.

Step S106, classifying the users according to the loss early warning model to obtain at least two types of user groups;

in an optional implementation manner, classifying users according to the churn early warning model to obtain at least two types of user groups includes: scoring the users according to the loss early warning model to obtain at least one scored user group; and matching the corresponding risk label according to the score of at least one user group to obtain at least two types of user groups.

Step S108, matching the corresponding loss prevention strategy according to at least two types of user groups.

Specifically, as shown in fig. 3, fig. 3 is a schematic diagram of different risk customer levels in the data processing method according to the embodiment of the present invention. In the embodiment of the application, a user portrait service is constructed by generating labels on large-scale data regularly, and users with different levels are divided: high risk customers, medium risk customers, low risk customers, no risk customers, and subsequently, different countermeasures can be adopted according to different value levels.

In summary, with reference to steps S102 to S108, as shown in fig. 4a to 4c, fig. 4a to 4c are schematic diagrams of a scheme implementation architecture in a data processing method according to an embodiment of the present invention, and as shown in fig. 4a and 4b, in an observation period window, a batch of sample data is mined from historical data to perfect an attrition evaluation dimension, an attrition early warning model is constructed using presentation period window data (i.e., predictive variable data in the present embodiment), an attrition probability is predicted for users who have not yet been definitely attrited through the model in the prediction window for several weeks or months in the future, an attrition scoring system is established, corresponding attrition labels are applied to user groups through scoring rules, at least two types of user groups are obtained, and corresponding anti-attrition strategies are matched for different user groups. In the embodiment of the present application, the attrition risk classes of the user groups are divided based on the average order interval duration exceeding 95%.

As shown in fig. 4c, the acquired historical data may be historical offline behavior data and real-time update data for the data lake Datalake, based on the beginning of 0 a day in the morning; constructing an Extract-Transform-Load (ETL for short) according to historical offline behavior data and real-time update data, and performing feature engineering on the data lake to obtain offline features and training data, wherein the offline features are cached; updating the model training service according to the training data; and (4) generating a model file by the reason training data, and estimating the loss risk by using the generated model file.

The loss reasons of the user in the embodiment of the present application may include: 1. a user reason; 2. service and product quality; 3. a competing factor; 4. and (6) feeding back. As shown in table 1:

TABLE 1

Example 2

According to another aspect of the embodiments of the present invention, there is also provided a data processing apparatus, and fig. 5 is a schematic diagram of the data processing apparatus according to the embodiments of the present invention, as shown in fig. 5, including: the obtaining module 52 is configured to obtain sample data and predicted variable data from historical data, where the historical data is behavior data generated when a user browses a page; a training module 54, configured to train a loss early warning model according to the sample data and the predictive variable data; the classification module 56 is configured to classify users according to the loss early warning model to obtain at least two user groups; the matching module 58 is configured to match the corresponding anti-attrition strategies according to at least two types of user groups.

Optionally, the obtaining module 52 includes: the classification unit is used for classifying the historical data according to the historical behavior data of the page browsed by the user to obtain observation period data and presentation period data; the first data generation unit is used for generating sample data according to the observation period data; and a second data generation unit for generating predictor variable data based on the presentation period data.

Optionally, the training module 54 includes: the data set dividing unit is used for segmenting the sample data to obtain at least one training set and a test set corresponding to the at least one training set; the training unit is used for training the loss early warning model by a preset verification method according to at least one training set and a test set corresponding to the at least one training set to obtain model parameters after training; and the correcting unit is used for correcting the loss early warning model according to the trained model parameters and the prediction variable data to obtain the corrected loss early warning model.

Further, optionally, the classification module 56 includes: the scoring unit is used for scoring the users according to the loss early warning model to obtain at least one scored user group; and the classification unit is used for matching the corresponding risk label according to the score of at least one user group to obtain at least two user groups.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A data processing method, comprising:

acquiring sample data and predictive variable data from historical data, wherein the historical data is behavior data generated when a user browses a page;

training a loss early warning model according to the sample data and the predictive variable data;

classifying users according to the loss early warning model to obtain at least two types of user groups;

and matching corresponding loss prevention strategies according to the at least two types of user groups.

2. The method of claim 1, wherein obtaining sample data and predictor variable data from historical data comprises:

classifying the historical data according to the historical behavior data of the user browsing page to obtain observation period data and presentation period data;

generating the sample data according to the observation period data;

and generating the predictive variable data according to the presentation period data.

3. The method of claim 2, wherein the observation period data comprises: at least one of purchase transaction amount, purchase item class, purchase frequency and purchase time of at least one user when browsing the page; the predictor variable data includes: the number of users that are churned, the type of users, and the impact of predictive variables on churning.

4. The method of claim 1 or 2, wherein training an attrition early warning model based on the sample data and the predictor variable data comprises:

segmenting the sample data to obtain at least one training set and a test set corresponding to the at least one training set;

training the loss early warning model by a preset verification method according to the at least one training set and the test set corresponding to the at least one training set to obtain trained model parameters;

and correcting the loss early warning model according to the trained model parameters and the predictive variable data to obtain a corrected loss early warning model.

5. The method of claim 4, wherein the classifying users according to the attrition early warning model to obtain at least two user groups comprises:

scoring the users according to the loss early warning model to obtain at least one scored user group;

and matching the corresponding risk label according to the score of the at least one user group to obtain at least two types of user groups.

6. A data processing apparatus, comprising:

the acquisition module is used for acquiring sample data and predictive variable data from historical data, wherein the historical data is behavior data generated when a user browses a page;

the training module is used for training a loss early warning model according to the sample data and the predictive variable data;

the classification module is used for classifying users according to the loss early warning model to obtain at least two types of user groups;

and the matching module is used for matching the corresponding loss prevention strategies according to the at least two types of user groups.

7. The apparatus of claim 6, wherein the obtaining module comprises:

the classification unit is used for classifying the historical data according to the historical behavior data of the page browsed by the user to obtain observation period data and presentation period data;

a first data generating unit, configured to generate the sample data according to the observation period data;

and the second data generation unit is used for generating the predictive variable data according to the presentation period data.

8. The apparatus of claim 7, wherein the observation period data comprises: at least one of purchase transaction amount, purchase item class, purchase frequency and purchase time of at least one user when browsing the page; the predictor variable data includes: the number of users that are churned, the type of users, and the impact of predictive variables on churning.

9. The apparatus of claim 6 or 7, wherein the training module comprises:

the data set dividing unit is used for segmenting the sample data to obtain at least one training set and a test set corresponding to the at least one training set;

the training unit is used for training the loss early warning model through a preset verification method according to the at least one training set and the test set corresponding to the at least one training set to obtain model parameters after training;

and the correcting unit is used for correcting the loss early warning model according to the trained model parameters and the predictive variable data to obtain a corrected loss early warning model.

10. The apparatus of claim 9, wherein the classification module comprises:

the scoring unit is used for scoring the users according to the loss early warning model to obtain at least one scored user group;

and the classification unit is used for matching the corresponding risk label according to the score of the at least one user group to obtain at least two user groups.