CN110930192A

CN110930192A - User loss prediction method, system, device and storage medium

Info

Publication number: CN110930192A
Application number: CN201911154066.5A
Authority: CN
Inventors: 刘畅; 肖铨武; 谢超
Original assignee: Ctrip Travel Information Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Information Technology Shanghai Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-03-27

Abstract

The invention discloses a user loss prediction method, a system, equipment and a storage medium, wherein the user loss prediction method comprises the following steps: acquiring historical data, wherein the historical data comprises historical behavior data and historical order data of a user; delineating users in the historical data to obtain tag data; preprocessing the tag data to obtain sample data, wherein the sample data comprises a plurality of variable dimensions; training the regression model based on the sample data to obtain a trained prediction model; acquiring data of a user to be predicted, and preprocessing the data of the user to be predicted to obtain the data to be predicted comprising a plurality of variable dimensions; and predicting the user to be predicted by using a prediction model based on the data to be predicted so as to obtain the loss probability of the user to be predicted. The invention can predict the user loss probability efficiently and accurately.

Description

User loss prediction method, system, device and storage medium

Technical Field

The invention relates to the technical field of internet products, in particular to a user loss prediction method, a user loss prediction system, user loss prediction equipment and a storage medium.

Background

The air ticket product is an important flow inlet of an OTA (on-line Travel) platform. The user not only can bring certain profit for the ticket itself for buying the ticket product on the OTA platform, but also can bring flow and conversion for other products (such as hotel, vacation, business trip, visa service, etc.) on the platform simultaneously. Prediction of ticket user loss is of great significance to the whole OTA platform.

Existing practice is typically to manually delineate users who may be lost by operators using existing rules. The treatment efficiency is low; meanwhile, the accuracy of predicting the user loss probability in a manual delineation mode is not high.

Disclosure of Invention

The invention aims to overcome the defects that in the prior art, an operator manually defines a user who may lose by using the existing rule, the processing efficiency is low, and the accuracy of predicting the user loss probability is low, and provides a user loss prediction method, a user loss prediction system, user loss prediction equipment and a storage medium.

The invention solves the technical problems through the following technical scheme:

the invention provides a user loss prediction method, which comprises the following steps:

acquiring historical data, wherein the historical data comprises historical behavior data and historical order data of a user;

delineating the user in the historical data to obtain tag data;

preprocessing the tag data to obtain sample data, wherein the sample data comprises a plurality of variable dimensions;

training a regression model based on the sample data to obtain a trained prediction model;

acquiring data of a user to be predicted, and preprocessing the data of the user to be predicted to obtain the data to be predicted comprising the variable dimensions;

and predicting the user to be predicted by using the prediction model based on the data to be predicted so as to obtain the loss probability of the user to be predicted.

In the scheme, the method for accurately predicting the user loss is established by using a machine learning method, and the problems that the processing efficiency is low and the accuracy of predicting the user loss probability is low when an operator manually defines the users which are likely to be lost by using the existing rule are solved.

Preferably, the user churn prediction method further comprises the following steps:

and serializing the prediction model into a file to be stored in a server.

In this embodiment, the models are further serialized for convenient transmission and storage. In the scheme, the training model and the online deployment prediction model are on different servers, and in addition, the prediction model is serialized and stored into a file, so that version management and rollback are facilitated.

Preferably, the historical behavior data comprises historical search data and/or historical browsing data.

Preferably, the historical data is historical data before X months;

the step of delineating the user in the historical data to obtain tag data comprises:

defining that the users in the historical data with successful orders more than M in N consecutive months and without orders in the following X months are lost users, and remaining users in the historical data are non-lost users;

labeling historical data of the lost user to obtain lost label data; labeling the historical data of the users who are not lost to obtain label data which are not lost;

the tag data comprises the attrition tag data and the non-attrition tag data;

wherein M, N, X are all positive integers.

In the scheme, historical data are further limited, so that the input data are more accurate, and the accuracy of user loss prediction is improved.

Preferably, the variable dimension includes at least one of a latest search time, a number of searches in a past first preset time period, a latest access time, a number of accesses in a past second preset time period, a latest order placing time, an amount of orders in a past third preset time period, and a maximum access depth of the web page.

In the scheme, specific variable dimensionality is further limited, and the accuracy of model training is improved.

Preferably, the regression model includes any one of logistic regression, decision tree regression, random forest, XGBoost (extreme gradient Boosting).

The invention also provides a user loss prediction system, which comprises:

the acquisition module is used for acquiring historical data, wherein the historical data comprises historical behavior data and historical order data of a user;

the delineating module is used for delineating the user in the historical data to obtain tag data;

the preprocessing module is used for preprocessing the tag data to obtain sample data, and the sample data comprises a plurality of variable dimensions;

the training module is used for training the regression model based on the sample data to obtain a trained prediction model;

the data generation module is used for acquiring data of a user to be predicted and preprocessing the data of the user to be predicted to obtain the data to be predicted comprising the variable dimensions;

and the prediction module is used for predicting the user to be predicted by using the prediction model based on the data to be predicted so as to obtain the loss probability of the user to be predicted.

Preferably, the user churn prediction system further comprises a serialization module, and the serialization module is used for serializing the prediction model into a file to be stored in a server.

Preferably, the historical data is historical data before X months;

the delineation module comprises:

the delineating unit is used for delineating the users in the historical data which have more than M successful orders in N consecutive months and have not been placed in the following X months as lost users, and the rest users in the historical data are not lost users;

the label unit is used for labeling the historical data of the lost user to obtain lost label data; the system is also used for tagging the historical data of the users who are not lost to obtain the data of the tags which are not lost;

the tag data comprises the attrition tag data and the non-attrition tag data;

wherein M, N, X are all positive integers.

Preferably, the regression model includes any one of logistic regression, decision tree regression, random forest, XGBoost.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the user churn prediction method when executing the computer program.

The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the aforementioned user churn prediction method.

The positive progress effects of the invention are as follows:

according to the invention, users in historical data are labeled after being delineated, sample data comprising variable dimensions are obtained after preprocessing, the regression model is trained based on the sample data to obtain a trained prediction model, the prediction model is used for realizing prediction of the user loss probability, and compared with the mode that the users which are possibly lost are delineated manually by using the existing rules by the existing operators, the user loss probability prediction method can predict the user loss probability efficiently and accurately.

Drawings

Fig. 1 is a flowchart of a user churn prediction method according to embodiment 1 of the present invention.

FIG. 2 is a flowchart of step 102 in embodiment 1 of the present invention.

Fig. 3 is a schematic block diagram of a user churn prediction system according to embodiment 2 of the present invention.

Fig. 4 is a schematic structural diagram of a delineation module in embodiment 2 of the present invention.

Fig. 5 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.

Example 1

As shown in fig. 1, the present embodiment discloses a user churn prediction method, which includes the following steps:

step S101, obtaining historical data before X months, wherein the historical data comprises historical behavior data and historical order data of a user; wherein the historical behavioral data includes historical search data and/or historical browsing data.

In this embodiment, the historical behavior data includes both the historical search data and the historical browsing data. Further still, historical data may also include macro economic data, airport city capacity, and the like.

The embodiment can be applied to the prediction of the user churn in various scenes, for example, the prediction of the user churn of the air ticket channel corresponding to the air ticket product in the OTA platform. The scene introduces the search data and order data of the user in the OTA air ticket channel and introduces the browsing data of the user in all the OTA channels; the possible behavior data of the user on other OTA websites can be supplemented through third-party data to jointly form historical data. Because the air ticket belongs to low-frequency consumption, the time window of the loss of the delimitation is longer than that of the common E-commerce products, and therefore the time window is selected according to the historical data.

And step S102, delineating the users in the historical data to obtain tag data.

Step S103, preprocessing the label data to obtain sample data, wherein the sample data comprises a plurality of variable dimensions.

And sorting variables which may influence whether the user loses or not. The variable dimension comprises at least one of recent search time, search times in a past first preset time period, recent access time, access times in a past second preset time period, recent ordering time, orders in a past third preset time period and maximum access depth of the webpage.

When the method is specifically applied to an OTA platform, based on browsing, searching and ordering data in a data warehouse of the OTA, each variable dimension of sample data is obtained through ETL data preprocessing.

And S104, training the regression model based on the sample data to obtain a trained prediction model. The regression model comprises any one of logistic regression, decision tree regression, random forest and XGboost. In this embodiment, a logistic regression model is used for model training.

Step S105, obtaining data of a user to be predicted, and preprocessing the data of the user to be predicted to obtain the data to be predicted comprising a plurality of variable dimensions.

And S106, serializing the prediction model into a file and storing the file in a server.

And S107, predicting the user to be predicted by using a prediction model based on the data to be predicted so as to obtain the loss probability of the user to be predicted.

In specific application, the model file stored in the server can be called, and the prediction result of the current day, namely the probability that each user runs off in X months in the future from the current day can be obtained. The results of this attrition probability may be stored in a database table. And sending the prediction result of the loss probability of the user to related personnel in a data report form or an email form. The prediction result of the related user to be predicted can be further processed, for example, hierarchical output is performed, and the direct output probability can be selected according to the actual application scene, and multiple level probabilities of high, medium and low can also be selected. The embodiment can effectively identify the users with high probability, medium probability and low probability of loss, and further can give an early warning to the users with high probability of loss.

As shown in fig. 2, in this embodiment, step S102 further includes the following steps:

and S1021, defining that the users in the historical data with the successful orders more than M in N continuous months and without orders in the following X months are lost users, and the rest users in the historical data are non-lost users.

Step S1022, labeling historical data of the loss user to obtain loss label data; labeling historical data of users who do not lose to obtain label data which do not lose;

the label data comprises lost label data and non-lost label data;

wherein M, N, X are all positive integers.

The description continues with the delineation of the lost user by taking the air ticket as an example. And (3) according to the characteristics of the air ticket product and the air ticket purchasing behavior (mainly ordering data) of the OTA typical user, delineating the typical lost user, and thus obtaining the label data corresponding to the lost user and the non-lost user. Specifically, taking historical data before X months, it can be concluded that the loss of high-frequency users is that more than M successful orders are made in consecutive N months in the past, and users who have not made orders in the following future X months are lost users. For example, currently, in 10 months, if N is 12, M is 3, and X is 6, users with orders greater than 3 in the last 4 months to the current 3 months are taken as a sample set; and determining whether the user is lost or not according to whether the order is placed in the period from 4 to 9 months of the year, if so, determining that the user is a non-lost user, and if not, determining that the user is a lost user. The user churn prediction method disclosed by the embodiment can predict the churn probability of the OTA air ticket channel users, so that possible churn users, especially users with high churn probability, can be identified. Based on the user identified by the embodiment, the return visit and the return purchase of the OTA platform in a future period are tracked, so as to determine whether the user is lost. Compared with the users who are manually identified by the operators according to the existing rules and are likely to lose, the method and the system can more accurately identify the users who really lose, and verify the effectiveness of the method and the system.

In specific application, an ETL data preprocessing flow can be performed at regular time, newly-added historical data is cleaned and integrated, and input of a regression model to be trained in the same day is obtained through integration. Specifically, the ETL data preprocessing can be executed at the beginning of a month to obtain an updated sample data flow, and the timeliness of the data is ensured according to the variable dimensions of the sample data of the lost user and the non-lost user, which are browsed, searched and placed by the user in the previous month. And then an updated prediction model is obtained.

According to the user loss prediction method disclosed by the embodiment, a method for accurately predicting user loss is established by using a machine learning method. The user in the historical data is labeled after being delineated, sample data comprising variable dimensions is obtained after preprocessing, the regression model is trained based on the sample data to obtain a trained prediction model, the prediction model is used for predicting the user loss probability, and compared with the mode that the user which is possibly lost is manually delineated by the existing operator according to the existing rules, the user loss probability can be predicted efficiently and accurately.

Example 2

As shown in fig. 3, the present embodiment provides a user churn prediction system, which includes an obtaining module 1, a delineating module 2, a preprocessing module 3, a training module 4, a data generating module 5, a prediction module 6, and a serialization module 7.

The acquisition module 1 is used for acquiring historical data, wherein the historical data comprises historical behavior data and historical order data of a user; wherein the historical behavioral data includes historical search data and/or historical browsing data. In this embodiment, the historical behavior data includes both the historical search data and the historical browsing data. Further still, historical data may also include macro economic data, airport city capacity, and the like.

The user churn prediction system provided by the embodiment can be applied to prediction of user churn in various scenes, for example, prediction of user churn of an air ticket channel corresponding to an air ticket product in an OTA platform. The scene introduces the search data and order data of the user in the OTA air ticket channel and introduces the browsing data of the user in all the OTA channels; the possible behavior data of the user on other OTA websites can be supplemented through third-party data to jointly form historical data. Because the air ticket belongs to low-frequency consumption, the time window of the loss of the delimitation is longer than that of the common E-commerce products, and therefore the time window is selected according to the historical data.

The delineating module 2 is used for delineating the users in the historical data to obtain the label data.

The preprocessing module 3 is configured to preprocess the tag data to obtain sample data, where the sample data includes a plurality of variable dimensions.

The preprocessing module 3 sorts the variables that may affect whether the user is lost. The variable dimension comprises at least one of recent search time, search times in a past first preset time period, recent access time, access times in a past second preset time period, recent ordering time, orders in a past third preset time period and maximum access depth of the webpage.

The training module 4 is configured to train the regression model based on the sample data to obtain a trained prediction model.

The regression model comprises any one of logistic regression, decision tree regression, random forest and XGboost. In this embodiment, a logistic regression model is used for model training.

The data generating module 5 is configured to obtain data of the user to be predicted, and pre-process the data of the user to be predicted to obtain the data to be predicted including the plurality of variable dimensions.

The serialization module 7 is used for serializing the prediction model into a file to be stored in the server.

The prediction module 6 is configured to predict the user to be predicted by using a prediction model based on the data to be predicted, so as to obtain the loss probability of the user to be predicted.

In specific application, the user loss prediction system can call the model file stored in the server, and can obtain the prediction result of the current day, namely the probability that each user loses in X months in the future from the current day. The results of this attrition probability may be stored in a database table. And sending the prediction result of the loss probability of the user to related personnel in a data report form or an email form. The prediction result of the related user to be predicted can be further processed, for example, hierarchical output is performed, and the direct output probability can be selected according to the actual application scene, and multiple level probabilities of high, medium and low can also be selected. The embodiment can effectively identify the users with high probability, medium probability and low probability of loss, and further can give an early warning to the users with high probability of loss.

As shown in fig. 4, in the present embodiment, the delineating module 2 includes a delineating unit 1 and a label unit 2.

The delineating unit 1 is configured to delineate that the users in the historical data, who have more than M successful orders in consecutive N months and have not been placed in the subsequent X months, are lost users, and remaining users in the historical data are non-lost users.

The label unit 2 is used for labeling historical data of the lost user to obtain lost label data; and the system is also used for marking the historical data of the users who are not lost to obtain the label data which are not lost.

The tag data includes stale tag data and non-stale tag data.

Wherein M, N, X are all positive integers.

The following description will proceed with the description of the delineation of the lost user by the delineation unit 1, taking the air ticket as an example. And (3) according to the characteristics of the air ticket product and the air ticket purchasing behavior (mainly ordering data) of the OTA typical user, delineating the typical lost user, and thus obtaining the label data corresponding to the lost user and the non-lost user. Specifically, taking historical data before X months, it can be concluded that the loss of high-frequency users is that more than M successful orders are made in consecutive N months in the past, and users who have not made orders in the following future X months are lost users. For example, currently, in 10 months, if N is 12, M is 3, and X is 6, users with orders greater than 3 in the last 4 months to the current 3 months are taken as a sample set; and determining whether the user is lost or not according to whether the order is placed in the period from 4 to 9 months of the year, if so, determining that the user is a non-lost user, and if not, determining that the user is a lost user. The user churn prediction system disclosed in this embodiment can predict the churn probability of OTA ticket channel users, thereby identifying possible churn users, especially users with high churn probability. Based on the user identified by the embodiment, the return visit and the return purchase of the OTA platform in a future period are tracked, so as to determine whether the user is lost. Compared with the users who are manually identified by the operators according to the existing rules and are likely to lose, the method and the system can more accurately identify the users who really lose, and verify the effectiveness of the method and the system.

The user loss prediction system provided by the embodiment realizes accurate prediction of the user loss probability by using a machine learning method. The user in the historical data is labeled through the delineation module, the data is preprocessed through the preprocessing module to obtain sample data comprising variable dimensions, the regression model is trained based on the sample data to obtain a trained prediction model, the prediction model is used for predicting the user loss probability, and compared with the mode that the user which is likely to be lost is manually delineated by the existing operators according to the existing rules, the user loss probability can be predicted efficiently and accurately.

Example 3

Fig. 5 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention. The electronic device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the user churn prediction method provided in embodiment 1 when executing the program. The electronic device 30 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 5, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).

The bus 33 includes a data bus, an address bus, and a control bus.

The memory 32 may include volatile memory, such as Random Access Memory (RAM)321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.

Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The processor 31 executes various functional applications and data processing, such as the user churn prediction method provided in embodiment 1 of the present invention, by running a computer program stored in the memory 32.

The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. As shown, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Example 4

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the user churn prediction method provided in embodiment 1.

More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation manner, the present invention can also be implemented in the form of a program product, which includes program code for causing a terminal device to execute the steps of implementing the user churn prediction method provided in embodiment 1 when the program product runs on the terminal device.

Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims

1. A user churn prediction method is characterized by comprising the following steps:

delineating the user in the historical data to obtain tag data;

2. The user churn prediction method as recited in claim 1, further comprising the steps of:

and serializing the prediction model into a file to be stored in a server.

3. The user churn prediction method as claimed in claim 1, wherein the historical behavior data comprises historical search data and/or historical browsing data; the regression model comprises any one of logistic regression, decision tree regression, random forest and XGboost.

4. The user churn prediction method of claim 1,

the historical data is the historical data before X months;

the tag data comprises the attrition tag data and the non-attrition tag data;

wherein M, N, X are all positive integers.

5. The user churn prediction method as claimed in claim 1, wherein the variable dimension includes at least one of a recent search time, a number of searches in a first preset time period in the past, a recent access time, a number of accesses in a second preset time period in the past, a recent order placing time, an amount of orders in a third preset time period in the past, and a maximum access depth of a web page.

6. A user churn prediction system, the user churn prediction system comprising:

7. The user churn prediction system of claim 6, further comprising a serialization module to serialize the prediction model into a file for storage to a server.

8. The user churn prediction system according to claim 6, wherein the historical behavior data comprises historical search data and/or historical browsing data; the regression model comprises any one of logistic regression, decision tree regression, random forest and XGboost.

9. The user churn prediction system as recited in claim 6,

the historical data is the historical data before X months;

the delineation module comprises:

the tag data comprises the attrition tag data and the non-attrition tag data;

wherein M, N, X are all positive integers.

10. The user churn prediction system as claimed in claim 6 wherein the variable dimensions include at least one of recent search time, number of searches in a first predetermined period of time in the past, recent access time, number of accesses in a second predetermined period of time in the past, recent order placement time, number of orders in a third predetermined period of time in the past, and maximum access depth of web pages.

11. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the user churn prediction method as claimed in any one of claims 1 to 5 when executing the computer program.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the user churn prediction method according to any one of claims 1 to 5.