Disclosure of Invention
In view of the problems in the prior art, an object of the present invention is to provide a method, an apparatus, a storage medium, and an electronic device for estimating a registration probability, so as to effectively predict the probability of a user's actions such as registration, purchase, and click.
According to an aspect of the present invention, there is provided a registration probability estimation method, including:
acquiring first user behavior data according to the user operation log stream;
inputting the first user behavior data into a trained first prediction model, and acquiring data of a plurality of hidden layers of the first prediction model as second user behavior data;
carrying out cross construction on part of the first user behavior data according to the calculated importance value to obtain third user behavior data;
splicing the second user behavior data and the third user behavior data to obtain fourth user behavior data;
inputting the fourth user behavior data into a second predictive model, the output of the second predictive model being a pre-estimated value of the user's enrollment probability,
the step of cross-building part of the first user behavior data according to the calculated importance value to obtain a third user behavior data further comprises:
distinguishing the first user behavior data into first characteristic data and second characteristic data according to the calculated importance value;
performing cross construction on the second characteristic data to form third characteristic data;
the first characteristic data and the third characteristic data constitute the third user behavior data;
the first prediction model is an RNN model, the RNN model comprises an input layer, a plurality of hidden layers and an output layer, and each hidden layer is a GRU unit; the second prediction model is a logistic regression model. .
In an embodiment of the present invention, the user operation log stream includes user basic information, user behavior information, and device information of a user.
In an embodiment of the present invention, the first prediction model and the second prediction model are trained according to sample data, where the sample data includes user behavior data and a user registration status.
In an embodiment of the invention, the importance value of the first user behavior data is calculated by variance estimation to distinguish the first user behavior data into first feature data and second feature data.
In an embodiment of the invention, the importance value of the first user behavior data is calculated by an xgboost algorithm to distinguish the first user behavior data into first characteristic data and second characteristic data.
In an embodiment of the invention, the importance value of the first user behavior data is calculated by cross entropy to distinguish the first user behavior data into first feature data and second feature data.
According to another aspect of the present invention, there is provided a registration probability estimating apparatus, including:
the acquisition module is used for acquiring first user behavior data according to the user operation log stream;
a first prediction model module, configured to input the first user behavior data into a trained first prediction model, and obtain data of multiple hidden layers of the first prediction model as second user behavior data, where the first prediction model is an RNN model, the RNN model includes an input layer, multiple hidden layers, and an output layer, and each hidden layer is a GRU unit;
the data construction module is used for carrying out cross construction on part of the first user behavior data according to the calculated importance value to obtain third user behavior data;
the data processing module is used for splicing the second user behavior data and the third behavior data to obtain fourth user behavior data;
the second prediction model module is used for inputting the fourth user behavior data into a second prediction model, and taking the output of the second prediction model as a predicted value of the registration probability of the user; the second prediction model is a logistic regression model;
the registration probability pre-estimating device is further configured to:
the step of cross-building part of the first user behavior data according to the calculated importance value to obtain a third user behavior data further comprises:
distinguishing the first user behavior data into first characteristic data and second characteristic data according to the calculated importance value;
performing cross construction on the second characteristic data to form third characteristic data;
the first characteristic data and the third characteristic data constitute the third user behavior data.
According to a further aspect of the invention, a storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, performs the method as described above.
According to yet another aspect of the present invention, an electronic device is provided. The electronic device includes: a processor; a storage medium having stored thereon a computer program which, when executed by the processor, performs the method as described above.
The registration probability estimation method provided by the invention uses a technology of combining a recurrent neural network with traditional feature extraction, acquires the behavior data of the user in real time according to the user operation log stream, ensures high-efficiency result feedback speed, models the user behavior on the premise of having good extension performance of an algorithm framework, and can effectively predict the probability of the user's behaviors such as registration, purchase, click and the like.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
In order to overcome the defects of the prior art, the invention provides a registration probability estimation method, a device, a storage medium and electronic equipment, so as to effectively predict the probability of the actions of registration, purchase, click and the like of a user, wherein the registration probability reflects the preference degree of the user to a certain APP. Fig. 1 is a flowchart of a registration probability estimation method according to an embodiment of the present invention. FIG. 2 is a flowchart of a method for estimating registration probability according to another embodiment of the present invention. Fig. 3 is a schematic structural diagram of a registration probability estimating apparatus according to an embodiment of the invention. Fig. 4 is a schematic structural diagram of a registration probability estimating apparatus according to another embodiment of the present invention. Fig. 5 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention. Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
According to an aspect of the present invention, there is provided a registration probability estimation method, as shown in fig. 1, the registration probability estimation method includes:
and S110, acquiring first user behavior data according to the user operation log stream.
Specifically, the user operation log stream records a plurality of original feature data, which are generally summarized from historical user behavior information, user basic information, user device information, and the like, and the first user behavior data is generally obtained by preprocessing the original feature data (which may specifically include a user device type, a user browsing number in seven days, a user frequently logged in, and the like).
S120, inputting the first user behavior data into a trained first prediction model, and acquiring data of a plurality of hidden layers of the first prediction model as second user behavior data.
In particular, the first user behavior data may be directly input into the first predictive model after being preprocessed at this time. In an embodiment of the present invention, the first prediction model is an RNN model, the RNN model includes an input layer, a plurality of hidden layers and an output layer, and each hidden layer is a GRU unit. The RNN model is a recurrent neural network model, and the principle of the RNN model is to add the neural network model into the characteristics of time sequence. And adding a feedback edge to the hidden layer, wherein the input of each hidden layer comprises the characteristics of the current sample and the information brought by the last time sequence. Each GRU cell contains two gates, a reset gate and an update gate. The results of these two gates are passed through a sigmoid function with a range of [0,1 ]. The candidate implicit state uses a reset gate to control the flow of the last implicit state containing past time information. If the reset gate is approximately 0, the last implicit state will be discarded. Thus, the reset gate provides a mechanism to discard past implicit states that are not relevant in the future, i.e., the reset gate decides how much information was forgotten in the past. The hidden state Ht uses the update gate ZT to update the last hidden state Ht-1 and the candidate hidden states. The update gate may control the importance of the implicit state in the past at the current time. If the update gate is always approximately 1, the past implicit state will always be saved by time and passed to the current time. This design can deal with the problem of gradient attenuation in the recurrent neural network and better capture the more distant dependencies in the time series data. The reset gate helps to capture short term dependencies in the time series data. The update gate helps to capture long term dependencies in the time series data. And updating the GRU and LR model parameters of the circulation network offline according to user operation data, user click stream data and the result of whether the user is actually registered, wherein the user operation data and the result are stored in an HDFS offline, and the HDFS is a Hadoop Distributed File System (HDFS) which is designed to be suitable for a distributed file system running on general hardware (comfort hardware).
S130, carrying out cross construction on part of the first user behavior data according to the calculated importance value to obtain third user behavior data.
Since the first user behavior data includes a plurality of kinds of information, it is necessary to distinguish the importance of the plurality of kinds of information. Specifically, the importance values of the types of data in the first user behavior data may be calculated and distinguished through variance estimation, an xgboost algorithm, cross entropy, and the like.
S140, splicing the second user behavior data and the third user behavior data to obtain fourth user behavior data.
Specifically, if the second user behavior data is [1,0,1,0,0], and the third user behavior data is [0,0,0,1,1], the fourth user behavior data is obtained by splicing [1,0,1,0,0] and [0,0,0,1,1 ]: [1,0,1,0,0,0,0,0,1,1]. Of course, the fourth user behavior data may also be calculated from the second user behavior data and the third user behavior data in other manners, which is not limited by the invention.
And S150, inputting the fourth user behavior data into a second prediction model, and taking the output of the second prediction model as a predicted value of the registration probability of the user.
In an embodiment of the invention, the second prediction model is a logistic regression model. The first prediction model and the second prediction model are trained according to sample data, and the sample data comprises user behavior data and a user registration state. The logistic regression model is a common classification model in machine learning, is mainly used for a binary problem, maps a feature space into a possibility, is a qualitative variable {0,1} in the logistic regression model, and is mainly used for researching the probability of certain events.
The registration probability estimation method provided by the invention uses a technology of combining a recurrent neural network with traditional feature extraction, acquires the behavior data of the user in real time according to the user operation log stream, ensures high-efficiency result feedback speed, models the user behavior on the premise of having good extension performance of an algorithm framework, and can effectively predict the probability of the user's behaviors such as registration, purchase, click and the like.
Since the first user behavior data includes a plurality of kinds of information, it is necessary to distinguish the importance of the plurality of kinds of information. FIG. 2 is a flowchart of a method for estimating registration probability according to another embodiment of the present invention. As shown in fig. 2, in another embodiment of the present invention, step S130 further includes:
s1310, dividing the first user behavior data into first characteristic data and second characteristic data according to the calculated importance value.
S1320, the second feature data with the importance value meeting the preset requirement are subjected to cross construction to form third feature data, and meanwhile the first feature data with the importance value not meeting the preset requirement are kept unchanged. For example, there are two types of the second feature data whose importance values meet preset requirements: age (divided into two groups of more than 20 years old and less than 20 years old) and gender (divided into two groups of male and female), 4 groups of third characteristic data can be obtained by the intersection construction of the second characteristic data of the two aforementioned groups, namely age of more than 20 years old and gender of male, age of more than 20 years old and gender of female, age of less than 20 years old and gender of male, and age of less than 20 years old and gender of female.
S1330, forming the third user behavior data by using the first feature data and the third feature data. Thereby avoiding that a large amount of user information cannot be completely acquired.
Further, an importance value of the first user behavior data may be calculated by variance estimation to distinguish the first user behavior data into first feature data and second feature data.
Optionally, an importance value of the first user behavior data is calculated by an xgboost algorithm to distinguish the first user behavior data into first characteristic data and second characteristic data. The xgboost performs second-order Taylor expansion on the loss function, and adds a regular term outside the objective function to obtain an optimal solution as a whole, so as to balance the reduction of the objective function and the complexity of the model and avoid overfitting. The invention realizes the calculation of the importance value of the first user behavior data through an importance value algorithm (import) in the xgboost.
Optionally, an importance value of the first user behavior data is calculated by cross entropy to distinguish the first user behavior data into first feature data and second feature data. In this case, the cross entropy can be used as a loss function in neural networks (machine learning), assuming that there are now two probability distributions p, q in a sample set. Wherein p represents the distribution of the real markers, q is the distribution of the predicted markers of the trained model, and the similarity of p and q can be measured by the cross entropy loss function. Thus, the first user behavior data is subjected to secondary classification by calculating the similarity between the first user behavior data, and the importance value of each first user behavior data is determined to be the maximum or the minimum according to the classification result. The cross entropy as the loss function has the advantage that the problem of the learning rate reduction of the mean square error loss function can be avoided when the gradient is reduced by using the sigmoid function, because the learning rate can be controlled by the output error. Sigmoid function is a common biological Sigmoid function, also called sigmoidal growth curve. In the information science, due to the properties of single increment and single increment of an inverse function, a Sigmoid function is often used as a threshold function of a neural network, and variables are mapped to be between 0 and 1.
According to another aspect of the present invention, a registration probability estimation apparatus is provided, and fig. 3 is a schematic structural diagram of the registration probability estimation apparatus according to an embodiment of the present invention. As shown in fig. 3, the registration probability estimating apparatus 200 includes: the device comprises an acquisition module 201, a first prediction model module 202, a data construction module 203, a data processing module 204 and a second prediction model module 205. The obtaining module 201 is configured to obtain first user behavior data according to a user operation log stream. The first prediction model module 202 is configured to input the first user behavior data into a trained first prediction model, and obtain data of a plurality of hidden layers of the first prediction model as second user behavior data. The data constructing module 203 is configured to cross-construct part of the first user behavior data according to the calculated importance value to obtain third user behavior data. The data processing module 204 is configured to splice the second user behavior data and the third behavior data to obtain fourth user behavior data. The second predictive model module 205 is configured to input the fourth user behavior data into a second predictive model, and output the second predictive model as a predicted value of the registration probability of the user. In this embodiment, the functions of each module in the registration probability estimation apparatus, and the specific steps and principles from the obtaining module 201 to obtain the first user behavior data to the second prediction model module 205 to obtain the predicted value of the registration probability of the user have been described in the above embodiments, and thus are not described again. The invention uses the technology of combining the recurrent neural network and the traditional characteristic extraction, acquires the behavior data of the user in real time according to the user operation log stream, ensures the high-efficiency result feedback speed, models the user behavior on the premise of having good extension performance of an algorithm framework, and can effectively predict the probability of the user's behaviors such as registration, purchase, click and the like.
Fig. 4 is a schematic structural diagram of a registration probability estimating apparatus according to another embodiment of the present invention. As shown in fig. 4, the registration probability estimation apparatus 200 also includes an obtaining module 201, a first prediction model module 202, a data constructing module 203, a data processing module 204, and a second prediction model module 205. In addition, the data constructing module 203 further includes: a distinguishing module 2031, a cross construction module 2032, and a data integration module 2033. The obtaining module 201 is configured to obtain first user behavior data according to a user operation log stream. The first prediction model module 202 is configured to input the first user behavior data into a trained first prediction model, and obtain data of a plurality of hidden layers of the first prediction model as second user behavior data. The data constructing module 203 is configured to cross-construct part of the first user behavior data according to the calculated importance value to obtain third user behavior data. The data processing module 204 is configured to splice the second user behavior data and the third behavior data to obtain fourth user behavior data. The second predictive model module 205 is configured to input the fourth user behavior data into a second predictive model, and output the second predictive model as a predicted value of the registration probability of the user. The distinguishing module is used for distinguishing the first user behavior data into first characteristic data and second characteristic data according to the calculated importance value. The cross construction module is used for cross construction of the second feature data with the importance values meeting preset requirements to form third feature data. The data integration module is used for forming the third user behavior data by the first characteristic data and the third characteristic data. The invention uses the technology of combining the recurrent neural network and the traditional characteristic extraction, acquires the behavior data of the user in real time according to the user operation log stream, ensures the high-efficiency result feedback speed, models the user behavior on the premise of having good extension performance of an algorithm framework, and can effectively predict the probability of the user's behaviors such as registration, purchase, click and the like.
In an exemplary embodiment of the present invention, a computer-readable storage medium is further provided, on which a computer program is stored, which when executed by, for example, a processor, can implement the registration probability estimation method in any of the above embodiments. In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product including program code for causing a terminal device to perform the methods according to various exemplary embodiments of the present invention described in the above registration probability estimation methods of this specification when the program product is run on the terminal device.
Fig. 5 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention. Fig. 5 depicts a program product 300 for implementing the above-described method according to an embodiment of the invention, which may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product 300 may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The invention uses the technology of combining the recurrent neural network and the traditional characteristic extraction, acquires the behavior data of the user in real time according to the user operation log stream, ensures the high-efficiency result feedback speed, models the user behavior on the premise of having good extension performance of an algorithm framework, and can effectively predict the probability of the user's behaviors such as registration, purchase, click and the like.
In an exemplary embodiment of the invention, there is also provided an electronic device that may include a processor and a memory for storing executable instructions of the processor. Wherein the processor is configured to execute the registration probability prediction method in any of the above embodiments via execution of the executable instructions.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 400 according to this embodiment of the invention is described below with reference to fig. 6. The electronic device 400 shown in fig. 6 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, electronic device 400 is embodied in the form of a general purpose computing device. The components of electronic device 400 may include, but are not limited to: at least one processing unit 410, at least one memory unit 420, a bus 430 that connects the various system components (including the memory unit 420 and the processing unit 410), a display unit 440, and the like.
Wherein the storage unit stores program code executable by the processing unit 410 to cause the processing unit 410 to perform the steps according to various exemplary embodiments of the present invention described in the registration probability estimation method section above in this specification. For example, the processing unit 410 may perform the steps as shown in fig. 1.
The storage unit 420 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)4201 and/or a cache memory unit 4202, and may further include a read only memory unit (ROM) 4203.
The storage unit 420 may also include a program/utility 4204 having a set (at least one) of program modules 4205, such program modules 4205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 430 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 400 may also communicate with one or more external devices 500 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 400, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 400 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 450. Also, the electronic device 400 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 460. The network adapter 460 may communicate with other modules of the electronic device 400 via the bus 430. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the registration probability estimation method according to the embodiment of the present invention.
The invention uses the technology of combining the recurrent neural network and the traditional characteristic extraction, acquires the behavior data of the user in real time according to the user operation log stream, ensures the high-efficiency result feedback speed, models the user behavior on the premise of having good extension performance of an algorithm framework, and can effectively predict the probability of the user's behaviors such as registration, purchase, click and the like.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.