Disclosure of Invention
The embodiment of the application provides a method and a device for user portrait prediction based on artificial intelligence, which are used for solving the problem of poor user portrait prediction capability in the prior art.
The embodiment of the invention provides a user portrait prediction method based on artificial intelligence, which comprises the following steps:
collecting operation behavior data flow of a user in a commodity purchasing process;
extracting features from the operation behavior data stream to generate a training sample;
training the training sample based on a gradient lifting tree GBDT and a logistic regression LR fusion model, and outputting a trained portrait prediction parameter;
predicting the user representation based on the representation prediction parameters and natural attribute parameters of the user.
Optionally, the training sample based on the gradient descent tree GBDT and the logistic regression LR fusion model, and outputting the trained portrait prediction parameters includes:
dividing the training samples into N groups of training feature sets according to N commodity categories, wherein N is a positive integer greater than 1;
dividing each group of training feature sets into three training sets according to the user clicking operation behavior feature, the user purchasing operation behavior feature and the user and customer service conversation behavior feature, and respectively establishing corresponding GBDT trees;
respectively traversing GBDT trees corresponding to the three training sets, and outputting three groups of GBDT training sets;
and taking the three groups of GBDT training sets as input of an LR model, training the three groups of GBDT training sets by using the LR model, and outputting the image prediction parameters after training.
Optionally, the separately traversing the GBDT trees corresponding to the three training sets includes:
setting a training set x, the depth D of a loss function L, GBDT tree and the iteration number M, and initializing a weak learner;
calculating the negative gradient r for each sample in the training setim,i=1,2,...N;
Using the negative gradient as a new sample, and using the data set (x)i,rim) Determining a second GBDT tree f as training data for the GBDT treem(x) Wherein the leaf node region corresponding to the second GBDT tree is RjmJ is 1,2. Wherein J is the number of leaf nodes of the second GBDT tree;
calculating the best fitting value of the leaf node area;
and updating the strong learner, and obtaining the final learner after M rounds of iteration.
Optionally, the training of the three groups of GBDT training sets using an LR model includes:
setting a loss function L1, a step length a and a maximum iteration number MmaxThe error limit t;
initializing an image prediction parameter c ═ c0,c1,c2...ck};
Inputting the three groups of GBDT training sets to carry out portrait prediction parameter iteration, respectively judging whether the errors of the three groups of GBDT training sets are smaller than t in each iteration process, terminating the training if the errors of the three groups of GBDT training sets are smaller than t, and updating the portrait prediction parameter c if the errors of the three groups of GBDT training sets are larger than or equal to t;
and outputting the final portrait prediction parameter c after the iteration is finished.
Optionally, the portrait prediction parameter c is used for predicting whether the user clicks a commodity page, and/or predicting whether the user purchases commodities, and/or predicting whether the user communicates with the customer service frequently.
Optionally, the extracting features from the operation behavior data stream includes:
acquiring a Pearson correlation coefficient between every two parameters;
and if the Pearson correlation coefficient exceeds a preset threshold value, deleting one of the two parameters corresponding to the Pearson correlation coefficient.
Optionally, the natural attributes of the user include a user gender, a user age, and a user interest, and predicting the user representation based on the representation prediction parameter and the natural attribute parameter of the user includes:
acquiring a user portrait template library, wherein the user portrait template library comprises a plurality of user portraits and corresponding characteristic values;
setting different weights for different portrait prediction parameters and natural attribute parameters of a user, and performing weighting operation to determine a user portrait characteristic value;
and traversing the characteristic values in the user image template library, and determining the characteristic value with the minimum difference value with the characteristic value of the user image in the template library, so that the user image corresponding to the characteristic value in the template library is the predicted user image.
The embodiment of the invention also provides a device for user portrait prediction based on artificial intelligence, which comprises:
the acquisition unit is used for acquiring the operation behavior data flow of the user in the commodity purchasing process;
the characteristic extraction unit is used for extracting characteristics from the operation behavior data stream to generate a training sample;
the training unit is used for training the training sample based on a gradient lifting tree GBDT and a logistic regression LR fusion model and outputting a trained portrait prediction parameter;
a prediction unit to predict the user representation based on the representation prediction parameters and the natural attribute parameters of the user.
Optionally, the training unit trains the training sample based on a gradient descent tree GBDT and a logistic regression LR fusion model, and outputs the trained portrait prediction parameters, including:
dividing the training samples into N groups of training feature sets according to N commodity categories, wherein N is a positive integer greater than 1;
dividing each group of training feature sets into three training sets according to the user clicking operation behavior feature, the user purchasing operation behavior feature and the user and customer service conversation behavior feature, and respectively establishing corresponding GBDT trees;
respectively traversing GBDT trees corresponding to the three training sets, and outputting three groups of GBDT training sets;
and taking the three groups of GBDT training sets as input of an LR model, training the three groups of GBDT training sets by using the LR model, and outputting the image prediction parameters after training.
The embodiment of the invention also provides a device which comprises a memory and a processor, wherein the memory is stored with computer executable instructions, and the processor realizes the method when running the computer executable instructions on the memory.
According to the method provided by the embodiment of the invention, the operation behavior of the user is analyzed through the GBDT and LR fusion model, the user portrait prediction parameter is output, the user portrait prediction parameter and the user natural attribute parameter are combined and analyzed, the user portrait prediction is finally obtained, and the user portrait prediction capability and the user portrait prediction accuracy are improved.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
FIG. 1 is a flow chart of artificial intelligence based user profile prediction, as shown in FIG. 1, according to an embodiment of the present invention, the method comprising:
s101, collecting an operation behavior data stream of a user in a commodity purchasing process;
in the embodiment of the invention, the e-commerce platform is provided with a plurality of cloud servers, the cloud servers acquire operation behavior data streams from the processes of clicking, operating and purchasing in the e-commerce webpage of a user, and the data streams comprise real-time data streams and offline data streams.
In the embodiment of the invention, the user portrait is targeted for each type of commodity, and the commodity is recommended based on the user portrait. Therefore, in the embodiment of the present invention, a certain type of commodity is taken as an example, data streams of the click, purchase and customer service conversation behavior of the commodity by the user are obtained, and the data streams are more in data types, such as a commodity number, a commodity brand, a commodity store, the number of clicks, a customer service conversation time length, a frequency, a content data size, and the like.
S102, extracting features from the operation behavior data stream to generate a training sample;
for the data in the data stream, it is necessary to select relatively important and relatively independent (low correlation between features) data as features to generate training samples.
In order to reduce redundancy of features, the embodiment of the present invention also needs to cut off different data. For example, acquiring a pearson correlation coefficient between two parameters in a data stream; and if the Pearson correlation coefficient exceeds a preset threshold value, deleting one of the two parameters corresponding to the Pearson correlation coefficient. That is, in the embodiment of the present invention, a concept of pearson correlation coefficient is defined, where the correlation coefficient reflects a linear correlation between two parameters, and a value range is [ -1,1], where 1 represents a complete positive correlation, 0 represents no linear relationship at all, and-1 represents a complete negative correlation, that is, one parameter is increasing while the other parameter is decreasing. The closer the correlation coefficient is to 0, the weaker the correlation. The calculation formula is as follows:
wherein X and Y each represent two successive parameters in pairs.
The correlation determination is based on:
generally, | r | >0.8 is highly correlated; 0.4< ═ r | <0.8 moderate correlation; the low correlation of r <0.4, two parameters with high correlation can remove one of the unimportants to reduce the redundancy of data. For example, the time length of the user and the customer service conversation is typically in positive correlation with the content size, and the longer the time length is, the more the communication content is, so that the feature extraction only needs to be performed on the time length.
S103, training the training sample based on a gradient lifting tree GBDT and a logistic regression LR fusion model, and outputting a trained portrait prediction parameter;
because the characteristics of the model relate to operation behavior characteristics and marketing activity participation behavior characteristics, and the characteristic dimension is very high, an LR (logistic regression) algorithm is adopted, but because the learning capacity of the LR model is limited, a large amount of characteristic engineering is required to be carried out, effective characteristics and characteristic combinations are extracted, and the nonlinear learning capacity of the model is improved. The GBDT (gradient Boosting Decision Tree) is an iterative Decision tree algorithm, belongs to a member of an ensemble learning Boosting family, has the advantages of high classification accuracy, good generalization capability and the like, and is a commonly used nonlinear model. Based on the boosting thought in ensemble learning, a new decision tree is established in the gradient direction for reducing the residual error in each iteration, and the decision trees are generated by iteration for a plurality of times. Therefore, the GBDT can find various characteristics with distinctiveness and characteristic combinations, and the time and labor cost of characteristic engineering are greatly saved. Therefore, a fusion algorithm of GBDT and LR is selected for the abnormal user identification model.
The basic idea of GBDT is: based on the forward distribution algorithm, each iteration is calculated to reduce the residual error (residual) of the previous time. To eliminate the residual, a new model can be built in the Gradient (Gradient) direction in which the residual is reduced. Therefore, in the Gradient Boost, the goal of each new model is to reduce the residual error of the previous model to the Gradient direction, which is greatly different from the traditional Boost algorithm that weights the correct and wrong samples. Therefore, the GBDT can achieve a higher accuracy with a relatively small parameter adjustment time. In addition, the GBDT adopts a robust loss function, and the robustness to abnormal data is very high.
The GBDT and LR fusion model is composed of two parts, wherein GBDT is used for extracting features from a training set to serve as new training input data, and LR is used as a classifier of the new training input data.
FIG. 2 is a schematic diagram of a GBDT + LR model structure, assuming that the GBDT has two weak classifiers, which are respectively represented by a hollow part and a solid part, wherein the number of leaf nodes of the hollow weak classifier is 3, the number of leaf nodes of the solid weak classifier is 2, and the prediction result of [0-1] in the hollow weak classifier falls on the second leaf node, and the prediction result of [0-1] in the solid weak classifier also falls on the second leaf node, noting that the prediction result of the hollow weak classifier is [ 010 ], the prediction result of the solid weak classifier is [ 01 ], and the output of the GBDT is a combination [ 01001 ] of the weak classifiers or a sparse vector (array).
After the new training data is constructed, the new training data is input into the LR classifier as input training set data for the training of the final classifier.
Specifically, in the embodiment of the present invention, the storming step of training the training sample may be:
s1031, dividing the training samples into N groups of training feature sets according to N commodity categories, wherein N is a positive integer larger than 1;
since the final predicted user portrait is strongly related to a certain type of commodity, it is necessary to distinguish the total training sample into N different training feature sets according to N commodity categories (commodity category IDs), where each training feature set corresponds to a commodity category ID, and perform user portrait prediction for each type of commodity.
S1032, dividing each group of training feature sets into three training sets according to the user clicking operation behavior feature, the user purchasing operation behavior feature and the user and customer service conversation behavior feature, and respectively establishing corresponding GBDT trees;
in the process of commodity purchasing behavior of a user, three operation behaviors are typical operation behaviors, the overall operation behavior of the user can be analyzed based on the operation behaviors, and the three operation behavior characteristics are the user click operation behavior characteristics (such as click times, webpage stay times, recommendation times and the like), the user purchase operation behavior characteristics (purchase price, purchase time and the like) and the user and customer service conversation behaviors (conversation content, emotion, frequency, duration and the like). The user's figure can be obtained from three typical operation behaviors, for example, the user is interested in the product (number of clicks) and has a payment intention (history payment record), the user is more critical to the product (service session duration), and the like. And respectively establishing corresponding GBDT trees.
S1033, respectively traversing GBDT trees corresponding to the three training sets, and outputting three groups of GBDT training sets;
setting a training set x (containing n samples), the depth D of a loss function L, GBDT tree and the iteration number M, and initializing a weak learner;
wherein the weak learner is initialized to f0(x),
For the ith sample in each training set (n samples in the training set, i is 1,2.. n), a negative gradient r is calculatedim(residual);
using the negative gradient as a new sample, and using the data set (x)i,rim) Determining a second GBDT tree f as training data for the GBDT treem(x) Wherein the leaf node region corresponding to the second GBDT tree is RjmJ is 1,2. Wherein J is the number of leaf nodes of the second GBDT tree;
calculating the best fitting value of the leaf node area;
the strong learning device is updated, and the strong learning device is updated,
after M iterations, the final learner is obtained:
s1034, the three groups of GBDT training sets are used as input of an LR model, the LR model is used for training the three groups of GBDT training sets, and the trained portrait prediction parameters are output.
Setting a loss function L1, a step length a and a maximum iteration number MmaxThe error limit t;
let any one of the three GBDT training sets be defined as xiN, wherein i is 1,2.. n, and n is the number of samples;
initializing an image prediction parameter c ═ c0,c1,c2...ck};
The loss function uses a log-likelihood function:
inputting the three groups of GBDT training sets in an LR model to carry out portrait prediction parameter iteration, respectively judging whether the errors of the three groups of GBDT training sets are less than t in each iteration process, terminating the training if the errors are less than t, updating a portrait prediction parameter c if the errors are more than or equal to t, and updating the updated cjJ is 0,1,. k is as follows;
and outputting the final portrait prediction parameter c after the iteration is finished.
In the embodiment of the present invention, the portrait prediction parameter c may be a prediction of whether the user clicks a merchandise page, and/or a prediction of whether the user purchases merchandise, and/or whether the user communicates with the customer service frequently.
S104, the user portrait is predicted based on the portrait prediction parameters and the natural attribute parameters of the user.
The natural attributes of the user may include a user gender, a user age, a user interest, and the like. The listing may be by information filled in at the time of user registration.
Wherein, S104 may specifically be:
acquiring a user portrait template library, wherein the user portrait template library comprises a plurality of user portraits and corresponding characteristic values; the user portrait template library is a user accurate portrait library which is mined and extracted by engineers for massive users, and comprises various natural attributes, interest and hobbies, click rates of similar/different commodities, historical purchase rates, complaint rates and the like of the users, and the user portrait template library can accurately recommend the commodities according to various attributes or parameters of the users or accurately recommend potential users according to the commodities. In the user portrait template library, a concept of a characteristic value (namely a correlation coefficient) is defined, the characteristic value is an index for quantifying whether a user portrait purchases a certain commodity, the value of the characteristic value is [0,1], 0 represents that the user is not interested in the commodity, and 1 represents that the user has a very strong purchase intention. Therefore, the values of the characteristic values are different, and the corresponding user images are also different.
Different weights (lambda) are set for different portrait prediction parameters and natural attribute parameters of the user1,λ2,...λn) And performing weighting operation to determine a user portrait characteristic value theta;
wherein b ═ b1,b2,...bjIs the user natural attribute parameter, p ═ p1+j,p2+j,...pnN, j are image prediction parameters, i 1,2<n。
And traversing the characteristic values in the user image template library, and determining the characteristic value with the minimum difference value with the characteristic value of the user image in the template library, so that the user image corresponding to the characteristic value in the template library is the predicted user image. For example, if the calculated feature value is 0.75 and the template library has feature values of 0.7, 0.73, 0.74, 0.78, the difference between 0.74 and 0.75 is minimal, and the user image corresponding to the calculated feature value is deemed to be consistent with the user image corresponding to the feature value of 0.74 in the template library.
According to the method provided by the embodiment of the invention, the operation behavior of the user is analyzed through the GBDT and LR fusion model, the user portrait prediction parameter is output, the user portrait prediction parameter and the user natural attribute parameter are combined and analyzed, the user portrait prediction is finally obtained, and the user portrait prediction capability and the user portrait prediction accuracy are improved.
As shown in FIG. 3, an embodiment of the present invention further provides an apparatus for user portrait prediction based on artificial intelligence, including:
the acquisition unit 31 is used for acquiring operation behavior data flow of a user in a commodity purchasing process;
a feature extraction unit 32, configured to perform feature extraction from the operation behavior data stream to generate a training sample;
the training unit 33 is used for training the training sample based on a gradient lifting tree GBDT and a logistic regression LR fusion model and outputting a trained portrait prediction parameter;
a prediction unit 34 for predicting the user representation based on the representation prediction parameters and the natural attribute parameters of the user.
The training unit 33 trains the training sample based on the gradient descent tree GBDT and the logistic regression LR fusion model, and outputs the trained portrait prediction parameters, specifically:
dividing the training samples into N groups of training feature sets according to N commodity categories, wherein N is a positive integer greater than 1;
dividing each group of training feature sets into three training sets according to the user clicking operation behavior feature, the user purchasing operation behavior feature and the user and customer service conversation behavior feature, and respectively establishing corresponding GBDT trees;
respectively traversing GBDT trees corresponding to the three training sets, and outputting three groups of GBDT training sets;
and taking the three groups of GBDT training sets as input of an LR model, training the three groups of GBDT training sets by using the LR model, and outputting the image prediction parameters after training.
The embodiment of the present invention further includes an apparatus, which is characterized by comprising a memory and a processor, wherein the memory stores computer executable instructions, and the processor implements the method when executing the computer executable instructions on the memory.
Embodiments of the present invention also provide a computer-readable storage medium having stored thereon computer-executable instructions for performing the method in the foregoing embodiments.
FIG. 4 is a diagram illustrating the hardware components of the apparatus according to one embodiment. It will be appreciated that fig. 4 only shows a simplified design of the device. In practical applications, the apparatuses may also respectively include other necessary elements, including but not limited to any number of input/output systems, processors, controllers, memories, etc., and all apparatuses that can implement the big data management method of the embodiments of the present application are within the protection scope of the present application.
The memory includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), which is used for storing instructions and data.
The input system is for inputting data and/or signals and the output system is for outputting data and/or signals. The output system and the input system may be separate devices or may be an integral device.
The processor may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU. The processor may also include one or more special purpose processors, which may include GPUs, FPGAs, etc., for accelerated processing.
The memory is used to store program codes and data of the network device.
The processor is used for calling the program codes and data in the memory and executing the steps in the method embodiment. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the division of the unit is only one logical function division, and other division may be implemented in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. The shown or discussed mutual coupling, direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable system. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a read-only memory (ROM), or a Random Access Memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).
The above is only a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.