CN114238758A

CN114238758A - User portrait prediction method based on multi-source cross-border data fusion

Info

Publication number: CN114238758A
Application number: CN202111531109.4A
Authority: CN
Inventors: 周仁杰; 郭星宇; 张纪林; 万健; 刘畅; 赵乃良; 殷昱煜; 蒋从锋; 刘焱; 李炳; 陈青雯
Original assignee: Zhejiang Panshi Information Technology Co ltd; Hangzhou Dianzi University
Current assignee: Zhejiang Panshi Information Technology Co ltd; Hangzhou Dianzi University
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-03-25
Anticipated expiration: 2041-12-14
Also published as: CN114238758B

Abstract

The invention discloses a user portrait prediction method based on multi-source cross-border data fusion, and aims to solve the problem of inaccurate user characteristic prediction caused by sparse project characteristics, high-order structural characteristic loss and user behavior sequence characteristic loss in the prior art. Based on E-commerce data generated by a user, commodity characteristics are expanded by using a knowledge graph, historical purchase records of the user are fully mined by using a graph convolution network, potential purchase characteristics of the user are predicted by using a recurrent neural network, and accuracy of user portrait prediction is effectively improved. The method has the advantages that the problem of sparse commodity features is solved through the knowledge graph, the problem of high-order structural feature loss is solved through the graph convolution neural network, the problem of user behavior sequence feature loss is solved through the recurrent neural network, and a good foundation is laid for improving the performance of a recommendation system.

Description

User portrait prediction method based on multi-source cross-border data fusion

Technical Field

The invention relates to a user portrait prediction method based on multi-source cross-border data fusion, which is constructed according to a historical order sequence and content information of user shopping.

Background

Along with the development of internet technology and intelligent devices, various mobile applications are emerging and penetrating the lives of people, and the generated information is also increasing explosively. This makes it difficult for people to efficiently obtain the desired information, while enterprises have difficulty in accurately pushing products, information, etc. to users. The recommendation system is based on the user portrait, the user portrait is efficiently constructed, and the enterprise can realize fine marketing and accurate recommendation.

Online shopping has become an extremely common thing in today's life, and many users can directly or indirectly provide personal information to a shopping platform while enjoying the convenience of online shopping. The direct information includes sex, age, place of residence, etc., and the indirect information includes browsing record, purchasing record, collecting record, etc. According to the information of the user, the shopping platform can construct a virtual portrait of the user in the internet, so that the needed commodities can be accurately recommended to the user, and the benefit of the shopping platform is improved.

The technology for viewing user portraits is mainly applied to the field of personalized recommendation, and various methods for predicting user portraits labels comprise SVM, decision trees, LR and other traditional shallow learning models with good effect. However, as data generated by users in a big data background is explosively increased and feature dimensions are increased, the limitation of the flattened structure of the traditional shallow learning model begins to be highlighted. For example, in typical problems of user click rate estimation, conversion rate estimation and the like, the processed features as input have the characteristics of high latitude and high sparsity, and the traditional shallow learning method faces certain challenges in user label prediction because complex nonlinear relations among the features cannot be found.

In the field of electronic commerce, the historical purchasing behavior of the user contains the behavior information of the user. The accuracy of user portrayal can be effectively improved through the purchase records of the user, and the performance of a recommendation system is further improved. For example, if a user's purchase history includes a large amount of "Hua's" brand, indicating that the user is "pollen", the user will not buy the phone with a high probability if the recommendation system recommends the "iphone" brand of phone to the user. And if the recommendation system pushes the newly released 'Huayi' mobile phone to the user, the user may buy the mobile phone if the user needs to change the phone. The so-called "Huacheng" and "iphone" are implicit features hidden in the historical purchasing behavior of the user. Other implicit features such as "efficacy", "genre", "price", "speaker", etc. of the product or "director", "producer", "genre", etc. of the movie. The implicit characteristics of the items often have sparsity problems in the network platform. In addition, most of the above methods do not mine the association between users and between projects, and most of the methods use user feature prediction as a classification task, and each feature of the user is relatively independent, so that the associated features between users and between projects are lost to a certain extent, and the representation vector of one user cannot be effectively learned to be used as the user feature prediction.

The invention utilizes the knowledge graph to supplement the characteristics of the user historical purchased commodities and provides a user portrait prediction method for learning the high-order structural characteristics of the user based on the graph convolution neural network. Meanwhile, the characteristics of the user are supplemented by the recurrent neural network according to the historical sequence of the purchase order of the user, and a complete user portrait prediction method based on multi-source cross-border data fusion is constructed.

Disclosure of Invention

The invention aims to solve the problems of project feature sparseness, high-order structural feature loss and user behavior sequence feature loss in the prior art, and provides a user portrait prediction method based on multi-source cross-border data fusion.

The technical scheme adopted by the invention is as follows:

step 1: collecting information generated by interaction of a user on a shopping platform;

step 2: constructing a heterogeneous knowledge graph and a user historical interaction sequence;

and step 3: constructing an embedded matrix;

and 4, step 4: constructing a user portrait prediction model of multi-source cross-border data fusion, training, and obtaining an optimal parameter model after model parameters are converged;

and 5: and (4) predicting user characteristics by using the user portrait prediction model for constructing multi-source cross-boundary data fusion obtained in the step (4).

The invention also aims to provide a user portrait prediction device based on multi-source cross-boundary data fusion, which comprises a memory, a processor and a sequence perception and image convolution based neural network model program stored on the memory and capable of running on the processor, wherein the sequence perception and image convolution based neural network model program realizes the steps of the user portrait prediction method based on multi-source cross-boundary data fusion when being executed by the processor.

It is still another object of the present invention to provide a storage medium storing a multi-source cross-boundary data fused user portrait prediction model program, which when executed by a processor implements the steps of the above-mentioned multi-source cross-boundary data fused user portrait prediction method.

The technical scheme provided by the invention has the following beneficial effects:

(1) according to the historical orders of the users, the commodity characteristics are expanded by adopting the knowledge graph, so that the problem that the commodity characteristics in E-commerce data are scarce is solved;

(2) constructing a knowledge subgraph by using the commodities and the related knowledge map triples; the method comprises the steps of fully learning knowledge subgraph node characteristics by using a graph convolution network, keeping the structural characteristics of a graph as much as possible, avoiding characteristic loss caused by a training process, and obtaining a representation vector capable of fully representing an entity and local neighbor characteristics of the entity; the problem of high-order structural feature loss is solved;

(3) and extracting features hidden in the user behavior sequence by using a recurrent neural network for the historical order sequence of the user. The high-order structural characteristics obtained by combining the learning of the graph convolution network model solve the problem of user behavior sequence characteristic loss, and further improve the user portrait prediction capability of the model.

Drawings

FIG. 1 is a flow chart according to the present invention;

FIG. 2 is a diagram of a model structure;

FIG. 3 is a schematic diagram of a heterogeneous knowledge graph;

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

A specific flow description of a user portrait prediction method of multi-source cross-border data fusion is shown in FIG. 1, in which:

step 1: and collecting information generated by interaction of the user on the shopping platform.

The information collected includes:

(1) basic information of the user includes gender and age.

(2) And the user behavior records comprise the time of purchasing the commodity, the commodity number, the commodity name and the like.

2-1 construction of heterogeneous knowledge graph

2-1-1, performing word segmentation on the commodity name to obtain a word segmentation result set { i }₁，i₂，...，i_m，...}，i_mRepresenting the mth participle;

2-1-2, performing 2-round recursion search on the segmentation result set in the public knowledge graph, discarding segmentation results which do not exist in the public knowledge graph, and forming an entity set epsilon (e) by using the remaining segmentation in the segmentation result set and entities searched in the public knowledge graph₁，e₂，...，e_n,., and further constructing them into triplets (i)_m，contain，e_n) Cotain represents i_mAnd e_nThe incidence relation between the two; using the above triplet (i)_m，contain，e_n) Constructing knowledge subgraphs corresponding to commodity names

2-1-3 knowledge subgraph corresponding to commodity name

Integrated into a heterogeneous knowledge graph

The heterogeneous knowledge graph

Includes node V and edge E; the node V comprises a user set U, a commodity name set I and an entity set epsilon; the edges include three types, respectively entity-entity intellectual graph relationships

Commodity name-user interaction record E_iuAnd the users have the same click behavior between pairs E_uu；

The entity-entity knowledge graph relationship

Is the relationship between any two entities in the entity set;

2-2, constructing a user-commodity name interaction matrix according to the user set, the commodity name set and the commodity name-user interaction records

N represents the number of users, and M represents the number of names of commodities;

y in the user-merchandise name interaction matrix_uv1 indicates that the user u interacts with the product name v (e.g., purchases, browses, clicks, etc.), if y_uv0 means that user u has not made an interaction with the trade name v;

2-3, further constructing a user history interaction sequence set according to the user-commodity name interaction matrix, wherein the set comprises the following steps:

wherein

The name of the item representing the user u's historical interaction at the ith time,

represents user u and

the moment at which the interaction occurs;

and step 3: according to the knowledge graph relation among the user set, the entity set epsilon and the entity-entity

Further constructing a user embedding matrix

Entity embedded matrix

And a user adjacency matrix

Wherein D represents a dimension of the vector; each element in the user adjacency matrix represents the similarity of the click behaviors of two users;

user adjacency matrix element

Representing user u₁With user u₂With a similar click-through behavior, the user may,

representing user u₁With user u₂There is no similar click behavior;

and 4, step 4: constructing a user portrait prediction model of multi-source cross-border data fusion;

the user portrait prediction model of multi-source cross-border data fusion comprises an input embedding layer, a heterogeneous knowledge graph convolution layer, a user behavior sequence perception layer and an output layer:

4-1 input embedding layer:constructing a set of user interaction entities N using a set of historical interaction sequences of users_e(u); vectorizing user with embedded matrix, vectorizing user interaction entity with embedded matrix, and displaying_i(u) acquiring the user' S neighbor embedded vector according to the user adjacency matrix, and constructing a neighbor user set S_u(u)；

4-2 heterogeneous knowledge map convolutional layer: after the representation vector of the user interaction entity enters the heterogeneous knowledge graph convolutional layer, two parts of operations are executed;

the 4-2-1 user interaction entity obtains a user-commodity name expression vector with neighbor features through H-round iterative aggregation of neighbor topological structure features

4-2-2 user u's neighboring user set expression vector and user u expression vector are aggregated to obtain user neighboring feature expression vector

4-2-3 user-Commodity name representation vector

And user neighbor feature representation vector

Adding the spliced vector and the user u representation vector to obtain an output vector of the heterogeneous knowledge graph convolutional layer

4-3, the user behavior sequence perception layer adopts LSTM or GRU to model user sequence characteristics so as to extract potential interest of users; to be provided with

Obtaining a vector with the same dimensionality as the output of the heterogeneous knowledge map convolutional layer for input;

the first method is as follows: user sequence feature modeling using LSTM

Hiding state of last moment of recurrent neural network

And cell status

Adding to obtain the output vector of the LSTM module:

wherein the content of the first and second substances,

representing historical interaction sequences of user u

The output vector after being processed by the LSTM module,

representing the cell state output by the LSTM neural network at the last moment,

representing the hidden state of the LSTM neural network output at the last time, T representing the last time,

represents an addition at the element level;

and then, carrying out spatial transformation on the output vector of the LSTM module, and converting the output vector into a user behavior sequence representation vector with the same dimension as the user representation vector:

wherein the content of the first and second substances,

the sequence of behaviors representing user u represents a vector,

and

respectively representing a weight matrix and a bias for spatial transformation, wherein P represents the number of LSTM hidden layer neurons;

the second method comprises the following steps: user sequence feature modeling using GRUs

Hidden state of last minute

I.e. the output vector of the GRU network:

wherein the content of the first and second substances,

representing a sequence of actions of user u

The output vector after being processed by the GRU module,

representing the hidden state output by the hidden layer at the last moment of the GRU network; likewise, the output vector processed by the GRU module needs to be converted into the same dimensions as the representation vector:

wherein the content of the first and second substances,

representing a sequence of actions of user uThe vector is represented by a vector of values,

respectively representing a weight matrix and an offset for performing spatial transformation;

4-4 output layer: the output layer adds the results output by the heterogeneous knowledge map convolutional layer and the user behavior sequence sensing layer and then converts the results into output vectors with the same dimensionality as the predicted feature number;

o＝Wu_final+b

wherein u is_finalThe representation user finally represents the vector(s),

representing a representation vector with user neighbor characteristics learned by a heterogeneous knowledge graph convolutional layer,

the sequence of behaviors representing user u represents a vector,

an addition operation representing a vector; o denotes a user output vector, W denotes a weight matrix, and b denotes an offset vector;

and 5: performing softmax operation on the user output vector o obtained in the step 4 to obtain the probability corresponding to the basic information (namely the gender or the predicted age period) of the predicted user;

wherein, o'_iRepresenting the probability representation of the i-th dimension obtained by the softmax function, o_iA value representing the ith dimension of the output vector o; obtaining the user output vector corresponding to the user characteristics of the 0, 1, f-1 dimension through a softmax functionA probability representation;

the back propagation process of the whole model adopts a softmax cross entropy loss function, and the formula is as follows:

wherein the content of the first and second substances,

a set of users is represented as a set of users,

representing the cross-entropy loss function, yu and

respectively representing a real user tag value and a model predicted value;

is a regularization term of L2, λ represents a regularization coefficient for controlling the strength of the regularization of L2, and Θ represents parameters in the model, such as weight matrices between the user, entity, and relationship embedding matrices U, V, R, and the neural network layer.

The performance evaluation of the invention respectively adopts a MovieLens-1M movie data set and a Kyoto E-business data set. The model performs gender prediction two-classification performance evaluation and age prediction multi-classification performance evaluation on the two data sets respectively.

The following table shows the data volume of two data sets after the screening of the knowledge graph entities:

the two data sets respectively adopt Microsoft Satori and zhishi.me Chinese knowledge maps to conduct triple feature expansion on the entity set of the commodity name. The distribution of the user characteristics of each data set is as follows:

(1) sex aspect:

a) the ratio of male users to female users in the MovieLens-1M movie data set is 72 percent, and the ratio of male users to female users is 28 percent;

b) the data of the Jingdong E-business accounts for 44% of male users and 56% of female users.

(2) Age-related:

a) MovieLens-1M movie data set 22% of users under the age of 25, 35% of users between the age of 25 and 34, 29% of users between the age of 35 and 50, and 15% of users over 50;

b) in the data set of the Jingdong e-commerce, 14% of users under the age of 26, 55% of users under the age of 26 to 35, 30% of users under the age of 36 to 55, and 1% of users under the age of 55 are all users.

The performance evaluation indexes adopted by the invention are Accuracy and macro-F1.

	True value 1	True value-1
			Predicted value 1	TP(True Positive)	FP(False Negative)
Prediction value-1	FN(False Negative)	TN(True Negative)

Accuracy: the correctly classified samples account for the total number of samples:

macro _ F1 is a variant of the evaluation index F1_ score of a two-class model commonly used in machine learning, and the F1_ score evaluation index formula is as follows:

wherein precision and call respectively represent classification accuracy and recall, and respectively evaluate whether the classification of the model positive examples is accurate and the proportion of the positive examples judged by the classifier to all the positive examples, and from the above formula, it can be seen that F1_ score is an evaluation index combining the evaluation of the classifier accuracy and the recall.

Since the conventional F1_ score is mostly used for evaluating the second category, age prediction is a multi-category problem in experiments, macro _ F1 is used as an evaluation index, and macro _ F1 is an average value of each category F1_ score, namely:

wherein, F1_ score₁，F1_score₂，...，F1_score_nF1_ score for classes 1, 2,. N, respectively, N being the number of classes.

The following table shows the results of the gender prediction experiment of the present invention on the above two data sets:

the following table shows the results of the age prediction experiments of the present invention on the above two data sets:

in the above gender prediction and age prediction experimental result table, the logistic regression and support vector machine is a traditional machine learning classifier, the LightGBM is a gradient boosting decision tree-based efficient classification model proposed by microsoft, and the heterogeneous knowledge graph convolution network (Ba-KGCN) is a multi-source cross-border data fusion user portrait prediction model in the present invention.

Claims

1. A user portrait prediction method based on multi-source cross-border data fusion is characterized by comprising the following steps:

step 1: collecting information generated by interaction of a user on a shopping platform, wherein the information comprises basic information of the user and user behavior records, and constructing a user set, a commodity name set and a commodity name-user interaction record;

basic information of the user comprises gender and age;

the user behavior records comprise time for purchasing commodities, commodity numbers and commodity names;

2-1, constructing a heterogeneous knowledge graph and a user behavior sequence set

2-1-3 knowledge subgraph corresponding to commodity name

Integrated into a heterogeneous knowledge graph

2-3, further constructing a user history interaction sequence set according to the user-commodity name interaction matrix:

wherein

represents user u and

the moment at which the interaction occurs;

and step 3: according to the user set and the entity set epsilonOne-step construction of user embedded matrix

Entity embedded matrix

And a user adjacency matrix

4-1 input embedding layer: constructing a set of user interaction entities N using a set of historical interaction sequences of users_e(u); vectorizing and representing the user by using the user embedded matrix; vectorized representation S of user interaction entities using entity embedding matrices_i(u); obtaining the user 'S adjacent user embedded vector according to the user' S adjacent matrix, and constructing the adjacent user set S_u(u)；

4-2-2 user u's neighboring user set expression vector and user u expression vector are aggregated to obtain user neighboringFeature representation vector

4-2-3 user-Commodity name representation vector

And user neighbor feature representation vector

and 5: and (4) performing softmax operation on the user output vector o obtained in the step (4) to obtain the probability corresponding to the basic information of the predicted user.

2. The method of claim 1, wherein the heterogeneous knowledge graph is used for user portrait prediction based on multi-source cross-border data fusion

Commodity name-user interaction record E_iuAnd the users have the same click behavior between pairs E_uu(ii) a The entity-entity knowledge graph relationship

Is the relationship between any two entities in the entity set.

3. The method of claim 1, wherein the user portrait prediction layer is modeled by using user sequence features of LSTM:

hiding state of last moment of recurrent neural network

And cell status

Adding to obtain the output vector of the LSTM module:

wherein the content of the first and second substances,

representing historical interaction sequences of user u

The output vector after being processed by the LSTM module,

represents an addition at the element level;

wherein the content of the first and second substances,

the sequence of behaviors representing user u represents a vector,

and

respectively representing the weight matrix and the bias for spatial transformation, and P representing the number of LSTM hidden layer neurons.

4. The method of claim 1, wherein the user portrait prediction layer is modeled by using user sequence features of GRUs:

hidden state of last minute

I.e. the output vector of the GRU network:

wherein the content of the first and second substances,

representing a sequence of actions of user u

The output vector after being processed by the GRU module,

wherein the content of the first and second substances,

the sequence of behaviors representing user u represents a vector,

respectively representing the weight matrix and the offset for performing the spatial transformation.

5. The method of claim 1, wherein the output layer is specifically as follows:

o＝Wu_final+b

wherein u is_finalThe representation user finally represents the vector(s),

the sequence of behaviors representing user u represents a vector,

an addition operation representing a vector; o denotes a user output vector, W denotes a weight matrix, and b denotes an offset vector.

6. The method for predicting the user portrait based on the multi-source cross-border data fusion as claimed in claim 1, wherein the step 5softmax operation is specifically:

wherein, o'_iRepresenting the probability representation of the i-th dimension obtained by the softmax function, o_iA value representing the ith dimension of the output vector o; and obtaining the probability representation of the user output vector pair corresponding to the user characteristics in the 0 th, 1 st and f-1 st dimensions through a softmax function.

7. The method of claim 1, wherein the backpropagation process of the multi-source cross-boundary data fusion-based user portrait prediction model adopts a softmax cross entropy loss function, and the formula is as follows:

wherein the content of the first and second substances,

a set of users is represented as a set of users,

representing the cross-entropy loss function, yu and

respectively representing a real user tag value and a model predicted value;

is the L2 regularization term, λ represents the regularization coefficient used to control the strength of the L2 regularization, and Θ represents the model parameters.

8. A multi-source cross-boundary data fusion-based user portrait prediction device, comprising a memory, a processor, and a multi-source cross-boundary data fusion-based user portrait prediction model program stored in the memory and executable on the processor, wherein when executed by the processor, the multi-source cross-boundary data fusion-based user portrait prediction model program implements the steps of any one of the above claims 1-7.

9. A storage medium storing a multi-source cross-boundary data fusion-based user portrait prediction model program, which when executed by a processor implements the steps of the multi-source cross-boundary data fusion-based user portrait prediction method of any one of claims 1 to 7.