CN114238758B

CN114238758B - User portrait prediction method based on multi-source cross-border data fusion

Info

Publication number: CN114238758B
Application number: CN202111531109.4A
Authority: CN
Inventors: 周仁杰; 郭星宇; 张纪林; 万健; 刘畅; 赵乃良; 殷昱煜; 蒋从锋; 刘焱; 李炳; 陈青雯
Original assignee: Zhejiang Panshi Information Technology Co ltd; Hangzhou Dianzi University
Current assignee: Zhejiang Panshi Information Technology Co ltd; Hangzhou Dianzi University
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2023-04-11
Anticipated expiration: 2041-12-14
Also published as: CN114238758A

Abstract

The invention discloses a user portrait prediction method based on multi-source cross-border data fusion, and aims to solve the problem of inaccurate user characteristic prediction caused by sparse project characteristics, high-order structural characteristic loss and user behavior sequence characteristic loss in the prior art. Based on E-commerce data generated by a user, commodity characteristics are expanded by using a knowledge graph, historical purchase records of the user are fully mined by using a graph convolution network, potential purchase characteristics of the user are predicted by using a recurrent neural network, and accuracy of user portrait prediction is effectively improved. The method has the advantages that the problem of sparse commodity features is solved through the knowledge graph, the problem of high-order structural feature loss is solved through the graph convolution neural network, the problem of user behavior sequence feature loss is solved through the recurrent neural network, and a good foundation is laid for improving the performance of a recommendation system.

Description

User portrait prediction method based on multi-source cross-border data fusion

Technical Field

The invention relates to a user portrait prediction method based on multi-source cross-border data fusion, which is constructed according to a historical order sequence and content information of user shopping.

Background

Along with the development of internet technology and intelligent devices, various mobile applications are emerging and penetrating the lives of people, and the generated information is also increasing explosively. This makes it difficult for people to efficiently obtain the desired information, while enterprises have difficulty in accurately pushing products, information, etc. to users. The recommendation system is based on the user portrait, the user portrait is efficiently constructed, and fine marketing and accurate recommendation of enterprises are facilitated.

Online shopping has become an extremely common thing in today's life, and many users can directly or indirectly provide personal information to a shopping platform while enjoying the convenience of online shopping. The direct information includes sex, age, place of residence, etc., and the indirect information includes browsing record, purchasing record, collecting record, etc. According to the information of the user, the shopping platform can construct a virtual portrait of the user in the Internet, so that the needed goods can be accurately recommended to the user, and the benefit of the shopping platform is improved.

The technology for viewing user portraits is mainly applied to the field of personalized recommendation, and various methods for predicting user portraits labels comprise SVM, decision trees, LR and other traditional shallow learning models which have achieved good results. However, as data generated by users in a big data background is explosively increased and feature dimensions are increased, the limitation of the flattened structure of the traditional shallow learning model begins to be highlighted. For example, in typical problems of user click rate estimation, conversion rate estimation and the like, the processed features as input have the characteristics of high latitude and high sparsity, and the traditional shallow learning method faces certain challenges in user label prediction because complex nonlinear relations among the features cannot be found.

In the field of electronic commerce, the historical purchasing behavior of the user contains the behavior information of the user. The accuracy of user portrayal can be effectively improved through the purchase records of the user, and the performance of a recommendation system is further improved. For example, if a user's purchase history includes a large amount of "Hua's" brand, indicating that the user is "pollen", the user will not buy the phone with a high probability if the recommendation system recommends the "iphone" brand of phone to the user. And if the recommendation system pushes the newly released 'Huayi' mobile phone to the user, the user may buy the mobile phone if the user needs to change the phone. The so-called "Huacheng" and "iphone" are implicit features hidden in the historical purchasing behavior of the user. Other implicit features are the "efficacy", "genre", "price", "speaker" etc. of the product or the "director", "producer", "genre" etc. of the movie. The implicit characteristics of the items often have sparsity problems in a network platform. In addition, most of the above methods do not mine the association between users and between projects, and most of the methods use user feature prediction as a classification task, and each feature of the user is relatively independent, so that the associated features between users and between projects are lost to a certain extent, and the representation vector of one user cannot be effectively learned to be used as the user feature prediction.

The invention utilizes the knowledge graph to supplement the characteristics of the user historical purchased commodities and provides a user portrait prediction method for learning the high-order structural characteristics of the user based on the graph convolution neural network. Meanwhile, the characteristics of the user are supplemented by the recurrent neural network according to the historical purchase order sequence of the user, and a complete user portrait prediction method based on multi-source cross-border data fusion is constructed.

Disclosure of Invention

The invention aims to solve the problems of project feature sparseness, high-order structural feature loss and user behavior sequence feature loss in the prior art, and provides a user portrait prediction method based on multi-source cross-border data fusion.

The technical scheme adopted by the invention is as follows:

step 1: collecting information generated by interaction of a user on a shopping platform;

step 2: constructing a heterogeneous knowledge graph and a user historical interaction sequence;

and step 3: constructing an embedded matrix;

and 4, step 4: constructing a user portrait prediction model of multi-source cross-border data fusion, training, and obtaining an optimal parameter model after model parameters are converged;

and 5: and (5) predicting user characteristics by using the user portrait prediction model for constructing multi-source cross-border data fusion obtained in the step (4).

The invention also aims to provide a user portrait prediction device based on multi-source cross-boundary data fusion, which comprises a memory, a processor and a sequence perception and image convolution based neural network model program stored on the memory and capable of running on the processor, wherein the sequence perception and image convolution based neural network model program realizes the steps of the user portrait prediction method based on multi-source cross-boundary data fusion when being executed by the processor.

It is still another object of the present invention to provide a storage medium storing a multi-source cross-boundary data fused user portrait prediction model program, which when executed by a processor implements the steps of the above-mentioned multi-source cross-boundary data fused user portrait prediction method.

The technical scheme provided by the invention has the following beneficial effects:

(1) According to the historical orders of the users, the commodity characteristics are expanded by adopting the knowledge graph, so that the problem that the commodity characteristics in E-commerce data are scarce is solved;

(2) Constructing a knowledge subgraph by using the commodities and the related knowledge map triples; the method comprises the steps of fully learning knowledge subgraph node characteristics by using a graph convolution network, keeping the structural characteristics of a graph as much as possible, avoiding characteristic loss caused by a training process, and obtaining a representation vector capable of fully representing an entity and local neighbor characteristics of the entity; the problem of high-order structural feature loss is solved;

(3) And extracting features hidden in the user behavior sequence by using a recurrent neural network for the historical order sequence of the user. The high-order structural features obtained by combining with the learning of the graph convolution network model solve the problem of user behavior sequence feature loss and further improve the user portrait prediction capability of the model.

Drawings

FIG. 1 is a flow chart according to the present invention;

FIG. 2 is a diagram of a model structure;

FIG. 3 is a schematic of a heterogeneous knowledge graph;

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

A specific flow description of a user portrait prediction method of multi-source cross-border data fusion is shown in FIG. 1, in which:

step 1: and collecting information generated by interaction of the user on the shopping platform.

The collected information includes:

(1) Basic information of the user includes gender and age.

(2) The user behavior records comprise the time of purchasing the commodity, the commodity number, the commodity name and the like.

And 2, step: constructing a heterogeneous knowledge graph and a user history interaction sequence;

2-1 construction of heterogeneous knowledge graph

2-1-1, performing word segmentation on the commodity name to obtain a word segmentation result set { i } ₁ ，i ₂ ，...，i _m ，...}，i _m Representing the mth participle;

2-1-2, performing 2-round recursive search on the segmentation result set in the public knowledge graph, discarding the segmentation results which do not exist in the public knowledge graph, and forming an entity set epsilon = { e } by the remaining segmentation in the segmentation result set and the entities searched in the public knowledge graph ₁ ，e ₂ ，...，e _n ,., and further constructed into triples (i) _m ，contain，e _n ) Cotain represents i _m And e _n The incidence relation between the two; using the above triplet (i) _m ，contain，e _n ) Constructing knowledge subgraphs corresponding to commodity names

2-1-3 knowledge subgraph corresponding to commodity name

Integrated into a heterogeneous knowledge map->

The heterogeneous knowledge graph

Includes node V and edge E;the node V comprises a user set U, a commodity name set I and an entity set epsilon; edges include three types, respectively entity-entity knowledgemap relationships>

Commodity name-user interaction record E _iu And the users have the same click behavior between pairs E _uu ；

The entity-entity knowledge graph relationship

Is the relationship between any two entities in the entity set;

2-2, constructing a user-commodity name interaction matrix according to the user set, the commodity name set and the commodity name-user interaction records

N represents the number of users, and M represents the number of commodity names;

y in the user-merchandise name interaction matrix _uv =1 indicates that the user u interacts with the commodity name v (e.g. buy, browse, click, etc.) if y _uv =0 represents that the user u has not made an interaction with the commodity name v;

2-3, further constructing a user history interaction sequence set according to the user-commodity name interaction matrix, wherein the set comprises the following steps:

wherein

Commodity names in conjunction with a user u at the ith historical interaction>

Indicating that user u and pick>

The moment at which the interaction occurs;

and step 3: according to the knowledge graph relation among the user set, the entity set epsilon and the entity-entity

Further constructing a user embedding matrix->

Entity embedding matrix->

And a user adjacency matrix>

Wherein D represents a dimension of a vector; each element in the user adjacency matrix represents the similarity of the click behaviors of two users;

elements of user adjacency matrix

Representing user u ₁ With user u ₂ With a similar click-through behaviour, it is possible to provide,

representing user u ₁ With user u ₂ There is no similar click behavior;

and 4, step 4: constructing a user portrait prediction model of multi-source cross-border data fusion;

the user portrait prediction model of multi-source cross-border data fusion comprises an input embedding layer, a heterogeneous knowledge graph convolution layer, a user behavior sequence perception layer and an output layer:

4-1 input embedding layer: constructing a set of user interaction entities N using a set of historical interaction sequences of users _e (u); vectorizing user with embedded matrix and vectorizing user interaction entity with embedded matrixS _i (u) obtaining the user 'S near user embedded vector according to the user' S adjacent matrix, and constructing near user set S _u (u)；

4-2 heterogeneous knowledge map convolutional layer: after the expression vector of the user interaction entity enters the heterogeneous knowledge graph convolution layer, two parts of operations are executed;

4-2-1 user interaction entity obtains user-commodity name expression vector with neighbor features by H-round iterative aggregation of neighbor topological structure features

4-2-2 user u's neighboring user set expression vector and user u expression vector are aggregated to obtain user neighboring feature expression vector

4-2-3 user-Commodity name representation vector

And user neighbor feature representation vector->

After splicing, adding the spliced vector and the user u representation vector to obtain an output vector ^ of the heterogeneous knowledge graph convolutional layer>

4-3, the user behavior sequence perception layer adopts LSTM or GRU to model user sequence characteristics so as to extract potential interest of users; to be provided with

Obtaining a vector with the same dimensionality as the output of the heterogeneous knowledge map convolutional layer for input;

the first method is as follows: user sequence feature modeling using LSTM

Hiding the last moment of the recurrent neural network

And cell status>

Adding to obtain the output vector of the LSTM module:

wherein, the first and the second end of the pipe are connected with each other,

representing a historical interaction sequence ≥ of user u>

Output vector processed by LSTM module>

Represents the state of the cells output by the LSTM neural network at the last moment in time, and>

represents the hidden state of the LSTM neural network output at last, T represents last, and/or>

Represents an addition at the element level;

and then, carrying out spatial transformation on the output vector of the LSTM module, and converting the output vector into a user behavior sequence representation vector with the same dimension as the user representation vector:

wherein the content of the first and second substances,

the sequence of actions representing user u represents a vector, which represents a @>

And &>

Respectively representing a weight matrix and a bias for spatial transformation, wherein P represents the number of LSTM hidden layer neurons;

the second method comprises the following steps: user sequence feature modeling using GRUs

Hidden state of last minute

I.e. the output vector of the GRU network:

wherein the content of the first and second substances,

representing a sequence of actions by user u->

The output vector processed by the GRU module is asserted>

Representing the hidden state output by the hidden layer at the last moment of the GRU network; likewise, the output vector processed by the GRU module needs to be converted into the same dimensions as the representation vector:

wherein the content of the first and second substances,

Respectively representing a weight matrix and an offset for performing spatial transformation;

4-4 output layer: the output layer adds the results output by the heterogeneous knowledge map convolutional layer and the user behavior sequence sensing layer and then converts the results into output vectors with the same dimensionality as the predicted feature number;

o＝Wu _final +b

wherein u is _final The representation user finally represents the vector(s),

represents a representation vector learned through a heterogeneous knowledge graph convolutional layer with user neighbor characteristics, and->

An addition operation representing a vector; o denotes a user output vector, W denotes a weight matrix, and b denotes an offset vector;

and 5: performing softmax operation on the user output vector o obtained in the step 4 to obtain the probability corresponding to the basic information (namely the gender or the predicted age period) of the predicted user;

wherein, o' _i Representing the probability representation of the i-th dimension obtained by the softmax function, o _i A value representing the ith dimension of the output vector o; obtaining probability representation of user output vector pairs of 0,1, the f-1 dimension corresponding to user characteristics through a softmax function;

the back propagation process of the whole model adopts a softmax cross entropy loss function, and the formula is as follows:

wherein the content of the first and second substances,

indicates a user set, <' > is present>

Represents the cross entropy loss function, y _u And &>

Respectively representing a real user tag value and a model predicted value; />

Is an L2 regularization term, λ represents a regularization coefficient for controlling the strength of L2 regularization, Θ represents a parameter in the model, such as an embedded matrix U, V, R of the user, the entity, and the relationship, and a weight matrix between neural network layers.

The performance evaluation of the invention respectively adopts a MovieLens-1M movie data set and a Kyoto E-business data set. The model performs gender prediction two-classification performance evaluation and age prediction multi-classification performance evaluation on the two data sets respectively.

The following table shows the data volume of two data sets after knowledge graph entity screening:

wherein, the two data sets respectively adopt Microsoft Satori and zhishi me Chinese knowledge maps to trade names

And carrying out triple feature expansion on the called entity set. The distribution of the user characteristics of each data set is as follows:

(1) Sex aspect:

a) The ratio of male users to female users in the MovieLens-1M movie data set is 72 percent, and the ratio of male users to female users is 28 percent;

b) The data of the Jingdong E-business accounts for 44% of male users and 56% of female users.

(2) Age-related:

a) In the movilens-1M movie data set, 22% of users under the age of 25, 35% of users between the ages of 25 and 34, 29% of users between the ages of 35 and 50, and 15% of users over the age of 50;

b) In the data set of the Jingdong e-commerce, 14% of users under the age of 26, 55% of users under the age of 26 to 35, 30% of users under the age of 36 to 55, and 1% of users under the age of 55 are all users.

The performance evaluation indexes adopted by the invention are Accuracy and macro-F1.

	True value 1	True value-1
			Predicted value 1	TP(True Positive)	FP(False Negative)
Predicted value-1	FN(False Negative)	TN(True Negative)

Accuracy: the correctly classified samples account for the total number of samples:

macro _ F1 is a variant of the evaluation index F1_ score used in machine learning to measure two-class models, and the F1_ score evaluation index formula is as follows:

wherein precision and call respectively represent classification accuracy and recall, and respectively evaluate whether the classification of the model positive examples is accurate and the proportion of the positive examples judged by the classifier to all the positive examples, and from the above formula, it can be seen that F1_ score is an evaluation index combining the evaluation of the classifier accuracy and the recall.

Since the conventional F1_ score is mostly used for evaluating the second category, and the age prediction is a multi-category problem in the experiment, macro _ F1 is used as an evaluation index, and macro _ F1 is an average value of the F1_ score of each category, namely:

wherein, F1_ score ₁ ，F1_score ₂ ，...，F1_score _n F1_ score of class N, which represents 1,2, respectively, N is the number of classes.

The following table shows the results of the gender prediction experiment of the present invention on the above two data sets:

the following table shows the results of the age prediction experiment of the present invention on the above two data sets:

in the above gender prediction and age prediction experimental result table, the logistic regression and support vector machine is a traditional machine learning classifier, the LightGBM is a gradient boosting decision tree-based efficient classification model proposed by microsoft, and the heterogeneous knowledge graph convolution network (Ba-KGCN) is a multi-source cross-border data fusion user portrait prediction model in the present invention.

Claims

1. A user portrait prediction method based on multi-source cross-border data fusion is characterized by comprising the following steps:

step 1: collecting information generated by interaction of a user on a shopping platform, wherein the information comprises basic information of the user and user behavior records, and constructing a user set, a commodity name set and a commodity name-user interaction record;

basic information of the user comprises gender and age;

the user behavior records comprise time for purchasing commodities, commodity numbers and commodity names;

2-1, constructing a heterogeneous knowledge graph and a user behavior sequence set;

2-1-1, performing word segmentation on the commodity name to obtain a word segmentation result set { i } ₁ ,i ₂ ,…,i _m ,…}，i _m Representing the mth participle;

2-1-2, performing 2-round recursive search on the segmentation result set in the public knowledge graph, discarding the segmentation results which do not exist in the public knowledge graph, and forming an entity set epsilon = { e } by the remaining segmentation in the segmentation result set and the entities searched in the public knowledge graph ₁ ,e ₂ ,…,e _n …, and further constructed into triplets (i) _m ,contain,e _n ) Cotain represents i _m And e _n In betweenAn association relationship; using the above triplet (i) _m ,contain,e _n ) Constructing knowledge subgraphs corresponding to commodity names

2-1-3 knowledge subgraph corresponding to commodity name

Integrated into a heterogeneous knowledge map->

2-3, further constructing a user history interaction sequence set according to the user-commodity name interaction matrix:

wherein

Commodity names in conjunction with a user u at the ith historical interaction>

Representing a user u and->

The moment of interaction;

and 3, step 3: further constructing a user embedded matrix according to the user set and the entity set epsilon

Entity embedded matrix

And the user adjacency matrix->

Wherein D represents a dimension of the vector; each element in the user adjacency matrix represents the similarity of the click behaviors of two users;

4-1 input embedding layer: constructing a set of user interaction entities N using a set of historical interaction sequences of users _e (u); vectorizing and representing the user by using the user embedded matrix; vectorized representation S of user interaction entities using entity embedding matrices _i (u); obtaining the user 'S adjacent user embedded vector according to the user' S adjacent matrix, and constructing the adjacent user set S _u (u)；

4-2 heterogeneous knowledge map convolutional layer: after the representation vector of the user interaction entity enters the heterogeneous knowledge graph convolutional layer, two parts of operations are executed;

the 4-2-1 user interaction entity obtains a user-commodity name expression vector with neighbor features through H-round iterative aggregation of neighbor topological structure features

4-2-3 user-Commodity name representation vector

And user neighbor feature representation vector>

4-4 output layer: the output layer adds the heterogeneous knowledge map convolutional layer and the results output by the user behavior sequence sensing layer and then converts the results into output vectors with the same dimensionality as the predicted feature types;

and 5: and (4) performing softmax operation on the user output vector o obtained in the step (4) to obtain the probability corresponding to the basic information of the predicted user.

2. The method of claim 1, wherein the heterogeneous knowledge graph is used for predicting the user portrait based on multi-source cross-border data fusion

Includes node V and edge E; the node V comprises a user set U, a commodity name set I and an entity set epsilon; edges include three types, respectively entity-entity knowledgebase relationships->

Commodity name-user interaction record E _iu And the users have the same click behavior between pairs E _uu (ii) a Said entity-entity knowledge-map relationship->

Is a relationship between any two entities in the entity collection.

3. The method of claim 1, wherein the user portrait prediction layer is modeled by using user sequence features of LSTM:

hiding state of last moment of recurrent neural network

And cell status->

Adding to obtain the output vector of the LSTM module:

wherein the content of the first and second substances,

representing a historical interaction sequence ≥ of user u>

The output vector processed by the LSTM module is then asserted>

Represents the state of the cell that the LSTM neural network outputs at the last moment, and->

Represents a hidden state of the LSTM neural network output at the last time, T represents the last time, and/or>

Represents an addition at the element level;

the sequence of actions representing user u represents a vector, which means that a user u is present in a predetermined pattern>

And &>

Respectively representing the weight matrix and the bias for spatial transformation, and P representing the number of LSTM hidden layer neurons.

4. The method of claim 1, wherein the user portrait prediction layer is modeled by using user sequence features of GRUs:

hidden state of last minute

I.e. the output vector of the GRU network:

wherein the content of the first and second substances,

representing a sequence of actions by user u->

The output vector processed by the GRU module is asserted>

wherein the content of the first and second substances,

Respectively representing the weight matrix and the offset for performing the spatial transformation.

5. The method of claim 1, wherein the output layer is specifically as follows:

o＝Wu _final +b

wherein u is _final The representation user finally represents the vector(s),

representing a representation vector learned through a heterogeneous knowledge graph convolutional layer having user neighbor characteristics, greater or less than>

An addition operation representing a vector; o denotes a user output vector, W denotes a weight matrix, and b denotes an offset vector.

6. The method for predicting the user portrait based on the multi-source cross-border data fusion as claimed in claim 1, wherein the step 5softmax operation is specifically:

wherein, o' _i Representing the probability representation of the i-th dimension obtained by the softmax function, o _i A value representing the ith dimension of the output vector o; and obtaining the probability representation of the user output vector to the user characteristics corresponding to the 0,1, … and f-1 dimensions through the softmax function.

7. The method of claim 1, wherein the backpropagation process of the multi-source cross-boundary data fusion-based user portrait prediction model adopts a softmax cross entropy loss function, and the formula is as follows:

wherein the content of the first and second substances,

indicates a user set, <' > is present>

Representing the cross entropy loss function, y _u And &>

Respectively representing a real user tag value and a model predicted value; />

Is an L2 regularization term, λ represents a regularization coefficient for controlling the strength of L2 regularization, and Θ represents a model parameter.

8. A multi-source cross-boundary data fusion-based user portrait prediction device, comprising a memory, a processor, and a multi-source cross-boundary data fusion-based user portrait prediction model program stored in the memory and executable on the processor, wherein when executed by the processor, the multi-source cross-boundary data fusion-based user portrait prediction model program implements the steps of any one of the above claims 1-7.

9. A storage medium storing a multi-source cross-boundary data fusion-based user portrait prediction model program, which when executed by a processor implements the steps of the multi-source cross-boundary data fusion-based user portrait prediction method of any one of claims 1 to 7.