CN113486143A

CN113486143A - User portrait generation method based on multi-level text representation and model fusion

Info

Publication number: CN113486143A
Application number: CN202110569271.9A
Authority: CN
Inventors: 杜永萍; 苗宇; 金醒男
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-10-08

Abstract

The invention discloses a user portrait generation method based on multi-level text representation and model fusion, which comprises the steps of searching texts by utilizing the Internet of a real user, extracting text characteristics from different levels respectively through text preprocessing, classifying through different neural networks, and finally secondarily classifying classification results predicted by the neural networks through a secondary classifier so as to realize user characteristic portraits. The method comprises the steps of segmenting internet search data of a real user, respectively generating word-level vector representation, sub-word-level vector representation and character-level vector representation, respectively inputting the word-level vector representation and the character-level vector representation into different deep neural networks for classification, using a k-fold cross validation method in a training stage of each neural network classifier, finally splicing prediction results of training data and test data obtained by each primary classification model respectively, and performing reclassification as the training data and the test data of a secondary classifier to realize accurate portrayal of internet users.

Description

User portrait generation method based on multi-level text representation and model fusion

Technical Field

The invention relates to a user portrait generation method based on multi-level text representation and model fusion, and belongs to the field of natural language processing application.

Background

The parent Alan Cooper of interactive design originally proposed the concept of user-like Personas, indicating that Personas is a virtual representation of a real user, a target user model built on a series of real data. The user portrait is user tagging, namely, after the enterprise collects and analyzes data of information such as social habits, living habits, consumption behaviors and the like of consumers, the enterprise abstracts a user's complete picture so as to help the enterprise to quickly find an accurate user group. The invention uses the internet element data of the real user to predict the attribute of the user, and serves enterprises to further collect more extensive user demand information.

The method for predicting the user attributes by utilizing the internet element data of real users is essentially the application of the classic task text classification of natural language processing. The essence of text classification is the process of assigning some predefined label to the text, which accumulates many implementations.

The convolutional neural network is essentially a multi-layer perceptron, which has succeeded in reducing the number of weights, the complexity of the model, the risk of overfitting, and the like. The convolutional neural network not only has advantages in processing image problems, but also can be applied to processing one-dimensional text sequence problems by using the excellent feature extraction capability of the convolutional neural network.

Text type data is unstructured data and structuring the text type data is the first step in text processing for a computer. The traditional one-hot representation method has the defects of large dimensionality, incapability of expressing semantic information of words and the like. Compared with one-hot representation, distributed representation of words is that each word is mapped to a shorter word vector through training, and the concept of 'distance' exists between words, so that more semantic grammar information can be contained.

The model fusion method is to fuse the classification results of a plurality of classifiers so as to obtain a new and more accurate prediction, so that the classification results of the first-stage classifiers are replaced by the new and more accurate prediction. The fusion method can be divided into two types, namely a fixed fusion method and a trainable fusion method, and the fixed fusion method and the trainable fusion method have the advantages that the fixed fusion method and the trainable fusion method do not need additional training corpora for training and are simple and easy to implement; the advantage of the trainable fusion method is that under sufficient corpus, better classification effect can be obtained.

At present, the application of a deep neural network classification model and a model fusion algorithm to a user label prediction scene is rare, and each text classification model has advantages and short boards in the aspect of text feature extraction, and the hierarchical representation of one-dimensional text features cannot be fully considered.

Problems to be solved and advantages achieved

With the rapid development of the mobile internet era, the user portrait technology is becoming increasingly important in terms of enterprises grasping target user characteristics, and performing accurate marketing and advertisement delivery. How to utilize the modern advanced data mining technology and artificial intelligence algorithm to carry out feature extraction and labeling on the target client is convenient for enterprises to discover the target client and understand the target client, thereby increasing the client viscosity and reducing the loss rate, and the method is the direction concerned by the enterprise marketing department, the product department and the algorithm department.

The invention starts from the aims of data modeling and label prediction of describing a user portrait by enterprises, and researches how to use a deep learning model and a model fusion algorithm to model and predict user characteristics.

The method comprises the steps of classifying user features according to internet real user metadata, expressing the user features by using texts of different levels, considering semantic features and grammatical structures of the texts, performing classification prediction by using different deep neural networks, and finally performing secondary classification on the basis of training results of all primary classifiers by considering the advantages and the disadvantages of all classifiers expressed by texts of all levels so as to improve classification and prediction precision and provide basis and support for various tasks of enterprise next-step accurate marketing.

Disclosure of Invention

In order to make up for the defects of the existing method, in the process of creating word representation of user metadata, vector representations based on word level, sub-word level and character level are respectively generated, and a deep neural network is respectively used for label classification prediction. And (4) considering the advantages and disadvantages of vector representations based on different text levels, and after training of each primary classification model is finished, performing secondary classification on the prediction result by using a model fusion mode.

The invention provides a user portrait generation method based on multi-level text representation and model fusion, which comprises the following steps: the system comprises a neural network classifier based on word-level feature vectors, a neural network classifier based on sub-word-level feature vectors, a neural network classifier based on character-level feature vectors and a secondary model fusion classifier.

The neural network classifier based on the word-level feature vector is used for finely adjusting each word in user metadata by using a pre-trained word vector model to obtain a word vector pointed by a task, and transmitting the word vector into a deep convolutional neural network for classification.

The neural network classifier based on the sub-word level feature vectors trains the sub-word level feature vectors by utilizing a character level n-gram thought, adds all the sub-word level word vectors to obtain an average word representation of a word, and then classifies the word representation by using a neural network.

The neural network classifier based on the character-level feature vectors expresses words as character-level one-hot vectors, and text classification is realized by using a convolutional neural network.

The model fusion part trains the three classification models through a K-fold cross validation method, and the obtained prediction results are spliced to serve as different new characteristics and input into a secondary classifier for secondary classification. The specific technical scheme is as follows:

a user image generation method based on multi-level text representation and model fusion is characterized by comprising the following steps.

And establishing and arranging a basic corpus.

The basic corpus is mainly user real metadata crawled from the internet, such as search texts input by a user in a search engine.

The sorting comprises cleaning and sorting the collected basic linguistic data, filling default values and removing stop words and low-frequency words.

A user portrait generation model based on multi-level text representation and model fusion is built, and three first-level classifiers based on multi-level text representation are firstly built.

The method comprises the steps of constructing a neural network classifier based on word-level feature vectors, firstly introducing a word vector model, enabling words of a user metadata text to generate word vectors by using a pre-trained word vector model, and initializing the mean value and variance of the word vectors into the mean value and variance of the pre-trained word vectors to serve as input of the model in order to guarantee the classification effect.

Building a convolutional neural network classification model, firstly using 3 one-dimensional convolution kernels with different sizes to perform n-gram feature extraction, using a one-dimensional maxporoling layer to select the maximum value of the features extracted each time after each convolution to perform feature dimension reduction, finally splicing 3 full connection layers to perform classification, and adding a dropout layer between the full connection layers to avoid overfitting of the model.

The model output is the predicted value vector of the training data serving as the test set part in each K-fold cross validation and the predicted value vector of the original test data.

And constructing a neural network classification model based on the sub-word representation.

One word w can be mapped into a plurality of character-level n-gram sub-words, and in order to obtain more root affix information when training sub-word vectors, sub-word representations of character-level 3-gram to 6-gram are used at the same time. The word vector of the central word is the sum of all vectors of the sub-words and the word, meanwhile, in the process of training the sub-word vector, in order to avoid overlarge word list, a plurality of sub-words are correspondingly generated into the same word vector by using a Hash technology, and finally, the word vector is generated by summing the sub-word vectors.

And adding and averaging the trained word vectors to obtain vector representation of the document as input, and classifying by a linear classifier.

A character representation-based neural network classification model is created.

The commonly used character list is firstly sorted out, and considering that the case of letters has no special influence on the generation of the user portrait, the sorted character list mainly comprises 70 characters such as 26 lower case letters, commonly used punctuations, partial special characters and the like to form the character list.

And expressing characters contained in the user text by using one-hot based on the created character table, performing convolution operation in a characteristic dimension by using a one-dimensional convolution kernel, and performing maximum pooling operation in a length dimension.

And building a convolutional network classification model, wherein six convolutional layers and three full-connection layers are used in the classification model, and the three convolutional layers are matched with the maximum pooling layer to perform feature dimension reduction. And a BN (Batch Normalization) operation is added into the classification model, so that the feature vectors keep approximately the same distribution before being input into each layer of the neural network.

After training of the classifier is completed, when testing is performed by using test data, the mean value and the variance of the data in each test are corrected by using a smoothing method by also considering the distribution condition of the data.

Model fusion performs a secondary classification.

And splicing the predicted values of the training data which are obtained by each model and are used as a part of the test set to obtain a new characteristic matrix, and using the new characteristic matrix as an input characteristic matrix of the secondary classifier, wherein the characteristic value is still the label value corresponding to the original training data.

And summing the predicted values of the test data obtained by each model, averaging and splicing to obtain a matrix as the test data of the secondary classifier, wherein the characteristic value is the label value corresponding to the original test data.

Inputting the characteristic matrix obtained in the above steps into a secondary classifier for training, wherein the secondary classifier is formed by splicing and linear classification of full connection layers.

The method extracts text features from different levels, trains different classifiers respectively, and finally fuses prediction results of models to achieve a better prediction result compared with a single model.

Description of the drawings:

in order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings which are required to be used in the description of the embodiments or the prior art.

FIG. 1 is a schematic diagram of an overall model of the present invention.

FIG. 2 is a diagram of a classification model based on word-level text representation.

FIG. 3 is a diagram of a classification model based on sub-word level text representations.

FIG. 4 is a diagram of a classification model based on character-level text representations.

The specific implementation mode is as follows:

fig. 1 is a schematic structural diagram of a user portrait generation model based on multi-level text representation and model fusion according to the present invention, and the user portrait generation model mainly includes a model of a first-level classifier represented by each level of text and a classifier fusion model.

The basic concepts and interrelationships involved in the present invention are as follows:

1. word-level based text representation: each word in the corpus is represented by a feature vector.

2. Text representation based on subword level: and (4) carrying out feature vector representation by averaging the summation of the n-gram feature vectors of the character level on each word in the corpus.

3. Letter-level based text representation: each word in the corpus is decomposed into individual letters for feature vector representation.

4. A first-stage classifier: and the classification model is used for classifying the text representation feature vectors for the first time.

5. A secondary classifier: and carrying out secondary classification on the prediction result of the primary classifier.

The invention comprises the following specific steps:

the method comprises the following steps of (1) establishing and arranging a basic corpus, carrying out model construction by adopting real internet search data of a user and user labels in the experiment, firstly loading the corpus, training a simple linear classifier by using all samples with complete labels, and carrying out prediction filling on samples without labels in the corpus.

And (4) dividing all sentences in the corpus into words, then counting word frequency, and removing words with the word frequency less than 3 to serve as final corpus for subsequent processing.

Step (2) creating a neural network classification model based on word-level text representation:

first a word vector model is imported, here using pre-trained word2vec word vectors. To ensure the classification effect, the mean and variance of the word vectors are initialized to the mean and variance of the pre-training word vectors.

Building a convolutional neural network classification model, firstly, using 3 one-dimensional convolution kernels with different sizes to extract n-gram features at word level, using a one-dimensional maxporoling layer to select the maximum value of the features extracted each time for dimensionality reduction after each convolution, finally splicing a full connection layer for classification, and using a dropout layer with the proportion of 0.5 to avoid overfitting of the model.

As three first-stage classifiers are used in the experiment, 3-fold cross validation is used during training of the first-stage classifiers, training data recorded for the 3-fold cross validation is divided into three parts, namely A1, A2 and A3, a training set and a validation set are divided according to the ratio of 9:1 by using a splicing matrix A2 and A3, model training and validation are carried out, and the obtained final model is used for predicting A1.

Similarly, training and verifying the model by respectively using A1 and A3, predicting A2 by the obtained final model, training and verifying the model by using a splicing matrix of A1 data and A2 data, and finally predicting A3. And sequentially splicing the prediction results of each time from beginning to end, and recording the obtained new vector as a new characteristic vector B1. And meanwhile, predicting all original test data by using the model trained each time to obtain predicted value vectors, finally obtaining 3 predicted value vectors, adding the vectors, averaging, and recording as C1.

Step (3) creating a neural network classification model based on the sub-word representation:

considering that information in a word is ignored when each word is mapped into a single word vector, the classifier performs feature text classification by using a word vector training mode of a subword, and can solve the oov (out of vocabularies) problem to a certain extent. A word w can be mapped to multiple character-level n-gram subwords, taking the word query as an example, whose character-level 3-gram is expressed as < qu, que, ues, est, sti, tio, ion, on > and the word itself < query >, using ' < ' > ' to distinguish the same prefix and suffix, while subwords and words can be distinguished. In practice, when training the sub-word vectors, in order to obtain more root affix information, the sub-word representations of 3-gram to 6-gram are used at the same time. The score function between the word and the context is:

wherein v is_cRepresenting a context word-word vector and,

a word vector representing a core word, i.e. a core word, is the sum of all vectors of subwords and words themselves.

Because each word needs to create sub-words from 3-gram to 6-gram, the word list is prevented from being large, when training sub-word vectors, a plurality of sub-words can be properly corresponding to one word vector by using a Hash technology, although certain information is lost, because the word vectors of the words do not participate in Hash calculation and have unique word representation, the method has negligible influence on final training results and is greatly helpful for reducing the word list and improving the training speed. And after the training of each sub-word vector is finished, adding the sub-word vectors to obtain a final word expression vector.

And adding and averaging the trained word vectors to obtain vector representation of the document, and classifying the document through a full connection layer.

The method for performing K-fold cross validation on the model and obtaining the new characteristic vector is the same as a classification model based on word representation, each prediction result of sequential head-to-tail splicing is recorded, the obtained new vector is recorded as a new characteristic vector B2, and the vector obtained after testing the test data for three times is added and the averaged vector is recorded as C2.

Step (4) creating a neural network classification model based on character representation:

because the text used to generate the user representation is abundant in source and content, and the text used by the user for content search is more random and noisy, the character-level representation has natural advantages, which helps to reduce the rate of oov (out of audio) words.

Firstly, a commonly used character table is sorted out, and considering that the case of letters has no special influence on the generation of a user portrait, the sorted character table mainly comprises 26 lower-case letters, commonly used punctuations, part of special characters and other 70 characters as a character set (abcdefghijklmnopqrstwxyz 0123456789-,;?:' \\\\ \ _ & ^ ^ to ++ - > } [ ] ()).

If it is assumed that each character is represented by a number, f (x) is a parameter of the convolution kernel, d is a step size, and c ═ k-d +1, the result after the convolution calculation can be represented as

In practice, each letter is represented by one-hot vector based on the above character set, and a convolution operation is performed in the characteristic dimension using a one-dimensional convolution kernel, and a maximum pooling operation is performed in the length dimension.

And building a convolutional network classification model, wherein six convolutional layers and three full-connection layers are used in the classification model, and the three convolutional layers are matched with the maximum pooling layer to perform feature dimension reduction.

In the field of machines and deep learning, different features of a feature vector often have different dimensions and dimension units, which affect the accuracy of data analysis, so in order to eliminate the dimension influence among various indexes, data standardization needs to be performed, and the dimension and distribution of the feature vector are consistent, which is beneficial to model training. For a multi-layer neural network, it is not only necessary to make the distribution of the individual features uniform at the model input, but the input vector distribution should be made relatively constant before each layer of the network is input, since even a slight difference in the distribution of the input vectors at the beginning of the model becomes increasingly significant as the network goes deeper. Therefore, for this classification model, a BN (Batch Normalization) operation is added so that the feature vectors maintain approximately the same distribution before being input to each layer of the neural network.

The procedure for normalization is as follows:

the mean for each small batch was:

variance of each small batch is

Normalizing the original batch data

Where ε is a small number that cannot be calculated to prevent variance of 0.

Finally, calculating to obtain final distribution

γ and β are new variances and means that can be learned, avoiding all getting a standard normal distribution.

After training of the classifier is completed, when testing is performed by using test data, the distribution condition of the data also needs to be considered, at this time, a smoothing method can be used to record the mean value and the variance of each batch, and the trained mean value and variance are obtained by calculation, wherein the formula is as follows:

0.95*mean_pre+0.05mean_cur

the method for performing K-fold cross validation on the model and obtaining the new feature vector is the same as that of a classification model based on word representation, each prediction result of sequential head-to-tail splicing is recorded here, the obtained new vector is recorded as a new feature vector B3, and the vectors obtained after testing the test data for three times are added and averaged to obtain a vector C3.

And (5) model fusion secondary classification:

and splicing the training data predicted values B1, B2 and B3 obtained by each model to obtain a new feature matrix, and taking the new feature matrix as an input feature matrix of the secondary classifier, wherein the feature value is still the label value corresponding to the original training data.

And splicing the predicted values C1, C2 and C3 of the test data obtained by each model to obtain a matrix as the test data of the secondary classifier, wherein the characteristic value is the label value corresponding to the original test data.

Constructing a secondary classifier:

the secondary classifier is composed of two hidden full-connection layer splicing linear classification functions, and a new feature matrix generated by the primary classifier is input to the secondary classifier for secondary classification.

And performing text classification on the user metadata by adopting text characteristic representation of different levels, and performing secondary classification on the basis of training results of all primary classifiers so as to improve classification and prediction accuracy and better serve for construction of user portraits.

Claims

1. A user image generation method based on multi-level text representation and model fusion is characterized by comprising the following steps:

step 1, establishing and arranging a basic corpus: the basic corpus is user real metadata crawled from the Internet, and the user real metadata is sorted;

step 2, constructing a user portrait generation model based on multi-level text representation and model fusion, and constructing three primary classifiers based on multi-level text representation, namely a neural network classification model based on word-level feature vectors, a neural network classification model based on sub-word representation and a neural network classification model based on character representation;

step 3, constructing a neural network classification model based on word-level feature vectors: introducing a word vector model, generating word vectors by using words of a user metadata text through a pre-trained word vector model, and initializing the mean value and variance of the word vectors into the mean value and variance of the pre-trained word vectors as the input of the model;

step 4, constructing a neural network classification model based on sub-word representation: mapping the word w into a plurality of character-level n-gram sub-words, and representing the sub-words by using character-level 3-gram to 6-gram in order to acquire more root affix information when training sub-word vectors; using a Hash technology to appropriately correspond a plurality of sub-words to generate the same word vector, and finally summing the word vector for each sub-word vector to generate the word vector as the input of the model;

step 5, creating a neural network classification model based on character representation: sorting out a character table, wherein the character table comprises 70 characters of 26 lower case letters, common punctuations and partial special characters to form the character table; representing characters contained in the user text by using one-hot based on the created character table as the input of the model;

and 6, after the trained three primary classification models are obtained through the steps 3,4 and 5, carrying out model fusion for secondary classification: splicing the predicted values of the training data which are obtained by each model and are used as a part of the test set to obtain a new characteristic matrix, and using the new characteristic matrix as an input characteristic matrix of a secondary classifier, wherein the characteristic value is still a label value corresponding to the original training data; summing and averaging the test data predicted values obtained by each model, and splicing to obtain a characteristic matrix as the test data of a secondary classifier, wherein the characteristic value is the label value corresponding to the original test data;

2. The method as claimed in claim 1, wherein the step 1 of collating comprises cleaning and collating the collected basic corpus, filling default values, and removing stop words and low frequency words.

3. The user image generation method based on multi-level text representation and model fusion as claimed in claim 1, characterized in that in step 3, a convolutional neural network classification model is built, 3 one-dimensional convolutional kernels with different sizes are used for n-gram feature extraction, a one-dimensional maxporoling layer is used for selecting the maximum value of the features extracted each time after each convolution for feature dimension reduction, finally 3 full connection layers are spliced for classification, and a dropout layer is added between the full connection layers to avoid model overfitting; the output of the neural network classification model based on the word-level feature vectors is a predicted value vector of training data serving as a test set part in each K-fold cross validation and a predicted value vector of original test data.

4. The method for generating a user image based on multi-level text representation and model fusion according to claim 1, wherein in step 4, the trained word vectors are added and averaged to obtain the vector representation of the document as input, and the vector representation is classified by a linear classifier; the output of the neural network classification model based on the sub-word representation is a predicted value vector of training data serving as a test set part in each K-fold cross validation and a predicted value vector of original test data.

5. The method for generating the user image based on the multi-level text representation and the model fusion as claimed in claim 1, wherein in the step 5, a one-dimensional convolution kernel is used to perform convolution operation in a characteristic dimension, and perform maximum pooling operation in a length dimension; building a convolutional network classification model, wherein six convolutional layers and three full-connection layers are used in the classification model, and the three convolutional layers are matched with a maximum pooling layer to perform feature dimension reduction; BN operation is added into the classification model, so that the feature vectors are kept in the same distribution before being input into each layer of the neural network; after training of the classifier is completed, when testing is performed by using test data, the mean value and the variance of the data in each test are corrected by using a smoothing method by also considering the distribution condition of the data; the output of the neural network classification model based on the character representation is a predicted value vector of training data serving as a test set part in each K-fold cross validation and a predicted value vector of original test data.