CN107122455B

CN107122455B - Network user enhanced representation method based on microblog

Info

Publication number: CN107122455B
Application number: CN201710283853.4A
Authority: CN
Inventors: 胡玥; 贾焰; 周斌; 杨树强; 韩伟红; 李爱平; 黄九鸣; 江荣; 全拥; 邓璐; 刘强; 张涛; 童咏之; 刘心; 韩文祥
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2017-04-26
Filing date: 2017-04-26
Publication date: 2019-12-31
Anticipated expiration: 2037-04-26
Also published as: CN107122455A

Abstract

The invention discloses a network enhanced representation method based on microblogs, belongs to the field of microblog data mining, and particularly relates to a network representation learning method for microblog data. The method considers the spoken language characteristics of the microblog short text and performs text preprocessing in a targeted manner, so that the influence of noise data is reduced; generating characteristic representation of historical blog text of the user by adopting an LDA topic model, and calculating cosine similarity between any two user blog characteristics so as to construct a potential friend relationship network; and integrating the structure information of the original network, and fusing the potential friend relationship into the original network to obtain the corrected network structure. According to the method and the device, the original network topology structure is corrected by utilizing the potential friend relationship network extracted from the user generated text, so that more accurate characteristic representation of the microblog user node is obtained. Compared with a network representation learning method only considering a network structure, the accuracy rate is obviously improved on two tasks of gender and age reasoning.

Description

Network user enhanced representation method based on microblog

Technical Field

The invention belongs to microblog data, belongs to the field of microblog data mining, and particularly relates to a network representation learning method for microblog data.

Background

The internet in the web2.0 era is gradually evolving into a ubiquitous information dissemination platform, and Social new media facing Social network Services (SNS for short), such as Twitter and microblog, are rapidly gaining popularity. The latest statistical data shows that Twitter's monthly active users reach 3.1 billion, and the monthly active users of the Sina microblog reach 2.97 billion. People express viewpoints, share information and exchange interaction by means of social media, and the social media spread and spread messages by means of social networks, thereby having profound influence on the fields of politics, economy, culture, education and the like. Therefore, the online social network analysis has important research value due to the characteristics of large scale, various forms, complex structure, dynamic change and the like of the online social network data and the guiding effect of hot spots and deep sentiment. Taking the Sing microblog as an example, the user can issue original blog articles within 140 words, can be in various forms such as pictures, hyperlinks, videos, audios and the like, and can also browse, forward and comment the blog articles of interested friends. Microblog data show the characteristic of multi-source heterogeneity, user generated texts, user attribute lists, network topological relations and the like are important data sources, and how to fuse multi-source microblog information to calculate the feature representation of user nodes becomes important.

Representation learning is an important research problem in the field of machine learning, and an effective feature representation is obtained by automatically learning a transformation from original input data to a new feature representation. The network representation learning is to learn the feature representation of the network nodes in the low-dimensional space, so as to realize the purposes of feature quantification and dimension reduction representation.

Currently, many research efforts have emerged in the field of network representation learning. The traditional popular learning method recovers a low-dimensional manifold structure from high-dimensional data and finds low-dimensional embedded representation of the high-dimensional network data. For example, the Isomap algorithm is based on an MDS theoretical framework, geodesic distances of any two points are used as geometric description of a manifold, an LLE algorithm (localization linear embedding) considers that a manifold can be approximately regarded as local linearity on a small local neighborhood, a coefficient of the linear fitting is used as a description of the local geometric property of the manifold, and the LE algorithm (Laplacian Eigenmaps) has the basic idea that an undirected weighted graph is used to describe a manifold, and then low-dimensional representation is found by embedding, namely the local adjacency relation of the graph is maintained, and the graph is redrawn from a high-dimensional space to a low-dimensional space.

In recent years, deep learning provides a new idea for network representation learning, and a network representation model based on deep learning continuously appears aiming at large-scale network structure data and rich network node information.

Inspired by a word2vec model, the Deepwalk model only considers the topological structure of the network, the nodes in the network correspond to the words in the corpus, the sequences generated by the nodes correspond to the sentences in the corpus, a standard input sequence is generated by adopting a random walk method, and then the sequence is modeled by using a Skip-gram model so as to obtain the vector representation of the network nodes. However, the Deepwalk algorithm does not establish an objective function, cannot learn the node representation of the weighted directed graph, and the node sequence is randomly generated and is greatly influenced by noise.

The LINE model considers the first-order and second-order similarities of a network topology structure at the same time, the first-order similarity represents the point pair similarity between two nodes in the network, namely the weight of an edge between the nodes, the second-order similarity is established on the assumption that if similar neighbor nodes are shared between the nodes, the two nodes tend to be similar, and the common neighbor of the two nodes is used for describing the second-order similarity. After the model based on the first-order similarity and the second-order similarity is built, the node representation of the network is obtained by adopting an edge-based negative sampling method. The GraRep model considers higher-order similarity information, models local information of each order respectively, obtains vector representation of network nodes by adopting an SVD matrix decomposition method, and is suitable for large-scale network structures.

Aiming at the randomness generated by a Deepwalk algorithm node sequence, a node2vec model improves a mode of searching for neighbor nodes, the nodes in a network are considered to have content similarity and structure similarity, wherein the content similarity is mainly the similarity between the neighbor nodes, the neighbor nodes with the homogeneity are searched through width priority, the nodes with the structure similarity are not necessarily adjacent, the neighbor nodes with the structure homogeneity are searched through depth priority, and the vector representation of the nodes is extracted from the obtained node sequence by adopting a Skip-grim method.

The research is only from the perspective of a network structure, but the online social network represented by the Sina microblog not only has network topological relation, but also the nodes contain a great deal of information in other forms. In view of the diversity of network node information, the TADW (Text-associated removed) method adopts an induced matrix filling algorithm, and models Text features and a network structure at the same time to obtain a better network node representation. The GENE model takes the information of the group into account in network representation learning, in consideration of the fact that the online social network users can establish groups by themselves and choose to join in groups established by other people, and that some internal relations exist among nodes in the same group even if no edges are directly connected. The Multi-failed retrieval model considers three information, namely a user generated text, node attribute information and a network topological structure, and obtains a more real representation of the network node.

However, real-world networks are often sparse, i.e., the number of directly connected edges in the network is too small, and it is difficult to learn an accurate network representation using only the initial limited structural information of the network. For users in an online social network, the similarity features reflected by the generated text may suggest that the two have a common interest in interest, and then a potential friend relationship may exist. The current research has not expanded the topological structure of the network from the text information of the nodes, so as to enhance the effect of network representation learning.

Disclosure of Invention

Aiming at the characteristics of network structure sparsity, the invention establishes a network user enhanced representation learning method combining with user generated text information based on the assumed facts, and realizes the reasoning task of user gender and age according to the characteristic representation of the user.

The method comprises the following concrete implementation steps:

the method comprises the following steps that firstly, the user generated blog messages are preprocessed by combining the existing microblog short text processing method, so that the influence of noise data is eliminated;

generating feature vectors of the preprocessed user blog text by referring to a related natural language processing technology, calculating the similarity between the blog texts by referring to a similarity measurement function, extracting the potential friend relationship based on the text generated by the user, and constructing a potential friend relationship network;

step three, considering the first-order and second-order similarities of the network structure, integrating the original network structure information and expanding the original microblog network topological relation network;

fusing the potential friend relationship network extracted from the blog information to the integrated network topology structure, and correcting the original network structure information, wherein the correction mode comprises two correction modes of increasing part of continuous edges and increasing the weight value of the part of continuous edges;

fifthly, learning the enhanced feature representation of the microblog network user by referring to the existing network representation learning technology;

and step six, in order to compare the effect difference between the expression vector of the enhanced network and the expression vector of the original network, applying the expression learning result to a gender and age reasoning task, and comparing the accuracy of a reasoning result with a reference method.

Compared with the prior art, the invention has the advantages that: aiming at the sparsity problem of a network topology structure, the invention provides a network enhanced representation learning method combining a user generated text, which considers the fact that two users publishing similar blog texts in an online social network have similar interests, so that the user characteristics of the online social network are more accurately described, and the accuracy of a microblog user attribute reasoning task is improved.

Drawings

FIG. 1 is a flow chart of a method for network enhanced representation of text in conjunction with user generation

FIG. 2 is a schematic diagram of network enhanced representation according to an embodiment of the present invention

FIG. 3 is a distribution diagram of LDA extracted text features in the embodiment of the present invention

FIG. 4 is a diagram illustrating the visualization of potential network structures extracted from user-generated text in an embodiment of the present invention

FIG. 5 is a diagram of the visualization effect of the enhanced network topology in the embodiment of the present invention

FIG. 6 is a comparison chart of the experimental results of the age inference task in the embodiment of the present invention

The specific implementation mode is as follows:

aiming at the sparsity characteristic of a network structure and based on the assumed fact, the invention establishes a network user enhanced representation learning method combining with user generated text information, and realizes the reasoning task of the gender and the age of the user according to the characteristic representation of the user.

The invention is described below with reference to the accompanying drawings and the detailed description. First, the following formal definitions are given:

in the social network, the nodes correspond to users, and each node corresponds to a large amount of text information and represents historical blog information of the corresponding user. Assuming that the network is represented by G, G ═ V, { E, { T }, where V ═ V_iIs a set of user nodes, E { (v) } { (v)_i，v_j) Is a set of binary edges, each edge corresponding to a weight W, where W is ∈ {0, 1}, T ═ T }_iIs the user generated set of bobbles. The present invention is therefore directed to capturing textual feature information from user generated messages and modifying the original network to learn to modify the low dimensional representation of each node in the network G ″

And (4) microblog short texts are preprocessed, wherein the blog texts of the Xinlang microblog are short texts with the word number not more than 140 words, and firstly, the historical blog texts of each user are integrated into a text paragraph. The expression mode of spoken blogging makes the microblog text have a large amount of noise data, and aiming at the preprocessing operation of the microblog short text, the noise data in the text information is removed through the processes of filtering stop words, replacing abnormal words, segmenting words and the like, so that the extraction of text features is facilitated. The specific preprocessing operation adopted by the invention for the microblog text has the following points:

1) text content between two # specified in the Sina microblog is topic information corresponding to the blog, and can reflect the interest of users, so that the text content between the two # is directly extracted to be used as a keyword without being segmented again;

2) the "@" represents that a user is mentioned, so the text content after the "@" is a nickname of the user and does not need to be further split;

3) filtering out punctuation marks in the original text;

4) and (5) replacing all the strange words in the text by contrasting the strange word table. The strange words are commonly accepted network words by netizens, and include acronyms and splicing words. For example, if you want to express "thank you," either "3Q" or "3Q" can be used; also, "harmony" may be split into "all-grass-spoken" expressions for some expression purposes;

5) replacing all traditional Chinese characters with corresponding simplified Chinese characters according to the traditional Chinese character list;

6) performing word segmentation processing on the retained microblog text by using a HanLP word segmentation tool;

7) filtering stop words in the stop word list;

8) counting TF-IDF values of all words, and filtering low-frequency words in the TF-IDF values;

the potential friend relationship is extracted based on the user generated text, and the user relationship extracted from the user generated text is called the potential friend relationship in consideration that similar blog information can reflect common interest of users, that is, the potential friend relationship exists between users corresponding to similar blog information with high possibility.

The extraction of potential friend relationships can be essentially relegated to text similarity computation problems. Firstly, generating a feature vector of a microblog text of a user by adopting an LDA topic model, and then calculating the weight of a potential relation edge corresponding to cosine similarity representation between any two user blog text vectors so as to construct a potential friend relation network.

LDA is a generative probabilistic model, involving three levels of documents, topics, and words. We consider a document to be represented as a random mixture of K potential topics, where each topic obeys a polynomial distribution of words and each document obeys a polynomial distribution of K topics. Thus, for a corpusThe generation process is described as follows for each document in (1):

1) for each document M_iSelecting theta to Dir (alpha), wherein Dir (alpha) is a dirichlet distribution of a parameter alpha, theta is a topic vector, and each element in the vector represents the probability of each topic appearing in the document;

2) for the jth word w in the ith document_ijBy conditional probability p (z)_i| θ), selecting a potential topic z from the topic vector θ_iThen by a conditional probability p (w)_j|z_iBeta) generating the word w_j.

3) Given the parameter a and the parameter β, the joint distribution of the model is,

where w is the observed variable and θ is the hidden variable, then we learn the parameter α and the parameter β using the maximum expectation algorithm (EM).

Assuming the top T topics are preserved, each text paragraph is embedded into a vectorWherein, w_iIs a weight corresponding to the ith topic, and represents the user v_iA likelihood that the generated text belongs to the ith topic. Fig. 2 is a distribution diagram of text features, and for each user generated text, the first three topics are selected, and then coordinate values corresponding to three coordinates are calculated, and one point corresponds to a vector representation of the text.

Finally, each feature vector represents a topic associated with each user-generated text, in other words, an interest of interest extracted from the user-published bloggers. Therefore, a cosine similarity calculation method is adopted to extract potential friend relationships from the expression vectors. Of course, other similarity functions may be used to calculate the similarity between different vectors. Given two representative vectorsAndthen two users v_iAnd v_jThe potential buddy relationships generated are defined as,

thus, a potential adjacency matrix extracted from user-generated text may be described as a matrixWherein each element w'_ij∈[0，1]。

Integrating the original network structure information, real-world social networks are often sparse, as there are only some users with direct concerns. Further, since direct friend relationships are usually added by users voluntarily according to their preferences, direct attention relationships play an important role in network embedding problems that only consider network structures. However, direct buddy relationships are not sufficient to describe the entire network structure, and two people who may not be buddies also have certain common features. In fact, two users in a social network who have common friends tend to have the same interests and characteristics.

Thus, LINE first proposes the concept of first and second order similarities to fully characterize the local and global information of the network structure.

1) First order similarity:

given an edge set E, for each node pair therein, the weight value of the corresponding edge represents a first order similarity.Representing a first order similarity matrix W¹The elements of (a), defined as,

2) second order similarity:

the common neighbor number of any node pair is used to define a second order similarity to describe the similarity of the neighbor structures of two users in the social network. Respectively given users v_iAnd user v_jSet of neighbor nodesAndthen, the number of common friends is calculated, and the second-order similarity is defined as,

now, we consider first and second order similarities together, fused into the adjacency matrix extracted from the network structure. We therefore introduce W, representing the integrated neighbor matrix, each element of the matrix being composed of two similarity values,

wherein, λ and μ are normalization coefficients, and the specific values are determined by continuously adjusting experiments.

The network structure is revised with the potential friend relationships, the network structure is revised first from the potential friend relationships extracted from the text, and then the potential representation of the expanded network structure is learned using the LINE model. This extension can bring about two changes: firstly, the weight is changed from zero to yes, namely 0 to 1; second, the weights are changed from small to large. Fig. 1 shows that a subgraph formed by stretching gray nodes is an original network structure diagram, and the color nodes are isolated nodes at this time, that is, the color nodes have no association relationship with other nodes in the network. After the network structure is corrected by using the potential friend relationship, the newly generated dotted line edge is the new friend relationship extracted from the microblog text, and the bold solid line edge indicates that the weight value of the edge in the original network structure is increased, namely the friend relationship is enhanced. Fig. 3 and 4 are microblog friend relationship topological graphs before and after network structure correction respectively.

Let W "be the adjacency matrix of the modified network, where each element W ″_ijIn order to realize the purpose,

however, some elements in the modified adjacency matrix are too small, so that a threshold needs to be set, and all elements smaller than the threshold are deleted. We then compute the low-dimensional representation using the final modified adjacency matrix as input to LINE. LINE first introduces first and second order similarities and learns a corresponding representation vector for each node based on the first and second order similarities, respectively, and then introduces how to fuse the two vector representations into one final node representation.

Essentially, the first order similarity represents the weight value of the edge of a node pair in the network. To model first order similarity, the LINE model builds an objective function using direct weights to build empirical probabilities, then uses the joint probabilities constructed from the representation vectors, and uses K-L divergence to describe the error between the empirical probabilities and the joint probabilities. Similarly, the second-order similarity can also establish a similar objective function, and two phases are respectively obtained by adopting a negative sampling optimization algorithmNode vector representation under semblanceFinally, simply splicing the two vectors to obtain the final network representation

The gender inference task of the microblog user can be regarded as a supervised binary classification problem based on user feature representation. Thus, we train the gender classifier using the SVM model of the linear kernel with the final representative vector as the extracted features. The results of the test with the reference method are shown in Table 1, and the method of the present invention is shown in Table 2.

TABLE 1 Experimental results of gender inference task (benchmark method)

TABLE 2 Experimental results of gender inference tasks (methods of the invention)

As can be seen from the data in the table, the average accuracy improved by about 4 percentage points. Moreover, the accuracy is improved along with the increase of the sample size of the test set, and for this reason, the more training samples, the more accurate the classifier obtained by SVM training.

Age reasoning is a supervised multi-classification problem. To more accurately infer the age of the test sample, we divided the user age into 4 intervals according to the distribution of birth dates in the user information. Statistics it can be found that most users are young people between 18 and 30 years of age. Therefore, the user age is inferred based on the two SVM extension algorithms of 'one-to-one' and 'one-to-the-rest'. The results of the experiments are shown in tables 3 and 4.

TABLE 3 Experimental results of the age inference task (baseline method)

TABLE 4 Experimental results of the age inference task (method of the invention)

The first action of the accuracy in the two tables is to realize the result of age inference by using the SVM classifier expanded in a one-to-one mode, and the second action is to realize the experimental result of age inference by using the SVM classifier expanded in a one-to-the-rest mode. As can be seen from the data in the table, the classification performance of the expression vector obtained by the network enhanced representation is greatly improved compared with that of the expression vector obtained by the reference scheme, for example, when the corresponding Percentage is about 10%, the accuracy of the first extension scheme is improved from 69.03% to 76.25%. Figure 6 shows the results of the age inference versus a graph showing that the network enhancement indicates that the resulting vector representation does yield better classification results than the vector representation of the baseline method.

In general, aiming at the sparsity problem of an online social network in the real world, a network enhanced representation learning method fusing node text information is provided based on the fact that two users similar to published blog texts have potential friend relationships, specifically, a potential friend relationship network is extracted from a user generated text, and an original network topological structure is corrected, so that more accurate network node representation is obtained. Compared with the network representation learning only considering the network topology structure, the accuracy rate is obviously improved on two tasks of gender and age reasoning.

Therefore, the microblog-based network enhanced representation method provided by the invention has important practical application value in the representation of network user characteristics and subsequent classification and reasoning tasks.

This specification presents a specific embodiment for the purpose of illustrating the context and method of practicing the invention. The details introduced in the examples are not intended to limit the scope of the claims but to aid in the understanding of the process described herein. Those skilled in the art will understand that: various modifications, changes or substitutions to the preferred embodiment steps are possible without departing from the spirit and scope of the invention and its appended claims. Therefore, the present invention should not be limited to the disclosure of the preferred embodiments and the accompanying drawings.

Claims

1. A network user enhanced representation method based on microblog is characterized by comprising the following steps:

step three, considering the first-order and second-order similarities of the network structure, integrating the original network structure information and expanding the topological relation network among users in the microblog;

fusing the potential friend relationship network extracted from the blog information to the integrated network topology structure, and correcting the original network structure information, wherein the correction comprises two correction modes of increasing part of potential connecting edges and increasing part of the weight values of the existing connecting edges;

2. The microblog-based network user enhanced representation method according to claim 1, wherein in a social network,the nodes are corresponding users, each node corresponds to a large amount of text information and represents historical blog information of the corresponding user, and if G represents a network, G is (V, E, T), wherein V is { V ═ V [, [ V ] V [, T ] is_iIs a set of user nodes, E { (v) } { (v)_i，v_j) Is a set of binary edges, each edge corresponding to a weight w, where w is e {0, 1}, and T is { T ═ T }_iThe method is characterized in that a user-generated blog article set is used, the characteristic information of texts is captured from the user-generated blog article and the original network is corrected, so that the low-dimensional representation of each node in the corrected network G' is learned

3. The microblog-based network user enhanced representation method according to claim 1, wherein the method for preprocessing the user generated blog messages in the first acquiring step comprises the following steps:

(1) extracting text content between two "#" and directly using the text content as a keyword;

(2) extracting the text content after the '@';

(3) filtering out punctuation marks in the original text;

(4) comparing the singular word list and replacing all the singular words in the text;

(5) performing word segmentation processing on the retained microblog text by using a HanLP word segmentation tool;

(6) filtering stop words in the stop word list;

(7) and counting TF-IDF values of all words, and filtering low-frequency words in the TF-IDF values.

4. The microblog-based network user enhanced representation method according to claim 2, wherein the potential friend relationship extraction method based on the user-generated text in the second step is as follows:

(1) generating a feature vector of the microblog text of the user by adopting an LDA topic model:

LDA is a generative probabilistic model relating to documents, topics, andthree levels of words; a document represented as a random mixture of K potential topics, wherein each topic obeys a polynomial distribution of words and each document obeys a polynomial distribution of K topics; thus, for a corpusThe generation process is described as follows for each document in (1):

for each document M_iSelecting theta to Dir (alpha), wherein Dir (alpha) is a dirichlet distribution of a parameter alpha, theta is a topic vector, and each element in the vector represents the probability of each topic appearing in the document;

for the jth word w in the ith document_ijBy conditional probability p (z)_i| θ), selecting a potential topic z from the topic vector θ_iThen by a conditional probability p (w)_j|z_iBeta) generating the word w_j.

Given the parameter a and the parameter β, the joint distribution of the model is,

where w is the observed variable and θ is the hidden variable, then we learn the parameter α and parameter β using the maximum expectation algorithm (EM), assuming the top T topics remain, then each text paragraph is embedded into a vectorWherein, w_iIs a weight corresponding to the ith topic, and represents the user v_iA likelihood that the generated text belongs to the ith topic;

(2) calculating the weight of a potential relationship edge corresponding to cosine similarity representation between any two user blog text vectors so as to construct a potential friend relationship network;

extracting potential friend relationships from the expression vectors by adopting a cosine similarity calculation method; given two representative vectorsAndthen two users v_iAnd v_jThe potential buddy relationships generated are defined as,

thus, the potential adjacency matrix extracted from the user-generated text is described as a matrixWherein each element w'_ij∈[0，1]。

5. The microblog-based network user enhanced representation method according to claim 2, wherein the method for integrating the original network structure information in the third obtaining step is as follows:

two users in a social network who have common friends tend to have the same interests and characteristics; LINE first proposes concepts of first and second order similarities to fully characterize the local and global information of the network structure;

(1) first order similarity:

giving an edge set E, wherein for each node pair in the edge set E, the weight value of the corresponding edge represents first-order similarity;representing a first order similarity matrix W¹The elements of (a), defined as,

(2) second order similarity:

the common neighbor number of any node pair is used for defining second-order similarity to describe the neighbor nodes of two users in the social networkSimilarity of constructs, respectively given to users v_iAnd user v_jSet of neighbor nodesAndthen calculating the number of common friends, and defining the second-order similarity as

Considering first and second order similarities, W is introduced into the adjacency matrix extracted from the network structure, representing the integrated neighbor matrix, each element of the matrix is composed of two similarity values,

6. The microblog-based network user enhanced representation method according to claim 2, wherein the obtaining step is carried out by correcting an original network structure with the potential friend relationship as follows:

however, some elements in the modified adjacency matrix are too small, so it is necessary to set a threshold, delete all elements smaller than the threshold, and use the final modified adjacency matrix as an input to the LINE to compute a low-dimensional representation of the network node user.