CN107122455B - Network user enhanced representation method based on microblog - Google Patents

Network user enhanced representation method based on microblog Download PDF

Info

Publication number
CN107122455B
CN107122455B CN201710283853.4A CN201710283853A CN107122455B CN 107122455 B CN107122455 B CN 107122455B CN 201710283853 A CN201710283853 A CN 201710283853A CN 107122455 B CN107122455 B CN 107122455B
Authority
CN
China
Prior art keywords
network
user
text
microblog
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710283853.4A
Other languages
Chinese (zh)
Other versions
CN107122455A (en
Inventor
胡玥
贾焰
周斌
杨树强
韩伟红
李爱平
黄九鸣
江荣
全拥
邓璐
刘强
张涛
童咏之
刘心
韩文祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201710283853.4A priority Critical patent/CN107122455B/en
Publication of CN107122455A publication Critical patent/CN107122455A/en
Application granted granted Critical
Publication of CN107122455B publication Critical patent/CN107122455B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The invention discloses a network enhanced representation method based on microblogs, belongs to the field of microblog data mining, and particularly relates to a network representation learning method for microblog data. The method considers the spoken language characteristics of the microblog short text and performs text preprocessing in a targeted manner, so that the influence of noise data is reduced; generating characteristic representation of historical blog text of the user by adopting an LDA topic model, and calculating cosine similarity between any two user blog characteristics so as to construct a potential friend relationship network; and integrating the structure information of the original network, and fusing the potential friend relationship into the original network to obtain the corrected network structure. According to the method and the device, the original network topology structure is corrected by utilizing the potential friend relationship network extracted from the user generated text, so that more accurate characteristic representation of the microblog user node is obtained. Compared with a network representation learning method only considering a network structure, the accuracy rate is obviously improved on two tasks of gender and age reasoning.

Description

Network user enhanced representation method based on microblog
Technical Field
The invention belongs to microblog data, belongs to the field of microblog data mining, and particularly relates to a network representation learning method for microblog data.
Background
The internet in the web2.0 era is gradually evolving into a ubiquitous information dissemination platform, and Social new media facing Social network Services (SNS for short), such as Twitter and microblog, are rapidly gaining popularity. The latest statistical data shows that Twitter's monthly active users reach 3.1 billion, and the monthly active users of the Sina microblog reach 2.97 billion. People express viewpoints, share information and exchange interaction by means of social media, and the social media spread and spread messages by means of social networks, thereby having profound influence on the fields of politics, economy, culture, education and the like. Therefore, the online social network analysis has important research value due to the characteristics of large scale, various forms, complex structure, dynamic change and the like of the online social network data and the guiding effect of hot spots and deep sentiment. Taking the Sing microblog as an example, the user can issue original blog articles within 140 words, can be in various forms such as pictures, hyperlinks, videos, audios and the like, and can also browse, forward and comment the blog articles of interested friends. Microblog data show the characteristic of multi-source heterogeneity, user generated texts, user attribute lists, network topological relations and the like are important data sources, and how to fuse multi-source microblog information to calculate the feature representation of user nodes becomes important.
Representation learning is an important research problem in the field of machine learning, and an effective feature representation is obtained by automatically learning a transformation from original input data to a new feature representation. The network representation learning is to learn the feature representation of the network nodes in the low-dimensional space, so as to realize the purposes of feature quantification and dimension reduction representation.
Currently, many research efforts have emerged in the field of network representation learning. The traditional popular learning method recovers a low-dimensional manifold structure from high-dimensional data and finds low-dimensional embedded representation of the high-dimensional network data. For example, the Isomap algorithm is based on an MDS theoretical framework, geodesic distances of any two points are used as geometric description of a manifold, an LLE algorithm (localization linear embedding) considers that a manifold can be approximately regarded as local linearity on a small local neighborhood, a coefficient of the linear fitting is used as a description of the local geometric property of the manifold, and the LE algorithm (Laplacian Eigenmaps) has the basic idea that an undirected weighted graph is used to describe a manifold, and then low-dimensional representation is found by embedding, namely the local adjacency relation of the graph is maintained, and the graph is redrawn from a high-dimensional space to a low-dimensional space.
In recent years, deep learning provides a new idea for network representation learning, and a network representation model based on deep learning continuously appears aiming at large-scale network structure data and rich network node information.
Inspired by a word2vec model, the Deepwalk model only considers the topological structure of the network, the nodes in the network correspond to the words in the corpus, the sequences generated by the nodes correspond to the sentences in the corpus, a standard input sequence is generated by adopting a random walk method, and then the sequence is modeled by using a Skip-gram model so as to obtain the vector representation of the network nodes. However, the Deepwalk algorithm does not establish an objective function, cannot learn the node representation of the weighted directed graph, and the node sequence is randomly generated and is greatly influenced by noise.
The LINE model considers the first-order and second-order similarities of a network topology structure at the same time, the first-order similarity represents the point pair similarity between two nodes in the network, namely the weight of an edge between the nodes, the second-order similarity is established on the assumption that if similar neighbor nodes are shared between the nodes, the two nodes tend to be similar, and the common neighbor of the two nodes is used for describing the second-order similarity. After the model based on the first-order similarity and the second-order similarity is built, the node representation of the network is obtained by adopting an edge-based negative sampling method. The GraRep model considers higher-order similarity information, models local information of each order respectively, obtains vector representation of network nodes by adopting an SVD matrix decomposition method, and is suitable for large-scale network structures.
Aiming at the randomness generated by a Deepwalk algorithm node sequence, a node2vec model improves a mode of searching for neighbor nodes, the nodes in a network are considered to have content similarity and structure similarity, wherein the content similarity is mainly the similarity between the neighbor nodes, the neighbor nodes with the homogeneity are searched through width priority, the nodes with the structure similarity are not necessarily adjacent, the neighbor nodes with the structure homogeneity are searched through depth priority, and the vector representation of the nodes is extracted from the obtained node sequence by adopting a Skip-grim method.
The research is only from the perspective of a network structure, but the online social network represented by the Sina microblog not only has network topological relation, but also the nodes contain a great deal of information in other forms. In view of the diversity of network node information, the TADW (Text-associated removed) method adopts an induced matrix filling algorithm, and models Text features and a network structure at the same time to obtain a better network node representation. The GENE model takes the information of the group into account in network representation learning, in consideration of the fact that the online social network users can establish groups by themselves and choose to join in groups established by other people, and that some internal relations exist among nodes in the same group even if no edges are directly connected. The Multi-failed retrieval model considers three information, namely a user generated text, node attribute information and a network topological structure, and obtains a more real representation of the network node.
However, real-world networks are often sparse, i.e., the number of directly connected edges in the network is too small, and it is difficult to learn an accurate network representation using only the initial limited structural information of the network. For users in an online social network, the similarity features reflected by the generated text may suggest that the two have a common interest in interest, and then a potential friend relationship may exist. The current research has not expanded the topological structure of the network from the text information of the nodes, so as to enhance the effect of network representation learning.
Disclosure of Invention
Aiming at the characteristics of network structure sparsity, the invention establishes a network user enhanced representation learning method combining with user generated text information based on the assumed facts, and realizes the reasoning task of user gender and age according to the characteristic representation of the user.
The method comprises the following concrete implementation steps:
the method comprises the following steps that firstly, the user generated blog messages are preprocessed by combining the existing microblog short text processing method, so that the influence of noise data is eliminated;
generating feature vectors of the preprocessed user blog text by referring to a related natural language processing technology, calculating the similarity between the blog texts by referring to a similarity measurement function, extracting the potential friend relationship based on the text generated by the user, and constructing a potential friend relationship network;
step three, considering the first-order and second-order similarities of the network structure, integrating the original network structure information and expanding the original microblog network topological relation network;
fusing the potential friend relationship network extracted from the blog information to the integrated network topology structure, and correcting the original network structure information, wherein the correction mode comprises two correction modes of increasing part of continuous edges and increasing the weight value of the part of continuous edges;
fifthly, learning the enhanced feature representation of the microblog network user by referring to the existing network representation learning technology;
and step six, in order to compare the effect difference between the expression vector of the enhanced network and the expression vector of the original network, applying the expression learning result to a gender and age reasoning task, and comparing the accuracy of a reasoning result with a reference method.
Compared with the prior art, the invention has the advantages that: aiming at the sparsity problem of a network topology structure, the invention provides a network enhanced representation learning method combining a user generated text, which considers the fact that two users publishing similar blog texts in an online social network have similar interests, so that the user characteristics of the online social network are more accurately described, and the accuracy of a microblog user attribute reasoning task is improved.
Drawings
FIG. 1 is a flow chart of a method for network enhanced representation of text in conjunction with user generation
FIG. 2 is a schematic diagram of network enhanced representation according to an embodiment of the present invention
FIG. 3 is a distribution diagram of LDA extracted text features in the embodiment of the present invention
FIG. 4 is a diagram illustrating the visualization of potential network structures extracted from user-generated text in an embodiment of the present invention
FIG. 5 is a diagram of the visualization effect of the enhanced network topology in the embodiment of the present invention
FIG. 6 is a comparison chart of the experimental results of the age inference task in the embodiment of the present invention
The specific implementation mode is as follows:
aiming at the sparsity characteristic of a network structure and based on the assumed fact, the invention establishes a network user enhanced representation learning method combining with user generated text information, and realizes the reasoning task of the gender and the age of the user according to the characteristic representation of the user.
The invention is described below with reference to the accompanying drawings and the detailed description. First, the following formal definitions are given:
in the social network, the nodes correspond to users, and each node corresponds to a large amount of text information and represents historical blog information of the corresponding user. Assuming that the network is represented by G, G ═ V, { E, { T }, where V ═ ViIs a set of user nodes, E { (v) } { (v)i,vj) Is a set of binary edges, each edge corresponding to a weight W, where W is ∈ {0, 1}, T ═ T }iIs the user generated set of bobbles. The present invention is therefore directed to capturing textual feature information from user generated messages and modifying the original network to learn to modify the low dimensional representation of each node in the network G ″
And (4) microblog short texts are preprocessed, wherein the blog texts of the Xinlang microblog are short texts with the word number not more than 140 words, and firstly, the historical blog texts of each user are integrated into a text paragraph. The expression mode of spoken blogging makes the microblog text have a large amount of noise data, and aiming at the preprocessing operation of the microblog short text, the noise data in the text information is removed through the processes of filtering stop words, replacing abnormal words, segmenting words and the like, so that the extraction of text features is facilitated. The specific preprocessing operation adopted by the invention for the microblog text has the following points:
1) text content between two # specified in the Sina microblog is topic information corresponding to the blog, and can reflect the interest of users, so that the text content between the two # is directly extracted to be used as a keyword without being segmented again;
2) the "@" represents that a user is mentioned, so the text content after the "@" is a nickname of the user and does not need to be further split;
3) filtering out punctuation marks in the original text;
4) and (5) replacing all the strange words in the text by contrasting the strange word table. The strange words are commonly accepted network words by netizens, and include acronyms and splicing words. For example, if you want to express "thank you," either "3Q" or "3Q" can be used; also, "harmony" may be split into "all-grass-spoken" expressions for some expression purposes;
5) replacing all traditional Chinese characters with corresponding simplified Chinese characters according to the traditional Chinese character list;
6) performing word segmentation processing on the retained microblog text by using a HanLP word segmentation tool;
7) filtering stop words in the stop word list;
8) counting TF-IDF values of all words, and filtering low-frequency words in the TF-IDF values;
the potential friend relationship is extracted based on the user generated text, and the user relationship extracted from the user generated text is called the potential friend relationship in consideration that similar blog information can reflect common interest of users, that is, the potential friend relationship exists between users corresponding to similar blog information with high possibility.
The extraction of potential friend relationships can be essentially relegated to text similarity computation problems. Firstly, generating a feature vector of a microblog text of a user by adopting an LDA topic model, and then calculating the weight of a potential relation edge corresponding to cosine similarity representation between any two user blog text vectors so as to construct a potential friend relation network.
LDA is a generative probabilistic model, involving three levels of documents, topics, and words. We consider a document to be represented as a random mixture of K potential topics, where each topic obeys a polynomial distribution of words and each document obeys a polynomial distribution of K topics. Thus, for a corpusThe generation process is described as follows for each document in (1):
1) for each document MiSelecting theta to Dir (alpha), wherein Dir (alpha) is a dirichlet distribution of a parameter alpha, theta is a topic vector, and each element in the vector represents the probability of each topic appearing in the document;
2) for the jth word w in the ith documentijBy conditional probability p (z)i| θ), selecting a potential topic z from the topic vector θiThen by a conditional probability p (w)j|ziBeta) generating the word wj.
3) Given the parameter a and the parameter β, the joint distribution of the model is,
where w is the observed variable and θ is the hidden variable, then we learn the parameter α and the parameter β using the maximum expectation algorithm (EM).
Assuming the top T topics are preserved, each text paragraph is embedded into a vectorWherein, wiIs a weight corresponding to the ith topic, and represents the user viA likelihood that the generated text belongs to the ith topic. Fig. 2 is a distribution diagram of text features, and for each user generated text, the first three topics are selected, and then coordinate values corresponding to three coordinates are calculated, and one point corresponds to a vector representation of the text.
Finally, each feature vector represents a topic associated with each user-generated text, in other words, an interest of interest extracted from the user-published bloggers. Therefore, a cosine similarity calculation method is adopted to extract potential friend relationships from the expression vectors. Of course, other similarity functions may be used to calculate the similarity between different vectors. Given two representative vectorsAndthen two users viAnd vjThe potential buddy relationships generated are defined as,
thus, a potential adjacency matrix extracted from user-generated text may be described as a matrixWherein each element w'ij∈[0,1]。
Integrating the original network structure information, real-world social networks are often sparse, as there are only some users with direct concerns. Further, since direct friend relationships are usually added by users voluntarily according to their preferences, direct attention relationships play an important role in network embedding problems that only consider network structures. However, direct buddy relationships are not sufficient to describe the entire network structure, and two people who may not be buddies also have certain common features. In fact, two users in a social network who have common friends tend to have the same interests and characteristics.
Thus, LINE first proposes the concept of first and second order similarities to fully characterize the local and global information of the network structure.
1) First order similarity:
given an edge set E, for each node pair therein, the weight value of the corresponding edge represents a first order similarity.Representing a first order similarity matrix W1The elements of (a), defined as,
2) second order similarity:
the common neighbor number of any node pair is used to define a second order similarity to describe the similarity of the neighbor structures of two users in the social network. Respectively given users viAnd user vjSet of neighbor nodesAndthen, the number of common friends is calculated, and the second-order similarity is defined as,
now, we consider first and second order similarities together, fused into the adjacency matrix extracted from the network structure. We therefore introduce W, representing the integrated neighbor matrix, each element of the matrix being composed of two similarity values,
wherein, λ and μ are normalization coefficients, and the specific values are determined by continuously adjusting experiments.
The network structure is revised with the potential friend relationships, the network structure is revised first from the potential friend relationships extracted from the text, and then the potential representation of the expanded network structure is learned using the LINE model. This extension can bring about two changes: firstly, the weight is changed from zero to yes, namely 0 to 1; second, the weights are changed from small to large. Fig. 1 shows that a subgraph formed by stretching gray nodes is an original network structure diagram, and the color nodes are isolated nodes at this time, that is, the color nodes have no association relationship with other nodes in the network. After the network structure is corrected by using the potential friend relationship, the newly generated dotted line edge is the new friend relationship extracted from the microblog text, and the bold solid line edge indicates that the weight value of the edge in the original network structure is increased, namely the friend relationship is enhanced. Fig. 3 and 4 are microblog friend relationship topological graphs before and after network structure correction respectively.
Let W "be the adjacency matrix of the modified network, where each element W ″ijIn order to realize the purpose,
however, some elements in the modified adjacency matrix are too small, so that a threshold needs to be set, and all elements smaller than the threshold are deleted. We then compute the low-dimensional representation using the final modified adjacency matrix as input to LINE. LINE first introduces first and second order similarities and learns a corresponding representation vector for each node based on the first and second order similarities, respectively, and then introduces how to fuse the two vector representations into one final node representation.
Essentially, the first order similarity represents the weight value of the edge of a node pair in the network. To model first order similarity, the LINE model builds an objective function using direct weights to build empirical probabilities, then uses the joint probabilities constructed from the representation vectors, and uses K-L divergence to describe the error between the empirical probabilities and the joint probabilities. Similarly, the second-order similarity can also establish a similar objective function, and two phases are respectively obtained by adopting a negative sampling optimization algorithmNode vector representation under semblanceFinally, simply splicing the two vectors to obtain the final network representation
The gender inference task of the microblog user can be regarded as a supervised binary classification problem based on user feature representation. Thus, we train the gender classifier using the SVM model of the linear kernel with the final representative vector as the extracted features. The results of the test with the reference method are shown in Table 1, and the method of the present invention is shown in Table 2.
TABLE 1 Experimental results of gender inference task (benchmark method)
TABLE 2 Experimental results of gender inference tasks (methods of the invention)
As can be seen from the data in the table, the average accuracy improved by about 4 percentage points. Moreover, the accuracy is improved along with the increase of the sample size of the test set, and for this reason, the more training samples, the more accurate the classifier obtained by SVM training.
Age reasoning is a supervised multi-classification problem. To more accurately infer the age of the test sample, we divided the user age into 4 intervals according to the distribution of birth dates in the user information. Statistics it can be found that most users are young people between 18 and 30 years of age. Therefore, the user age is inferred based on the two SVM extension algorithms of 'one-to-one' and 'one-to-the-rest'. The results of the experiments are shown in tables 3 and 4.
TABLE 3 Experimental results of the age inference task (baseline method)
TABLE 4 Experimental results of the age inference task (method of the invention)
The first action of the accuracy in the two tables is to realize the result of age inference by using the SVM classifier expanded in a one-to-one mode, and the second action is to realize the experimental result of age inference by using the SVM classifier expanded in a one-to-the-rest mode. As can be seen from the data in the table, the classification performance of the expression vector obtained by the network enhanced representation is greatly improved compared with that of the expression vector obtained by the reference scheme, for example, when the corresponding Percentage is about 10%, the accuracy of the first extension scheme is improved from 69.03% to 76.25%. Figure 6 shows the results of the age inference versus a graph showing that the network enhancement indicates that the resulting vector representation does yield better classification results than the vector representation of the baseline method.
In general, aiming at the sparsity problem of an online social network in the real world, a network enhanced representation learning method fusing node text information is provided based on the fact that two users similar to published blog texts have potential friend relationships, specifically, a potential friend relationship network is extracted from a user generated text, and an original network topological structure is corrected, so that more accurate network node representation is obtained. Compared with the network representation learning only considering the network topology structure, the accuracy rate is obviously improved on two tasks of gender and age reasoning.
Therefore, the microblog-based network enhanced representation method provided by the invention has important practical application value in the representation of network user characteristics and subsequent classification and reasoning tasks.
This specification presents a specific embodiment for the purpose of illustrating the context and method of practicing the invention. The details introduced in the examples are not intended to limit the scope of the claims but to aid in the understanding of the process described herein. Those skilled in the art will understand that: various modifications, changes or substitutions to the preferred embodiment steps are possible without departing from the spirit and scope of the invention and its appended claims. Therefore, the present invention should not be limited to the disclosure of the preferred embodiments and the accompanying drawings.

Claims (6)

1. A network user enhanced representation method based on microblog is characterized by comprising the following steps:
the method comprises the following steps that firstly, the user generated blog messages are preprocessed by combining the existing microblog short text processing method, so that the influence of noise data is eliminated;
generating feature vectors of the preprocessed user blog text by referring to a related natural language processing technology, calculating the similarity between the blog texts by referring to a similarity measurement function, extracting the potential friend relationship based on the text generated by the user, and constructing a potential friend relationship network;
step three, considering the first-order and second-order similarities of the network structure, integrating the original network structure information and expanding the topological relation network among users in the microblog;
fusing the potential friend relationship network extracted from the blog information to the integrated network topology structure, and correcting the original network structure information, wherein the correction comprises two correction modes of increasing part of potential connecting edges and increasing part of the weight values of the existing connecting edges;
fifthly, learning the enhanced feature representation of the microblog network user by referring to the existing network representation learning technology;
and step six, in order to compare the effect difference between the expression vector of the enhanced network and the expression vector of the original network, applying the expression learning result to a gender and age reasoning task, and comparing the accuracy of a reasoning result with a reference method.
2. The microblog-based network user enhanced representation method according to claim 1, wherein in a social network,the nodes are corresponding users, each node corresponds to a large amount of text information and represents historical blog information of the corresponding user, and if G represents a network, G is (V, E, T), wherein V is { V ═ V [, [ V ] V [, T ] isiIs a set of user nodes, E { (v) } { (v)i,vj) Is a set of binary edges, each edge corresponding to a weight w, where w is e {0, 1}, and T is { T ═ T }iThe method is characterized in that a user-generated blog article set is used, the characteristic information of texts is captured from the user-generated blog article and the original network is corrected, so that the low-dimensional representation of each node in the corrected network G' is learned
3. The microblog-based network user enhanced representation method according to claim 1, wherein the method for preprocessing the user generated blog messages in the first acquiring step comprises the following steps:
(1) extracting text content between two "#" and directly using the text content as a keyword;
(2) extracting the text content after the '@';
(3) filtering out punctuation marks in the original text;
(4) comparing the singular word list and replacing all the singular words in the text;
(5) performing word segmentation processing on the retained microblog text by using a HanLP word segmentation tool;
(6) filtering stop words in the stop word list;
(7) and counting TF-IDF values of all words, and filtering low-frequency words in the TF-IDF values.
4. The microblog-based network user enhanced representation method according to claim 2, wherein the potential friend relationship extraction method based on the user-generated text in the second step is as follows:
(1) generating a feature vector of the microblog text of the user by adopting an LDA topic model:
LDA is a generative probabilistic model relating to documents, topics, andthree levels of words; a document represented as a random mixture of K potential topics, wherein each topic obeys a polynomial distribution of words and each document obeys a polynomial distribution of K topics; thus, for a corpusThe generation process is described as follows for each document in (1):
for each document MiSelecting theta to Dir (alpha), wherein Dir (alpha) is a dirichlet distribution of a parameter alpha, theta is a topic vector, and each element in the vector represents the probability of each topic appearing in the document;
for the jth word w in the ith documentijBy conditional probability p (z)i| θ), selecting a potential topic z from the topic vector θiThen by a conditional probability p (w)j|ziBeta) generating the word wj.
Given the parameter a and the parameter β, the joint distribution of the model is,
where w is the observed variable and θ is the hidden variable, then we learn the parameter α and parameter β using the maximum expectation algorithm (EM), assuming the top T topics remain, then each text paragraph is embedded into a vectorWherein, wiIs a weight corresponding to the ith topic, and represents the user viA likelihood that the generated text belongs to the ith topic;
(2) calculating the weight of a potential relationship edge corresponding to cosine similarity representation between any two user blog text vectors so as to construct a potential friend relationship network;
extracting potential friend relationships from the expression vectors by adopting a cosine similarity calculation method; given two representative vectorsAndthen two users viAnd vjThe potential buddy relationships generated are defined as,
thus, the potential adjacency matrix extracted from the user-generated text is described as a matrixWherein each element w'ij∈[0,1]。
5. The microblog-based network user enhanced representation method according to claim 2, wherein the method for integrating the original network structure information in the third obtaining step is as follows:
two users in a social network who have common friends tend to have the same interests and characteristics; LINE first proposes concepts of first and second order similarities to fully characterize the local and global information of the network structure;
(1) first order similarity:
giving an edge set E, wherein for each node pair in the edge set E, the weight value of the corresponding edge represents first-order similarity;representing a first order similarity matrix W1The elements of (a), defined as,
(2) second order similarity:
the common neighbor number of any node pair is used for defining second-order similarity to describe the neighbor nodes of two users in the social networkSimilarity of constructs, respectively given to users viAnd user vjSet of neighbor nodesAndthen calculating the number of common friends, and defining the second-order similarity as
Considering first and second order similarities, W is introduced into the adjacency matrix extracted from the network structure, representing the integrated neighbor matrix, each element of the matrix is composed of two similarity values,
wherein, λ and μ are normalization coefficients, and the specific values are determined by continuously adjusting experiments.
6. The microblog-based network user enhanced representation method according to claim 2, wherein the obtaining step is carried out by correcting an original network structure with the potential friend relationship as follows:
let W "be the adjacency matrix of the modified network, where each element W ″ijIn order to realize the purpose,
however, some elements in the modified adjacency matrix are too small, so it is necessary to set a threshold, delete all elements smaller than the threshold, and use the final modified adjacency matrix as an input to the LINE to compute a low-dimensional representation of the network node user.
CN201710283853.4A 2017-04-26 2017-04-26 Network user enhanced representation method based on microblog Active CN107122455B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710283853.4A CN107122455B (en) 2017-04-26 2017-04-26 Network user enhanced representation method based on microblog

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710283853.4A CN107122455B (en) 2017-04-26 2017-04-26 Network user enhanced representation method based on microblog

Publications (2)

Publication Number Publication Date
CN107122455A CN107122455A (en) 2017-09-01
CN107122455B true CN107122455B (en) 2019-12-31

Family

ID=59724978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710283853.4A Active CN107122455B (en) 2017-04-26 2017-04-26 Network user enhanced representation method based on microblog

Country Status (1)

Country Link
CN (1) CN107122455B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577782B (en) * 2017-09-14 2021-04-30 国家计算机网络与信息安全管理中心 Figure similarity depicting method based on heterogeneous data
CN110020151B (en) * 2017-12-01 2022-04-26 北京搜狗科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN108647800B (en) * 2018-03-19 2022-01-11 浙江工业大学 Online social network user missing attribute prediction method based on node embedding
CN108536844B (en) * 2018-04-13 2021-09-03 吉林大学 Text-enhanced network representation learning method
CN108877946A (en) * 2018-05-04 2018-11-23 浙江工业大学 A kind of doctor's expert recommendation method based on network characterization
CN110555305A (en) * 2018-05-31 2019-12-10 武汉安天信息技术有限责任公司 Malicious application tracing method based on deep learning and related device
CN109189936B (en) * 2018-08-13 2021-07-27 天津科技大学 Label semantic learning method based on network structure and semantic correlation measurement
CN111127232B (en) * 2018-10-31 2023-08-29 百度在线网络技术(北京)有限公司 Method, device, server and medium for discovering interest circle
CN110008975B (en) * 2018-11-30 2023-05-02 武汉科技大学 Social network water army detection method based on immune hazard theory
CN109743196B (en) * 2018-12-13 2021-12-17 杭州电子科技大学 Network characterization method based on cross-double-layer network random walk
CN110245682B (en) * 2019-05-13 2021-07-27 华中科技大学 Topic-based network representation learning method
CN110879861B (en) * 2019-09-05 2023-07-14 国家计算机网络与信息安全管理中心 Similar mobile application computing method and device based on representation learning
CN112134720A (en) * 2020-05-26 2020-12-25 北京国腾创新科技有限公司 Network topology discovery method
CN113076743A (en) * 2021-03-30 2021-07-06 太原理工大学 Knowledge graph multi-hop inference method based on network structure and representation learning
CN113722437B (en) * 2021-08-31 2023-06-23 平安科技(深圳)有限公司 User tag identification method, device, equipment and medium based on artificial intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102510551A (en) * 2011-09-30 2012-06-20 奇智软件(北京)有限公司 Method and device for automatic recommendation of friends in mobile communication tool
CN103150678A (en) * 2013-03-12 2013-06-12 中国科学院计算技术研究所 Method and device for discovering inter-user potential focus relationships on microblogs
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN104899657A (en) * 2015-06-09 2015-09-09 北京邮电大学 Method for predicting association fusion events
CN105302866A (en) * 2015-09-23 2016-02-03 东南大学 OSN community discovery method based on LDA Theme model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102510551A (en) * 2011-09-30 2012-06-20 奇智软件(北京)有限公司 Method and device for automatic recommendation of friends in mobile communication tool
CN103150678A (en) * 2013-03-12 2013-06-12 中国科学院计算技术研究所 Method and device for discovering inter-user potential focus relationships on microblogs
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN104899657A (en) * 2015-06-09 2015-09-09 北京邮电大学 Method for predicting association fusion events
CN105302866A (en) * 2015-09-23 2016-02-03 东南大学 OSN community discovery method based on LDA Theme model

Also Published As

Publication number Publication date
CN107122455A (en) 2017-09-01

Similar Documents

Publication Publication Date Title
CN107122455B (en) Network user enhanced representation method based on microblog
Kumar et al. Sentiment analysis of multimodal twitter data
Wu et al. Tracing fake-news footprints: Characterizing social media messages by how they propagate
Goel et al. Real time sentiment analysis of tweets using Naive Bayes
Xiaomei et al. Microblog sentiment analysis with weak dependency connections
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
Li et al. Image sentiment prediction based on textual descriptions with adjective noun pairs
KR20160057475A (en) System and method for actively obtaining social data
Wang et al. A hybrid model of sentimental entity recognition on mobile social media
Pan et al. Social media-based user embedding: A literature review
CN109214454B (en) Microblog-oriented emotion community classification method
Wang et al. Deep and broad learning on content-aware POI recommendation
Yang et al. Microblog sentiment analysis via embedding social contexts into an attentive LSTM
CN110569920A (en) prediction method for multi-task machine learning
CN112966091A (en) Knowledge graph recommendation system fusing entity information and heat
Wang et al. A multidimensional nonnegative matrix factorization model for retweeting behavior prediction
Dritsas et al. An apache spark implementation for graph-based hashtag sentiment classification on twitter
Jiang et al. Visual font pairing
Zou et al. Collaborative community-specific microblog sentiment analysis via multi-task learning
CN111026866B (en) Domain-oriented text information extraction clustering method, device and storage medium
Kuo et al. Integrated microblog sentiment analysis from users’ social interaction patterns and textual opinions
Sun et al. Graph force learning
CN115470344A (en) Video barrage and comment theme fusion method based on text clustering
Singh et al. Current trends in text mining for social media
Lee et al. Sentiment analysis on online social network using probability Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant