CN109189936B

CN109189936B - Label semantic learning method based on network structure and semantic correlation measurement

Info

Publication number: CN109189936B
Application number: CN201810914904.3A
Authority: CN
Inventors: 王嫄; 杨巨成; 李政; 赵婷婷; 陈亚瑞; 赵青
Original assignee: Tianjin University of Science and Technology
Current assignee: Beijing Contention Technology Co ltd
Priority date: 2018-08-13
Filing date: 2018-08-13
Publication date: 2021-07-27
Anticipated expiration: 2038-08-13
Also published as: CN109189936A

Abstract

The invention relates to a label semantic learning method based on network structure and semantic correlation measurement, which comprises the steps of initializing a label network based on user behavior facts to obtain a fact label network G; label network G after constructing a protocol according to fact label network G^R(ii) a On-tag network G^RConstructing label network G based on random walk strategy by applying improved random walk strategy^C(ii) a Tag network G constructed based on tag-related text information^T(ii) a To label network G^CLabel network G^TAnd (4) carrying out normalization processing, and learning the semantic vector representation of the label by a random walk strategy and a word vector learning method. The invention has reasonable design, not only fully utilizes the network topological structure, but also considers the relevant text information contained in the node, can express and learn the label semantic vector which is easy to operate, high in confidence coefficient, sufficient in expression and low in noise from the topological structure and the text in a shorter time, and can be widely applied to label network learning and label semantic learning of a text set containing labels.

Description

Label semantic learning method based on network structure and semantic correlation measurement

Technical Field

The invention belongs to the technical field of network representation learning, and particularly relates to a label semantic learning method based on a network structure and semantic correlation measurement.

Background

In the network representation learning technology, text semantic learning is mainly a feature representation of a text, that is, a target text object (word, sentence, segment, and piece) is represented in a numerical value (single value, vector, or matrix). In view of the requirements of computing and Semantic modeling applications, models commonly used at present are latent Semantic analysis lsa (latent Semantic analysis) based on matrix singular value decomposition, latent Dirichlet distribution lda (latent Dirichlet allocation) based on probability model, word vector representation model nnlm (neural Network Language model) based on neural Network solution, word2vec, and the like, and these models are mainly used for long texts. In the face of short texts with high sparseness and high noise, researchers propose a method for expanding original short text corpora by using external corpora, such as WordNet, Mesh, Wikipedia and the like, or aggregating short texts according to a certain rule to form 'pseudo documents', however, the two methods have obvious disadvantages, the former is limited by an external corpus, and the latter is limited by an artificial rule. Through the user behavior analysis of the short text, the fact that a user expands the information extension of the short text by using a large number of labels is found, and researchers use the labels to provide methods such as Tag-LDA, TWTM (Tag-Weighted Topic Model), TWDA (Tag-Weighted dictionary Allocation), Labeled LDA, MB-LDA and HGTM (Hashtag-Graph Based Topic Model) for theme semantic modeling of the short text containing the labels, so that a good effect is obtained. At the same time, the model also outputs a vector representation of the label as a byproduct.

In the network research, classification, clustering, abnormal point detection and link prediction need to face the sparsity of links in the network, and network nodes are expressed as vectors, so that the problem can be effectively avoided. The important branch in the network representation learning is the vector representation of the learning network node. The conventional methods mainly include Graph decomposition (Graph Factorization) and Laplacian eigenvector-based methods (Laplacian Eigenmaps). Perozzi et al learns a linear sequence in a network topology based on random walk learning, and uses Mikolov Skip-Gram word vector learning method and learning network node representation for reference, however, the method of Deepwalk only learns based on connection between nodes, and does not consider content information possibly contained by the nodes. Yang et al have demonstrated that Deepwalk is substantially equivalent to the transition probability matrix decomposition of an overlay network, and on the basis, a depth Associated Deep Walk model (Text Associated Deep Walk) is proposed by introducing node features to carry out co-decomposition. Tang et al propose a model LINE (Large-scale Information Network Embedding) for learning and modeling a Large Network, but the model only considers nodes of one degree and two degrees, and cuts off a Large amount of Information transmitted on the Network, and the three-step, four-step and even more distant associations between different nodes are important components of the global structure of the Network, and have important significance. Based on this, Cao et al propose GraRep to fully learn the correlation information between nodes by defining different transition matrices, defining different loss functions for different step numbers.

In conclusion, the existing algorithm is difficult to solve the problems that the original label network comes from the unconstrained collaboration fact of the user, the information is high in noise and variable, the theme boundary is not clear or the theme drifts, and the semantic model fails.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a label semantic learning method based on a network structure and semantic correlation measurement.

The technical problem to be solved by the invention is realized by adopting the following technical scheme:

a label semantic learning method based on network structure and semantic relevance measurement comprises the following steps:

step 1, initializing a tag network based on user behavior facts to obtain a fact tag network G;

step 2, constructing a label network G after a protocol is constructed according to the fact label network G^R；

Step 3, in the label network G^RConstructing label network G based on random walk strategy by applying improved random walk strategy^C；

Step 4, constructing a label network G based on the text information related to the label^T；

Step (ii) of5. To label network G^CLabel network G^TAnd (4) carrying out normalization processing, and learning the semantic vector representation of the label by a random walk strategy and a word vector learning method.

Further, the fact label network G is:

defining a label network G ═ { V, E } based on text internal co-occurrence, wherein V is all labels in the whole text set D; if any label i and label j appear in a text d at the same time, there is an edge between them, which is marked as e_ij(ii) a Defining a weight g of an edge in the network_ijComprises the following steps:

wherein D is_i,jFor a text collection containing both labels i and j, h_dIs the set of tags for document d.

Further, the processing method in step 2 is as follows:

first, consider the random noise associated with semantic nodes: branch subtraction is carried out to remove weak correlation with strong randomness, and effective edges of the network are restrained to reduce noise; let the incidence matrix after pruning be represented as T, then the element T therein_ijNamely, the correlation value after pruning the label i and the label j is represented as follows:

where δ is the phase threshold at which 20% of the low frequency edge can be truncated, g_ijIs the weight of the edge in the network;

then, considering that the divergence degrees of the label nodes are different, for an edge in any network, the weight of the network label incidence relation is adjusted according to the difference of the number of the endpoints associated with the two endpoints of the edge, and the label theme association is enhanced:

t′_ij＝t_ij*log(N/N(i))*log(N/N(j))。

wherein, t'_ijIs G^RThe weight of the edge in the network, N is the node contained in the networkThe number of points, N (i), N (j), is the degree of the label i, j on the network graph, and logN/N (i) is a logarithmic expression of the probability p (i) that the current label is associated with the node in the network, (i) N (i)/N reciprocal.

Further, the random walk strategy improved in step 3 is to sample a high-noise complex network structure into a plurality of linear sequences, obtain local microscopic description information of the network by a breadth-first search method, obtain global macroscopic information of the network by a depth-first search method, and obtain a new label association relationship by sliding a window on the linear sequences according to the sampled linear sequences, thereby obtaining a new edge weight.

Further, the processing method in the step 4 comprises:

first, a vocabulary set W having a co-occurrence relationship with a tag i is defined_iBy w_ijRepresenting the number of times the label i and the vocabulary j co-occur;

then, w is weighted by the inverse text frequency index IDF, the word t_iThe idf value of (a) is calculated as follows:

where | D | is the number of documents in the set, | { j: t |, is_i∈d_jIs an inclusive word t_iThe number of documents;

secondly, calculate w again_ijAnd idf_jObtaining a text representation vector of the label;

finally, calculating cosine similarity of every two label text expression vectors, and defining

A truncation threshold for 80% of the low-frequency edge, and a cosine similarity smaller than

The weight of the reserved edge is marked as the value of cosine similarity.

Further, step 5 introduces graph mining in the network random walk processSample preference parameters in tag network G^TAnd a label network G^CSwitch between the two to benefit from the association of the network structure and the text information; in the semantic vector updating process, linear sequences obtained by random walk of a network are used as sentences, and label semantics are learned based on left and right contexts.

The invention has the advantages and positive effects that:

1. the invention combines text semantic learning and network representation learning technologies, reduces network noise through multi-stage noise reduction, network protocol, random walk and text similarity calculation strategies, obtains a core network, learns vector representation of labels on the basis, not only fully utilizes a network topological structure, but also considers relevant text information contained in nodes, can learn label semantic vectors with easy operation, high confidence, sufficient expression and low noise from the topological structure and the text expression in a short time, and can be widely applied to label network learning and label semantic learning of a text set containing labels and various applications of text mining.

2. The method is reasonable in design, and based on the fact information, the core abstraction of the tag association network is performed, the edges on the graph are sampled and the weight is updated, so that the stable tag network structure is obtained, the model is insensitive to noise interference, and the modeling generalization is improved.

Drawings

FIG. 1 is a schematic diagram of the overall architecture of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The design idea of the invention is as follows: the method is mainly based on a statistical machine learning theory and a text mining technology, and a label network with high credibility is constructed by using behavior data of a user to learn pure semantic representation of labels. Firstly, initializing a tag network by utilizing user behavior facts, secondly reconstructing the tag network based on a reduction technology and an improved random walk technology, thirdly, constructing the tag network by utilizing text similarity related to tags, and finally, learning tag semantic vector representation based on the reconstructed tag network and the tag network based on the text similarity. The network reconstruction can be regarded as an automatic multi-path label strong subject correlation discovery filter, the weak correlation part is filtered, and the influence of noise correlation irrelevant to the subject is reduced, so that the network reconstruction helps to discover and enhance label network representation with more consistent and compact subject semantics to obtain a subject enhanced label semantic vector.

Based on the above design concept, the tag semantic learning method based on the network structure and the semantic relevance metric of the invention, as shown in fig. 1, includes the following steps:

step 1: and initializing the tag network based on the user behavior fact to obtain a fact tag network G based on the user behavior.

In the field of internet text applications, essentially "tags" are a theme. The subject matter herein may be fine grained, such as specific events, times, places, or coarse grained, such as aggregated news, abstract concepts, etc. The appearance of the tags is generated by the text composition of the user, and the relationship between the tags is based on the behavioral facts of the user. Thus, naturally, tags co-occur with text.

The most direct relationship between tags is a co-occurrence relationship in a single text, and a tag network G ═ { V, E } based on text internal co-occurrence is defined, and V is all tags in the whole text set D. If any label i and label j appear in a text d at the same time, there is an edge between them, which is marked as e_ij. Defining a weight g of an edge in the network_ijIs the number of texts containing both tags. Namely, it is

Wherein D is_i,jFor a text collection containing both labels i and j, h_dIs the set of tags for document d. Intuitively, two labels appear in a text with a larger quantity, and the weight of the connecting line of the two labels is larger, so that the semantic association of the two labels is tighter.

By this, a tag network based on user behavior facts is obtained.

In addition to the co-occurrence relationships within the text described above, this step also includes a tag fact network that is constructed indirectly using the user and tag usage relationships or the co-occurrence relationships of external links and tags.

Step 2: label network G after construction of protocol based on fact label network G^R。

Tag networks constructed directly on the basis of facts are relatively noisy, mainly due to the randomness and variability of the user's usage. Liberty means that users are free to tag text, and there are likely to be situations where multiple words are synonymous, such as "Qinghua university" and "Qinghua"; the difference refers to the understanding of the difference of connotation expressed by the same text, so as to add different labels, for example, different users are "how many people can wear" add labels "padded jacket" or "bikini".

To condense semantics, this step is based on four naive assumptions. (1) The more times two tags are associated, the more closely the subject association between the tags. (2) The more identical neighbors of two tags, the more closely the subject association between the tags. (3) The label set with the same theme has the aggregation effect similar to blocks and communities. (4) Tags that are related to a large number of tags are not closely related to the subject matter of other tags.

Random noise associated with semantic nodes is first considered. And (4) carrying out branch subtraction to remove weak correlation with strong randomness, and constraining the effective edge of the network to reduce noise. Let the incidence matrix after pruning be represented as T, then the element T therein_ijNamely the correlation value after the pruning of the label i and the label j.

Where δ is the phase threshold that can truncate 20% of the low frequency edges.

Secondly, considering that the divergence degrees of the label nodes are different, for an edge in any network, the weight of the network label incidence relation is adjusted according to the difference of the number of the endpoints associated with the two endpoints of the edge, and the label theme association is enhanced:

t′_ij＝t_ij*log(N/N(i))*log(N/N(j))。

here, t'_ijIs G^RThe weight of the edge in the network, N is the number of nodes included in the network, and N (i) and N (j) are the degrees of labels i and j on the network graph, i.e. the number of edges associated with the vertex. logN/N (i) is a logarithmic expression of the probability p (i) ═ N (i)/N reciprocal that the current tag is associated with the node in the network, which reflects the information amount of the tag i in the network, and may be called Inverse correlation Frequency (IRF), and the larger the value, the better the topic semantic distinction degree is, and the more effective information in the topic semantic association is.

And step 3: tag network G after obtaining the specification in step 2^RConstructing label network G based on random walk strategy by applying improved random walk strategy^C。

This step samples the nodes associated with each node based on an improved random walk strategy. In step 2, we get the weights t 'on the network'_ijTwo degree transition weights are defined herein to guide the transition. After the label t walks to the label i, the next unnormalized transition probability of the label j is defined as:

wherein

For the adjustment factor, the following is defined:

d_tjthe shortest distance on the graph from label t to label j is defined, and only nodes with the distance of 2 at most are defined here. And the label t is a front node when the user randomly walks from i. When d is_tjWhen 0, it means t, i.e., j, pi_i ^t _jRefers to the situation that the label j goes to the label iProbability of going back from tag i to tag j.

According to definition

After normalization, n-step random walks are performed m times to obtain m paths with the step length of n, sliding with the window size of s is performed in the paths, an edge is added between the labels co-existing in the window, and the weight is added with 1. After all sliding is finished, the weight is the total number of the label co-occurrence windows, and a new label network G is obtained according to the total number^C。

And 4, step 4: tag network G constructed based on tag-related text information^T。

This step builds a label network using the text associated with the label, i.e., using the content characteristics of the network nodes themselves. The tags and the text vocabularies of the invention have a co-occurrence relationship, and the more the co-occurrence times, the more the vocabularies can embody the meanings indicated by the tags. Defining vocabulary set W with co-occurrence relation with tag i_iBy w_ijIndicating the number of times the label i co-occurs with the vocabulary j.

For example, a text collection contains three pieces of text, respectively:

"# south African golden brick meeting # [ Times News eye I golden brick meeting: gold city 'gold brick'

'south Africa golden brick meeting # economic community, technology is not separated'

"south Africa gold brick meeting # gold brick country, friendship is often in; the common win is shared by resisting wind and rain. Wish our motherland to grow more strongly, the country is Thai and Min' an. "

Wherein, the word "golden brick" labeled "# south Africa golden brick meeting #" has a weight w of 3.

After text representation, w is weighted by the inverse text frequency index idf (inverse Document frequency). Word t_iThe idf value of (a) is calculated as follows:

wherein | D | is in the setThe number of documents, | { j: t_i∈d_jIs an inclusive word t_iThen calculates w_ijAnd idf_jThe product of (a) and (b) to obtain a text representation vector of the tag.

Calculating the cosine similarity of the text representation vectors of every two labels, and defining similar to the step 2

Is a truncation threshold that truncates 80% of the low frequency edges. Removing cosine similarity smaller than

The edge of (2). The weight of the reserved edge is marked as the value of the cosine similarity.

And 5: for the label network G obtained in the step 3 and the step 4^CLabel network G^TAnd (4) carrying out normalization, and learning the semantic vector representation of the label by a random walk strategy and a word vector learning method.

G obtained in step 3^CEmbodying label association based on network structure, G obtained in step 4^TTag association based on semantic relevance is embodied. This step will fuse G^CAnd G^TAnd two networks, wherein the semantic representation of the label is learned on the converged network.

Since the weight ranges of the two networks are different, the weights of the two networks are unified to be between 0 and 1. Normalization was performed using a linear function, which is as follows:

i.e. the original data is scaled equally, x_normIs a normalized value of x, x_minAnd x_maxRespectively the minimum and maximum of the edge weights in the network. The normalized networks are respectively G^C-normAnd G^T-norm. Then random walk is carried out on two networks to obtain a neighbor sequence of each node, word2vec learning is carried out on the sequence, and a SkipGram method is used for updating the vector of the label nodeAnd (4) showing. The calculation flow of the tag semantic vector Φ is as follows:

wherein, Shuffle (-) is used for reordering the label nodes to avoid preference caused by operation sequence. RandomWalk (G)^C-normJ, t) is completed in graph G^C-normStarting from j, randomly walking t steps to obtain a label node sequence W with the length of t_j，RandomWalk(G^T-normJ, t) are similar. SkipGram (phi, W)_jAnd d) a word vector learning method based on the node sequence W_jAnd updating the label semantic vector phi with the dimension d.

It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims

1. A label semantic learning method based on network structure and semantic correlation measurement is characterized in that: the method comprises the following steps:

Step 5, label network G^CLabel network G^TCarrying out normalization processing, and learning the semantic vector representation of the label by a random walk strategy and a word vector learning method;

the appearance of the labels is generated by the text composition of the user, the relationship between the labels is based on the behavior fact of the user, and the labels and the text have a co-occurrence relationship;

the fact label network G is:

wherein D is_i,jFor a text collection containing both labels i and j, h_dA set of tags for document d;

the processing method of the step 2 comprises the following steps:

first, consider the random noise associated with semantic nodes: pruning to remove weak correlation with strong randomness, and constraining the effective edge of the network to reduce noise; let the incidence matrix after pruning be represented as T, then the element T therein_ijNamely, the correlation value after pruning the label i and the label j is represented as follows:

t′_ij＝t_ij*log(N/N(i))*log(N/N(j))；

wherein, t'_ijIs G^RThe weight of the edge in the network, N is the number of nodes contained in the network, N (i), N (j) are the degree of the labels i, j on the network graphLog (N/N (i)) is a logarithmic expression of the probability p (i) ═ N (i)/N reciprocal of the label i associated with the nodes in the network;

the random walk strategy improved in the step 3 is to sample a high-noise complex network structure into a plurality of linear sequences, obtain local microscopic description information of the network by using a width-first search method, obtain global macroscopic information of the network by using a depth-first search method, and obtain a new label incidence relation by sliding a window on the linear sequences according to the sampled linear sequences, thereby obtaining a new edge weight;

the processing method of the step 4 comprises the following steps:

then, w is paired with the inverse text frequency index idf_ijWeighted, word t_iThe idf value of (a) is calculated as follows:

The weight of the reserved edge is marked as a cosine similarity value;

in the step 5, in the random walk process of the network, a graph sampling preference parameter is introduced, and a label network G is provided^TAnd label networkG^CSwitch between the two to benefit from the association of the network structure and the text information; in the semantic vector updating process, linear sequences obtained by random walk of a network are used as sentences, and label semantics are learned based on left and right contexts.