CN110083683B

CN110083683B - Entity semantic annotation method based on random walk

Info

Publication number: CN110083683B
Application number: CN201910323155.1A
Authority: CN
Inventors: 张明西; 苏冠英; 李学民; 杨柳倩; 乐水波
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2019-04-22
Filing date: 2019-04-22
Publication date: 2022-12-13
Anticipated expiration: 2039-04-22
Also published as: CN110083683A

Abstract

The invention relates to an entity semantic annotation method based on random walk. Firstly, in an off-line module, based on a random walk algorithm, obtaining a steady-state probability matrix of a description text of an entity in a corpus, namely an entity term correlation score matrix; then in an online query stage, the number of selected labeled terms is given, the selected labeled terms are input to a user query entity node, a row vector corresponding to the query node is found by utilizing a steady-state probability matrix obtained in an offline module, all element values in the row vector are sorted according to the size, and a label recommendation list corresponding to the query is formed; and finally, according to the user requirements, selecting the first k terms with larger entity term relevance scores to recommend the terms as the feature words of the entity, namely the semantic tags of the corresponding query entities. The method introduces the random walk model to solve the problem of data sparsity, can carry out semantic annotation on the entities on a sparse data set, and effectively improves the recall ratio and precision ratio compared with the existing keyword annotation method.

Description

Entity semantic annotation method based on random walk

Technical Field

The invention relates to a data mining technology, in particular to an entity semantic annotation method based on random walk.

Background

The tags can effectively organize information on the internet, and the task of semantic annotation is to recommend a plurality of related tag information given an entity (document, image, video and the like). There are many solutions to the current state of the art of label labeling. The collaborative filtering is a recommendation technology which is most widely applied, but the collaborative filtering has the problems of data sparsity, cold start and the like, and directly causes great reduction of recommendation quality. In addition, there is a method of keyword labeling, whose purpose is to find several representative words in the original text, but which does not provide words that may be more representative but do not appear in the document. The tag recommendation of the method differs from keyword extraction in several ways, and the required tags may not appear in the text, which makes it unavailable by extraction. In addition, due to the sparsity of real data, the existing method is difficult to label comprehensively, which affects the recall ratio and precision ratio of the returned result.

Based on the work of restarting random walk, the method becomes a node-to-node similarity measurement mode widely used in graph data, and the technology provides important reference significance for semantic annotation. The recommendation model based on random walk intuitively accords with the cognition of people, and can capture the overall structure relation of a complex network graph, so that the sparsity of data is effectively relieved.

Disclosure of Invention

The invention provides an entity semantic annotation method based on random walk aiming at the problems that the existing method is difficult to label comprehensively and affects the recall ratio and precision ratio of returned results, and solves the problem of entity semantic annotation in a large-scale network. The technical scheme adopted by the invention is to effectively combine the TF-IDF algorithm and the random walk model, and the relation between entities and terms is considered, so that the terms can be effectively matched in the query. By fully mining and utilizing the structure information implied in the complex network graph, potential semantic labels are discovered, and query results are presented better.

The technical scheme of the invention is as follows: a method for labeling entity semantics based on random walk specifically comprises the following steps:

1) In an offline module, based on a TF-IDF algorithm and a random walk algorithm, obtaining a steady-state probability matrix of a description text of an entity in a corpus, wherein the obtained steady-state probability matrix is an entity term correlation score matrix; firstly, preprocessing texts in a corpus, and uniformly expressing the processed text data into forms of entity IDs and term IDs; then, combining TF-IDF algorithm to establish entity-term bipartite network, mapping the network into TF-IDF weight matrix A _TF-IDF In matrix A _TF-IDF In the method, the element values in the row vector are sorted according to the weight value, and the node retention proportion of the row vector is setSetting TF-IDF value of nodes less than the ratio in the row vector to 0 to obtain the processed entity term matrix A _PT (ii) a Finally, constructing a square matrix based on the entity term matrix, taking the normalized square matrix as an initial transition probability matrix, executing a random walk process, and performing off-line mining on the description text of the entity to obtain a steady-state probability matrix;

2) In the online query stage, the quantity of selected labeled terms is given, user query nodes, namely entity nodes, are input, then row vectors corresponding to the query nodes are found in the matrix by utilizing the steady-state probability matrix obtained in the step 1) in the offline module, all element values in the row vectors are sorted according to the size to form a label recommendation list corresponding to the query, and the element values are called entity term relevance scores;

3) According to the user requirements, the first k terms with larger entity term relevance scores are selected and recommended to serve as the feature words of the entity, namely the semantic tags of the corresponding query entities.

The step 1) of preprocessing the text in the corpus refers to removing punctuation marks, capital and small case conversion and word segmentation operations on the description text, and then numbering the entity and the preprocessed terms respectively, so that the text data is uniformly expressed in the form of entity ID and term ID.

The entity-term bipartite network is established in the step 1) by combining the TF-IDF algorithm, and the method is specifically realized as follows:

a, calculating TF value and IDF value of the preprocessed data, and counting the times of the single term appearing in the description text corresponding to the entity, namely the times are TF values

Wherein q is _i,j Is the number of occurrences of the term j in the entity description i, and the denominator is the total number of all terms in the entity i, and m is the length of the description text of the entity i; counting the IDF value of the term, dividing the total number of entities in the corpus by the number of entities in the corpus containing the term,

| D | is the total number of all entities in the corpus;

b, calculating TF-TDF weight value with the formula of tfidf _i,j ＝tf _i,j ×idf _j Obtaining TF-IDF weight between the entity i and the term j;

c: construction of "entity-term" bipartite graph G =<V,E,W>Wherein V =<V _P ∪V _T >Representing a physical node V _P And the term node V _T And E is an edge set of the entity and the term represented by the node pair, and the weight W between the edges is the TF-IDF weight between the entity and the term obtained in the step B, so that the entity-term binary network is mapped into a TF-IDF weight matrix A _TF-IDF Wherein, matrix A _TF-IDF The row vector of (1) represents each entity vector, the column vector represents all term vectors, and each element of the matrix represents TF-IDF weight values among the entity terms; d in the matrix A _TF-IDF In the method, the row vectors are sorted according to the TF-IDF weight; then setting the reserved proportion of elements in the row vector, screening according to the proportion, removing the connecting edges between the entity term nodes outside the proportion, namely setting the TF-IDF value of the elements smaller than the proportion in the matrix row vector to be 0, thus obtaining the processed entity term matrix A _PT I.e. the term node which cannot describe the characteristics of the entity is skipped.

Constructing a square matrix in the step 1): the square matrix is composed of _PT 、A _TP And a block matrix composed of two zero matrixes, wherein A _TP Is a matrix of terms to entities, and A _PT Transpose each other, so two zero matrixes are placed on the main diagonal because the relation between entities and terms is not needed to be considered.

Obtaining a steady state probability matrix in the step 1):

will normalize the square matrix

The initial transition probability matrix is regarded as the random walk, the rewritten random walk algorithm is realized, and the iterative formula is as follows:

wherein c is restart probability, l is iteration step length of random walk, and superscript-normalization processing is carried out; r ^l Is a probability distribution matrix, R, obtained after the random walk of the first step ^l+1 The probability distribution matrix after the random walk in the step (l + 1); the central idea of (1) is as follows: starting from a source node S, wherein the source node S is a query node, and a traverser randomly jumps to any node in the graph from the node S according to the probability c and walks to a neighbor node of the node S according to the probability (1-c); obtaining a probability distribution after each walk, wherein the probability distribution describes the probability of each node in the graph being accessed;

setting the maximum random walk iteration step length, using the last probability distribution as the input of the next walk and repeating the iteration to execute the random walk, when the iteration reaches the maximum step number, converging to obtain a stable probability value, ending the iteration at this moment, and obtaining the probability distribution matrix which is the steady state probability matrix of the random walk

The invention has the beneficial effects that: the invention relates to an entity semantic annotation method based on random walk, which solves the problem of data sparseness by utilizing a TF-IDF algorithm and introducing a random walk model, digs out potential semantic features by considering the correlation between entities and terms, attaches proper labels to the entities and improves the recall ratio and precision ratio of the original method.

Drawings

FIG. 1 is a flow chart of the entity semantic annotation method based on random walk according to the present invention;

FIG. 2 is a diagram of the TF-IDF weight matrix of the present invention;

FIG. 3 is a diagram of an initial transition probability matrix for random walks in accordance with the present invention;

FIG. 4 is a graph illustrating the effect of the restart probability c on the precision ratio according to the present invention;

FIG. 5 is a graph of the effect of iteration step length l on recall and precision according to the present invention;

FIG. 6 is a graph illustrating the effect of the number k of selected tags on recall and precision according to the present invention.

Detailed Description

As shown in fig. 1, the entity semantic annotation method based on random walk includes the following specific steps:

step 1, in an offline module, based on a random walk algorithm, obtaining a steady-state probability matrix of description texts of entities in a corpus, wherein the obtained steady-state probability matrix represents a correlation score among entity terms. The concrete implementation is as follows:

1.1 Pre-processing the description text of the entity in the corpus (removing punctuation marks, case conversion, word segmentation, etc.). The entities and the pre-processed terms are numbered separately. The processed text data is uniformly expressed in the form of entity ID and term ID.

1.2 TF-TDF is a commonly used weighting technique for information retrieval and data mining. Calculating TF value and IDF value of the preprocessed data, and counting the times of the single term appearing in the description text corresponding to the entity, namely TF value (word frequency)

Wherein q is _i,j Is the number of occurrences of the term j in the entity description i, and the denominator is the total number of all terms in the description text of the entity i, and m is the length of the description text of the entity i. The IDF value (inverse document frequency) of a statistical term can be obtained by dividing the total number of entities in the corpus by the number of entities in the corpus that contain the term,

| D | is the total number of all entities in the corpus.

Calculating TF-TDF weight value with the formula tfidf _i,j ＝tf _i,j ×idf _j And obtaining the TF-IDF weight between the entity i and the term j. A high term frequency for an entity, and a low inverse document frequency for that term across the entire set of entities, may yield a highWeighted TF-IDF. Therefore, common terms are filtered out by the TF-IDF, and important terms capable of describing the characteristics of the entity are reserved.

1.3 Construct "entity-term" bipartite graph G =<V,E,W>Wherein V =<V _P ∪V _T >Representing a physical node V _P And the term node V _T And E is an edge set of the entity and the term represented by the node pair, and the weight W between the edges is the TF-IDF weight value between the entity and the term obtained in the step 3. Then, the entity-term dichotomous network is mapped into TF-IDF weight matrix A _TF-IDF Wherein, the row vector of the matrix represents each entity vector, the column vector represents all term vectors, and the elements of the matrix represent the TF-IDF weights of the entity-term. When the term j appears in the description text of entity i, matrix A _PT The value of the element in the ith row and j column is tfidf _i,j (ii) a Otherwise, the element value is 0;

1.4 In matrix A) _TF-IDF In the method, the row vectors are sorted according to the TF-IDF weight; then, the filter is selected according to the weight value, and the filter will have smaller weight value (here, we will use matrix A) _PT Row vectors are sorted according to weight, and the top 60% of the term nodes in the rank are reserved) are removed, which is equivalent to the operation at A _TF-IDF And setting TF-IDF values of elements except for the reserved proportion in the row vector to be 0 in the matrix. Thus we get the processed entity term matrix A _PT . The term node which can not describe the entity characteristics is skipped in the binary network through the step, and the interference to the result is avoided.

1.5 Constructing a square matrix: a is to be _PT 、A _TP And the two zero matrixes jointly form a block matrix, so that a square matrix is constructed. Wherein A is _TP Is a matrix of terms to entities, and A _PT Are mutually transposed. In particular, because the relationships between entities and terms are not considered, two zero matrices are placed on the main diagonal.

1.6 Normalize the square matrix as

The calculation formula of the element values in the square matrix is

Wherein, a _ij The TF-IDF weight representing entity i through term j; a is a _i,n And expressing the TF-IDF weight of the nth term in the entity i row vector.

Representing the initial transition probability from entity i to term j. The above formula is shown in A _PT The sum of the TF-IDF values of all terms included in the entity i row vector is taken as the denominator, and the TF-IDF weight of each term in the entity is taken as the numerator, so that we obtain the normalized square matrix

Since the main diagonal is two zero matrices, the normalization of the square matrix can be considered as a _PT 、A _TP The normalization of the matrix is carried out in such a way that,

and

are respectively A _PT And A _TP Normalized matrix form.

1.7 To normalize the square matrix

And the initial transition probability matrix is regarded as the random walk, and the rewritten random walk algorithm is realized. The iterative formula is as follows:

whereinc is restart probability, l is iteration step length of random walk, R ^l Is a probability distribution matrix, R, obtained after the random walk of the first step ^l+1 Is the probability distribution matrix after (l + 1) step random walk.

Random walk: starting from a source node S (namely a query node), a traverser randomly jumps to any node in the graph from the node S with a probability c and walks to a neighbor node of the node S with a probability (1-c); after each walk, a probability distribution is obtained, which characterizes the probability that each node in the graph is visited.

1.8 Setting the maximum iteration step number of the random walk, using the probability distribution of the last time as the input of the next walk and repeating the iteration, wherein the iteration formula is the formula in the step 1.7), so as to execute the random walk, when the maximum step number is iterated, the random walk is converged to obtain a stable probability value, and the iteration is finished at the moment, and the obtained probability distribution matrix is the steady-state probability matrix of the random walk.

And 2, in an online query stage, giving out the number k of selected labeled terms, inputting a user query node (namely an entity node), finding a row vector corresponding to the query node in the matrix by using the steady-state probability matrix obtained in the offline module in the step 1, and sorting all element values (element values are called as entity term correlation scores) in the row vector according to the sizes to form a tag recommendation list corresponding to the query.

And 3, selecting the first k terms with higher entity term relevance scores to recommend the terms as the feature words of the entity according to the user requirements, namely the semantic tags of the corresponding query entities.

The present invention performs a series of experiments on amazon datasets.

Firstly, extracting ASIN labels and Title labels (the ASIN corresponds to the ID number of an entity, and the Title labels are description texts of the entity) in an Amazon data set to form entities and source data of the corresponding description texts. Then, the source data is preprocessed in step 1), and then TF-IDF weight values among entity-terms are solved according to the method 1.2 in step 1), so as to obtain a TF-IDF weight matrix A shown in figure 2 _TF-IDF . The rows of the matrix represent the respective entity vectors in whichThe element (b) represents a single term obtained after preprocessing, namely, the element (b) is a column of a matrix; the row and column values of the matrix are the TD-IDF weights of the entity terms.

We find the TF-IDF weight matrix to be sparse, and we will use the matrix A to better utilize its sparsity _TF-IDF The row vectors are sorted according to TF-IDF weight, and the term nodes of 60% of the top rank are reserved to obtain a processed entity term matrix A _PT Constructing a square matrix according to the method 1.5 in the step 1), wherein rows and columns of the square matrix are all represented by entity nodes and term nodes together, as shown in 3. From the figure, it can be seen that the square matrix is represented by A _PT 、A _TP And the two zero matrixes jointly form a block matrix. A. The _TP Then the term matrix to entity, and A _PT Are mutually transposed. In particular, because the relationship between entities and terms is not considered, two zero matrices are placed on the main diagonal.

The square matrix is normalized as described in 1.6 in step 1), and is recorded as

The element calculation method in the square matrix comprises the following steps:

and then, taking the normalized square matrix as an initial transition probability matrix of random walk to realize the rewritten random walk algorithm. The normalization of the square matrix can be considered as pair A _PT 、A _TP The obtained iterative formula is as follows:

a random walk is performed on the data set and given a maximum iteration step of 11 for the random walk, the number of extracted entities is 5 ten thousand (i.e. there are 5 thousand entities of description text). In fig. 4, the influence of the restart probability c on the experimental result is firstly explored, and we find that when the restart probability is selected to be 0.8, the algorithm guarantees an accuracy of 73.5%, so that the experiment is performed by taking c =0.8 next. When the step 3 is executed, we find that the transition probability values are always in a stable state (i.e. the probability values are not changed) when the iteration reaches the step 11 with the increase of the iteration step size of the random walk, that is, a stable probability distribution is obtained, and the experimental result is shown in fig. 5. The steady-state probability value at iteration step size 11 is therefore chosen as the final relevance score for the entity-term.

Examples of labeling entities in amazon datasets are given in table 1. According to the method provided by the invention, the importance sequence of the entity description text is obtained, the recommendation list of the labels is generated according to the step 2, and the description words with the top 10 of the ranking can be selected to view the semantic annotation condition of the entity. For example, in the second column, when the user inputs the query with ASIN 0878301534, a search may be performed in the steady-state probability matrix calculated in the offline stage, and then sorted according to the relevance score to find the most relevant terms, and then we take out Top-10 terms as daning, balloon, latin, finger board, slow, teach, kick, session, midnight, and stars, which may be used as the semantic tags of the entity. It can be seen by this example that the words "recording, balloon, latin, finger board, slow, teach, kick" are highly related to the entity's Title "balloon recording" and that these labels do not appear in the description text of the entity. Compared with keyword labeling, potential semantic labels are mined, so that a user can clearly know the basic description information of the entity, and the query time of the entity is shortened.

TABLE 1

Fig. 6 shows curves of the number of the selected annotation words and the Recall ratio (Precision) and Precision ratio (Recall) of the entity semantic annotation, which indicate that the more the number of the selected annotation words, the higher the Precision ratio of the invention is after the iteration is stable, but the browsing time of the user is increased and the Recall ratio is reduced, so a balance relationship is to be found. As shown in fig. 6, for the case that the text description of the entity in the amazon dataset is short, when the number of the selected annotation words is 4, a better compromise between recall ratio and precision ratio can be ensured. Therefore, through the label attached to the entity, the user can know whether the entity is required by the user, so that unnecessary browsing time is reduced, a real-time recommendation function is realized, and the user requirement is met.

A large number of experiments are carried out on the Amazon data set, and the results show that the precision ratio of the method provided by the invention is 73.5%, and the recall ratio is 60%. Therefore, the method can find potential semantic association and realize semantic annotation of the entity to a certain extent. For a single query of a user, taking 50 queries, measuring the average query time, and displaying the result that the response time of each query is 1.553ms on average, so that the real-time response requirement of the user is met.

Claims

1. A random walk-based entity semantic annotation method is characterized by comprising the following steps:

1) In an offline module, based on a TF-IDF algorithm and a random walk algorithm, obtaining a steady-state probability matrix of a description text of an entity in a corpus, wherein the obtained steady-state probability matrix is an entity term correlation score matrix;

firstly, preprocessing texts in a corpus, and uniformly expressing the processed text data into forms of entity IDs and term IDs; then, combining TF-IDF algorithm to establish entity-term bipartite network, mapping the network into TF-IDF weight matrix A _TF-IDF In matrix A _TF-IDF In the method, element values in row vectors are sorted according to the weight value, the node retention proportion of the row vectors is set, namely TF-IDF values of nodes which are smaller than the proportion in the row vectors are set to be 0, and a processed entity term matrix A is obtained _PT (ii) a Finally, constructing a square matrix based on the entity term matrix, and taking the normalized square matrix as an initial transfer profileRate matrix, and executing random walk process, and obtaining steady state probability matrix by off-line mining description text of the entity;

3) According to the user requirements, selecting the first k terms with larger entity term relevance scores to recommend out as the feature words of the entity, namely the semantic tags of the corresponding query entities;

the entity-term bipartite network is established by combining the TF-IDF algorithm, and the method is specifically realized as follows:

| D | is the total number of all entities in the corpus;

b, calculating TF-TDF weight value with the formula tfidf _i,j ＝tf _i,j ×idf _j Obtaining TF-IDF weight between the entity i and the term j;

c: construction of "entity-term" bipartite graph G =<V,E,W>Wherein V =<V _P ∪V _T >Representing a physical node V _P And the term node V _T Set, E is an entity and term represented by a node pairThe weight W between edges of the edge set is the TF-IDF weight between the entity and the term obtained in the step B, so that the entity-term binary network is mapped into a TF-IDF weight matrix A _TF-IDF Wherein, matrix A _TF-IDF The row vector of (1) represents each entity vector, the column vector represents all term vectors, and each element of the matrix represents TF-IDF weight values among the entity terms;

d in the matrix A _TF-IDF In the method, the row vectors are sorted according to the TF-IDF weight; then setting the reserved proportion of elements in the row vector, screening according to the proportion, removing the connecting edges between the entity term nodes outside the proportion, namely setting the TF-IDF value of the elements in the matrix row vector, which are smaller than the proportion, to be 0, thus obtaining the processed entity term matrix A _PT Namely, the term node which cannot describe the entity characteristics is skipped; constructing a square matrix: the square matrix is composed of _PT 、A _TP And a block matrix composed of two zero matrixes, wherein A _TP Is a matrix of terms to entities, and A _PT The two zero matrixes are arranged on the main diagonal line because the relation between entities and terms is not needed to be considered;

the steady state probability matrix is obtained:

will normalize the square matrix

wherein c is restart probability, l is iteration step length of random walk, and superscript-normalization processing is carried out; r ^l For the first step of randomizationProbability distribution matrix, R, obtained after wandering ^l+1 The probability distribution matrix after the random walk in the (l + 1) step is obtained; the central idea is as follows: starting from a source node S, wherein the source node S is a query node, and a traverser randomly jumps to any node in the graph from the node S according to the probability c and walks to a neighbor node of the node S according to the probability (1-c); obtaining a probability distribution after each walk, wherein the probability distribution describes the probability of each node in the graph being accessed;

setting the maximum random walk iteration step length, using the last probability distribution as the input of the next walk and repeating the iteration so as to execute the random walk, converging to obtain a stable probability value when the iteration reaches the maximum step number, and ending the iteration at the moment to obtain a probability distribution matrix which is the steady-state probability matrix of the random walk.

2. The entity semantic annotation method based on random walk according to claim 1, wherein the preprocessing of the text in the corpus in the step 1) refers to removing punctuation marks, case-case conversion and word segmentation operations on the description text, and then numbering the entities and the preprocessed terms respectively, that is, uniformly representing the text data in the form of entity ID and term ID.