CN110083683B - Entity semantic annotation method based on random walk - Google Patents

Entity semantic annotation method based on random walk Download PDF

Info

Publication number
CN110083683B
CN110083683B CN201910323155.1A CN201910323155A CN110083683B CN 110083683 B CN110083683 B CN 110083683B CN 201910323155 A CN201910323155 A CN 201910323155A CN 110083683 B CN110083683 B CN 110083683B
Authority
CN
China
Prior art keywords
entity
matrix
term
idf
random walk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910323155.1A
Other languages
Chinese (zh)
Other versions
CN110083683A (en
Inventor
张明西
苏冠英
李学民
杨柳倩
乐水波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201910323155.1A priority Critical patent/CN110083683B/en
Publication of CN110083683A publication Critical patent/CN110083683A/en
Application granted granted Critical
Publication of CN110083683B publication Critical patent/CN110083683B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an entity semantic annotation method based on random walk. Firstly, in an off-line module, based on a random walk algorithm, obtaining a steady-state probability matrix of a description text of an entity in a corpus, namely an entity term correlation score matrix; then in an online query stage, the number of selected labeled terms is given, the selected labeled terms are input to a user query entity node, a row vector corresponding to the query node is found by utilizing a steady-state probability matrix obtained in an offline module, all element values in the row vector are sorted according to the size, and a label recommendation list corresponding to the query is formed; and finally, according to the user requirements, selecting the first k terms with larger entity term relevance scores to recommend the terms as the feature words of the entity, namely the semantic tags of the corresponding query entities. The method introduces the random walk model to solve the problem of data sparsity, can carry out semantic annotation on the entities on a sparse data set, and effectively improves the recall ratio and precision ratio compared with the existing keyword annotation method.

Description

Entity semantic annotation method based on random walk
Technical Field
The invention relates to a data mining technology, in particular to an entity semantic annotation method based on random walk.
Background
The tags can effectively organize information on the internet, and the task of semantic annotation is to recommend a plurality of related tag information given an entity (document, image, video and the like). There are many solutions to the current state of the art of label labeling. The collaborative filtering is a recommendation technology which is most widely applied, but the collaborative filtering has the problems of data sparsity, cold start and the like, and directly causes great reduction of recommendation quality. In addition, there is a method of keyword labeling, whose purpose is to find several representative words in the original text, but which does not provide words that may be more representative but do not appear in the document. The tag recommendation of the method differs from keyword extraction in several ways, and the required tags may not appear in the text, which makes it unavailable by extraction. In addition, due to the sparsity of real data, the existing method is difficult to label comprehensively, which affects the recall ratio and precision ratio of the returned result.
Based on the work of restarting random walk, the method becomes a node-to-node similarity measurement mode widely used in graph data, and the technology provides important reference significance for semantic annotation. The recommendation model based on random walk intuitively accords with the cognition of people, and can capture the overall structure relation of a complex network graph, so that the sparsity of data is effectively relieved.
Disclosure of Invention
The invention provides an entity semantic annotation method based on random walk aiming at the problems that the existing method is difficult to label comprehensively and affects the recall ratio and precision ratio of returned results, and solves the problem of entity semantic annotation in a large-scale network. The technical scheme adopted by the invention is to effectively combine the TF-IDF algorithm and the random walk model, and the relation between entities and terms is considered, so that the terms can be effectively matched in the query. By fully mining and utilizing the structure information implied in the complex network graph, potential semantic labels are discovered, and query results are presented better.
The technical scheme of the invention is as follows: a method for labeling entity semantics based on random walk specifically comprises the following steps:
1) In an offline module, based on a TF-IDF algorithm and a random walk algorithm, obtaining a steady-state probability matrix of a description text of an entity in a corpus, wherein the obtained steady-state probability matrix is an entity term correlation score matrix; firstly, preprocessing texts in a corpus, and uniformly expressing the processed text data into forms of entity IDs and term IDs; then, combining TF-IDF algorithm to establish entity-term bipartite network, mapping the network into TF-IDF weight matrix A TF-IDF In matrix A TF-IDF In the method, the element values in the row vector are sorted according to the weight value, and the node retention proportion of the row vector is setSetting TF-IDF value of nodes less than the ratio in the row vector to 0 to obtain the processed entity term matrix A PT (ii) a Finally, constructing a square matrix based on the entity term matrix, taking the normalized square matrix as an initial transition probability matrix, executing a random walk process, and performing off-line mining on the description text of the entity to obtain a steady-state probability matrix;
2) In the online query stage, the quantity of selected labeled terms is given, user query nodes, namely entity nodes, are input, then row vectors corresponding to the query nodes are found in the matrix by utilizing the steady-state probability matrix obtained in the step 1) in the offline module, all element values in the row vectors are sorted according to the size to form a label recommendation list corresponding to the query, and the element values are called entity term relevance scores;
3) According to the user requirements, the first k terms with larger entity term relevance scores are selected and recommended to serve as the feature words of the entity, namely the semantic tags of the corresponding query entities.
The step 1) of preprocessing the text in the corpus refers to removing punctuation marks, capital and small case conversion and word segmentation operations on the description text, and then numbering the entity and the preprocessed terms respectively, so that the text data is uniformly expressed in the form of entity ID and term ID.
The entity-term bipartite network is established in the step 1) by combining the TF-IDF algorithm, and the method is specifically realized as follows:
a, calculating TF value and IDF value of the preprocessed data, and counting the times of the single term appearing in the description text corresponding to the entity, namely the times are TF values
Figure BDA0002035307660000021
Wherein q is i,j Is the number of occurrences of the term j in the entity description i, and the denominator is the total number of all terms in the entity i, and m is the length of the description text of the entity i; counting the IDF value of the term, dividing the total number of entities in the corpus by the number of entities in the corpus containing the term,
Figure BDA0002035307660000022
| D | is the total number of all entities in the corpus;
b, calculating TF-TDF weight value with the formula of tfidf i,j =tf i,j ×idf j Obtaining TF-IDF weight between the entity i and the term j;
c: construction of "entity-term" bipartite graph G =<V,E,W>Wherein V =<V P ∪V T >Representing a physical node V P And the term node V T And E is an edge set of the entity and the term represented by the node pair, and the weight W between the edges is the TF-IDF weight between the entity and the term obtained in the step B, so that the entity-term binary network is mapped into a TF-IDF weight matrix A TF-IDF Wherein, matrix A TF-IDF The row vector of (1) represents each entity vector, the column vector represents all term vectors, and each element of the matrix represents TF-IDF weight values among the entity terms; d in the matrix A TF-IDF In the method, the row vectors are sorted according to the TF-IDF weight; then setting the reserved proportion of elements in the row vector, screening according to the proportion, removing the connecting edges between the entity term nodes outside the proportion, namely setting the TF-IDF value of the elements smaller than the proportion in the matrix row vector to be 0, thus obtaining the processed entity term matrix A PT I.e. the term node which cannot describe the characteristics of the entity is skipped.
Constructing a square matrix in the step 1): the square matrix is composed of PT 、A TP And a block matrix composed of two zero matrixes, wherein A TP Is a matrix of terms to entities, and A PT Transpose each other, so two zero matrixes are placed on the main diagonal because the relation between entities and terms is not needed to be considered.
Obtaining a steady state probability matrix in the step 1):
will normalize the square matrix
Figure BDA0002035307660000031
The initial transition probability matrix is regarded as the random walk, the rewritten random walk algorithm is realized, and the iterative formula is as follows:
Figure BDA0002035307660000032
Figure BDA0002035307660000033
wherein c is restart probability, l is iteration step length of random walk, and superscript-normalization processing is carried out; r l Is a probability distribution matrix, R, obtained after the random walk of the first step l+1 The probability distribution matrix after the random walk in the step (l + 1); the central idea of (1) is as follows: starting from a source node S, wherein the source node S is a query node, and a traverser randomly jumps to any node in the graph from the node S according to the probability c and walks to a neighbor node of the node S according to the probability (1-c); obtaining a probability distribution after each walk, wherein the probability distribution describes the probability of each node in the graph being accessed;
setting the maximum random walk iteration step length, using the last probability distribution as the input of the next walk and repeating the iteration to execute the random walk, when the iteration reaches the maximum step number, converging to obtain a stable probability value, ending the iteration at this moment, and obtaining the probability distribution matrix which is the steady state probability matrix of the random walk
The invention has the beneficial effects that: the invention relates to an entity semantic annotation method based on random walk, which solves the problem of data sparseness by utilizing a TF-IDF algorithm and introducing a random walk model, digs out potential semantic features by considering the correlation between entities and terms, attaches proper labels to the entities and improves the recall ratio and precision ratio of the original method.
Drawings
FIG. 1 is a flow chart of the entity semantic annotation method based on random walk according to the present invention;
FIG. 2 is a diagram of the TF-IDF weight matrix of the present invention;
FIG. 3 is a diagram of an initial transition probability matrix for random walks in accordance with the present invention;
FIG. 4 is a graph illustrating the effect of the restart probability c on the precision ratio according to the present invention;
FIG. 5 is a graph of the effect of iteration step length l on recall and precision according to the present invention;
FIG. 6 is a graph illustrating the effect of the number k of selected tags on recall and precision according to the present invention.
Detailed Description
As shown in fig. 1, the entity semantic annotation method based on random walk includes the following specific steps:
step 1, in an offline module, based on a random walk algorithm, obtaining a steady-state probability matrix of description texts of entities in a corpus, wherein the obtained steady-state probability matrix represents a correlation score among entity terms. The concrete implementation is as follows:
1.1 Pre-processing the description text of the entity in the corpus (removing punctuation marks, case conversion, word segmentation, etc.). The entities and the pre-processed terms are numbered separately. The processed text data is uniformly expressed in the form of entity ID and term ID.
1.2 TF-TDF is a commonly used weighting technique for information retrieval and data mining. Calculating TF value and IDF value of the preprocessed data, and counting the times of the single term appearing in the description text corresponding to the entity, namely TF value (word frequency)
Figure BDA0002035307660000051
Wherein q is i,j Is the number of occurrences of the term j in the entity description i, and the denominator is the total number of all terms in the description text of the entity i, and m is the length of the description text of the entity i. The IDF value (inverse document frequency) of a statistical term can be obtained by dividing the total number of entities in the corpus by the number of entities in the corpus that contain the term,
Figure BDA0002035307660000052
| D | is the total number of all entities in the corpus.
Calculating TF-TDF weight value with the formula tfidf i,j =tf i,j ×idf j And obtaining the TF-IDF weight between the entity i and the term j. A high term frequency for an entity, and a low inverse document frequency for that term across the entire set of entities, may yield a highWeighted TF-IDF. Therefore, common terms are filtered out by the TF-IDF, and important terms capable of describing the characteristics of the entity are reserved.
1.3 Construct "entity-term" bipartite graph G =<V,E,W>Wherein V =<V P ∪V T >Representing a physical node V P And the term node V T And E is an edge set of the entity and the term represented by the node pair, and the weight W between the edges is the TF-IDF weight value between the entity and the term obtained in the step 3. Then, the entity-term dichotomous network is mapped into TF-IDF weight matrix A TF-IDF Wherein, the row vector of the matrix represents each entity vector, the column vector represents all term vectors, and the elements of the matrix represent the TF-IDF weights of the entity-term. When the term j appears in the description text of entity i, matrix A PT The value of the element in the ith row and j column is tfidf i,j (ii) a Otherwise, the element value is 0;
1.4 In matrix A) TF-IDF In the method, the row vectors are sorted according to the TF-IDF weight; then, the filter is selected according to the weight value, and the filter will have smaller weight value (here, we will use matrix A) PT Row vectors are sorted according to weight, and the top 60% of the term nodes in the rank are reserved) are removed, which is equivalent to the operation at A TF-IDF And setting TF-IDF values of elements except for the reserved proportion in the row vector to be 0 in the matrix. Thus we get the processed entity term matrix A PT . The term node which can not describe the entity characteristics is skipped in the binary network through the step, and the interference to the result is avoided.
1.5 Constructing a square matrix: a is to be PT 、A TP And the two zero matrixes jointly form a block matrix, so that a square matrix is constructed. Wherein A is TP Is a matrix of terms to entities, and A PT Are mutually transposed. In particular, because the relationships between entities and terms are not considered, two zero matrices are placed on the main diagonal.
1.6 Normalize the square matrix as
Figure BDA0002035307660000061
The calculation formula of the element values in the square matrix is
Figure BDA0002035307660000062
Wherein, a ij The TF-IDF weight representing entity i through term j; a is a i,n And expressing the TF-IDF weight of the nth term in the entity i row vector.
Figure BDA0002035307660000063
Representing the initial transition probability from entity i to term j. The above formula is shown in A PT The sum of the TF-IDF values of all terms included in the entity i row vector is taken as the denominator, and the TF-IDF weight of each term in the entity is taken as the numerator, so that we obtain the normalized square matrix
Figure BDA0002035307660000064
Since the main diagonal is two zero matrices, the normalization of the square matrix can be considered as a PT 、A TP The normalization of the matrix is carried out in such a way that,
Figure BDA0002035307660000065
and
Figure BDA0002035307660000066
are respectively A PT And A TP Normalized matrix form.
1.7 To normalize the square matrix
Figure BDA0002035307660000067
And the initial transition probability matrix is regarded as the random walk, and the rewritten random walk algorithm is realized. The iterative formula is as follows:
Figure BDA0002035307660000068
Figure BDA0002035307660000069
whereinc is restart probability, l is iteration step length of random walk, R l Is a probability distribution matrix, R, obtained after the random walk of the first step l+1 Is the probability distribution matrix after (l + 1) step random walk.
Random walk: starting from a source node S (namely a query node), a traverser randomly jumps to any node in the graph from the node S with a probability c and walks to a neighbor node of the node S with a probability (1-c); after each walk, a probability distribution is obtained, which characterizes the probability that each node in the graph is visited.
1.8 Setting the maximum iteration step number of the random walk, using the probability distribution of the last time as the input of the next walk and repeating the iteration, wherein the iteration formula is the formula in the step 1.7), so as to execute the random walk, when the maximum step number is iterated, the random walk is converged to obtain a stable probability value, and the iteration is finished at the moment, and the obtained probability distribution matrix is the steady-state probability matrix of the random walk.
And 2, in an online query stage, giving out the number k of selected labeled terms, inputting a user query node (namely an entity node), finding a row vector corresponding to the query node in the matrix by using the steady-state probability matrix obtained in the offline module in the step 1, and sorting all element values (element values are called as entity term correlation scores) in the row vector according to the sizes to form a tag recommendation list corresponding to the query.
And 3, selecting the first k terms with higher entity term relevance scores to recommend the terms as the feature words of the entity according to the user requirements, namely the semantic tags of the corresponding query entities.
The present invention performs a series of experiments on amazon datasets.
Firstly, extracting ASIN labels and Title labels (the ASIN corresponds to the ID number of an entity, and the Title labels are description texts of the entity) in an Amazon data set to form entities and source data of the corresponding description texts. Then, the source data is preprocessed in step 1), and then TF-IDF weight values among entity-terms are solved according to the method 1.2 in step 1), so as to obtain a TF-IDF weight matrix A shown in figure 2 TF-IDF . The rows of the matrix represent the respective entity vectors in whichThe element (b) represents a single term obtained after preprocessing, namely, the element (b) is a column of a matrix; the row and column values of the matrix are the TD-IDF weights of the entity terms.
We find the TF-IDF weight matrix to be sparse, and we will use the matrix A to better utilize its sparsity TF-IDF The row vectors are sorted according to TF-IDF weight, and the term nodes of 60% of the top rank are reserved to obtain a processed entity term matrix A PT Constructing a square matrix according to the method 1.5 in the step 1), wherein rows and columns of the square matrix are all represented by entity nodes and term nodes together, as shown in 3. From the figure, it can be seen that the square matrix is represented by A PT 、A TP And the two zero matrixes jointly form a block matrix. A. The TP Then the term matrix to entity, and A PT Are mutually transposed. In particular, because the relationship between entities and terms is not considered, two zero matrices are placed on the main diagonal.
The square matrix is normalized as described in 1.6 in step 1), and is recorded as
Figure BDA0002035307660000071
The element calculation method in the square matrix comprises the following steps:
Figure BDA0002035307660000072
and then, taking the normalized square matrix as an initial transition probability matrix of random walk to realize the rewritten random walk algorithm. The normalization of the square matrix can be considered as pair A PT 、A TP The obtained iterative formula is as follows:
Figure BDA0002035307660000081
a random walk is performed on the data set and given a maximum iteration step of 11 for the random walk, the number of extracted entities is 5 ten thousand (i.e. there are 5 thousand entities of description text). In fig. 4, the influence of the restart probability c on the experimental result is firstly explored, and we find that when the restart probability is selected to be 0.8, the algorithm guarantees an accuracy of 73.5%, so that the experiment is performed by taking c =0.8 next. When the step 3 is executed, we find that the transition probability values are always in a stable state (i.e. the probability values are not changed) when the iteration reaches the step 11 with the increase of the iteration step size of the random walk, that is, a stable probability distribution is obtained, and the experimental result is shown in fig. 5. The steady-state probability value at iteration step size 11 is therefore chosen as the final relevance score for the entity-term.
Examples of labeling entities in amazon datasets are given in table 1. According to the method provided by the invention, the importance sequence of the entity description text is obtained, the recommendation list of the labels is generated according to the step 2, and the description words with the top 10 of the ranking can be selected to view the semantic annotation condition of the entity. For example, in the second column, when the user inputs the query with ASIN 0878301534, a search may be performed in the steady-state probability matrix calculated in the offline stage, and then sorted according to the relevance score to find the most relevant terms, and then we take out Top-10 terms as daning, balloon, latin, finger board, slow, teach, kick, session, midnight, and stars, which may be used as the semantic tags of the entity. It can be seen by this example that the words "recording, balloon, latin, finger board, slow, teach, kick" are highly related to the entity's Title "balloon recording" and that these labels do not appear in the description text of the entity. Compared with keyword labeling, potential semantic labels are mined, so that a user can clearly know the basic description information of the entity, and the query time of the entity is shortened.
TABLE 1
Figure BDA0002035307660000082
Figure BDA0002035307660000091
Fig. 6 shows curves of the number of the selected annotation words and the Recall ratio (Precision) and Precision ratio (Recall) of the entity semantic annotation, which indicate that the more the number of the selected annotation words, the higher the Precision ratio of the invention is after the iteration is stable, but the browsing time of the user is increased and the Recall ratio is reduced, so a balance relationship is to be found. As shown in fig. 6, for the case that the text description of the entity in the amazon dataset is short, when the number of the selected annotation words is 4, a better compromise between recall ratio and precision ratio can be ensured. Therefore, through the label attached to the entity, the user can know whether the entity is required by the user, so that unnecessary browsing time is reduced, a real-time recommendation function is realized, and the user requirement is met.
A large number of experiments are carried out on the Amazon data set, and the results show that the precision ratio of the method provided by the invention is 73.5%, and the recall ratio is 60%. Therefore, the method can find potential semantic association and realize semantic annotation of the entity to a certain extent. For a single query of a user, taking 50 queries, measuring the average query time, and displaying the result that the response time of each query is 1.553ms on average, so that the real-time response requirement of the user is met.

Claims (2)

1. A random walk-based entity semantic annotation method is characterized by comprising the following steps:
1) In an offline module, based on a TF-IDF algorithm and a random walk algorithm, obtaining a steady-state probability matrix of a description text of an entity in a corpus, wherein the obtained steady-state probability matrix is an entity term correlation score matrix;
firstly, preprocessing texts in a corpus, and uniformly expressing the processed text data into forms of entity IDs and term IDs; then, combining TF-IDF algorithm to establish entity-term bipartite network, mapping the network into TF-IDF weight matrix A TF-IDF In matrix A TF-IDF In the method, element values in row vectors are sorted according to the weight value, the node retention proportion of the row vectors is set, namely TF-IDF values of nodes which are smaller than the proportion in the row vectors are set to be 0, and a processed entity term matrix A is obtained PT (ii) a Finally, constructing a square matrix based on the entity term matrix, and taking the normalized square matrix as an initial transfer profileRate matrix, and executing random walk process, and obtaining steady state probability matrix by off-line mining description text of the entity;
2) In the online query stage, the quantity of selected labeled terms is given, user query nodes, namely entity nodes, are input, then row vectors corresponding to the query nodes are found in the matrix by utilizing the steady-state probability matrix obtained in the step 1) in the offline module, all element values in the row vectors are sorted according to the size to form a label recommendation list corresponding to the query, and the element values are called entity term relevance scores;
3) According to the user requirements, selecting the first k terms with larger entity term relevance scores to recommend out as the feature words of the entity, namely the semantic tags of the corresponding query entities;
the entity-term bipartite network is established by combining the TF-IDF algorithm, and the method is specifically realized as follows:
a, calculating TF value and IDF value of the preprocessed data, and counting the times of the single term appearing in the description text corresponding to the entity, namely the times are TF values
Figure FDA0003849724300000011
Wherein q is i,j Is the number of occurrences of the term j in the entity description i, and the denominator is the total number of all terms in the entity i, and m is the length of the description text of the entity i; counting the IDF value of the term, dividing the total number of entities in the corpus by the number of entities in the corpus containing the term,
Figure FDA0003849724300000012
| D | is the total number of all entities in the corpus;
b, calculating TF-TDF weight value with the formula tfidf i,j =tf i,j ×idf j Obtaining TF-IDF weight between the entity i and the term j;
c: construction of "entity-term" bipartite graph G =<V,E,W>Wherein V =<V P ∪V T >Representing a physical node V P And the term node V T Set, E is an entity and term represented by a node pairThe weight W between edges of the edge set is the TF-IDF weight between the entity and the term obtained in the step B, so that the entity-term binary network is mapped into a TF-IDF weight matrix A TF-IDF Wherein, matrix A TF-IDF The row vector of (1) represents each entity vector, the column vector represents all term vectors, and each element of the matrix represents TF-IDF weight values among the entity terms;
d in the matrix A TF-IDF In the method, the row vectors are sorted according to the TF-IDF weight; then setting the reserved proportion of elements in the row vector, screening according to the proportion, removing the connecting edges between the entity term nodes outside the proportion, namely setting the TF-IDF value of the elements in the matrix row vector, which are smaller than the proportion, to be 0, thus obtaining the processed entity term matrix A PT Namely, the term node which cannot describe the entity characteristics is skipped; constructing a square matrix: the square matrix is composed of PT 、A TP And a block matrix composed of two zero matrixes, wherein A TP Is a matrix of terms to entities, and A PT The two zero matrixes are arranged on the main diagonal line because the relation between entities and terms is not needed to be considered;
the steady state probability matrix is obtained:
will normalize the square matrix
Figure FDA0003849724300000021
The initial transition probability matrix is regarded as the random walk, the rewritten random walk algorithm is realized, and the iterative formula is as follows:
Figure FDA0003849724300000022
Figure FDA0003849724300000023
wherein c is restart probability, l is iteration step length of random walk, and superscript-normalization processing is carried out; r l For the first step of randomizationProbability distribution matrix, R, obtained after wandering l+1 The probability distribution matrix after the random walk in the (l + 1) step is obtained; the central idea is as follows: starting from a source node S, wherein the source node S is a query node, and a traverser randomly jumps to any node in the graph from the node S according to the probability c and walks to a neighbor node of the node S according to the probability (1-c); obtaining a probability distribution after each walk, wherein the probability distribution describes the probability of each node in the graph being accessed;
setting the maximum random walk iteration step length, using the last probability distribution as the input of the next walk and repeating the iteration so as to execute the random walk, converging to obtain a stable probability value when the iteration reaches the maximum step number, and ending the iteration at the moment to obtain a probability distribution matrix which is the steady-state probability matrix of the random walk.
2. The entity semantic annotation method based on random walk according to claim 1, wherein the preprocessing of the text in the corpus in the step 1) refers to removing punctuation marks, case-case conversion and word segmentation operations on the description text, and then numbering the entities and the preprocessed terms respectively, that is, uniformly representing the text data in the form of entity ID and term ID.
CN201910323155.1A 2019-04-22 2019-04-22 Entity semantic annotation method based on random walk Active CN110083683B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910323155.1A CN110083683B (en) 2019-04-22 2019-04-22 Entity semantic annotation method based on random walk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910323155.1A CN110083683B (en) 2019-04-22 2019-04-22 Entity semantic annotation method based on random walk

Publications (2)

Publication Number Publication Date
CN110083683A CN110083683A (en) 2019-08-02
CN110083683B true CN110083683B (en) 2022-12-13

Family

ID=67415955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910323155.1A Active CN110083683B (en) 2019-04-22 2019-04-22 Entity semantic annotation method based on random walk

Country Status (1)

Country Link
CN (1) CN110083683B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444317B (en) * 2020-03-17 2021-11-30 杭州电子科技大学 Semantic-sensitive knowledge graph random walk sampling method
CN111931480B (en) * 2020-07-03 2023-07-18 北京新联财通咨询有限公司 Text main content determining method and device, storage medium and computer equipment
CN111767440B (en) * 2020-09-03 2021-01-05 平安国际智慧城市科技股份有限公司 Vehicle portrayal method based on knowledge graph, computer equipment and storage medium
CN113361605B (en) * 2021-06-07 2024-05-24 汇智数字科技控股(深圳)有限公司 Product similarity quantification method based on Amazon keywords
CN116775849B (en) * 2023-08-23 2023-10-24 成都运荔枝科技有限公司 On-line problem processing system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298606A (en) * 2011-06-01 2011-12-28 清华大学 Random walking image automatic annotation method and device based on label graph model
CN105243149A (en) * 2015-10-26 2016-01-13 深圳市智搜信息技术有限公司 Semantic-based query recommendation method and system
CN106445989A (en) * 2016-06-03 2017-02-22 新乡学院 Query click graph-based search recommendation model optimization
CN107832306A (en) * 2017-11-28 2018-03-23 武汉大学 A kind of similar entities method for digging based on Doc2vec
CN108460011A (en) * 2018-02-01 2018-08-28 北京百度网讯科技有限公司 A kind of entitative concept mask method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11074495B2 (en) * 2013-02-28 2021-07-27 Z Advanced Computing, Inc. (Zac) System and method for extremely efficient image and pattern recognition and artificial intelligence platform

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298606A (en) * 2011-06-01 2011-12-28 清华大学 Random walking image automatic annotation method and device based on label graph model
CN105243149A (en) * 2015-10-26 2016-01-13 深圳市智搜信息技术有限公司 Semantic-based query recommendation method and system
CN106445989A (en) * 2016-06-03 2017-02-22 新乡学院 Query click graph-based search recommendation model optimization
CN107832306A (en) * 2017-11-28 2018-03-23 武汉大学 A kind of similar entities method for digging based on Doc2vec
CN108460011A (en) * 2018-02-01 2018-08-28 北京百度网讯科技有限公司 A kind of entitative concept mask method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Fast Random Walk with Restart and Its Applications";Hanghang Tong;《IEEE》;20070108;全文 *
"基于双层随机游走的关系推理算法";刘峤等;《计算机学报》;20161125;全文 *

Also Published As

Publication number Publication date
CN110083683A (en) 2019-08-02

Similar Documents

Publication Publication Date Title
Bennani-Smires et al. Simple unsupervised keyphrase extraction using sentence embeddings
CN110083683B (en) Entity semantic annotation method based on random walk
Wang et al. K-adapter: Infusing knowledge into pre-trained models with adapters
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US10146862B2 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
CN109885773B (en) Personalized article recommendation method, system, medium and equipment
CN105183833B (en) Microblog text recommendation method and device based on user model
CN108446316B (en) association word recommendation method and device, electronic equipment and storage medium
WO2008106667A1 (en) Searching heterogeneous interrelated entities
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN110807101A (en) Scientific and technical literature big data classification method
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN110321437B (en) Corpus data processing method and device, electronic equipment and medium
CN106708929A (en) Video program search method and device
CN113190593A (en) Search recommendation method based on digital human knowledge graph
CN109145083A (en) A kind of candidate answers choosing method based on deep learning
CN113673252A (en) Automatic join recommendation method for data table based on field semantics
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
CN106570196B (en) Video program searching method and device
CN113111178B (en) Method and device for disambiguating homonymous authors based on expression learning without supervision
CN105426490A (en) Tree structure based indexing method
CN108509449B (en) Information processing method and server
CN113312523B (en) Dictionary generation and search keyword recommendation method and device and server
CN112199461B (en) Document retrieval method, device, medium and equipment based on block index structure
CN111985217B (en) Keyword extraction method, computing device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant