CN112836029A - Graph-based document retrieval method, system and related components thereof - Google Patents

Graph-based document retrieval method, system and related components thereof Download PDF

Info

Publication number
CN112836029A
CN112836029A CN202110110581.4A CN202110110581A CN112836029A CN 112836029 A CN112836029 A CN 112836029A CN 202110110581 A CN202110110581 A CN 202110110581A CN 112836029 A CN112836029 A CN 112836029A
Authority
CN
China
Prior art keywords
document
graph
semantic
retrieval
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110110581.4A
Other languages
Chinese (zh)
Inventor
王伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Runlian Software System Shenzhen Co Ltd
Original Assignee
Runlian Software System Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Runlian Software System Shenzhen Co Ltd filed Critical Runlian Software System Shenzhen Co Ltd
Priority to CN202110110581.4A priority Critical patent/CN112836029A/en
Publication of CN112836029A publication Critical patent/CN112836029A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a system and related components for searching a document based on a graph, wherein the method comprises the following steps: acquiring a document to be retrieved to construct a target knowledge graph, acquiring embedded vectors corresponding to entity nodes and edge nodes in the target knowledge graph, and calculating semantic distance between every two entity nodes; acquiring a minimum semantic subgraph containing all retrieval keywords input by a user; acquiring a topic word set of each document in a document set and calculating an average vector; and acquiring the feature vector of the minimum semantic subgraph, performing cosine similarity calculation on the feature vector and the average vector, taking the document with the calculation result larger than the similarity threshold value as a candidate searched document, and searching out a target searched document. The invention utilizes the cosine similarity calculation method to calculate the characteristic vector and the average vector of the documents, screens out the candidate searched documents according to the calculation result, and searches out the target searched documents from the candidate searched documents, the whole calculation process is simple and effective, the target is clear, and the search cost is reduced.

Description

Graph-based document retrieval method, system and related components thereof
Technical Field
The invention relates to the technical field of document retrieval, in particular to a document retrieval method and system based on a graph and related components thereof.
Background
People often need to obtain various information through retrieval, and the existing mainstream mode is to perform keyword-based retrieval on a document set to be retrieved, so that two prominent phenomena of 'lack' and 'error' exist when a user uses keywords. "lack" means that the user does not know something well and does not know a keyword in more detail, and "false" means that the search keyword used by the user is not a real keyword reflecting the feature of something. Currently, the mainstream information retrieval ideas are roughly classified into the following categories:
1. retrieval schemes such as TF-IDF (term-inverse document frequency, which is a common weighting technology used for information retrieval and data mining) and BM25 (an algorithm used for evaluating the correlation between search words and documents) adopted by full-text search engines such as traditional elastic search and Slor;
2. by training a large amount of labeled data, semantic association between the questions and the answers is established by utilizing a deep neural network, so that the initial positions of the answers are positioned in the document.
The first method often results in too many answers and still requires a great deal of effort to screen to obtain the desired information. The second method needs to invest a large amount of resources to collect and arrange training data, and is high in cost.
Disclosure of Invention
The embodiment of the invention provides a graph-based document retrieval method, a graph-based document retrieval system and related components thereof, and aims to solve the problems of excessive retrieval results, time consumption and high cost in the prior art.
In a first aspect, an embodiment of the present invention provides a graph-based document retrieval method, which includes:
acquiring a document to be retrieved, constructing a document set, identifying entities in the document set and relations between the entities by using a natural language processing technology, and constructing a target knowledge graph;
acquiring embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating the cosine similarity of the embedded vectors corresponding to every two entity nodes in a preset reference knowledge graph, and calculating the semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors;
acquiring a retrieval keyword set input by a user, and screening out a minimum semantic sub-graph containing all retrieval keywords in the retrieval keyword set from the target knowledge graph;
acquiring a subject term set of each document in the document set, acquiring a corresponding embedded vector of each subject term, and calculating an average vector of the subject terms according to the embedded vectors of the subject terms;
extracting a feature vector of the minimum semantic subgraph by using a graph convolution neural network, performing cosine similarity calculation on the feature vector and an average vector, and judging whether a calculation result is greater than a similarity threshold value;
and if the calculation result is larger than the similarity threshold value, taking the document corresponding to the subject term set as a candidate searched document, and searching out a target searched document.
In a second aspect, an embodiment of the present invention provides a graph-based document retrieval system, which includes:
the target knowledge graph acquisition unit is used for acquiring a document to be retrieved, constructing a document set, identifying entities in the document set and relations between the entities by using a natural language processing technology, and constructing a target knowledge graph;
the semantic distance calculating unit is used for acquiring embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating the cosine similarity of the embedded vectors corresponding to every two entity nodes in a preset reference knowledge graph, and calculating the semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors;
the minimum semantic subgraph acquisition unit is used for acquiring a retrieval keyword set input by a user and screening out minimum semantic subgraphs containing all retrieval keywords in the retrieval keyword set from the target knowledge graph;
the average vector calculation unit is used for acquiring a subject term set of each document in the document set, acquiring a corresponding embedded vector of each subject term, and calculating an average vector of the subject terms according to the embedded vectors of the subject terms;
the similarity judging unit is used for extracting the feature vector of the minimum semantic subgraph by using a graph convolution neural network, calculating the cosine similarity of the feature vector and the average vector and judging whether the calculation result is greater than a similarity threshold value;
and the target retrieval document acquisition unit is used for taking the document corresponding to the subject term set as a candidate retrieved document and retrieving a target retrieval document if the calculation result is greater than the similarity threshold value.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements the graph-based document retrieval method described in the first aspect.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the graph-based document retrieval method according to the first aspect.
The embodiment of the invention provides a graph-based document retrieval method, a graph-based document retrieval system and related components thereof. Acquiring a document to be retrieved, constructing a document set, identifying entities in the document set and relations between the entities by using a natural language processing technology, and constructing a target knowledge graph; acquiring embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating the cosine similarity of the embedded vectors corresponding to every two entity nodes in a preset reference knowledge graph, and calculating the semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors; acquiring a retrieval keyword set input by a user, and screening out a minimum semantic sub-graph containing all retrieval keywords in the retrieval keyword set from the target knowledge graph; acquiring a subject term set of each document in the document set, acquiring a corresponding embedded vector of each subject term, and calculating an average vector of the subject terms according to the embedded vectors of the subject terms; extracting a feature vector of the minimum semantic subgraph by using a graph convolution neural network, performing cosine similarity calculation on the feature vector and an average vector, and judging whether a calculation result is greater than a similarity threshold value; and if the calculation result is larger than the similarity threshold value, taking the document corresponding to the subject term set as a candidate searched document, and searching out a target searched document. The embodiment of the invention utilizes a cosine similarity calculation method to calculate the characteristic vector of the minimum semantic subgraph and the average vector of the documents in the target knowledge graph, screens out the candidate searched documents according to the calculation result, and directly searches out the target searched documents from the candidate searched documents, the whole calculation process is simple and effective, the target is clear, and the search cost is reduced. The invention reduces the number of irrelevant documents in the document set to be retrieved, thereby reducing the workload of subsequent retrieval.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating a graph-based document retrieval method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating the minimum semantic subgraph screening of the graph-based document retrieval method according to the embodiment of the present invention;
FIG. 3 is a schematic block diagram of a graph-based document retrieval system provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1, fig. 1 is a schematic flowchart of a graph-based document retrieval method according to an embodiment of the present invention, where the method includes steps S101 to S106.
S101, obtaining a document to be retrieved, constructing a document set, identifying entities in the document set and relations between the entities by using a natural language processing technology, and constructing a target knowledge graph;
in this step, all documents to be retrieved are acquired, so as to construct a document set, and then entities in the document set and relationships between the entities are identified through word segmentation, named entity identification, relationship extraction technology and the like in the natural language technology, so as to construct a target knowledge graph. Segmenting words in the document from sentences by using a word segmentation technology, then identifying entities with specific meanings in the segmented words by using a named entity identification technology, and finally acquiring the relationship between the entities by using the relationship extraction technology so as to construct a target knowledge graph according to the entities and the relationship between the entities.
In one embodiment, the identifying entities and entity-to-entity relationships in the set of documents using natural language processing techniques includes:
constructing an entity-document mapping table according to the relation between the entity and the document;
acquiring the access times of the document in a period of time, and judging whether the access times are smaller than a time threshold value;
and if the access times are less than the time threshold value, deleting the corresponding document from the entity-document mapping table.
In this embodiment, the entity and the document are corresponded, an entity-document mapping table is constructed, then the document with the access frequency smaller than the threshold value is found in the document set for deletion, and the entity-document mapping table is updated. When an entity in the target knowledge graph appears in a certain document, the entity and the document are considered to be in a corresponding relationship, wherein one entity may correspond to one document or a plurality of documents, and therefore the entity needs to be screened. The method comprises the steps of obtaining the access times of all documents in a period of time, utilizing an LRU algorithm (namely, a least recently used method, and the core idea is that the most recently used page data can still be used in a future period of time, and the pages which are not used for a long time are more likely not to be used in a longer period of time in the future. In this manner, when the relationship between the entity and the document is found again, the relationship between the document and the entity that has been deleted will not be displayed.
S102, acquiring embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating the cosine similarity of the embedded vectors corresponding to every two entity nodes in a preset reference knowledge graph, and calculating the semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors;
in this step, all entities are set as corresponding entity nodes in the target knowledge graph, relationships between the entities are used as edge nodes, embedded vectors corresponding to the entity nodes and the edge nodes are obtained through a translation embedding algorithm (namely, a TransE algorithm), and cosine similarity of the embedded vectors corresponding to the two entity nodes is calculated in a predetermined reference knowledge graph (such as a large-scale knowledge graph rich in entities such as Hotan, Freebase, Wikipedia and the like), so that a semantic distance between the two entity nodes is obtained. In one embodiment, the semantic distance between two entity nodes is calculated by the following formula:
Figure BDA0002918823010000051
wherein, sim (E)i,Ej) The cosine similarity of the embedded vectors for two physical nodes,
Figure BDA0002918823010000052
is the sum of the embedded vectors of all connected edges of the entity Ei,
Figure BDA0002918823010000053
is the sum of the embedded vectors of all connected edges of the entity Ej.
S103, acquiring a retrieval keyword set input by a user, and screening out a minimum semantic sub-graph containing all retrieval keywords in the retrieval keyword set from the target knowledge graph;
in the step, according to a retrieval keyword set input by a user, node positions of all retrieval keywords are found in the target knowledge graph, extraction is carried out, semantic subgraphs containing all the retrieval keywords are obtained, and the minimum semantic subgraph with the minimum radius is screened out.
In one embodiment, the step S103 includes:
taking all the retrieval keywords in the retrieval keyword set as nodes;
firstly, selecting any node as a central node, and entering a semantic subgraph selection step;
semantic subgraph selection: taking the semantic distance between the other node which is farthest away from the central node as a radius to form a semantic sub-graph containing all the nodes, adjusting the radius of the semantic sub-graph, and sequentially increasing or decreasing preset numerical values to obtain the semantic sub-graph which has the smallest radius and contains all the retrieval keywords;
continuously selecting another node as a central node, and entering a semantic subgraph selection step;
and obtaining semantic subgraphs corresponding to all the nodes, and selecting the semantic subgraph with the minimum radius from the semantic subgraphs.
In this embodiment, in a search keyword set input by a user, any search keyword is selected as a central node, a semantic subgraph selection step is performed to obtain a semantic subgraph, then the process is repeated until all the search keywords form a corresponding semantic subgraph, and then the semantic subgraph with the smallest radius is selected from all the semantic subgraphs as the smallest semantic subgraph. After obtaining a semantic sub-graph, the radius of the next semantic sub-graph is the radius of the previous semantic sub-graph, and the radius of the previous semantic sub-graph is increased or decreased by a predefined delta r, so that the radius of the next semantic sub-graph contains all the retrieval keywords.
As shown in fig. 2, the search keywords input by the user are kw1, kw2 and kw3, the search keyword kw1 is used as a center node, the node farthest from the search keyword kw1 is kw3, and the semantic distance between kw1 and kw3 is used as a radius, so as to obtain sub-graph 1; similarly, with kw2 as a central node, obtaining a sub-graph 2 by the semantic distance radius between kw2 and kw 1; with kw3 as a central node, obtaining a sub-graph 3 by the semantic distance radius between kw3 and kw 1; and (4) screening the subgraph 2 with the smallest radius from the subgraph 1, the subgraph 2 and the subgraph 3 to be used as the smallest semantic subgraph.
In a specific embodiment, the acquiring the set of search keywords input by the user includes:
acquiring a retrieval keyword input by a user, and searching a position corresponding to the retrieval keyword in the knowledge graph;
if the corresponding retrieval key words are not found in the knowledge graph, obtaining the embedded vectors of the retrieval key words through a word embedding method, and obtaining a near meaning word set of the retrieval key words by utilizing a cosine similarity calculation method;
and replacing the search keyword by a word with the most similar semantic meaning to the search keyword in the similar meaning word set.
In this embodiment, when a user inputs a search keyword, the condition that the input search keyword is inaccurate may occur, an embedded vector of the search keyword input by the user is obtained by a word embedding method, a synonym set is obtained by using a cosine similarity calculation method, and a word with the most similar semantics to the search keyword is found from the synonym set to replace the search keyword. When cosine similarity calculation is carried out, the similarity of the calculation result is compared with a near-meaning word threshold value, and words with similarity deviation smaller than the near-meaning word threshold value are all regarded as near-meaning words.
S104, acquiring a subject term set of each document in the document set, acquiring a corresponding embedding vector of each subject term, and calculating an average vector of the subject terms according to the embedding vectors of the subject terms;
in this step, an embedded vector of a subject term corresponding to each document in the document set is calculated, and an average vector of the subject terms is calculated by using the embedded vectors. Specifically, a topic word set of the document is extracted by using an LDA topic model or a probabilistic latent semantic index topic model, and an embedded vector of each topic word is obtained according to a word2vec method. The LDA (Latent Dirichlet Allocation) is a document theme generation model, is also called a three-layer Bayesian probability model, and comprises three-layer structures of words, themes and documents; word2vec, a group of related models for generating Word vectors, are shallow, two-level neural networks used for training to reconstruct linguistic Word text.
In one embodiment, said calculating an average vector of said subject term according to said embedded vector of said subject term comprises:
and acquiring a word vector corresponding to the subject word in a preset word vector list, and calculating the average value of the word vectors to acquire the average vector of the subject word.
In this embodiment, the word vector corresponding to the subject word is found in a preset word vector list, and an average value of all the word vectors is calculated, that is, the average vector of the subject word. The word vector is also the embedding vector.
S105, extracting a feature vector of the minimum semantic subgraph by using a graph convolution neural network, performing cosine similarity calculation on the feature vector and an average vector, and judging whether a calculation result is greater than a similarity threshold value;
in this step, cosine similarity calculation is performed on the feature vector and the average vector, and the obtained calculation result is compared with the similarity threshold value, so as to determine whether the corresponding document is a candidate document to be retrieved.
The specific process of extracting the feature vector of the minimum semantic subgraph by using the graph convolution neural network comprises the following steps:
defining the number of convolutional layers of the graph convolutional neural network according to the minimum semantic subgraph;
calculating the feature vector by the following formula:
Figure BDA0002918823010000081
wherein
Figure BDA0002918823010000082
A is an adjacency matrix of nodes in the knowledge graph, D is an in-out degree matrix of nodes in the knowledge graph, W0Is a weight matrix, σ is an activation function, Dii=∑jAijAnd X is an adjacent matrix of the minimum semantic subgraph.
And S106, if the calculation result is larger than the similarity threshold value, taking the document corresponding to the subject term set as a candidate searched document, and searching out a target searched document.
In this step, when the calculation result is greater than the similarity threshold, the document corresponding to the subject term set is taken as a candidate document to be retrieved and is placed in the candidate document set to be retrieved; and when the calculation result is smaller than the similarity threshold value, not putting the document corresponding to the subject term set into the candidate searched document set. After all the candidate searched documents are obtained, corresponding target searched documents are searched out from the candidate searched document set.
Referring to fig. 3, fig. 3 is a schematic block diagram of a graph-based document retrieval system according to an embodiment of the present invention, where the graph-based document retrieval system 200 includes:
a target knowledge graph obtaining unit 201, configured to obtain a document to be retrieved, construct a document set, identify entities in the document set and relationships between the entities by using a natural language processing technology, and construct a target knowledge graph;
a semantic distance calculating unit 202, configured to obtain embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculate an embedded vector cosine similarity corresponding to each two entity nodes in a predetermined reference knowledge graph, and calculate a semantic distance between each two entity nodes according to the embedded vector cosine similarity;
a minimum semantic subgraph obtaining unit 203, configured to obtain a search keyword set input by a user, and screen out a minimum semantic subgraph including all search keywords in the search keyword set from the target knowledge graph;
an average vector calculation unit 204, configured to obtain a subject term set of each document in the document set, obtain an embedded vector corresponding to each subject term, and calculate an average vector of the subject terms according to the embedded vector of the subject terms;
a similarity judgment unit 205, configured to extract a feature vector of the minimum semantic subgraph by using a graph convolution neural network, perform cosine similarity calculation on the feature vector and the average vector, and judge whether a calculation result is greater than a similarity threshold;
and the target retrieval document acquiring unit 206 is configured to, if the calculation result is greater than the similarity threshold, take the document corresponding to the topic word set as a candidate retrieved document, and retrieve a target retrieval document.
In one embodiment, the target knowledge-graph obtaining unit 201 includes:
the mapping table construction unit is used for constructing an entity-document mapping table according to the relation between the entity and the document;
the access frequency acquiring unit is used for acquiring the access frequency of the document in a period of time and judging whether the access frequency is smaller than a frequency threshold value;
and the document deleting unit is used for deleting the corresponding document from the entity-document mapping table if the access times are smaller than a time threshold.
In an embodiment, the semantic distance calculating unit 202 includes:
a formula calculation unit, configured to calculate a semantic distance between two entity nodes by using the following formula:
Figure BDA0002918823010000091
wherein, sim (E)i,Ej) Is the embedded vector cosine similarity, SIGMA R, of two physical nodesEiAs entity EiSum of the embedding vectors of all connected edges, ∑ REjAs entity EjThe sum of the embedded vectors of all connected edges.
In an embodiment, the minimum semantic subgraph obtaining unit 203 includes:
a node acquisition unit, configured to take all the search keywords in the search keyword set as nodes;
the central node selection unit is used for selecting any node as a central node and entering a semantic subgraph selection step;
a semantic subgraph selection unit used for the semantic subgraph selection step: taking the semantic distance between the other node which is farthest away from the central node as a radius to form a semantic sub-graph containing all the nodes, adjusting the radius of the semantic sub-graph, and sequentially increasing or decreasing preset numerical values to obtain the semantic sub-graph which has the smallest radius and contains all the retrieval keywords;
the central node reselecting unit is used for continuously selecting another node as a central node and entering a semantic subgraph selection step;
and the minimum semantic subgraph screening unit is used for obtaining semantic subgraphs corresponding to all the nodes and selecting the semantic subgraph with the minimum radius from the semantic subgraphs.
In an embodiment, the minimum semantic subgraph obtaining unit 203 includes:
the retrieval key word acquisition unit is used for acquiring retrieval key words input by a user and searching corresponding positions of the retrieval key words in the knowledge graph;
a near meaning word set obtaining unit, configured to obtain an embedded vector of the search keyword through a word embedding method if the corresponding search keyword is not found in the knowledge graph, and obtain a near meaning word set of the search keyword by using a cosine similarity calculation method;
and the keyword replacing unit is used for replacing the search keyword by using the word with the most similar semantic meaning to the search keyword in the similar meaning word set.
In an embodiment, the similarity determining unit 205 includes:
a convolution layer number confirmation unit, configured to define the number of convolution layers of the graph convolution neural network according to the minimum semantic subgraph;
a feature vector calculation unit for calculating the feature vector by the following formula:
Figure BDA0002918823010000101
wherein
Figure BDA0002918823010000102
A is an adjacency matrix of nodes in the knowledge graph, D is an in-out degree matrix of nodes in the knowledge graph, W0Is a weight matrix, σ is an activation function, Dii=∑jAijAnd X is an adjacent matrix of the minimum semantic subgraph.
In one embodiment, the average vector calculation unit 204 includes:
and the word vector average value calculating unit is used for acquiring the corresponding word vectors of the subject words in a preset word vector list, calculating the average value of the word vectors and acquiring the average vector of the subject words.
The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the graph-based document retrieval method is implemented.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the graph-based document retrieval method is implemented.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A graph-based document retrieval method, comprising:
acquiring a document to be retrieved, constructing a document set, identifying entities in the document set and relations between the entities by using a natural language processing technology, and constructing a target knowledge graph;
acquiring embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating the cosine similarity of the embedded vectors corresponding to every two entity nodes in a preset reference knowledge graph, and calculating the semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors;
acquiring a retrieval keyword set input by a user, and screening out a minimum semantic sub-graph containing all retrieval keywords in the retrieval keyword set from the target knowledge graph;
acquiring a subject term set of each document in the document set, acquiring a corresponding embedded vector of each subject term, and calculating an average vector of the subject terms according to the embedded vectors of the subject terms;
extracting a feature vector of the minimum semantic subgraph by using a graph convolution neural network, performing cosine similarity calculation on the feature vector and an average vector, and judging whether a calculation result is greater than a similarity threshold value;
and if the calculation result is larger than the similarity threshold value, taking the document corresponding to the subject term set as a candidate searched document, and searching out a target searched document.
2. The graph-based document retrieval method of claim 1, wherein the identifying entities and entity-to-entity relationships in the document collection using natural language processing techniques comprises:
constructing an entity-document mapping table according to the relation between the entity and the document;
acquiring the access times of the document in a period of time, and judging whether the access times are smaller than a time threshold value;
and if the access times are less than the time threshold value, deleting the corresponding document from the entity-document mapping table.
3. The graph-based document retrieval method of claim 1, wherein the obtaining of the embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating cosine similarity of the embedded vectors corresponding to every two entity nodes in a predetermined reference knowledge graph, and calculating semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors comprises:
the semantic distance between two entity nodes is calculated by the following formula:
Figure FDA0002918823000000011
wherein, sim (E)i,Ej) Is the embedded vector cosine similarity, SIGMA R, of two physical nodesEiIs the sum of the embedded vectors, Σ R, of all connected edges of the entity EiEjIs the sum of the embedded vectors of all connected edges of the entity Ej.
4. The graph-based document retrieval method of claim 1, wherein the obtaining of the set of retrieval keywords input by the user and the screening of the target knowledge-graph to the smallest semantic subgraph containing all the retrieval keywords in the set of retrieval keywords comprises:
taking all the retrieval keywords in the retrieval keyword set as nodes;
firstly, selecting any node as a central node, and entering a semantic subgraph selection step;
semantic subgraph selection: taking the semantic distance between the other node which is farthest away from the central node as a radius to form a semantic sub-graph containing all the nodes, adjusting the radius of the semantic sub-graph, and sequentially increasing or decreasing preset numerical values to obtain the semantic sub-graph which has the smallest radius and contains all the retrieval keywords;
continuously selecting another node as a central node, and entering a semantic subgraph selection step;
and obtaining semantic subgraphs corresponding to all the nodes, and selecting the semantic subgraph with the minimum radius from the semantic subgraphs.
5. The graph-based document retrieval method of claim 1, wherein the obtaining of the set of retrieval keywords input by the user comprises:
acquiring a retrieval keyword input by a user, and searching a position corresponding to the retrieval keyword in the knowledge graph;
if the corresponding retrieval key words are not found in the knowledge graph, obtaining the embedded vectors of the retrieval key words through a word embedding method, and obtaining a near meaning word set of the retrieval key words by utilizing a cosine similarity calculation method;
and replacing the search keyword by a word with the most similar semantic meaning to the search keyword in the similar meaning word set.
6. The graph-based document retrieval method of claim 1, wherein the extracting the feature vector of the minimal semantic subgraph by using a graph convolutional neural network comprises:
defining the number of convolutional layers of the graph convolutional neural network according to the minimum semantic subgraph;
calculating the feature vector by the following formula:
Figure FDA0002918823000000021
wherein
Figure FDA0002918823000000022
A is an adjacency matrix of nodes in the knowledge graph, D is an in-out degree matrix of nodes in the knowledge graph, W0Is a weight matrix, σ is an activation function, Dii=∑jAijAnd X is an adjacent matrix of the minimum semantic subgraph.
7. The graph-based document retrieval method of claim 1, wherein the computing an average vector of the subject term from the embedded vector of the subject term comprises:
and acquiring a word vector corresponding to the subject word in a preset word vector list, and calculating the average value of the word vectors to acquire the average vector of the subject word.
8. A graph-based document retrieval system, comprising:
the target knowledge graph acquisition unit is used for acquiring a document to be retrieved, constructing a document set, identifying entities in the document set and relations between the entities by using a natural language processing technology, and constructing a target knowledge graph;
the semantic distance calculating unit is used for acquiring embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating the cosine similarity of the embedded vectors corresponding to every two entity nodes in a preset reference knowledge graph, and calculating the semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors;
the minimum semantic subgraph acquisition unit is used for acquiring a retrieval keyword set input by a user and screening out minimum semantic subgraphs containing all retrieval keywords in the retrieval keyword set from the target knowledge graph;
the average vector calculation unit is used for acquiring a subject term set of each document in the document set, acquiring a corresponding embedded vector of each subject term, and calculating an average vector of the subject terms according to the embedded vectors of the subject terms;
the similarity judging unit is used for extracting the feature vector of the minimum semantic subgraph by using a graph convolution neural network, calculating the cosine similarity of the feature vector and the average vector and judging whether the calculation result is greater than a similarity threshold value;
and the target retrieval document acquisition unit is used for taking the document corresponding to the subject term set as a candidate retrieved document and retrieving a target retrieval document if the calculation result is greater than the similarity threshold value.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the graph-based document retrieval method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the graph-based document retrieval method according to any one of claims 1 to 7.
CN202110110581.4A 2021-01-27 2021-01-27 Graph-based document retrieval method, system and related components thereof Pending CN112836029A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110110581.4A CN112836029A (en) 2021-01-27 2021-01-27 Graph-based document retrieval method, system and related components thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110110581.4A CN112836029A (en) 2021-01-27 2021-01-27 Graph-based document retrieval method, system and related components thereof

Publications (1)

Publication Number Publication Date
CN112836029A true CN112836029A (en) 2021-05-25

Family

ID=75931765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110110581.4A Pending CN112836029A (en) 2021-01-27 2021-01-27 Graph-based document retrieval method, system and related components thereof

Country Status (1)

Country Link
CN (1) CN112836029A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417845A (en) * 2022-03-30 2022-04-29 支付宝(杭州)信息技术有限公司 Identical entity identification method and system based on knowledge graph
CN116433799A (en) * 2023-06-14 2023-07-14 安徽思高智能科技有限公司 Flow chart generation method and device based on semantic similarity and sub-graph matching
CN116975314A (en) * 2023-09-25 2023-10-31 浙江星汉信息技术股份有限公司 Intelligent query method and system for electronic files
CN117112736A (en) * 2023-10-24 2023-11-24 云南瀚文科技有限公司 Information retrieval analysis method and system based on semantic analysis model

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417845A (en) * 2022-03-30 2022-04-29 支付宝(杭州)信息技术有限公司 Identical entity identification method and system based on knowledge graph
CN116433799A (en) * 2023-06-14 2023-07-14 安徽思高智能科技有限公司 Flow chart generation method and device based on semantic similarity and sub-graph matching
CN116433799B (en) * 2023-06-14 2023-08-25 安徽思高智能科技有限公司 Flow chart generation method and device based on semantic similarity and sub-graph matching
CN116975314A (en) * 2023-09-25 2023-10-31 浙江星汉信息技术股份有限公司 Intelligent query method and system for electronic files
CN116975314B (en) * 2023-09-25 2023-12-22 浙江星汉信息技术股份有限公司 Intelligent query method and system for electronic files
CN117112736A (en) * 2023-10-24 2023-11-24 云南瀚文科技有限公司 Information retrieval analysis method and system based on semantic analysis model
CN117112736B (en) * 2023-10-24 2024-01-05 云南瀚文科技有限公司 Information retrieval analysis method and system based on semantic analysis model

Similar Documents

Publication Publication Date Title
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
CN106649818B (en) Application search intention identification method and device, application search method and server
JP5391633B2 (en) Term recommendation to define the ontology space
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
CN112836029A (en) Graph-based document retrieval method, system and related components thereof
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN111190997B (en) Question-answering system implementation method using neural network and machine learning ordering algorithm
CN107577785A (en) A kind of level multi-tag sorting technique suitable for law identification
US20090292685A1 (en) Video search re-ranking via multi-graph propagation
Grabski et al. Sentence completion
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
JP2009093651A (en) Modeling topics using statistical distribution
JP2009093650A (en) Selection of tag for document by paragraph analysis of document
WO2014054052A2 (en) Context based co-operative learning system and method for representing thematic relationships
CN108090178B (en) Text data analysis method, text data analysis device, server and storage medium
Landthaler et al. Extending Full Text Search for Legal Document Collections Using Word Embeddings.
Mahdabi et al. The effect of citation analysis on query expansion for patent retrieval
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN110674301A (en) Emotional tendency prediction method, device and system and storage medium
CN116501875A (en) Document processing method and system based on natural language and knowledge graph
CN113761192B (en) Text processing method, text processing device and text processing equipment
CN112307364B (en) Character representation-oriented news text place extraction method
CN111813916A (en) Intelligent question and answer method, device, computer equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Wang Wei

Inventor after: Huang Yongqi

Inventor after: Yu Cuicui

Inventor after: Zhang Qian

Inventor before: Wang Wei

CB03 Change of inventor or designer information