CN112836029A

CN112836029A - Graph-based document retrieval method, system and related components thereof

Info

Publication number: CN112836029A
Application number: CN202110110581.4A
Authority: CN
Inventors: 王伟
Original assignee: Runlian Software System Shenzhen Co Ltd
Current assignee: Runlian Software System Shenzhen Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-05-25

Abstract

The invention discloses a method, a system and related components for searching a document based on a graph, wherein the method comprises the following steps: acquiring a document to be retrieved to construct a target knowledge graph, acquiring embedded vectors corresponding to entity nodes and edge nodes in the target knowledge graph, and calculating semantic distance between every two entity nodes; acquiring a minimum semantic subgraph containing all retrieval keywords input by a user; acquiring a topic word set of each document in a document set and calculating an average vector; and acquiring the feature vector of the minimum semantic subgraph, performing cosine similarity calculation on the feature vector and the average vector, taking the document with the calculation result larger than the similarity threshold value as a candidate searched document, and searching out a target searched document. The invention utilizes the cosine similarity calculation method to calculate the characteristic vector and the average vector of the documents, screens out the candidate searched documents according to the calculation result, and searches out the target searched documents from the candidate searched documents, the whole calculation process is simple and effective, the target is clear, and the search cost is reduced.

Description

Graph-based document retrieval method, system and related components thereof

Technical Field

The invention relates to the technical field of document retrieval, in particular to a document retrieval method and system based on a graph and related components thereof.

Background

People often need to obtain various information through retrieval, and the existing mainstream mode is to perform keyword-based retrieval on a document set to be retrieved, so that two prominent phenomena of 'lack' and 'error' exist when a user uses keywords. "lack" means that the user does not know something well and does not know a keyword in more detail, and "false" means that the search keyword used by the user is not a real keyword reflecting the feature of something. Currently, the mainstream information retrieval ideas are roughly classified into the following categories:

1. retrieval schemes such as TF-IDF (term-inverse document frequency, which is a common weighting technology used for information retrieval and data mining) and BM25 (an algorithm used for evaluating the correlation between search words and documents) adopted by full-text search engines such as traditional elastic search and Slor;

2. by training a large amount of labeled data, semantic association between the questions and the answers is established by utilizing a deep neural network, so that the initial positions of the answers are positioned in the document.

The first method often results in too many answers and still requires a great deal of effort to screen to obtain the desired information. The second method needs to invest a large amount of resources to collect and arrange training data, and is high in cost.

Disclosure of Invention

The embodiment of the invention provides a graph-based document retrieval method, a graph-based document retrieval system and related components thereof, and aims to solve the problems of excessive retrieval results, time consumption and high cost in the prior art.

In a first aspect, an embodiment of the present invention provides a graph-based document retrieval method, which includes:

acquiring a document to be retrieved, constructing a document set, identifying entities in the document set and relations between the entities by using a natural language processing technology, and constructing a target knowledge graph;

acquiring embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating the cosine similarity of the embedded vectors corresponding to every two entity nodes in a preset reference knowledge graph, and calculating the semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors;

acquiring a retrieval keyword set input by a user, and screening out a minimum semantic sub-graph containing all retrieval keywords in the retrieval keyword set from the target knowledge graph;

acquiring a subject term set of each document in the document set, acquiring a corresponding embedded vector of each subject term, and calculating an average vector of the subject terms according to the embedded vectors of the subject terms;

extracting a feature vector of the minimum semantic subgraph by using a graph convolution neural network, performing cosine similarity calculation on the feature vector and an average vector, and judging whether a calculation result is greater than a similarity threshold value;

and if the calculation result is larger than the similarity threshold value, taking the document corresponding to the subject term set as a candidate searched document, and searching out a target searched document.

In a second aspect, an embodiment of the present invention provides a graph-based document retrieval system, which includes:

the target knowledge graph acquisition unit is used for acquiring a document to be retrieved, constructing a document set, identifying entities in the document set and relations between the entities by using a natural language processing technology, and constructing a target knowledge graph;

the semantic distance calculating unit is used for acquiring embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating the cosine similarity of the embedded vectors corresponding to every two entity nodes in a preset reference knowledge graph, and calculating the semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors;

the minimum semantic subgraph acquisition unit is used for acquiring a retrieval keyword set input by a user and screening out minimum semantic subgraphs containing all retrieval keywords in the retrieval keyword set from the target knowledge graph;

the average vector calculation unit is used for acquiring a subject term set of each document in the document set, acquiring a corresponding embedded vector of each subject term, and calculating an average vector of the subject terms according to the embedded vectors of the subject terms;

the similarity judging unit is used for extracting the feature vector of the minimum semantic subgraph by using a graph convolution neural network, calculating the cosine similarity of the feature vector and the average vector and judging whether the calculation result is greater than a similarity threshold value;

and the target retrieval document acquisition unit is used for taking the document corresponding to the subject term set as a candidate retrieved document and retrieving a target retrieval document if the calculation result is greater than the similarity threshold value.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements the graph-based document retrieval method described in the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the graph-based document retrieval method according to the first aspect.

The embodiment of the invention provides a graph-based document retrieval method, a graph-based document retrieval system and related components thereof. Acquiring a document to be retrieved, constructing a document set, identifying entities in the document set and relations between the entities by using a natural language processing technology, and constructing a target knowledge graph; acquiring embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating the cosine similarity of the embedded vectors corresponding to every two entity nodes in a preset reference knowledge graph, and calculating the semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors; acquiring a retrieval keyword set input by a user, and screening out a minimum semantic sub-graph containing all retrieval keywords in the retrieval keyword set from the target knowledge graph; acquiring a subject term set of each document in the document set, acquiring a corresponding embedded vector of each subject term, and calculating an average vector of the subject terms according to the embedded vectors of the subject terms; extracting a feature vector of the minimum semantic subgraph by using a graph convolution neural network, performing cosine similarity calculation on the feature vector and an average vector, and judging whether a calculation result is greater than a similarity threshold value; and if the calculation result is larger than the similarity threshold value, taking the document corresponding to the subject term set as a candidate searched document, and searching out a target searched document. The embodiment of the invention utilizes a cosine similarity calculation method to calculate the characteristic vector of the minimum semantic subgraph and the average vector of the documents in the target knowledge graph, screens out the candidate searched documents according to the calculation result, and directly searches out the target searched documents from the candidate searched documents, the whole calculation process is simple and effective, the target is clear, and the search cost is reduced. The invention reduces the number of irrelevant documents in the document set to be retrieved, thereby reducing the workload of subsequent retrieval.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating a graph-based document retrieval method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating the minimum semantic subgraph screening of the graph-based document retrieval method according to the embodiment of the present invention;

FIG. 3 is a schematic block diagram of a graph-based document retrieval system provided by an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flowchart of a graph-based document retrieval method according to an embodiment of the present invention, where the method includes steps S101 to S106.

S101, obtaining a document to be retrieved, constructing a document set, identifying entities in the document set and relations between the entities by using a natural language processing technology, and constructing a target knowledge graph;

in this step, all documents to be retrieved are acquired, so as to construct a document set, and then entities in the document set and relationships between the entities are identified through word segmentation, named entity identification, relationship extraction technology and the like in the natural language technology, so as to construct a target knowledge graph. Segmenting words in the document from sentences by using a word segmentation technology, then identifying entities with specific meanings in the segmented words by using a named entity identification technology, and finally acquiring the relationship between the entities by using the relationship extraction technology so as to construct a target knowledge graph according to the entities and the relationship between the entities.

In one embodiment, the identifying entities and entity-to-entity relationships in the set of documents using natural language processing techniques includes:

constructing an entity-document mapping table according to the relation between the entity and the document;

acquiring the access times of the document in a period of time, and judging whether the access times are smaller than a time threshold value;

and if the access times are less than the time threshold value, deleting the corresponding document from the entity-document mapping table.

In this embodiment, the entity and the document are corresponded, an entity-document mapping table is constructed, then the document with the access frequency smaller than the threshold value is found in the document set for deletion, and the entity-document mapping table is updated. When an entity in the target knowledge graph appears in a certain document, the entity and the document are considered to be in a corresponding relationship, wherein one entity may correspond to one document or a plurality of documents, and therefore the entity needs to be screened. The method comprises the steps of obtaining the access times of all documents in a period of time, utilizing an LRU algorithm (namely, a least recently used method, and the core idea is that the most recently used page data can still be used in a future period of time, and the pages which are not used for a long time are more likely not to be used in a longer period of time in the future. In this manner, when the relationship between the entity and the document is found again, the relationship between the document and the entity that has been deleted will not be displayed.

S102, acquiring embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating the cosine similarity of the embedded vectors corresponding to every two entity nodes in a preset reference knowledge graph, and calculating the semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors;

in this step, all entities are set as corresponding entity nodes in the target knowledge graph, relationships between the entities are used as edge nodes, embedded vectors corresponding to the entity nodes and the edge nodes are obtained through a translation embedding algorithm (namely, a TransE algorithm), and cosine similarity of the embedded vectors corresponding to the two entity nodes is calculated in a predetermined reference knowledge graph (such as a large-scale knowledge graph rich in entities such as Hotan, Freebase, Wikipedia and the like), so that a semantic distance between the two entity nodes is obtained. In one embodiment, the semantic distance between two entity nodes is calculated by the following formula:

wherein, sim (E)_i,E_j) The cosine similarity of the embedded vectors for two physical nodes,

is the sum of the embedded vectors of all connected edges of the entity Ei,

is the sum of the embedded vectors of all connected edges of the entity Ej.

S103, acquiring a retrieval keyword set input by a user, and screening out a minimum semantic sub-graph containing all retrieval keywords in the retrieval keyword set from the target knowledge graph;

in the step, according to a retrieval keyword set input by a user, node positions of all retrieval keywords are found in the target knowledge graph, extraction is carried out, semantic subgraphs containing all the retrieval keywords are obtained, and the minimum semantic subgraph with the minimum radius is screened out.

In one embodiment, the step S103 includes:

taking all the retrieval keywords in the retrieval keyword set as nodes;

firstly, selecting any node as a central node, and entering a semantic subgraph selection step;

semantic subgraph selection: taking the semantic distance between the other node which is farthest away from the central node as a radius to form a semantic sub-graph containing all the nodes, adjusting the radius of the semantic sub-graph, and sequentially increasing or decreasing preset numerical values to obtain the semantic sub-graph which has the smallest radius and contains all the retrieval keywords;

continuously selecting another node as a central node, and entering a semantic subgraph selection step;

and obtaining semantic subgraphs corresponding to all the nodes, and selecting the semantic subgraph with the minimum radius from the semantic subgraphs.

In this embodiment, in a search keyword set input by a user, any search keyword is selected as a central node, a semantic subgraph selection step is performed to obtain a semantic subgraph, then the process is repeated until all the search keywords form a corresponding semantic subgraph, and then the semantic subgraph with the smallest radius is selected from all the semantic subgraphs as the smallest semantic subgraph. After obtaining a semantic sub-graph, the radius of the next semantic sub-graph is the radius of the previous semantic sub-graph, and the radius of the previous semantic sub-graph is increased or decreased by a predefined delta r, so that the radius of the next semantic sub-graph contains all the retrieval keywords.

As shown in fig. 2, the search keywords input by the user are kw1, kw2 and kw3, the search keyword kw1 is used as a center node, the node farthest from the search keyword kw1 is kw3, and the semantic distance between kw1 and kw3 is used as a radius, so as to obtain sub-graph 1; similarly, with kw2 as a central node, obtaining a sub-graph 2 by the semantic distance radius between kw2 and kw 1; with kw3 as a central node, obtaining a sub-graph 3 by the semantic distance radius between kw3 and kw 1; and (4) screening the subgraph 2 with the smallest radius from the subgraph 1, the subgraph 2 and the subgraph 3 to be used as the smallest semantic subgraph.

In a specific embodiment, the acquiring the set of search keywords input by the user includes:

acquiring a retrieval keyword input by a user, and searching a position corresponding to the retrieval keyword in the knowledge graph;

if the corresponding retrieval key words are not found in the knowledge graph, obtaining the embedded vectors of the retrieval key words through a word embedding method, and obtaining a near meaning word set of the retrieval key words by utilizing a cosine similarity calculation method;

and replacing the search keyword by a word with the most similar semantic meaning to the search keyword in the similar meaning word set.

In this embodiment, when a user inputs a search keyword, the condition that the input search keyword is inaccurate may occur, an embedded vector of the search keyword input by the user is obtained by a word embedding method, a synonym set is obtained by using a cosine similarity calculation method, and a word with the most similar semantics to the search keyword is found from the synonym set to replace the search keyword. When cosine similarity calculation is carried out, the similarity of the calculation result is compared with a near-meaning word threshold value, and words with similarity deviation smaller than the near-meaning word threshold value are all regarded as near-meaning words.

S104, acquiring a subject term set of each document in the document set, acquiring a corresponding embedding vector of each subject term, and calculating an average vector of the subject terms according to the embedding vectors of the subject terms;

in this step, an embedded vector of a subject term corresponding to each document in the document set is calculated, and an average vector of the subject terms is calculated by using the embedded vectors. Specifically, a topic word set of the document is extracted by using an LDA topic model or a probabilistic latent semantic index topic model, and an embedded vector of each topic word is obtained according to a word2vec method. The LDA (Latent Dirichlet Allocation) is a document theme generation model, is also called a three-layer Bayesian probability model, and comprises three-layer structures of words, themes and documents; word2vec, a group of related models for generating Word vectors, are shallow, two-level neural networks used for training to reconstruct linguistic Word text.

In one embodiment, said calculating an average vector of said subject term according to said embedded vector of said subject term comprises:

and acquiring a word vector corresponding to the subject word in a preset word vector list, and calculating the average value of the word vectors to acquire the average vector of the subject word.

In this embodiment, the word vector corresponding to the subject word is found in a preset word vector list, and an average value of all the word vectors is calculated, that is, the average vector of the subject word. The word vector is also the embedding vector.

S105, extracting a feature vector of the minimum semantic subgraph by using a graph convolution neural network, performing cosine similarity calculation on the feature vector and an average vector, and judging whether a calculation result is greater than a similarity threshold value;

in this step, cosine similarity calculation is performed on the feature vector and the average vector, and the obtained calculation result is compared with the similarity threshold value, so as to determine whether the corresponding document is a candidate document to be retrieved.

The specific process of extracting the feature vector of the minimum semantic subgraph by using the graph convolution neural network comprises the following steps:

defining the number of convolutional layers of the graph convolutional neural network according to the minimum semantic subgraph;

calculating the feature vector by the following formula:

wherein

A is an adjacency matrix of nodes in the knowledge graph, D is an in-out degree matrix of nodes in the knowledge graph, W₀Is a weight matrix, σ is an activation function, D_ii＝∑_jA_ijAnd X is an adjacent matrix of the minimum semantic subgraph.

And S106, if the calculation result is larger than the similarity threshold value, taking the document corresponding to the subject term set as a candidate searched document, and searching out a target searched document.

In this step, when the calculation result is greater than the similarity threshold, the document corresponding to the subject term set is taken as a candidate document to be retrieved and is placed in the candidate document set to be retrieved; and when the calculation result is smaller than the similarity threshold value, not putting the document corresponding to the subject term set into the candidate searched document set. After all the candidate searched documents are obtained, corresponding target searched documents are searched out from the candidate searched document set.

Referring to fig. 3, fig. 3 is a schematic block diagram of a graph-based document retrieval system according to an embodiment of the present invention, where the graph-based document retrieval system 200 includes:

a target knowledge graph obtaining unit 201, configured to obtain a document to be retrieved, construct a document set, identify entities in the document set and relationships between the entities by using a natural language processing technology, and construct a target knowledge graph;

a semantic distance calculating unit 202, configured to obtain embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculate an embedded vector cosine similarity corresponding to each two entity nodes in a predetermined reference knowledge graph, and calculate a semantic distance between each two entity nodes according to the embedded vector cosine similarity;

a minimum semantic subgraph obtaining unit 203, configured to obtain a search keyword set input by a user, and screen out a minimum semantic subgraph including all search keywords in the search keyword set from the target knowledge graph;

an average vector calculation unit 204, configured to obtain a subject term set of each document in the document set, obtain an embedded vector corresponding to each subject term, and calculate an average vector of the subject terms according to the embedded vector of the subject terms;

a similarity judgment unit 205, configured to extract a feature vector of the minimum semantic subgraph by using a graph convolution neural network, perform cosine similarity calculation on the feature vector and the average vector, and judge whether a calculation result is greater than a similarity threshold;

and the target retrieval document acquiring unit 206 is configured to, if the calculation result is greater than the similarity threshold, take the document corresponding to the topic word set as a candidate retrieved document, and retrieve a target retrieval document.

In one embodiment, the target knowledge-graph obtaining unit 201 includes:

the mapping table construction unit is used for constructing an entity-document mapping table according to the relation between the entity and the document;

the access frequency acquiring unit is used for acquiring the access frequency of the document in a period of time and judging whether the access frequency is smaller than a frequency threshold value;

and the document deleting unit is used for deleting the corresponding document from the entity-document mapping table if the access times are smaller than a time threshold.

In an embodiment, the semantic distance calculating unit 202 includes:

a formula calculation unit, configured to calculate a semantic distance between two entity nodes by using the following formula:

wherein, sim (E)_i,E_j) Is the embedded vector cosine similarity, SIGMA R, of two physical nodes_EiAs entity E_iSum of the embedding vectors of all connected edges, ∑ R_EjAs entity E_jThe sum of the embedded vectors of all connected edges.

In an embodiment, the minimum semantic subgraph obtaining unit 203 includes:

a node acquisition unit, configured to take all the search keywords in the search keyword set as nodes;

the central node selection unit is used for selecting any node as a central node and entering a semantic subgraph selection step;

a semantic subgraph selection unit used for the semantic subgraph selection step: taking the semantic distance between the other node which is farthest away from the central node as a radius to form a semantic sub-graph containing all the nodes, adjusting the radius of the semantic sub-graph, and sequentially increasing or decreasing preset numerical values to obtain the semantic sub-graph which has the smallest radius and contains all the retrieval keywords;

the central node reselecting unit is used for continuously selecting another node as a central node and entering a semantic subgraph selection step;

and the minimum semantic subgraph screening unit is used for obtaining semantic subgraphs corresponding to all the nodes and selecting the semantic subgraph with the minimum radius from the semantic subgraphs.

In an embodiment, the minimum semantic subgraph obtaining unit 203 includes:

the retrieval key word acquisition unit is used for acquiring retrieval key words input by a user and searching corresponding positions of the retrieval key words in the knowledge graph;

a near meaning word set obtaining unit, configured to obtain an embedded vector of the search keyword through a word embedding method if the corresponding search keyword is not found in the knowledge graph, and obtain a near meaning word set of the search keyword by using a cosine similarity calculation method;

and the keyword replacing unit is used for replacing the search keyword by using the word with the most similar semantic meaning to the search keyword in the similar meaning word set.

In an embodiment, the similarity determining unit 205 includes:

a convolution layer number confirmation unit, configured to define the number of convolution layers of the graph convolution neural network according to the minimum semantic subgraph;

a feature vector calculation unit for calculating the feature vector by the following formula:

wherein

In one embodiment, the average vector calculation unit 204 includes:

and the word vector average value calculating unit is used for acquiring the corresponding word vectors of the subject words in a preset word vector list, calculating the average value of the word vectors and acquiring the average vector of the subject words.

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the graph-based document retrieval method is implemented.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the graph-based document retrieval method is implemented.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A graph-based document retrieval method, comprising:

2. The graph-based document retrieval method of claim 1, wherein the identifying entities and entity-to-entity relationships in the document collection using natural language processing techniques comprises:

3. The graph-based document retrieval method of claim 1, wherein the obtaining of the embedded vectors corresponding to all entity nodes and edge nodes in the target knowledge graph, calculating cosine similarity of the embedded vectors corresponding to every two entity nodes in a predetermined reference knowledge graph, and calculating semantic distance between every two entity nodes according to the cosine similarity of the embedded vectors comprises:

the semantic distance between two entity nodes is calculated by the following formula:

wherein, sim (E)_i,E_j) Is the embedded vector cosine similarity, SIGMA R, of two physical nodes_EiIs the sum of the embedded vectors, Σ R, of all connected edges of the entity Ei_EjIs the sum of the embedded vectors of all connected edges of the entity Ej.

4. The graph-based document retrieval method of claim 1, wherein the obtaining of the set of retrieval keywords input by the user and the screening of the target knowledge-graph to the smallest semantic subgraph containing all the retrieval keywords in the set of retrieval keywords comprises:

taking all the retrieval keywords in the retrieval keyword set as nodes;

5. The graph-based document retrieval method of claim 1, wherein the obtaining of the set of retrieval keywords input by the user comprises:

6. The graph-based document retrieval method of claim 1, wherein the extracting the feature vector of the minimal semantic subgraph by using a graph convolutional neural network comprises:

calculating the feature vector by the following formula:

wherein

7. The graph-based document retrieval method of claim 1, wherein the computing an average vector of the subject term from the embedded vector of the subject term comprises:

8. A graph-based document retrieval system, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the graph-based document retrieval method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the graph-based document retrieval method according to any one of claims 1 to 7.