Disclosure of Invention
The embodiment of the application aims to provide a document retrieval method based on a knowledge graph and related equipment thereof, so as to solve the technical problem of low document retrieval efficiency.
In order to solve the technical problems, the embodiment of the application provides a document retrieval method based on a knowledge graph, which adopts the following technical scheme:
acquiring a document set to be searched, performing word segmentation on documents in the document set to be searched to obtain a document word segmentation set, and constructing a target knowledge graph based on the document word segmentation set;
When a plurality of search keywords are received, calculating semantic distances among the search keywords based on the target knowledge graph, and determining two search keywords corresponding to the maximum semantic distances as two center keywords;
Respectively constructing a first sub-graph and a second sub-graph based on the two central keywords, respectively calculating the node number in the first sub-graph and the second sub-graph, and selecting the maximum semantic sub-graph from the first sub-graph and the second sub-graph according to the node number;
Acquiring a preset graph convolution neural network, and carrying out feature extraction on the maximum semantic subgraph based on the graph convolution neural network to obtain a feature vector;
Extracting the subject word of each document in the document set to be retrieved, and calculating the subject embedded vector of the subject word;
And calculating the vector similarity of each topic embedded vector and the feature vector, determining the topic embedded vector with the vector similarity larger than or equal to a preset similarity threshold as a target embedded vector, and taking a document corresponding to the target embedded vector as a target retrieval document.
Further, the step of calculating the semantic distance between the search keywords based on the target knowledge graph includes:
Acquiring a reference knowledge graph corresponding to the target knowledge graph, and determining the distance weights among all the search keywords according to the reference knowledge graph;
And calculating the embedding similarity between the search keywords and the sum of the embedding vectors of the corresponding edge nodes of each search keyword in the target knowledge graph, and calculating the semantic distance between the search keywords according to the distance weight, the embedding similarity and the sum of the embedding vectors.
Further, the step of determining the distance weight between the search keywords according to the reference knowledge graph includes:
And obtaining category attributes and levels of the search keywords in the reference knowledge graph, and determining distance weights among the search keywords according to the category attributes and the levels.
Further, the step of determining the distance weight between the search keywords according to the category attribute and the hierarchy includes:
Judging whether category attributes among the search keywords are the same or not, and whether levels among the search keywords are the same or not, and determining that the distance weights among the search keywords are preset weights when the category attributes of the search keywords are the same and the levels are the same;
And when the category attributes are different, or the category attributes are the same and the levels are different, acquiring a common upper entity of the search keywords in the reference knowledge graph, calculating the level distance between the upper entity and the search keywords, and calculating the distance weight between the search keywords according to the level distance.
Further, the step of extracting the subject term of each document in the document set to be retrieved includes:
acquiring the word number of each document, and carrying out ascending order sequencing on the documents according to the word number to obtain a document queue;
Acquiring the number of keywords corresponding to the lowest-order documents in the document queue, taking the number of keywords corresponding to the lowest-order documents as a lowest threshold, and sequentially increasing the number of keywords of other documents in the document queue according to the arrangement sequence of the document queue based on the lowest threshold until the number of keywords reaches a preset maximum threshold;
And sequentially extracting the subject terms of the documents in the document queue according to the sequence and the number from the lowest threshold value to the maximum threshold value.
Further, the step of extracting features of the maximum semantic subgraph based on the graph convolution neural network to obtain feature vectors includes:
calculating an adjacency matrix and an access degree matrix of the maximum semantic subgraph;
And acquiring a preset weight matrix, and calculating the characteristic vector through the graph convolution neural network according to the weight matrix, the adjacent matrix and the access degree matrix.
Further, before the step of calculating the semantic distance between the search keywords based on the target knowledge graph, the method further includes:
searching the target knowledge graph, and determining whether the search keywords exist in the target knowledge graph;
When the search keyword does not exist in the target knowledge graph, a preset pre-training language model is obtained, the search keyword and the word in the document word segmentation set are respectively input into the pre-training language model, and a first characterization vector and a second characterization vector are obtained through calculation;
According to the first characterization vector and the second characterization vector, calculating to obtain word similarity of the search keywords and the segmented words, determining segmented words with the word similarity larger than or equal to preset similarity as candidate keywords, and replacing the search keywords with the candidate keywords.
In order to solve the technical problems, the embodiment of the application also provides a document retrieval device based on a knowledge graph, which adopts the following technical scheme:
The construction module is used for acquiring a document set to be searched, performing word segmentation on the documents in the document set to be searched to obtain a document word segmentation set, and constructing a target knowledge graph based on the document word segmentation set;
The first calculation module is used for calculating semantic distances among the search keywords based on the target knowledge graph when a plurality of search keywords are received, and determining two search keywords corresponding to the maximum semantic distances as two center keywords;
The selecting module is used for respectively constructing a first sub-graph and a second sub-graph based on the two central keywords, respectively calculating the node number in the first sub-graph and the second sub-graph, and selecting the maximum semantic sub-graph from the first sub-graph and the second sub-graph according to the node number;
The second calculation module is used for acquiring a preset graph convolution neural network, and extracting features of the maximum semantic subgraph based on the graph convolution neural network to obtain feature vectors;
the extraction module is used for extracting the subject word of each document in the document set to be searched and calculating the subject embedded vector of the subject word;
And the confirmation module is used for calculating the vector similarity of each topic embedded vector and the feature vector, determining the topic embedded vector with the vector similarity larger than or equal to a preset similarity threshold as a target embedded vector, and taking a document corresponding to the target embedded vector as a target retrieval document.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
acquiring a document set to be searched, performing word segmentation on documents in the document set to be searched to obtain a document word segmentation set, and constructing a target knowledge graph based on the document word segmentation set;
When a plurality of search keywords are received, calculating semantic distances among the search keywords based on the target knowledge graph, and determining two search keywords corresponding to the maximum semantic distances as two center keywords;
Respectively constructing a first sub-graph and a second sub-graph based on the two central keywords, respectively calculating the node number in the first sub-graph and the second sub-graph, and selecting the maximum semantic sub-graph from the first sub-graph and the second sub-graph according to the node number;
Acquiring a preset graph convolution neural network, and carrying out feature extraction on the maximum semantic subgraph based on the graph convolution neural network to obtain a feature vector;
Extracting the subject word of each document in the document set to be retrieved, and calculating the subject embedded vector of the subject word;
And calculating the vector similarity of each topic embedded vector and the feature vector, determining the topic embedded vector with the vector similarity larger than or equal to a preset similarity threshold as a target embedded vector, and taking a document corresponding to the target embedded vector as a target retrieval document.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
acquiring a document set to be searched, performing word segmentation on documents in the document set to be searched to obtain a document word segmentation set, and constructing a target knowledge graph based on the document word segmentation set;
When a plurality of search keywords are received, calculating semantic distances among the search keywords based on the target knowledge graph, and determining two search keywords corresponding to the maximum semantic distances as two center keywords;
Respectively constructing a first sub-graph and a second sub-graph based on the two central keywords, respectively calculating the node number in the first sub-graph and the second sub-graph, and selecting the maximum semantic sub-graph from the first sub-graph and the second sub-graph according to the node number;
Acquiring a preset graph convolution neural network, and carrying out feature extraction on the maximum semantic subgraph based on the graph convolution neural network to obtain a feature vector;
Extracting the subject word of each document in the document set to be retrieved, and calculating the subject embedded vector of the subject word;
And calculating the vector similarity of each topic embedded vector and the feature vector, determining the topic embedded vector with the vector similarity larger than or equal to a preset similarity threshold as a target embedded vector, and taking a document corresponding to the target embedded vector as a target retrieval document.
The method comprises the steps of obtaining a document set to be searched, carrying out word segmentation processing on documents in the document set to be searched to obtain a document word set, constructing a target knowledge graph based on the document word set, carrying out efficient arrangement and search on the documents through the target knowledge graph, then calculating semantic distances among the search keywords based on the target knowledge graph when a plurality of search keywords are received, determining that two search keywords corresponding to the maximum semantic distances are two central keywords, respectively constructing a first sub-graph and a second sub-graph based on the two central keywords, respectively calculating node numbers in the first sub-graph and the second sub-graph, selecting the largest semantic sub-graph from the first sub-graph and the second sub-graph according to the node numbers, thereby screening semantic sub-graphs comprising more search keywords, avoiding interference of irrelevant search keywords, further improving the accuracy and the efficiency of document search, obtaining feature vectors based on the graph convolutional neural network, finally extracting the subject matter vectors in the document set, calculating the subject matter vectors, respectively constructing the first sub-graph and the second sub-graph, respectively embedding the subject matter vectors into the feature vectors, and embedding the feature vectors into the feature vectors, and achieving the similarity vectors, and achieving the effect of the object similarity and the object similarity, and achieving the effect of the object similarity and the effect vector embedding.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the document retrieval method based on the knowledge graph provided by the embodiment of the application is generally executed by a server/terminal device, and correspondingly, the document retrieval device based on the knowledge graph is generally arranged in the server/terminal device.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flowchart of one embodiment of a method of knowledge-graph based document retrieval in accordance with the application is shown. The document retrieval method based on the knowledge graph comprises the following steps:
Step S201, a document set to be searched is obtained, word segmentation processing is carried out on documents in the document set to be searched, a document word segmentation set is obtained, and a target knowledge graph is built based on the document word segmentation set.
In this embodiment, the document set to be searched is a set of documents acquired in advance, the document set to be searched is obtained, and the documents in the document set to be searched are subjected to word segmentation processing, so as to obtain a document word segmentation set. Specifically, the content of each document can be segmented through a preset segmentation tool to obtain segmented words corresponding to different documents, and segmented words corresponding to all documents in a document set to be searched are collected to obtain a document segmented word set. When a document word segmentation set is obtained, extracting entities and relations of the words in the document word segmentation set through named entity identification to obtain entities corresponding to each word and relations among different entities, then constructing triples according to the entities and the relations, for example, the triples in the format of (entity 1, relation and entity 2), inputting the triples into a graph database (for example, neo4 j), and obtaining a target knowledge graph based on the output of the graph database.
Step S202, when a plurality of search keywords are received, calculating semantic distances among the search keywords based on the target knowledge graph, wherein two search keywords corresponding to the maximum semantic distance are two center keywords.
In this embodiment, the search keyword is an input document search keyword, and the search keyword can be obtained by dividing a search sentence input by a user. When a plurality of search keywords are received, semantic distances among different search keywords are calculated based on the target knowledge graph, and the search keywords are screened based on the semantic distances. The semantic distance is the semantic distance between two different search keywords, when the search keywords are obtained, the search keywords are matched with entity nodes in the target knowledge graph to determine entity nodes matched with the search keywords, then the semantic distance between the different entity nodes is calculated, and the semantic distance between the different entity nodes in the target knowledge graph is used as the semantic distance between the matched search keywords. When determining entity nodes matched with the search keywords, calculating the embedded vectors of the entity nodes, wherein the embedded vectors can be obtained by carrying out one-hot encoding (one-hot) on the search keywords through a word Bag (BOW), and then calculating the cosine similarity between different embedded vectors according to a calculation formula of the cosine similarity, wherein the cosine similarity is the semantic distance between the entity nodes. When the semantic distance between all the search keywords is obtained, the two search keywords corresponding to the maximum semantic distance are taken as two center keywords.
Step 203, respectively constructing a first sub-graph and a second sub-graph based on the two central keywords, respectively calculating the node number in the first sub-graph and the second sub-graph, and selecting the maximum semantic sub-graph from the first sub-graph and the second sub-graph according to the node number.
In this embodiment, when a center keyword is determined, a first sub-graph and a second sub-graph are constructed based on the center keyword. The central keywords are two optimal search keywords selected through semantic distances, one central keyword constructs a sub-graph, and the first sub-graph and the second sub-graph can be respectively constructed according to the two central keywords. The method comprises the steps of selecting one central keyword as a circle center at will, using the maximum semantic distance as a radius, and defining a first sub-graph from a target knowledge graph, and using the rest central keywords as the circle center, using the maximum semantic distance as the radius, and defining a second sub-graph from the target knowledge graph. The first sub-graph and the second sub-graph comprise different numbers of entity nodes, the node numbers of the first sub-graph and the second sub-graph are obtained, and the sub-graph with the largest node number in the first sub-graph and the second sub-graph is used as the maximum semantic sub-graph. As shown in FIG. 3, FIG. 3 is a schematic diagram of a maximum semantic subgraph, wherein the semantic distance between kw1 and kw3 is longest, kw1 and kw3 are two central keywords, a first subgraph (i.e., subgraph 1) is obtained based on kw1, a second subgraph (i.e., subgraph 2) is obtained based on kw2, the number of nodes of the second subgraph is larger than that of the first subgraph, and the second subgraph is the maximum semantic subgraph.
Step S204, a preset graph convolution neural network is obtained, and feature extraction is carried out on the maximum semantic subgraph based on the graph convolution neural network to obtain feature vectors.
In this embodiment, when the maximum semantic subgraph is obtained, a preset graph convolution neural network is obtained, and feature extraction is performed on the maximum semantic subgraph based on the graph convolution neural network to obtain a feature vector. Specifically, a graph convolution neural network (GCN, graph Convolutional Network) is a convolutional neural network for processing a graph structure, through which feature extraction can be performed on an input maximum semantic subgraph to obtain feature vectors. The method comprises the steps of obtaining a maximum semantic subgraph, calculating an adjacent matrix and an access degree matrix of the maximum semantic subgraph when the maximum semantic subgraph is obtained, and then calculating the adjacent matrix, the access degree matrix and embedded vectors of nodes in the maximum semantic subgraph according to a graph convolution neural network to obtain a Laplacian matrix (Laplacian) feature vector, wherein the Laplacian matrix feature vector is the feature vector of the maximum semantic subgraph.
Step S205, extracting the subject word of each document in the document set to be retrieved, and calculating the subject embedded vector of the subject word.
In this embodiment, the subject term is a core term of each document, and the subject term may be a term with a frequency of occurrence greater than or equal to a frequency threshold in the document, a term with a semantic weight greater than or equal to a weight threshold, or a term with a total weight greater than or equal to a total weight threshold obtained by calculating the frequency of occurrence and the semantic weight in the document. Extracting the subject word of each document in the document set to be searched, and calculating the embedded vector of the subject word of each document, wherein the embedded vector of the subject word is the subject embedded vector. The topic embedded vector can be obtained through calculation through a preset pre-training language model (bert), specifically, when the topic words of the document are obtained, the topic words are input into the pre-training language model, and the embedded vector of each topic word is obtained based on the embedded coding of the pre-training language model.
Step S206, calculating the vector similarity of each topic embedded vector and the feature vector, determining the topic embedded vector with the vector similarity larger than or equal to a preset similarity threshold as a target embedded vector, and taking the document corresponding to the target embedded vector as a target retrieval document.
In the embodiment, based on a calculation formula of cosine similarity, the vector similarity of the topic embedded vector and the feature vector is calculated, whether the vector similarity is larger than or equal to a preset similarity threshold is determined, if the vector similarity is larger than or equal to the preset similarity threshold, the topic embedded vector corresponding to the vector similarity is determined to be a target embedded vector, and if the vector similarity is smaller than the preset similarity threshold, the topic embedded vector corresponding to the vector similarity is determined to be a non-target embedded vector. When the target embedded vector is obtained, the document corresponding to the target embedded vector is taken as a target retrieval document.
The method and the device realize efficient screening of the target retrieval documents, reduce interference of irrelevant documents on retrieval results, save machine learning cost and improve the efficiency and accuracy of target document retrieval.
In some optional implementations of this embodiment, the step of calculating the semantic distance between the search keywords based on the target knowledge-graph includes:
Acquiring a reference knowledge graph corresponding to the target knowledge graph, and determining the distance weights among all the search keywords according to the reference knowledge graph;
And calculating the embedding similarity between the search keywords and the sum of the embedding vectors of the corresponding edge nodes of each search keyword in the target knowledge graph, and calculating the semantic distance between the search keywords according to the distance weight, the embedding similarity and the sum of the embedding vectors.
In this embodiment, the reference knowledge graph is a graph constructed by data in a main domain to which the search object belongs, and taking a document as an example of the search object, the reference knowledge graph may be a knowledge graph constructed based on knowledge network data. Obtaining a reference knowledge graph corresponding to a target knowledge graph, and determining distance weights among all the search keywords according to the reference knowledge graph, wherein the distance weights are distance weight values among entities corresponding to the search keywords in the reference knowledge graph, the entities are identical to the search keywords, and if no entity identical to the search keywords exists in the reference knowledge graph, the entity with the maximum similarity to the search keywords is selected as the entity corresponding to the search keywords in the reference knowledge graph. And acquiring preset reference distances corresponding to the two entities in the reference knowledge graph, and determining the distance weights between the search keywords corresponding to the entities according to the preset reference distances.
When the distance weight between all the search keywords is obtained, the embedding similarity between the search keywords and the sum of the embedding vectors of the corresponding edge nodes of each search keyword in the target knowledge graph are calculated, wherein the embedding similarity can be obtained by calculating the cosine similarity of the embedding vectors corresponding to different search keywords, and the sum of the embedding vectors is the sum of the embedding vectors of all the edge nodes connected with the corresponding entity of the search keyword in the target knowledge graph. When the sum of the embedding similarity and the embedding vector is obtained, the semantic distance between the search keywords can be calculated according to the distance weight, the sum of the embedding similarity and the embedding vector. The calculation formula of the semantic distance is as follows:
Wherein, For distance weights, sim (E i,Ej) is the embedding similarity,Is the sum of the embedded vectors of entity E i,Is the sum of the embedded vectors of entity E j.
According to the method and the device, the reference knowledge graph corresponding to the target knowledge graph is obtained, the weight distance between the search keywords is determined according to the reference knowledge graph, the semantic distance between the search keywords is obtained through calculation according to the sum of the weight distance, the embedding similarity and the embedding vector, accurate calculation of the semantic distance between the search keywords is achieved, documents corresponding to the search keywords can be further and accurately searched through the semantic distance, and accuracy of document searching is improved.
In some optional implementations of this embodiment, the step of determining the distance weight between the search keywords according to the reference knowledge graph includes:
And obtaining category attributes and levels of the search keywords in the reference knowledge graph, and determining distance weights among the search keywords according to the category attributes and the levels.
In this embodiment, the category attribute is a thing category of an entity corresponding to each search keyword, for example, for an entity apple and a banana, the category attributes of the two entities are fruit categories, the hierarchy is a position hierarchy of the entity corresponding to each search keyword in a reference knowledge graph, and when the reference knowledge graph is constructed, the position hierarchy of each entity can be divided according to the relationship between different entities. And obtaining the category attribute and the hierarchy of each search keyword in the reference knowledge graph, and determining the distance weight between the search keywords according to the category attribute and the hierarchy. Specifically, the distance weight is a distance measurement weight of a corresponding entity of the search keywords in the reference knowledge graph, and the semantic distance between the search keywords can be accurately calculated through the distance weight. And when the category attribute and the hierarchy are obtained, obtaining the corresponding distance weight according to the category attribute and the hierarchy. The distance weight can be obtained by searching weight values corresponding to different category attributes and levels in a weight preset table.
According to the embodiment, the category attribute and the hierarchy of the search keywords in the reference knowledge graph are obtained, and the distance weight between the search keywords is determined according to the category attribute and the hierarchy, so that the semantic distance between the search keywords can be accurately calculated through the distance weight, and the accuracy of the semantic distance is improved.
In some optional implementations of this embodiment, the step of determining the distance weight between the search keywords according to the category attribute and the hierarchy includes:
Judging whether category attributes among the search keywords are the same or not, and whether levels among the search keywords are the same or not, and determining that the distance weights among the search keywords are preset weights when the category attributes of the search keywords are the same and the levels are the same;
And when the category attributes are different, or the category attributes are the same and the levels are different, acquiring a common upper entity of the search keywords in the reference knowledge graph, calculating the level distance between the upper entity and the search keywords, and calculating the distance weight between the search keywords according to the level distance.
In this embodiment, when the category attribute and the hierarchy of all the search keywords in the reference knowledge graph are obtained, it is determined whether the category attribute between each search keyword is the same. If the category attributes of the two search keywords are the same, determining whether the levels of the two search keywords are the same, namely determining whether the two search keywords are the same level in a reference knowledge graph, and if the category attributes of the two search keywords are the same and the levels are the same, determining that the weight distance between the two search keywords is a preset weight, such as 1. If the category attributes of the two search keywords are different, or the category attributes are the same and the levels are different, the common upper entity of the two search keywords in the reference knowledge graph is obtained, for example, for the two search keywords under the same category attribute and different levels, a balcony and a room, and the common upper entity of the two search keywords in the reference knowledge graph is a building. Calculating the hierarchical distance between the upper entity and the two search keywords, and calculating the distance weight between the two search keywords according to the hierarchical distance, wherein the calculation formula of the distance weight is as follows:
Wherein λ is a preset parameter, es is a superior entity common to the search keywords in the reference knowledge graph, E i is an entity i of the search keywords in the reference knowledge graph, E j is an entity j of the search keywords in the reference knowledge graph, d (es, E i) is a hierarchical distance from the entity i to the superior entity, and d (es, E j) is a hierarchical distance from the entity j to the superior entity.
According to the method and the device, the category attribute and the hierarchy of the search keywords are judged, different calculation is carried out on the distance weights among the search keywords according to the category attribute and the hierarchy, the calculation accuracy of the distance weights is further improved, and the semantic distances among different search keywords can be reflected more accurately through the distance weights.
In some optional implementations of this embodiment, the step of extracting the subject term of each document in the document set to be retrieved includes:
acquiring the word number of each document, and carrying out ascending order sequencing on the documents according to the word number to obtain a document queue;
Acquiring the number of keywords corresponding to the lowest-order documents in the document queue, taking the number of keywords corresponding to the lowest-order documents as a lowest threshold, and sequentially increasing the number of keywords of other documents in the document queue according to the arrangement sequence of the document queue based on the lowest threshold until the number of keywords reaches a preset maximum threshold;
And sequentially extracting the subject terms of the documents in the document queue according to the sequence and the number from the lowest threshold value to the maximum threshold value.
In the embodiment, when extracting the subject word of each document in the document set to be searched, the word number of each document is obtained, wherein the word number is the total number of words included in the document, and the documents in the document set to be searched are ordered from low to high according to the word number, so that a document queue is obtained. And based on the minimum threshold value, increasing the number of the subject words of other documents in the document queue according to the number of the equal differences according to the arrangement sequence of the documents in the document queue until the number of the subject words reaches a preset maximum threshold value. For example, the minimum threshold is TW min, and the number of keywords of other documents in the document queue is sequentially increased by 1 until the preset maximum threshold TW max is reached. And then, extracting the subject terms of the documents in the document queue according to the sequence and the number from the lowest threshold value to the maximum threshold value, wherein the extraction number of the subject terms of each document is the number of the subject terms of the document in the document queue.
According to the method and the device, the topic words of the documents in the document set to be retrieved are extracted through the document queue, so that the sequential extraction of the topic words of the documents is realized, and the extraction efficiency and the extraction precision of the topic words of the documents are improved.
In some optional implementations of this embodiment, the step of extracting features of the maximum semantic subgraph based on the graph convolution neural network to obtain feature vectors includes:
calculating an adjacency matrix and an access degree matrix of the maximum semantic subgraph;
And acquiring a preset weight matrix, and calculating the characteristic vector through the graph convolution neural network according to the weight matrix, the adjacent matrix and the access degree matrix.
In this embodiment, when the maximum semantic subgraph is obtained, an adjacency matrix and an access degree matrix of the maximum semantic subgraph are calculated, then a preset weight matrix is obtained, and a feature vector is calculated through the graph convolution neural network according to the weight matrix, the adjacency matrix and the access degree matrix. The calculation formula of the feature vector is as follows:
L(0)=X
Wherein, A is an adjacent matrix, D is an access degree matrix, W 0 is a preset weight matrix, and X is an embedded vector of a node in the maximum semantic subgraph.
According to the embodiment, the graph convolution neural network is constructed, so that the characteristics of the maximum semantic subgraph can be accurately and efficiently extracted through the graph convolution neural network, and the accuracy of characteristic extraction is further improved.
In some optional implementations of this embodiment, before the step of calculating the semantic distance between the search keywords based on the target knowledge-graph, the method further includes:
searching the target knowledge graph, and determining whether the search keywords exist in the target knowledge graph;
When the search keyword does not exist in the target knowledge graph, a preset pre-training language model is obtained, the search keyword and the word in the document word segmentation set are respectively input into the pre-training language model, and a first characterization vector and a second characterization vector are obtained through calculation;
According to the first characterization vector and the second characterization vector, calculating to obtain word similarity of the search keywords and the segmented words, determining segmented words with the word similarity larger than or equal to preset similarity as candidate keywords, and replacing the search keywords with the candidate keywords.
In this embodiment, before calculating the semantic distance between the search keywords based on the target knowledge-graph, the target knowledge-graph is searched, and whether the search keywords exist in the target knowledge-graph is determined. If the search keyword does not exist in the target knowledge graph, a preset pre-training language representation model (bert, bidirectional Encoder Representation from Transformers) is obtained, the received search keyword is input into the pre-training language model, the first representation vector is obtained through the output layer of the pre-training language model, the word segmentation in the document word segmentation set is input into the pre-training language model, and the second representation vector is obtained through the output layer of the pre-training language model. The method comprises the steps of calculating cosine similarity of a first characterization vector and a second characterization vector, obtaining preset similarity, taking a word with the word similarity being greater than or equal to the preset similarity as a candidate keyword, replacing the search keyword with the candidate keyword, and calculating through the candidate keyword when semantic distance between the search keywords is calculated. If the word similarity of the search keywords and the word segmentation is smaller than the preset similarity, the search keywords do not need to be replaced, and the original search keywords are still used for calculation when the semantic distance between the search keywords is calculated.
According to the method and the device for searching the keywords, when the search keywords do not exist in the target knowledge graph, word similarity of the search keywords and the word similarity of the word in the document word segmentation set is calculated, and the search keywords are replaced according to the word similarity, so that all received search keywords can be accurately searched through the target knowledge graph, and the search range based on the keywords is improved.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 4, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a document retrieval device based on a knowledge-graph, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be applied to various electronic devices specifically.
As shown in fig. 4, the document retrieval device 400 based on a knowledge graph according to the present embodiment includes a construction module 401, a first calculation module 402, a selection module 403, a second calculation module 404, an extraction module 405, and a confirmation module 406. Wherein:
The construction module 401 is configured to obtain a document set to be retrieved, perform word segmentation on documents in the document set to be retrieved to obtain a document word segmentation set, and construct a target knowledge graph based on the document word segmentation set;
In this embodiment, the document set to be searched is a set of documents acquired in advance, the document set to be searched is obtained, and the documents in the document set to be searched are subjected to word segmentation processing, so as to obtain a document word segmentation set. Specifically, the content of each document can be segmented through a preset segmentation tool to obtain segmented words corresponding to different documents, and segmented words corresponding to all documents in a document set to be searched are collected to obtain a document segmented word set. When a document word segmentation set is obtained, extracting entities and relations of the words in the document word segmentation set through named entity identification to obtain entities corresponding to each word and relations among different entities, then constructing triples according to the entities and the relations, for example, the triples in the format of (entity 1, relation and entity 2), inputting the triples into a graph database (for example, neo4 j), and obtaining a target knowledge graph based on the output of the graph database.
A first calculation module 402, configured to calculate, when receiving a plurality of search keywords, semantic distances between the search keywords based on the target knowledge graph, and determine two search keywords corresponding to a maximum semantic distance as two center keywords;
in some alternative implementations of the present embodiment, the first computing module 402 includes:
The first acquisition unit is used for acquiring a reference knowledge graph corresponding to the target knowledge graph and determining the distance weights among all the search keywords according to the reference knowledge graph;
The computing unit is used for computing the embedding similarity between the search keywords and the sum of the embedding vectors of the corresponding edge nodes of each search keyword in the target knowledge graph, and computing the semantic distance between the search keywords according to the distance weight, the embedding similarity and the sum of the embedding vectors.
In some optional implementations of the present embodiment, the first obtaining unit includes:
and the confirming subunit is used for acquiring the category attribute and the hierarchy of the search keywords in the reference knowledge graph and determining the distance weight between the search keywords according to the category attribute and the hierarchy.
In some optional implementations of the present embodiment, the validation subunit includes:
The judging subunit is used for judging whether category attributes among the search keywords are the same or not and whether the levels among the search keywords are the same or not, and determining that the distance weights among the search keywords are preset weights when the category attributes of the search keywords are the same and the levels are the same;
and the first calculating subunit is used for acquiring a common upper entity of the search keywords in the reference knowledge graph when the category attributes are different or the category attributes are the same and the levels are different, calculating the level distance between the upper entity and the search keywords, and calculating the distance weight between the search keywords according to the level distance.
In this embodiment, the search keyword is an input document search keyword, and the search keyword can be obtained by dividing a search sentence input by a user. When a plurality of search keywords are received, semantic distances among different search keywords are calculated based on the target knowledge graph, and the search keywords are screened based on the semantic distances. The semantic distance is the semantic distance between two different search keywords, when the search keywords are obtained, the search keywords are matched with entity nodes in the target knowledge graph to determine entity nodes matched with the search keywords, then the semantic distance between the different entity nodes is calculated, and the semantic distance between the different entity nodes in the target knowledge graph is used as the semantic distance between the matched search keywords. When determining entity nodes matched with the search keywords, calculating the embedded vectors of the entity nodes, wherein the embedded vectors can be obtained by carrying out one-hot encoding (one-hot) on the search keywords through a word Bag (BOW), and then calculating the cosine similarity between different embedded vectors according to a calculation formula of the cosine similarity, wherein the cosine similarity is the semantic distance between the entity nodes. When the semantic distance between all the search keywords is obtained, the two search keywords corresponding to the maximum semantic distance are taken as two center keywords.
A selecting module 403, configured to respectively construct a first sub-graph and a second sub-graph based on two central keywords, respectively calculate the number of nodes in the first sub-graph and the second sub-graph, and select a maximum semantic sub-graph from the first sub-graph and the second sub-graph according to the number of nodes;
In this embodiment, when a center keyword is determined, a first sub-graph and a second sub-graph are constructed based on the center keyword. The central keywords are two optimal search keywords selected through semantic distances, one central keyword constructs a sub-graph, and the first sub-graph and the second sub-graph can be respectively constructed according to the two central keywords. The method comprises the steps of selecting one central keyword as a circle center at will, using the maximum semantic distance as a radius, and defining a first sub-graph from a target knowledge graph, and using the rest central keywords as the circle center, using the maximum semantic distance as the radius, and defining a second sub-graph from the target knowledge graph. The first sub-graph and the second sub-graph comprise different numbers of entity nodes, the node numbers of the first sub-graph and the second sub-graph are obtained, and the sub-graph with the largest node number in the first sub-graph and the second sub-graph is used as the maximum semantic sub-graph. As shown in FIG. 3, FIG. 3 is a schematic diagram of a maximum semantic subgraph, wherein the semantic distance between kw1 and kw3 is longest, kw1 and kw3 are two central keywords, a first subgraph (i.e., subgraph 1) is obtained based on kw1, a second subgraph (i.e., subgraph 2) is obtained based on kw2, the number of nodes of the second subgraph is larger than that of the first subgraph, and the second subgraph is the maximum semantic subgraph.
The second calculation module 404 is configured to obtain a preset graph convolution neural network, and perform feature extraction on the maximum semantic subgraph based on the graph convolution neural network to obtain a feature vector;
In some alternative implementations of the present embodiment, the second computing module 404 includes:
The second calculating subunit is used for calculating an adjacency matrix and an access degree matrix of the maximum semantic subgraph;
And the third calculation subunit is used for acquiring a preset weight matrix, and calculating the characteristic vector through the graph convolution neural network according to the weight matrix, the adjacent matrix and the access degree matrix.
In this embodiment, when the maximum semantic subgraph is obtained, a preset graph convolution neural network is obtained, and feature extraction is performed on the maximum semantic subgraph based on the graph convolution neural network to obtain a feature vector. Specifically, a graph convolution neural network (GCN, graph Convolutional Network) is a convolutional neural network for processing a graph structure, through which feature extraction can be performed on an input maximum semantic subgraph to obtain feature vectors. The method comprises the steps of obtaining a maximum semantic subgraph, calculating an adjacent matrix and an access degree matrix of the maximum semantic subgraph when the maximum semantic subgraph is obtained, and then calculating the adjacent matrix, the access degree matrix and embedded vectors of nodes in the maximum semantic subgraph according to a graph convolution neural network to obtain a Laplacian matrix (Laplacian) feature vector, wherein the Laplacian matrix feature vector is the feature vector of the maximum semantic subgraph.
An extracting module 405, configured to extract a subject term of each document in the document set to be retrieved, and calculate a subject embedding vector of the subject term;
in some alternative implementations of the present embodiment, the extracting module 405 includes:
The ordering unit is used for acquiring the word number of each document, and carrying out ascending order ordering on the documents according to the word number to obtain a document queue;
a second obtaining unit, configured to obtain the number of keywords corresponding to the lowest-order documents in the document queue, take the number of keywords corresponding to the lowest-order documents as a lowest threshold, and sequentially increment the number of keywords of other documents in the document queue according to the arrangement order of the document queue based on the lowest threshold until the number of keywords reaches a preset maximum threshold;
And the extraction unit is used for sequentially extracting the subject terms of the documents in the document queue according to the sequence and the number from the lowest threshold value to the maximum threshold value.
In this embodiment, the subject term is a core term of each document, and the subject term may be a term with a frequency of occurrence greater than or equal to a frequency threshold in the document, a term with a semantic weight greater than or equal to a weight threshold, or a term with a total weight greater than or equal to a total weight threshold obtained by calculating the frequency of occurrence and the semantic weight in the document. Extracting the subject word of each document in the document set to be searched, and calculating the embedded vector of the subject word of each document, wherein the embedded vector of the subject word is the subject embedded vector. The topic embedded vector can be obtained through calculation through a preset pre-training language model (bert), specifically, when the topic words of the document are obtained, the topic words are input into the pre-training language model, and the embedded vector of each topic word is obtained based on the embedded coding of the pre-training language model.
And the confirmation module 406 is configured to calculate a vector similarity between each of the topic embedded vectors and the feature vector, determine the topic embedded vector with the vector similarity greater than or equal to a preset similarity threshold as a target embedded vector, and use a document corresponding to the target embedded vector as a target search document.
In the embodiment, based on a calculation formula of cosine similarity, the vector similarity of the topic embedded vector and the feature vector is calculated, whether the vector similarity is larger than or equal to a preset similarity threshold is determined, if the vector similarity is larger than or equal to the preset similarity threshold, the topic embedded vector corresponding to the vector similarity is determined to be a target embedded vector, and if the vector similarity is smaller than the preset similarity threshold, the topic embedded vector corresponding to the vector similarity is determined to be a non-target embedded vector. When the target embedded vector is obtained, the document corresponding to the target embedded vector is taken as a target retrieval document.
In some optional implementations of this embodiment, the document retrieval device 400 based on a knowledge graph further includes:
the retrieval module is used for retrieving the target knowledge graph and determining whether the retrieval keywords exist in the target knowledge graph;
The third calculation module is used for acquiring a preset pre-training language model when the search keyword does not exist in the target knowledge graph, inputting the search keyword and the segmentation in the document segmentation set into the pre-training language model respectively, and calculating to obtain a first characterization vector and a second characterization vector;
And the replacing module is used for calculating the word similarity of the search keyword and the word segmentation according to the first characterization vector and the second characterization vector, determining the word segmentation with the word similarity larger than or equal to the preset similarity as a candidate keyword, and replacing the search keyword with the candidate keyword.
In this embodiment, before calculating the semantic distance between the search keywords based on the target knowledge-graph, the target knowledge-graph is searched, and whether the search keywords exist in the target knowledge-graph is determined. If the search keyword does not exist in the target knowledge graph, a preset pre-training language characterization model (namely Bert, bidirectional Encoder Representation from Transformers) is obtained, the received search keyword is input into the pre-training language model, the first characterization vector is obtained through the output layer of the pre-training language model, the word segmentation in the document word segmentation set is input into the pre-training language model, and the second characterization vector is obtained through the output layer of the pre-training language model. The method comprises the steps of calculating cosine similarity of a first characterization vector and a second characterization vector, obtaining preset similarity, taking a word with the word similarity being greater than or equal to the preset similarity as a candidate keyword, replacing the search keyword with the candidate keyword, and calculating through the candidate keyword when semantic distance between the search keywords is calculated. If the word similarity of the search keywords and the word segmentation is smaller than the preset similarity, the search keywords do not need to be replaced, and the original search keywords are still used for calculation when the semantic distance between the search keywords is calculated.
The document retrieval device based on the knowledge graph provided by the embodiment realizes efficient screening of the target retrieval document, reduces the interference of irrelevant documents on the retrieval result, saves the machine learning cost, and improves the efficiency and the accuracy of target document retrieval.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 5, fig. 5 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It is noted that only computer device 6 having components 61-63 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 61 includes at least one type of readable storage media including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 61 may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 6. Of course, the memory 61 may also comprise both an internal memory unit of the computer device 6 and an external memory device. In this embodiment, the memory 61 is generally used to store an operating system and various application software installed on the computer device 6, such as computer readable instructions of a document retrieval method based on a knowledge graph. Further, the memory 61 may be used to temporarily store various types of data that have been output or are to be output.
The processor 62 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute computer readable instructions stored in the memory 61 or process data, for example, execute computer readable instructions of the knowledge-graph-based document retrieval method.
The network interface 63 may comprise a wireless network interface or a wired network interface, which network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.
The computer equipment provided by the embodiment realizes the efficient screening of the target retrieval document, reduces the interference of irrelevant documents on the retrieval result, saves the machine learning cost and improves the efficiency and the accuracy of the target document retrieval.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the knowledge-graph-based document retrieval method as described above.
The computer readable storage medium provided by the embodiment realizes efficient screening of the target retrieval document, reduces the interference of irrelevant documents on the retrieval result, saves the machine learning cost and improves the efficiency and the accuracy of target document retrieval.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.