CN116881436A

CN116881436A - Knowledge graph-based document retrieval method, system, terminal and storage medium

Info

Publication number: CN116881436A
Application number: CN202311004772.8A
Authority: CN
Inventors: 陈清财; 褚达文; 杨扬; 任鹏宇; 刘荣; 王斐; 张恭
Original assignee: Shenzhen Graduate School Harbin Institute of Technology; First Medical Center of PLA General Hospital
Current assignee: Shenzhen Graduate School Harbin Institute of Technology; First Medical Center of PLA General Hospital
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2023-10-13

Abstract

The invention provides a document retrieval method, a system, a terminal and a storage medium based on a knowledge graph, and particularly relates to the technical field of medical big data processing. Training a character vector model, carrying out fuzzy matching on unregistered words to obtain candidate words in query sentences, extracting the candidate words with highest matching degree, and adding the candidate words into an entity set; generating a triplet based on the entity set and the relation set; acquiring a literature index corresponding to the triplet based on the target field knowledge graph and the target field literature; and sorting the document indexes according to the relevance, and outputting retrieval results according to the sorting list. According to the scheme, the documents with the highest association degree with the query statement can be screened out by positioning and expanding from the entity, the relation, the value, the unregistered words and other angles of the query statement, so that the accurate and efficient retrieval of the documents in the target field is realized.

Description

Knowledge graph-based document retrieval method, system, terminal and storage medium

Technical Field

The invention relates to the technical field of medical big data processing, in particular to a document retrieval method, a system, a terminal and a storage medium based on a knowledge graph.

Background

With the rapid development of big data, the variety and number of documents in each field are rapidly increased, and how to accurately and rapidly retrieve needed medical documents from a large number of medical documents becomes a technical problem to be solved.

Currently, most literature book searching methods search through accurate matching of key fields, and the set key fields are often the titles, authors, index numbers or domain key fields of the literature. If the retriever does not know the title, author, and index number corresponding to the content being queried, the desired document cannot be accurately retrieved. In addition, since the text corpus included in the knowledge graph in the existing document retrieval system is limited, under the condition that the text corpus is only through the custom keywords, especially, the condition that the same keywords express different meanings in different contexts can occur, even the condition that the same keywords appear in different keywords due to different translations can occur, the documents meeting the requirements are difficult to retrieve, the accuracy of document retrieval is low, and the effectiveness of the retrieval method is low based on factors such as word frequency, reference amount and age.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a knowledge-graph-based document retrieval method, system, terminal and storage medium, which aims to solve the problem of low accuracy of document retrieval in the prior art.

In order to achieve the above object, a first aspect of the present invention provides a document retrieval method based on a knowledge graph, including the steps of:

acquiring a query sentence input by a user;

scanning the query statement by adopting a prefix dictionary, extracting all entities, and constructing an entity set; generating a directed acyclic graph by utilizing the entity, extracting all entity relations based on the directed acyclic graph, and constructing a relation set;

training a word vector model by adopting a preset word vector training model and a preset training corpus, carrying out fuzzy matching on unregistered words to obtain matched word vectors, and forming a word vector group by the matched word vectors; obtaining candidate words in the query sentence, calculating the matching degree of the word vector group and the candidate words, extracting the candidate words with the highest matching degree from the word vector group, and adding the candidate words into the entity set;

generating a triplet based on the entity set and the relation set;

Extracting all target domain entities in a target domain document, forming target domain entity pairs, and acquiring all triples comprising the target domain entity pairs from a target domain knowledge graph to obtain a triplet set; based on the triplet set, obtaining target field documents matched with each triplet in the triplet set, obtaining a document and triplet pair set, and sorting the target field documents in the document and triplet pair set according to the association degree, so as to obtain a document index corresponding to the triplet;

and sorting the document indexes according to the association degree, obtaining a retrieval result and outputting the retrieval result.

Further, the generating a directed acyclic graph by using the entity, extracting all entity relationships based on the directed acyclic graph, and constructing a relationship set includes:

generating a directed acyclic graph with each entity as an edge;

searching a weight and a maximum path of each point on the directed acyclic graph by adopting a dynamic programming algorithm, and taking an entity relationship generated on the weight and the maximum path as an extracted entity relationship;

and constructing a relation set by using the extracted entity relation of each point on the directed acyclic graph.

Further, the obtaining the candidate word in the query sentence includes:

splitting each entity word, each relation word and each value in the target field knowledge graph into single words, and storing each word into an inverted index to obtain index words;

and performing inverted indexing on the words in the query statement to obtain all index words associated with the words in the query statement, and obtaining candidate words of the query statement.

Further, the calculating the matching degree of the word vector group and the candidate words, extracting the candidate word with the highest matching degree, and adding the candidate word to the entity set includes:

converting the candidate words and the words in the query sentences into word vectors by using a preset word vector training model, and obtaining the word vectors of the candidate words and the word vectors of the words in the query sentences;

and forming a word vector pair by the word vector of the candidate word and the word vector of the word in the query sentence, calculating the matching degree of the word vector pair by adopting a word shift distance algorithm, extracting the candidate word with the highest matching degree, and adding the candidate word into the entity set.

Further, the generating a triplet based on the entity set and the relationship set includes:

Generating an entity relation set based on the relation among the entities in the entity set, generating the relation set by using the extracted relation, and dividing the relation among the entities into a determined relation, a fuzzy relation, a single entity and multiple entities based on the number and the type of the relation set and the number of the entities;

when the relation set is not empty, matching all the relations in the entity relation set with all the relations in the relation set by using a preset relation matching operation, and sorting according to the matching value to obtain a sorting queue for extracting the relations; extracting a plurality of relations in the ordering queue of the extraction relations as blocks, extracting all triples of an entity corresponding to each relation, and ordering the triples in each block according to a matching value to obtain a triplet block of a determined relation or a triplet block of a fuzzy relation;

when the relation set is empty, all triples of the entities in the relation set are found out from the target field knowledge graph, and when the number of the entities in the relation set is more than one, all shortest paths among the entities are found out by using a path optimization algorithm, all triples are classified according to the relation to generate blocks, and sorting is carried out according to the number of the triples of the relation blocks, so that a single-entity triplet or a multi-entity triplet is obtained.

Further, the sorting the documents and the target domain documents in the triplet pair set according to the association degree to obtain a document index includes:

based on the importance and the relevance, adding labels to the target domain documents and the triplet pairs in the target domain documents and the triplet sets to obtain a prediction model;

and carrying out association degree prediction on the target field literature and the triplet pairs based on the prediction model to obtain a prediction result, and sequencing according to the prediction result to obtain a literature index corresponding to the triplet.

Further, the adding labels to the target domain literature and triplet pairs based on the importance and the relevance to obtain a prediction model includes:

adding machine feature labels and manual labels to the target field literature and triplet pairs based on the importance and the relevance, and obtaining machine feature label data and manual label data;

training a preset document mapping model by using the machine feature labeling data to obtain a feature model;

and continuing training the preset document mapping model by using the manually marked data and the characteristic model to obtain a prediction model.

The invention provides a document retrieval system based on a knowledge graph, which comprises an interaction unit and a retrieval unit, wherein the interaction unit comprises an input module for receiving query sentences and an output module for outputting document retrieval results, and the retrieval unit comprises a query sentence entity and relationship extraction module, an unregistered word fuzzy matching module, a matching triplet module, a medical document triplet index construction module and a correlation module.

The query sentence entity and relation extraction module is used for extracting entities and relations based on the query sentences input in the input module, scanning the query sentences by using a prefix dictionary, extracting all the entities and constructing an entity set; generating a relational graph by utilizing the entities, extracting all entity relations based on the relational graph, constructing a relation set, and outputting the relation set to a matching triplet module;

the unregistered word fuzzy matching module is used for carrying out fuzzy matching on unregistered words existing in the input module, training a preset word vector training model and a preset training corpus by adopting the preset word vector training model, carrying out fuzzy matching on the unregistered words to obtain matched word vectors, and forming a word vector group by the matched word vectors; obtaining candidate words in the query sentence, calculating the matching degree of the word vector group and the candidate words, extracting the candidate words with the highest matching degree, adding the candidate words into the entity set in the received query sentence entity and relation extraction module, and outputting the entity set to a matching triplet module;

the matching triplet module is used for generating triples based on the received entity set and the relation set and outputting the triples to the medical document triplet index construction module;

The medical literature triplet index construction module is used for extracting all target field entities in the target field literature, forming target field entity pairs, acquiring all triples comprising the target field entity pairs from a target field knowledge graph, and obtaining a triplet set; based on the triplet set, obtaining target domain documents matched with all triples in the triplet set, obtaining a document and triplet pair set, sorting the target domain documents in the document and triplet pair set according to the association degree, obtaining a document index, and outputting the document index to an association module;

and the association module is used for sorting the document indexes according to the association degree, obtaining search results and outputting the search results.

A third aspect of the present invention provides an intelligent terminal including a memory, a processor, and a knowledge-graph-based document retrieval program stored on the memory and executable on the processor, the knowledge-graph-based document retrieval program implementing the steps of any one of the above-described knowledge-graph-based document retrieval methods when executed by the processor.

A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a knowledge-graph-based document retrieval program which, when executed by a processor, implements the steps of any one of the above-described knowledge-graph-based document retrieval methods.

Compared with the prior art, the beneficial effects of this scheme are as follows:

the invention is based on the knowledge graph of the target field and natural language processing technology, entity and relation extraction is carried out on query sentences, an entity set and a relation set are constructed, a character vector model is trained, fuzzy matching is carried out on unregistered words, and candidate words with highest matching degree in the query sentences are extracted and added into the entity set; then generating a triplet based on the entity set and the relation set; and the method locates and expands from the angles of entities, relations, values, unregistered words and the like of the query sentence, and improves the efficiency and accuracy of document retrieval of users.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a knowledge-based document retrieval method of the present invention;

FIG. 2 is a flow chart of query statement entities and relationships extraction in accordance with the present invention;

FIG. 3 is a flow chart of the fuzzy matching of unregistered words of the present invention;

FIG. 4 is a flow chart of an example of unregistered word fuzzy matching of the present invention;

FIG. 5 is a triplet matching flow chart of the present invention;

FIG. 6 is a flowchart of an example triplet matching of the present invention;

FIG. 7 is a flow chart of the medical document triplet index construction of the present invention;

FIG. 8 is a flow chart of an example of document and triplet pair generation of the present invention;

FIG. 9 is a flow chart of the search result ordering of the present invention;

FIG. 10 is a schematic diagram of a document numbering sequence construction flow of the present invention;

FIG. 11 is a machine feature labeling flow chart in data labeling of the present invention;

FIG. 12 is a flow chart of manual feature labeling in data labeling according to the present invention;

FIG. 13 is a schematic diagram of a knowledge-based document retrieval system according to the present invention;

fig. 14 is a schematic structural diagram of an intelligent terminal for document retrieval based on a knowledge graph.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted in context as "when …" or "upon" or "in response to a determination" or "in response to detection. Similarly, the phrase "if a condition or event described is determined" or "if a condition or event described is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a condition or event described" or "in response to detection of a condition or event described".

The following description of the embodiments of the present invention will be made more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown, it being evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

The knowledge graph is an important research method in the artificial intelligence field, is also an academic research field in multiple directions and multiple fields, relates to research directions such as data mining, entity extraction, relation extraction, information processing and the like, also relates to multiple fields such as medicine, finance and the like, and is different in different professional fields. The knowledge graph is structured knowledge and is extracted from a large amount of text information, and knowledge is generally required to be judged and arranged by experts in the field, and can be generally regarded as reliable basic knowledge, so that a knowledge base can be provided for intelligent retrieval, intelligent question-answering and intelligent analysis. In recent years, the technology for constructing the knowledge graph is increasingly mature along with the technology improvement of a natural language model, and the acquisition of knowledge is changed from a previous manual arrangement mode into the reading and identification of a machine at present, so that the application of knowledge graph is wider.

In view of the breakthrough of a large language model in recent years, the appearance of various specialized knowledge maps, and the literature retrieval algorithm based on keyword indexes or a multi-factor weighing calculation mode model based on word frequency, reference amount, years and the like in the prior art, the problem of lower accuracy of literature retrieval caused by factors such as low effectiveness, single function, poor understanding accuracy of user query content and the like in the literature retrieval in the specialized field exists. The invention provides a document retrieval method based on a knowledge graph, which utilizes knowledge graph and natural language processing technology and combines knowledge mapping to realize efficient retrieval of documents in the target field. In the target field document retrieval process, the retrieval efficiency can be effectively improved, and documents with the highest association degree with query sentences can be screened out. Firstly, establishing a mapping relation from a target field document to a triplet, and establishing an index; after a user inputs a query sentence, the method extracts entities and relations of the query sentence, and determines matching or fuzzy matching according to actual conditions to obtain a triplet block corresponding to the query sentence; then, the triples matched with the query sentences pass through a document triples index, and target documents are output according to the matching degree with the query sentences; the method can locate and expand from multiple angles of user inquiry, improves the efficiency of searching documents by users, and the knowledge structure contained in the search results is also beneficial to understanding the search results by users, so that the method has certain research significance and practical value.

Exemplary method

The embodiment of the application provides a document retrieval method based on a knowledge graph, which is deployed on electronic equipment such as a computer and a server, relates to research directions such as data mining, entity extraction, relation extraction, information processing and the like in a plurality of fields such as medicine, finance, science and technology, and aims at the situation that a text corpus in a specific field is less when an application scene is retrieval of documents. Specifically, the embodiment takes medical document retrieval as an example to describe the document retrieval method flow based on the knowledge graph in detail, and it is required to state that the application field of the method includes but is not limited to the medical field, and the method flow chart is shown in fig. 1, and the main steps include:

step S1000: and acquiring a query statement input by a user.

Specifically, the user inputs a section of natural language query sentence according to the own requirement, which may be a word, phrase or sentence, or any combination of one or more words, phrases or sentences, and the natural language query sentence input by the user is hereinafter referred to as a query sentence.

Step S2000: scanning query sentences by using a prefix dictionary, extracting all entities, and constructing an entity set; generating a directed acyclic graph by using the entities, extracting all entity relations based on the directed acyclic graph, and constructing a relation set.

Specifically, query sentence entity and relation extraction, namely, for an input query sentence, using a prefix dictionary to rapidly scan a text, extracting all possible extracted words, then generating a directed acyclic graph according to all possible word forming conditions, adopting a dynamic programming algorithm to find out an optimal path, and extracting by using the optimal path;

as shown in fig. 2, the query term entity and relationship extraction includes the following steps:

step 2100: prefix dictionary construction: constructing a prefix dictionary by using a CMeKG (Chinese Medical Knowledge Graph) knowledge graph, putting all entities (marked as E), relations (marked as R) and values (marked as V) of the knowledge graph into the prefix dictionary, using a hash table to accelerate the scanning speed of texts, and setting different types of labels for the entities, the relations, the values and the stop words, wherein 1 is an entity class label, 2 is a relation class label, 3 is a value class label and 4 is a stop word class label, and in addition, the prefixes of all words are marked with 0 as class labels, such as { stream: 0, { flow nose }: 0}, { runny nose: 1};

the embodiment selects a CMeKG knowledge graph for prefix dictionary construction, which is a Chinese medical knowledge graph developed in a man-machine combination mode based on large-scale medical text data by utilizing natural language processing and text mining technology, and can flexibly select knowledge graphs such as a UMLS semantic network, wikidata, schema, cnSchema or an OMAHA knowledge base according to practical application conditions to construct a prefix dictionary as other preferred embodiments.

Step 2200: text scanning using a prefix dictionary: scanning query sentences word by word, if the words in the current queue are not in the prefix word list, indicating that new words starting from the first word in the current queue cannot be generated, then dequeuing the first word, and continuing to scan and judge the remaining words in the queue word by word; if the word in the queue appears in the prefix word list, further judging whether the prefix word is an effective word such as an entity, a relation and the like, if so, reserving, and then reading the next word for continuous scanning;

step 2300: constructing a directed acyclic graph: an initial Directed Acyclic Graph (DAG) is generated with each word of the text as an edge, wherein each node is connected with one or more later nodes, and the edge is connected according to the valid word extracted in the last step. Each edge is a feasible path and also represents a phrase form, and each path of the directed acyclic graph can generate different entity relation extraction results;

step 2400: dynamic programming determines the best combination: the method comprises the steps of defining the weight of each word by integrating word frequency, word length and priority, wherein the weight can be mapped into the weight of an edge in a directed acyclic graph, the state is defined as the weight sum from a starting point to a certain point of a point on the directed acyclic graph, and the optimal state of each point in the directed graph is the sum of the states of all reachable current points plus transfer cost (i.e. the weight of the directed edge), namely the weight sum is the maximum;

The weight formula is:

wherein, typeWeight _word A weight priority representing a word, wherein the weight priority is ordered from high to low as: relationship of>Entity>Value of>Stop words, frequency _word Word frequency representing the word, length _word The word length representing the word.

The specific steps of the dynamic programming algorithm are as follows:

1) Constructing a weight matrix W and a position matrix pre with initial values of 0 based on nodes of the directed acyclic graph, and obtaining an adjacent matrix P of the directed acyclic graph, wherein P is calculated by the weight matrix W and the position matrix pre _ij Representing the adjacency between node i and node j, P _ij A value other than 0 indicates that there is a path from node i to node j, and the value size of P indicates the class label to which the word belongs.

2) Traversing all lines of P starting from the first line, based on P in the ith line _ij A value other than 0, updating the weight matrix W, andthe new formula is as follows:

W _j ＝max{W _j ,[(-+1)*(P _ij )+W _i ]}

where ty (-) represents class label weight, e.g., set entity class label weight to 1, set relationship class label weight to 1.1, set disabling word class label weight to 0.2, and set value class label weight to 0.55, W _j Represents the j-th value in the W array, W _i Representing the i-th value in the W array, W updates the position matrix pre at the same time.

3) After the updating of W, the reading starts from the last number of the position matrix pre, the read value is used as the index value to continue reading until the index reaches the class 0 mark, the corresponding segment is read out through the text for each path, and the prefix dictionary is used for identifying the extraction type.

Step 3000: training a word vector model by adopting a preset word vector training model and a preset training corpus, carrying out fuzzy matching on unregistered words to obtain matched word vectors, and forming a word vector group by the matched word vectors; and obtaining candidate words in the query sentence, calculating the matching degree of the word vector group and the candidate words, extracting the candidate words with the highest matching degree from the word vector group, and adding the candidate words into the entity set.

Specifically, for fuzzy matching of unregistered words, a continuous Word bag (CBOW) model is adopted to perform Word vector training, candidate words are obtained through inverted indexes of words in query sentences, then Word Move Distance (WMD) algorithm is used to calculate the matching degree of an input Word vector group and the candidate words, and the candidate Word with the highest matching degree is a matching result.

The word training model adopted in this embodiment is a continuous word bag (CBOW) model in word2vec, and as other preferred embodiments, the training word vectors of the skip-gram model, glove, fastText, elmo, GPT or bert, xlnet and other models in word2vec can be flexibly selected based on the conditions of system configuration, data volume, accuracy and/or efficiency.

As shown in fig. 3, the unregistered word fuzzy matching includes the steps of:

Step 3100: training word vector: performing character vector training by adopting a continuous word bag model, splicing abstracts of all medical documents in a target database to serve as training corpus of the character vector model, and separating each character in a text of the training corpus to form training samples of a character sequence;

step 3200: inverted index of words: splitting each entity word, relation word and value in the knowledge graph into single words, storing each word into an inverted index, and generating an index from each word to the word containing the word. Finding all entities related to each word through the inverted index;

step 3300: obtaining a fuzzy word: and removing the accurately matched words and stop words, and splicing the unlabeled words to obtain fuzzy words.

For example, in the process of fuzzy matching of unregistered words, as shown in fig. 4, it is assumed that an input word is "child typhoid", character vector training is performed through a continuous word bag model, candidate words such as little typhoid, child cold or child fever are obtained through inverted indexes of words in query sentences, then character vector training is performed by using article abstract character sequences and continuous word bag models in medical documents to obtain character vectors, matching degrees of each character vector in a word vector group and the candidate words are calculated by using a word shift distance algorithm, the matching degrees of the character vector of child typhoid and the candidate words such as little typhoid, child cold and child fever are respectively 0.88, 0.82 and 0.66, and the candidate word with the highest visible matching degree is little typhoid, so that the little typhoid is the matching result to be searched finally.

Step 4000: based on the entity set and the relationship set, a triplet is generated.

Specifically, after entity and relation extraction is performed on the query statement, an entity set, a relation set and a value set are obtained, a triplet is generated based on the obtained entity set, relation set and value set, and the triplet is divided into a determined relation, a fuzzy relation, a single entity and multiple entities.

As shown in fig. 5 and 6, the triplet matching includes the following steps:

step 4100: and (3) determining relation processing: and generating an entity relation set R' by using various relations among the entities in the entity set, and generating the relation set R by using the extracted relations. When the relation set is not empty and is not a fuzzy relation, matching all the relations in R ' with all the relations in R by using a TransE relation matching operation, and sequencing according to the matching values of the relations in R ' and all the relations in R, so as to obtain a sequencing queue of R '; taking out the first n relations of the queue as blocks, extracting all triples of the entity corresponding to each relation, performing TransE entity matching on the tail entity of the triples in each block and words except the head entity in the entity and the value, and sorting the triples in each block according to the matching value according to the calculation result to generate a triples block for determining the relation;

Step 4200: fuzzy relation processing: when the relation set is not empty and is a fuzzy relation, carrying out fuzzy matching on the relation in R ' and the fuzzy relation by using a word shift distance algorithm, sequencing according to a matching value of R ', obtaining a sequencing queue of R ', taking out the first n relations as blocks, extracting all triples of an entity corresponding to each relation, carrying out TransE entity matching on a tail entity of the triples in each block and words except for a head entity in the entity and the value, sequencing the triples in each block according to the matching value according to a calculation result, and generating a triples block of the fuzzy relation;

step 4300: no relation, single entity processing: when the relation set is empty and only one entity exists, all triples of the entity are found out from the knowledge graph, all triples are classified according to the relation to generate blocks, the triples are ordered according to the number of the triples of the relation blocks, and the first n relation blocks are taken out to be single entity triples;

step 4400: no relation, multi-entity processing: when the relation set is empty and a plurality of entities exist, a path optimization algorithm is used for finding all paths of the shortest paths among the entities in the entity set, all triples of all paths are classified according to the relation to generate relation blocks, the relation blocks are ordered according to the number of the triples of the relation blocks, and the first n relation blocks are taken out to serve as multi-entity triples.

Step 5000: extracting all target domain entities in the target domain literature, forming target domain entity pairs, and acquiring all triples containing the target domain entity pairs from a target domain knowledge graph to obtain a triplet set; and based on the triplet set, obtaining target field documents matched with each triplet in the triplet set, obtaining a document and triplet pair set, and sorting the documents and the target field documents in the triplet pair set according to the association degree to obtain a document index corresponding to the triplet.

Specifically, all entities in the medical literature are extracted to form entity pairs, triples containing the entity pairs are obtained from a medical knowledge graph, fine adjustment is carried out in a BERT pre-training model to enable output results to be more in accordance with expected output in the medical field, a literature mapping model is obtained through weak supervision training, and the triples and literature indexes are obtained through sequencing of the output weak class probability results.

As shown in fig. 7, in the medical document triplet index construction, the steps are as follows:

step 5100: construction literature and triplet pairs: the method comprises the steps that a document and triplet pair is used for data marking and forming an index, firstly, abstracts and titles of medical documents are taken out from a medical document database and combined, all entities in the abstracts and the titles are extracted by using an entity relation extraction model, entity pairs are generated by combining the abstracts and the titles in pairs, all triples corresponding to each entity pair are found out from a knowledge graph and combined into a triplet set of the medical document database, and the triplet set and each matched medical document ID form a document and triplet pair set respectively;

Step 5200: and (3) data marking: based on the generated document and triplet pairs, marking data by using two indexes of importance degree and relativity degree, adding a relativity degree class mark, wherein the importance degree represents whether two entities of the triplet are main description objects of the document or important places in the document, the relativity degree represents whether the relation of the triplet is represented in the semantics of the document or is the content of the document description, the data marking is divided into machine characteristic marking and manual marking, and the BERT pre-training model is finely adjusted through marked data, so that the data is better represented on triad extraction mapping of the document;

step 5300: model training and prediction: training a document mapping model by using machine feature labeling data to obtain a feature model, wherein the feature model can learn the machine labeling features but is inaccurate in understanding relation semantics, so that the document mapping model is continuously trained by using artificial labeling data and the feature model to obtain a prediction model; and carrying out large-scale prediction on the document and the triplet pair by using a prediction model, and sequencing the prediction probability result to obtain indexes of the triplet and the document.

The model for document triplet construction in this embodiment is a learning ordering model based on weak supervision, uses self-attribute mechanism to convert abstract and title of document species into text vector representation, performs a correlation calculation with embedded sequence of triples, and finally connects a full connection layer and softmax layer to perform weak class label classification, and performs association degree ordering by probability value of weak class labels. The model is built by fine-tuning (fine-tuning) based on the pre-trained model of BERT (Bidirectional Encoder Representation from Transformers), and the NSP (Next Sentence Prediction) mechanism is used to understand the association between the document and the triples, and a full-connection layer classification layer is added behind the last transducer of the BERT to perform weak supervision classification.

Illustrating a document and triplet pair generation flow, as shown in fig. 8, firstly, reading the title and abstract of each medical document based on a medical document database, and extracting entities from a preset entity relation extraction model based on the title and abstract to obtain an entity set; combining the entities in the entity set into entity pairs in pairs, and matching all triples simultaneously containing the two entities in the medical knowledge graph to form a triplet set; and matching the literature and the triplet pairs by a prediction model obtained by training the literature mapping model to generate a literature and triplet pair set.

Step 6000: and sorting the document indexes according to the association degree, obtaining a retrieval result and outputting the retrieval result.

Specifically, as shown in fig. 9, in the search result ranking, the following steps are included:

step 6100: obtaining a literature block: sorting the triplet pairs obtained in the step 4000 according to the association degree to form triplet blocks, obtaining document ID number sequences corresponding to the triples according to the step 5000, and sorting according to the association degree to obtain document blocks;

step 6200: lateral window ordering: defining a transverse window, wherein the width of the window represents the number of documents with the highest priority corresponding to each extraction triplet; cutting documents with the front association degree sorting position in each document sequence based on a set document window, and transversely arranging the cut document IDs in a row according to the window; and so on until all document IDs are cut and aligned according to the degree of association.

For example, as shown in fig. 10, the method of constructing the document number sequence is first, according to the document and triplet pair set, triples are arranged into a triplet sequence in the order of high-to-low matching degree, and for a plurality of documents corresponding to each triplet, the triples are arranged into a document sequence in the order of high-to-low association degree of document ID numbers, that is, the triplet sequence includes a plurality of triples, and each triplet corresponds to one document sequence; then cutting the documents with the front association degree sorting positions in each document sequence based on a set document window, and transversely arranging the cut document IDs into a row according to the window; and then, cutting the documents with the front relevance ranking position in the cut document sequence based on the set document window, transversely arranging the cut document IDs in a row according to the window, and splicing the cut document IDs behind the row of the first cut document ID until all the document IDs are arranged in a row according to the relevance.

In the step 5200, adding machine feature labels and manual labels to data comprises the following steps:

step 5210: machine feature labeling in data labeling;

as shown in fig. 11, in the machine feature labeling in the data labeling, the following steps are included:

step 5211: importance labeling: the importance is divided into three classes, specifically shown in table 1, for measuring the entity coverage and the distribution characteristics of important areas, wherein the important areas are the core areas of documents such as titles, purposes, conclusions and the like in the existing documents, and the abstract of the Chinese medical document generally has the structures of purposes, methods, results and conclusions;

TABLE 1

Step 5212: correlation labeling: the relevance classification is used for measuring whether the head and tail entities appear in the same sentence, whether the words related to the relation appear in the sentence or not, and the like as shown in the table 2;

TABLE 2

Step 5213: integrating the major classes: based on the type of the importance class mark and the type of the relevance class mark, the importance and the relevance labels are integrated into a unified large class mark and divided into five classes, as shown in table 3, wherein the oblique lines in the table represent OR. The major class 1 indicates that the association of the triplet with the document is highest, and the major class 5 indicates that the association of the triplet with the document is lowest. For example, when the importance class is 1 and the relatedness class is 1, the major class is set to 1; when the importance class is 1 and the relevance class is 2, the major class is set to 3.

TABLE 3 Table 3

Step 5220: the manual feature labeling in the data labeling,

as shown in fig. 12, in the artificial feature labeling in the data labeling, the steps of:

step 5221: and (3) manual understanding: manually understanding the main content and triplet relationships of the literature;

step 5222: manual association labeling: the method is divided into five classes, specifically shown in table 4, the importance of two entities in the triplet in the literature is mainly measured, the consistency of the triplet relation and the literature and the correlation degree of the triplet knowledge in the literature are mainly measured, and class 1 shows that the correlation degree of the triplet and the literature is highest.

TABLE 4 Table 4

Exemplary System

As shown in fig. 13, corresponding to the above-mentioned knowledge-graph-based document retrieval method, the embodiment of the present invention further provides a knowledge-graph-based document retrieval system, which includes an interaction unit 1 and a retrieval unit 2, wherein the interaction unit includes an input module 11 for receiving a query sentence and an output module 12 for outputting a document retrieval result, and the retrieval unit includes a query sentence entity and relationship extraction module 21, an unregistered word fuzzy matching module 22, a matching triplet module 23, a medical document triplet index construction module 24, and an association module 25.

The query sentence entity and relationship extraction module 21 is configured to perform entity and relationship extraction based on the query sentence input in the input module 11, scan the query sentence using the prefix dictionary, extract all entities, and construct an entity set; generating a directed acyclic graph by using the entities, extracting all entity relations based on the directed acyclic graph, constructing a relation set, and outputting the relation set to a matching triplet module 23;

the unregistered word fuzzy matching module 22 is configured to perform fuzzy matching on unregistered words existing in the input module 11, train a preset word vector model and perform fuzzy matching on the unregistered words by adopting a preset word vector training model and a preset training corpus, obtain matched word vectors, and form a word vector group by the matched word vectors; obtaining candidate words in the query sentence, calculating the matching degree of a word vector group and the candidate words, extracting the candidate words with the highest matching degree from the word vector group, adding the candidate words into the entity set in the received query sentence entity and relation extraction module 21, and outputting the entity set to the matching triplet module 23;

a matching triplet module 23, configured to generate a triplet based on the received entity set and the relationship set, and output the generated triplet to the medical document triplet index building module 24;

The medical literature triplet index construction module 24 is configured to extract all target domain entities in the target domain literature, form target domain entity pairs, acquire all triples including the target domain entity pairs from the target domain knowledge graph, and obtain a triplet set; based on the triplet set, obtaining target field documents matched with each triplet in the triplet set, obtaining a document and triplet pair set, sorting the documents and the target field documents in the triplet pair set according to the association degree, obtaining a document index corresponding to the triplet, and outputting the document index to the association module 25;

and the association module 25 is used for sorting the document indexes according to the association degree, obtaining and outputting the retrieval result.

The specific working process of the system of the embodiment is as follows:

the query sentence entity and relationship extraction module 21 performs entity and relationship extraction based on the query sentence input in the input module 11; the unregistered word fuzzy matching module 22 mainly performs fuzzy matching on unregistered words existing in the input module 11; the matching triplet module 23 mainly matches the extracted relation with the triples in the entity and knowledge graph; the medical document triplet index construction module 24 mainly extracts entities and relations of medical documents, and matches the entities and relations with triples in a knowledge graph so as to establish a document index; the association module 25 mainly associates the triples obtained by the matching triples module 23 with the document triples obtained by the medical document triples index construction module 24, searches for corresponding documents through the index, and then sends the search result to the output module 12, and the output module outputs the search result to the interaction interface in a manner of sorting the transverse windows.

Specifically, in this embodiment, the specific function of the knowledge-graph-based document retrieval system may refer to the corresponding description in the knowledge-graph-based document retrieval method, which is not described herein again.

Based on the above embodiment, the present invention also provides an intelligent terminal, and a functional block diagram thereof may be shown in fig. 14. The intelligent terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. The processor of the intelligent terminal is used for providing computing and control capabilities. The memory of the intelligent terminal comprises a nonvolatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a document retrieval program based on a knowledge graph. The internal memory provides an environment for the operation of an operating system and a knowledge-graph-based document retrieval program in a non-volatile storage medium. The network interface of the intelligent terminal is used for communicating with an external terminal through network connection. The method for searching the literature based on the knowledge graph comprises the step of realizing any one of the above-mentioned methods for searching the literature based on the knowledge graph when the program for searching the literature based on the knowledge graph is executed by a processor. The display screen of the intelligent terminal can be a liquid crystal display screen or an electronic ink display screen.

It will be appreciated by those skilled in the art that the schematic block diagram shown in fig. 14 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the smart terminal to which the present inventive arrangements are applied, and that a particular smart terminal may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, an intelligent terminal is provided, where the intelligent terminal includes a memory, a processor, and a knowledge-graph-based document retrieval program stored in the memory and capable of running on the processor, where the knowledge-graph-based document retrieval program implements any one of the steps of the knowledge-graph-based document retrieval method provided in the embodiment of the present invention when executed by the processor.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a document retrieval program based on a knowledge graph, and the document retrieval program based on the knowledge graph realizes any step of the document retrieval method based on the knowledge graph provided by the embodiment of the invention when being executed by a processor.

It should be understood that the sequence number of each step in the above embodiment does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be construed as limiting the implementation process of the embodiment of the present invention.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units described above is merely a logical function division, and may be implemented in other manners, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that; the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions are not intended to depart from the spirit and scope of the various embodiments of the invention, which are also within the spirit and scope of the invention.

Claims

1. The document retrieval method based on the knowledge graph is characterized by comprising the following steps of:

acquiring a query sentence input by a user;

training a preset character vector training model and a preset training corpus, carrying out fuzzy matching on unregistered words to obtain matched character vectors, and forming a word vector group by the matched character vectors; obtaining candidate words in the query sentence, calculating the matching degree of the word vector group and the candidate words, extracting the candidate words with the highest matching degree, and adding the candidate words into the entity set;

Generating a triplet based on the entity set and the relation set;

extracting all target domain entities in a target domain document, forming target domain entity pairs, and acquiring all triples comprising the target domain entity pairs from a target domain knowledge graph to obtain a triplet set; based on the triplet set, obtaining target domain documents matched with all triples in the triplet set, obtaining a document and triplet pair set, and sorting the target domain documents in the document and triplet pair set according to the association degree to obtain a document index;

2. The knowledge-graph-based document retrieval method of claim 1, wherein said generating a directed acyclic graph using said entities and extracting all entity relationships based on said directed acyclic graph, constructing a relationship set, comprises:

generating a directed acyclic graph with each entity as an edge;

3. The knowledge-based document retrieval method as recited in claim 1, wherein said obtaining candidate words in the query sentence comprises:

4. The knowledge-graph-based document retrieval method according to claim 3, wherein said calculating the degree of matching between the word vector group and the candidate words, and extracting the candidate word with the highest degree of matching, adding the candidate word to the entity set, includes:

5. The knowledge-graph-based document retrieval method as recited in claim 1, wherein said generating a triplet based on said set of entities and said set of relationships comprises:

6. The knowledge-graph-based document retrieval method according to claim 1, wherein the ranking the documents according to the degree of association with the target domain documents in the triplet pair set to obtain a document index includes:

7. The knowledge-graph-based document retrieval method according to claim 6, wherein the adding labels to the target domain document and triplet pairs based on importance and relevance to obtain a prediction model includes:

8. The document retrieval system based on the knowledge graph is characterized by comprising an interaction unit and a retrieval unit, wherein the interaction unit comprises an input module for receiving query sentences and an output module for outputting document retrieval results, and the retrieval unit comprises a query sentence entity and relationship extraction module, an unregistered word fuzzy matching module, a matching triplet module, a medical document triplet index construction module and a correlation module.

The query sentence entity and relation extraction module is used for extracting entities and relations based on the query sentences input in the input module, scanning the query sentences by using a prefix dictionary, extracting all the entities and constructing an entity set; generating a directed acyclic graph by utilizing the entity, extracting all entity relations based on the directed acyclic graph, and constructing a relation set;

9. A smart terminal, characterized in that it comprises a memory, a processor and a knowledge-based document retrieval program stored on the memory and executable on the processor, which knowledge-based document retrieval program, when executed by the processor, implements the steps of the knowledge-based document retrieval method according to any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a knowledge-graph based document retrieval program, which when executed by a processor, implements the steps of the knowledge-graph based document retrieval method according to any one of claims 1-7.