CN117854735A

CN117854735A - Medical text retrieval method and system based on knowledge injection

Info

Publication number: CN117854735A
Application number: CN202311863508.XA
Authority: CN
Inventors: 姚娟娟
Original assignee: Shanghai Mingping Medical Data Technology Co ltd
Current assignee: Shanghai Mingping Medical Data Technology Co ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-09

Abstract

The invention provides a medical text retrieval method and a system based on knowledge injection. Comprising the following steps: and carrying out entity identification on the input sentences, and detecting the positions and the categories of the entities. And in the knowledge injection stage, predicates and objects of the corresponding entities are searched in the knowledge graph and are related to the corresponding entities to form a semantic enhanced sentence tree. And serializing the sentence subtrees to generate an integrated description text. And generating a corresponding visual matrix according to the integrated descriptive text. And sending the integrated descriptive text and the corresponding visual matrix into a language model, and calculating a word vector corresponding to each word and a sentence vector of the whole sentence. And sending the input sentence into an elastic search engine, and returning to a corresponding candidate text list. And calculating a similarity score between the integrated description text and the candidate text, reordering the candidate text according to the similarity score, and returning to the ordering result list. And extracting information content from the medical knowledge graph to complete knowledge, and supplementing medical background information in the input text.

Description

Medical text retrieval method and system based on knowledge injection

Technical Field

The invention relates to the field of medical data retrieval, in particular to a medical text retrieval method and system based on knowledge injection.

Background

At present, a plurality of text retrieval systems all use a deep neural network to process text information to obtain vocabulary vectors and sentence vectors representing texts, but the deep neural network used in business can only process semantic information in training corpus, lack professional background knowledge outside the corpus and lack relevance cognition of external matters.

Patent document CN115687574a discloses a text retrieval method, a device, a terminal device and a storage medium, and determines a retrieval result corresponding to a target retrieval word according to the target retrieval text and a pre-established ranking model, wherein the retrieval result at least comprises a ranking parameter and a weight value corresponding to the ranking parameter; and sorting the search results according to the sorting parameters and the weight values of the sorting parameters to obtain target search results corresponding to the target search text, obtaining a sorting model by adopting a large amount of historical query information and document information corresponding to the historical query information and supervising a machine learning algorithm, and sorting all documents in the search results by adopting the sorting model when the target search text is obtained, so that the search result closest to the target search text is displayed. Such a text retrieval method, although being capable of retrieving text-related information, cannot supplement background knowledge of a specific field and corresponding related information in a text.

In the medical field, users only know own symptoms, and input of professional medical terms is difficult, so that the retrieval result is inaccurate, and therefore, domain knowledge and relevance knowledge among transactions need to be supplemented.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a medical text retrieval method and system based on knowledge injection.

The medical text retrieval method based on knowledge injection provided by the invention comprises the following steps:

step 1: carrying out entity identification on the input sentences, and detecting the positions and the categories of the entities;

step 2: retrieving predicates and objects of the corresponding entities in the knowledge graph, and associating the retrieved predicates and objects with the corresponding entities to form a semantic enhanced sentence tree;

step 3: serializing the sentence subtrees to generate an integrated description text;

step 4: generating a corresponding visual matrix according to the integrated description text;

step 5: sending the integrated description text and the corresponding visual matrix into a language model, and calculating a corresponding word vector;

step 6: sending the input sentences into a search engine, and returning corresponding candidate texts;

step 7: and calculating a similarity score between the integrated description text and the candidate text, reordering the candidate text according to the similarity score, and returning a sequencing result.

Further, the step 1 includes:

in the text rewriting stage, corresponding entity a and extracted predicates b and object c in the sentence subtree are rewritten into natural sentences according to the part of speech of the predicate b, and when the predicate b is a noun, b rewritten into a form of 'a is c'; when predicate b is a verb, it is rewritten into "abc" form;

and in the text splicing stage, the formed natural sentences are used as supplementary sentences and added to the back of the input sentences to form an integrated description text.

Further, the step 5 includes:

performing text cleaning and word granularity division on the integrated description text, mapping each word into a word vector with the length of 768, and uniformly changing the word vector into the text length of 512 words by filling preset characters or cutting the word vector excessively;

the masking operation means that a vector with a length of 512 is constructed for each text, 512 is that the text length is fixed to 512 words, the vector is composed of 0 or 1, 1 represents that the position is a text character, 0 represents that the position is a filling character, and when the attention vector is calculated, the corresponding value of the position of the filling character is 0, so that the attention is prevented from being allocated to the filling character;

when the attention is calculated among the word vectors, if the corresponding mask is 0 or the visual matrix is 0, the corresponding attention value is set to 0, the attention calculation is not performed, and otherwise, the attention calculation is performed.

Further, the step 6 includes:

and sending the original input sentences into an es search engine, automatically performing word segmentation, indexing and scoring processes in the es search engine, and returning to the candidate text sorting list.

Further, the step 7 includes:

calculating tf-idf score weight of each word of the integrated descriptive text;

the following operation is carried out on each candidate text, 1, the candidate text is sent into a BERT language model, the word vector of each word in the text is calculated, and then the average value is taken to obtain sentence vectors; 2. calculating the similarity of the corresponding word vectors according to the candidate text word vectors and the integrated description text word vectors; 3. calculating the similarity of the corresponding sentence vectors according to the candidate text sentence vectors and the integrated description text sentence vectors; 4. the word vector similarity and the sentence vector similarity are added to obtain the final similarity score of the candidate text and the integrated description text;

and (5) reordering the candidate texts according to the final similarity score, wherein the higher the score is, the higher the ranking is.

According to the invention, a medical text retrieval system based on knowledge injection comprises:

module M1: carrying out entity identification on the input sentences, and detecting the positions and the categories of the entities;

module M2: retrieving predicates and objects of the corresponding entities in the knowledge graph, and associating the retrieved predicates and objects with the corresponding entities to form a semantic enhanced sentence tree;

module M3: serializing the sentence subtrees to generate an integrated description text;

module M4: generating a corresponding visual matrix according to the integrated description text;

module M5: sending the integrated description text and the corresponding visual matrix into a language model, and calculating a corresponding word vector;

module M6: sending the input sentences into a search engine, and returning corresponding candidate texts;

module M7: and calculating a similarity score between the integrated description text and the candidate text, reordering the candidate text according to the similarity score, and returning a sequencing result.

Further, the module M1 includes:

Further, the module M5 includes:

Further, the module M6 includes:

Further, the module M7 includes:

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, by adopting a knowledge injection means, information content is extracted from the medical knowledge graph to complete knowledge, so that the result of supplementing medical background information in an input text is achieved.

2. The invention adopts two directions of the word vector and the sentence vector to calculate the semantic similarity in a crossing way, thereby realizing the result of the semantic similarity calculation by comprehensively considering the word granularity level and the sentence granularity.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a workflow diagram of the present invention;

FIG. 2 is a flowchart illustrating an embodiment of the present invention;

FIG. 3 is a schematic diagram of a knowledge injection process according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a process for generating an integrated descriptive text in accordance with an embodiment of the present invention;

fig. 5 is a schematic diagram of a visual matrix generating process according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

Example 1

As shown in fig. 1, taking cold as an example, a medical text retrieval method based on knowledge injection includes:

step 1: as shown in fig. 2, entity recognition is performed on the user input sentence, and the entity position and the corresponding category are detected. Is there a usefulness in eating kangtai from the input sentence "start cold and then fever today? ", the entities" cold "," fever "," kantaike "are identified.

Step 2: and in the knowledge injection stage, searching predicates and objects of the corresponding entities in the knowledge graph, and associating the predicates and objects with the corresponding entities to form a semantic enhanced sentence tree structure. In this case, the physical "cold" is taken as an example, and its associated part is the upper respiratory tract, and its associated part is shown as sneeze. While the component associated with entity "Kang Taike" is pseudoephedrine, the associated therapeutic goal is the cold.

Step 3: as shown in fig. 3, sentence trees are serialized to generate integrated descriptive text.

Step 3.1: in the text rewriting stage, corresponding to an entity a and extracted predicates b and objects c in the sentence subtree structure, rewriting the predicates b into natural sentences according to the part of speech of the predicates b, and when the predicates b are nouns, rewriting the predicates b into a form of 'a' and 'c'; when predicate b is a verb, it is rewritten into the "abc" form. Will "do it start cold today and then fever, will it be useful to eat kangtai? The "rewritten" part for cold is the upper respiratory tract "," cold is manifested by sneeze "," fever part is the head "," Kang Taike component is pseudoephedrine "," kangtak for cold treatment ".

Step 3.2: and in the text splicing stage, the formed natural sentences are used as supplementary sentences and added to the back of the original input sentences to form an integrated description text. The final splice is "do it start cold today and then fever, do it take kangtai's will be useful? The cold is in the upper respiratory tract. Cold manifests itself as sneezing. The site of fever is the head. The component of Kang Taike is pseudoephedrine. Kangtaike for treating common cold.

Step 4: as shown in fig. 4, the corresponding visual matrix is generated from the unified descriptive text. A visual matrix correspondence value of 0 represents that the two words are not visible to each other, and no attention calculation is performed. A visual matrix other than 0 represents that two words can see each other, and the attention calculation is performed normally. The generation rule is that the corresponding supplemental statement can only be seen by the corresponding entity in the original statement, and other words are not visible. This is to reduce semantic noise from knowledge injection.

Step 5: as shown in fig. 5, the integrated descriptive text and the corresponding visual matrix are fed into a language model, and the word vector corresponding to each word is calculated.

Step 5.1: text cleansing and word granularity division are performed on the text data, and each word is mapped into a word vector with a length of 768. The text data is uniformly changed into the text length of 512 words by filling specific characters or cutting the text data excessively long; the masking operation means that a vector 512 long is constructed for each text, 512 being that the text length is fixed to 512 words, the vector is composed of 0 or 1, 1 representing that the position is a text character, 0 representing that the position is a pad character, and the corresponding value of the pad character position is 0 when calculating an attention vector, so that the allocation of attention to the pad character is avoided.

Step 5.2: when the attention calculation is carried out between the word vectors, if the corresponding mask is 0 or the visual matrix is 0, the corresponding attention value is set to 0, and the attention calculation is not carried out. Other cases normally do the attention calculations.

Step 6: and sending the original input sentence into an elastiscearch search engine, automatically performing word segmentation, indexing and scoring processes in the elastiscearch engine, and returning to the candidate text ranking list.

Step 7: and calculating a word vector similarity score between the integrated text and the candidate text, reordering the candidate text according to the similarity score, and returning a sequencing result.

Step 7.1: the tf-idf score weight is computed for each word of the integrated descriptive text.

Step 7.2: word vectors of all candidate texts are calculated by using the language model pair, and then the text word vectors are averaged to obtain corresponding sentence vectors.

Step 7.3: for each candidate text, calculating a cosine value between the integrated descriptive text word vector and the candidate text word vector to represent similarity scores between single words, and then carrying out weighted average on the similarity scores of all words to obtain a word level similarity score, wherein the weight is tf-idf weight of the corresponding words of the integrated descriptive text.

Step 7.4: for each candidate text, a cosine value between the integrally descriptive text sentence vector and the candidate text sentence vector is calculated to represent a similarity score.

Step 7.5: the final similarity score is equal to the word level similarity plus sentence level similarity. And (5) reordering the candidate texts according to the final similarity score, wherein the higher the score is, the higher the ranking is.

Example 2

The invention also provides a knowledge injection-based medical text retrieval system which can be realized by executing the flow steps of the knowledge injection-based medical text retrieval method, namely, a person skilled in the art can understand the knowledge injection-based medical text retrieval method as a preferred implementation mode of the knowledge injection-based medical text retrieval system. The system comprises:

module M1: and carrying out entity identification on the input sentences, and detecting the positions and the categories of the entities.

Module M2: and searching predicates and objects of the corresponding entities in the knowledge graph, and associating the searched predicates and objects with the corresponding entities to form a semantic enhanced sentence tree.

Module M3: and serializing the sentence subtrees to generate an integrated description text.

Module M4: and generating a corresponding visual matrix according to the integrated descriptive text.

Module M5: and sending the integrated descriptive text and the corresponding visual matrix into a language model, and calculating the corresponding word vector.

Module M6: and sending the input sentence into a search engine, and returning the corresponding candidate text.

Those skilled in the art will appreciate that the invention provides a system and its individual devices, modules, units, etc. that can be implemented entirely by logic programming of method steps, in addition to being implemented as pure computer readable program code, in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units for realizing various functions included in the system can also be regarded as structures in the hardware component; means, modules, and units for implementing the various functions may also be considered as either software modules for implementing the methods or structures within hardware components.

The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. A medical text retrieval method based on knowledge injection, comprising:

2. The knowledge-based medical text retrieval method as recited in claim 1, wherein said step 1 includes:

3. The knowledge-based medical text retrieval method as recited in claim 1, wherein said step 5 includes:

4. The knowledge-based medical text retrieval method as recited in claim 1, wherein said step 6 includes:

5. The knowledge-based medical text retrieval method as recited in claim 1, wherein said step 7 includes:

6. A knowledge injection-based medical text retrieval system, comprising:

7. The knowledge-based medical text retrieval system as recited in claim 6, wherein said module M1 includes:

8. The knowledge-based medical text retrieval system as recited in claim 6, wherein said module M5 includes:

9. The knowledge-based medical text retrieval system as recited in claim 6, wherein said module M6 includes:

10. The knowledge-based medical text retrieval system as recited in claim 6, wherein said module M7 includes: