CN118227769B

CN118227769B - Knowledge graph enhancement-based large language model question-answer generation method

Info

Publication number: CN118227769B
Application number: CN202410653158.2A
Authority: CN
Inventors: 陈晓红; 谢俊伟; 曾阳艳
Original assignee: Xiangjiang Laboratory
Current assignee: Xiangjiang Laboratory
Priority date: 2024-05-24
Filing date: 2024-05-24
Publication date: 2024-08-20
Anticipated expiration: 2044-05-24
Also published as: CN118227769A

Abstract

The application relates to a knowledge graph enhancement-based large language model question-answer generation method, which comprises the following steps: constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form; acquiring a user history dialogue text, and obtaining a text context embedded representation through an encoder; extracting entity mention of a problem text in a user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to a target entity; the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained; finally, the text context embedding representation and the representation of the final knowledge triplet are combined with the distribution of the knowledge triplet through a decoder to generate a response utterance; the method enhances the utilization rate of external medical knowledge and improves the response accuracy.

Description

Knowledge graph enhancement-based large language model question-answer generation method

Technical Field

The application relates to the technical field of large language model question and answer generation, in particular to a knowledge graph enhancement-based large language model question and answer generation method.

Background

After the pre-training model performs pre-training on large-scale data, the response can be enhanced by carrying certain knowledge information. However, when specific knowledge in a particular field is required, the model may still produce responses that are inappropriately misleading and even produce deleterious illusion facts, and the interpretability of such models is poor. Medical dialog systems have been challenged by the rich domain knowledge. In order to benefit from rich domain knowledge, how to model these domain knowledge becomes a hotspot problem.

Knowledge maps, as a method for representing and organizing knowledge, capture information and concepts in the real world in a structured and semantic fashion, describe facts between instances, and are more flexible in most application environments. However, the existing method for integrating external knowledge into a generated pre-training language model transfers the relational knowledge only through post training of a single knowledge triplet, and ignores rich structural and semantic information in the knowledge graph.

Disclosure of Invention

Based on this, there is a need to provide a knowledge-graph-enhanced large language model question-answer generation method, which includes:

S1: constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form;

S2: acquiring a user history dialogue text, wherein the user history dialogue text is subjected to a text context embedding representation through an encoder;

s3: extracting entity mention of a problem text in the user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to the target entity;

S4: the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained;

S5: a response utterance is generated by a decoder combining the text context embedded representation, the representation of the final knowledge triplet, and the distribution of the knowledge triplet.

The beneficial effects are that: the method comprises the steps of obtaining a user history dialogue text, and obtaining text context embedded representation through an encoder; extracting entity mention of a problem text in a user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to a target entity; the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained; finally, the text context embedding representation, the representation of the final knowledge triplet and the distribution of the knowledge triplet are combined through a decoder to generate a response utterance; the method enhances the utilization rate of external medical knowledge and improves the response accuracy.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a knowledge-graph-enhancement-based large language model question-answer generation method according to an embodiment of the present application.

Detailed Description

In order that the above objects, features and advantages of the application will be readily understood, a more particular description of the application will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the application, whereby the application is not limited to the specific embodiments disclosed below.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

As shown in fig. 1, the present embodiment provides a knowledge graph enhancement-based large language model question-answer generation method, which includes:

s1: and constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form.

Specifically, the external knowledge base comprises a plurality of triples, and the forms of the triples comprise { head entity, relation, tail entity }, and { entity, attribute value }; extracting a medical knowledge entity and a corresponding entity description text from the medical knowledge; constructing a triplet in the form of a head entity, a relation and a tail entity according to the relation among the medical knowledge entities; and constructing a triplet in the form of { entity, attribute and attribute value } based on the medical knowledge entity and by taking entity description text corresponding to the medical knowledge entity as an attribute value, and carrying out entity alignment on all the triples to obtain the external knowledge base.

In this embodiment, the sources of medical knowledge include data sets, medical literature data, which are crawled from medical websites and disclosed for knowledge graph construction. For the semi-structured data crawled from the medical website, extracting the triplet information according to defined extraction rules; for unstructured data of medical literature, extracting triplet information by adopting a BERT+ BiLSTM +CRF model;

The BERT+ BiLSTM +CRF model is a composite deep learning architecture widely used for named entity recognition, and consists of a BERT pre-training language model (Bidirectional Encoder Representations from Transformers, BERT), a bidirectional long and short Term Memory network (Bidirectional Long Short-Term Memory, biLSTM) and a conditional random field (Conditional Random Field, CRF).

BERT is a pre-trained language model that is capable of understanding and generating natural language text; biLSTM is a recurrent neural network capable of processing sequence data; CRF is a conditional random field capable of identifying structural patterns in a sequence.

The workflow of the bert+ BiLSTM +crf model is: firstly, inputting a text to obtain a corresponding word vector through a BERT pre-training language model, wherein the word vector can capture context information of vocabulary; and then inputting the word vector output by the BERT into a BiLSTM module for further processing, extracting the characteristics of the sequence, and simultaneously considering the context information. Finally, decoding the output result of BiLSTM by using the CRF module to obtain a prediction labeling sequence, and extracting and classifying each entity in the sequence, thereby completing the whole process of entity identification. The BERT-BiLSTM-CRF model, by virtue of its deep contextual understanding, bi-directional information processing, and sequence-level optimization strategies, exhibits superior performance in named entity recognition tasks.

The medical knowledge entity includes a disease name (e.g., diabetes, etc.), a medication name (e.g., insulin, etc.), a symptom name (e.g., metabolic abnormality, etc.), a examination item name, a food name, and a department name.

In this embodiment, because of the problems of too long and redundant entity description text in the external knowledge base, the redundant description may cause confusion and misunderstanding, especially when different entities are given similar descriptions, so that the different entities cannot be distinguished accurately, thereby influencing understanding and use of information; the step further comprises redundancy elimination pretreatment of the entity description text by adopting a TextRank algorithm, wherein the redundancy elimination pretreatment comprises the following steps:

Step 1: dividing the entity description text into a plurality of sentences; taking the sentences as nodes, and taking the similarity between the two sentences as the weight of the edge between the two sentences; the similarity calculation formula is:

；

Wherein, Is a sentence、Similarity between;、 respectively representing the ith and jth sentences, Representing sentence length; Representing sentences 、The same words in (a); Representing and;

step 2: initializing the weight value of all sentences to 1/n, and iteratively calculating the weight value of the sentence i based on all other sentences connected with the sentence i and the similarity between the two sentences; the iterative calculation formula is as follows:

；

Wherein, For the weight value of the i-th sentence,To point to sentencesIs a set of sentences of (a),Is a sentenceThe set of sentences pointed to; Representing sentences The kth sentence pointed to, d, is the damping ratio, which is typically set to 0.85; the weight value of the jth sentence;

Step 3: repeating the step 2 until the weight value of each sentence is obtained through calculation;

Step 4: adjusting the weight value of each sentence based on the entity coverage rate in the sentence and the position of the sentence in the entity description text to obtain the importance score of each sentence; the importance score calculation formula is:

；

Wherein W is For the importance score of the ith sentence,The product of the Hadamard is represented,Representing the adjustment parameters; Representing the entity coverage of the ith sentence, Representing the number of entity names contained in the ith sentence; representing the position of the ith sentence in the entity description text; n represents the number of sentences in the entity description text; is normalized to ；A feature matrix representing the entity coverage of the sentence; is normalized to ；A feature matrix representing sentence locations; t is the transpose;

step 5: and selecting the sentences with the first x importance scores as candidate text summaries, and combining all the candidate text summaries to obtain redundancy-removed preprocessed entity description text.

S2: and acquiring a user history dialogue text, wherein the user history dialogue text is subjected to an encoder to obtain a text context embedded representation.

Specifically, the user history dialogue text is recorded as:； Representing the 1 st utterance; Representing the kth-1 utterance; the process of obtaining the text context embedded representation includes:

Step 1: inserting special tokens into the user history dialogue text, and dividing each utterance; the user history dialogue text after inserting the special token is expressed as:

；

Wherein, A special token that indicates the beginning of a session; a special token representing the end of a conversation or a segmented utterance, with the end being the end of the conversation and the middle being the segmented utterance; word 1 representing the 1 st utterance; the 1 st utterance is represented A personal word; word 1 representing the 2 nd utterance; The 2 nd utterance is represented A personal word; 1 st word representing the kth-1 th utterance; the (k-1) th utterance A personal word;

Step 2: inputting the user history dialogue text after the special token is inserted into an encoder to generate the text context embedded representation; the calculation formula is as follows:

；

Wherein, Representing a text context embedded representation; Representing an encoder implemented by BioBERT pre-training models; c represents the user history dialogue text after inserting the special token.

S3: extracting entity mention of a problem text in the user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; and inquiring related medical knowledge triples in the external knowledge base according to the target entity.

Specifically, the process of obtaining the medical knowledge triplet includes:

step 1: extracting entity references of the problem text in the user history dialogue text by adopting a BERT+ BiLSTM +CRF model;

step 2: calculating the similarity between the entity mention and each medical knowledge entity in the external knowledge base, and taking the medical knowledge entity with the highest similarity with the entity mention as a target entity; and linking the entity mention to a target entity in the external knowledge base;

Further, calculating the similarity between the entity mention and each of the medical knowledge entities in the external knowledge base comprises:

step 2.1: coding each medical knowledge entity in the external knowledge base to obtain candidate entity codes; encoding entity description texts corresponding to the medical knowledge entities to obtain entity description sentence vectors; coding the entity mention to obtain an entity code;

step 2.2: performing concat splicing on the candidate entity codes and the corresponding entity description sentence vectors to obtain semantic text representation vectors of the candidate entities; performing concat splicing on the entity codes and the text context embedded representations corresponding to the entity codes to obtain semantic text representation vectors of the entities;

Step 2.3: calculating the similarity between the semantic text representation vector of the entity and the semantic text representation vector of the candidate entity, wherein the calculation formula is as follows:

；

Wherein, Similarity between the semantic text representation vector representing the entity and the semantic text representation vector of the candidate entity; Representing cosine similarity; a semantic text representation vector representing an entity; a semantic text representation vector representing a candidate entity;

Step 2.4: and (3) repeating the step 2.3 until all the medical knowledge entities are calculated, and obtaining the similarity between the entity mention and each medical knowledge entity in the external knowledge base.

Step 3: and inquiring medical knowledge triples containing the target entity in the external knowledge base according to the target entity.

S4: and the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained.

In this example, there is a gap between the representation of the text context embedded representation and the medical knowledge triplet as obtained by the BioBERT encoder; the step further comprises bridging the representation gap between the medical knowledge triplet and the text context embedded representation using a multi-layer perceptron to obtain a vector representation of the medical knowledge triplet; the calculation formula is as follows:

；

Wherein, A vector representation representing a medical knowledge triplet; Representing a multi-layer perceptron; Representing a splicing operation; h represents a head entity of the medical knowledge triplet; hs represents entity description text of a head entity in the medical knowledge triplet; r represents a relationship; t represents the tail entity of the medical knowledge triplet; ts represents entity description text of the tail entity in the medical knowledge triplet.

Further, the method comprises the steps of:

predicting a distribution of knowledge triples using a multi-headed attention mechanism based on the text context embedded representation and the medical knowledge triples;

；

Wherein, Representing a distribution of knowledge triples; represents an attention score; Representing a scaling factor; representing a jth word in a kth utterance; w represents a word; Representing the first of multiple head attentiveness A head;

the vector representation of the medical knowledge triplet and the text context embedding representation sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together to obtain the representation of the final knowledge triplet; the calculation formula is as follows:

；

Wherein, A representation representing a final knowledge triplet; representing a fully connected feed forward network; Representing a multi-headed attention mechanism; will be Respectively corresponding to the inquiry, the key and the value in the multi-head attention mechanism; representing a text context embedded representation; a vector representation representing a kth medical knowledge triplet.

Specifically, the process of obtaining the response utterance includes:

S5.1: the text context embedding representation is input into the decoder, so that softmax probability distribution of the vocabulary of the dataset and distribution of the vocabulary in the session history are obtained; the calculation formula is as follows:

；

Wherein, Representing a decoder implemented by BioBERT pre-training models; representing a text context embedded representation; a softmax layer representing the decoder; a softmax probability distribution representing the vocabulary of the dataset; representing the distribution of vocabulary in the session history; Representing a cross attention score of the decoder; Probability information representing a jth word in a kth utterance; representing a jth word in a kth utterance; w represents a word; Representing the first of multiple head attentiveness A head;

S5.2: the final probability distribution is obtained by gating probability combination of the distribution of the knowledge triples, the softmax probability distribution of the vocabulary of the dataset and the distribution of the vocabulary in the session history; the calculation formula of the final probability distribution is:

；

Wherein, Representing a final probability distribution; representing a first gating probability; Representing a second gating probability; a first sigmoid layer representing a decoder; Training parameters representing a first sigmoid layer; A bias term representing a first sigmoid layer; a second sigmoid layer representing a decoder; training parameters representing a second sigmoid layer; a bias term representing a second sigmoid layer; representing a text context embedded representation; a representation representing a final knowledge triplet; a softmax probability distribution representing the vocabulary of the dataset; representing the distribution of vocabulary in the session history; Representing a distribution of knowledge triples; Representing a measure And (3) withIs a probability distribution of (c).

S5.3: generating the word corresponding to the highest according to the final probability distribution to generate the next word until a complete medical dialogue reply is generated, wherein the medical dialogue reply is the response utterance.

The method for generating the question and answer of the large language model based on the enhancement of the knowledge spectrogram provided by the embodiment obtains the historical dialogue text of the user and obtains the text context embedded representation through the encoder; extracting entity mention of a problem text in a user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring a corresponding medical knowledge triplet in the external knowledge base according to a target entity; the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together to obtain the representation of the final knowledge triples, and the distribution of the knowledge triples is obtained based on the representation of the final knowledge triples; finally, combining the text context embedded representation with the distribution of the knowledge triples through a decoder to generate a response utterance; the method enhances the utilization rate of external medical knowledge and improves the response accuracy.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A knowledge graph enhancement-based large language model question-answer generation method is characterized by comprising the following steps:

the user history dialogue text is recorded as: ； Representing the 1 st utterance; Representing the kth-1 utterance; the process of obtaining the text context embedded representation includes:

；

Wherein, Representing a text context embedded representation; representing an encoder implemented by BioBERT pre-training models; c represents the user history dialogue text after inserting the special token;

S5: generating, by a decoder, a response utterance by combining the text context embedded representation, the representation of the final knowledge triplet, and the distribution of the knowledge triplet;

the process of deriving the response utterance includes:

；

s5.2: the final probability distribution is obtained by gating probability combination of the distribution of the knowledge triples, the softmax probability distribution of the vocabulary of the dataset and the distribution of the vocabulary in the session history;

s5.3: generating the next word according to the word with the highest final probability distribution until the final word is generated And (3) a special token to obtain a complete medical dialogue reply, wherein the medical dialogue reply is the response utterance.

2. The knowledge-based enhanced large language model question-answer generation method according to claim 1, wherein in S1, the external knowledge base includes a plurality of triples, and the forms of the triples include { head entity, relationship, tail entity } and { entity, attribute value }; extracting a medical knowledge entity and a corresponding entity description text from the medical knowledge; constructing a triplet in the form of a head entity, a relation and a tail entity according to the relation among the medical knowledge entities; constructing a triplet in the form of { entity, attribute and attribute value } based on the medical knowledge entity and by taking entity description text corresponding to the medical knowledge entity as an attribute value, and performing entity alignment on all the triples to obtain the external knowledge base;

The medical knowledge entity includes a disease name, a medication name, a symptom name, an examination item name, a food name, and a department name.

3. The knowledge-based enhanced large language model question-answering generation method according to claim 2, wherein in S1, the method further comprises performing redundancy elimination preprocessing on the entity description text by using TextRank algorithm, the redundancy elimination preprocessing comprising:

；

Wherein, For the weight value of the i-th sentence,To point to sentencesIs a set of sentences of (a),Is a sentenceThe set of sentences to which the user is directed,Representing sentencesThe kth sentence, d is the damping ratio,The weight value of the jth sentence;

；

4. The knowledge-graph-enhancement-based large language model question-answer generation method according to claim 2, wherein in S3, the process of obtaining the medical knowledge triples includes:

5. The knowledge-graph-enhancement-based large language model question-answer generation method of claim 4, wherein said calculating similarity between said entity mention and each of said medical knowledge entities in said external knowledge base comprises:

；

6. The knowledge-graph-enhancement-based large language model question-answer generation method according to claim 1, wherein in S4, the method further comprises employing a multi-layer perceptron to bridge a representation gap between the medical knowledge triplet and the text context embedded representation to obtain a vector representation of the medical knowledge triplet; the calculation formula is as follows:

；

7. The knowledge-based enhanced large language model question-answering generation method according to claim 6, wherein in S4, comprising:

；

8. The knowledge-based enhanced large language model question-answer generation method according to claim 1, wherein the calculation formula of the final probability distribution is:

；