CN118227769B - Knowledge graph enhancement-based large language model question-answer generation method - Google Patents
Knowledge graph enhancement-based large language model question-answer generation method Download PDFInfo
- Publication number
- CN118227769B CN118227769B CN202410653158.2A CN202410653158A CN118227769B CN 118227769 B CN118227769 B CN 118227769B CN 202410653158 A CN202410653158 A CN 202410653158A CN 118227769 B CN118227769 B CN 118227769B
- Authority
- CN
- China
- Prior art keywords
- entity
- representing
- knowledge
- text
- representation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000004044 response Effects 0.000 claims abstract description 15
- 230000007246 mechanism Effects 0.000 claims abstract description 14
- 239000013598 vector Substances 0.000 claims description 31
- 238000004364 calculation method Methods 0.000 claims description 20
- 238000012549 training Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 7
- 230000008030 elimination Effects 0.000 claims description 4
- 238000003379 elimination reaction Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000013016 damping Methods 0.000 claims description 2
- 201000010099 disease Diseases 0.000 claims description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 2
- 229940079593 drug Drugs 0.000 claims description 2
- 239000003814 drug Substances 0.000 claims description 2
- 208000024891 symptom Diseases 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 claims 2
- 230000002457 bidirectional effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- NOESYZHRGYRDHS-UHFFFAOYSA-N insulin Chemical compound N1C(=O)C(NC(=O)C(CCC(N)=O)NC(=O)C(CCC(O)=O)NC(=O)C(C(C)C)NC(=O)C(NC(=O)CN)C(C)CC)CSSCC(C(NC(CO)C(=O)NC(CC(C)C)C(=O)NC(CC=2C=CC(O)=CC=2)C(=O)NC(CCC(N)=O)C(=O)NC(CC(C)C)C(=O)NC(CCC(O)=O)C(=O)NC(CC(N)=O)C(=O)NC(CC=2C=CC(O)=CC=2)C(=O)NC(CSSCC(NC(=O)C(C(C)C)NC(=O)C(CC(C)C)NC(=O)C(CC=2C=CC(O)=CC=2)NC(=O)C(CC(C)C)NC(=O)C(C)NC(=O)C(CCC(O)=O)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)NC(=O)C(CC=2NC=NC=2)NC(=O)C(CO)NC(=O)CNC2=O)C(=O)NCC(=O)NC(CCC(O)=O)C(=O)NC(CCCNC(N)=N)C(=O)NCC(=O)NC(CC=3C=CC=CC=3)C(=O)NC(CC=3C=CC=CC=3)C(=O)NC(CC=3C=CC(O)=CC=3)C(=O)NC(C(C)O)C(=O)N3C(CCC3)C(=O)NC(CCCCN)C(=O)NC(C)C(O)=O)C(=O)NC(CC(N)=O)C(O)=O)=O)NC(=O)C(C(C)CC)NC(=O)C(CO)NC(=O)C(C(C)O)NC(=O)C1CSSCC2NC(=O)C(CC(C)C)NC(=O)C(NC(=O)C(CCC(N)=O)NC(=O)C(CC(N)=O)NC(=O)C(NC(=O)C(N)CC=1C=CC=CC=1)C(C)C)CC1=CN=CN1 NOESYZHRGYRDHS-UHFFFAOYSA-N 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 102000004877 Insulin Human genes 0.000 description 1
- 108090001061 Insulin Proteins 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000002939 deleterious effect Effects 0.000 description 1
- 206010012601 diabetes mellitus Diseases 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 229940125396 insulin Drugs 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000006371 metabolic abnormality Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/027—Frames
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Animal Behavior & Ethology (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
Abstract
The application relates to a knowledge graph enhancement-based large language model question-answer generation method, which comprises the following steps: constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form; acquiring a user history dialogue text, and obtaining a text context embedded representation through an encoder; extracting entity mention of a problem text in a user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to a target entity; the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained; finally, the text context embedding representation and the representation of the final knowledge triplet are combined with the distribution of the knowledge triplet through a decoder to generate a response utterance; the method enhances the utilization rate of external medical knowledge and improves the response accuracy.
Description
Technical Field
The application relates to the technical field of large language model question and answer generation, in particular to a knowledge graph enhancement-based large language model question and answer generation method.
Background
After the pre-training model performs pre-training on large-scale data, the response can be enhanced by carrying certain knowledge information. However, when specific knowledge in a particular field is required, the model may still produce responses that are inappropriately misleading and even produce deleterious illusion facts, and the interpretability of such models is poor. Medical dialog systems have been challenged by the rich domain knowledge. In order to benefit from rich domain knowledge, how to model these domain knowledge becomes a hotspot problem.
Knowledge maps, as a method for representing and organizing knowledge, capture information and concepts in the real world in a structured and semantic fashion, describe facts between instances, and are more flexible in most application environments. However, the existing method for integrating external knowledge into a generated pre-training language model transfers the relational knowledge only through post training of a single knowledge triplet, and ignores rich structural and semantic information in the knowledge graph.
Disclosure of Invention
Based on this, there is a need to provide a knowledge-graph-enhanced large language model question-answer generation method, which includes:
S1: constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form;
S2: acquiring a user history dialogue text, wherein the user history dialogue text is subjected to a text context embedding representation through an encoder;
s3: extracting entity mention of a problem text in the user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to the target entity;
S4: the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained;
S5: a response utterance is generated by a decoder combining the text context embedded representation, the representation of the final knowledge triplet, and the distribution of the knowledge triplet.
The beneficial effects are that: the method comprises the steps of obtaining a user history dialogue text, and obtaining text context embedded representation through an encoder; extracting entity mention of a problem text in a user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to a target entity; the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained; finally, the text context embedding representation, the representation of the final knowledge triplet and the distribution of the knowledge triplet are combined through a decoder to generate a response utterance; the method enhances the utilization rate of external medical knowledge and improves the response accuracy.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a knowledge-graph-enhancement-based large language model question-answer generation method according to an embodiment of the present application.
Detailed Description
In order that the above objects, features and advantages of the application will be readily understood, a more particular description of the application will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the application, whereby the application is not limited to the specific embodiments disclosed below.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
As shown in fig. 1, the present embodiment provides a knowledge graph enhancement-based large language model question-answer generation method, which includes:
s1: and constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form.
Specifically, the external knowledge base comprises a plurality of triples, and the forms of the triples comprise { head entity, relation, tail entity }, and { entity, attribute value }; extracting a medical knowledge entity and a corresponding entity description text from the medical knowledge; constructing a triplet in the form of a head entity, a relation and a tail entity according to the relation among the medical knowledge entities; and constructing a triplet in the form of { entity, attribute and attribute value } based on the medical knowledge entity and by taking entity description text corresponding to the medical knowledge entity as an attribute value, and carrying out entity alignment on all the triples to obtain the external knowledge base.
In this embodiment, the sources of medical knowledge include data sets, medical literature data, which are crawled from medical websites and disclosed for knowledge graph construction. For the semi-structured data crawled from the medical website, extracting the triplet information according to defined extraction rules; for unstructured data of medical literature, extracting triplet information by adopting a BERT+ BiLSTM +CRF model;
The BERT+ BiLSTM +CRF model is a composite deep learning architecture widely used for named entity recognition, and consists of a BERT pre-training language model (Bidirectional Encoder Representations from Transformers, BERT), a bidirectional long and short Term Memory network (Bidirectional Long Short-Term Memory, biLSTM) and a conditional random field (Conditional Random Field, CRF).
BERT is a pre-trained language model that is capable of understanding and generating natural language text; biLSTM is a recurrent neural network capable of processing sequence data; CRF is a conditional random field capable of identifying structural patterns in a sequence.
The workflow of the bert+ BiLSTM +crf model is: firstly, inputting a text to obtain a corresponding word vector through a BERT pre-training language model, wherein the word vector can capture context information of vocabulary; and then inputting the word vector output by the BERT into a BiLSTM module for further processing, extracting the characteristics of the sequence, and simultaneously considering the context information. Finally, decoding the output result of BiLSTM by using the CRF module to obtain a prediction labeling sequence, and extracting and classifying each entity in the sequence, thereby completing the whole process of entity identification. The BERT-BiLSTM-CRF model, by virtue of its deep contextual understanding, bi-directional information processing, and sequence-level optimization strategies, exhibits superior performance in named entity recognition tasks.
The medical knowledge entity includes a disease name (e.g., diabetes, etc.), a medication name (e.g., insulin, etc.), a symptom name (e.g., metabolic abnormality, etc.), a examination item name, a food name, and a department name.
In this embodiment, because of the problems of too long and redundant entity description text in the external knowledge base, the redundant description may cause confusion and misunderstanding, especially when different entities are given similar descriptions, so that the different entities cannot be distinguished accurately, thereby influencing understanding and use of information; the step further comprises redundancy elimination pretreatment of the entity description text by adopting a TextRank algorithm, wherein the redundancy elimination pretreatment comprises the following steps:
Step 1: dividing the entity description text into a plurality of sentences; taking the sentences as nodes, and taking the similarity between the two sentences as the weight of the edge between the two sentences; the similarity calculation formula is:
;
Wherein, Is a sentence、Similarity between;、 respectively representing the ith and jth sentences, Representing sentence length; Representing sentences 、The same words in (a); Representing and;
step 2: initializing the weight value of all sentences to 1/n, and iteratively calculating the weight value of the sentence i based on all other sentences connected with the sentence i and the similarity between the two sentences; the iterative calculation formula is as follows:
;
Wherein, For the weight value of the i-th sentence,To point to sentencesIs a set of sentences of (a),Is a sentenceThe set of sentences pointed to; Representing sentences The kth sentence pointed to, d, is the damping ratio, which is typically set to 0.85; the weight value of the jth sentence;
Step 3: repeating the step 2 until the weight value of each sentence is obtained through calculation;
Step 4: adjusting the weight value of each sentence based on the entity coverage rate in the sentence and the position of the sentence in the entity description text to obtain the importance score of each sentence; the importance score calculation formula is:
;
;
;
Wherein W is For the importance score of the ith sentence,The product of the Hadamard is represented,Representing the adjustment parameters; Representing the entity coverage of the ith sentence, Representing the number of entity names contained in the ith sentence; representing the position of the ith sentence in the entity description text; n represents the number of sentences in the entity description text; is normalized to ;A feature matrix representing the entity coverage of the sentence; is normalized to ;A feature matrix representing sentence locations; t is the transpose;
step 5: and selecting the sentences with the first x importance scores as candidate text summaries, and combining all the candidate text summaries to obtain redundancy-removed preprocessed entity description text.
S2: and acquiring a user history dialogue text, wherein the user history dialogue text is subjected to an encoder to obtain a text context embedded representation.
Specifically, the user history dialogue text is recorded as:; Representing the 1 st utterance; Representing the kth-1 utterance; the process of obtaining the text context embedded representation includes:
Step 1: inserting special tokens into the user history dialogue text, and dividing each utterance; the user history dialogue text after inserting the special token is expressed as:
;
Wherein, A special token that indicates the beginning of a session; a special token representing the end of a conversation or a segmented utterance, with the end being the end of the conversation and the middle being the segmented utterance; word 1 representing the 1 st utterance; the 1 st utterance is represented A personal word; word 1 representing the 2 nd utterance; The 2 nd utterance is represented A personal word; 1 st word representing the kth-1 th utterance; the (k-1) th utterance A personal word;
Step 2: inputting the user history dialogue text after the special token is inserted into an encoder to generate the text context embedded representation; the calculation formula is as follows:
;
Wherein, Representing a text context embedded representation; Representing an encoder implemented by BioBERT pre-training models; c represents the user history dialogue text after inserting the special token.
S3: extracting entity mention of a problem text in the user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; and inquiring related medical knowledge triples in the external knowledge base according to the target entity.
Specifically, the process of obtaining the medical knowledge triplet includes:
step 1: extracting entity references of the problem text in the user history dialogue text by adopting a BERT+ BiLSTM +CRF model;
step 2: calculating the similarity between the entity mention and each medical knowledge entity in the external knowledge base, and taking the medical knowledge entity with the highest similarity with the entity mention as a target entity; and linking the entity mention to a target entity in the external knowledge base;
Further, calculating the similarity between the entity mention and each of the medical knowledge entities in the external knowledge base comprises:
step 2.1: coding each medical knowledge entity in the external knowledge base to obtain candidate entity codes; encoding entity description texts corresponding to the medical knowledge entities to obtain entity description sentence vectors; coding the entity mention to obtain an entity code;
step 2.2: performing concat splicing on the candidate entity codes and the corresponding entity description sentence vectors to obtain semantic text representation vectors of the candidate entities; performing concat splicing on the entity codes and the text context embedded representations corresponding to the entity codes to obtain semantic text representation vectors of the entities;
Step 2.3: calculating the similarity between the semantic text representation vector of the entity and the semantic text representation vector of the candidate entity, wherein the calculation formula is as follows:
;
Wherein, Similarity between the semantic text representation vector representing the entity and the semantic text representation vector of the candidate entity; Representing cosine similarity; a semantic text representation vector representing an entity; a semantic text representation vector representing a candidate entity;
Step 2.4: and (3) repeating the step 2.3 until all the medical knowledge entities are calculated, and obtaining the similarity between the entity mention and each medical knowledge entity in the external knowledge base.
Step 3: and inquiring medical knowledge triples containing the target entity in the external knowledge base according to the target entity.
S4: and the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained.
In this example, there is a gap between the representation of the text context embedded representation and the medical knowledge triplet as obtained by the BioBERT encoder; the step further comprises bridging the representation gap between the medical knowledge triplet and the text context embedded representation using a multi-layer perceptron to obtain a vector representation of the medical knowledge triplet; the calculation formula is as follows:
;
Wherein, A vector representation representing a medical knowledge triplet; Representing a multi-layer perceptron; Representing a splicing operation; h represents a head entity of the medical knowledge triplet; hs represents entity description text of a head entity in the medical knowledge triplet; r represents a relationship; t represents the tail entity of the medical knowledge triplet; ts represents entity description text of the tail entity in the medical knowledge triplet.
Further, the method comprises the steps of:
predicting a distribution of knowledge triples using a multi-headed attention mechanism based on the text context embedded representation and the medical knowledge triples;
;
;
Wherein, Representing a distribution of knowledge triples; represents an attention score; Representing a scaling factor; representing a jth word in a kth utterance; w represents a word; Representing the first of multiple head attentiveness A head;
the vector representation of the medical knowledge triplet and the text context embedding representation sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together to obtain the representation of the final knowledge triplet; the calculation formula is as follows:
;
Wherein, A representation representing a final knowledge triplet; representing a fully connected feed forward network; Representing a multi-headed attention mechanism; will be Respectively corresponding to the inquiry, the key and the value in the multi-head attention mechanism; representing a text context embedded representation; a vector representation representing a kth medical knowledge triplet.
S5: a response utterance is generated by a decoder combining the text context embedded representation, the representation of the final knowledge triplet, and the distribution of the knowledge triplet.
Specifically, the process of obtaining the response utterance includes:
S5.1: the text context embedding representation is input into the decoder, so that softmax probability distribution of the vocabulary of the dataset and distribution of the vocabulary in the session history are obtained; the calculation formula is as follows:
;
;
;
Wherein, Representing a decoder implemented by BioBERT pre-training models; representing a text context embedded representation; a softmax layer representing the decoder; a softmax probability distribution representing the vocabulary of the dataset; representing the distribution of vocabulary in the session history; Representing a cross attention score of the decoder; Probability information representing a jth word in a kth utterance; representing a jth word in a kth utterance; w represents a word; Representing the first of multiple head attentiveness A head;
S5.2: the final probability distribution is obtained by gating probability combination of the distribution of the knowledge triples, the softmax probability distribution of the vocabulary of the dataset and the distribution of the vocabulary in the session history; the calculation formula of the final probability distribution is:
;
;
;
;
Wherein, Representing a final probability distribution; representing a first gating probability; Representing a second gating probability; a first sigmoid layer representing a decoder; Training parameters representing a first sigmoid layer; A bias term representing a first sigmoid layer; a second sigmoid layer representing a decoder; training parameters representing a second sigmoid layer; a bias term representing a second sigmoid layer; representing a text context embedded representation; a representation representing a final knowledge triplet; a softmax probability distribution representing the vocabulary of the dataset; representing the distribution of vocabulary in the session history; Representing a distribution of knowledge triples; Representing a measure And (3) withIs a probability distribution of (c).
S5.3: generating the word corresponding to the highest according to the final probability distribution to generate the next word until a complete medical dialogue reply is generated, wherein the medical dialogue reply is the response utterance.
The method for generating the question and answer of the large language model based on the enhancement of the knowledge spectrogram provided by the embodiment obtains the historical dialogue text of the user and obtains the text context embedded representation through the encoder; extracting entity mention of a problem text in a user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring a corresponding medical knowledge triplet in the external knowledge base according to a target entity; the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together to obtain the representation of the final knowledge triples, and the distribution of the knowledge triples is obtained based on the representation of the final knowledge triples; finally, combining the text context embedded representation with the distribution of the knowledge triples through a decoder to generate a response utterance; the method enhances the utilization rate of external medical knowledge and improves the response accuracy.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.
Claims (8)
1. A knowledge graph enhancement-based large language model question-answer generation method is characterized by comprising the following steps:
S1: constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form;
S2: acquiring a user history dialogue text, wherein the user history dialogue text is subjected to a text context embedding representation through an encoder;
the user history dialogue text is recorded as: ; Representing the 1 st utterance; Representing the kth-1 utterance; the process of obtaining the text context embedded representation includes:
Step 1: inserting special tokens into the user history dialogue text, and dividing each utterance; the user history dialogue text after inserting the special token is expressed as:
;
Wherein, A special token that indicates the beginning of a session; a special token representing the end of a conversation or a segmented utterance, with the end being the end of the conversation and the middle being the segmented utterance; word 1 representing the 1 st utterance; the 1 st utterance is represented A personal word; word 1 representing the 2 nd utterance; The 2 nd utterance is represented A personal word; 1 st word representing the kth-1 th utterance; the (k-1) th utterance A personal word;
Step 2: inputting the user history dialogue text after the special token is inserted into an encoder to generate the text context embedded representation; the calculation formula is as follows:
;
Wherein, Representing a text context embedded representation; representing an encoder implemented by BioBERT pre-training models; c represents the user history dialogue text after inserting the special token;
s3: extracting entity mention of a problem text in the user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to the target entity;
S4: the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained;
S5: generating, by a decoder, a response utterance by combining the text context embedded representation, the representation of the final knowledge triplet, and the distribution of the knowledge triplet;
the process of deriving the response utterance includes:
S5.1: the text context embedding representation is input into the decoder, so that softmax probability distribution of the vocabulary of the dataset and distribution of the vocabulary in the session history are obtained; the calculation formula is as follows:
;
;
;
Wherein, Representing a decoder implemented by BioBERT pre-training models; representing a text context embedded representation; a softmax layer representing the decoder; a softmax probability distribution representing the vocabulary of the dataset; representing the distribution of vocabulary in the session history; Representing a cross attention score of the decoder; Probability information representing a jth word in a kth utterance; representing a jth word in a kth utterance; w represents a word; Representing the first of multiple head attentiveness A head;
s5.2: the final probability distribution is obtained by gating probability combination of the distribution of the knowledge triples, the softmax probability distribution of the vocabulary of the dataset and the distribution of the vocabulary in the session history;
s5.3: generating the next word according to the word with the highest final probability distribution until the final word is generated And (3) a special token to obtain a complete medical dialogue reply, wherein the medical dialogue reply is the response utterance.
2. The knowledge-based enhanced large language model question-answer generation method according to claim 1, wherein in S1, the external knowledge base includes a plurality of triples, and the forms of the triples include { head entity, relationship, tail entity } and { entity, attribute value }; extracting a medical knowledge entity and a corresponding entity description text from the medical knowledge; constructing a triplet in the form of a head entity, a relation and a tail entity according to the relation among the medical knowledge entities; constructing a triplet in the form of { entity, attribute and attribute value } based on the medical knowledge entity and by taking entity description text corresponding to the medical knowledge entity as an attribute value, and performing entity alignment on all the triples to obtain the external knowledge base;
The medical knowledge entity includes a disease name, a medication name, a symptom name, an examination item name, a food name, and a department name.
3. The knowledge-based enhanced large language model question-answering generation method according to claim 2, wherein in S1, the method further comprises performing redundancy elimination preprocessing on the entity description text by using TextRank algorithm, the redundancy elimination preprocessing comprising:
Step 1: dividing the entity description text into a plurality of sentences; taking the sentences as nodes, and taking the similarity between the two sentences as the weight of the edge between the two sentences; the similarity calculation formula is:
;
Wherein, Is a sentence、Similarity between;、 respectively representing the ith and jth sentences, Representing sentence length; Representing sentences 、The same words in (a); Representing and;
step 2: initializing the weight value of all sentences to 1/n, and iteratively calculating the weight value of the sentence i based on all other sentences connected with the sentence i and the similarity between the two sentences; the iterative calculation formula is as follows:
;
Wherein, For the weight value of the i-th sentence,To point to sentencesIs a set of sentences of (a),Is a sentenceThe set of sentences to which the user is directed,Representing sentencesThe kth sentence, d is the damping ratio,The weight value of the jth sentence;
Step 3: repeating the step 2 until the weight value of each sentence is obtained through calculation;
Step 4: adjusting the weight value of each sentence based on the entity coverage rate in the sentence and the position of the sentence in the entity description text to obtain the importance score of each sentence; the importance score calculation formula is:
;
;
;
Wherein W is For the importance score of the ith sentence,The product of the Hadamard is represented,Representing the adjustment parameters; Representing the entity coverage of the ith sentence, Representing the number of entity names contained in the ith sentence; representing the position of the ith sentence in the entity description text; n represents the number of sentences in the entity description text; is normalized to ;A feature matrix representing the entity coverage of the sentence; is normalized to ;A feature matrix representing sentence locations; t is the transpose;
step 5: and selecting the sentences with the first x importance scores as candidate text summaries, and combining all the candidate text summaries to obtain redundancy-removed preprocessed entity description text.
4. The knowledge-graph-enhancement-based large language model question-answer generation method according to claim 2, wherein in S3, the process of obtaining the medical knowledge triples includes:
step 1: extracting entity references of the problem text in the user history dialogue text by adopting a BERT+ BiLSTM +CRF model;
step 2: calculating the similarity between the entity mention and each medical knowledge entity in the external knowledge base, and taking the medical knowledge entity with the highest similarity with the entity mention as a target entity; and linking the entity mention to a target entity in the external knowledge base;
step 3: and inquiring medical knowledge triples containing the target entity in the external knowledge base according to the target entity.
5. The knowledge-graph-enhancement-based large language model question-answer generation method of claim 4, wherein said calculating similarity between said entity mention and each of said medical knowledge entities in said external knowledge base comprises:
step 2.1: coding each medical knowledge entity in the external knowledge base to obtain candidate entity codes; encoding entity description texts corresponding to the medical knowledge entities to obtain entity description sentence vectors; coding the entity mention to obtain an entity code;
step 2.2: performing concat splicing on the candidate entity codes and the corresponding entity description sentence vectors to obtain semantic text representation vectors of the candidate entities; performing concat splicing on the entity codes and the text context embedded representations corresponding to the entity codes to obtain semantic text representation vectors of the entities;
Step 2.3: calculating the similarity between the semantic text representation vector of the entity and the semantic text representation vector of the candidate entity, wherein the calculation formula is as follows:
;
Wherein, Similarity between the semantic text representation vector representing the entity and the semantic text representation vector of the candidate entity; Representing cosine similarity; a semantic text representation vector representing an entity; a semantic text representation vector representing a candidate entity;
Step 2.4: and (3) repeating the step 2.3 until all the medical knowledge entities are calculated, and obtaining the similarity between the entity mention and each medical knowledge entity in the external knowledge base.
6. The knowledge-graph-enhancement-based large language model question-answer generation method according to claim 1, wherein in S4, the method further comprises employing a multi-layer perceptron to bridge a representation gap between the medical knowledge triplet and the text context embedded representation to obtain a vector representation of the medical knowledge triplet; the calculation formula is as follows:
;
Wherein, A vector representation representing a medical knowledge triplet; Representing a multi-layer perceptron; Representing a splicing operation; h represents a head entity of the medical knowledge triplet; hs represents entity description text of a head entity in the medical knowledge triplet; r represents a relationship; t represents the tail entity of the medical knowledge triplet; ts represents entity description text of the tail entity in the medical knowledge triplet.
7. The knowledge-based enhanced large language model question-answering generation method according to claim 6, wherein in S4, comprising:
predicting a distribution of knowledge triples using a multi-headed attention mechanism based on the text context embedded representation and the medical knowledge triples;
;
;
Wherein, Representing a distribution of knowledge triples; represents an attention score; Representing a scaling factor; representing a jth word in a kth utterance; w represents a word; Representing the first of multiple head attentiveness A head;
the vector representation of the medical knowledge triplet and the text context embedding representation sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together to obtain the representation of the final knowledge triplet; the calculation formula is as follows:
;
Wherein, A representation representing a final knowledge triplet; representing a fully connected feed forward network; Representing a multi-headed attention mechanism; will be Respectively corresponding to the inquiry, the key and the value in the multi-head attention mechanism; representing a text context embedded representation; a vector representation representing a kth medical knowledge triplet.
8. The knowledge-based enhanced large language model question-answer generation method according to claim 1, wherein the calculation formula of the final probability distribution is:
;
;
;
;
Wherein, Representing a final probability distribution; representing a first gating probability; Representing a second gating probability; a first sigmoid layer representing a decoder; Training parameters representing a first sigmoid layer; A bias term representing a first sigmoid layer; a second sigmoid layer representing a decoder; training parameters representing a second sigmoid layer; a bias term representing a second sigmoid layer; representing a text context embedded representation; a representation representing a final knowledge triplet; a softmax probability distribution representing the vocabulary of the dataset; representing the distribution of vocabulary in the session history; Representing a distribution of knowledge triples; Representing a measure And (3) withIs a probability distribution of (c).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410653158.2A CN118227769B (en) | 2024-05-24 | 2024-05-24 | Knowledge graph enhancement-based large language model question-answer generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410653158.2A CN118227769B (en) | 2024-05-24 | 2024-05-24 | Knowledge graph enhancement-based large language model question-answer generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118227769A CN118227769A (en) | 2024-06-21 |
CN118227769B true CN118227769B (en) | 2024-08-20 |
Family
ID=91513772
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410653158.2A Active CN118227769B (en) | 2024-05-24 | 2024-05-24 | Knowledge graph enhancement-based large language model question-answer generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118227769B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118469005B (en) * | 2024-07-10 | 2024-09-27 | 北方健康医疗大数据科技有限公司 | Medical knowledge graph construction method, system, terminal and storage medium based on large language model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115470327A (en) * | 2022-08-11 | 2022-12-13 | 天津泰凡科技有限公司 | Medical question-answering method based on knowledge graph and related equipment |
CN117033602A (en) * | 2023-08-24 | 2023-11-10 | 北京邮电大学 | Method for constructing multi-mode user mental perception question-answering model |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113010693B (en) * | 2021-04-09 | 2024-03-08 | 大连民族大学 | Knowledge graph intelligent question-answering method integrating pointer generation network |
CN113488165B (en) * | 2021-07-26 | 2023-08-22 | 平安科技(深圳)有限公司 | Text matching method, device, equipment and storage medium based on knowledge graph |
CN115422369B (en) * | 2022-08-30 | 2023-11-03 | 中国人民解放军国防科技大学 | Knowledge graph completion method and device based on improved TextRank |
CN116561272A (en) * | 2023-04-18 | 2023-08-08 | 华南师范大学 | Open domain visual language question-answering method and device, electronic equipment and storage medium |
-
2024
- 2024-05-24 CN CN202410653158.2A patent/CN118227769B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115470327A (en) * | 2022-08-11 | 2022-12-13 | 天津泰凡科技有限公司 | Medical question-answering method based on knowledge graph and related equipment |
CN117033602A (en) * | 2023-08-24 | 2023-11-10 | 北京邮电大学 | Method for constructing multi-mode user mental perception question-answering model |
Also Published As
Publication number | Publication date |
---|---|
CN118227769A (en) | 2024-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111783462B (en) | Chinese named entity recognition model and method based on double neural network fusion | |
CN118227769B (en) | Knowledge graph enhancement-based large language model question-answer generation method | |
CN112115687B (en) | Method for generating problem by combining triplet and entity type in knowledge base | |
CN113297364B (en) | Natural language understanding method and device in dialogue-oriented system | |
CN113569001A (en) | Text processing method and device, computer equipment and computer readable storage medium | |
CN110990555B (en) | End-to-end retrieval type dialogue method and system and computer equipment | |
CN113268609A (en) | Dialog content recommendation method, device, equipment and medium based on knowledge graph | |
CN112487820A (en) | Chinese medical named entity recognition method | |
Kesavan et al. | Deep learning based automatic image caption generation | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN114662476B (en) | Character sequence recognition method integrating dictionary and character features | |
CN118093834B (en) | AIGC large model-based language processing question-answering system and method | |
CN112036189A (en) | Method and system for recognizing gold semantic | |
CN115796182A (en) | Multi-modal named entity recognition method based on entity-level cross-modal interaction | |
Kim et al. | Cross-modal distillation with audio–text fusion for fine-grained emotion classification using BERT and Wav2vec 2.0 | |
CN118312600B (en) | Intelligent customer service question-answering method based on knowledge graph and large language model | |
CN115630145A (en) | Multi-granularity emotion-based conversation recommendation method and system | |
CN116561272A (en) | Open domain visual language question-answering method and device, electronic equipment and storage medium | |
CN117875424B (en) | Knowledge graph completion method and system based on entity description and symmetry relation | |
CN115203388A (en) | Machine reading understanding method and device, computer equipment and storage medium | |
CN115186147A (en) | Method and device for generating conversation content, storage medium and terminal | |
CN114490954A (en) | Document level generation type event extraction method based on task adjustment | |
CN118245602A (en) | Emotion recognition model training method, device, equipment and storage medium | |
CN117932066A (en) | Pre-training-based 'extraction-generation' answer generation model and method | |
CN114416925B (en) | Sensitive word recognition method, device, equipment, storage medium and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |