CN118227769B - Knowledge graph enhancement-based large language model question-answer generation method - Google Patents

Knowledge graph enhancement-based large language model question-answer generation method Download PDF

Info

Publication number
CN118227769B
CN118227769B CN202410653158.2A CN202410653158A CN118227769B CN 118227769 B CN118227769 B CN 118227769B CN 202410653158 A CN202410653158 A CN 202410653158A CN 118227769 B CN118227769 B CN 118227769B
Authority
CN
China
Prior art keywords
entity
representing
knowledge
text
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410653158.2A
Other languages
Chinese (zh)
Other versions
CN118227769A (en
Inventor
陈晓红
谢俊伟
曾阳艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangjiang Laboratory
Original Assignee
Xiangjiang Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangjiang Laboratory filed Critical Xiangjiang Laboratory
Priority to CN202410653158.2A priority Critical patent/CN118227769B/en
Publication of CN118227769A publication Critical patent/CN118227769A/en
Application granted granted Critical
Publication of CN118227769B publication Critical patent/CN118227769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/027Frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a knowledge graph enhancement-based large language model question-answer generation method, which comprises the following steps: constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form; acquiring a user history dialogue text, and obtaining a text context embedded representation through an encoder; extracting entity mention of a problem text in a user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to a target entity; the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained; finally, the text context embedding representation and the representation of the final knowledge triplet are combined with the distribution of the knowledge triplet through a decoder to generate a response utterance; the method enhances the utilization rate of external medical knowledge and improves the response accuracy.

Description

Knowledge graph enhancement-based large language model question-answer generation method
Technical Field
The application relates to the technical field of large language model question and answer generation, in particular to a knowledge graph enhancement-based large language model question and answer generation method.
Background
After the pre-training model performs pre-training on large-scale data, the response can be enhanced by carrying certain knowledge information. However, when specific knowledge in a particular field is required, the model may still produce responses that are inappropriately misleading and even produce deleterious illusion facts, and the interpretability of such models is poor. Medical dialog systems have been challenged by the rich domain knowledge. In order to benefit from rich domain knowledge, how to model these domain knowledge becomes a hotspot problem.
Knowledge maps, as a method for representing and organizing knowledge, capture information and concepts in the real world in a structured and semantic fashion, describe facts between instances, and are more flexible in most application environments. However, the existing method for integrating external knowledge into a generated pre-training language model transfers the relational knowledge only through post training of a single knowledge triplet, and ignores rich structural and semantic information in the knowledge graph.
Disclosure of Invention
Based on this, there is a need to provide a knowledge-graph-enhanced large language model question-answer generation method, which includes:
S1: constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form;
S2: acquiring a user history dialogue text, wherein the user history dialogue text is subjected to a text context embedding representation through an encoder;
s3: extracting entity mention of a problem text in the user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to the target entity;
S4: the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained;
S5: a response utterance is generated by a decoder combining the text context embedded representation, the representation of the final knowledge triplet, and the distribution of the knowledge triplet.
The beneficial effects are that: the method comprises the steps of obtaining a user history dialogue text, and obtaining text context embedded representation through an encoder; extracting entity mention of a problem text in a user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to a target entity; the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained; finally, the text context embedding representation, the representation of the final knowledge triplet and the distribution of the knowledge triplet are combined through a decoder to generate a response utterance; the method enhances the utilization rate of external medical knowledge and improves the response accuracy.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a knowledge-graph-enhancement-based large language model question-answer generation method according to an embodiment of the present application.
Detailed Description
In order that the above objects, features and advantages of the application will be readily understood, a more particular description of the application will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the application, whereby the application is not limited to the specific embodiments disclosed below.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
As shown in fig. 1, the present embodiment provides a knowledge graph enhancement-based large language model question-answer generation method, which includes:
s1: and constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form.
Specifically, the external knowledge base comprises a plurality of triples, and the forms of the triples comprise { head entity, relation, tail entity }, and { entity, attribute value }; extracting a medical knowledge entity and a corresponding entity description text from the medical knowledge; constructing a triplet in the form of a head entity, a relation and a tail entity according to the relation among the medical knowledge entities; and constructing a triplet in the form of { entity, attribute and attribute value } based on the medical knowledge entity and by taking entity description text corresponding to the medical knowledge entity as an attribute value, and carrying out entity alignment on all the triples to obtain the external knowledge base.
In this embodiment, the sources of medical knowledge include data sets, medical literature data, which are crawled from medical websites and disclosed for knowledge graph construction. For the semi-structured data crawled from the medical website, extracting the triplet information according to defined extraction rules; for unstructured data of medical literature, extracting triplet information by adopting a BERT+ BiLSTM +CRF model;
The BERT+ BiLSTM +CRF model is a composite deep learning architecture widely used for named entity recognition, and consists of a BERT pre-training language model (Bidirectional Encoder Representations from Transformers, BERT), a bidirectional long and short Term Memory network (Bidirectional Long Short-Term Memory, biLSTM) and a conditional random field (Conditional Random Field, CRF).
BERT is a pre-trained language model that is capable of understanding and generating natural language text; biLSTM is a recurrent neural network capable of processing sequence data; CRF is a conditional random field capable of identifying structural patterns in a sequence.
The workflow of the bert+ BiLSTM +crf model is: firstly, inputting a text to obtain a corresponding word vector through a BERT pre-training language model, wherein the word vector can capture context information of vocabulary; and then inputting the word vector output by the BERT into a BiLSTM module for further processing, extracting the characteristics of the sequence, and simultaneously considering the context information. Finally, decoding the output result of BiLSTM by using the CRF module to obtain a prediction labeling sequence, and extracting and classifying each entity in the sequence, thereby completing the whole process of entity identification. The BERT-BiLSTM-CRF model, by virtue of its deep contextual understanding, bi-directional information processing, and sequence-level optimization strategies, exhibits superior performance in named entity recognition tasks.
The medical knowledge entity includes a disease name (e.g., diabetes, etc.), a medication name (e.g., insulin, etc.), a symptom name (e.g., metabolic abnormality, etc.), a examination item name, a food name, and a department name.
In this embodiment, because of the problems of too long and redundant entity description text in the external knowledge base, the redundant description may cause confusion and misunderstanding, especially when different entities are given similar descriptions, so that the different entities cannot be distinguished accurately, thereby influencing understanding and use of information; the step further comprises redundancy elimination pretreatment of the entity description text by adopting a TextRank algorithm, wherein the redundancy elimination pretreatment comprises the following steps:
Step 1: dividing the entity description text into a plurality of sentences; taking the sentences as nodes, and taking the similarity between the two sentences as the weight of the edge between the two sentences; the similarity calculation formula is:
Wherein, Is a sentenceSimilarity between; respectively representing the ith and jth sentences, Representing sentence length; Representing sentences The same words in (a); Representing and;
step 2: initializing the weight value of all sentences to 1/n, and iteratively calculating the weight value of the sentence i based on all other sentences connected with the sentence i and the similarity between the two sentences; the iterative calculation formula is as follows:
Wherein, For the weight value of the i-th sentence,To point to sentencesIs a set of sentences of (a),Is a sentenceThe set of sentences pointed to; Representing sentences The kth sentence pointed to, d, is the damping ratio, which is typically set to 0.85; the weight value of the jth sentence;
Step 3: repeating the step 2 until the weight value of each sentence is obtained through calculation;
Step 4: adjusting the weight value of each sentence based on the entity coverage rate in the sentence and the position of the sentence in the entity description text to obtain the importance score of each sentence; the importance score calculation formula is:
Wherein W is For the importance score of the ith sentence,The product of the Hadamard is represented,Representing the adjustment parameters; Representing the entity coverage of the ith sentence, Representing the number of entity names contained in the ith sentence; representing the position of the ith sentence in the entity description text; n represents the number of sentences in the entity description text; is normalized to A feature matrix representing the entity coverage of the sentence; is normalized to A feature matrix representing sentence locations; t is the transpose;
step 5: and selecting the sentences with the first x importance scores as candidate text summaries, and combining all the candidate text summaries to obtain redundancy-removed preprocessed entity description text.
S2: and acquiring a user history dialogue text, wherein the user history dialogue text is subjected to an encoder to obtain a text context embedded representation.
Specifically, the user history dialogue text is recorded as: Representing the 1 st utterance; Representing the kth-1 utterance; the process of obtaining the text context embedded representation includes:
Step 1: inserting special tokens into the user history dialogue text, and dividing each utterance; the user history dialogue text after inserting the special token is expressed as:
Wherein, A special token that indicates the beginning of a session; a special token representing the end of a conversation or a segmented utterance, with the end being the end of the conversation and the middle being the segmented utterance; word 1 representing the 1 st utterance; the 1 st utterance is represented A personal word; word 1 representing the 2 nd utterance; The 2 nd utterance is represented A personal word; 1 st word representing the kth-1 th utterance; the (k-1) th utterance A personal word;
Step 2: inputting the user history dialogue text after the special token is inserted into an encoder to generate the text context embedded representation; the calculation formula is as follows:
Wherein, Representing a text context embedded representation; Representing an encoder implemented by BioBERT pre-training models; c represents the user history dialogue text after inserting the special token.
S3: extracting entity mention of a problem text in the user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; and inquiring related medical knowledge triples in the external knowledge base according to the target entity.
Specifically, the process of obtaining the medical knowledge triplet includes:
step 1: extracting entity references of the problem text in the user history dialogue text by adopting a BERT+ BiLSTM +CRF model;
step 2: calculating the similarity between the entity mention and each medical knowledge entity in the external knowledge base, and taking the medical knowledge entity with the highest similarity with the entity mention as a target entity; and linking the entity mention to a target entity in the external knowledge base;
Further, calculating the similarity between the entity mention and each of the medical knowledge entities in the external knowledge base comprises:
step 2.1: coding each medical knowledge entity in the external knowledge base to obtain candidate entity codes; encoding entity description texts corresponding to the medical knowledge entities to obtain entity description sentence vectors; coding the entity mention to obtain an entity code;
step 2.2: performing concat splicing on the candidate entity codes and the corresponding entity description sentence vectors to obtain semantic text representation vectors of the candidate entities; performing concat splicing on the entity codes and the text context embedded representations corresponding to the entity codes to obtain semantic text representation vectors of the entities;
Step 2.3: calculating the similarity between the semantic text representation vector of the entity and the semantic text representation vector of the candidate entity, wherein the calculation formula is as follows:
Wherein, Similarity between the semantic text representation vector representing the entity and the semantic text representation vector of the candidate entity; Representing cosine similarity; a semantic text representation vector representing an entity; a semantic text representation vector representing a candidate entity;
Step 2.4: and (3) repeating the step 2.3 until all the medical knowledge entities are calculated, and obtaining the similarity between the entity mention and each medical knowledge entity in the external knowledge base.
Step 3: and inquiring medical knowledge triples containing the target entity in the external knowledge base according to the target entity.
S4: and the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained.
In this example, there is a gap between the representation of the text context embedded representation and the medical knowledge triplet as obtained by the BioBERT encoder; the step further comprises bridging the representation gap between the medical knowledge triplet and the text context embedded representation using a multi-layer perceptron to obtain a vector representation of the medical knowledge triplet; the calculation formula is as follows:
Wherein, A vector representation representing a medical knowledge triplet; Representing a multi-layer perceptron; Representing a splicing operation; h represents a head entity of the medical knowledge triplet; hs represents entity description text of a head entity in the medical knowledge triplet; r represents a relationship; t represents the tail entity of the medical knowledge triplet; ts represents entity description text of the tail entity in the medical knowledge triplet.
Further, the method comprises the steps of:
predicting a distribution of knowledge triples using a multi-headed attention mechanism based on the text context embedded representation and the medical knowledge triples;
Wherein, Representing a distribution of knowledge triples; represents an attention score; Representing a scaling factor; representing a jth word in a kth utterance; w represents a word; Representing the first of multiple head attentiveness A head;
the vector representation of the medical knowledge triplet and the text context embedding representation sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together to obtain the representation of the final knowledge triplet; the calculation formula is as follows:
Wherein, A representation representing a final knowledge triplet; representing a fully connected feed forward network; Representing a multi-headed attention mechanism; will be Respectively corresponding to the inquiry, the key and the value in the multi-head attention mechanism; representing a text context embedded representation; a vector representation representing a kth medical knowledge triplet.
S5: a response utterance is generated by a decoder combining the text context embedded representation, the representation of the final knowledge triplet, and the distribution of the knowledge triplet.
Specifically, the process of obtaining the response utterance includes:
S5.1: the text context embedding representation is input into the decoder, so that softmax probability distribution of the vocabulary of the dataset and distribution of the vocabulary in the session history are obtained; the calculation formula is as follows:
Wherein, Representing a decoder implemented by BioBERT pre-training models; representing a text context embedded representation; a softmax layer representing the decoder; a softmax probability distribution representing the vocabulary of the dataset; representing the distribution of vocabulary in the session history; Representing a cross attention score of the decoder; Probability information representing a jth word in a kth utterance; representing a jth word in a kth utterance; w represents a word; Representing the first of multiple head attentiveness A head;
S5.2: the final probability distribution is obtained by gating probability combination of the distribution of the knowledge triples, the softmax probability distribution of the vocabulary of the dataset and the distribution of the vocabulary in the session history; the calculation formula of the final probability distribution is:
Wherein, Representing a final probability distribution; representing a first gating probability; Representing a second gating probability; a first sigmoid layer representing a decoder; Training parameters representing a first sigmoid layer; A bias term representing a first sigmoid layer; a second sigmoid layer representing a decoder; training parameters representing a second sigmoid layer; a bias term representing a second sigmoid layer; representing a text context embedded representation; a representation representing a final knowledge triplet; a softmax probability distribution representing the vocabulary of the dataset; representing the distribution of vocabulary in the session history; Representing a distribution of knowledge triples; Representing a measure And (3) withIs a probability distribution of (c).
S5.3: generating the word corresponding to the highest according to the final probability distribution to generate the next word until a complete medical dialogue reply is generated, wherein the medical dialogue reply is the response utterance.
The method for generating the question and answer of the large language model based on the enhancement of the knowledge spectrogram provided by the embodiment obtains the historical dialogue text of the user and obtains the text context embedded representation through the encoder; extracting entity mention of a problem text in a user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring a corresponding medical knowledge triplet in the external knowledge base according to a target entity; the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together to obtain the representation of the final knowledge triples, and the distribution of the knowledge triples is obtained based on the representation of the final knowledge triples; finally, combining the text context embedded representation with the distribution of the knowledge triples through a decoder to generate a response utterance; the method enhances the utilization rate of external medical knowledge and improves the response accuracy.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (8)

1. A knowledge graph enhancement-based large language model question-answer generation method is characterized by comprising the following steps:
S1: constructing an external knowledge base based on medical knowledge, wherein the external knowledge base is in a knowledge graph form;
S2: acquiring a user history dialogue text, wherein the user history dialogue text is subjected to a text context embedding representation through an encoder;
the user history dialogue text is recorded as: Representing the 1 st utterance; Representing the kth-1 utterance; the process of obtaining the text context embedded representation includes:
Step 1: inserting special tokens into the user history dialogue text, and dividing each utterance; the user history dialogue text after inserting the special token is expressed as:
Wherein, A special token that indicates the beginning of a session; a special token representing the end of a conversation or a segmented utterance, with the end being the end of the conversation and the middle being the segmented utterance; word 1 representing the 1 st utterance; the 1 st utterance is represented A personal word; word 1 representing the 2 nd utterance; The 2 nd utterance is represented A personal word; 1 st word representing the kth-1 th utterance; the (k-1) th utterance A personal word;
Step 2: inputting the user history dialogue text after the special token is inserted into an encoder to generate the text context embedded representation; the calculation formula is as follows:
Wherein, Representing a text context embedded representation; representing an encoder implemented by BioBERT pre-training models; c represents the user history dialogue text after inserting the special token;
s3: extracting entity mention of a problem text in the user history dialogue text, and linking the entity mention to a target entity in the external knowledge base; inquiring related medical knowledge triples in the external knowledge base according to the target entity;
S4: the text context embedded representation and the medical knowledge triples sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together, so that the distribution of the knowledge triples is calculated, and the representation of the final knowledge triples is obtained;
S5: generating, by a decoder, a response utterance by combining the text context embedded representation, the representation of the final knowledge triplet, and the distribution of the knowledge triplet;
the process of deriving the response utterance includes:
S5.1: the text context embedding representation is input into the decoder, so that softmax probability distribution of the vocabulary of the dataset and distribution of the vocabulary in the session history are obtained; the calculation formula is as follows:
Wherein, Representing a decoder implemented by BioBERT pre-training models; representing a text context embedded representation; a softmax layer representing the decoder; a softmax probability distribution representing the vocabulary of the dataset; representing the distribution of vocabulary in the session history; Representing a cross attention score of the decoder; Probability information representing a jth word in a kth utterance; representing a jth word in a kth utterance; w represents a word; Representing the first of multiple head attentiveness A head;
s5.2: the final probability distribution is obtained by gating probability combination of the distribution of the knowledge triples, the softmax probability distribution of the vocabulary of the dataset and the distribution of the vocabulary in the session history;
s5.3: generating the next word according to the word with the highest final probability distribution until the final word is generated And (3) a special token to obtain a complete medical dialogue reply, wherein the medical dialogue reply is the response utterance.
2. The knowledge-based enhanced large language model question-answer generation method according to claim 1, wherein in S1, the external knowledge base includes a plurality of triples, and the forms of the triples include { head entity, relationship, tail entity } and { entity, attribute value }; extracting a medical knowledge entity and a corresponding entity description text from the medical knowledge; constructing a triplet in the form of a head entity, a relation and a tail entity according to the relation among the medical knowledge entities; constructing a triplet in the form of { entity, attribute and attribute value } based on the medical knowledge entity and by taking entity description text corresponding to the medical knowledge entity as an attribute value, and performing entity alignment on all the triples to obtain the external knowledge base;
The medical knowledge entity includes a disease name, a medication name, a symptom name, an examination item name, a food name, and a department name.
3. The knowledge-based enhanced large language model question-answering generation method according to claim 2, wherein in S1, the method further comprises performing redundancy elimination preprocessing on the entity description text by using TextRank algorithm, the redundancy elimination preprocessing comprising:
Step 1: dividing the entity description text into a plurality of sentences; taking the sentences as nodes, and taking the similarity between the two sentences as the weight of the edge between the two sentences; the similarity calculation formula is:
Wherein, Is a sentenceSimilarity between; respectively representing the ith and jth sentences, Representing sentence length; Representing sentences The same words in (a); Representing and;
step 2: initializing the weight value of all sentences to 1/n, and iteratively calculating the weight value of the sentence i based on all other sentences connected with the sentence i and the similarity between the two sentences; the iterative calculation formula is as follows:
Wherein, For the weight value of the i-th sentence,To point to sentencesIs a set of sentences of (a),Is a sentenceThe set of sentences to which the user is directed,Representing sentencesThe kth sentence, d is the damping ratio,The weight value of the jth sentence;
Step 3: repeating the step 2 until the weight value of each sentence is obtained through calculation;
Step 4: adjusting the weight value of each sentence based on the entity coverage rate in the sentence and the position of the sentence in the entity description text to obtain the importance score of each sentence; the importance score calculation formula is:
Wherein W is For the importance score of the ith sentence,The product of the Hadamard is represented,Representing the adjustment parameters; Representing the entity coverage of the ith sentence, Representing the number of entity names contained in the ith sentence; representing the position of the ith sentence in the entity description text; n represents the number of sentences in the entity description text; is normalized to A feature matrix representing the entity coverage of the sentence; is normalized to A feature matrix representing sentence locations; t is the transpose;
step 5: and selecting the sentences with the first x importance scores as candidate text summaries, and combining all the candidate text summaries to obtain redundancy-removed preprocessed entity description text.
4. The knowledge-graph-enhancement-based large language model question-answer generation method according to claim 2, wherein in S3, the process of obtaining the medical knowledge triples includes:
step 1: extracting entity references of the problem text in the user history dialogue text by adopting a BERT+ BiLSTM +CRF model;
step 2: calculating the similarity between the entity mention and each medical knowledge entity in the external knowledge base, and taking the medical knowledge entity with the highest similarity with the entity mention as a target entity; and linking the entity mention to a target entity in the external knowledge base;
step 3: and inquiring medical knowledge triples containing the target entity in the external knowledge base according to the target entity.
5. The knowledge-graph-enhancement-based large language model question-answer generation method of claim 4, wherein said calculating similarity between said entity mention and each of said medical knowledge entities in said external knowledge base comprises:
step 2.1: coding each medical knowledge entity in the external knowledge base to obtain candidate entity codes; encoding entity description texts corresponding to the medical knowledge entities to obtain entity description sentence vectors; coding the entity mention to obtain an entity code;
step 2.2: performing concat splicing on the candidate entity codes and the corresponding entity description sentence vectors to obtain semantic text representation vectors of the candidate entities; performing concat splicing on the entity codes and the text context embedded representations corresponding to the entity codes to obtain semantic text representation vectors of the entities;
Step 2.3: calculating the similarity between the semantic text representation vector of the entity and the semantic text representation vector of the candidate entity, wherein the calculation formula is as follows:
Wherein, Similarity between the semantic text representation vector representing the entity and the semantic text representation vector of the candidate entity; Representing cosine similarity; a semantic text representation vector representing an entity; a semantic text representation vector representing a candidate entity;
Step 2.4: and (3) repeating the step 2.3 until all the medical knowledge entities are calculated, and obtaining the similarity between the entity mention and each medical knowledge entity in the external knowledge base.
6. The knowledge-graph-enhancement-based large language model question-answer generation method according to claim 1, wherein in S4, the method further comprises employing a multi-layer perceptron to bridge a representation gap between the medical knowledge triplet and the text context embedded representation to obtain a vector representation of the medical knowledge triplet; the calculation formula is as follows:
Wherein, A vector representation representing a medical knowledge triplet; Representing a multi-layer perceptron; Representing a splicing operation; h represents a head entity of the medical knowledge triplet; hs represents entity description text of a head entity in the medical knowledge triplet; r represents a relationship; t represents the tail entity of the medical knowledge triplet; ts represents entity description text of the tail entity in the medical knowledge triplet.
7. The knowledge-based enhanced large language model question-answering generation method according to claim 6, wherein in S4, comprising:
predicting a distribution of knowledge triples using a multi-headed attention mechanism based on the text context embedded representation and the medical knowledge triples;
Wherein, Representing a distribution of knowledge triples; represents an attention score; Representing a scaling factor; representing a jth word in a kth utterance; w represents a word; Representing the first of multiple head attentiveness A head;
the vector representation of the medical knowledge triplet and the text context embedding representation sequentially pass through a multi-head attention mechanism and a fully-connected feedforward network together to obtain the representation of the final knowledge triplet; the calculation formula is as follows:
Wherein, A representation representing a final knowledge triplet; representing a fully connected feed forward network; Representing a multi-headed attention mechanism; will be Respectively corresponding to the inquiry, the key and the value in the multi-head attention mechanism; representing a text context embedded representation; a vector representation representing a kth medical knowledge triplet.
8. The knowledge-based enhanced large language model question-answer generation method according to claim 1, wherein the calculation formula of the final probability distribution is:
Wherein, Representing a final probability distribution; representing a first gating probability; Representing a second gating probability; a first sigmoid layer representing a decoder; Training parameters representing a first sigmoid layer; A bias term representing a first sigmoid layer; a second sigmoid layer representing a decoder; training parameters representing a second sigmoid layer; a bias term representing a second sigmoid layer; representing a text context embedded representation; a representation representing a final knowledge triplet; a softmax probability distribution representing the vocabulary of the dataset; representing the distribution of vocabulary in the session history; Representing a distribution of knowledge triples; Representing a measure And (3) withIs a probability distribution of (c).
CN202410653158.2A 2024-05-24 2024-05-24 Knowledge graph enhancement-based large language model question-answer generation method Active CN118227769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410653158.2A CN118227769B (en) 2024-05-24 2024-05-24 Knowledge graph enhancement-based large language model question-answer generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410653158.2A CN118227769B (en) 2024-05-24 2024-05-24 Knowledge graph enhancement-based large language model question-answer generation method

Publications (2)

Publication Number Publication Date
CN118227769A CN118227769A (en) 2024-06-21
CN118227769B true CN118227769B (en) 2024-08-20

Family

ID=91513772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410653158.2A Active CN118227769B (en) 2024-05-24 2024-05-24 Knowledge graph enhancement-based large language model question-answer generation method

Country Status (1)

Country Link
CN (1) CN118227769B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118469005B (en) * 2024-07-10 2024-09-27 北方健康医疗大数据科技有限公司 Medical knowledge graph construction method, system, terminal and storage medium based on large language model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470327A (en) * 2022-08-11 2022-12-13 天津泰凡科技有限公司 Medical question-answering method based on knowledge graph and related equipment
CN117033602A (en) * 2023-08-24 2023-11-10 北京邮电大学 Method for constructing multi-mode user mental perception question-answering model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010693B (en) * 2021-04-09 2024-03-08 大连民族大学 Knowledge graph intelligent question-answering method integrating pointer generation network
CN113488165B (en) * 2021-07-26 2023-08-22 平安科技(深圳)有限公司 Text matching method, device, equipment and storage medium based on knowledge graph
CN115422369B (en) * 2022-08-30 2023-11-03 中国人民解放军国防科技大学 Knowledge graph completion method and device based on improved TextRank
CN116561272A (en) * 2023-04-18 2023-08-08 华南师范大学 Open domain visual language question-answering method and device, electronic equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470327A (en) * 2022-08-11 2022-12-13 天津泰凡科技有限公司 Medical question-answering method based on knowledge graph and related equipment
CN117033602A (en) * 2023-08-24 2023-11-10 北京邮电大学 Method for constructing multi-mode user mental perception question-answering model

Also Published As

Publication number Publication date
CN118227769A (en) 2024-06-21

Similar Documents

Publication Publication Date Title
CN111783462B (en) Chinese named entity recognition model and method based on double neural network fusion
CN118227769B (en) Knowledge graph enhancement-based large language model question-answer generation method
CN112115687B (en) Method for generating problem by combining triplet and entity type in knowledge base
CN113297364B (en) Natural language understanding method and device in dialogue-oriented system
CN113569001A (en) Text processing method and device, computer equipment and computer readable storage medium
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
CN113268609A (en) Dialog content recommendation method, device, equipment and medium based on knowledge graph
CN112487820A (en) Chinese medical named entity recognition method
Kesavan et al. Deep learning based automatic image caption generation
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN114662476B (en) Character sequence recognition method integrating dictionary and character features
CN118093834B (en) AIGC large model-based language processing question-answering system and method
CN112036189A (en) Method and system for recognizing gold semantic
CN115796182A (en) Multi-modal named entity recognition method based on entity-level cross-modal interaction
Kim et al. Cross-modal distillation with audio–text fusion for fine-grained emotion classification using BERT and Wav2vec 2.0
CN118312600B (en) Intelligent customer service question-answering method based on knowledge graph and large language model
CN115630145A (en) Multi-granularity emotion-based conversation recommendation method and system
CN116561272A (en) Open domain visual language question-answering method and device, electronic equipment and storage medium
CN117875424B (en) Knowledge graph completion method and system based on entity description and symmetry relation
CN115203388A (en) Machine reading understanding method and device, computer equipment and storage medium
CN115186147A (en) Method and device for generating conversation content, storage medium and terminal
CN114490954A (en) Document level generation type event extraction method based on task adjustment
CN118245602A (en) Emotion recognition model training method, device, equipment and storage medium
CN117932066A (en) Pre-training-based 'extraction-generation' answer generation model and method
CN114416925B (en) Sensitive word recognition method, device, equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant