CN116719913A

CN116719913A - Medical question-answering system based on improved named entity recognition and construction method thereof

Info

Publication number: CN116719913A
Application number: CN202310469261.7A
Authority: CN
Inventors: 姜芳艽; 陈婕妤; 王斌
Original assignee: Jiangsu Normal University
Current assignee: Jiangsu Normal University
Priority date: 2023-04-27
Filing date: 2023-04-27
Publication date: 2023-09-08

Abstract

A medical question-answering system based on improved named entity recognition and a construction method thereof are provided, wherein the system and the method extract the characteristics of a text by utilizing a BERT pre-training language model, have strong semantic expression capability, add disturbance factors to an obtained word vector, and enhance the generalization capability and robustness of the model; introducing countermeasure training to solve the problem that the data set possibly has insufficient labeling quantity or missed labeling, and reducing the influence of the noise of the data set on the realization result; setting learning rate in layers, and achieving the effects of no decline of BERT layer effect, faster lower layer training and synchronous training; the BERT layer and the BiLSTM layer are spliced to output characteristics, so that the layers are more closely connected, deeper characteristics are obtained, and original characteristics of the BERT layer are not lost; the chatting sentences are added through the two-round intention recognition function, so that the system can answer chatting topics of users, the slot inheritance enables the system to have a multi-round question-answering function, and the accuracy of answer of the question-answering system is greatly improved.

Description

Medical question-answering system based on improved named entity recognition and construction method thereof

Technical Field

The invention relates to a medical question-answering system based on improved named entity recognition and a construction method thereof, belonging to the technical field of knowledge graph and natural language processing.

Background

Along with the continuous development of natural language processing technology, knowledge graphs are gradually applied to various fields, and a question-answering system based on the knowledge graphs is generated. At present, the accurate medical treatment vertical field has fewer question and answer assistant platforms, and the traditional question and answer system mainly searches according to keywords to obtain related contents, but returns too many pages, so that users are required to screen and judge the pages, and the accuracy of results is difficult to ensure. Moreover, the data in the medical industry is huge and complex, and the diversified requirements of users cannot be met only by the traditional method.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a medical question-answering system based on improved named entity recognition and a construction method thereof.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: the medical question-answering system based on the improved named entity recognition comprises a data acquisition module, a knowledge storage module, a natural language understanding module, a knowledge calculation module and a dialogue management and interaction module, wherein the output end of the data acquisition module is connected with the input end of the knowledge storage module;

the data acquisition module is used for crawling medical data on a website, cleaning and preprocessing the data, and constructing a knowledge graph data set;

the knowledge storage module is used for storing the knowledge triples extracted from the data set by adopting a Neo4j graph database and displaying a visualized knowledge graph;

the natural language understanding module is used for carrying out a named entity recognition task and an intention recognition task on a question input by a user and understanding the specific meaning of the user question;

the knowledge calculation module is used for converting the question sentence passing through the natural language understanding module into a structured query sentence, and utilizing the Cypher sentence to query in the knowledge graph to obtain an answer;

the dialogue management and interaction module is used for constructing page development of a question-answering system based on a knowledge graph, supporting a user to input a question, returning corresponding answers and supporting multiple rounds of question-answering.

A medical question-answering system construction method based on improved named entity recognition comprises the following steps:

step one, crawling relevant medical information, cleaning and preprocessing data, constructing a data set, defining a required data mode, storing the collected data according to the defined data mode by adopting a Neo4j graph database, and completing the construction of a knowledge graph;

step two, constructing a named entity recognition model structure based on BERT-FGM-BiLSTM-CRF-lr splicing improved network, executing a medical named entity recognition task, and extracting medical entities in question sentences input by a user;

thirdly, constructing an intention recognition model structure based on the BERT-textCNN network, executing a question intention recognition task, and recognizing intention contained in a user input question;

step four, constructing a medical question-answering system by adopting a technical route of intention recognition, semantic slot design and template, carrying out two-round intention recognition, filling slots after judging intention, and inquiring in a knowledge graph according to the structured semantic slots;

fifthly, realizing a multi-round question-answering function of the system by utilizing slot inheritance;

step six, designing a medical question-answering system and a user page thereof, and carrying out page development by using PyQt5 to support man-machine interaction and realize the function of on-line auxiliary diagnosis.

Further, the step of constructing the data set and the knowledge graph in the step one is as follows:

the method comprises the steps of (1.1) extracting semi-structured data in a webpage, wherein a medicine searching and questioning network provides very comprehensive disease knowledge and treatment modes for users, the data has certain authority, has a clear structure and is suitable for crawling, and the medicine searching and questioning network is selected as a data source for constructing a knowledge graph in the text;

(1.2) analyzing the property of the webpage to obtain a URL address corresponding to the data;

(1.3) sending a network address by using a url lib.request data request module to acquire HTML format data of a webpage; the network address is the URL address obtained in the step (1.2);

(1.4) analyzing the HTML tag by using XPath to extract the required data and the association relation thereof;

(1.5) defining eight types of entities of diseases, symptoms, examination, medicines, medicine enterprises, departments, foods and recipes, and designing eleven types of relations of diseases-symptoms, diseases-concurrent diseases, diseases-examination, diseases-recommended medicines, diseases-general medicines, diseases-departments, diseases-recipes, diseases-food preference, diseases-food contraindicated, departments-departments and medicine enterprises-medicines;

(1.6) saving the data in the step (1.5) to the local to obtain an initial corpus;

(1.7) deleting or filling the missing value of the data directly;

(1.8) aiming at noise existing in the data, using a regular expression for standardization, and eliminating stop words, messy codes, special characters, redundant information and formats of punctuation marks and letters in the data;

(1.9) performing word segmentation processing by using a word segmentation tool, formatting the data, converting the data into a key value pair form, storing the key value pair form, and deriving normal JSON data as a data set for constructing a medical knowledge graph;

(1.10) analyzing entities, attributes and relationships among the entities in the constructed data set, and defining a required data mode by combining the application of the question-answering system;

(1.11) extracting knowledge triples from the data according to the Schema;

(1.12) using a Py2Neo module in Python to realize connection with Neo4j, respectively utilizing a Cypher statement to establish entity nodes and entity relation edges of the knowledge graph according to a Schema, and writing attributes of the entity nodes and entity relation edges into a disease entity to complete the construction of the knowledge graph.

Further, in the second step, the construction steps of the named entity recognition model based on BERT-FGM-BiLSTM-CRF-lr splicing improved network are as follows:

(2.1) fusing the CCKS2019 dataset and the cMaedQANER dataset: the CCKS2019 dataset contains disease and diagnosis, anatomy, surgery, examination, medication, and examination; the cMedQANER dataset includes disease, symptoms, detection, physiology, treatment regimen, body part, population, department, medicine, local, time; both data sets are marked by adopting a BIO marking method;

(2.2) merging the CCKS2019 dataset as a base dataset, extracting a part of the available categories of entities from the cMedQANER dataset into the CCKS2019 dataset, including "detection-examination, treatment-surgery, drug-medication", and performing a small scale expansion on the CCKS2019 dataset;

(2.3) introducing a BERT model to perform word embedding, wherein the word embedding is obtained through pre-training and a Fine-tune link, all word vectors are obtained through the BERT model, text features are extracted, the acquisition capacity of the model on character semantic features is enhanced, and the character semantic features are marked as e1;

(2.4) introducing countermeasure training, adding all word vectors e1 obtained through the BERT model into disturbance factors r of the countermeasure training, and marking as e2; wherein, the countermeasure training formula is as follows:

wherein D is a training set, x is input, y is a label, θ is a model parameter, L (x+Δx, y; θ) is a loss value of a single sample, Ω is a disturbance space, and Δx is an anti-disturbance;

the disturbance factor r is obtained by carrying out standardized processing on a word vector loss value output by BERT and a current gradient value, and the sum of the word vector and the disturbance quantity is an countermeasure sample; the calculation formula for the disturbance resistance is as follows:

where g is the gradient value, i.e., the partial derivative of the loss function to x, ε is the scaling factor;

(2.5) sending the disturbance added word vector e2 into the BiILSTM network to obtain context characteristic information, and marking the context characteristic information as e3;

(2.6) performing feature stitching on the output e3 of the BiLSTM layer and the disturbed word vector e2, and simultaneously reserving the output features of the two layers, and marking the output features as e4;

(2.7) sending the vector e4 into the full connection layer for dimension reduction treatment, and marking as e5;

(2.8) inputting e5 into the CRF layer for decoding processing to obtain a label sequence corresponding to each character, and marking the label sequence as output; in the decoding process, the CRF layer judges the label according to the transition probability matrix, specifically, the transition matrix is randomly initialized during training, and then the transition matrix is optimized, so that the transition matrix more accords with the actual transition probability among the training data labels, and the formula is as follows:

wherein A is a transfer matrix,representing tags yi through y _i+1 P is the output of the BiLSTM network, +.>Represents the y-th of BiLSTM layer to the i-th character _i Scoring of individual labels;

and (2.9) training the BERT layer and the model structure which is arranged below the BERT layer by adopting different learning rates, and properly adjusting according to different stages to keep training synchronization, wherein the BERT layer learning rate is set to be lr1, the BiLSTM layer learning rate is set to be lr2, and the CRF layer learning rate is set to be lr3.

Further, the method for constructing the intention recognition model structure based on the BERT-TextCNN network in the third step is as follows:

(3.1) selecting a published CMID dataset;

(3.2) extracting the intention of 13 types of systems which can answer from the CMID data set according to the constructed knowledge graph and defined entities and relations, and writing the intention into a label file;

(3.3) setting a rule generating template aiming at partial intention with small data quantity by adopting a mode of generating a supplementary data set based on the template, and manually writing keywords of each rule;

(3.4) combining the keywords according to a certain sequence, and finally randomly generating to obtain data in the supplementary corpus and the balanced original data set;

(3.5) converting the text into vectors using the BERT model as an Embedding layer, extracting text features, and recording as b ₁ ,b ₂ ,…,b _n ；

(3.6) vector b ₁ ,b ₂ ,…,b _n Splicing to obtain an embedded matrix, denoted as B _1:n ，

B _1:n ＝[b ₁ ,b ₂ ,…,b _n ]

Wherein b ₁ ,b ₂ ,…,b _n Representing a word vector;

(3.7) embedding matrix B _1:n Sending the sentence into a convolution layer for feature extraction, performing convolution operation by using convolution kernels with the sizes of (3, 4 and 5), extracting semantic features of the sentence, and marking the semantic features as a _i e1；

Semantic feature a _i The calculation formula of (2) is as follows:

a _i ＝f(W·M _i:i+h-1 +b)

wherein M is a word vector matrix, b is a bias, W is a neural network weight, h is a convolution kernel size, f is a nonlinear function used for calculating a feature value, M _i:i+h-1 Word vectors for different positions in the text;

(3.8) extracting the semantic feature a from the step (3.7) _i Sending to a pooling layer, wherein the pooling layer utilizes max_pooling to pool the semantic feature a _i Downsampling is carried out to keep the same vector dimension, and the obtained output is marked as f1;

(3.9) after passing through the pooling layer, converting sentences with different lengths into fixed-length expression;

(3.10) adding Dropout to prevent overfitting;

(3.11) sending the pooled f1 into a full connection layer to obtain the probability of each label;

(3.12) outputting the classification result of the text by using softmax, and recording as output.

Further, the step of constructing the medical question-answering system in the step four is as follows:

(4.1) defining a question input by a user as Q;

(4.2) extracting the entities related to the medical treatment in the Q by using the named entity recognition model in the second step;

(4.3) carrying out first-round intention recognition on the Q by adopting a logistic regression algorithm, and judging whether the input of the user is boring intention or diagnostic intention;

(4.4) if the chat intention is judged, answering by using the set chat template;

(4.5) if the diagnosis intention is judged, entering a second round of intention recognition, namely judging the specific diagnosis intention of the user by using the intention recognition model in the third step;

(4.6) after obtaining the probability of each category, the system performs descending order sorting according to the probability, and takes the intention with the highest confidence;

(4.7) presetting threshold intervals of 3 confidence degrees, namely more than 0.8, between 0.4 and 0.8 and less than 0.4, and comparing the intention confidence degrees with a threshold;

(4.8) determining a reply strategy according to the obtained intention confidence, filling slots, and inquiring in a knowledge graph by using a Cypher statement according to the structured semantic slots;

(4.9) when the highest intention confidence level returned is greater than 0.8, adopting an 'accept' strategy to answer according to the intention and the slot position value by combining the reply template;

(4.10) if the slot position value is empty, proving that no related result is inquired in the knowledge graph, and directly replying to the dense_response template at the moment;

(4.11) when the highest intended confidence level returned is between 0.4 and 0.8, adopting a clarification strategy, and the system can inquire according to the template;

(4.12) when the highest intent confidence returned is less than 0.4, a "reject" strategy is employed to reject the answer.

Further, the steps of the multi-round question and answer of the system in the step five are as follows:

(5.1) predefining a semantic slot template for the entity;

(5.2) identifying which slots are contained in the question input by the user;

(5.3) extracting the slot values in the semantic slots and filling the semantic slots into predefined semantic slots, namely filling the slots;

(5.4) when analyzing the question input by the user, if the slot is filled, and the slot value is empty, the slot value of the previous question is inherited, namely, the slot inheritance is carried out;

and (5.5) inquiring in the knowledge graph by using a Cypher statement to realize multiple rounds of question and answer.

Further, the construction steps of the medical question-answering system and the user page thereof in the step six are as follows:

(6.1) adopting a hierarchical structure to design a medical question-answering system, wherein the medical question-answering system comprises a data layer, a construction layer and a user layer; wherein the data layer is responsible for providing data support; the construction layer comprises two major contents, namely, constructing a knowledge graph according to a self-built medical field data set, and constructing a medical question-answering system; the user layer is oriented to the user, mainly performs dialogue management and interaction, and a question input in the interface is transmitted into the construction layer to perform specific analysis operation;

(6.2) carrying out page design by using PyQt5, sequentially extracting entity, judging intention, slot filling, inquiring structured sentences and replying templates on questions input by a user, directly calling the trained model, and transmitting the input of the user to a construction layer for carrying out specific analysis operation;

and (6.3) feeding back the generated answer to the user in the user operation page.

According to the invention, the features of the text are extracted by utilizing the BERT pre-training language model, a large amount of corpus is subjected to unsupervised training, rich priori information can be learned, the semantic expression capability is strong, and disturbance factors are added to the obtained word vectors, so that the generalization capability and robustness of the model are enhanced; the problems of insufficient labeling quantity or missed labeling possibly existing in the data set are solved by introducing the countermeasure training, so that the influence of the noise of the data set on the realization result is reduced; the learning rate is set in a layered mode, the learning rate of the pre-training layer is reduced, and the learning rate of the lower joint layer is set to be larger, so that the effects of no decline of the BERT layer effect, faster training of the lower joint layer and synchronous training are achieved; the BERT layer and the BiLSTM layer are used for outputting characteristic splicing, so that the layers are more closely connected, deeper characteristics are obtained, and original characteristics of the BERT layer are not lost; the chatting sentences are added through the two-round intention recognition function, so that the system can answer chatting topics of users, and the slot inheritance enables the system to have a multi-round question-answering function, so that the accuracy of answer of the question-answering system is greatly improved.

Drawings

FIG. 1 is a schematic diagram of the configuration of the medical question-answering system of the present invention;

FIG. 2 is a workflow diagram of a method of constructing a medical question-answering system of the present invention;

FIG. 3 is a frame diagram of the construction of the medical question-answering system of the present invention;

FIG. 4 is a diagram of a model structure of a named entity recognition module of the present invention;

FIG. 5 is a workflow diagram of an intent recognition module of the present invention;

FIG. 6 is an example of a system question and answer of the present invention;

fig. 7 is an example of a system multiple round question and answer of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the medical question-answering system based on improved named entity recognition comprises a data acquisition module, a knowledge storage module, a natural language understanding module, a knowledge calculation module and a dialogue management and interaction module, wherein the output end of the data acquisition module is connected with the input end of the knowledge storage module, the output end of the knowledge storage module is connected with the input end of the natural language understanding module, the output end of the natural language understanding module is connected with the input end of the knowledge calculation module, and the output end of the knowledge calculation module is connected with the input end of the dialogue management and interaction module;

As shown in fig. 2 and 3, a method for constructing a medical question-answering system based on improved named entity recognition includes the steps of:

The steps of constructing the data set and the knowledge graph are as follows:

(1.7) deleting or filling the missing value of the data directly;

(1.11) extracting knowledge triples from the data according to the Schema;

As shown in FIG. 4, the named entity recognition model of the improved network based on BERT-FGM-BiLSTM-CRF-lr splicing is constructed as follows:

As shown in fig. 5, the construction method of the intention recognition model structure based on the BERT-TextCNN network is as follows:

(3.1) selecting a published CMID dataset;

(3.5) converting the text into vectors using the BERT model as an Embedding layer, extracting text features, and recording as b ₁ ,b ₂ ,…,b _m ；

(3.6) vector b ₁ ,b ₂ ,…,b _, Splicing to obtain an embedded matrix, denoted as B _1:n ，

B _1:n ＝[b ₁ ,b ₂ ,…,b _n ]

Wherein b ₁ ,b ₂ ,…,b _n Representing a word vector;

Semantic feature a _i The calculation formula of (2) is as follows:

a _i ＝f(W·M _i:i+h-1 +b)

(3.10) adding Dropout to prevent overfitting;

As shown in fig. 6, the steps of constructing the medical question-answering system are as follows:

(4.1) defining a question input by a user as Q;

As shown in fig. 7, the steps of the system for multiple questions and answers are as follows:

(5.1) predefining a semantic slot template for the entity;

(5.2) identifying which slots are contained in the question input by the user;

Experiments prove that the f1 value of the named entity recognition model based on the BERT-FGM-BiLSTM-CRF-lr splicing improved network is 87.70% on a self-built data set, and the f1 value of the intention recognition model is 76.64% on the self-built data set, so that the accuracy of the answer-question system recovery is greatly improved.

Claims

1. The medical question-answering system based on the improved named entity recognition is characterized by comprising a data acquisition module, a knowledge storage module, a natural language understanding module, a knowledge calculation module and a dialogue management and interaction module, wherein the output end of the data acquisition module is connected with the input end of the knowledge storage module;

2. The medical question-answering system construction method based on the improved named entity recognition is characterized by comprising the following steps of:

3. The method for constructing a medical question-answering system based on the improved named entity recognition according to claim 2, wherein the step of constructing the data set and the knowledge graph in the step one is as follows:

(1.7) deleting or filling the missing value of the data directly;

(1.11) extracting knowledge triples from the data according to the Schema;

4. The method for constructing a medical question-answering system based on improved named entity recognition according to claim 2, wherein the construction steps of the named entity recognition model based on the BERT-FGM-BiLSTM-CRF-lr splicing improved network in the second step are as follows:

in the method, in the process of the invention,a is the transfer matrix of the transfer matrix,representative tag y _i To y _i+1 P is the output of the BiLSTM network, +.>Represents the y-th of BiLSTM layer to the i-th character _i Scoring of individual labels;

5. The method for constructing a medical question-answering system based on improved named entity recognition according to claim 2, wherein the method for constructing an intention recognition model structure based on the BERT-TextCNN network in the third step is as follows:

(3.1) selecting a published CMID dataset;

B _1:n ＝[b ₁ ,b ₂ ,…,b _n ]

Wherein b ₁ ,b ₂ ,…,b _n Representing a word vector;

(3.7) embedding matrix B _1:n Sending the sentence into a convolution layer for feature extraction, performing convolution operation by using convolution kernels with the sizes of (3, 4 and 5), extracting semantic features of the sentence, and marking the semantic features as a _i ；

Semantic feature a _i The calculation formula of (2) is as follows:

a _i ＝f(W·M _i:i+h-1 +b)

(3.8) extracting the semantic feature a from the step (3.7) _i The whole is sent to a pooling layer, and the pooling layer utilizes max_pooling to pool the semantic feature a _i Downsampling is carried out to keep the same vector dimension, and the obtained output is marked as f1;

(3.10) adding Dropout to prevent overfitting;

6. The method for constructing a medical question-answering system based on the improved named entity recognition according to claim 2, wherein the step of constructing a medical question-answering system in the fourth step is as follows:

(4.1) defining a question input by a user as Q;

7. The method for constructing a medical question-answering system based on the improved named entity recognition according to claim 2, wherein the steps of the system in the fifth step are as follows:

(5.1) predefining a semantic slot template for the entity;

(5.2) identifying which slots are contained in the question input by the user;

8. The method for constructing a medical question-answering system based on the improved named entity recognition according to claim 2, wherein the steps of constructing the medical question-answering system and the user page thereof in the step six are as follows: