CN112925918A - Question-answer matching system based on disease field knowledge graph - Google Patents

Question-answer matching system based on disease field knowledge graph Download PDF

Info

Publication number
CN112925918A
CN112925918A CN202110213829.XA CN202110213829A CN112925918A CN 112925918 A CN112925918 A CN 112925918A CN 202110213829 A CN202110213829 A CN 202110213829A CN 112925918 A CN112925918 A CN 112925918A
Authority
CN
China
Prior art keywords
entity
character
bert
domain
disease
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110213829.XA
Other languages
Chinese (zh)
Other versions
CN112925918B (en
Inventor
倪妙玲
孙庆华
王聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110213829.XA priority Critical patent/CN112925918B/en
Publication of CN112925918A publication Critical patent/CN112925918A/en
Application granted granted Critical
Publication of CN112925918B publication Critical patent/CN112925918B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a question-answer matching system based on a knowledge graph in a disease field, which comprises a preprocessing module, a named entity identification module, an entity linking module and a relation matching module.

Description

Question-answer matching system based on disease field knowledge graph
Technical Field
The invention relates to the field of natural language processing, in particular to a question-answer matching system based on a knowledge graph in the disease field.
Background
The question-answering system is a common application in the field of natural language processing, is a common expression form of an information retrieval system, can quickly acquire accurate and effective information from a plurality of information by analyzing questions input by a user and by means of algorithms such as deep learning and the like, returns the accurate and effective information to the user, answers the questions of the user, and meets the requirement of the user on quickly acquiring the accurate information.
The knowledge graph organizes and understands data in a form close to human cognitive thinking, and provides an excellent solution for massive, heterogeneous and dynamic large data management and use on the Internet. The knowledge graph integrates technologies and methods in many aspects, including knowledge representation, knowledge extraction (named entity identification, relationship extraction, and the like), knowledge fusion, knowledge storage, knowledge inference, graph computation, visualization, semantic search, knowledge question answering, knowledge crowdsourcing, and the like. The knowledge graph generally follows the RDF triple data structure, i.e., (s, p, o) form, corresponding to the main entity subject, the relationship predicate, and the tail entity object. A knowledge-graph contains tens or hundreds of millions of scale entities, as well as billions or billions of facts (i.e., attribute values and relationships to other entities) organized in thousands of objective world conceptual structures embodied by semantics. Fig. 1 illustrates a portion of a coronary heart disease domain knowledge map with arrows representing relationships, the arrows pointing from the main entity to the tail entity. Circles represent entities or attributes, entities being abstractions of objective individuals, such as body parts, diseases, symptoms. And attribute values are used to describe entities and are classified into text type and numerical type, such as "incurable", "80%".
The question-answering system based on the medical knowledge graph can help a user to screen out accurate answers of questions from massive medical data and return the answers to the user, and meanwhile, a certain explanation can be provided for the answers by means of the structured knowledge graph.
The defects in the prior art are as follows:
1. the word vector representation model based on the general field lacks field knowledge, and the cause is that BERT is pre-trained on general language materials, while the disease field has more professional terms and more complex entities, for example, rheumatic mitral stenosis is a complete entity, and a disease entity, namely mitral stenosis, is also nested, but the rheumatic mitral stenosis is less on common general training language materials, so that the corresponding semantic vector learning is poor, and only the mitral stenosis is recognized.
2. The reason why the accumulated errors exist in the process of entity linking and entity identification based on the pipeline is that the pipeline refers to the process that the entity linking and the entity identification are serial, and the entity linking is judged based on the result of the entity identification, so that the error transmission is caused. When the entity recognition module outputs the wrong entity mention, the entity cannot be linked to the real entity word in the knowledge graph.
Disclosure of Invention
In order to overcome two defects in the prior art, the invention provides a question-answer matching system based on a knowledge graph in the disease field.
The invention adopts the following technical scheme:
a question-answer matching system based on a disease domain knowledge graph, comprising:
the preprocessing module is used for preprocessing the problem input by the user and recording the output Q of the preprocessing module;
a named entity recognition module for determining the start and end positions of the entity mentions, predicting the probabilities of each character in the output Q as the start and end positions of the sentence entity mention, and determining the start and end positions of the entity mention according to the probabilities of the start and end positions, the named entity recognition module using BERTdomainFine adjustment is carried out;
the entity linking module is used for linking the entity mentions to entity words in the disease knowledge graph and retrieving all corresponding relations in the disease knowledge graph through the entity words;
a relation matching module for matching the relation corresponding to the entity word with the output Q of the preprocessing module, judging whether the relation is consistent with the user problem, if so, outputting, and using BERTdomainFine tuning is performed.
Further, the preprocessing includes removing spaces and punctuation marks using regularization on the user entered text problem and unifying the letters into lower case.
Further, predicting the probability of each character in the output Q as the starting position and the ending position of the entity mention of the sentence, and determining the entity mention, specifically:
BER after pretraining through disease domainTdomainThe model encodes the output Q of the preprocessing module and outputs a feature vector, the feature vector predicts the probability of each character as an entity mention starting position and an entity mention ending position through two Softmax classification layers, and a character string corresponding to the entity starting position and the entity mention ending position is the entity mention.
Further, BERT in the disease field is obtaineddomainThe pre-training process of the model is as follows:
firstly, storing the disease names in clinical diagnosis and treatment documents and electronic medical records as dictionary files;
then, segmenting the text contents in the clinical diagnosis and treatment documents and the electronic medical records into sentences according to punctuations;
searching whether the sentence contains the disease word in the dictionary file or not by using a maximum dictionary matching method for the segmented sentence, and if so, shielding the disease word, namely replacing the disease word with MASK; for sentences which do not contain disease words, randomly masking a certain character, and constructing an MLM task data set of the disease field;
finally, the BERT model is pre-trained through an MLM task data set, the MLM task aims to predict words which are replaced by MASK through context information of sentences, so that the model learns the bidirectional information of the sentences, and the BERT is obtained through the MLM taskdomainModel and character vector Echar
Further, the named entity recognition module uses BERTdomainFine tuning is performed at BERTdomainAdding network parameters of the downstream-oriented task and finely adjusting the parameters of the whole network, wherein the named entity recognition module is arranged in BERTdomainAnd on the basis of the initial position and the final position, respectively splicing two softmax classification layers, and respectively predicting the probability that each character in the output Q is used as an entity to mention the initial position and the final position. In addition, BERT is used in the fine tuning stagedomainThe segment codes in the original input are replaced with participle codes. BERT in Fine tuning phasedomainThe network parameters of (a) are initialized using the model parameters saved during the pre-training phase.
Further, the probability of the named entity recognition module entity referring to the starting position and the ending position is calculated by the following formula:
Figure BDA0002953228470000031
wherein L is the number of characters of Q,
Figure BDA0002953228470000032
representing the k character c in QkIs a characteristic code of BERTdomainOutput, hkIs a network parameter of the Softmax classification layer.
Further, the word segmentation code specifically comprises: the output Q is segmented through a word segmentation tool, a label is marked on each character according to a word segmentation result, the label set is { B, M, E, S }, wherein B represents that the character is a starting character of the word, M represents that the character is a middle character of the word, E represents that the character is an ending character of the word, S represents a single word, and corresponding word segmentation codes are respectively EB、EM、ES、EE
Further, the entity link module is realized by means of retrieval, and is specifically divided into two stages, namely an offline stage and an online searching stage:
an off-line stage: segmenting entity words and entity aliases in the disease knowledge map according to characters, and counting TF-IDF;
establishing a word index for the counted entity words, and recording all the entity words containing a certain word and position information of the word in the entity words;
and (3) in an online searching stage: the method comprises the steps of preprocessing a problem input by a user to obtain Q, obtaining entity mentions through a named entity recognition module, segmenting the entity mentions according to characters, and searching corresponding character indexes. Accumulating according to TF-IDF, sorting from large to small according to scores, and selecting the first 2 entity words with the highest scores as candidate entities;
calculating DICE distance d between each candidate entity and entity1And the DICE distance d of the candidate entity from Q2If the weighted sum is greater than the threshold value, the candidate entity is retained,otherwise, removing the candidate entity.
Further, the DICE distance is:
Figure BDA0002953228470000041
x, Y, S sequentially represents entity word, entity mention and user input, alpha and beta are parameters, a ^ b represents the public character of a and b, and | X | represents the character length.
Further, the network parameters of the two Softmax classification layers are different.
The invention has the beneficial effects that:
(1) the invention adopts richer and more professional field disease characters for representation. The prior pre-training model is pre-trained by using universal linguistic data, and has the problems of more professional terms, entity nesting and the like in the disease field, so that the character vector representation and the field pre-training model BERT in the disease field can be obtained through the pre-training of clinical diagnosis and treatment documents and electronic medical records in the disease field;
(2) the entity boundary identification is more accurate, the entity in the knowledge map is used for assisting word segmentation, word segmentation codes are added according to word segmentation results, and the entity boundary identification is assisted through the word segmentation results;
(3) the present invention mitigates coupling between entity identification and entity linking. The entity link uses an entity link method based on retrieval, the entity link result is directly influenced aiming at the entity recognition result, when the DICE distance is calculated, the calculation of the DICE distance between the entity word and the sentence is increased, and the DICE distance caused by the error recognition of the entity boundary is avoided to be smaller.
Drawings
FIG. 1 is a partial schematic view of a coronary heart disease domain knowledge graph showing a schematic view of the principal entities, relationships, attributes and tail entities in the prior art;
FIG. 2 is a flow chart of the operation of the present invention;
FIG. 3 is a schematic diagram of a named entity recognition module of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.
Examples
As shown in fig. 1 to 3, a question-answer matching system based on a disease domain knowledge graph analyzes questions posed by a user based on a constructed disease knowledge graph and provides relevant answers, including
The preprocessing module is used for removing spaces, punctuation marks and the like by using regularization aiming at the problem input by a user, the case and case are unified into lower case, and then the output of the preprocessing module is recorded as Q;
and the named entity identification module is used for determining the starting position and the ending position of the entity mention, predicting the probability of outputting each character in Q as the starting position and the ending position of the sentence entity mention through two Softmax classification layers, and determining a corresponding character string according to the probability of the starting position and the ending position, namely the entity mention. The named entity recognition module uses BERTdomainFine tuning is performed.
The probability calculation formula is:
Figure BDA0002953228470000051
wherein L is the number of characters of the output Q of the preprocessing module,
Figure BDA0002953228470000052
indicating the k-th character ckIs a characteristic code of BERTdomainOutput of hkIs a network parameter of a classification layer, and only the classification layer network parameter of the network layer used in the entity starting and ending position prediction is independent, i.e. hkDifferent, BERTdomainThe same is true.
Figure BDA0002953228470000053
Being BERT through domain pre-trainingdomainAnd (4) obtaining. The original BERT inputs were: character encoding, position encoding, and segment encoding, named entity module using BERTdomainAnd (6) fine adjustment. BERTdomainIs divided into 3 parts: character encoding, position encoding and word segmentation encoding. Wherein the character code is BERTdomainPre-trained character representation EcharThe position code is the position representation of each character, is used for the time sequence judgment of the BERT model and is the same as the position code representation of the original BERT. The original segment coding is removed and replaced by participle coding, and because the named entity recognition task is a single sentence task and does not involve multiple sentences, segment coding is not required to indicate which sentence each character belongs to.
The word segmentation code specifically comprises: and (3) segmenting the output Q through a segmentation tool, and marking a label on each character according to a segmentation result, wherein the label set is { B, M, E, S }, wherein B represents that the character is the beginning character of the vocabulary, M represents that the character is the middle character of the vocabulary, E represents that the character is the ending character of the vocabulary, and S represents a single word. According to the word segmentation result, the word segmentation result is B, and the corresponding word segmentation boundary code is EBAnd others are similar to EB、EM、ES、EEAre all network parameters. By introducing word segmentation coding, certain prior knowledge is provided for the identification of entity boundaries. If the user inputs 'how to treat rheumatic mitral stenosis', the 'stenosis' after word segmentation is a connected word, and the word segmentation result is used as the characteristic of the named entity recognition module, so that the situation that entity boundary recognition errors such as the fact that only the 'stenosis' is included in entity mention but the 'stenosis' is not included can be reduced.
The entity linking module is used for linking the entity mentions output by the entity recognition module to the entity words s in the knowledge graph, and is realized by using a retrieval-based entity linking method. The specific implementation method comprises the following steps: all entity words and entity aliases (the following entity words refer to entity words and entity aliases) in the knowledge graph are segmented according to characters, and TF-IDF (term frequency-inverse document frequency) is counted. TF is the Term Frequency (Term Frequency) that indicates how often a word appears in an article. IDF is an Inverse text Frequency index (Inverse Document Frequency) that represents the ratio of the total number of articles to the number of times a word appears in an article, and is typically logarithmized. The main idea of TF-IDF is: if a word or phrase appears frequently in one article, TF, and rarely appears in other articles, the word or phrase is considered to have a good classification capability. For the entity linking module, TF represents the frequency of a certain word in an entity word, and IDF is the number of all entity words divided by the number of entity words containing the word. And after statistics, performing inverted indexing on the entity words, and recording all the entity words containing a certain character and position information of the character in the entity words. When the input entity is mentioned, segmenting according to characters, obtaining indexes of entity words according to the characters, obtaining each entity word containing the characters and TF-IDF, accumulating the TF-IDF according to the hit characters of the entity words, sorting the entity words from large to small according to scores, and selecting the first 2 entity words with the highest scores as candidate entity words.
Whether the top2 candidate entity word is matched is judged according to the following distance formula. X, Y, S sequentially represents candidate entity words, entity mentions and the output Q of the preprocessing module, alpha and beta are parameters, a and b represent the public character of a and b, and | x | represents the character length.
The following formula (2) is the DICE distance d between the candidate entity word and the entity reference1And the DICE distance d of the candidate entity word from the sentence2If the weighted sum is greater than the threshold value, the candidate entity word is retained, otherwise the candidate entity word is removed.
Calculating a DICE distance d between an entity reference and an entity word1Also, the DICE distance d of an entity from a sentence is of concern2To reduce to some extent d caused by entity-mentioned boundary missing recognition1Small, resulting in entity mentions that cannot be linked to the correct entity word. If the input "how to treat rheumatic mitral stenosis", the entity mentions that only "mitral stenosis" is involved, the recalled top1 entity is "mitral stenosis", and the top2 entity is "rheumatic mitral stenosis". Calculating d1Top1 entity scored higher than top2 entity, and d was calculated2Top1 entity scored less than top2 entity, by d1And d2The weighted sum of (2) can reduce the condition of mistaken link to 'mitral stenosis' when the entity is linked due to the wrong identification of the entity boundary.
Figure BDA0002953228470000061
The entity linking module calculates TF-IDF according to the word index to obtain two candidate entity words, calculates the weighted sum of the distance between the two DICEs according to the candidate entity words, if the distance meets the threshold value, the corresponding candidate entity words are called entity words, and the number of the entity words can be 0,1 or 2.
And the relation matching module is used for judging whether the relation p is matched with the intention input by the user, and if the relation p is matched with the intention input by the user, o in (s, p, o) in the knowledge graph is the answer to be output. The specific implementation method of the relationship matching module comprises the following steps: and carrying out relation retrieval of the knowledge graph according to the entity words obtained by entity linking. The entity words and the relationships are spliced to form a sentence 1, the user input is a sentence 2, and the relationships correspond to a plurality of sentences 1 consistent with the NSP task in the BERT pre-training. Using the character representation of [ CLS ] as the feature vector of the sentence, Softmax classification is performed to determine whether sentence 1 (the sentence in which the entity words and relationships are concatenated) and sentence 2 (the sentence Q in which the user input is preprocessed) are semantically similar. And matching the sentence spliced with the relation with Q by using the entity word when the relation is matched, rather than matching the non-entity-mentioned part without the entity mention by using the relation. This is because considering that the entity links are based on word matching, it is necessary to consider comprehensively whether both the entity and the relationship match the entity and the intention input by the user. And matching is performed only by using the non-entity-mentioned part after the entity mention is removed, so that the semantics of the non-entity-mentioned part can be influenced due to the recognition error of the entity mention boundary, and the matching degree of the non-entity-mentioned part and the relation is not high. And selecting the relation with the highest score and the probability larger than the threshold value as p, and taking the corresponding o in the (s, p, o) as the answer input by the user.
The invention uses the BERT model and has two stages:
stage 1: a pre-training phase. Pre-training by a Masked Language Model (MLM) task because the original BERT training corpus is generic and not directed to the disease domainAnd thus the character representation of the partial disease word is less effective. Therefore, by designing MLM task of disease domain, BERT of disease domain is obtaineddomain. The input coding vector of the MLM task is consistent with that in BERT pre-training in the general field, and the input coding vector is the sum of a character vector, segment coding and a position vector. The character vector, segment code and position vector are learnable parameters, and the character vector E in the input process is trained after the pre-training is finishedcharAnd the model parameters are saved as a file. Pre-training phase BERTdomainThe input of (1) is the sum of the character vector, the segment code and the position vector. The output is called a character feature vector.
And (2) stage: and (5) fine adjustment. The fine tuning phase is at BERTdomainAdding the network parameters of the downstream-oriented task and finely adjusting the parameters of the whole network. Fine tuning phase BERTdomainThe parameters of (2) are initialized by using the model parameters of the stage 1, the learning rate is relatively small and is generally 10 < -5 >, and the network parameters of the downstream-oriented tasks are initialized randomly. And designing a new loss function according to a downstream task, and finely adjusting the model parameters, wherein the learning rate is relatively low, so that the adjustment of the model parameters of the downstream task is relatively low, and the adjustment of the network layer parameters which are farther away from an output layer of the model is relatively small.
For the question-answering system, the downstream tasks include two tasks: entity identification and relationship matching.
Entity identification: for this task, BERTdomainThe segment codes in the input are changed into participle codes. In BERTdomainAnd on the basis of the two softmax classification layers, respectively splicing the two softmax classification layers, and respectively predicting the probability that each character in the Q is used as an entity to mention the starting position and the ending position. BERTdomainThe network parameters are initialized by using the model parameter file saved in the pre-training stage.
And (3) relation matching: for this task, BERTdomainThe input of (1) is unchanged, and is the sum of the character vector, the segment code and the position vector. In BERTdomainAnd on the basis, a softmax classification layer is spliced, and whether the output Q is consistent with the entity word + relation is predicted. BERTdomainUsing pre-training orderAnd initializing the model parameter file saved by the segment.
The difference between the invention and the prior art is that:
(1) the named entity recognition module and the relationship matching module use a pre-trained BERT model of the disease domain and a character vector representation of the disease domain. Pre-training is carried out through an MLM task, and the pre-trained linguistic data come from clinical diagnosis and treatment documents and electronic medical records in the disease field;
(2) the named entity recognition module is a single sentence task and therefore does not require segment coding. Meanwhile, the entities in the knowledge graph are used for assisting word segmentation, and word segmentation codes are obtained according to word segmentation results. And replacing the segmentation codes with the segment codes in the original BERT model input, and assisting the entity boundary identification through the segmentation results.
(3) The entity linking module is implemented using a retrieval-based entity linking method. Aiming at the fact that the entity recognition result directly influences the entity link result, when the DICE distance is calculated, the calculation of the DICE distance between the entity word and the sentence is increased, and the DICE distance smaller than a threshold value caused by entity boundary missing recognition is reduced.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. A question-answer matching system based on a disease domain knowledge graph, comprising:
the preprocessing module is used for preprocessing the problem input by the user and recording the output Q of the preprocessing module;
a named entity recognition module for determining the start and end positions of the entity mentions, predicting the probabilities of each character in the output Q as the start and end positions of the sentence entity mention, and determining the start and end positions of the entity mention according to the probabilities of the start and end positions, the named entity recognition module using BERTdomainMake fine adjustment;
The entity linking module is used for linking the entity mentions to entity words in the disease knowledge graph and retrieving all corresponding relations in the disease knowledge graph through the entity words;
a relation matching module for matching the relation corresponding to the entity word with the user input problem, judging whether the relation is consistent with the user problem, if so, outputting the result, and using BERTdomainFine tuning is performed.
2. The question-answer matching system according to claim 1, wherein the pre-processing includes removing spaces and punctuation marks using regularization on user-entered text questions and unifying letters into lower case.
3. The question-answer matching system according to claim 1, characterized in that probabilities of respective characters in the output Q as sentence entity mention start positions and end positions are predicted, and entity mention is determined, specifically:
BERT after pretraining through disease domainsdomainThe model encodes the output Q of the preprocessing module and outputs a feature vector, the feature vector predicts the probability of each character as an entity mention starting position and an entity mention ending position through two Softmax classification layers, and a character string corresponding to the entity starting position and the entity mention ending position is the entity mention.
4. The question-answer matching system according to any one of claims 1 to 3, characterized in that the BERT of the field of disease is obtaineddomainThe pre-training process of the model is as follows:
firstly, storing the disease names in clinical diagnosis and treatment documents and electronic medical records as dictionary files;
then, segmenting the text contents in the clinical diagnosis and treatment documents and the electronic medical records into sentences according to punctuations;
searching whether the sentence contains the disease word in the dictionary file or not by using a maximum dictionary matching method for the segmented sentence, and if so, shielding the disease word, namely replacing the disease word with MASK; for sentences which do not contain disease words, randomly masking a certain character, and constructing an MLM task data set of the disease field;
finally, the BERT model is pre-trained through an MLM task data set, the MLM task aims to predict words which are replaced by MASK through context information of sentences, so that the model learns the bidirectional information of the sentences, and the BERT is obtained through the MLM taskdomainModel and character vector Echar
5. The question-answer matching system according to claim 3, characterized in that BERT is used in the named entity recognition moduledomainFine tuning is performed at BERTdomainAdding network parameters of the downstream-oriented task and finely adjusting the parameters of the whole network, wherein the named entity module is in BERTdomainRespectively splicing two softmax classification layers on the basis of the initial position and the end position of the output Q, respectively predicting the probability that each character in the output Q is used as an entity to mention the initial position and the end position, and in addition, carrying out BERT (binary inverse transform) in a fine tuning stagedomainSegment coding in the original input is replaced by participle coding, BERT in the fine tuning stagedomainCharacter encoding in input using character vector E saved during pre-training phasecharInitialization is performed.
6. The question-answer matching system according to claim 4, wherein the probability that the named entity recognition module entity refers to the start position and the end position is calculated using the following formula:
Figure FDA0002953228460000021
wherein L is the number of characters of Q,
Figure FDA0002953228460000022
representing the k character c in QkIs a characteristic code of BERTdomainOutput, hkIs a network parameter of the Softmax classification layer.
7. The question-answer matching system according to claim 5, wherein the word segmentation code is specifically: the output Q is segmented through a segmentation tool, a label is marked on each character according to the segmentation result, the label set is { B, M, E, S }, wherein B represents that the character is the beginning character of the vocabulary, M represents that the character is the middle character of the vocabulary, E represents that the character is the ending character of the vocabulary, and S represents a single word, and corresponding segmentation codes are respectively EB、EM、ES、EE
8. The question-answer matching system according to claim 1, characterized in that the entity linking module is implemented by means of retrieval, and is specifically divided into two stages, an offline stage and an online search stage:
an off-line stage: segmenting entity words and entity aliases in the disease knowledge map according to characters, and counting TF-IDF;
establishing a word index for the counted entity words, and recording all the entity words containing a certain word and position information of the word in the entity words;
and (3) in an online searching stage: the method comprises the steps that Q is obtained through preprocessing problems input by a user, entity mentions are obtained through a named entity recognition module, the entity mentions are segmented according to characters, corresponding character indexes are searched, accumulated according to TF-IDF, sorted from large to small according to scores, and the first 2 entity words with the highest scores are selected as candidate entities;
calculating DICE distance d between each candidate entity and entity1And the DICE distance d of the candidate entity from Q2If the weighted sum is greater than the threshold, the candidate entity is retained, otherwise the candidate entity is removed.
9. The question-answer matching system according to claim 8, characterized in that the DICE distance is:
Figure FDA0002953228460000023
x, Y, S sequentially represents candidate entity words, entity mentions and the output Q of the preprocessing module, alpha and beta are parameters, a and b represent the public character of a and b, and | x | represents the character length.
10. The question-answer matching system according to claim 3, characterized in that the network parameters of the two Softmax classification layers are different.
CN202110213829.XA 2021-02-26 2021-02-26 Question-answer matching system based on disease field knowledge graph Active CN112925918B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110213829.XA CN112925918B (en) 2021-02-26 2021-02-26 Question-answer matching system based on disease field knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110213829.XA CN112925918B (en) 2021-02-26 2021-02-26 Question-answer matching system based on disease field knowledge graph

Publications (2)

Publication Number Publication Date
CN112925918A true CN112925918A (en) 2021-06-08
CN112925918B CN112925918B (en) 2023-03-24

Family

ID=76172049

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110213829.XA Active CN112925918B (en) 2021-02-26 2021-02-26 Question-answer matching system based on disease field knowledge graph

Country Status (1)

Country Link
CN (1) CN112925918B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420160A (en) * 2021-06-24 2021-09-21 竹间智能科技(上海)有限公司 Data processing method and device
CN114723073A (en) * 2022-06-07 2022-07-08 阿里健康科技(杭州)有限公司 Language model pre-training method, language model pre-training device, language model searching device and computer equipment
CN116821712A (en) * 2023-08-25 2023-09-29 中电科大数据研究院有限公司 Semantic matching method and device for unstructured text and knowledge graph

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160103932A1 (en) * 2014-02-13 2016-04-14 Samsung Electronics Co., Ltd. Dynamically modifying elements of user interface based on knowledge graph
CN106407443A (en) * 2016-09-28 2017-02-15 医渡云(北京)技术有限公司 Structured medical data generation method and device
US20180082197A1 (en) * 2016-09-22 2018-03-22 nference, inc. Systems, methods, and computer readable media for visualization of semantic information and inference of temporal signals indicating salient associations between life science entities
US20190121903A1 (en) * 2017-10-24 2019-04-25 Optra Health, Inc. Method of Retrieving Information from a Health Report through a Machine Assisted Interrogation Process
CN110032648A (en) * 2019-03-19 2019-07-19 微医云(杭州)控股有限公司 A kind of case history structuring analytic method based on medical domain entity
CN110704631A (en) * 2019-08-16 2020-01-17 北京紫冬认知科技有限公司 Construction method and device of medical knowledge map
CN110765257A (en) * 2019-12-30 2020-02-07 杭州识度科技有限公司 Intelligent consulting system of law of knowledge map driving type
CN110827941A (en) * 2019-11-06 2020-02-21 山东众阳健康科技集团有限公司 Electronic medical record information correction method and system
CN111104498A (en) * 2019-12-12 2020-05-05 华南理工大学 Semantic understanding method in task type dialogue system
CN111191453A (en) * 2019-12-25 2020-05-22 中国电子科技集团公司第十五研究所 Named entity recognition method based on confrontation training
US20200219618A1 (en) * 2019-01-04 2020-07-09 International Business Machines Corporation Progress evaluation of a diagnosis process
CN111414393A (en) * 2020-03-26 2020-07-14 湖南科创信息技术股份有限公司 Semantic similar case retrieval method and equipment based on medical knowledge graph
CN111444313A (en) * 2020-03-04 2020-07-24 深圳追一科技有限公司 Knowledge graph-based question and answer method and device, computer equipment and storage medium
CN111462841A (en) * 2020-03-12 2020-07-28 华南理工大学 Depression intelligent diagnosis device and system based on knowledge graph
CN111914074A (en) * 2020-07-16 2020-11-10 华中师范大学 Method and system for generating limited field conversation based on deep learning and knowledge graph
CN112002411A (en) * 2020-08-20 2020-11-27 杭州电子科技大学 Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record
CN112035635A (en) * 2020-08-28 2020-12-04 康键信息技术(深圳)有限公司 Medical field intention recognition method, device, equipment and storage medium
CN112131401A (en) * 2020-09-14 2020-12-25 腾讯科技(深圳)有限公司 Method and device for constructing concept knowledge graph

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160103932A1 (en) * 2014-02-13 2016-04-14 Samsung Electronics Co., Ltd. Dynamically modifying elements of user interface based on knowledge graph
US20180082197A1 (en) * 2016-09-22 2018-03-22 nference, inc. Systems, methods, and computer readable media for visualization of semantic information and inference of temporal signals indicating salient associations between life science entities
CN106407443A (en) * 2016-09-28 2017-02-15 医渡云(北京)技术有限公司 Structured medical data generation method and device
US20190121903A1 (en) * 2017-10-24 2019-04-25 Optra Health, Inc. Method of Retrieving Information from a Health Report through a Machine Assisted Interrogation Process
US20200219618A1 (en) * 2019-01-04 2020-07-09 International Business Machines Corporation Progress evaluation of a diagnosis process
CN110032648A (en) * 2019-03-19 2019-07-19 微医云(杭州)控股有限公司 A kind of case history structuring analytic method based on medical domain entity
CN110704631A (en) * 2019-08-16 2020-01-17 北京紫冬认知科技有限公司 Construction method and device of medical knowledge map
CN110827941A (en) * 2019-11-06 2020-02-21 山东众阳健康科技集团有限公司 Electronic medical record information correction method and system
CN111104498A (en) * 2019-12-12 2020-05-05 华南理工大学 Semantic understanding method in task type dialogue system
CN111191453A (en) * 2019-12-25 2020-05-22 中国电子科技集团公司第十五研究所 Named entity recognition method based on confrontation training
CN110765257A (en) * 2019-12-30 2020-02-07 杭州识度科技有限公司 Intelligent consulting system of law of knowledge map driving type
CN111444313A (en) * 2020-03-04 2020-07-24 深圳追一科技有限公司 Knowledge graph-based question and answer method and device, computer equipment and storage medium
CN111462841A (en) * 2020-03-12 2020-07-28 华南理工大学 Depression intelligent diagnosis device and system based on knowledge graph
CN111414393A (en) * 2020-03-26 2020-07-14 湖南科创信息技术股份有限公司 Semantic similar case retrieval method and equipment based on medical knowledge graph
CN111914074A (en) * 2020-07-16 2020-11-10 华中师范大学 Method and system for generating limited field conversation based on deep learning and knowledge graph
CN112002411A (en) * 2020-08-20 2020-11-27 杭州电子科技大学 Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record
CN112035635A (en) * 2020-08-28 2020-12-04 康键信息技术(深圳)有限公司 Medical field intention recognition method, device, equipment and storage medium
CN112131401A (en) * 2020-09-14 2020-12-25 腾讯科技(深圳)有限公司 Method and device for constructing concept knowledge graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
乔振浩 等: "基于问题生成的知识图谱问答方法", 《智能计算机与应用》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420160A (en) * 2021-06-24 2021-09-21 竹间智能科技(上海)有限公司 Data processing method and device
CN114723073A (en) * 2022-06-07 2022-07-08 阿里健康科技(杭州)有限公司 Language model pre-training method, language model pre-training device, language model searching device and computer equipment
CN114723073B (en) * 2022-06-07 2023-09-05 阿里健康科技(杭州)有限公司 Language model pre-training method, product searching method, device and computer equipment
CN116821712A (en) * 2023-08-25 2023-09-29 中电科大数据研究院有限公司 Semantic matching method and device for unstructured text and knowledge graph
CN116821712B (en) * 2023-08-25 2023-12-19 中电科大数据研究院有限公司 Semantic matching method and device for unstructured text and knowledge graph

Also Published As

Publication number Publication date
CN112925918B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN112925918B (en) Question-answer matching system based on disease field knowledge graph
CN112002411A (en) Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record
CN110674252A (en) High-precision semantic search system for judicial domain
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN117573843B (en) Knowledge calibration and retrieval enhancement-based medical auxiliary question-answering method and system
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN112328800A (en) System and method for automatically generating programming specification question answers
US20220114340A1 (en) System and method for an automatic search and comparison tool
CN114416942A (en) Automatic question-answering method based on deep learning
CN113297364A (en) Natural language understanding method and device for dialog system
CN112784532A (en) Multi-head attention memory network for short text sentiment classification
CN115392252A (en) Entity identification method integrating self-attention and hierarchical residual error memory network
CN114077673A (en) Knowledge graph construction method based on BTBC model
Manik et al. Out-of-Scope Intent Detection on A Knowledge-Based Chatbot.
CN116010553A (en) Viewpoint retrieval system based on two-way coding and accurate matching signals
CN117251524A (en) Short text classification method based on multi-strategy fusion
CN115238026A (en) Medical text subject segmentation method and device based on deep learning
CN111666374A (en) Method for integrating additional knowledge information into deep language model
KR20230077588A (en) Method of classifying intention of various question and searching answers of financial domain based on financial term language model and system impelemting thereof
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN115828854B (en) Efficient table entity linking method based on context disambiguation
Göker et al. Neural text normalization for turkish social media
Li et al. Approach of intelligence question-answering system based on physical fitness knowledge graph
CN116992002A (en) Intelligent care scheme response method and system
Saikh et al. COVIDRead: A large-scale question answering dataset on COVID-19

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant