CN113436698B - Automatic medical term standardization system and method integrating self-supervision and active learning - Google Patents

Automatic medical term standardization system and method integrating self-supervision and active learning Download PDF

Info

Publication number
CN113436698B
CN113436698B CN202110994475.7A CN202110994475A CN113436698B CN 113436698 B CN113436698 B CN 113436698B CN 202110994475 A CN202110994475 A CN 202110994475A CN 113436698 B CN113436698 B CN 113436698B
Authority
CN
China
Prior art keywords
term
model
standard
training
candidate set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110994475.7A
Other languages
Chinese (zh)
Other versions
CN113436698A (en
Inventor
李劲松
杨宗峰
辛然
李玉格
史黎鑫
田雨
周天舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202110994475.7A priority Critical patent/CN113436698B/en
Publication of CN113436698A publication Critical patent/CN113436698A/en
Application granted granted Critical
Publication of CN113436698B publication Critical patent/CN113436698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a medical term automatic standardization system and method integrating self-supervision and active learning, wherein the system comprises a candidate set generation module, a self-supervision learning module for training a term standardization model, an active learning module, a precision sequencing module for comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions, and other basic modules, and also comprises a semi-supervision learning module, a straight superior term retrieval module and other preferred modules; the invention can realize the automatic medical term standardized model under the condition of less labeled data, and the model keeps the capability of fast updating and upgrading, thereby greatly reducing the workload of manual intervention while ensuring the accuracy of the output result; the new clinical concept can be matched with the direct superior term, and the accurate position is found in the standard term table, so that the integrity and the uniformity of the standardized result are ensured.

Description

Automatic medical term standardization system and method integrating self-supervision and active learning
Technical Field
The invention belongs to the technical field of Chinese medical term standardization and multi-center medical information platforms, and particularly relates to a medical term automatic standardization system and method integrating self-supervision and active learning.
Background
With the popularization of electronic medical record systems, a large amount of important medical information is stored in various medical information systems in an electronic form, and the data create great values for clinical auxiliary diagnosis, medicine research and development, public health monitoring and evaluation, infectious disease epidemic situation early warning, personalized accurate medical treatment and the like. The medical data standardization is a key step for promoting the integration of domestic medical systems and realizing the cooperative research and large-scale analysis of medical data. Achieving standardization of medical terms is the first difficult problem to be solved in the process of standardization of medical data. Internationally, different types of medical terms respectively have corresponding standard term systems, including disease term sets ICD-10, surgical operation codes ICD-9-CM-3, medical inspection term sets LONIC and the like. Hospitals or other medical facilities, however, do not make good use of the international universal standard terminology set during actual operations, primarily because: (1) different hospitals often adopt different medical information systems, and the data standards adopted by the information systems are different, so that the generated medical terms have larger difference in data dimension and data format; (2) the understanding of standard terminology and granularity by different operators is not uniform. The medical information system usually requires the operator to select the corresponding information of disease name, operation name, etc. according to the condition of the patient, and for the condition that the meanings of the upper and lower terms are overlapped (for example, the two codes "D00.2" and "D00.200" for the "gastric carcinoma in situ" in ICD-10), the understanding of different operators, even the understanding of the same operator at different times, may be different; (3) the operator personalizes the terms entered. Most information systems provide manual input for the convenience of entering new concepts, and therefore operators may develop irregular terminology based on past experience and personal habits. These factors result in the original clinical concept not being directly related to the general standard terms, and data unification and information exchange between different organizations are not easy.
The ultimate goal of medical term standardization efforts is to establish a mapping relationship between the original clinical concept and the standard term. The term standardization schemes in the past are generally based on the following two concepts: (1) by using an artificial method, professional clinicians are invited to carry out mapping and proofreading on the operation terms one by one, but the order of magnitude of the operation terms contained in each medical information system is in the ten thousand level, the working time for the clinicians to carry out proofreading is very long, the rapid popularization in China is difficult, and the rapid implementation of the domestic medical data standardization is further hindered; in addition, because the doctors have different work experiences, the standard mapping of the standard surgical terms lacks a uniform standard, so that the uniformity of the standards among different doctors is difficult to ensure, and meanwhile, the mapping result has manual errors, so that the uniformity of the mapping standards of the same doctor at different times is difficult to ensure. (2) The medical concept semantic matching model is trained based on a machine learning algorithm, but the difficulty of manually marking data is high, the consumed time is long, so that insufficient training data is not available, the finally generated model is low in generalization capability, and in order to ensure the accuracy of the actually used term standard result, more manpower is required to be consumed to verify the output result. On the other hand, there are many standard terminology sets with a high-low relationship, for example, the lower-level terminology of the term "corneal surgery (11)" in the surgical operation code ICD-9-CM-3 includes "corneal laceration suture (11.51)", "corneal transplantation NOS (11.6)", and the like. When the concept generated by the actual clinical operation cannot find the same-meaning peer-level term in the standard term set, the direct superior standard term needs to be accurately positioned, and the existing method cannot solve the problem well, so that the newly-added clinical concept cannot be fused into the universal standard term system. The invention aims to solve the problems that a medical term standardization system with good accuracy and generalization capability is established under the condition that a large amount of labeled data is not available, quick automatic iterative updating of the system is realized under the condition that manual intervention is reduced as much as possible, and meanwhile, the accurate peer standard terms or superior standard terms can be positioned for the original clinical concept.
Disclosure of Invention
The medical data standardization is a key step for promoting the integration of domestic medical systems and realizing the cooperative research and large-scale analysis of medical data. However, the existing clinical term standardization method and system generally need more manual examination and labeling work, and the accuracy and generalization capability are difficult to guarantee, so that the clinical data standardization is difficult to popularize at home quickly.
The invention aims to provide a medical term automatic standardization system and method based on a deep learning model and integrating self-supervision and active learning aiming at the difficulty of the standardization work of the current medical terms.
The purpose of the invention is realized by the following technical scheme: according to the method, a medical term standardization model is constructed on the basis of a deep learning language model, a self-supervision learning method is adopted to train the model, negative samples are sampled based on a text correlation model and a hierarchical structure of a standard glossary, the negative samples with higher information content and more difficult model discrimination are obtained, and the effect of data enhancement is achieved, so that the semantic relation contained in the model can be fully utilized under the condition that only a small number of labeled samples exist in the model; the method comprises the steps of realizing an active learning function based on the principles of maximum entropy, low confidence, high frequency and the like, and screening a group of samples which can improve the performance of a model to the maximum extent according to the prediction results of the model on a large number of unknown samples, so that the model can be upgraded quickly and obviously with the least manual intervention; designing a precise sequencing model, and finally outputting correct standard terms by integrating information in various aspects such as texts, semantics and the like; the accurately sequenced samples automatically update training data in a semi-supervised self-training learning mode, so that the accuracy and generalization capability of the model are further improved, and the workload of manual intervention is continuously reduced; an upward retrieval method is constructed, some newly added original clinical concepts are positioned to corresponding direct superior terms, the integrity and consistency of the medical term standardization results are guaranteed, the newly added clinical concepts can find the correct positions in the standard glossary, and the comprehensive standardization of clinical data is facilitated.
The invention discloses an automatic medical term standardization system integrating self-supervision and active learning, which comprises the following components:
(1) a candidate set generation module: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;
(2) the self-supervision learning module: for training a term normalized model, comprising:
training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms;
respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model;
adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;
(3) an active learning module: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;
(4) the accurate sequencing module: acquiring semantic similarity scores of the original clinical concept and standard terms output by the self-supervision learning module as semantic features, calculating text features, training a regression decision tree-based accurate sequencing model based on the semantic and text features, and calculating confidence scores of medical term standardization results; and calculating a confidence score for the standard term positive sample in the prediction candidate set by using the trained accurate sequencing model to obtain the standard term with the maximum confidence score.
Further, the automatic medical term standardization system further comprises a semi-supervised learning module, and the semi-supervised learning module fuses the samples of which the confidence scores of the medical term standardization results output by the precise ordering module meet the conditions to the training candidate set.
Further, the automatic medical term standardization system further comprises a direct superior term retrieval module, wherein the direct superior term retrieval module comprises: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.
The invention discloses a medical term automatic standardization method fusing self-supervision and active learning on the other hand, which comprises the following steps:
(1) generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;
(2) training the term normalization model by self-supervised learning: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;
(3) the term standardization model is rapidly upgraded through active learning: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;
(4) training an accurate ordering model, and comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions: acquiring the semantic similarity score of the original clinical concept and the standard term output by the self-supervision learning in the step two as a semantic feature, and calculating a text feature; training a regression decision tree-based accurate sequencing model based on semantic and text features for calculating a confidence score of a medical term standardization result;
(5) predicting the final term normalization result: and calculating confidence scores for the standard term positive samples in the prediction candidate set by using the trained accurate ranking model, and taking the standard term with the maximum confidence score as a term standardization result.
Further, the step (1) includes:
(1.1) training candidate sets: the training candidate set consists of the original clinical concept x and the corresponding standard term Y if Y has the upper level direct term Y1Then take Y1All the next-level terms of (a) are denoted as set M; if Y does not have the primary direct term but does have the secondary direct term Y2Then take Y2All the next level and next level terms of (a) are denoted as set M; otherwise, the standard glossary is totally marked as a set M; calculating text relevance score of any standard term M in x and M, sorting according to the text relevance score, and selecting negative sample set
Figure GDA0003300318080000041
Obtaining a training candidate set
Figure GDA0003300318080000042
(1.2) predicting candidate set: in the case of term-normalized model prediction, the unlabeled original clinical concept x is collectively represented as a set M in a standard term table, and a positive sample set is selected from M using a text relevance score
Figure GDA0003300318080000043
Deriving a set of prediction candidates
Figure GDA0003300318080000044
Further, in the step (2), the chinese medical language model is a bi-directional autoregressive language model, which specifically includes: the original clinical concept x is compared with any standard term y*Splicing together according to characters, adding a segmentation character (SEP) at the splicing position]And adds a start character to the leftmost side [ S ]]Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The position outputs are x and y*The semantic vector of (2).
Further, in the step (3), the unlabeled original clinical concept x is subjected to the step (1) to obtain a prediction candidate set:
Figure GDA0003300318080000045
computing semantic similarity scores using a semantic matching model
Figure GDA0003300318080000046
Normalize it to a probability distribution:
Figure GDA0003300318080000047
the uncertainty of the term normalized model for x, c (x), is calculated as follows:
C(x)=w1·ent(x)+w2·margin(x)+w3·lc(x)+w4·freq(x)
where ent (x) is the information entropy of the term normalized model pair x:
Figure GDA0003300318080000048
margin (x) is the edge probability:
margin(x)=-(p1-p2)
wherein p is1And p2Are respectively all piThe median maximum and second maximum probabilities;
lc (x) is confidence:
lc(x)=-p1
freq (x) is the frequency of x occurrences in the original clinical text data;
(wi)i=1,2,3,4is the weight of each feature.
Further, in the step (4), a gradient lifting model XGBoost is adopted as a precise sequencing model, which specifically includes: training a plurality of regression decision trees, wherein the learning target of each tree is the error of the previous tree, and the accumulation of the calculation results of all the trees is the final confidence score; if a gradient lifting model is constructed for u samples, the loss function L of the t-th decision tree(t)Comprises the following steps:
Figure GDA0003300318080000051
where l (-) is the square loss function, viIs a sample xiTag of (e), f (Tree)t,xi) For the t decision tree pair xiThe predicted value of (a) is determined,
Figure GDA0003300318080000052
for the first t-1 decision tree pairs xiThe predicted value of (a) is determined,
Figure GDA0003300318080000053
is a positive representation of the complexity of the decision treeTerm, | Tree thereintL is the leaf node number of the t-th decision tree, wkThe predicted value of the kth leaf node is obtained, and gamma and lambda are weight parameters;
in the process of training the precise ranking model, input training data is a data set consisting of standard term positive samples in a prediction candidate set of an original clinical concept and a term standardization model:
Figure GDA0003300318080000054
if the trained accurate sequencing model comprises T decision trees, the samples are subjected to
Figure GDA0003300318080000055
Computing a confidence score for a medical term normalized result
Figure GDA0003300318080000056
Comprises the following steps:
Figure GDA0003300318080000057
and further, fusing the sample with the confidence score meeting the condition of the medical term standardization result output by the accurate sequencing module into a training candidate set, and updating the term standardization model and the accurate sequencing model parameters.
Further, the method also comprises a direct superior term retrieval function: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.
The invention has the beneficial effects that: the invention can realize the automatic medical term standardization model under the condition of less labeled data, and the model keeps the capability of quick updating and upgrading, thereby greatly reducing the workload of manual intervention while ensuring the accuracy of the output result. The new clinical concept can be matched with the direct superior term, and the accurate position is found in the standard term table, so that the integrity and the uniformity of the standardized result are ensured.
Drawings
FIG. 1 is a block diagram of an automatic standardization system for medical terms fusing self-supervision and active learning according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an implementation of a candidate set generation module according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an implementation of an auto-supervised learning module and an active learning module according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an implementation of a direct superior term retrieval module according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a training process of a decision tree-based precision ranking model according to an embodiment of the present invention;
fig. 6 is a schematic diagram of direct superior term retrieval according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
In the invention, the self-supervision learning means: and (3) mining own supervision information from large-scale unmarked data by using an auxiliary task, and training the network by using the constructed supervision information so as to learn valuable characteristics of downstream tasks. There are three main ways for self-supervised learning: context-based learning, time-series-based learning, and contrast learning, where contrast learning is the construction of a characterization by learning to encode the similarity or dissimilarity of two things.
Active learning means: the main goal is to reduce the cost of people to annotate data. The sample data which is difficult to classify or fuzzy in model classification is obtained through a machine learning method, and the data is generally considered to be possibly in the critical positions of different classes, so that the method can provide greater help for the model to accurately learn the features of the different classes. By manually re-confirming and auditing the samples, the effect of the model can be improved more remarkably under the condition of the same labeled data quantity.
Semi-supervised learning refers to: the learner is independent of external interaction, and learning performance is improved by automatically utilizing unmarked samples. The self-training is a special implementation mode of semi-supervised learning, assuming that similar samples have similar output, firstly training an initial model by using labeled samples, then carrying out prediction classification on unlabeled samples by using the model, screening out samples with higher prediction result confidence coefficient based on certain standard, and then using predicted soft labels or hard labels as new labeled data to expand a training set.
Medical term standardization refers to: the standardized principle and method are used for unifying medical terms in a certain range by establishing medical term standards so as to obtain the process of optimal order and social benefit. The establishment of the unified medical term standard and the term set is helpful for solving the problems of term repetition, connotation, semantic expression and inconsistent understanding and the like, and has important significance for effectively promoting the propagation, sharing and use of medical information in a wider range and a deeper level.
The embodiment of the invention provides an automatic medical term standardization system integrating self-supervision and active learning, which comprises the following modules as shown in figure 1:
a candidate set generation module: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;
a self-supervised learning module for training the term standardized model;
the active learning module is realized on the basis of the principles of maximum entropy, minimum confidence coefficient and the like;
a precision ranking module for comprehensive evaluation of term normalized model predictions from text and semantic dimensions.
Preferably, the system further comprises: and fusing the sample of which the confidence score of the medical term standardization result output by the precise ordering module meets the condition to a semi-supervised learning module of the training candidate set.
Preferably, the system further comprises: directly superior term retrieval module.
Specifically, the candidate set generation module is composed of two parts: in the term standardization model training process, sampling is carried out based on the text correlation BM25 model and the hierarchical structure of a standard term table, and standard terms which are close to but not identical with the original clinical concept as much as possible are obtained as negative sample standard terms; in the term standardization model prediction process, possible positive sample standard terms are generated based on the text relevance BM25 model, and the detailed flow is shown in FIG. 2.
Specifically, the self-supervision learning module mainly comprises the following three steps:
1. training a Chinese medical language model, preferably a bidirectional autoregressive language model (BERT), by a self-adaptive method, and further acquiring semantic vectors of original clinical concepts and standard terms;
2. respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model;
3. the loss function of the normalized model is calculated according to the semantic similarity meter using an auto-supervised learning approach (preferably an auto-supervised contrast learning approach), as shown in the left part of fig. 3.
Specifically, the active learning module mainly includes the following two steps:
1. calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set;
2. and screening out a group of samples with the most uncertain current term standardization model according to the active learning standard, determining labels of the samples, and then merging the samples into a training candidate set, wherein the labels are shown in the right part of the figure 3.
Specifically, the precise sorting module mainly comprises the following two steps:
1. firstly, acquiring semantic similarity scores of an original clinical concept and a standard term output by an automatic supervision learning module as semantic features, and calculating text features, wherein the text features comprise the literal similarity of the original clinical concept and the standard term, word co-occurrence frequency, the difference of the number of contained words and the like;
2. a regression decision tree-based precision ranking model is then trained based on these features for computing a confidence score for the medical term normalization result.
Specifically, the semi-supervised learning module has the main function of screening a group of most determined samples of the current term standardized model based on the confidence scores output by the precise ranking module, and expanding a training candidate set.
Specifically, the direct superior term retrieval module mainly includes the following two steps:
1. firstly, acquiring a group of standard terms with the highest confidence scores predicted by a precise ordering model for an original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of a standard term table;
2. and then determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting, as shown in fig. 4.
The embodiment of the invention provides a medical term automatic standardization method integrating self-supervision and active learning, which comprises the following specific implementation steps:
generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set, specifically: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set; more specifically, with reference to fig. 2, the following sub-steps are included:
1) the training candidate set of the term normalization model consists of a large number of original clinical concepts x and their corresponding standard terms y. In term normalization model training, a set of negative samples is first sampled in a standard term that has a different meaning than x. In order for the term-normalized model to learn more from negative examples, the sampling process needs to obtain standard terms that are as close as possible to, but not exactly the same as, the meaning of the original clinical concept. Some standard nomenclature exists hierarchically, for example, the disease nomenclature table ICD-10 encodes "oral, esophageal and gastric carcinoma in situ" as "D00" with the next-level nomenclature of "carcinoma in situ of the lip, oral and pharynx (D00.0)", "esophageal carcinoma in situ (D00.2)" and the like, and the next-level nomenclature of "carcinoma in situ of the tonsil (D00.001)", "carcinoma in situ of the lip (D00.002)" and the like. The operations were performed in the following order:
if Y has the direct term Y of the previous level1Then take Y1All the next-level terms of (a) are denoted as set M;
② if Y does not have the last direct term but has the last direct term Y2Then take Y2All the next level and next level terms of (a) are denoted as set M;
thirdly, if not, recording the standard glossary as a set M;
then, a text relevance Score of any standard term M of x and M is calculated, and the formula of the text relevance Score (x, M) of the text relevance BM25 model is as follows:
Figure GDA0003300318080000081
wherein IDF (q)i) Representing words q in xiIDF value of (f)iIs qiThe frequency of occurrence in M, len is the length of M, avglen is the average length of all standard terms in M, wiIs the weight of the word, k1And b is an empirically specified parameter, in this example, k1=1,b=0.5。
Sorting out negative sample set according to text relevance scores
Figure GDA0003300318080000082
Obtaining a training candidate set
Figure GDA0003300318080000083
2) In performing the operationWhen the language standardization model is used for prediction, for an original clinical concept x without a label, a group of standard terms with the highest possibility is screened out from the whole standard term table to serve as a prediction candidate set, and then the term standardization model is only used for calculating the semantic similarity scores of the original clinical concept and the standard terms in the prediction candidate set, but the semantic similarity scores are not required to be calculated on the whole standard term table, so that excessive useless calculation of the term standardization model can be avoided, and the prediction efficiency is improved. At this time, the standard glossary is collectively referred to as a set M, and a positive sample set is selected from M by using the text relevance Score (x, M)
Figure GDA0003300318080000084
Deriving a set of prediction candidates
Figure GDA0003300318080000085
Secondly, training a term standardization model through self-supervision learning, specifically: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity; more specifically, the following substeps are included:
1) the term standardization model consists of a two-way autoregressive language model and a semantic matching model. The bidirectional autoregressive language model performs autoregressive training of semantic units based on forward and reverse contexts, and can learn efficient semantic vector representation while modeling natural language. The input of the next layer in the multi-layer bidirectional autoregressive language model is derived from the self-attention mechanism of the hidden state of the previous layer:
Figure GDA0003300318080000091
wherein Q ═ hWQ,K=hWK,V=hWVAre respectively asVector d of previous layer hidden state h after matrix transformationkDimension of h, Z is the input of the next layer, WQ,WK,WVA matrix obtained for training; obtaining the hidden state FFN (Z) of the next layer through the following nonlinear transformation:
FFN(Z)=max(0,ZW1+b1)W2+b2
wherein W1And W2To train the resulting matrix, b1And b2The resulting vectors are trained.
Semantic vectors of the original clinical concepts and the standard terms may be derived based on a bi-directional autoregressive language model when term normalization is performed. The method specifically comprises the following steps: the original clinical concept x is compared with any standard term y*(either positive or negative) are concatenated together word-wise, and a segmentation character [ SEP ] is added at the concatenation]And adds a start character to the leftmost side [ S ]]For example, if the original clinical operation concept "fallopian tube resection" corresponds to a positive sample with the standard term "bilateral fallopian tube resection (66.51)" in the ICD-9-CM-3 glossary, then the positive sample is spliced into "[ S ] fallopian tube resection]Fallopian tube resection [ SEP ]]Bilateral salpingectomy ". Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The position outputs are x and y*The semantic vector of (2).
And inputting the semantic vector into a semantic matching model to calculate to obtain semantic similarity. The calculation process of the multilayer semantic matching model is as follows:
Zi=Wihi-1+bi
Figure GDA0003300318080000092
wherein h isiFor hidden states of the semantic matching model, ZiIs the output value of the i-1 st layer, WiAnd biParameters obtained for training; the output dimension of the last layer of the semantic matching model is set to 2, namely
Figure GDA0003300318080000093
And (3) obtaining through nonlinear transformation:
Figure GDA0003300318080000094
Figure GDA0003300318080000095
of the output s1(x,y*) I.e. x and y*Semantic similarity score of s0(x,y*) Is x and y*The difference degree score of (a).
2) The semantic matching model is trained in an automatic supervision learning mode, so that the model autonomously learns the same characteristics of the same data from a large amount of data, and the problem of insufficient labeling training data is solved. In the specific implementation process, a comparison learning mode is adopted, common characteristics of synonymous terms are emphatically learned, and differences of terms with different meanings are distinguished. Setting the label of the original clinical concept x as a standard term y, and obtaining a training candidate set through the first step
Figure GDA0003300318080000101
Constructing a global loss function L as:
Figure GDA0003300318080000102
where E [ · ] denotes the expectation function, τ is an empirically specified parameter, τ being 0.9 in this embodiment. The term normalized model parameters are updated through gradient backpropagation using this loss function.
Thirdly, rapidly upgrading a term standardization model through active learning, specifically: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; and screening a group of samples with most uncertain current term standardization models according to the active learning standard, and fusing a training candidate set after determining labels of the samples. The more specific implementation principle and process are as follows:
obtaining a more efficient model using as little labeled data as possible is a problem faced by many machine learning algorithms. The idea of active learning is that more fuzzy samples of the current model classification can provide more information quantity, and the performance of the model can be improved greatly under the condition of the same data quantity by screening the samples and generating labels based on the principle. The unlabeled original clinical concept x is subjected to the first step to obtain a prediction candidate set:
Figure GDA0003300318080000103
computing semantic similarity scores using a semantic matching model
Figure GDA0003300318080000104
Normalize it to a probability distribution:
Figure GDA0003300318080000105
the uncertainty of the term normalized model for x is calculated as follows:
C(x)=w1·ent(x)+w2·margin(x)+w3·lC(x)+w4·freq(x)
where enx (x) is the information entropy of the term normalized model pair x:
Figure GDA0003300318080000106
margin (x) is the edge probability:
margin(x)=-(p1-p2)
wherein p is1And p2Are respectively all piThe median maximum and second maximum probabilities;
lc (x) is confidence:
lc(x)=-p1
freq (x) is the frequency of x occurrences in the original clinical text data, and generating correct labels for high frequency original clinical concepts may help the term normalization model to better learn the distribution of the entire set of original clinical concepts.
(wi)i=1,2,3,4For the weight of each feature, in the present embodiment, w1=0.45,w2=0.2,w3=0.2,w4=0.15。
The active learning process is to screen out original clinical concepts with higher C (x) values, manually determine labels of the original clinical concepts, integrate the original clinical concepts into a training candidate set for retraining the term standardization model, and repeat the steps to obtain the term standardization model with the best effect by using the least labeled data.
Training a precise ordering model, and comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions, specifically: acquiring the semantic similarity score of the original clinical concept and the standard term output by the self-supervision learning in the step two as a semantic feature, and calculating a text feature; a regression decision tree-based precision ranking model is trained based on the features for computing a confidence score for the medical term normalization result. The more specific implementation principle and process are as follows:
the accuracy problem needs to be considered when putting the term standardized result into practical application, especially when the whole set of system is relatively lack of training data in the initial stage of starting. The traditional method is to manually verify the prediction result of the term standardized model again, and more manpower is usually consumed. The accurate ordering model designed by comprehensively considering various characteristics such as semantics, texts and the like can help the correctly predicted standard terms to obtain higher ranking, so that the problem is effectively solved, and the weight of the semantic characteristics in the accurate ordering model can be gradually and artificially increased along with the iterative upgrade of the term standardization model based on self-supervision and active learning. Preferably, a gradient lifting model XGBoost is used as the accurate ranking model, and the basic idea is to train a plurality of regression decision trees, the learning goal of each tree is the error of the previous tree, and the accumulation of the calculation results of all trees is the final confidence score, as shown in fig. 5. If a gradient boosting model is constructed for u samples, the loss function l (t) of the t-th decision tree is:
Figure GDA0003300318080000111
wherein
Figure GDA0003300318080000112
As a function of the square loss, viIs a sample xiTag of (e), f (Tree)t,xi) For the predicted value of the t-th decision tree pair,
Figure GDA0003300318080000113
for the predicted values of the first t-1 decision tree pairs,
Figure GDA0003300318080000114
is a regularization term representing the complexity of the decision Tree, where | TreetL is the leaf node number of the t-th decision tree, wkFor the predicted value of the kth leaf node, γ and λ are weighting parameters, and in this embodiment, γ is 0.1 and λ is 0.9.
In the process of training the precise ranking model, input training data is a data set consisting of standard term positive samples in a prediction candidate set of an original clinical concept and a term standardization model:
Figure GDA0003300318080000115
labels v for each set of samplesiThe characteristics involved in training are shown in table 1, either 0 (wrong standard terms) or 1 (correct standard terms). If the trained accurate sequencing model comprises T decision trees, the samples are subjected to
Figure GDA0003300318080000116
Computing a confidence score for a medical term normalized result
Figure GDA0003300318080000117
Comprises the following steps:
Figure GDA0003300318080000121
TABLE 1 characteristics adopted by the precision ranking model
Figure GDA0003300318080000122
And fifthly, predicting a final term standardization result. Predicting candidate sets using a trained accurate ranking model pair
Figure GDA0003300318080000123
The standard term positive sample in (1) calculates a confidence score
Figure GDA0003300318080000124
Then take the standard term with the largest confidence score:
Figure GDA0003300318080000125
can be regarded as
Figure GDA0003300318080000126
Is a standard term with the same meaning corresponding to x.
And sixthly, screening samples with higher confidence scores of the prediction results to perform semi-supervised self-training. Specifically, the method comprises the following steps: setting strict threshold for confidence score of accurate ranking model prediction
Figure GDA0003300318080000127
The standard term for the prediction of the original clinical concept x by setting the accurate ordering model is
Figure GDA0003300318080000128
The confidence score of the output is
Figure GDA0003300318080000129
If it is
Figure GDA00033003180800001210
Then will be
Figure GDA00033003180800001211
And adding an original training candidate set, and updating the parameters of the term standardization model and the accurate sequencing model.
Seventhly, searching the direct superior terms, specifically: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting. More specifically, the implementation process is as follows:
for output confidence scores
Figure GDA00033003180800001212
Indicates that there may not be a standard term with the same meaning, and needs to be searched upward in the standard term table to locate the directly superior term of x. Calculating confidence scores of the standard terms in the x and prediction candidate sets by using a precise ordering model, ordering, and selecting k standard terms with highest confidence scores from the confidence scores
Figure GDA00033003180800001213
Is marked out
Figure GDA00033003180800001214
Is coded in the standard glossary to trace back to the upper level. For example, the original disease concept "right synovitis", with k ═ 5 standard terms with the highest confidence score as shown in table 2. Starting from the encoding of the terms in the table, marking a path traced back to the upper level and each intermediate node passed by the path in the standard term table, and marking each standard term node (marked as node) on the traced back pathj) The number of times the node is passed is marked (denoted as count (node)j) As shown in fig. 6, the number in each node in the graph is a count (node)j). Then, the first met condition is searched from the lower level to the upper level on the backtracking path
Figure GDA0003300318080000131
Standard term node ofjI.e. can be regarded as x straightIs a high level standard term. For example, the first satisfaction encountered during the search from lower level to upper level in FIG. 6
Figure GDA0003300318080000132
The node of (a) is "synovitis and tenosynovitis (M65.9)", indicating that the original clinical concept "right synovitis" should be fused into the standard glossary as an immediate subordinate term to that term.
Table 2 shows the standard term with the highest confidence score of the original concept "right synovitis
Standard term names Encoding of standard terms
Synovitis (synovitis) M65.909
Infectious synovitis M65.101
Synovitis of shoulder joint M65.901
Synovitis and tenosynovitis M65.9
Other synovitis and tenosynovitis M65.8
The invention designs an automatic supervision learning method aiming at medical term standardization, and realizes a high-accuracy medical term standardization model under the condition of less labeled data; the active learning function is completed based on the term standardization process, so that the model can be rapidly and automatically upgraded; designing a candidate sample generating function by combining the characteristics of the standard glossary to ensure that the candidate sample has enough information; the accurate ordering function of the prediction result of the medical term standardized model is designed by integrating the semantic and text characteristics, so that the manual intervention is further reduced; the direct superior term retrieval function of the original clinical concept is designed on the basis of the accurate sequencing result, and the integrity and the uniformity of the medical term standardization result are ensured.
The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims (10)

1. An automatic medical term standardization system combining self-supervision and active learning, comprising:
(1) a candidate set generation module: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;
(2) the self-supervision learning module: for training a term normalized model, comprising:
training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms;
respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model;
adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;
(3) an active learning module: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;
(4) the accurate sequencing module: acquiring semantic similarity scores of the original clinical concept and standard terms output by the self-supervision learning module as semantic features, calculating text features, training a regression decision tree-based accurate sequencing model based on the semantic and text features, and calculating confidence scores of medical term standardization results; and calculating a confidence score for the standard term positive sample in the prediction candidate set by using the trained accurate sequencing model to obtain the standard term with the maximum confidence score.
2. The system of claim 1, further comprising a semi-supervised learning module, which fuses the sample with the confidence score of the result of the medical term standardization output by the precise ranking module satisfying the condition to the training candidate set.
3. The system of claim 1, further comprising a direct superior term retrieval module, the direct superior term retrieval module comprising: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.
4. A method for automatically normalizing medical terms fusing self-supervision and active learning, the method comprising the steps of:
generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;
step (2) training a term standardization model through self-supervision learning: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;
and (3) rapidly upgrading the term standardized model through active learning: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;
training an accurate sequencing model, and comprehensively evaluating the prediction result of the term standardized model from text and semantic dimensions: acquiring semantic similarity scores of the original clinical concept and the standard terms output in the step (2) by self-supervision learning as semantic features, and calculating text features; training a regression decision tree-based accurate sequencing model based on semantic and text features for calculating a confidence score of a medical term standardization result;
step (5) predicting the final term normalization result: and calculating confidence scores for the standard term positive samples in the prediction candidate set by using the trained accurate ranking model, and taking the standard term with the maximum confidence score as a term standardization result.
5. The method for automatically standardizing medical terms fused with self-supervision and active learning according to claim 4, wherein the step (1) comprises:
(1.1) training candidate sets: the training candidate set consists of the original clinical concept x and the corresponding standard term Y if Y has the upper level direct term Y1Then take Y1All the next-level terms of (a) are denoted as set M; if Y does not have the primary direct term but does have the secondary direct term Y2Then take Y2All the next level and next level terms of (a) are denoted as set M; otherwise, the standard glossary is totally marked as a set M; calculating text relevance score of any standard term M in x and M, sorting according to the text relevance score, and selecting negative sample set
Figure FDA0003300318070000021
Obtaining a training candidate set
Figure FDA0003300318070000022
(1.2) predicting candidate set: in the case of term-normalized model prediction, the unlabeled original clinical concept x is collectively represented as a set M in a standard term table, and a positive sample set is selected from M using a text relevance score
Figure FDA0003300318070000023
Deriving a set of prediction candidates
Figure FDA0003300318070000024
6. The method according to claim 4, wherein in the step (2), the Chinese medical language model is a bi-directional autoregressive language model, specifically: the original clinical concept x is compared with any standard term y*Splicing together according to characters, adding a segmentation character (SEP) at the splicing position]And adds a start character to the leftmost side [ S ]]Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The position outputs are x and y*Semantics of (A)And (5) vector quantity.
7. The method for automatically standardizing medical terms fused with self-supervision and active learning according to claim 5, wherein in the step (3), the unlabeled original clinical concept x is subjected to the step (1) to obtain a prediction candidate set:
Figure FDA0003300318070000031
computing semantic similarity scores using a semantic matching model
Figure FDA0003300318070000032
Normalize it to a probability distribution:
Figure FDA0003300318070000033
the uncertainty of the term normalized model for x, c (x), is calculated as follows:
C(x)=w1·ent(x)+w2·margin(x)+w3·lc(x)+w4·freq(x)
where ent (x) is the information entropy of the term normalized model pair x:
Figure FDA0003300318070000034
margin (x) is the edge probability:
margin(x)=-(p1-p2)
wherein p is1And p2Are respectively all piThe median maximum and second maximum probabilities;
lc (x) is confidence:
lc(x)=-p1
freq (x) is the frequency of x occurrences in the original clinical text data;
(wi)i=1,2,3,4is the weight of each feature.
8. The automatic medical term standardization method integrating self-supervision and active learning according to claim 5, wherein in the step (4), a gradient lifting model XGboost is adopted as a precise ordering model, and specifically: training a plurality of regression decision trees, wherein the learning target of each tree is the error of the previous tree, and the accumulation of the calculation results of all the trees is the final confidence score; if a gradient lifting model is constructed for u samples, the loss function L of the t-th decision tree(t)Comprises the following steps:
Figure FDA0003300318070000035
where l (-) is the square loss function, viIs a sample xiTag of (e), f (Tree)t,xi) For the t decision tree pair xiThe predicted value of (a) is determined,
Figure FDA0003300318070000036
for the first t-1 decision tree pairs xiThe predicted value of (a) is determined,
Figure FDA0003300318070000037
is a regularization term representing the complexity of the decision Tree, where | TreetL is the leaf node number of the t-th decision tree, wkThe predicted value of the kth leaf node is obtained, and gamma and lambda are weight parameters;
in the process of training the precise ranking model, input training data is a data set consisting of standard term positive samples in a prediction candidate set of an original clinical concept and a term standardization model:
Figure FDA0003300318070000038
if the trained accurate sequencing model comprises T decision trees, the samples are subjected to
Figure FDA0003300318070000039
Computing a confidence score for a medical term normalized result
Figure FDA00033003180700000310
Comprises the following steps:
Figure FDA00033003180700000311
Figure FDA0003300318070000041
9. the method for automatically standardizing medical terms fused with self-supervision and active learning according to any one of claims 4-8, characterized in that the samples with confidence scores satisfying the condition of the standardized result of the medical terms output by the precise ranking module are fused to the training candidate set, and the term standardization model and the precise ranking model parameters are updated.
10. The method for automatically standardizing medical terms fused with self-supervision and active learning according to any one of claims 4-8, characterized in that the method further comprises a direct superior term retrieval function: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.
CN202110994475.7A 2021-08-27 2021-08-27 Automatic medical term standardization system and method integrating self-supervision and active learning Active CN113436698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110994475.7A CN113436698B (en) 2021-08-27 2021-08-27 Automatic medical term standardization system and method integrating self-supervision and active learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110994475.7A CN113436698B (en) 2021-08-27 2021-08-27 Automatic medical term standardization system and method integrating self-supervision and active learning

Publications (2)

Publication Number Publication Date
CN113436698A CN113436698A (en) 2021-09-24
CN113436698B true CN113436698B (en) 2021-12-07

Family

ID=77798234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110994475.7A Active CN113436698B (en) 2021-08-27 2021-08-27 Automatic medical term standardization system and method integrating self-supervision and active learning

Country Status (1)

Country Link
CN (1) CN113436698B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656604B (en) 2021-10-19 2022-02-22 之江实验室 Medical term normalization system and method based on heterogeneous graph neural network
CN114691826B (en) * 2022-03-10 2022-12-09 南京云设智能科技有限公司 Medical data information retrieval method based on co-occurrence analysis and spectral clustering
CN114330370B (en) * 2022-03-17 2022-05-20 天津思睿信息技术有限公司 Natural language processing system and method based on artificial intelligence
CN114595333B (en) * 2022-04-27 2022-08-09 之江实验室 Semi-supervision method and device for public opinion text analysis
CN115270780B (en) * 2022-07-20 2023-04-07 北京新纽科技有限公司 Method for recognizing terms
CN115080751B (en) * 2022-08-16 2022-11-11 之江实验室 Medical standard term management system and method based on general model
CN115062602B (en) * 2022-08-17 2022-11-11 杭州火石数智科技有限公司 Sample construction method and device for contrast learning and computer equipment
CN115688779B (en) * 2022-10-11 2023-05-09 杭州瑞成信息技术股份有限公司 Address recognition method based on self-supervision deep learning
CN115994227B (en) * 2023-03-23 2023-06-06 北京左医科技有限公司 Medical term standardization model construction method, device, terminal equipment and medium
CN117540734B (en) * 2024-01-10 2024-04-09 中南大学 Chinese medical entity standardization method, device and equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150309987A1 (en) * 2014-04-29 2015-10-29 Google Inc. Classification of Offensive Words
CN108520038B (en) * 2018-03-31 2020-11-10 大连理工大学 Biomedical literature retrieval method based on sequencing learning algorithm
CN111881334A (en) * 2020-07-15 2020-11-03 浙江大胜达包装股份有限公司 Keyword-to-enterprise retrieval method based on semi-supervised learning
CN112364174A (en) * 2020-10-21 2021-02-12 山东大学 Patient medical record similarity evaluation method and system based on knowledge graph

Also Published As

Publication number Publication date
CN113436698A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN113436698B (en) Automatic medical term standardization system and method integrating self-supervision and active learning
CN109378053B (en) Knowledge graph construction method for medical image
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
CN109800411B (en) Clinical medical entity and attribute extraction method thereof
CN111540468B (en) ICD automatic coding method and system for visualizing diagnostic reasons
CN112002411A (en) Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN102799579B (en) Statistical machine translation method with error self-diagnosis and self-correction functions
CN111222340B (en) Breast electronic medical record entity recognition system based on multi-standard active learning
CN111949759A (en) Method and system for retrieving medical record text similarity and computer equipment
CN106682397A (en) Knowledge-based electronic medical record quality control method
CN112364174A (en) Patient medical record similarity evaluation method and system based on knowledge graph
US20050027664A1 (en) Interactive machine learning system for automated annotation of information in text
CN111554360A (en) Drug relocation prediction method based on biomedical literature and domain knowledge data
CN108875809A (en) The biomedical entity relationship classification method of joint attention mechanism and neural network
US20210042344A1 (en) Generating or modifying an ontology representing relationships within input data
CN104699741A (en) Analyzing natural language questions to determine missing information in order to improve accuracy of answers
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN113779211A (en) Intelligent question-answer reasoning method and system based on natural language entity relationship
CN117151220B (en) Entity link and relationship based extraction industry knowledge base system and method
CN113707339B (en) Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases
CN111782786B (en) Multi-model fusion question-answering method, system and medium for urban brain
CN106407183A (en) Method and device for generating medical named entity recognition system
CN112328766A (en) Knowledge graph question-answering method and device based on path search
Wu et al. Structured information extraction of pathology reports with attention-based graph convolutional network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant