CN113436698A - Automatic medical term standardization system and method integrating self-supervision and active learning - Google Patents

Automatic medical term standardization system and method integrating self-supervision and active learning Download PDF

Info

Publication number
CN113436698A
CN113436698A CN202110994475.7A CN202110994475A CN113436698A CN 113436698 A CN113436698 A CN 113436698A CN 202110994475 A CN202110994475 A CN 202110994475A CN 113436698 A CN113436698 A CN 113436698A
Authority
CN
China
Prior art keywords
term
model
standard
training
candidate set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110994475.7A
Other languages
Chinese (zh)
Other versions
CN113436698B (en
Inventor
李劲松
杨宗峰
辛然
李玉格
史黎鑫
田雨
周天舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202110994475.7A priority Critical patent/CN113436698B/en
Publication of CN113436698A publication Critical patent/CN113436698A/en
Application granted granted Critical
Publication of CN113436698B publication Critical patent/CN113436698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Epidemiology (AREA)
  • Machine Translation (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a medical term automatic standardization system and method integrating self-supervision and active learning, wherein the system comprises a candidate set generation module, a self-supervision learning module for training a term standardization model, an active learning module, a precision sequencing module for comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions, and other basic modules, and also comprises a semi-supervision learning module, a straight superior term retrieval module and other preferred modules; the invention can realize the automatic medical term standardized model under the condition of less labeled data, and the model keeps the capability of fast updating and upgrading, thereby greatly reducing the workload of manual intervention while ensuring the accuracy of the output result; the new clinical concept can be matched with the direct superior term, and the accurate position is found in the standard term table, so that the integrity and the uniformity of the standardized result are ensured.

Description

Automatic medical term standardization system and method integrating self-supervision and active learning
Technical Field
The invention belongs to the technical field of Chinese medical term standardization and multi-center medical information platforms, and particularly relates to a medical term automatic standardization system and method integrating self-supervision and active learning.
Background
With the popularization of electronic medical record systems, a large amount of important medical information is stored in various medical information systems in an electronic form, and the data create great values for clinical auxiliary diagnosis, medicine research and development, public health monitoring and evaluation, infectious disease epidemic situation early warning, personalized accurate medical treatment and the like. The medical data standardization is a key step for promoting the integration of domestic medical systems and realizing the cooperative research and large-scale analysis of medical data. Achieving standardization of medical terms is the first difficult problem to be solved in the process of standardization of medical data. Internationally, different types of medical terms respectively have corresponding standard term systems, including disease term sets ICD-10, surgical operation codes ICD-9-CM-3, medical inspection term sets LONIC and the like. Hospitals or other medical facilities, however, do not make good use of the international universal standard terminology set during actual operations, primarily because: (1) different hospitals often adopt different medical information systems, and the data standards adopted by the information systems are different, so that the generated medical terms have larger difference in data dimension and data format; (2) the understanding of standard terminology and granularity by different operators is not uniform. The medical information system usually requires the operator to select the corresponding information of disease name, operation name, etc. according to the condition of the patient, and for the condition that the meanings of the upper and lower terms are overlapped (for example, the two codes "D00.2" and "D00.200" for the "gastric carcinoma in situ" in ICD-10), the understanding of different operators, even the understanding of the same operator at different times, may be different; (3) the operator personalizes the terms entered. Most information systems provide manual input for the convenience of entering new concepts, and therefore operators may develop irregular terminology based on past experience and personal habits. These factors result in the original clinical concept not being directly related to the general standard terms, and data unification and information exchange between different organizations are not easy.
The ultimate goal of medical term standardization efforts is to establish a mapping relationship between the original clinical concept and the standard term. The term standardization schemes in the past are generally based on the following two concepts: (1) by using an artificial method, professional clinicians are invited to carry out mapping and proofreading on the operation terms one by one, but the order of magnitude of the operation terms contained in each medical information system is in the ten thousand level, the working time for the clinicians to carry out proofreading is very long, the rapid popularization in China is difficult, and the rapid implementation of the domestic medical data standardization is further hindered; in addition, because the doctors have different work experiences, the standard mapping of the standard surgical terms lacks a uniform standard, so that the uniformity of the standards among different doctors is difficult to ensure, and meanwhile, the mapping result has manual errors, so that the uniformity of the mapping standards of the same doctor at different times is difficult to ensure. (2) The medical concept semantic matching model is trained based on a machine learning algorithm, but the difficulty of manually marking data is high, the consumed time is long, so that insufficient training data is not available, the finally generated model is low in generalization capability, and in order to ensure the accuracy of the actually used term standard result, more manpower is required to be consumed to verify the output result. On the other hand, there are many standard terminology sets with a high-low relationship, for example, the lower-level terminology of the term "corneal surgery (11)" in the surgical operation code ICD-9-CM-3 includes "corneal laceration suture (11.51)", "corneal transplantation NOS (11.6)", and the like. When the concept generated by the actual clinical operation cannot find the same-meaning peer-level term in the standard term set, the direct superior standard term needs to be accurately positioned, and the existing method cannot solve the problem well, so that the newly-added clinical concept cannot be fused into the universal standard term system. The invention aims to solve the problems that a medical term standardization system with good accuracy and generalization capability is established under the condition that a large amount of labeled data is not available, quick automatic iterative updating of the system is realized under the condition that manual intervention is reduced as much as possible, and meanwhile, the accurate peer standard terms or superior standard terms can be positioned for the original clinical concept.
Disclosure of Invention
The medical data standardization is a key step for promoting the integration of domestic medical systems and realizing the cooperative research and large-scale analysis of medical data. However, the existing clinical term standardization method and system generally need more manual examination and labeling work, and the accuracy and generalization capability are difficult to guarantee, so that the clinical data standardization is difficult to popularize at home quickly.
The invention aims to provide a medical term automatic standardization system and method based on a deep learning model and integrating self-supervision and active learning aiming at the difficulty of the standardization work of the current medical terms.
The purpose of the invention is realized by the following technical scheme: according to the method, a medical term standardization model is constructed on the basis of a deep learning language model, a self-supervision learning method is adopted to train the model, negative samples are sampled based on a text correlation model and a hierarchical structure of a standard glossary, the negative samples with higher information content and more difficult model discrimination are obtained, and the effect of data enhancement is achieved, so that the semantic relation contained in the model can be fully utilized under the condition that only a small number of labeled samples exist in the model; the method comprises the steps of realizing an active learning function based on the principles of maximum entropy, low confidence, high frequency and the like, and screening a group of samples which can improve the performance of a model to the maximum extent according to the prediction results of the model on a large number of unknown samples, so that the model can be upgraded quickly and obviously with the least manual intervention; designing a precise sequencing model, and finally outputting correct standard terms by integrating information in various aspects such as texts, semantics and the like; the accurately sequenced samples automatically update training data in a semi-supervised self-training learning mode, so that the accuracy and generalization capability of the model are further improved, and the workload of manual intervention is continuously reduced; an upward retrieval method is constructed, some newly added original clinical concepts are positioned to corresponding direct superior terms, the integrity and consistency of the medical term standardization results are guaranteed, the newly added clinical concepts can find the correct positions in the standard glossary, and the comprehensive standardization of clinical data is facilitated.
The invention discloses an automatic medical term standardization system integrating self-supervision and active learning, which comprises the following components:
(1) a candidate set generation module: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;
(2) the self-supervision learning module: for training a term normalized model, comprising:
training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms;
respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model;
adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;
(3) an active learning module: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;
(4) the accurate sequencing module: acquiring semantic similarity scores of the original clinical concept and standard terms output by the self-supervision learning module as semantic features, calculating text features, training a regression decision tree-based accurate sequencing model based on the semantic and text features, and calculating confidence scores of medical term standardization results; and calculating a confidence score for the standard term positive sample in the prediction candidate set by using the trained accurate sequencing model to obtain the standard term with the maximum confidence score.
Further, the automatic medical term standardization system further comprises a semi-supervised learning module, and the semi-supervised learning module fuses the samples of which the confidence scores of the medical term standardization results output by the precise ordering module meet the conditions to the training candidate set.
Further, the automatic medical term standardization system further comprises a direct superior term retrieval module, wherein the direct superior term retrieval module comprises: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.
The invention discloses a medical term automatic standardization method fusing self-supervision and active learning on the other hand, which comprises the following steps:
(1) generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;
(2) training the term normalization model by self-supervised learning: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;
(3) the term standardization model is rapidly upgraded through active learning: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;
(4) training an accurate ordering model, and comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions: acquiring the semantic similarity score of the original clinical concept and the standard term output by the self-supervision learning in the step two as a semantic feature, and calculating a text feature; training a regression decision tree-based accurate sequencing model based on semantic and text features for calculating a confidence score of a medical term standardization result;
(5) predicting the final term normalization result: and calculating confidence scores for the standard term positive samples in the prediction candidate set by using the trained accurate ranking model, and taking the standard term with the maximum confidence score as a term standardization result.
Further, the step (1) includes:
(1.1) training candidate sets: training candidate set from original clinical conceptsxAnd their corresponding standard termsyComposition of, ifyPresence of last-level direct terminology
Figure 772861DEST_PATH_IMAGE001
Then get
Figure 38758DEST_PATH_IMAGE001
All next level terms of (1) are taken as a setM(ii) a If it isyAbsence of primary direct terms but presence of secondary direct terms
Figure 116435DEST_PATH_IMAGE002
Then get
Figure 758769DEST_PATH_IMAGE002
All next level and next level terms of (1) are denoted as setM(ii) a Otherwise, the standard glossary is collectively referred to as a setM(ii) a ComputingxAndMany standard term ofmThe text relevance scores are sorted according to the text relevance scores to select a negative sample set
Figure 503871DEST_PATH_IMAGE003
To obtain a training candidate set
Figure 940669DEST_PATH_IMAGE004
(1.2) predicting candidate set: in the prediction of the term standardized model, the original clinical concept without label is processedxThe standard glossary is collectively referred to as a setMUsing text relevance scores fromMTo select a positive sample set
Figure 505642DEST_PATH_IMAGE005
To obtain a prediction candidate set
Figure 190482DEST_PATH_IMAGE006
Further, in the step (2), the chinese medical language model is a bi-directional autoregressive language model, which specifically includes: the original clinical conceptxWith any standard terminology
Figure 524512DEST_PATH_IMAGE007
Splicing together according to characters, adding a segmentation character (SEP) at the splicing position]And adds a start character to the leftmost side [ S ]]Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The output of the position isxAnd
Figure 132211DEST_PATH_IMAGE007
the semantic vector of (2).
Further, in the step (3), the unlabeled original clinical concept is setxObtaining a prediction candidate set through the step (1):
Figure 184480DEST_PATH_IMAGE008
computing a semantic similarity score using a semantic matching model
Figure 106300DEST_PATH_IMAGE009
It is normalized to a probability distribution:
Figure 560415DEST_PATH_IMAGE010
term standardized model pairxDegree of uncertainty of
Figure 339015DEST_PATH_IMAGE011
The calculation is as follows:
Figure 377116DEST_PATH_IMAGE012
wherein
Figure 899364DEST_PATH_IMAGE013
For the term standardized model pairxThe information entropy of (2):
Figure 676827DEST_PATH_IMAGE014
Figure 626329DEST_PATH_IMAGE015
is the edge probability:
Figure 387611DEST_PATH_IMAGE016
wherein
Figure 979130DEST_PATH_IMAGE017
And
Figure 142258DEST_PATH_IMAGE018
are respectively all
Figure 964458DEST_PATH_IMAGE019
The median maximum and second maximum probabilities;
Figure 478616DEST_PATH_IMAGE020
as confidence:
Figure 342667DEST_PATH_IMAGE021
Figure 360301DEST_PATH_IMAGE022
for raw clinical text dataxThe frequency of occurrence;
Figure 917185DEST_PATH_IMAGE023
is the weight of each feature.
Further, in the step (4), a gradient lifting model XGBoost is adopted as a precise sequencing model, which specifically includes: training a plurality of regression decision trees, wherein the learning target of each tree is the error of the previous tree, and the accumulation of the calculation results of all the trees is the final confidence score; is provided with a pairuConstructing a gradient lifting model for each sample, thentLoss function of decision tree
Figure 653060DEST_PATH_IMAGE024
Comprises the following steps:
Figure 288178DEST_PATH_IMAGE025
wherein
Figure 160319DEST_PATH_IMAGE026
In the form of a function of the square loss,
Figure 888104DEST_PATH_IMAGE027
is a sample
Figure 111275DEST_PATH_IMAGE028
The label of (a) is used,
Figure 317128DEST_PATH_IMAGE029
is as followstA decision tree pair
Figure 574934DEST_PATH_IMAGE028
The predicted value of (a) is determined,
Figure 473620DEST_PATH_IMAGE030
is frontt-1 decision tree pair
Figure 918508DEST_PATH_IMAGE028
The predicted value of (a) is determined,
Figure 420728DEST_PATH_IMAGE031
is a regular term representing the complexity of the decision tree, wherein
Figure 736303DEST_PATH_IMAGE032
Is as followstThe number of leaf nodes of the decision tree,
Figure 540311DEST_PATH_IMAGE033
is as followskThe predicted value of each of the leaf nodes,
Figure 738074DEST_PATH_IMAGE034
and
Figure 285730DEST_PATH_IMAGE035
is a weight parameter;
in the process of training the precise ranking model, input training data is a data set consisting of standard term positive samples in a prediction candidate set of an original clinical concept and a term standardization model:
Figure 518128DEST_PATH_IMAGE036
let the training be completed by a precise ranking model comprisingTA decision tree, then the sample is matched
Figure 493037DEST_PATH_IMAGE037
Computing a confidence score for a medical term normalized result
Figure 676632DEST_PATH_IMAGE038
Comprises the following steps:
Figure 27978DEST_PATH_IMAGE039
and further, fusing the sample with the confidence score meeting the condition of the medical term standardization result output by the accurate sequencing module into a training candidate set, and updating the term standardization model and the accurate sequencing model parameters.
Further, the method also comprises a direct superior term retrieval function: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.
The invention has the beneficial effects that: the invention can realize the automatic medical term standardization model under the condition of less labeled data, and the model keeps the capability of quick updating and upgrading, thereby greatly reducing the workload of manual intervention while ensuring the accuracy of the output result. The new clinical concept can be matched with the direct superior term, and the accurate position is found in the standard term table, so that the integrity and the uniformity of the standardized result are ensured.
Drawings
FIG. 1 is a block diagram of an automatic standardization system for medical terms fusing self-supervision and active learning according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an implementation of a candidate set generation module according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an implementation of an auto-supervised learning module and an active learning module according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an implementation of a direct superior term retrieval module according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a training process of a decision tree-based precision ranking model according to an embodiment of the present invention;
fig. 6 is a schematic diagram of direct superior term retrieval according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
In the invention, the self-supervision learning means: and (3) mining own supervision information from large-scale unmarked data by using an auxiliary task, and training the network by using the constructed supervision information so as to learn valuable characteristics of downstream tasks. There are three main ways for self-supervised learning: context-based learning, time-series-based learning, and contrast learning, where contrast learning is the construction of a characterization by learning to encode the similarity or dissimilarity of two things.
Active learning means: the main goal is to reduce the cost of people to annotate data. The sample data which is difficult to classify or fuzzy in model classification is obtained through a machine learning method, and the data is generally considered to be possibly in the critical positions of different classes, so that the method can provide greater help for the model to accurately learn the features of the different classes. By manually re-confirming and auditing the samples, the effect of the model can be improved more remarkably under the condition of the same labeled data quantity.
Semi-supervised learning refers to: the learner is independent of external interaction, and learning performance is improved by automatically utilizing unmarked samples. The self-training is a special implementation mode of semi-supervised learning, assuming that similar samples have similar output, firstly training an initial model by using labeled samples, then carrying out prediction classification on unlabeled samples by using the model, screening out samples with higher prediction result confidence coefficient based on certain standard, and then using predicted soft labels or hard labels as new labeled data to expand a training set.
Medical term standardization refers to: the standardized principle and method are used for unifying medical terms in a certain range by establishing medical term standards so as to obtain the process of optimal order and social benefit. The establishment of the unified medical term standard and the term set is helpful for solving the problems of term repetition, connotation, semantic expression and inconsistent understanding and the like, and has important significance for effectively promoting the propagation, sharing and use of medical information in a wider range and a deeper level.
The embodiment of the invention provides an automatic medical term standardization system integrating self-supervision and active learning, which comprises the following modules as shown in figure 1:
a candidate set generation module: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;
a self-supervised learning module for training the term standardized model;
the active learning module is realized on the basis of the principles of maximum entropy, minimum confidence coefficient and the like;
a precision ranking module for comprehensive evaluation of term normalized model predictions from text and semantic dimensions.
Preferably, the system further comprises: and fusing the sample of which the confidence score of the medical term standardization result output by the precise ordering module meets the condition to a semi-supervised learning module of the training candidate set.
Preferably, the system further comprises: directly superior term retrieval module.
Specifically, the candidate set generation module is composed of two parts: in the term standardization model training process, sampling is carried out based on the text correlation BM25 model and the hierarchical structure of a standard term table, and standard terms which are close to but not identical with the original clinical concept as much as possible are obtained as negative sample standard terms; in the term standardization model prediction process, possible positive sample standard terms are generated based on the text relevance BM25 model, and the detailed flow is shown in FIG. 2.
Specifically, the self-supervision learning module mainly comprises the following three steps:
1. training a Chinese medical language model, preferably a bidirectional autoregressive language model (BERT), by a self-adaptive method, and further acquiring semantic vectors of original clinical concepts and standard terms;
2. respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model;
3. the loss function of the normalized model is calculated according to the semantic similarity meter using an auto-supervised learning approach (preferably an auto-supervised contrast learning approach), as shown in the left part of fig. 3.
Specifically, the active learning module mainly includes the following two steps:
1. calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set;
2. and screening out a group of samples with the most uncertain current term standardization model according to the active learning standard, determining labels of the samples, and then merging the samples into a training candidate set, wherein the labels are shown in the right part of the figure 3.
Specifically, the precise sorting module mainly comprises the following two steps:
1. firstly, acquiring semantic similarity scores of an original clinical concept and a standard term output by an automatic supervision learning module as semantic features, and calculating text features, wherein the text features comprise the literal similarity of the original clinical concept and the standard term, word co-occurrence frequency, the difference of the number of contained words and the like;
2. a regression decision tree-based precision ranking model is then trained based on these features for computing a confidence score for the medical term normalization result.
Specifically, the semi-supervised learning module has the main function of screening a group of most determined samples of the current term standardized model based on the confidence scores output by the precise ranking module, and expanding a training candidate set.
Specifically, the direct superior term retrieval module mainly includes the following two steps:
1. firstly, acquiring a group of standard terms with the highest confidence scores predicted by a precise ordering model for an original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of a standard term table;
2. and then determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting, as shown in fig. 4.
The embodiment of the invention provides a medical term automatic standardization method integrating self-supervision and active learning, which comprises the following specific implementation steps:
generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set, specifically: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set; more specifically, with reference to fig. 2, the following sub-steps are included:
1) the training candidate set of the term normalized model consists of a large number of original clinical conceptsxAnd their corresponding standard termsyAnd (4) forming. When the term standardization model training is carried out, firstly, the term standardization model training is carried out inxA set of negative examples are sampled in different meaning standard terms. In order for the term-normalized model to learn more from negative examples, the sampling process needs to obtain standard terms that are as close as possible to, but not exactly the same as, the meaning of the original clinical concept. Some standard nomenclature exists hierarchically, for example, the disease nomenclature table ICD-10 encodes "oral, esophageal and gastric carcinoma in situ" as "D00" with the next-level nomenclature of "carcinoma in situ of the lip, oral and pharynx (D00.0)", "esophageal carcinoma in situ (D00.2)" and the like, and the next-level nomenclature of "carcinoma in situ of the tonsil (D00.001)", "carcinoma in situ of the lip (D00.002)" and the like. The operations were performed in the following order:
ifyPresence of last-level direct terminology
Figure 52566DEST_PATH_IMAGE001
Then get
Figure 463956DEST_PATH_IMAGE001
All next level terms of (1) are taken as a setM
② ifyAbsence of primary direct terms but presence of secondary direct terms
Figure 370732DEST_PATH_IMAGE002
Then get
Figure 56928DEST_PATH_IMAGE002
All next level and next level terms of (1) are denoted as setM
Thirdly, if not, the standard glossary is totally recorded as a setM
Then calculatexAndMany standard term of
Figure 998340DEST_PATH_IMAGE040
For the text relevance BM25 model, its text relevance score
Figure 16849DEST_PATH_IMAGE041
The formula of (1) is as follows:
Figure 410921DEST_PATH_IMAGE042
wherein
Figure 104071DEST_PATH_IMAGE043
To representxChinese character
Figure 899988DEST_PATH_IMAGE044
The value of the IDF of (a),
Figure 653181DEST_PATH_IMAGE045
is composed of
Figure 331287DEST_PATH_IMAGE046
In thatmThe frequency of occurrence of (a) is,lenis composed ofmThe length of (a) of (b),avglenis composed ofMThe average length of all the standard terms in (a),
Figure 765810DEST_PATH_IMAGE047
is the weight of the word or words,
Figure 416234DEST_PATH_IMAGE048
andbin order to empirically specify the parameters, in the present embodiment,
Figure 838863DEST_PATH_IMAGE049
Figure 207528DEST_PATH_IMAGE050
sorting out negative sample set according to text relevance scores
Figure 242480DEST_PATH_IMAGE051
To obtain a training candidate set
Figure 747410DEST_PATH_IMAGE052
2) In the prediction of the term standardized model, the original clinical concept without label is processedxFirstly, a group of standard terms with the highest possibility is screened out from the whole standard term table to serve as a prediction candidate set, and then the term standardization model is only needed to calculate the semantic similarity scores of the original clinical concept and the standard terms in the prediction candidate set, and the semantic similarity scores are not needed to be calculated on the whole standard term table, so that excessive useless calculation of the term standardization model can be avoided, and the prediction efficiency can be improved. In this case, the standard glossary is collectively referred to as a setMUsing text relevance scores
Figure 842405DEST_PATH_IMAGE053
FromMTo select a positive sample set
Figure 698366DEST_PATH_IMAGE054
To obtain a prediction candidate set
Figure 271430DEST_PATH_IMAGE055
Secondly, training a term standardization model through self-supervision learning, specifically: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity; more specifically, the following substeps are included:
1) the term standardization model consists of a two-way autoregressive language model and a semantic matching model. The bidirectional autoregressive language model performs autoregressive training of semantic units based on forward and reverse contexts, and can learn efficient semantic vector representation while modeling natural language. The input of the next layer in the multi-layer bidirectional autoregressive language model is derived from the self-attention mechanism of the hidden state of the previous layer:
Figure 896446DEST_PATH_IMAGE056
wherein
Figure 666737DEST_PATH_IMAGE057
Respectively in the hidden state of the upper layerhThe vector after the matrix transformation is carried out,
Figure 744414DEST_PATH_IMAGE058
is composed ofhThe dimension (c) of (a) is,Zis the input of the next layer, and the input of the next layer,
Figure 386748DEST_PATH_IMAGE059
a matrix obtained for training; obtaining the hidden state of the next layer through the following nonlinear transformation
Figure 866271DEST_PATH_IMAGE060
Figure 411391DEST_PATH_IMAGE061
Wherein
Figure 179627DEST_PATH_IMAGE062
And
Figure 94493DEST_PATH_IMAGE063
in order to train the resulting matrix, the matrix is,
Figure 694102DEST_PATH_IMAGE064
and
Figure 301801DEST_PATH_IMAGE065
the resulting vectors are trained.
Semantic vectors of the original clinical concepts and the standard terms may be derived based on a bi-directional autoregressive language model when term normalization is performed. The method specifically comprises the following steps: the original clinical conceptxWith any standard terminology
Figure 354070DEST_PATH_IMAGE066
(either positive or negative) are concatenated together word-wise, and a segmentation character [ SEP ] is added at the concatenation]And adds a start character to the leftmost side [ S ]]For example, if the original clinical operation concept "fallopian tube resection" corresponds to a positive sample with the standard term "bilateral fallopian tube resection (66.51)" in the ICD-9-CM-3 glossary, then the positive sample is spliced into "[ S ] fallopian tube resection]Ova duct excision [ SEP ]]Bilateral fallopian tube resection ". Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The output of the position isxAnd
Figure 571163DEST_PATH_IMAGE066
the semantic vector of (2).
And inputting the semantic vector into a semantic matching model to calculate to obtain semantic similarity. The calculation process of the multilayer semantic matching model is as follows:
Figure 25278DEST_PATH_IMAGE067
Figure 803878DEST_PATH_IMAGE068
wherein
Figure 343444DEST_PATH_IMAGE069
For the hidden state of the semantic matching model,
Figure 131271DEST_PATH_IMAGE070
is as follows
Figure 439893DEST_PATH_IMAGE071
The output value of the layer is then calculated,
Figure 389394DEST_PATH_IMAGE072
and
Figure 150677DEST_PATH_IMAGE073
parameters obtained for training; the output dimension of the last layer of the semantic matching model is set to 2, namely
Figure 975151DEST_PATH_IMAGE074
And obtaining the following through nonlinear transformation:
Figure 138279DEST_PATH_IMAGE075
Figure 461944DEST_PATH_IMAGE076
of the output
Figure 976102DEST_PATH_IMAGE077
Is thatxAnd
Figure 840153DEST_PATH_IMAGE078
the semantic similarity score of (a) is determined,
Figure 654525DEST_PATH_IMAGE079
is composed ofxAnd
Figure 211408DEST_PATH_IMAGE078
the difference degree score of (a).
2) The semantic matching model is trained in an automatic supervision learning mode, so that the model autonomously learns the same characteristics of the same data from a large amount of data, and the problem of insufficient labeling training data is solved. In the specific implementation process, a comparison learning mode is adopted, common characteristics of synonymous terms are emphatically learned, and differences of terms with different meanings are distinguished. Setting original clinical conceptxIs a standard termyObtaining a training candidate set through the first step
Figure 947283DEST_PATH_IMAGE080
Constructing a global penalty functionLComprises the following steps:
Figure 68821DEST_PATH_IMAGE081
wherein
Figure 206542DEST_PATH_IMAGE082
The function of the expectation is represented by,
Figure 731064DEST_PATH_IMAGE083
for empirically specified parameters, in the present embodiment
Figure 688656DEST_PATH_IMAGE084
. The term normalized model parameters are updated through gradient backpropagation using this loss function.
Thirdly, rapidly upgrading a term standardization model through active learning, specifically: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; and screening a group of samples with most uncertain current term standardization models according to the active learning standard, and fusing a training candidate set after determining labels of the samples. The more specific implementation principle and process are as follows:
obtaining a more efficient model using as little labeled data as possible is a problem faced by many machine learning algorithms. The idea of active learning is that more fuzzy samples of the current model classification can provide more information quantity, and the performance of the model can be improved greatly under the condition of the same data quantity by screening the samples and generating labels based on the principle. Unlabeled original clinical conceptxObtaining a prediction candidate set through the first step:
Figure 363351DEST_PATH_IMAGE085
computing a semantic similarity score using a semantic matching model
Figure 355577DEST_PATH_IMAGE086
It is normalized to a probability distribution:
Figure 988684DEST_PATH_IMAGE087
term standardized model pairxDegree of uncertainty of
Figure 197686DEST_PATH_IMAGE088
The calculation is as follows:
Figure 3968DEST_PATH_IMAGE012
wherein
Figure 53964DEST_PATH_IMAGE089
For the term standardized model pairxThe information entropy of (2):
Figure 123551DEST_PATH_IMAGE090
Figure 55735DEST_PATH_IMAGE015
is the edge probability:
Figure 868970DEST_PATH_IMAGE016
wherein
Figure 835789DEST_PATH_IMAGE017
And
Figure 810698DEST_PATH_IMAGE018
are respectively all
Figure 259872DEST_PATH_IMAGE019
The median maximum and second maximum probabilities;
Figure 345640DEST_PATH_IMAGE020
as confidence:
Figure 166965DEST_PATH_IMAGE021
Figure 312776DEST_PATH_IMAGE022
for raw clinical text dataxThe frequency of occurrence, generating the correct label for high frequency of original clinical concepts may help the term standardized model to better learn the distribution of the entire set of original clinical concepts.
Figure 485131DEST_PATH_IMAGE023
For the weight of each feature, in the present embodiment,
Figure 640169DEST_PATH_IMAGE091
Figure 316001DEST_PATH_IMAGE092
Figure 898292DEST_PATH_IMAGE093
Figure 994162DEST_PATH_IMAGE094
the active learning process is screened out
Figure 687311DEST_PATH_IMAGE095
And (3) manually determining the labels of the original clinical concepts with higher values, merging the labels into a training candidate set for retraining the term standardization model, and repeating the steps to obtain the term standardization model with the best effect by using the least labeled data.
Training a precise ordering model, and comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions, specifically: acquiring the semantic similarity score of the original clinical concept and the standard term output by the self-supervision learning in the step two as a semantic feature, and calculating a text feature; a regression decision tree-based precision ranking model is trained based on the features for computing a confidence score for the medical term normalization result. The more specific implementation principle and process are as follows:
the term normalized result is considered in practical applicationThe problem of certainty is especially when the training data is relatively lacking during the initial startup period of the whole system. The traditional method is to manually verify the prediction result of the term standardized model again, and more manpower is usually consumed. The accurate ordering model designed by comprehensively considering various characteristics such as semantics, texts and the like can help the correctly predicted standard terms to obtain higher ranking, so that the problem is effectively solved, and the weight of the semantic characteristics in the accurate ordering model can be gradually and artificially increased along with the iterative upgrade of the term standardization model based on self-supervision and active learning. Preferably, a gradient lifting model XGBoost is used as the accurate ranking model, and the basic idea is to train a plurality of regression decision trees, the learning goal of each tree is the error of the previous tree, and the accumulation of the calculation results of all trees is the final confidence score, as shown in fig. 5. Is provided with a pairuConstructing a gradient lifting model for each sample, thentLoss function of decision tree
Figure 483229DEST_PATH_IMAGE096
Comprises the following steps:
Figure 236421DEST_PATH_IMAGE025
wherein
Figure 852210DEST_PATH_IMAGE097
In the form of a function of the square loss,
Figure 349051DEST_PATH_IMAGE027
is a sample
Figure 999475DEST_PATH_IMAGE028
The label of (a) is used,
Figure 427963DEST_PATH_IMAGE029
is as followstA decision tree pair
Figure 796628DEST_PATH_IMAGE028
The predicted value of (a) is determined,
Figure 831580DEST_PATH_IMAGE098
is frontt-1 decision tree pair
Figure 602090DEST_PATH_IMAGE028
The predicted value of (a) is determined,
Figure 697085DEST_PATH_IMAGE099
is a regular term representing the complexity of the decision tree, wherein
Figure 21887DEST_PATH_IMAGE032
Is as followstThe number of leaf nodes of the decision tree,
Figure 126109DEST_PATH_IMAGE033
is as followskThe predicted value of each of the leaf nodes,
Figure 751125DEST_PATH_IMAGE034
and
Figure 515557DEST_PATH_IMAGE035
as the weight parameter, in the present embodiment,
Figure 593234DEST_PATH_IMAGE100
Figure 969989DEST_PATH_IMAGE101
in the process of training the precise ranking model, input training data is a data set consisting of standard term positive samples in a prediction candidate set of an original clinical concept and a term standardization model:
Figure 449512DEST_PATH_IMAGE036
labels of samples of each group
Figure 886309DEST_PATH_IMAGE102
The characteristics involved in training are shown in table 1, either 0 (wrong standard terms) or 1 (correct standard terms). The training-completed accurate ranking model comprisesTA decision tree, then the sample is matched
Figure 451283DEST_PATH_IMAGE103
Computing a confidence score for a medical term normalized result
Figure 631728DEST_PATH_IMAGE104
Comprises the following steps:
Figure 231337DEST_PATH_IMAGE039
TABLE 1 characteristics adopted by the precision ranking model
Figure 337571DEST_PATH_IMAGE105
And fifthly, predicting a final term standardization result. Predicting candidate sets using a trained accurate ranking model pair
Figure 124261DEST_PATH_IMAGE106
The standard term positive sample in (1) calculates a confidence score
Figure 311660DEST_PATH_IMAGE107
Then, the standard term with the largest confidence score is taken:
Figure 765775DEST_PATH_IMAGE108
can be regarded as
Figure 544376DEST_PATH_IMAGE109
Is composed ofxCorresponding standard terms have the same meaning.
And sixthly, screening samples with higher confidence scores of the prediction results to perform semi-supervised self-training. Specifically, the method comprises the following steps: setting strict threshold for confidence score of accurate ranking model prediction
Figure 818362DEST_PATH_IMAGE110
Setting a precise ordering model to the original clinical conceptxThe standard term of prediction is
Figure 606189DEST_PATH_IMAGE109
The confidence score is output as
Figure 413346DEST_PATH_IMAGE111
If, if
Figure 362848DEST_PATH_IMAGE112
Then will be
Figure 389710DEST_PATH_IMAGE113
And adding an original training candidate set, and updating the parameters of the term standardization model and the accurate sequencing model.
Seventhly, searching the direct superior terms, specifically: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting. More specifically, the implementation process is as follows:
for output confidence scores
Figure 918911DEST_PATH_IMAGE114
Original clinical concept ofxMeans that there may not exist a standard term having the same meaning as the standard term, and the standard term is searched upward in the standard term table and locatedxIs a direct superior term of. Calculated by a precision sorting modelxAnd predicting confidence scores of the standard terms in the candidate set, sorting the confidence scores, and selecting the term with the highest confidence scorekA standard term
Figure 816460DEST_PATH_IMAGE115
Is marked out
Figure 936863DEST_PATH_IMAGE116
Is coded in the standard glossary to trace back to the upper level. For example, the original disease concept "right synovitis", with the highest confidence scorek=5 standard terms are shown in table 2. Starting from the coding of the terms in the table, in the standard glossaryMarking out the path of the backtracking of the upper level and each intermediate node passing through, and marking out each standard term node (marked as a node) on the backtracking path
Figure 678117DEST_PATH_IMAGE117
) Marking the number of times the node has been traversed (denoted
Figure 542168DEST_PATH_IMAGE118
) As shown in FIG. 6, the numbers in each node in the graph are
Figure 825382DEST_PATH_IMAGE119
. Then, the first met condition is searched from the lower level to the upper level on the backtracking path
Figure 319948DEST_PATH_IMAGE120
Standard term node of
Figure 55823DEST_PATH_IMAGE117
Can be regarded asxIs directly superior to the standard terminology. For example, the first satisfaction encountered during the search from lower level to upper level in FIG. 6
Figure 723565DEST_PATH_IMAGE121
The node of (a) is "synovitis and tenosynovitis (M65.9)", indicating that the original clinical concept "right synovitis" should be fused into the standard glossary as an immediate subordinate term to that term.
Table 2 shows the standard term with the highest confidence score of the original concept "right synovitis
Figure 861285DEST_PATH_IMAGE122
The invention designs an automatic supervision learning method aiming at medical term standardization, and realizes a high-accuracy medical term standardization model under the condition of less labeled data; the active learning function is completed based on the term standardization process, so that the model can be rapidly and automatically upgraded; designing a candidate sample generating function by combining the characteristics of the standard glossary to ensure that the candidate sample has enough information; the accurate ordering function of the prediction result of the medical term standardized model is designed by integrating the semantic and text characteristics, so that the manual intervention is further reduced; the direct superior term retrieval function of the original clinical concept is designed on the basis of the accurate sequencing result, and the integrity and the uniformity of the medical term standardization result are ensured.
The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims (10)

1. An automatic medical term standardization system combining self-supervision and active learning, comprising:
(1) a candidate set generation module: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;
(2) the self-supervision learning module: for training a term normalized model, comprising:
training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms;
respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model;
adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;
(3) an active learning module: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;
(4) the accurate sequencing module: acquiring semantic similarity scores of the original clinical concept and standard terms output by the self-supervision learning module as semantic features, calculating text features, training a regression decision tree-based accurate sequencing model based on the semantic and text features, and calculating confidence scores of medical term standardization results; and calculating a confidence score for the standard term positive sample in the prediction candidate set by using the trained accurate sequencing model to obtain the standard term with the maximum confidence score.
2. The system of claim 1, further comprising a semi-supervised learning module, which fuses the sample with the confidence score of the result of the medical term standardization output by the precise ranking module satisfying the condition to the training candidate set.
3. The system of claim 1, further comprising a direct superior term retrieval module, the direct superior term retrieval module comprising: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.
4. A method for automatically normalizing medical terms fusing self-supervision and active learning, the method comprising:
(1) generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;
(2) training the term normalization model by self-supervised learning: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;
(3) the term standardization model is rapidly upgraded through active learning: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;
(4) training an accurate ordering model, and comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions: acquiring the semantic similarity score of the original clinical concept and the standard term output by the self-supervision learning in the step two as a semantic feature, and calculating a text feature; training a regression decision tree-based accurate sequencing model based on semantic and text features for calculating a confidence score of a medical term standardization result;
(5) predicting the final term normalization result: and calculating confidence scores for the standard term positive samples in the prediction candidate set by using the trained accurate ranking model, and taking the standard term with the maximum confidence score as a term standardization result.
5. The method for automatically standardizing medical terms fused with self-supervision and active learning according to claim 4, wherein the step (1) comprises:
(1.1) training candidate sets: training candidate set from original clinical conceptsxAnd their corresponding standard termsyComposition of, ifyPresence of last-level direct terminology
Figure 755062DEST_PATH_IMAGE001
Then get
Figure 645439DEST_PATH_IMAGE001
All next level terms of (1) are taken as a setM(ii) a If it isyAbsence of primary direct terms but presence of secondary direct terms
Figure 894018DEST_PATH_IMAGE002
Then get
Figure 758068DEST_PATH_IMAGE002
All next level and next level terms of (1) are denoted as setM(ii) a Otherwise, the standard glossary is collectively referred to as a setM(ii) a ComputingxAndMany standard term ofmThe text relevance scores are sorted according to the text relevance scores to select a negative sample set
Figure 41282DEST_PATH_IMAGE003
To obtain a training candidate set
Figure 332586DEST_PATH_IMAGE004
(1.2) predicting candidate set: in the prediction of the term standardized model, the original clinical concept without label is processedxThe standard glossary is collectively referred to as a setMUsing text relevance scores fromMTo select a positive sample set
Figure 68461DEST_PATH_IMAGE005
To obtain a prediction candidate set
Figure 736203DEST_PATH_IMAGE006
6. The method for automatically standardizing medical terms fused with self-supervision and active learning according to claim 4, wherein the step (2) is performed byThe Chinese medical language model adopts a bidirectional autoregressive language model, and specifically comprises the following steps: the original clinical conceptxWith any standard terminology
Figure 873923DEST_PATH_IMAGE007
Splicing together according to characters, adding a segmentation character (SEP) at the splicing position]And adds a start character to the leftmost side [ S ]]Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The output of the position isxAnd
Figure 569084DEST_PATH_IMAGE007
the semantic vector of (2).
7. The method for automatically standardizing medical terms fused with self-supervision and active learning according to claim 5, wherein in the step (3), the unlabeled original clinical concept is setxObtaining a prediction candidate set through the step (1):
Figure 792255DEST_PATH_IMAGE008
computing a semantic similarity score using a semantic matching model
Figure 998109DEST_PATH_IMAGE009
It is normalized to a probability distribution:
Figure 990335DEST_PATH_IMAGE010
term standardized model pairxDegree of uncertainty of
Figure 889021DEST_PATH_IMAGE011
The calculation is as follows:
Figure 396226DEST_PATH_IMAGE012
wherein
Figure 140191DEST_PATH_IMAGE013
For the term standardized model pairxThe information entropy of (2):
Figure 252504DEST_PATH_IMAGE014
Figure 56512DEST_PATH_IMAGE015
is the edge probability:
Figure 752810DEST_PATH_IMAGE016
wherein
Figure 300466DEST_PATH_IMAGE017
And
Figure 1706DEST_PATH_IMAGE018
are respectively all
Figure 242194DEST_PATH_IMAGE019
The median maximum and second maximum probabilities;
Figure 661674DEST_PATH_IMAGE020
as confidence:
Figure 13021DEST_PATH_IMAGE021
Figure 834346DEST_PATH_IMAGE022
for raw clinical text dataxThe frequency of occurrence;
Figure 245736DEST_PATH_IMAGE023
is the weight of each feature.
8. The automatic medical term standardization method integrating self-supervision and active learning according to claim 5, wherein in the step (4), a gradient lifting model XGboost is adopted as a precise ordering model, and specifically: training a plurality of regression decision trees, wherein the learning target of each tree is the error of the previous tree, and the accumulation of the calculation results of all the trees is the final confidence score; is provided with a pairuConstructing a gradient lifting model for each sample, thentLoss function of decision tree
Figure 651048DEST_PATH_IMAGE024
Comprises the following steps:
Figure 540506DEST_PATH_IMAGE025
wherein
Figure 278655DEST_PATH_IMAGE026
In the form of a function of the square loss,
Figure 860946DEST_PATH_IMAGE027
is a sample
Figure 255018DEST_PATH_IMAGE028
The label of (a) is used,
Figure 948168DEST_PATH_IMAGE029
is as followstA decision tree pair
Figure 478506DEST_PATH_IMAGE028
The predicted value of (a) is determined,
Figure 231699DEST_PATH_IMAGE030
is frontt-1 decision tree pair
Figure 113067DEST_PATH_IMAGE028
The predicted value of (a) is determined,
Figure 102583DEST_PATH_IMAGE031
is a regular term representing the complexity of the decision tree, wherein
Figure 753007DEST_PATH_IMAGE032
Is as followstThe number of leaf nodes of the decision tree,
Figure 145943DEST_PATH_IMAGE033
is as followskThe predicted value of each of the leaf nodes,
Figure 514607DEST_PATH_IMAGE034
and
Figure 549559DEST_PATH_IMAGE035
is a weight parameter;
in the process of training the precise ranking model, input training data is a data set consisting of standard term positive samples in a prediction candidate set of an original clinical concept and a term standardization model:
Figure 320069DEST_PATH_IMAGE036
let the training be completed by a precise ranking model comprisingTA decision tree, then the sample is matched
Figure 149485DEST_PATH_IMAGE037
Computing a confidence score for a medical term normalized result
Figure 5445DEST_PATH_IMAGE038
Comprises the following steps:
Figure 640826DEST_PATH_IMAGE039
9. the method for automatically standardizing medical terms fused with self-supervision and active learning according to any one of claims 4-8, characterized in that the samples with confidence scores satisfying the condition of the standardized result of the medical terms output by the precise ranking module are fused to the training candidate set, and the term standardization model and the precise ranking model parameters are updated.
10. The method for automatically standardizing medical terms fused with self-supervision and active learning according to any one of claims 4-8, characterized in that the method further comprises a direct superior term retrieval function: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.
CN202110994475.7A 2021-08-27 2021-08-27 Automatic medical term standardization system and method integrating self-supervision and active learning Active CN113436698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110994475.7A CN113436698B (en) 2021-08-27 2021-08-27 Automatic medical term standardization system and method integrating self-supervision and active learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110994475.7A CN113436698B (en) 2021-08-27 2021-08-27 Automatic medical term standardization system and method integrating self-supervision and active learning

Publications (2)

Publication Number Publication Date
CN113436698A true CN113436698A (en) 2021-09-24
CN113436698B CN113436698B (en) 2021-12-07

Family

ID=77798234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110994475.7A Active CN113436698B (en) 2021-08-27 2021-08-27 Automatic medical term standardization system and method integrating self-supervision and active learning

Country Status (1)

Country Link
CN (1) CN113436698B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114330370A (en) * 2022-03-17 2022-04-12 天津思睿信息技术有限公司 Natural language processing system and method based on artificial intelligence
CN114691826A (en) * 2022-03-10 2022-07-01 南京云设智能科技有限公司 Medical data information retrieval method based on co-occurrence analysis and spectral clustering
CN115062602A (en) * 2022-08-17 2022-09-16 杭州火石数智科技有限公司 Sample construction method and device for contrast learning, computer equipment and storage medium
CN115080751A (en) * 2022-08-16 2022-09-20 之江实验室 Medical standard term management system and method based on general model
CN115270780A (en) * 2022-07-20 2022-11-01 北京新纽科技有限公司 Method for recognizing terms
CN115688779A (en) * 2022-10-11 2023-02-03 杭州瑞成信息技术股份有限公司 Address recognition method based on self-supervision deep learning
CN115994227A (en) * 2023-03-23 2023-04-21 北京左医科技有限公司 Medical term standardization model construction method, device, terminal equipment and medium
WO2023092961A1 (en) * 2022-04-27 2023-06-01 之江实验室 Semi-supervised method and apparatus for public opinion text analysis
CN117540734A (en) * 2024-01-10 2024-02-09 中南大学 Chinese medical entity standardization method, device and equipment
JP7432802B2 (en) 2021-10-19 2024-02-16 之江実験室 Medical terminology normalization system and method based on heterogeneous graph neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150309987A1 (en) * 2014-04-29 2015-10-29 Google Inc. Classification of Offensive Words
CN108520038A (en) * 2018-03-31 2018-09-11 大连理工大学 A kind of Biomedical literature search method based on Ranking Algorithm
CN111881334A (en) * 2020-07-15 2020-11-03 浙江大胜达包装股份有限公司 Keyword-to-enterprise retrieval method based on semi-supervised learning
CN112364174A (en) * 2020-10-21 2021-02-12 山东大学 Patient medical record similarity evaluation method and system based on knowledge graph

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150309987A1 (en) * 2014-04-29 2015-10-29 Google Inc. Classification of Offensive Words
CN108520038A (en) * 2018-03-31 2018-09-11 大连理工大学 A kind of Biomedical literature search method based on Ranking Algorithm
CN111881334A (en) * 2020-07-15 2020-11-03 浙江大胜达包装股份有限公司 Keyword-to-enterprise retrieval method based on semi-supervised learning
CN112364174A (en) * 2020-10-21 2021-02-12 山东大学 Patient medical record similarity evaluation method and system based on knowledge graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘龙航: "基于多资源的中文医疗知识图谱构建方法研究", 《中国优秀博硕士学位论文全文数据库(硕士)医药卫生科技辑》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7432802B2 (en) 2021-10-19 2024-02-16 之江実験室 Medical terminology normalization system and method based on heterogeneous graph neural network
CN114691826B (en) * 2022-03-10 2022-12-09 南京云设智能科技有限公司 Medical data information retrieval method based on co-occurrence analysis and spectral clustering
CN114691826A (en) * 2022-03-10 2022-07-01 南京云设智能科技有限公司 Medical data information retrieval method based on co-occurrence analysis and spectral clustering
CN114330370B (en) * 2022-03-17 2022-05-20 天津思睿信息技术有限公司 Natural language processing system and method based on artificial intelligence
CN114330370A (en) * 2022-03-17 2022-04-12 天津思睿信息技术有限公司 Natural language processing system and method based on artificial intelligence
WO2023092961A1 (en) * 2022-04-27 2023-06-01 之江实验室 Semi-supervised method and apparatus for public opinion text analysis
CN115270780A (en) * 2022-07-20 2022-11-01 北京新纽科技有限公司 Method for recognizing terms
CN115270780B (en) * 2022-07-20 2023-04-07 北京新纽科技有限公司 Method for recognizing terms
CN115080751A (en) * 2022-08-16 2022-09-20 之江实验室 Medical standard term management system and method based on general model
CN115080751B (en) * 2022-08-16 2022-11-11 之江实验室 Medical standard term management system and method based on general model
CN115062602B (en) * 2022-08-17 2022-11-11 杭州火石数智科技有限公司 Sample construction method and device for contrast learning and computer equipment
CN115062602A (en) * 2022-08-17 2022-09-16 杭州火石数智科技有限公司 Sample construction method and device for contrast learning, computer equipment and storage medium
CN115688779A (en) * 2022-10-11 2023-02-03 杭州瑞成信息技术股份有限公司 Address recognition method based on self-supervision deep learning
CN115688779B (en) * 2022-10-11 2023-05-09 杭州瑞成信息技术股份有限公司 Address recognition method based on self-supervision deep learning
CN115994227A (en) * 2023-03-23 2023-04-21 北京左医科技有限公司 Medical term standardization model construction method, device, terminal equipment and medium
CN115994227B (en) * 2023-03-23 2023-06-06 北京左医科技有限公司 Medical term standardization model construction method, device, terminal equipment and medium
CN117540734A (en) * 2024-01-10 2024-02-09 中南大学 Chinese medical entity standardization method, device and equipment
CN117540734B (en) * 2024-01-10 2024-04-09 中南大学 Chinese medical entity standardization method, device and equipment

Also Published As

Publication number Publication date
CN113436698B (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN113436698B (en) Automatic medical term standardization system and method integrating self-supervision and active learning
CN109378053B (en) Knowledge graph construction method for medical image
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
CN111540468B (en) ICD automatic coding method and system for visualizing diagnostic reasons
JP4774073B2 (en) Methods for document clustering or categorization
CN102799579B (en) Statistical machine translation method with error self-diagnosis and self-correction functions
CN112002411A (en) Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN111078875B (en) Method for extracting question-answer pairs from semi-structured document based on machine learning
US20050027664A1 (en) Interactive machine learning system for automated annotation of information in text
CN112364174A (en) Patient medical record similarity evaluation method and system based on knowledge graph
CN110298033A (en) Keyword corpus labeling trains extracting tool
CN106682397A (en) Knowledge-based electronic medical record quality control method
CN108875809A (en) The biomedical entity relationship classification method of joint attention mechanism and neural network
CN111554360A (en) Drug relocation prediction method based on biomedical literature and domain knowledge data
CN113707339B (en) Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases
CN104699741A (en) Analyzing natural language questions to determine missing information in order to improve accuracy of answers
US20210042344A1 (en) Generating or modifying an ontology representing relationships within input data
CN117151220B (en) Entity link and relationship based extraction industry knowledge base system and method
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
CN113779211A (en) Intelligent question-answer reasoning method and system based on natural language entity relationship
CN106407183A (en) Method and device for generating medical named entity recognition system
CN116245107B (en) Electric power audit text entity identification method, device, equipment and storage medium
CN112420148A (en) Medical image report quality control system, method and medium based on artificial intelligence
CN114912435A (en) Power text knowledge discovery method and device based on frequent itemset algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant