CN113436698B

CN113436698B - Automatic medical term standardization system and method integrating self-supervision and active learning

Info

Publication number: CN113436698B
Application number: CN202110994475.7A
Authority: CN
Inventors: 李劲松; 杨宗峰; 辛然; 李玉格; 史黎鑫; 田雨; 周天舒
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-12-07
Anticipated expiration: 2041-08-27
Also published as: CN113436698A

Abstract

The invention discloses a medical term automatic standardization system and method integrating self-supervision and active learning, wherein the system comprises a candidate set generation module, a self-supervision learning module for training a term standardization model, an active learning module, a precision sequencing module for comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions, and other basic modules, and also comprises a semi-supervision learning module, a straight superior term retrieval module and other preferred modules; the invention can realize the automatic medical term standardized model under the condition of less labeled data, and the model keeps the capability of fast updating and upgrading, thereby greatly reducing the workload of manual intervention while ensuring the accuracy of the output result; the new clinical concept can be matched with the direct superior term, and the accurate position is found in the standard term table, so that the integrity and the uniformity of the standardized result are ensured.

Description

Automatic medical term standardization system and method integrating self-supervision and active learning

Technical Field

The invention belongs to the technical field of Chinese medical term standardization and multi-center medical information platforms, and particularly relates to a medical term automatic standardization system and method integrating self-supervision and active learning.

Background

With the popularization of electronic medical record systems, a large amount of important medical information is stored in various medical information systems in an electronic form, and the data create great values for clinical auxiliary diagnosis, medicine research and development, public health monitoring and evaluation, infectious disease epidemic situation early warning, personalized accurate medical treatment and the like. The medical data standardization is a key step for promoting the integration of domestic medical systems and realizing the cooperative research and large-scale analysis of medical data. Achieving standardization of medical terms is the first difficult problem to be solved in the process of standardization of medical data. Internationally, different types of medical terms respectively have corresponding standard term systems, including disease term sets ICD-10, surgical operation codes ICD-9-CM-3, medical inspection term sets LONIC and the like. Hospitals or other medical facilities, however, do not make good use of the international universal standard terminology set during actual operations, primarily because: (1) different hospitals often adopt different medical information systems, and the data standards adopted by the information systems are different, so that the generated medical terms have larger difference in data dimension and data format; (2) the understanding of standard terminology and granularity by different operators is not uniform. The medical information system usually requires the operator to select the corresponding information of disease name, operation name, etc. according to the condition of the patient, and for the condition that the meanings of the upper and lower terms are overlapped (for example, the two codes "D00.2" and "D00.200" for the "gastric carcinoma in situ" in ICD-10), the understanding of different operators, even the understanding of the same operator at different times, may be different; (3) the operator personalizes the terms entered. Most information systems provide manual input for the convenience of entering new concepts, and therefore operators may develop irregular terminology based on past experience and personal habits. These factors result in the original clinical concept not being directly related to the general standard terms, and data unification and information exchange between different organizations are not easy.

The ultimate goal of medical term standardization efforts is to establish a mapping relationship between the original clinical concept and the standard term. The term standardization schemes in the past are generally based on the following two concepts: (1) by using an artificial method, professional clinicians are invited to carry out mapping and proofreading on the operation terms one by one, but the order of magnitude of the operation terms contained in each medical information system is in the ten thousand level, the working time for the clinicians to carry out proofreading is very long, the rapid popularization in China is difficult, and the rapid implementation of the domestic medical data standardization is further hindered; in addition, because the doctors have different work experiences, the standard mapping of the standard surgical terms lacks a uniform standard, so that the uniformity of the standards among different doctors is difficult to ensure, and meanwhile, the mapping result has manual errors, so that the uniformity of the mapping standards of the same doctor at different times is difficult to ensure. (2) The medical concept semantic matching model is trained based on a machine learning algorithm, but the difficulty of manually marking data is high, the consumed time is long, so that insufficient training data is not available, the finally generated model is low in generalization capability, and in order to ensure the accuracy of the actually used term standard result, more manpower is required to be consumed to verify the output result. On the other hand, there are many standard terminology sets with a high-low relationship, for example, the lower-level terminology of the term "corneal surgery (11)" in the surgical operation code ICD-9-CM-3 includes "corneal laceration suture (11.51)", "corneal transplantation NOS (11.6)", and the like. When the concept generated by the actual clinical operation cannot find the same-meaning peer-level term in the standard term set, the direct superior standard term needs to be accurately positioned, and the existing method cannot solve the problem well, so that the newly-added clinical concept cannot be fused into the universal standard term system. The invention aims to solve the problems that a medical term standardization system with good accuracy and generalization capability is established under the condition that a large amount of labeled data is not available, quick automatic iterative updating of the system is realized under the condition that manual intervention is reduced as much as possible, and meanwhile, the accurate peer standard terms or superior standard terms can be positioned for the original clinical concept.

Disclosure of Invention

The medical data standardization is a key step for promoting the integration of domestic medical systems and realizing the cooperative research and large-scale analysis of medical data. However, the existing clinical term standardization method and system generally need more manual examination and labeling work, and the accuracy and generalization capability are difficult to guarantee, so that the clinical data standardization is difficult to popularize at home quickly.

The invention aims to provide a medical term automatic standardization system and method based on a deep learning model and integrating self-supervision and active learning aiming at the difficulty of the standardization work of the current medical terms.

The purpose of the invention is realized by the following technical scheme: according to the method, a medical term standardization model is constructed on the basis of a deep learning language model, a self-supervision learning method is adopted to train the model, negative samples are sampled based on a text correlation model and a hierarchical structure of a standard glossary, the negative samples with higher information content and more difficult model discrimination are obtained, and the effect of data enhancement is achieved, so that the semantic relation contained in the model can be fully utilized under the condition that only a small number of labeled samples exist in the model; the method comprises the steps of realizing an active learning function based on the principles of maximum entropy, low confidence, high frequency and the like, and screening a group of samples which can improve the performance of a model to the maximum extent according to the prediction results of the model on a large number of unknown samples, so that the model can be upgraded quickly and obviously with the least manual intervention; designing a precise sequencing model, and finally outputting correct standard terms by integrating information in various aspects such as texts, semantics and the like; the accurately sequenced samples automatically update training data in a semi-supervised self-training learning mode, so that the accuracy and generalization capability of the model are further improved, and the workload of manual intervention is continuously reduced; an upward retrieval method is constructed, some newly added original clinical concepts are positioned to corresponding direct superior terms, the integrity and consistency of the medical term standardization results are guaranteed, the newly added clinical concepts can find the correct positions in the standard glossary, and the comprehensive standardization of clinical data is facilitated.

The invention discloses an automatic medical term standardization system integrating self-supervision and active learning, which comprises the following components:

(1) a candidate set generation module: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;

(2) the self-supervision learning module: for training a term normalized model, comprising:

training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms;

respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model;

adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;

(3) an active learning module: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;

(4) the accurate sequencing module: acquiring semantic similarity scores of the original clinical concept and standard terms output by the self-supervision learning module as semantic features, calculating text features, training a regression decision tree-based accurate sequencing model based on the semantic and text features, and calculating confidence scores of medical term standardization results; and calculating a confidence score for the standard term positive sample in the prediction candidate set by using the trained accurate sequencing model to obtain the standard term with the maximum confidence score.

Further, the automatic medical term standardization system further comprises a semi-supervised learning module, and the semi-supervised learning module fuses the samples of which the confidence scores of the medical term standardization results output by the precise ordering module meet the conditions to the training candidate set.

Further, the automatic medical term standardization system further comprises a direct superior term retrieval module, wherein the direct superior term retrieval module comprises: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.

The invention discloses a medical term automatic standardization method fusing self-supervision and active learning on the other hand, which comprises the following steps:

(1) generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;

(2) training the term normalization model by self-supervised learning: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;

(3) the term standardization model is rapidly upgraded through active learning: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;

(4) training an accurate ordering model, and comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions: acquiring the semantic similarity score of the original clinical concept and the standard term output by the self-supervision learning in the step two as a semantic feature, and calculating a text feature; training a regression decision tree-based accurate sequencing model based on semantic and text features for calculating a confidence score of a medical term standardization result;

(5) predicting the final term normalization result: and calculating confidence scores for the standard term positive samples in the prediction candidate set by using the trained accurate ranking model, and taking the standard term with the maximum confidence score as a term standardization result.

Further, the step (1) includes:

(1.1) training candidate sets: the training candidate set consists of the original clinical concept x and the corresponding standard term Y if Y has the upper level direct term Y₁Then take Y₁All the next-level terms of (a) are denoted as set M; if Y does not have the primary direct term but does have the secondary direct term Y₂Then take Y₂All the next level and next level terms of (a) are denoted as set M; otherwise, the standard glossary is totally marked as a set M; calculating text relevance score of any standard term M in x and M, sorting according to the text relevance score, and selecting negative sample set

Obtaining a training candidate set

(1.2) predicting candidate set: in the case of term-normalized model prediction, the unlabeled original clinical concept x is collectively represented as a set M in a standard term table, and a positive sample set is selected from M using a text relevance score

Deriving a set of prediction candidates

Further, in the step (2), the chinese medical language model is a bi-directional autoregressive language model, which specifically includes: the original clinical concept x is compared with any standard term y^*Splicing together according to characters, adding a segmentation character (SEP) at the splicing position]And adds a start character to the leftmost side [ S ]]Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The position outputs are x and y^*The semantic vector of (2).

Further, in the step (3), the unlabeled original clinical concept x is subjected to the step (1) to obtain a prediction candidate set:

computing semantic similarity scores using a semantic matching model

Normalize it to a probability distribution:

the uncertainty of the term normalized model for x, c (x), is calculated as follows:

C(x)＝w₁·ent(x)+w₂·margin(x)+w₃·lc(x)+w₄·freq(x)

where ent (x) is the information entropy of the term normalized model pair x:

margin (x) is the edge probability:

margin(x)＝-(p₁-p₂)

wherein p is₁And p₂Are respectively all p_iThe median maximum and second maximum probabilities;

lc (x) is confidence:

lc(x)＝-p₁

freq (x) is the frequency of x occurrences in the original clinical text data;

(w_i)_{i＝1，2，3，4}is the weight of each feature.

Further, in the step (4), a gradient lifting model XGBoost is adopted as a precise sequencing model, which specifically includes: training a plurality of regression decision trees, wherein the learning target of each tree is the error of the previous tree, and the accumulation of the calculation results of all the trees is the final confidence score; if a gradient lifting model is constructed for u samples, the loss function L of the t-th decision tree^(t)Comprises the following steps:

where l (-) is the square loss function, v_iIs a sample x_iTag of (e), f (Tree)_t，x_i) For the t decision tree pair x_iThe predicted value of (a) is determined,

for the first t-1 decision tree pairs x_iThe predicted value of (a) is determined,

is a positive representation of the complexity of the decision treeTerm, | Tree therein_tL is the leaf node number of the t-th decision tree, w_kThe predicted value of the kth leaf node is obtained, and gamma and lambda are weight parameters;

in the process of training the precise ranking model, input training data is a data set consisting of standard term positive samples in a prediction candidate set of an original clinical concept and a term standardization model:

if the trained accurate sequencing model comprises T decision trees, the samples are subjected to

Computing a confidence score for a medical term normalized result

Comprises the following steps:

and further, fusing the sample with the confidence score meeting the condition of the medical term standardization result output by the accurate sequencing module into a training candidate set, and updating the term standardization model and the accurate sequencing model parameters.

Further, the method also comprises a direct superior term retrieval function: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.

The invention has the beneficial effects that: the invention can realize the automatic medical term standardization model under the condition of less labeled data, and the model keeps the capability of quick updating and upgrading, thereby greatly reducing the workload of manual intervention while ensuring the accuracy of the output result. The new clinical concept can be matched with the direct superior term, and the accurate position is found in the standard term table, so that the integrity and the uniformity of the standardized result are ensured.

Drawings

FIG. 1 is a block diagram of an automatic standardization system for medical terms fusing self-supervision and active learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an implementation of a candidate set generation module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an implementation of an auto-supervised learning module and an active learning module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an implementation of a direct superior term retrieval module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a training process of a decision tree-based precision ranking model according to an embodiment of the present invention;

fig. 6 is a schematic diagram of direct superior term retrieval according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

In the invention, the self-supervision learning means: and (3) mining own supervision information from large-scale unmarked data by using an auxiliary task, and training the network by using the constructed supervision information so as to learn valuable characteristics of downstream tasks. There are three main ways for self-supervised learning: context-based learning, time-series-based learning, and contrast learning, where contrast learning is the construction of a characterization by learning to encode the similarity or dissimilarity of two things.

Active learning means: the main goal is to reduce the cost of people to annotate data. The sample data which is difficult to classify or fuzzy in model classification is obtained through a machine learning method, and the data is generally considered to be possibly in the critical positions of different classes, so that the method can provide greater help for the model to accurately learn the features of the different classes. By manually re-confirming and auditing the samples, the effect of the model can be improved more remarkably under the condition of the same labeled data quantity.

Semi-supervised learning refers to: the learner is independent of external interaction, and learning performance is improved by automatically utilizing unmarked samples. The self-training is a special implementation mode of semi-supervised learning, assuming that similar samples have similar output, firstly training an initial model by using labeled samples, then carrying out prediction classification on unlabeled samples by using the model, screening out samples with higher prediction result confidence coefficient based on certain standard, and then using predicted soft labels or hard labels as new labeled data to expand a training set.

Medical term standardization refers to: the standardized principle and method are used for unifying medical terms in a certain range by establishing medical term standards so as to obtain the process of optimal order and social benefit. The establishment of the unified medical term standard and the term set is helpful for solving the problems of term repetition, connotation, semantic expression and inconsistent understanding and the like, and has important significance for effectively promoting the propagation, sharing and use of medical information in a wider range and a deeper level.

The embodiment of the invention provides an automatic medical term standardization system integrating self-supervision and active learning, which comprises the following modules as shown in figure 1:

a candidate set generation module: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;

a self-supervised learning module for training the term standardized model;

the active learning module is realized on the basis of the principles of maximum entropy, minimum confidence coefficient and the like;

a precision ranking module for comprehensive evaluation of term normalized model predictions from text and semantic dimensions.

Preferably, the system further comprises: and fusing the sample of which the confidence score of the medical term standardization result output by the precise ordering module meets the condition to a semi-supervised learning module of the training candidate set.

Preferably, the system further comprises: directly superior term retrieval module.

Specifically, the candidate set generation module is composed of two parts: in the term standardization model training process, sampling is carried out based on the text correlation BM25 model and the hierarchical structure of a standard term table, and standard terms which are close to but not identical with the original clinical concept as much as possible are obtained as negative sample standard terms; in the term standardization model prediction process, possible positive sample standard terms are generated based on the text relevance BM25 model, and the detailed flow is shown in FIG. 2.

Specifically, the self-supervision learning module mainly comprises the following three steps:

1. training a Chinese medical language model, preferably a bidirectional autoregressive language model (BERT), by a self-adaptive method, and further acquiring semantic vectors of original clinical concepts and standard terms;

2. respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model;

3. the loss function of the normalized model is calculated according to the semantic similarity meter using an auto-supervised learning approach (preferably an auto-supervised contrast learning approach), as shown in the left part of fig. 3.

Specifically, the active learning module mainly includes the following two steps:

1. calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set;

2. and screening out a group of samples with the most uncertain current term standardization model according to the active learning standard, determining labels of the samples, and then merging the samples into a training candidate set, wherein the labels are shown in the right part of the figure 3.

Specifically, the precise sorting module mainly comprises the following two steps:

1. firstly, acquiring semantic similarity scores of an original clinical concept and a standard term output by an automatic supervision learning module as semantic features, and calculating text features, wherein the text features comprise the literal similarity of the original clinical concept and the standard term, word co-occurrence frequency, the difference of the number of contained words and the like;

2. a regression decision tree-based precision ranking model is then trained based on these features for computing a confidence score for the medical term normalization result.

Specifically, the semi-supervised learning module has the main function of screening a group of most determined samples of the current term standardized model based on the confidence scores output by the precise ranking module, and expanding a training candidate set.

Specifically, the direct superior term retrieval module mainly includes the following two steps:

1. firstly, acquiring a group of standard terms with the highest confidence scores predicted by a precise ordering model for an original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of a standard term table;

2. and then determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting, as shown in fig. 4.

The embodiment of the invention provides a medical term automatic standardization method integrating self-supervision and active learning, which comprises the following specific implementation steps:

generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set, specifically: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set; more specifically, with reference to fig. 2, the following sub-steps are included:

1) the training candidate set of the term normalization model consists of a large number of original clinical concepts x and their corresponding standard terms y. In term normalization model training, a set of negative samples is first sampled in a standard term that has a different meaning than x. In order for the term-normalized model to learn more from negative examples, the sampling process needs to obtain standard terms that are as close as possible to, but not exactly the same as, the meaning of the original clinical concept. Some standard nomenclature exists hierarchically, for example, the disease nomenclature table ICD-10 encodes "oral, esophageal and gastric carcinoma in situ" as "D00" with the next-level nomenclature of "carcinoma in situ of the lip, oral and pharynx (D00.0)", "esophageal carcinoma in situ (D00.2)" and the like, and the next-level nomenclature of "carcinoma in situ of the tonsil (D00.001)", "carcinoma in situ of the lip (D00.002)" and the like. The operations were performed in the following order:

if Y has the direct term Y of the previous level₁Then take Y₁All the next-level terms of (a) are denoted as set M;

② if Y does not have the last direct term but has the last direct term Y₂Then take Y₂All the next level and next level terms of (a) are denoted as set M;

thirdly, if not, recording the standard glossary as a set M;

then, a text relevance Score of any standard term M of x and M is calculated, and the formula of the text relevance Score (x, M) of the text relevance BM25 model is as follows:

wherein IDF (q)_i) Representing words q in x_iIDF value of (f)_iIs q_iThe frequency of occurrence in M, len is the length of M, avglen is the average length of all standard terms in M, w_iIs the weight of the word, k₁And b is an empirically specified parameter, in this example, k₁＝1，b＝0.5。

Sorting out negative sample set according to text relevance scores

Obtaining a training candidate set

2) In performing the operationWhen the language standardization model is used for prediction, for an original clinical concept x without a label, a group of standard terms with the highest possibility is screened out from the whole standard term table to serve as a prediction candidate set, and then the term standardization model is only used for calculating the semantic similarity scores of the original clinical concept and the standard terms in the prediction candidate set, but the semantic similarity scores are not required to be calculated on the whole standard term table, so that excessive useless calculation of the term standardization model can be avoided, and the prediction efficiency is improved. At this time, the standard glossary is collectively referred to as a set M, and a positive sample set is selected from M by using the text relevance Score (x, M)

Deriving a set of prediction candidates

Secondly, training a term standardization model through self-supervision learning, specifically: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity; more specifically, the following substeps are included:

1) the term standardization model consists of a two-way autoregressive language model and a semantic matching model. The bidirectional autoregressive language model performs autoregressive training of semantic units based on forward and reverse contexts, and can learn efficient semantic vector representation while modeling natural language. The input of the next layer in the multi-layer bidirectional autoregressive language model is derived from the self-attention mechanism of the hidden state of the previous layer:

wherein Q ═ hW^Q，K＝hW^K，V＝hW^VAre respectively asVector d of previous layer hidden state h after matrix transformation_kDimension of h, Z is the input of the next layer, W^Q，W^K，W^VA matrix obtained for training; obtaining the hidden state FFN (Z) of the next layer through the following nonlinear transformation:

FFN(Z)＝max(0，ZW₁+b₁)W₂+b₂

wherein W₁And W₂To train the resulting matrix, b₁And b₂The resulting vectors are trained.

Semantic vectors of the original clinical concepts and the standard terms may be derived based on a bi-directional autoregressive language model when term normalization is performed. The method specifically comprises the following steps: the original clinical concept x is compared with any standard term y^*(either positive or negative) are concatenated together word-wise, and a segmentation character [ SEP ] is added at the concatenation]And adds a start character to the leftmost side [ S ]]For example, if the original clinical operation concept "fallopian tube resection" corresponds to a positive sample with the standard term "bilateral fallopian tube resection (66.51)" in the ICD-9-CM-3 glossary, then the positive sample is spliced into "[ S ] fallopian tube resection]Fallopian tube resection [ SEP ]]Bilateral salpingectomy ". Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The position outputs are x and y^*The semantic vector of (2).

And inputting the semantic vector into a semantic matching model to calculate to obtain semantic similarity. The calculation process of the multilayer semantic matching model is as follows:

Z_i＝W_ih_i-1+b_i

wherein h is_iFor hidden states of the semantic matching model, Z_iIs the output value of the i-1 st layer, W_iAnd b_iParameters obtained for training; the output dimension of the last layer of the semantic matching model is set to 2, namely

And (3) obtaining through nonlinear transformation:

of the output s₁(x，y^*) I.e. x and y^*Semantic similarity score of s₀(x，y^*) Is x and y^*The difference degree score of (a).

2) The semantic matching model is trained in an automatic supervision learning mode, so that the model autonomously learns the same characteristics of the same data from a large amount of data, and the problem of insufficient labeling training data is solved. In the specific implementation process, a comparison learning mode is adopted, common characteristics of synonymous terms are emphatically learned, and differences of terms with different meanings are distinguished. Setting the label of the original clinical concept x as a standard term y, and obtaining a training candidate set through the first step

Constructing a global loss function L as:

where E [ · ] denotes the expectation function, τ is an empirically specified parameter, τ being 0.9 in this embodiment. The term normalized model parameters are updated through gradient backpropagation using this loss function.

Thirdly, rapidly upgrading a term standardization model through active learning, specifically: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; and screening a group of samples with most uncertain current term standardization models according to the active learning standard, and fusing a training candidate set after determining labels of the samples. The more specific implementation principle and process are as follows:

obtaining a more efficient model using as little labeled data as possible is a problem faced by many machine learning algorithms. The idea of active learning is that more fuzzy samples of the current model classification can provide more information quantity, and the performance of the model can be improved greatly under the condition of the same data quantity by screening the samples and generating labels based on the principle. The unlabeled original clinical concept x is subjected to the first step to obtain a prediction candidate set:

computing semantic similarity scores using a semantic matching model

Normalize it to a probability distribution:

the uncertainty of the term normalized model for x is calculated as follows:

C(x)＝w₁·ent(x)+w₂·margin(x)+w₃·lC(x)+w₄·freq(x)

where enx (x) is the information entropy of the term normalized model pair x:

margin (x) is the edge probability:

margin(x)＝-(p₁-p₂)

lc (x) is confidence:

lc(x)＝-p₁

freq (x) is the frequency of x occurrences in the original clinical text data, and generating correct labels for high frequency original clinical concepts may help the term normalization model to better learn the distribution of the entire set of original clinical concepts.

(w_i)_{i＝1，2，3，4}For the weight of each feature, in the present embodiment, w₁＝0.45,w₂＝0.2,w₃＝0.2，w₄＝0.15。

The active learning process is to screen out original clinical concepts with higher C (x) values, manually determine labels of the original clinical concepts, integrate the original clinical concepts into a training candidate set for retraining the term standardization model, and repeat the steps to obtain the term standardization model with the best effect by using the least labeled data.

Training a precise ordering model, and comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions, specifically: acquiring the semantic similarity score of the original clinical concept and the standard term output by the self-supervision learning in the step two as a semantic feature, and calculating a text feature; a regression decision tree-based precision ranking model is trained based on the features for computing a confidence score for the medical term normalization result. The more specific implementation principle and process are as follows:

the accuracy problem needs to be considered when putting the term standardized result into practical application, especially when the whole set of system is relatively lack of training data in the initial stage of starting. The traditional method is to manually verify the prediction result of the term standardized model again, and more manpower is usually consumed. The accurate ordering model designed by comprehensively considering various characteristics such as semantics, texts and the like can help the correctly predicted standard terms to obtain higher ranking, so that the problem is effectively solved, and the weight of the semantic characteristics in the accurate ordering model can be gradually and artificially increased along with the iterative upgrade of the term standardization model based on self-supervision and active learning. Preferably, a gradient lifting model XGBoost is used as the accurate ranking model, and the basic idea is to train a plurality of regression decision trees, the learning goal of each tree is the error of the previous tree, and the accumulation of the calculation results of all trees is the final confidence score, as shown in fig. 5. If a gradient boosting model is constructed for u samples, the loss function l (t) of the t-th decision tree is:

wherein

As a function of the square loss, v_iIs a sample x_iTag of (e), f (Tree)_t，x_i) For the predicted value of the t-th decision tree pair,

for the predicted values of the first t-1 decision tree pairs,

is a regularization term representing the complexity of the decision Tree, where | Tree_tL is the leaf node number of the t-th decision tree, w_kFor the predicted value of the kth leaf node, γ and λ are weighting parameters, and in this embodiment, γ is 0.1 and λ is 0.9.

labels v for each set of samples_iThe characteristics involved in training are shown in table 1, either 0 (wrong standard terms) or 1 (correct standard terms). If the trained accurate sequencing model comprises T decision trees, the samples are subjected to

Computing a confidence score for a medical term normalized result

Comprises the following steps:

TABLE 1 characteristics adopted by the precision ranking model

And fifthly, predicting a final term standardization result. Predicting candidate sets using a trained accurate ranking model pair

The standard term positive sample in (1) calculates a confidence score

Then take the standard term with the largest confidence score:

can be regarded as

Is a standard term with the same meaning corresponding to x.

And sixthly, screening samples with higher confidence scores of the prediction results to perform semi-supervised self-training. Specifically, the method comprises the following steps: setting strict threshold for confidence score of accurate ranking model prediction

The standard term for the prediction of the original clinical concept x by setting the accurate ordering model is

The confidence score of the output is

If it is

Then will be

And adding an original training candidate set, and updating the parameters of the term standardization model and the accurate sequencing model.

Seventhly, searching the direct superior terms, specifically: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting. More specifically, the implementation process is as follows:

for output confidence scores

Indicates that there may not be a standard term with the same meaning, and needs to be searched upward in the standard term table to locate the directly superior term of x. Calculating confidence scores of the standard terms in the x and prediction candidate sets by using a precise ordering model, ordering, and selecting k standard terms with highest confidence scores from the confidence scores

Is marked out

Is coded in the standard glossary to trace back to the upper level. For example, the original disease concept "right synovitis", with k ═ 5 standard terms with the highest confidence score as shown in table 2. Starting from the encoding of the terms in the table, marking a path traced back to the upper level and each intermediate node passed by the path in the standard term table, and marking each standard term node (marked as node) on the traced back path_j) The number of times the node is passed is marked (denoted as count (node)_j) As shown in fig. 6, the number in each node in the graph is a count (node)_j). Then, the first met condition is searched from the lower level to the upper level on the backtracking path

Standard term node of_jI.e. can be regarded as x straightIs a high level standard term. For example, the first satisfaction encountered during the search from lower level to upper level in FIG. 6

The node of (a) is "synovitis and tenosynovitis (M65.9)", indicating that the original clinical concept "right synovitis" should be fused into the standard glossary as an immediate subordinate term to that term.

Table 2 shows the standard term with the highest confidence score of the original concept "right synovitis

Standard term names	Encoding of standard terms
		Synovitis (synovitis)	M65.909
Infectious synovitis	M65.101
		Synovitis of shoulder joint	M65.901
Synovitis and tenosynovitis	M65.9
		Other synovitis and tenosynovitis	M65.8

The invention designs an automatic supervision learning method aiming at medical term standardization, and realizes a high-accuracy medical term standardization model under the condition of less labeled data; the active learning function is completed based on the term standardization process, so that the model can be rapidly and automatically upgraded; designing a candidate sample generating function by combining the characteristics of the standard glossary to ensure that the candidate sample has enough information; the accurate ordering function of the prediction result of the medical term standardized model is designed by integrating the semantic and text characteristics, so that the manual intervention is further reduced; the direct superior term retrieval function of the original clinical concept is designed on the basis of the accurate sequencing result, and the integrity and the uniformity of the medical term standardization result are ensured.

The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. An automatic medical term standardization system combining self-supervision and active learning, comprising:

2. The system of claim 1, further comprising a semi-supervised learning module, which fuses the sample with the confidence score of the result of the medical term standardization output by the precise ranking module satisfying the condition to the training candidate set.

3. The system of claim 1, further comprising a direct superior term retrieval module, the direct superior term retrieval module comprising: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.

4. A method for automatically normalizing medical terms fusing self-supervision and active learning, the method comprising the steps of:

generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;

step (2) training a term standardization model through self-supervision learning: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;

and (3) rapidly upgrading the term standardized model through active learning: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;

training an accurate sequencing model, and comprehensively evaluating the prediction result of the term standardized model from text and semantic dimensions: acquiring semantic similarity scores of the original clinical concept and the standard terms output in the step (2) by self-supervision learning as semantic features, and calculating text features; training a regression decision tree-based accurate sequencing model based on semantic and text features for calculating a confidence score of a medical term standardization result;

step (5) predicting the final term normalization result: and calculating confidence scores for the standard term positive samples in the prediction candidate set by using the trained accurate ranking model, and taking the standard term with the maximum confidence score as a term standardization result.

5. The method for automatically standardizing medical terms fused with self-supervision and active learning according to claim 4, wherein the step (1) comprises:

Obtaining a training candidate set

Deriving a set of prediction candidates

6. The method according to claim 4, wherein in the step (2), the Chinese medical language model is a bi-directional autoregressive language model, specifically: the original clinical concept x is compared with any standard term y^*Splicing together according to characters, adding a segmentation character (SEP) at the splicing position]And adds a start character to the leftmost side [ S ]]Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The position outputs are x and y^*Semantics of (A)And (5) vector quantity.

7. The method for automatically standardizing medical terms fused with self-supervision and active learning according to claim 5, wherein in the step (3), the unlabeled original clinical concept x is subjected to the step (1) to obtain a prediction candidate set:

computing semantic similarity scores using a semantic matching model

Normalize it to a probability distribution:

C(x)＝w₁·ent(x)+w₂·margin(x)+w₃·lc(x)+w₄·freq(x)

where ent (x) is the information entropy of the term normalized model pair x:

margin (x) is the edge probability:

margin(x)＝-(p₁-p₂)

lc (x) is confidence:

lc(x)＝-p₁

freq (x) is the frequency of x occurrences in the original clinical text data;

(w_i)_{i＝1，2，3，4}is the weight of each feature.

8. The automatic medical term standardization method integrating self-supervision and active learning according to claim 5, wherein in the step (4), a gradient lifting model XGboost is adopted as a precise ordering model, and specifically: training a plurality of regression decision trees, wherein the learning target of each tree is the error of the previous tree, and the accumulation of the calculation results of all the trees is the final confidence score; if a gradient lifting model is constructed for u samples, the loss function L of the t-th decision tree^(t)Comprises the following steps:

is a regularization term representing the complexity of the decision Tree, where | Tree_tL is the leaf node number of the t-th decision tree, w_kThe predicted value of the kth leaf node is obtained, and gamma and lambda are weight parameters;

Computing a confidence score for a medical term normalized result

Comprises the following steps:

9. the method for automatically standardizing medical terms fused with self-supervision and active learning according to any one of claims 4-8, characterized in that the samples with confidence scores satisfying the condition of the standardized result of the medical terms output by the precise ranking module are fused to the training candidate set, and the term standardization model and the precise ranking model parameters are updated.

10. The method for automatically standardizing medical terms fused with self-supervision and active learning according to any one of claims 4-8, characterized in that the method further comprises a direct superior term retrieval function: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.