CN113593661B

CN113593661B - Clinical term standardization method, device, electronic equipment and storage medium

Info

Publication number: CN113593661B
Application number: CN202110767577.5A
Authority: CN
Inventors: 尹珊珊; 舒正; 朱波; 张骁雅; 赵明; 刘英杰
Original assignee: Qingdao Guoxin Health Industry Technology Co ltd
Current assignee: Qingdao Guoxin Health Industry Technology Co ltd
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2024-06-14
Anticipated expiration: 2041-07-07
Also published as: CN113593661A

Abstract

The invention provides a clinical term standardization method, a device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a first sentence and a plurality of second sentences; the first sentence is a sentence to be identified, and the second sentence is a standard clinical term; inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a second sentence vector; and carrying out similarity calculation on the first sentence vector and the second sentence vector, and taking the second sentence corresponding to the highest similarity value as a standard clinical term corresponding to the first sentence. The method provided by the invention can improve the accuracy and efficiency of clinical term standardization.

Description

Clinical term standardization method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a clinical term standardization method, a device, an electronic apparatus, and a storage medium.

Background

With the rapid development of medical informatization, the type and scale of medical data are also rapidly increasing. Obviously, data analysis mining is performed on medical data of a plurality of medical data centers (abbreviated as 'multi-centers'), and the provision of support for clinical decisions, medical management services, scientific researches and the like becomes a necessary trend.

At present, related standards of domestic medical terms are deficient, a system is incomplete, and medical information system manufacturers are numerous, so that the isomerization phenomenon of term names and codes among medical data centers and even in the medical data centers is serious, a large amount of semi-structured and unstructured data exist, and a large amount of distributed heterogeneous data, information, instrument equipment and systems bring a plurality of barriers to the expression, storage, exchange, sharing and system cooperative work of medical information. In order to realize digitization and informatization of medical treatment, a set of standard clinical medical terms is clearly required to realize efficient whole-society medical resource sharing, cross-regional medical treatment and cross-system medical treatment. However, the mapping relation between the standard clinical term sets existing internationally at present is difficult to apply to the standard of domestic medical terms due to language barriers, and the standardization and sharing of medical data between multiple medical data centers are difficult to realize.

In the prior art, experience rules are mostly adopted for data standardization processing, more manual review is still needed after simple processing, the workload of related mapping personnel is large, the efficiency is low, and the accuracy of the mapping relation is low.

Disclosure of Invention

The invention provides a clinical term standardization method, a device, electronic equipment and a storage medium, which are used for solving the technical problems of low mapping efficiency and low accuracy caused by adoption of an empirical rule and manual rechecking mode in the standardization process in the prior art and achieving the purpose of improving the accuracy and the efficiency of the clinical term standardization process.

In a first aspect, the present invention provides a method for normalization of clinical terms comprising:

acquiring a first sentence and a plurality of second sentences; the first sentence is a sentence to be identified, and the second sentence is a standard clinical term;

Inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors;

Respectively carrying out similarity calculation on the first sentence vector and the plurality of second sentence vectors, and taking a second sentence corresponding to the highest similarity value as a standard clinical term corresponding to the first sentence;

the S-Bert model is obtained based on sample sentence pairs and sample category labels in a training mode; wherein the sample sentence pairs comprise a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample sentence to be identified and a plurality of standard clinical terms;

the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.

According to the clinical term normalization method provided by the invention, the S-Bert model comprises a twin neural network structure and a pooling layer, wherein,

Correspondingly, the inputting the first sentence and the plurality of second sentences into the pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors includes:

Respectively constructing sentence pairs from the first sentence and any one of the plurality of second sentences to obtain a plurality of sentence pairs;

Respectively inputting the multiple sentences into the twin neural network structure to determine semantic unit symbols and perform symbol processing to obtain respective semantic symbol sequences of the multiple sentences;

and respectively inputting the semantic symbol sequences of the sentence pairs into the pooling layer to perform average pooling, maximum pooling or initial symbol position pooling processing to obtain a first sentence vector and a plurality of second sentence vectors.

According to the clinical term normalization method provided by the invention, the symbol processing comprises the following steps:

Adding a preset initial symbol to the front of the sentence pair;

and/or the number of the groups of groups,

Adding a preset separation symbol between two adjacent sentences in the sentence pair;

and/or the number of the groups of groups,

Special symbols are added to the special semantic unit symbols.

According to the clinical term standardization method provided by the invention, before the first sentence and the second sentence are acquired, an S-Bert model is trained according to sample sentence pairs and category identification labels, and the method comprises the following steps:

S1, determining semantic unit symbols and performing symbol processing on a third sentence and a fourth sentence in the sample sentence pair by utilizing an S-Bert model to be trained to obtain a semantic symbol sequence; wherein the third sentence is the sample sentence to be identified, and the fourth sentence is any one of the plurality of standard clinical terms;

s2, carrying out average pooling, maximum pooling or initial symbol position pooling treatment on the semantic symbol sequence to obtain a third sentence vector and a fourth sentence vector of the sample sentence pair;

Step S3, performing score evaluation on the first sentence vector and the second sentence vector of the sample sentence, and selecting a model with the highest score as an S-Bert model to be trained in the next round;

Step S4, training the S-Bert model to be trained again, and when the model training termination condition is not met, adjusting the S-Bert model to be trained, and executing the step S1 again by using the adjusted S-Bert model; and when the model training termination condition is met, obtaining a trained S-Bert model.

According to the clinical term standardization method provided by the invention, the adjusting the S-Bert model to be trained comprises the following steps:

obtaining a difference vector according to the third sentence vector and the fourth sentence vector of the sample sentence pair;

Splicing the third sentence vector, the fourth sentence vector and the difference vector of the sample sentence pair to obtain a fifth sentence vector;

Optimizing the S-Bert model to be trained according to the fifth sentence vector and the training weight value;

Or alternatively, the first and second heat exchangers may be,

Performing cosine similarity calculation on the third statement vector and the fourth statement vector of the sample statement pair to obtain a cosine similarity value;

optimizing the S-Bert model to be trained according to the cosine similarity value and the score in the class label of the sample sentence pair;

Or alternatively, the first and second heat exchangers may be,

And optimizing the S-Bert model to be trained according to the distance between the third sentence vector and the fourth sentence vector of the sample sentence pair and the given sentence vector.

According to the clinical term standardization method provided by the invention, the similarity calculation is carried out on the first sentence vector and the plurality of second sentence vectors respectively, and the method comprises the following steps:

And carrying out similarity calculation on the first sentence vector and the plurality of second sentence vectors based on a cosine similarity algorithm, and taking the second sentence corresponding to the highest similarity value as a standard clinical term of the first sentence.

In a second aspect, the present invention provides a clinical term normalization device comprising:

The acquisition module is used for acquiring the first sentence and a plurality of second sentences; the first sentence is a sentence to be identified, and the second sentence is a standard clinical term;

the input module is used for inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors;

The calculation module is used for carrying out similarity calculation on the first sentence vector and the plurality of second sentence vectors respectively, and taking the second sentence corresponding to the highest similarity value as a standard clinical term corresponding to the first sentence;

According to the invention, there is provided a clinical term standardizing device, the input device further being adapted to:

In a third aspect, the present invention also provides an electronic device, including:

a processor, a memory, and a bus, wherein,

The processor and the memory complete communication with each other through the bus;

The memory stores program instructions executable by the processor, the processor invoking the program instructions to perform a method of clinical term normalization as described in any of the above.

In a fourth aspect, the present invention also provides a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform a method of normalization of clinical terms as described above.

The invention provides a method, a device, electronic equipment and a storage medium for standardizing clinical terms, wherein the method comprises the following steps: acquiring a first sentence and a plurality of second sentences; the first sentence is a sentence to be identified, and the second sentence is a standard clinical term; inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors; and respectively carrying out similarity calculation on the first sentence vector and the plurality of second sentence vectors, and taking the second sentence corresponding to the highest similarity value as a standard clinical term corresponding to the first sentence. The method provided by the invention can improve the accuracy and efficiency of clinical term standardization.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for standardizing clinical terms provided by the invention;

FIG. 2 is a schematic diagram of a clinical term normalization device provided by the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention is realized based on the S-Bert model obtained after the Bert language model is subjected to fine tuning treatment. The Bert model released in 10 2018 causes a huge reverberation in the NLP industry and is considered as a milestone progress in the NLP field. Bert full scale Bidirectional Encoder Representation from Transformers, where the literal name is Transformers based bi-directional encoder characterization. Unlike other language representation models, the Bert model aims to pre-train the deep bi-directional representation by jointly adjusting the contexts in all layers.

Fig. 1 is a diagram showing a clinical term standardization method provided by the present invention, and as shown in fig. 1, the clinical term standardization method provided by the present invention includes the following steps:

step 101: acquiring a first sentence and a plurality of second sentences; the first sentence is a sentence to be identified, and the second sentence is a standard clinical term;

step 102: inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors;

Step 103: respectively carrying out similarity calculation on the first sentence vector and the plurality of second sentence vectors, and taking a second sentence corresponding to the highest similarity value as a standard clinical term corresponding to the first sentence;

Specifically, the semantic symbol sequence refers to a row and column in which symbols given a language meaning are arranged in a certain order; the standardization refers to the process of unifying medical expressions in a certain range and obtaining the optimal order and social benefit by formulating clinical medical term standards.

In step 101, the first sentence is a sentence to be identified, the second sentence is a standard clinical term, for example, the sentence to be identified is "ICD", the standard clinical term is "lung cancer" and "lung cancer postoperative infection", etc., and it should be noted that the second sentence can be set according to actual needs, which is not specifically described herein.

In step 102, the obtained first sentence and the plurality of second sentences are input into a pre-trained S-Bert model, and it is to be noted that the S-Bert model is obtained by training based on a sample sentence pair and a sample class label, where the sample class label is used to describe whether a mapping relationship exists between a sample sentence to be identified and a plurality of standard clinical terms, and the determination of the mapping relationship can be judged by a mapping relationship score, in this embodiment, the mapping relationship score is set between 0 and 1,0 indicates that the mapping relationship does not exist at all, 1 indicates that the mapping relationship has a certain mapping relationship, and if the mapping relationship score of ICD and "lung cancer postoperative infection" is 1, and if the mapping relationship score of the other standard medical terms is smaller than 1, the mapping relationship between ICD and "lung cancer postoperative infection" is indicated, and the mapping relationship of other standard medical terms may not exist. The setting of the sample type label may be set according to actual needs, and is not particularly limited herein.

In step 103, similarity calculation is performed on the first sentence vector output by the S-Bert model and a plurality of second sentence vectors, a standard clinical term corresponding to the highest similarity value is used as a standard clinical term of the first sentence, for example, V is "ICD" in the plurality of second sentences, V is "lung cancer", W is "lung cancer postoperative infection", the first sentence vector is obtained by semantic symbol sequence and pooling processing of the S-Bert model, the plurality of second sentence vectors are V and W respectively, similarity values of U and V are 0.5 according to a cosine similarity algorithm, and similarity values of U and W are 1, then W sentences corresponding to the highest similarity value are used as standard clinical terms of the first sentence U through comparative analysis, namely, "lung cancer postoperative infection" is used as standard clinical terms of the first sentence U. Note that, the method for calculating the similarity may be selected and set according to actual needs, which is not particularly limited herein.

In the embodiment of the invention, the first sentence vector and the plurality of second sentence vectors are obtained by inputting the obtained first sentence and the plurality of second sentences into a pre-trained S-Bert model, similarity calculation is carried out on the first sentence vector and the plurality of second sentence vectors respectively, and the second sentence corresponding to the highest similarity value is used as a standard clinical term corresponding to the first sentence. The method provided by the invention can improve the accuracy and the recognition efficiency of clinical term standardization.

In another embodiment of the present invention, the S-Bert model comprises a twin neural network structure and a pooling layer, wherein,

Specifically, a twin neural network (Siamese neural network), also known as a twin neural network, is a coupling framework established based on two artificial neural networks.

The pooling (pooling) is used for removing the impurity information, reserving key information and realizing the optimization of data. And carrying out average pooling (mean pooling), carrying out average value solving on the characteristic points in the field, and reserving the characteristics of the whole data so as to better highlight background information. And (3) carrying out maximum value pooling (max pooling) to solve the maximum value of the characteristic points in the field, so that the texture characteristics can be better reserved. And (5) pooling the initial symbol positions, and pooling the characteristic points in the initial symbol position field.

In the embodiment of the present invention, the semantic unit symbol refers to all word vectors of each sentence. And respectively constructing sentence pairs by the first sentence and a plurality of second sentences to obtain a plurality of sentence pairs, inputting the plurality of sentence pairs into two Bert models with shared parameters according to a twin neural network structure, obtaining all word vectors of the sentence pairs, performing symbol processing to obtain a semantic symbol sequence, inputting the semantic symbol sequence output by Bert into Pooling layers to perform pooling processing to obtain a first sentence vector and a plurality of second sentence vectors. It should be noted that the semantic unit symbol may be all word vectors contained in each sentence, or may be obtained by performing word segmentation on the sentence, so as to obtain a plurality of word vectors contained in each sentence. The setting may be performed according to actual needs, and is not particularly limited herein.

The pooling strategy of Pooling layers can be average pooling treatment, namely, all word vectors are averaged in the length dimension of the sentence, and the average result is used as the whole semantic vector of the sentence; the maximum value pooling processing can be performed, namely, taking the maximum value of all word vectors in the length dimension of the sentence as the whole semantic vector of the sentence; the processing of pooling of the starting symbol positions is also possible, namely, the vector of the starting symbol positions is taken as the whole semantic vector of the sentence.

In the embodiment of the invention, the S-Bert model is a model based on Bert sentence vectorization processing, a semantic symbol sequence of a sentence is determined through a twin neural network structure, the input semantic symbol sequence is subjected to pooling processing through a pooling layer to obtain a first sentence vector and a plurality of second sentence vectors, the model keeps the integrity of the understanding of the sentence by the Bert layer, meanwhile, the generalization capability of the vector on migration tasks is considered, the understanding of the semantic can be deeply carried out, and the recognition efficiency and the accuracy of the model are improved.

In another embodiment of the present invention, the symbol processing includes:

Adding a preset initial symbol to the front of the sentence pair;

and/or the number of the groups of groups,

Special symbols are added to the special semantic unit symbols.

Specifically, the start symbol is a flag for representing the start of a sentence; the separation symbol is used for separating sentence pairs; special symbols are used for representation of special characters.

In the embodiment of the invention, the symbol processing of the sentences can be to add a preset initial symbol in front of the sentence pairs, and/or to add a preset separation symbol between two adjacent sentences in each sentence pair, and/or to add a special symbol for a special semantic unit symbol. Note that, the semantic symbol marked with the special symbol will be used as a semantic unit symbol, and the cutting process will not be performed.

In this embodiment, the START symbol may be defined by itself, such as [ CLS ] or [ START ]; the preset separation symbol can be defined by itself, such as [ SEP ] or [ S ]; the special symbol may be self-defined, such as by [ NOT ], etc.

For example, if the sentence a is "gastric cancer", the sentence b is "gastric cancer", the "unspecified" is set as a special symbol, denoted by [ NOT ], and a single word in the sentence is used as a semantic unit symbol, for example, "stomach" is set as token_1, the semantic symbol sequence of the sentence pair is obtained as follows: [ CLS ] token_1token_2[ SEP ] token_3token_4[ NOT ]. The number of [ SEP ] may be two or more.

In the embodiment of the invention, the semantic symbol sequence of the sentence pair is obtained by carrying out symbol processing on the sentence pair, so that understanding of the semantic can be realized more deeply, and the accuracy and the efficiency of model identification are improved.

In one embodiment of the present invention, before the acquiring the first sentence and the second sentence, training the S-Bert model according to the sample sentence pair and the class identification tag, including:

step S3, performing score evaluation on the third sentence vector and the fourth sentence vector of the sample sentence, and selecting a model with the highest score as an S-Bert model to be trained in the next round;

Specifically, the model training termination condition may be that training is terminated when the score of the training model is less than a preset threshold.

In the embodiment of the invention, a training ending instruction is set as a model score preset threshold. And selecting a proper evaluator to evaluate a plurality of whole semantic vectors according to the obtained sample sentence vector and standard clinical term vector and combining with a specific business target, and taking the model with the highest score as the model to be trained in the next round. The evaluation indexes of the machine learning or deep learning model are selected for evaluation processing, wherein the evaluation indexes are accuracy, precision, recall, F1 value and the like, and a user can select different indexes according to the positive and negative sample proportion of the training sample and the training task. And performing intensive training according to the false positive data in the verification set until a training instruction is triggered to finish, namely stopping training when the score of the model to be trained is smaller than a preset threshold value. If the preset threshold is 90, training can be stopped when the model training score is 80. The false positive data is the data which is originally a sentence pair with a non-mapping relation, the training model predicts errors, and the obtained prediction result is the sentence pair with the mapping relation. In the embodiment of the invention, the determination and pooling of the semantic symbol sequence are needed for the sample sentence to be identified and the standard clinical term to obtain the sample sentence vector and the standard clinical term vector, wherein the semantic symbol sequence can be the vector of all words in the sentence or the vector of all words after the word segmentation is carried out on the sentence. The specific implementation manner is the above embodiment, and will not be described herein.

According to the embodiment of the invention, the training set data and the verification set data are used for carrying out two-time evaluation training treatment on the model to be trained, so that the accuracy of the identification result of the S-Bert model can be improved.

In another embodiment of the present invention, the adjusting the S-Bert model to be trained includes:

Or alternatively, the first and second heat exchangers may be,

Specifically, the optimization process may be implemented by adjusting parameters of the model, where the objective function applied in the present embodiment may be a classification objective function, a regression objective function, or a triplet objective function.

In the embodiment of the invention, the third sentence is a sample sentence to be identified, and the third sentence vector is the whole semantic vector of the sample sentence to be identified; the fourth sentence is any one of a plurality of standard clinical terms, and the fourth sentence vector is an overall semantic vector of the standard clinical terms. Obtaining a difference value vector U-V according to the whole semantic vector U of the third sentence and the whole semantic vector V of the fourth sentence of the sample sentence pair; splicing the whole semantic vector U of the third sentence, the whole semantic vector V of the fourth sentence and the difference vector U-V to obtain a fifth sentence vector M; and multiplying the spliced fifth sentence vector M by a trainable weight value W _t∈R^3n*k, wherein n is the dimension of the whole semantic vector, k is the category number, finally, a softmax classifier is connected to train the classification of sentence pairs, and the optimization processing of the S-Bert model to be trained is realized by adjusting parameters. The specific representation of the classification objective function is: o=softmax (W _t (u, v, |u-v|)) wherein the loss function employs a cross entropy loss function, specifically expressed as:

Where loss (o, class) represents the loss value of a sentence to a semantic symbol sequence; class represents the class to which the sentence pair belongs, and k is the class number; o is the sentence pair vector output by the softmax classifier; j belongs to k and is a possible category of sentences.

In the embodiment of the invention, cosine similarity is calculated for the whole semantic vector U of the third sentence and the whole semantic vector V of the fourth sentence of the sample sentence pair; and adjusting the S-Bert model parameters to be trained according to the cosine similarity and the score in the class label of the sample sentence pair, so as to realize optimization processing. The loss function is a mean square error, and is calculated by adopting the following expression formula:

loss(sim(u,v),cosine)＝(cosine-sim(u,v))²

Where sim (u, v) represents cosine similarity of the third sentence vector and the fourth sentence vector, and cosine represents scores in class labels of the third sentence and the fourth sentence in the sample sentence.

In the embodiment of the invention, the S-Bert model to be trained is optimized according to the distance between the third sentence vector and the fourth sentence vector of the sample sentence pair and the given sentence vector, wherein the objective function of the trainer is a three-tuple objective function, three sentences, namely a given sentence a, a positive example sentence p and a negative example sentence n, are input, and model parameters are adjusted so that the distance between the sentence a and the sentence p is smaller than the distance between the sentence a and the sentence n, namely the overall semantic vector of the sentence a is closer to the overall semantic vector of the sentence p, and the optimization of the model is realized. The specific characterization formula of the loss function is as follows:

loss＝min(max(||S_a-S_p||-‖S_a-S_n‖+ε,0))

Wherein S _a、S_p、S_n represents the overall semantic vectors of sentences a, p and n, respectively, and ε represents the preset distance vector value.

It should be noted that the purpose of the epsilon parameter is to ensure that the distance from S _p to S _a is at least closer than the distance from S _n to S _a, where epsilon may be set to 1.

The embodiment of the invention provides three optimization processing modes, wherein the objective functions are respectively a classification objective function, a regression objective function and a triplet objective function, so that the optimization processing of the trained model parameters is realized, and the accuracy of the recognition result of the model is improved.

In another embodiment of the present invention, the performing similarity calculation on the first sentence vector and the plurality of second sentence vectors includes:

Specifically, the cosine similarity is also called cosine similarity, and the cosine value of the included angle of the two vectors is calculated to be used as the basis for measuring the difference between the two individuals.

In the embodiment of the invention, the similarity calculation is performed on the first sentence vector and the plurality of second sentence vectors based on a cosine similarity algorithm, the second sentence corresponding to the highest similarity value is determined as the standard clinical term of the first sentence V, when the cosine similarity value of the two vectors U and V is 1, the cosine similarity value of the vectors U and W is 0, and the second sentence V corresponding to the second vector sentence V with the cosine similarity value of 1 is used as the standard clinical term of the first sentence U. It should be noted that, the range of the cosine similarity value is set to be [ -1,1], and when the cosine value is closer to 1, it indicates that the included angle of the two vectors is closer to 0 degrees, that is, the two vectors are more similar.

In the embodiment of the invention, the standard clinical terms of the first sentence are determined based on the cosine similarity algorithm, the calculation method is simple, and the model identification efficiency is improved.

In another embodiment of the present invention, the construction of the S-Bert model is specifically described as follows:

The pretrained Bert model can be subjected to fine-tuning (fine-tuning) processing through an additional output layer, and is suitable for the construction of the most advanced model of a wide task without the need of modifying a large scale of architecture for a specific task. In the present invention, the Bert model is pre-trained using two new unsupervised prediction tasks, task one being to mask some parts of the input sentence randomly, and then predict the masked parts to implement training of the model, in which case the final hidden vector corresponding to the masked parts would be input into the softmax function and predict the probabilities of all words as in the standard language model, thus enhancing the understanding of semantics, this step is called MLM (Masked Language Model). Task two is the next sentence prediction, and many important downstream tasks, such as question-answering (QA) and Natural Language Inference (NLI), are not directly obtained through language modeling based on understanding the relationship between two text sentences. To train a model that understands sentence relationships, a binary next sentence prediction task is set that can be easily generated from any monolingual corpus, such as selecting sentences a and B as pre-training samples, where B has 50% of possible next sentences a and 50% of possible random sentences from the corpus, which enables the model to capture semantic links between two sentences.

The S-Bert (Sentences Bert) model is mainly provided for solving the problems that the huge time cost of the Bert semantic similarity retrieval and sentence characterization are not suitable for unsupervised tasks, clustering, sentence similarity calculation and the like. The S-Bert model uses a twin neural network structure and a pooling layer to acquire vector representations of sentence pairs, and the S-Bert pre-training process mainly comprises two steps: firstly, obtaining the whole semantic vector representation of a sentence by utilizing a twin neural network structure, and performing fine-tuning processing (fine-tuning) by using two Bert models to perform model pre-training: and inputting the sentence pairs into two Bert models with shared parameters, inputting all word vectors of the Bert output sentences into Pooling layers for pooling processing, and obtaining the whole semantic vector representation of each sentence. Second, the model is optimized using an objective function. The following examples are described in detail.

It should be noted that, for the acquiring manner of the Bert/S-Bert model, the basic model of the Bert/S-Bert may be fine-tuned (fine-tuning) by using the training corpus in the corresponding scene according to the specific application scene, so as to obtain a model suitable for the specific service scene, or may be used by self-training the relevant model.

In another embodiment of the present invention, the method further comprises:

Calculating the first sentence vector and the second sentence vector according to a mapping strategy to acquire a standard clinical term of the first sentence;

The mapping strategy is a mode for classifying the characteristics of clinical terms and limiting the mapping range by adopting axis words and axes.

Specifically, the axial words are characteristic words or keywords which are included in clinical terms and are used for distinguishing other attributes; the axis is the standard or scale of classification.

In the embodiment of the invention, a plurality of mapping strategies are added according to service requirements and data quality, and standard clinical terms of the first statement are acquired according to the mapping strategies. The similarity can be calculated after the vector obtained by the S-Bert model and the vector of the axial words are spliced, or n clinical standard terms with higher similarity obtained by the S-Bert model are selected from the n clinical standard terms through medical rules, and the like. For example: setting that the axle center 1 comprises axle center words of digestive system and digestive system, and the axle center 1.1 comprises axle center words of stomach and gastric body, and the axle center 2 comprises axle center words of respiratory system and respiratory system, extracting new axle words for clinical diagnosis result of stomach malignant tumor, and extracting axle center 1.1 of axle center word of stomach, the mapping range of the clinical terms can be set in the axle center 1 and the axle center 1.1. It should be noted that, the extraction of the axial words may be performed by a certain rule, or may be performed by a machine learning or deep learning method, or may be selected according to actual needs, which is not limited herein.

In the embodiment of the invention, the standardized processing of the sentences to be identified is realized by adding the mapping strategy, the efficiency of model identification is provided, and the user experience is improved.

Fig. 2 is a device for standardizing clinical terms provided by the present invention, and as shown in fig. 2, the device for standardizing clinical terms provided by the present invention includes:

an obtaining module 201, configured to obtain a first sentence and a plurality of second sentences; the first sentence is a sentence to be identified, and the second sentence is a standard clinical term;

An input module 202, configured to input the first sentence and the plurality of second sentences into a pre-trained S-Bert model, and obtain a first sentence vector and a plurality of second sentence vectors;

the calculation module 203 is configured to perform similarity calculation on the first sentence vector and the plurality of second sentence vectors, and use a second sentence corresponding to the highest similarity value as a standard clinical term corresponding to the first sentence;

The device for standardizing clinical terms provided by the embodiment of the invention comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first sentence and a plurality of second sentences; the first sentence is a sentence to be identified, and the second sentence is a standard clinical term; the input module is used for inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a second sentence vector; the calculation module is used for carrying out similarity calculation on the first sentence vector and the second sentence vector, and taking the second sentence corresponding to the highest similarity value as a standard clinical term corresponding to the first sentence. The device provided by the invention can realize accurate focusing on clinical terms, and the accuracy and efficiency of clinical term standardization are improved from the depth understanding of depth and breadth semantics, and the user experience is improved.

Since the apparatus according to the embodiment of the present invention is the same as the method according to the above embodiment, the details of the explanation will not be repeated here.

Fig. 3 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention, where, as shown in fig. 3, the present invention provides an electronic device, including: a processor (processor) 301, a memory (memory) 302, and a bus 303;

Wherein, the processor 301 and the memory 302 complete communication with each other through the bus 303;

The processor 301 is configured to invoke program instructions in the memory 302 to perform the methods provided by the above-described method embodiments, for example, including: acquiring a first sentence and a plurality of second sentences; the first sentence is a sentence to be identified, and the second sentence is a standard clinical term; inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors; respectively carrying out similarity calculation on the first sentence vector and the plurality of second sentence vectors, and taking a second sentence corresponding to the highest similarity value as a standard clinical term corresponding to the first sentence; the S-Bert model is obtained based on sample sentence pairs and sample category labels in a training mode; wherein the sample sentence pairs comprise a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample sentence to be identified and a plurality of standard clinical terms; the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.

The present embodiment provides a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: acquiring a first sentence and a plurality of second sentences; the first sentence is a sentence to be identified, and the second sentence is a standard clinical term; inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors; respectively carrying out similarity calculation on the first sentence vector and the plurality of second sentence vectors, and taking a second sentence corresponding to the highest similarity value as a standard clinical term corresponding to the first sentence; the S-Bert model is obtained based on sample sentence pairs and sample category labels in a training mode; wherein the sample sentence pairs comprise a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample sentence to be identified and a plurality of standard clinical terms; the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for normalization of clinical terms, comprising:

The S-Bert model is used for determining and pooling semantic symbol sequences of the first sentence and the second sentence;

The S-Bert model comprises a twin neural network structure and a pooling layer;

Inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors, wherein the method comprises the following steps:

Respectively inputting the multiple sentences into the twin neural network structure to determine semantic unit symbols and perform symbol processing to obtain respective semantic symbol sequences of the multiple sentences, wherein the semantic unit symbols comprise all word vectors corresponding to the sentence pairs or at least one word vector corresponding to the sentence pairs after word segmentation;

2. The method of claim 1, wherein the symbol processing comprises:

Adding a preset initial symbol to the front of the sentence pair;

and/or the number of the groups of groups,

Special symbols are added to the special semantic unit symbols.

3. The method of claim 1, wherein training the S-Bert model based on the sample sentence pairs and the class identification tags prior to the acquiring the first sentence and the second sentence comprises:

4. A clinical term normalization method according to claim 3, in which the adapting the S-Bert model to be trained comprises:

Or alternatively, the first and second heat exchangers may be,

5. The method of claim 1, wherein said performing similarity calculation on said first sentence vector and said plurality of second sentence vectors, respectively, comprises:

6. A clinical term normalization device, comprising:

The S-Bert model comprises a twin neural network structure and a pooling layer; the input module is also used for:

7. An electronic device, comprising:

a processor, a memory, and a bus, wherein,

The memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-5.

8. A non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the method of any one of claims 1 to 5.