CN112836055A

CN112836055A - Quantity prediction method and device for clinical term standardization

Info

Publication number: CN112836055A
Application number: CN202110264867.8A
Authority: CN
Inventors: 李雪; 刘升平; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-05-25

Abstract

The invention relates to a quantity prediction method for clinical term standardization, which comprises the following steps: acquiring a basic data set; the primary dataset includes a plurality of clinical terms and a standard term set corresponding to each clinical term; determining a classification label of a standard term set corresponding to each clinical term; enhancing each standard term set, and determining an enhanced data set; the number of clinical term normalizations is determined by fine tuning through the pre-trained language model BERT, based on the classification labels and the enhanced data set.

Description

Quantity prediction method and device for clinical term standardization

Technical Field

The invention relates to the technical field of data processing, in particular to a quantity prediction method and device for clinical term standardization.

Background

The clinical terms describing diagnosis, operation, medicine, examination, assay and symptoms are flexible and changeable in expression mode, and in order to facilitate analysis and filing of medical records, the clinical terms need to be standardized into corresponding standard terms, and in the prior art, the clinical terms are segmented by using rules through separators. When terms are normalized, it is important to determine the number of standard terms after normalization, taking table 1 as an example, and table 1 is an example of the operation name normalization:

TABLE 1

As can be seen from table 1, in the operation name standardization example, the clinical terms without separators do not necessarily correspond to only one standard term; the clinical term separators are diverse, and the number of separator divisions is different from the number of actual standard terms.

The prior art has the problems that: the regular split cannot handle the case of no delimiter but corresponds to multiple standard terms. The rule segmentation has poor compatibility of clinical terms with a variety of separators, and it is difficult to solve the case where the number of standard terms is not consistent with the number of separators.

Disclosure of Invention

The invention aims to provide a quantity prediction method and a quantity prediction device for clinical term standardization aiming at the defects of the prior art so as to solve the problems in the prior art.

To solve the above problems, in a first aspect, the present invention provides a method for predicting the amount of clinical term normalization, the method comprising:

acquiring a basic data set; the primary dataset includes a plurality of clinical terms and a standard term set corresponding to each clinical term;

determining a classification label of a standard term set corresponding to each clinical term;

enhancing each standard term set, and determining an enhanced data set;

and according to the classification labels and the enhanced data set, fine tuning is carried out through a pre-training language model BERT, and the standardized number of clinical terms is determined.

In a possible implementation manner, the acquiring the basic data set specifically includes:

the basic data set is obtained through open source term standardized competition or network crawling.

In one possible implementation, the determining the classification label of the standard term set corresponding to each clinical term specifically includes:

using the formula K ═ max_i∈(1，n)Card(Y_i) Determining a classification label; wherein, the classification label is an integer from 1 to K, K is the maximum value of the number of standard terms in the standard term set corresponding to each clinical term, and n is the number of clinical terms; the basic data set is D { X₁，X₂，...，X_n；Y₁，Y₂，...，Y_n}，X_iDenotes the clinical term, Y_iIs X_iCorresponding set of standard terms y_i1，y_i2，...，y_ik}。

In a possible implementation manner, the enhancing each standard term set, and the determining the enhanced data set specifically includes:

mining the standard term set, and determining standard terms which can be combined in the standard term set; merging the combinable standard terms, and determining the merged standard terms as a standard term enhancement set; and the number of the first and second electrodes,

when a separator exists in clinical terms, expanding a standard term set corresponding to the clinical terms with the separator, and determining the expanded standard term set as a basic data enhancement set; and the number of the first and second electrodes,

counting separators in each clinical term according to the basic data set, the standard term enhancement set and the basic data enhancement set, and determining the probability of the clinical term in which a separator exists, and determining a separator enhancement data set according to the determined probability of the clinical term in which a separator exists.

In a possible implementation, the determining, according to the classification label and the enhanced data set, the normalized number of clinical terms by fine-tuning a pre-trained language model BERT specifically includes:

and predicting through a pre-training language model BERT according to the basic data set, the standard term enhancement set, the basic data enhancement set and the separator data enhancement set and the classification label to obtain the standardized number of clinical terms.

In a second aspect, the present invention provides an apparatus for quantity prediction for normalization of clinical terms, the apparatus comprising:

an acquisition unit configured to acquire a basic data set; the primary dataset includes a plurality of clinical terms and a standard term set corresponding to each clinical term;

a determination unit for determining a classification label of a standard term set corresponding to each clinical term;

the determining unit is further configured to enhance each standard term set, and determine an enhanced data set;

the determination unit is further configured to determine a number of clinical term normalizations by fine tuning through a pre-trained language model BERT based on the classification labels and the enhanced data set.

In a possible implementation manner, the obtaining unit is specifically configured to:

In a possible implementation manner, the determining unit is specifically configured to:

and predicting through a pre-training language model BFRT according to the basic data set, the standard term enhancement set, the basic data enhancement set and the separator data enhancement set and the classification label to obtain the standardized number of clinical terms.

In a third aspect, the invention provides an apparatus comprising a memory for storing a program and a processor for performing the method of any of the first aspects.

In a fourth aspect, the present invention provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method according to any one of the first aspect.

In a fifth aspect, the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of any of the first aspects.

By applying the method and the device for predicting the number of the clinical term standardization, provided by the invention, the compatibility of the model to different separators can be improved through data enhancement, particularly separator enhancement; it is also possible to improve the accuracy of the number of delimiter cuts different from the number of standard terms, and to correctly predict the number of standard terms even if clinical terms have no delimiter by defining the number prediction as a multi-classification problem.

Drawings

FIG. 1 is a block diagram of a term normalized quantity prediction scheme according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for predicting the amount of normalized clinical terms according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a quantity prediction apparatus for clinical term normalization according to a second embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

FIG. 1 is a block diagram of a term normalized quantity prediction scheme according to an embodiment of the present invention. Fig. 2 is a flowchart of a method for predicting the amount of normalized clinical terms according to an embodiment of the present invention, where the main implementation bodies of the method are devices with computing functions, such as a terminal and a server. The technical scheme of the invention is detailed by combining fig. 1 and fig. 2.

Step 210, acquiring a basic data set; the primary dataset includes a plurality of clinical terms and a standard term set corresponding to each clinical term;

wherein, the competition can be standardized by open source terms or the network crawlingObtaining a basic data set D { X₁，X₂，...，X_n；Y₁，Y₂，...，Y_nIn which X is_iRepresents the clinical term, Y_iIs X_iCorresponding set of standard terms y_i1，y_i2，...，y_ikK is X_iCorresponding number of standard terms, i.e. (X)_i；Y_i) Is a piece of data.

Step 220, determining a classification label of a standard term set corresponding to each clinical term;

wherein, it is to

Y_iNumber of standard terms in the set Card (Y)_i) Satisfies that K is max_i∈(1，n)Card(Y_i) To determine a classification label; wherein, the classification label is an integer from 1 to K, K is the maximum value of the number of standard terms in the standard term set corresponding to each clinical term, and n is the number of clinical terms.

Step 230, enhancing each standard term set, and determining an enhanced data set;

specifically, the enhanced data set may be determined simultaneously in three ways, specifically as follows:

firstly, mining a standard term set, and determining standard terms which can be combined in the standard term set; merging combinable standard terms, and determining the merged standard terms as a standard term enhancement set;

for example, possible merged coded data is mined from a standard term set, such as operation and operation classification filling in the standard ICD-9-CM-3 of ' gastrotomy and gastroduodenal anastomosis ', and expanding to ' gastrotomy + gastroduodenal anastomosis; gastroduodenal anastomosis. The standard term enhancement set can be denoted as D_stand。

Secondly, when a separator exists in the clinical terms, expanding the standard term set corresponding to the clinical terms with the separator, and determining the expanded standard term set as a basic data enhancement set;

for example, if X_iIf there is a separator, then pair (X)_i；Y_i) Performing enhancements, such as (hysteroscopy + segmental curettage; hysteroscopic diagnostic curettage), expanded by expert labeling (hysteroscopy; hysteroscopy) and (hysteroscopy segmental curettage; hysteroscopic diagnostic uterine curettage). The basic data enhancement set can be noted as D_base。

Thirdly, counting the separators in each clinical term according to the basic data set, the standard term enhancement set and the basic data enhancement set, determining the probability of the clinical term with the separator, and determining the separator enhancement data set according to the determined probability of the clinical term with the separator.

Specifically, the number of separators in clinical terms and the probability p of each separator are counted, the separators can refer to attached table 1, such as "+", "," 1. "," 2. "and the like, and specifically, the possible separators can be manually determined according to the actual data set, and the statistical number of the occurrence of the separators is counted; the probability is calculated by dividing the number of separators by the number of all separators.

For example, from the set DU D at random_stand∪D_baseRandomly selecting m pieces of data to satisfy sigma_1＜j＜mCard(Y_j)＜K_{And m > 1}Randomly choosing separators as m clinical terms X with probability p_jThe separator of (2). For example, there are only "+" and "two separators, and 80% of the data in the data set is separated by" + ", and the probability of occurrence of" + "is 80%. Thereby obtaining a separator enhanced data set D according to the number and probability of each separator_symbel。

Step 240, fine tuning is performed through the pre-trained language model BERT according to the classification labels and the enhanced data set to determine the number of clinical term normalizations.

Specifically, the amount of clinical term standardization is obtained by predicting through a pre-training language model BERT according to a basic data set, a standard term enhancement set, a basic data enhancement set and a separator data enhancement set and a classification labelI.e. according to DU D_stand∪D_base∪D_symbelAnd classification labels 1, 2, … …, K, fine tuning (Finetuning) using a pre-trained language model BERT, resulting in a quantitative model for clinical term normalization. The amount of clinical term normalization is derived through forward reasoning of the model.

According to the quantity prediction method for clinical term standardization, provided by the embodiment of the invention, the compatibility of the model to different separators can be improved through data enhancement, especially separator enhancement; it is also possible to improve the accuracy of the number of delimiter cuts different from the number of standard terms, and to correctly predict the number of standard terms even if clinical terms have no delimiter by defining the number prediction as a multi-classification problem.

Fig. 3 is a schematic structural diagram of a clinical term normalized quantity prediction apparatus according to a second embodiment of the present invention, which is applied to the clinical term normalized quantity prediction, and as shown in fig. 3, the apparatus includes an obtaining unit 310 and a determining unit 320.

The obtaining unit 310 is configured to obtain a basic data set; the primary dataset includes a plurality of clinical terms and a standard term set corresponding to each clinical term;

the determining unit 320 is configured to determine a classification label of the standard term set corresponding to each clinical term;

the determining unit 320 is further configured to enhance each standard term set, and determine an enhanced data set;

the determination unit 320 is further configured to determine the amount of clinical term normalization by fine-tuning the pre-trained language model BERT based on the classification labels and the enhancement data set.

Further, the obtaining unit 310 is specifically configured to:

Further, the determining unit 320 is specifically configured to:

using the formula K ═ max_i∈(1，n)Card(Y_i) Determining a classification label; wherein the classification label is an integer of 1 to K, K is perThe maximum value of the number of standard terms in the standard term set corresponding to each clinical term, and n is the number of the clinical terms; the basic data set is D { X₁，X₂，...，X_n；Y₁，Y₂，...，Y_n}，X_iDenotes the clinical term, Y_iIs X_iCorresponding set of standard terms y_i1，y_i2，...，y_ik}。

Further, the determining unit 320 is specifically configured to:

mining the standard term set, and determining standard terms which can be combined in the standard term set; merging combinable standard terms, and determining the merged standard terms as a standard term enhancement set; and the number of the first and second electrodes,

when a separator exists in the clinical terms, expanding the standard term set corresponding to the clinical terms with the separator, and determining the expanded standard term set as a basic data enhancement set; and the number of the first and second electrodes,

counting separators in each clinical term according to the basic data set, the standard term enhancement set and the basic data enhancement set, and determining the probability of the clinical term in which the separator exists, and determining the separator enhancement data set according to the determined probability of the clinical term in which the separator exists.

Further, the determining unit 320 is specifically configured to:

By applying the quantity prediction device for clinical term standardization provided by the embodiment of the invention, the compatibility of the model to different separators can be improved through data enhancement, especially separator enhancement; it is also possible to improve the accuracy of the number of delimiter cuts different from the number of standard terms, and to correctly predict the number of standard terms even if clinical terms have no delimiter by defining the number prediction as a multi-classification problem.

The third embodiment of the invention provides equipment, which comprises a memory and a processor, wherein the memory is used for storing programs, and the memory can be connected with the processor through a bus. The memory may be a non-volatile memory such as a hard disk drive and a flash memory, in which a software program and a device driver are stored. The software program is capable of performing various functions of the above-described methods provided by embodiments of the present invention; the device drivers may be network and interface drivers. The processor is used for executing a software program, and the software program can realize the method provided by the first embodiment of the invention when being executed.

A fourth embodiment of the present invention provides a computer program product including instructions, which, when the computer program product runs on a computer, causes the computer to execute the method provided in the first embodiment of the present invention.

The fifth embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method provided in the first embodiment of the present invention is implemented.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for quantitative prediction of normalization of clinical terms, the method comprising:

enhancing each standard term set, and determining an enhanced data set;

2. The method according to claim 1, wherein the obtaining a basic data set specifically comprises:

3. The method of claim 1, wherein the determining the classification label of the standard term set corresponding to each clinical term specifically comprises:

4. The method according to claim 1, wherein said enhancing each set of standard terms, determining an enhanced data set specifically comprises:

5. The method according to claim 4, wherein the determining the amount of clinical term normalization, fine-tuned by a pre-trained language model BERT based on the classification labels and the enhanced data set, specifically comprises:

6. An apparatus for quantity prediction for normalization of clinical terms, the apparatus comprising:

7. The apparatus according to claim 6, wherein the obtaining unit is specifically configured to:

8. The apparatus according to claim 6, wherein the determining unit is specifically configured to:

9. The apparatus according to claim 6, wherein the determining unit is specifically configured to:

10. The apparatus according to claim 6, wherein the determining unit is specifically configured to: