CN116502629B

CN116502629B - Medical direct reporting method and system based on self-training text error correction and text matching

Info

Publication number: CN116502629B
Application number: CN202310735155.9A
Authority: CN
Inventors: 刘硕; 杨雅婷; 白焜太; 宋佳祥; 许娟; 史文钊
Original assignee: Digital Health China Technologies Co Ltd
Current assignee: Digital Health China Technologies Co Ltd
Priority date: 2023-06-20
Filing date: 2023-06-20
Publication date: 2023-08-18
Anticipated expiration: 2043-06-20
Also published as: CN116502629A

Abstract

The invention relates to the technical field of disease early warning, in particular to a medical direct reporting method and a system based on self-training text error correction and text matching; constructing training data of a text error correction model based on original medical record data to obtain a word missing supplementary model and a word error correction model, performing text error correction processing on new input data through the model, calculating similarity between the data after error correction processing and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm, selecting an existing disease standard name with the highest comprehensive similarity score as a standard disease name mapped by the new input data, performing direct matching search on the disease standard name and an infectious disease name in a dangerous infectious disease database, judging that the current disease is a dangerous infectious disease if corresponding data exists, directly reporting the disease to a mechanism, and completing direct reporting of the dangerous disease; the data standardization is realized, so that the direct-report system can accurately identify symptoms, and the problem of inaccurate system is solved.

Description

Medical direct reporting method and system based on self-training text error correction and text matching

Technical Field

The invention relates to the technical field of disease early warning, in particular to a medical direct reporting method and a system based on self-training text error correction and text matching.

Background

Along with the change of society and natural environment, the pathogen, transmission way, disease characteristics and influencing factors of infectious diseases also change greatly, how to identify the emergent public health event of infectious diseases in early stage, give out an alarm in time, and take corresponding control measures as early as possible to minimize the loss caused by the emergent public health event, thus being the focus of attention in the public health field for a long time and also being the important content of health emergency work. The public health event early warning for the sudden infection is an important preventive control measure for avoiding or reducing the occurrence and the popularity of the infectious diseases and reducing the influence on public health, social safety and economic development, and fully embodies the basic guidelines of the public health emergency work. The sudden public health event early warning is to collect, sort, analyze and integrate related data, monitor, identify, diagnose and evaluate the symptoms of the event by using modern advanced technologies such as computers, networks and communication, alarm in time, inform related departments and the public of taking relevant countermeasures and preparation work, take effective prevention and control measures in time, and prevent or slow down the occurrence of the sudden event or reduce the harm of the event as much as possible.

Along with the development of diagnosis and treatment data informatization, the demand of data management is also improved, early warning and direct reporting mechanisms aiming at the risk infection diseases are also important, however, the direct reporting system of the data needs to face the situation of data text errors and non-uniform data naming, and the situation of insensitive diseases or deviation of the direct reporting system is often caused.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a medical direct reporting method and a system based on self-training text error correction and text matching, so as to solve the problems of insensitivity and deviation of system diseases caused by data text errors and non-uniform data naming in a medical direct reporting system.

In order to solve the problems, the invention adopts the following technical scheme:

in one aspect, the invention provides a medical direct reporting method based on self-training text correction and text matching, comprising the following steps:

constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters;

training a missing word supplementing model and a wrong word correcting model based on the bert pre-training model by using the training data lacking the characters and the training data of the wrong word to obtain the missing word supplementing model and the wrong word correcting model;

performing text error correction processing on the new input data through the word-missing supplementary model and the word-error correction model;

calculating the similarity between the error-corrected data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;

adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;

selecting the existing disease standard name with the highest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data;

and directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to finish dangerous disease direct reporting.

As an implementation manner, the training data for constructing the text error correction model based on the original medical record data includes:

deleting two characters in each sentence in the original medical record data at random, recording character indexes of the deleting positions and deleted character information, and constructing training data lacking characters;

and randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing the training data of wrongly written characters.

As an embodiment, the structure of the bert pretraining model includes:

the L1embedding layer is used for carrying out matrix matching through the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;

the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;

the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;

the loss function of the bert pre-training model is as follows:

wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.

As an implementation manner, the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using the bm25 algorithm and the jaccard algorithm includes:

similarity to the knowledge base of existing disease standard names is calculated using the bm25 algorithm: for sentence s ₁ Word segmentation is carried out to generate a word list [ w ] _i ]For sentence s ₁ Sentence s for comparison ₂ Calculate each word w _i And s ₂ And finally w _i Relative s ₂ The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:

wherein idf (w) _i ) For the word w _i Idf, f of (f) _i For the word w _i Sentence s ₂ Frequency of occurrence k ₁ And b is a regulatory factor of 2 and 0.75, len (s ₂ ) For sentence s ₂ Avgsl is the average length of all sentences;

similarity to the existing disease standard name knowledge base was calculated using the jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:。

on the other hand, the invention provides a medical direct-reporting system based on self-training text correction and text matching, which comprises a training data construction module, a word-missing supplementary model and a word-missing correction model construction module, a text correction module, a similarity calculation module, a comprehensive scoring module, a mapping standard disease name determination module and a direct-reporting module;

the training data construction module is used for constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters;

the missing word supplementing model and the error word correcting model building module is used for training the missing word supplementing model and the error word correcting model based on the bert pre-training model respectively by using the training data of the missing characters and the training data of the error words to obtain a missing word supplementing model and an error word correcting model;

the text error correction module is used for carrying out text error correction processing on the new input data through the word-missing supplementary model and the word-missing error correction model;

the similarity calculation module is used for calculating the similarity between the data subjected to error correction processing and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;

the comprehensive scoring module is used for adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;

the mapping standard disease name determining module is used for selecting the existing disease standard name with the largest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data;

the direct reporting module is used for directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to complete direct reporting of the dangerous disease.

As an embodiment, the training data construction module includes a training data construction unit lacking characters and a training data construction unit mispronounced characters:

the training data construction unit lacking the characters is used for randomly deleting two characters in each sentence in the original medical record data, recording character indexes of deleting positions and deleted character information, and constructing training data lacking the characters;

the wrongly written word training data construction unit is used for randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing wrongly written word training data.

As an embodiment, the structure of the bert pretraining model includes:

the saidThe loss function of the bert pre-training model is:

As an embodiment, the similarity calculation module includes a bm25 similarity calculation unit;

the bm25 similarity calculation unit is configured to calculate a similarity with an existing disease standard name knowledge base by using a bm25 algorithm: for sentence s ₁ Word segmentation is carried out to generate a word list [ w ] _i ]For sentence s ₁ Sentence s for comparison ₂ Calculate each word w _i And s ₂ And finally w _i Relative s ₂ The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:

as an implementation manner, the similarity calculation module includes a jaccard similarity calculation unit;

the jaccard similarity calculation unit is configured to calculate similarity with an existing disease standard name knowledge base by using a jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:。

the invention has the beneficial effects that: according to the medical direct reporting method and system based on self-training text error correction and text matching, the text error correction processing is carried out on input data by establishing the word missing supplementary model and the word error correction model, the similarity between the input data and the existing disease standard names in a knowledge base is calculated through a bm25 algorithm and a jaccard algorithm, after addition, the existing disease standard names with the highest scores are obtained, a mapping is established, then the dangerous infectious disease database is matched for judgment and reporting, the front-end text processing is carried out, the word missing text and the word error correction text are carried out, the standard names are matched, data standardization is achieved, the direct reporting system can accurately identify diseases, and the problems of insensitivity and inaccuracy of the system are overcome.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings, in which:

fig. 1 is a schematic flow chart of a medical direct reporting method based on self-training text correction and text matching according to an embodiment of the invention.

Fig. 2 is a schematic diagram of a medical direct-reporting system based on self-training text correction and text matching according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples.

It should be noted that these examples are only for illustrating the present invention, and not for limiting the present invention, and simple modifications of the method under the premise of the inventive concept are all within the scope of the claimed invention.

A self-training text correction and text matching based medical direct reporting method, comprising:

s100, constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters.

The original medical record data refers to the medical record data history data of a hospital, and training data is constructed after desensitization or screening treatment is carried out on the medical record data history data of the hospital.

And S200, training a word-missing supplementary model and a word-missing error correction model based on the bert pre-training model by using the training data of the missing characters and the training data of the wrongly written words respectively to obtain the word-missing supplementary model and the word-missing error correction model.

Wherein, the structure of the bert pre-training model comprises:

the loss function of the bert pre-training model is as follows:

Assuming a three-classification task, the correct label for a sample is of the first type, then p= [1, 0, 0], the model predictor is assumed to be [0.5, 0.4, 0.1], and then the cross entropy is calculated as follows:

s300, performing text error correction processing on the new input data through the word-missing supplementary model and the word-missing error correction model.

After model training is completed, new input data, namely medical record data needing to be judged and reported, is input into a model for processing.

S400, calculating the similarity between the error-corrected data and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm.

Wherein, the bm25 algorithm is used for calculating the similarity with the existing disease standard name knowledge base: for sentence s ₁ Word segmentation is carried out to generate a word list [ w ] _i ]For sentence s ₁ Sentence s for comparison ₂ Calculate each word w _i And s ₂ And finally w _i Relative s ₂ The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:

the similarity with the existing disease standard name knowledge base is calculated by using a jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:。

s500, adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;

s600, selecting the existing disease standard name with the largest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data.

And S700, directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to finish direct reporting of the dangerous disease.

On the other hand, the invention provides a medical direct reporting system based on self-training text correction and text matching, which comprises a training data construction module 100, a word-missing supplementary model and word-missing correction model construction module 200, a text correction module 300, a similarity calculation module 400, a comprehensive scoring module 500, a mapping standard disease name determination module 600 and a direct reporting module 700;

the training data construction module 100 is configured to construct training data of a text error correction model based on the original medical record data, where the training data of the text error correction model includes training data lacking characters and training data of wrongly written characters;

the missing word supplement model and error correction model construction module 200 is configured to train the missing word supplement model and error correction model based on the bert pre-training model respectively by using the training data of the missing characters and the training data of the error characters, so as to obtain a missing word supplement model and an error correction model;

the text error correction module 300 is configured to perform text error correction processing on new input data through the word-missing supplementary model and the word-missing error correction model;

the similarity calculation module 400 is configured to calculate the similarity between the error-corrected data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;

the comprehensive scoring module 500 is configured to add the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;

the mapping standard disease name determining module 600 is configured to select, from the existing disease standard name knowledge base, an existing disease standard name with the largest similarity comprehensive score as a standard disease name mapped by the new input data;

the direct reporting module 700 is configured to directly match and retrieve the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, and if corresponding data exists in the dangerous infectious disease database, determine that the current disease is a dangerous infectious disease, and directly report the disease to a responsible institution to complete direct reporting of the dangerous disease.

As an embodiment, the training data construction module 100 includes a training data construction unit 110 lacking characters and a wrongly written training data construction unit 120:

the training data construction unit 110 for deleting two characters in each sentence in the original medical record data randomly, and recording the character index of the deleting position and the deleted character information to construct training data lacking characters;

the wrongly written training data construction unit 120 is configured to randomly replace two characters in each sentence in the original medical record data with other characters, record the character index of the replacement position and the information of the original characters before replacement, and construct the wrongly written training data.

As an embodiment, the structure of the bert pretraining model includes:

the loss function of the bert pre-training model is as follows:

As an embodiment, the similarity calculation module 400 includes a bm25 similarity calculation unit 410;

the bm25 similarity calculation unit 410 is configured to calculate a similarity with an existing disease standard name knowledge base using a bm25 algorithm: for sentence s ₁ Word segmentation is carried out to generate a word list [ w ] _i ]For sentence s ₁ Sentence s for comparison ₂ Calculate each word w _i And s ₂ And finally w _i Relative s ₂ The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:

as an embodiment, the similarity calculation module 400 includes a jaccard similarity calculation unit 420;

the jaccard similarity calculation unit 420 is configured to calculate a similarity with an existing disease standard name knowledge base using the jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:。

finally, it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A self-training text correction and text matching based medical direct reporting method, comprising:

2. The self-training text correction and text matching based medical direct reporting method of claim 1 wherein constructing training data of a text correction model based on raw medical record data comprises:

3. The self-training text correction and text matching based medical presentation method of claim 1, wherein the structure of the bert pre-training model comprises:

the L1embedding layer multiplies the matrix by the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;

the loss function of the bert pre-training model is as follows:

4. The self-training text correction and text matching based medical direct reporting method as claimed in claim 1, wherein the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm comprises:

calculation of existing disease standard name knowledge base Using bm25 AlgorithmSimilarity of (3): for sentence s ₁ Word segmentation is carried out to generate a word list [ w ] _i ]For sentence s ₁ Sentence s for comparison ₂ Calculate each word w _i And s ₂ And finally w _i Relative s ₂ The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:

wherein idf (w) _i ) For the word w _i Idf, f of (f) _i For the word w _i Sentence s ₂ Frequency of occurrence k ₁ And b is a regulatory factor of 2 and 0.75, len (s ₂ ) For sentence s ₂ Avgsl is the average length of all sentences.

5. The self-training text correction and text matching based medical direct reporting method as claimed in claim 1, wherein the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm comprises:

similarity to the existing disease standard name knowledge base was calculated using the jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:

。

6. the medical direct reporting system based on self-training text error correction and text matching is characterized by comprising a training data construction module, a word-missing supplementary model and word-missing error correction model construction module, a text error correction module, a similarity calculation module, a comprehensive scoring module, a mapping standard disease name determination module and a direct reporting module;

7. The self-training text correction and text matching based medical presentation system of claim 6 wherein the training data construction module comprises a training data construction unit lacking characters and a mispronounced training data construction unit:

8. The self-training text correction and text matching based medical presentation system of claim 6, wherein the structure of the bert pre-training model comprises:

the loss function of the bert pre-training model is as follows:

9. The self-training text correction and text matching based medical presentation system of claim 6, wherein the similarity calculation module comprises a bm25 similarity calculation unit;

the bm25 similarity calculation unit is configured to calculate a similarity with an existing disease standard name knowledge base by using a bm25 algorithm: for sentence s ₁ Word segmentation is carried out to generateWord list [ w _i ]For sentence s ₁ Sentence s for comparison ₂ Calculate each word w _i And s ₂ And finally w _i Relative s ₂ The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:

10. The self-training text correction and text matching based medical presentation system of claim 6 wherein the similarity calculation module comprises a jaccard similarity calculation unit;

the jaccard similarity calculation unit is configured to calculate similarity with an existing disease standard name knowledge base by using a jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:

。