CN116502629B - Medical direct reporting method and system based on self-training text error correction and text matching - Google Patents
Medical direct reporting method and system based on self-training text error correction and text matching Download PDFInfo
- Publication number
- CN116502629B CN116502629B CN202310735155.9A CN202310735155A CN116502629B CN 116502629 B CN116502629 B CN 116502629B CN 202310735155 A CN202310735155 A CN 202310735155A CN 116502629 B CN116502629 B CN 116502629B
- Authority
- CN
- China
- Prior art keywords
- data
- word
- training
- disease
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 116
- 238000012937 correction Methods 0.000 title claims abstract description 81
- 238000000034 method Methods 0.000 title claims abstract description 15
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 98
- 201000010099 disease Diseases 0.000 claims abstract description 97
- 208000035473 Communicable disease Diseases 0.000 claims abstract description 31
- 208000015181 infectious disease Diseases 0.000 claims abstract description 30
- 238000012545 processing Methods 0.000 claims abstract description 19
- 230000007246 mechanism Effects 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims description 49
- 239000011159 matrix material Substances 0.000 claims description 30
- 239000013598 vector Substances 0.000 claims description 30
- 238000010276 construction Methods 0.000 claims description 25
- 230000001502 supplementing effect Effects 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 230000001105 regulatory effect Effects 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 208000024891 symptom Diseases 0.000 abstract description 2
- 230000005180 public health Effects 0.000 description 7
- 239000013589 supplement Substances 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000000586 desensitisation Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 244000052769 pathogen Species 0.000 description 1
- 230000001717 pathogenic effect Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000003449 preventive effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/226—Validation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/80—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Abstract
The invention relates to the technical field of disease early warning, in particular to a medical direct reporting method and a system based on self-training text error correction and text matching; constructing training data of a text error correction model based on original medical record data to obtain a word missing supplementary model and a word error correction model, performing text error correction processing on new input data through the model, calculating similarity between the data after error correction processing and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm, selecting an existing disease standard name with the highest comprehensive similarity score as a standard disease name mapped by the new input data, performing direct matching search on the disease standard name and an infectious disease name in a dangerous infectious disease database, judging that the current disease is a dangerous infectious disease if corresponding data exists, directly reporting the disease to a mechanism, and completing direct reporting of the dangerous disease; the data standardization is realized, so that the direct-report system can accurately identify symptoms, and the problem of inaccurate system is solved.
Description
Technical Field
The invention relates to the technical field of disease early warning, in particular to a medical direct reporting method and a system based on self-training text error correction and text matching.
Background
Along with the change of society and natural environment, the pathogen, transmission way, disease characteristics and influencing factors of infectious diseases also change greatly, how to identify the emergent public health event of infectious diseases in early stage, give out an alarm in time, and take corresponding control measures as early as possible to minimize the loss caused by the emergent public health event, thus being the focus of attention in the public health field for a long time and also being the important content of health emergency work. The public health event early warning for the sudden infection is an important preventive control measure for avoiding or reducing the occurrence and the popularity of the infectious diseases and reducing the influence on public health, social safety and economic development, and fully embodies the basic guidelines of the public health emergency work. The sudden public health event early warning is to collect, sort, analyze and integrate related data, monitor, identify, diagnose and evaluate the symptoms of the event by using modern advanced technologies such as computers, networks and communication, alarm in time, inform related departments and the public of taking relevant countermeasures and preparation work, take effective prevention and control measures in time, and prevent or slow down the occurrence of the sudden event or reduce the harm of the event as much as possible.
Along with the development of diagnosis and treatment data informatization, the demand of data management is also improved, early warning and direct reporting mechanisms aiming at the risk infection diseases are also important, however, the direct reporting system of the data needs to face the situation of data text errors and non-uniform data naming, and the situation of insensitive diseases or deviation of the direct reporting system is often caused.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a medical direct reporting method and a system based on self-training text error correction and text matching, so as to solve the problems of insensitivity and deviation of system diseases caused by data text errors and non-uniform data naming in a medical direct reporting system.
In order to solve the problems, the invention adopts the following technical scheme:
in one aspect, the invention provides a medical direct reporting method based on self-training text correction and text matching, comprising the following steps:
constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters;
training a missing word supplementing model and a wrong word correcting model based on the bert pre-training model by using the training data lacking the characters and the training data of the wrong word to obtain the missing word supplementing model and the wrong word correcting model;
performing text error correction processing on the new input data through the word-missing supplementary model and the word-error correction model;
calculating the similarity between the error-corrected data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;
adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
selecting the existing disease standard name with the highest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data;
and directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to finish dangerous disease direct reporting.
As an implementation manner, the training data for constructing the text error correction model based on the original medical record data includes:
deleting two characters in each sentence in the original medical record data at random, recording character indexes of the deleting positions and deleted character information, and constructing training data lacking characters;
and randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing the training data of wrongly written characters.
As an embodiment, the structure of the bert pretraining model includes:
the L1embedding layer is used for carrying out matrix matching through the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the loss function of the bert pre-training model is as follows:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
As an implementation manner, the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using the bm25 algorithm and the jaccard algorithm includes:
similarity to the knowledge base of existing disease standard names is calculated using the bm25 algorithm: for sentence s 1 Word segmentation is carried out to generate a word list [ w ] i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences;
as an implementation manner, the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using the bm25 algorithm and the jaccard algorithm includes:
similarity to the existing disease standard name knowledge base was calculated using the jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:。
on the other hand, the invention provides a medical direct-reporting system based on self-training text correction and text matching, which comprises a training data construction module, a word-missing supplementary model and a word-missing correction model construction module, a text correction module, a similarity calculation module, a comprehensive scoring module, a mapping standard disease name determination module and a direct-reporting module;
the training data construction module is used for constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters;
the missing word supplementing model and the error word correcting model building module is used for training the missing word supplementing model and the error word correcting model based on the bert pre-training model respectively by using the training data of the missing characters and the training data of the error words to obtain a missing word supplementing model and an error word correcting model;
the text error correction module is used for carrying out text error correction processing on the new input data through the word-missing supplementary model and the word-missing error correction model;
the similarity calculation module is used for calculating the similarity between the data subjected to error correction processing and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;
the comprehensive scoring module is used for adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
the mapping standard disease name determining module is used for selecting the existing disease standard name with the largest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data;
the direct reporting module is used for directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to complete direct reporting of the dangerous disease.
As an embodiment, the training data construction module includes a training data construction unit lacking characters and a training data construction unit mispronounced characters:
the training data construction unit lacking the characters is used for randomly deleting two characters in each sentence in the original medical record data, recording character indexes of deleting positions and deleted character information, and constructing training data lacking the characters;
the wrongly written word training data construction unit is used for randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing wrongly written word training data.
As an embodiment, the structure of the bert pretraining model includes:
the L1embedding layer is used for carrying out matrix matching through the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the saidThe loss function of the bert pre-training model is:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
As an embodiment, the similarity calculation module includes a bm25 similarity calculation unit;
the bm25 similarity calculation unit is configured to calculate a similarity with an existing disease standard name knowledge base by using a bm25 algorithm: for sentence s 1 Word segmentation is carried out to generate a word list [ w ] i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences;
as an implementation manner, the similarity calculation module includes a jaccard similarity calculation unit;
the jaccard similarity calculation unit is configured to calculate similarity with an existing disease standard name knowledge base by using a jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:。
the invention has the beneficial effects that: according to the medical direct reporting method and system based on self-training text error correction and text matching, the text error correction processing is carried out on input data by establishing the word missing supplementary model and the word error correction model, the similarity between the input data and the existing disease standard names in a knowledge base is calculated through a bm25 algorithm and a jaccard algorithm, after addition, the existing disease standard names with the highest scores are obtained, a mapping is established, then the dangerous infectious disease database is matched for judgment and reporting, the front-end text processing is carried out, the word missing text and the word error correction text are carried out, the standard names are matched, data standardization is achieved, the direct reporting system can accurately identify diseases, and the problems of insensitivity and inaccuracy of the system are overcome.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings, in which:
fig. 1 is a schematic flow chart of a medical direct reporting method based on self-training text correction and text matching according to an embodiment of the invention.
Fig. 2 is a schematic diagram of a medical direct-reporting system based on self-training text correction and text matching according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
It should be noted that these examples are only for illustrating the present invention, and not for limiting the present invention, and simple modifications of the method under the premise of the inventive concept are all within the scope of the claimed invention.
A self-training text correction and text matching based medical direct reporting method, comprising:
s100, constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters.
Deleting two characters in each sentence in the original medical record data at random, recording character indexes of the deleting positions and deleted character information, and constructing training data lacking characters;
and randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing the training data of wrongly written characters.
The original medical record data refers to the medical record data history data of a hospital, and training data is constructed after desensitization or screening treatment is carried out on the medical record data history data of the hospital.
And S200, training a word-missing supplementary model and a word-missing error correction model based on the bert pre-training model by using the training data of the missing characters and the training data of the wrongly written words respectively to obtain the word-missing supplementary model and the word-missing error correction model.
Wherein, the structure of the bert pre-training model comprises:
the L1embedding layer is used for carrying out matrix matching through the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the loss function of the bert pre-training model is as follows:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
Assuming a three-classification task, the correct label for a sample is of the first type, then p= [1, 0, 0], the model predictor is assumed to be [0.5, 0.4, 0.1], and then the cross entropy is calculated as follows:
s300, performing text error correction processing on the new input data through the word-missing supplementary model and the word-missing error correction model.
After model training is completed, new input data, namely medical record data needing to be judged and reported, is input into a model for processing.
S400, calculating the similarity between the error-corrected data and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm.
Wherein, the bm25 algorithm is used for calculating the similarity with the existing disease standard name knowledge base: for sentence s 1 Word segmentation is carried out to generate a word list [ w ] i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences;
the similarity with the existing disease standard name knowledge base is calculated by using a jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:。
s500, adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
s600, selecting the existing disease standard name with the largest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data.
And S700, directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to finish direct reporting of the dangerous disease.
On the other hand, the invention provides a medical direct reporting system based on self-training text correction and text matching, which comprises a training data construction module 100, a word-missing supplementary model and word-missing correction model construction module 200, a text correction module 300, a similarity calculation module 400, a comprehensive scoring module 500, a mapping standard disease name determination module 600 and a direct reporting module 700;
the training data construction module 100 is configured to construct training data of a text error correction model based on the original medical record data, where the training data of the text error correction model includes training data lacking characters and training data of wrongly written characters;
the missing word supplement model and error correction model construction module 200 is configured to train the missing word supplement model and error correction model based on the bert pre-training model respectively by using the training data of the missing characters and the training data of the error characters, so as to obtain a missing word supplement model and an error correction model;
the text error correction module 300 is configured to perform text error correction processing on new input data through the word-missing supplementary model and the word-missing error correction model;
the similarity calculation module 400 is configured to calculate the similarity between the error-corrected data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;
the comprehensive scoring module 500 is configured to add the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
the mapping standard disease name determining module 600 is configured to select, from the existing disease standard name knowledge base, an existing disease standard name with the largest similarity comprehensive score as a standard disease name mapped by the new input data;
the direct reporting module 700 is configured to directly match and retrieve the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, and if corresponding data exists in the dangerous infectious disease database, determine that the current disease is a dangerous infectious disease, and directly report the disease to a responsible institution to complete direct reporting of the dangerous disease.
As an embodiment, the training data construction module 100 includes a training data construction unit 110 lacking characters and a wrongly written training data construction unit 120:
the training data construction unit 110 for deleting two characters in each sentence in the original medical record data randomly, and recording the character index of the deleting position and the deleted character information to construct training data lacking characters;
the wrongly written training data construction unit 120 is configured to randomly replace two characters in each sentence in the original medical record data with other characters, record the character index of the replacement position and the information of the original characters before replacement, and construct the wrongly written training data.
As an embodiment, the structure of the bert pretraining model includes:
the L1embedding layer is used for carrying out matrix matching through the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the loss function of the bert pre-training model is as follows:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
As an embodiment, the similarity calculation module 400 includes a bm25 similarity calculation unit 410;
the bm25 similarity calculation unit 410 is configured to calculate a similarity with an existing disease standard name knowledge base using a bm25 algorithm: for sentence s 1 Word segmentation is carried out to generate a word list [ w ] i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences;
as an embodiment, the similarity calculation module 400 includes a jaccard similarity calculation unit 420;
the jaccard similarity calculation unit 420 is configured to calculate a similarity with an existing disease standard name knowledge base using the jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:。
finally, it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A self-training text correction and text matching based medical direct reporting method, comprising:
constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters;
training a missing word supplementing model and a wrong word correcting model based on the bert pre-training model by using the training data lacking the characters and the training data of the wrong word to obtain the missing word supplementing model and the wrong word correcting model;
performing text error correction processing on the new input data through the word-missing supplementary model and the word-error correction model;
calculating the similarity between the error-corrected data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;
adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
selecting the existing disease standard name with the highest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data;
and directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to finish dangerous disease direct reporting.
2. The self-training text correction and text matching based medical direct reporting method of claim 1 wherein constructing training data of a text correction model based on raw medical record data comprises:
deleting two characters in each sentence in the original medical record data at random, recording character indexes of the deleting positions and deleted character information, and constructing training data lacking characters;
and randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing the training data of wrongly written characters.
3. The self-training text correction and text matching based medical presentation method of claim 1, wherein the structure of the bert pre-training model comprises:
the L1embedding layer multiplies the matrix by the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the loss function of the bert pre-training model is as follows:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
4. The self-training text correction and text matching based medical direct reporting method as claimed in claim 1, wherein the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm comprises:
calculation of existing disease standard name knowledge base Using bm25 AlgorithmSimilarity of (3): for sentence s 1 Word segmentation is carried out to generate a word list [ w ] i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences.
5. The self-training text correction and text matching based medical direct reporting method as claimed in claim 1, wherein the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm comprises:
similarity to the existing disease standard name knowledge base was calculated using the jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:
。
6. the medical direct reporting system based on self-training text error correction and text matching is characterized by comprising a training data construction module, a word-missing supplementary model and word-missing error correction model construction module, a text error correction module, a similarity calculation module, a comprehensive scoring module, a mapping standard disease name determination module and a direct reporting module;
the training data construction module is used for constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters;
the missing word supplementing model and the error word correcting model building module is used for training the missing word supplementing model and the error word correcting model based on the bert pre-training model respectively by using the training data of the missing characters and the training data of the error words to obtain a missing word supplementing model and an error word correcting model;
the text error correction module is used for carrying out text error correction processing on the new input data through the word-missing supplementary model and the word-missing error correction model;
the similarity calculation module is used for calculating the similarity between the data subjected to error correction processing and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;
the comprehensive scoring module is used for adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
the mapping standard disease name determining module is used for selecting the existing disease standard name with the largest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data;
the direct reporting module is used for directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to complete direct reporting of the dangerous disease.
7. The self-training text correction and text matching based medical presentation system of claim 6 wherein the training data construction module comprises a training data construction unit lacking characters and a mispronounced training data construction unit:
the training data construction unit lacking the characters is used for randomly deleting two characters in each sentence in the original medical record data, recording character indexes of deleting positions and deleted character information, and constructing training data lacking the characters;
the wrongly written word training data construction unit is used for randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing wrongly written word training data.
8. The self-training text correction and text matching based medical presentation system of claim 6, wherein the structure of the bert pre-training model comprises:
the L1embedding layer multiplies the matrix by the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the loss function of the bert pre-training model is as follows:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
9. The self-training text correction and text matching based medical presentation system of claim 6, wherein the similarity calculation module comprises a bm25 similarity calculation unit;
the bm25 similarity calculation unit is configured to calculate a similarity with an existing disease standard name knowledge base by using a bm25 algorithm: for sentence s 1 Word segmentation is carried out to generateWord list [ w i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences.
10. The self-training text correction and text matching based medical presentation system of claim 6 wherein the similarity calculation module comprises a jaccard similarity calculation unit;
the jaccard similarity calculation unit is configured to calculate similarity with an existing disease standard name knowledge base by using a jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:
。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310735155.9A CN116502629B (en) | 2023-06-20 | 2023-06-20 | Medical direct reporting method and system based on self-training text error correction and text matching |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310735155.9A CN116502629B (en) | 2023-06-20 | 2023-06-20 | Medical direct reporting method and system based on self-training text error correction and text matching |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116502629A CN116502629A (en) | 2023-07-28 |
CN116502629B true CN116502629B (en) | 2023-08-18 |
Family
ID=87316810
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310735155.9A Active CN116502629B (en) | 2023-06-20 | 2023-06-20 | Medical direct reporting method and system based on self-training text error correction and text matching |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116502629B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111859921A (en) * | 2020-07-08 | 2020-10-30 | 金蝶软件(中国)有限公司 | Text error correction method and device, computer equipment and storage medium |
KR20210035987A (en) * | 2019-09-25 | 2021-04-02 | 국민대학교산학협력단 | Document search device and method based on jaccard model |
CN116127952A (en) * | 2023-01-16 | 2023-05-16 | 之江实验室 | Multi-granularity Chinese text error correction method and device |
-
2023
- 2023-06-20 CN CN202310735155.9A patent/CN116502629B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20210035987A (en) * | 2019-09-25 | 2021-04-02 | 국민대학교산학협력단 | Document search device and method based on jaccard model |
CN111859921A (en) * | 2020-07-08 | 2020-10-30 | 金蝶软件(中国)有限公司 | Text error correction method and device, computer equipment and storage medium |
CN116127952A (en) * | 2023-01-16 | 2023-05-16 | 之江实验室 | Multi-granularity Chinese text error correction method and device |
Non-Patent Citations (1)
Title |
---|
An in-depth study of similarity predicate committee;Jia Zhu等;《ELSEVIER》;第381-393页 * |
Also Published As
Publication number | Publication date |
---|---|
CN116502629A (en) | 2023-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10818397B2 (en) | Clinical content analytics engine | |
AU2019263758B2 (en) | Systems and methods for generating a contextually and conversationally correct response to a query | |
US9910848B2 (en) | Generating semantic variants of natural language expressions using type-specific templates | |
US10275576B2 (en) | Automatic medical coding system and method | |
US9928235B2 (en) | Type-specific rule-based generation of semantic variants of natural language expression | |
US20140249865A1 (en) | Claims analytics engine | |
CN104699730A (en) | Identifying and displaying relationships between candidate answers | |
AU2019278989B2 (en) | System and method for analyzing and modeling content | |
CN113505243A (en) | Intelligent question-answering method and device based on medical knowledge graph | |
CN113779179B (en) | ICD intelligent coding method based on deep learning and knowledge graph | |
CN112541066B (en) | Text-structured-based medical and technical report detection method and related equipment | |
US20220292085A1 (en) | Systems and methods for advanced query generation | |
WO2023160264A1 (en) | Medical data processing method and apparatus, and storage medium | |
CN116502629B (en) | Medical direct reporting method and system based on self-training text error correction and text matching | |
CN113808758A (en) | Method and device for verifying data standardization, electronic equipment and storage medium | |
CN115828854B (en) | Efficient table entity linking method based on context disambiguation | |
CN116561264A (en) | Knowledge graph-based intelligent question-answering system construction method | |
WO2023060634A1 (en) | Case concatenation method and apparatus based on cross-chapter event extraction, and related component | |
CN115617689A (en) | Software defect positioning method based on CNN model and domain features | |
CN112528003B (en) | Multi-item selection question-answering method based on semantic sorting and knowledge correction | |
CN109408831B (en) | Remote supervision method for traditional Chinese medicine fine-grained syndrome name segmentation | |
CN109949938B (en) | Method and device for standardizing medical non-standard names | |
US20240112804A1 (en) | Matching unstructured text to clinical ontologies | |
CN116306925A (en) | Method and system for generating end-to-end entity link | |
Tie et al. | Research on the Text2SQL Method Based |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |