CN116502629B - Medical direct reporting method and system based on self-training text error correction and text matching - Google Patents

Medical direct reporting method and system based on self-training text error correction and text matching Download PDF

Info

Publication number
CN116502629B
CN116502629B CN202310735155.9A CN202310735155A CN116502629B CN 116502629 B CN116502629 B CN 116502629B CN 202310735155 A CN202310735155 A CN 202310735155A CN 116502629 B CN116502629 B CN 116502629B
Authority
CN
China
Prior art keywords
data
word
training
disease
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310735155.9A
Other languages
Chinese (zh)
Other versions
CN116502629A (en
Inventor
刘硕
杨雅婷
白焜太
宋佳祥
许娟
史文钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Health China Technologies Co Ltd
Original Assignee
Digital Health China Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Health China Technologies Co Ltd filed Critical Digital Health China Technologies Co Ltd
Priority to CN202310735155.9A priority Critical patent/CN116502629B/en
Publication of CN116502629A publication Critical patent/CN116502629A/en
Application granted granted Critical
Publication of CN116502629B publication Critical patent/CN116502629B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/80ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for detecting, monitoring or modelling epidemics or pandemics, e.g. flu
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention relates to the technical field of disease early warning, in particular to a medical direct reporting method and a system based on self-training text error correction and text matching; constructing training data of a text error correction model based on original medical record data to obtain a word missing supplementary model and a word error correction model, performing text error correction processing on new input data through the model, calculating similarity between the data after error correction processing and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm, selecting an existing disease standard name with the highest comprehensive similarity score as a standard disease name mapped by the new input data, performing direct matching search on the disease standard name and an infectious disease name in a dangerous infectious disease database, judging that the current disease is a dangerous infectious disease if corresponding data exists, directly reporting the disease to a mechanism, and completing direct reporting of the dangerous disease; the data standardization is realized, so that the direct-report system can accurately identify symptoms, and the problem of inaccurate system is solved.

Description

Medical direct reporting method and system based on self-training text error correction and text matching
Technical Field
The invention relates to the technical field of disease early warning, in particular to a medical direct reporting method and a system based on self-training text error correction and text matching.
Background
Along with the change of society and natural environment, the pathogen, transmission way, disease characteristics and influencing factors of infectious diseases also change greatly, how to identify the emergent public health event of infectious diseases in early stage, give out an alarm in time, and take corresponding control measures as early as possible to minimize the loss caused by the emergent public health event, thus being the focus of attention in the public health field for a long time and also being the important content of health emergency work. The public health event early warning for the sudden infection is an important preventive control measure for avoiding or reducing the occurrence and the popularity of the infectious diseases and reducing the influence on public health, social safety and economic development, and fully embodies the basic guidelines of the public health emergency work. The sudden public health event early warning is to collect, sort, analyze and integrate related data, monitor, identify, diagnose and evaluate the symptoms of the event by using modern advanced technologies such as computers, networks and communication, alarm in time, inform related departments and the public of taking relevant countermeasures and preparation work, take effective prevention and control measures in time, and prevent or slow down the occurrence of the sudden event or reduce the harm of the event as much as possible.
Along with the development of diagnosis and treatment data informatization, the demand of data management is also improved, early warning and direct reporting mechanisms aiming at the risk infection diseases are also important, however, the direct reporting system of the data needs to face the situation of data text errors and non-uniform data naming, and the situation of insensitive diseases or deviation of the direct reporting system is often caused.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a medical direct reporting method and a system based on self-training text error correction and text matching, so as to solve the problems of insensitivity and deviation of system diseases caused by data text errors and non-uniform data naming in a medical direct reporting system.
In order to solve the problems, the invention adopts the following technical scheme:
in one aspect, the invention provides a medical direct reporting method based on self-training text correction and text matching, comprising the following steps:
constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters;
training a missing word supplementing model and a wrong word correcting model based on the bert pre-training model by using the training data lacking the characters and the training data of the wrong word to obtain the missing word supplementing model and the wrong word correcting model;
performing text error correction processing on the new input data through the word-missing supplementary model and the word-error correction model;
calculating the similarity between the error-corrected data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;
adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
selecting the existing disease standard name with the highest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data;
and directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to finish dangerous disease direct reporting.
As an implementation manner, the training data for constructing the text error correction model based on the original medical record data includes:
deleting two characters in each sentence in the original medical record data at random, recording character indexes of the deleting positions and deleted character information, and constructing training data lacking characters;
and randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing the training data of wrongly written characters.
As an embodiment, the structure of the bert pretraining model includes:
the L1embedding layer is used for carrying out matrix matching through the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the loss function of the bert pre-training model is as follows:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
As an implementation manner, the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using the bm25 algorithm and the jaccard algorithm includes:
similarity to the knowledge base of existing disease standard names is calculated using the bm25 algorithm: for sentence s 1 Word segmentation is carried out to generate a word list [ w ] i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences;
as an implementation manner, the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using the bm25 algorithm and the jaccard algorithm includes:
similarity to the existing disease standard name knowledge base was calculated using the jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:
on the other hand, the invention provides a medical direct-reporting system based on self-training text correction and text matching, which comprises a training data construction module, a word-missing supplementary model and a word-missing correction model construction module, a text correction module, a similarity calculation module, a comprehensive scoring module, a mapping standard disease name determination module and a direct-reporting module;
the training data construction module is used for constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters;
the missing word supplementing model and the error word correcting model building module is used for training the missing word supplementing model and the error word correcting model based on the bert pre-training model respectively by using the training data of the missing characters and the training data of the error words to obtain a missing word supplementing model and an error word correcting model;
the text error correction module is used for carrying out text error correction processing on the new input data through the word-missing supplementary model and the word-missing error correction model;
the similarity calculation module is used for calculating the similarity between the data subjected to error correction processing and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;
the comprehensive scoring module is used for adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
the mapping standard disease name determining module is used for selecting the existing disease standard name with the largest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data;
the direct reporting module is used for directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to complete direct reporting of the dangerous disease.
As an embodiment, the training data construction module includes a training data construction unit lacking characters and a training data construction unit mispronounced characters:
the training data construction unit lacking the characters is used for randomly deleting two characters in each sentence in the original medical record data, recording character indexes of deleting positions and deleted character information, and constructing training data lacking the characters;
the wrongly written word training data construction unit is used for randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing wrongly written word training data.
As an embodiment, the structure of the bert pretraining model includes:
the L1embedding layer is used for carrying out matrix matching through the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the saidThe loss function of the bert pre-training model is:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
As an embodiment, the similarity calculation module includes a bm25 similarity calculation unit;
the bm25 similarity calculation unit is configured to calculate a similarity with an existing disease standard name knowledge base by using a bm25 algorithm: for sentence s 1 Word segmentation is carried out to generate a word list [ w ] i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences;
as an implementation manner, the similarity calculation module includes a jaccard similarity calculation unit;
the jaccard similarity calculation unit is configured to calculate similarity with an existing disease standard name knowledge base by using a jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:
the invention has the beneficial effects that: according to the medical direct reporting method and system based on self-training text error correction and text matching, the text error correction processing is carried out on input data by establishing the word missing supplementary model and the word error correction model, the similarity between the input data and the existing disease standard names in a knowledge base is calculated through a bm25 algorithm and a jaccard algorithm, after addition, the existing disease standard names with the highest scores are obtained, a mapping is established, then the dangerous infectious disease database is matched for judgment and reporting, the front-end text processing is carried out, the word missing text and the word error correction text are carried out, the standard names are matched, data standardization is achieved, the direct reporting system can accurately identify diseases, and the problems of insensitivity and inaccuracy of the system are overcome.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings, in which:
fig. 1 is a schematic flow chart of a medical direct reporting method based on self-training text correction and text matching according to an embodiment of the invention.
Fig. 2 is a schematic diagram of a medical direct-reporting system based on self-training text correction and text matching according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
It should be noted that these examples are only for illustrating the present invention, and not for limiting the present invention, and simple modifications of the method under the premise of the inventive concept are all within the scope of the claimed invention.
A self-training text correction and text matching based medical direct reporting method, comprising:
s100, constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters.
Deleting two characters in each sentence in the original medical record data at random, recording character indexes of the deleting positions and deleted character information, and constructing training data lacking characters;
and randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing the training data of wrongly written characters.
The original medical record data refers to the medical record data history data of a hospital, and training data is constructed after desensitization or screening treatment is carried out on the medical record data history data of the hospital.
And S200, training a word-missing supplementary model and a word-missing error correction model based on the bert pre-training model by using the training data of the missing characters and the training data of the wrongly written words respectively to obtain the word-missing supplementary model and the word-missing error correction model.
Wherein, the structure of the bert pre-training model comprises:
the L1embedding layer is used for carrying out matrix matching through the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the loss function of the bert pre-training model is as follows:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
Assuming a three-classification task, the correct label for a sample is of the first type, then p= [1, 0, 0], the model predictor is assumed to be [0.5, 0.4, 0.1], and then the cross entropy is calculated as follows:
s300, performing text error correction processing on the new input data through the word-missing supplementary model and the word-missing error correction model.
After model training is completed, new input data, namely medical record data needing to be judged and reported, is input into a model for processing.
S400, calculating the similarity between the error-corrected data and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm.
Wherein, the bm25 algorithm is used for calculating the similarity with the existing disease standard name knowledge base: for sentence s 1 Word segmentation is carried out to generate a word list [ w ] i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences;
the similarity with the existing disease standard name knowledge base is calculated by using a jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:
s500, adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
s600, selecting the existing disease standard name with the largest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data.
And S700, directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to finish direct reporting of the dangerous disease.
On the other hand, the invention provides a medical direct reporting system based on self-training text correction and text matching, which comprises a training data construction module 100, a word-missing supplementary model and word-missing correction model construction module 200, a text correction module 300, a similarity calculation module 400, a comprehensive scoring module 500, a mapping standard disease name determination module 600 and a direct reporting module 700;
the training data construction module 100 is configured to construct training data of a text error correction model based on the original medical record data, where the training data of the text error correction model includes training data lacking characters and training data of wrongly written characters;
the missing word supplement model and error correction model construction module 200 is configured to train the missing word supplement model and error correction model based on the bert pre-training model respectively by using the training data of the missing characters and the training data of the error characters, so as to obtain a missing word supplement model and an error correction model;
the text error correction module 300 is configured to perform text error correction processing on new input data through the word-missing supplementary model and the word-missing error correction model;
the similarity calculation module 400 is configured to calculate the similarity between the error-corrected data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;
the comprehensive scoring module 500 is configured to add the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
the mapping standard disease name determining module 600 is configured to select, from the existing disease standard name knowledge base, an existing disease standard name with the largest similarity comprehensive score as a standard disease name mapped by the new input data;
the direct reporting module 700 is configured to directly match and retrieve the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, and if corresponding data exists in the dangerous infectious disease database, determine that the current disease is a dangerous infectious disease, and directly report the disease to a responsible institution to complete direct reporting of the dangerous disease.
As an embodiment, the training data construction module 100 includes a training data construction unit 110 lacking characters and a wrongly written training data construction unit 120:
the training data construction unit 110 for deleting two characters in each sentence in the original medical record data randomly, and recording the character index of the deleting position and the deleted character information to construct training data lacking characters;
the wrongly written training data construction unit 120 is configured to randomly replace two characters in each sentence in the original medical record data with other characters, record the character index of the replacement position and the information of the original characters before replacement, and construct the wrongly written training data.
As an embodiment, the structure of the bert pretraining model includes:
the L1embedding layer is used for carrying out matrix matching through the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the loss function of the bert pre-training model is as follows:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
As an embodiment, the similarity calculation module 400 includes a bm25 similarity calculation unit 410;
the bm25 similarity calculation unit 410 is configured to calculate a similarity with an existing disease standard name knowledge base using a bm25 algorithm: for sentence s 1 Word segmentation is carried out to generate a word list [ w ] i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences;
as an embodiment, the similarity calculation module 400 includes a jaccard similarity calculation unit 420;
the jaccard similarity calculation unit 420 is configured to calculate a similarity with an existing disease standard name knowledge base using the jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:
finally, it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A self-training text correction and text matching based medical direct reporting method, comprising:
constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters;
training a missing word supplementing model and a wrong word correcting model based on the bert pre-training model by using the training data lacking the characters and the training data of the wrong word to obtain the missing word supplementing model and the wrong word correcting model;
performing text error correction processing on the new input data through the word-missing supplementary model and the word-error correction model;
calculating the similarity between the error-corrected data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;
adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
selecting the existing disease standard name with the highest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data;
and directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to finish dangerous disease direct reporting.
2. The self-training text correction and text matching based medical direct reporting method of claim 1 wherein constructing training data of a text correction model based on raw medical record data comprises:
deleting two characters in each sentence in the original medical record data at random, recording character indexes of the deleting positions and deleted character information, and constructing training data lacking characters;
and randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing the training data of wrongly written characters.
3. The self-training text correction and text matching based medical presentation method of claim 1, wherein the structure of the bert pre-training model comprises:
the L1embedding layer multiplies the matrix by the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the loss function of the bert pre-training model is as follows:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
4. The self-training text correction and text matching based medical direct reporting method as claimed in claim 1, wherein the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm comprises:
calculation of existing disease standard name knowledge base Using bm25 AlgorithmSimilarity of (3): for sentence s 1 Word segmentation is carried out to generate a word list [ w ] i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences.
5. The self-training text correction and text matching based medical direct reporting method as claimed in claim 1, wherein the calculating the similarity between the error correction processed data and the existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm comprises:
similarity to the existing disease standard name knowledge base was calculated using the jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:
6. the medical direct reporting system based on self-training text error correction and text matching is characterized by comprising a training data construction module, a word-missing supplementary model and word-missing error correction model construction module, a text error correction module, a similarity calculation module, a comprehensive scoring module, a mapping standard disease name determination module and a direct reporting module;
the training data construction module is used for constructing training data of a text error correction model based on the original medical record data, wherein the training data of the text error correction model comprises training data lacking characters and training data of wrongly written characters;
the missing word supplementing model and the error word correcting model building module is used for training the missing word supplementing model and the error word correcting model based on the bert pre-training model respectively by using the training data of the missing characters and the training data of the error words to obtain a missing word supplementing model and an error word correcting model;
the text error correction module is used for carrying out text error correction processing on the new input data through the word-missing supplementary model and the word-missing error correction model;
the similarity calculation module is used for calculating the similarity between the data subjected to error correction processing and an existing disease standard name knowledge base by using a bm25 algorithm and a jaccard algorithm;
the comprehensive scoring module is used for adding the similarity calculated by the bm25 algorithm and the jaccard algorithm to obtain a similarity comprehensive score;
the mapping standard disease name determining module is used for selecting the existing disease standard name with the largest similarity comprehensive score from the existing disease standard name knowledge base as the standard disease name mapped by the new input data;
the direct reporting module is used for directly matching and searching the mapped standard disease name and the infectious disease name in the dangerous infectious disease database, judging that the current disease is dangerous infectious disease if corresponding data exists in the dangerous infectious disease database, and directly reporting the disease to a responsible institution to complete direct reporting of the dangerous disease.
7. The self-training text correction and text matching based medical presentation system of claim 6 wherein the training data construction module comprises a training data construction unit lacking characters and a mispronounced training data construction unit:
the training data construction unit lacking the characters is used for randomly deleting two characters in each sentence in the original medical record data, recording character indexes of deleting positions and deleted character information, and constructing training data lacking the characters;
the wrongly written word training data construction unit is used for randomly replacing two characters in each sentence in the original medical record data with other characters, recording the character index of the replacement position and the information of the original characters before replacement, and constructing wrongly written word training data.
8. The self-training text correction and text matching based medical presentation system of claim 6, wherein the structure of the bert pre-training model comprises:
the L1embedding layer multiplies the matrix by the embedding weight matrix and the id mapped by the input data to obtain an embedding word vector as an embedding matrix representation of the input data, wherein the vector dimension is 768 dimensions;
the L2 multi-head attention mechanism layer is used for extracting matrix characteristics of 768-dimensional characteristic vectors output by the enabling layer through three linear layers respectively, and obtaining 768-dimensional vector representations of each input data fused with attention information through matrix multiplication calculation;
the L3 forward calculation layer is used for enabling each input of the multi-head attention mechanism layer to pass through two linear layers and outputting a final 768-dimensional vector representation of each data after being activated by the activation layer;
the loss function of the bert pre-training model is as follows:
wherein p (x) is the actual label input currently, and q (x) is the predicted value of the model for each label.
9. The self-training text correction and text matching based medical presentation system of claim 6, wherein the similarity calculation module comprises a bm25 similarity calculation unit;
the bm25 similarity calculation unit is configured to calculate a similarity with an existing disease standard name knowledge base by using a bm25 algorithm: for sentence s 1 Word segmentation is carried out to generateWord list [ w i ]For sentence s 1 Sentence s for comparison 2 Calculate each word w i And s 2 And finally w i Relative s 2 The correlation scores of (2) are weighted and summed, and the calculation formula is as follows:
wherein idf (w) i ) For the word w i Idf, f of (f) i For the word w i Sentence s 2 Frequency of occurrence k 1 And b is a regulatory factor of 2 and 0.75, len (s 2 ) For sentence s 2 Avgsl is the average length of all sentences.
10. The self-training text correction and text matching based medical presentation system of claim 6 wherein the similarity calculation module comprises a jaccard similarity calculation unit;
the jaccard similarity calculation unit is configured to calculate similarity with an existing disease standard name knowledge base by using a jaccard algorithm: and calculating Jaccard coefficients of the data set A after error correction processing and the existing disease standard name set B in the existing disease standard name knowledge base, wherein the calculation formula is as follows:
CN202310735155.9A 2023-06-20 2023-06-20 Medical direct reporting method and system based on self-training text error correction and text matching Active CN116502629B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310735155.9A CN116502629B (en) 2023-06-20 2023-06-20 Medical direct reporting method and system based on self-training text error correction and text matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310735155.9A CN116502629B (en) 2023-06-20 2023-06-20 Medical direct reporting method and system based on self-training text error correction and text matching

Publications (2)

Publication Number Publication Date
CN116502629A CN116502629A (en) 2023-07-28
CN116502629B true CN116502629B (en) 2023-08-18

Family

ID=87316810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310735155.9A Active CN116502629B (en) 2023-06-20 2023-06-20 Medical direct reporting method and system based on self-training text error correction and text matching

Country Status (1)

Country Link
CN (1) CN116502629B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859921A (en) * 2020-07-08 2020-10-30 金蝶软件(中国)有限公司 Text error correction method and device, computer equipment and storage medium
KR20210035987A (en) * 2019-09-25 2021-04-02 국민대학교산학협력단 Document search device and method based on jaccard model
CN116127952A (en) * 2023-01-16 2023-05-16 之江实验室 Multi-granularity Chinese text error correction method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210035987A (en) * 2019-09-25 2021-04-02 국민대학교산학협력단 Document search device and method based on jaccard model
CN111859921A (en) * 2020-07-08 2020-10-30 金蝶软件(中国)有限公司 Text error correction method and device, computer equipment and storage medium
CN116127952A (en) * 2023-01-16 2023-05-16 之江实验室 Multi-granularity Chinese text error correction method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
An in-depth study of similarity predicate committee;Jia Zhu等;《ELSEVIER》;第381-393页 *

Also Published As

Publication number Publication date
CN116502629A (en) 2023-07-28

Similar Documents

Publication Publication Date Title
US10818397B2 (en) Clinical content analytics engine
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
US9910848B2 (en) Generating semantic variants of natural language expressions using type-specific templates
US10275576B2 (en) Automatic medical coding system and method
US9928235B2 (en) Type-specific rule-based generation of semantic variants of natural language expression
US20140249865A1 (en) Claims analytics engine
CN104699730A (en) Identifying and displaying relationships between candidate answers
AU2019278989B2 (en) System and method for analyzing and modeling content
CN113505243A (en) Intelligent question-answering method and device based on medical knowledge graph
CN113779179B (en) ICD intelligent coding method based on deep learning and knowledge graph
CN112541066B (en) Text-structured-based medical and technical report detection method and related equipment
US20220292085A1 (en) Systems and methods for advanced query generation
WO2023160264A1 (en) Medical data processing method and apparatus, and storage medium
CN116502629B (en) Medical direct reporting method and system based on self-training text error correction and text matching
CN113808758A (en) Method and device for verifying data standardization, electronic equipment and storage medium
CN115828854B (en) Efficient table entity linking method based on context disambiguation
CN116561264A (en) Knowledge graph-based intelligent question-answering system construction method
WO2023060634A1 (en) Case concatenation method and apparatus based on cross-chapter event extraction, and related component
CN115617689A (en) Software defect positioning method based on CNN model and domain features
CN112528003B (en) Multi-item selection question-answering method based on semantic sorting and knowledge correction
CN109408831B (en) Remote supervision method for traditional Chinese medicine fine-grained syndrome name segmentation
CN109949938B (en) Method and device for standardizing medical non-standard names
US20240112804A1 (en) Matching unstructured text to clinical ontologies
CN116306925A (en) Method and system for generating end-to-end entity link
Tie et al. Research on the Text2SQL Method Based

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant