CN114446422A

CN114446422A - Medical record marking method, system and corresponding equipment and storage medium

Info

Publication number: CN114446422A
Application number: CN202111536210.9A
Authority: CN
Inventors: 赵建强; 王梦迪
Original assignee: Wanghai Kangxin Beijing Technology Co ltd
Current assignee: Wanghai Kangxin Beijing Technology Co ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-05-06

Abstract

The application discloses a medical record marking method, a medical record marking system, corresponding equipment and a storage medium, wherein the method comprises the following steps: extracting diagnosis words in the medical record information; for each diagnosis word, calculating the relevance scores of the diagnosis word and all standard words, and recalling a preset number of most relevant standard words according to the ranking of the relevance scores; calculating the text similarity of each diagnosis word and each corresponding recalled standard word; inputting every two diagnostic words with text similarity greater than or equal to a preset threshold value and corresponding standard words into the trained semantic similarity model and performing semantic similarity sequencing; and selecting the standard word with the highest semantic similarity as the standard diagnostic word of the corresponding diagnostic word. The invention can greatly improve the accuracy and normalization of the medical record diagnosis words so as to improve the quality of the medical record.

Description

Medical record marking method, system and corresponding equipment and storage medium

Technical Field

The application relates to the field of electric digital data processing, in particular to a medical record marking method. The application also relates to a medical record marking system and a corresponding computer device and computer readable storage medium.

Background

In the process of writing the medical records, there are various problems such as serious copying phenomenon, simple and irregular disease records, different individual comprehension and the like, so that the medical records cannot faithfully and accurately reflect the actual disease changes, treatment effects and the like of patients, and meanwhile, the same medical history in a rule caused by copying also affects the quality of the medical records, thereby causing greater medical dispute hidden dangers.

Disclosure of Invention

The invention provides a medical record marking method, a medical record marking system, corresponding equipment and a storage medium, which can greatly improve the accuracy and normalization of medical record diagnosis words and further improve the quality of medical records.

In a first aspect of the present invention, there is provided a method of medical record marking, the method comprising:

extracting diagnosis words in the medical record information;

for each diagnosis word, calculating the relevance scores of the diagnosis word and all standard words, and recalling a preset number of most relevant standard words according to the ranking of the relevance scores;

calculating the text similarity of each diagnosis word and each corresponding recalled standard word;

inputting every two diagnostic words with text similarity greater than or equal to a preset threshold value and corresponding standard words into the trained semantic similarity model and performing semantic similarity sequencing;

and selecting the standard word with the highest semantic similarity as the standard diagnostic word of the corresponding diagnostic word.

In a second aspect of the present invention, there is provided a medical record marking system, comprising:

the diagnostic word extraction module is used for extracting diagnostic words in the medical record information;

the relevant standard word recalling module is used for calculating the relevance scores of each diagnosis word and all the standard words and recalling the most relevant standard words in a preset number according to the ranking of the relevance scores;

the text similarity calculation module is used for calculating the text similarity of each diagnosis word and each corresponding recalled standard word;

the semantic similarity sorting module is used for inputting every two diagnostic words with text similarity greater than or equal to a preset threshold value and corresponding standard words into the trained semantic similarity model and sorting the semantic similarity;

and the standard word selecting module is used for selecting the standard word with the highest semantic similarity as the standard diagnostic word of the corresponding diagnostic word.

In a third aspect of the invention, a computer device is provided, comprising a processor, a memory and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the method according to the first aspect of the invention or implements the functions of the system according to the second aspect of the invention.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to the first aspect of the present invention or performs the functions of the system according to the second aspect of the present invention.

According to the invention, the diagnosis words in the medical record information are extracted, the relevance scores of the diagnosis words and all the standard words are calculated for each diagnosis word, the most relevant standard words with preset number are recalled according to the sequence of the relevance scores, the text similarity of each diagnosis word and each corresponding recalled standard word is calculated, the diagnosis words with the text similarity larger than or equal to a preset threshold and the corresponding standard words are input into a trained semantic similarity model in pairs and are subjected to semantic similarity sequencing, the standard word with the highest semantic similarity is selected as the standard diagnosis word of the corresponding diagnosis word, the accuracy and the normalization of the medical record diagnosis words can be automatically and greatly improved, the medical record quality is improved, the difficulty of manual quality control of the medical record is avoided or reduced, and the workload of quality control personnel is reduced. Tests show that the accuracy of the diagnosis words of the medical records can reach more than 97 percent through the medical records subjected to standardization treatment.

Other features and advantages of the present invention will become more apparent from the detailed description of the embodiments of the present invention when taken in conjunction with the accompanying drawings.

Drawings

FIG. 1 is a flow chart of one embodiment of a method according to the present invention;

FIG. 2 is a block diagram of one embodiment of a system according to the present invention.

For the sake of clarity, the figures are schematic and simplified drawings, which only show details which are necessary for understanding the invention and other details are omitted.

Detailed Description

Embodiments and examples of the present invention will be described in detail below with reference to the accompanying drawings.

The scope of applicability of the present invention will become apparent from the detailed description given hereinafter. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only.

FIG. 1 is a flow chart of a preferred embodiment of a method for marking a medical record according to the present invention.

In step S102, diagnostic words in the medical record information are extracted. For the plain text content of the medical record (also called electronic medical record) written by the doctor, the content in the medical record can be extracted by classifying (such as physical signs, disease diagnosis, operation and the like) the content in the medical record through natural language processing, for example, and then for the disease diagnosis and/or operation part, the diagnosis words can be extracted through a trained natural language processing model, for example. The natural language processing model is trained by learning diagnostic word labels of a professional physician. For example, if the medical record information includes "prompt for liver fat infiltration, liver cyst", the extracted diagnosis words may be "liver fat infiltration" and "liver cyst".

In step S104, for each disease diagnosis word or surgical operation diagnosis word, the relevance scores thereof to all the standard words in the pre-established standard word library are calculated, and a predetermined number of the most relevant standard words are recalled according to the ranking of the relevance scores. The standard words are words in ICD (International Classification of Diseases) 10 and ICD 9. The number of most relevant criterion words recalled may compromise the processing speed and accuracy determination, and may be, for example, between 40 and 70, such as 50, such as 55, such as 60, etc.

The relevance score between the diagnostic word and the standard word can be calculated using conventional BM25 algorithms. The BM25 algorithm is commonly used as a search relevance score that morphemes search term Q to generate morpheme Q_i(ii) a Then, for each search result D, each morpheme q is calculated_iThe correlation with D is scored and finally, q is assigned_iRelative to DAnd carrying out weighted summation on the correlation scores, thereby obtaining the correlation score of Q and D. The general formula of the BM25 algorithm is as follows:

since the BM25 algorithm is well known, it is not described herein.

After calculating the relevance scores of each diagnostic word and all the standard words by using the BM25 algorithm, the most relevant standard words are recalled by ranking according to the relevance scores, for example, 50.

In step S106, the text similarity between each diagnostic word and each corresponding recalled standard word is calculated.

In an embodiment, the text similarity may be determined by the Levenshtein distance (also known as the edit distance, i.e., the minimum number of editing operations required to transition from one string to another). Calculating editing distances between the recalled standard words and corresponding diagnosis words one by one, auditing the calculated effect of the BM25 algorithm by setting a distance threshold, and if the editing distance is greater than or equal to the set distance threshold, the corresponding recalled standard words are approved; otherwise, if the edit distance is smaller than the set distance threshold, the corresponding recalled standard word is not approved.

In other embodiments, the text similarity may also be determined by other string similarity algorithms such as cosine similarity, matrix similarity, and the like.

In step S108, it is determined whether the calculated text similarity is equal to or greater than a predetermined threshold. If the calculated text similarity is equal to or greater than the predetermined threshold, the process proceeds to step S110; otherwise, the calculated text similarity is smaller than the predetermined threshold, the process proceeds to step S120.

In step S110, the diagnostic word and the corresponding standard word are input into the trained semantic similarity model in pairs and subjected to semantic similarity ranking.

In an embodiment, a large number, e.g., millions, of BERT models trained on medical data may be used to determine semantic similarity between a diagnostic word and a corresponding standard word. The BERT (bidirectional Encoder retrieval from transformations) model may be various well-known BERT models, such as ERNIE (Chinese Heart), ALBERT, etc. The similarity criteria are given by the specialist. For example: the similarity of 'fragile diabetes mellitus' and 'type 1 diabetes mellitus prophase microalbuminuria' is 0.3, the similarity of 'upper limb fracture' and 'open upper limb fracture' is 0.5 and the like.

In step S112, the standard word with the highest semantic similarity is selected as the standard diagnostic word of the corresponding diagnostic word.

In step S120, the diagnosis words having the text similarity smaller than the predetermined threshold are input into the trained medical entity recognition model to recognize the medical entities in the corresponding diagnosis words.

In an embodiment, a BERT model trained using a large amount of data labeling medical entities may be employed as a medical entity recognition model to identify medical entities in corresponding diagnostic words. For example: the data is "after consultation i am admitted to the hospital with femoral neck fracture diagnosis", the annotator will annotate the entity "femoral neck" as the part and the entity "fracture" as seen clinically. Also, the BERT model may be various well-known BERT models, such as ERNIE (Wen Heart), ALBERT, and the like.

In step S122, based on the pre-constructed knowledge-graph, the corresponding standard words are recalled from the knowledge-graph according to the identified medical entities. The knowledge graph is constructed by professional medical personnel based on a large amount of medical data, the knowledge graph comprises medical entities, entity attributes, relationships among the entities and the like, and the standard words are obtained by reasoning the relationships among the entities through the knowledge graph. Since the invention is not in the knowledge graph itself, the detailed description of the knowledge graph is omitted here. After step S122, the process proceeds to step S110.

In an embodiment, after one or more standard diagnostic words are selected, the standard diagnostic words may be converted into standard ICD (International Classification of Diseases) codes according to the one or more standard diagnostic words, and the standard ICD codes are used to update the existing corresponding ICD codes in the medical record, such as ICD for admission diagnosis, ICD for discharge main diagnosis, ICD for other diagnosis in discharge, ICD for operation and operation, and the like, so that the ICD codes of the medical record are more accurate.

The BM25 algorithm is very dependent on the accuracy of the segmentation tool and is generally unable to represent the correlation between synonyms, which makes normalization using the BM25 algorithm alone less accurate. The medical record diagnosis words are standardized by combining the BM25 algorithm, the knowledge map and the deep learning language model BERT, and the medical record quality control effect is greatly improved by two modes of text similarity and semantic similarity.

FIG. 2 shows a block diagram of a preferred embodiment of a medical records marking system according to the present invention, the system comprising:

a diagnostic word extraction module 202, configured to extract diagnostic words in the medical record information;

a related standard word recalling module 204, configured to calculate, for each diagnostic word, a relevance score between the diagnostic word and all standard words, and recall a predetermined number of most relevant standard words according to the ranking of the relevance scores;

a text similarity calculation module 206, configured to calculate a text similarity between each diagnosis word and each corresponding recalled standard word;

the semantic similarity sorting module 208 is configured to input the diagnostic words and the corresponding standard words, of which the text similarity is greater than or equal to a predetermined threshold, into the trained semantic similarity model in pairs and perform semantic similarity sorting;

the standard word selecting module 210 is configured to select a standard word with the highest semantic similarity as a standard diagnostic word of the corresponding diagnostic word.

In another embodiment, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method embodiment or other corresponding method embodiments described in conjunction with fig. 1 or implements the functions of the system embodiment or other corresponding system embodiments described in conjunction with fig. 2, and is not described herein again.

In another embodiment, the present invention provides a computer device, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor, where the processor implements the steps of the method embodiment or other corresponding method embodiments described in conjunction with fig. 1 or implements the functions of the system embodiment or other corresponding system embodiments described in conjunction with fig. 2 when executing the computer program, and details of the steps are not repeated herein.

The various embodiments described herein, or certain features, structures, or characteristics thereof, may be combined as suitable in one or more embodiments of the invention. Additionally, in some cases, the order of steps depicted in the flowcharts and/or in the pipelined process may be modified, as appropriate, and need not be performed exactly in the order depicted. In addition, various aspects of the invention may be implemented using software, hardware, firmware, or a combination thereof, and/or other computer implemented modules or devices that perform the described functions. Software implementations of the present invention may include executable code stored in a computer readable medium and executed by one or more processors. The computer-readable medium may include a computer hard drive, ROM, RAM, flash memory, portable computer storage media such as CD-ROM, DVD-ROM, flash drives, and/or other devices with a Universal Serial Bus (USB) interface, and/or any other suitable tangible or non-transitory computer-readable medium or computer memory on which executable code may be stored and executed by a processor. The present invention may be used in conjunction with any suitable operating system.

As used herein, the singular forms "a", "an" and "the" include plural references (i.e., have the meaning "at least one"), unless the context clearly dictates otherwise. It will be further understood that the terms "has," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

The foregoing describes some preferred embodiments of the present invention, but it should be emphasized that the invention is not limited to these embodiments, but can be implemented in other ways within the scope of the inventive subject matter. Various modifications and alterations of this invention will become apparent to those skilled in the art without departing from the spirit and scope of this invention.

Claims

1. A method for marking a medical record, the method comprising:

extracting diagnosis words in the medical record information;

calculating the text similarity of each diagnostic word and each corresponding recalled standard word;

2. The method of claim 1, further comprising:

inputting the diagnosis words with the text similarity smaller than a preset threshold value into the trained medical entity recognition model to recognize the medical entities in the corresponding diagnosis words;

recalling corresponding standard words from the knowledge graph according to the identified medical entities based on the pre-constructed knowledge graph;

inputting every two of the diagnostic words with the text similarity smaller than a preset threshold value and the standard words recalled from the knowledge graph into the trained semantic similarity model and sequencing the semantic similarities.

3. The method of claim 1, further comprising:

and determining the ICD code of the medical record according to one or more standard diagnostic words.

4. The method of claim 1, wherein the relevance score is calculated using the BM25 algorithm.

5. The method of claim 1, wherein the text similarity is an edit distance.

6. The method of claim 1, wherein the trained semantic similarity model is a trained BERT model.

7. The method of claim 1, wherein the trained medical entity recognition model is a BERT model trained using data labeling medical entities.

8. A medical record marking system, comprising:

9. A computer device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program realizes the steps of the method according to any of the claims 1-7 or the functions of the system according to claim 8.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7 or the functions of the system according to claim 8.