CN112883717A

CN112883717A - Wrongly written character detection method and device

Info

Publication number: CN112883717A
Application number: CN202110459221.5A
Authority: CN
Inventors: 胡文; 陈联忠; 胡可云
Original assignee: Beijing Jiahesen Health Technology Co ltd
Current assignee: Beijing Jiahesen Health Technology Co ltd
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-06-01

Abstract

The application provides a method and a device for detecting wrongly written characters, which are applied to the recognition of wrongly written characters in a Chinese electronic medical record, wherein the method obtains at least one text to be processed by obtaining the text to be detected and carrying out sentence division processing on the text to be detected; according to the N-gram language model, scoring each text to be processed to obtain a score corresponding to each text to be processed; comparing the score corresponding to each text to be processed with a preset threshold value; and when the score corresponding to the text to be processed is smaller than a preset threshold value, determining that wrongly written characters exist in the text to be processed, and positioning the positions of the wrongly written characters. The method and the device are used for performing wrongly written character detection on characters in the text to be detected based on 2gram and 3gram combination scoring, so that wrongly written character detection can be performed on medical corpus data effectively at high speed, and a foundation is laid for research and development of subsequent products; in addition, for different data environments, the threshold value searching method can be used for adjusting the threshold value standard, and the applicability is strong.

Description

Wrongly written character detection method and device

Technical Field

The present application relates to the field of data recognition technologies, and in particular, to a method and an apparatus for detecting wrongly written characters.

Background

With the rapid development of electronic technology, the data volume of various industries is explosively increased, people enter a big data era, traffic big data, meteorological big data, financial big data, business big data, biomedical big data and the like, and with the gradual popularization of big data and electronic medical records in hospitals, the medical industry also generates massive clinical big data, and the application technology of the clinical big data in the medical industry has been developed to a certain extent by analyzing and mining the clinical big data.

However, the technology for checking wrongly written characters in electronic medical records is not perfect. For example, if the electronic medical record contains text with wrongly written characters, the efficiency and accuracy of subsequent processing can be reduced. In the application of detecting wrongly written or mispronounced characters in medical corpus, most of the detection methods are not suitable for the medical corpus.

Therefore, how to efficiently detect the wrongly written characters on the medical corpus data at a high speed lays a foundation for the research and development of subsequent products, and the problem to be solved by the technical personnel in the field is urgently needed.

Disclosure of Invention

The application provides a method and a device for detecting wrongly written characters, which are used for effectively detecting the wrongly written characters of medical corpus data at high speed.

In order to achieve the above object, the present application provides the following technical solutions:

a method for detecting wrongly written characters is applied to the recognition of wrongly written characters in a Chinese electronic medical record, and comprises the following steps:

acquiring a text to be detected, and performing sentence division processing on the text to be detected to obtain at least one text to be processed, wherein the text to be detected is a Chinese electronic medical record;

according to the N-gram language model, scoring each text to be processed to obtain a score corresponding to each text to be processed;

comparing the score corresponding to each text to be processed with a preset threshold value;

and when the score corresponding to the text to be processed is smaller than a preset threshold value, determining that wrongly written characters exist in the text to be processed, and positioning the positions of the wrongly written characters.

Preferably, when the sentence dividing processing is performed on the text to be detected, the method further includes any one or more of the following steps:

removing interference elements in the text to be detected;

and converting the character strings in the text to be detected into a preset format.

Preferably, the step of training the N-gram language model includes:

acquiring a training sample set, wherein the training sample set comprises at least one training sample;

performing sentence division processing on each training sample in the training sample set;

labeling each processed training sample, labeling sentences with errors and determining the error positions of the sentences;

and training according to the sentences with errors and the error positions to form the N-gram language model.

Preferably, the scoring each text to be processed according to the N-gram language model to obtain a score corresponding to each text to be processed includes:

scoring each text to be processed according to a 2-gram language model to obtain a first score;

scoring each text to be processed according to a 3-gram language model to obtain a second score;

and calculating the score corresponding to each text to be processed according to the first score and the second score.

Preferably, the method for determining the preset threshold includes:

scoring each text to be processed according to the N-gram language model, and obtaining scores of all error positions as an error score set;

taking 85% -95% percentiles of the error score sets as standby thresholds respectively;

summing the scores corresponding to the texts to be processed, and calculating to obtain the score corresponding to the text to be processed;

comparing the standby threshold with the score corresponding to the text to be detected;

determining that errors exist in the text to be detected which is smaller than the standby threshold value, and determining the positions of the errors;

and respectively testing the standby threshold values, and determining the threshold value with the highest score corresponding to the text to be detected as the preset threshold value.

A wrongly written or mispronounced character detection device is applied to the recognition of wrongly written or mispronounced characters in a Chinese electronic medical record, and comprises the following components:

the first processing unit is used for acquiring a text to be detected, and performing sentence division processing on the text to be detected to obtain at least one text to be processed, wherein the text to be detected is a Chinese electronic medical record;

the second processing unit is used for scoring each text to be processed according to an N-gram language model to obtain a score corresponding to each text to be processed;

the third processing unit is used for comparing the score corresponding to each text to be processed with a preset threshold value;

and the fourth processing unit is used for determining that wrongly written characters exist in the text to be processed and positioning the positions of the wrongly written characters when the score corresponding to the text to be processed is smaller than a preset threshold value.

Preferably, the first processing unit is further configured to:

removing interference elements in the text to be detected;

Preferably, the second processing unit is further configured to:

A storage medium comprising a stored program, wherein a device on which the storage medium is located is controlled to perform the method of detecting wrongly written words as described above when the program is run.

An electronic device comprising at least one processor, and at least one memory, bus connected with the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the method for detecting wrongly written words as described above.

The method and the device for detecting the wrongly written characters are applied to recognition of the wrongly written characters in the Chinese electronic medical record, the method obtains at least one text to be processed by obtaining the text to be detected and carrying out sentence division processing on the text to be detected, and the text to be detected is the Chinese electronic medical record; according to the N-gram language model, scoring each text to be processed to obtain a score corresponding to each text to be processed; comparing the score corresponding to each text to be processed with a preset threshold value; and when the score corresponding to the text to be processed is smaller than a preset threshold value, determining that wrongly written characters exist in the text to be processed, and positioning the positions of the wrongly written characters. The method and the device are used for performing wrongly written character detection on characters in the text to be detected based on 2gram and 3gram combination scoring, so that wrongly written character detection can be performed on medical corpus data effectively at high speed, and a foundation is laid for research and development of subsequent products; in addition, for different data environments, the threshold value searching method can be used for adjusting the threshold value standard, and the applicability is strong.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for detecting wrongly written characters according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a device for detecting a wrongly written word according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application.

Detailed Description

The applicant finds that there are three methods for detecting wrongly written words in medical corpus data in the prior art, including: the detection method based on the single n-gram, the detection method based on the word segmentation and the detection method based on the deep learning are characterized in that the detection method based on the single n-gram only uses a single scoring method (2-gram or 3-gram) or does not effectively combine multiple methods, so that the detection accuracy is low; the word segmentation-based detection method is used for segmenting input sentences by using a word segmentation dictionary, and single words which are not segmented are considered as wrongly-typed words and cannot cover unobserved case words, so that the detection method can only detect words belonging to the word segmentation dictionary, is poor in flexibility and cannot accurately confirm wrong character positions; the detection method based on deep learning has no real method for detecting wrongly written characters, and the method uses a trained coding-decoding model to convert one sentence into another sentence, and needs a large amount of training linguistic data and time, so that the cost is relatively high, and the effect is not good.

Therefore, the applicant believes that, through the above discussion, several wrongly written characters detection methods in the prior art have problems, and cannot perform high-speed and effective wrongly written character detection on medical corpus.

The application provides a method and a device for detecting wrongly written characters, which are applied to the recognition of wrongly written characters in a Chinese electronic medical record and aim at the following steps: how to carry out wrongly written characters detection on medical corpus data at high speed and effectively lays a foundation for the research and development of subsequent products.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a flow chart of a method for detecting a wrongly written word according to an embodiment of the present application is shown. As shown in fig. 1, an embodiment of the present application provides a method for detecting wrongly written characters, which is applied to recognition of wrongly written characters in a chinese electronic medical record, and the method includes:

s101: the method comprises the steps of obtaining a text to be detected, and performing sentence division processing on the text to be detected to obtain at least one text to be processed, wherein the text to be detected is a Chinese electronic medical record.

In the embodiment of the application, the text to be detected is a Chinese electronic medical record, after the text to be detected is acquired, sentence division processing needs to be performed on the text to be detected, specifically, the text to be detected is subjected to sentence division processing according to a sentence number, for example, the text to be detected can be split into a plurality of sentences by using punctuation marks (such as a sentence number, a question mark or an exclamation mark).

It should be noted that Electronic Medical records, namely Electronic Medical Records (EMR), also called computerized Medical Record system or computer-based patient records, mainly use Electronic devices (computers, health cards, etc.) to store, manage, transmit and reproduce digitized Medical records to replace handwritten paper Medical records, and mainly include: details such as chief complaints, current medical history, past history, personal history, family history, and the like, and details such as physical examinations.

In the embodiment of the present application, when performing clause processing on the text to be detected, the method further includes any one or more of the following steps:

and removing the interference elements in the text to be detected.

The interference elements may include: emoticons (e.g., emoji stored in an unprode code or such as "smiley face"), Uniform Resource Locator (URL) addresses, and the like. In one implementation, regular expressions may be utilized to remove interfering elements in the text to be detected.

In one implementation, a regular expression may be used to extract a numeric character string from the text to be detected and convert the numeric character string into a form such as "Num + { length of the numeric character string }", so as to achieve the purpose of numeric dimension reduction, where "Num" is used to indicate that the current character string is a numeric character string. In terms of distance, the number string "123456" may be converted into Num6, and the number string 1111111111 may be converted into Num 10.

It should be noted that several steps of the above-mentioned processes can be flexibly combined and used in combination with actual situations. For example, when there are interfering elements, such as emoticons and unrecognizable character strings, in a text to be detected, steps of removing the interfering elements from the text to be detected, converting the character strings in the text to be detected into a predetermined format, performing sentence division processing, and the like need to be performed. In addition, each step can also be flexibly changed in combination with the actual situation, for example, if the URL in the text to be detected is desired to be detected, the step of removing the URL of the interfering element in the text to be detected may not be executed.

S102: and according to the N-gram language model, scoring each text to be processed to obtain a score corresponding to each text to be processed.

In the embodiment of the application, the N-Gram is a language model using collocation information between adjacent words in the context, the model is based on the assumption that the occurrence of the Nth word is only related to the previous N-1 words but not related to any other words, the probability of the whole sentence is the product of the occurrence probability of each word, and the probabilities can be obtained by directly counting the number of times of simultaneous occurrence of the N words from the corpus, and the probabilities are generally commonly used as a binary Bi-Gram and a ternary Tri-Gram, namely a 2-Gram and a 3-Gram.

It should be noted that the step of training the N-gram language model includes:

In the embodiment of the application, when the N-gram language model is trained, some Chinese electronic medical records are required to be found and separated according to periods, then wrong sentences are labeled and the wrong positions of the wrong sentences are labeled, then the wrong sentences are used as samples for training, and correct Chinese electronic medical records are used as comparison to construct the N-gram language model.

After the N-gram language model is built, scoring each sentence in the text to be detected by 2-gram and 3-gram, and then combining the scoring of 2-gram and 3-gram to obtain the corresponding score of each sentence of the text to be detected.

Specifically, the scoring each text to be processed according to the N-gram language model to obtain a score corresponding to each text to be processed includes:

In the embodiment of the present application, the key point of the wrongly written or mispronounced word detection is a scoring rule for performing wrongly written or mispronounced word detection on sentences, and "doctor's advice for discharge" is as follows: all-purposeBook (I)2 weeks "as an example, wherein bold face is a wrongly written word, the following detailed description is made with respect to the scoring rules, and specifically, see table 1.

TABLE 1 rules of scoring

The items in table 1 above are explained:

column N2: current position, current position +1 2-gram of two characters.

Column N3+ 1: a score of 3-grams of three characters, current character position, current position +1, and current position + 2.

Column N3-1: current character position, current position-1, and current position + 1.

Score column: the score calculated by the formula in the table is used for judging the correctness of the score.

As can be seen from Table 1, the scores of row 1 and column 1 are similar to the scores of row 1 and column 5, but from the analysis, the data of row 1 and column 1 are correct data, and the data of row 1 and column 5 are wrong data.

It can be seen that if a single 2-gram score is used, the correct and incorrect characters cannot be distinguished, but if a single 3-gram score is used, the position of the incorrect character cannot be accurately determined. The score is used to both reduce the number of false positives and to more accurately determine the location of the error.

In addition, to further illustrate the advantages of the scoring rules in the examples of the present application, the 2-gram +3-gram +4-gram. + n-gram combination is compared with the 2-gram +3-gram combination as follows:

for ease of comparison, 2-gram +3-gram +4-gram is used as an example for illustration:

method 1, hierarchical computation. The combined score (3-4) of the 3-gram and the 4-gram is calculated first, and then the combined score of the combined scores of the 2-gram and the 3-4-gram is calculated. The results are shown in tables 2 and 3.

TABLE 23-gram, 4-gram combination scores

TABLE 32-gram and 3-4gram combination scores

As can be seen from the above comparison, the calculation results of method 1 and the results of the combination of 2 and 3 grams have almost no difference.

The method 2 comprises the following steps: the calculation methods and results are shown in table 4.

TABLE 4

It can be seen that the calculation result of the method 2 and the result of the combination of the 2gram and the 3gram are some distances, but the actual application effect is almost not different. Therefore, on the premise of the same effect, the fewer the calculation steps are, the better the calculation steps are, and the 2-3-gram combined score can be obtained as a relatively optimal method.

S103: and comparing the score corresponding to each text to be processed with a preset threshold value.

S104: and when the score corresponding to the text to be processed is smaller than a preset threshold value, determining that wrongly written characters exist in the text to be processed, and positioning the positions of the wrongly written characters.

It should be noted that the method for determining the preset threshold includes:

Specifically, the detailed steps are described as follows:

step 1: and (3) scoring the sentences of the whole electronic medical record by using the scoring method in the table 1, and obtaining all wrong position scores as a wrong score set.

Such as: through step 1, an error score set a is obtained:

[0.05878375，0.005998509，0.003742163，0.005285632，0.002947054，0.00073473，0.001753893，0.014158532]

step 2: and taking 85% -95% percentiles of the error score sets as thresholds for standby respectively.

Respectively calculating 85% -95% percentiles of the A to obtain a set B:

[0.000795337，0.000796084，0.000796797，0.000797551，0.000798224，0.000799651，0.000800364，0.000801078，0.000801791，0.000802504]

and step 3: and (3) respectively using the threshold values obtained in the step (2) to compare scores of the sentences of the whole electronic case, wherein the scores are smaller than the threshold value, and errors are detected. The detected number and the false detection number can be obtained by comparing the known false position. The score of each sentence is equal to the number of detections-the number of false detections, and the score of the whole electronic case is the sum of the scores of each sentence. The end result is given in Table 5.

TABLE 5

As can be seen in table 5, the three thresholds (0.000798224, 0.000799651, 0.000800364) have the highest overall score. Therefore, if the number of the detected data is the highest, 0.000798224 is selected; if the false detection rate is required to be the lowest, 0.000800364 is selected.

The embodiment of the application provides a method for detecting wrongly written characters, which is applied to recognition of wrongly written characters in a Chinese electronic medical record, and the method comprises the steps of obtaining a text to be detected, and performing sentence division processing on the text to be detected to obtain at least one text to be processed, wherein the text to be detected is the Chinese electronic medical record; according to the N-gram language model, scoring each text to be processed to obtain a score corresponding to each text to be processed; comparing the score corresponding to each text to be processed with a preset threshold value; and when the score corresponding to the text to be processed is smaller than a preset threshold value, determining that wrongly written characters exist in the text to be processed, and positioning the positions of the wrongly written characters. The method and the device for detecting the wrongly written characters in the text to be detected are based on 2gram and 3gram combined scoring, wrongly written characters in the text to be detected can be detected, wrongly written characters can be detected on medical corpus data effectively at a high speed, and a foundation is laid for research and development of subsequent products; in addition, for different data environments, the threshold value searching method can be used for adjusting the threshold value standard, and the applicability is strong.

Referring to fig. 2, based on the method for detecting wrongly written characters disclosed in the above embodiment, the present embodiment correspondingly discloses a device for detecting wrongly written characters, which is applied to the recognition of wrongly written characters in a chinese electronic medical record, and the device includes:

the first processing unit 201 is configured to acquire a text to be detected, and perform sentence division processing on the text to be detected to obtain at least one text to be processed, where the text to be detected is a chinese electronic medical record;

the second processing unit 202 is configured to score each to-be-processed text according to an N-gram language model to obtain a score corresponding to each to-be-processed text;

the third processing unit 203 is configured to compare the score corresponding to each text to be processed with a preset threshold;

a fourth processing unit 204, configured to determine that a wrongly written or mispronounced word exists in the to-be-processed text when the score corresponding to the to-be-processed text is smaller than a preset threshold, and locate a position of the wrongly written or mispronounced word.

Further, the first processing unit 201 is further configured to:

removing interference elements in the text to be detected;

Further, the second processing unit 202 is further configured to:

The device for detecting the wrongly written words comprises a processor and a memory, wherein the first processing unit, the second processing unit, the third processing unit, the fourth processing unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, the purpose of high-speed and effective detection of wrongly written characters on the medical corpus data is achieved by adjusting the kernel parameters, and a foundation is laid for the research and development of subsequent products.

An embodiment of the present application provides a storage medium on which a program is stored, the program implementing the wrongly written word detection method when executed by a processor.

The embodiment of the application provides a processor, wherein the processor is used for running a program, and the method for detecting the wrongly written words is executed when the program runs.

The embodiment of the present application provides an electronic device, as shown in fig. 3, the electronic device 30 includes at least one processor 301, and at least one memory 302 and a bus 303 connected to the processor; the processor 301 and the memory 302 complete communication with each other through the bus 303; the processor 301 is configured to call program instructions in the memory 302 to execute the method for detecting a wrongly written word.

The electronic device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:

Further, when the sentence dividing processing is performed on the text to be detected, any one or more of the following steps are also included:

removing interference elements in the text to be detected;

Further, the step of training the N-gram language model includes:

Further, the scoring each text to be processed according to the N-gram language model to obtain a score corresponding to each text to be processed includes:

Further, the method for determining the preset threshold includes:

The present application is described in terms of flowcharts and/or block diagrams of methods, apparatus (systems), computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for detecting wrongly written characters is characterized by being applied to recognition of wrongly written characters in a Chinese electronic medical record, and comprises the following steps:

2. The method according to claim 1, wherein when the sentence dividing processing is performed on the text to be detected, any one or more of the following steps are further included:

removing interference elements in the text to be detected;

3. The method according to claim 1 or 2, wherein the step of training the N-gram language model comprises:

4. The method according to claim 1, wherein the scoring each text to be processed according to an N-gram language model to obtain a score corresponding to each text to be processed comprises:

5. The method according to claim 4, wherein the method for determining the preset threshold value comprises:

6. A wrongly written or mispronounced character detection device is applied to the recognition of wrongly written or mispronounced characters in a Chinese electronic medical record, and comprises:

7. The apparatus of claim 6, wherein the first processing unit is further configured to:

removing interference elements in the text to be detected;

8. The apparatus according to claim 6 or 7, wherein the second processing unit is further configured to:

9. A storage medium comprising a stored program, wherein a device on which the storage medium is located is controlled to perform the method of detecting a wrongly written word according to any one of claims 1 to 5 when the program is run.

10. An electronic device comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the method of detecting wrongly written words as claimed in any one of claims 1 to 5.