CN114387602A

CN114387602A - Medical OCR data optimization model training method, optimization method and equipment

Info

Publication number: CN114387602A
Application number: CN202210294556.0A
Authority: CN
Inventors: 安波
Original assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Current assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-04-22
Anticipated expiration: 2042-03-24
Also published as: CN114387602B

Abstract

The invention discloses a medical OCR data optimization model training method, an optimization method and equipment, wherein the training method comprises the following steps: acquiring large-scale label-free medical text data, and identifying medical terms and characters in the text data to form a training set; pre-training the training set to obtain a pre-training data set for training the medical OCR optimization model, and training the medical OCR optimization model by using the pre-training data set; the pre-training process comprises: carrying out data augmentation processing on low-frequency terms and low-frequency characters in the training set; randomly replacing a first target character in a training set with an error character; shielding a second target character in the training set; and segmenting the training set into a plurality of text paragraphs to obtain a pre-training data set for training the medical OCR optimization model. According to the invention, the pre-training language model in the medical field is utilized to perform structured extraction, error recognition and optimization on the medical OCR result, so that the accuracy of the medical OCR is improved.

Description

Medical OCR data optimization model training method, optimization method and equipment

技术领域technical field

本发明涉及智能医疗数据处理技术领域，尤其涉及一种医疗OCR数据优化模型训练方法、优化方法及相关的电子设备和计算机可读存储介质。The invention relates to the technical field of intelligent medical data processing, in particular to a medical OCR data optimization model training method, an optimization method, and related electronic equipment and computer-readable storage media.

背景技术Background technique

随着机器学习的快速发展，光学文字识别（OCR）目前在文字识别取得了长足的进步，已经出现多种商业应用如百度OCR等。在医疗领域中，临床医学研究、病案结构化、核保理赔等都要对纸质数据进行结构化。如何将纸质医疗数据转换为计算机可处理的结构数据已成为智能医疗发展的关键。医疗图片数据结构化也需要进行光学识别，识别的结果决定了后续过程。然而，医疗领域的光学文本识别的准确性还存在较多的问题。与通用领域的图像文字识别不同，医疗图像文字识别包含大量医疗专业术语，如病历中疾病的名称和字段名称，而且术语库的规模较大，常用的医疗专用术语已超过100万。而且，医学领域包含大量的生僻的、非常用的字符，这些字符在通用文本中出现的频率极低，同时罕见病等非生僻字的医疗术语在语料库中的出现频率也很低，如“川崎病”，这些低频术语的识别准确率较低（如“睑”经常会识别为“脸”）。另外，医疗领域常用字符指代各类病症及状态等，但这些字符的出现频率也较低，因此OCR文字识别模型难以准确识别这些字符。在病历的结构上，与普通的文本材料不同，医疗数据通常具有特定结构，如病历报告中包含多个字段的信息，不同字段包含的数据类型不同，然而目前的文本识别系统缺少对结构信息的利用。医疗领域的文本语言风格非常简练，医务人员在形成文档时经常省略大量非医学词语。以上各个方面对OCR和已有的后处理模型都提出了新的挑战。With the rapid development of machine learning, Optical Character Recognition (OCR) has made great progress in character recognition, and there have been various commercial applications such as Baidu OCR. In the medical field, paper data must be structured for clinical medical research, medical record structuring, and underwriting claims. How to convert paper medical data into computer-processable structured data has become the key to the development of intelligent medical care. The structuring of medical image data also requires optical recognition, and the result of the recognition determines the subsequent process. However, there are still many problems with the accuracy of optical text recognition in the medical field. Different from image text recognition in the general field, medical image text recognition contains a large number of medical professional terms, such as the names of diseases and field names in medical records, and the term base is large, with more than 1 million commonly used medical terms. Moreover, the medical field contains a large number of rare and very used characters, which appear very infrequently in general texts, and medical terms such as rare diseases and other non-rare words appear in the corpus very low frequency, such as "Kawasaki" Diseases", these low-frequency terms have low recognition accuracy (eg "lid" is often recognized as "face"). In addition, characters commonly used in the medical field refer to various diseases and states, etc., but the frequency of these characters is low, so it is difficult for the OCR character recognition model to accurately recognize these characters. In terms of the structure of medical records, different from ordinary text materials, medical data usually has a specific structure, such as the information of multiple fields contained in the medical record report, and the data types contained in different fields are different. However, the current text recognition system lacks structural information. use. The language style of texts in the medical field is very concise, and medical personnel often omit a large number of non-medical words when forming documents. All of the above aspects pose new challenges to OCR and existing post-processing models.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中存在的OCR无法准确识别异常或错误数据问题，本发明提供了如下技术方案。In order to solve the problem in the prior art that OCR cannot accurately identify abnormal or incorrect data, the present invention provides the following technical solutions.

本发明在第一方面提供了一种医疗OCR优化模型训练方法，包括：The present invention provides a medical OCR optimization model training method in a first aspect, including:

获取大规模无标注医疗文本数据，并对所述大规模无标注医疗文本数据中的医疗术语和字符进行识别以形成训练集；acquiring large-scale unlabeled medical text data, and identifying medical terms and characters in the large-scale unlabeled medical text data to form a training set;

对所述训练集进行预训练处理以得到用于训练所述医疗OCR优化模型的预训练数据集，并利用所述预训练数据集对所述医疗OCR优化模型进行训练；Performing pre-training processing on the training set to obtain a pre-training data set for training the medical OCR optimization model, and using the pre-training data set to train the medical OCR optimization model;

其中，所述预训练处理包括：Wherein, the pre-training process includes:

对所述训练集中的低频术语和低频字符进行数据增广处理；performing data augmentation processing on the low-frequency terms and low-frequency characters in the training set;

将所述训练集中的第一目标字符随机替换为错误字符；Randomly replace the first target character in the training set with an incorrect character;

对所述训练集中的第二目标字符进行遮挡；以及occluding the second target character in the training set; and

将所述训练集切分为多个文本段落，得到用于训练所述医疗OCR优化模型的预训练数据集。The training set is divided into multiple text paragraphs to obtain a pre-training data set for training the medical OCR optimization model.

优选地，在所述对所述训练集中的低频术语和低频字符进行数据增广处理之前，进一步包括：Preferably, before the data augmentation processing is performed on the low-frequency terms and low-frequency characters in the training set, the method further includes:

统计识别出的所述训练集中的每个医疗术语和字符的频次，根据相应的低频阈值来确定所述训练集中的低频术语和低频字符。The frequency of each medical term and character in the training set identified is counted, and the low-frequency term and low-frequency character in the training set are determined according to the corresponding low-frequency threshold.

优选地，在所述形成训练集之后，进一步包括：Preferably, after forming the training set, it further includes:

利用医疗知识图谱对所述训练集进行医疗术语的表示学习，并在表示空间进行映射。Representation learning of medical terms is performed on the training set using the medical knowledge graph, and mapping is performed in the representation space.

优选地，所述将所述训练集中的第一目标字符随机替换为错误字符，进一步包括：Preferably, the randomly replacing the first target character in the training set with an error character further includes:

从所述训练集中的医疗术语和字符中筛选第一目标字符，其中所述第一目标字符包括字形相似字典中所包含的字符和/或医疗常用字符。First target characters are screened from medical terms and characters in the training set, wherein the first target characters include characters contained in a glyph similarity dictionary and/or commonly used medical characters.

优选地，所述利用所述预训练数据集对所述医疗OCR优化模型进行训练，进一步包括：Preferably, the use of the pre-training data set to train the medical OCR optimization model further includes:

在已将所述第一目标字符随机替换为错误字符之后，将当前训练集作为第一数据集，迭代地根据当前上下文提取所述第一数据集中的所述错误字符，并预测与所述错误字符相对应的所述第一目标字符以训练所述医疗OCR优化模型的字符纠错能力。After the first target characters have been randomly replaced with wrong characters, the current training set is taken as the first data set, the wrong characters in the first data set are iteratively extracted according to the current context, and predictions that match the wrong characters are performed. The first target character corresponding to the character is used to train the character error correction capability of the medical OCR optimization model.

在已遮挡所述第二目标字符之后，将当前训练集作为第二数据集，迭代地根据当前上下文预测与所述第二数据集中的被遮挡位置相对应的所述第二目标字符以训练所述医疗OCR优化模型识别遮挡内容的能力。After the second target character has been occluded, using the current training set as the second data set, iteratively predicts the second target character corresponding to the occluded position in the second data set according to the current context to train the The ability of the medical OCR optimization model to identify occluded content.

迭代地根据当前上下文预测所述预训练数据集中的段落结束语句以训练所述医疗OCR优化模型自动分段的能力。Iteratively predicts end-of-paragraph sentences in the pre-trained dataset based on the current context to train the medical OCR optimization model's ability to automatically segment.

本发明在第二方面提供了一种医疗OCR数据优化方法，包括：The present invention provides a medical OCR data optimization method in a second aspect, comprising:

获取目标医疗图像，并对目标医疗图像进行OCR识别，得到待优化文本数据；Obtain the target medical image, perform OCR recognition on the target medical image, and obtain the text data to be optimized;

将所述待优化文本数据输入医疗OCR优化模型，以使所述医疗OCR优化模型输出与所述待优化文本数据对应的医疗术语和字符识别结果；Inputting the text data to be optimized into a medical OCR optimization model, so that the medical OCR optimization model outputs medical terms and character recognition results corresponding to the text data to be optimized;

其中，所述医疗OCR优化模型预先基于第一方面所述的医疗OCR优化模型训练方法得到。Wherein, the medical OCR optimization model is obtained in advance based on the medical OCR optimization model training method described in the first aspect.

本发明另一方面提供了一种电子设备，包括处理器和存储器，所述存储器存储有多条指令，所述处理器用于读取所述指令并执行第一方面所述的医疗OCR数据优化模型训练方法，或者执行基于第二方面所述的医疗OCR数据优化方法。Another aspect of the present invention provides an electronic device, comprising a processor and a memory, the memory stores a plurality of instructions, the processor is configured to read the instructions and execute the medical OCR data optimization model described in the first aspect training method, or executing the medical OCR data optimization method based on the second aspect.

本发明又一方面提供了一种计算机可读存储介质，所述计算机可读存储介质存储有多条指令，所述多条指令可被处理器读取并执行第一方面所述的医疗OCR数据优化模型训练方法，或者执行基于第二方面所述的医疗OCR数据优化方法。Yet another aspect of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores a plurality of instructions, and the plurality of instructions can be read by a processor to execute the medical OCR data described in the first aspect The model training method is optimized, or the medical OCR data optimization method based on the second aspect is executed.

本发明的有益效果是：本发明的技术方案在数据增强的基础上，利用医疗领域预训练语言模型对医疗OCR结果进行结构化提取、错误识别及优化，提升了医疗图像文字识别的准确率，尤其是提升了对医疗术语、病历关键词等关键词汇的识别准确率，同时能够对文本段落进行辅助切分，用于实现后续的医疗知识抽取、事件抽取。The beneficial effects of the present invention are: on the basis of data enhancement, the technical solution of the present invention uses a pre-trained language model in the medical field to perform structured extraction, error recognition and optimization of medical OCR results, thereby improving the accuracy of medical image text recognition, In particular, the recognition accuracy of key words such as medical terminology and medical record keywords is improved, and at the same time, text paragraphs can be subdivided for subsequent medical knowledge extraction and event extraction.

附图说明Description of drawings

图1为本发明所述的医疗OCR优化模型训练方法的流程图。FIG. 1 is a flowchart of the medical OCR optimization model training method according to the present invention.

图2为本发明所述的用于模型训练的预训练数据集的形成过程示意图。FIG. 2 is a schematic diagram of the formation process of the pre-training data set used for model training according to the present invention.

图3为本发明所述的面向文字识别后处理的预训练语言模型的训练过程示意图。FIG. 3 is a schematic diagram of the training process of the pre-trained language model for post-processing of text recognition according to the present invention.

图4为本发明所述的医疗OCR数据优化方法的流程图。FIG. 4 is a flow chart of the medical OCR data optimization method according to the present invention.

图5为本发明所述的医疗图像文字识别方法的详细流程图。FIG. 5 is a detailed flowchart of the method for recognizing text in medical images according to the present invention.

具体实施方式Detailed ways

为了更好的理解上述技术方案，下面将结合说明书附图以及具体的实施方式对上述技术方案做详细的说明。In order to better understand the above technical solutions, the above technical solutions will be described in detail below with reference to the accompanying drawings and specific embodiments.

本发明提供的方法可以在如下的终端环境中实施，该终端可以包括一个或多个如下部件：处理器、存储器和显示屏。其中，存储器中存储有至少一条指令，所述指令由处理器加载并执行以实现下述实施例所述的方法。The method provided by the present invention may be implemented in the following terminal environment, and the terminal may include one or more of the following components: a processor, a memory and a display screen. Wherein, at least one instruction is stored in the memory, and the instruction is loaded and executed by the processor to implement the methods described in the following embodiments.

处理器可以包括一个或者多个处理核心。处理器利用各种接口和线路连接整个终端内的各个部分，通过运行或执行存储在存储器内的指令、程序、代码集或指令集，以及调用存储在存储器内的数据，执行终端的各种功能和处理数据。A processor may include one or more processing cores. The processor uses various interfaces and lines to connect various parts of the entire terminal, and executes various functions of the terminal by running or executing the instructions, programs, code sets or instruction sets stored in the memory, and calling the data stored in the memory. and processing data.

存储器可以包括随机存储器(Random Access Memory，RAM)，也可以包括只读存储器(Read-Only Memory，ROM)。存储器可用于存储指令、程序、代码、代码集或指令。The memory may include random access memory (Random Access Memory, RAM), or may include read-only memory (Read-Only Memory, ROM). Memory may be used to store instructions, programs, codes, sets of codes, or instructions.

显示屏用于显示各个应用程序的用户界面。The display is used to display the user interface of each application.

除此之外，本领域技术人员可以理解，上述终端的结构并不构成对终端的限定，终端可以包括更多或更少的部件，或者组合某些部件，或者不同的部件布置。比如，终端中还包括射频电路、输入单元、传感器、音频电路、电源等部件，在此不再赘述。In addition, those skilled in the art can understand that the structure of the above-mentioned terminal does not constitute a limitation on the terminal, and the terminal may include more or less components, or combine some components, or arrange different components. For example, the terminal also includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, and a power supply, which will not be repeated here.

针对上述问题，为了实现对医疗图像OCR结果的优化，本发明提出一种医疗OCR优化模型训练方法，以及基于预训练语言模型的医疗图像文字识别（OCR）的优化方法，本方法利用医疗预训练语言模型对OCR的结果进行后处理。具体利用预设训练模型来识别OCR得到的文本中出错的字符，并预测正确的字符，以得到正确的文本识别结果。本发明在数据增强的基础上利用医疗领域预训练语言模型对医疗OCR结果进行结构化提取、错误识别及优化，旨在提升医疗OCR的准确率。In view of the above problems, in order to optimize the results of medical image OCR, the present invention proposes a medical OCR optimization model training method, and a medical image text recognition (OCR) optimization method based on a pre-trained language model. The method utilizes medical pre-training The language model post-processes the results of OCR. Specifically, a preset training model is used to identify the wrong characters in the text obtained by OCR, and predict the correct characters to obtain the correct text recognition result. On the basis of data enhancement, the present invention utilizes a pre-trained language model in the medical field to perform structured extraction, error identification and optimization of medical OCR results, aiming to improve the accuracy of medical OCR.

实施例一Example 1

如图1所示，本发明实施例提供了一种医疗OCR优化模型训练方法，包括：As shown in FIG. 1, an embodiment of the present invention provides a medical OCR optimization model training method, including:

S101、获取大规模无标注医疗文本数据，并对所述大规模无标注医疗文本数据中的医疗术语和字符进行识别以形成训练集；S101. Obtain large-scale unlabeled medical text data, and identify medical terms and characters in the large-scale unlabeled medical text data to form a training set;

其中，预训练语言模型的原始训练数据为大规模医疗文本数据，这些数据来自于临床指南、医学教材、医疗百科、医疗论坛和积累的病历中的文本数据。基于该大规模无标注的医疗文本数据，利用现有的医疗命名实体识别模型进行识别处理得到初步的医疗术语和字符识别结果，如疾病、诊断、手术、药品、病历关键词（主诉、超声诊断等），具体的术语识别模型可以采用基于深度学习的实体识别、基于统计学习的实体识别或基于词典的方法。Among them, the original training data of the pre-trained language model is large-scale medical text data, which comes from text data in clinical guidelines, medical textbooks, medical encyclopedias, medical forums and accumulated medical records. Based on the large-scale unlabeled medical text data, the existing medical named entity recognition model is used for recognition processing to obtain preliminary medical terminology and character recognition results, such as disease, diagnosis, surgery, medicine, medical record keywords (main complaint, ultrasound diagnosis) etc.), the specific term recognition model can adopt deep learning-based entity recognition, statistical learning-based entity recognition or dictionary-based methods.

S102、对所述训练集进行预训练处理以得到用于训练所述医疗OCR优化模型的预训练数据集，并利用所述预训练数据集对所述医疗OCR优化模型进行训练。S102. Perform pre-training processing on the training set to obtain a pre-training data set for training the medical OCR optimization model, and use the pre-training data set to train the medical OCR optimization model.

为提升医疗图像文本识别的准确率，本发明主要针对面向文字识别后处理的预训练语言模型，对文字识别的结果进行优化，得到对医疗图像文字识别具有更优处理能力的模型。在优选的实施例中，更优的处理能力可以体现在模型能够正确地识别并修正文字识别的错误，补充缺失的字符以及对长文本进行合理的段落划分。因此在步骤S102中，对步骤S101得到的训练集进行预训练处理，得到增强上述识别处理能力的新的训练集，并根据新的训练集对原始医疗OCR优化模型进行训练。In order to improve the accuracy of medical image text recognition, the present invention is mainly aimed at the pre-trained language model for text recognition post-processing, and optimizes the text recognition results to obtain a model with better processing capability for medical image text recognition. In a preferred embodiment, the better processing capability can be reflected in that the model can correctly identify and correct errors in character recognition, supplement missing characters, and divide long texts into reasonable paragraphs. Therefore, in step S102, pre-training is performed on the training set obtained in step S101 to obtain a new training set that enhances the above identification processing capability, and the original medical OCR optimization model is trained according to the new training set.

在进一步优选的实施例中，可以利用医疗知识图谱对所述训练集进行医疗术语的表示学习，并在表示空间进行映射。具体可以利用词典、模型对大规模无标注医疗文本数据进行标注，利用医疗知识图谱和标注文本进行术语的表示学习，其中知识图谱可以采用图神经网络（GNN），文本可以采用Transformer进行表示学习，并在表示空间进行映射，使得不同模态学习到的数据具有相同的向量表示。In a further preferred embodiment, representation learning of medical terms can be performed on the training set by using a medical knowledge graph, and mapping is performed in the representation space. Specifically, dictionaries and models can be used to label large-scale unlabeled medical text data, and medical knowledge graphs and labeled texts can be used to perform term representation learning. The knowledge graph can use a graph neural network (GNN), and the text can use Transformer for representation learning. And map in the representation space, so that the data learned by different modalities have the same vector representation.

其中，所述预训练处理可以包括以下一个或多个方面，如图2所示：Wherein, the pre-training process may include one or more of the following aspects, as shown in FIG. 2:

S1021、对所述训练集中的低频术语和低频字符进行数据增广处理。S1021. Perform data augmentation processing on the low-frequency terms and low-frequency characters in the training set.

针对生僻字的识别，本发明利用复述生成实现数据增广，即增加低频词字的词频，以更好地满足训练需求。具体方式可以选取语句级复述生成方式，可以结合seq2seq模型等，即直接复制并粘贴低频术语和低频字符出现的语句。在具体操作中，可以对医疗文本及术语进行统计分析，得到训练集中的每个医疗术语和字符的频次。根据相应的低频阈值来确定训练集中的低频术语和低频字符，例如出现次数低于20次的术语，或出现次数低于5次的字符等。对这部分数据进行增广对该类文本数据进行增强，得到更为平衡的数据。For the recognition of uncommon words, the present invention utilizes paraphrasing generation to realize data augmentation, that is, increasing the word frequency of low-frequency words, so as to better meet the training requirements. The specific method can choose the sentence-level paraphrase generation method, which can be combined with the seq2seq model, that is, directly copy and paste the sentences with low-frequency terms and low-frequency characters. In specific operations, statistical analysis can be performed on medical texts and terms to obtain the frequency of each medical term and character in the training set. The low-frequency terms and low-frequency characters in the training set are determined according to the corresponding low-frequency thresholds, such as terms that appear less than 20 times, or characters that appear less than 5 times. This part of the data is augmented to enhance this type of text data to obtain more balanced data.

S1022、将训练集中的第一目标字符随机替换为错误字符。S1022. Randomly replace the first target character in the training set with an error character.

具体地，可以从所述训练集中的识别得到的医疗术语和字符中筛选第一目标字符，其中该第一目标字符可以包括字形相似字典中所包含的字符和/或医疗常用字符。由于OCR识别是基于图像的识别，字形相似的字符容易被错误识别。因此针对异常字符的识别，本发明利用字形相似字典对医疗文本中的字符随机使用同形字进行替换，在训练数据中将同形字库的正确的字符替换为错误的形近字。字形相似字典包含常见字符（如“人”和“如”）以及医疗常用字符（如“脸”和“睑”）。举例而言，可以将原始训练集中正确的医学术语“睑板腺”，确定第一目标字符“睑”。根据同形字库对应的字符“脸”替换为错误术语“脸板腺”。在语言模型训练时，可以将包含错误的训练集输入预训练模型，正向地激励模型预测出不符合当前上下文的字符“脸”，并相应地预测出正确的目标字符“睑”，从而提高模型的文本纠错能力。Specifically, the first target characters may be screened from the recognized medical terms and characters in the training set, where the first target characters may include characters contained in a glyph similarity dictionary and/or commonly used medical characters. Since OCR recognition is based on image recognition, characters with similar glyphs are easily misidentified. Therefore, for the identification of abnormal characters, the present invention utilizes a glyph similarity dictionary to randomly replace the characters in the medical text with homographs, and replaces the correct characters of the homograph library with wrong glyphs in the training data. The glyph similarity dictionary contains common characters (such as "person" and "such as") as well as commonly used medical characters (such as "face" and "lid"). For example, the correct medical term "meiibomian gland" in the original training set can be used to determine the first target character "eyelid". The character "face" corresponding to the homograph is replaced by the wrong term "face gland". During language model training, the training set containing errors can be input into the pre-training model, and the model can be positively motivated to predict the character "face" that does not conform to the current context, and correspondingly predict the correct target character "eyelid", thereby improving The text error correction capability of the model.

在进一步优选的实施例中，在随机选择被替换的目标字符时，相比于普通字符，可以提高医疗常用字符的替换频率，以使模型预测出正确的医疗常用字符。由于所述第一目标字符可以是一个以上，即可以将多个易错字符进行随机替换，并将当前训练集作为第一数据集，因此相应地，在对医疗OCR优化模型进行训练期间，迭代地根据当前上下文提取所述第一数据集中的每个错误字符，并预测与每个错误字符相对应的每个正确的第一目标字符，以训练所述医疗OCR优化模型的字符纠错能力。In a further preferred embodiment, when the target characters to be replaced are randomly selected, the replacement frequency of commonly used medical characters can be increased compared with ordinary characters, so that the model can predict correct commonly used medical characters. Since the first target character can be more than one, that is, multiple error-prone characters can be randomly replaced, and the current training set can be used as the first data set, accordingly, during the training of the medical OCR optimization model, iterative Each erroneous character in the first data set is extracted according to the current context, and each correct first target character corresponding to each erroneous character is predicted, so as to train the character error correction ability of the medical OCR optimization model.

S1023、对训练集中的第二目标字符进行遮挡。S1023, occlude the second target character in the training set.

针对于医疗图像文字识别后经常出现的字符缺失问题，本发明优化后的语言模型需要具备缺少词汇和字符的识别能力。可以按照预设概率，从所述训练集中的识别得到的医疗术语和字符中筛选要遮挡的第二目标字符，然后在语言模型的训练时预测被遮挡位置的内容。即在完成一个或多个字符的遮挡操作之后，将当前训练集作为第二数据集。相应地，在对医疗OCR优化模型进行训练期间，迭代地根据当前上下文预测与第二数据集中的被遮挡位置相对应的每个正确的第二目标字符，以训练所述医疗OCR优化模型识别缺失内容的能力。Aiming at the problem of missing characters that often occurs after the medical image text recognition, the optimized language model of the present invention needs to have the ability to recognize the lack of words and characters. The second target character to be occluded can be screened from the recognized medical terms and characters in the training set according to a preset probability, and then the content of the occluded position is predicted during language model training. That is, after completing the occlusion operation of one or more characters, the current training set is used as the second data set. Accordingly, during the training of the medical OCR optimization model, each correct second target character corresponding to the occluded position in the second dataset is iteratively predicted from the current context to train the medical OCR optimization model to identify missing content capability.

与常用的语言模型遮挡操作不同的是，本发明优选地提高医疗术语词汇被遮挡的概率。在优选的实施例中，医疗术语随机替换的概率可以是其他词汇的3倍，同时也会部分地遮挡医疗词汇中的字符。字符的替换频率可以与字符在语料库中出现的次数成反比，越低频的字符被替换的概率越高。Unlike common language model occlusion operations, the present invention preferably increases the probability of medical term vocabulary being occluded. In a preferred embodiment, the probability of random replacement of medical terms can be 3 times that of other terms, and at the same time, the characters in the medical terms are also partially occluded. The replacement frequency of a character can be inversely proportional to the number of times the character appears in the corpus, the lower the frequency the higher the probability of the character being replaced.

S1024、将所述训练集切分为多个文本段落，得到用于训练所述医疗OCR优化模型的预训练数据集。S1024. Divide the training set into multiple text paragraphs to obtain a pre-training data set for training the medical OCR optimization model.

针对病历结构化的特点，在预训练阶段，本发明利用关键词和语言模型进行不同文本块的划分，将医疗数据自动切分为多个独立的文本块，并作为单独的任务训练语言模型，使模型具有字段划分的能力。正确拆分得到的段落对后续的医疗知识抽取、事件抽取都具有重要作用。切分的方式可以利用预先获取的关键字等。在医疗文本中不同的段落描述的内容差异通常较大，本发明的语言模型通过预测当前语句是否为当前段落的最后一句来执行段落切分。该预测可以基于当前文本和下一句文本展开，当两句之间没有明显的语义关联时，进行段落切分。其中，当前语句是否为当前段落的最后一句的预测方式具体也可以参考公式（1）来根据当前的上下文预测当前语句为正确的概率。In view of the structural characteristics of medical records, in the pre-training stage, the present invention uses keywords and language models to divide different text blocks, automatically divides medical data into multiple independent text blocks, and trains language models as separate tasks. Enables the model to have field partitioning capabilities. Correctly split paragraphs play an important role in subsequent medical knowledge extraction and event extraction. The method of segmentation can use pre-acquired keywords and the like. Different paragraphs in medical texts usually have large differences in content. The language model of the present invention performs paragraph segmentation by predicting whether the current sentence is the last sentence of the current paragraph. The prediction can be expanded based on the current text and the next sentence, and when there is no obvious semantic relationship between the two sentences, paragraph segmentation is performed. Among them, the prediction method of whether the current sentence is the last sentence of the current paragraph can also refer to formula (1) to predict the probability that the current sentence is correct according to the current context.

在对医疗OCR优化模型进行训练期间，迭代地根据当前上下文预测所述预训练数据集中的一个或多个段落结束语句，以训练所述医疗OCR优化模型自动分段的能力。During the training of the medical OCR optimization model, one or more end-of-paragraph sentences in the pre-training dataset are iteratively predicted based on the current context to train the medical OCR optimization model's ability to automatically segment.

通过执行S1022、S1023和S1024的预训练过程，得到预训练数据集，接下来，利用得到的预训练数据集对所述医疗OCR优化模型进行迭代地训练和更新，可以得到对医疗图像文字识别具有更优处理能力的模型，包括分别能够正确识别和修正文字识别的错误、缺失的字符以及优化文本段的划分的能力。By performing the pre-training process of S1022, S1023 and S1024, a pre-training data set is obtained. Next, the medical OCR optimization model is iteratively trained and updated by using the obtained pre-training data set, and it can be obtained that the medical image text recognition has a Models for better processing power, including the ability to correctly identify and correct text recognition errors, missing characters, and optimal segmentation of text segments, respectively.

图3示出了离线阶段的预训练语言模型的完整训练过程。应当注意的是，在本发明的上述流程中，字符纠错、字符补充以及文本分段是相对独立的模型优化过程。因此，在实际应用的预训练过程中，选择上述至少一种预训练方式均可增强语言模型的训练数据。而且步骤S1022、S1023和S1024的顺序是可以任意调整的，而不限于上述实施例所描述的顺序。例如，可以先将所述训练集切分为多个文本段落，然后再对训练集中的预设目标字符进行错误替换和/或遮挡，得到预训练数据集，等等。Figure 3 shows the complete training process of the pre-trained language model in the offline phase. It should be noted that, in the above process of the present invention, character error correction, character supplement and text segmentation are relatively independent model optimization processes. Therefore, in the pre-training process of practical application, selecting at least one of the above-mentioned pre-training methods can enhance the training data of the language model. Moreover, the sequence of steps S1022, S1023 and S1024 can be adjusted arbitrarily, and is not limited to the sequence described in the above embodiment. For example, the training set may be first divided into multiple text paragraphs, and then the preset target characters in the training set may be incorrectly replaced and/or occluded to obtain a pre-training data set, and so on.

在进一步的实施例中，针对医疗语言风格的识别问题，本发明的预训练处理还可以包括：In a further embodiment, for the identification of medical language styles, the pre-training process of the present invention may further include:

S1025、抽取大量诊断结果文本进行语言模型的微调。S1025 , extracting a large amount of diagnosis result texts to fine-tune the language model.

通过对大量诊断结果文本的学习和训练来执行模型微调（fine-tuning），从而增强模型对医疗人员语法习惯和行文风格的理解，提升诊断结果的识别准确率。Model fine-tuning is performed by learning and training a large number of texts of diagnostic results, thereby enhancing the model's understanding of medical personnel's grammar habits and writing style, and improving the recognition accuracy of diagnostic results.

此外，作为实现语句错误检测的具体方式，在步骤S1022完成错误字符随机替换之后的模型训练阶段，首先可以估计当前语句为正确语句的概率，接下来识别错误字符，最后根据上下文预测正确字符，并计算纠错后的语句为正确语句的概率。具体地，语句错误检测的计算方法可以如公式（1）表示：In addition, as a specific method for realizing sentence error detection, in the model training stage after the random replacement of wrong characters is completed in step S1022, the probability that the current sentence is a correct sentence can be estimated first, the wrong characters can be identified next, and the correct characters can be predicted according to the context. Calculate the probability that the sentence after error correction is the correct sentence. Specifically, the calculation method of sentence error detection can be expressed as formula (1):

P(s)=P(w₁, w₂,w₃,...,w_n) 公式（1）P(s)=P(w ₁ , w ₂ ,w ₃ ,...,w _n ) Formula (1)

其中s为OCR的语句，该语句由字符序列w₁,w₂,w₃,...,w_n构成，P(s)为该语句为正确语句的概率。Where s is an OCR sentence, the sentence is composed of character sequences w ₁ , w ₂ , w ₃ ,..., _wn , and P(s) is the probability that the sentence is a correct sentence.

当P(s)的值小于给定阈值时，则判定为包含错误。即当识别出错误语句后预测错误字符，模型预测错误字符的方法如公式（2）所示：When the value of P(s) is less than a given threshold, it is determined to contain an error. That is, after identifying the wrong sentence and predicting the wrong character, the model predicts the wrong character as shown in formula (2):

P_error(w_i)=minP(w_i|w₁,..,w_i-1,w_i+1,..,w_n) 公式（2）P _error ( _wi )=minP( _wi |w ₁ ,..,wi _-1 ,wi ₊₁ ,..,w _n ) Formula (2)

其中是字符w_i是识别错误的字符，P(w_i|w₁,..,w_i-1,w_i+1,..,w_n)是在给定上下文情况下w_i出现的概率，语句中概率最低的字符即为错误字符P_error(w_i)。模型给出正确字符的方法如公式（3）所示：where is the character w _i is the misidentified character and P( _wi |w ₁ ,..,wi _-1 ,w _i+1 ,..,w _n ) is the probability of w _i appearing in a given context , the character with the lowest probability in the sentence is the error character P _error ( _wi ). The way the model gives the correct character is shown in formula (3):

w’=maxP(w’|w₁,w₂,w₃,...) 公式（3）w'=maxP(w'|w ₁ ,w ₂ ,w ₃ ,...) Formula (3)

其中w’为预测的正确字符，P(w’|w₁,w₂,w₃,...)给定上下文的情况下，字符w’能够构成合理语句的概率。Where w' is the predicted correct character, P(w'|w ₁ ,w ₂ ,w ₃ ,...) Given the context, the probability that the character w' can constitute a reasonable sentence.

可以看出，通过本发明的上述方法，与现有技术相比，能够进一步提升医疗图像文字识别的准确率，尤其是对医疗术语、病历关键词（主诉等）等关键词汇的识别准确率，同时能够对文本段落进行辅助切分，有利于后续的医疗知识抽取、事件抽取。实验数据表明，采用本发明的医疗OCR优化模型训练方法后，文本识别错误检测率可以达到78%，正确字符的预测正确率可以达到85%，因而能够显著优化医疗图像OCR识别的准确率。本发明的医疗OCR优化模型训练方法对于病历的结构化、临床医学统计、核保理赔等应用都具有重要价值。It can be seen that, compared with the prior art, the above method of the present invention can further improve the accuracy of text recognition in medical images, especially the recognition accuracy of key words such as medical terms, medical record keywords (main complaint, etc.), etc. At the same time, it can perform auxiliary segmentation on text paragraphs, which is beneficial to subsequent medical knowledge extraction and event extraction. The experimental data show that after using the medical OCR optimization model training method of the present invention, the text recognition error detection rate can reach 78%, and the correct character prediction accuracy rate can reach 85%, so the accuracy of medical image OCR recognition can be significantly optimized. The medical OCR optimization model training method of the present invention has important value for the structuring of medical records, clinical medical statistics, underwriting claims and other applications.

实施例二Embodiment 2

如图4所示，本发明在第二方面提供了一种医疗OCR数据优化方法，包括：As shown in FIG. 4 , the present invention provides a medical OCR data optimization method in a second aspect, including:

S201、获取目标医疗图像，并对目标医疗图像进行OCR识别，得到待优化文本数据；S201, obtaining a target medical image, and performing OCR identification on the target medical image to obtain text data to be optimized;

待识别的目标医疗图像可以包含被扫描或拍摄的病历、诊断报告等图像文件。当获取目标医疗图像之后，根据现有的OCR识别算法来提取图像文本数据，作为初始的文本数据。The target medical image to be identified may include image files such as scanned or photographed medical records and diagnostic reports. After acquiring the target medical image, the image text data is extracted according to the existing OCR recognition algorithm as the initial text data.

S202、将待优化文本数据输入医疗OCR优化模型，以使所述医疗OCR优化模型输出与所述待优化文本数据对应的医疗术语和字符识别结果。S202. Input the text data to be optimized into a medical OCR optimization model, so that the medical OCR optimization model outputs medical terms and character recognition results corresponding to the text data to be optimized.

如前所述，在OCR在线识别之前，在离线阶段已经预先基于实施例一的医疗OCR优化模型训练方法得到了最终的医疗OCR优化模型。而在线阶段的完整医疗图像OCR识别方法参见图5所示。将初始的待优化文本数据输入医疗OCR优化模型，以使该模型输出的对应的优化文本数据。由于模型被训练成具有更高的纠正文字识别错误、识别缺失的字符以及优化文本段划分的能力，因此优化后的文本数据至少包含对初始文本数据的段落划分，并在初始文本数据中存在错误医疗术语或字符的情况下，标识出初始文本数据中的错误医疗术语或字符；然后将错误项替换为对应的正确元素项，以及在初始文本数据中存在缺失项的情况下，预测出缺失的医疗术语或字符。As mentioned above, before the OCR online identification, the final medical OCR optimization model has been obtained in advance based on the medical OCR optimization model training method of the first embodiment in the offline stage. The complete medical image OCR recognition method in the online stage is shown in Figure 5. Input the initial text data to be optimized into the medical OCR optimization model, so that the model outputs the corresponding optimized text data. Since the model is trained to have a higher ability to correct text recognition errors, identify missing characters, and optimize text segmentation, the optimized text data contains at least the segment segmentation of the initial text data, and there are errors in the initial text data In the case of medical terms or characters, identify erroneous medical terms or characters in the initial text data; then replace the erroneous terms with the corresponding correct element terms, and in the case of missing terms in the initial text data, predict the missing Medical term or character.

实施例三Embodiment 3

本发明的另一方面还包括和前述方法流程完全对应一致的功能模块架构，即本发明实施例还提供了一种医疗OCR数据优化模型训练装置，包括：Another aspect of the present invention also includes a functional module architecture that is completely consistent with the aforementioned method flow, that is, an embodiment of the present invention also provides a medical OCR data optimization model training device, including:

获取模块301，用于获取大规模无标注医疗文本数据，并对所述大规模无标注医疗文本数据中的医疗术语和字符进行识别以形成训练集；The obtaining module 301 is used for obtaining large-scale unlabeled medical text data, and identifying medical terms and characters in the large-scale unlabeled medical text data to form a training set;

预训练模块302，用于对所述训练集进行预训练处理以得到用于训练所述医疗OCR优化模型的预训练数据集，并利用所述预训练数据集对所述医疗OCR优化模型进行训练；A pre-training module 302, configured to perform pre-training processing on the training set to obtain a pre-training data set for training the medical OCR optimization model, and use the pre-training data set to train the medical OCR optimization model ;

其中，所述预训练处理模块包括：Wherein, the pre-training processing module includes:

增广模块3021，用于对所述训练集中的低频术语和低频字符进行数据增广处理；Augmentation module 3021, configured to perform data augmentation processing on low-frequency terms and low-frequency characters in the training set;

替换模块3022，用于将所述训练集中的第一目标字符随机替换为错误字符；Replacement module 3022, for randomly replacing the first target character in the training set with an incorrect character;

遮挡模块3023，用于对所述训练集中的第二目标字符进行遮挡；以及an occlusion module 3023, configured to occlude the second target character in the training set; and

切分模块3024，用于将所述训练集切分为多个文本段落，得到用于训练所述医疗OCR优化模型的预训练数据集。The segmentation module 3024 is configured to segment the training set into multiple text paragraphs to obtain a pre-training data set for training the medical OCR optimization model.

该装置可通过上述实施例一提供的医疗OCR数据优化模型训练方法实现，具体的实现方法可参见实施例一中的描述，在此不再赘述。The apparatus can be implemented by the medical OCR data optimization model training method provided in the first embodiment. For the specific implementation method, reference may be made to the description in the first embodiment, which will not be repeated here.

实施例四Embodiment 4

相应地，本发明实施例还提供了一种医疗OCR数据优化装置，包括：Correspondingly, the embodiment of the present invention also provides a medical OCR data optimization device, including:

识别模块401，用于获取目标医疗图像，并对目标医疗图像进行OCR识别，得到待优化文本数据；The identification module 401 is used for acquiring the target medical image, and performing OCR identification on the target medical image to obtain text data to be optimized;

优化模块402，用于将所述待优化文本数据输入医疗OCR优化模型，以使所述医疗OCR优化模型输出与所述待优化文本数据对应的医疗术语和字符识别结果；An optimization module 402, configured to input the text data to be optimized into a medical OCR optimization model, so that the medical OCR optimization model outputs medical terms and character recognition results corresponding to the text data to be optimized;

其中，所述医疗OCR优化模型预先基于实施例一所述的医疗OCR优化模型训练方法得到。Wherein, the medical OCR optimization model is obtained in advance based on the medical OCR optimization model training method described in the first embodiment.

实施例五Embodiment 5

本发明另一方面提供了一种电子设备，包括处理器和存储器，所述存储器存储有多条指令，所述处理器用于读取所述指令并执行实施例一所述的医疗OCR数据优化模型训练方法，或者执行实施例二所述的医疗OCR数据优化方法。其中处理器和存储器可以通过总线或者其他方式连接，以通过总线连接为例。处理器可以为中央处理器（CentralProcessing Unit，CPU）。处理器还可以为其他通用处理器、数字信号处理器（DigitalSignal Processor，DSP）、专用集成电路（Application Specific Integrated Circuit，ASIC）、现场可编程门阵列（Field-Programmable Gate Array，FPGA）或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等芯片，或者上述各类芯片的组合。Another aspect of the present invention provides an electronic device, including a processor and a memory, the memory stores a plurality of instructions, and the processor is configured to read the instructions and execute the medical OCR data optimization model of the first embodiment training method, or execute the medical OCR data optimization method described in Embodiment 2. The processor and the memory may be connected by a bus or in other ways, and the connection by a bus is taken as an example. The processor may be a central processing unit (Central Processing Unit, CPU). The processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other Chips such as programming logic devices, discrete gate or transistor logic devices, discrete hardware components, or a combination of the above types of chips.

存储器作为一种非暂态计算机可读存储介质，可用于存储非暂态软件程序、非暂态计算机可执行程序以及模块，如本申请实施例中的医疗OCR数据优化模型训练方法、优化方法对应的程序指令/模块。处理器通过运行存储在存储器中的非暂态软件程序、指令以及模块，从而执行处理器的各种功能应用以及数据处理，即实现上述方法实施例中的方法。As a non-transitory computer-readable storage medium, the memory can be used to store non-transitory software programs, non-transitory computer-executable programs and modules, such as the medical OCR data optimization model training method and the optimization method in the embodiments of the present application corresponding to program instructions/modules. The processor executes various functional applications and data processing of the processor by running the non-transitory software programs, instructions and modules stored in the memory, that is, to implement the methods in the above method embodiments.

存储器可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储处理器所创建的数据等。此外，存储器可以包括高速随机存取存储器，还可以包括非暂态存储器，例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中，存储器可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system and an application program required by at least one function; the storage data area may store data created by the processor, and the like. Additionally, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, such remote memory being connectable to the processor via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

实施例六Embodiment 6

本发明又一方面提供了一种计算机可读存储介质，所述计算机可读存储介质存储有多条指令，所述多条指令可被处理器读取并执行实施例一所述的医疗OCR数据优化模型训练方法，或者执行实施例二所述的医疗OCR数据优化方法。该计算机可读存储介质可以是有形存储介质，诸如随机存储器（RAM）、内存、只读存储器（ROM）、电可编程ROM、电可擦除可编程ROM、寄存器、软盘、硬盘、可移动存储盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。Another aspect of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores a plurality of instructions, and the plurality of instructions can be read by a processor to execute the medical OCR data described in the first embodiment Optimize the model training method, or execute the medical OCR data optimization method described in Embodiment 2. The computer-readable storage medium may be a tangible storage medium such as random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.

尽管已描述了本发明的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例作出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Although the preferred embodiments of the present invention have been described, additional changes and modifications to these embodiments may occur to those skilled in the art once the basic inventive concepts are known. Therefore, the appended claims are intended to be construed to include the preferred embodiment and all changes and modifications that fall within the scope of the present invention. It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the invention. Thus, provided that these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include these modifications and variations.

Claims

1. a medical OCR optimization model training method, is characterized in that, comprises:

acquiring large-scale unlabeled medical text data, and identifying medical terms and characters in the large-scale unlabeled medical text data to form a training set;

Performing pre-training processing on the training set to obtain a pre-training data set for training the medical OCR optimization model, and using the pre-training data set to train the medical OCR optimization model;

Wherein, the pre-training process includes:

performing data augmentation processing on the low-frequency terms and low-frequency characters in the training set;

Randomly replace the first target character in the training set with an incorrect character;

occluding the second target character in the training set; and

The training set is divided into multiple text paragraphs to obtain a pre-training data set for training the medical OCR optimization model.

2. The method according to claim 1, wherein before the data augmentation processing is performed on the low-frequency terms and low-frequency characters in the training set, the method further comprises:

The frequency of each medical term and character in the training set identified is counted, and the low-frequency term and low-frequency character in the training set are determined according to the corresponding low-frequency threshold.

3. The method according to claim 1, characterized in that, after said forming a training set, further comprising:

Representation learning of medical terms is performed on the training set using the medical knowledge graph, and mapping is performed in the representation space.

4. The method according to claim 1, wherein the randomly replacing the first target character in the training set with an error character further comprises:

First target characters are screened from medical terms and characters in the training set, wherein the first target characters include characters contained in a glyph similarity dictionary and/or commonly used medical characters.

5. The method according to claim 1 or 4, wherein the training of the medical OCR optimization model using the pre-training data set further comprises:

After the first target characters have been randomly replaced with wrong characters, the current training set is taken as the first data set, the wrong characters in the first data set are iteratively extracted according to the current context, and predictions that match the wrong characters are performed. The first target character corresponding to the character is used to train the character error correction capability of the medical OCR optimization model.

6. The method according to claim 1, wherein the training of the medical OCR optimization model using the pre-training data set further comprises:

After the second target character has been occluded, using the current training set as the second data set, iteratively predicts the second target character corresponding to the occluded position in the second data set according to the current context to train the The ability of the medical OCR optimization model to identify occluded content.

7. The method according to claim 1, wherein the training of the medical OCR optimization model using the pre-training data set further comprises:

Iteratively predicts end-of-paragraph sentences in the pre-trained dataset based on the current context to train the medical OCR optimization model's ability to automatically segment.

8. A method for optimizing medical OCR data, comprising:

Obtain the target medical image, perform OCR recognition on the target medical image, and obtain the text data to be optimized;

Inputting the text data to be optimized into a medical OCR optimization model, so that the medical OCR optimization model outputs medical terms and character recognition results corresponding to the text data to be optimized;

Wherein, the medical OCR optimization model is obtained in advance based on the medical OCR optimization model training method described in any one of claims 1 to 7.

9. An electronic device, comprising a processor and a memory, wherein the memory stores a plurality of instructions, and the processor is configured to read the instructions and execute the medical treatment according to any one of claims 1 to 7. The OCR data optimization model training method, or the medical OCR data optimization method as claimed in claim 8 is implemented.

10. A computer-readable storage medium, wherein the computer-readable storage medium stores a plurality of instructions, and the plurality of instructions can be read and executed by a processor according to any one of claims 1 to 7 The medical OCR data optimization model training method, or the medical OCR data optimization method according to claim 8 is executed.