CN114419632A - A method, device and system for generating OCR training samples - Google Patents

A method, device and system for generating OCR training samples Download PDF

Info

Publication number
CN114419632A
CN114419632A CN202111646988.5A CN202111646988A CN114419632A CN 114419632 A CN114419632 A CN 114419632A CN 202111646988 A CN202111646988 A CN 202111646988A CN 114419632 A CN114419632 A CN 114419632A
Authority
CN
China
Prior art keywords
text
area
generated
random
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111646988.5A
Other languages
Chinese (zh)
Inventor
沈达伟
王勇
朱军民
康铁钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yidao Boshi Technology Co ltd
Original Assignee
Beijing Yidao Boshi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yidao Boshi Technology Co ltd filed Critical Beijing Yidao Boshi Technology Co ltd
Priority to CN202111646988.5A priority Critical patent/CN114419632A/en
Publication of CN114419632A publication Critical patent/CN114419632A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Input (AREA)

Abstract

The invention discloses an OCR training sample generation method, device and system, and relates to the field of computer vision. The method comprises the following steps: a character outline extraction step, namely extracting all character outlines based on the original image, determining an erasing area mask by combining the erasing area coordinates, and obtaining a repairing area mask; an image repairing and filling step, namely performing image repairing and filling according to a mask of a repairing area and pixel information around the repairing area to obtain a background template after characters are erased; and a random text generation step, namely generating a random text in each generation area, thereby obtaining a new sample picture and a corresponding marking information file. The method combines the technologies of character contour extraction algorithm, image restoration and the like, fully utilizes the background information of the original picture to generate a high-quality training picture, and simultaneously generates a labeling file (containing character content and position information) corresponding to the picture, so that the method avoids the tedious labeling work and can be directly used for OCR model training.

Description

一种OCR训练样本生成方法、装置及系统A method, device and system for generating OCR training samples

技术领域technical field

本发明涉及计算机视觉领域,尤其是一种OCR训练样本生成方法、装置及系统。The present invention relates to the field of computer vision, in particular to a method, device and system for generating OCR training samples.

背景技术Background technique

OCR(Optical Character Recognition,光学字符识别)任务广泛地存在于生活和业务场景中。当前该任务效果最好的方法是使用深度学习技术进行文字定位和识别。但在真实场景中,常常出现某种版式训练样本匮乏的情况,而深度学习又依赖于庞大的训练样本。因此,自动生成OCR训练样本的方法应运而生。OCR (Optical Character Recognition, Optical Character Recognition) tasks widely exist in life and business scenarios. The current best method for this task is to use deep learning techniques for text localization and recognition. However, in real scenarios, there is often a shortage of training samples in a certain format, and deep learning relies on huge training samples. Therefore, the method of automatically generating OCR training samples emerges as the times require.

当前生成OCR样本的方法大致分为两类:Current methods for generating OCR samples are roughly divided into two categories:

第一类是基于模板的方法。即输入一张干净的模板图片,然后在该模板的指定位置写上随机文字,以此生成该模板样式的OCR样本。该方法有两个主要缺陷:一是大多数情况下我们无法获取模板图片;二是模板图片往往是干净、无形变的理想图片,与夹杂多种噪声干扰的真实图片有较大差距,生成的样本真实性较差。The first category is template-based methods. That is, input a clean template image, and then write random text in the specified position of the template to generate an OCR sample of the template style. This method has two main defects: one is that we cannot obtain template images in most cases; the other is that template images are often clean and undeformed ideal images, which are far from the real images mixed with various noises. The sample authenticity is poor.

第二类是基于深度学习的图像编辑技术。该类方法有四个主要缺陷:第一个缺陷是模型泛化能力差,由于不同版式的OCR图片特征和文字特征差异很大,对于在模型训练过程中未出现过的图片版式或文字样式,模型往往表现较差;第二个缺陷是训练数据获取困难,由于模型泛化能力有限,若要训练针对某种版式的图像编辑模型,首先需要大量与该版式类似的训练数据,而当前场景本来就是因为缺少数据才需要图像编辑技术的;第三个缺陷是训练图像编辑模型的样本需要像素级标注,此标注工作非常昂贵耗时;第四个缺陷是模型推理速度相对较慢,且运行往往需要深度学习推理框架,大大增加工程复杂度。The second category is image editing techniques based on deep learning. This type of method has four main defects: the first defect is the poor generalization ability of the model. Since the OCR image features and text features of different layouts are very different, for the image layout or text style that has not appeared in the model training process, The model often performs poorly; the second defect is that it is difficult to obtain training data. Due to the limited generalization ability of the model, if you want to train an image editing model for a certain layout, you first need a large amount of training data similar to the layout, and the current scene is originally It is because of the lack of data that image editing technology is needed; the third defect is that the samples for training the image editing model need pixel-level annotation, which is very expensive and time-consuming; the fourth defect is that the model inference speed is relatively slow, and it often runs. A deep learning inference framework is required, which greatly increases the engineering complexity.

发明内容SUMMARY OF THE INVENTION

本发明涉及一种OCR训练样本生成方法。针对OCR训练过程中常常遇到的样本匮乏问题,申请人创新性地结合文字轮廓提取算法和图像修复等技术,充分利用原图片背景信息,生成高质量训练图片,同时生成与图片对应的标注文件(包含文字内容、位置信息),免去繁冗耗力的标注工作,可直接用于OCR模型训练。由此,本发明基于传统的计算机视觉方法,直接处理真实的原始OCR图片,克服了以上弊端。The invention relates to a method for generating OCR training samples. In view of the lack of samples often encountered in the OCR training process, the applicant innovatively combined text contour extraction algorithms and image restoration technologies to make full use of the background information of the original images to generate high-quality training images and generate annotation files corresponding to the images. (including text content and location information), eliminating the tedious and labor-intensive labeling work, and can be directly used for OCR model training. Therefore, based on the traditional computer vision method, the present invention directly processes the real original OCR picture, and overcomes the above drawbacks.

根据本发明的第一方面,提供一种OCR训练样本生成方法,输入信息为原始图像以及抹除区域坐标,所述方法包括:According to a first aspect of the present invention, a method for generating an OCR training sample is provided, wherein the input information is an original image and coordinates of an erased area, and the method includes:

文字轮廓提取步骤,基于原始图像提取所有文字轮廓,结合抹除区域坐标确定抹除区域掩膜(mask),并得到修复区域掩膜;The text outline extraction step, extracts all text outlines based on the original image, determines the erasure area mask (mask) in combination with the erasure area coordinates, and obtains the repair area mask;

图像修复填充步骤,根据修复区域掩膜以及修复区域周围的像素信息进行图像修复填充,得到抹除文字后的背景模板;In the image repairing and filling step, image repairing and filling is performed according to the repaired area mask and the pixel information around the repaired area to obtain the background template after erasing the text;

随机文本生成步骤,在每个生成区域内生成随机文本,由此得到一张新的样本图片和与之对应的标注信息文件。In the random text generation step, random text is generated in each generation area, thereby obtaining a new sample image and the corresponding annotation information file.

进一步地,所述文字轮廓提取步骤具体包括:Further, the text outline extraction step specifically includes:

将输入的原始图像转换为单通道灰度图,再将其自适应二值化,得到所有文字轮廓掩膜,文字区域值为1,背景区域值为0;Convert the input original image to a single-channel grayscale image, and then adaptively binarize it to obtain all text outline masks, the text area value is 1, and the background area value is 0;

根据抹除区域坐标,得到抹除区域掩膜,抹除区域值为1,其他区域值为0;According to the coordinates of the erasing area, the erasing area mask is obtained, the value of the erasing area is 1, and the value of other areas is 0;

将所有文字轮廓掩膜与抹除区域掩膜对应位置像素相乘,得到抹除区域文字轮廓掩膜;Multiply all the text outline masks with the corresponding position pixels of the erased area mask to obtain the erased area text outline mask;

对抹除区域文字轮廓掩膜进行形态学膨胀,由此得到修复区域掩膜。Morphological expansion is performed on the erased region text outline mask, thereby obtaining the repaired region mask.

进一步地,所述形态学膨胀的膨胀核为2或3。Further, the swelling core of the morphological swelling is 2 or 3.

进一步地,所述图像修复填充步骤具体包括:Further, the image repairing and filling step specifically includes:

根据修复区域掩膜,在原始图像中确定待修复区域;Determine the area to be repaired in the original image according to the repair area mask;

从外到内的顺序轮询待修复区域的每个像素点,根据某个像素点周围已知像素的信息,计算该修复点应该填充的像素值,成为已知像素;Poll each pixel of the area to be repaired in order from outside to inside, and calculate the pixel value that should be filled in the repaired point according to the information of the known pixels around a certain pixel to become a known pixel;

向内计算下一个像素点的像素值;Calculate the pixel value of the next pixel inward;

逐步迭代,待修复区域逐渐收缩变小,直到待修复区域都被修复,得到已修复的抹除文字后的背景模板。Iterative step by step, the area to be repaired gradually shrinks and becomes smaller, until the area to be repaired is repaired, and the repaired background template after erasing the text is obtained.

进一步地,从外到内的排序算法为快速行进算法(Fast Marching Method)。Further, the sorting algorithm from outside to inside is the Fast Marching Method.

进一步地,根据某个像素点周围已知像素的信息,计算该修复点应该填充的像素值的方法包括邻域已知像素值加权平均、INPAINT_NS或INPAINT_TELEA方法。Further, according to the information of the known pixels around a certain pixel, the method for calculating the pixel value that should be filled in the repaired point includes the weighted average of the known pixel values in the neighborhood, the INPAINT_NS or the INPAINT_TELEA method.

这里,INPAINT_NS(基于Navier-Stokes的修复方法)和INPAINT_TELEA(基于图像梯度的快速匹配方法,又称Telea法)是两种常见的图像修复算法。Here, INPAINT_NS (Navier-Stokes-based inpainting method) and INPAINT_TELEA (image gradient-based fast matching method, also known as Telea method) are two common image inpainting algorithms.

进一步地,所述生成区域为需要在抹除文字后的背景模板中生成样本文字的位置区域。Further, the generation area is a location area where sample characters need to be generated in the background template after erasing the characters.

进一步地,所述随机文本生成步骤具体包括:Further, the random text generation step specifically includes:

针对某一生成区域,确定针对该生成区域的随机文本预计长度w,设定字体大小为s,估算该段随机文本的文字个数n=int(w/s);For a certain generation area, determine the expected length w of the random text for the generation area, set the font size to s, and estimate the number of characters in the random text n=int(w/s);

生成n*k个字符长度的冗余随机文本,k为冗余倍数,取值为正整数;Generate redundant random text with length of n*k characters, k is a redundant multiple, and the value is a positive integer;

根据冗余随机文本长度和随机文本预计长度w的关系,确定最终生成的随机文本及其实际长度L;Determine the final generated random text and its actual length L according to the relationship between the length of the redundant random text and the expected length w of the random text;

在生成区域内随机确定该生成文本的位置,写入该最终生成的随机文本并确定其标注信息;Randomly determine the position of the generated text in the generation area, write the final generated random text and determine its label information;

轮询每个生成区域,由此得到一张新的样本图片和与之对应的标注信息文件。Poll each generated area to obtain a new sample image and the corresponding annotation information file.

进一步地,所述确定针对该生成区域的随机文本预计长度w具体包括:Further, the determining the expected length w of the random text for the generation area specifically includes:

取该生成区域的长和宽中的最大值作为生成文本的最大长度,长和宽的最小值作为生成文本的最小长度,在该最小值和最大值之间随机选择一个整数作为随机文本预计长度,记为w。Take the maximum value of the length and width of the generated area as the maximum length of the generated text, the minimum value of the length and width as the minimum length of the generated text, and randomly select an integer between the minimum and maximum values as the expected length of the random text , denoted as w.

进一步地,所述生成n*k个字符长度的冗余随机文本步骤中,如指定语料类型,则以该语料类型生成n*k个字符长度的冗余随机文本;如不指定语料类型,则随机生成n*k个字符长度的冗余随机文本。Further, in the described step of generating redundant random texts of n*k character lengths, if the corpus type is specified, then the redundant random texts of n*k character lengths are generated with this corpus type; if the corpus type is not specified, then Randomly generate redundant random text of length n*k characters.

进一步地,指定语料的生成方法包括:Further, the generation method of the specified corpus includes:

正则表达式生成:用来生成规则明确的语料,将需要的语料规则特征编为正则表达式,然后根据正则表达式随机生成字符串;或者Regular expression generation: used to generate corpus with clear rules, compile the required corpus rule features into regular expressions, and then randomly generate strings according to the regular expressions; or

数据库随机获取:用来生成规则不明确或某种固定型语料,通过公开信息获得该语料的所有内容,并将数据按条目存入数据库或数据文件中,在随机生成的时候从中随机获取一条即可。Database random acquisition: used to generate corpus with unclear rules or a certain fixed type, obtain all the content of the corpus through public information, and store the data in the database or data file by entry, and randomly obtain one item from it when it is randomly generated. Can.

进一步地,k值取3。Further, the value of k is 3.

进一步地,所述根据冗余随机文本长度和随机文本预计长度w的关系,确定最终生成的随机文本及其实际长度L具体包括:Further, according to the relationship between the length of the redundant random text and the expected length w of the random text, it is determined that the final generated random text and its actual length L specifically include:

从冗余随机文本的第一个字符开始,统计每个字符的实际长度并依次累加,直到字符总长度刚好满足“再加一个字符就会超出w”,将其作为最终生成的随机文本,记为“text”,并记录该最终生成的随机文本的实际长度L。Starting from the first character of the redundant random text, count the actual length of each character and accumulate them in turn, until the total length of the characters just satisfies "one more character will exceed w", and take it as the final generated random text, record is "text", and records the actual length L of the final generated random text.

进一步地,在生成区域内随机确定该生成文本的位置,写入该最终生成的随机文本并确定其标注信息具体包括:Further, randomly determine the position of the generated text in the generation area, write the final generated random text and determine that its label information specifically includes:

设该生成区域左上角坐标为(x1,x2),右下角坐标为(x2,y2),则该最终生成的随机文本左上角起始点的x轴坐标范围为[x1,(x2-w)],y轴坐标范围为[y1,(y2-s)],在该范围内随机选择一个整数点(x,y)作为该最终生成的随机文本的起始位置;Let the coordinates of the upper left corner of the generated area be (x1, x2) and the coordinates of the lower right corner be (x2, y2), then the x-axis coordinate range of the starting point of the upper left corner of the final generated random text is [x1, (x2-w)] , The y-axis coordinate range is [y1, (y2-s)], and an integer point (x, y) is randomly selected within this range as the starting position of the final generated random text;

在背景模板上,以(x,y)为文本左上角的起始位置,写入该最终生成的随机文本,其标注信息为:坐标为[(x,y),(x+L,y),(x+L,y+s),(x,y+s)];文本内容为“text”。On the background template, with (x,y) as the starting position of the upper left corner of the text, write the final generated random text, and its label information is: the coordinates are [(x,y),(x+L,y) ,(x+L,y+s),(x,y+s)]; the text content is "text".

进一步地,所述在生成区域内随机确定该生成文本的位置,写入该最终生成的随机文本并确定其标注信息步骤之后,还包括对该最终生成的随机文本的字体大小和颜色进行调节的步骤。Further, after the step of randomly determining the position of the generated text in the generation area, writing the final generated random text and determining its label information, it also includes adjusting the font size and color of the final generated random text. step.

进一步地,所述对该最终生成的随机文本的字体大小进行调节具体包括:确定抹除区域宽度h,将最终生成的随机文本的大小默认设置为h。Further, the adjusting the font size of the finally generated random text specifically includes: determining the width h of the erasing area, and setting the size of the finally generated random text to h by default.

进一步地,所述对该最终生成的随机文本的字体颜色进行调节具体包括:Further, the adjusting the font color of the finally generated random text specifically includes:

对于某一生成区域,从修复区域掩膜(mask)中选择对应的区域,并对该区域的文字轮廓进行骨干提取,形成文字骨干区域;For a certain generated area, select the corresponding area from the repair area mask (mask), and extract the text outline of the area to form the text backbone area;

从原始图像中提取该生成区域对应的区域,对于该区域的RGB三个通道,将各个通道文字骨干区域处的所有像素值进行求平均,即为该文字骨干区域的各个通道颜色值;Extract the area corresponding to the generated area from the original image, and for the three RGB channels of this area, average all pixel values in the text backbone area of each channel, which is the color value of each channel in the text backbone area;

将该颜色值作为生成文字的颜色,使得该最终生成的随机文本的字体颜色与原文字颜色近似一致。The color value is used as the color of the generated text, so that the font color of the final generated random text is approximately the same as the original text color.

这里,常用的骨干提取算法包括K3M算法、Zhang-Suen algorithm等。Here, commonly used backbone extraction algorithms include K3M algorithm, Zhang-Suen algorithm, and the like.

根据本发明的第二方面,提供一种OCR训练样本生成装置,所述装置基于前述任一方面提供的方法进行操作,所述装置包括:According to a second aspect of the present invention, an apparatus for generating OCR training samples is provided. The apparatus operates based on the method provided in any of the preceding aspects, and the apparatus includes:

文字轮廓提取模块,用于基于原始图像提取所有文字轮廓,结合抹除区域坐标确定抹除区域掩膜,并得到修复区域掩膜;The text outline extraction module is used to extract all text outlines based on the original image, determine the erasure area mask in combination with the erasure area coordinates, and obtain the repair area mask;

图像修复填充模块,用于根据修复区域掩膜以及修复区域周围的像素信息进行图像修复填充,得到抹除文字后的背景模板;Image inpainting and filling module, which is used for image inpainting and filling according to the repaired area mask and the pixel information around the repaired area to obtain the background template after erasing the text;

随机文本生成模块,用于在每个生成区域内生成随机文本,由此得到一张新的样本图片和与之对应的标注信息文件。The random text generation module is used to generate random text in each generation area, thereby obtaining a new sample image and the corresponding annotation information file.

根据本发明的第三方面,提供一种OCR训练样本生成系统,所述系统包括:处理器和用于存储可执行指令的存储器;其中,所述处理器被配置为执行所述可执行指令,以执行如以上任一方面所述的OCR训练样本生成方法。According to a third aspect of the present invention, there is provided a system for generating OCR training samples, the system comprising: a processor and a memory for storing executable instructions; wherein the processor is configured to execute the executable instructions, to perform the OCR training sample generation method as described in any of the above aspects.

根据本发明的第四方面,提供一种计算机可读存储介质,其特征在于,其上存储有计算机程序,所述计算机程序被处理器执行时实现如以上任一方面所述的OCR训练样本生成方法。According to a fourth aspect of the present invention, a computer-readable storage medium is provided, characterized in that a computer program is stored thereon, and when the computer program is executed by a processor, the OCR training sample generation according to any one of the above aspects is realized. method.

本发明的有益效果:Beneficial effects of the present invention:

1.发明了一种OCR训练样本生成方法,可用于解决使用深度学习训练OCR模型时样本匮乏的问题;1. Invented a method for generating OCR training samples, which can be used to solve the problem of lack of samples when using deep learning to train OCR models;

2.创新性地将膨胀后的文字轮廓作为“图像待修复区域”,使用图像修复算法进行文字擦除。既能保证将文字痕迹擦除干净,又能在图像修复时为文字区域最大化地保留周围像素信息,获得最佳的修复效果;2. Innovatively use the expanded text outline as the "image to be repaired area", and use the image repair algorithm to erase the text. It can not only ensure that the text traces are wiped clean, but also maximize the preservation of surrounding pixel information for the text area during image restoration, so as to obtain the best restoration effect;

3.创新性地提出可选自适应文字大小模块,可根据需要自适应地还原真实样本中各个位置不同的文字大小,使生成的图片在文字大小上与真实样本更加贴近;3. Innovatively proposes an optional adaptive text size module, which can adaptively restore different text sizes in different positions in the real sample according to needs, so that the generated image is closer to the real sample in terms of text size;

4.创新性地提出可选自适应文字颜色模块,可根据需要自适应地还原真实样本中各个位置不同的文字颜色,使生成的图片在文字颜色上与真实样本更加贴近;4. Innovatively propose an optional adaptive text color module, which can adaptively restore different text colors in different positions in the real sample according to needs, so that the generated image is closer to the real sample in terms of text color;

5.可同时生成标注文件,包含OCR训练所需的文字位置坐标和文本内容的信息。免去繁冗耗力的答案标注工作。5. Annotation files can be generated at the same time, including text position coordinates and text content information required for OCR training. Eliminate the tedious and labor-intensive answer labeling work.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图示出的结构获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention, and for those of ordinary skill in the art, other drawings can also be obtained according to the structures shown in these drawings without creative efforts.

图1示出根据本发明实施例的原始图像示例。FIG. 1 shows an example of an original image according to an embodiment of the present invention.

图2示出根据本发明实施例的提取的文字轮廓示例。FIG. 2 shows an example of an extracted text outline according to an embodiment of the present invention.

图3示出根据本发明实施例的生成的背景模板示例。FIG. 3 shows an example of a generated background template according to an embodiment of the present invention.

图4示出根据本发明实施例的生成的新样本和新文本框示例。FIG. 4 shows an example of a new sample and a new text box generated according to an embodiment of the present invention.

图5示出根据本发明实施例的添加自适应文字大小模块的结果示例。FIG. 5 shows an example of the result of adding an adaptive text size module according to an embodiment of the present invention.

图6示出根据本发明实施例的再添加自适应文字颜色模块的结果示例。FIG. 6 shows an example of the result of adding an adaptive text color module according to an embodiment of the present invention.

图7示出根据本发明实施例的再添加语料配置模块的结果示例。FIG. 7 shows an example of the result of adding a corpus configuration module according to an embodiment of the present invention.

图8示出根据本发明实施例的模块流程图。FIG. 8 shows a block flow diagram according to an embodiment of the present invention.

图9示出根据本发明实施例的文字轮廓提取模块算法流程图。FIG. 9 shows a flowchart of an algorithm of a text outline extraction module according to an embodiment of the present invention.

图10示出根据本发明实施例的图像修复模块算法流程图。FIG. 10 shows a flowchart of an image repair module algorithm according to an embodiment of the present invention.

图11示出根据本发明实施例的随机生成模块算法流程图。FIG. 11 shows a flowchart of a random generation module algorithm according to an embodiment of the present invention.

图12示出根据本发明实施例的自适应文字大小模块算法流程图。FIG. 12 shows a flowchart of an adaptive text size module algorithm according to an embodiment of the present invention.

图13示出根据本发明实施例的自适应文字颜色模块算法流程图。FIG. 13 shows a flowchart of an adaptive text color module algorithm according to an embodiment of the present invention.

图14示出根据本发明实施例的某一生成区域的文字轮廓示意图。FIG. 14 shows a schematic diagram of a text outline of a certain generation area according to an embodiment of the present invention.

图15示出根据本发明实施例的文字轮廓骨干示意图。FIG. 15 shows a schematic diagram of a character outline backbone according to an embodiment of the present invention.

图16示出根据本发明实施例的该生成区域原图。FIG. 16 shows an original image of the generation area according to an embodiment of the present invention.

图17示出根据本发明实施例的原图1示意图。FIG. 17 shows a schematic diagram of the original FIG. 1 according to an embodiment of the present invention.

图18示出根据本发明实施例的生成图1示意图。FIG. 18 shows a schematic diagram of generating FIG. 1 according to an embodiment of the present invention.

图19示出根据本发明实施例的原图2示意图。FIG. 19 shows a schematic diagram of the original FIG. 2 according to an embodiment of the present invention.

图20示出根据本发明实施例的生成图2示意图。FIG. 20 shows a schematic diagram of generating FIG. 2 according to an embodiment of the present invention.

图21示出根据本发明实施例的原图3示意图。FIG. 21 shows a schematic diagram of the original FIG. 3 according to an embodiment of the present invention.

图22示出根据本发明实施例的生成图3示意图。FIG. 22 shows a schematic diagram of generating FIG. 3 according to an embodiment of the present invention.

本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

本公开的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。The terms "first", "second", etc. in the description and claims of the present disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can, for example, be practiced in sequences other than those illustrated or described herein.

此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

多个,包括两个或者两个以上。Multiple, including two or more.

和/或,应当理解,对于本公开中使用的术语“和/或”,其仅仅是一种描述关联对象的关联关系,表示可以存在三种关系。例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。And/or, it should be understood that the term "and/or" used in the present disclosure is merely an association relationship for describing associated objects, indicating that three kinds of relationships may exist. For example, A and/or B can mean that A exists alone, A and B exist at the same time, and B exists alone.

本发明提供一种OCR训练样本生成的方法,主要涉及图像文字轮廓提取技术、图像骨干提取技术、图像修复与填补技术:The present invention provides a method for generating an OCR training sample, which mainly relates to an image text outline extraction technology, an image backbone extraction technology, and an image repairing and filling technology:

文字轮廓提取技术:Text outline extraction technology:

文字轮廓提取有基于传统计算机视觉的方法和基于深度学习的方法。传统方法主要是各种二值化方法,包括固定阈值二值化、OTSU二值化、自适应二值化等。此类传统方法的优点是算法简单快速、可解释性强,缺点是提取的内容更易包含噪声、易受亮度影响、鲁棒性较差。There are traditional computer vision-based methods and deep learning-based methods for text contour extraction. Traditional methods are mainly various binarization methods, including fixed threshold binarization, OTSU binarization, and adaptive binarization. The advantages of such traditional methods are that the algorithm is simple and fast, and the interpretability is strong.

基于深度学习的方法主要为图像分割算法,优点是抗噪声能力强,不易受亮度影响,鲁棒性强。缺点是文字轮廓训练样本的标注非常困难,并且深度学习需要模型训练,推理速度较慢。因此,基于深度学习方法的文字轮廓提取在工业场景中极其罕见。The method based on deep learning is mainly an image segmentation algorithm, which has the advantages of strong anti-noise ability, not easily affected by brightness, and strong robustness. The disadvantage is that the labeling of text outline training samples is very difficult, and deep learning requires model training and the inference speed is slow. Therefore, text contour extraction based on deep learning methods is extremely rare in industrial scenarios.

图像骨干提取技术:Image backbone extraction technology:

图像骨干提取技术用来将二值化图像内的对象减少为1像素宽度的表示。其工作原理是连续迭代图像,去除对象边界上的像素,而不改变其连通性,直到无法删除更多像素。本发明使用该技术提取图像中文字的骨干。Image backbone extraction techniques are used to reduce objects within a binarized image to a 1-pixel wide representation. It works by continuously iterating over the image, removing pixels on object boundaries without changing their connectivity, until no more pixels can be removed. The present invention uses this technology to extract the backbone of the text in the image.

图像修复与填补技术:Image inpainting and filling techniques:

图像修复与填补技术可分为三大类,图像填充、图像修复、纹理合成。图像填充一般应用在待填充区域较大的情况,一般思路是设定每次迭代填充尺寸的大小,找到和待填充区域周围像素最匹配的一块区域,将其像素直接复制到待填充区域,迭代直到所有待填充区域填充完成。而图像修复技术一般应用在待填充区域较小的情况,一般思路是从外向内轮询待填充区域每个像素点,根据该像素点周围已知像素信息通过某种算法计算出待填充像素值,直到所有像素填充完成。纹理合成一般应用在纹理特征明显的图像中,如细胞、砖墙,待填充内容为已知区域纹理的生长。OCR样本图像并不符合纹理合成的特征。本发明在实验阶段对图像填充和图像修复这两种方法进行了详细的实验,发现使用图像修复的方法在该场景中效果最佳。Image inpainting and filling techniques can be divided into three categories: image filling, image inpainting, and texture synthesis. Image filling is generally used when the area to be filled is large. The general idea is to set the size of the filling size for each iteration, find an area that best matches the pixels around the area to be filled, copy its pixels directly to the area to be filled, and iterate until all the areas to be filled are filled. The image inpainting technology is generally used in the case where the area to be filled is small. The general idea is to poll each pixel in the area to be filled from the outside to the inside, and calculate the value of the pixel to be filled by some algorithm according to the known pixel information around the pixel. , until all pixels are filled. Texture synthesis is generally used in images with obvious texture features, such as cells and brick walls, and the content to be filled is the growth of texture in a known area. The OCR sample images do not meet the characteristics of texture synthesis. The present invention conducts detailed experiments on the two methods of image filling and image inpainting in the experimental stage, and finds that the method using image inpainting has the best effect in this scene.

实施例Example

本发明涉及一种OCR训练样本生成方法。该方法可自适应生成大量高质量OCR样本,以解决OCR训练样本匮乏的问题。The invention relates to a method for generating OCR training samples. This method can adaptively generate a large number of high-quality OCR samples to solve the problem of lack of OCR training samples.

如果想要以图1为原图,生成该种版式的大量样本。为便于展示,其中后添加文本框既为我们想要抹除文字的“抹除区域”,又是想要生成文字的“生成区域”(抹除区域与成生成区域不必重合)。If you want to use Figure 1 as the original image, generate a large number of samples of this layout. For the convenience of presentation, the text box added later is not only the "erasing area" where we want to erase the text, but also the "generating area" where we want to generate the text (the erasing area and the generating area do not have to overlap).

第一步,进入文字轮廓提取模块,在整张图片上提取所有“抹除区域”内的文字轮廓,如图2所示。The first step is to enter the text outline extraction module, and extract all text outlines in the "erased area" on the entire image, as shown in Figure 2.

第二步,进入图像修复模块,将文字轮廓(图2中的白色部分)作为原图像的损毁部分,根据损毁部分周围的像素信息进行图像的修复填充,得到抹除文字后的背景模板,如图3所示。The second step, enter the image repair module, take the text outline (the white part in Figure 2) as the damaged part of the original image, repair and fill the image according to the pixel information around the damaged part, and obtain the background template after erasing the text, such as shown in Figure 3.

第三步,进入随机生成模块,在指定的生成区域内生成指定样式的随机文本,并记录生成文本的坐标和文本内容作为标注信息,如图4所示。The third step is to enter the random generation module, generate random text of the specified style in the specified generation area, and record the coordinates and text content of the generated text as the annotation information, as shown in Figure 4.

整体的模块流程图如下图8所示,每个模块内部的算法流程图见每个模块说明部分。The overall module flow chart is shown in Figure 8 below, and the algorithm flow chart inside each module is shown in the description section of each module.

文字轮廓提取模块Text outline extraction module

如上图9所示,文字轮廓提取模块的输入为原图和“抹除区域”的坐标。对于输入图片,我们先将其转换为单通道灰度图,再将其自适应二值化,得到一张像素值为0,1的掩膜,即文字轮廓掩膜,记为mask1,其中值为1的区域为文字区域,值为0的区域为背景区域。对于输入的所有“抹除区域”坐标,我们也构造一个掩膜,使其抹除区域值为1,其他区域值为0,记为mask2。将mask1与mask2对应位置像素相乘。mask2值为0的部分,相乘结果为0;mask2值为1的部分,相乘结果为mask1的原值。以此可得到“抹除区域文字轮廓”的掩膜。As shown in Figure 9 above, the input of the text outline extraction module is the original image and the coordinates of the "erased area". For the input image, we first convert it into a single-channel grayscale image, and then adaptively binarize it to obtain a mask with a pixel value of 0 and 1, that is, a text outline mask, denoted as mask1, where the value The area with a value of 1 is the text area, and the area with a value of 0 is the background area. For all the input "erased area" coordinates, we also construct a mask, so that the value of the erased area is 1, and the value of other areas is 0, denoted as mask2. Multiply the corresponding position pixel of mask1 and mask2. For the part with mask2 value of 0, the multiplication result is 0; for the part with mask2 value of 1, the multiplication result is the original value of mask1. In this way, the mask of "Erase Area Text Outline" can be obtained.

由于二值化提取的文字轮廓边缘可能不全,修补后不能完全抹除文字边缘的痕迹,我们对上步得到的“抹除区域文字轮廓”掩膜进行形态学膨胀(膨胀核为2或3),以确保文字轮廓能够完全覆盖原图中的文字,以此得到修复区域掩膜,如图2所示。Since the edges of the text contours extracted by binarization may be incomplete, the traces of the text edges cannot be completely erased after repairing. We perform morphological expansion on the "erased area text contour" mask obtained in the previous step (the expansion kernel is 2 or 3). , to ensure that the text outline can completely cover the text in the original image, so as to obtain the repair area mask, as shown in Figure 2.

若我们不想抹除图片上的任何文字,则输入空的“抹除区域坐标”即可。此时构造的mask2值全为0,与mask1相乘后结果亦全为0,形态学膨胀后仍全为0,因此该模块得到的“修复区域”掩膜值全为0,也就是无需修复任何区域。If we don't want to erase any text on the picture, just enter an empty "Erase Area Coordinates". At this time, the constructed mask2 value is all 0, and the result of multiplication with mask1 is also all 0, and it is still all 0 after morphological expansion, so the mask value of the "repair area" obtained by this module is all 0, that is, no repair is required. any area.

若我们想要抹除图片上的所有文字,除了输入所有文本的坐标作为抹除区域的方法外,也可以直接忽略mask2,将mask1直接作为“抹除区域文字掩膜”,再将其进行形态学膨胀,得到修复区域掩膜。If we want to erase all the text on the picture, in addition to entering the coordinates of all the text as the method of erasing the area, we can also ignore mask2 directly, use mask1 directly as the "erasing area text mask", and then shape it Learn to dilate to get the repair area mask.

图像修复模块Image inpainting module

图像修复模块的功能是抹除图片上需要抹除的文字,生成一张背景模板。上一模块得到的修复区域掩膜用于告诉该模块具体需要修复的区域。现在我们按从外到内的顺序轮询待修复区域的每个像素点,(从外到内的排序算法可自由选择,如快速行进算法FastMarching Method),根据该像素点周围已知的像素信息,计算该修复点应该填充的像素值。之所以选择从外到内的顺序,即先填充待修补区域外围的像素点再填充内部的像素点,是因为轮廓外围的像素点邻域内的已知像素更多,含有更丰富的已知像素信息,这样基于周围像素点计算的填充值更为准确和平滑。当一个待修复的像素点被填充后,则认为该像素点成为已知像素,当计算其周围待填充像素点时,可将其作为已知像素信息并提供参考值。基于某像素点周围像素信息计算其像素值的具体方法可自由选择,如邻域已知像素值的加权平均、兼顾速度和效果的[INPAINT_NS]、[INPAINT_TELEA]等方法。The function of the image restoration module is to erase the text that needs to be erased on the picture and generate a background template. The repair area mask obtained from the previous module is used to tell the module the specific area that needs to be repaired. Now we poll each pixel of the area to be repaired in order from outside to inside, (the sorting algorithm from outside to inside can be freely selected, such as FastMarching Method), according to the known pixel information around the pixel , calculates the pixel value that the repair point should be filled with. The reason why we choose the order from outside to inside, that is, fill the pixels around the area to be repaired first and then fill the inner pixels, is because there are more known pixels in the neighborhood of the pixels around the contour, and there are more abundant known pixels. information, so that the fill value calculated based on the surrounding pixels is more accurate and smooth. When a pixel to be repaired is filled, the pixel is considered to be a known pixel. When calculating the surrounding pixels to be filled, it can be used as known pixel information and provide a reference value. The specific method for calculating the pixel value based on the pixel information around a pixel can be freely selected, such as the weighted average of the known pixel values in the neighborhood, [INPAINT_NS], [INPAINT_TELEA] and other methods that take into account both speed and effect.

通过以上方法的逐步迭代,待修复区域逐渐收缩变小,直到整个区域都被修复,得到已修复的背景模板图片,如图3所示。Through the step-by-step iteration of the above method, the area to be repaired gradually shrinks and becomes smaller until the entire area is repaired, and a repaired background template image is obtained, as shown in Figure 3.

随机生成模块Randomly generated modules

随机生成模块用于在背景模板的指定位置或随机位置生成随机文本和标注文件。通过设置生成数量,该模块用于快速生成指定数量的新样本。该模块需要用户指定生成文本的字体大小和颜色,如用户未指定,则使用设好的默认值。The random generation module is used to generate random text and annotation files at specified or random positions of the background template. By setting the generation number, this module is used to quickly generate a specified number of new samples. This module requires the user to specify the font size and color of the generated text. If the user does not specify it, the set default value will be used.

对于每一个“生成区域”,我们取其长宽的最大值作为生成文本的最大长度,取其长宽的最小值作为生成文本的最小长度,然后在该最小值和最大值之间随机选择一个整数作为本次生成文本的实际长度,记为w。这样既实现了生成文本的长度随机性,又可以保证该文本长度不会超出生成区域。我们记设定的字体大小为s,估算该段文本的文字个数n=int(w/s)。如指定语料类型,我们以该语料类型生成n*k个字符长度的冗余文本,如不指定语料类型,我们就随机生成n*k个字符长度的冗余文本。其中k为冗余倍数,由于上述估计的文字个数是每个字符宽度刚好与字符高度相等的情况,而不同的字符长宽比并不一定是严格的1:1,因此我们将n放大k倍,以确保生成的冗余文本字符的个数够用,通常k取3即可。然后从冗余文本的第一个字符开始,统计每个字符的实际长度并依次累加,直到字符总长度刚好满足“再加一个字符就会超出w”,将其作为最终生成的文本,记为text,并记录其此时的真实长度为L。For each "generating area", we take the maximum length and width as the maximum length of the generated text, take the minimum length and width as the minimum length of the generated text, and then randomly select one between the minimum and maximum values. The integer is used as the actual length of the text generated this time, which is denoted as w. This not only realizes the randomness of the length of the generated text, but also ensures that the length of the text does not exceed the generated area. We record the set font size as s, and estimate the number of characters in the text n=int(w/s). If the corpus type is specified, we use the corpus type to generate redundant texts with a length of n*k characters. If the corpus type is not specified, we randomly generate redundant texts with a length of n*k characters. Among them, k is the redundancy multiple. Since the above estimated number of characters is the case where the width of each character is exactly equal to the height of the character, and the aspect ratio of different characters is not necessarily strictly 1:1, we enlarge n by k times to ensure that the number of redundant text characters generated is sufficient, usually k is 3. Then start from the first character of the redundant text, count the actual length of each character and accumulate them in turn, until the total length of the characters just satisfies "one more character will exceed w", and it is used as the final generated text, recorded as text, and record its real length at this time as L.

接下来我们实现文本生成在生成区域的随机位置功能。为保证生成的文本不超出生成区域,设生成区域左上角坐标为(x1,x2),右下角坐标为(x2,y2),则文本左上角起始点的x轴坐标范围应为[x1,(x2-w)],y轴坐标范围应为[y1,(y2-s)]。在该范围内随机选择一个整数点作为生成文本的起始位置,实现位置随机功能。Next, we implement the random position function of text generation in the generated area. In order to ensure that the generated text does not exceed the generation area, set the coordinates of the upper left corner of the generated area to be (x1, x2) and the coordinates of the lower right corner to be (x2, y2), then the x-axis coordinate range of the starting point of the upper left corner of the text should be [x1,( x2-w)], the y-axis coordinate range should be [y1,(y2-s)]. In this range, an integer point is randomly selected as the starting position of the generated text to realize the function of random position.

然后我们在图片上,以(x,y)为起始位置,写入指定颜色和大小的“text”字符串。该文本行的标注信息亦可得到:坐标为[(x,y),(x+L,y),(x+L,y+s),(x,y+s)],文本内容为“text”。以此流程轮询完每个“生成区域”,即可得到一张新图片和与之对应的答案标注文件。Then we write the "text" string of the specified color and size on the image, starting at (x, y). The annotation information of the text line can also be obtained: the coordinates are [(x,y),(x+L,y),(x+L,y+s),(x,y+s)], and the text content is " text". After polling each "generating area" in this process, a new image and the corresponding answer annotation file can be obtained.

至此,已完成OCR训练样本生成,效果如图4所示。So far, the OCR training sample generation has been completed, and the effect is shown in Figure 4.

自适应文字大小模块Adaptive text size module

自适应文字大小模块为可选模块,用来自适应调节生成的每个文本行的文字大小,使得每个区域生成的文字大小和原来该位置的文字大小近似一致。The adaptive text size module is an optional module, used to adaptively adjust the text size of each generated text line, so that the size of the text generated in each area is approximately the same as the size of the original text at that location.

为保留更多的背景信息,我们指定的抹除区域都应紧贴文字,此时抹除区域宽度即为其中字符的大小,记为h。我们利用这种特性,将生成文本字体大小默认设置为h,则该区域生成的文字大小和原来该位置的文字大小近似一致。In order to retain more background information, the designated erasing area should be close to the text. At this time, the width of the erasing area is the size of the characters in it, denoted as h. We use this feature to set the font size of the generated text to h by default, and the size of the text generated in this area is approximately the same as the size of the original text in this position.

本模块的优点有三个:There are three advantages of this module:

1.可以保持每个区域生成的文字大小和原来该位置的文字大小近似一致,最大化地还原原图不同位置的不同字体特征;1. The size of the text generated in each area can be kept approximately the same as the size of the text in the original position, and the different font features in different positions of the original image can be restored to the maximum extent;

2.省去为每个生成文本区域手动设置固定字体大小的繁琐步骤;2. Eliminate the tedious steps of manually setting a fixed font size for each generated text area;

3.解决手动设置字体,大小难以把控的问题。3. Solve the problem that the font size is difficult to control by manually setting the font.

添加该模块后的样本生成效果如图5所示。The sample generation effect after adding this module is shown in Figure 5.

自适应文字颜色模块Adaptive text color module

自适应文字颜色模块为可选模块,用来自适应调节生成的每个文本行的字体颜色,使得每个区域内生成文字的颜色与原文字颜色近似一致。The adaptive text color module is an optional module used to adaptively adjust the font color of each generated text line, so that the color of the generated text in each area is approximately the same as the original text color.

对于每个生成区域,我们从“文字轮廓提取模块”中选择对应的生成区域,如图14。对该区域的文字轮廓进行骨干提取,如图15。提取骨干的目的是为了滤除文字轮廓边界可能包含的一些背景像素,这些背景像素的颜色往往与文字颜色相差很大,会引入误差。然后,对于生成区域原图(如图16)的RGB三个通道,我们以R通道为例,将R通道文字骨干区域处的所有像素值进行求平均,即为该区域文字骨干的R通道颜色值。以此求得每个通道的均值,即为当前区域内文字的RGB三通道颜色均值。将该颜色值作为生成文字的颜色,使得每个区域内生成文字的颜色与原文字颜色近似一致。For each generated region, we select the corresponding generated region from the "Text Outline Extraction Module", as shown in Figure 14. Backbone extraction is performed on the text outline of this area, as shown in Figure 15. The purpose of extracting the backbone is to filter out some background pixels that may be included in the border of the text outline. The color of these background pixels is often very different from the color of the text, which will introduce errors. Then, for the three RGB channels of the original image of the generated area (as shown in Figure 16), we take the R channel as an example, and average all the pixel values in the R channel text backbone area, which is the R channel color of the text backbone in this area. value. In this way, the average value of each channel is obtained, which is the RGB three-channel color average value of the text in the current area. The color value is used as the color of the generated text, so that the color of the generated text in each area is approximately the same as the original text color.

本模块的优点有三个:There are three advantages of this module:

1.可以保持每个区域生成的文字颜色和原来该位置的文字颜色近似一致,最大化地还原原图不同位置的不同字体颜色特征;1. The color of the text generated in each area can be kept approximately the same as the color of the text in the original position, and the color characteristics of different fonts in different positions of the original image can be restored to the maximum extent;

2.省去为每个生成文本区域手动设置固定字体颜色的繁琐步骤;2. Eliminate the tedious steps of manually setting a fixed font color for each generated text area;

3.解决手动设置字体颜色,颜色像素值难以把控的问题。3. Solve the problem that the font color is manually set, and the color pixel value is difficult to control.

添加该模块后的样本生成效果如图6所示。The sample generation effect after adding this module is shown in Figure 6.

语料配置模块Corpus Configuration Module

语料配置模块为可选模块,用来配置生成的每个文本行的语料类型,使得每个区域内生成文字的语料意义与原文字一致。本模块嵌入在“随机生成模块”中,如图11所示。The corpus configuration module is an optional module, which is used to configure the corpus type of each generated text line, so that the corpus meaning of the generated text in each area is consistent with the original text. This module is embedded in the "Random Generation Module", as shown in Figure 11.

本模块使用两种方法生成指定语料:1.正则表达式生成;2.数据库随机获取。This module uses two methods to generate the specified corpus: 1. Regular expression generation; 2. Random acquisition from database.

正则表达式生成的方法,主要用来生成规则比较明确的语料,如手机号码、身份证号码等。这类语料有固定的字符长度和编码规则,我们将需要的语料规则特征编为正则表达式,然后根据正则表达式随机生成字符串。The method of regular expression generation is mainly used to generate corpus with clear rules, such as mobile phone numbers, ID numbers, etc. This kind of corpus has fixed character length and encoding rules. We encode the required corpus rule features into regular expressions, and then randomly generate strings according to the regular expressions.

数据库随机获取的方法,主要用来生成规则不明确或某种固定型语料,如公司名称、银行名称、国家名称、省、市、语言等。针对这类语料,我们首先要通过公开信息获得该语料的所有内容,并将数据按条目存入数据库或数据文件中,在随机生成的时候从中随机获取一条即可。The method of random database acquisition is mainly used to generate unclear rules or certain fixed corpus, such as company name, bank name, country name, province, city, language, etc. For this type of corpus, we must first obtain all the content of the corpus through public information, and store the data in the database or data file by entry, and randomly obtain one from it when it is randomly generated.

添加该模块后的样本生成效果如图7所示。图17-22为使用本方法为不同样式的OCR样本生成虚拟样本的效果图。The sample generation effect after adding this module is shown in Figure 7. Figures 17-22 are the effect diagrams of using this method to generate virtual samples for OCR samples of different styles.

需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above implementations, those skilled in the art can clearly understand that the above implementation method can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as ROM/RAM, magnetic disk, CD), including several instructions to make a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present invention.

上面结合附图对本发明的实施例进行了描述,但是本发明并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本发明的启示下,在不脱离本发明宗旨和权利要求所保护的范围情况下,还可做出很多形式,这些均属于本发明的保护之内。The embodiments of the present invention have been described above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned specific embodiments, which are merely illustrative rather than restrictive. Under the inspiration of the present invention, without departing from the scope of protection of the present invention and the claims, many forms can be made, which all belong to the protection of the present invention.

Claims (17)

1.一种OCR训练样本生成方法,其特征在于,输入信息为原始图像以及抹除区域坐标,所述方法包括:1. an OCR training sample generation method, is characterized in that, input information is original image and erasure area coordinate, and described method comprises: 文字轮廓提取步骤,基于原始图像提取所有文字轮廓,结合抹除区域坐标确定抹除区域掩膜,并得到修复区域掩膜;In the text outline extraction step, all text outlines are extracted based on the original image, and the erased area mask is determined in combination with the erased area coordinates, and the repaired area mask is obtained; 图像修复填充步骤,根据修复区域掩膜以及修复区域周围的像素信息进行图像修复填充,得到抹除文字后的背景模板;In the image repairing and filling step, image repairing and filling is performed according to the repaired area mask and the pixel information around the repaired area to obtain the background template after erasing the text; 随机文本生成步骤,在每个生成区域内生成随机文本,由此得到一张新的样本图片和与之对应的标注信息文件。In the random text generation step, random text is generated in each generation area, thereby obtaining a new sample image and the corresponding annotation information file. 2.根据权利要求1所述的OCR训练样本生成方法,其特征在于,所述文字轮廓提取步骤具体包括:2. OCR training sample generation method according to claim 1, is characterized in that, described character outline extraction step specifically comprises: 将输入的原始图像转换为单通道灰度图,再将其自适应二值化,得到所有文字轮廓掩膜,文字区域值为1,背景区域值为0;Convert the input original image to a single-channel grayscale image, and then adaptively binarize it to obtain all text outline masks, the text area value is 1, and the background area value is 0; 根据抹除区域坐标,得到抹除区域掩膜,抹除区域值为1,其他区域值为0;According to the coordinates of the erasing area, the erasing area mask is obtained, the value of the erasing area is 1, and the value of other areas is 0; 将所有文字轮廓掩膜与抹除区域掩膜对应位置像素相乘,得到抹除区域文字轮廓掩膜;Multiply all the text outline masks with the corresponding position pixels of the erased area mask to obtain the erased area text outline mask; 对抹除区域文字轮廓掩膜进行形态学膨胀,由此得到修复区域掩膜。Morphological expansion is performed on the erased region text outline mask, thereby obtaining the repaired region mask. 3.根据权利要求1所述的OCR训练样本生成方法,其特征在于,所述图像修复填充步骤具体包括:3. OCR training sample generation method according to claim 1, is characterized in that, described image repair filling step specifically comprises: 根据修复区域掩膜,在原始图像中确定待修复区域;Determine the area to be repaired in the original image according to the repair area mask; 从外到内的顺序轮询待修复区域的每个像素点,根据某个像素点周围已知像素的信息,计算该修复点应该填充的像素值,成为已知像素;Poll each pixel of the area to be repaired in order from outside to inside, and calculate the pixel value that should be filled in the repaired point according to the information of the known pixels around a certain pixel to become a known pixel; 向内计算下一个像素点的像素值;Calculate the pixel value of the next pixel inward; 逐步迭代,待修复区域逐渐收缩变小,直到待修复区域都被修复,得到已修复的抹除文字后的背景模板。Iterative step by step, the area to be repaired gradually shrinks and becomes smaller, until the area to be repaired is repaired, and the repaired background template after erasing the text is obtained. 4.根据权利要求3所述的OCR训练样本生成方法,其特征在于,从外到内的排序算法为快速行进算法。4. The method for generating OCR training samples according to claim 3, wherein the sorting algorithm from outside to inside is a fast marching algorithm. 5.根据权利要求3所述的OCR训练样本生成方法,其特征在于,所述根据某个像素点周围已知像素的信息,计算该修复点应该填充的像素值的方法包括:邻域已知像素值加权平均、INPAINT_NS或INPAINT_TELEA方法。5. The OCR training sample generation method according to claim 3, wherein the method for calculating the pixel value that should be filled in the repair point according to the information of the known pixels around a certain pixel point comprises: the neighborhood is known Pixel value weighted average, INPAINT_NS or INPAINT_TELEA method. 6.根据权利要求1所述的OCR训练样本生成方法,其特征在于,所述随机文本生成步骤具体包括:6. OCR training sample generation method according to claim 1, is characterized in that, described random text generation step specifically comprises: 针对某一生成区域,确定针对该生成区域的随机文本预计长度w,设定字体大小为s,估算该段随机文本的文字个数n=int(w/s);For a certain generation area, determine the expected length w of the random text for the generation area, set the font size to s, and estimate the number of characters in the random text n=int(w/s); 生成n*k个字符长度的冗余随机文本,k为冗余倍数,取值为正整数;Generate redundant random text with length of n*k characters, k is a redundant multiple, and the value is a positive integer; 根据冗余随机文本长度和随机文本预计长度w的关系,确定最终生成的随机文本及其实际长度L;Determine the final generated random text and its actual length L according to the relationship between the length of the redundant random text and the expected length w of the random text; 在生成区域内随机确定该生成文本的位置,写入该最终生成的随机文本并确定其标注信息;Randomly determine the position of the generated text in the generation area, write the final generated random text and determine its label information; 轮询每个生成区域,由此得到一张新的样本图片和与之对应的标注信息文件。Poll each generated area to obtain a new sample image and the corresponding annotation information file. 7.根据权利要求6所述的OCR训练样本生成方法,其特征在于,所述确定针对该生成区域的随机文本预计长度w具体包括:7. OCR training sample generation method according to claim 6, is characterized in that, described determining the random text expected length w for this generation area specifically comprises: 取该生成区域的长和宽中的最大值作为生成文本的最大长度,长和宽的最小值作为生成文本的最小长度,在该最小值和最大值之间随机选择一个整数作为随机文本预计长度,记为w。Take the maximum value of the length and width of the generated area as the maximum length of the generated text, the minimum value of the length and width as the minimum length of the generated text, and randomly select an integer between the minimum and maximum values as the expected length of the random text , denoted as w. 8.根据权利要求6所述的OCR训练样本生成方法,其特征在于,所述生成n*k个字符长度的冗余随机文本步骤中,如指定语料类型,则以该指定语料类型生成n*k个字符长度的冗余随机文本;如不指定语料类型,则随机生成n*k个字符长度的冗余随机文本。8. OCR training sample generation method according to claim 6, is characterized in that, in described generating the redundant random text step of n*k character length, as specified corpus type, then generates n* with this specified corpus type Redundant random text with a length of k characters; if the corpus type is not specified, a redundant random text with a length of n*k characters is randomly generated. 9.根据权利要求8所述的OCR训练样本生成方法,其特征在于,所述以该指定语料类型生成n*k个字符长度的冗余随机文本具体包括:9. OCR training sample generation method according to claim 8, is characterized in that, described generating the redundant random text of n*k character length with this specified corpus type specifically comprises: 正则表达式生成:用来生成规则明确的语料,将需要的语料规则特征编为正则表达式,然后根据正则表达式随机生成n*k个字符长度的冗余随机文本;或者Regular expression generation: used to generate corpus with clear rules, compile the required corpus rule features into a regular expression, and then randomly generate redundant random text with a length of n*k characters according to the regular expression; or 数据库随机获取:用来生成规则不明确或某种固定型语料,通过公开信息获得该语料的所有内容,并将数据按条目存入数据库或数据文件中,在随机生成的时候从中随机获取一条n*k个字符长度的冗余随机文本即可。Database random acquisition: used to generate unclear rules or a certain fixed corpus, obtain all the content of the corpus through public information, store the data in the database or data file by entry, and randomly obtain a n *k characters long redundant random text is sufficient. 10.根据权利要求6所述的OCR训练样本生成方法,其特征在于,所述根据冗余随机文本长度和随机文本预计长度w的关系,确定最终生成的随机文本及其实际长度L具体包括:10. OCR training sample generation method according to claim 6, is characterized in that, described according to the relationship between redundant random text length and random text expected length w, determine the random text finally generated and its actual length L specifically include: 从冗余随机文本的第一个字符开始,统计每个字符的实际长度并依次累加,直到字符总长度刚好满足“再加一个字符就会超出w”,将其作为最终生成的随机文本,记为“text”,并记录该最终生成的随机文本的实际长度L。Starting from the first character of the redundant random text, count the actual length of each character and accumulate them in turn, until the total length of the characters just satisfies "one more character will exceed w", and take it as the final generated random text, record is "text", and records the actual length L of the final generated random text. 11.根据权利要求6所述的OCR训练样本生成方法,其特征在于,在生成区域内随机确定该生成文本的位置,写入该最终生成的随机文本并确定其标注信息具体包括:11. OCR training sample generation method according to claim 6, is characterized in that, randomly determines the position of this generated text in the generation area, writes the random text of this final generation and determines that its label information specifically comprises: 设该生成区域左上角坐标为(x1,x2),右下角坐标为(x2,y2),则该最终生成的随机文本左上角起始点的x轴坐标范围为[x1,(x2-w)],y轴坐标范围为[y1,(y2-s)],在该范围内随机选择一个整数点(x,y)作为该最终生成的随机文本的起始位置;Let the coordinates of the upper left corner of the generated area be (x1, x2) and the coordinates of the lower right corner be (x2, y2), then the x-axis coordinate range of the starting point of the upper left corner of the final generated random text is [x1, (x2-w)] , The y-axis coordinate range is [y1, (y2-s)], and an integer point (x, y) is randomly selected within this range as the starting position of the final generated random text; 在背景模板上,以(x,y)为文本左上角的起始位置,写入该最终生成的随机文本,其标注信息为:坐标为[(x,y),(x+L,y),(x+L,y+s),(x,y+s)];文本内容为“text”。On the background template, with (x,y) as the starting position of the upper left corner of the text, write the final generated random text, and its label information is: the coordinates are [(x,y),(x+L,y) ,(x+L,y+s),(x,y+s)]; the text content is "text". 12.根据权利要求6所述的OCR训练样本生成方法,其特征在于,所述在生成区域内随机确定该生成文本的位置,写入该最终生成的随机文本并确定其标注信息步骤之后,还包括对该最终生成的随机文本的字体大小和颜色进行调节的步骤。12. The OCR training sample generation method according to claim 6, characterized in that, after the step of randomly determining the generated text in the generating area, writing the final generated random text and determining its labeling information, further Including the step of adjusting the font size and color of the finally generated random text. 13.根据权利要求12所述的OCR训练样本生成方法,其特征在于,所述对该最终生成的随机文本的字体大小进行调节具体包括:确定抹除区域宽度h,将最终生成的随机文本的大小默认设置为h。13. The method for generating OCR training samples according to claim 12, wherein the adjusting the font size of the finally generated random text specifically comprises: determining the width h of the erasing region, and adjusting the font size of the finally generated random text. The size is set to h by default. 14.根据权利要求12所述的OCR训练样本生成方法,其特征在于,所述对该最终生成的随机文本的字体颜色进行调节具体包括:14. The OCR training sample generation method according to claim 12, wherein the adjusting the font color of the finally generated random text specifically comprises: 对于某一生成区域,从修复区域掩膜中选择对应的区域,并对该区域的文字轮廓进行骨干提取,形成文字骨干区域;For a certain generated area, select the corresponding area from the repair area mask, and extract the text outline of the area to form the text backbone area; 从原始图像中提取该生成区域对应的区域,对于该区域的RGB三个通道,将各个通道文字骨干区域处的所有像素值进行求平均,即为该文字骨干区域的各个通道颜色值;Extract the area corresponding to the generated area from the original image, and for the three RGB channels of this area, average all pixel values in the text backbone area of each channel, which is the color value of each channel in the text backbone area; 将该颜色值作为生成文字的颜色,使得该最终生成的随机文本的字体颜色与原文字颜色近似一致。The color value is used as the color of the generated text, so that the font color of the final generated random text is approximately the same as the original text color. 15.一种OCR训练样本生成装置,所述装置基于根据权利要求1至14中任一项所述的方法进行操作,所述装置包括:15. A device for generating OCR training samples, the device operating based on the method according to any one of claims 1 to 14, the device comprising: 文字轮廓提取模块,用于基于原始图像提取所有文字轮廓,结合抹除区域坐标确定抹除区域掩膜,并得到修复区域掩膜;The text outline extraction module is used to extract all text outlines based on the original image, determine the erasure area mask in combination with the erasure area coordinates, and obtain the repair area mask; 图像修复填充模块,用于根据修复区域掩膜以及修复区域周围的像素信息进行图像修复填充,得到抹除文字后的背景模板;Image inpainting and filling module, which is used for image inpainting and filling according to the repaired area mask and the pixel information around the repaired area to obtain the background template after erasing the text; 随机文本生成模块,用于在每个生成区域内生成随机文本,由此得到一张新的样本图片和与之对应的标注信息文件。The random text generation module is used to generate random text in each generation area, thereby obtaining a new sample image and the corresponding annotation information file. 16.一种OCR训练样本生成系统,所述系统包括:处理器和用于存储可执行指令的存储器;其中,所述处理器被配置为执行所述可执行指令,以执行根据权利要求1至14中任一项所述的OCR训练样本生成方法。16. An OCR training sample generation system, the system comprising: a processor and a memory for storing executable instructions; wherein the processor is configured to execute the executable instructions to execute the steps according to claims 1 to 10 . The OCR training sample generation method described in any one of 14. 17.一种计算机可读存储介质,其特征在于,其上存储有计算机程序,所述计算机程序被处理器执行时实现根据权利要求1至14中任一项所述的OCR训练样本生成方法。17. A computer-readable storage medium, wherein a computer program is stored thereon, and when the computer program is executed by a processor, the method for generating an OCR training sample according to any one of claims 1 to 14 is implemented.
CN202111646988.5A 2021-12-29 2021-12-29 A method, device and system for generating OCR training samples Pending CN114419632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111646988.5A CN114419632A (en) 2021-12-29 2021-12-29 A method, device and system for generating OCR training samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111646988.5A CN114419632A (en) 2021-12-29 2021-12-29 A method, device and system for generating OCR training samples

Publications (1)

Publication Number Publication Date
CN114419632A true CN114419632A (en) 2022-04-29

Family

ID=81269457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111646988.5A Pending CN114419632A (en) 2021-12-29 2021-12-29 A method, device and system for generating OCR training samples

Country Status (1)

Country Link
CN (1) CN114419632A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926839A (en) * 2022-07-22 2022-08-19 富璟科技(深圳)有限公司 Image identification method based on RPA and AI and electronic equipment
CN115034977A (en) * 2022-05-12 2022-09-09 北京控制与电子技术研究所 A Contour-Based Repair Method for Incomplete Text
CN117253233A (en) * 2023-09-05 2023-12-19 广东奥普特科技股份有限公司 Character erasing method, device and equipment
CN118379221A (en) * 2024-02-02 2024-07-23 北京东方瑞丰航空技术有限公司 Method, system, electronic equipment and medium for repairing satellite picture

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401465A (en) * 2020-03-25 2020-07-10 深圳前海微众银行股份有限公司 Training sample optimization method, device, equipment and storage medium
CN111401365A (en) * 2020-03-17 2020-07-10 海尔优家智能科技(北京)有限公司 OCR image automatic generation method and device
CN112949754A (en) * 2021-03-29 2021-06-11 中国科学院合肥物质科学研究院 Text recognition data synthesis method based on image fusion
CN113537229A (en) * 2021-08-27 2021-10-22 广州广电运通金融电子股份有限公司 Bill image generation method and device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401365A (en) * 2020-03-17 2020-07-10 海尔优家智能科技(北京)有限公司 OCR image automatic generation method and device
CN111401465A (en) * 2020-03-25 2020-07-10 深圳前海微众银行股份有限公司 Training sample optimization method, device, equipment and storage medium
CN112949754A (en) * 2021-03-29 2021-06-11 中国科学院合肥物质科学研究院 Text recognition data synthesis method based on image fusion
CN113537229A (en) * 2021-08-27 2021-10-22 广州广电运通金融电子股份有限公司 Bill image generation method and device, computer equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034977A (en) * 2022-05-12 2022-09-09 北京控制与电子技术研究所 A Contour-Based Repair Method for Incomplete Text
CN114926839A (en) * 2022-07-22 2022-08-19 富璟科技(深圳)有限公司 Image identification method based on RPA and AI and electronic equipment
CN117253233A (en) * 2023-09-05 2023-12-19 广东奥普特科技股份有限公司 Character erasing method, device and equipment
CN117253233B (en) * 2023-09-05 2024-05-17 广东奥普特科技股份有限公司 Character erasing method, device and equipment
CN118379221A (en) * 2024-02-02 2024-07-23 北京东方瑞丰航空技术有限公司 Method, system, electronic equipment and medium for repairing satellite picture

Similar Documents

Publication Publication Date Title
CN114419632A (en) A method, device and system for generating OCR training samples
CN113673338B (en) Method, system and medium for weakly supervised automatic annotation of character pixels in natural scene text images
JP3822277B2 (en) Character template set learning machine operation method
CN111723585A (en) A style-controllable real-time translation and conversion method for image and text
WO2021169740A1 (en) Image restoration method and apparatus, computer device, and storage medium
CN110516554A (en) A multi-scene multi-font Chinese character detection and recognition method
CN112529989B (en) Picture reconstruction method based on bill template
CN113469148B (en) A text erasing method and model training method, device and storage medium
WO2023019995A1 (en) Training method and apparatus, translation presentation method and apparatus, and electronic device and storage medium
CN115392188A (en) Method and device for generating editable document based on non-editable image-text images
CN111274863A (en) Text prediction method based on text peak probability density
CN117350910A (en) Image watermark protection method based on diffusion image editing model
CN110232724A (en) A kind of Chinese character style image vector representation method
CN115376022A (en) Application of small target detection algorithm based on neural network in unmanned aerial vehicle aerial photography
CN112734778A (en) Vehicle matting method, system, equipment and storage medium based on neural network
CN112837329A (en) A method and system for image binarization of Tibetan ancient book documents
JP7402931B2 (en) METHODS, COMPUTER READABLE PROGRAMS AND SYSTEM
CN117132994B (en) Handwritten character erasing method based on generation countermeasure network
CN116681604B (en) Qin simple text restoration method based on condition generation countermeasure network
CN113744140B (en) Image processing method, device and computer readable storage medium
CN114494528A (en) Automatic high-quality text tampered image generation method, terminal and medium
CN110197522B (en) Image data processing method, device and computer readable storage medium
WO2024197829A1 (en) Single-character detection method and apparatus, model training method and apparatus, and device and medium
CN114463238B (en) Image fusion method, device and storage medium
TWI753332B (en) Method for processing pictures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination