CN114419632A - OCR training sample generation method, device and system - Google Patents

OCR training sample generation method, device and system Download PDF

Info

Publication number
CN114419632A
CN114419632A CN202111646988.5A CN202111646988A CN114419632A CN 114419632 A CN114419632 A CN 114419632A CN 202111646988 A CN202111646988 A CN 202111646988A CN 114419632 A CN114419632 A CN 114419632A
Authority
CN
China
Prior art keywords
area
text
character
length
random text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111646988.5A
Other languages
Chinese (zh)
Inventor
沈达伟
王勇
朱军民
康铁钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yidao Boshi Technology Co ltd
Original Assignee
Beijing Yidao Boshi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yidao Boshi Technology Co ltd filed Critical Beijing Yidao Boshi Technology Co ltd
Priority to CN202111646988.5A priority Critical patent/CN114419632A/en
Publication of CN114419632A publication Critical patent/CN114419632A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Input (AREA)

Abstract

The invention discloses an OCR training sample generation method, device and system, and relates to the field of computer vision. The method comprises the following steps: a character outline extraction step, namely extracting all character outlines based on the original image, determining an erasing area mask by combining the erasing area coordinates, and obtaining a repairing area mask; an image repairing and filling step, namely performing image repairing and filling according to a mask of a repairing area and pixel information around the repairing area to obtain a background template after characters are erased; and a random text generation step, namely generating a random text in each generation area, thereby obtaining a new sample picture and a corresponding marking information file. The method combines the technologies of character contour extraction algorithm, image restoration and the like, fully utilizes the background information of the original picture to generate a high-quality training picture, and simultaneously generates a labeling file (containing character content and position information) corresponding to the picture, so that the method avoids the tedious labeling work and can be directly used for OCR model training.

Description

OCR training sample generation method, device and system
Technical Field
The invention relates to the field of computer vision, in particular to an OCR training sample generation method, device and system.
Background
OCR (Optical Character Recognition) tasks are widely present in life and business scenarios. The current best way for this task is to use deep learning techniques for word location and recognition. However, in a real scene, a certain format training sample is often deficient, and deep learning depends on a huge training sample. Accordingly, a method of automatically generating OCR training samples arises.
Current methods of generating OCR samples fall broadly into two categories:
the first category is template-based methods. Inputting a clean template picture, and then writing random characters at the specified position of the template to generate the OCR sample of the template pattern. This method has two major drawbacks: firstly, in most cases, the template picture cannot be obtained; secondly, the template picture is often a clean and deformation-free ideal picture, and has a large difference with a real picture mixed with various noise interferences, so that the authenticity of the generated sample is poor.
The second category is image editing techniques based on deep learning. This type of process has four major drawbacks: the first defect is that the generalization capability of the model is poor, and as the OCR picture characteristics and character characteristics of different formats are greatly different, the model is often poor in performance for picture formats or character patterns which do not appear in the model training process; the second defect is that the training data is difficult to obtain, and due to limited generalization capability of the model, if an image editing model aiming at a certain format is to be trained, a large amount of training data similar to the format is needed firstly, and the current scene needs image editing technology due to the fact that data is lacked; the third defect is that the samples of the training image editing model need pixel-level labeling, and the labeling work is very expensive and time-consuming; the fourth defect is that the model reasoning speed is relatively slow, and a deep learning reasoning framework is often required for operation, thereby greatly increasing the engineering complexity.
Disclosure of Invention
The invention relates to an OCR training sample generation method. Aiming at the problem of sample shortage frequently encountered in the OCR training process, the applicant creatively combines technologies such as a character outline extraction algorithm, image restoration and the like, fully utilizes the background information of an original picture, generates a high-quality training picture, and simultaneously generates a labeling file (containing character content and position information) corresponding to the picture, so that the redundant labeling work is omitted, and the method can be directly used for OCR model training. Therefore, the method is based on the traditional computer vision method, directly processes the real original OCR picture, and overcomes the defects.
According to a first aspect of the present invention, there is provided an OCR training sample generating method, where input information includes an original image and coordinates of an erasure area, the method includes:
a character outline extraction step, namely extracting all character outlines based on an original image, determining an erasing area mask (mask) by combining the erasing area coordinates, and obtaining a repairing area mask;
an image repairing and filling step, namely performing image repairing and filling according to a mask of a repairing area and pixel information around the repairing area to obtain a background template after characters are erased;
and a random text generation step, namely generating a random text in each generation area, thereby obtaining a new sample picture and a corresponding marking information file.
Further, the text contour extraction step specifically includes:
converting an input original image into a single-channel gray image, and then performing self-adaptive binarization on the single-channel gray image to obtain all character outline masks, wherein the character area value is 1, and the background area value is 0;
obtaining an erasing area mask according to the coordinates of the erasing area, wherein the value of the erasing area is 1, and the values of other areas are 0;
multiplying all the character outline masks by pixels at the corresponding positions of the erasing area masks to obtain the character outline masks of the erasing area;
and performing morphological expansion on the character outline mask of the erasing area, thereby obtaining a mask of a repairing area.
Further, the morphologically expanded core is 2 or 3.
Further, the image restoration filling step specifically includes:
determining a region to be repaired in the original image according to the mask of the repair region;
polling each pixel point of the area to be repaired from outside to inside in sequence, and calculating the pixel value which should be filled by the repair point according to the information of known pixels around a certain pixel point to become a known pixel;
calculating the pixel value of the next pixel point inwards;
and (5) gradually iterating, and gradually shrinking and reducing the area to be repaired until the area to be repaired is repaired, so as to obtain the repaired background template after the characters are erased.
Further, the sorting algorithm from outside to inside is a Fast Marching algorithm (Fast Marching Method).
Further, the method for calculating the pixel value to be filled by the repair point according to the information of the known pixels around a certain pixel point includes a weighted average of the known pixel values in the neighborhood, and an INPAINT _ NS or INPAINT _ tele method.
Here, INPAINT _ NS (Navier-Stokes based repair method) and INPAINT _ tele (image gradient based fast matching method, also called TELEA method) are two common image repair algorithms.
Further, the generation area is a position area where the sample text needs to be generated in the background template after the text is erased.
Further, the step of generating the random text specifically includes:
determining the expected length w of the random text in a certain generation area, setting the font size as s, and estimating the number n of characters of the random text as int (w/s);
generating a redundant random text with the length of n x k characters, wherein k is a redundant multiple and takes a value as a positive integer;
determining a random text which is finally generated and an actual length L thereof according to the relation between the length of the redundant random text and the predicted length w of the random text;
randomly determining the position of the generated text in the generation area, writing the finally generated random text and determining the labeling information of the random text;
and polling each generation area to obtain a new sample picture and a corresponding marking information file.
Further, the determining the predicted length w of the random text for the generated region specifically includes:
and taking the maximum value of the length and the width of the generated area as the maximum length of the generated text, taking the minimum value of the length and the width as the minimum length of the generated text, and randomly selecting an integer between the minimum value and the maximum value as the predicted length of the random text, wherein the integer is marked as w.
Further, in the step of generating the redundant random text with the character length of n x k, if a corpus type is specified, generating the redundant random text with the character length of n x k by the corpus type; and if the corpus type is not specified, randomly generating n x k character length redundant random texts.
Further, the method for generating the specified corpus comprises the following steps:
generating a regular expression: the method is used for generating a corpus with definite rules, compiling required corpus rule characteristics into a regular expression, and then randomly generating character strings according to the regular expression; or
Randomly obtaining a database: the method is used for generating a corpus with an indefinite rule or a certain fixed type, obtaining all contents of the corpus through public information, storing data into a database or a data file according to items, and randomly obtaining one item from the data when randomly generating.
Further, the value of k takes 3.
Further, the determining the finally generated random text and the actual length L thereof according to the relationship between the length of the redundant random text and the expected length w of the random text specifically includes:
counting the actual length of each character from the first character of the redundant random text and accumulating the actual lengths in sequence until the total length of the characters just meets the condition that 'one more character exceeds w', taking the character as a finally generated random text and marking the character as 'text', and recording the actual length L of the finally generated random text.
Further, randomly determining the position of the generated text in the generation area, writing the finally generated random text and determining the labeling information thereof specifically includes:
if the coordinates of the upper left corner of the generated area are (x1, x2) and the coordinates of the lower right corner are (x2, y2), the x-axis coordinate range of the starting point of the upper left corner of the finally generated random text is [ x1, (x2-w) ], the y-axis coordinate range is [ y1, (y2-s) ], and an integer point (x, y) is randomly selected in the range to serve as the starting position of the finally generated random text;
writing the finally generated random text on the background template by taking (x, y) as the starting position of the upper left corner of the text, wherein the marking information is as follows: coordinates [ (x, y), (x + L, y + s), (x, y + s) ]; the text content is "text".
Further, after the steps of randomly determining the position of the generated text in the generation area, writing the finally generated random text and determining the label information thereof, the method further comprises the step of adjusting the font size and the color of the finally generated random text.
Further, the adjusting the font size of the finally generated random text specifically includes: determining the width h of the erasing area, and setting the size of the finally generated random text as h by default.
Further, the adjusting the font color of the finally generated random text specifically includes:
for a certain generation area, selecting a corresponding area from a mask (mask) of a repair area, and carrying out backbone extraction on the character outline of the area to form a character backbone area;
extracting a region corresponding to the generation region from the original image, and averaging all pixel values at the character backbone region of each channel for RGB three channels of the region to obtain color values of each channel of the character backbone region;
and taking the color value as the color of the generated character, so that the font color of the finally generated random text is approximately consistent with the color of the original character.
Here, commonly used backbone extraction algorithms include the K3M algorithm, Zhang-Suen algorithm, and the like.
According to a second aspect of the present invention there is provided an OCR training sample generating apparatus, the apparatus operating in accordance with the method provided in any one of the preceding aspects, the apparatus comprising:
the character outline extraction module is used for extracting all character outlines based on the original image, determining an erasing area mask by combining the erasing area coordinates and obtaining a repairing area mask;
the image repairing and filling module is used for repairing and filling the image according to the mask of the repairing area and the pixel information around the repairing area to obtain a background template after the characters are erased;
and the random text generation module is used for generating a random text in each generation area so as to obtain a new sample picture and a corresponding marking information file.
According to a third aspect of the present invention, there is provided an OCR training sample generating system, the system comprising: a processor and a memory for storing executable instructions; wherein the processor is configured to execute the executable instructions to perform the OCR training sample generation method of any of the above aspects.
According to a fourth aspect of the present invention, there is provided a computer-readable storage medium, characterized in that a computer program is stored thereon, which, when executed by a processor, implements the OCR training sample generating method according to any one of the above aspects.
The invention has the beneficial effects that:
1. the invention discloses an OCR training sample generation method which can be used for solving the problem of sample shortage when an OCR model is trained by deep learning;
2. and innovatively taking the expanded character outline as an image area to be repaired, and erasing the characters by using an image repairing algorithm. The method can not only ensure that the character traces are erased completely, but also maximally reserve the surrounding pixel information for the character area during image restoration, thereby obtaining the best restoration effect;
3. the selectable self-adaptive character size module is innovatively provided, and the different character sizes of all positions in the real sample can be restored in a self-adaptive manner according to the requirement, so that the generated picture is closer to the real sample in character size;
4. the selectable self-adaptive character color module is innovatively provided, and the different character colors of all positions in the real sample can be restored in a self-adaptive manner according to the requirement, so that the generated picture is closer to the real sample in character color;
5. and a marking file can be generated simultaneously and contains the information of the character position coordinates and the text content required by the OCR training. The tedious and labor-consuming answer labeling work is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
Fig. 1 illustrates an example of an original image according to an embodiment of the present invention.
FIG. 2 illustrates an example of extracted text outlines according to an embodiment of the present invention.
FIG. 3 illustrates an example of a generated background template according to an embodiment of the present invention.
FIG. 4 shows an example of a new sample and a new text box generated according to an embodiment of the present invention.
FIG. 5 shows an example of the result of adding an adaptive text size module according to an embodiment of the present invention.
Fig. 6 shows an example of the result of adding an adaptive text color module according to an embodiment of the present invention.
FIG. 7 shows an example of the result of the re-adding corpus configuration module according to an embodiment of the present invention.
FIG. 8 shows a block flow diagram according to an embodiment of the invention.
FIG. 9 shows a flow chart of a text outline extraction module algorithm according to an embodiment of the invention.
FIG. 10 shows an image inpainting module algorithm flow diagram according to an embodiment of the invention.
FIG. 11 shows a flow diagram of a random generation module algorithm according to an embodiment of the invention.
FIG. 12 shows a flow diagram of an adaptive text size module algorithm according to an embodiment of the invention.
FIG. 13 shows a flow diagram of an adaptive text color module algorithm according to an embodiment of the invention.
FIG. 14 illustrates a text outline diagram of a certain generation area according to an embodiment of the invention.
Fig. 15 shows a text contour backbone diagram according to an embodiment of the invention.
Fig. 16 illustrates the generated region artwork according to an embodiment of the present invention.
Fig. 17 shows a schematic diagram of an original 1 according to an embodiment of the invention.
FIG. 18 shows a schematic diagram of generating FIG. 1 according to an embodiment of the invention.
Fig. 19 shows a schematic diagram of an artwork 2 according to an embodiment of the present invention.
FIG. 20 shows the generation of the schematic diagram of FIG. 2 according to an embodiment of the invention.
Fig. 21 shows a schematic diagram of an artwork 3 according to an embodiment of the present invention.
FIG. 22 illustrates generating the FIG. 3 schematic in accordance with an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terms "first," "second," and the like in the description and in the claims of the present disclosure are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.
Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
A plurality, including two or more.
And/or, it should be understood that, for the term "and/or" as used in this disclosure, it is merely one type of association that describes an associated object, meaning that three types of relationships may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone.
The invention provides a method for generating an OCR training sample, which mainly relates to an image character outline extraction technology, an image backbone extraction technology and an image restoration and filling technology:
the character outline extraction technology comprises the following steps:
the character outline extraction method comprises a traditional computer vision-based method and a deep learning-based method. The traditional method mainly comprises various binarization methods including fixed threshold binarization, OTSU binarization, adaptive binarization and the like. The traditional method has the advantages of simple and quick algorithm and strong interpretability, and has the disadvantages of easier noise inclusion, easy brightness influence and poor robustness of the extracted content.
The method based on the deep learning is mainly an image segmentation algorithm and has the advantages of strong noise resistance, difficulty in being influenced by brightness and strong robustness. The method has the disadvantages that the labeling of the character outline training sample is very difficult, the deep learning needs model training, and the reasoning speed is slow. Therefore, text contour extraction based on the deep learning method is extremely rare in industrial scenes.
Image backbone extraction technology:
image backbone extraction techniques are used to reduce objects within the binarized image to a 1-pixel wide representation. The working principle is to iterate the image continuously, removing the pixels on the boundary of the object without changing the connectivity of the image until no more pixels can be deleted. The invention uses the technology to extract the backbone of characters in the image.
Image restoration and filling technology:
image restoration and filling techniques can be divided into three major categories, image filling, image restoration, and texture synthesis. The image filling is generally applied to the condition that the area to be filled is large, and the general idea is to set the size of filling size of each iteration, find an area which is most matched with pixels around the area to be filled, directly copy the pixels to the area to be filled, and iterate until all the areas to be filled are filled. The image restoration technology is generally applied to the condition that the area to be filled is small, and the general idea is to poll each pixel point of the area to be filled from outside to inside, and calculate the pixel value to be filled through a certain algorithm according to the known pixel information around the pixel point until all pixels are filled. Texture synthesis is generally applied to images with obvious texture features, such as cells and brick walls, and the content to be filled is the growth of the texture of a known area. OCR sample images do not conform to the features of texture synthesis. The invention carries out detailed experiments on the image filling method and the image repairing method in the experimental stage, and finds that the method for repairing the image has the best effect in the scene.
Examples
The invention relates to an OCR training sample generation method. The method can generate a large amount of high-quality OCR samples in a self-adaptive mode so as to solve the problem that OCR training samples are deficient.
If we want to use fig. 1 as the original, a large number of samples of this layout are generated. For convenience of illustration, the post-added text box is both an "erasing area" where we want to erase the text and a "generating area" where we want to generate the text (the erasing area and the generating area do not need to coincide).
First, enter the text outline extraction module to extract the text outline in all the "erased areas" on the whole picture, as shown in fig. 2.
Secondly, entering an image repairing module, taking the character outline (white part in fig. 2) as the damaged part of the original image, and performing image repairing and filling according to the pixel information around the damaged part to obtain a background template after erasing characters, as shown in fig. 3.
And thirdly, entering a random generation module, generating a random text with a specified style in a specified generation region, and recording coordinates and text content of the generated text as labeling information, as shown in fig. 4.
The overall block flow diagram is shown in fig. 8 below, and the algorithm flow diagram inside each block is shown in the description part of each block.
Character outline extraction module
As shown in FIG. 9, the input of the text outline extraction module is the coordinates of the original image and the "erased area". For an input picture, the input picture is converted into a single-channel gray-scale image, then the single-channel gray-scale image is subjected to self-adaptive binarization, a mask with the pixel value of 0 and 1, namely a character outline mask, is marked as mask1, an area with the value of 1 is a character area, and an area with the value of 0 is a background area. For all "erased area" coordinates entered, we also construct a mask, such that the erased area value is 1, and the other area values are 0, and are denoted as mask 2. And multiplying the mask1 with the pixel of the position corresponding to the mask 2. A part where mask2 has a value of 0, and the multiplication result is 0; the mask2 has a value of 1, and the result of multiplication is the original value of mask 1. Thus, a mask of 'erasing the regional character outline' can be obtained.
Since the edge of the character outline extracted by binarization may not be complete, and the trace of the character edge cannot be completely erased after repairing, we perform morphological expansion (expansion kernel is 2 or 3) on the "character outline in erased area" mask obtained in the previous step to ensure that the character outline can completely cover the characters in the original image, so as to obtain a repaired area mask, as shown in fig. 2.
If we do not want to erase any characters on the picture, we can input the empty "erasing area coordinates". At this time, the constructed mask2 values are all 0, the result of multiplying with the mask1 is also all 0, and the result of morphological expansion is all 0, so that the mask values of the 'repair regions' obtained by the module are all 0, that is, no region needs to be repaired.
If we want to erase all the characters on the picture, besides the method of inputting the coordinates of all the characters as the erasing area, we can also directly ignore mask2, and directly use mask1 as the 'erasing area character mask', and then perform morphological dilation to obtain the repairing area mask.
Image restoration module
The image restoration module is used for erasing characters needing to be erased on the picture and generating a background template. The repair area mask obtained by the previous module is used for telling the module about the specific area needing repair. Now, each pixel point of the area to be repaired is polled in sequence from outside to inside, (the sorting algorithm from outside to inside can be freely selected, such as Fast Marching algorithm), and the pixel value which should be filled in the repair point is calculated according to the known pixel information around the pixel point. The reason why the sequence from outside to inside is selected, that is, the pixels at the periphery of the region to be repaired are filled first and then the pixels inside are filled is filled, is that the known pixels in the neighborhood of the pixels at the periphery of the outline are more and contain richer known pixel information, so that the filling value calculated based on the surrounding pixels is more accurate and smooth. When a pixel to be repaired is filled, the pixel is considered to be a known pixel, and when the pixel to be repaired around the pixel is calculated, the pixel can be used as the known pixel information and provide a reference value. The specific method for calculating the pixel value of a certain pixel point based on the pixel information around the pixel point can be freely selected, such as weighted average of the known pixel values in the neighborhood, and methods [ INPAINT _ NS ], [ INPAINT _ TELEA ] which take speed and effect into consideration.
Through the gradual iteration of the method, the area to be repaired is gradually shrunk and reduced until the whole area is repaired, and a repaired background template picture is obtained, as shown in fig. 3.
Random generation module
And the random generation module is used for generating random texts and marking files at the designated position or the random position of the background template. By setting the number of generations, the module is used to quickly generate a specified number of new samples. The module requires the user to specify the font size and color of the generated text, and if not, uses the default values that are set.
For each 'generation region', the maximum value of the length and the width of the 'generation region' is taken as the maximum length of a generated text, the minimum value of the length and the width of the 'generation region' is taken as the minimum length of the generated text, and then an integer is randomly selected between the minimum value and the maximum value to be taken as the actual length of the generated text at this time and is marked as w. Therefore, the randomness of the length of the generated text is realized, and the length of the text can be ensured not to exceed the generation area. Let us remember that the set font size is s, and estimate the number of characters n of the text segment as int (w/s). If the corpus type is specified, the redundant text with the length of n x k characters is generated by the corpus type, and if the corpus type is not specified, the redundant text with the length of n x k characters is randomly generated. Since the estimated number of characters is the case that the width of each character is just equal to the height of the character, and the aspect ratio of different characters is not strictly 1:1, we enlarge n by k times to ensure that the number of generated redundant text characters is enough, and usually k is 3. Then, starting from the first character of the redundant text, counting the actual length of each character and accumulating the actual lengths in sequence until the total length of the characters just meets the condition that 'one more character exceeds w', taking the character as a finally generated text, recording the finally generated text as text, and recording the actual length of the character at the moment as L.
Next we implement the random location function of text generation in the generation area. In order to ensure that the generated text does not exceed the generation area, the coordinates of the upper left corner of the generation area are (x1, x2) and the coordinates of the lower right corner of the generation area are (x2, y2), the x-axis coordinate range of the starting point of the upper left corner of the text is [ x1, (x2-w) ], and the y-axis coordinate range is [ y1, (y2-s) ]. And randomly selecting an integer point in the range as the initial position of the generated text to realize the position random function.
Then we write a "text" string of specified color and size on the picture starting with (x, y). The label information of the text line can also be obtained: the coordinates are [ (x, y), (x + L, y + s), (x, y + s) ], and the text content is "text". After the process is used for polling each 'generation area', a new picture and an answer mark file corresponding to the new picture can be obtained.
By this time, OCR training sample generation has been completed, and the effect is shown in fig. 4.
Adaptive text size module
The self-adaptive character size module is an optional module and is used for self-adaptively adjusting the size of the generated characters of each text line, so that the size of the characters generated in each region is approximately consistent with the size of the original characters at the position.
In order to keep more background information, the designated erasing area should be tightly attached to the character, and the width of the erasing area is the size of the character therein and is marked as h. By using the characteristic, the font size of the generated text is set as h by default, and the size of the character generated in the area is approximately consistent with the size of the original character at the position.
The advantages of this module are three:
1. the size of the characters generated in each area can be kept approximately consistent with the size of the original characters at the position, and different font characteristics of the original image at different positions can be restored to the maximum extent;
2. the complicated step of manually setting the fixed font size for each text generation area is omitted;
3. the problem of manual setting typeface, the size is difficult to the control is solved.
The sample generation effect after adding the module is shown in fig. 5.
Adaptive text color module
The self-adaptive character color module is an optional module and is used for self-adaptively adjusting the font color of each generated text line, so that the color of the generated characters in each region is approximately consistent with the color of the original characters.
For each generation region, we select the corresponding generation region from the "text outline extraction module", as in fig. 14. The character outline of the region is subjected to skeleton extraction, as shown in fig. 15. The purpose of extracting the backbone is to filter some background pixels possibly included in the text outline boundary, and the color of the background pixels is often greatly different from the text color, which may introduce errors. Then, for the three RGB channels for generating the area original image (as shown in fig. 16), taking the R channel as an example, all the pixel values in the R channel character skeleton area are averaged, that is, the R channel color value of the area character skeleton is obtained. The average value of each channel is obtained, namely the RGB three-channel color average value of the characters in the current area. And taking the color value as the color of the generated character, so that the color of the generated character in each area is approximately consistent with the color of the original character.
The advantages of this module are three:
1. the method can keep the character color generated by each area approximately consistent with the original character color at the position, and maximally restore the color characteristics of different fonts at different positions of the original image;
2. the complicated step of manually setting the fixed font color for each text generation area is omitted;
3. the problem that the color pixel value is difficult to control when the font color is manually set is solved.
The sample generation effect after adding the module is shown in fig. 6.
Corpus configuration module
The corpus configuration module is an optional module and is used for configuring the corpus type of each generated text line, so that the corpus meaning of the generated text in each region is consistent with that of the original text. This module is embedded in a "random generation module" as shown in fig. 11.
The module uses two methods to generate the specified corpus: 1. generating a regular expression; 2. and (6) randomly acquiring the database.
The regular expression generation method is mainly used for generating linguistic data with clear rules, such as mobile phone numbers, identity card numbers and the like. The corpus has fixed character length and encoding rule, the required corpus rule is characterized by being compiled into a regular expression, and then a character string is randomly generated according to the regular expression.
The method for randomly obtaining database is mainly used for generating indefinite rules or fixed corpora such as company name, bank name, country name, province, city, language, etc. For such corpus, we first obtain all the contents of the corpus through public information, and store the data in a database or a data file according to items, and randomly obtain one item from the data when randomly generating.
The sample generation effect after adding this module is shown in fig. 7. Figures 17-22 are effect graphs of the use of the present method to generate virtual samples for different styles of OCR samples.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the above implementation method can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation method. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (17)

1. An OCR training sample generation method is characterized in that input information is an original image and erasure area coordinates, and the method comprises the following steps:
a character outline extraction step, namely extracting all character outlines based on the original image, determining an erasing area mask by combining the erasing area coordinates, and obtaining a repairing area mask;
an image repairing and filling step, namely performing image repairing and filling according to a mask of a repairing area and pixel information around the repairing area to obtain a background template after characters are erased;
and a random text generation step, namely generating a random text in each generation area, thereby obtaining a new sample picture and a corresponding marking information file.
2. An OCR training sample generation method according to claim 1, wherein the text contour extraction step specifically includes:
converting an input original image into a single-channel gray image, and then performing self-adaptive binarization on the single-channel gray image to obtain all character outline masks, wherein the character area value is 1, and the background area value is 0;
obtaining an erasing area mask according to the coordinates of the erasing area, wherein the value of the erasing area is 1, and the values of other areas are 0;
multiplying all the character outline masks by pixels at the corresponding positions of the erasing area masks to obtain the character outline masks of the erasing area;
and performing morphological expansion on the character outline mask of the erasing area, thereby obtaining a mask of a repairing area.
3. An OCR training sample generation method according to claim 1, wherein the image inpainting filling step specifically comprises:
determining a region to be repaired in the original image according to the mask of the repair region;
polling each pixel point of the area to be repaired from outside to inside in sequence, and calculating the pixel value which should be filled by the repair point according to the information of known pixels around a certain pixel point to become a known pixel;
calculating the pixel value of the next pixel point inwards;
and (5) gradually iterating, and gradually shrinking and reducing the area to be repaired until the area to be repaired is repaired, so as to obtain the repaired background template after the characters are erased.
4. An OCR training sample generation method as claimed in claim 3 wherein the outside-in ranking algorithm is a fast marching algorithm.
5. An OCR training sample generation method according to claim 3, wherein the method for calculating the pixel value to be filled in by the repair point according to the information of the known pixels around a certain pixel point includes: neighborhood known pixel value weighted average, INPAINT _ NS, or INPAINT _ tele methods.
6. An OCR training sample generation method according to claim 1, wherein the random text generation step specifically comprises:
determining the expected length w of the random text in a certain generation area, setting the font size as s, and estimating the number n of characters of the random text as int (w/s);
generating a redundant random text with the length of n x k characters, wherein k is a redundant multiple and takes a value as a positive integer;
determining a random text which is finally generated and an actual length L thereof according to the relation between the length of the redundant random text and the predicted length w of the random text;
randomly determining the position of the generated text in the generation area, writing the finally generated random text and determining the labeling information of the random text;
and polling each generation area to obtain a new sample picture and a corresponding marking information file.
7. An OCR training sample generation method as claimed in claim 6, wherein said determining the projected length w of random text for the generation region comprises:
and taking the maximum value of the length and the width of the generated area as the maximum length of the generated text, taking the minimum value of the length and the width as the minimum length of the generated text, and randomly selecting an integer between the minimum value and the maximum value as the predicted length of the random text, wherein the integer is marked as w.
8. An OCR training sample generation method according to claim 6, wherein in the step of generating n × k character-length redundant random texts, if a corpus type is specified, the n × k character-length redundant random texts are generated by using the specified corpus type; and if the corpus type is not specified, randomly generating n x k character length redundant random texts.
9. An OCR training sample generation method according to claim 8, wherein said generating n × k character-length redundant random text with the specified corpus type specifically comprises:
generating a regular expression: the method is used for generating corpora with definite rules, compiling required corpus rule characteristics into a regular expression, and then randomly generating n x k character length redundant random texts according to the regular expression; or
Randomly obtaining a database: the method is used for generating a corpus with an indefinite rule or a certain fixed type, obtaining all contents of the corpus through public information, storing data into a database or a data file according to items, and randomly acquiring a redundant random text with the length of n x k characters from the data when the data are randomly generated.
10. An OCR training sample generation method according to claim 6, wherein the determining of the finally generated random text and its actual length L according to the relationship between the redundant random text length and the predicted length w of the random text specifically comprises:
counting the actual length of each character from the first character of the redundant random text and accumulating the actual lengths in sequence until the total length of the characters just meets the condition that 'one more character exceeds w', taking the character as a finally generated random text and marking the character as 'text', and recording the actual length L of the finally generated random text.
11. An OCR training sample generation method according to claim 6, wherein the randomly determining the position of the generated text in the generation area, and writing the finally generated random text and determining the label information thereof specifically includes:
if the coordinates of the upper left corner of the generated area are (x1, x2) and the coordinates of the lower right corner are (x2, y2), the x-axis coordinate range of the starting point of the upper left corner of the finally generated random text is [ x1, (x2-w) ], the y-axis coordinate range is [ y1, (y2-s) ], and an integer point (x, y) is randomly selected in the range to serve as the starting position of the finally generated random text;
writing the finally generated random text on the background template by taking (x, y) as the starting position of the upper left corner of the text, wherein the marking information is as follows: coordinates [ (x, y), (x + L, y + s), (x, y + s) ]; the text content is "text".
12. An OCR training sample generation method as claimed in claim 6, wherein the step of randomly locating the generated text within the generation area, writing the finally generated random text and determining its label information further comprises the step of adjusting the font size and color of the finally generated random text.
13. An OCR training sample generation method according to claim 12, wherein said adjusting the font size of the finally generated random text specifically comprises: determining the width h of the erasing area, and setting the size of the finally generated random text as h by default.
14. An OCR training sample generation method according to claim 12, wherein said adjusting font color of the finally generated random text specifically comprises:
for a certain generation area, selecting a corresponding area from a repair area mask, and carrying out backbone extraction on the character outline of the area to form a character backbone area;
extracting a region corresponding to the generation region from the original image, and averaging all pixel values at the character backbone region of each channel for RGB three channels of the region to obtain color values of each channel of the character backbone region;
and taking the color value as the color of the generated character, so that the font color of the finally generated random text is approximately consistent with the color of the original character.
15. An OCR training sample generating apparatus, the apparatus operating based on the method according to any one of claims 1 to 14, the apparatus comprising:
the character outline extraction module is used for extracting all character outlines based on the original image, determining an erasing area mask by combining the erasing area coordinates and obtaining a repairing area mask;
the image repairing and filling module is used for repairing and filling the image according to the mask of the repairing area and the pixel information around the repairing area to obtain a background template after the characters are erased;
and the random text generation module is used for generating a random text in each generation area so as to obtain a new sample picture and a corresponding marking information file.
16. An OCR training sample generation system, the system comprising: a processor and a memory for storing executable instructions; wherein the processor is configured to execute the executable instructions to perform the OCR training sample generation method of any one of claims 1 to 14.
17. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, realizes the OCR training sample generation method according to any one of claims 1 to 14.
CN202111646988.5A 2021-12-29 2021-12-29 OCR training sample generation method, device and system Pending CN114419632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111646988.5A CN114419632A (en) 2021-12-29 2021-12-29 OCR training sample generation method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111646988.5A CN114419632A (en) 2021-12-29 2021-12-29 OCR training sample generation method, device and system

Publications (1)

Publication Number Publication Date
CN114419632A true CN114419632A (en) 2022-04-29

Family

ID=81269457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111646988.5A Pending CN114419632A (en) 2021-12-29 2021-12-29 OCR training sample generation method, device and system

Country Status (1)

Country Link
CN (1) CN114419632A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926839A (en) * 2022-07-22 2022-08-19 富璟科技(深圳)有限公司 Image identification method based on RPA and AI and electronic equipment
CN117253233A (en) * 2023-09-05 2023-12-19 广东奥普特科技股份有限公司 Character erasing method, device and equipment
CN118379221A (en) * 2024-02-02 2024-07-23 北京东方瑞丰航空技术有限公司 Method, system, electronic equipment and medium for repairing satellite picture

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926839A (en) * 2022-07-22 2022-08-19 富璟科技(深圳)有限公司 Image identification method based on RPA and AI and electronic equipment
CN117253233A (en) * 2023-09-05 2023-12-19 广东奥普特科技股份有限公司 Character erasing method, device and equipment
CN117253233B (en) * 2023-09-05 2024-05-17 广东奥普特科技股份有限公司 Character erasing method, device and equipment
CN118379221A (en) * 2024-02-02 2024-07-23 北京东方瑞丰航空技术有限公司 Method, system, electronic equipment and medium for repairing satellite picture

Similar Documents

Publication Publication Date Title
CN111507251B (en) Method and device for positioning answering area in test question image, electronic equipment and computer storage medium
CN114419632A (en) OCR training sample generation method, device and system
JP3822277B2 (en) Character template set learning machine operation method
CN111160352B (en) Workpiece metal surface character recognition method and system based on image segmentation
US5201011A (en) Method and apparatus for image hand markup detection using morphological techniques
CN105528614B (en) A kind of recognition methods of the cartoon image space of a whole page and automatic recognition system
JP4443576B2 (en) Pattern separation / extraction program, pattern separation / extraction apparatus, and pattern separation / extraction method
CN115812221A (en) Image generation and coloring method and device
JP6000992B2 (en) Document file generation apparatus and document file generation method
CN112529989A (en) Image reconstruction method based on bill template
JPH10116340A (en) Bit map comparing device and method therefor
JP4704601B2 (en) Character recognition method, program, and recording medium
CN114119949A (en) Method and system for generating enhanced text synthetic image
CN111126266B (en) Text processing method, text processing system, equipment and medium
CN114332895A (en) Text image synthesis method, text image synthesis device, text image synthesis equipment, storage medium and program product
CN114565702B (en) Text image generation method and device and electronic equipment
JP4275866B2 (en) Apparatus and method for extracting character string pattern from color image
CN111274863A (en) Text prediction method based on text peak probability density
CN111079562A (en) Multi-stage data generation self-circulation financial invoice text intelligent identification system and method
CN113392772B (en) Character recognition-oriented character image shrinkage deformation enhancement method
CN117274406A (en) Indoor map vectorization method and device and electronic equipment
CN106598934A (en) Electronic book data display method and device, and terminal equipment
JP2003046746A (en) Method and apparatus for processing image
CN114663414B (en) Rock and ore recognition and extraction system and method based on UNET convolutional neural network
JP7402931B2 (en) METHODS, COMPUTER READABLE PROGRAMS AND SYSTEM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination