CN113435163A

CN113435163A - OCR data generation method for any character combination

Info

Publication number: CN113435163A
Application number: CN202110978686.1A
Authority: CN
Inventors: 苗功勋; 孙强; 陈姝; 熊英超; 韦文峰
Original assignee: Nanjing Zhongfu Information Technology Co Ltd
Current assignee: Nanjing Zhongfu Information Technology Co Ltd
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2021-09-24
Anticipated expiration: 2041-08-25
Also published as: CN113435163B

Abstract

The invention discloses a method for generating OCR data of any character combination in the technical field of optical character recognition, which comprises the steps of generating a character-font mapping dictionary through a character dictionary, a font library and a corpus to obtain the corresponding relation between characters and all supported fonts; acquiring a line of text of a picture to be generated from a corpus, segmenting the text into a plurality of character strings, and finding out each character string and a font corresponding to the character string; arranging fonts corresponding to the found character strings to obtain character pictures; splicing the character pictures to obtain a final picture; the invention improves the link of drawing the appointed characters on the background picture when the conventional OCR data is generated, realizes the OCR data generation mode of any character combination, and is simple and efficient.

Description

OCR data generation method for any character combination

Technical Field

The invention relates to the technical field of optical character recognition, in particular to a method for generating OCR data of any character combination.

Background

Currently, mainstream algorithms in the field of OCR (Optical Character Recognition) are divided into two types, one is a two-stage algorithm, and the other is an end-to-end algorithm. The two-stage algorithm generally comprises a character detection algorithm and a character recognition algorithm, and the main idea is to firstly use the character detection algorithm to obtain a detection frame of a text line from an image and then use the character recognition algorithm to recognize the content in the text frame. The end-to-end algorithm is to complete the character check and character recognition in one algorithm. Although the end-to-end algorithm model is smaller and faster, the end-to-end algorithm model is mostly used for OCR of fixed scenes, such as bills and bank cards, so that a two-stage algorithm is generally adopted for inputting more flexible scenes.

In the two-stage algorithm, the character detection algorithm is relatively simpler to implement, the required data amount is less, the character recognition algorithm is an OCR core module, the accuracy of an output result is directly influenced, hundreds of thousands of data are often required for training, if manual marking is used, a large amount of manpower is consumed, and therefore a data generation method is generally required to meet the requirement of mass data.

For the character recognition algorithm, the input is a character picture, and the output is a character corresponding to the picture, so when generating such OCR data, it is necessary to generate a corresponding character picture using the characters, and simultaneously save the characters and the character picture, and the general generating steps are as follows:

1. corpora are obtained from various channels, and the corpora have various forms, and can be articles, dictionaries or phrases.

2. According to actual language requirements and target requirements, text content is generally generated from a corpus by various ways such as randomly cutting the length of articles in the corpus or randomly extracting character combinations.

3. The method comprises the steps of generating a transparent picture, generating characters on the transparent picture, and pasting the transparent character picture to a target background picture, wherein the transparent picture is generated in order to replace character backgrounds and generate a more real picture. Of course, some methods do not generate a transparent picture in advance, but directly perform operations such as cropping on a background picture as a background picture during text rendering.

4. The setting of the characters, the font size, the font color and other parameters of the characters is often set according to the actual requirements in order to be closer to the real picture.

5. Characters are drawn on the transparent picture (background picture) by using the fonts, the step is a core link in the process of generating the characters and the image, and the accuracy of the algorithm model is directly influenced by the quality of the generated characters.

6. The transparent character picture is transformed, and various transformation operations such as noise addition, deformation, background picture addition and the like are generally carried out, so that the final character picture is generated.

In the existing OCR data generating method, in the above steps, various methods have been generally implemented, and for the link 5, a single image often only uses a single font to generate data, because the font is divided into languages and the number of characters supported by a single font is limited, and for the unsupported characters, characters such as blank or "#" will be used instead, so the generated data has great limitations, and the main problems are as follows:

1. the generation of data mixed by multiple languages cannot be realized, generally, a single font only supports one to two languages, if a given corpus uses multiple languages, a picture corresponding to a given text cannot be generated, and in actual use, the multiple languages are mixed.

2. The data generation of the unusual characters cannot be supported, in the case of Chinese, about 3500 common Chinese characters exist in Chinese, but the Chinese actually has thousands of Chinese characters, and a single font sometimes cannot even cover 3500 Chinese characters, not to mention other unusual Chinese characters.

3. Data generation of special symbols such as ≧ special symbols such as ∑ special symbols █ or ≧ special symbols, some fonts support the special symbols, and more fonts do not support the special symbols.

In view of the above problems, the currently adopted solutions are as follows:

1. and changing the implementation idea of the character detection algorithm. In the process of character detection, language detection is carried out in a classified mode, when different languages appear in the same line, the languages are detected in a classified mode, a plurality of detection boxes are used for detecting the different languages respectively, however, due to the thought, a fine-grained classification algorithm is needed for distinguishing the different languages, the difficulty is high, characters which are relatively similar to the characters of a Chinese simplified form and a Chinese traditional form are often difficult to distinguish, and the processing cost is increased.

2. And generating data by adopting a random character combination mode. Randomly extracting characters from the characters to be recognized and combining the characters to generate data, and if a non-supported character is encountered, skipping the character or replacing a font to generate data. This method often causes data imbalance, the frequency of occurrence is high for common characters, while the frequency of occurrence is low for uncommon characters, and this method still cannot generate scenes of designated character combinations.

Based on the above, the invention designs a method for generating OCR data of any character combination to solve the above problems.

Disclosure of Invention

The present invention is directed to a method for generating OCR data of an arbitrary character combination, so as to solve the problems in the background art.

In order to achieve the purpose, the invention provides the following technical scheme:

a method of OCR data generation of arbitrary character combinations comprising the steps of:

s1: generating a character-font mapping dictionary through the character dictionary, the font library and the corpus to obtain the corresponding relation between the characters and all supported fonts;

s2: splitting the corpus; acquiring a line of text of a picture to be generated from a corpus, segmenting the text into a plurality of character strings, and finding out each character string and a font corresponding to the character string;

the concrete steps of the corpus splitting are as follows:

s21: reading a first character c in the text to be generated;

s22: taking out all font lists s corresponding to the characters c from the character-font mapping dictionary, and returning to null or returning to the fonts;

s23: selecting the reduced character cycle according to the return value in S22S 21 or marking the return value as temp _ font until finding the first character c with font support;

s24: if the text is empty or the font returned by the character is empty, finishing all the steps, otherwise traversing each character c in the current text;

s25: for each character c in S24, performing an iteration;

s26: if the text _ font is not empty, obtaining the last text and the corresponding font text, and adding the text and the font text into the text-font list text _ font _ list;

s3: generating a picture; selectively arranging fonts corresponding to the found character strings according to a horizontal character direction and a vertical character direction to obtain character pictures, marking the width of the arranged character pictures as fina _ width, marking the height of the arranged character pictures as final _ height, and initializing the width of the arranged character pictures to be 0;

s4: splicing the pictures; and splicing the character pictures selectively according to the horizontal direction and the vertical direction to obtain the final picture.

Preferably, in S1, the character dictionary is all characters appearing in the corpus, the font library is a set of all fonts that are desired to be used, the font library is required to satisfy that all characters in the character dictionary have at least one font support, and the corpus is the text content that needs to be generated.

Preferably, in S1, the character-font mapping dictionary is generated as follows:

s11: reading the character dictionary and initializing the character-font mapping dictionary to be null;

s12: traversing all fonts in the font library;

s13: reading all characters supported by the font of S12 respectively;

s14: traversing all characters in the S13, if the character supported by the font is in the character dictionary, adding the font object in the font list supported by the character;

s15: and completing the construction of the character-font mapping dictionary to obtain the corresponding relation between the characters and all supported fonts.

Preferably, in S22, the specific step of extracting all font lists S corresponding to the character c from the character-font mapping dictionary is as follows:

s221: if the character c is not in the character-font mapping dictionary or the list S is empty, returning to empty, and ending S22;

s222: if the list S has only one font object, then this font is returned, and S22 ends;

s223: if the list S has a plurality of font objects, a font object is randomly selected from the list S and returned, and S22 ends.

Preferably, in S23, the step of finding the first character c with font support includes the following steps:

s231: if the return value of S22 is null and the current text is not null, the text to be generated at present becomes the text without the first character, i.e. text = text [1: ], returning to S21 until the return value of S22 is not null or text becomes null, and ending S23;

s232: if the return value of S22 is not null, let the return value of S22 be temp _ font, all the characters supported by the font be listed as temp _ char _ list, and S23 ends.

Preferably, in S25, the iteration specifically includes the following steps:

s251: if c is the first character in the text, then note the text _ text = c, the currently used font is the font temp _ font corresponding to c, and the iteration is finished;

s252: if c is not the first character, but the character c is in the character list temp _ char _ list of the current font, representing that the current font continuously supports the character, then temp _ text + = char, and the iteration is finished;

s253: if c is not the first character, the character c is not in the character list temp _ char _ list of the current font, and temp _ char _ list is not null, and represents that the current font only supports the text content before c, a segment of text temp _ text and the corresponding font temp _ font are obtained, and the segment of text temp _ text and the corresponding font temp _ font are recorded as the text _ font _ list, at this time, the temp _ font is equal to the return value of S22, if the temp _ font is null, it indicates that the font of the current character is not supported, the temp _ text and the temp _ char _ list are both null, otherwise, the temp _ text = char, the temp _ char _ list is equal to all the character lists supported by the temp _ font, and the iteration is finished.

Preferably, in S3, the main steps of generating the text pictures in the horizontal arrangement mode are as follows:

s311: obtaining each segmented text and a corresponding font;

s312: for the text and the font in the S311, the font is used to obtain the picture size required by the text, the height is recorded, and the width is recorded;

s313: generating a transparent picture with the same size by using the size of S312;

s314: rendering the text content using the font on the transparent picture of S313;

s315: adding the drawn picture into a result list;

s316: taking the highest height of the picture as the last height, if the height is larger than final _ height, then final _ height = height, otherwise final _ height remains unchanged;

s317: the final width of the picture is the sum of the widths of each picture, so final _ width + = width.

Preferably, in S3, the main steps of generating the text pictures in the vertical arrangement mode are as follows:

s321: obtaining each segmented text and a corresponding font;

s322: obtaining the picture size needed by the character c by using the font, and recording the height as height and the width as width;

s323: generating a transparent picture with the same size by using the size of S322;

s324: rendering the text content c using the font on the transparent picture of S323;

s325: adding the drawn picture into a result list;

s326: taking the maximum width of the picture as the final width, if the width is larger than final _ width, then final _ width = width, otherwise final _ width remains unchanged;

s327: the final height of a picture is the sum of the heights of each picture, so final _ height + = height.

Preferably, in S4, the splicing the pictures in the horizontal direction includes the following main steps:

s411: generating a transparent picture with the width of fina _ width and the height of final _ height;

s412: initializing a position x =0 and y =0 of picture pasting;

s413: acquiring the size of a picture, wherein the width is recorded as width, and the height is recorded as height;

s414: pasting the generated picture on the (x, y) coordinates of the transparent picture;

S415：x+=width。

preferably, in S4, the splicing the pictures in the vertical direction includes the following steps:

s421: generating a transparent picture with the width of fina _ width and the height of final _ height;

s422: initializing a position x =0 and y =0 of picture pasting;

s423: acquiring the size of a picture, wherein the width is recorded as width, and the height is recorded as height;

s424: pasting the generated picture on the (x, y) coordinates of the transparent picture;

S425：y+=height。

compared with the prior art, the invention has the beneficial effects that:

the method improves the link of drawing the appointed characters on the background picture during the generation of the conventional OCR data, does not adopt a simple mode of drawing the picture by using a single font or directly replacing unsupported characters, and finds character parts which are drawn by different fonts supported in the corpus by cutting the corpus, thereby realizing the OCR data generation mode of any character combination, being simple and efficient, and laying a solid foundation for constructing good OCR data.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution:

the concrete steps of the corpus splitting are as follows:

s21: reading a first character c in the text to be generated;

s25: for each character c in S24, performing an iteration;

The method is mainly improved aiming at the link of drawing the specified characters on the background picture when the OCR data is generated, so other links when the OCR data is generated are omitted, and the characters can be drawn on the transparent picture or the background picture according to the selection of the user when the OCR data is actually generated.

The method mainly comprises four modules of character-font mapping dictionary generation, corpus splitting, picture generation and picture splicing, a flow chart is shown in figure 1, a character dictionary, a font library and a language database are prepared before data are generated, the character dictionary is all characters appearing in the corpus, the font library is a set of all fonts to be used, the font library is required to meet the condition that all characters in the character dictionary have at least one font support, and otherwise, part of characters cannot be generated. The corpus is text content to be generated, the form is not limited, the generation form of the text content can be selected according to the requirement of the corpus, and if the corpus is not given and random combination characters are selected to generate data, the corpus can also be a character dictionary.

Character-font mapping dictionary generation

The character-font mapping dictionary is a dictionary of character and font correspondence, the keys of the dictionary are all characters in the character dictionary, and the values of the dictionary are all font sets supporting the characters. The character-font mapping dictionary is generated as follows:

s12: traversing all fonts in the font library;

s13: reading all characters supported by the font of S12 respectively;

Corpus splitting

A line of text of a picture to be generated is obtained from a corpus, because whether each character in a font supporting text exists or not can not be determined, the text needs to be split, how many continuous characters can use the same font is found, namely the text is split into a plurality of character strings, each character string and a font corresponding to the character string are found, and if the font supports all the characters in the line of text, the text does not need to be split.

S21: reading a first character c in the text to be generated;

in S22, the specific step of extracting all font lists S corresponding to the character c from the character-font mapping dictionary is as follows:

s223: if the list S has a plurality of font objects, randomly selecting one font object from the list S and returning, and ending the step S22;

s23: to find the first character c with font support, the procedure is as follows

S231: if the return value of the step 2 is null and the current text is not null, the current text to be generated is changed into the text without the first character, i.e. text = text [1: ], returning to the step 1 until the return value of the step 2 is not null or the text is null, and ending the step 3;

s232: if the return value in step 2 is not null, recording the return value in step 2 as temp _ font, recording all character lists supported by the font as temp _ char _ list, and ending step 3;

s24: if the text is empty or the font returned by the character is empty, the character does not have any font to support any character in the text to be generated, all the steps are ended, and if the font temp _ font returned by the first character is not empty, each character c in the current text is traversed

S25: for each character c in S24, the following iterations are performed

s253: if c is not the first character, the character c is not in the character list temp _ char _ list of the current font, and temp _ char _ list is not null, representing that the current font only supports the text content before c, then a segment of text temp _ text and the corresponding font temp _ font are obtained, and both are added into the text-font list (denoted as text _ font _ list). At this time temp _ font equals the return value of step 2. If the temp _ font is empty, it indicates that the font of the current character is not supported, then the temp _ text and the temp _ char _ list are both empty, otherwise, the temp _ text = char, the temp _ char _ list is equal to all character lists supported by the temp _ font, and the iteration is finished;

s26: if the text _ font is not empty at this time, the last text segment and the corresponding font text _ font are obtained and added to the text-font list text _ font _ list.

The invention can realize the data generation of multi-language mixing by the corpus splitting, supports the data generation of unusual characters, supports the data generation of special symbols and the like, and simultaneously reduces the processing difficulty and the cost by the corpus splitting.

Picture generation

The character directions of the pictures generated are closely related and can be divided into a horizontal character direction and a vertical character direction, the horizontal direction means that characters in the pictures are arranged from left to right, the vertical direction means that the characters in the pictures are arranged from top to bottom, wherein for the convenience of splicing the following pictures, the final spliced picture size needs to be obtained, the width is recorded as fina _ width, the height is final _ height, and the initialization is 0.

If the text picture is selected to be generated in a horizontal mode, the text-font list text _ font _ list is traversed, and the main steps are as follows:

s311: obtaining each segmented text and corresponding font

S312: for the text and the font in the step 1, the font is used to obtain the picture size required by the text, the height is recorded, and the width is recorded

s315: adding the drawn picture into a result list;

s316: taking the highest height of the picture as the final height because the pictures generated by different fonts can be inconsistent in height, if the height is greater than final height, final height = height, otherwise final height remains unchanged

S317: the final width of the picture is the sum of the widths of each picture, so final _ width + = width

If the text picture is selected to be generated in a vertical mode, traversing the text-font list text _ font _ list, and mainly comprising the following steps:

s321: obtaining each segmented text and corresponding font

S322: traversing each character c in the text; the font is used to obtain the picture size required by the character c, the height is recorded, and the width is recorded

S323: generating a transparent picture with the same size by using the size of step 201

s325: adding the drawn picture into a result list;

s326: taking the maximum width of the picture as the final width because the widths of the pictures generated by different fonts are possibly inconsistent, if the width is greater than final _ width, final _ width = width, otherwise final _ width is kept unchanged;

Picture splicing

To obtain a final picture, the generated pictures need to be spliced, and the mode of splicing the pictures is also related to the direction of characters.

If the picture is selected to be spliced in a horizontal mode, the main steps are as follows:

s411: generating a transparent picture with the width of fina _ width and the height of final _ height (the size is from picture generation);

s412: initializing a position x =0 and y =0 of picture pasting;

S415：x+=width。

if the picture is selected to be spliced in a vertical mode, the main steps are as follows:

s421: generating a transparent picture with the width of fina _ width and the height of final _ height (the size is from picture generation);

s422: initializing a position x =0 and y =0 of picture pasting;

S425：y+=height。

one specific application of this embodiment is: the invention does not adopt a simple mode of drawing pictures by using a single font or directly replacing unsupported characters, but finds character parts of different fonts which support drawing in the corpus by cutting the corpus, thereby overcoming the problems that a classification algorithm with fine granularity is needed to distinguish different languages, the difficulty is higher, characters which are relatively similar in Chinese simplified form and Chinese traditional form are often difficult to distinguish, and the processing cost is increased.

By cutting the corpus, the character parts which are different in font and support drawing in the corpus are found, the problem of data imbalance is solved, an OCR data generation mode of any character combination is realized, simplicity and high efficiency are realized, and the specified character combination can be generated.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A method of OCR data generation of arbitrary character combinations, characterized by: the method comprises the following steps:

the concrete steps of the corpus splitting are as follows:

s21: reading a first character c in the text to be generated;

s25: for each character c in S24, performing an iteration;

2. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S1, the character dictionary is all characters appearing in the corpus, the font library is a set of all fonts to be used, the font library needs to satisfy that the characters in the character dictionary have at least one font support, and the corpus is text content to be generated.

3. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S1, the generation method of the character-font mapping dictionary is as follows:

s12: traversing all fonts in the font library;

s13: reading all characters supported by the font of S12 respectively;

4. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S22, the specific steps of extracting all font lists S corresponding to the character c from the character-font mapping dictionary are as follows:

5. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S23, the specific steps of finding the first character c with font support are as follows:

s231: if the return value of S22 is null and the current text is not null, the text to be generated at present becomes the text without the first character, i.e. text = text [1: ], returning to S21 until the return value of S22 is not null or the text becomes null, and ending S23;

s232: if the return value of S22 is not null, the return value of S22 is temp _ font, all the characters supported by the font are listed as temp _ char _ list, and S23 ends.

6. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S25, the iteration specifically includes the following steps:

7. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S3, the main steps of generating the text pictures in a horizontal arrangement are as follows:

s311: obtaining each segmented text and a corresponding font;

s315: adding the drawn picture into a result list;

8. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S3, the step of generating the text pictures in a vertical arrangement is as follows:

s321: obtaining each segmented text and a corresponding font;

s322: obtaining the picture size required by the character c by using the font, wherein the height is recorded as height, and the width is recorded as width;

s325: adding the drawn picture into a result list;

9. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S4, the step of splicing the pictures in the horizontal direction is as follows:

s412: initializing a position x =0 and y =0 of picture pasting;

S415：x+=width。

10. a method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S4, the step of splicing the pictures in the vertical direction is as follows:

s422: initializing a position x =0 and y =0 of picture pasting;

S425：y+=height。