CN113435163A - OCR data generation method for any character combination - Google Patents

OCR data generation method for any character combination Download PDF

Info

Publication number
CN113435163A
CN113435163A CN202110978686.1A CN202110978686A CN113435163A CN 113435163 A CN113435163 A CN 113435163A CN 202110978686 A CN202110978686 A CN 202110978686A CN 113435163 A CN113435163 A CN 113435163A
Authority
CN
China
Prior art keywords
character
font
text
picture
width
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110978686.1A
Other languages
Chinese (zh)
Other versions
CN113435163B (en
Inventor
苗功勋
孙强
陈姝
熊英超
韦文峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhongfu Information Technology Co Ltd
Original Assignee
Nanjing Zhongfu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhongfu Information Technology Co Ltd filed Critical Nanjing Zhongfu Information Technology Co Ltd
Priority to CN202110978686.1A priority Critical patent/CN113435163B/en
Publication of CN113435163A publication Critical patent/CN113435163A/en
Application granted granted Critical
Publication of CN113435163B publication Critical patent/CN113435163B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method for generating OCR data of any character combination in the technical field of optical character recognition, which comprises the steps of generating a character-font mapping dictionary through a character dictionary, a font library and a corpus to obtain the corresponding relation between characters and all supported fonts; acquiring a line of text of a picture to be generated from a corpus, segmenting the text into a plurality of character strings, and finding out each character string and a font corresponding to the character string; arranging fonts corresponding to the found character strings to obtain character pictures; splicing the character pictures to obtain a final picture; the invention improves the link of drawing the appointed characters on the background picture when the conventional OCR data is generated, realizes the OCR data generation mode of any character combination, and is simple and efficient.

Description

OCR data generation method for any character combination
Technical Field
The invention relates to the technical field of optical character recognition, in particular to a method for generating OCR data of any character combination.
Background
Currently, mainstream algorithms in the field of OCR (Optical Character Recognition) are divided into two types, one is a two-stage algorithm, and the other is an end-to-end algorithm. The two-stage algorithm generally comprises a character detection algorithm and a character recognition algorithm, and the main idea is to firstly use the character detection algorithm to obtain a detection frame of a text line from an image and then use the character recognition algorithm to recognize the content in the text frame. The end-to-end algorithm is to complete the character check and character recognition in one algorithm. Although the end-to-end algorithm model is smaller and faster, the end-to-end algorithm model is mostly used for OCR of fixed scenes, such as bills and bank cards, so that a two-stage algorithm is generally adopted for inputting more flexible scenes.
In the two-stage algorithm, the character detection algorithm is relatively simpler to implement, the required data amount is less, the character recognition algorithm is an OCR core module, the accuracy of an output result is directly influenced, hundreds of thousands of data are often required for training, if manual marking is used, a large amount of manpower is consumed, and therefore a data generation method is generally required to meet the requirement of mass data.
For the character recognition algorithm, the input is a character picture, and the output is a character corresponding to the picture, so when generating such OCR data, it is necessary to generate a corresponding character picture using the characters, and simultaneously save the characters and the character picture, and the general generating steps are as follows:
1. corpora are obtained from various channels, and the corpora have various forms, and can be articles, dictionaries or phrases.
2. According to actual language requirements and target requirements, text content is generally generated from a corpus by various ways such as randomly cutting the length of articles in the corpus or randomly extracting character combinations.
3. The method comprises the steps of generating a transparent picture, generating characters on the transparent picture, and pasting the transparent character picture to a target background picture, wherein the transparent picture is generated in order to replace character backgrounds and generate a more real picture. Of course, some methods do not generate a transparent picture in advance, but directly perform operations such as cropping on a background picture as a background picture during text rendering.
4. The setting of the characters, the font size, the font color and other parameters of the characters is often set according to the actual requirements in order to be closer to the real picture.
5. Characters are drawn on the transparent picture (background picture) by using the fonts, the step is a core link in the process of generating the characters and the image, and the accuracy of the algorithm model is directly influenced by the quality of the generated characters.
6. The transparent character picture is transformed, and various transformation operations such as noise addition, deformation, background picture addition and the like are generally carried out, so that the final character picture is generated.
In the existing OCR data generating method, in the above steps, various methods have been generally implemented, and for the link 5, a single image often only uses a single font to generate data, because the font is divided into languages and the number of characters supported by a single font is limited, and for the unsupported characters, characters such as blank or "#" will be used instead, so the generated data has great limitations, and the main problems are as follows:
1. the generation of data mixed by multiple languages cannot be realized, generally, a single font only supports one to two languages, if a given corpus uses multiple languages, a picture corresponding to a given text cannot be generated, and in actual use, the multiple languages are mixed.
2. The data generation of the unusual characters cannot be supported, in the case of Chinese, about 3500 common Chinese characters exist in Chinese, but the Chinese actually has thousands of Chinese characters, and a single font sometimes cannot even cover 3500 Chinese characters, not to mention other unusual Chinese characters.
3. Data generation of special symbols such as ≧ special symbols such as ∑ special symbols █ or ≧ special symbols, some fonts support the special symbols, and more fonts do not support the special symbols.
In view of the above problems, the currently adopted solutions are as follows:
1. and changing the implementation idea of the character detection algorithm. In the process of character detection, language detection is carried out in a classified mode, when different languages appear in the same line, the languages are detected in a classified mode, a plurality of detection boxes are used for detecting the different languages respectively, however, due to the thought, a fine-grained classification algorithm is needed for distinguishing the different languages, the difficulty is high, characters which are relatively similar to the characters of a Chinese simplified form and a Chinese traditional form are often difficult to distinguish, and the processing cost is increased.
2. And generating data by adopting a random character combination mode. Randomly extracting characters from the characters to be recognized and combining the characters to generate data, and if a non-supported character is encountered, skipping the character or replacing a font to generate data. This method often causes data imbalance, the frequency of occurrence is high for common characters, while the frequency of occurrence is low for uncommon characters, and this method still cannot generate scenes of designated character combinations.
Based on the above, the invention designs a method for generating OCR data of any character combination to solve the above problems.
Disclosure of Invention
The present invention is directed to a method for generating OCR data of an arbitrary character combination, so as to solve the problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme:
a method of OCR data generation of arbitrary character combinations comprising the steps of:
s1: generating a character-font mapping dictionary through the character dictionary, the font library and the corpus to obtain the corresponding relation between the characters and all supported fonts;
s2: splitting the corpus; acquiring a line of text of a picture to be generated from a corpus, segmenting the text into a plurality of character strings, and finding out each character string and a font corresponding to the character string;
the concrete steps of the corpus splitting are as follows:
s21: reading a first character c in the text to be generated;
s22: taking out all font lists s corresponding to the characters c from the character-font mapping dictionary, and returning to null or returning to the fonts;
s23: selecting the reduced character cycle according to the return value in S22S 21 or marking the return value as temp _ font until finding the first character c with font support;
s24: if the text is empty or the font returned by the character is empty, finishing all the steps, otherwise traversing each character c in the current text;
s25: for each character c in S24, performing an iteration;
s26: if the text _ font is not empty, obtaining the last text and the corresponding font text, and adding the text and the font text into the text-font list text _ font _ list;
s3: generating a picture; selectively arranging fonts corresponding to the found character strings according to a horizontal character direction and a vertical character direction to obtain character pictures, marking the width of the arranged character pictures as fina _ width, marking the height of the arranged character pictures as final _ height, and initializing the width of the arranged character pictures to be 0;
s4: splicing the pictures; and splicing the character pictures selectively according to the horizontal direction and the vertical direction to obtain the final picture.
Preferably, in S1, the character dictionary is all characters appearing in the corpus, the font library is a set of all fonts that are desired to be used, the font library is required to satisfy that all characters in the character dictionary have at least one font support, and the corpus is the text content that needs to be generated.
Preferably, in S1, the character-font mapping dictionary is generated as follows:
s11: reading the character dictionary and initializing the character-font mapping dictionary to be null;
s12: traversing all fonts in the font library;
s13: reading all characters supported by the font of S12 respectively;
s14: traversing all characters in the S13, if the character supported by the font is in the character dictionary, adding the font object in the font list supported by the character;
s15: and completing the construction of the character-font mapping dictionary to obtain the corresponding relation between the characters and all supported fonts.
Preferably, in S22, the specific step of extracting all font lists S corresponding to the character c from the character-font mapping dictionary is as follows:
s221: if the character c is not in the character-font mapping dictionary or the list S is empty, returning to empty, and ending S22;
s222: if the list S has only one font object, then this font is returned, and S22 ends;
s223: if the list S has a plurality of font objects, a font object is randomly selected from the list S and returned, and S22 ends.
Preferably, in S23, the step of finding the first character c with font support includes the following steps:
s231: if the return value of S22 is null and the current text is not null, the text to be generated at present becomes the text without the first character, i.e. text = text [1: ], returning to S21 until the return value of S22 is not null or text becomes null, and ending S23;
s232: if the return value of S22 is not null, let the return value of S22 be temp _ font, all the characters supported by the font be listed as temp _ char _ list, and S23 ends.
Preferably, in S25, the iteration specifically includes the following steps:
s251: if c is the first character in the text, then note the text _ text = c, the currently used font is the font temp _ font corresponding to c, and the iteration is finished;
s252: if c is not the first character, but the character c is in the character list temp _ char _ list of the current font, representing that the current font continuously supports the character, then temp _ text + = char, and the iteration is finished;
s253: if c is not the first character, the character c is not in the character list temp _ char _ list of the current font, and temp _ char _ list is not null, and represents that the current font only supports the text content before c, a segment of text temp _ text and the corresponding font temp _ font are obtained, and the segment of text temp _ text and the corresponding font temp _ font are recorded as the text _ font _ list, at this time, the temp _ font is equal to the return value of S22, if the temp _ font is null, it indicates that the font of the current character is not supported, the temp _ text and the temp _ char _ list are both null, otherwise, the temp _ text = char, the temp _ char _ list is equal to all the character lists supported by the temp _ font, and the iteration is finished.
Preferably, in S3, the main steps of generating the text pictures in the horizontal arrangement mode are as follows:
s311: obtaining each segmented text and a corresponding font;
s312: for the text and the font in the S311, the font is used to obtain the picture size required by the text, the height is recorded, and the width is recorded;
s313: generating a transparent picture with the same size by using the size of S312;
s314: rendering the text content using the font on the transparent picture of S313;
s315: adding the drawn picture into a result list;
s316: taking the highest height of the picture as the last height, if the height is larger than final _ height, then final _ height = height, otherwise final _ height remains unchanged;
s317: the final width of the picture is the sum of the widths of each picture, so final _ width + = width.
Preferably, in S3, the main steps of generating the text pictures in the vertical arrangement mode are as follows:
s321: obtaining each segmented text and a corresponding font;
s322: obtaining the picture size needed by the character c by using the font, and recording the height as height and the width as width;
s323: generating a transparent picture with the same size by using the size of S322;
s324: rendering the text content c using the font on the transparent picture of S323;
s325: adding the drawn picture into a result list;
s326: taking the maximum width of the picture as the final width, if the width is larger than final _ width, then final _ width = width, otherwise final _ width remains unchanged;
s327: the final height of a picture is the sum of the heights of each picture, so final _ height + = height.
Preferably, in S4, the splicing the pictures in the horizontal direction includes the following main steps:
s411: generating a transparent picture with the width of fina _ width and the height of final _ height;
s412: initializing a position x =0 and y =0 of picture pasting;
s413: acquiring the size of a picture, wherein the width is recorded as width, and the height is recorded as height;
s414: pasting the generated picture on the (x, y) coordinates of the transparent picture;
S415:x+=width。
preferably, in S4, the splicing the pictures in the vertical direction includes the following steps:
s421: generating a transparent picture with the width of fina _ width and the height of final _ height;
s422: initializing a position x =0 and y =0 of picture pasting;
s423: acquiring the size of a picture, wherein the width is recorded as width, and the height is recorded as height;
s424: pasting the generated picture on the (x, y) coordinates of the transparent picture;
S425:y+=height。
compared with the prior art, the invention has the beneficial effects that:
the method improves the link of drawing the appointed characters on the background picture during the generation of the conventional OCR data, does not adopt a simple mode of drawing the picture by using a single font or directly replacing unsupported characters, and finds character parts which are drawn by different fonts supported in the corpus by cutting the corpus, thereby realizing the OCR data generation mode of any character combination, being simple and efficient, and laying a solid foundation for constructing good OCR data.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution:
a method of OCR data generation of arbitrary character combinations comprising the steps of:
s1: generating a character-font mapping dictionary through the character dictionary, the font library and the corpus to obtain the corresponding relation between the characters and all supported fonts;
s2: splitting the corpus; acquiring a line of text of a picture to be generated from a corpus, segmenting the text into a plurality of character strings, and finding out each character string and a font corresponding to the character string;
the concrete steps of the corpus splitting are as follows:
s21: reading a first character c in the text to be generated;
s22: taking out all font lists s corresponding to the characters c from the character-font mapping dictionary, and returning to null or returning to the fonts;
s23: selecting the reduced character cycle according to the return value in S22S 21 or marking the return value as temp _ font until finding the first character c with font support;
s24: if the text is empty or the font returned by the character is empty, finishing all the steps, otherwise traversing each character c in the current text;
s25: for each character c in S24, performing an iteration;
s26: if the text _ font is not empty, obtaining the last text and the corresponding font text, and adding the text and the font text into the text-font list text _ font _ list;
s3: generating a picture; selectively arranging fonts corresponding to the found character strings according to a horizontal character direction and a vertical character direction to obtain character pictures, marking the width of the arranged character pictures as fina _ width, marking the height of the arranged character pictures as final _ height, and initializing the width of the arranged character pictures to be 0;
s4: splicing the pictures; and splicing the character pictures selectively according to the horizontal direction and the vertical direction to obtain the final picture.
The method is mainly improved aiming at the link of drawing the specified characters on the background picture when the OCR data is generated, so other links when the OCR data is generated are omitted, and the characters can be drawn on the transparent picture or the background picture according to the selection of the user when the OCR data is actually generated.
The method mainly comprises four modules of character-font mapping dictionary generation, corpus splitting, picture generation and picture splicing, a flow chart is shown in figure 1, a character dictionary, a font library and a language database are prepared before data are generated, the character dictionary is all characters appearing in the corpus, the font library is a set of all fonts to be used, the font library is required to meet the condition that all characters in the character dictionary have at least one font support, and otherwise, part of characters cannot be generated. The corpus is text content to be generated, the form is not limited, the generation form of the text content can be selected according to the requirement of the corpus, and if the corpus is not given and random combination characters are selected to generate data, the corpus can also be a character dictionary.
Character-font mapping dictionary generation
The character-font mapping dictionary is a dictionary of character and font correspondence, the keys of the dictionary are all characters in the character dictionary, and the values of the dictionary are all font sets supporting the characters. The character-font mapping dictionary is generated as follows:
s11: reading the character dictionary and initializing the character-font mapping dictionary to be null;
s12: traversing all fonts in the font library;
s13: reading all characters supported by the font of S12 respectively;
s14: traversing all characters in the S13, if the character supported by the font is in the character dictionary, adding the font object in the font list supported by the character;
s15: and completing the construction of the character-font mapping dictionary to obtain the corresponding relation between the characters and all supported fonts.
Corpus splitting
A line of text of a picture to be generated is obtained from a corpus, because whether each character in a font supporting text exists or not can not be determined, the text needs to be split, how many continuous characters can use the same font is found, namely the text is split into a plurality of character strings, each character string and a font corresponding to the character string are found, and if the font supports all the characters in the line of text, the text does not need to be split.
S21: reading a first character c in the text to be generated;
in S22, the specific step of extracting all font lists S corresponding to the character c from the character-font mapping dictionary is as follows:
s221: if the character c is not in the character-font mapping dictionary or the list S is empty, returning to empty, and ending S22;
s222: if the list S has only one font object, then this font is returned, and S22 ends;
s223: if the list S has a plurality of font objects, randomly selecting one font object from the list S and returning, and ending the step S22;
s23: to find the first character c with font support, the procedure is as follows
S231: if the return value of the step 2 is null and the current text is not null, the current text to be generated is changed into the text without the first character, i.e. text = text [1: ], returning to the step 1 until the return value of the step 2 is not null or the text is null, and ending the step 3;
s232: if the return value in step 2 is not null, recording the return value in step 2 as temp _ font, recording all character lists supported by the font as temp _ char _ list, and ending step 3;
s24: if the text is empty or the font returned by the character is empty, the character does not have any font to support any character in the text to be generated, all the steps are ended, and if the font temp _ font returned by the first character is not empty, each character c in the current text is traversed
S25: for each character c in S24, the following iterations are performed
S251: if c is the first character in the text, then note the text _ text = c, the currently used font is the font temp _ font corresponding to c, and the iteration is finished;
s252: if c is not the first character, but the character c is in the character list temp _ char _ list of the current font, representing that the current font continuously supports the character, then temp _ text + = char, and the iteration is finished;
s253: if c is not the first character, the character c is not in the character list temp _ char _ list of the current font, and temp _ char _ list is not null, representing that the current font only supports the text content before c, then a segment of text temp _ text and the corresponding font temp _ font are obtained, and both are added into the text-font list (denoted as text _ font _ list). At this time temp _ font equals the return value of step 2. If the temp _ font is empty, it indicates that the font of the current character is not supported, then the temp _ text and the temp _ char _ list are both empty, otherwise, the temp _ text = char, the temp _ char _ list is equal to all character lists supported by the temp _ font, and the iteration is finished;
s26: if the text _ font is not empty at this time, the last text segment and the corresponding font text _ font are obtained and added to the text-font list text _ font _ list.
The invention can realize the data generation of multi-language mixing by the corpus splitting, supports the data generation of unusual characters, supports the data generation of special symbols and the like, and simultaneously reduces the processing difficulty and the cost by the corpus splitting.
Picture generation
The character directions of the pictures generated are closely related and can be divided into a horizontal character direction and a vertical character direction, the horizontal direction means that characters in the pictures are arranged from left to right, the vertical direction means that the characters in the pictures are arranged from top to bottom, wherein for the convenience of splicing the following pictures, the final spliced picture size needs to be obtained, the width is recorded as fina _ width, the height is final _ height, and the initialization is 0.
If the text picture is selected to be generated in a horizontal mode, the text-font list text _ font _ list is traversed, and the main steps are as follows:
s311: obtaining each segmented text and corresponding font
S312: for the text and the font in the step 1, the font is used to obtain the picture size required by the text, the height is recorded, and the width is recorded
S313: generating a transparent picture with the same size by using the size of S312;
s314: rendering the text content using the font on the transparent picture of S313;
s315: adding the drawn picture into a result list;
s316: taking the highest height of the picture as the final height because the pictures generated by different fonts can be inconsistent in height, if the height is greater than final height, final height = height, otherwise final height remains unchanged
S317: the final width of the picture is the sum of the widths of each picture, so final _ width + = width
If the text picture is selected to be generated in a vertical mode, traversing the text-font list text _ font _ list, and mainly comprising the following steps:
s321: obtaining each segmented text and corresponding font
S322: traversing each character c in the text; the font is used to obtain the picture size required by the character c, the height is recorded, and the width is recorded
S323: generating a transparent picture with the same size by using the size of step 201
S324: rendering the text content c using the font on the transparent picture of S323;
s325: adding the drawn picture into a result list;
s326: taking the maximum width of the picture as the final width because the widths of the pictures generated by different fonts are possibly inconsistent, if the width is greater than final _ width, final _ width = width, otherwise final _ width is kept unchanged;
s327: the final height of a picture is the sum of the heights of each picture, so final _ height + = height.
Picture splicing
To obtain a final picture, the generated pictures need to be spliced, and the mode of splicing the pictures is also related to the direction of characters.
If the picture is selected to be spliced in a horizontal mode, the main steps are as follows:
s411: generating a transparent picture with the width of fina _ width and the height of final _ height (the size is from picture generation);
s412: initializing a position x =0 and y =0 of picture pasting;
s413: acquiring the size of a picture, wherein the width is recorded as width, and the height is recorded as height;
s414: pasting the generated picture on the (x, y) coordinates of the transparent picture;
S415:x+=width。
if the picture is selected to be spliced in a vertical mode, the main steps are as follows:
s421: generating a transparent picture with the width of fina _ width and the height of final _ height (the size is from picture generation);
s422: initializing a position x =0 and y =0 of picture pasting;
s423: acquiring the size of a picture, wherein the width is recorded as width, and the height is recorded as height;
s424: pasting the generated picture on the (x, y) coordinates of the transparent picture;
S425:y+=height。
one specific application of this embodiment is: the invention does not adopt a simple mode of drawing pictures by using a single font or directly replacing unsupported characters, but finds character parts of different fonts which support drawing in the corpus by cutting the corpus, thereby overcoming the problems that a classification algorithm with fine granularity is needed to distinguish different languages, the difficulty is higher, characters which are relatively similar in Chinese simplified form and Chinese traditional form are often difficult to distinguish, and the processing cost is increased.
By cutting the corpus, the character parts which are different in font and support drawing in the corpus are found, the problem of data imbalance is solved, an OCR data generation mode of any character combination is realized, simplicity and high efficiency are realized, and the specified character combination can be generated.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (10)

1. A method of OCR data generation of arbitrary character combinations, characterized by: the method comprises the following steps:
s1: generating a character-font mapping dictionary through the character dictionary, the font library and the corpus to obtain the corresponding relation between the characters and all supported fonts;
s2: splitting the corpus; acquiring a line of text of a picture to be generated from a corpus, segmenting the text into a plurality of character strings, and finding out each character string and a font corresponding to the character string;
the concrete steps of the corpus splitting are as follows:
s21: reading a first character c in the text to be generated;
s22: taking out all font lists s corresponding to the characters c from the character-font mapping dictionary, and returning to null or returning to the fonts;
s23: selecting the reduced character cycle according to the return value in S22S 21 or marking the return value as temp _ font until finding the first character c with font support;
s24: if the text is empty or the font returned by the character is empty, finishing all the steps, otherwise traversing each character c in the current text;
s25: for each character c in S24, performing an iteration;
s26: if the text _ font is not empty, obtaining the last text and the corresponding font text, and adding the text and the font text into the text-font list text _ font _ list;
s3: generating a picture; selectively arranging fonts corresponding to the found character strings according to a horizontal character direction and a vertical character direction to obtain character pictures, marking the width of the arranged character pictures as fina _ width, marking the height of the arranged character pictures as final _ height, and initializing the width of the arranged character pictures to be 0;
s4: splicing the pictures; and splicing the character pictures selectively according to the horizontal direction and the vertical direction to obtain the final picture.
2. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S1, the character dictionary is all characters appearing in the corpus, the font library is a set of all fonts to be used, the font library needs to satisfy that the characters in the character dictionary have at least one font support, and the corpus is text content to be generated.
3. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S1, the generation method of the character-font mapping dictionary is as follows:
s11: reading the character dictionary and initializing the character-font mapping dictionary to be null;
s12: traversing all fonts in the font library;
s13: reading all characters supported by the font of S12 respectively;
s14: traversing all characters in the S13, if the character supported by the font is in the character dictionary, adding the font object in the font list supported by the character;
s15: and completing the construction of the character-font mapping dictionary to obtain the corresponding relation between the characters and all supported fonts.
4. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S22, the specific steps of extracting all font lists S corresponding to the character c from the character-font mapping dictionary are as follows:
s221: if the character c is not in the character-font mapping dictionary or the list S is empty, returning to empty, and ending S22;
s222: if the list S has only one font object, then this font is returned, and S22 ends;
s223: if the list S has a plurality of font objects, a font object is randomly selected from the list S and returned, and S22 ends.
5. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S23, the specific steps of finding the first character c with font support are as follows:
s231: if the return value of S22 is null and the current text is not null, the text to be generated at present becomes the text without the first character, i.e. text = text [1: ], returning to S21 until the return value of S22 is not null or the text becomes null, and ending S23;
s232: if the return value of S22 is not null, the return value of S22 is temp _ font, all the characters supported by the font are listed as temp _ char _ list, and S23 ends.
6. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S25, the iteration specifically includes the following steps:
s251: if c is the first character in the text, then note the text _ text = c, the currently used font is the font temp _ font corresponding to c, and the iteration is finished;
s252: if c is not the first character, but the character c is in the character list temp _ char _ list of the current font, representing that the current font continuously supports the character, then temp _ text + = char, and the iteration is finished;
s253: if c is not the first character, the character c is not in the character list temp _ char _ list of the current font, and temp _ char _ list is not null, and represents that the current font only supports the text content before c, a segment of text temp _ text and the corresponding font temp _ font are obtained, and the segment of text temp _ text and the corresponding font temp _ font are recorded as the text _ font _ list, at this time, the temp _ font is equal to the return value of S22, if the temp _ font is null, it indicates that the font of the current character is not supported, the temp _ text and the temp _ char _ list are both null, otherwise, the temp _ text = char, the temp _ char _ list is equal to all the character lists supported by the temp _ font, and the iteration is finished.
7. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S3, the main steps of generating the text pictures in a horizontal arrangement are as follows:
s311: obtaining each segmented text and a corresponding font;
s312: for the text and the font in the S311, the font is used to obtain the picture size required by the text, the height is recorded, and the width is recorded;
s313: generating a transparent picture with the same size by using the size of S312;
s314: rendering the text content using the font on the transparent picture of S313;
s315: adding the drawn picture into a result list;
s316: taking the highest height of the picture as the last height, if the height is larger than final _ height, then final _ height = height, otherwise final _ height remains unchanged;
s317: the final width of the picture is the sum of the widths of each picture, so final _ width + = width.
8. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S3, the step of generating the text pictures in a vertical arrangement is as follows:
s321: obtaining each segmented text and a corresponding font;
s322: obtaining the picture size required by the character c by using the font, wherein the height is recorded as height, and the width is recorded as width;
s323: generating a transparent picture with the same size by using the size of S322;
s324: rendering the text content c using the font on the transparent picture of S323;
s325: adding the drawn picture into a result list;
s326: taking the maximum width of the picture as the final width, if the width is larger than final _ width, then final _ width = width, otherwise final _ width remains unchanged;
s327: the final height of a picture is the sum of the heights of each picture, so final _ height + = height.
9. A method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S4, the step of splicing the pictures in the horizontal direction is as follows:
s411: generating a transparent picture with the width of fina _ width and the height of final _ height;
s412: initializing a position x =0 and y =0 of picture pasting;
s413: acquiring the size of a picture, wherein the width is recorded as width, and the height is recorded as height;
s414: pasting the generated picture on the (x, y) coordinates of the transparent picture;
S415:x+=width。
10. a method of OCR data generation of arbitrary character combinations according to claim 1 and further comprising: in S4, the step of splicing the pictures in the vertical direction is as follows:
s421: generating a transparent picture with the width of fina _ width and the height of final _ height;
s422: initializing a position x =0 and y =0 of picture pasting;
s423: acquiring the size of a picture, wherein the width is recorded as width, and the height is recorded as height;
s424: pasting the generated picture on the (x, y) coordinates of the transparent picture;
S425:y+=height。
CN202110978686.1A 2021-08-25 2021-08-25 OCR data generation method for any character combination Active CN113435163B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110978686.1A CN113435163B (en) 2021-08-25 2021-08-25 OCR data generation method for any character combination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110978686.1A CN113435163B (en) 2021-08-25 2021-08-25 OCR data generation method for any character combination

Publications (2)

Publication Number Publication Date
CN113435163A true CN113435163A (en) 2021-09-24
CN113435163B CN113435163B (en) 2021-11-16

Family

ID=77797823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110978686.1A Active CN113435163B (en) 2021-08-25 2021-08-25 OCR data generation method for any character combination

Country Status (1)

Country Link
CN (1) CN113435163B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102262619A (en) * 2010-05-31 2011-11-30 汉王科技股份有限公司 Method and device for extracting characters of document
CN110246197A (en) * 2019-05-21 2019-09-17 北京奇艺世纪科技有限公司 Identifying code character generating method, device, electronic equipment and storage medium
CN111401365A (en) * 2020-03-17 2020-07-10 海尔优家智能科技(北京)有限公司 OCR image automatic generation method and device
CN112418224A (en) * 2021-01-22 2021-02-26 成都无糖信息技术有限公司 General OCR training data generation system and method based on machine learning
CN112488114A (en) * 2020-11-13 2021-03-12 宁波多牛大数据网络技术有限公司 Picture synthesis method and device and character recognition system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102262619A (en) * 2010-05-31 2011-11-30 汉王科技股份有限公司 Method and device for extracting characters of document
CN110246197A (en) * 2019-05-21 2019-09-17 北京奇艺世纪科技有限公司 Identifying code character generating method, device, electronic equipment and storage medium
CN111401365A (en) * 2020-03-17 2020-07-10 海尔优家智能科技(北京)有限公司 OCR image automatic generation method and device
CN112488114A (en) * 2020-11-13 2021-03-12 宁波多牛大数据网络技术有限公司 Picture synthesis method and device and character recognition system
CN112418224A (en) * 2021-01-22 2021-02-26 成都无糖信息技术有限公司 General OCR training data generation system and method based on machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张婷婷: "基于Tesseract_OCR文字识别系统的研究", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *

Also Published As

Publication number Publication date
CN113435163B (en) 2021-11-16

Similar Documents

Publication Publication Date Title
Clausner et al. Aletheia-an advanced document layout and text ground-truthing system for production environments
Barman et al. Combining visual and textual features for semantic segmentation of historical newspapers
CN108170649B (en) Chinese character library generation method and device based on DCGAN deep network
CN101573705B (en) Media material analysis of continuing article portions
Bai et al. Image character recognition using deep convolutional neural network learned from different languages
US20010014176A1 (en) Document image processing device and method thereof
CN114005123A (en) System and method for digitally reconstructing layout of print form text
Kassis et al. Vml-hd: The historical arabic documents dataset for recognition systems
CN104156706A (en) Chinese character recognition method based on optical character recognition technology
CN103065146A (en) Character recognition method for power communication machine room dumb equipment signboards
CN110033054A (en) Personalized handwritten form moving method and system based on collaboration stroke optimization
CN109685061A (en) The recognition methods of mathematical formulae suitable for structuring
CN109582926A (en) A kind of digital printing method of the anti printing and scanning attack based on fusion font
Novotný et al. Introduction to Optical Music Recognition: Overview and Practical Challenges.
CN113435163B (en) OCR data generation method for any character combination
JP2012190434A (en) Form defining device, form defining method, program and recording medium
WO2009087999A1 (en) Index-structure specifying device
CN113505787A (en) Title correction method and system, adopted electronic equipment and computer readable medium
CN116822634A (en) Document visual language reasoning method based on layout perception prompt
CN114861595B (en) Vector line transformation-based individual font generation method
CN115988149A (en) Method for generating video by AI intelligent graphics context
Reul et al. Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache
CN114332476A (en) Method, device, electronic equipment, storage medium and product for identifying dimensional language
Asi et al. User-assisted alignment of arabic historical manuscripts
CN112200158B (en) Training data generation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant