CN113435426B - Data augmentation method, device and equipment for OCR recognition and storage medium - Google Patents

Data augmentation method, device and equipment for OCR recognition and storage medium Download PDF

Info

Publication number
CN113435426B
CN113435426B CN202110991555.7A CN202110991555A CN113435426B CN 113435426 B CN113435426 B CN 113435426B CN 202110991555 A CN202110991555 A CN 202110991555A CN 113435426 B CN113435426 B CN 113435426B
Authority
CN
China
Prior art keywords
word frequency
dictionary
data set
character
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110991555.7A
Other languages
Chinese (zh)
Other versions
CN113435426A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Eeasy Electronic Tech Co ltd
Original Assignee
Zhuhai Eeasy Electronic Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Eeasy Electronic Tech Co ltd filed Critical Zhuhai Eeasy Electronic Tech Co ltd
Priority to CN202110991555.7A priority Critical patent/CN113435426B/en
Publication of CN113435426A publication Critical patent/CN113435426A/en
Application granted granted Critical
Publication of CN113435426B publication Critical patent/CN113435426B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Character Discrimination (AREA)

Abstract

The invention is suitable for the technical field of character recognition, and provides a data augmentation method, a device, equipment and a storage medium for OCR recognition, wherein the method comprises the following steps: the method comprises the steps of establishing a recognition dictionary, establishing a first word frequency dictionary based on the recognition dictionary and an obtained opening source data set, establishing a synthesized data set text document based on the first word frequency dictionary, and performing data augmentation on a current data set based on established data set attributes, an OCR (optical character recognition) application scene and the synthesized data set text document to obtain an augmented basic data set, so that the cost of obtaining a training sample in an OCR (optical character recognition) depth algorithm is reduced, and the pertinence of the data augmentation is improved.

Description

Data augmentation method, device and equipment for OCR recognition and storage medium
Technical Field
The invention belongs to the technical field of character recognition, and particularly relates to a data augmentation method, device, equipment and storage medium for OCR recognition.
Background
OCR (Optical Character Recognition) refers to a process in which an electronic device examines a Character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into a computer text using a Character Recognition method. OCR recognition has a wide range of applications, such as document recognition, and the like.
There are two main methods of OCR at present: conventional OCR algorithm based and deep learning based OCR methods. In recent years, the application of deep learning network structure makes the OCR recognition accuracy and stability far higher than the traditional OCR method. However, deep learning relies on a large number of training samples, for Chinese document recognition, including Chinese and English numbers, common characters are usually 6K-8K, the required data amount is in the millions or even tens of millions, and the OCR recognition performance depends on the number and the types of acquired data sets. Aiming at Chinese documents, tens of millions of data volumes are needed to obtain ideal OCR performance, and the situation that only manual labeling is needed is unrealistic, the existing data augmentation mode is mainly carried out aiming at different backgrounds, the Chinese document recognition background is single in type, but the Chinese document recognition background contains more types of characters, particularly rare characters, and the word frequency is less.
Disclosure of Invention
The invention aims to provide a data augmentation method, a data augmentation device, data augmentation equipment and a storage medium for OCR recognition, and aims to solve the problems that in the prior art, the cost for acquiring training samples by adopting manual labeling is high, and the pertinence of the training samples acquired by adopting a data augmentation mode is poor.
In one aspect, the present invention provides a data augmentation method for OCR recognition, the method comprising the steps of:
establishing a recognition dictionary;
establishing a first word frequency dictionary based on the recognition dictionary and the acquired open source data set;
establishing a synthetic data set text document based on the first word frequency dictionary;
and performing data augmentation on the current data set based on the established data set attributes, the application scene identified by the OCR and the synthetic data set text document to obtain an augmented basic data set.
Preferably, the step of establishing a recognition dictionary includes:
adjusting the character position in the recognition dictionary according to the character type; and/or
Adjusting the character position in the recognition dictionary according to the calculated character pattern similarity of the Chinese characters; and/or
And serializing the label of each character in the recognition dictionary.
Preferably, the step of establishing a first word frequency dictionary based on the recognition dictionary and the acquired opening source data set further includes:
establishing an index document according to the open source data set;
and traversing the index document, and counting each character in the recognition dictionary to obtain the first word frequency dictionary.
Preferably, the step of creating a text document of a synthesized data set based on the first word frequency dictionary comprises:
balancing the word frequency in the first word frequency dictionary to obtain a second word frequency dictionary;
and establishing the synthetic data set text document based on the second word frequency dictionary.
Preferably, the step of equalizing the word frequencies in the first word frequency dictionary to obtain a second word frequency dictionary includes:
traversing the first word frequency of each character in the first word frequency dictionary;
obtaining a second word frequency corresponding to each character according to a random number selected for each character, wherein the second word frequency is the sum of the first word frequency and the random number, the second word frequency is within a preset word frequency range, the minimum word frequency in the preset word frequency range is greater than the minimum word frequency in the first word frequency dictionary, and the maximum word frequency in the preset word frequency range is greater than the maximum word frequency in the first word frequency dictionary;
and taking the second word frequency as the word frequency of the corresponding character to obtain a second word frequency dictionary.
Preferably, the step of generating the synthesized data set text document based on the second word frequency dictionary comprises:
acquiring a character range of a line text in the application scene;
and establishing the synthesized data set text document based on the character range, wherein the synthesized data set text document comprises at least one text document, the number of characters of the line text in each text document is in the character range, and the word frequency of each character in the synthesized data set text document is the same as the word frequency of the corresponding character in the second word frequency dictionary.
Preferably, the dataset properties include at least font size, font style, tilt angle, blur, distortion, text position, line thickness, color, interword spacing, background, and frame press.
In another aspect, the present invention provides an OCR model training method based on the data augmentation method for OCR recognition as described above, including:
training an OCR model by using the basic data set until the model converges to obtain a pre-training model;
and mixing the basic data set and the actual marked data set, and continuously training the pre-training model by using the mixed data set to obtain the trained OCR model.
In another aspect, the present invention provides a data augmentation apparatus for OCR recognition, including:
the recognition dictionary establishing unit is used for establishing a recognition dictionary;
the first dictionary establishing unit is used for establishing a first word frequency dictionary based on the recognition dictionary and the acquired opening source data set;
the document establishing unit is used for establishing a synthetic data set text document based on the first word frequency dictionary; and
and the data augmentation unit is used for performing data augmentation on the current data set based on the established data set attributes, the application scenes identified by the OCR and the synthesized data set text documents to obtain an augmented basic data set.
In another aspect, the present invention further provides a data augmentation device for OCR recognition, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method when executing the computer program.
In another aspect, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.
According to the method, the recognition dictionary is established, the first word frequency dictionary is established based on the recognition dictionary and the acquired opening source data set, the synthetic data set text document is established based on the first word frequency dictionary, and data augmentation is performed on the current data set based on the established data set attributes, the OCR application scenario and the synthetic data set text document to obtain the augmented basic data set, so that the cost of acquiring training samples in an OCR depth algorithm is reduced, and meanwhile, the pertinence of data augmentation is improved.
Drawings
FIG. 1 is a flowchart illustrating an implementation of a data augmentation method for OCR recognition according to an embodiment of the present invention;
FIG. 2 is a flowchart of an implementation of an OCR model training method according to a second embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a data enhancement apparatus for OCR recognition according to a third embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an OCR model training apparatus according to a fourth embodiment of the present invention; and
fig. 5 is a schematic structural diagram of a data amplification device for OCR recognition according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:
the first embodiment is as follows:
fig. 1 shows an implementation flow of a data augmentation method for OCR recognition according to an embodiment of the present invention, and for convenience of description, only the portions related to the embodiment of the present invention are shown, which is detailed as follows:
in step S101, a recognition dictionary is established.
In the embodiment of the invention, the recognition dictionary can contain Chinese characters, English letters and Chinese-English punctuations. The recognition dictionary can be established according to Chinese character code standards of China, wherein the Chinese character code standards GB2312-80 stipulate 3755 primary Chinese characters and 3008 secondary Chinese characters, and the total number of the primary Chinese characters and the secondary Chinese characters is 6763. The second level Chinese characters appear relatively less frequently than the first level Chinese characters, but also appear in daily life, such as wonton, cricket, crucian, and the like.
When the recognition dictionary is established, the positions of the characters can be reasonably adjusted according to actual conditions. Preferably, the positions of characters in the recognition dictionary are adjusted according to the character type so that characters of the same type are located in the vicinity. For example, chinese characters are concentrated in the same vicinity, punctuation marks are concentrated in the same vicinity, and english letters are concentrated in the same vicinity. Still preferably, the positions of the characters in the recognition dictionary are adjusted according to the calculated font similarity of the chinese characters so that characters with larger similarity are located at adjacent positions. For example, adjust "person", "in", "big", "too" to adjacent positions.
When the recognition dictionary is built, in consideration of the fact that when data are used for model training, corresponding labels need to be added in addition to the pictures, the labels of each character in the recognition dictionary are preferably serialized so that the requirements of the model training can be met through the serialized labels. Specifically, the number of lines per character may be used as a label for the character. For example, "I" is on line 32, and "I" corresponds to a sequence number of "32".
In step S102, a first word frequency dictionary is established based on the recognition dictionary and the acquired open source data set.
In the embodiment of the present invention, the word frequency dictionary corresponding to the source data set is referred to as a first word frequency dictionary, the first word frequency dictionary contains, in addition to characters, the word frequency of each character, and the word frequency of each character can be determined based on the number of times that the character appears in the source data set. Preferably, the index document is established according to the source data set, the index document is traversed, and each character in the recognition dictionary is counted to obtain the first word frequency dictionary, so that the construction of the first word frequency dictionary is realized. In a specific implementation, the index document may include a picture path and corresponding tag characters, and the picture path and the tag characters may be connected by a space. For example, the address of picture 1 is/dataset/ocr, and the corresponding text is "i love china", then it is saved as: "/data/ocr/1. jpg I love China". After the index document is established, initializing the recognition dictionary to obtain an initial first word frequency dictionary, wherein the word frequency of each character in the initial first word frequency dictionary is 0, traversing the index document line by line, counting the characters in the recognition dictionary respectively, updating the word frequency of each character according to the counting result, and obtaining the first word frequency dictionary based on open source data. For example, the open source data set includes a picture 1, the character included in the picture 1 is "i love china", the initial first dictionary is dict1= { 'i': 0 'love': 0 'middle': 0, 'country': 0., ' 0}, and after processing picture 1, the dictionary becomes dict1= { ' me ': 1, 'love': 1, a 'medium': 1, 'country': 1,...,'*':0}.
In step S103, a synthetic data set text document is created based on the first word frequency dictionary.
In the embodiment of the invention, the synthetic data set text document can be established based on the first word frequency dictionary in consideration of the fact that the synthetic data set needs the corresponding text document for generating the picture. The number of documents in the synthesized data set text document can be set according to actual needs, the synthesized data set text document usually comprises at least one text document, the number of characters of a line text in each text document is usually the same, the number of characters of the line text in different text documents is different, and the word frequency of each character in the synthesized data set text document is the same as the word frequency of a corresponding character in the second word frequency dictionary.
In order to avoid the difference of different character recognition accuracy rates caused by the difference of word frequencies in the actual application process, preferably, the word frequencies in the first word frequency dictionary are balanced to obtain a second word frequency dictionary, and a synthesized data set text document is established based on the second word frequency dictionary, so that the synthesized data set text document is reasonably established by balancing the number of the word frequencies of different characters. It should be noted that the second word frequency dictionary refers to a word frequency dictionary corresponding to the entire data set (open source + synthesis).
When the word frequency in the first word frequency dictionary is balanced to obtain the second word frequency dictionary, preferably, the first word frequency of each character in the first word frequency dictionary is traversed, the second word frequency corresponding to each character is obtained according to the random number selected for each character, the second word frequency is used as the word frequency of the corresponding character to obtain the second word frequency dictionary, and therefore the balance of the word frequency number of different characters is achieved. The second word frequency is the sum of the first word frequency and a random number, the second word frequency is within a preset word frequency range, for convenience of description, the preset word frequency range is represented by [ freq _ min and freq _ max ], the minimum word frequency and the maximum word frequency in the first word frequency dictionary are represented by dic1_ min and dic1_ max respectively, the minimum word frequency in the preset word frequency range is greater than the minimum word frequency in the first word frequency dictionary, namely freq _ min > dic1_ min, the maximum word frequency in the preset word frequency range is greater than the maximum word frequency in the first word frequency dictionary, and freq _ max > dic1_ max. To ensure the final recognition effect, it is preferable that freq _ min > 500.
In a specific implementation, the word frequency range may be preset, then the second word frequency dictionary dict2 is initialized, so that the word frequency in the second word frequency dictionary dict2 is freq _ min, then each character in the first word frequency dictionary dict1 is traversed, and for any character char and corresponding word frequency char _ freq, a random number num is selected, so that freq _ min < = num + char _ freq < = freq _ max. And randomly selecting a random number num within the range of [ max (0, freq _ min-char _ freq) and max (0, freq _ max-char _ freq) ], and assigning the num value to the word frequency corresponding to the character char in the second word frequency dictionary dct 2 to obtain the second word frequency dictionary.
When the synthetic data set text document is established based on the second word frequency dictionary, preferably, the character range of the line text in the application scene is obtained, and the synthetic data set text document is established based on the character range, so that the synthetic data set text document is established by combining the actual application scene, and the pertinence of establishing the data set text document is improved. The synthesized data set text document comprises at least one text document, the number of characters of a line text in each text document is in a character range, and the word frequency of each character in the synthesized data set text document is the same as the word frequency of the corresponding character in the second word frequency dictionary. The minimum value of the character range of line text is typically 1. It should be noted that "or" space "is also a character.
Further, the synthesized data set text document may include a first document and at least one second document, the first document includes all characters in the second word frequency dictionary, the line text is a single character, the line text of each second document is a plurality of characters within the above character range, the characters in the line text of the second document may be randomly selected, the number of characters of the line text under the same text document is the same, and the number of characters of the line text under different text documents is different. When a synthesized data set text document is established, a third word frequency dictionary dit 3= dit 2 is initialized in advance, a character list is established according to all characters in a second word frequency dictionary, a line text in a first document is selected from the character list, a line text in a second document is randomly selected from the character list, for any character in the character list, every time the character is selected, the word frequency in the corresponding third word frequency dictionary dit 3 is reduced by one until the character list is zero, meanwhile, when the word frequency of any character in the dit 3 is zero, the character list with the word frequency being zero is removed to update the character list until the character list is empty, and if the word frequency corresponding to the final third word frequency dictionary dit 3 is 0, namely, the character list is empty, the establishment of the synthesized data set text document is completed. Wherein the third word frequency dictionary dit 3 may be understood as an auxiliary dictionary for generating text documents.
In step S104, data augmentation is performed on the current data set based on the established data set attributes, the application scenarios recognized by OCR and the synthesized data set text document, so as to obtain an augmented basic data set.
In the embodiment of the present invention, before data augmentation is performed on a current data set based on established data set attributes, an application scenario recognized by OCR, and a synthesized data set text document, the data set attributes are established in advance, and preferably, the data set attributes at least include a font size, a font style, an inclination angle, a blur, a distortion, a character position, a line thickness, a color, an inter-character interval, a background, and a frame pressing so as to provide a clear index for an actual application scenario to generate synthesized data closer to real data. The following describes the dataset attributes one by one:
the character size is as follows: unlike the selection of the font size in the editing tool, the font size in the synthesized data is mainly reflected on the setting of the width and the height of the picture, and in order to ensure the integrity of the generated picture, the barrel effect exists. Specifically, it can be represented by [ height, width ], for example, for the same text, [80,100], i.e. height set to 80, width 100 is sufficient to produce filled characters, and [80,200] produces the same picture as the font size in [80,100], except that the former has more blank positions on the line. Of these, 80,100 and 200 represent pixel points.
Font: determined by the different font files.
Inclination angle: to simulate the fact that the scanned text will have an angular tilt in real life, an angular tilt within a range of tilt angles is acceptable, typically set within 30 degrees, wherein the tilt angles can be randomly selected to tilt left or right.
Blurring: in a real situation, the scanned text may have a certain degree of blur, a blur function (e.g., a gaussian blur function) may be used to simulate the degree of blur of the text, and the degree of blur may be set, and the degree of blur may be represented by a number, and the larger the number, the more serious the blur. For example, the degree of blur is 0-3, where 0 represents no blur, 1 represents slight blur, 2 represents moderate blur, and 3 represents severe blur.
And (3) twisting: in a real situation, the scanned text may have a certain degree of deformation, and a distortion or distortion function (e.g., a distortion function) may be used to simulate the degree of distortion (or deformation) of the scanned text, and the degree of distortion may also be represented by a number, and the larger the number, the more serious the deformation. For example, the degree of distortion is 0-3 where 0 represents no deformation, 1 represents slight deformation, 2 represents moderate deformation, and 3 represents severe deformation.
And (3) character position: in order to avoid the influence of blank text on the recognition result, the synthesized data generally leaves the text blank, and the word positions can be represented by numbers, for example, 0,1, and 2 represent the text in the middle, left, and right, respectively.
Line thickness: in order to simulate the difference in font weight in real text, the line weight degree can also be represented by numbers, for example, the line weight is 0-3, 1 represents normal, 0 is thinner, 2 is thicker, and 3 is thicker. As for Chinese characters with more strokes, when the strokes are thicker, the outline of the Chinese character is not obvious, the thickness range of the Chinese characters with the strokes more than 8 can be controlled to be 0-2.
Color: to simulate the difference of the character colors in the real scene, the RGB three channels can respectively represent the colors by hexadecimal numbers of 0-255. For example, RGB 40 each represents # 282828.
Inter-word spacing: in order to identify the space and the space normally, the space can be represented by a number, for example, the space is 0-5, wherein the number represents the value of the space (pixel).
Background: in order to simulate various background patterns in practical situations, a background library is firstly established, and pictures are randomly selected from the background library to be synthesized. The establishment of the background library may be specific to the recognition task.
Pressing the frame: in order to simulate a scene in which a picture is partially covered in an actual situation, the upper, lower, left and right parts of the simulated picture are blocked, and corresponding pixel values can be directly discarded, it should be noted that the frame pressing criterion does not affect effective recognition of characters, for example, for a Chinese character "pair", the frame cannot be pressed to be "cun".
After the dataset attributes are established, data augmentation parameters from which a base dataset is derived that is closer to the true data may be determined based on the application scenarios identified by OCR and the established dataset attributes. The base dataset includes an open source dataset and a composite dataset. Data augmentation parameters are shown in table 1, for example.
Figure 552349DEST_PATH_IMAGE002
TABLE 1
In the embodiment of the invention, the recognition dictionary is established, the first word frequency dictionary is established based on the recognition dictionary and the acquired opening source data set, the synthetic data set text document is established based on the first word frequency dictionary, and the current data set is subjected to data augmentation based on the established data set attribute, the OCR recognized application scene and the synthetic data set text document to obtain the augmented basic data set, so that the pertinence of the data augmentation is improved while the cost of acquiring training samples in an OCR depth algorithm is reduced.
Example two:
fig. 2 shows an implementation flow of an OCR model training method based on the data augmentation method described in the first embodiment, which is provided in the second embodiment of the present invention, and for convenience of description, only the parts related to the first embodiment of the present invention are shown, and detailed descriptions are as follows:
considering that it is not enough to train the OCR model by only relying on the basic data sets (the open source data set and the synthetic data set) to obtain a more ideal character recognition effect, a part of the actual data sets are needed to perform fine tuning on the model. Therefore, the following steps can be adopted when the OCR model is trained:
in step S201, training the OCR model using the basic data set described in the first embodiment until the model converges to obtain a pre-training model;
in the embodiment of the present invention, the OCR model may be a model using a common CRNN structure. In specific implementation, the labels corresponding to the basic data set can be serialized, and the model is trained to be convergent by using the label serialized basic data set to obtain a pre-training model.
In step S202, the basic data set and the actual labeled data set are mixed, and the pre-trained model is trained continuously using the mixed data set, so as to obtain a trained OCR model.
In the embodiment of the invention, the basic data set and the actual labeled data set are mixed, the learning rate is reduced on the basis of pre-training, and 1-2 epochs can be continuously trained, so that the final model can be obtained. One epoch is the process of training all training samples once.
By adopting the basic data set obtained in the first embodiment, a more ideal effect can be obtained on a limited actual data set, for example, the basic data set comprises 1000 pieces, and the more ideal effect can be obtained only by fine-tuning 3 to 5 pieces of actual data sets. Otherwise, a large amount of manpower and material resources are needed to collect and label the real data to obtain a considerable effect.
In the embodiment of the invention, the OCR model is trained by using the basic data set until the model converges to obtain the pre-training model, the basic data set and the actual marking data set are mixed, and the pre-training model is continuously trained by using the mixed data set to obtain the trained OCR model, so that the model training effect is ensured, and the cost for acquiring real data is reduced.
Example three:
fig. 3 shows a structure of a data augmentation apparatus for OCR recognition according to a third embodiment of the present invention, and for convenience of description, only the portions related to the third embodiment of the present invention are shown, which include:
a recognition dictionary creating unit 31 for creating a recognition dictionary;
a first dictionary establishing unit 32, configured to establish a first word frequency dictionary based on the recognition dictionary and the acquired opening source data set;
a document creating unit 33 for creating a synthetic data set text document based on the first word frequency dictionary; and
and the data augmentation unit 34 is configured to perform data augmentation on the current data set based on the established data set attributes, the application scenarios identified by the OCR and the synthesized data set text documents, so as to obtain an augmented basic data set.
Preferably, the recognition dictionary creating unit includes:
the first adjusting unit is used for adjusting the character position in the recognition dictionary according to the character type; and/or
The second adjusting unit is used for adjusting the character position in the recognition dictionary according to the calculated character pattern similarity of the Chinese characters; and/or
And the label serialization unit is used for serializing the label of each character in the recognition dictionary.
Preferably, the first dictionary establishing unit includes:
the index document establishing unit is used for establishing an index document according to the starting data set; and
and the dictionary establishing subunit is used for traversing the index document and counting each character in the recognition dictionary to obtain a first word frequency dictionary.
Preferably, the document creation unit includes:
the second dictionary establishing unit is used for balancing the word frequency in the first word frequency dictionary to obtain a second word frequency dictionary; and
and the second document establishing unit is used for establishing a text document of the synthetic data set based on the second word frequency dictionary.
Preferably, the second dictionary establishing unit includes:
the first word frequency traversing unit is used for traversing the first word frequency of each character in the first word frequency dictionary;
the second word frequency obtaining unit is used for obtaining a second word frequency corresponding to each character according to the random number selected for each character, wherein the second word frequency is the sum of the first word frequency and the random number, the second word frequency is in a preset word frequency range, the minimum word frequency in the preset word frequency range is larger than the minimum word frequency in the first word frequency dictionary, and the maximum word frequency in the preset word frequency range is larger than the maximum word frequency in the first word frequency dictionary; and
and the second dictionary establishing subunit is used for taking the second word frequency as the word frequency of the corresponding character to obtain a second word frequency dictionary.
The second document creating unit includes:
the character range acquisition unit is used for acquiring a character range of a line text in an application scene; and
and the document establishing subunit is used for establishing a synthesized data set text document based on the character range, wherein the synthesized data set text document comprises at least one text document, the number of characters of the line text in each text document is in the character range, and the word frequency of each character in the synthesized data set text document is the same as the word frequency of the corresponding character in the second word frequency dictionary.
Preferably, the dataset properties include at least font size, font style, tilt angle, blur, distortion, text position, line thickness, color, interword spacing, background, and frame.
In the embodiment of the present invention, each unit of the data amplification device for OCR recognition may be implemented by a corresponding hardware or software unit, and each unit may be an independent software or hardware unit, or may be integrated into a software or hardware unit, which is not limited herein. For a specific implementation of each unit of the data amplification device for OCR recognition, reference may be made to the description of the foregoing method embodiment, and details are not described here again.
Example four:
fig. 4 shows a structure of an OCR model training apparatus according to a fourth embodiment of the present invention, which is based on a third embodiment of the present invention, and for convenience of description, only a part related to the embodiment of the present invention is shown, where the part includes:
a pre-training unit 41, configured to train the OCR model using the basic data set until the model converges to obtain a pre-training model; and
and the secondary training unit 42 is configured to mix the basic data set and the actual labeled data set, and continue training the pre-training model by using the mixed data set to obtain a trained OCR model.
Wherein, the basic data set is obtained by the data augmentation device for OCR recognition described in the third embodiment.
In the embodiment of the present invention, each unit of the OCR model training apparatus may be implemented by a corresponding hardware or software unit, and each unit may be an independent software or hardware unit, or may be integrated into a software or hardware unit, which is not limited herein. For the specific implementation of each unit of the OCR model training apparatus, reference may be made to the description of the foregoing method embodiment, and details are not repeated here.
Example five:
fig. 5 shows a structure of a data augmentation apparatus for OCR recognition according to a fifth embodiment of the present invention, and for convenience of description, only the portions related to the embodiment of the present invention are shown.
The data augmentation apparatus 5 for OCR recognition of the embodiment of the present invention includes a processor 50, a memory 51, and a computer program 52 stored in the memory 51 and executable on the processor 50. The processor 50, when executing the computer program 52, implements the steps in the above-described method embodiments, such as the steps S101 to S104 shown in fig. 1. Alternatively, the processor 50, when executing the computer program 52, implements the functions of the units in the above-described device embodiments, such as the functions of the units 31 to 34 shown in fig. 3.
Example six:
in an embodiment of the present invention, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps in the above-described method embodiments, e.g., steps S101 to S104 shown in fig. 1. Alternatively, the computer program may be adapted to perform the functions of the units of the above-described device embodiments, such as the functions of the units 31 to 34 shown in fig. 3, when executed by the processor.
The computer readable storage medium of the embodiments of the present invention may include any entity or device capable of carrying computer program code, a recording medium, such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (9)

1. A data augmentation method for OCR recognition, said method comprising the steps of:
establishing a recognition dictionary;
establishing a first word frequency dictionary based on the recognition dictionary and the acquired open source data set;
establishing a synthetic data set text document based on the first word frequency dictionary;
performing data augmentation on the current data set based on the established data set attributes, the application scene identified by the OCR and the synthetic data set text document to obtain an augmented basic data set;
the step of creating a composite data set text document based on the first word frequency dictionary comprises:
balancing the word frequency in the first word frequency dictionary to obtain a second word frequency dictionary;
establishing the synthetic data set text document based on the second word frequency dictionary;
the step of equalizing the word frequencies in the first word frequency dictionary to obtain a second word frequency dictionary comprises the following steps:
traversing the first word frequency of each character in the first word frequency dictionary;
obtaining a second word frequency corresponding to each character according to a random number selected for each character, wherein the second word frequency is the sum of the first word frequency and the random number, the second word frequency is within a preset word frequency range, the minimum word frequency in the preset word frequency range is greater than the minimum word frequency in the first word frequency dictionary, and the maximum word frequency in the preset word frequency range is greater than the maximum word frequency in the first word frequency dictionary;
and taking the second word frequency as the word frequency of the corresponding character to obtain the second word frequency dictionary.
2. The method of claim 1, wherein the step of creating a recognition dictionary comprises:
adjusting the character position in the recognition dictionary according to the character type; and/or
Adjusting the character position in the recognition dictionary according to the calculated character pattern similarity of the Chinese characters; and/or
And serializing the label of each character in the recognition dictionary.
3. The method of claim 1, wherein the step of creating a first word frequency dictionary based on the recognition dictionary and the acquired source data set further comprises:
establishing an index document according to the open source data set;
and traversing the index document, and counting each character in the recognition dictionary to obtain the first word frequency dictionary.
4. The method of claim 1, wherein said step of generating said composite data set text document based on said second word frequency dictionary comprises:
acquiring a character range of a line text in the application scene;
and establishing the synthesized data set text document based on the character range, wherein the synthesized data set text document comprises at least one text document, the number of characters of the line text in each text document is in the character range, and the word frequency of each character in the synthesized data set text document is the same as the word frequency of the corresponding character in the second word frequency dictionary.
5. The method of claim 1, wherein the dataset properties include at least font size, font style, tilt angle, blur, distortion, text position, line thickness, color, interword space, background, and frame.
6. An OCR model training method based on the data augmentation method for OCR recognition according to any one of claims 1 to 5, comprising:
training an OCR model by using the basic data set until the model converges to obtain a pre-training model;
and mixing the basic data set and the actual marked data set, and continuously training the pre-training model by using the mixed data set to obtain the trained OCR model.
7. A data augmentation apparatus for OCR recognition, the apparatus comprising:
the recognition dictionary establishing unit is used for establishing a recognition dictionary;
the first dictionary establishing unit is used for establishing a first word frequency dictionary based on the recognition dictionary and the acquired opening source data set;
the document establishing unit is used for establishing a synthetic data set text document based on the first word frequency dictionary; and
the data augmentation unit is used for augmenting the data of the current data set based on the established data set attributes, the application scenes identified by the OCR and the text documents of the synthetic data set to obtain an augmented basic data set;
the document creation unit further includes:
the second dictionary establishing unit is used for balancing the word frequency in the first word frequency dictionary to obtain a second word frequency dictionary; and
the second document establishing unit is used for establishing a synthesized data set text document based on the second word frequency dictionary;
the second dictionary establishing unit includes:
the first word frequency traversing unit is used for traversing the first word frequency of each character in the first word frequency dictionary;
the second word frequency obtaining unit is used for obtaining a second word frequency corresponding to each character according to the random number selected for each character, wherein the second word frequency is the sum of the first word frequency and the random number, the second word frequency is in a preset word frequency range, the minimum word frequency in the preset word frequency range is larger than the minimum word frequency in the first word frequency dictionary, and the maximum word frequency in the preset word frequency range is larger than the maximum word frequency in the first word frequency dictionary; and
and the second dictionary establishing subunit is used for taking the second word frequency as the word frequency of the corresponding character to obtain the second word frequency dictionary.
8. A data augmentation device for OCR recognition comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor implements the steps of the method according to any one of claims 1 to 6 when executing said computer program.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
CN202110991555.7A 2021-08-27 2021-08-27 Data augmentation method, device and equipment for OCR recognition and storage medium Active CN113435426B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110991555.7A CN113435426B (en) 2021-08-27 2021-08-27 Data augmentation method, device and equipment for OCR recognition and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110991555.7A CN113435426B (en) 2021-08-27 2021-08-27 Data augmentation method, device and equipment for OCR recognition and storage medium

Publications (2)

Publication Number Publication Date
CN113435426A CN113435426A (en) 2021-09-24
CN113435426B true CN113435426B (en) 2021-11-16

Family

ID=77798123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110991555.7A Active CN113435426B (en) 2021-08-27 2021-08-27 Data augmentation method, device and equipment for OCR recognition and storage medium

Country Status (1)

Country Link
CN (1) CN113435426B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079428A (en) * 2019-12-27 2020-04-28 出门问问信息科技有限公司 Word segmentation and industry dictionary construction method and device and readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101571852B (en) * 2008-04-28 2011-04-20 富士通株式会社 Dictionary generating device and information retrieving device
CN111563377A (en) * 2019-02-13 2020-08-21 北京京东尚科信息技术有限公司 Data enhancement method and device
CN111797908B (en) * 2020-06-18 2022-08-09 浪潮金融信息技术有限公司 Training set generation method of deep learning model for print character recognition
CN112633268B (en) * 2020-12-21 2024-08-23 江苏国光信息产业股份有限公司 OCR (optical character recognition) method and OCR recognition system based on domestic platform

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079428A (en) * 2019-12-27 2020-04-28 出门问问信息科技有限公司 Word segmentation and industry dictionary construction method and device and readable storage medium

Also Published As

Publication number Publication date
CN113435426A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
Burie et al. ICDAR2015 competition on smartphone document capture and OCR (SmartDoc)
CN108415887A (en) A kind of method that pdf document is converted to OFD files
US11468694B1 (en) Systems and methods for document image processing using neural networks
JP7170773B2 (en) Structured document information marking method, structured document information marking device, electronic device, computer-readable storage medium, and computer program
CN103488711A (en) Method and system for fast making vector font library
US8386943B2 (en) Method for query based on layout information
CN103632387A (en) Method and system for generation of brush writing copybook
CN109508712A (en) A kind of Chinese written language recognition methods based on image
CN110554991A (en) Method for correcting and managing text picture
CN110162773A (en) Title estimator
CN109670502A (en) Training data generation system and method based on dimension language character recognition
CN113435426B (en) Data augmentation method, device and equipment for OCR recognition and storage medium
CN114399782B (en) Text image processing method, apparatus, device, storage medium, and program product
CN113038184B (en) Data processing method, device, equipment and storage medium
CN104866631A (en) Method and device for aggregating counseling problems
CN114612912A (en) Image character recognition method, system and equipment based on intelligent corpus
CN114220112A (en) Person name card oriented arbitrary relationship extraction method and system
CN114332882A (en) Text translation method and device, electronic equipment and storage medium
Suchenwirth et al. Optical recognition of Chinese characters
Sobhan Sarbandi Navigating the Latent: Exploring the Potentials of Islamic Calligraphy with Generative Adversarial Networks
CN111582281A (en) Picture display optimization method and device, electronic equipment and storage medium
CN112836467A (en) Image processing method and device
CN110580352A (en) Chinese character and line book intercommunication mutual identification technical method
US20210390344A1 (en) Automatically applying style characteristics to images
CN115830612A (en) OCR training data generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant