CN113435426B - Data augmentation method, device and equipment for OCR recognition and storage medium - Google Patents
Data augmentation method, device and equipment for OCR recognition and storage medium Download PDFInfo
- Publication number
- CN113435426B CN113435426B CN202110991555.7A CN202110991555A CN113435426B CN 113435426 B CN113435426 B CN 113435426B CN 202110991555 A CN202110991555 A CN 202110991555A CN 113435426 B CN113435426 B CN 113435426B
- Authority
- CN
- China
- Prior art keywords
- word frequency
- dictionary
- data set
- character
- recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Character Discrimination (AREA)
Abstract
The invention is suitable for the technical field of character recognition, and provides a data augmentation method, a device, equipment and a storage medium for OCR recognition, wherein the method comprises the following steps: the method comprises the steps of establishing a recognition dictionary, establishing a first word frequency dictionary based on the recognition dictionary and an obtained opening source data set, establishing a synthesized data set text document based on the first word frequency dictionary, and performing data augmentation on a current data set based on established data set attributes, an OCR (optical character recognition) application scene and the synthesized data set text document to obtain an augmented basic data set, so that the cost of obtaining a training sample in an OCR (optical character recognition) depth algorithm is reduced, and the pertinence of the data augmentation is improved.
Description
Technical Field
The invention belongs to the technical field of character recognition, and particularly relates to a data augmentation method, device, equipment and storage medium for OCR recognition.
Background
OCR (Optical Character Recognition) refers to a process in which an electronic device examines a Character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into a computer text using a Character Recognition method. OCR recognition has a wide range of applications, such as document recognition, and the like.
There are two main methods of OCR at present: conventional OCR algorithm based and deep learning based OCR methods. In recent years, the application of deep learning network structure makes the OCR recognition accuracy and stability far higher than the traditional OCR method. However, deep learning relies on a large number of training samples, for Chinese document recognition, including Chinese and English numbers, common characters are usually 6K-8K, the required data amount is in the millions or even tens of millions, and the OCR recognition performance depends on the number and the types of acquired data sets. Aiming at Chinese documents, tens of millions of data volumes are needed to obtain ideal OCR performance, and the situation that only manual labeling is needed is unrealistic, the existing data augmentation mode is mainly carried out aiming at different backgrounds, the Chinese document recognition background is single in type, but the Chinese document recognition background contains more types of characters, particularly rare characters, and the word frequency is less.
Disclosure of Invention
The invention aims to provide a data augmentation method, a data augmentation device, data augmentation equipment and a storage medium for OCR recognition, and aims to solve the problems that in the prior art, the cost for acquiring training samples by adopting manual labeling is high, and the pertinence of the training samples acquired by adopting a data augmentation mode is poor.
In one aspect, the present invention provides a data augmentation method for OCR recognition, the method comprising the steps of:
establishing a recognition dictionary;
establishing a first word frequency dictionary based on the recognition dictionary and the acquired open source data set;
establishing a synthetic data set text document based on the first word frequency dictionary;
and performing data augmentation on the current data set based on the established data set attributes, the application scene identified by the OCR and the synthetic data set text document to obtain an augmented basic data set.
Preferably, the step of establishing a recognition dictionary includes:
adjusting the character position in the recognition dictionary according to the character type; and/or
Adjusting the character position in the recognition dictionary according to the calculated character pattern similarity of the Chinese characters; and/or
And serializing the label of each character in the recognition dictionary.
Preferably, the step of establishing a first word frequency dictionary based on the recognition dictionary and the acquired opening source data set further includes:
establishing an index document according to the open source data set;
and traversing the index document, and counting each character in the recognition dictionary to obtain the first word frequency dictionary.
Preferably, the step of creating a text document of a synthesized data set based on the first word frequency dictionary comprises:
balancing the word frequency in the first word frequency dictionary to obtain a second word frequency dictionary;
and establishing the synthetic data set text document based on the second word frequency dictionary.
Preferably, the step of equalizing the word frequencies in the first word frequency dictionary to obtain a second word frequency dictionary includes:
traversing the first word frequency of each character in the first word frequency dictionary;
obtaining a second word frequency corresponding to each character according to a random number selected for each character, wherein the second word frequency is the sum of the first word frequency and the random number, the second word frequency is within a preset word frequency range, the minimum word frequency in the preset word frequency range is greater than the minimum word frequency in the first word frequency dictionary, and the maximum word frequency in the preset word frequency range is greater than the maximum word frequency in the first word frequency dictionary;
and taking the second word frequency as the word frequency of the corresponding character to obtain a second word frequency dictionary.
Preferably, the step of generating the synthesized data set text document based on the second word frequency dictionary comprises:
acquiring a character range of a line text in the application scene;
and establishing the synthesized data set text document based on the character range, wherein the synthesized data set text document comprises at least one text document, the number of characters of the line text in each text document is in the character range, and the word frequency of each character in the synthesized data set text document is the same as the word frequency of the corresponding character in the second word frequency dictionary.
Preferably, the dataset properties include at least font size, font style, tilt angle, blur, distortion, text position, line thickness, color, interword spacing, background, and frame press.
In another aspect, the present invention provides an OCR model training method based on the data augmentation method for OCR recognition as described above, including:
training an OCR model by using the basic data set until the model converges to obtain a pre-training model;
and mixing the basic data set and the actual marked data set, and continuously training the pre-training model by using the mixed data set to obtain the trained OCR model.
In another aspect, the present invention provides a data augmentation apparatus for OCR recognition, including:
the recognition dictionary establishing unit is used for establishing a recognition dictionary;
the first dictionary establishing unit is used for establishing a first word frequency dictionary based on the recognition dictionary and the acquired opening source data set;
the document establishing unit is used for establishing a synthetic data set text document based on the first word frequency dictionary; and
and the data augmentation unit is used for performing data augmentation on the current data set based on the established data set attributes, the application scenes identified by the OCR and the synthesized data set text documents to obtain an augmented basic data set.
In another aspect, the present invention further provides a data augmentation device for OCR recognition, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method when executing the computer program.
In another aspect, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.
According to the method, the recognition dictionary is established, the first word frequency dictionary is established based on the recognition dictionary and the acquired opening source data set, the synthetic data set text document is established based on the first word frequency dictionary, and data augmentation is performed on the current data set based on the established data set attributes, the OCR application scenario and the synthetic data set text document to obtain the augmented basic data set, so that the cost of acquiring training samples in an OCR depth algorithm is reduced, and meanwhile, the pertinence of data augmentation is improved.
Drawings
FIG. 1 is a flowchart illustrating an implementation of a data augmentation method for OCR recognition according to an embodiment of the present invention;
FIG. 2 is a flowchart of an implementation of an OCR model training method according to a second embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a data enhancement apparatus for OCR recognition according to a third embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an OCR model training apparatus according to a fourth embodiment of the present invention; and
fig. 5 is a schematic structural diagram of a data amplification device for OCR recognition according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:
the first embodiment is as follows:
fig. 1 shows an implementation flow of a data augmentation method for OCR recognition according to an embodiment of the present invention, and for convenience of description, only the portions related to the embodiment of the present invention are shown, which is detailed as follows:
in step S101, a recognition dictionary is established.
In the embodiment of the invention, the recognition dictionary can contain Chinese characters, English letters and Chinese-English punctuations. The recognition dictionary can be established according to Chinese character code standards of China, wherein the Chinese character code standards GB2312-80 stipulate 3755 primary Chinese characters and 3008 secondary Chinese characters, and the total number of the primary Chinese characters and the secondary Chinese characters is 6763. The second level Chinese characters appear relatively less frequently than the first level Chinese characters, but also appear in daily life, such as wonton, cricket, crucian, and the like.
When the recognition dictionary is established, the positions of the characters can be reasonably adjusted according to actual conditions. Preferably, the positions of characters in the recognition dictionary are adjusted according to the character type so that characters of the same type are located in the vicinity. For example, chinese characters are concentrated in the same vicinity, punctuation marks are concentrated in the same vicinity, and english letters are concentrated in the same vicinity. Still preferably, the positions of the characters in the recognition dictionary are adjusted according to the calculated font similarity of the chinese characters so that characters with larger similarity are located at adjacent positions. For example, adjust "person", "in", "big", "too" to adjacent positions.
When the recognition dictionary is built, in consideration of the fact that when data are used for model training, corresponding labels need to be added in addition to the pictures, the labels of each character in the recognition dictionary are preferably serialized so that the requirements of the model training can be met through the serialized labels. Specifically, the number of lines per character may be used as a label for the character. For example, "I" is on line 32, and "I" corresponds to a sequence number of "32".
In step S102, a first word frequency dictionary is established based on the recognition dictionary and the acquired open source data set.
In the embodiment of the present invention, the word frequency dictionary corresponding to the source data set is referred to as a first word frequency dictionary, the first word frequency dictionary contains, in addition to characters, the word frequency of each character, and the word frequency of each character can be determined based on the number of times that the character appears in the source data set. Preferably, the index document is established according to the source data set, the index document is traversed, and each character in the recognition dictionary is counted to obtain the first word frequency dictionary, so that the construction of the first word frequency dictionary is realized. In a specific implementation, the index document may include a picture path and corresponding tag characters, and the picture path and the tag characters may be connected by a space. For example, the address of picture 1 is/dataset/ocr, and the corresponding text is "i love china", then it is saved as: "/data/ocr/1. jpg I love China". After the index document is established, initializing the recognition dictionary to obtain an initial first word frequency dictionary, wherein the word frequency of each character in the initial first word frequency dictionary is 0, traversing the index document line by line, counting the characters in the recognition dictionary respectively, updating the word frequency of each character according to the counting result, and obtaining the first word frequency dictionary based on open source data. For example, the open source data set includes a picture 1, the character included in the picture 1 is "i love china", the initial first dictionary is dict1= { 'i': 0 'love': 0 'middle': 0, 'country': 0., ' 0}, and after processing picture 1, the dictionary becomes dict1= { ' me ': 1, 'love': 1, a 'medium': 1, 'country': 1,...,'*':0}.
In step S103, a synthetic data set text document is created based on the first word frequency dictionary.
In the embodiment of the invention, the synthetic data set text document can be established based on the first word frequency dictionary in consideration of the fact that the synthetic data set needs the corresponding text document for generating the picture. The number of documents in the synthesized data set text document can be set according to actual needs, the synthesized data set text document usually comprises at least one text document, the number of characters of a line text in each text document is usually the same, the number of characters of the line text in different text documents is different, and the word frequency of each character in the synthesized data set text document is the same as the word frequency of a corresponding character in the second word frequency dictionary.
In order to avoid the difference of different character recognition accuracy rates caused by the difference of word frequencies in the actual application process, preferably, the word frequencies in the first word frequency dictionary are balanced to obtain a second word frequency dictionary, and a synthesized data set text document is established based on the second word frequency dictionary, so that the synthesized data set text document is reasonably established by balancing the number of the word frequencies of different characters. It should be noted that the second word frequency dictionary refers to a word frequency dictionary corresponding to the entire data set (open source + synthesis).
When the word frequency in the first word frequency dictionary is balanced to obtain the second word frequency dictionary, preferably, the first word frequency of each character in the first word frequency dictionary is traversed, the second word frequency corresponding to each character is obtained according to the random number selected for each character, the second word frequency is used as the word frequency of the corresponding character to obtain the second word frequency dictionary, and therefore the balance of the word frequency number of different characters is achieved. The second word frequency is the sum of the first word frequency and a random number, the second word frequency is within a preset word frequency range, for convenience of description, the preset word frequency range is represented by [ freq _ min and freq _ max ], the minimum word frequency and the maximum word frequency in the first word frequency dictionary are represented by dic1_ min and dic1_ max respectively, the minimum word frequency in the preset word frequency range is greater than the minimum word frequency in the first word frequency dictionary, namely freq _ min > dic1_ min, the maximum word frequency in the preset word frequency range is greater than the maximum word frequency in the first word frequency dictionary, and freq _ max > dic1_ max. To ensure the final recognition effect, it is preferable that freq _ min > 500.
In a specific implementation, the word frequency range may be preset, then the second word frequency dictionary dict2 is initialized, so that the word frequency in the second word frequency dictionary dict2 is freq _ min, then each character in the first word frequency dictionary dict1 is traversed, and for any character char and corresponding word frequency char _ freq, a random number num is selected, so that freq _ min < = num + char _ freq < = freq _ max. And randomly selecting a random number num within the range of [ max (0, freq _ min-char _ freq) and max (0, freq _ max-char _ freq) ], and assigning the num value to the word frequency corresponding to the character char in the second word frequency dictionary dct 2 to obtain the second word frequency dictionary.
When the synthetic data set text document is established based on the second word frequency dictionary, preferably, the character range of the line text in the application scene is obtained, and the synthetic data set text document is established based on the character range, so that the synthetic data set text document is established by combining the actual application scene, and the pertinence of establishing the data set text document is improved. The synthesized data set text document comprises at least one text document, the number of characters of a line text in each text document is in a character range, and the word frequency of each character in the synthesized data set text document is the same as the word frequency of the corresponding character in the second word frequency dictionary. The minimum value of the character range of line text is typically 1. It should be noted that "or" space "is also a character.
Further, the synthesized data set text document may include a first document and at least one second document, the first document includes all characters in the second word frequency dictionary, the line text is a single character, the line text of each second document is a plurality of characters within the above character range, the characters in the line text of the second document may be randomly selected, the number of characters of the line text under the same text document is the same, and the number of characters of the line text under different text documents is different. When a synthesized data set text document is established, a third word frequency dictionary dit 3= dit 2 is initialized in advance, a character list is established according to all characters in a second word frequency dictionary, a line text in a first document is selected from the character list, a line text in a second document is randomly selected from the character list, for any character in the character list, every time the character is selected, the word frequency in the corresponding third word frequency dictionary dit 3 is reduced by one until the character list is zero, meanwhile, when the word frequency of any character in the dit 3 is zero, the character list with the word frequency being zero is removed to update the character list until the character list is empty, and if the word frequency corresponding to the final third word frequency dictionary dit 3 is 0, namely, the character list is empty, the establishment of the synthesized data set text document is completed. Wherein the third word frequency dictionary dit 3 may be understood as an auxiliary dictionary for generating text documents.
In step S104, data augmentation is performed on the current data set based on the established data set attributes, the application scenarios recognized by OCR and the synthesized data set text document, so as to obtain an augmented basic data set.
In the embodiment of the present invention, before data augmentation is performed on a current data set based on established data set attributes, an application scenario recognized by OCR, and a synthesized data set text document, the data set attributes are established in advance, and preferably, the data set attributes at least include a font size, a font style, an inclination angle, a blur, a distortion, a character position, a line thickness, a color, an inter-character interval, a background, and a frame pressing so as to provide a clear index for an actual application scenario to generate synthesized data closer to real data. The following describes the dataset attributes one by one:
the character size is as follows: unlike the selection of the font size in the editing tool, the font size in the synthesized data is mainly reflected on the setting of the width and the height of the picture, and in order to ensure the integrity of the generated picture, the barrel effect exists. Specifically, it can be represented by [ height, width ], for example, for the same text, [80,100], i.e. height set to 80, width 100 is sufficient to produce filled characters, and [80,200] produces the same picture as the font size in [80,100], except that the former has more blank positions on the line. Of these, 80,100 and 200 represent pixel points.
Font: determined by the different font files.
Inclination angle: to simulate the fact that the scanned text will have an angular tilt in real life, an angular tilt within a range of tilt angles is acceptable, typically set within 30 degrees, wherein the tilt angles can be randomly selected to tilt left or right.
Blurring: in a real situation, the scanned text may have a certain degree of blur, a blur function (e.g., a gaussian blur function) may be used to simulate the degree of blur of the text, and the degree of blur may be set, and the degree of blur may be represented by a number, and the larger the number, the more serious the blur. For example, the degree of blur is 0-3, where 0 represents no blur, 1 represents slight blur, 2 represents moderate blur, and 3 represents severe blur.
And (3) twisting: in a real situation, the scanned text may have a certain degree of deformation, and a distortion or distortion function (e.g., a distortion function) may be used to simulate the degree of distortion (or deformation) of the scanned text, and the degree of distortion may also be represented by a number, and the larger the number, the more serious the deformation. For example, the degree of distortion is 0-3 where 0 represents no deformation, 1 represents slight deformation, 2 represents moderate deformation, and 3 represents severe deformation.
And (3) character position: in order to avoid the influence of blank text on the recognition result, the synthesized data generally leaves the text blank, and the word positions can be represented by numbers, for example, 0,1, and 2 represent the text in the middle, left, and right, respectively.
Line thickness: in order to simulate the difference in font weight in real text, the line weight degree can also be represented by numbers, for example, the line weight is 0-3, 1 represents normal, 0 is thinner, 2 is thicker, and 3 is thicker. As for Chinese characters with more strokes, when the strokes are thicker, the outline of the Chinese character is not obvious, the thickness range of the Chinese characters with the strokes more than 8 can be controlled to be 0-2.
Color: to simulate the difference of the character colors in the real scene, the RGB three channels can respectively represent the colors by hexadecimal numbers of 0-255. For example, RGB 40 each represents # 282828.
Inter-word spacing: in order to identify the space and the space normally, the space can be represented by a number, for example, the space is 0-5, wherein the number represents the value of the space (pixel).
Background: in order to simulate various background patterns in practical situations, a background library is firstly established, and pictures are randomly selected from the background library to be synthesized. The establishment of the background library may be specific to the recognition task.
Pressing the frame: in order to simulate a scene in which a picture is partially covered in an actual situation, the upper, lower, left and right parts of the simulated picture are blocked, and corresponding pixel values can be directly discarded, it should be noted that the frame pressing criterion does not affect effective recognition of characters, for example, for a Chinese character "pair", the frame cannot be pressed to be "cun".
After the dataset attributes are established, data augmentation parameters from which a base dataset is derived that is closer to the true data may be determined based on the application scenarios identified by OCR and the established dataset attributes. The base dataset includes an open source dataset and a composite dataset. Data augmentation parameters are shown in table 1, for example.
TABLE 1
In the embodiment of the invention, the recognition dictionary is established, the first word frequency dictionary is established based on the recognition dictionary and the acquired opening source data set, the synthetic data set text document is established based on the first word frequency dictionary, and the current data set is subjected to data augmentation based on the established data set attribute, the OCR recognized application scene and the synthetic data set text document to obtain the augmented basic data set, so that the pertinence of the data augmentation is improved while the cost of acquiring training samples in an OCR depth algorithm is reduced.
Example two:
fig. 2 shows an implementation flow of an OCR model training method based on the data augmentation method described in the first embodiment, which is provided in the second embodiment of the present invention, and for convenience of description, only the parts related to the first embodiment of the present invention are shown, and detailed descriptions are as follows:
considering that it is not enough to train the OCR model by only relying on the basic data sets (the open source data set and the synthetic data set) to obtain a more ideal character recognition effect, a part of the actual data sets are needed to perform fine tuning on the model. Therefore, the following steps can be adopted when the OCR model is trained:
in step S201, training the OCR model using the basic data set described in the first embodiment until the model converges to obtain a pre-training model;
in the embodiment of the present invention, the OCR model may be a model using a common CRNN structure. In specific implementation, the labels corresponding to the basic data set can be serialized, and the model is trained to be convergent by using the label serialized basic data set to obtain a pre-training model.
In step S202, the basic data set and the actual labeled data set are mixed, and the pre-trained model is trained continuously using the mixed data set, so as to obtain a trained OCR model.
In the embodiment of the invention, the basic data set and the actual labeled data set are mixed, the learning rate is reduced on the basis of pre-training, and 1-2 epochs can be continuously trained, so that the final model can be obtained. One epoch is the process of training all training samples once.
By adopting the basic data set obtained in the first embodiment, a more ideal effect can be obtained on a limited actual data set, for example, the basic data set comprises 1000 pieces, and the more ideal effect can be obtained only by fine-tuning 3 to 5 pieces of actual data sets. Otherwise, a large amount of manpower and material resources are needed to collect and label the real data to obtain a considerable effect.
In the embodiment of the invention, the OCR model is trained by using the basic data set until the model converges to obtain the pre-training model, the basic data set and the actual marking data set are mixed, and the pre-training model is continuously trained by using the mixed data set to obtain the trained OCR model, so that the model training effect is ensured, and the cost for acquiring real data is reduced.
Example three:
fig. 3 shows a structure of a data augmentation apparatus for OCR recognition according to a third embodiment of the present invention, and for convenience of description, only the portions related to the third embodiment of the present invention are shown, which include:
a recognition dictionary creating unit 31 for creating a recognition dictionary;
a first dictionary establishing unit 32, configured to establish a first word frequency dictionary based on the recognition dictionary and the acquired opening source data set;
a document creating unit 33 for creating a synthetic data set text document based on the first word frequency dictionary; and
and the data augmentation unit 34 is configured to perform data augmentation on the current data set based on the established data set attributes, the application scenarios identified by the OCR and the synthesized data set text documents, so as to obtain an augmented basic data set.
Preferably, the recognition dictionary creating unit includes:
the first adjusting unit is used for adjusting the character position in the recognition dictionary according to the character type; and/or
The second adjusting unit is used for adjusting the character position in the recognition dictionary according to the calculated character pattern similarity of the Chinese characters; and/or
And the label serialization unit is used for serializing the label of each character in the recognition dictionary.
Preferably, the first dictionary establishing unit includes:
the index document establishing unit is used for establishing an index document according to the starting data set; and
and the dictionary establishing subunit is used for traversing the index document and counting each character in the recognition dictionary to obtain a first word frequency dictionary.
Preferably, the document creation unit includes:
the second dictionary establishing unit is used for balancing the word frequency in the first word frequency dictionary to obtain a second word frequency dictionary; and
and the second document establishing unit is used for establishing a text document of the synthetic data set based on the second word frequency dictionary.
Preferably, the second dictionary establishing unit includes:
the first word frequency traversing unit is used for traversing the first word frequency of each character in the first word frequency dictionary;
the second word frequency obtaining unit is used for obtaining a second word frequency corresponding to each character according to the random number selected for each character, wherein the second word frequency is the sum of the first word frequency and the random number, the second word frequency is in a preset word frequency range, the minimum word frequency in the preset word frequency range is larger than the minimum word frequency in the first word frequency dictionary, and the maximum word frequency in the preset word frequency range is larger than the maximum word frequency in the first word frequency dictionary; and
and the second dictionary establishing subunit is used for taking the second word frequency as the word frequency of the corresponding character to obtain a second word frequency dictionary.
The second document creating unit includes:
the character range acquisition unit is used for acquiring a character range of a line text in an application scene; and
and the document establishing subunit is used for establishing a synthesized data set text document based on the character range, wherein the synthesized data set text document comprises at least one text document, the number of characters of the line text in each text document is in the character range, and the word frequency of each character in the synthesized data set text document is the same as the word frequency of the corresponding character in the second word frequency dictionary.
Preferably, the dataset properties include at least font size, font style, tilt angle, blur, distortion, text position, line thickness, color, interword spacing, background, and frame.
In the embodiment of the present invention, each unit of the data amplification device for OCR recognition may be implemented by a corresponding hardware or software unit, and each unit may be an independent software or hardware unit, or may be integrated into a software or hardware unit, which is not limited herein. For a specific implementation of each unit of the data amplification device for OCR recognition, reference may be made to the description of the foregoing method embodiment, and details are not described here again.
Example four:
fig. 4 shows a structure of an OCR model training apparatus according to a fourth embodiment of the present invention, which is based on a third embodiment of the present invention, and for convenience of description, only a part related to the embodiment of the present invention is shown, where the part includes:
a pre-training unit 41, configured to train the OCR model using the basic data set until the model converges to obtain a pre-training model; and
and the secondary training unit 42 is configured to mix the basic data set and the actual labeled data set, and continue training the pre-training model by using the mixed data set to obtain a trained OCR model.
Wherein, the basic data set is obtained by the data augmentation device for OCR recognition described in the third embodiment.
In the embodiment of the present invention, each unit of the OCR model training apparatus may be implemented by a corresponding hardware or software unit, and each unit may be an independent software or hardware unit, or may be integrated into a software or hardware unit, which is not limited herein. For the specific implementation of each unit of the OCR model training apparatus, reference may be made to the description of the foregoing method embodiment, and details are not repeated here.
Example five:
fig. 5 shows a structure of a data augmentation apparatus for OCR recognition according to a fifth embodiment of the present invention, and for convenience of description, only the portions related to the embodiment of the present invention are shown.
The data augmentation apparatus 5 for OCR recognition of the embodiment of the present invention includes a processor 50, a memory 51, and a computer program 52 stored in the memory 51 and executable on the processor 50. The processor 50, when executing the computer program 52, implements the steps in the above-described method embodiments, such as the steps S101 to S104 shown in fig. 1. Alternatively, the processor 50, when executing the computer program 52, implements the functions of the units in the above-described device embodiments, such as the functions of the units 31 to 34 shown in fig. 3.
Example six:
in an embodiment of the present invention, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps in the above-described method embodiments, e.g., steps S101 to S104 shown in fig. 1. Alternatively, the computer program may be adapted to perform the functions of the units of the above-described device embodiments, such as the functions of the units 31 to 34 shown in fig. 3, when executed by the processor.
The computer readable storage medium of the embodiments of the present invention may include any entity or device capable of carrying computer program code, a recording medium, such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (9)
1. A data augmentation method for OCR recognition, said method comprising the steps of:
establishing a recognition dictionary;
establishing a first word frequency dictionary based on the recognition dictionary and the acquired open source data set;
establishing a synthetic data set text document based on the first word frequency dictionary;
performing data augmentation on the current data set based on the established data set attributes, the application scene identified by the OCR and the synthetic data set text document to obtain an augmented basic data set;
the step of creating a composite data set text document based on the first word frequency dictionary comprises:
balancing the word frequency in the first word frequency dictionary to obtain a second word frequency dictionary;
establishing the synthetic data set text document based on the second word frequency dictionary;
the step of equalizing the word frequencies in the first word frequency dictionary to obtain a second word frequency dictionary comprises the following steps:
traversing the first word frequency of each character in the first word frequency dictionary;
obtaining a second word frequency corresponding to each character according to a random number selected for each character, wherein the second word frequency is the sum of the first word frequency and the random number, the second word frequency is within a preset word frequency range, the minimum word frequency in the preset word frequency range is greater than the minimum word frequency in the first word frequency dictionary, and the maximum word frequency in the preset word frequency range is greater than the maximum word frequency in the first word frequency dictionary;
and taking the second word frequency as the word frequency of the corresponding character to obtain the second word frequency dictionary.
2. The method of claim 1, wherein the step of creating a recognition dictionary comprises:
adjusting the character position in the recognition dictionary according to the character type; and/or
Adjusting the character position in the recognition dictionary according to the calculated character pattern similarity of the Chinese characters; and/or
And serializing the label of each character in the recognition dictionary.
3. The method of claim 1, wherein the step of creating a first word frequency dictionary based on the recognition dictionary and the acquired source data set further comprises:
establishing an index document according to the open source data set;
and traversing the index document, and counting each character in the recognition dictionary to obtain the first word frequency dictionary.
4. The method of claim 1, wherein said step of generating said composite data set text document based on said second word frequency dictionary comprises:
acquiring a character range of a line text in the application scene;
and establishing the synthesized data set text document based on the character range, wherein the synthesized data set text document comprises at least one text document, the number of characters of the line text in each text document is in the character range, and the word frequency of each character in the synthesized data set text document is the same as the word frequency of the corresponding character in the second word frequency dictionary.
5. The method of claim 1, wherein the dataset properties include at least font size, font style, tilt angle, blur, distortion, text position, line thickness, color, interword space, background, and frame.
6. An OCR model training method based on the data augmentation method for OCR recognition according to any one of claims 1 to 5, comprising:
training an OCR model by using the basic data set until the model converges to obtain a pre-training model;
and mixing the basic data set and the actual marked data set, and continuously training the pre-training model by using the mixed data set to obtain the trained OCR model.
7. A data augmentation apparatus for OCR recognition, the apparatus comprising:
the recognition dictionary establishing unit is used for establishing a recognition dictionary;
the first dictionary establishing unit is used for establishing a first word frequency dictionary based on the recognition dictionary and the acquired opening source data set;
the document establishing unit is used for establishing a synthetic data set text document based on the first word frequency dictionary; and
the data augmentation unit is used for augmenting the data of the current data set based on the established data set attributes, the application scenes identified by the OCR and the text documents of the synthetic data set to obtain an augmented basic data set;
the document creation unit further includes:
the second dictionary establishing unit is used for balancing the word frequency in the first word frequency dictionary to obtain a second word frequency dictionary; and
the second document establishing unit is used for establishing a synthesized data set text document based on the second word frequency dictionary;
the second dictionary establishing unit includes:
the first word frequency traversing unit is used for traversing the first word frequency of each character in the first word frequency dictionary;
the second word frequency obtaining unit is used for obtaining a second word frequency corresponding to each character according to the random number selected for each character, wherein the second word frequency is the sum of the first word frequency and the random number, the second word frequency is in a preset word frequency range, the minimum word frequency in the preset word frequency range is larger than the minimum word frequency in the first word frequency dictionary, and the maximum word frequency in the preset word frequency range is larger than the maximum word frequency in the first word frequency dictionary; and
and the second dictionary establishing subunit is used for taking the second word frequency as the word frequency of the corresponding character to obtain the second word frequency dictionary.
8. A data augmentation device for OCR recognition comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor implements the steps of the method according to any one of claims 1 to 6 when executing said computer program.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110991555.7A CN113435426B (en) | 2021-08-27 | 2021-08-27 | Data augmentation method, device and equipment for OCR recognition and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110991555.7A CN113435426B (en) | 2021-08-27 | 2021-08-27 | Data augmentation method, device and equipment for OCR recognition and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113435426A CN113435426A (en) | 2021-09-24 |
CN113435426B true CN113435426B (en) | 2021-11-16 |
Family
ID=77798123
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110991555.7A Active CN113435426B (en) | 2021-08-27 | 2021-08-27 | Data augmentation method, device and equipment for OCR recognition and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113435426B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079428A (en) * | 2019-12-27 | 2020-04-28 | 出门问问信息科技有限公司 | Word segmentation and industry dictionary construction method and device and readable storage medium |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101571852B (en) * | 2008-04-28 | 2011-04-20 | 富士通株式会社 | Dictionary generating device and information retrieving device |
CN111563377A (en) * | 2019-02-13 | 2020-08-21 | 北京京东尚科信息技术有限公司 | Data enhancement method and device |
CN111797908B (en) * | 2020-06-18 | 2022-08-09 | 浪潮金融信息技术有限公司 | Training set generation method of deep learning model for print character recognition |
CN112633268B (en) * | 2020-12-21 | 2024-08-23 | 江苏国光信息产业股份有限公司 | OCR (optical character recognition) method and OCR recognition system based on domestic platform |
-
2021
- 2021-08-27 CN CN202110991555.7A patent/CN113435426B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079428A (en) * | 2019-12-27 | 2020-04-28 | 出门问问信息科技有限公司 | Word segmentation and industry dictionary construction method and device and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113435426A (en) | 2021-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Burie et al. | ICDAR2015 competition on smartphone document capture and OCR (SmartDoc) | |
CN108415887A (en) | A kind of method that pdf document is converted to OFD files | |
US11468694B1 (en) | Systems and methods for document image processing using neural networks | |
JP7170773B2 (en) | Structured document information marking method, structured document information marking device, electronic device, computer-readable storage medium, and computer program | |
CN103488711A (en) | Method and system for fast making vector font library | |
US8386943B2 (en) | Method for query based on layout information | |
CN103632387A (en) | Method and system for generation of brush writing copybook | |
CN109508712A (en) | A kind of Chinese written language recognition methods based on image | |
CN110554991A (en) | Method for correcting and managing text picture | |
CN110162773A (en) | Title estimator | |
CN109670502A (en) | Training data generation system and method based on dimension language character recognition | |
CN113435426B (en) | Data augmentation method, device and equipment for OCR recognition and storage medium | |
CN114399782B (en) | Text image processing method, apparatus, device, storage medium, and program product | |
CN113038184B (en) | Data processing method, device, equipment and storage medium | |
CN104866631A (en) | Method and device for aggregating counseling problems | |
CN114612912A (en) | Image character recognition method, system and equipment based on intelligent corpus | |
CN114220112A (en) | Person name card oriented arbitrary relationship extraction method and system | |
CN114332882A (en) | Text translation method and device, electronic equipment and storage medium | |
Suchenwirth et al. | Optical recognition of Chinese characters | |
Sobhan Sarbandi | Navigating the Latent: Exploring the Potentials of Islamic Calligraphy with Generative Adversarial Networks | |
CN111582281A (en) | Picture display optimization method and device, electronic equipment and storage medium | |
CN112836467A (en) | Image processing method and device | |
CN110580352A (en) | Chinese character and line book intercommunication mutual identification technical method | |
US20210390344A1 (en) | Automatically applying style characteristics to images | |
CN115830612A (en) | OCR training data generation method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |