CN111523541A

CN111523541A - Data generation method, system, equipment and medium based on OCR

Info

Publication number: CN111523541A
Application number: CN202010315649.8A
Authority: CN
Inventors: 周曦; 姚志强; 林文峰; 许梅芳
Original assignee: Shanghai Yunconghuilin Artificial Intelligence Technology Co ltd
Current assignee: Shanghai Yunconghuilin Artificial Intelligence Technology Co ltd
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-08-11

Abstract

The invention provides a data generation method, a system, equipment and a medium based on OCR, comprising the following steps: constructing a sample database according to one or more sample data, wherein each sample data comprises one or more attributes; and generating one or more target data containing the one or more attributes based on the sample database. According to the method, the generated sample database is constructed according to the sample data acquired in the real environment, one or more target data containing one or more attributes with the sample data are generated based on the sample database, and the generated target data are used as the data pairs acquired in the real environment for training, so that the real data required for training an OCR recognition model can be greatly reduced, the workload of marking the data is greatly reduced, the development cycle of the recognition model is greatly shortened, and the development speed of the recognition model is improved.

Description

Data generation method, system, equipment and medium based on OCR

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data generation method, system, device, and medium based on OCR

Background

An application field of the artificial intelligence, OCR (Optical Character Recognition), needs a large amount of data to train, and a neural network can have an accurate prediction capability. Therefore, in the field of OCR, an enormous amount of training data covering as many scenes as possible is crucial to the accuracy of the output OCR model. Meanwhile, OCR scenes are complicated and various, formats are complicated and complicated, customization requirements are high, training data are difficult to obtain, data lack of real scenes are pain points of the industry, finally, product development period is long, model robustness is poor, and application is difficult to fall to the ground due to high machine learning development cost. Therefore, in the field of OCR, a tool capable of generating a large amount of layout training data under a high-simulation real scene is urgently needed.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide OCR-based data generating method, system, device and medium for solving the problems existing in the prior art.

To achieve the above and other related objects, the present invention provides an OCR-based data generating method, including the steps of:

constructing a sample database according to one or more sample data, wherein each sample data comprises one or more attributes;

and generating one or more target data containing the one or more attributes based on the sample database.

Optionally, the sample data comprises at least one of: bill image, ticket image, certificate image, bill image.

Optionally, the attribute comprises at least one of: texture, background style, layout, table, text, icon, font, language, line.

Optionally, the method further includes synthesizing the generated target data into one or more training sample sets.

Optionally, training is further performed according to the one or more training sample sets, and one or more recognition models are generated.

Optionally, the identifying a model comprises identifying at least one of: bills, tickets, certificates, documents.

Optionally, in the process of generating one or more target data including the one or more attributes based on the sample database, the method further includes enhancing the target data by adding one or more perturbation factors.

Optionally, if the attribute is a font, the added perturbation factor includes at least one of: character strings, font colors, character spacing, character special effects, text line backgrounds, character distortion, background noise, character position, stroke adhesion, stroke fracture, character inclination and various fonts.

Optionally, if the attribute is an icon, the added disturbance factor includes one of: adding lines, adding chapters, Gaussian filtering, morphological filtering, motion blurring, illumination, highlighting, deforming and sharpening.

Optionally, if the attribute is text, the added disturbance factor includes at least one of: text box perturbation, text line random scaling, text line inclination, and text line aspect ratio random adjustment.

Optionally, the added perturbation factors further comprise effects, the effects comprising at least one of: perspective change, global color noise, global brightness adjustment, global brightness drift, local color noise, local brightness noise, global contrast adjustment, font motion blur.

The invention also provides an OCR-based data generation system, comprising:

The invention also provides data generation equipment based on OCR, which comprises:

The present invention also provides an apparatus comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform a method as described in one or more of the above.

The present invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the methods as described in one or more of the above.

As described above, the OCR-based data generation method, system, device and medium provided by the present invention have the following beneficial effects: constructing a sample database according to one or more sample data, wherein each sample data comprises one or more attributes; and generating one or more target data containing the one or more attributes based on the sample database. According to the method, the generated sample database is constructed according to the sample data acquired in the real environment, one or more target data containing one or more attributes with the sample data are generated based on the sample database, and the generated target data are used as the data pairs acquired in the real environment for training, so that the real data required for training an OCR recognition model can be greatly reduced, the workload of marking the data is greatly reduced, the development cycle of the recognition model is greatly shortened, and the development speed of the recognition model is improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an OCR-based data generation method according to an embodiment;

fig. 2 is a schematic diagram of a hardware structure of a terminal device according to an embodiment;

fig. 3 is a schematic diagram of a hardware structure of a terminal device according to another embodiment.

Description of the element reference numerals

1100 input device

1101 first processor

1102 output device

1103 first memory

1104 communication bus

1200 processing assembly

1201 second processor

1202 second memory

1203 communication assembly

1204 Power supply Assembly

1205 multimedia assembly

1206 voice assembly

1207 input/output interface

1208 sensor assembly

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1, the present invention provides an OCR-based data generating method, which is characterized by comprising the following steps:

s100, constructing a sample database according to one or more sample data, wherein each sample data comprises one or more attributes;

s200, generating one or more target data containing one or more attributes based on the sample database.

The method includes the steps of constructing and generating a sample database according to sample data acquired in a real environment, then generating one or more target data containing one or more attributes of the sample data based on the sample database, and then training the generated target data as data pairs acquired in the real environment, so that real data required by training an OCR recognition model can be greatly reduced, the workload of marking the data is greatly reduced, the development cycle of the recognition model is greatly shortened, and the development speed of the recognition model is improved.

In an exemplary embodiment, the sample data comprises at least one of: bill image, ticket image, certificate image, bill image. As an example, the sample data may be, for example, an invoice image, a bank card image, an identification card image, a house notebook image, or the like acquired in a real environment.

In some exemplary embodiments, the attribute comprises at least one of: texture, background style, layout, table, text, icon, font, language, line.

As an example, if the attribute is a background and/or a texture, constructing a sample database according to sample data acquired in a real environment, wherein the sample data at least comprises the background and/or the texture; target data containing background and/or texture is generated based on the sample database. As an example, if the attribute is a format and/or a table, constructing a sample database according to sample data acquired in a real environment, wherein the sample data at least comprises the format and/or the table; target data containing the format and/or the table is generated based on the sample database. Meanwhile, line segments, graphs and defined fields can be drawn according to the format information; and adjusting the thickness and the position of the line to ensure that the image corresponding to the format and/or the table can be accurate to the size of the pixel.

As an example, if the attribute is text and/or font, the text includes text content and a text box; constructing a sample database according to sample data acquired in a real environment; and matching the closest font based on the sample database to generate the text content of the corresponding font. Specifically, font selection is performed through the recognition model and the font model, a font which is the same as or similar to the sample data is selected in the sample database, and training data, namely target data with the font is generated. The selection of the font can be selected by a program or a manual operation. And meanwhile, text line data is randomly generated, and generalization of indexes such as text length is increased. In this embodiment, the hyper-parameters are combined by using a plurality of customized image enhancement models and similar fonts generated by the font selection module. The most suitable enhancement combination is searched from the hyper-parameter space through an enhancement learning algorithm, different types of samples are generated by combining with the scale factors, various special effect modules are configured, data enhancement is carried out, and training samples are automatically synthesized. The method comprises the step of adding disturbance factors into parameters such as character rules, character lengths, dictionary ranges, character numbers, text lines, text boxes and the like to enhance fonts, so that the generated fonts are closer to fonts in a real environment.

As an example, if the attribute is an icon, a special icon is generated for the local feature and special processing is performed. For example, for a particular icon that may be present in a text format, such as a national emblem pattern on a business license, a picture may be inserted and the corresponding size, position, and precision adjusted. For patterns or seals appearing in the text, similar pictures are matched and inserted, and special processing is carried out in the generation process, for example, the seals are subjected to transparentization processing to avoid blocking characters.

In some exemplary embodiments, the method further includes synthesizing the generated target data into one or more training sample sets. Meanwhile, training can be carried out according to the one or more training sample sets, and one or more recognition models can be generated. Wherein identifying the model comprises identifying at least one of: bills, tickets, certificates, documents. In the embodiment of the application, the overall effect is adjusted and the special effect is added aiming at the generated training sample set, and the accuracy and the generalization capability aiming at the format analysis and the field recognition capability are improved in an automatic enhancement mode. Specifically, when each element of the text has been set, the range of zooming of the text and the distance of collective offset of the text can be set as a whole; when various backgrounds exist, such as a bank card background surface, various background text patterns can be generated by setting the background picture set; in order to make the image closer to a real photo, the accuracy and generalization capability of the layout analysis and field recognition capability are improved by an automatic enhancement mode and a plurality of special effects, including but not limited to: perspective changes, global color noise, global brightness adjustment, global brightness drift, local color noise, local brightness noise, global contrast adjustment, font motion blur, and the like.

And carrying out corresponding labeling on the data according to the generated massive simulated fixed format training data, and optimizing the recognition model through an automatic training algorithm and an iterative super-parameter combination. Specifically, a data format is generated, wherein the data format comprises a path, a picture, a label and a text line coordinate frame; automatically generating mass data and corresponding labels and contents; training the recognition model through an automatic training algorithm, returning the accuracy of the recognition model on the verification set, verifying the combination effect, automatically and iteratively tuning a series of parameters such as special effect combination, morphological probability and the like according to the accuracy of the training set hyper-parameter selection module on the verification set, and optimizing the recognition model. The accuracy rate of the method can reach 99% through transfer learning in a specific scene.

According to the above embodiment, in the process of generating one or more target data including the one or more attributes based on the sample database, the method further includes enhancing the target data by adding one or more perturbation factors. According to the method, the target data is enhanced through a sample enhancement algorithm and multiple customized special effects, so that the generalization capability and the accuracy of the recognition model are enhanced. The objects enhanced by the sample enhancement algorithm include but are not limited to character image enhancement such as character strings, font colors, word space, character special effects, text line backgrounds and the like; character image enhancements include, but are not limited to, character distortion, background noise, character position, stroke blocking, stroke breaking, character tilting, multiple fonts, and the like; image enhancement includes, but is not limited to, adding lines, adding chapters, gaussian filtering, morphological filtering, motion blur, lighting, highlighting, morphing, sharpening, and the like; text box data enhancement includes but is not limited to text box perturbation, text line random scaling, text line tilt and aspect ratio random adjustment, and the like; other enhancements include, but are not limited to, channel random switching and fusion, gray scale transformation, gaussian filtering, blurring, and random background. Custom special effects include, but are not limited to, perspective changes, global color noise, global brightness adjustments, global brightness drift, local color noise, local brightness noise, global contrast adjustments, font motion blur, and the like.

The invention provides an OCR-based data generation method capable of automatically generating training data meeting requirements according to requirements aiming at the problem of insufficient training data quantity in the OCR field. The method comprises the steps of constructing a sample database according to one or more sample data, wherein each sample data comprises one or more attributes; and generating one or more target data containing the one or more attributes based on the sample database. The method comprises the steps of constructing and generating a sample database according to sample data acquired in a real environment, and then generating one or more target data containing one or more attributes with the sample data based on the sample database, wherein the generated attributes comprise all formats, tables, paper styles, background textures, text contents, Chinese, English, various languages, numbers, icons, seals and the like of various formats and fonts; the generated data is close to the real data, and the similarity with the real scene data can reach 99%; after the generated data is finished, special effects such as noise are added, so that the generated data is closer to a real photo; and the generated target data is used as the data pair collected in the real environment for training, so that the OCR recognition model trained according to the data has better robustness and higher accuracy in the real scene. In addition, the method can generate ten thousand pieces of high simulation real data in about 4 hours. By the method, real data required for training the OCR recognition model can be greatly reduced, the workload of data labeling is greatly reduced, the development period of the recognition model is greatly shortened, and the development speed of the recognition model is increased; the method also solves the problems of less training data, poor model generalization, long development period, high development cost and low accuracy in the OCR field, and achieves the purpose of quickly developing an accurate recognition model.

The invention also provides a data generation system based on OCR, comprising:

The system constructs and generates a sample database according to sample data acquired in a real environment, then generates one or more target data containing one or more attributes with the sample data based on the sample database, and trains the generated target data as data pairs acquired in the real environment, so that real data required by training an OCR recognition model can be greatly reduced, the workload of marking the data is greatly reduced, the development cycle of the recognition model is greatly shortened, and the development speed of the recognition model is improved.

According to the above embodiment, in the process of generating one or more target data including the one or more attributes based on the sample database, the method further includes enhancing the target data by adding one or more perturbation factors. The system enhances the target data through a sample enhancement algorithm and a plurality of customized special effects, thereby enhancing the generalization ability and the accuracy of the recognition model. The objects enhanced by the sample enhancement algorithm include but are not limited to character image enhancement such as character strings, font colors, word space, character special effects, text line backgrounds and the like; character image enhancements include, but are not limited to, character distortion, background noise, character position, stroke blocking, stroke breaking, character tilting, multiple fonts, and the like; image enhancement includes, but is not limited to, adding lines, adding chapters, gaussian filtering, morphological filtering, motion blur, lighting, highlighting, morphing, sharpening, and the like; text box data enhancement includes but is not limited to text box perturbation, text line random scaling, text line tilt and aspect ratio random adjustment, and the like; other enhancements include, but are not limited to, channel random switching and fusion, gray scale transformation, gaussian filtering, blurring, and random background. Custom special effects include, but are not limited to, perspective changes, global color noise, global brightness adjustments, global brightness drift, local color noise, local brightness noise, global contrast adjustments, font motion blur, and the like.

The invention provides an OCR-based data generation system capable of automatically generating training data meeting requirements according to requirements aiming at the problem of insufficient training data in the OCR field. The system constructs a sample database according to one or more sample data, and each sample data comprises one or more attributes; and generating one or more target data containing the one or more attributes based on the sample database. The system constructs and generates a sample database according to sample data acquired in a real environment, and then generates one or more target data containing one or more attributes with the sample data based on the sample database, wherein the generated attributes comprise all formats, tables, paper styles, background textures, text contents, Chinese, English, various languages, numbers, icons, seals and the like in various formats and fonts; the generated data is close to the real data, and the similarity with the real scene data can reach 99%; after the generated data is finished, special effects such as noise are added, so that the generated data is closer to a real photo; and the generated target data is used as the data pair collected in the real environment for training, so that the OCR recognition model trained according to the data has better robustness and higher accuracy in the real scene. In addition, the system can generate ten thousand pieces of high simulation real data in about 4 hours. By the system, real data required for training the OCR recognition model can be greatly reduced, the workload of data labeling is greatly reduced, the development period of the recognition model is greatly shortened, and the development speed of the recognition model is increased; the method also solves the problems of less training data, poor model generalization, long development period, high development cost and low accuracy in the OCR field, and achieves the purpose of quickly developing an accurate recognition model.

An embodiment of the present application further provides an OCR-based data generating apparatus, including:

In this embodiment, the OCR-based data generation apparatus executes the system or the method, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.

Embodiments of the present application also provide a non-transitory readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may execute instructions (instructions) included in the method in fig. 1 according to the embodiments of the present application.

Fig. 2 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the first processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.

In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

Fig. 3 is a schematic hardware structure diagram of a terminal device according to an embodiment of the present application. Fig. 3 is a specific embodiment of fig. 2 in an implementation process. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.

The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication component 1203, power component 1204, multimedia component 1205, speech component 1206, input/output interfaces 1207, and/or sensor component 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing component 1200 may include one or more second processors 1201 to execute instructions to perform all or a portion of the steps in the OCR-based data generation method described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.

The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.

The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the speech component 1206 further comprises a speaker for outputting speech signals.

The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.

The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.

As can be seen from the above, the communication component 1203, the voice component 1206, the input/output interface 1207 and the sensor component 1208 referred to in the embodiment of fig. 3 can be implemented as input devices in the embodiment of fig. 2.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. An OCR-based data generation method, comprising the steps of:

2. An OCR based data generation method according to claim 1 wherein said sample data includes at least one of: bill image, ticket image, certificate image, bill image.

3. An OCR based data generation method according to claim 2 wherein said attributes include at least one of: texture, background style, layout, table, text, icon, font, language, line.

4. An OCR-based data generation method according to claim 1 and further comprising synthesizing the generated target data into one or more training sample sets.

5. An OCR-based data generation method according to claim 4, further comprising training from the one or more training sample sets to generate one or more recognition models.

6. An OCR based data generation method according to claim 5 wherein said recognition model includes means for recognizing at least one of: bills, tickets, certificates, documents.

7. An OCR based data generation method according to claim 3 and wherein generating one or more target data containing said one or more attributes based on said sample database further comprises enhancing said target data by adding one or more perturbation factors.

8. An OCR based data generation method according to claim 7 wherein if said attribute is font, the added perturbation factor includes at least one of: character strings, font colors, character spacing, character special effects, text line backgrounds, character distortion, background noise, character position, stroke adhesion, stroke fracture, character inclination and various fonts.

9. An OCR based data generation method according to claim 7 and wherein if said attribute is an icon, the added perturbation factor comprises one of: adding lines, adding chapters, Gaussian filtering, morphological filtering, motion blurring, illumination, highlighting, deforming and sharpening.

10. An OCR based data generation method according to claim 7 wherein if said attribute is text, the added perturbation factor includes at least one of: text box perturbation, text line random scaling, text line inclination, and text line aspect ratio random adjustment.

11. An OCR based data generation method according to claim 7 wherein the added perturbation factors further include special effects including at least one of: perspective change, global color noise, global brightness adjustment, global brightness drift, local color noise, local brightness noise, global contrast adjustment, font motion blur.

12. An OCR-based data generation system, comprising:

13. An OCR based data generation system according to claim 12 wherein said sample data includes at least one of: bill image, ticket image, certificate image, bill image.

14. An OCR based data generation system according to claim 13 wherein said attributes include at least one of: texture, background style, layout, table, text, icon, font, language, line.

15. An OCR-based data generation system as recited in claim 12 further comprising synthesizing the generated target data into one or more training sample sets.

16. An OCR-based data generation system as recited in claim 15 further comprising training from the one or more training sample sets to generate one or more recognition models.

17. An OCR based data generation system according to claim 16 wherein said recognition model includes means for recognizing at least one of: bills, tickets, certificates, documents.

18. An OCR-based data generation system according to claim 14 and wherein generating one or more target data comprising said one or more attributes based on said sample database further comprises enhancing said target data by adding one or more perturbation factors.

19. An OCR based data generation system according to claim 18 wherein if said attribute is font, the added perturbation factors include at least one of: character strings, font colors, character spacing, character special effects, text line backgrounds, character distortion, background noise, character position, stroke adhesion, stroke fracture, character inclination and various fonts.

20. An OCR based data generation system according to claim 18 wherein if said attribute is an icon, the added perturbation factor comprises one of: adding lines, adding chapters, Gaussian filtering, morphological filtering, motion blurring, illumination, highlighting, deforming and sharpening.

21. An OCR based data generation system according to claim 18 wherein if said attribute is text, the added perturbation factor includes at least one of: text box perturbation, text line random scaling, text line inclination, and text line aspect ratio random adjustment.

22. An OCR-based data generation system according to claim 18 wherein the added perturbation factors further include special effects including at least one of: perspective change, global color noise, global brightness adjustment, global brightness drift, local color noise, local brightness noise, global contrast adjustment, font motion blur.

23. An OCR-based data generation apparatus, comprising:

24. An apparatus, comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method recited by one or more of claims 1-11.

25. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-11.