CN107122785B

CN107122785B - Text recognition model establishing method and device

Info

Publication number: CN107122785B
Application number: CN201610105478.XA
Authority: CN
Inventors: 李洁
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2016-02-25
Filing date: 2016-02-25
Publication date: 2022-09-27
Anticipated expiration: 2036-02-25
Also published as: CN107122785A; WO2017143973A1

Abstract

The invention provides a text recognition model establishing method and device. Wherein, the method comprises the following steps: acquiring a text file set; selecting different text files from the text file set as characteristic text files; and establishing a text recognition model by using the characteristic text file, wherein the text recognition model is used for recognizing the text information in the text file to be recognized. According to the invention, the problem of lower accuracy of the text recognition model established by repeatedly acquiring the same text file in the prior art is solved, so that the effect of improving the accuracy of the established text recognition model is realized.

Description

Text recognition model establishing method and device

Technical Field

The invention relates to the field of communication, in particular to a text recognition model establishing method and device.

Background

With the development of the internet and the popularization of mobile devices, a large number of network-synthesized text pictures containing complicated noises or various distortions are generated, and in order to extract valuable information from a large number of public multimedia data, it is very important to identify these complicated network-synthesized text pictures.

However, identifying complex web synthesized text pictures has considerable challenges: on one hand, complex network-synthesized text pictures have diversity, which may have different fonts, colors, sizes, directions and arrangements; on the other hand, the complicated network synthesized text picture has the problems of noise, blur, illumination, occlusion and the like, which brings great difficulty to the detection and identification of characters.

If these network-synthesized text pictures are recognized using a conventional Optical Character Recognition (OCR) method, it is difficult to achieve predetermined requirements in terms of Recognition rate and accuracy. With the advent of machine learning methods, OCR of complex background text pictures has been developed in a breakthrough, but before character recognition using machine learning, a large number of text files are required as training samples to establish a text recognition model. However, in the existing text recognition model establishing process, the same text file is often repeatedly acquired, so that the text recognition model established by using the same text file cannot cover all text contents, and thus, the text recognition model cannot be used for accurately performing text recognition.

Aiming at the problem that the accuracy of a text recognition model established by repeatedly acquiring the same text file in the related technology is low, an effective solution is not provided at present.

Disclosure of Invention

The invention provides a method and a device for establishing a text recognition model, which at least solve the problem of lower accuracy of the text recognition model established by repeatedly acquiring the same text file in the related technology.

According to an aspect of the present invention, there is provided a text recognition model building method, including: acquiring a text file set; selecting different text files from the text file set as characteristic text files; and establishing a text recognition model by using the characteristic text file, wherein the text recognition model is used for recognizing text information in the text file to be recognized.

Optionally, selecting the mutually different text files from the text file set as the feature text files comprises: and selecting the different text files from the text file set as the characteristic text files according to the file identification of the text file in the text file set and/or the storage position identification of the text file in the text file set.

Optionally, selecting the mutually different text files from the text file set as the characteristic text files according to the file identifier of the text file in the text file set and/or the storage location identifier of the text file in the text file set includes: acquiring file identifiers in a first preset number of the text file sets according to a preset algorithm to obtain a file identifier set, wherein the storage position identifiers of the text files corresponding to the same text file identifier in the file identifier set are the same; acquiring different storage position identifiers corresponding to the file identifiers in the file identifier set; screening a second preset number of different file identifications from the file identification set according to the different storage position identifications; and extracting text files corresponding to the different file identifications from the text file set as the characteristic text files.

Optionally, the obtaining the text file set includes: acquiring text information; copying the text information in batch to obtain a plurality of text information; and respectively setting text parameters for the plurality of text messages to obtain different text files, wherein the text file set comprises the different text files.

Optionally, the acquiring the text information includes: receiving an input first text string as the text information; or reading a second text string stored in the system; segmenting the second text string according to a preset strategy to obtain a text string set; and extracting a third text string in the text string set as the text information.

Optionally, the text parameter comprises at least one of: the font format parameter of the characters in the text information, the font display size parameter of the characters in the text information, the size ratio parameter of the blank characters in the text information, the interval size ratio parameter of the characters in the text information, the rotation angle parameter of the characters in the text information, the font color parameter of the characters in the text information, the transparency parameter of the characters in the text information, the thickening degree parameter of the characters in the text information, the inclination degree parameter of the characters in the text information, the underlining drawing parameter of the characters in the text information, a background picture and the display position parameter of the text information in the background picture.

According to another aspect of the present invention, there is also provided a text recognition model building apparatus, including: the acquisition module is used for acquiring a text file set; the selection module is used for selecting different text files from the text file set as characteristic text files; and the establishing module is used for establishing a text recognition model by using the characteristic text file, wherein the text recognition model is used for recognizing the text information in the text file to be recognized.

Optionally, the selection module is configured to: and selecting the different text files from the text file set as the characteristic text files according to the file identification of the text file in the text file set and/or the storage position identification of the text file in the text file set.

Optionally, the selection module comprises: a first obtaining unit, configured to obtain, according to a preset algorithm, a first preset number of file identifiers in the text file set to obtain a file identifier set, where storage location identifiers of text files corresponding to the same text file identifier in the file identifier set are the same; a second obtaining unit, configured to obtain different storage location identifiers corresponding to the file identifiers in the file identifier set; a selecting unit, configured to select a second preset number of different file identifiers from the file identifier set according to the different storage location identifiers; and the extracting unit is used for extracting the text files corresponding to the different file identifications from the text file set as the characteristic text files.

Optionally, the obtaining module includes: a third acquisition unit configured to acquire text information; the copying unit is used for copying the text information in batch to obtain a plurality of text information; and the setting unit is used for setting text parameters for the plurality of text messages respectively to obtain different text files, wherein the text file set comprises the different text files.

Optionally, the third obtaining unit is configured to: receiving an input first text string as the text information; or reading a second text string stored in the system; segmenting the second text string according to a preset strategy to obtain a text string set; and extracting a third text string in the text string set as the text information.

According to the invention, after the text file set is obtained; selecting different text files from a text file set as characteristic text files; and establishing a text recognition model by using the characteristic text file, wherein the text recognition model is used for recognizing text information in the text file to be recognized. That is to say, different text files are automatically selected from the text file set as the feature text files to establish the text recognition model for recognizing the text information in the text files, so that the established text recognition model can cover different text files, the accuracy of the established text recognition model is ensured, and the problem of low accuracy of the text recognition model established by repeatedly acquiring the same text files in the prior art is solved. Further, the text information in the text picture can be accurately identified by the text identification model established by the text identification model establishing method provided by the embodiment.

In addition, through the mode of automatically selecting different text files from the text file set to establish the text recognition model, the number of the text files used as training samples for establishing the text recognition model can be reduced, namely the number of the repeatedly acquired text files is reduced, so that the efficiency of establishing the text recognition model is improved, and the problem of low efficiency of establishing the text recognition model caused by excessive number of the acquired text files is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of an alternative text recognition model building method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a method for text recognition model building in accordance with an alternative embodiment of the present invention;

FIG. 3 is a flow diagram of a novel improved linear congruence random number generator in accordance with an alternative embodiment of the present invention;

FIG. 4 is a block diagram of an alternative text recognition model building apparatus according to an embodiment of the present invention;

FIG. 5 is a block diagram of an alternative text recognition model building apparatus according to an embodiment of the present invention;

fig. 6 is a block diagram of another alternative text recognition model building apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example one

In this embodiment, a text recognition model building method is provided, and fig. 1 is a flowchart of an optional text recognition model building method according to an embodiment of the present invention, as shown in fig. 1, the flowchart includes the following steps:

step S102, acquiring a text file set;

step S104, selecting different text files from the text file set as characteristic text files;

and step S106, establishing a text recognition model by using the characteristic text file, wherein the text recognition model is used for recognizing text information in the text file to be recognized.

Optionally, the present embodiment may be applied, but not limited to, in a scenario of building a text recognition model. Particularly, a text Recognition model for machine learning is established in an Optical Character Recognition (OCR) scene. For example, it may be applied, but not limited to, in the process of text localization, text detection, or text recognition. The above scenario is only an example, and this is not limited in this embodiment.

Through the steps, the text recognition models for recognizing the text information in the text files are established by automatically selecting the different text files from the text file set as the characteristic text files, so that the established text recognition models can cover the different text files, the accuracy of the established text recognition models is ensured, and the problem of lower accuracy of the text recognition models established by repeatedly acquiring the same text files in the prior art is solved. And further, the text information in the text picture can be accurately identified by the text identification model established by the text identification model establishing method provided by the embodiment.

In this embodiment, the text recognition model can be used for training an OCR text recognition model, where OCR is understood to be a technology that allows a computer to recognize words in a picture, for example, a picture in a computer, and a computer cannot automatically recognize words in a picture. To realize the above function, an OCR model needs to be established, and the model is obtained through training. Before training, OCR text files for training are required to be obtained to form a text recognition model, and the method in the related technology is to collect pictures with characters, collect massive character pictures and label the contents in the character pictures one by one (namely, a computer can read the contents). The computer model is then made to learn these labeled text files. And training the OCR model by using a mass of text files, wherein when the OCR model encounters a new picture with characters, the characters on the picture can be recognized, and a character format which can be read by a computer is output.

However, in OCR model training, the samples must be very massive to ensure that enough OCR models are trained to be available. The mass has two disadvantages: 1. such a number of samples, which are collected and labeled, must be viewed by a person, knowing the text content of the figure, and then labeling this content in text format (i.e., made computer readable), each picture doing so. The labor consumption is very large and it cannot be guaranteed that the labor is not mistaken. 2. The samples must be very diverse. Such as "good" with various colors, fonts, backgrounds, even shading, slant, thickness, different angles of illumination, etc. It is desirable to have these various forms of expression of "good" words as samples for learning training by the OCR model as much as possible, so that the OCR model will correctly recognize newly encountered "good" words when used in the future. But samples with rich expression forms are collected, manual searching and screening are carried out, and the engineering quantity is very large.

In the embodiment, firstly, since the text recognition model is generated by a text file readable by a computer, the problem of manual labeling of the text picture does not exist. Secondly, there are a variety of different representations of the same text information in the text file used to generate the text recognition model. In addition, after the text file is generated, a random algorithm is added, and the text file is randomly selected for training twice. If no random algorithm such as 'good' words is added, pictures of 'good' words of 1000 expression forms are generated, and 'bad' words are also generated in 1000 forms, each time content is input by the program, the output expression forms are 1000, and accordingly accuracy of computer recognition is reduced. In this embodiment, 1000 kinds of "good" words are generated, 500 kinds are randomly selected, 1000 kinds of "bad" words are generated, and 500 kinds are randomly selected. This ensures richness and randomness of the sample.

In this embodiment, text files different from each other may be selected from the text file set as the characteristic text file according to, but not limited to, the file identifier of the text file in the text file set and/or the storage location identifier of the text file in the text file set.

The process of selecting mutually different text files from the set of text files as characteristic text files is described below by three examples.

An example is a process of selecting different text files from a text file set as characteristic text files according to file identifications of the text files in the text file set.

In the first example, because different text files in the text file set carry different file identifiers, the file identifiers may be selected in batch through a preset algorithm, and then the same file identifiers are deleted, and different file identifiers are reserved. And then extracting corresponding text files from the text file set according to the screened different file identifications as characteristic text files to establish a text recognition model. Through the steps, the characteristic text file is obtained according to the characteristic that different text files carry different text identifications, so that the established text recognition model can cover different text files, the accuracy of the established text recognition model is ensured, and the problem that the accuracy of the text recognition model established by repeatedly obtaining the same text file is lower in the prior art is solved. And further, the text information in the text picture can be accurately identified by the text identification model established by the text identification model establishing method provided by the embodiment.

The second example is a process of selecting different text files from the text file set as characteristic text files according to the storage position identification of the text files in the text file set.

In example two, different text file storage locations in the text file set are different and therefore carry different storage location identifiers, the storage location identifiers may be selected in batch through a preset algorithm, the same storage location identifiers are deleted, and the different storage location identifiers are retained. And then extracting corresponding text files from the text file set according to the screened different storage position identifications as characteristic text files to establish a text recognition model. Through the steps, the characteristic text file is obtained according to the characteristic that different text files carry different storage position identifications due to different storage positions, so that the established text recognition model can cover different text files, the accuracy of the established text recognition model is ensured, and the problem that the accuracy of the text recognition model established by repeatedly obtaining the same text file is lower in the prior art is solved. And further, the text information in the text picture can be accurately identified by the text identification model established by the text identification model establishing method provided by the embodiment.

Example three is a process of selecting different text files from the text file set as the characteristic text files according to the file identification of the text file in the text file set and the storage location identification of the text file in the text file set.

In example three, text identifiers may be selected in batch from a text file set according to text identifiers, at this time, the batch-selected text identifiers may be the same, different text identifiers are stored in different storage locations, the same text identifiers are stored in the same storage locations, so that the different text identifiers carry different storage location identifiers, then different storage location identifiers are selected in batch, different file identifiers are obtained according to the different storage location identifiers, and thus corresponding different text files are obtained in the text file set as feature text files, and a text recognition model is established. Through the steps, the same file identification in the possibly repeated file identifications obtained in batch is stored at the same position, the different storage position identifications corresponding to the different file identifications are ensured, different file identifications are screened out according to the different storage position identifications to extract the characteristic text file from the text file set, so that the established text recognition model can cover different text files, the accuracy of the established text recognition model is ensured, and the problem that the accuracy of the text recognition model established by repeatedly obtaining the same text file is lower in the prior art is solved. And further, the text information in the text picture can be accurately identified by the text identification model established by the text identification model establishing method provided by the embodiment.

It should be noted that, in this embodiment, only the text identifier and the storage location identifier are taken as examples to explain how to obtain different text files to establish the text recognition model, and other information such as identifiers or parameters that can distinguish the different text files may also be used to obtain the different text files, which belongs to the protection scope of the present invention and is not described herein again.

In step S102, the text file set may be obtained by obtaining an existing text file set, or by generating a text file set according to a predetermined rule. The method for generating the text file set may be, but is not limited to, generating the text files in batch, and then selecting the text files forming the text file set from the generated text files, or selecting the existing text files to form the text file set.

Before generating the text file set, it may also be determined whether to process the text files, where the processing manner includes, but is not limited to: blur, noise, sharpening, lighting, etc.

In this embodiment, in order to obtain the text file set, the obtained text information may be copied in batch to obtain a large amount of text information, different text parameters are set for each text information, and a large amount of text files different from each other are obtained to form the text file set. Through the steps, different text parameters are set for a large amount of same text information, and different text files are obtained to form a text file set, so that the text files with the same text information but different text parameters are ensured to be stored in the text file set, and the text information can be identified from the text files in various forms in the subsequent identification process of the text files.

In addition, in the present embodiment, the text information may be obtained in a form of, but not limited to, receiving an input text string, or reading a text string stored in the system.

If the text information is obtained by reading the text character strings stored in the system, the read text character strings are divided into a plurality of different text character strings according to a preset rule, and one text character string is extracted to be used as the text information for generating the text file. The division unit may be, but is not limited to, a line, a plurality of lines, a word, a plurality of words, and the like.

Through the steps, the generated text files can be ensured to carry the same text information, but the text parameters of the text information are different. The establishment condition of the text recognition model is satisfied.

In this embodiment, the text parameter may include, but is not limited to, at least one of: font format, font display size, blank character size ratio, character interval size ratio, character rotation angle, character font color, character transparency parameter, character thickening degree, character inclination degree, character underline drawing, background picture and display position of text information in the background picture. Optionally, in this embodiment, a port of an open source computer vision library (OPENCV) may be called, but is not limited to, to set a text parameter of the text information.

The following describes the setting process of the parameters by taking a background picture as an example.

After the text information is obtained, different text parameters are set for the text information in batch, the text information with different text parameters is added into one or more background pictures obtained from a background picture library, the same text information can be added into different background pictures to generate different text files, different text information can be added into the same background picture to generate different text files, and therefore a large number of text files are obtained.

Optionally, in step S104, text files different from each other may be selected from the text file set as the characteristic text file according to the file identifier of the text file in the text file set and/or the storage location identifier of the text file in the text file set.

Optionally, when different text files are selected from the text file set as the characteristic text files according to file identifiers of the text files in the text file set and/or storage location identifiers of the text files in the text file set, file identifiers in a first preset number of text file sets can be obtained according to a preset algorithm to obtain a file identifier set, wherein the storage location identifiers of the text files corresponding to the same text file identifier in the file identifier set are the same; acquiring different storage position identifications corresponding to file identifications in a file identification set; screening a second preset number of different file identifications from the file identification set according to the different storage position identifications; and extracting text files corresponding to different file identifications from the text file set as characteristic text files.

The above process is exemplified below.

Example 1: the step of screening out the second preset number of different file identifiers from the file identifier set according to the different storage location identifiers may include, but is not limited to, the following steps: repeatedly executing the following steps until the number of the acquired file identifications which are different from each other reaches a second preset number: judging whether the number of the currently acquired file identifications which are different from each other reaches a second preset number or not; when the number does not reach a second preset number, acquiring a storage position identifier from a storage position identifier set, and generating a current variable according to the acquired storage position identifier, wherein the storage position identifier set is used for storing the storage position identifiers which are not used for generating the variable; acquiring a random number corresponding to a current variable from a preset random array; acquiring file identifications corresponding to the random number from a file identification set as different currently acquired file identifications; and updating the quantity of the currently acquired file identifications which are different from each other, and deleting the acquired storage position identification from the storage position identification set.

The above process may be expressed, but is not limited to, as: assigning a current variable

Wherein, L is the number of binary bits after the storage position identification is converted into the binary system, W is the number of binary bits after the storage position identification is converted into the binary system (values are sequentially obtained from 0 to high bits), L represents the number of the storage position identification, L is an integer from 0 to L-1 in sequence, and I is _W+l For storage location identifiers obtained from a set of storage location identifiers used for storing I that has not been used for generating n _W+l (ii) a Assignment y _i ＝V[n]Wherein V [ n ]]Is N in a random array V [ N ]]The corresponding random number in (1); obtaining from a set of file identifiersAnd y _i And the corresponding file identifications are used as the currently acquired different file identifications. In the above process, L may be, but is not limited to, preset, and the numbers of W and L are sequentially increased, and I _W+l Corresponding to storage location identities in a set of storage location identities, I _W+l The sequence of the storage location identifiers can be disturbed by multiplying 2L without repeating the sequence, and the randomness of the obtained storage location identifiers is further ensured, wherein the larger L is, the more random the arrangement of the storage location identifiers is, and the random array V [ N ] obtained by disturbing the sequence of the storage location identifiers is]The larger. Further, in order to balance the randomness of the storage location identifier and the storage amount, in this embodiment, L may be reasonably selected in the implementation process according to the actual situation.

Example 2: the process of obtaining the file identifiers in the first preset number of text file sets according to the preset algorithm may be: a first preset number of the file identifications are obtained according to a preset random number generator (such as a linear congruential random number generator).

A first preset number of said file identities may be obtained from a linear congruential random number generator by the following formula: from x _i ＝(ax _i-1 + c) mod (M) generated random number x ₁ ，x ₂ …x _i-1 ，x _i Forming a first predetermined number of file identifiers, wherein a, c, M, x ₀ For the preset parameter, M>0，0<a<M，0≤c<M。

Optionally, the process of obtaining the text file set may be: acquiring text information; copying the text information in batch to obtain a plurality of text information; and respectively setting text parameters for the plurality of text messages to obtain different text files, wherein the text file set comprises the different text files.

Alternatively, the text information may be obtained by receiving an input first text string; or reading a second text string stored in the system; segmenting the second text string according to a preset strategy to obtain a text string set; and extracting a third text string in the text string set as text information to obtain the text information.

Optionally, the text parameters may include, but are not limited to, at least one of: the font format parameter of the characters in the text information, the font display size parameter of the characters in the text information, the space size ratio parameter of the characters in the text information, the rotation angle parameter of the characters in the text information, the font color parameter of the characters in the text information, the transparency parameter of the characters in the text information, the thickening degree parameter of the characters in the text information, the inclination degree parameter of the characters in the text information, the underline drawing parameter of the characters in the text information, the background picture and the display position parameter of the text information in the background picture.

In the following examples and alternative embodiments, the text file is exemplified by a sample, the text file set is exemplified by a batch sample set, and the feature text file is exemplified by a feature sample.

In order to make the description of the embodiments of the present invention clearer, the following description and illustrations are made with reference to alternative embodiments.

The optional embodiment provides a batch sample generation method for text positioning, detection and identification.

The optional embodiment solves the problem that when the existing OCR of the complex background text picture is carried out based on machine learning, the accuracy of the established text recognition model is low because the same text file is obtained repeatedly.

The method for generating the text recognition model for text positioning, detection and recognition in the optional embodiment comprises the following steps:

step 1, loading text information, and providing two loading modes: inputting a text character string, and if the mode is the mode, executing a step 3; or reading the existing text character string, and if the mode is the mode, executing the step 2;

step 2, selecting a preset rule to segment the read text character string into a plurality of objects, and storing the segmented text character string to an appointed path;

step 3, selecting a background picture to be loaded from the background picture library;

step 4, reading the segmented text character string or reading the input character string, and performing batch text parameter setting on the segmented text character string or the input character string, wherein the text parameter comprises at least one of the following parameters: font format, font display size, space size ratio, rotation angle, display position, font color, transparency setting, bolding degree, inclination degree, underline drawing, and the like;

step 5, adding various different text information after the text parameters are set in batches into the picture background to generate a text file;

step 6, judging whether to perform image processing on the text file according to the requirement: if image processing is needed, executing step 7, if image processing is not needed, executing step 8;

and 7, carrying out image processing on the text file, wherein the image processing comprises the following steps: blur, noise, sharpening, lighting, and the like;

step 8, providing a novel improved linear congruence random number generator, ensuring that the random randomness of the feature text file is obtained:

step 8-1, setting a random rule to the generated text file:

x _i ＝(ax _i-1 +c)mod(M)

wherein x is ₀ For the initial text file, M is the modulus, M>0, a is multiplier, 0<a<M, c are increments, c is more than or equal to 0<M；x ₀ M, a and c are preset values.

Step 8-2, generating x from step 8-1 _i And ax _i-1 Wherein x is _i And ax _i-1 A document identifier randomly selected from the text document set;

step 8-3, variable assignment

Wherein, L is the number of binary bits after converting the storage location identifier into the binary system, W is the number of binary bits after converting the storage location identifier into the binary system (values are sequentially taken from 0 to high), and L represents the storage location identifierL is an integer from 0 to L-1 in this order, I _W+l Is an integer ax _i-1 Or x _i A storage location identification of a storage location indication in the computer;

step 8-4, assigning y _i ＝V[n]In which V [ n ]]To assist the random array V [ N ]]The random number of (1);

step 8-5, obtaining the preset number of random numbers y _i Corresponding to x _i Extracting and acquiring a text file corresponding to the extracted text file as a characteristic text file;

and 9, saving the selected characteristic text file again, renaming the characteristic text file (for example, renaming the characteristic text file by using sequence numbers), and generating a text recognition model.

Specifically, the following example is combined to explain, as shown in fig. 2, a flowchart of a text recognition model building method according to an alternative embodiment of the present invention, where a text string is a text document with a format of × txt. The process comprises the following steps:

step S202, loading text information and judging whether to read the text character string. The loading of the text information comprises two loading modes: the text string is entered or retrieved from a pre-stored text string. If it is determined that the text string is read (i.e., it needs to be obtained from a pre-stored text string), step S204-2 is performed, and if it is determined that the text string is not read (i.e., it needs to be input), step S204-1 is performed.

And step S204-1, inputting a text character string.

S204-2, selecting a preset rule to divide the read text character string into a plurality of objects, and selecting line division or word division according to requirements; saving a plurality of text strings (with the format of a. txt) which are segmented to a specified Path, and naming the text strings as a Path _ A (Path _ A); and finding a segmented text file needing to be processed under the file Path Path _ A, and naming the segmented text file as a file source-text.

In step S206, a background picture is loaded.

Selecting a background picture (named background) to be loaded from an existing background picture library, wherein the background picture library is open, a new picture file can be added according to needs to enter the background picture library, and the supported picture format is as follows: windows bitmap files BMP, DIB, JPEG files JPEG, JPG, JPE, portable network map PNG; portable images PBM, PGM, PPM, Sun rasters images SR, RAS, TIFF images TIFF, TIF, OpenEXR HDR images EXR, JPEG 2000 pictures jp 2.

Step S208, batch operation, wherein the step S208 comprises:

s208-1, performing text parameter batch setting on the text character string source-text.

Setting a batch font format: alternative formats include, but are not limited to, the following font library's various fonts:

TrueType fonts(and collections)、Type 1fonts、CID-keyed Type 1fonts、CFF fonts、OpenType fonts(both TrueType and CFF variants)、SFNT-based bitmap fonts、X11 PCF fonts、Windows FNT fonts、BDF fonts(including anti-aliased ones)；

and (3) setting the size of the fonts in batches: by adjusting the font size parameters, the size parameters such as font display size, blank character size ratio, interval size ratio, rotation angle and the like can be set in batches;

and (3) setting the positions of the fonts in batches: the method includes the steps that the position of a text displayed on a picture is set, and the display setting of the position of the text in batches can be carried out by setting the horizontal and vertical coordinates of a first pixel point at the upper left corner of the text in batches, but not limited to;

and setting the color of the fonts in batches: adopting an RGB format, and generating different color fonts in batches by setting different numerical combinations of R \ G \ B to combine preset arrays;

and (3) setting the transparency of the fonts in batches: the setting range can be 0-100%;

and (3) setting the rendering effect of the fonts in batches: thickening (thickening degree, vertical thickening or horizontal thickening can be independently set), tilting (different tilting angles can be set), drawing with a border, drawing with a shadow, drawing with an underline, and the like.

And S208-2, writing various different text files after batch parameter adjustment into background pictures (background) respectively.

Step S208-3, judging whether to perform image processing according to requirements: if image processing is required, step S208-4 is performed, and if image processing is not required, step S208-5 is performed.

Step S208-4, performing image processing on the series of text files obtained in the step S208-2 by combination and selection, wherein the image processing can comprise blurring, noise, sharpening, illumination and the like; the execution of step S208-5 is continued after the image processing.

Step S208-5, renaming the text files generated in batches (for example, renaming by using sequential numbers), storing the text files into a new format, and selecting a saving Path _ B (Path _ B) of the text files.

And step S210, generating a characteristic text file.

In step S210, a novel improved linear congruence random number generator is provided to ensure any randomness of the generated feature samples; the generation process of the novel modified linear congruence random number generator can be as shown in fig. 3, and any randomness of the generated feature text file can be ensured through the modified linear congruence random number generator. The flow of the above generation process is shown in fig. 3, and includes the following steps:

step S302, loading batch text files x _i Setting a random rule for the generated batch text file:

x _i ＝(ax _i - ₁ +c)mod(M)

Step S304, generating x from step S302 _i And ax _i-1 Wherein x is _i And ax _i-1 A document identification randomly selected from a collection of text documents.

Step S306, assigning a value to n, wherein,

l is the number of binary bits after converting the storage location ID into binary, and W is the number of binary bits after converting the storage location ID into binaryThe serial number of the binary digit after manufacture (sequentially and respectively taking values from 0 to high), L represents the serial number of the storage position identifier, L sequentially takes an integer from 0 to L-1, and I _W+l Is an integer ax _i-1 Or x _i A storage location identification of a storage location indication in the computer.

Step S308, is y _i Assignment of value, wherein, y _i ＝V[n]，V[n]To assist the random array V [ N ]]The random number of (1).

Step S310, obtaining a preset number of random numbers y _i Corresponding to x _i Extracting and finding the corresponding text file in the saving Path Path _ B.

And S312, saving the selected text file rename (rename by sequence number), saving as a target Path (Path _ target), and generating a batch characteristic text file.

It should be noted that the present invention is not limited to using the above method to obtain the number of samples that can satisfy the training machine learning, and other random methods can be used to generate the feature text file.

And S212, selecting a storage format and a path of the characteristic text file.

And step S214, saving the characteristic text file.

In summary, the embodiments and the alternative embodiments of the present invention can generate various text files in large quantities according to the requirements, and have the following advantages: firstly, the input text can be input through an editing command of 'individuation', and the existing text character string can be directly read and segmented to obtain the required text paragraph. Secondly, a large number of methods are added to realize batch one-time generation of different text formats such as font formats, font display sizes, blank character size ratios, interval size ratios, rotation angles, display positions, font colors, transparency setting, thickening degrees, inclination degrees, underline drawing and the like, and a series of image processing operations such as blurring, noise, sharpening, illumination and the like are added to further expand the diversity of samples. In addition, a novel improved linear congruence random number generator method is provided, the randomness of generated samples is guaranteed, more perfect and reasonable samples are provided for subsequent model training based on machine learning, and the trained models are guaranteed to have higher accuracy. Meanwhile, the text recognition model establishing method obviously saves the labor cost and greatly improves the training efficiency of machine learning.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example two

In this embodiment, a text recognition model establishing apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and optional embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of an alternative text recognition model building apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus includes:

1) an obtaining module 42, configured to obtain a text file set;

2) a selecting module 44, coupled to the obtaining module 42, configured to select different text files from the text file set as feature text files;

3) a building module 46, coupled to the selecting module 44, for building a text recognition model using the feature text file, wherein the text recognition model is used for recognizing text information in the text file to be recognized.

Optionally, the present embodiment may be applied, but not limited to, in a scenario of building a text recognition model. Particularly, a text recognition model for machine learning is established in an optical character recognition scenario.

By the device, firstly, the obtaining module 42 obtains a large number of text files to form a text file set, the selecting module 44 automatically selects different text files from the text file set, and the establishing module 46 establishes text recognition models for recognizing text information in the text files, so that the established text recognition models can cover different text files, the accuracy of the established text recognition models is ensured, and the problem of low accuracy of the text recognition models established by repeatedly obtaining the same text files in the prior art is solved. And further, the text information in the text picture can be accurately identified by the text identification model established by the text identification model establishing method provided by the embodiment.

In this embodiment, the selection module 44 may be, but not limited to, configured to select different text files from the text file set as the characteristic text files according to file identifications of the text files in the text file set and/or storage location identifications of the text files in the text file set.

The process of selecting mutually different text files from the set of text files as characteristic text files by the selection module 44 is explained below by three examples.

An example is a process in which selection module 44 selects text files that are different from each other from the collection of text files as characteristic text files based on file identifications of the text files in the collection of text files.

In an example one, because different text files in the text file set carry different file identifiers, the selecting module 44 may select file identifiers in batch through a preset algorithm, delete the same file identifiers, and keep different file identifiers. And then extracting corresponding text files from the text file set according to the screened different file identifications as characteristic text files to establish a text recognition model. By the device, the characteristic text file is obtained according to the characteristic that different text files carry different text identifications, so that the established text recognition model can cover different text files, the accuracy of the established text recognition model is ensured, and the problem that the accuracy of the text recognition model established by repeatedly obtaining the same text file is lower in the prior art is solved. And further, the text information in the text picture can be accurately identified by the text identification model established by the text identification model establishing method provided by the embodiment.

The second example is a process of selecting different text files from the text file set as characteristic text files by the selection module 44 according to the storage location identifiers of the text files in the text file set.

In example two, different text file storage locations in the text file set are different and therefore carry different storage location identifiers, and the selection module 44 may select the storage location identifiers in batch through a preset algorithm, delete the same storage location identifiers, and reserve different storage location identifiers. And then extracting corresponding text files from the text file set according to the screened different storage position identifications as characteristic text files to establish a text recognition model. By the device, the characteristic text file is obtained according to the characteristic that different text files carry different storage position identifications due to different storage positions, so that the established text recognition model can cover different text files, the accuracy of the established text recognition model is ensured, and the problem that the accuracy of the text recognition model established by repeatedly obtaining the same text file is lower in the prior art is solved. And further, the text information in the text picture can be accurately identified by the text identification model established by the text identification model establishing method provided by the embodiment.

Example three is a process of selecting, by the selection module 44, different text files from the text file set as characteristic text files according to the file identifier of the text file in the text file set and the storage location identifier of the text file in the text file set.

In example three, the selecting module 44 may first select the text identifiers in batch from the text file set according to the text identifiers, where the text identifiers selected in batch may be the same, store different text identifiers in different storage locations, where the same text identifiers are stored in the same storage locations, so that the different text identifiers carry different storage location identifiers, then select different storage location identifiers in batch, obtain different file identifiers according to the different storage location identifiers, thereby obtaining corresponding different text files in the text file set as the feature text files, and establish the text recognition model. By the aid of the device, the same file identification in the possibly repeated file identifications acquired in batches is stored at the same position, the different file identifications correspond to the different storage position identifications, different file identifications are screened out according to the different storage position identifications to extract the characteristic text file from the text file set, so that the established text recognition model can cover different text files, accuracy of the established text recognition model is guaranteed, and the problem that the text recognition model established by the aid of the repeatedly acquired same text file in the prior art is low in accuracy is solved. And further, the text information in the text picture can be accurately identified by the text identification model established by the text identification model establishing method provided by the embodiment.

The manner of acquiring the text file set by the acquiring module 42 may be to acquire an existing text file set, or may be to generate the text file set according to a predetermined rule. The method for generating the text file set may be, but is not limited to, generating the text files in batch, and then selecting the text files that form the text file set from the generated text files, or selecting the existing text files to form the text file set.

The obtaining module 42 may further determine whether to process the text file before generating the text file set, where the processing manner includes, but is not limited to: blur, noise, sharpening, lighting, etc.

In this embodiment, in order to obtain the text file set, the obtaining module 42 may copy the obtained text information in batch to obtain a large amount of text information, set different text parameters for each text information, and obtain a large amount of text files different from each other to form the text file set. By the device, different text parameters are set for a large number of same text messages, and different text files are obtained to form the text file set, so that the text files with the same text messages and different text parameters are stored in the text file set, and the text messages can be identified from the text files in various forms in the subsequent identification process of the text files.

In addition, in the present embodiment, the obtaining module 42 obtains the text information in a form that can be, but is not limited to, receiving an input text string, or reading a text string stored in the system.

If the text information is obtained by reading the text strings stored in the system, the obtaining module 42 divides the read text strings into a plurality of different text strings according to a predetermined rule, and extracts one text information as a generated text file. The division unit may be, but is not limited to, a line, a plurality of lines, a word, a plurality of words, and the like.

By the device, the generated text files can be ensured to carry the same text information, but the text parameters of the text information are different. The establishment condition of the text recognition model is satisfied.

In this embodiment, the text parameter may include, but is not limited to, at least one of: font format, font display size, blank character size ratio, character interval size ratio, character rotation angle, character font color, character transparency parameter, character thickening degree, character inclination degree, character underline drawing, background picture and display position of text information in the background picture. Optionally, in this embodiment, a port of OPENCV may be called, but is not limited to, to set a text parameter of the text message.

The following explains the setting procedure of the parameter by taking a background picture as an example.

After the obtaining module 42 obtains the text information, different text parameters are set for the text information in batch, the text information with different text parameters is added to one or more background pictures obtained from the background picture library, the same text information may be added to different background pictures to generate different text files, and different text information may be added to the same background picture to generate different text files, so as to obtain a large number of text files.

Optionally, the selection module 44 is configured to: and selecting different text files from the text file set as characteristic text files according to the file identification of the text files in the text file set and/or the storage position identification of the text files in the text file set.

Fig. 5 is a block diagram of another alternative text recognition model building apparatus according to an embodiment of the present invention, as shown in fig. 5, optionally, the selecting module 44 includes:

1) the first obtaining unit 52 is configured to obtain file identifiers in a first preset number of text file sets according to a preset algorithm to obtain a file identifier set, where storage location identifiers of text files corresponding to the same text file identifier in the file identifier set are the same;

2) a second obtaining unit 54, coupled to the first obtaining unit 52, configured to obtain different storage location identifiers corresponding to file identifiers in the file identifier set;

3) a selecting unit 56, coupled to the second obtaining unit 54, configured to select a second preset number of different file identifiers from the file identifier set according to the different storage location identifiers;

4) and an extracting unit 58, coupled to the selecting unit 56, for extracting text files corresponding to file identifications different from each other from the text file set as feature text files.

Fig. 6 is a block diagram of another alternative text recognition model building apparatus according to an embodiment of the present invention, as shown in fig. 6, optionally, the obtaining module 42 includes:

1) a third acquiring unit 62 for acquiring text information;

2) the copying unit 64 is coupled to the third obtaining unit 62 and configured to copy the text information in batch to obtain a plurality of text information;

3) a setting unit 66, coupled to the copying unit 64, configured to set text parameters for the plurality of text messages, respectively, to obtain different text files, where the set of text files includes different text files.

Optionally, the third obtaining unit 62 is configured to: receiving an input first text string as text information; or reading a second text string stored in the system; segmenting the second text string according to a preset strategy to obtain a text string set; and extracting a third text string in the text string set as text information.

Optionally, the text parameter comprises at least one of: the font format parameter of the characters in the text information, the font display size parameter of the characters in the text information, the space size ratio parameter of the characters in the text information, the rotation angle parameter of the characters in the text information, the font color parameter of the characters in the text information, the transparency parameter of the characters in the text information, the thickening degree parameter of the characters in the text information, the inclination degree parameter of the characters in the text information, the underline drawing parameter of the characters in the text information, the background picture and the display position parameter of the text information in the background picture.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in a plurality of processors.

EXAMPLE III

The embodiment of the invention also provides a storage medium. In the present embodiment, the storage medium described above may be configured to store program code for performing the steps of:

step S1, acquiring a text file set;

step S2, selecting different text files from the text file set as characteristic text files;

and step S3, establishing a text recognition model by using the characteristic text file, wherein the text recognition model is used for recognizing the text information in the text file to be recognized.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only an alternative embodiment of the present invention and is not intended to limit the present invention, and various modifications and variations of the present invention may occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text recognition model building method is characterized by comprising the following steps:

acquiring a text file set; wherein the acquiring the text file set comprises: acquiring text information; copying the text information in batch to obtain a plurality of text information; respectively setting text parameters for the plurality of text messages to obtain different text files, wherein the text file set comprises the different text files;

selecting different text files from the text file set as characteristic text files; the method comprises the following steps of obtaining a characteristic text file by utilizing an improved linear congruence random number generator;

establishing a text recognition model by using the characteristic text file, wherein the text recognition model is used for recognizing text information in the text file to be recognized;

the acquiring of the text information includes: receiving an input text string, or reading a text string stored in a system;

the reading system reads the stored text character string, and comprises: dividing the read text character string into a plurality of different text character strings according to a preset rule, and extracting one text information serving as a generated text file, wherein the division unit comprises one line, a plurality of lines, one character, a plurality of characters, one word and/or a plurality of words;

the setting of text parameters for the plurality of text messages respectively to obtain different text files includes: respectively adding the text information with different text parameters to one or more background pictures obtained from a background picture library;

the adding the text information with different text parameters to one or more background pictures obtained from a background picture library respectively comprises: and adding the same text information into different background pictures to generate different text files, or adding different text information into the same background picture to generate different text files.

2. The method according to claim 1, wherein selecting the mutually different text files from the set of text files as the feature text file comprises: and selecting the different text files from the text file set as the characteristic text files according to the file identification of the text file in the text file set and/or the storage position identification of the text file in the text file set.

3. The method as claimed in claim 2, wherein selecting the mutually different text files from the set of text files as the characteristic text files according to the file identifications of the text files in the set of text files and/or the storage location identifications of the text files in the set of text files comprises: acquiring file identifiers in a first preset number of the text file sets according to a preset algorithm to obtain a file identifier set, wherein the storage position identifiers of the text files corresponding to the same text file identifier in the file identifier set are the same;

acquiring different storage position identifications corresponding to the file identifications in the file identification set;

screening a second preset number of different file identifications from the file identification set according to the different storage position identifications;

and extracting text files corresponding to the different file identifications from the text file set as the characteristic text files.

4. The method of claim 1, wherein the obtaining text information comprises: receiving an input first text string as the text information; or alternatively

Reading a second text string stored in the system; segmenting the second text string according to a preset strategy to obtain a text string set; and extracting a third text string in the text string set as the text information.

5. The method of claim 1 or 4, wherein the text parameter comprises at least one of: the font format parameter of the characters in the text information, the font display size parameter of the characters in the text information, the size ratio parameter of the blank characters in the text information, the interval size ratio parameter of the characters in the text information, the rotation angle parameter of the characters in the text information, the font color parameter of the characters in the text information, the transparency parameter of the characters in the text information, the thickening degree parameter of the characters in the text information, the inclination degree parameter of the characters in the text information, the underlining drawing parameter of the characters in the text information, a background picture and the display position parameter of the text information in the background picture.

6. A text recognition model creation apparatus, comprising:

the acquisition module is used for acquiring a text file set; wherein the acquisition module comprises: a third acquisition unit configured to acquire text information; the copying unit is used for copying the text information in batch to obtain a plurality of text information; the setting unit is used for setting text parameters for the text messages respectively to obtain different text files, wherein the text file set comprises the different text files;

the selection module is used for selecting different text files from the text file set as characteristic text files; the method comprises the following steps of obtaining a characteristic text file by utilizing an improved linear congruence random number generator;

the establishing module is used for establishing a text recognition model by using the characteristic text file, wherein the text recognition model is used for recognizing text information in the text file to be recognized;

the acquisition module is used for receiving an input text character string or reading a text character string stored in the system;

the acquisition module is used for dividing the read text character string into a plurality of different text character strings according to a preset rule and extracting one text information as a generated text file, wherein the division unit comprises one line, a plurality of lines, one character, a plurality of characters, one word and/or a plurality of words;

the acquisition module is used for respectively adding the text information with different text parameters to one or more background pictures acquired from a background picture library;

the acquisition module is used for adding the same text information into different background pictures to generate different text files, or adding different text information into the same background picture to generate different text files.

7. The apparatus of claim 6, wherein the selection module is configured to: and selecting the different text files from the text file set as the characteristic text files according to the file identification of the text file in the text file set and/or the storage position identification of the text file in the text file set.

8. The apparatus of claim 7, wherein the selection module comprises: a first obtaining unit, configured to obtain, according to a preset algorithm, a first preset number of file identifiers in the text file set to obtain a file identifier set, where storage location identifiers of text files corresponding to the same text file identifier in the file identifier set are the same;

a second obtaining unit, configured to obtain different storage location identifiers corresponding to the file identifiers in the file identifier set;

a selecting unit, configured to select a second preset number of different file identifiers from the file identifier set according to the different storage location identifiers;

and the extracting unit is used for extracting the text files corresponding to the different file identifications from the text file set as the characteristic text files.

9. The apparatus of claim 6, wherein the third obtaining unit is configured to: receiving an input first text string as the text information; or alternatively