CN112329803B

CN112329803B - Natural scene character recognition method based on standard font generation

Info

Publication number: CN112329803B
Application number: CN201910716704.1A
Authority: CN
Inventors: 连宙辉; 王逸之; 唐英敏; 肖建国
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2022-08-26
Anticipated expiration: 2039-08-05
Also published as: CN112329803A

Abstract

The invention discloses a character recognition method based on standard font generation, which is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of a picture at each moment, character category prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the picture are traversed, and recognition and output of the characters in a natural scene picture containing one or more characters are realized. The invention utilizes the multi-font generation, improves the attention module, and improves the character recognition precision and the font generation quality, thereby improving the character recognition accuracy.

Description

Natural scene character recognition method based on standard font generation

Technical Field

The invention belongs to the technical field of computer vision and pattern recognition, relates to a character recognition method, and particularly relates to a method for recognizing characters in a natural scene picture.

Background

In the field of computer vision and pattern recognition, character recognition refers to letting a computer automatically recognize character contents in a picture. The natural scene character recognition specifically refers to recognizing all character contents in a picture for a natural scene picture with characters as main bodies. The method realizes automatic recognition of the characters in the natural scene, and has great significance in improving the production and living efficiency of people, understanding the image content, recognizing the environment by a machine and the like.

To date, many text recognition techniques have been proposed in academia and industry, mainly classified into a local feature-based method and a neural network-based method. Among them, the method based on local features is represented by a method proposed in the literature (Wang, k., Babenko, b., & Belongie, S.J. (2011), End-to-End scene text registration.in 2011 International Conference on Computer Vision (pp.1457-1464)). It locates the positions of the characteristic points by a series of rules set by human, and extracts the characteristics on the positions for character classification. However, in natural scene images, the background of the text and the font thereof are complicated, the shape of the text is not fixed (bending, tilting, etc.), and the method cannot provide the unified standard of which feature points are important, so that the method cannot show a good recognition effect.

Recently, some methods based on neural networks have been proposed. The method has excellent performance on the character recognition problem by utilizing the characteristics of self-adaptive selection characteristics of the neural network and strong noise robustness. These methods generally use a Convolutional Neural Network (CNN) to extract visual features of an image, then use a Recurrent Neural Network (RNN) to perform sequence modeling, and predict each character in the image in sequence. Among them, a Long Short Term Memory Network (LSTM) is a commonly used RNN structure. The most advanced methods at present are represented by the ASTER method in the literature (Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). ASTER: An Interactive Screen Text Recombnizer with Flexible Reconfiguration. IEEE Transactions on Pattern and Machine Intelligence, 1-1.) and the method in the literature (Li, H., Wang, P., Shen, C., & Zhang, G. (2018). Show, extended Read: A Simple and string base for Irregurguar SAR Text registration. ArXpriv: 1811.00751.). However, these methods still have the defect that they only use the word class labels to supervise the neural network, but the guiding information provided by the word class labels is not sufficient. When processing a picture with a noisy text background and a novel font style, the methods cannot extract the distinguishing features, so that the recognition accuracy is still not ideal. There are some methods that attempt to use standard glyphs as additional supervisory information, such as the methods in the literature (Liu, y., Wang, z., Jin, h., & Wassell, I.J. (2018) synthetic provided Feature Learning for Scene Text Recognition. in Proceedings of the European Conference Computer Vision (ECCV) (pp.449-465)) and the methods in the literature (Zhang, y., Liang, S., Nie, S., Liu, w., & pending, S. (2018) Robust write library Learning apparatus mapping data, which do not result in the use of the standard glyphs, but no methods for Recognition of glyphs 26 in the literature (such methods do not result in the use of the methods in the methods of Text Recognition of fonts 26).

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a character recognition method based on standard font generation. For the natural scene character features extracted by the neural network, the attention mechanism is used in a character pattern generation mode, the neural network is used for predicting character types, and the neural network is also used for generating standard character patterns of various characters corresponding to the natural scene characters. By learning how to generate the standard font through the neural network, the method can extract the character features of the natural scene, which are more robust to interference factors such as a noisy background, a font style and the like, so that the accuracy of character recognition is improved.

For convenience of description, the present invention has the following definitions of terms:

natural scene picture: and (3) artificially shooting a picture of a real scene.

Text picture: the picture with the text content as the main body comprises one or more texts.

The core of the invention is: in the process of recognizing the characters, unnecessary font style information in the neural network features is redundant information. The SSFL of the prior art has two main problems: firstly, SSFL generates a standard font (single font) for learning how to filter out the background of the characters in the natural scene, and does not consider the font of various fonts to be generated and what effect can be brought by generating the font of various fonts; secondly, the model provided by the SSFL can not generate fonts of various fonts, and the technical difficulty exists. Compared with the prior art, the SSFL only uses the font of one font as a generation targetMeanwhile, the invention provides standard font generation of various fonts: generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generator

Using a glyph discriminator to compete against the glyph generator enables the glyph generator to generate a more realistic standard glyph. Because there are several typical standard fonts, such as song, regular, black, etc., for a certain character. The invention uses the font style embedded vector z to control the font to be generated, and the characteristics extracted by the neural network only reflect the most important content information (which character is), so the mode reduces unnecessary font style information in the neural network characteristics and further improves the identification precision; meanwhile, the method for controlling the font of the generated font by using the font style embedded vector z innovatively provided by the invention perfectly solves the problem of multi-font generation. In addition, the attention mechanism and the standard font generation are jointly optimized in a mode of jointly learning (joint optimization) through the two models, the two models which are independently learned are organically combined, and the two models are better in performance through jointly exchanging and learning (joint optimization).

The technical scheme provided by the invention is as follows:

a character recognition method based on standard font generation. The invention processes a natural scene picture containing one or more characters, and can output the characters in the picture in sequence according to the writing sequence. The invention uses a neural network model based on an attention mechanism and a generation mechanism, focuses attention on a certain position of an image at each moment, and respectively predicts character types and generates multi-font standard fonts by using the neural network characteristics of the position until all characters in the image are traversed, thereby realizing the recognition and output of the characters in a natural scene image containing one or more characters.

The attention mechanism and generation mechanism based neural network model comprises:

A. a convolutional neural network for extracting visual features f (x) of the input picture x;

B. a recurrent neural network for sequence modeling the features f (x); the recurrent neural network comprises an LSTM encoder and a decoder;

C. an attention module for acquiring an attention weight matrix M (x, t) according to hidden states h (x, t) and F (x) of the recurrent neural network at a time t;

D. a classifier for classifying the features; in specific implementation, a softmax classifier is adopted;

E. standard glyph for generating attention vector c (x, t) into its corresponding m fonts

The glyph generator of (1);

F. a glyph discriminator for competing with the glyph generator so that the glyph generator can generate a more realistic standard glyph.

The character recognition method based on standard font generation specifically comprises the following steps:

1. the visual features f (x) of the input picture x are extracted using a convolutional neural network.

2. And F (x) performing sequence modeling on the recurrent neural network, and transmitting the hidden state h (x, t) of the recurrent neural network at the time t to an attention module together with the F (x) to obtain an attention weight matrix M (x, t) which represents the attention allocated to each area of the image at the time t.

3. Performing point multiplication on each feature channel by using F (x) and M (x, t), and obtaining an attention vector c (x, t) which represents the feature of the concerned picture area at the time t.

4. And classifying the features of c (x, t) and h (x, t) after the c (x, t) and the h (x, t) are connected in series by using a classifier, and predicting the character category of the attention position at the moment t.

5. Generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generator

Using a glyph discriminator to compete against a glyph generator to make glyphsThe generator can generate a more realistic standard glyph.

In step 1, based on the convolutional neural network in the enter method (a scene text identifier based on attention mechanism), the step size of the first convolution unit in the last two convolution groups is modified to 1 × 1, which is used as the CNN feature extractor in the present invention to extract the visual feature f (x) of the input picture x. Wherein

That is, the input image is scaled to 48 pixels in height, 160 pixels in width, and 3 color channels;

where H is 6, W is 40, C is 512, which respectively represent the height, width and number of channels of feature f (x).

In step 2, the features f (x) are sequence modeled using an LSTM encoder and decoder. Wherein, the LSTM encoder and decoder both have two hidden layers, and each layer has 512 nodes. F (x) for each set of features in the width (W) dimension, Pooling (Max-Pooling) is performed along the height dimension and then sequentially input into the LSTM encoder. The hidden state of the LSTM encoder at the last instant is used as the initial state of the LSTM decoder. The hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain an attention weight matrix

The M (x, t) mode is calculated as follows:

M′ _ij (x，t)＝tanh(∑ _{p，q∈N(i，j)} W _F F _pq (x)+W _h h (x, t)) formula (1)

M(x，t)＝sotfmax(W _M M' (x, t)) formula (2)

Where M '(x, t) is an intermediate variable of the calculation process, M' _ij (x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; f _pq (x) Represents F (x) the feature at position (p, q); w _F And W _h Is a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a normalized exponential function (softmax function).

In step 3, the feature of each channel of F (x) and M (x, t) are used for dot multiplication to obtain

Representing the features of the picture region of interest at time t.

In step 4, the feature of the attention vectors c (x, t) and h (x, t) after concatenation is classified by using a softmax classifier commonly used in machine learning, and the character type of the attention position at the time t can be obtained

Probability of (c):

wherein, W _o And b _o Is a parameter to be learned, the middle brackets represent tandem operation,

c represents the overall character category. Is selected such that

Largest size

As a predicted character category.

In step 5, a glyph generator based on a deconvolution Neural Network (DCNN) is used to generate standard glyphs of m selected fonts by taking the attention vector c (x, t) as an input

Represented by formula (4):

wherein z is _i The embedded vector of the ith font is a random vector which follows multivariate standard normal distribution, the bracket represents the tandem operation, and m is the set font type number. True multifont standard glyphs g _i The (x, t) is rendered by TTF (true Type font) or OTF (open Type font) files. Meanwhile, the idea of a generative confrontation network is adopted, and a font discriminator based on a convolution neural network is used for discriminating the generated standard font from a real standard font. The generated font is more accurate through the countermeasure between the font discriminator and the font generator. The glyph discriminator gives the generated glyph

Probability of being true is

Probability of being false

By the same token, it gives the true glyph g _i The probability that (x, t) is true is p (y) _d ＝1|g _i (x, t)), probability p (y) of being false _d ＝0|g _i (x，t))＝1-p(y _d ＝1|g _i (x，t))。

The network parameters to be trained include the parameters to be learned in the CNN feature extractor, the LSTM encoder, the LSTM decoder, the attention module, the font generator and the font discriminator module. When the network is trained, the method and the device update the parameters of the network by combining character type prediction loss, font pixel point loss and font discriminator loss. In particular, the present invention iteratively optimizes two objective functions L _G And L _D ：

Wherein α is a weight coefficient, set to 0.01; y is ₁ ，y ₂ ，...，y _T Is the category label of all T characters in the input picture x; the operation of L1 norm is represented by | · | |. L is _G Item I of (1)

Predicting loss for text categories, second term

A third term for the loss of the glyph discriminator to falsely discriminate the generated glyph as true

Is a loss of glyph pixel points; l is a radical of an alcohol _D Item I of (1)

Loss of generating a glyph to be false for the right authentication of the glyph discriminator, the second term

A loss of correctly discriminating the true glyph for the glyph discriminator. The iterative optimization method realizes the countermeasure between the font discriminator and the font generator. Using literature (Kingma, d.p.,&ba, j.l. (2015). Adam: a Method for Stochartic optimization, International Conference on Learning retrieval), an adam optimizer optimizes network parameters, the initial Learning rate is set to be 0.001, the attenuation of every 1 ten thousand steps is 0.9 times of the original attenuation, and the same training data are adopted as an SAR Method.

Compared with the prior art, the beneficial effects of the invention include the following aspects:

the invention provides a character recognition method based on standard font generation, which is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of an image at each moment, character type prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the image are traversed, and recognition and output of the characters in a natural scene image containing one or more characters are realized. The invention utilizes the multi-font generation to improve or improve the attention module, the recognition precision and the font generation quality, and the specific embodiment is as follows:

firstly, the method adopts standard font generation to guide the learning process of character features, and compared with most methods for guiding feature learning by using character labels, the method can better learn features irrelevant to scenes, thereby improving the identification accuracy.

Secondly, the standard font is generated by adopting a space attention mechanism, compared with the existing SSFL method, the standard font corresponding to the irregular-shaped text can be generated better, the generation quality of the standard font is greatly improved, and the character recognition can obtain better accuracy.

Thirdly, the method adopts multi-font standard font generation, and further enhances the robustness of the learned characteristics. Compared with the generation of single-font standard fonts, the method reduces the font style characteristics in the characters in the natural scene, and is more favorable for identifying the contents of the characters.

Drawings

Fig. 1 is a flowchart of a text recognition method provided in the present invention.

FIG. 2 is a diagram comparing the present invention with the SSFL method for glyph generation.

FIG. 3 is an exemplary diagram of a standard glyph font utilized in the present invention.

FIG. 4 is a comparison graph of glyphs generated by the present invention and SSFL method when processing irregular-shaped text pictures.

FIG. 5 is a graph of glyph pixel loss value versus comparison during training according to the present invention and other prior art.

Fig. 6 is a comparison graph of the visualization result of the attention weight matrix calculated by the SAR method and the present invention.

FIG. 7 is a comparison graph of standard glyphs generated with and without the use of resist learning in accordance with the present invention.

FIG. 8 is a comparison graph of standard glyphs generated using single font and multi-font training in accordance with the invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a character recognition method based on standard font generation. The invention processes a natural scene picture containing one or more characters, and sequentially outputs the characters in the picture according to the writing sequence. The invention uses a neural network model based on an attention mechanism and a generation mechanism, focuses attention on a certain position of the picture at each moment, and respectively predicts character types and generates multi-font standard fonts by using the neural network characteristics of the position until all characters in the picture are traversed.

The flow chart of the invention is shown in the attached figure 1, and when the method is implemented, the method comprises the following steps:

1 visual features f (x) of the input picture x are extracted using a CNN feature extractor.

Table 1 parameter configuration chart of CNN feature extractor in embodiment

The configuration parameters of the CNN feature extractor are shown in table 1: comprising a total of 6 convolution groupsThe second column is a characteristic dimension output by each convolution group, and the format of the characteristic dimension is h multiplied by w multiplied by c, and h, w and c respectively represent the height, width and channel number of the characteristic; in addition to the first convolution group, the other convolution groups are internally represented by the literature (He, k., Zhang, x., Ren, s.,&residual unit (Residual unit) proposed by Sun, J. (2016.) Deep reactive Learning for Image registration in 2016 IEEE Conference on Computer Vision and Pattern Registration (CVPR) (pp.770-778.) configuration

The convolution group is represented by n residual error units, each residual error unit comprises two convolution layers with convolution kernel sizes of 1 × 1 and 3 × 3 respectively and the number of output characteristic channels of o, and the step size represents the convolution step size.

2 sequence modeling of feature f (x) using LSTM encoder and decoder in SAR approach. Wherein, the LSTM encoder and decoder both have two hidden layers, and each layer has 512 nodes. F (x) for each set of features in the width (W) dimension, Pooling (Max-Pooling) is performed along the height dimension and then sequentially input into the LSTM encoder. The hidden state of the LSTM encoder at the last instant is used as the initial state of the LSTM decoder. The hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain the attention weight matrix

Wherein the M (x, t) mode is calculated as follows:

M(x，t)＝sotfmax(W _M M′(x，t))

where M' (x, t) is an intermediate variable of the calculation process, i.e. the attention weight matrix before softmax normalization; m' _ij (x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; f _pq (x) Represents F (x) the feature at position (p, q); w _F And W _h Is a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a normalized exponential function (softmax function).

3 using the characteristic of each channel of F (x) and M (x, t) to carry out dot multiplication to obtain

Representing the features of the picture region of interest at time t.

4, classifying the features of the attention vectors c (x, t) and h (x, t) after the attention vectors are connected in series by using a softmax classifier commonly used in machine learning, and obtaining the character class of the concerned position at the moment t

Probability of (2)

c represents the overall character category. Is selected such that

Largest size

As a predicted character category.

5 using a character pattern generator based on a deconvolution neural network to generate standard character patterns of m selected fonts by taking the attention vector c (x, t) as input

Wherein z is _i The embedded vector of the ith font is a random vector which follows a multivariate standard normal distribution, and the bracket represents a concatenation operation. True multifont standard glyphs g _i And (x, t) is rendered by a TTF (true Type font) or OTF (open Type font) file. Meanwhile, the idea of a generating type confrontation network is adopted, the generated standard font and the real standard font are identified by using the font identifier, and the generated font is more accurate through the confrontation between the font identifier and the font generator. The glyph discriminator gives the generated glyph

Probability of being true is

Probability of being false is

By the same token, it gives the true glyph g _i The probability that (x, t) is true is p (y) _d ＝1|g _i (x, t)), the probability of being false is p (y) _d ＝0|g _i (x，t))＝1-p(y _d ＝1|g _i (x, t)). The configuration parameters of the glyph generator and the discriminator are shown in table 2: the first, second and third columns in the table represent the name, type and specific configuration of the network layer, respectively. In the third column, "k × k × c, s, BN, ReLU" represents the size of the convolution kernel for the convolution and deconvolution layers as k × k, the output characteristic dimension as c, the step size as s, and the batch normalization and ReLU activation functions are used. For a fully connected layer, "i o" represents that the dimension of the input features of the layer is i and the dimension of the output features is o.

Table 2 parameter configuration diagram of glyph generator and glyph discriminator in the embodiment

When the whole network is trained, the invention combines character type prediction loss, font pixel point loss and font discriminator loss to update the parameters of the network. Specifically, the invention iteratively optimizes two objective functions L for countermeasures in the countermeasures learning _G And L _D Expressed as follows:

wherein α is a weight coefficient and is set to 0.01; y is ₁ ，y ₂ ，...，y _T Is the category label for all T characters in the input picture x. L is a radical of an alcohol _G Item I of (1)

Predicting loss for text categories, second term

A third term for the loss of the glyph discriminator to discriminate that the generated glyph is true

Is a loss of glyph pixel points; l is a radical of an alcohol _D Item I of (1)

Second term for loss of the glyph discriminator to discriminate the generated glyph as false

A loss to identify the true glyph to be true for the glyph discriminator. The iterative optimization method realizes the countermeasure between the font discriminator and the font generator. Using literature (Kingma, d.p.,&ba, j.l. (2015). Adam: a Method for Stochartic optimization. International Conference on Learning predictions.) an adam optimizer proposed was used to optimize network parameters, with an initial Learning rate set to 0.001, attenuation per 1 ten thousand steps being 0.9 times the original, and the same training data as the SAR Method was used.

FIG. 2 is a comparison of the glyph generation methods provided by the present invention and the existing SSFL method, respectively. The upper half of the dotted line is the multi-font standard font generation mode based on the attention mechanism provided by the invention, and the lower half of the dotted line is the standard font generation mode provided by the SSFL method. The scheme adopted by the invention is different from the SSFL in two main points: firstly, generating standard fonts corresponding to each scene character one by adopting an attention mechanism; second, the invention adopts multi-font standard font generation, which is helpful to better learn the character irrelevant to the font style.

FIG. 3 is an example of a font for a standard glyph used in the present invention. The invention trains three network models respectively for English, Chinese and Bengali. For English, the present invention uses 8 (m ═ 8) fonts, namely Arial, Bradley Hand ITC, cosmetic Sans MS, Courier New, Georgia, Times New Roman, Kunstler Script and Vladimir Script. For Chinese, the invention adopts 4 (m ═ 8) fonts, namely Song style, regular style, black style and imitated Song style. For Bengali, the present invention uses 1 font, Nirmala UI.

TABLE 3 recognition accuracy of the present invention and other prior art techniques on English evaluation datasets

Method	IIIT5k	SVT	IC13	IC15	SVTP	CT80
							SSFL	89.4	87.1	94.0	-	73.9	62.5
ASTER	93.4	89.5	91.8	76.1	78.5	79.5
							SAR	95.0	91.2	94.0	78.8	86.4	89.6
The invention	95.3	91.3	95.1	81.7	86.0	88.5

TABLE 4 recognition accuracy of the present invention and other prior art techniques on Chinese and Bengali evaluation datasets

Method	Pan+ChiPhoto	ISI Bengali
			HOG	59.2	87.4
CNN	61.5	89.7
			ConvCoHOG	71.2	92.2
The invention	89.4	97.4

Tables 2-3 are graphs of the recognition accuracy of the present invention and other prior art on an evaluation data set. Among them, IIIT5k, SVT, IC13, IC15, SVTP and CT80 are english character data sets commonly used in the industry at present. The present invention achieves the best results over most data sets, as seen by the recognition accuracy (in%) in the graph. The method has great advantage in the aspect of accuracy ratio on the IC15 data set; the accuracy of the present invention is somewhat behind the SAR approach on the two smaller datasets SVTP and CT 80. Pan + ChiPhoto is the Chinese dataset and ISI Bengali is the Mengladesh dataset, on which the present invention also achieves the highest recognition accuracy. HOG is a method in the literature (Dalal, N., & Triggs, B. (2005). Histograms of ordered grams for human detection. in 2005IEEE Computer Conference on Computer Vision and Pattern Recognition (CVPR ' 05) (Vol.1, pp.886-893.), CNN is a method in the literature (Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep Features for Text spraying. in European Computer video (pp.512-528)), CONvCoHOG is a method in the literature (Dalal, N., & Triggs, B. (2005.) History of Experimental patents for Computer video (pp.512-528) & CVPR's, Vol.05 & gt, in PC ' Conference on Computer) (CVPR.05, in). In general, the invention is more advanced than the prior art in the task of character recognition in natural scenes.

FIG. 4 is a comparison graph of glyphs generated by the present invention and SSFL method when processing irregular-shaped text pictures. The SSFL method generates a standard glyph by a global mapping method and does not better handle irregular shaped text. The invention locates the approximate position of each character through an attention mechanism, and then generates the corresponding standard font, thereby obtaining better results aiming at the irregular text.

FIG. 5 is a diagram of glyph pixel loss (also called L1 loss) in the training process of the present invention and other prior arts, in which CNN-DCNN is a standard glyph Generation framework used by SSFL method, CNN-DCNN (Skip) is added with Skip Connection (Skip Connection) in CNN-DCNN, CNN-LSTM-DCNN is an improved version of CNN-DCNN, in which CNN features are passed through LSTM and then delivered to deconvolution network (DCNN), and Attentional Generation is the attention-based standard glyph Generation framework proposed herein. For fair comparison, the four methods use the same CNN and DCNN structure configuration, the same training data, and the first three methods also introduce multi-font generation. Through comparison, the attention-based generation method provided by the invention generates more accurate standard fonts.

Fig. 6 is a graph comparing the visualization results of the attention weight matrix (i.e., M (x, t)) obtained by the present invention and SAR method. The invention generates the standard font through learning, so that the attention module generates a more accurate and more meaningful attention weight matrix. The 2 nd and 3 rd columns in the figure respectively represent the thermodynamic diagrams of M (x, t) calculated by the SAR method and the invention, and the underlined letters below the thermodynamic diagrams represent character labels predicted by the model at a certain moment. Taking the first group of pictures as an example, the invention focuses attention on the lower half of the flower character "L" and correctly identifies the flower character "L"; while the SAR method focuses on the lower half of the flower-shaped word "L", it will be recognized as "R" if it is mistaken.

FIG. 7 is a comparison graph of a standard glyph generated with and without the use of antagonistic learning in accordance with the present invention. Wherein "output" is the result of not using the countermeasure training, "output" is the result of using the countermeasure training, and "target" is the true standard glyph. Through the counterstudy, the invention can better generate the standard font and identify the text content for the fuzzy and distorted text. Although many generated standard glyphs have some gap from the true standard glyphs after using the resistance training, the key improvement is related compared with not using the resistance training.

FIG. 8 is a standard glyph comparison graph generated under training with single fonts and multiple fonts according to the invention. Where "output" is the result of training using a single font, the name of the font is in parentheses, "output" is the result of training using multiple fonts, and "target" is the true standard glyph. If the standard font of only one font is adopted for training, the standard font and the recognition cannot be correctly generated when the model encounters characters with newer font styles during testing. By generating standard glyphs of multiple fonts, the model can better learn the character-independent features, thereby correctly identifying the content of the characters.

The technical solutions in the embodiments of the present invention are clearly and completely described above with reference to the drawings in the embodiments of the present invention. It is to be understood that the described examples are only a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims

1. A character recognition method based on standard font generation is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of an image at each moment, character category prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the image are traversed, and therefore the characters in a natural scene image containing one or more characters are recognized and output;

D. a classifier for classifying the features;

The glyph generator of (2);

F. a glyph evaluator for countering the glyph generator so that the glyph generator can generate a more realistic standard glyph;

establishing and training the neural network model based on the attention mechanism and the generation mechanism, wherein the loss function of the model comprises the following steps: character type prediction loss, font pixel point loss and font discriminator loss; the learning parameters comprise parameters in a CNN characteristic extractor, an LSTM encoder, an LSTM decoder, an attention module, a font generator and a font discriminator module;

the training process optimizes two objective functions L for iteration _G And L _D Expressed by formula (5) and formula (6):

wherein α is a weight coefficient; y is ₁ ,y ₂ ,…,y _T Is the category label of all T characters in the input picture x; | represents the norm operation of L1; l is _G In (1),

predicting losses for the text categories;

a loss that generates a glyph that is true for glyph discriminator false authentication;

is a loss of glyph pixel points; l is _D In (1),

a penalty for the glyph evaluator to correctly identify the generated glyph as false,

a loss to correctly identify the true glyph for the glyph identifier;

the character recognition method based on standard font generation comprises the following steps:

1) carrying out structural modification on a convolutional neural network in the ASTER method, wherein the step length of a first convolution unit in the last two convolution groups of the convolutional neural network is 1 multiplied by 1, and the convolutional neural network is used as a CNN feature extractor; extracting visual features F (x) of the input picture x by using a convolutional neural network;

wherein H, W, C represents the height, width and number of channels, respectively, of feature F (x);

2) using a recurrent neural network to perform sequence modeling on F (x), and transmitting the hidden state h (x, t) of the recurrent neural network at the time t and the F (x) to an attention module together to obtain an attention weight matrix M (x, t) which represents the attention distributed to each area of the image at the time t;

3) performing point multiplication on each feature channel by using F (x) and M (x, t) to obtain an attention vector c (x, t) which represents the feature of the concerned picture area at the time t;

4) classifying the features of c (x, t) and h (x, t) after the c (x, t) and the h (x, t) are connected in series by using a classifier, and predicting the character category of the attention position at the moment t;

5) generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generator

The font discriminator can be further used for confrontation with the font generator, so that the font generator can generate a more real standard font;

through the steps, the characters in the picture are identified based on the standard font generation.

2. The method for recognizing characters based on standard font generation as claimed in claim 1, wherein in step 1), the visual characteristics F (x) of the picture x,

scaling the picture x to 48 pixels in height and 160 pixels in width, and setting the number of color channels to 3;

wherein H is 6, W is 40, C is 512.

3. The method of claim 1 in which step 2) is performed by using an LSTM encoder and decoder to model the sequence of features f (x); the method comprises the following steps:

21) the LSTM encoder and decoder each comprise two hidden layers, each layer having 512 nodes;

22) pooling f (x) in each set of features in the width W dimension, Max-Pooling along the height dimension; then sequentially inputting the signals into an LSTM encoder;

23) taking the hidden state of the LSTM encoder at the last moment as the initial state of the LSTM decoder;

24) the hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain an attention weight matrix

The M (x, t) mode is calculated as follows:

M′ _ij (x,t)＝tanh(∑ _p,q∈N(i,j) W _F F _pq (x)+W _h h (x, t)) formula (1)

M(x,t)＝sotfmax(W _M M' (x, t)) formula (2)

WhereinM '(x, t) is an intermediate variable of the calculated process, M' _ij (x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; f _pq (x) Represents F (x) the feature at position (p, q); w _F And W _h Is a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a softmax function.

4. The method for recognizing characters based on standard font generation as claimed in claim 1, wherein in step 4), the specific softmax classifier classifies the features of attention vectors c (x, t) and h (x, t) after concatenation to obtain the character class of the attention position at time t

Probability of (c):

wherein, W _o And b _o Is a parameter to be learned, the middle bracket represents the tandem operation,

c represents the overall character category.

5. The method for character recognition based on standard font generation as claimed in claim 1, wherein in step 5), the standard fonts of m selected fonts are generated by using a font generator based on a deconvolution neural network (DCNN) and taking the attention vector c (x, t) as an input

Represented by formula (4):

wherein z is _i The embedded vector of the ith font is a random vector which follows multivariate standard normal distribution, the bracket represents the tandem operation, and m is the set font type number.

6. The method for recognizing words based on standard font generation as claimed in claim 1, wherein TTF or OTF file rendering is used to obtain the real multi-font standard font g _i (x,t)。

7. The method for recognizing characters based on standard font generation as claimed in claim 1, further comprising using a generative countermeasure network to discriminate the generated standard font from the real standard font by using a font discriminator based on a convolutional neural network; wherein: the font discriminator gives the generated font

Probability of being true is

Probability of being false

True font g _i The probability that (x, t) is true is p (y) _d ＝1|g _i (x, t)), probability p (y) of being false _d ＝0|g _i (x,t))＝1-p(y _d ＝1|g _i (x,t))。

8. The method of claim 1, wherein an adam optimizer is used to optimize network parameters, and the initial learning rate is set to 0.001; the weight coefficient α is set to 0.01.