CN112329803B - Natural scene character recognition method based on standard font generation - Google Patents

Natural scene character recognition method based on standard font generation Download PDF

Info

Publication number
CN112329803B
CN112329803B CN201910716704.1A CN201910716704A CN112329803B CN 112329803 B CN112329803 B CN 112329803B CN 201910716704 A CN201910716704 A CN 201910716704A CN 112329803 B CN112329803 B CN 112329803B
Authority
CN
China
Prior art keywords
font
attention
standard
neural network
glyph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910716704.1A
Other languages
Chinese (zh)
Other versions
CN112329803A (en
Inventor
连宙辉
王逸之
唐英敏
肖建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201910716704.1A priority Critical patent/CN112329803B/en
Publication of CN112329803A publication Critical patent/CN112329803A/en
Application granted granted Critical
Publication of CN112329803B publication Critical patent/CN112329803B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a character recognition method based on standard font generation, which is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of a picture at each moment, character category prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the picture are traversed, and recognition and output of the characters in a natural scene picture containing one or more characters are realized. The invention utilizes the multi-font generation, improves the attention module, and improves the character recognition precision and the font generation quality, thereby improving the character recognition accuracy.

Description

Natural scene character recognition method based on standard font generation
Technical Field
The invention belongs to the technical field of computer vision and pattern recognition, relates to a character recognition method, and particularly relates to a method for recognizing characters in a natural scene picture.
Background
In the field of computer vision and pattern recognition, character recognition refers to letting a computer automatically recognize character contents in a picture. The natural scene character recognition specifically refers to recognizing all character contents in a picture for a natural scene picture with characters as main bodies. The method realizes automatic recognition of the characters in the natural scene, and has great significance in improving the production and living efficiency of people, understanding the image content, recognizing the environment by a machine and the like.
To date, many text recognition techniques have been proposed in academia and industry, mainly classified into a local feature-based method and a neural network-based method. Among them, the method based on local features is represented by a method proposed in the literature (Wang, k., Babenko, b., & Belongie, S.J. (2011), End-to-End scene text registration.in 2011 International Conference on Computer Vision (pp.1457-1464)). It locates the positions of the characteristic points by a series of rules set by human, and extracts the characteristics on the positions for character classification. However, in natural scene images, the background of the text and the font thereof are complicated, the shape of the text is not fixed (bending, tilting, etc.), and the method cannot provide the unified standard of which feature points are important, so that the method cannot show a good recognition effect.
Recently, some methods based on neural networks have been proposed. The method has excellent performance on the character recognition problem by utilizing the characteristics of self-adaptive selection characteristics of the neural network and strong noise robustness. These methods generally use a Convolutional Neural Network (CNN) to extract visual features of an image, then use a Recurrent Neural Network (RNN) to perform sequence modeling, and predict each character in the image in sequence. Among them, a Long Short Term Memory Network (LSTM) is a commonly used RNN structure. The most advanced methods at present are represented by the ASTER method in the literature (Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). ASTER: An Interactive Screen Text Recombnizer with Flexible Reconfiguration. IEEE Transactions on Pattern and Machine Intelligence, 1-1.) and the method in the literature (Li, H., Wang, P., Shen, C., & Zhang, G. (2018). Show, extended Read: A Simple and string base for Irregurguar SAR Text registration. ArXpriv: 1811.00751.). However, these methods still have the defect that they only use the word class labels to supervise the neural network, but the guiding information provided by the word class labels is not sufficient. When processing a picture with a noisy text background and a novel font style, the methods cannot extract the distinguishing features, so that the recognition accuracy is still not ideal. There are some methods that attempt to use standard glyphs as additional supervisory information, such as the methods in the literature (Liu, y., Wang, z., Jin, h., & Wassell, I.J. (2018) synthetic provided Feature Learning for Scene Text Recognition. in Proceedings of the European Conference Computer Vision (ECCV) (pp.449-465)) and the methods in the literature (Zhang, y., Liang, S., Nie, S., Liu, w., & pending, S. (2018) Robust write library Learning apparatus mapping data, which do not result in the use of the standard glyphs, but no methods for Recognition of glyphs 26 in the literature (such methods do not result in the use of the methods in the methods of Text Recognition of fonts 26).
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a character recognition method based on standard font generation. For the natural scene character features extracted by the neural network, the attention mechanism is used in a character pattern generation mode, the neural network is used for predicting character types, and the neural network is also used for generating standard character patterns of various characters corresponding to the natural scene characters. By learning how to generate the standard font through the neural network, the method can extract the character features of the natural scene, which are more robust to interference factors such as a noisy background, a font style and the like, so that the accuracy of character recognition is improved.
For convenience of description, the present invention has the following definitions of terms:
natural scene picture: and (3) artificially shooting a picture of a real scene.
Text picture: the picture with the text content as the main body comprises one or more texts.
The core of the invention is: in the process of recognizing the characters, unnecessary font style information in the neural network features is redundant information. The SSFL of the prior art has two main problems: firstly, SSFL generates a standard font (single font) for learning how to filter out the background of the characters in the natural scene, and does not consider the font of various fonts to be generated and what effect can be brought by generating the font of various fonts; secondly, the model provided by the SSFL can not generate fonts of various fonts, and the technical difficulty exists. Compared with the prior art, the SSFL only uses the font of one font as a generation targetMeanwhile, the invention provides standard font generation of various fonts: generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generator
Figure BDA0002155681230000021
Using a glyph discriminator to compete against the glyph generator enables the glyph generator to generate a more realistic standard glyph. Because there are several typical standard fonts, such as song, regular, black, etc., for a certain character. The invention uses the font style embedded vector z to control the font to be generated, and the characteristics extracted by the neural network only reflect the most important content information (which character is), so the mode reduces unnecessary font style information in the neural network characteristics and further improves the identification precision; meanwhile, the method for controlling the font of the generated font by using the font style embedded vector z innovatively provided by the invention perfectly solves the problem of multi-font generation. In addition, the attention mechanism and the standard font generation are jointly optimized in a mode of jointly learning (joint optimization) through the two models, the two models which are independently learned are organically combined, and the two models are better in performance through jointly exchanging and learning (joint optimization).
The technical scheme provided by the invention is as follows:
a character recognition method based on standard font generation. The invention processes a natural scene picture containing one or more characters, and can output the characters in the picture in sequence according to the writing sequence. The invention uses a neural network model based on an attention mechanism and a generation mechanism, focuses attention on a certain position of an image at each moment, and respectively predicts character types and generates multi-font standard fonts by using the neural network characteristics of the position until all characters in the image are traversed, thereby realizing the recognition and output of the characters in a natural scene image containing one or more characters.
The attention mechanism and generation mechanism based neural network model comprises:
A. a convolutional neural network for extracting visual features f (x) of the input picture x;
B. a recurrent neural network for sequence modeling the features f (x); the recurrent neural network comprises an LSTM encoder and a decoder;
C. an attention module for acquiring an attention weight matrix M (x, t) according to hidden states h (x, t) and F (x) of the recurrent neural network at a time t;
D. a classifier for classifying the features; in specific implementation, a softmax classifier is adopted;
E. standard glyph for generating attention vector c (x, t) into its corresponding m fonts
Figure BDA0002155681230000031
The glyph generator of (1);
F. a glyph discriminator for competing with the glyph generator so that the glyph generator can generate a more realistic standard glyph.
The character recognition method based on standard font generation specifically comprises the following steps:
1. the visual features f (x) of the input picture x are extracted using a convolutional neural network.
2. And F (x) performing sequence modeling on the recurrent neural network, and transmitting the hidden state h (x, t) of the recurrent neural network at the time t to an attention module together with the F (x) to obtain an attention weight matrix M (x, t) which represents the attention allocated to each area of the image at the time t.
3. Performing point multiplication on each feature channel by using F (x) and M (x, t), and obtaining an attention vector c (x, t) which represents the feature of the concerned picture area at the time t.
4. And classifying the features of c (x, t) and h (x, t) after the c (x, t) and the h (x, t) are connected in series by using a classifier, and predicting the character category of the attention position at the moment t.
5. Generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generator
Figure BDA0002155681230000041
Using a glyph discriminator to compete against a glyph generator to make glyphsThe generator can generate a more realistic standard glyph.
In step 1, based on the convolutional neural network in the enter method (a scene text identifier based on attention mechanism), the step size of the first convolution unit in the last two convolution groups is modified to 1 × 1, which is used as the CNN feature extractor in the present invention to extract the visual feature f (x) of the input picture x. Wherein
Figure BDA0002155681230000042
That is, the input image is scaled to 48 pixels in height, 160 pixels in width, and 3 color channels;
Figure BDA0002155681230000043
where H is 6, W is 40, C is 512, which respectively represent the height, width and number of channels of feature f (x).
In step 2, the features f (x) are sequence modeled using an LSTM encoder and decoder. Wherein, the LSTM encoder and decoder both have two hidden layers, and each layer has 512 nodes. F (x) for each set of features in the width (W) dimension, Pooling (Max-Pooling) is performed along the height dimension and then sequentially input into the LSTM encoder. The hidden state of the LSTM encoder at the last instant is used as the initial state of the LSTM decoder. The hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain an attention weight matrix
Figure BDA0002155681230000044
The M (x, t) mode is calculated as follows:
M′ ij (x,t)=tanh(∑ p,q∈N(i,j) W F F pq (x)+W h h (x, t)) formula (1)
M(x,t)=sotfmax(W M M' (x, t)) formula (2)
Where M '(x, t) is an intermediate variable of the calculation process, M' ij (x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; f pq (x) Represents F (x) the feature at position (p, q); w F And W h Is a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a normalized exponential function (softmax function).
In step 3, the feature of each channel of F (x) and M (x, t) are used for dot multiplication to obtain
Figure BDA0002155681230000048
Representing the features of the picture region of interest at time t.
In step 4, the feature of the attention vectors c (x, t) and h (x, t) after concatenation is classified by using a softmax classifier commonly used in machine learning, and the character type of the attention position at the time t can be obtained
Figure BDA0002155681230000045
Probability of (c):
Figure BDA0002155681230000046
wherein, W o And b o Is a parameter to be learned, the middle brackets represent tandem operation,
Figure BDA0002155681230000047
c represents the overall character category. Is selected such that
Figure BDA0002155681230000051
Largest size
Figure BDA0002155681230000052
As a predicted character category.
In step 5, a glyph generator based on a deconvolution Neural Network (DCNN) is used to generate standard glyphs of m selected fonts by taking the attention vector c (x, t) as an input
Figure BDA00021556812300000514
Represented by formula (4):
Figure BDA0002155681230000053
wherein z is i The embedded vector of the ith font is a random vector which follows multivariate standard normal distribution, the bracket represents the tandem operation, and m is the set font type number. True multifont standard glyphs g i The (x, t) is rendered by TTF (true Type font) or OTF (open Type font) files. Meanwhile, the idea of a generative confrontation network is adopted, and a font discriminator based on a convolution neural network is used for discriminating the generated standard font from a real standard font. The generated font is more accurate through the countermeasure between the font discriminator and the font generator. The glyph discriminator gives the generated glyph
Figure BDA0002155681230000054
Probability of being true is
Figure BDA0002155681230000055
Probability of being false
Figure BDA0002155681230000056
By the same token, it gives the true glyph g i The probability that (x, t) is true is p (y) d =1|g i (x, t)), probability p (y) of being false d =0|g i (x,t))=1-p(y d =1|g i (x,t))。
The network parameters to be trained include the parameters to be learned in the CNN feature extractor, the LSTM encoder, the LSTM decoder, the attention module, the font generator and the font discriminator module. When the network is trained, the method and the device update the parameters of the network by combining character type prediction loss, font pixel point loss and font discriminator loss. In particular, the present invention iteratively optimizes two objective functions L G And L D
Figure BDA0002155681230000057
Figure BDA0002155681230000058
Wherein α is a weight coefficient, set to 0.01; y is 1 ,y 2 ,...,y T Is the category label of all T characters in the input picture x; the operation of L1 norm is represented by | · | |. L is G Item I of (1)
Figure BDA0002155681230000059
Predicting loss for text categories, second term
Figure BDA00021556812300000510
A third term for the loss of the glyph discriminator to falsely discriminate the generated glyph as true
Figure BDA00021556812300000511
Is a loss of glyph pixel points; l is a radical of an alcohol D Item I of (1)
Figure BDA00021556812300000512
Loss of generating a glyph to be false for the right authentication of the glyph discriminator, the second term
Figure BDA00021556812300000513
A loss of correctly discriminating the true glyph for the glyph discriminator. The iterative optimization method realizes the countermeasure between the font discriminator and the font generator. Using literature (Kingma, d.p.,&ba, j.l. (2015). Adam: a Method for Stochartic optimization, International Conference on Learning retrieval), an adam optimizer optimizes network parameters, the initial Learning rate is set to be 0.001, the attenuation of every 1 ten thousand steps is 0.9 times of the original attenuation, and the same training data are adopted as an SAR Method.
Compared with the prior art, the beneficial effects of the invention include the following aspects:
the invention provides a character recognition method based on standard font generation, which is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of an image at each moment, character type prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the image are traversed, and recognition and output of the characters in a natural scene image containing one or more characters are realized. The invention utilizes the multi-font generation to improve or improve the attention module, the recognition precision and the font generation quality, and the specific embodiment is as follows:
firstly, the method adopts standard font generation to guide the learning process of character features, and compared with most methods for guiding feature learning by using character labels, the method can better learn features irrelevant to scenes, thereby improving the identification accuracy.
Secondly, the standard font is generated by adopting a space attention mechanism, compared with the existing SSFL method, the standard font corresponding to the irregular-shaped text can be generated better, the generation quality of the standard font is greatly improved, and the character recognition can obtain better accuracy.
Thirdly, the method adopts multi-font standard font generation, and further enhances the robustness of the learned characteristics. Compared with the generation of single-font standard fonts, the method reduces the font style characteristics in the characters in the natural scene, and is more favorable for identifying the contents of the characters.
Drawings
Fig. 1 is a flowchart of a text recognition method provided in the present invention.
FIG. 2 is a diagram comparing the present invention with the SSFL method for glyph generation.
FIG. 3 is an exemplary diagram of a standard glyph font utilized in the present invention.
FIG. 4 is a comparison graph of glyphs generated by the present invention and SSFL method when processing irregular-shaped text pictures.
FIG. 5 is a graph of glyph pixel loss value versus comparison during training according to the present invention and other prior art.
Fig. 6 is a comparison graph of the visualization result of the attention weight matrix calculated by the SAR method and the present invention.
FIG. 7 is a comparison graph of standard glyphs generated with and without the use of resist learning in accordance with the present invention.
FIG. 8 is a comparison graph of standard glyphs generated using single font and multi-font training in accordance with the invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a character recognition method based on standard font generation. The invention processes a natural scene picture containing one or more characters, and sequentially outputs the characters in the picture according to the writing sequence. The invention uses a neural network model based on an attention mechanism and a generation mechanism, focuses attention on a certain position of the picture at each moment, and respectively predicts character types and generates multi-font standard fonts by using the neural network characteristics of the position until all characters in the picture are traversed.
The flow chart of the invention is shown in the attached figure 1, and when the method is implemented, the method comprises the following steps:
1 visual features f (x) of the input picture x are extracted using a CNN feature extractor.
Figure BDA0002155681230000071
That is, the input image is scaled to 48 pixels in height, 160 pixels in width, and 3 color channels;
Figure BDA0002155681230000072
where H is 6, W is 40, C is 512, which respectively represent the height, width and number of channels of feature f (x).
Table 1 parameter configuration chart of CNN feature extractor in embodiment
Figure BDA0002155681230000073
The configuration parameters of the CNN feature extractor are shown in table 1: comprising a total of 6 convolution groupsThe second column is a characteristic dimension output by each convolution group, and the format of the characteristic dimension is h multiplied by w multiplied by c, and h, w and c respectively represent the height, width and channel number of the characteristic; in addition to the first convolution group, the other convolution groups are internally represented by the literature (He, k., Zhang, x., Ren, s.,&residual unit (Residual unit) proposed by Sun, J. (2016.) Deep reactive Learning for Image registration in 2016 IEEE Conference on Computer Vision and Pattern Registration (CVPR) (pp.770-778.) configuration
Figure BDA0002155681230000081
The convolution group is represented by n residual error units, each residual error unit comprises two convolution layers with convolution kernel sizes of 1 × 1 and 3 × 3 respectively and the number of output characteristic channels of o, and the step size represents the convolution step size.
2 sequence modeling of feature f (x) using LSTM encoder and decoder in SAR approach. Wherein, the LSTM encoder and decoder both have two hidden layers, and each layer has 512 nodes. F (x) for each set of features in the width (W) dimension, Pooling (Max-Pooling) is performed along the height dimension and then sequentially input into the LSTM encoder. The hidden state of the LSTM encoder at the last instant is used as the initial state of the LSTM decoder. The hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain the attention weight matrix
Figure BDA0002155681230000082
Wherein the M (x, t) mode is calculated as follows:
Figure BDA0002155681230000083
M(x,t)=sotfmax(W M M′(x,t))
where M' (x, t) is an intermediate variable of the calculation process, i.e. the attention weight matrix before softmax normalization; m' ij (x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; f pq (x) Represents F (x) the feature at position (p, q); w F And W h Is a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a normalized exponential function (softmax function).
3 using the characteristic of each channel of F (x) and M (x, t) to carry out dot multiplication to obtain
Figure BDA0002155681230000084
Representing the features of the picture region of interest at time t.
4, classifying the features of the attention vectors c (x, t) and h (x, t) after the attention vectors are connected in series by using a softmax classifier commonly used in machine learning, and obtaining the character class of the concerned position at the moment t
Figure BDA0002155681230000085
Probability of (2)
Figure BDA0002155681230000086
Figure BDA0002155681230000087
Wherein, W o And b o Is a parameter to be learned, the middle brackets represent tandem operation,
Figure BDA0002155681230000088
c represents the overall character category. Is selected such that
Figure BDA00021556812300000812
Largest size
Figure BDA00021556812300000810
As a predicted character category.
5 using a character pattern generator based on a deconvolution neural network to generate standard character patterns of m selected fonts by taking the attention vector c (x, t) as input
Figure BDA00021556812300000813
Figure BDA00021556812300000811
Wherein z is i The embedded vector of the ith font is a random vector which follows a multivariate standard normal distribution, and the bracket represents a concatenation operation. True multifont standard glyphs g i And (x, t) is rendered by a TTF (true Type font) or OTF (open Type font) file. Meanwhile, the idea of a generating type confrontation network is adopted, the generated standard font and the real standard font are identified by using the font identifier, and the generated font is more accurate through the confrontation between the font identifier and the font generator. The glyph discriminator gives the generated glyph
Figure BDA0002155681230000091
Probability of being true is
Figure BDA0002155681230000092
Probability of being false is
Figure BDA0002155681230000093
By the same token, it gives the true glyph g i The probability that (x, t) is true is p (y) d =1|g i (x, t)), the probability of being false is p (y) d =0|g i (x,t))=1-p(y d =1|g i (x, t)). The configuration parameters of the glyph generator and the discriminator are shown in table 2: the first, second and third columns in the table represent the name, type and specific configuration of the network layer, respectively. In the third column, "k × k × c, s, BN, ReLU" represents the size of the convolution kernel for the convolution and deconvolution layers as k × k, the output characteristic dimension as c, the step size as s, and the batch normalization and ReLU activation functions are used. For a fully connected layer, "i o" represents that the dimension of the input features of the layer is i and the dimension of the output features is o.
Table 2 parameter configuration diagram of glyph generator and glyph discriminator in the embodiment
Figure BDA0002155681230000094
When the whole network is trained, the invention combines character type prediction loss, font pixel point loss and font discriminator loss to update the parameters of the network. Specifically, the invention iteratively optimizes two objective functions L for countermeasures in the countermeasures learning G And L D Expressed as follows:
Figure BDA0002155681230000095
Figure BDA0002155681230000101
wherein α is a weight coefficient and is set to 0.01; y is 1 ,y 2 ,...,y T Is the category label for all T characters in the input picture x. L is a radical of an alcohol G Item I of (1)
Figure BDA0002155681230000102
Predicting loss for text categories, second term
Figure BDA0002155681230000103
Figure BDA0002155681230000104
A third term for the loss of the glyph discriminator to discriminate that the generated glyph is true
Figure BDA0002155681230000105
Figure BDA0002155681230000106
Is a loss of glyph pixel points; l is a radical of an alcohol D Item I of (1)
Figure BDA0002155681230000107
Second term for loss of the glyph discriminator to discriminate the generated glyph as false
Figure BDA0002155681230000108
A loss to identify the true glyph to be true for the glyph discriminator. The iterative optimization method realizes the countermeasure between the font discriminator and the font generator. Using literature (Kingma, d.p.,&ba, j.l. (2015). Adam: a Method for Stochartic optimization. International Conference on Learning predictions.) an adam optimizer proposed was used to optimize network parameters, with an initial Learning rate set to 0.001, attenuation per 1 ten thousand steps being 0.9 times the original, and the same training data as the SAR Method was used.
FIG. 2 is a comparison of the glyph generation methods provided by the present invention and the existing SSFL method, respectively. The upper half of the dotted line is the multi-font standard font generation mode based on the attention mechanism provided by the invention, and the lower half of the dotted line is the standard font generation mode provided by the SSFL method. The scheme adopted by the invention is different from the SSFL in two main points: firstly, generating standard fonts corresponding to each scene character one by adopting an attention mechanism; second, the invention adopts multi-font standard font generation, which is helpful to better learn the character irrelevant to the font style.
FIG. 3 is an example of a font for a standard glyph used in the present invention. The invention trains three network models respectively for English, Chinese and Bengali. For English, the present invention uses 8 (m ═ 8) fonts, namely Arial, Bradley Hand ITC, cosmetic Sans MS, Courier New, Georgia, Times New Roman, Kunstler Script and Vladimir Script. For Chinese, the invention adopts 4 (m ═ 8) fonts, namely Song style, regular style, black style and imitated Song style. For Bengali, the present invention uses 1 font, Nirmala UI.
TABLE 3 recognition accuracy of the present invention and other prior art techniques on English evaluation datasets
Method IIIT5k SVT IC13 IC15 SVTP CT80
SSFL 89.4 87.1 94.0 - 73.9 62.5
ASTER 93.4 89.5 91.8 76.1 78.5 79.5
SAR 95.0 91.2 94.0 78.8 86.4 89.6
The invention 95.3 91.3 95.1 81.7 86.0 88.5
TABLE 4 recognition accuracy of the present invention and other prior art techniques on Chinese and Bengali evaluation datasets
Method Pan+ChiPhoto ISI Bengali
HOG 59.2 87.4
CNN 61.5 89.7
ConvCoHOG 71.2 92.2
The invention 89.4 97.4
Tables 2-3 are graphs of the recognition accuracy of the present invention and other prior art on an evaluation data set. Among them, IIIT5k, SVT, IC13, IC15, SVTP and CT80 are english character data sets commonly used in the industry at present. The present invention achieves the best results over most data sets, as seen by the recognition accuracy (in%) in the graph. The method has great advantage in the aspect of accuracy ratio on the IC15 data set; the accuracy of the present invention is somewhat behind the SAR approach on the two smaller datasets SVTP and CT 80. Pan + ChiPhoto is the Chinese dataset and ISI Bengali is the Mengladesh dataset, on which the present invention also achieves the highest recognition accuracy. HOG is a method in the literature (Dalal, N., & Triggs, B. (2005). Histograms of ordered grams for human detection. in 2005IEEE Computer Conference on Computer Vision and Pattern Recognition (CVPR ' 05) (Vol.1, pp.886-893.), CNN is a method in the literature (Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep Features for Text spraying. in European Computer video (pp.512-528)), CONvCoHOG is a method in the literature (Dalal, N., & Triggs, B. (2005.) History of Experimental patents for Computer video (pp.512-528) & CVPR's, Vol.05 & gt, in PC ' Conference on Computer) (CVPR.05, in). In general, the invention is more advanced than the prior art in the task of character recognition in natural scenes.
FIG. 4 is a comparison graph of glyphs generated by the present invention and SSFL method when processing irregular-shaped text pictures. The SSFL method generates a standard glyph by a global mapping method and does not better handle irregular shaped text. The invention locates the approximate position of each character through an attention mechanism, and then generates the corresponding standard font, thereby obtaining better results aiming at the irregular text.
FIG. 5 is a diagram of glyph pixel loss (also called L1 loss) in the training process of the present invention and other prior arts, in which CNN-DCNN is a standard glyph Generation framework used by SSFL method, CNN-DCNN (Skip) is added with Skip Connection (Skip Connection) in CNN-DCNN, CNN-LSTM-DCNN is an improved version of CNN-DCNN, in which CNN features are passed through LSTM and then delivered to deconvolution network (DCNN), and Attentional Generation is the attention-based standard glyph Generation framework proposed herein. For fair comparison, the four methods use the same CNN and DCNN structure configuration, the same training data, and the first three methods also introduce multi-font generation. Through comparison, the attention-based generation method provided by the invention generates more accurate standard fonts.
Fig. 6 is a graph comparing the visualization results of the attention weight matrix (i.e., M (x, t)) obtained by the present invention and SAR method. The invention generates the standard font through learning, so that the attention module generates a more accurate and more meaningful attention weight matrix. The 2 nd and 3 rd columns in the figure respectively represent the thermodynamic diagrams of M (x, t) calculated by the SAR method and the invention, and the underlined letters below the thermodynamic diagrams represent character labels predicted by the model at a certain moment. Taking the first group of pictures as an example, the invention focuses attention on the lower half of the flower character "L" and correctly identifies the flower character "L"; while the SAR method focuses on the lower half of the flower-shaped word "L", it will be recognized as "R" if it is mistaken.
FIG. 7 is a comparison graph of a standard glyph generated with and without the use of antagonistic learning in accordance with the present invention. Wherein "output" is the result of not using the countermeasure training, "output" is the result of using the countermeasure training, and "target" is the true standard glyph. Through the counterstudy, the invention can better generate the standard font and identify the text content for the fuzzy and distorted text. Although many generated standard glyphs have some gap from the true standard glyphs after using the resistance training, the key improvement is related compared with not using the resistance training.
FIG. 8 is a standard glyph comparison graph generated under training with single fonts and multiple fonts according to the invention. Where "output" is the result of training using a single font, the name of the font is in parentheses, "output" is the result of training using multiple fonts, and "target" is the true standard glyph. If the standard font of only one font is adopted for training, the standard font and the recognition cannot be correctly generated when the model encounters characters with newer font styles during testing. By generating standard glyphs of multiple fonts, the model can better learn the character-independent features, thereby correctly identifying the content of the characters.
The technical solutions in the embodiments of the present invention are clearly and completely described above with reference to the drawings in the embodiments of the present invention. It is to be understood that the described examples are only a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims (8)

1. A character recognition method based on standard font generation is characterized in that a neural network model based on an attention mechanism and a generation mechanism is established, attention is focused at a certain position of an image at each moment, character category prediction and multi-font standard font generation are respectively carried out by utilizing the neural network characteristics of the position until all characters in the image are traversed, and therefore the characters in a natural scene image containing one or more characters are recognized and output;
the attention mechanism and generation mechanism based neural network model comprises:
A. a convolutional neural network for extracting visual features f (x) of the input picture x;
B. a recurrent neural network for sequence modeling the features f (x); the recurrent neural network comprises an LSTM encoder and a decoder;
C. an attention module for acquiring an attention weight matrix M (x, t) according to hidden states h (x, t) and F (x) of the recurrent neural network at a time t;
D. a classifier for classifying the features;
E. standard glyph for generating attention vector c (x, t) into its corresponding m fonts
Figure FDA0003702647850000011
The glyph generator of (2);
F. a glyph evaluator for countering the glyph generator so that the glyph generator can generate a more realistic standard glyph;
establishing and training the neural network model based on the attention mechanism and the generation mechanism, wherein the loss function of the model comprises the following steps: character type prediction loss, font pixel point loss and font discriminator loss; the learning parameters comprise parameters in a CNN characteristic extractor, an LSTM encoder, an LSTM decoder, an attention module, a font generator and a font discriminator module;
the training process optimizes two objective functions L for iteration G And L D Expressed by formula (5) and formula (6):
Figure FDA0003702647850000012
Figure FDA0003702647850000013
wherein α is a weight coefficient; y is 1 ,y 2 ,…,y T Is the category label of all T characters in the input picture x; | represents the norm operation of L1; l is G In (1),
Figure FDA0003702647850000014
predicting losses for the text categories;
Figure FDA0003702647850000015
Figure FDA0003702647850000016
a loss that generates a glyph that is true for glyph discriminator false authentication;
Figure FDA0003702647850000017
is a loss of glyph pixel points; l is D In (1),
Figure FDA0003702647850000018
a penalty for the glyph evaluator to correctly identify the generated glyph as false,
Figure FDA0003702647850000019
a loss to correctly identify the true glyph for the glyph identifier;
the character recognition method based on standard font generation comprises the following steps:
1) carrying out structural modification on a convolutional neural network in the ASTER method, wherein the step length of a first convolution unit in the last two convolution groups of the convolutional neural network is 1 multiplied by 1, and the convolutional neural network is used as a CNN feature extractor; extracting visual features F (x) of the input picture x by using a convolutional neural network;
Figure FDA0003702647850000021
wherein H, W, C represents the height, width and number of channels, respectively, of feature F (x);
2) using a recurrent neural network to perform sequence modeling on F (x), and transmitting the hidden state h (x, t) of the recurrent neural network at the time t and the F (x) to an attention module together to obtain an attention weight matrix M (x, t) which represents the attention distributed to each area of the image at the time t;
3) performing point multiplication on each feature channel by using F (x) and M (x, t) to obtain an attention vector c (x, t) which represents the feature of the concerned picture area at the time t;
4) classifying the features of c (x, t) and h (x, t) after the c (x, t) and the h (x, t) are connected in series by using a classifier, and predicting the character category of the attention position at the moment t;
5) generating the attention vector c (x, t) into the standard font of the m fonts corresponding to the attention vector c (x, t) by using a font generator
Figure FDA0003702647850000022
The font discriminator can be further used for confrontation with the font generator, so that the font generator can generate a more real standard font;
through the steps, the characters in the picture are identified based on the standard font generation.
2. The method for recognizing characters based on standard font generation as claimed in claim 1, wherein in step 1), the visual characteristics F (x) of the picture x,
Figure FDA0003702647850000023
scaling the picture x to 48 pixels in height and 160 pixels in width, and setting the number of color channels to 3;
Figure FDA0003702647850000024
wherein H is 6, W is 40, C is 512.
3. The method of claim 1 in which step 2) is performed by using an LSTM encoder and decoder to model the sequence of features f (x); the method comprises the following steps:
21) the LSTM encoder and decoder each comprise two hidden layers, each layer having 512 nodes;
22) pooling f (x) in each set of features in the width W dimension, Max-Pooling along the height dimension; then sequentially inputting the signals into an LSTM encoder;
23) taking the hidden state of the LSTM encoder at the last moment as the initial state of the LSTM decoder;
24) the hidden state h (x, t) of the LSTM decoder at time t is fed to the attention module together with F (x) to obtain an attention weight matrix
Figure FDA0003702647850000031
The M (x, t) mode is calculated as follows:
M′ ij (x,t)=tanh(∑ p,q∈N(i,j) W F F pq (x)+W h h (x, t)) formula (1)
M(x,t)=sotfmax(W M M' (x, t)) formula (2)
WhereinM '(x, t) is an intermediate variable of the calculated process, M' ij (x, t) represents the characteristic of M' (x, t) at position (i, j), i ≦ 1 ≦ H, j ≦ 1 ≦ W; n (i, j) represents a neighborhood of the center of (i, j), i.e. p is more than or equal to i-1 and less than or equal to i +1, and q is more than or equal to j-1 and less than or equal to j + 1; f pq (x) Represents F (x) the feature at position (p, q); w F And W h Is a parameter to be learned, tanh is a hyperbolic tangent function, and sotfmax is a softmax function.
4. The method for recognizing characters based on standard font generation as claimed in claim 1, wherein in step 4), the specific softmax classifier classifies the features of attention vectors c (x, t) and h (x, t) after concatenation to obtain the character class of the attention position at time t
Figure FDA0003702647850000032
Probability of (c):
Figure FDA0003702647850000033
wherein, W o And b o Is a parameter to be learned, the middle bracket represents the tandem operation,
Figure FDA0003702647850000034
c represents the overall character category.
5. The method for character recognition based on standard font generation as claimed in claim 1, wherein in step 5), the standard fonts of m selected fonts are generated by using a font generator based on a deconvolution neural network (DCNN) and taking the attention vector c (x, t) as an input
Figure FDA0003702647850000035
Represented by formula (4):
Figure FDA0003702647850000036
wherein z is i The embedded vector of the ith font is a random vector which follows multivariate standard normal distribution, the bracket represents the tandem operation, and m is the set font type number.
6. The method for recognizing words based on standard font generation as claimed in claim 1, wherein TTF or OTF file rendering is used to obtain the real multi-font standard font g i (x,t)。
7. The method for recognizing characters based on standard font generation as claimed in claim 1, further comprising using a generative countermeasure network to discriminate the generated standard font from the real standard font by using a font discriminator based on a convolutional neural network; wherein: the font discriminator gives the generated font
Figure FDA0003702647850000037
Probability of being true is
Figure FDA0003702647850000038
Probability of being false
Figure FDA0003702647850000039
True font g i The probability that (x, t) is true is p (y) d =1|g i (x, t)), probability p (y) of being false d =0|g i (x,t))=1-p(y d =1|g i (x,t))。
8. The method of claim 1, wherein an adam optimizer is used to optimize network parameters, and the initial learning rate is set to 0.001; the weight coefficient α is set to 0.01.
CN201910716704.1A 2019-08-05 2019-08-05 Natural scene character recognition method based on standard font generation Active CN112329803B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910716704.1A CN112329803B (en) 2019-08-05 2019-08-05 Natural scene character recognition method based on standard font generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910716704.1A CN112329803B (en) 2019-08-05 2019-08-05 Natural scene character recognition method based on standard font generation

Publications (2)

Publication Number Publication Date
CN112329803A CN112329803A (en) 2021-02-05
CN112329803B true CN112329803B (en) 2022-08-26

Family

ID=74319415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910716704.1A Active CN112329803B (en) 2019-08-05 2019-08-05 Natural scene character recognition method based on standard font generation

Country Status (1)

Country Link
CN (1) CN112329803B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792855B (en) * 2021-09-09 2023-06-23 北京百度网讯科技有限公司 Model training and word stock building method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007122500A (en) * 2005-10-28 2007-05-17 Ricoh Co Ltd Character recognition device, character recognition method and character data
CN107577651A (en) * 2017-08-25 2018-01-12 上海交通大学 Chinese character style migratory system based on confrontation network
CN107644006A (en) * 2017-09-29 2018-01-30 北京大学 A kind of Chinese script character library automatic generation method based on deep neural network
CN108615036A (en) * 2018-05-09 2018-10-02 中国科学技术大学 A kind of natural scene text recognition method based on convolution attention network
CN108804397A (en) * 2018-06-12 2018-11-13 华南理工大学 A method of the Chinese character style conversion based on a small amount of target font generates
CN109255356A (en) * 2018-07-24 2019-01-22 阿里巴巴集团控股有限公司 A kind of character recognition method, device and computer readable storage medium
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10515296B2 (en) * 2017-11-14 2019-12-24 Adobe Inc. Font recognition by dynamically weighting multiple deep learning neural networks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007122500A (en) * 2005-10-28 2007-05-17 Ricoh Co Ltd Character recognition device, character recognition method and character data
CN107577651A (en) * 2017-08-25 2018-01-12 上海交通大学 Chinese character style migratory system based on confrontation network
CN107644006A (en) * 2017-09-29 2018-01-30 北京大学 A kind of Chinese script character library automatic generation method based on deep neural network
CN108615036A (en) * 2018-05-09 2018-10-02 中国科学技术大学 A kind of natural scene text recognition method based on convolution attention network
CN108804397A (en) * 2018-06-12 2018-11-13 华南理工大学 A method of the Chinese character style conversion based on a small amount of target font generates
CN109255356A (en) * 2018-07-24 2019-01-22 阿里巴巴集团控股有限公司 A kind of character recognition method, device and computer readable storage medium
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Focusing Attention: Towards Accurate Text Recognition in Natural Images;Zhanzhan Cheng等;《2017 IEEE 计算机视觉国际会议(ICCV)》;20171225;第5076-5084页 *

Also Published As

Publication number Publication date
CN112329803A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
CN109961034B (en) Video target detection method based on convolution gating cyclic neural unit
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
CN110334705B (en) Language identification method of scene text image combining global and local information
CN109948714B (en) Chinese scene text line identification method based on residual convolution and recurrent neural network
CN111753828B (en) Natural scene horizontal character detection method based on deep convolutional neural network
CN112819686B (en) Image style processing method and device based on artificial intelligence and electronic equipment
CN112784929B (en) Small sample image classification method and device based on double-element group expansion
CN109753897B (en) Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning
CN107358172B (en) Human face feature point initialization method based on human face orientation classification
CN108681735A (en) Optical character recognition method based on convolutional neural networks deep learning model
CN111428557A (en) Method and device for automatically checking handwritten signature based on neural network model
CN112364873A (en) Character recognition method and device for curved text image and computer equipment
CN110738201B (en) Self-adaptive multi-convolution neural network character recognition method based on fusion morphological characteristics
CN110427819A (en) The method and relevant device of PPT frame in a kind of identification image
CN110689044A (en) Target detection method and system combining relationship between targets
CN110503090B (en) Character detection network training method based on limited attention model, character detection method and character detector
CN115116074A (en) Handwritten character recognition and model training method and device
CN115880704A (en) Automatic case cataloging method, system, equipment and storage medium
CN112329803B (en) Natural scene character recognition method based on standard font generation
Abe et al. Font creation using class discriminative deep convolutional generative adversarial networks
CN111242114B (en) Character recognition method and device
Nandhini et al. Sign language recognition using convolutional neural network
CN111461061A (en) Pedestrian re-identification method based on camera style adaptation
CN111461239A (en) White box attack method of CTC scene character recognition model
CN108537855B (en) Ceramic stained paper pattern generation method and device with consistent sketch

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant