CN110414498B

CN110414498B - Natural scene text recognition method based on cross attention mechanism

Info

Publication number: CN110414498B
Application number: CN201910517855.4A
Authority: CN
Inventors: 罗灿杰; 金连文; 黄云龙; 林庆祥; 周伟英
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2023-07-11
Anticipated expiration: 2039-06-14
Also published as: CN110414498A

Abstract

The invention discloses a natural scene text recognition method based on a cross attention mechanism, which comprises the following steps of data acquisition: downloading a sample picture in a natural scene, and synthesizing a training set for the picture by using a public code; and (3) data processing: stretching all training sample pictures, wherein the sizes of the processed sample pictures are 32 x 100, the aspect ratio is consistent with that of the original picture, and the insufficient parts are filled by black edges; and (3) label manufacturing: training the recognition model by adopting a supervised method, so that each line of text pictures has corresponding text information; training network: inputting the prepared training picture data and labels into a cross attention network for training, wherein the cross attention network consists of a vertical attention network and a horizontal attention network; and inputting test data into a trained network, and finally obtaining a recognition result and predicting the confidence coefficient of each character. The method has high recognition accuracy, strong robustness and good recognition performance for the irregular text.

Description

Natural scene text recognition method based on cross attention mechanism

Technical Field

The invention belongs to the technical field of pattern recognition and artificial intelligence, and particularly relates to a natural scene text recognition method based on a cross attention mechanism.

Background

With the rapid development of computer technology, artificial intelligence technology is gradually changing our lives, so that our lives become more convenient and efficient. The rapid development of hardware technologies such as GPU in recent years also makes practical application of deep neural networks possible.

In real life, we are free from text. Human visual information is largely carried by text. In the past or future, people rely heavily on obtaining information from words, and the word information is obtained, so that it is a crucial step to correctly identify words. It is simple for humans to recognize text from a picture, but it is not an easy task for computers. If a computer is needed to assist a human in understanding the information in the drawing, the computer is first required to correctly identify the text from the drawing. The characters existing in the natural scene are rich and varied in background, and often due to some artistic effects, the arrangement shape of the fonts is irregular, such as a curved shape, which greatly increases the difficulty of recognizing the text from the picture by a computer, and various factors make the recognition of the natural scene text difficult and heavy, so that a method for recognizing the natural scene text more effectively is urgently needed.

The research progress of the deep neural network provides tools for us, and recently, researchers propose various methods for recognizing natural scene texts by using the deep neural network, wherein the attention mechanism-based method greatly improves the recognition rate of a model based on the method by a special decoding mode and semantic deduction characteristics, and the recognition network based on the attention mechanism is currently applied to a plurality of text recognition systems, but the traditional scene text recognition method based on the attention mechanism directly compresses an original picture into a feature picture with the height of 1, so that noise is introduced to the feature picture for decoding, thereby influencing the recognition result.

Disclosure of Invention

The invention aims to provide a natural scene text recognition method based on a cross attention mechanism, which solves the problems in the prior art and enables the text with irregular arrangement shape to be correctly recognized.

In order to achieve the above object, the present invention provides the following solutions: the invention provides a natural scene text recognition method based on a cross attention mechanism, which comprises the following steps:

s1, data acquisition: downloading a sample picture in a natural scene, and synthesizing a training set for the picture by using a disclosed code, wherein the training set comprises fonts and backgrounds;

s2, data processing: stretching all training sample pictures, wherein the sizes of the processed sample pictures are 32 x 100, the aspect ratio is consistent with that of the original picture, and the insufficient parts are filled by black edges;

s3, label manufacturing: training the recognition model by adopting a supervised method, so that each line of text pictures has corresponding text information;

s4, training a network: inputting the data and the labels of the training pictures in the step S2 into a cross attention network for training, wherein the cross attention network consists of a vertical attention network and a horizontal attention network which are connected in series for training;

s5, inputting test data into the trained network, finally obtaining a recognition result, and predicting the confidence coefficient of each character according to the recognition result.

Preferably, the text of the training set in the step S1 covers multiple fonts and multiple backgrounds.

Preferably, in the step S2, the training sample picture is stretched, wherein the height is stretched into 32 pixels, the width is stretched according to the original aspect ratio, and the portion with insufficient width is filled with black edges.

Preferably, the step S3 includes the steps of:

s3.1, synthesizing a line text picture by the disclosed codes and text corpus;

s3.2, storing the text content in each text picture in a corresponding text file;

s3.3, randomly dividing the synthesized line text pictures into a training set and a verification set.

Preferably, the step S4 includes:

s4.1, constructing a convolutional neural network taking a convolutional module as a basic unit;

s4.2, constructing a cross attention network, wherein the cross attention network utilizes asymmetric convolution to calculate that a weight vector occupied by a feature vector in each sub-feature map of the two-dimensional feature map is alpha _j ＝{α _1，j ，α _2，j ，...，α _n，j }：

Wherein H represents the height of the feature map, g _i,j The weight occupied by the j-th column vector in the i-th row of the two-dimensional feature map in the j-th column is calculated by the vertical attention network:

g _i，j ＝conv _1×1 (conv _3×1 (X _j )+X _j ).

wherein conv _h×w (X) represents the calculation of the convolution check feature pattern X with the height h and width w, X _j Representing a j-th column sub-feature map in a two-dimensional feature map generated by the convolutional neural network;

after the weights occupied by the feature vectors of the sub-feature graphs segmented by columns are calculated, each sub-feature graph carries out weighted summation according to the weights of the feature vectors at the corresponding positions to obtain a feature vector f _v，j ：

Feature vector f generated by all sub-feature graphs _v，j Spliced into a characteristic sequence f _v ，f _v After passing through a BLSTM network composed of a two-way long-short-time memory model, a feature sequence f 'with context features is obtained' _v The characteristic sequence f' _v Feeding inA horizontal attention network that extracts, at each point in time, a confidence probability distribution y containing the character currently to be recognized _t ：

y _t ＝softmax(ψ(h _t ))

Wherein, psi represents the full-connection layer for decoding the vector h _t The dimension of (2) is reduced to the target character number, h _t Context vector ctx by horizontal attention network _t Word embedding vector emb of character decoded by last time node _t-1 And (3) sending the obtained product into a gate control circulation unit to obtain:

h _t ＝GRU([ctx _t ；emb _t-1 ]，h _t-1 )，

wherein GRU represents GRU network operation, h _t-1 Hidden layer vector, context vector ctx, representing GRU network output at last point in time _t The method is characterized by comprising the following steps of:

beta, beta _j Representing the weight of the j-th feature vector in the feature sequence, and using the full connection layer to perform the feature sequence f' _v And vector h used for decoding the previous character _t-1 And (3) obtaining:

e _t，j ＝W ^T tanh(Qf′ _v +Vh _t-1 +b)

wherein W is ^T Q, V, b represent the weight value of training; selecting y _t The character corresponding to the value with the highest confidence coefficient in the middle is obtained to obtain the current decoding output character c _t ；

S4.3, training parameter setting: the training data is sent into a network for training, the network traverses the training data set for 10 times, the used optimization algorithm is a self-adaptive gradient descent method, the initial learning rate is 1, the learning rate is manually adjusted to 0.1 after the network traverses the training set for 5 times, then the training is continued, and the network traverses the training set for 10 times again;

the Loss value Loss of network output can be calculated by the following formula:

where N represents the amount of data used for the batch optimization,

indicating that the character +.>

Probability of (2);

s4.4, initializing weight: the weight parameters in all networks are randomly initialized by using Gaussian noise at the initial stage of training;

s4.5, training a convolutional neural network: the probability that each character of the target character string is output at its corresponding time point is used as cross entropy, and the gradient descent method is used to minimize the cross entropy.

Preferably, the step S5 includes:

s5.1, inputting pictures and labels in the test set into a trained network for identification test;

s5.2, after identification is completed, calculating the accuracy rate by the program;

s5.3, randomly displaying the visual effect of the recognition process of 20 photos, wherein the character characteristics of each photo are selected by the horizontal attention network and the vertical attention network in a crossing way.

The invention discloses the following technical effects:

1. due to the adoption of the automatic learning recognition algorithm of the deep network structure, effective expression can be well learned from data, and the recognition accuracy is improved.

2. The invention adopts end-to-end training, does not need to mark the position of each character, and saves the marking cost.

3. The classification method has high recognition accuracy, strong robustness and good recognition performance for the irregular text.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow chart of data acquisition and processing in accordance with the present invention;

FIG. 3 is a natural scene text recognition flow chart of the present invention;

FIG. 4 is an example of the result of an attention heat map in the identification process of the present invention;

fig. 5 is a table of the deep convolutional neural network structure and parameter configuration of the present invention.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Referring to fig. 1-5, the invention provides a natural scene text recognition method based on a cross attention mechanism, which comprises the following steps:

s3.1, synthesizing a line text picture by the disclosed codes and text corpus;

s4.1, constructing a convolutional neural network taking a convolutional module as a basic unit: the input picture first convolution block first convolution layer second convolution block second convolution layer third convolution block fourth convolution layer third convolution layer; the convolutional neural network downsamples the features through convolutional layers, and the downsampling multiple of each convolutional layer is 2;

the convolution module is represented as the following calculation process, which is participated by the convolution layer: input feature map first convolutional layer first feature map second convolutional layer second feature map third convolutional layer;

performing numerical addition operation on the characteristic diagram output by the first convolution layer and the characteristic diagram output by the third convolution layer of the convolution modules to obtain the output characteristic diagram of the convolution module, wherein each convolution module does not perform downsampling on the characteristic diagram;

s4.2, constructing a cross attention network, wherein the cross attention network utilizes asymmetric convolution to calculate weight vectors occupied by feature vectors in each sub-feature map of the two-dimensional feature map

α _j ＝{α _1，j ，α _2，j ，...，α _n，j }：

g _i，j ＝conv _1×1 (conv _3×1 (X _j )+X _j ).

Feature vector f generated by all sub-feature graphs _v，j Spliced into a characteristic sequence f _v ，f _v After passing through a BLSTM network composed of a two-way long-short-time memory model, a feature sequence f 'with context features is obtained' _v The characteristic sequence f' _v Feeding into a horizontal attention network, which at each point in time extracts a confidence probability distribution y containing the character currently to be recognized _t ：

y _t ＝softmax(ψ(h _t ))

Wherein psi represents a full connectionA layer for decoding the decoded vector h _t The dimension of (2) is reduced to the target character number, h _t Context vector ctx by horizontal attention network _t Word embedding vector emb of character decoded by last time node _t-1 And (3) sending the obtained product into a gate control circulation unit to obtain:

h _t ＝GRU([ctx _t ；emb _t-1 ]，h _t-1 )，

e _t，j ＝W ^T tanh(Qf′ _v +Vh _t-1 +b)

wherein W is ^T Q, V, b represent the weight value of training; selecting y _t The character corresponding to the value with the highest confidence coefficient in the middle is obtained to obtain the current decoding output character c _t The method comprises the steps of carrying out a first treatment on the surface of the Wherein W is ^T Q, V, b represent trainable weight values; selecting y _t The character corresponding to the value with the highest confidence coefficient in the middle is obtained to obtain the current decoding output character c _t ；

where N represents the amount of data used for the batch optimization,

indicating that the character +.>

Probability of (2);

S5, inputting test data into the trained network, finally obtaining a recognition result, and predicting the confidence coefficient of each character according to the recognition result;

In the example shown in fig. 4, the recognition result of 5 more curved texts is displayed.

In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the present invention, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention. The above embodiments are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solutions of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.

Claims

1. A natural scene text recognition method based on a cross attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:

s4, training a network: inputting the data and the labels of the training pictures in the S2 into a cross attention network for training, wherein the cross attention network consists of a vertical attention network and a horizontal attention network which are connected in series for training;

the step S4 comprises the following steps:

α _j ＝{α _1，j ，α _2，j ，...，α _n，j }：

Wherein H represents the height of the feature map, g _i，j The weight occupied by the j-th column vector in the i-th row of the two-dimensional feature map in the j-th column is calculated by the vertical attention network:

g _i，j ＝conv _1×1 (conv _3×1 (X _j )+X _j )

wherein conv _h×w (X) represents the calculation of the convolution check feature pattern X with the height h and width w, X _j Representing a j-th column sub-feature map in a two-dimensional feature map generated by the convolutional neural network; after the weights occupied by the feature vectors of the sub-feature graphs segmented by columns are calculated, each sub-feature graph carries out weighted summation according to the weights of the feature vectors at the corresponding positions to obtain a feature vector f _v，j ：

Feature vector f generated by all sub-feature graphs _v，j Spliced into a characteristic sequence f _v ，f _v After a BLSTM network formed by a two-way long-short-time memory model, a characteristic sequence f 'with context characteristics is obtained' _v The characteristic sequence f' _v Feeding into a horizontal attention network, which at each point in time extracts a confidence probability distribution y containing the character currently to be recognized _t ：

y _t ＝softmax(ψ(h _t ))

Wherein, psi represents the full-connection layer for decoding the vector h _t The dimension of (2) is reduced to the target character number, h _t Context vector ctx by horizontal attention network _t And go upWord embedding vector emb of character decoded by time node _t-1 And (3) sending the obtained product into a gate control circulation unit to obtain:

h _t ＝GRU([ctx _t ；emb _t-1 ]，h _t-1 )，

e _t，j ＝W ^T tanh(Qf′ _v +Vh _t-1 +b)

the Loss value Loss of the network output is calculated by the following formula:

wherein N represents the present lotSub-optimizing the amount of data used, p (c) _i,j |I _(i) The method comprises the steps of carrying out a first treatment on the surface of the θ) means that the character c is outputted from the ith sample picture at the jth timing _i,j Probability of (2);

2. The natural scene text recognition method based on a cross-attention mechanism of claim 1, wherein: the text of the training set in S1 covers a plurality of fonts and a plurality of backgrounds.

3. The method for recognizing natural scene text based on a cross-attention mechanism according to claim 1, wherein the step S2 is to stretch the training sample picture, wherein the height is stretched into 32 pixels, the width is stretched according to the original aspect ratio, and the portion with insufficient width is filled with black edges.

4. The natural scene text recognition method based on the cross-attention mechanism as recited in claim 1, wherein S3 includes the following:

s3.1, synthesizing a line text picture by the disclosed codes and text corpus;

5. The natural scene text recognition method based on the cross-attention mechanism of claim 1, wherein S5 comprises: