CN113807340B

CN113807340B - Attention mechanism-based irregular natural scene text recognition method

Info

Publication number: CN113807340B
Application number: CN202111043808.4A
Authority: CN
Inventors: 孙亚杰; 曹小玲; 孙莹莹; 董方怡
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2024-03-15
Anticipated expiration: 2041-09-07
Also published as: CN113807340A

Abstract

The invention discloses an irregular natural scene text recognition method based on an attention mechanism, wherein a natural scene text image correction module positions the shape of a text region and corrects the text region into a regular text image; the feature extraction module extracts visual feature graphs with different scales; the attention mechanism alignment module aligns the visual feature images with different scales by using a full convolution neural network to obtain visual attention feature images; the context feature space module is used for selecting the obtained visual attention feature map through context to obtain a context feature map of the image, and then connecting the visual attention feature map of the image with the context feature map to obtain a new feature space; the attention mechanism sequence recognition module uses the LSTM attention decoder to simultaneously decode the obtained context feature space to obtain a recognition result. The method and the device can improve the recognition effect of the irregular scene text, the recognition accuracy is not affected by nearby text and background noise, and the application scenes of character recognition are increased.

Description

Attention mechanism-based irregular natural scene text recognition method

Technical Field

The invention relates to a natural scene text recognition method, in particular to an irregular natural scene text recognition method based on an attention mechanism, and belongs to the technical field of pattern recognition and artificial intelligence.

Background

With the development of informatization technology, artificial intelligence is a current research hotspot, and natural scene text recognition is a part of the artificial intelligence technology, so that high importance is placed on researchers. Many natural scene text recognition techniques have achieved significant results at present, particularly in terms of regular scene text recognition, thanks to the rapid development of deep learning. But scene text images are often affected by shooting conditions, resulting in uneven quality of the scene text images, such as curved text, perspective text, noise, etc., which can affect the accuracy of recognition. In order to solve the problem of irregular scene text recognition, in recent years, there have been many research teams proposed to correct an original text image into an image having regular text by using a text correction model. The corrected image is prone to introducing new noise that can interfere with the accuracy of text recognition. In addition, the method using the attention mechanism has a significant influence on the field of natural scene recognition. However, most attention methods encounter alignment problems due to repeated use of historical decoding information.

The existing irregular scene text recognition technology does not solve the problems of newly added noise interference and attention alignment,

disclosure of Invention

Aiming at the defects, the invention provides an irregular scene text recognition method based on an attention mechanism, which aims to solve the problems in the background technology.

The invention is realized by the following technical scheme:

an irregular natural scene text recognition method based on an attention mechanism is characterized by comprising the following steps of:

(1) Positioning the shape of the text region by using a natural scene text image correction module, and correcting an irregular natural scene text image into a regular text image;

(2) Introducing a space-channel mixed attention mechanism into ResNet to construct a feature extraction module, and extracting visual feature graphs with different scales by using the feature extraction module;

(3) Using a full convolution neural network to align visual feature graphs of different scales to obtain a visual attention map; multiplying the visual feature map and the visual attention map to obtain a visual attention feature map;

(4) Selecting the obtained visual attention feature map through a double-layer BiLSTM context to obtain a context feature map of the image, and then connecting the visual attention feature map with the context feature map to obtain a new feature space D, wherein the feature space D comprises visual features and context features of the image;

(5) The recognition result is obtained by decoding the feature space D using an LSTM attention decoder.

Optionally, the specific process of step (1) is as follows:

(11) Constructing a positioning network, acquiring the shape of a text region, and positioning a datum point C of the upper edge and the lower edge; the positioning network includes 4 convolutional layers followed by 1 batch normalization layer and 2 maximum pooling layers; the positioning network adopts a Relu activation function;

(12) Calculating TPS transformation parameters by using the datum point C at a grid generator to obtain a sampling grid on a text image;

(13) And inputting the sampling grid and the original image into a sampler, and sampling the grid points on the original image to obtain a corrected image.

Alternatively, the positioning network, grid generator and sampler may all be micro, with the natural scene text image correction module following back propagation to update network parameters.

Optionally, the specific process of step (2) is as follows:

(21) Extracting channel attention map M based on channel attention mechanism _c The method comprises the steps of carrying out a first treatment on the surface of the The channel attention mechanism comprises 1 maximum pooling layer, 1 average pooling layer and a multi-layer perceptron, and the activation function is sigmoid; the intermediate feature diagram F is used as the input of the maximum pooling layer and the average pooling layer respectively, the output obtained by the two pooling layers is forwarded to the multi-layer perceptron respectively, and finally the channel attention diagram M is extracted _c ：

M _c (F) =σ (MLP (AvgPool (F))+mlp (MaxPool (F))) formula (1)

Wherein F represents an intermediate feature map; avgpool is average pooling; maxPool is max pooling; MLP is a multi-layer perceptron; sigma represents a sigmoid activation function;

(22) Multiplying the channel attention map obtained in step (21) with the intermediate feature map to obtain F':

(23) Obtaining a spatial attention map M based on a spatial attention mechanism _s The method comprises the steps of carrying out a first treatment on the surface of the The spatial attention mechanism comprises 1 maximum pooling layer, 1 average pooling layer and 1 convolution layer, F' obtained in the step (22) is taken as input to obtain maximum pooling characteristics and average pooling characteristics, the maximum pooling characteristics and the average pooling characteristics are integrated through the convolution layer, and finally the spatial attention force map M is obtained _s ：

M _s (F′)＝σ(f ^7×7 ([AvgPool(F′)；MaxPool(F′)]) Arbitrary (3)

Wherein f ^7×7 Convolution operation with a filter size of 7 x 7; sigma represents the Relu activation function;

(24) Multiplying the output of step (23) by said F 'to yield F':

(25) Adding the inputs x and F' of the overall spatial-channel mixed attention mechanism together to the Relu activation function yields the visual feature map F of the output _v ：

F _v =σ (F "+x) (5)

Wherein σ represents the Relu activation function.

Optionally, the specific process of step (3) is as follows:

the method comprises the steps of coding feature graphs with different sizes by utilizing a downsampling method in a convolution process, wherein the convolution process comprises convolution layers with the same layer number and deconvolution layers, the sizes of output of each layer of convolution layers are different, and the output of each layer of deconvolution layer is added with the output of the convolution layer with the corresponding size to be used as the input of the next deconvolution layer; finally, activating the Relu function to obtain a visual attention diagram; f (F) _v Representing visual feature map, A _att Representing a visual attention map obtained by attention alignment, a visual attention profile V is obtained by the following formula:

optionally, for step (4), using two layers of BiLSTM to output a context feature map H on the visual feature map, and combining the context feature map H and the visual attention feature map V to obtain a new feature space d= (V, H);

optionally, the following is implemented for step (5):

the predicted output of the encoder at time t is y _t ：

y _t ＝softmax(W _o h _t +b _o ) (7)

Wherein W is _o And b _o To learn parameters, h _t Represents the hidden state of LSTM at time t; softmax is a normalized exponential function; h is a _t The calculation mode of (a) is expressed as follows:

h _t ＝LSTM(y _t-1 ，c _t ，h _t-1 ) (8)

Wherein y is _t-1 Representing the prediction at time t-1. c _t Representing semantic vectors, h _t-1 Represents the hidden state of LSTM at time t-1; LSTM is long-short-term memory network

The final Loss function Loss is calculated as follows:

wherein X is _i Representing a training picture; y is Y _i Representing a predictive label;

and constructing a deep convolution network model according to the content, and sending the training set into the network model for training until the network model reaches convergence.

Optionally, the training of the deep convolutional network model is set as follows:

the epoch of the deep convolution network model is 10;

the optimizer of the deep convolution network model is Adadelta;

the learning rate of the deep convolution network model is 0.1;

the number of pictures read in each batch of the depth convolution network model is 64;

the parameter initialization mode of the deep convolution network model is Kaiming initialization.

The beneficial effects brought by adopting the technical scheme are as follows:

(1) A text image correction module is introduced to improve the recognition effect of the irregular scene text;

(2) Introducing a channel-space attention mechanism so that recognition accuracy is not affected by nearby text and background noise;

drawings

FIG. 1 is a network block diagram of irregular natural scene text recognition based on an attention mechanism;

FIG. 2 is a flow chart of an irregular natural scene text recognition method based on an attention mechanism of the present invention;

fig. 3 is a network configuration diagram of the feature extractor.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The invention provides an irregular natural scene text recognition method based on an attention mechanism, wherein the network structure of the irregular natural scene text recognition method is shown in figure 1, and the irregular natural scene text recognition method comprises a natural scene text image correction module, a feature extraction module, an attention alignment module and a text decoding module;

the natural scene text image correction module positions the shape of the text region and corrects the irregular natural scene text image into a regular text image;

the feature extractor extracts visual feature graphs with different scales;

the attention mechanism alignment model uses a full convolution neural network to align the visual feature images with different scales to obtain an attention force map;

the text decoding module decodes the visual feature map and the attention map simultaneously using an LSTM attention decoder to obtain a recognition result.

As shown in fig. 2, the irregular natural scene text recognition system based on the attention mechanism includes the steps of:

step one: a dataset is prepared, and the dataset is divided into a training dataset and a test dataset.

For the training data set, the present invention selects the synthetic data set SynthText training network. Network performance evaluations were performed on the universal seven test sets, including rule text data set IIIT5K, ICDAR2003, ICDAR2013 and rule text data sets SVT-Perspective, CUTE80, ICDAR2015.

Step two: firstly, an irregular text image I is corrected into an image I' with regular text by using a natural scene text image correction module, and the implementation process is as follows: inputting the image into a positioning network, detecting a text region of the image, and acquiring a group of datum points C of the upper edge and the lower edge of the text; then, the grid generator calculates TPS transformation parameters by using the reference point C, and a grid sampler P= { P on the image I is obtained according to TPS transformation _i }. And finally, generating a corrected image I' on the sampler by performing bilinear sample insertion on the pixel points on the grid generator.

Step three: inputting the corrected image I' into a feature extractor to extract visual feature images F with different sizes _v . The network structure of the feature extractor is shown in fig. 3, the network is composed of a basic convolution layer and 5 convolution blocks, each convolution block respectively comprises 3, 4, 6 and 3 layers of convolutions, and the input of each convolution block is spliced with a channel-space attention module after being activated by a Relu function to obtain an output sequence, so that the channel information and the space information of feature diagrams at different stages can be well combined.

Step four: the visual feature map is input into an attention mechanism alignment model, the feature maps with different sizes are encoded by using a convolution block, then the features with different sizes are added with the corresponding size features output by the convolution stage by using a deconvolution block, and then the attention map is obtained through the activation of a Relu function. The visual attention profile is multiplied by the visual attention profile.

Step five: the visual feature map extracts the context information via a context selector, which consists of two bilstms, and then connects the context information and the visual attention feature map to obtain a new feature space D.

Step six: the feature space D is input to a text decoder for decoding each character in turn.

And inputting a scene text image, and accurately identifying the image based on an irregular natural scene text identification model of an attention mechanism to obtain characters in the text image.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. An irregular natural scene text recognition method based on an attention mechanism is characterized by comprising the following steps of:

(2) Introducing a space-channel mixed attention mechanism into ResNet to construct a feature extraction module, and using the feature extraction module

Extracting visual feature diagrams with different scales;

(5) Decoding the feature space D by using an LSTM attention decoder to obtain a recognition result;

the specific process of the step (1) is as follows:

(11) Constructing a positioning network, acquiring the shape of a text region, and positioning a datum point C of the upper edge and the lower edge; the positioning network comprises 4 convolution layers, wherein the convolution layers are connected with 1 batch normalization layer and 2 maximum pooling layers; the positioning network adopts a Relu activation function;

(13) Inputting the sampling grid and the original image into a sampler, and sampling the grid points on the original image to obtain a corrected image;

the positioning network, the grid generator and the sampler are all micro, and the natural scene text image correction module updates network parameters by following back propagation;

for the step (4), a two-layer BiLSTM is adopted on the visual feature map to output a context feature map H, and the context feature map H and the visual attention feature map V are combined to obtain a new feature space D= (V, H);

the step (5) is specifically implemented as follows:

the predicted output of the decoder at time t is y _t ：

y _t ＝softmax(W _o h _t +b _o ) (7)

Wherein W is _o And b _o To learn parameters, h _t Represents the hidden state of LSTM at time t; softmax is a normalized exponential function;

h _t the calculation mode of (a) is expressed as follows:

h _t ＝LSTM(y _t-1 ，c _t ，h _t-1 ) (8)

Wherein y is _t-1 Representing a prediction of time t-1; c _t Representing a semantic vector; h is a _t-1 Represents the hidden state of LSTM at time t-1; LSTM is long-term memory network;

the final Loss function Loss is calculated as follows:

2. The method for recognizing irregular natural scene text based on an attention mechanism according to claim 1, wherein the specific process of the step (2) is as follows:

M _c (F) =σ (MLP (AvgPool (F))+mlp (MaxPool (F))) formula (1)

(23) Obtaining a spatial attention map Ms based on a spatial attention mechanism; the spatial attention mechanism comprises 1 maximum pooling layer, 1 average pooling layer and 1 convolution layer, F' obtained in the step (22) is taken as input to obtain maximum pooling characteristics and average pooling characteristics, the maximum pooling characteristics and the average pooling characteristics are integrated through the convolution layer, and finally the spatial attention force map M is obtained _s ：

M _s (F′)＝σ(f ^7×7 ([AvgPool(F′)；MaxPool(F′)]) Formula (3);

(24) Multiplying the output of step (23) by said F 'to yield F':

F _v =σ (F "+x) (5)

Wherein σ represents the Relu activation function.

3. The method for recognizing irregular natural scene text based on an attention mechanism according to claim 1, wherein the specific process of the step (3) is as follows:

4. the method for recognizing irregular natural scene text based on an attention mechanism according to claim 1, wherein the training of the deep convolutional network model is set as follows:

the epoch of the deep convolution network model is 10;

the optimizer of the deep convolution network model is Adadelta;

the learning rate of the deep convolution network model is 0.1;