CN113159023A

CN113159023A - Scene text recognition method based on explicit supervision mechanism

Info

Publication number: CN113159023A
Application number: CN202110273068.7A
Authority: CN
Inventors: 王鹏; 郑财源
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-03-14
Filing date: 2021-03-14
Publication date: 2021-07-23

Abstract

The invention relates to a scene text recognition method based on an explicit supervision mechanism, and belongs to the field of scene text recognition. In the first part, a feature extraction part of a ResNet convolutional neural network with an attention mechanism is adopted to perform feature extraction and coding on a text image, and a feature map and a global representation are obtained. In the second part, the relationship modeling between characters is carried out by combining the information of the previous predicted character, the position information and the global representation, then the attention weight is generated according to the feature diagram, the feature of a single character is obtained by multiplying the weight by the feature diagram, the feature is input into a feedforward neural network to obtain the predicted character, and then the predicted character enters the predicted recognition process of the next character, and so on until the recognition ending identifier is obtained. The method can automatically position the characteristics of the area at each moment when the method predicts each moment, thereby improving the recognition effect and solving the problem of poor recognition effect under the condition of bending or inclining.

Description

Scene text recognition method based on explicit supervision mechanism

Technical Field

The invention belongs to the field of scene text recognition, and particularly provides a text image recognition method and a text image recognition system adopting an encoding and decoding structure of an explicit supervision attention mechanism. The whole system adopts ResNet34 convolutional neural network with spatial attention and channel attention mechanism to extract text image features, and adopts a Transformer structure based on a self-attention mechanism to perform decoding recognition.

Background

Scene text recognition is an important challenge in the field of computer vision, and its task is to automatically detect and recognize text in natural images. Text acts as a physical carrier of text that can be used to store and transfer information. With the help of text detection and recognition technology, important semantic information in the visual image can be decoded. Due to the huge application value of scene text recognition, a lot of research and exploration are caused in the industrial and academic circles in recent years, however, at present, most of better recognition is horizontal text with simple background. However, in a real scene, due to the influence of factors such as illumination, shielding of a photographing device and a photographing angle in the scene and actual factors such as bending, inclination and artistic words of a text, recognition of a scene text, especially recognition of an irregular scene text, has a great bottleneck.

In order to solve the problem of irregular Text Recognition, the existing Scene Text Recognition technology (such as: more: a Multi-Object Recognition Network and enter: An Attention Scene Text Recognition with a Flexible Recognition for Scene Text Recognition) adopts a decoder based on An Attention mechanism in the decoding stage, so that the character region in the picture can be automatically focused. The method can better solve the problem of irregular text recognition, but the problems of 'attention drift' and the like often occur due to the fact that scene pictures are too noisy, and therefore the accuracy rate of text recognition is reduced.

Disclosure of Invention

Technical problem to be solved

In order to solve the problem of low text recognition accuracy rate caused by the 'attention drift' problem of a decoder based on an attention mechanism in the prior art, the invention provides a scene text recognition method based on an explicit supervision attention mechanism. Scene text recognition to account for warping and tilting.

Technical scheme

A scene text recognition method based on an explicit supervision mechanism is characterized by comprising the following steps:

step 1: inputting the scene text picture into a ResNet34 convolutional neural network to extract a characteristic diagram and recording the characteristic diagram as F, wherein the F belongs to R^25x8x512(ii) a Inputting F into a global feature extraction layer of a six-layer Bottleneck to obtain a global feature representation G, wherein G belongs to R^1×1×1024(ii) a Meanwhile, performing 1 × 1 convolution on F to obtain F 'as the finally extracted feature, wherein F' belongs to R^25×8×1024(ii) a Each block of four layers of the ResNet34 is added with a channel attention and space attention mechanism;

the space attention is explicitly supervised according to the label box of the character, and the calculation formula of the loss is as follows:

wherein y is^predIs the attention weight at point i, j, when the point is located in the character area y^labelIs 1, otherwise is 0; the supervisory signal is added to only the last block of each layer;

step 2: at the time step t of decoding, adding the character obtained by prediction before and the position information to obtain E, E belongs to R^t×512And then spliced together with a global feature representation G to obtain a vector C, wherein C belongs to R^t×1024；

And step 3: inputting C into a masked self-attention mechanism for modeling dependencies between different characters in the output word, the self-attention mechanism being as follows:

the method comprises the following steps of dividing into three steps during the calculation of the attribute, and calculating the similarity of the query and each key to obtain the weight in the first step; the second step then normalizes the weights, typically using a softmax function; finally, weighting and summing the weight and the corresponding key value to obtain the final attention;

connecting the encoder and the decoder by using a two-dimensional attention module, wherein the structure of the two-dimensional attention module is basically consistent with that of the self-attention module, except that K and V of the two-dimensional attention module are both from F' obtained in the encoding stage, and Q is the output of the self-attention module with a mask; the invention carries out explicit supervised training on attention weight, and the definition of the loss function is as follows:

wherein y is^predFor attention weight at point i, j, when the point is in the character region y^labelIs 1, otherwise is 0;

and 4, step 4: and obtaining a picture characteristic vector after passing through a two-dimensional attention module, passing the vector through a full connection layer to obtain a vector with the same dimension as the number of the letters, and obtaining a prediction result at the moment by performing argmax operation on the vector.

The technical scheme of the invention is further that: in the step 1: the channel attention module is used for enabling the input feature map to respectively pass through a global maximum pooling layer and a global average pooling layer based on width and height, and then respectively pass through a multilayer perceptron; performing addition operation based on bit-alignment multiplication on the characteristics output by the multilayer perceptron, and performing sigmoid activation operation to generate a final channel attention characteristic diagram; carrying out the multiplication operation of the map and the input feature map by the counterpoint multiplication to generate the input features required by the space attention module; the space attention module takes the feature map output by the channel attention module as an input feature map of the module, firstly, global maximum pooling and global average pooling based on the channel are carried out, and then, the 2 results are spliced based on the channel; then, reducing the dimension into 1 channel through convolution operation; generating a spatial attention feature through sigmoid; and finally, multiplying the characteristic with the input characteristic of the module to obtain the finally generated characteristic.

A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the above-described method.

A computer-readable storage medium having stored thereon computer-executable instructions for performing the above-described method when executed.

A computer program comprising computer executable instructions which when executed perform the method described above.

Advantageous effects

The scene text recognition method based on the explicit supervision mechanism not only can be used for recognizing curved and inclined scene text pictures, but also can be used for converting the information of the pictures into an attention weight matrix by utilizing the two-dimensional supervision mechanism, and can automatically locate the characteristics of the area at each predicted moment, so that the recognition effect is improved, and the problem of poor recognition effect under the curved or inclined condition is solved. An attention mechanism of explicit supervision is introduced, so that the problem of attention drift can be effectively solved, the model can find out a key area of a scene text letter at each decoding moment, and complex scene text pictures can be better identified by combining the letter characteristics. Meanwhile, the horizontal standard scene text can be identified, the whole system has stronger practicability, and the problem of identifying the scene text under various conditions such as bending, inclining, horizontal and the like can be solved.

Drawings

FIG. 1 structural diagram of CBAM

FIG. 2 ResNet structure diagram

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the system comprises two parts, wherein the first part is a process for extracting and coding the features of the scene picture based on a space attention and channel attention mechanism, the second part is a decoding process based on a transform of a self-attention mechanism, and the recognition of the scene text is realized through a coding and decoding structure and the attention mechanism. In the first part, a feature extraction part of a ResNet convolutional neural network with an attention mechanism is adopted to perform feature extraction and coding on a text image, and a feature map and a global representation are obtained. In the second part, the relationship modeling between characters is carried out by combining the information of the previous predicted character, the position information and the global representation, then the attention weight is generated according to the feature diagram, the feature of a single character is obtained by multiplying the weight by the feature diagram, the feature is input into a feedforward neural network to obtain the predicted character, and then the predicted character enters the predicted recognition process of the next character, and so on until the recognition ending identifier is obtained. It should be noted that, in order to extract more text information while ignoring the background information of the picture as much as possible in the encoding stage and to focus the model on the corresponding feature map region in the decoding stage, the present invention explicitly supervises the attention mechanism used in the encoder and the decoder respectively according to the frame information of the characters. The method comprises the following steps:

(1) extracting a scene text image through a ResNet34 convolutional neural network to obtain a feature map (feature map) and recording the feature map as F, wherein the F belongs to R^25x8x512. The present invention adds a spatial attention (spatial attention) and a channel attention mechanism (channel attention) to each block of the four layers of the ResNet, which is called CBAM, and its structure diagram is shown in FIG. 1.

The channel attention module passes the input feature map through a global maximum pooling layer and a global average pooling layer based on width and height respectively and then passes through a multilayer perceptron respectively. And performing addition operation based on bit-wise multiplication (element-wise) on the features output by the multilayer perceptron, and performing sigmoid activation operation to generate a final channel attention feature map. And carrying out bitwise multiplication operation on the graph and the input feature graph to generate the input features required by the spatial attention module.

The spatial attention module takes the feature map output by the channel attention module as an input feature map of the module. First, a global maximum pooling and a global average pooling based on channels are performed, and then the 2 results are spliced based on channels. Then, after a convolution operation, the dimensionality is reduced to 1 channel. And generating a spatial attention feature by sigmoid. And finally, multiplying the characteristic with the input characteristic of the module to obtain the finally generated characteristic.

In order to extract the information of the text as much as possible, the invention carries out explicit supervision on the space attention according to the labeling box of the character, and the calculation formula of the loss is as follows:

wherein y is^predIs the attention weight at point i, j, when the point is located in the character area y^labelIs 1, otherwise is 0. The supervisory signal is added only to the last block of each layer.

(2) To maintain dimensional consistency in the two-dimensional attention calculation of the decoding stage, the channel of F is changed using a 1x1 convolution to obtain F ', F' ∈ R^25x8x1024. Meanwhile, F passes through six layers of Bottleneeck to obtain another feature G which is called as a global representation, and G belongs to R^1x1x512I.e. G is a vector in one 512 dimensions. Inspired by the Transformer, the designed attention-based sequence decoder consists of three layers: the method comprises the following steps of firstly, carrying out a self-attention mechanism with a mask, and modeling the dependency relationship among different characters in an output word; a two-dimensional attention module connecting the encoder and the decoder; and a position feedforward layer applied to each decoding position. For each of the three layers, residual concatenation with addition is used, followed by layer normalization. The above three layers constitute one module, and can be stacked without sharing parameters.

(3) At time step t (t starts from 0), t characters obtained by prediction are coded and then added with position coding position embedding to obtain a 512-dimensional vector, the 512-dimensional vector is spliced with global representation G, and finally t 1024-dimensional inputs C are obtained, wherein C belongs to R^t×1024。

(4) C is input to a masked self-attention mechanism for modeling the dependency between different characters in the output word, as shown in the following figure.

The method mainly comprises three steps during the calculation of the attribute, wherein the first step is to calculate the similarity of the query and each key to obtain the weight; the second step then normalizes the weights, typically using a softmax function; and finally, carrying out weighted summation on the weight and the corresponding key value to obtain the final attention.

(5) The encoder and decoder are connected using a two-dimensional attention module, whose structure is substantially identical to the self-attention module except that K and V are both from F' obtained in the encoding stage, and Q is the output of the self-attention module with a mask. In order to enable the model to focus on the region of the corresponding character, the embodiment of the two-dimensional attention module is to increase the attention weight of the region of the corresponding character, and the invention carries out explicit supervised training on the attention weight. The term loss function is defined as follows:

wherein y is^predFor attention weight at point i, j, when the point is in the character region y^labelIs 1, otherwise is 0.

(6) Obtaining a picture characteristic vector after passing through a two-dimensional attention module, and enabling the vector to be in the range of R^1x1x1024And obtaining a vector with the same dimensionality as the number of the letter types through a full connection layer, and obtaining a prediction result at the moment by performing argmax operation on the vector.

(7) Repeating the operations (3-6) at the next moment to obtain the prediction results at a plurality of moments until an END terminator is obtained.

(8) During training, the input of a decoder is a vector obtained after a real label is subjected to word embedding (embedding); at test time, since the real tag information is not known, the output of the previous decoder is taken as the input at that time. Only during the training phase is back propagation involved.

The specific process of the embodiment is as follows:

1. label making of attention mechanism:

the invention provides the bounding box information of each character in the picture in the synthetic scene text data set SynthText, and a label of an attention mechanism is made according to the information for carrying out explicit supervision training on attention.

The picture size of the model input is 400 x 128, with a width of 400 and a height of 128. In the encoding stage, the invention adds an attention supervision signal on the last block of each layer of the ResNet. The sizes of the spatial attention weight maps (spatial attention weight maps) of the four phases are 100 × 32, 50 × 16, 25 × 8, and 25 × 8, respectively, and accordingly, tags of corresponding sizes are made. The method firstly scales the bounding box of the character to the corresponding size according to the proportion of the original image and the attention image weight, and then generates the attention label, wherein the value inside the character bounding box is 1, and the value outside the character bounding box is 0. In the decoding stage, a corresponding label needs to be made for each attention mechanism of the decoding step, and the size of each label is 25 × 8. The character bounding box is first scaled to the corresponding size, but the value of the attention tag is 1 only within the current character bounding box, and is 0 otherwise.

2. Scene text picture preprocessing

In order to make the picture size of the input model 400 × 128, the picture size is adjusted to 400 × 128 using a bilinear interpolation method. The data enhancement mode used in training is random cutting, and changing the brightness, contrast, saturation and tone of the image.

3. ResNet scene text picture feature extraction based on explicit supervision mechanism

The tensor (400 × 128 × 3) obtained by picture preprocessing is input to the feature extraction layer of the ResNet 34. In each block of the four layers of the ResNet34, a channel attention and spatial attention mechanism CBAM is added, as shown in FIG. 1.

In order to enlarge the extracted feature map, the step size step of the last layer of ResNet34 is changed from 2 to 1. The overall framework of ResNet34 is shown in FIG. 2. Obtaining F after characteristic extraction, wherein F belongs to R^25×8×512Inputting F into a global feature extraction layer of a six-layer Bottleneck to obtain a global feature representation G, wherein G belongs to R^1×1×1024Meanwhile, performing 1 × 1 convolution on F to obtain F 'as the finally extracted feature, wherein F' belongs to R^25×8×1024. The structure of the adjusted ResNet is as follows:

4. feature decoding of Transformer structure based on explicit supervised attention mechanism

At the time step t of decoding, adding the character embedding obtained by prediction before and the position information position embedding to obtain E, wherein E belongs to R^t×512And then spliced together with a global representation G to obtain a vector C, wherein C belongs to R^t ^×1024And inputting the characters into a self-attention module to model the relationship between the characters. In the two-dimensional attention module, the output from the attention module is regarded as Q, the characteristics F' obtained by coding are regarded as K and V, and the characteristic vector S needing attention at present is obtained by calculation at each time step_t，S_t∈R¹⁰²⁴. A position feedforward layer is added to both the self-attention module and the two-dimensional attention module, and can be regarded as a fully-connected layer with two layers, wherein the input and the output are 1024, and the dimension of the hidden layer is 2048.

5. Model training

S_tAnd through a full connection layer, the output dimension of the full connection layer is equal to the number of all letter types, then softmax operation is executed, the output vector is converted into the probability distribution of each letter, wherein the letter with the maximum probability distribution value is regarded as the prediction result of the layer, and by analogy, the prediction results obtained at a plurality of time steps are all the letters on the scene text. The identified loss function uses a cross-entropy loss function:

where x is the predicted 94-dimensional vector and gt is the true character label. The final loss function is:

where α and β are coefficients, where α is 0.1 and β is 1.

The optimizer chooses ADADELTA to calculate the gradient and does back propagation. The trained batch size is set to 112, and 64638 iterations are required for one epoch, for a total of 6 epochs to be trained.

8. Model application

After the training process, a plurality of models can be obtained, the optimal model (with the minimum loss function value) is selected for application, and at the moment, the image data processing does not need data enhancement, only the image needs to be adjusted to 400 × 128 size, and normalization can be used as the input of the model. The parameters of the whole network model are fixed, so long as the image data is input and propagated forwards. Obtaining a characteristic diagram F' epsilon R in sequence^25×8×1024And G ∈ R^1×1×512. Then automatically transmitted into a decoding network for automatic decoding, and the recognition result can be directly obtained through the whole model. When a large number of scene text pictures need to be tested, all the pictures need to be integrated into one lmdb format file, and all the pictures can be conveniently read at one time.

The invention is not to be considered as limited to the particular embodiments shown, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A scene text recognition method based on an explicit supervision mechanism is characterized by comprising the following steps:

step 1: inputting the scene text picture into a ResNet34 convolutional neural network to extract a characteristic diagram and recording the characteristic diagram as F, wherein the F belongs to R^25x8x512(ii) a Inputting F into a global feature extraction layer of a six-layer Bottleneck to obtain a global feature representation G, wherein G belongs to R¹ ^×1×1024(ii) a Meanwhile, performing 1 × 1 convolution on F to obtain F 'as the finally extracted feature, wherein F' belongs to R^25×8×1024(ii) a Each block of four layers of the ResNet34 is added with a channel attention and space attention mechanism;

2. The method for scene text recognition based on the explicit supervised attention mechanism as recited in claim 1, wherein in step 1: the channel attention module is used for enabling the input feature map to respectively pass through a global maximum pooling layer and a global average pooling layer based on width and height, and then respectively pass through a multilayer perceptron; performing addition operation based on bit-alignment multiplication on the characteristics output by the multilayer perceptron, and performing sigmoid activation operation to generate a final channel attention characteristic diagram; carrying out the multiplication operation of the map and the input feature map by the counterpoint multiplication to generate the input features required by the space attention module; the space attention module takes the feature map output by the channel attention module as an input feature map of the module, firstly, global maximum pooling and global average pooling based on the channel are carried out, and then, the 2 results are spliced based on the channel; then, reducing the dimension into 1 channel through convolution operation; generating a spatial attention feature through sigmoid; and finally, multiplying the characteristic with the input characteristic of the module to obtain the finally generated characteristic.

3. A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.

4. A computer-readable storage medium having stored thereon computer-executable instructions for, when executed, implementing the method of claim 1.

5. A computer program comprising computer executable instructions which when executed perform the method of claim 1.