CN112070114B

CN112070114B - Scene character recognition method and system based on Gaussian constraint attention mechanism network

Info

Publication number: CN112070114B
Application number: CN202010767079.6A
Authority: CN
Inventors: 王伟平; 乔峙; 秦绪功; 周宇
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2023-05-16
Anticipated expiration: 2040-08-03
Also published as: CN112070114A

Abstract

The invention provides a scene character recognition method and a scene character recognition system based on a Gaussian constraint attention mechanism network, which relate to the field of image information recognition and are used for obtaining a two-dimensional feature map by extracting visual features of a picture to be recognized; converting the two-dimensional feature map into a one-dimensional feature sequence, and extracting global semantic information according to the one-dimensional feature sequence; inputting global semantic information into a first time step, initializing and decoding hidden states, calculating original attention weights according to the hidden states and the two-dimensional feature map in each time step, and obtaining original weighted feature vectors by utilizing weighted summation of the weights; constructing a two-dimensional Gaussian distribution mask according to the hidden state and the original weighted feature vector, multiplying the mask by the original attention weight to obtain corrected attention weight, and obtaining a corrected weighted feature vector according to the weight; the original weighted feature vector and the corrected weighted feature vector are fused together to predict the character of the picture to be recognized, so that the attention dispersion situation can be solved.

Description

Scene character recognition method and system based on Gaussian constraint attention mechanism network

Technical Field

The invention relates to the field of image information identification, in particular to a scene text identification method and a scene text identification system based on a Gaussian constraint attention mechanism network.

Background

Text detection and recognition of scene images is a recent research hotspot, wherein text recognition is a core part of the whole process, and the task of the text detection is to transcribe the text in the pictures into a form which can be directly edited by a computer. With the development of deep learning, the field is rapidly advanced. Inspired by the field of machine translation, the current mainstream methods are all based on the structure of a coder and a decoder, the coder extracts rich visual features through a convolutional neural network and a cyclic neural network, and the decoder acquires the required features through an attention mechanism and predicts each character in a sequence according to the sequence of the text.

However, the prior art has the following defects:

1. the word recognition requires only a specific area of each character in the text picture at each time step of decoding, and the existing methods do not take full advantage of this feature of text recognition.

2. The existing method does not consider the constraint attention weight, but allows the model to freely predict the attention weight, and a problem of attention dispersion occurs in a part of pictures, namely the weight cannot be concentrated on a specific character.

3. Although some existing approaches use gaussian distributed labels for each character's position to monitor the attention weight, thereby implicitly constraining the attention weight. However, since a process of displaying constraint is not introduced, a problem of distraction still occurs in some pictures.

Disclosure of Invention

The invention aims to provide a scene character recognition method and a scene character recognition system based on a Gaussian constraint attention mechanism network, which are used for correcting an original attention weight by introducing a displayed constraint in the process of calculating the attention weight, so that the corrected attention weight can be more concentrated in a region corresponding to a character, and the problem of attention dispersion can be solved.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a scene text recognition method based on Gaussian constraint attention mechanism network comprises the following steps:

extracting visual features of the picture to be identified to obtain a two-dimensional feature map;

converting the two-dimensional feature map into a one-dimensional feature sequence, and extracting global semantic information according to the one-dimensional feature sequence;

inputting global semantic information into a first time step, initializing and decoding hidden states, calculating original attention weights according to the hidden states and the two-dimensional feature map in each time step, and obtaining original weighted feature vectors by utilizing weighted summation of the weights;

constructing a two-dimensional Gaussian distribution mask according to the hidden state and the original weighted feature vector, multiplying the mask by the original attention weight to obtain corrected attention weight, and obtaining a corrected weighted feature vector according to the weight;

and fusing the original weighted feature vector and the corrected weighted feature vector together to predict the character of the picture to be identified.

A scene text recognition system based on a gaussian constrained attention mechanism network, comprising:

the feature extraction module comprises a multi-layer residual error network and is responsible for extracting visual features of the picture to be identified to obtain a two-dimensional feature map;

the encoder module comprises a unidirectional double-layer long-short time memory network LSTM and is responsible for converting a two-dimensional feature map into a one-dimensional feature sequence, and then inputting the one-dimensional feature sequence into the LSTM to extract global semantic information;

the decoder module comprises a unidirectional double-layer long-short time memory network AM-LSTM based on an attention mechanism, is responsible for updating the hidden state of the AM-LSTM of the first time step based on global semantic information, calculates the original attention weight according to the hidden state of the AM-LSTM and the two-dimensional feature map in each time step, and obtains an original weighted feature vector by utilizing the weighted summation of the weights; fusing the original weighted feature vector and the corrected weighted feature vector together to predict the character of the picture to be recognized;

and the correction module based on Gaussian constraint is responsible for constructing a two-dimensional Gaussian distribution mask according to the hidden state of the AM-LSTM and the original weighted feature vector, multiplying the mask by the original attention weight to obtain corrected attention weight, and obtaining the corrected weighted feature vector according to the weight.

Further, the feature extraction module includes a 31-layer residual network.

Further, the encoder module will be responsible for maximum pooling of the two-dimensional feature map into a one-dimensional feature sequence.

Further, the decoder module is responsible for updating the hidden state of the AM-LSTM according to the decoding result of the last time step for each time step starting from the second time step.

Further, the correction module based on Gaussian constraint is responsible for predicting a group of Gaussian distribution parameters through a full connection layer after the hidden state of the AM-LSTM and the original weighted feature vector are spliced, and constructing a two-dimensional Gaussian distribution mask by utilizing the parameters, wherein the parameters comprise a mean value and a variance.

Further, the present system optimizes training by calculating character recognition loss optimized by calculating cross entropy loss between predicted character probabilities and recognition labels and attention weight loss optimized by calculating L1 regression loss between predicted character attention distribution and character position labels.

Compared with the prior art, the invention provides a brand new correction module based on Gaussian constraint, and the correction module predicts a Gaussian mask to correct the original attention weight. Since the characters in the word recognition task are typically regular in shape, let the model predict more a gaussian mask as a display constraint to correct the original attention weight. The attention weight after correction is more concentrated on the area corresponding to the character, so that the attention dispersion situation can be solved. Experiments show that the invention can obtain more excellent performance on the existing data set, and the module provided by the invention is very flexible and can be used in the existing method based on the attention mechanism.

Drawings

Fig. 1 is a schematic diagram of a scene text recognition network structure based on a gaussian constraint attention mechanism network according to an embodiment.

Fig. 2 is a schematic diagram of a decoder structure according to an embodiment.

FIG. 3 is a diagram comparing a prior art method with the visualization of the recognition result of the present invention.

Detailed Description

In order to make the technical scheme of the invention more understandable, specific examples are described below in detail with reference to the accompanying drawings.

The embodiment discloses a scene character recognition method and a scene character recognition system (GCAN) based on a Gaussian constraint attention mechanism network, as shown in fig. 1, the GCAN is a recognition model based on a two-dimensional attention mechanism, and a brand new correction module (GSRM) based on Gaussian constraint is introduced. The GSRM input is the original unconstrained attention weighted feature, and the GSRM output is the corrected attention weighted feature. The two feature vectors are fused and then used for predicting characters of the current decoding time step, the time step is one step of predicting characters of a word during iterative decoding, and each step predicts corresponding characters because the decoding is iterated step by step. The system consists of four parts: the device comprises a feature extraction module, an encoder module, a decoder module and a correction module based on Gaussian constraint.

The feature extraction module is composed of a 31-layer residual network, and the residual network can extract rich visual features for the following encoding and decoding processes.

The encoder module consists of a unidirectional double-layer long and short time memory network (LSTM). Firstly, carrying out maximum pooling on the two-dimensional feature map output by the feature extraction module along the vertical direction to obtain a one-dimensional feature sequence. And then inputting the one-dimensional feature sequence into the LSTM to extract the context information, obtaining global semantic information, wherein the output of the encoder module is the hidden state of the last moment of the LSTM and is used as global semantic information to guide a decoder.

The decoder module consists of a unidirectional double-layer long-short time memory network LSTM (AM-LSTM) based on an attention mechanism, and the structure of the decoder is shown in figure 2. The global semantic information finally output by the encoder is input in the first time step of decoding, and then the decoding result of the last time step is input in each time step for updating the hidden state of the decoder AM-LSTM. At each time step, the hidden state of the decoder AM-LSTM and the feature map output by the feature extraction module calculate the original attention weight, and the feature map is weighted and summed based on the original attention weight to obtain the original weighted feature vector. Finally, the original weighted feature vector and the corrected weighted feature vector are fused to predict the character of the current time step

And a correction module based on Gaussian constraint, after splicing the hidden state of the corresponding time step of the decoder AM-LSTM and the original weighted feature vector, predicting a group of parameters (mean value and variance) of Gaussian distribution through a full-connection layer, constructing a two-dimensional Gaussian distribution by using the parameters as a mask, multiplying the mask with the original attention weight to obtain corrected attention weight, and calculating by using the corrected attention weight to obtain a new corrected weighted feature vector. Compared with the original attention weight, the corrected attention is more concentrated, so that the problem of attention dispersion can be solved.

The whole process for identifying the scene characters by adopting the method and the system comprises the following steps:

1. and the input picture extracts visual features through a feature extraction module to obtain a two-dimensional feature map.

2. The extracted visual features are passed through an encoder module to extract global semantic information, which is then input into a decoder module.

3. The decoder module adopts an attention mechanism, calculates the original attention weight through the hidden state and the feature diagram output by the feature extraction module, and then obtains the weighted original weighted feature vector.

4. The weighted original weighted feature vector and the hidden state are input to a correction module based on Gaussian constraint, original attention weights predicted by a decoder are corrected by using two-dimensional Gaussian masks respectively, corrected attention weights are obtained, and corrected weighted feature vectors are obtained.

5. And fusing the corrected weighted feature vector and the original weighted feature vector together to predict corresponding characters.

6. The whole model optimizes training by calculating the loss of character recognition and attention weights. Wherein the character recognition penalty is optimized by calculating a cross entropy penalty between the predicted character probability and the recognition callout, and the penalty for the attention weight is optimized by calculating an L1 regression penalty between the predicted character attention distribution and the character position callout.

The present invention conducted extensive experiments to evaluate the effectiveness of the GCAN of the present invention. The GCAN trains on two generated data Syn90K and SynthText, testing on several scene text datasets of the main stream. Wherein IIIT5K has 3000 images; mostly high quality horizontal images; SVT has 647 images, mostly horizontal text; SVT-Perspective (SVTP) has 645 images, most of which have stronger distortion; ICDAR2013 (IC 13) has 1015 images, mostly high quality horizontal text; ICDAR2015 (IC 15) has 1811 images, mostly text images of arbitrary shape and low quality; the cut had 288 images, most of which were high quality curved text.

Table 1 shows the effect comparison among the modules of the GCAN, and the result proves that the novel module GCRM provided by the invention can bring about obvious improvement, and the existing method can not obviously improve the performance only through character supervision. Table 2 demonstrates the comparison of the effect of the present invention on test datasets with other mainstream methods, the best performance of the present invention on multiple datasets, demonstrating the effectiveness of the present invention. Fig. 3 shows the visualization of the recognition result and the attention weight of the prior art method and the present invention, wherein for each recognized picture on the left, the first on the right represents the recognition result of the prior art method, the second represents the recognition result of the present invention, the white ring in the picture is marked as the recognized position, and the letters below are recognized characters, so that the present invention can be found to effectively solve the phenomenon of attention dispersion, and can obtain a better recognition result.

Table 1 comparative experiments of the respective modules

Table 2 comparison of GCAN with other methods on various datasets

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention, and the scope of the present invention is defined by the claims.

Claims

1. A scene text recognition method based on a Gaussian constraint attention mechanism network is characterized by comprising the following steps:

in each decoding time step, splicing the hidden state of the corresponding time step and the original weighted feature vector, and then predicting a group of parameters of Gaussian distribution through a full-connection layer, wherein the parameters comprise mean value and variance, constructing a two-dimensional Gaussian distribution by using the parameters as a mask, and finally multiplying the mask with the original attention weight to obtain corrected attention weight, and obtaining the corrected weighted feature vector according to the weight;

2. The method of claim 1, wherein starting from the second time step, each time step inputs the decoding result of the last time step to update the hidden state.

3. A scene text recognition system based on a gaussian constrained attention mechanism network, comprising:

the encoder module comprises a one-way double-layer long-short time memory network LSTM which is responsible for converting a two-dimensional feature map into a one-dimensional feature sequence, inputting the one-dimensional feature sequence into the LSTM to extract global semantic information, and outputting the hidden state of the last moment of the LSTM;

the decoder module comprises a unidirectional double-layer long-short time memory network AM-LSTM based on an attention mechanism, is responsible for updating the hidden state of the AM-LSTM of each time step based on global semantic information, calculates the original attention weight according to the hidden state of the AM-LSTM and the two-dimensional feature map in each time step, and obtains an original weighted feature vector by utilizing the weighted summation of the weights; fusing the original weighted feature vector and the corrected weighted feature vector together to predict the character of the picture to be recognized;

and the correction module based on Gaussian constraint is responsible for splicing the hidden state of the corresponding time step of the AM-LSTM and the original weighted feature vector at each decoding time step, predicting a group of Gaussian distribution parameters through a full-connection layer, wherein the parameters comprise mean value and variance, constructing a two-dimensional Gaussian distribution by using the parameters as a mask, and multiplying the mask with the original attention weight to obtain corrected attention weight, and obtaining the corrected weighted feature vector according to the weight.

4. The system of claim 3, wherein the feature extraction module comprises a 31-layer residual network.

5. A system as claimed in claim 3, characterized in that the encoder module is responsible for maximum pooling of the two-dimensional feature map into a one-dimensional feature sequence.

6. A system as claimed in claim 3, characterized in that the decoder module is responsible for entering global semantic information in the first time step of decoding, obtaining the decoding result of the next time step, after which each time step updates the hidden state of the AM-LSTM on the basis of the decoding result of the last time step.

7. The system of claim 3, wherein the system optimizes training by calculating character recognition loss and attention weight loss.

8. The system of claim 7, wherein the character recognition penalty is optimized by calculating a cross entropy penalty between predicted character probabilities and recognition labels, and the attention weight penalty is optimized by calculating an L1 regression penalty between predicted character attention distribution and character position labels.