CN110458165B

CN110458165B - Natural scene text detection method introducing attention mechanism

Info

Publication number: CN110458165B
Application number: CN201910750169.1A
Authority: CN
Inventors: 牛作东; 李捍东
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2022-11-08
Anticipated expiration: 2039-08-14
Also published as: CN110458165A

Abstract

The invention discloses a natural scene text detection method introducing an attention mechanism, which comprises the following steps: in the process of using a PVANet network to perform down-sampling on images, a space attention module is generated by using the spatial relationship of the intermediate text feature information, the space attention module is used for capturing importance information judged for a target area in a two-dimensional space, and the feature information generated by each convolution belongs to the range of I ∈ R ^1×H×W And activated through the sgmod function; extracting features in an unprooling mode in the image sampling process, using the extracted features to approximate the target position features to generate a channel attention module, and then adjusting the channel attention module through a shared network MLP; and finally, in the process of feature fusion, the channel attention weight and the space attention weight form the whole branch attention model. According to the method, the useful information is paid more attention to and the useless information is restrained when the text target features are extracted, the capability of detecting the long text by an EAST algorithm is effectively improved, and the detection precision is improved while the detection efficiency is not lost.

Description

Natural scene text detection method introducing attention mechanism

Technical Field

The invention relates to a natural scene text detection method introducing an attention mechanism, and belongs to the technical field of text detection methods.

Background

The classification strategy based on the original detection target is mainly a role-based detection algorithm, which is to detect a single character or a part of a text first and then group the single character or the part of the text into a word. The word-based detection method comprises: it is a way similar to general object detection to extract text directly. Text line-based detection algorithms: the method first detects lines of text and then subdivides the lines into words. The detection methods of the classification strategy based on the shape of the target bounding box can be divided into two categories, the first category is horizontal or near-horizontal detection methods, and such methods are focused on detecting horizontal or near-horizontal text in the image. The second type is a multi-directional detection method, and compared with a horizontal or nearly horizontal detection method, the multi-directional text detection is more stable, because the text in a natural scene can be in any direction in an image, the main research methods of the type utilize the rotation invariant feature of the multi-directional text detection, firstly estimate the center, proportion and direction information of a detection target before feature calculation, and then perform chain-level features according to size change, color self-similarity and structure self-similarity.

However, the EAST algorithm provides a fast and accurate scene text detection pipeline, which has only two stages. The pipeline employs a complete convolutional network (FCN) model to directly generate word or text line level predictions, without the inclusion of redundant and slow intermediate steps. The generated text prediction, which may be a rotated rectangle or quadrangle, is sent to non-maximum suppression to produce the final result, as shown in fig. 2, the method has the limitation of extracting long text, and the detection effect of the long text is poor.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method for detecting the natural scene text with the attention mechanism is provided to solve the problems in the prior art.

The technical scheme adopted by the invention is as follows: a natural scene text detection method introducing an attention mechanism comprises the following steps: in the process of using a PVANet network to perform down-sampling on images, a space attention module is generated by using the spatial relationship of the intermediate text feature information, the space attention module is used for capturing importance information judged for a target area in a two-dimensional space, and the feature information generated by each convolution belongs to the range of I ∈ R ^1×H×W And activated through an sgmod function, and the expression is as follows:

W _S (I)＝σf ^7×7 Pool(I) (4)

wherein f is ^7×7 For convolution operation, a convolution kernel is a convolution layer of 7 × 7, in the process of image sampling, features are extracted in an unprooling mode to be used for approximating target bit features to generate a channel attention module, and then adjustment is performed through a shared network MLP, wherein the expression is as follows:

W _C (I′)＝σMLP(unpool(I))＝σW ₁ W ₀ I′ (5)

where σ is the singmod activation function, W ₀ ∈R ^C/r×C And W ₁ ∈R ^C×C/r Respectively MLP weight, and finally, in the process of feature fusion, forming a whole branch attention model by using the channel attention weight and the spatial attention weight, wherein the process is represented as follows:

I′＝(W _S (I)+1)⊙I (6)

I″＝(W _C (I′)+1)⊙I′ (7)

in the equation |, where the elements of the corresponding matrix are multiplied, since each module finally needs to be activated by using the sigmod function, the values of each element of the attention channel are between 0 and 1, and the effects of enhancing the useful image information and suppressing the useless information by the attention module are achieved.

The invention has the beneficial effects that: compared with the prior art, the invention aims at the problem that the visual field of the EAST algorithm is limited when the text direction features are extracted, and obtains an Attention-EAST detection method by introducing an Attention mechanism into the backbone network PVANet, so that a training model can pay more Attention to useful information and inhibit useless information when extracting text target features, and experiments prove that the method effectively improves the capability of the EAST algorithm for detecting long texts, and improves the detection precision without losing the detection efficiency.

Drawings

FIG. 1 is a basic flow diagram of an object detection algorithm;

FIG. 2 is a block diagram of the EAST algorithm;

FIG. 3 is a diagram of the Attention-EAST algorithm architecture;

FIG. 4 is a diagram of the EAST algorithm long text detection effect;

FIG. 5 is a diagram of the detection effect of the Attention-EAST algorithm for long text.

Detailed Description

The invention is further described with reference to the accompanying drawings and specific embodiments.

The feasibility of visual attention is mainly due to the reasonable assumption that human vision does not immediately process the entire image as a whole; instead, one focuses on selective portions of the entire visual space only when and where it is needed. In particular, attention is not directed to encoding images as static vectors, but rather to allowing image features to evolve from the sentence context at hand, resulting in richer and longer descriptions of cluttered images. In this way, visual attention can be viewed as a dynamic feature extraction mechanism that incorporates contextual localization over time.

When a mechanism of attention is added to the image processing task that describes the features and information of the detection target in the image, the feature information that the attention module needs to process contains an explicit sequence item a = { a = { (a) } ₁ ,a ₂ ,a ₃ ,…,a _L },a _i ∈R ^D Where L represents the number of feature vectors and D represents the spatial dimension. The attention mechanism used therefore requires the calculation of each feature vector a at the current time t _i Weight of alpha _t,i The formula is as follows:

e _ti ＝f _att (a _i ,h _t-1 ) (1)

wherein the fatt () stands for multi-layer perceptron, e _ti Represents an intermediate variable, h _t-1 The implicit state at the last moment is represented and k represents the index of the feature vector. After the weight is calculated, the model can screen the input sequence a, and the screened sequence items are as follows:

the final objective is to determine whether the attention mechanism is hard or soft.

Example 1: as shown in fig. 3 to 5, a natural scene text detection method with attention mechanism introduced includes: in the process of using PVANet network to perform down-image sampling, a spatial attention module is generated by using the spatial relationship of the intermediate text characteristic information, and is used for capturing important judgment of a target area in a two-dimensional spaceCharacteristic information generated by each convolution is I epsilon R ^1×H×W And activated through an sgmod function, and the expression is as follows:

W _S (I)＝σf ^7×7 Pool(I) (4)

wherein f is ^7×7 In order to perform convolution operation, a convolution kernel is a convolution layer of 7 multiplied by 7, characteristics are extracted in an unprooling mode in the image sampling process and used for approximating target bit characteristics to generate a channel attention module, and then adjustment is performed through a shared network MLP, wherein the expression is as follows:

W _C (I′)＝σMLP(unpool(I))＝σW ₁ W ₀ I′ (5)

where σ is the singmod activation function, W ₀ ∈R ^C/r×C And W ₁ ∈R ^C×C/r The weights are MLP respectively, and finally, in the process of feature fusion, the channel attention weight and the space attention weight form a whole branch attention model, and the process is expressed as follows:

I′＝(W _S (I)+1)⊙I (6)

I″＝(W _C (I′)+1)⊙I′ (7)

in the formula, <' > is multiplication of corresponding matrix elements, since each module is finally activated by using the sigmod function, each element value of the attention channel is between [0 and 1], and the effects of enhancing useful image information and suppressing useless information by the attention module are achieved.

The file detection method of the invention has the following loss functions:

L＝L _s +λ _g L _g (8)

wherein L is _s And L _g Denotes the loss of the fractional map and geometry respectively, and λ _g Indicating the importance between the two losses. In the invention, λ is _g Set to 1, to simplify the training process, the class balance cross entropy introduced by the present invention:

wherein

Is the predicted value of the score plot, and Y is the basic true value. The parameter β is a balance factor between positive and negative samples, and is given by:

in order to generate accurate geometric predictions for large and small text regions, keeping the regression loss scale unchanged, the rotated rectangular box RBox regression portion employs the IoU loss function because it is fixed for objects of different scales, whose expression is:

wherein

Expressed as predicted geometric shape, R is its corresponding true shape, intersecting rectangles

Respectively, the width and height of (a):

wherein d is ₁ ，d ₂ ，d ₃ And d ₄ Representing the distance of the pixel to the upper, right, lower and left boundaries of its corresponding rectangle, respectively. The union region is given by the following equation:

from this the intersection or union region is calculated, the rotation angle loss is calculated as follows:

in the formula (I), the compound is shown in the specification,

is a prediction of the angle of rotation, theta ^* Representing the actual value. Finally, the total geometric loss is calculated as:

L _g ＝L _R +λ _θ L _θ (15)

in the experimental process, the invention converts lambda into _θ Set to 10.

As in the algorithm shown in fig. 3, the key part of the algorithm is a neural network model incorporating an attention module, which predicts the existence of text instances and their geometry directly from full images by training. The model is a fully convolutional neural network, is suitable for text detection, and outputs word or text line predictions with dense pixels. This eliminates intermediate steps such as candidate solution, text region formation and word segmentation. The post-processing step only includes the thresholds on the prediction geometry and the NMS. The algorithm is applied to text detection and mainly comprises three parts, including a feature extraction network, a feature fusion network and an output layer:

1. a feature extraction network: the convolutional neural network is first pre-trained on an ICDAR data set to generate initialization parameters for the neural network model. And then extracting four levels of feature maps with the sizes of 1/32, 1/16, 1/8 and 1/4 of the input image through convolution operation in a feature extraction stage based on a PVANet model. And then, calculating the spatial attention feature of each feature map by using a spatial attention feature module, wherein the spatial attention feature is used for focusing on the feature of the text and is marked as f _i (i =1,2,3,4) as output for feature merging;

2. the feature fusion network comprises: in the network, the features extracted by the feature extraction network are combined by adopting a layer-by-layer combination method, and the calculation formula is as follows:

in the process of each combination, firstly, the feature map from the previous stage is firstly input into a sampling layer to enlarge the size of the feature map; and then the text position feature information is focused through a channel attention feature module. And combining the text feature map with the current layer feature extraction network. Finally, the number of channels and the amount of computation are reduced by convolution operation Conv1 × 1, and Conv3 × 3 fuses the local information to generate output h of the merging stage _i (i =1,2,3,4). After the last merging stage, the convolution operation conv3 × 3 layer generates the final feature map of the merging branch and sends it to the output layer;

3. an output layer: several convolution Conv1 × 1 operations are included in the output layer to project the feature maps of 32 channels onto the fractional feature map of 1 channel and one multi-channel geometry feature map. The geometric feature map performs position regression on the detected text by using a rotating rectangular box, wherein the rectangular text box is described by four channels, the four channels respectively represent 4 distances from the pixel position to the top, right, bottom and left boundaries of the rectangle, and one channel represents the rotating angle of the text box. Finally, the text detected in the image is marked by the generated rotating rectangular box, and the detection effect is shown in the following fig. 5.

Model training: for the model provided by the invention, an Adam optimizer is adopted to train the network end to end according to the training mode of the EAST algorithm. In order to accelerate the learning speed, the training samples of the original images 512 × 512 are uniformly packed into 24 samples at a time for batch processing. Adam's learning rate from 1e ^-3 At the beginning, every 27300 small batch is reduced to one tenth and stopped at 1e ^-5 The network is trained until the performance improvement tends to be smooth.

Experimental verification and analysis:

the experimental environment is as follows: the experiment is carried out on an Ubuntu18.04LTS operating system, the development language is Python3.6, the integrated development environment is Pycharm, and the deep learning framework is TensorFlow of a GPU version. The hardware configuration CPU is i7-6700k with four cores and eight threads, the main frequency is 4GHz, the memory is 32GB, the GPU is NVIDIA GTX 1080T, and the video memory is 11G.

The experimental results are as follows: the data set adopted in the experiment is a data set used in an ICDAR challenge race, the data set is also a data set which is more popular in a text target detection algorithm, 1500 pictures are provided, 1000 pictures are used for model training, pictures are provided for a test set, text areas are annotated by four vertexes of a quadrangle and correspond to four-side geometric figures in target texts, the pictures are randomly shot by a mobile phone or a camera, therefore, text information in a scene is in any direction and can be influenced by natural environment, and the characteristics are beneficial to estimation and verification of the text detection algorithm.

The invention introduces the Attention-based algorithm and the EAST algorithm to process the detection result pair of the long text in the natural scene, as shown in FIGS. 4-5, and can see that the text detection visual field is improved and the detection effect of the long text is effectively improved by adding the Attention-based algorithm to enhance the extraction of the feature information of the text and the direction. Meanwhile, the method uses three indexes of Recall (Recall), accuracy (Precision) and weighted harmonic mean value F-measured to evaluate the training effect of the detection method on the ICDAR data set. The experimental results are shown in table 1, and it can be shown through the experimental results that the performance indexes of the text detection by the method for introducing the attention mechanism proposed herein are improved compared with the original EAST algorithm.

Table 1 comparison of experimental results with data for each text detection algorithm

Algorithm	Usage recall rate	Rate of accuracy	Weighted harmonic mean
				Attention-EAST	0.7902	0.8401	0.8144
EAST	0.7831	0.8224	0.8022

In order to analyze the influence of the attention module on the detection efficiency on the original EAST algorithm, the Frame Per Second (FPS) index is adopted in the experimental environment to evaluate the detection efficiency of the original EAST algorithm and the Frame Per Second (FPS) index, which represents the number of pictures processed Per Second, and 500 detection pictures in the test set are randomly divided into 5 parts for testing respectively. The experimental results are shown in table 2, and it can be seen that the detection efficiency of the original algorithm is not lost after the attention module is filled.

TABLE 2 text detection efficiency comparison data (FPS) of two algorithms

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present invention, and therefore, the scope of the present invention should be determined by the scope of the claims.

Claims

1. A natural scene text detection method introducing attention mechanism is characterized in that: the method comprises the following steps: in the process of using a PVANet network to perform down-sampling on images, a space attention module is generated by using the spatial relationship of the intermediate text feature information, the space attention module is used for capturing importance information judged for a target area in a two-dimensional space, and the feature information generated by each convolution belongs to the range of I ∈ R ^1×H×W And activated through an sgmod function, and the expression is as follows:

W _S (I)＝σf ^7×7 Pool(I) (4)

W _C (I′)＝σMLP(unpool(I))＝σW ₁ W ₀ I′ (5)

I′＝(W _S (I)+1)⊙I (6)

I″＝(W _C (I′)+1)⊙I′ (7)

wherein |, is the multiplication of the corresponding matrix elements, resulting in an attention channel with each element value between [0,1 ].