CN113516126A

CN113516126A - Adaptive threshold scene text detection method based on attention feature fusion

Info

Publication number: CN113516126A
Application number: CN202110750847.1A
Authority: CN
Inventors: 胡靖�; 雷小唐; 王小龙; 吴锡
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-10-19

Abstract

The invention relates to an attention feature fusion-based adaptive threshold scene text detection method, which comprises the steps of capturing a scene picture through equipment, and inputting the scene picture into a constructed neural network, wherein the neural network comprises three processing modules; the feature extraction module is used for extracting features of the picture, wherein the channel attention module added in the convolutional layer can dynamically modulate the weight of each channel of the picture, the representation capability of a network is improved by enhancing the text feature weight, feature fusion is carried out in a pyramid mode on features of different scales obtained by down-sampling in the feature extraction module, the high resolution of low-level features and semantic information of high-level features are fused, and the segmentation robustness is improved. The image segmentation module predicts the features obtained by the feature fusion module to obtain a probability map and a threshold map, learns the optimal threshold of the probability map pixels in a network, and finally generates an optimal binary map on the probability map by using the optimal threshold to obtain the detected text region.

Description

Adaptive threshold scene text detection method based on attention feature fusion

Technical Field

The invention relates to the field of digital image processing, in particular to an adaptive threshold scene text detection method based on attention feature fusion.

Background

The natural scene text detection has wide application field and has important application prospect in the aspects of military affairs, automatic driving, scene recognition, computer vision and the like. For example, scene text detection is very helpful for visually impaired people to know the surrounding environment; meanwhile, characters detected by the scene text can enable the machine to make correct judgment; if the detection effect can be improved, the performance of pattern recognition in computer vision is greatly improved. The conventional means for improving the text detection algorithm is to acquire text features, extract manually designed feature information, and distinguish a background from a text region through a classification algorithm to obtain a detected target region, which is limited by the constraints of very high manufacturing cost and the like, and has a poor text detection processing effect on a complex background in real life. Therefore, from the viewpoint of software and algorithms, the improvement of the multi-scene text detection technology has become a hot research topic in a plurality of fields such as image processing and computer vision.

Scene text detection (Scene text detection) refers to a method of signal processing and image processing, and other background interference areas in an image are suppressed by a software algorithm, so that a text area existing in a Scene is detected. There are many methods for detecting natural scene texts, and basically, the methods can be divided into candidate-box-based text detection methods and example-segmentation-based detection methods

A candidate box based detection method improves upon mainstream target detection algorithms. The algorithm RRPN is improved on the basis of fast-Rcnn detection, a rotation candidate network region (RRPN) is provided, the module is mainly used for generating a text region with inclination, and finally the candidate region is mapped to a feature map through an RROI layer to obtain a text detection region. The method provides a solution for detecting the scene text in the direction of target detection, but the general algorithm cannot effectively solve the characteristic of text diversity, and the detection effect of a multi-character area contained in the text is usually to detect the text line in a single character. Most of these methods result in low recall rates when detecting natural scene text with complex backgrounds.

The text detection algorithm based on example segmentation is better than the algorithm improved by target detection, the scene text algorithm based on example segmentation solves the problem of size and shape diversity of scene characters, but the detection effect on the scene text with background interference is not ideal, the part with similar background and characters is easy to detect as a text area, and meanwhile, the conventional text area segmentation method is complex in processing. In this case, the inference speed is reduced, and the practical application is limited.

The traditional scene text detection is characterized by manual design and extraction, the expression capability of a learning model is limited, the workload is huge, and the algorithm complexity is high. Currently, due to the breakthrough progress of deep learning in the field of computer vision, researchers try to introduce a deep neural network to perform a task of text detection, and the problem of scene text detection of the variability of texts and the complexity of scenes in reality is solved by constructing a deep multi-feature fusion network to perform end-to-end training.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an adaptive threshold scene text detection method based on attention feature fusion, which is characterized by comprising the steps of constructing an end-to-end network, wherein the network is based on attention channel feature extraction, then fusing multi-dimensional feature information, further learning an adaptive segmentation threshold, and finally outputting a segmented target region, wherein the specific steps comprise:

step 1: acquiring a scene text image, adjusting the size of the acquired scene text image to 720 × 720, and inputting the scene text image into a constructed neural network for training, wherein the neural network comprises a feature extraction module, a feature fusion module and an image segmentation module which are sequentially connected;

step 2: and (3) feature extraction, wherein the feature extraction module inhibits the background information of the scene text image by using the channel attention, and the specific steps are as follows:

step 21: the scene text image passes through a convolution layer of the neural network, and a convolution layer operator is used for extracting spatial semantic features of the scene text image to obtain a feature image;

step 22: in the process of extracting the features, 1/2, 1/4, 1/8, 1/16 and 1/32 downsampling processing is carried out on the feature images in sequence every time the feature images pass through the convolutional layer;

step 23: after each downsampling process, a channel attention module introduced into the convolutional layer processes the downsampled feature map, so that adjacent K channels participate in the attention prediction of the current channel, the weight of each channel is dynamically modulated, the text feature weight is enhanced to improve the expression capacity of the network, the background information is filtered, and a feature image with the background information filtered is obtained;

and step 3: feature fusion, fusing feature information of different dimensions, preparing for subsequent segmentation, inputting a feature image obtained by filtering background information in the step 2 into the feature fusion module, respectively obtaining first feature maps of 1/4, 1/8, 1/16 and 1/32 from the feature image after 4-fold, 8-fold, 16-fold and 32-fold down-sampling, simultaneously obtaining second feature maps of 1/16, 1/8 and 1/4 from the first feature map of 1/32 after 2-fold, 4-fold and 8-fold up-sampling, performing corresponding dimension addition fusion on the first feature map and the second feature maps to obtain enhanced feature information, then respectively connecting the number of channels of different scales, and rearranging and combining the connected features by utilizing 1x1 convolution to obtain a third feature map F;

and 4, step 4: image segmentation, namely predicting the fused third feature map F to obtain a probability map and a threshold map, learning the optimal threshold of a probability image pixel through a network, finally generating the optimal binary map on the probability map by using the optimal threshold, outputting a detected text region, and solving the optimal segmentation threshold by using a differentiable and binary function by combining the two, wherein the specific steps comprise:

step 41: inputting the third feature map F into the image segmentation module for processing, wherein the probability map module of the third feature map F, which is input into the neural network, judges the probability that the pixel is a text to obtain a probability map, and the threshold map module of the neural network inputs the third feature map F into the threshold map module of the neural network for processing to obtain a threshold map, and specifically, setting a threshold according to the difference of the gray value of the pixel of the image;

step 42: combining the probability map and the threshold map, and performing adaptive learning by using a differentiable and binaryzation function to obtain an optimal adaptive threshold;

step 43: and acquiring an optimal binary image, comparing each pixel value P with an optimal adaptive threshold T in the probability image according to the optimal adaptive threshold, setting the pixel value of the probability image to be 1 when P is larger than or equal to T, and determining the probability image to be a valid text region, otherwise, setting the probability image to be 0, and determining the probability image to be an invalid region, thereby realizing the segmentation of the text region and the background region. The method mainly comprises the step of carrying out threshold value binarization processing on a probability map to obtain a character area approximate binary map.

And 5: calculating the text region and the true value text box of the approximate binary image obtained in the step 42 to obtain a loss value, and iteratively updating a loss function in the constructed neural network, wherein the mathematical expression of the loss function is as follows:

Loss＝L_s+α×L_b+β×L_t

wherein L is_sIs a probability map loss value, L_bIs a threshold map loss value, L_tAnd (5) updating network parameters for the Loss value of the binary threshold value image before Loss convergence, and repeating the steps from 2 to 5.

Step 6: when Loss converges, the constructed neural network is shown to be subjected to iterative learning from the step 2 to the step 5, and the optimal binarization threshold value is learned.

And calculating outlines on the finally output approximate binary image, calculating a bounding rectangle for each outline, and then calculating a predicted frame of the rectangle.

The invention has the beneficial effects that:

1. in the method, a channel attention mode is utilized, namely the weight of each channel of an image, namely channel attention, is dynamically modulated in a neural network, so that the text feature weight is enhanced to improve the representation capability of the network. This can provide a good effect even in a case where the scene is complicated.

2. The down-sampling features of different scales under the background of channel attention removal are fused, the formed fusion features have rich semantics and spatial information, and the prediction performance is greatly improved.

3. Compared with the existing scene text detection method, the existing method mostly adopts a target detection method to detect texts, and is easy to detect single texts as texts for detecting the variability of the texts, so that the situation of incomplete detection occurs, and meanwhile, the generalization is not high. And the method utilizes the adaptive threshold value of the network learning image pixel to carry out the binarization of the probability map. The method is generally suitable for the character of character occurrence, has strong generalization and has good effect on detecting character areas.

Drawings

FIG. 1 is a schematic flow chart of a scene detection method of the present invention;

FIG. 2 is a schematic structural diagram of a neural network constructed by the detection method of the present invention;

FIG. 3 is a block diagram of a channel attention module;

FIG. 4 is a schematic diagram of a feature fusion module architecture;

FIG. 5 is a probability map and threshold map for one embodiment; and

fig. 6 is a graph comparing the effects of the present invention and the prior art method.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

The following detailed description is made with reference to the accompanying drawings.

The module one represents the characteristic extraction module, the module two represents the characteristic fusion module, and the module three represents the image segmentation module.

The spatial semantic features include spatial features and semantic information,

the invention constructs an end-to-end network based on a deep learning theory, the network dynamically adjusts text feature weight based on the attention of a channel, the obtained features are modeled and fused to obtain high semantic and rich spatial information, the fused features have very obvious improvement on segmentation prediction, a probability graph and a threshold graph are obtained through prediction, in order to adapt to the variability of characters, an optimal segmentation threshold value of each pixel is further obtained through learning the threshold value of each pixel in the probability graph through the network dynamics, the optimal segmentation threshold value acts on the probability graph to generate an optimal binary graph, and the method has very small influence on scene change and strong generalization.

Aiming at the defects of the prior art, the invention provides a self-adaptive threshold scene text detection method based on deep learning attention feature fusion, as shown in fig. 1, the method is a flow chart of the method, the method comprises the steps of constructing an end-to-end network, extracting the network based on attention channel features, fusing multi-dimensional feature information, further learning a self-adaptive segmentation threshold, and finally outputting a segmented target region, and the method specifically comprises the following steps:

step 1: the method comprises the steps of collecting scene text images, adjusting the collected scene text images to 720 × 720, inputting the scene text images into a constructed neural network for training, wherein the neural network comprises a feature extraction module, a feature fusion module and an image segmentation module which are sequentially connected, and fig. 2 is a schematic structural diagram of the neural network constructed by the detection method of the invention and is further described in detail by combining fig. 2.

step 21: and the scene text image passes through a convolution layer of the neural network, and the spatial semantic features of the scene text image are extracted by utilizing a convolution layer operator to obtain a feature image.

Step 22: in the process of extracting the features, the characteristic images are subjected to downsampling processing of 1/2, 1/4, 1/8, 1/16 and 1/32 in sequence every time the convolutional layers pass.

Step 23: after each downsampling process, a channel attention module introduced into the convolutional layer processes the downsampled feature map, enables adjacent K channels to participate in attention prediction of the current channel, dynamically modulates the weight of each channel, enhances the text feature weight to improve the expression capacity of the network, and achieves filtering of background information to obtain a feature image with the background information filtered.

Fig. 3 is a block diagram of interaction channel information processing of channel attention versus feature. As shown in fig. 3, the channel attention convolution module outputs 1 × 1 × C first channel information by using a global average pooling method for the down-sampled feature map, makes K surrounding channels participate in attention prediction of the current channel through convolution interaction, and outputs enhanced second channel information;

c represents the numerical distribution condition of the feature diagram, then K channels around the feature diagram are interacted through 1-dimensional convolution to participate in the attention prediction of the channel, K represents the size of a 1-dimensional convolution kernel, so that the respective importance degrees of the weights of the target features of the feature diagram are predicted, 1 multiplied by C second channel information is output, the product of the second channel information and the feature diagram is used for highlighting the important target in the feature diagram, and therefore the interference background and the interference of irrelevant factors are reduced. And filtering the space semantic features extracted from the neural network convolution layer, wherein the channel convolution module is used for enhancing the weight of the target object to be detected and reducing the interference of the background.

The channel attention convolution module provides a local cross-channel interaction strategy without dimension reduction, after channel-level global average pooling without dimension reduction is carried out, local cross-channel interaction information is captured by considering each channel of an image and k neighbors of each channel, the influence of dimension reduction on learning channel attention is avoided, the representation capability of a network is sequentially improved through feature weights of iterative learning images in the network, and suppression of background information is achieved. The channel weight adjustment is as follows:

wherein, ω is_iIndicating the learned channel attention as to the channel,

representing the mutual information between the image i channel and the surrounding j channels.

Represents a set of k adjacent channels; to further improve performance, it is also possible to have all channels share weight information,

the coverage of information interaction across channels (i.e. kernel size k of one-dimensional convolution) is proportional to the channel dimension C, there is a linear mapping between k and C, but due to the limitations of linear functions for some relevant features, and since the channel dimension is usually an exponential multiple of 2, here an exponential function with a base 2 is used to represent the non-linear mapping relationship

C＝φ(k)＝2^(r*k-b)

r and b are constant coefficients, so that, given a channel dimension C, the convolution kernel size (k) can be calculated according to the following formula

Fig. 4 is a schematic structural diagram of a feature fusion module, step 3: feature fusion, fusing feature information of obstructed dimensions, preparing for subsequent segmentation, inputting the feature image after filtering background information in the step 2 into the feature fusion module, the characteristic images are subjected to downsampling by 4 times, 8 times, 16 times and 32 times to obtain first characteristic maps 1/4, 1/8, 1/16 and 1/32 respectively, meanwhile, after the first characteristic map of 1/32 is subjected to convolutional layer, sampling at 2 times, 4 times and 8 times to obtain second feature maps of 1/16, 1/8 and 1/4, adding and fusing corresponding dimensions of the first feature map and the second feature maps to obtain enhanced feature information, then respectively connecting the number of channels with different scales, rearranging and combining the connected features by utilizing 1x1 convolution to obtain a third feature map F, wherein the third feature map F is semantic information fused with feature maps with different scales; the enhanced third feature map F can improve the precision of a subsequent image segmentation module. The feature fusion carries out pyramid feature fusion on the features of different scales obtained by down sampling in the feature extraction module, the high resolution of the low-layer features and the semantic information of the high-layer features are fused, and the segmentation robustness is improved.

And 4, step 4: image segmentation, namely predicting the fused third feature map F to obtain a probability map and a threshold map, learning the optimal threshold of pixels in the probability map through a network, finally generating an optimal binary map on the probability map by using the optimal threshold, outputting a detected text region, and solving the optimal segmentation threshold by using a differentiable and binary function by combining the two, wherein the specific steps comprise:

step 41: and inputting the third feature map F into the image segmentation module for processing, wherein the probability map module of the third feature map F, which is input into the neural network, judges the probability that the pixel is a text to obtain a probability map, and the threshold map module of the neural network inputs the third feature map F into the threshold map module of the neural network for processing to obtain a threshold map, and specifically, the threshold is set according to different gray values of the pixels of the image. As shown in fig. 5.

Step 42: and combining the probability map and the threshold map, and performing adaptive learning by using a differentiable and binaryzation function to obtain an optimal adaptive threshold.

The predicted output of the threshold value image is the threshold value of the pixel point of the character boundary frame, in order to learn the threshold value of each pixel in the probability image, the pixel P of the probability image and the threshold value T of the pixel point in the threshold value image are brought into a differentiable and binarizable function to carry out self-adaptive learning, and the self-adaptive threshold value T of the probability image is learned through the pixel point P.

The mathematical expression of the differentiable two-valued function is as follows:

b' represents a predicted approximate binary image, T is the optimal adaptive threshold value to be learned from the neural network, P_i,jRepresenting the current pixel point, k being the amplification factor;

in the traditional binarization processing process, the binary function is not differentiable, and the effect of the segmented image is poor. In order to enhance the generalization of character detection, a binary function is modified into a differentiable form, so that iterative learning in a network can be realized, and the k value in the network is set to be 55 through experiments. Compared with the traditional binarization function, the function has the property of being differentiable in nature and high flexibility, each pixel point can be subjected to self-adaptive binarization in the network, and the self-adaptive threshold value, namely the optimal threshold value, of each pixel is learned through the network, so that the final output threshold value of the neural network has stronger generalization on the binarization process of the probability map.

Loss＝L_s+α×L_b+β×L_t

Step 6: when Loss converges, the constructed neural network is shown to be subjected to iterative learning from the step 2 to the step 5, and the optimal binarization threshold value is learned. The contours are computed on the approximate binary map of the final output step 4, a bounding rectangle is computed for each contour, and then the predicted bounding box for that rectangle is computed.

FIG. 6 shows a comparison of our method with a prior art method, the first row being a sample of text generated by the testing of the prior art testing method, and the second row being a sample of testing generated by the testing method of the present invention. The detection method can well cope with the variability of the detected characters, and has adaptability to the detection of the texts of curved scenes and scenes with complex backgrounds. The experimental result shows that the method provided by the invention has a very good effect on detecting the scene text in real life, and the detected text box can be well adapted to the irrelevant background interference of the variability of characters, so that the method has excellent detection performance.

It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A self-adaptive threshold scene text detection method based on attention feature fusion is characterized by comprising the steps of constructing an end-to-end network, extracting the network based on attention channel features, fusing multi-dimensional feature information, further learning a self-adaptive segmentation threshold, and finally outputting a segmented target region, wherein the method comprises the following specific steps:

step 41: inputting the third feature map F into the image segmentation module for processing, wherein the probability map module of the third feature map F, which is input into the neural network, judges the probability that the pixel is a text to obtain a probability map, and the threshold map module of the neural network inputs the third feature map F into the threshold map module of the neural network for processing to obtain a threshold map, and specifically, setting a threshold according to different gray values of the pixels of the image;

step 43: obtaining an optimal binary image, comparing each pixel value P with an optimal adaptive threshold T in the probability image according to the optimal adaptive threshold, setting the pixel value of the probability image to be 1 when P is larger than or equal to T, and determining the probability image to be a valid text region, otherwise, setting the pixel value to be 0, and determining the probability image to be an invalid region, thereby realizing the segmentation of the text region and the background region;

and 5: calculating the text region and the true value text box of the approximate binary image detected in the step 4 to obtain a loss value, and iteratively updating a loss function in the constructed neural network, wherein the mathematical expression of the loss function is as follows:

Loss＝L_s+α×L_b+β×L_t

wherein L is_sIs a probability map loss value, L_bIs a threshold map loss value, L_tAs a loss of the binarized threshold mapLosing values, updating network parameters before Loss convergence, and repeating the steps from 2 to 5;

step 6: when Loss converges, the constructed neural network shows that the optimal binarization threshold value is learned through iterative learning of the steps 2 to 5, the outline is calculated on the finally output approximate binary image, a bounding rectangle is calculated for each outline, then a prediction frame of the rectangle is calculated, and finally the detected text region is output.