CN114529894A

CN114529894A - Rapid scene text detection method fusing hole convolution

Info

Publication number: CN114529894A
Application number: CN202210046573.2A
Authority: CN
Inventors: 谭钦红; 江一峰; 黄�俊
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-05-24

Abstract

The invention discloses a fast scene text detection method fusing void convolution, which comprises the steps of obtaining a text detection training data set and carrying out label generation on the training data set; establishing a rapid scene text detection preliminary model fusing the void convolution; the rapid scene text detection preliminary model fusing the void convolution comprises a lightweight feature extraction module, a void convolution module and a differentiable binarization module; using a training data set generated by a label to detect an initial model of the established fast scene text fusing the void convolution, and using a loss function to calculate a loss value so as to adjust parameters of the initial model to obtain a fast scene text detection model fusing the void convolution; and detecting the text in the scene by adopting a fast scene text detection model fusing the cavity convolution. The invention replaces a large network with a light-weight neural network to extract the characteristics of the input graph, reduces the parameters of the network model, and can effectively improve the efficiency of detecting the network model by the text. Meanwhile, in order to make up for the problem of insufficient characteristics extracted by the lightweight neural network, a cavity convolution module is added in a characteristic fusion layer, and a channel attention mechanism is used for carrying out fusion screening on the characteristics, so that the utilization efficiency of the characteristics in the network is improved. Therefore, the method and the device realize the rapid detection of the scene text under the condition of keeping a higher text detection level.

Description

Rapid scene text detection method fusing hole convolution

Technical Field

The invention relates to the field of image processing, in particular to a method for detecting a fast scene text by fusing a hole convolution.

Background

With the rapid development of economic society and the rapid popularization of intelligent terminals, channels for people to acquire information are more and more diversified, and images and videos become main media for information dissemination. Unlike visual elements in general images, text appearing in an image often plays a crucial role in the expression of visual information as an important content in the image. If the text information appearing in the images can be detected and recognized, people can be helped to analyze and understand deeper information contained in the scene images, so that the development and application of technologies such as image search, automatic driving, office automation and the like are promoted, and great convenience is provided for the production and life of people.

The text in the natural scene has many characteristics which are not beneficial to detection, such as complex background, large text scale difference, diversified text shapes, unobvious text edges and the like, and accurate detection of the text in the natural scene image through a machine is still a very challenging task.

The traditional scene text detection method relies on artificially designed features, traditional feature extraction is carried out on an input image through methods such as LBP, DPM and HOG, and then a specific classifier or heuristic rules are used for classifying the extracted features. Two types of scene text detection which are typical in comparison are a method based on a connected domain and a method based on a sliding detection window. Because the methods excessively depend on the characteristics of manual design, the changes of objective factors such as illumination intensity, image quality and text background cannot be effectively coped with, and the robustness of the text detection method is poor.

In recent years, the research and development of natural scene text detection are promoted by the successful application of deep learning such as a deep convolutional neural network in the field of computer vision, and this method usually trains a network model based on the deep convolutional neural network by using a specific data set for automatically extracting the basic features of an input image, and then obtains a final text region by a series of post-processing algorithms. Compared with the traditional scene text detection algorithm, the method effectively avoids the limitation of manual design features. At present, a scene text detection algorithm based on deep learning generally adopts a large-scale deep neural network as a backbone network for feature extraction, and the text detection accuracy is remarkable but the detection model is huge, so that the transplantation of the text detection algorithm is not facilitated. In practical application, text detection in a natural scene needs to consider the efficiency of text detection while pursuing the accuracy of text detection, and an excessively large scene text detection model inevitably has the problem of slow detection, so that the application of the scene text detection algorithm in practical life is influenced.

Disclosure of Invention

The current scene text detection method based on deep learning generally adopts a large-scale deep neural network as a backbone network for feature extraction, and the detection speed is too slow due to a huge text detection model, so that the actual application of a scene text detection algorithm is influenced. Aiming at the problem, the invention provides a fast scene text detection method fusing hole convolution, which specifically comprises the following steps:

s1, acquiring a text detection training data set, and performing label generation on the training data set;

s2, establishing a fast scene text detection preliminary model fusing the cavity convolution; the rapid scene text detection preliminary model fusing the void convolution comprises a lightweight feature extraction module, a void convolution module and a differentiable binarization module;

s3, detecting a preliminary model of the established fast scene text of the fusion cavity convolution in the step S2 by using the training data set generated by the label in the step S1, and calculating a loss value by using a loss function to train the preliminary model to obtain the fast scene text detection model of the fusion cavity convolution;

and S4, detecting the text in the scene image by using the obtained fast scene text detection model for fusion cavity convolution in the step S3.

The step S1 is to obtain a text detection training data set, and perform label generation on the training data set, specifically to perform label generation on the original labels of the public data sets ICDAR2015 and CTW 1500.

The rapid scene text detection preliminary model fusing the void convolution in the step S2 comprises a lightweight feature extraction module, a void convolution module and a differentiable binarization module, and specifically is a pyramid network structure constructed by the lightweight feature extraction module by adopting EfficientNet-b3 as a main network to extract features of an input image; a hole convolution module is added in the characteristic fusion layer; and fusing the two parts of characteristics and then connecting the two parts of characteristics with a differentiable binarization module.

Step S3, detecting a preliminary model of the fast scene text of the fusion cavity convolution established in step S2 by using the training data set generated by the label in step S1, calculating a loss value by using a loss function to train the preliminary model to obtain the fast scene text detection model of the fusion cavity convolution, and specifically training by adopting the following steps to obtain a text detection model:

s3.1, inputting the text image generated by the label into a lightweight backbone network EfficientNet-b3 to extract characteristic graphs from the first stage to the fifth stage to construct a pyramid network structure;

s3.2, respectively using 1/16 feature maps extracted in the step S3.1 of hole convolution processing with hole rates of 1,6,12 and 18 by a hole convolution module to obtain hole convolution features;

s3.3, fusing the characteristics generated in the step S3.1 and the step S3.2 by a characteristic fusion layer, and performing fusion screening on the characteristics by using a channel attention mechanism;

s3.4, predicting the probability map (P) and the threshold map (T) by the fusion feature map generated in the step S3.3, combining the probability map and the threshold map by using a differentiable binarization module to obtain an approximate binary map (B), adaptively predicting the threshold value of each position in the image, and forming a boundary frame for obtaining a text region from the approximate binary map B through the boundary frame in an inference stage;

s3.5 and S3.4, when the probability map P and the threshold map T of the fusion feature map are predicted, the following formula is adopted as a prediction loss function:

L＝L_s+α×L_b+β×L_t

in the formula L_sIs the loss of the probability map, L_bIs the loss of the binary image, L_tFor the loss of the threshold map, α and β are set to 1 and 10, respectively, L_sAnd L_bBinary cross entropy loss (BCE) was used, the formula is shown below:

L_s＝L_b＝∑y_i·log x_i+(1-y_i)log(1-x_i)

the invention discloses a fast scene text detection method fusing void convolution, which replaces a large network with a light-weight neural network to extract the characteristics of an input image, solves the problem of overlarge network model parameters and can effectively improve the efficiency of a text detection network model. And a cavity convolution module is added in the feature fusion layer to enlarge the receptive field, after the features are extracted, the features of each layer are fused layer by layer from top to bottom, and a channel attention mechanism is used for fusing and screening the features, so that the utilization efficiency of the features in the network is improved, and the problem of insufficient extraction features of the lightweight neural network is effectively solved. The text detection method can greatly reduce the parameters of the text detection model, greatly improve the detection speed and realize the rapid detection of the scene text under the condition of keeping a higher detection level.

Drawings

FIG. 1 is a flow chart of a fast scene text detection method incorporating hole convolution according to the present invention;

FIG. 2 is a schematic diagram of a data set tag generation of the present invention;

FIG. 3 is a flow chart of a scene text detection network architecture of the present invention;

FIG. 4 is a diagram of a hole convolution module according to the present invention.

Detailed Description

FIG. 1 is a schematic diagram of the detection process of the method of the present invention: the invention provides a fast scene text detection method fusing hole convolution, which comprises the following steps:

specifically, label generation is performed on the original labels of the public data sets ICDAR2015 and CTW 1500; given a scene text image, the polygons of its text regions are described by a set of line segments:

where n represents the number of vertices, e.g., the text region of the ICDAR2015 data set consists of 4 vertices, the polygon G is reduced to G using a Vatti clipping algorithm_sThe contraction offset D is calculated by the perimeter L and the area A of the original polygon:

where r is the contraction ratio, typically set at 0.4. Labels can be generated for the threshold map by a similar process, as follows: first, the text polygon G is expanded to G with the same offset D_dG is_sAnd G_dThe gaps between them serve as the boundaries of the text region, and the labels of the threshold map are generated by calculating the distance to the closest line segment in G, as shown in fig. 2.

specifically, the lightweight feature extraction module adopts EfficientNet-b3 as a backbone network to extract features of the input image to construct a pyramid network structure; a hole convolution module is added in the characteristic fusion layer; the feature fusion module fuses the two parts of features and then is connected with a differentiable binarization module; the text detection network model structure flow chart is shown in fig. 3.

S3, detecting a preliminary model of the fast scene text of the fusion cavity convolution established in the step S2 by using the training data set generated by the label in the step S1, and calculating a loss value by using a loss function to adjust parameters of the preliminary model to obtain the fast scene text detection model of the fusion cavity convolution;

specifically, the following steps are adopted to train a text detection preliminary model and obtain a text detection model:

s3.2, the hole convolution module respectively uses the 1/16 feature maps extracted in the step S3.1 to process the hole convolution with the hole rates of 1,6,12 and 18 to obtain hole convolution features;

specifically, C to backbone network EfficientNet-b3₄That is, 1/16 feature maps are sampled in parallel by convolution of common 1 × 1 and convolution kernels of 3 × 3 of different void rates in 3 of 6,12 and 18 respectively to obtain different receptive fields, and then the multi-scale features of 4 channels are subjected to cascade operation to fully capture the context information of the input image, as shown in fig. 4.

s3.4, predicting the probability map (P) and the threshold map (T) by the fused feature map generated in the step S3.3, combining the probability map and the threshold map by using a differentiable binarization module to obtain an approximate binary map (B), adaptively predicting the threshold value of each position in the image, and forming a boundary box for obtaining a text region from the approximate binary map B through the boundary box in an inference stage;

specifically, the feature map after the attention mechanism is used for predicting the probability map P and the threshold map T, and then the relationship between the probability map and the threshold map is established by the following formula to generate the approximate binary map B:

wherein k is an amplification factor, generally set to 50; b is_i，jRefers to the value, P, at the point (i, j) on the approximate binary map_i，jRefers to the value of the (i, j) point on the probability map, T_i，jRefers to the value at point (i, j) on the threshold map. The differentiable approximate binarization function can be optimized along with the network in the training process, and is beneficial to distinguishing the character area from the background.

L＝L_s+α×L_b+β×L_t

L_s＝L_b＝∑y_i·log x_i+(1-y_i)log(1-x_i)。

and S4, detecting the text in the scene image by using the fast scene text detection model which is obtained in the step S3 and is combined with the cavity convolution.

Claims

1. A fast scene text detection method fusing hole convolution specifically comprises the following steps:

s3, detecting a preliminary model of the established fast scene text of the fusion cavity convolution in the step S2 by using the training data set generated by the label in the step S1, and calculating a loss value by using a loss function to adjust parameters of the preliminary model to obtain the fast scene text detection model of the fusion cavity convolution;

2. The method according to claim 1, wherein step S1 is implemented by obtaining a training data set for text detection and performing label generation on the training data set, specifically performing label generation on the original labels of the public data sets ICDAR2015 and CTW 1500.

3. The method for detecting the fast scene text fused with the hole convolution according to claim 1, wherein the preliminary model for detecting the fast scene text fused with the hole convolution in step S2 includes a lightweight feature extraction module, a hole convolution module and a differentiable binarization module, and specifically, the lightweight feature extraction module adopts EfficientNet-b3 as a main network to extract features of an input image to construct a pyramid network structure; the feature fusion layer is added with the hole convolution module; and the feature fusion module fuses the two parts of features and then connects the two parts of features with the differentiable binarization module.

4. The method according to claim 1, wherein the training data set generated in step S1 is used to detect the preliminary model of the fast scene text detection of the fusion hole convolution established in step S2 in step S3, and a loss function is used to calculate a loss value to adjust parameters of the preliminary model to obtain the fast scene text detection model of the fusion hole convolution, and the following steps are specifically adopted to train and obtain the text detection model:

and S3.4, predicting the probability map (P) and the threshold map (T) by the fused feature map generated in the step S3.3, combining the probability map and the threshold map by using a differentiable binarization module to obtain an approximate binary map (B), adaptively predicting the threshold value of each position in the image, and forming a boundary box for obtaining a text region from the approximate binary map B through the boundary box in an inference stage.

5. The method for fast detecting the scene text fused with the hole convolution according to claim 4, wherein the following formula is adopted as a prediction loss function when the probability map (P) and the threshold map (T) are predicted in step S3.4:

L＝L_s+α×L_b+β×L_t

L_s＝L_b＝∑y_i·log x_i+(1-y_i)log(1-x_i) 。