CN110533041B

CN110533041B - Regression-based multi-scale scene text detection method

Info

Publication number: CN110533041B
Application number: CN201910838235.0A
Authority: CN
Inventors: 景小荣; 朱莉
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2022-07-01
Anticipated expiration: 2039-09-05
Also published as: CN110533041A

Abstract

The invention relates to a regression-based multi-scale scene text detection method, and belongs to the field of digital image processing. The method specifically comprises the following steps: s1: setting sufficient training data with text position calibration; s2: constructing a feature extraction network, which comprises a forward network process from bottom to top and a feature fusion process from top to bottom and is used for extracting low-level, middle-level and high-level features of each training data; s3: using a cascade module for each layer of characteristics sent into the detection layer; s4: and adopting a regression-based detection framework, setting a proper default box according to the text characteristics, and detecting the text position in the image. The cascade module adopted by the invention improves the receptive field of the network, so that the set default frame of the text characteristics is very suitable, and finally the text position in the image is accurately detected.

Description

Regression-based multi-scale scene text detection method

Technical Field

The invention belongs to the field of digital image processing, and relates to a multi-direction scene text detection method based on regression.

Background

With the popularization of intelligent equipment, people can acquire image information anytime and anywhere. The characters in the image serve as high-level semantic information, and important clues are provided for understanding and analyzing the image content. The characters are directly reflected by the image content, are easier to extract and understand compared with other elements, and the description of a plurality of characters can be directly used, so that the characters can be conveniently applied to retrieval and analysis of various image and video contents based on key words. Text detection has become a popular research topic in the field of computer vision.

There are many methods of text detection. The traditional scene text detection method needs manual feature design, different images need different feature extraction modes, and the workload is huge. Meanwhile, the requirement of the work of characteristic design on designers is high, and rich professional knowledge is needed. These all create a developing bottleneck for artificial design features. And the occurrence of deep learning solves the problem.

With the excellent detection effect of deep learning in the field of target detection, some text detection methods based on the improvement of a general target detection algorithm are developed. Detection methods based on universal targets can be divided into two main categories: candidate region-based methods and regression-based methods. Different from general target detection, the aspect ratio of the text changes drastically, and how to make the network have strong robustness to the change of the text scale is a problem to be considered. Text detection algorithms developed for candidate region based methods, such as: a Natural scene Text detection algorithm (CTPN) connected with a Text box, wherein the detection frame provides that the length of a Text sequence changes violently, and the horizontal position is more difficult to predict than the vertical position, and in order to generate a Text candidate box more accurately, the method fixes a default box as a width of 16 and only predicts the position in the vertical direction. Although the method realizes the end-to-end training of the convolutional neural network and the cyclic neural network for the first time, the spatial characteristics and the sequence characteristics of the text are extracted; and the detection precision of multi-scale and multi-language texts is higher, but only the detection of horizontal texts is aimed at, and the speed is lower. Text detection algorithms that improve regression-based methods, such as: a Fast Text detection algorithm (A Fast Text Detector with a Single Deep Neural Network, Textboxes) of a Single-step Deep Neural Network predicts in different layers, predicts small targets in a low layer and predicts large targets in a high layer. A default box is designed that fits the text scale. Although the speed and the precision of the method have good effects, the detection effect on small targets is not ideal because the feature extraction of the middle and lower layers is insufficient.

Therefore, a text detection method with strong robustness to the text scale change is needed.

Disclosure of Invention

In view of this, the present invention provides a regression-based multi-directional scene text detection method, which solves the problem that the current regression-based text detection network is not robust enough to the text scale change, sets a proper default box for text features, and finally detects the text position in an image.

In order to achieve the purpose, the invention provides the following technical scheme:

a multi-scale scene text detection method based on regression specifically comprises the following steps:

s1: setting sufficient training data with text position calibration;

s2: constructing a feature extraction network, which comprises a forward network process from bottom to top and a feature fusion process from top to bottom and is used for extracting low-level, middle-level and high-level features of each training data;

s3: using a cascade (initiation) module for each layer characteristic sent into the detection layer;

s4: and adopting a regression-based detection framework, setting a proper default box according to the text characteristics, and detecting the text position in the image.

Further, in step S2, the forward network from the low to the high includes: the system comprises an input module, first to fifth convolution modules, first to fifth pooling modules, a recurrent neural network module, sixth to tenth convolution modules and a sixth pooling module; the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are respectively connected with the first convolution module, the second convolution module and the third convolution module respectively; and the circulating neural network module, the sixth to tenth convolution modules and the sixth pooling module are sequentially cascaded.

Further, in step S2, the top-down feature fusion refers to fusion of a high-level feature and a low-level feature, and specifically includes: the high layer firstly obtains a characteristic diagram consistent with the size of the low layer through deconvolution, and then a Batch Normalization (Batch norm) module is BatchNorm connected; the low layer is firstly connected with a convolution module, the size of a convolution kernel is 1 x 1, the step length is 1, and the filling is 0; then connecting a BatchNorm module; finally, fusing the two feature layers by using element dot product operation (Eltwise); the fused output is used as the output of the whole feature extraction network.

Furthermore, the convolution kernels of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are all 3 x 3, the step length is 1, and the filling is 1; the convolution kernel size of the fifth pooling module is 3 x 3, the step size is 1, and the padding is 1; the convolution kernel size of the rest pooling modules is 2 x 2, the step length is 2, and the filling is 0; one circulating Neural Network module is a bidirectional Long Short-Term Memory circulating Neural Network (BLSTM-RNN), and the number of hidden layer units is 256; the size of the seventh convolution kernel is 1 x 1, the step size is 1, and the padding is 0; the eighth to tenth convolution modules each include two convolution kernels, one of which has a size of 1 × 1, a step size of 1, and a fill of 0, and the other of which has a size of 3 × 3, a step size of 2, and a fill of 1.

Further, in step S3, the cascade module includes an input spectrum end and a characteristic spectrum cascade end, the input spectrum end and the characteristic spectrum cascade end are connected by four parallel convolution branches, and each branch includes 1, 2, or 3 convolution modules.

Further, the cascade module comprises four convolution branches connected in parallel,

the first convolution branch comprises a convolution kernel, the size of the convolution kernel is 3 x 3, the step length is 1, and the padding is 1;

the second convolution branch comprises three convolution kernels, wherein the size of one convolution kernel is 1 x 1, the step length is 1, and the filling is 0; one convolution kernel size is 1 x 5, step size is 1, and padding is 1; one convolution kernel size is 5 x 1, step size is 1, and padding is 1;

the third convolution branch comprises three convolution kernels, wherein the size of one convolution kernel is 1 x 1, the step length is 1, and the filling is 0; one convolution kernel size is 5 x 1, step size is 1, and padding is 1; one convolution kernel size is 1 x 5, step size is 1, and padding is 1;

the fourth convolution branch comprises a pooling layer and a convolution kernel, wherein the size of the convolution kernel of the pooling layer is 3 x 3, the step size is 1, the filling is 1, the size of the convolution kernel is 1 x 1, the step size is 1, and the filling is 0;

after all the convolution kernels, a BatchNorm module and a Relu module are connected.

The invention has the beneficial effects that: the text detection method has strong robustness to text scale change. The method uses the convolution cyclic neural network to simultaneously extract the spatial features and the sequence features of the text. And (3) using the multi-layer prediction output of the feature pyramid structure, predicting a small target by using a low-layer feature map, and predicting a large target by using a high-layer feature map. By using feature fusion, semantic information of a high layer is used for classification, and structural information of a low layer is used for assisting regression, so that the problems of insufficient extraction of features of the low layer and low accuracy of small target prediction are solved to a certain extent. And finally, using an acceptance module for each layer of characteristics sent into the detection layer to further improve the receptive field of the network, then adopting a regression-based detection frame, setting a proper default frame aiming at the text characteristics, and finally detecting the text position in the image.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For a better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic diagram of a network architecture according to the present invention;

FIG. 2 is a schematic view of feature fusion;

fig. 3 is a schematic structural diagram of a cascade initiation module.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Referring to fig. 1 to fig. 3, a preferred embodiment of a regression-based multi-scale scene text detection method according to the present invention includes the following steps:

the method comprises the following steps: preparing data;

several public data sets were aggregated — SynthText, ICDAR2011, ICDAR2013, SVT. Wherein SynthText contains 8 x 10⁵The opening and combination picture is used for network pre-training, and 749 training pictures including ICDAR2011, ICDAR2013 and SVT are used for fine adjustment of the network. The three data sets of ICDAR2011, ICDAR2013, SVT total 585 training pictures for testing.

Step two: the network pre-training specifically comprises the following steps:

1) constructing a network structure as shown in fig. 1;

2) pre-training the network on the SynthText synthetic dataset: and (3) inputting the image normalized to 300 × 300 into a network model, and outputting the network as a score of the text positioning result and the text classification by adopting a loss function shown in a formula (1).

The loss function includes two parts: the binary loss of the text line and the return loss of the default frame position of the text line; where N denotes the number of matched default frames, α is 1, x is a matching matrix of the default frames and the real frames, c denotes the confidence of whether each default frame contains text, l denotes the positioning result of the network prediction of each default frame, and g denotes the position of the real frame. Two classification losses L of text line_confWith cross entropy penalty, the default frame position of a line of text returns a penalty L_locLoss with smooth L1;

3) the losses obtained in 2) were optimized using a random optimizer (A Method for Stocharistic Optimization, Adam): parameters in the network are constantly updated by minimizing the loss function by an Adam optimizer. Network co-training 4 x 10⁶Next, the learning rate is initialized to 10^-34 x 10 per iteration⁵The sub-learning rate is multiplied by 0.1, and the 0.3 parameter is randomly discarded.

Step three: the network fine tuning specifically comprises the following steps:

1) fine-tuning the network model obtained in the second step by using 749 real pictures on the ICDAR2011, the ICDAR2013 and the SVT provided in the first step, and performing data enhancement on the 749 real pictures, wherein the data enhancement comprises operations of random turning, noise addition, blurring and the like;

2) setting default frames with 6 different length-width ratios on different output layers, wherein the default frames are respectively as follows: 1, 2, 3, 5, 7 and 10;

3) the detection layer uses a cascade (acceptance) module to cascade convolution kernels with different sizes, and the receptive field of the network is improved by increasing the network width, so that the text detection problem of the length-width ratio is solved;

4) setting learning rate to 10^-5And 20000 iterations are performed in total. Optimizing by using random gradient descent in the process to obtain a final deep neural network model;

step four: and testing the learned network on a test set: in the step, the normalized test image is input into a network model, and the network output is the positioning result of the text and the score of the text classification.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A multi-scale scene text detection method based on regression is characterized by specifically comprising the following steps:

s1: setting sufficient training data with text position calibration;

s2: constructing a feature extraction network, which comprises a forward network process from bottom to top and a feature fusion process from top to bottom and is used for extracting low-level, middle-level and high-level features of each training data; wherein the low-up forward network comprises: the system comprises an input module, first to fifth convolution modules, first to fifth pooling modules, a recurrent neural network module, sixth to tenth convolution modules and a sixth pooling module; the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are respectively connected with the first convolution module, the second convolution module and the third convolution module respectively; a cyclic neural network module, sixth to tenth convolution modules and a sixth pooling module are sequentially cascaded;

the convolution kernels of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are all 3 x 3, the step length is 1, and the filling is 1; the convolution kernel size of the fifth pooling module is 3 x 3, the step size is 1, and the padding is 1; the convolution kernel size of the rest pooling modules is 2 x 2, the step length is 2, and the filling is 0; one circulating Neural Network module is a bidirectional Long Short-Term Memory circulating Neural Network (BLSTM-RNN), and the number of hidden layer units is 256; the size of the seventh convolution kernel is 1 x 1, the step size is 1, and the padding is 0; the eighth to tenth convolution modules each include two convolution kernels, wherein one convolution kernel has a size of 1 × 1, a step size of 1, and a padding of 0, and the other convolution kernel has a size of 3 × 3, a step size of 2, and a padding of 1;

s3: using a cascade module for each layer of characteristics sent into the detection layer;

the cascade module comprises four convolution branches connected in parallel,

after all convolution kernels, a BatchNorm module and a Linear rectification Unit module (Rectified Linear Unit, Relu) are connected;

2. The regression-based multi-scale scene text detection method according to claim 1, wherein in step S2, the top-down feature fusion refers to fusion of high-level features and low-level features, and specifically includes: the high layer firstly obtains a characteristic diagram consistent with the size of the low layer through deconvolution, and then a Batch Normalization (Batch norm) module is BatchNorm connected; the low layer is firstly connected with a convolution module, the size of a convolution kernel is 1 x 1, the step length is 1, and the filling is 0; then connecting a BatchNorm module; finally, fusing the two feature layers by using element dot product operation; the fused output is used as the output of the whole feature extraction network.

3. The regression-based multi-scale scene text detection method according to claim 1, wherein in step S3, the cascade module includes an input spectrum end and a feature spectrum cascade end, the input spectrum end and the feature spectrum cascade end are connected by four convolution branches connected in parallel, and each branch includes 1, 2 or 3 convolution modules.