CN106650725B

CN106650725B - Candidate text box generation and text detection method based on full convolution neural network

Info

Publication number: CN106650725B
Application number: CN201611070587.9A
Authority: CN
Inventors: 马景法; 金连文; 钟卓耀
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-11-29
Filing date: 2016-11-29
Publication date: 2020-06-26
Anticipated expiration: 2036-11-29
Also published as: CN106650725A

Abstract

The invention discloses a candidate text box generation and text detection method based on a full convolution neural network, which comprises the following steps: generating a text region candidate frame, wherein an initiation-RPN takes a natural scene picture and a set of real boundary frames for marking a text region as input, generates word region candidate frames with controllable quantity, slides an initiation network on a convolution characteristic response image of a VGG16 model, and assists a set of text characteristic prior frames at each sliding position; text type monitoring information which easily causes ambiguity is merged, multilevel regional downsampling information is merged, and text detection is carried out; training an initiation candidate box in an end-to-end mode to generate a network and a text detection network through back propagation and random gradient descent; the candidate box iterative voting achieves a higher text recall in a complementary manner, using a candidate box filtering algorithm to remove the remaining detection boxes. The invention obtains 0.83 and 0.85 accuracy on ICDAR 2011 and 2013robust text detection standard databases respectively, which is superior to the best result in the past.

Description

Candidate text box generation and text detection method based on full convolution neural network

Technical Field

The invention relates to a technology for generating a text candidate box and detecting a text in a natural scene picture, in particular to a method for generating a candidate text box and detecting a text based on a full convolution neural network.

Background

Text in images provides a rich and accurate high level of semantic information that is critical to a large number of potential applications such as scene understanding, image and food retrieval, content-based recommendation systems, and the like. Text detection of natural scene pictures has attracted a great deal of attention in computer vision and image understanding communities. However, text detection of natural scenes remains a challenging and unsolved problem. First, the background of text pictures is complex and the composition of areas such as symbols, logos, bricks and grass is very difficult to distinguish from text. In addition, the super-mixed factors of non-uniform lighting conditions, strong exposure, low contrast, blur, and low resolution add significant challenges to the text detection task

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a candidate text box generation and text detection method based on a full convolution neural network.

The technical scheme of the invention is realized as follows:

the candidate text box generating and text detecting method based on the full convolution neural network includes the steps of

S1: generating a text region candidate frame, wherein an initiation-RPN takes a natural scene picture and a set of real boundary frames for marking a text region as input, generates word region candidate frames with controllable quantity, slides an initiation network on a convolution characteristic response image of a VGG16 model, and assists a set of text characteristic prior frames at each sliding position;

s2: text type monitoring information which easily causes ambiguity is merged, multilevel regional downsampling information is merged, and text detection is carried out;

s3: training an initiation candidate box in an end-to-end mode to generate a network and a text detection network through back propagation and random gradient descent;

s4: the candidate box iterative voting achieves a higher text recall in a complementary manner, using a candidate box filtering algorithm to remove the remaining detection boxes.

Further, step S1 includes the step of

S11: designing a text feature prior box;

s12: and constructing an inclusion candidate box generation network.

Further, in step S11, there are 24 text feature prior boxes, where the width of the sliding window at each sliding position is set to be 32, 48, 64 and 80, and the length-to-width ratio is 0.2, 0.5, 0.8, 1.0, 1.2 and 1.5.

Further, the initiation candidate frame generation net in step S12 is formed by connecting a 3 × 3 convolutional layer, a 5 × 5 convolutional layer and a 3 × 3 max pooling layer to a corresponding spatial acceptance domain of the characteristic response map of Conv5_3 as input.

Further, in step S2, the text type supervision information is: the candidate box IoU is designated as existing text with an overlap of 0.5 or more, the candidate box IoU is designated as "fuzzy text" with an overlap of 0.2 or more and less than 0.5, and the others are designated as containing no text information.

Further, the multi-level area downsampling information in step S2 is: the convolutional feature response maps of Conv4_3 and Conv5_3 in the VGG16 network both perform multi-level region downsampling and obtain two 512H W sampled features, which are then decoded with one 512H 1W convolutional layer to join the features together.

Compared with the prior art, the invention provides an initiation candidate box generation network, which applies sliding windows with different sizes on a convolution characteristic diagram and assists a set of text characteristic prior boxes at each sliding position to generate word area candidate boxes. The sliding windows with different sizes keep local information at corresponding positions and also give consideration to context information, and help to filter candidate frames without texts, and the initiation candidate frame generation network of the invention obtains high recall rate under the condition of only using hundreds of word candidate frames; the invention also introduces additional text category supervision information which is easy to disambiguate and multi-level regional down-sampling information which are fused into the text detection network, and the information helps the text detection network to learn more discriminative information to distinguish the text from the complex background; in addition, in order to better utilize the model in the training process, the invention provides a candidate box iterative voting scheme, and a higher word recall rate is obtained in a supplementary mode.

Drawings

FIG. 1 is a flow chart of a candidate text box generation and text detection method based on a full convolution neural network according to the present invention.

FIG. 2 is an exemplary diagram of IoU overlap of word region candidate boxes in a particular interval, according to one embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the method for generating a candidate text box and detecting a text based on a full convolution neural network of the present invention includes four steps: s1, generating a text region candidate box; s2, text detection; s3, end-to-end learning optimization; and S4, heuristic processing.

The function of the component S1 is: the initiation-RPN takes a natural scene picture and a set of real boundary boxes for marking text areas as input to generate word area candidate boxes with controllable quantity; to search for word region candidate boxes, we slide an initiation network over the convolution feature response map of the VGG16 model and assist a set of text feature prior boxes at each slide position. The method comprises the following steps: (1) and designing a text feature prior box (2) an inclusion candidate box to generate a network. Four different scales (32, 48, 64 and 80) and six different ratios (0.2, 0.5, 0.8, 1.0, 1.2 and 1.5) were set at each sliding position, for a total of 24 a priori sliding windows. In the learning phase, text labels are designated that intersect with the real text box divided by a union greater than 0.5, whereas background labels are designated that overlap areas divided by a union area less than 0.3. The designed concept candidate frame generation net is connected to a corresponding spatial acceptance domain of the characteristic response map of Conv5_3 as input by a convolution layer of 3 x 3, a convolution layer of 5 x 5 and a maximum pooling layer of 3 x 3. In addition, to reduce dimensionality, a convolution operation of 1 × 1 is applied on the 3 × 3 max pooling layer. Then we join the features of each part on the channel coordinates, and a 640-dimensional connected feature vector is sent to two output layers: the classification layer predicts whether the text score exists in the region, and the regression layer improves the text region position of various prior windows at each sliding position.

Step S2 includes: (1) text category supervision information which is easy to cause ambiguity is integrated to increase more reasonable supervision information, a classifier is helped to learn more distinctive features, text regions are distinguished from complex and diverse backgrounds, and candidate boxes which do not contain texts are filtered out. (2) And integrating multi-level regional down-sampling information. The effect is to better utilize the convolution characteristics of multiple layers and enrich the distinguishing information of each sliding window.

Much of the previous work has been on detecting networks designating IoU candidates with overlap greater than 0.5 as being text present and vice versa. However, this method of determining whether text is present in the candidate box is not reasonable because IoU overlap in the interval 0.2 to 0.5 may contain spatial or extensive text information, as shown in FIG. 2. This confounding labeling information can confuse the classification learning of textual and non-textual candidate boxes. To this end, we propose to designate as present text the candidate box IoU that overlaps by 0.5 or more, the candidate box IoU that overlaps by 0.2 or more and less than 0.5 as "fuzzy text", and the others as not containing textual information. This strategy provides more reasonable supervision information to help the classifier learn more distinctive features to discriminate text from a complex and diverse background and filter out candidate boxes that do not contain text.

In order to better utilize multilevel convolution characteristics and enrich the discrimination information of each candidate frame, the invention performs multilevel regional downsampling on the convolution characteristic response graphs of Conv4_3 and Conv5_3 of the VGG16 network, and obtains two sampling characteristics of 512H W. The connected features are then decoded with a 512 x 1 convolutional layer. The effect of this 1 x 1 convolutional layer is (1) to combine the sampled features of multiple levels together and weight-weighted fusion during the training process. (2) The dimensions are reduced to match the first fully connected layer of VGG 16.

The component S3 is different from the proposed four-step training strategy combining RPN and Fast-RCNN, the invention trains the initiation candidate box generation network and the text detection network in an end-to-end mode through a back propagation and random gradient descent method. The shared convolutional network is initialized by a pre-trained imageNet classification network. The weights of the new layers are initialized by a gaussian distribution with a mean of 0 and a variance of 0.01. The baseline learning rate was 0.001, and was reduced to one tenth of the original 40000 times per iteration. The momentum and the weight attenuation were set to 0.9 and 0.0005, respectively.

The inclusion candidate box generation network and the text detection network have two sibling input layers: a classification layer, and a regression layer. The difference between the output layers of the inclusion candidate box generation network and the text detection network is as follows: (1) the inclusion candidate box generates a network, and each prior box should be parameterized independently, so we need to predict k-24 prior candidate boxes simultaneously. The classification layer outputs 2k scores for judging whether the candidate frames have texts, and the regression layer outputs 4k numerical values of the improved candidate frames deviating from the original candidate frames. (2) The text detection network has three output scores for each candidate box, which respectively correspond to the background, the fuzzy text and the candidate box with the text. The regression layer outputs 4 regression bias values for each text candidate box. We minimize this multi-tasking loss function during the training process, the formula is as follows:

L(p,p^*,t,t^*)＝L_cls(p,p^*)+λL_reg(t,t^*), (0.1)

loss function L of classification layer_clsIs the softmax loss function, p and p^*Respectively a predicted tag and a genuine tag. Regression loss function L_regA smooth-L1 loss function is applied. In addition, t is { t ═ t_x,t_y,t_w,t_hAnd

corresponding regression deviation value vectors, t, representing the prediction and true candidate frames, respectively^*The following formula is obtained:

here, P ═ { P ═ P_x,P_y,P_w,P_hAnd G ═ G_x,G_y,G_w,G_hRepresents the center coordinates, height, and width of the corresponding candidate frame P and real text frame G, respectivelyAnd (4) degree. λ represents the loss balance parameter, and in the initiation candidate box generation network we let λ be 3 to bias him towards better candidate box positions, and in the text detection network let λ be 1.

The component S4 includes a candidate box iterative voting mechanism and a filtering algorithm. The candidate box iterative voting mechanism enables the invention to obtain higher text recall rate in a supplementary mode and improves the performance of a text detection system. The filtering algorithm allows the present invention to remove excess detection boxes to improve accuracy.

The invention firstly inputs a natural scene picture and a set of real text box data into an initiation candidate box to generate a network, and generates a certain number of word area candidate boxes. And then, sending the obtained word region candidate box into a text detection network for text and non-text classification and text positioning, wherein the network adds text category supervision information which is easy to cause ambiguity and region down-sampling information which is fused with multiple layers in the training process. The entire system is trained in an end-to-end fashion through back-propagation and gradient descent mechanisms. In order to fully utilize the intermediate model of the training process, the invention adopts a candidate box iterative voting mechanism to obtain the high recall rate of the text example in a supplementary mode, and improves the performance of the whole text detection system. Finally, the invention applies a filtering algorithm that finds the inside and outside candidate boxes for each text instance in terms of coordinate position, retains the high-score candidate boxes, and removes the low-score candidate boxes.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. The candidate text box generation and text detection method based on the full convolution neural network is characterized by comprising the following steps

s4: iteratively voting the candidate boxes to obtain a higher text recall rate in a supplementary mode, and removing the residual detection boxes by using a candidate box filtering algorithm;

the loss function of multitask is minimized in the training process, and the formula is as follows:

loss function L of classification layer_clsIs the softmax loss function, p and p^*Predicted tags and true tags, respectively; regression loss function L_regApplying a smooth-L1 loss function; in addition, t is { t ═ t_x，t_y，t_w，t_hAnd

wherein P ═ { P ═ P_x，P_y，P_w，P_hAnd G ═ G_x，G_y，G_w，G_hRepresents the center coordinates, height and width of the corresponding candidate box P and real text box G, respectively, and λ represents the loss balance parameter.

2. The method for generating candidate text boxes and detecting text based on full convolutional neural network as claimed in claim 1, wherein step S1 includes the step of

S11: designing a text feature prior box;

s12: and constructing an acceptance candidate box to generate a network.

3. The method for generating candidate text boxes and detecting text based on full convolution neural network as claimed in claim 2, wherein the text feature prior boxes in step S11 are 24 in number, wherein the width of each sliding window in sliding position is set to be 32, 48, 64 and 80, and the ratio of length to width is 0.2, 0.5, 0.8, 1.0, 1.2 and 1.5.

4. The method as claimed in claim 2, wherein the concept candidate box generating network in step S12 is formed by connecting a 3 x 3 convolutional layer, a 5 x 5 convolutional layer and a 3 x 3 maximal pooling layer to corresponding spatial acceptance fields of a feature response map of Conv5_3 as input.

5. The method for generating candidate text boxes and detecting texts based on full convolutional neural network as claimed in claim 1, wherein the text type supervision information in step S2 is: the candidate box IoU is designated as existing text with an overlap of 0.5 or more, the candidate box IoU is designated as "fuzzy text" with an overlap of 0.2 or more and less than 0.5, and the others are designated as containing no text information.

6. The method for generating a candidate text box and detecting a text based on a full convolutional neural network as claimed in claim 1, wherein the multi-level region downsampling information in step S2 is: the convolutional feature response maps of Conv4_3 and Conv5_3 in the VGG16 network both perform multi-level region downsampling and obtain two 512H W sampled features, which are then decoded with one 512H 1W convolutional layer to join the features together.