CN111062386B

CN111062386B - Natural scene text detection method based on depth pyramid attention and feature fusion

Info

Publication number: CN111062386B
Application number: CN201911192949.5A
Authority: CN
Inventors: 贾世杰; 冯宇静
Original assignee: Dalian Jiaotong University
Current assignee: Dalian Jiaotong University
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2023-12-29
Anticipated expiration: 2039-11-28
Also published as: CN111062386A

Abstract

The invention provides a natural scene text detection method based on depth pyramid attention and feature fusion, which is a natural scene text detection algorithm combining a depth pyramid attention network and feature fusion, and aims to solve the problems that an originally designed good model cannot be fully utilized, the overall performance is limited, convolution operation is based on a local receptive field, and long dependence disappears along with the deepening of convolution. The utilization rate of the model is improved better by utilizing the feature fusion and the depth pyramid attention model, the defect that the design structure of many conventional character detection models is good but cannot be fully utilized is overcome, and long dependence can disappear along with the deepening of convolution operation based on local receptive fields.

Description

Natural scene text detection method based on depth pyramid attention and feature fusion

Technical Field

The invention relates to a natural scene text detection method, in particular to a natural scene text detection algorithm combining a depth pyramid attention network and a feature fusion technology.

Background

With the progress of science and technology, the demand for internet products is increasing, and more aspects need text information in images. The text detection is the first step, which is also extremely important, to more completely identify the text content in the image, and directly affects the text identification performance.

Based on text detection in natural scenes, detection complexity caused by background interference, changeable character aspect ratio, changeable character direction and small text on text detection needs to be overcome, and the method is one of the most challenging problems in the field of computer vision at present. The natural scene text detection can be divided into traditional natural scene text detection and text detection under a natural scene based on deep learning from different extraction feature modes. Scene pictures, unlike document pictures, contain complex backgrounds and changes in text angle, which are difficult to distinguish from the background using conventional natural scene text detection methods alone. Text detection in the current deep learning natural scene can be mainly divided into two types, namely a text detection method based on regional suggestion and a text detection method based on image segmentation. Through analysis of the two methods, the fact that most models lack balance of feature levels is found, so that the originally well-designed models cannot be fully utilized, and overall performance is limited.

In order to better fully utilize the model, the invention provides a new network, which overcomes the defects that the model which is designed well originally cannot be fully utilized and the overall performance is limited, and solves the problems that the long dependence can disappear along with the deepening of convolution operation based on local receptive fields.

Disclosure of Invention

The invention provides a natural scene text detection algorithm combining a depth pyramid attention network and feature fusion, which solves the problems that an originally designed model cannot be fully utilized and the overall performance is limited.

The technical scheme of the invention is as follows:

a natural scene text detection method based on depth pyramid attention and feature fusion comprises the following steps:

step one, taking a text public data set related to a natural scene as a training sample;

step two, inputting training samples into a preliminary extraction feature network (PixelLink extraction feature network) according to 8 pictures in each batch, wherein a basic framework is a VGG16 network, and a Unet structure is adopted; the top-down path adopts a VGG16 network, which is a deep network formed by a plurality of 3*3 convolution series connection and maximum pooling. The advantage of using multiple convolutions in series is: fewer parameter amounts and more non-linear variations are required than if only one larger convolution kernel were used.

The bottom-up path, the up-sampling phase. Wherein upsampling is performed using bilinear interpolation.

To prevent the feature map output by the VGG16 from being directly upsampled, thereby losing context information, a lateral connection is employed. The feature fusion is carried out on the feature graphs with the same space size of the top-down path and the bottom-up path, so that missing information is complemented, and the feature representation capability after up-sampling is stronger.

Step three, 4 feature mapping layers obtained by extracting a feature network from the PixelLink: h4, h3, h2 and h1, up-sampling the 4 feature mapping layers to h4, and carrying out average summation of pixel values, wherein the number of channels is unchanged, which is called feature fusion; wherein the upsampling is a bilinear interpolation; the formula of feature fusion is:

F＝(h4+Up _×2 (h3)+Up _×4 (h2)+Up _×4 (h1))/4 (1)

wherein Up _×2 (. Cndot.) and Up _×4 (. Cndot.) the expansion is 2-fold and 4-fold respectively;

step four, taking the output of feature fusion as the input of a depth pyramid attention model, further increasing the depth pyramid attention model, and more fully utilizing the increased depth pyramid attention model;

the depth pyramid attention model consists of three branches: depth feature pyramid network branches, nonlinear transformation branches, and global average pooling branches. The invention does not simply add the extracted information to the depth feature pyramid network, but performs refinement processing. The depth feature pyramid network branches are convolved with 2 7 x 7, 25 x 5,2 x 3*3, respectively, in order to extract information from different pyramid scales. The same convolution kernel adopts a serial form, and different convolution kernels adopt a parallel form. The present invention labels conv7×7 in the left half, bn, relu as conv7_1, conv7×7 in the right half, bn as conv7_2. Similarly, conv5 x 5 in the left half, bn, relu is denoted Conv5_1, conv5 x 5 in the right half, bn is denoted Conv5_2, conv3 x 3 in the left half, bn, relu is denoted Conv3_1, conv3 x 3 in the right half, bn is denoted Conv3_2. The refining process is as follows: the feature map after feature fusion first goes through conv7_1, conv5_1, conv3_1 and conv3_2, respectively. The feature map of conv3_2 is then up-sampled and superimposed with the feature map of conv5_1 by pixel values and the superimposed result is input to conv5_2. And finally, up-sampling the Conv5_2 feature map, superposing the pixel values with the Conv7_1 feature map, and inputting the superposition result to the Conv7_2. Wherein the up-sampling is deconvolution, the size of the kernel is 4*4, the step size is 2, and BN and Relu activation functions are used;

inputting the refined feature mapping layer into a PixelLink output network;

the pixelink output network mainly comprises two parts: the first part is to predict whether the pixel is text; the second part is to predict whether the pixel and 8 pixels around the pixel belong to the same text instance; connecting the positive pixels by positive connection to form a connected component, wherein each component is a text example;

step six, finally, obtaining a final connected domain through minAreRect in the Opencv connected domain method by the segmented text example; when the connected area with the shortest edge less than 10 pixels or the area less than 300 pixels is regarded as false detection, automatically filtering the text area, and finally outputting the boundary box.

The invention has the beneficial effects that:

(1) The utilization rate of the model is improved better by utilizing the feature fusion and the depth pyramid attention model, and the defects that many character detection models are good in design structure but cannot be fully utilized and the overall performance is limited are overcome.

(2) The convolution operation is based on a local receptive field, so that the problem that long dependence disappears as the convolution deepens is avoided.

(3) Is effective for multi-scale text.

Drawings

Fig. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of the overall network architecture of the present invention.

FIG. 3 is a schematic diagram of a portion of a deep pyramid attention network architecture.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings and technical schemes.

As shown in fig. 1, the following steps are specifically described:

firstly, taking a training set of a text public data set related to a natural scene as a training sample;

step two, utilizing an extracted feature network of the PixelLink as a preliminary extracted feature network, wherein a basic framework is a VGG16 network, and a Unet structure is adopted;

the Unet is composed of a top-down path, a bottom-up path and a transverse connection.

(1) The top-down path adopts a VGG16 network, which is a deep network formed by a plurality of 3*3 convolution series connection and maximum pooling. The advantage of using multiple convolutions in series is: fewer parameter amounts and more non-linear variations are required than if only one larger convolution kernel were used.

(2) The bottom-up path, the up-sampling phase. Wherein upsampling is performed using bilinear interpolation.

(3) To prevent the feature map output by the VGG16 from being directly upsampled, thereby losing context information, a lateral connection is employed. The feature fusion is carried out on the feature graphs with the same space size of the top-down path and the bottom-up path, so that missing information is complemented, and the feature representation capability after up-sampling is stronger.

Step three, 4 feature mapping layers obtained by extracting a feature network from the PixelLink: h4; h3; h2; h1, up-sampling the 4 feature mapping layers to h4, carrying out average summation of pixel values, and enabling the number of channels to be unchanged, namely feature fusion; wherein the upsampling is a bilinear interpolation; the formula of feature fusion is:

F＝(h4+Up _×2 (h3)+Up _×4 (h2)+Up _×4 (h1))/4 (1)

(1) For reasons of hardware equipment, the training picture size is 256×256, the h4 size is 64×64, the h3 size is 32×32, the h2 size is 16×16, and the h1 size is 16×16.

Step four, taking the output of the feature fusion as the input of a deep pyramid attention network, further refining the features and more fully utilizing the model;

(1) The depth pyramid attention network is composed of depth feature pyramid network branches, nonlinear transformation branches, and global average pooling branches. Some designs are made on the network branches of the depth feature pyramid, so that the features of each branch are simply fused, and each part in the network branches of the depth feature pyramid is further refined.

And fifthly, inputting the refined feature mapping layer into a PixelLink output network.

(1) This output network mainly comprises two parts. The first part is to predict whether the pixel is text/not text; the second part is to predict whether the pixel and 8 pixels around it belong to the same text instance. Connecting the positive pixels by positive connection to form a connected component, wherein each component is a text example;

and step six, finally, the segmented text example is subjected to minAreRect in the Opencv connected domain method to obtain a final connected domain, but the method is sensitive to noise and can predict the noise as a real text, so that a plurality of thresholds are set, and false positives are reduced. When the connected area with the shortest edge less than 10 pixels or the area less than 300 pixels is regarded as false detection, the text area is automatically filtered, and finally the bounding box is output.

The invention is characterized in that the refined network is composed of two parts: the utilization rate of the model is improved better, the problems that many character detection models at present are good in design structure and cannot be fully utilized, and long dependence can disappear along with deepening of convolution based on local receptive fields in convolution operation are avoided.

The following describes embodiments of the present invention in detail with reference to the accompanying drawings, and the embodiments and specific operation procedures are given by the present embodiment on the premise of the technical solution of the present invention, but the scope of protection of the present invention is not limited to the following embodiments.

The data set for the experiments of the present invention were ICDAR2015 and ICDAR2013.ICDAR2015 dataset total 1500 pictures under natural scene with resolution size 1280 x 720, of which 1000 are training pictures and 500 are test pictures. Unlike previous images of ICDAR games are: these pictures are mainly obtained by google glasses and are very random when taken, and the text has the condition of inclination and blurring, which aims at increasing the difficulty of detection.

ICDAR2013 contains 229 training pictures and 233 test pictures. The dataset is a subset of the ICDAR2011, deleting the ICDAR2011 duplicate pictures and repairing the problem of incorrect image annotation. It is widely used in text detection, but contains only horizontal text.

The experiment was performed on a computer equipped with Intel (R) Core i7-6700 CPU 3.40GHz running the Linux Ubuntu 14.04 operating system and Pycharm Python 2.7. The deep learning framework is Tensorflow-gpu= 1.3.0, and the main required libraries are Opencv2, setproctitle, matplotlib.

ICDAR2015 experiment: when ICDAR2015 was tested, the training picture input size in the ICDAR2015 dataset used was 256 x 256 and the test picture resolution in the ICDAR2015 dataset was 1280 x 704. The evaluation criteria used were ICDAR2015 challenge published evaluation style R, P, F values.

Table 1 shows the R, P, F values for the model of the invention and the PixelLink on the ICDAR2015 dataset, respectively. ICDAR2015 experimental results are shown in table 1:

table 1 ICDAR2015 multi-directional text detection experimental results

Model	Recall rate of recall	Accuracy rate of	F value
				The model of the invention	0.7708	0.7595	0.7651
PixelLink	0.7299	0.7607	0.7450

ICDAR2013 experiment: in ICDAR2013 experiments, the training picture input size in the ICDAR2013 dataset used was 256×256, and the test picture resolution in the ICDAR2013 dataset was 384×384. The evaluation standard adopts an ICDAR2013 challenge game published evaluation mode R, P, F value.

Table 2 shows the R, P, F values for the model of the present invention and the PixelLink on the ICDAR2013 dataset, respectively. The ICDAR2013 experimental results are shown in table 2:

table 2 ICDAR2013 horizontal text test results

Model	Recall rate of recall	Accuracy rate of	F value
				The model of the invention	0.8168	0.7041	0.7563
PixelLink	0.6919	0.7508	0.7201

Claims

1. A natural scene text detection method based on depth pyramid attention and feature fusion is characterized by comprising the following steps:

inputting training samples into a preliminary extraction feature network according to 8 pictures in each batch, wherein a basic framework is a VGG16 network and adopts a Unet structure; the preliminary extraction feature network is an extraction feature network of PixelLink;

F＝(h4+Up _×2 (h3)+Up _×4 (h2)+Up _×4 (h1))/4 (1)

the depth pyramid attention model consists of three branches: depth feature pyramid network branches, nonlinear transformation branches, and global average pooling branches; the depth feature pyramid network branches are respectively convolved with 2 convolutions of 7 x 7,2 convolutions of 5 x 5 and 2 convolutions of 3*3, so as to extract information from different pyramid scales; the same convolution kernel adopts a serial connection mode, and different convolution kernels adopt a parallel connection mode; conv7×7 in the left half, bn, renu, conv7_1, conv7×7 in the right half, bn, conv7_2; similarly, conv5 x 5 in the left half, bn, relu is labeled Conv5_1, conv5 x 5 in the right half, bn is labeled Conv5_2, conv3 x 3 in the left half, bn, relu is labeled Conv3_1, conv3 x 3 in the right half, bn is labeled Conv3_2; the refining process is as follows: the feature mapping after feature fusion is firstly subjected to Conv7_1, conv5_1, conv3_1 and Conv3_2 respectively; then up-sampling the feature map of Conv3_2, superposing the pixel values with the feature map of Conv5_1, and inputting the superposition result to Conv5_2; finally, up-sampling the feature map of Conv5_2, superposing pixel values with the feature map of Conv7_1, and inputting a superposition result to Conv7_2; wherein the up-sampling is deconvolution, the kernel size is 4*4, the step size is 2, using BN and relu activation functions;

inputting the refined feature mapping layer into a PixelLink output network;

the pixelink output network comprises two parts: the first part is to predict whether the pixel is text; the second part is to predict whether the pixel and 8 pixels around the pixel belong to the same text instance; connecting the positive pixels by positive connection to form a connected component, wherein each component is a text example;