CN110751154B

CN110751154B - Complex environment multi-shape text detection method based on pixel-level segmentation

Info

Publication number: CN110751154B
Application number: CN201910929393.7A
Authority: CN
Inventors: 袁媛; 王�琦; 陈旺
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2022-04-08
Anticipated expiration: 2039-09-27
Also published as: CN110751154A

Abstract

The invention provides a complex environment multi-shape text detection method based on pixel level segmentation. Firstly, preprocessing images in a data set such as enhancement and the like, expanding the data set and obtaining labels with different sizes; then, constructing and training a complex environment text segmentation model based on a full convolution network; and finally, performing text detection on the given image by using the trained model. The method can detect texts in various shapes including arc shapes, effectively solves the detection problem of the texts in different scales, has robustness to illumination change and complex background conditions, and has higher detection accuracy and recall rate.

Description

Complex environment multi-shape text detection method based on pixel-level segmentation

Technical Field

The invention belongs to the technical field of computer vision and graphic processing, and particularly relates to a complex environment multi-shape text detection method based on pixel-level segmentation.

Background

The character recognition is divided into two specific steps of character detection and character recognition, wherein one of the two steps is not available, and the character detection is the premise of recognition. Text detection is not a simple task, and especially text detection in complex scenes is very challenging. But the character recognition under the natural scene has important effects on intelligent transportation, automatic driving, picture translation and the like. The method is also a research hotspot in the field of computer vision due to the strong application value.

Texts in natural scenes are very complex, and the texts have various tilt angles, languages, arrangements, size scales, fonts and the like; in the shooting environment, due to the brightness change/blur of the image or the deformation of the text caused by the shooting conditions, the complexity of the text in a natural scene is increased, and the detection difficulty is increased. Since the conventional method has difficulty in coping with such a complicated situation, the method of machine learning has been more applied to text detection in recent years.

Scene text detection methods based on deep learning are mainly based on convolutional neural networks and are roughly divided into two types: one type is a regression-based approach, typically based on a generic object detection framework. For example, "J.Ma, W.Shao, H.Ye, L.Wang, H.Wang, Y.Zheng, and X.Xue," inhibit-oriented scene text detection view rotation recommendations, "IEEE Transactions on Multimedia, vol.20, No.11, pp.3111-3122,2018" proposes the RRPN method, i.e., generating rotation candidate regions based on the fast R-CNN candidate region network (RPN) to detect text in any orientation. The second category is partition-based methods, primarily based on Full Convolutional Networks (FCNs). For example, "D.Deng, H.Liu, X.Li, and D.Cai," Pixellink: Detecting Scene Text instant Segmentation, "Proc.AAAI reference on Intelligent knowledge reference, 2018," proposed PixelLink method, by performing Text/sub-Text classification and predicting pixel connections between different Text instances, and finally performing connected domain analysis and merging to obtain the final Text box.

The method overcomes the problem that the inclined text is difficult to detect by the traditional method on the basis of universal detection. But also has limitations such as inability to effectively cope with texts with large bending and dimensional changes, etc.

Disclosure of Invention

In order to overcome the defects that the conventional text detection method cannot process texts with large bending or scale change and multi-line texts cannot be correctly separated, the invention provides a complex environment multi-shape text detection method based on pixel-level segmentation. Firstly, preprocessing images in a data set such as enhancement and the like, expanding the data set and obtaining labels with different sizes; then, constructing and training a complex environment text segmentation model based on a full convolution network; and finally, performing text detection on the given image by using the trained model. The method can detect texts in various shapes including arc shapes, effectively solves the detection problem of the texts in different scales, is more robust to illumination change and complex background conditions, and has higher detection accuracy and recall rate.

A complex environment multi-shape text detection method based on pixel level segmentation is characterized by comprising the following steps:

step 1, data preprocessing:

respectively carrying out enhancement processing on all images in the data set, and combining the images subjected to enhancement processing and the images in the original data set into a new image data set; respectively reducing the text region labels of each image in the new data set to 1/2 and 1/4, and adding the original labels to obtain three groups of labels; the enhancement processing comprises image rotation, brightness adjustment and scaling processing.

Step 2, constructing and training a complex environment text segmentation model based on a full convolution network:

step 2.1: inputting the samples into a ResNet50 network, and respectively extracting the outputs of the pool2, pool3, pool4 and pool5 layers to obtain 4 features with different scales, wherein the features are sequentially represented as f _1, f _2, f _3 and f _4 from small to large according to the scales;

step 2.2: inputting the minimum scale feature f _1 into the upper pooling layer, then cascading the minimum scale feature with f _2, and inputting the cascaded feature into a feature fusion module to obtain a first fused transformation feature; inputting the fused transformation characteristic I into an upper pooling layer, cascading the transformation characteristic I with f _3, and passing the cascaded characteristic through a characteristic fusion module to obtain a fused transformation characteristic II; inputting the fused transformation feature II into an upper pooling layer, cascading the transformation feature II with f _4, and passing the cascaded feature through a feature fusion module to finally obtain the transformation feature fused with 4 different scale features; the characteristic fusion module consists of a convolution layer with convolution kernel size of 3 multiplied by 3, a Batch Normalization layer and a ReLU layer;

step 2.3: inputting the transformation characteristics finally fused in the step 2.2 into a convolution layer with a convolution kernel size of 1x1, and activating the layer through a Sigmoid function to obtain a pixel-level segmentation image;

step 2.4: training the models in the steps 2.1 to 2.3 by taking the label of the image as a target and using cross entropy as a loss function to calculate a loss value, and training three groups of different labels to obtain three segmentation models;

step 3, text detection:

step 3.1: respectively inputting the text image to be detected into the three segmentation models obtained in the step 2, and binarizing the output to obtain three segmentation results A _1, A _2 and A _3 which respectively correspond to 1/4, 1/2 and the original text region segmentation image;

step 3.2: analyzing the connected domain of A _1, and marking different connected domains by different positive integers; superposing the marked image and A _2, analyzing the connected domain of the superposed image, and respectively removing and expanding the regions to obtain a 1/2-sized segmented image (A'; AlO _ 2); stacking A 'and A _3, analyzing connected domain of the stacked images, and removing and expanding the regions to obtain the final segmented image A' of original size; the region removal means that all pixel values are set to 0 for a connected region with the maximum value of 1; after the region is removed, setting the pixel with the remaining value of 1 as the value of the pixel with the value which is closest to the pixel and is not 0 or 1;

step 3.3: the segmented images a' are processed using an OpenCV contour detection function to obtain contour point coordinates of different text regions.

The invention has the beneficial effects that: because the network model can realize the fusion of different scale characteristics, the method has better detection effect on texts with various sizes. Due to the adoption of the image segmentation technology, not only can a rectangular text region be detected, but also irregular texts such as bent texts can be well detected. Due to the fact that the expansion processing is carried out on the text core region, the multi-line text of the dense region can be well separated, and compared with direct segmentation, the text region overlapping portion can also be well separated, and the false detection rate can be reduced. The deep network of the method can cope with text detection tasks in complex backgrounds, and has higher detection accuracy and better robustness.

Drawings

FIG. 1 is a flowchart of a method for detecting polymorphic text in a complex environment according to the present invention

FIG. 2 is a diagram of a text segmentation model in a complex environment according to the present invention

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

As shown in fig. 1, the present invention provides a method for detecting a multi-shape text in a complex environment, which is implemented as follows:

1. data pre-processing

Step 1.1: images in the used ICDAR2015 and Total-Text datasets, whose pictures carry Text region labels, are first data enhanced to fit complex scenes. The image in the data set is subjected to image data enhancement through a combination of rotation, brightness adjustment and a scaling mode, in the embodiment, the rotation angle is randomly generated from-90 degrees to 90 degrees, the brightness adjustment mode is that the brightness is randomly increased or decreased by 50%, and the scaling mode is that the brightness is randomly scaled 1/2 to 2 times. After the images are subjected to the enhancement processing, the processed images are combined into the original data set to obtain an expanded image data set, and the expanded image data set is used for training samples of a subsequent feature learning algorithm to deal with light changes and shooting angle changes in a complex environment.

Step 1.2: and respectively reducing the text region labels of each image in the new data set to 1/2 and 1/4, and adding the original labels to each image to obtain three groups of labels with different sizes. The method specifically comprises the following steps: firstly, generating an image (the size is the size of an original image) with all pixel values of 0, filling a text region of a label with 1 by using an Opencv polygon filling algorithm, then respectively corroding the text region with the widths of 1/4 and 3/8 (namely the minimum value of the distances between four corner points) by using an Opencv corrosion algorithm, so that a new text label is changed into 1/2 and 1/4 of the original size, and adding the original label to obtain three groups of pixel-level segmentation labels with different sizes.

2. Construction and training of complex environment text segmentation model based on full convolution network

As shown in fig. 2, includes:

step 2.1: and constructing a multi-scale Feature extractor based on a Feature Pyramid Network (FPN). Using ResNet50 as a skeleton network, generating a feature pyramid, and using features output by 4 layers, namely pool2, pool3, pool4 and pool5, wherein the scales from small to large are respectively represented as f _1, f _2, f _3 and f _ 4.

Among them, the ResNet50 network is described in "Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jianan Sun. deep reactive Learning for Image registration [ A ]. IEEE Conference on Computer Vision and Pattern registration [ C ].2016.1063-6919 ].

Step 2.2: f _1 is cascaded with f _2 through an upper pooling layer (the method is bilinear upsampling, and the scale is changed into 2 times). And (3) passing the cascaded features through a feature fusion module consisting of a convolution layer with a convolution kernel size of 3 multiplied by 3, a Batch Normalization layer and a ReLU layer to obtain the transformation features fused with f _1 and f _ 2. Similarly, the obtained transformation features are cascaded with f _3 after passing through the pooling layer, and then pass through the feature fusion module, so that the transformation features with f _1, f _2 and f _3 are fused. And after the new features pass through the upper pooling layer, the new features are cascaded with f _4 and then pass through a feature fusion module, and finally the transformation features obtained after 4 features with different scales are fused are obtained.

Step 2.3: the fused transformation features are segmented using convolution layers (Conv1x1) with convolution kernel size 1x1 and Sigmoid function activation layers, and segmented images with pixel values of 0 to 1 are output, corresponding to the confidence of each pixel in the detection region.

Step 2.4: and inputting the labeled image to train the model. The cross entropy is used as a loss function to calculate a loss value, the learning rate is set to be 0.001, the batch size is set to be 32, and the model is trained by using a stochastic gradient descent method. And respectively obtaining three text segmentation models for the three groups of different labels.

3. Text detection

Step 3.1: the text image to be detected is input into the above to obtain three segmentation models, the output is subjected to binarization processing, and three segmentation results, namely segmentation images of 1/4 size, 1/2 size and original size of each text region are respectively obtained and are respectively represented as A _1, A _2 and A _ 3. The binarization threshold value is set to 0.6 in this embodiment.

Step 3.2: and performing connected component analysis on A _1, and marking different connected components (the marking method is to set all pixel values in the components to different positive integers). The resulting image is superimposed with a _2, i.e. the value of each pixel is added, and connected domain analysis is performed. Setting all the values to be 0 for the connected region with the maximum value of 1, and removing the text region with lower credibility; for each remaining pixel with a value of 1, the value of the pixel with a value other than 0 or 1 closest to the pixel is set, resulting in an example segmented image that extends to a size of 1/2. Similarly, the obtained image is superposed with the segmentation result of a _3, and the same operation as the above process is performed, and finally the text instance segmentation image is obtained.

Step 3.3: and processing the segmented image obtained in the last step by using an OpenCV contour detection function to obtain contour point coordinates of different text areas, namely the required final output result.

To verify the effectiveness of the method of the present invention, the CPU is Intel (R) core (TM) i7-6800K CPU @3.40GHz, the memory 64G and the graphics processor are

Ubuntu18.04LTS operation of Geforce 1080Ti GPUAnd performing a simulation experiment by using a Pythrch frame on the system. The experiments used the public data set ICDAR2015 containing oblique Text and the public data set Total-Text containing curved Text, respectively.

Firstly, learning features according to training steps in a specific implementation mode by using a training set; and then, detecting the pictures in the test set according to the detection step, and calculating the accuracy P (the accuracy of the detection result), the recall ratio R (the ratio of the detected existing text region) and the F value by combining the result of the real mark, wherein the F value integrates the accuracy and the recall ratio, and the larger the value is, the better the effect of the method is.

Meanwhile, a connected text area network (CTPN) (document "z.tie, w.huang, t.he, p.he, and y.qiao," Detecting text In natural image with connecting text protocol network "In ECCV, 2017"), a division connection network (SegLink) (document "b.shi, x.bai, and s.belongingie," Detecting encoded text In natural images by linking segments ", In CVPR, 2017") and a rotation candidate area network (RRPN) (document "j.ma, w.share, h.ye, l.wang, h.wang, y.z, and x.xue," array-oriented text protocol, reaction, "2018) were selected as a comparison table, and the results are calculated as a comparison table, i.1, 2, and a comparison table, respectively. The calculation results show that the detection performance of the method for the oblique text and the bent text is better, particularly the detection result of the bent text is far superior to that of other methods, and the method has good practicability and robustness for detecting the complex text in the natural environment.

TABLE 1

Method	Recall rate	Rate of accuracy	F value
				CTPN	51.56％	74.22％	60.85％
SegLink	76.8％	73.1％	75.0％
				RRPN	73.0％	82.0％	77.0％
The method of the invention	73.62％	79.81％	76.6％

TABLE 2

Method	Recall rate	Rate of accuracy	F value
				CTPN	20.7％	28.6％	24.0％
SegLink	23.8％	30.3％	26.7％
				RRPN	36.2％	40.2％	38.09％
The method of the invention	69.54	77.02％	73.09％

Claims

1. A complex environment multi-shape text detection method based on pixel level segmentation is characterized by comprising the following steps:

step 1, data preprocessing:

respectively carrying out enhancement processing on all images in the data set, and combining the images subjected to enhancement processing and the images in the original data set into a new image data set; respectively reducing the text region labels of each image in the new data set to 1/2 and 1/4, and adding the original labels to obtain three groups of labels; the enhancement processing comprises image rotation, brightness adjustment and scaling processing;

step 2.1: inputting the samples into a ResNet50 network, and respectively extracting the pool2 and the poolThe output of the ol3, the pool4 and the pool5 layers obtains 4 characteristics with different scales, which are sequentially expressed as f from small to large according to the scales₁，f₂，f₃，f₄；

Step 2.2: the minimum scale feature f₁After entering the upper pooling layer and f₂Cascading, and inputting the cascaded features into a feature fusion module to obtain a first fused transformation feature; inputting the fused transformation characteristic I into the upper pooling layer and f₃Cascading, namely passing the cascaded features through a feature fusion module to obtain a second fused transformation feature; inputting the two fused transformation characteristics into the upper pooling layer and f₄Cascading, namely passing the cascaded features through a feature fusion module to finally obtain transformation features fused with 4 features with different scales; the characteristic fusion module consists of a convolution layer with convolution kernel size of 3 multiplied by 3, a Batch Normalization layer and a ReLU layer;

step 3, text detection:

step 3.1: respectively inputting the text image to be detected into the three segmentation models obtained in the step 2, and obtaining three segmentation results A after binarization of the output₁，A₂，A₃Segmenting the image corresponding to 1/4, 1/2 and the original text region respectively;

step 3.2: to A₁Analyzing the connected domains, and marking different connected regions with different positive integers; the marked image is compared with A₂Superposing, analyzing the connected domain of the superposed images, and respectively removing and expanding the regions to obtain a segmented image A 'with the size of 1/2'₂(ii) a A'₂And A₃Overlapping, analyzing the connected domain of the overlapped images, and respectively removing the regionsAnd expanding to obtain final segmentation image A 'of original size'₃(ii) a The region removal means that all pixel values are set to 0 for a connected region with the maximum value of 1; after the region is removed, setting the pixel with the remaining value of 1 as the value of the pixel with the value which is closest to the pixel and is not 0 or 1;

step 3.3: segment image A 'using OpenCV contour detection function'₃And processing to obtain the coordinates of the contour points of different text areas.