CN110516669B

CN110516669B - Multi-level and multi-scale fusion character detection method in complex environment

Info

Publication number: CN110516669B
Application number: CN201910781042.6A
Authority: CN
Inventors: 袁媛; 王�琦; 刘琛
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2022-04-29
Anticipated expiration: 2039-08-23
Also published as: CN110516669A

Abstract

The invention relates to a multi-level and multi-scale fusion character detection method in a complex environment, which can effectively solve the problems of text positioning and detection in the complex background environment (variable illumination, variable contrast and the like), and has the advantages of high training speed and high detection precision rate which is more than 77%. Aiming at pictures containing texts with various shapes and scales in various natural scenes, the method has the characteristics of high efficiency, accuracy, simplicity and the like.

Description

Multi-level and multi-scale fusion character detection method in complex environment

Technical Field

The invention belongs to the technical field of computer vision and graphic processing, and particularly relates to a multi-level and multi-scale fusion character detection method in a complex environment.

Background

The character detection under the complex scene plays an important role in intelligent transportation, bill recognition and the like, but because the character picture to be detected is usually acquired under the complex scene under the actual condition, the character picture to be detected may possibly encounter the interference factors of poor picture quality, fuzzy character font, text bending, low contrast, different character fonts and the like. Meanwhile, because a text is different from a general target in a picture, the shape difference (such as different lengths) between words or a line of texts is very large, and meanwhile, the detection of the text in a complex environment faces the problem of large change of the target size.

For the problems, on the basis of general target detection, two methods for text detection are generally provided, one is to detect the position of a target central point, propose candidate frames with various sizes and shapes based on the central point, and then find the most appropriate one of the candidate frames; the other is to directly output the coordinate regression point based on it without considering the candidate box.

Text detection based on candidate boxes the problem of using a rotating region candidate box network to generate candidate boxes with oblique angles to cope with shape variations of text is proposed in the documents "j.ma, w.sho, h.ye, l.wang, h.wang, y.zheng, and x.xue," align-oriented scene text detection via positions protocols, "IEEE Transactions on Multimedia, vol.20, No.11, pp.3111-3122,2018".

Methods based on direct regression coordinates have proposed a simple and effective network that directly predicts text lines and removes intermediate steps for generating candidate frame networks in the documents "x.zhou, c.yao, h.wen, y.wang, s.zhou, w.he, and j.liang," EAST: an effective and available scene context detector, "in proc.ieee conf.conference on Computer Vision and Pattern Recognition,2017, pp.5551-5560.

Both of these approaches have their limitations. When multi-scale text detection is performed, multi-scale feature information is difficult to effectively utilize, and meanwhile, the practicability is not high due to the fact that the number of model parameters is large and the network structure is complex.

Disclosure of Invention

The technical problem solved by the invention is as follows: in order to solve the limitation of the existing scene character detection algorithm to the specific problem, the invention relates to a character detection method with multi-level and multi-scale fusion under a complex environment.

The technical scheme of the invention is as follows: a multi-level and multi-scale fusion character detection method under a complex environment comprises the following steps:

the method comprises the following steps: a training phase comprising the following sub-steps:

the first substep: and (3) expanding the image data of the training picture with the label through the combination of three modes of rotating, turning and changing light and shade, extracting 30% of all data, and performing the three operations to obtain expanded image data. Incorporating it into the original image data to form a new enlarged image data set for subsequent operation;

and a second substep: inputting the enlarged image dataset sample image obtained in the substep one into a ResNetXt-101 network; extracting the output characteristics of 8 layers of 'conv 1', 'conv 2_ 1', 'conv 3_ 1', 'conv 3_ 4', 'conv 4_ 3', 'conv 4_ 12', 'conv 4_ 20' and 'conv 5_ 3' of the ResNetXt-101 network respectively, wherein the characteristics are depth characteristics of different scales;

and a third substep: the purpose of this step is to fuse the 8 different scale depth features obtained in sub-step two. Firstly, respectively passing 8 depth features with different scales through 8 transformation modules which are formed by cascading a convolution layer with a convolution kernel size of 1x1, a Batch Normalization layer and a ReLU layer (linear rectification activation function layer) to obtain 8 transformation features with different scales; respectively carrying out Bilinear Upsampling (Bilinear Upsampling) operation on 8 conversion characteristics with different scales, so that the scales of the conversion characteristics are unified to the maximum scale of the conversion characteristics; finally, stacking the 8 uniform-scale transformation features according to the number of channels to form a multi-scale fusion feature;

and a fourth substep: and (4) passing the multi-scale fusion features obtained in the substep three through k decoding networks to obtain multi-level multi-scale features: each decoding network is composed of n convolution layers and n deconvolution layers which are completely symmetrical, and the size of the characteristic dimension obtained after the decoding network is the same as the original size, so that the decoding network can be cascaded with multi-scale characteristics. And sending the concatenated features to the next decoding network. Each decoding network can obtain 2n features of n scales, the 2nk features output by the k decoding networks are combined according to the scales, the features of the same scale are combined and cascaded according to the channels, and the features of different scales after being combined by n different levels, namely the multi-level features of n different scales can be obtained.

And a fifth substep: introducing a CBAM (convolutional Block attention Module) model, and for n multi-level features obtained in the fourth substep, enabling the n multi-level features to pass through the CBAM module respectively to obtain n multi-level fusion features with different scales, and upsampling the n multi-level features with different scales and cascading the upsampled multi-level features into a multi-scale multi-level feature according to channels.

And a sixth substep: sending the multi-scale and multi-level features into a regression prediction network; the regression prediction network consists of a convolution layer with convolution kernel size of 1x1 and a full connection layer, and the multi-scale multi-level features obtained through the substep five are sent to the regression prediction network to output a feature of 1x 5 h. It represents the predicted h outcomes. Each prediction result comprises 5 attributes (coordinate values of the upper left corner point and the lower right corner point of the detection frame and a prediction score), and finally the maximum score is screened out according to the size of the prediction score to serve as the prediction result.

And a seventh substep: optimizing the target by using a GIoU target function, optimizing the target function by using a random gradient descent method, continuously updating the parameters of each network layer, and performing iterative optimization to store a group of model parameters which enable the result to tend to be stable;

step two: a detection phase comprising the following sub-steps:

the first substep: when a text picture is detected every time, reading model parameters obtained in the training stage, applying the model parameters to each network layer and fixing the model parameters, repeating the substeps two to six in the step one on the text picture to be detected, obtaining h prediction outputs, and outputting the prediction outputs including coordinate positions and prediction scores;

and a second substep: and selecting the coordinates of the regression box with the maximum prediction score as the text coordinate position.

The further technical scheme of the invention is as follows: the rotation angle in the first substep is randomly generated by using a computer random seed from-90 degrees to 90 degrees, the turning mode is a random one of left-right turning, up-down turning and constant keeping, and the mode of changing brightness is randomly zooming by 0.5-1.5 times.

Effects of the invention

The invention has the technical effects that: the invention discloses a method for positioning and detecting a multi-scale-transformation text in any direction, which can effectively solve the problem of positioning and detecting the text in a complex background environment (changeable illumination, changeable contrast and the like), and has the advantages of high training speed and high detection precision rate which is up to more than 77%. Aiming at pictures containing texts with various shapes and scales in various natural scenes, the method has the characteristics of high efficiency, accuracy, simplicity and the like.

Drawings

FIG. 1 is a flow chart of a multi-scale and multi-level text detection method according to the present invention

Detailed Description

Referring to fig. 1, the invention provides a multi-level and multi-scale fusion text detection method in a complex environment, so as to solve the limitations of the conventional text detection method and the general target detection on the specific problem. The technical scheme comprises two stages: a training phase and a testing phase.

A training stage:

1. expanding data by using an original training set sample picture in modes of rotating, turning, changing brightness and the like on training and images so as to obtain a multi-angle and multi-brightness expanded image data set;

2. inputting the enlarged image dataset sample image obtained in step 1 into a ResNetXt-101 network; in consideration of balance of computational complexity and effect, extracting outputs of 8 layers of 'conv 1', 'conv 2_ 1', 'conv 3_ 1', 'conv 3_ 4', 'conv 4_ 3', 'conv 4_ 12', 'conv 4_ 20' and 'conv 5_ 3' of the ResNetXt-101 network respectively to obtain depth features of 8 different scales;

3. firstly, respectively passing the 8 depth features with different scales obtained in the step 2 through 8 transformation modules to obtain 8 transformation features with different scales; respectively carrying out Bilinear Upsampling (Bilinear Upsampling) operation on 8 conversion characteristics with different scales, so that the scales of the conversion characteristics are unified to the maximum scale of the conversion characteristics; finally, stacking the 8 uniform transformation characteristics into a multi-scale fusion characteristic according to the number of channels;

4. cascading the multi-scale fusion features obtained in the step 3 into the 1 st decoding network to obtain a group of 1 st-level features of multiple scales, cascading the last layer output of the 1 st decoding network with the multi-scale fusion features obtained in the step 3, sending the output into the 2 nd decoding network to obtain a group of 2 nd-level features of multiple scales, cascading the last layer output of the 2 nd decoding network with the multi-scale fusion features obtained in the step 3, sending the output into the 3 rd decoding network to obtain a group of 3 rd-level features of multiple scales … …, and repeating continuously after passing through k decoding networks to obtain k groups of k-level features of multiple scales. And combining the features with the same scale corresponding to the k groups of features to obtain a group of multi-scale 'multi-level features'. The decoding network is a lightweight feature extraction network unit formed by a convolution network and a deconvolution network. In the embodiment of the invention, the characteristics are as follows: consists of n convolutional layers and n anti-convolutional layers, and for each layer, a plurality of scale features are extracted as a set of the levels. Here, k denotes the number of decoding networks, n denotes the number of convolutional layers and deconvolution layers in each decoding network, and the number of convolutional layers and deconvolution layers is the same. K here may theoretically take 2 to infinity, but in view of complexity, 2 ≦ n ≦ 5 is generally taken, and k is desirably 3 in the present embodiment; in theory n may be any integer greater than 1, but in view of complexity, 1 ≦ n ≦ 5 is generally used, and n ≦ 3 is desirable in this embodiment

5. And introducing a CBAM (CBAM) model, respectively sending the group of multi-scale 'multi-level features' obtained in the step 4 into a CBAM module to obtain a group of multi-scale 'multi-level fusion new features', unifying the 'multi-level fusion new features' of different scales to the maximum scale through upsampling, and finally cascading according to channels to obtain the 'multi-scale multi-level features'.

6. Sending the multi-scale multi-level features obtained in the step (5) into a regression prediction network, wherein the regression prediction network is composed of a convolution layer and a full-connection layer, the output size is 1 multiplied by 5h, the regression prediction network represents h results of primary prediction, each result comprises four coordinate positions and confidence coefficients, and the result with the maximum confidence coefficient is selected as the coordinate position obtained by prediction;

7. optimizing the target by using a GIoU objective function, and finally training to obtain a group of depth model parameters to be stored;

and (3) a testing stage:

1. and when detecting the text picture every time, reading the model parameters obtained in the training stage, applying the model parameters to each network layer and fixing, repeating the steps 2 to 6 in the training stage on the text picture to be detected, obtaining h prediction outputs, wherein the outputs comprise coordinate positions and prediction scores. All the network layers mentioned above (including the ResNeXt-101 network, the transformation module, the decoding network, the CBAM, the prediction regression network and the like) have a large number of parameters, the parameters are continuously updated and optimized based on an objective function and an optimization method in a training stage, finally, stable solutions are obtained and stored, the parameters which are trained and optimized are directly read in a testing stage and applied to each network layer, and a picture to be tested can obtain an expected ideal output result through the network layers. The prediction score is also the result of the output of the network, which represents that this result of the prediction may not be credible, and is predicted by the model learned during the training phase without human factor control.

2. And selecting the coordinates of the regression box with the maximum prediction score as the text coordinate position. (the coordinates of the upper left corner and the lower right corner of the frame are predicted to be output, and a rectangular frame can be determined according to the four values, and the text in the picture is framed out.) in order to more clearly illustrate the technical solution implemented by the present invention, the following briefly introduces each module required in the description of the embodiment. It should be apparent that the drawings in the following description are only flow charts of the present invention, and it is obvious for those skilled in the art to expand the drawings and obtain other drawings without creative efforts.

Referring to fig. 1, the implementation steps of the invention are as follows:

step 1, firstly, performing data expansion on a labeled training picture through a combination of three modes of rotating, turning and changing light and shade, wherein the rotating angle is randomly generated by using a computer random seed from-90 degrees to 90 degrees, the turning mode is a random one of left-right turning, up-down turning and keeping unchanged, the changing light and shade mode is random scaling by 0.5-1.5 times, extracting 30% of all data to perform the three operations, obtaining expanded data, and combining the expanded data with the original data to form a new expanded data set for subsequent operation.

Step 2, inputting the enlarged image data set sample image obtained in the step 1 into a ResNetXt-101 network; extracting the output characteristics of 8 layers of 'conv 1', 'conv 2_ 1', 'conv 3_ 1', 'conv 3_ 4', 'conv 4_ 3', 'conv 4_ 12', 'conv 4_ 20' and 'conv 5_ 3' of the ResNetXt-101 network respectively, wherein the characteristics are depth characteristics of different scales;

and 3, fusing the 8 depth features with different scales obtained in the step 2. Firstly, respectively passing 8 depth features with different scales through 8 transformation modules which are formed by cascading a convolution layer with a convolution kernel size of 1x1, a Batch Normalization layer and a ReLU layer (linear rectification activation function layer) to obtain 8 transformation features with different scales; respectively carrying out Bilinear Upsampling (Bilinear Upsampling) operation on 8 conversion characteristics with different scales, so that the scales of the conversion characteristics are unified to the maximum scale of the conversion characteristics; finally, stacking the 8 uniform-scale transformation features according to the number of channels to form a multi-scale fusion feature;

and 4, passing the multi-scale fusion features obtained in the step 3 through k decoding networks to obtain multi-level multi-scale features: each decoding network is composed of n convolution layers and n deconvolution layers which are completely symmetrical, and the size of the characteristic dimension obtained after the decoding network is the same as the original size, so that the decoding network can be cascaded with multi-scale characteristics. And sending the concatenated features to the next decoding network. Each decoding network can obtain 2n features of n scales, the 2nk features output by the k decoding networks are combined according to the scales, the features of the same scale are combined and cascaded according to the channels, and the features of different scales after being combined by n different levels, namely the multi-level features of n different scales can be obtained.

And 5, introducing a CBAM (convolutional Block attention module) model, enabling the n multi-level features obtained in the step 4 to respectively pass through the CBAM module to obtain n multi-level fusion features with different scales, and upsampling the n multi-level features with different scales and cascading the upsampled multi-level features into a multi-scale multi-level feature according to a channel.

Step 6, the multi-scale and multi-level features are sent into a regression prediction network; the regression prediction network consists of a convolution layer with convolution kernel size of 1x1 and a full connection layer, and the multi-scale multi-level features obtained in the step 5 are sent to the regression prediction network to output a feature of 1x 5 h. It represents the predicted h outcomes. Each prediction result comprises 5 attributes (coordinate values of the upper left corner point and the lower right corner point of the detection frame and a prediction score), and finally the maximum score is screened out according to the size of the prediction score to serve as the prediction result.

Step 7, optimizing the target by using a GIoU target function, optimizing the target function by using a random gradient descent method, continuously updating the parameters of each network layer, and performing iterative optimization to store a group of model parameters which enable the result to tend to be stable;

and 8, removing the optimization process in the step 7, importing the trained model parameters, carrying out the processing of the steps 1 to 6 on the picture to be tested, reasoning to obtain h output results, and selecting one with the largest prediction score as a text position for output.

The effects of the present invention can be further explained by the following simulation experiments.

1. Simulation conditions

The invention is characterized in that a central processing unit is Intel (R) core (TM) i7-6800K CPU @3.40GHz, a memory 128G and a graphics processor are

And (3) carrying out simulation on an Ubuntu14.04LTS operating system of the Tesla 1080Ti GPU by utilizing a Pytrch framework.

The data used in the simulation was a text detection picture of ICDAR 2015.

2. Emulated content

Firstly, learning features according to training steps in a specific implementation mode by using a training set; then according to the test steps, the pictures in the test set are combined with the real mark result to calculate the accuracy rate P, the recall rate R and the calculation F₁Value of wherein

In order to prove the effectiveness of the algorithm, a Deep Matching Prior Network (DMPNet) is selected, and a model of a connected Text area Network (CTPN) and a full volume multidirectional Text Detection Network (MCLAB FCN) is used as a comparison algorithm, wherein the DMPNet algorithm is described In detail In the documents of Y.Liu and L.jin, "Deep Matching precursor Network: aware Tight Multi-oriented Text Detection," In Proceedings of IEEE Conference Computer Vision and Pattern Recognition, pp.3454-3461,2017 "; the CTPN algorithm is proposed In the literature "Z.Tian, W.Huang, T.He, P.He, and Y.Qiao," Detecting Text In Natural Image with connectivity Text protocol Network, "In Proceedings of IEEE Conference on European Conference on Computer Vision, pp.56-72,2016"; the MCLAB FCN algorithm is proposed In the literature "Z.Zhang, C.Zhang, W.Shen, C.Yao, W.Liu, and X.Bai," Multi-oriented Text Detection with full connectivity Networks, "In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp.4159-4167,2016". The above comparison methods are typical methods with prominent effect in the art, and the comparison results are shown in table 1.

TABLE 1 comparative results

Method	Recall	Precision	F-measure
				CTPN	51.56％	74.22％	60.85％
MCLAB_FCN	43.09％	70.81％	53.58％
				DMPNet	68.22％	73.23％	70.64％
MMText(ours)	52.34％	76.26％	62.07％

As can be seen from table 1, the detection accuracy of the present invention is significantly higher than that of the general text detection algorithm. Although the method is slightly insufficient in recall rate, the method has good adaptability and good practicability and robustness in detection under complex environments.

The effectiveness of the invention can be verified through the simulation experiment.

Claims

1. A multi-level and multi-scale fusion character detection method under a complex environment is characterized by comprising the following steps:

the first substep: carrying out image data expansion on the training picture with the label through the combination of three modes of rotating, turning and changing light and shade, extracting 30% of all data and carrying out the three operations to obtain expanded image data; incorporating it into the original image data to form a new enlarged image data set for subsequent operation;

and a third substep: the purpose of this step is to fuse the depth features of 8 different scales obtained in the sub-step two; firstly, respectively passing 8 depth features with different scales through 8 transformation modules which are formed by cascading a convolution layer with a convolution kernel size of 1x1, a Batch Normalization layer and a ReLU layer (linear rectification activation function layer) to obtain 8 transformation features with different scales; respectively carrying out Bilinear Upsampling (Bilinear Upsampling) operation on 8 conversion characteristics with different scales, so that the scales of the conversion characteristics are unified to the maximum scale of the conversion characteristics; finally, stacking the 8 uniform-scale transformation features according to the number of channels to form a multi-scale fusion feature;

and a fourth substep: and (4) passing the multi-scale fusion features obtained in the substep three through k decoding networks to obtain multi-level multi-scale fusion features: each decoding network is composed of n convolution layers and n deconvolution layers which are completely symmetrical, and the size of the characteristic dimension obtained after the decoding network is the same as the original size, so that the decoding network can be cascaded with multi-scale fusion characteristics; sending the cascaded features to a next decoding network; each decoding network can obtain 2n features of n scales, the 2nk features output by the k decoding networks are combined according to the scales, the features of the same scale are combined and cascaded according to the channels, and the features of different scales after being combined by n different levels, namely the multi-level features of n different scales can be obtained;

and a fifth substep: introducing a CBAM (convolutional Block attention module) model, and for n multi-level features obtained in the fourth substep, enabling the n multi-level features to pass through the CBAM module respectively to obtain n multi-level fusion features with different scales, and performing up-sampling on the n multi-level features with different scales and cascading the n multi-level features into a multi-scale multi-level feature according to a channel;

and a sixth substep: sending the multi-scale and multi-level features into a regression prediction network; the regression prediction network consists of a convolution layer with convolution kernel size of 1x1 and a full connection layer, and the multi-scale multi-level features obtained in the substep five are sent to the regression prediction network to output a feature of 1x 5 h; it represents the predicted h outcomes; each prediction result comprises 5 attributes, and the maximum score is screened out as the prediction result according to the size of the prediction score;

and a seventh substep: optimizing the target by using a GIoU target function, optimizing the target function by using a random gradient descent method, continuously updating parameters of each network layer, and performing iterative optimization to store a group of model parameters which enable the result to tend to be stable;

step two: a detection phase comprising the following sub-steps:

2. The method for detecting multilayer multi-scale fusion characters under complex environment as claimed in claim 1, wherein the angle of rotation in the sub-step one is randomly generated from-90 degrees to 90 degrees by using a computer random seed, the turning mode is a random one of left-right turning, up-down turning and keeping unchanged, and the mode of changing brightness is randomly scaled by 0.5-1.5 times.