CN112149620A

CN112149620A - Method for constructing natural scene character region detection model based on no anchor point

Info

Publication number: CN112149620A
Application number: CN202011098722.7A
Authority: CN
Inventors: 徐亦飞; 王冕; 王爱臣; 严汤文; 王优; 李斌; 尉萍萍; 肖志峰
Original assignee: Nanchang Huiyichen Technology Co ltd
Current assignee: Nanchang Huiyichen Technology Co ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2020-12-29

Abstract

The invention discloses a method for constructing a natural scene character region detection model based on no anchor point, which uses a pixel-based detection mode and introduces a convolution branch for predicting the inclination angle of a boundary box, thereby detecting the inclined characters in a natural scene; the method has the advantages that the deformable convolution DCN is added into some layers of the network backbone, so that the capability of expressing the specific characteristics of the text instance by the network is improved, and the receptive field of the text target shape is more flexible; an attention module is introduced into the network, so that the extracted features are filtered, positive information is enhanced, and interference information is suppressed; in the invention, classification Loss, regression Loss CIoU Loss, centrality Loss and angle Loss are used as a joint Loss function, so that the detection precision is improved, the regression of the target frame becomes more stable, and the faster convergence speed is achieved.

Description

Method for constructing natural scene character region detection model based on no anchor point

Technical Field

The invention relates to the technical field of image processing, in particular to a method for constructing a natural scene character region detection model based on no anchor point.

Background

The character region detection is a hot point of research in the field of computer vision, and aims to detect the position of characters in a natural scene image for further identification, so that the image is converted into real character information which can be processed by a computer. Characters in the natural scene image generally have great differences in the aspects of fonts, combination modes, character sizes and the like, and the natural scene image also has great uncertainty in the aspects of illumination intensity, resolution, image noise, shooting angles and the like, and the difficulty of character area detection in the natural scene is greatly increased by the complex factors.

One method commonly used for detecting the character region of the natural scene is a method based on bounding box regression, which generally regards a text as a type of target and directly predicts the bounding box of the text as a detection result. The method based on the bounding box regression comprises a two-stage (two-stage) method and a one-stage (one-stage) method, wherein the two-stage (two-stage) method is to generate a series of candidate frames serving as samples by an algorithm and then to classify the samples by a convolutional neural network; in the latter case, the problem of target frame positioning is directly converted into regression problem processing without generating candidate frames. Generally, the former has a higher accuracy than the latter, while the latter has a higher speed than the former.

Based on the characteristics of the two methods, under the situation of higher real-time requirement, for example, the character region detection in the automatic driving scene needs shorter identification time, which belongs to the real-time and commonly uses a single-stage method. For example, the FCOS algorithm proposed by Tian Z, Shen C, Chen H, et al.FCOS: full volumetric one-stage object detection is an anchor-frame-free single-stage target detection algorithm, the algorithm reserves an anchor-frame-free mechanism, three strategies based on pixel regression prediction, multi-scale features and centrality (Center-less) prediction are introduced, and finally the effect can be better than that of various mainstream anchor-frame-based target detection algorithms under the condition of no anchor frame. However, the FCOS algorithm has a problem of low accuracy.

Disclosure of Invention

The invention provides a method for constructing a natural scene character region detection model based on no anchor point, which aims to solve the problem of low accuracy of the conventional natural scene character region detection without anchor point.

The invention provides a method for constructing a natural scene character region detection model without anchor points, which comprises the following steps:

s100, collecting a data set containing character images facing to a natural scene, wherein the data set comprises a training image set T_trainAnd detecting the image set T_test；

Step S200, inputting the natural image as Input to a Feature extraction network to generate a Feature pyramid consisting of multi-scale Input Feature maps, wherein the Feature extraction network comprises a deformable convolution DCN;

step S300, introducing the Feature pyramid into Attention Module Attention, and filtering the Input Feature Map by operating a Head pyramid Attention Module through a Head to generate an accurate Feature Map, wherein the Attention Module Attention comprises a Channel Attention Module and a Spatial Attention Module;

step S400, transmitting the accurate Feature Map into an output layer comprising three Convolution branches, generating a Feature Map, wherein the Feature Map comprises a Classification Feature Map, a Center-to-new Feature Map, a Regression Feature Map and an Angle Feature Map,

among the three Convolution branches, the first Convolution branch is responsible for a classification task and a centrality prediction task, the second Convolution branch is responsible for regression of a boundary box, and the third Convolution branch is responsible for prediction of an inclination angle of the boundary box;

step S500, training image set T_trainThe training image in (1) is input into step S200, a characteristic feature map corresponding to the training image is obtained through step S200, step S300 and step S400,

training each actual target frame centrality, target frame regression coordinate target frame character inclination angle and corresponding characteristic feature map labeled by the training image by using a joint loss function to obtain an anchor-free natural scene character region detection model;

step S600, detecting an image data set T_testThe detection image in (1) is used as input and is input to a natural scene character area detection model without anchor points, and a character detection area in the detection image is obtained.

Optionally, step S200, including,

step S210, the natural images are transmitted to a feature extraction network, and a third layer C3, a fourth layer C4 and a fifth layer C5 of a ResNet network in the feature extraction network generate corresponding input feature maps P3, input feature maps P4 and input feature maps P5;

in step S220, two convolution layers are added to the input feature map P5 generated at the fifth layer, so as to generate two new input feature maps P6 and P7, and obtain a feature pyramid composed of five input feature maps with different sizes.

Optionally, step S300, including,

step S310, compressing the Input Feature Map in the Feature pyramid in the spatial dimension by using maximum pooling Maxpool and average pooling Avgppool operations to generate two different spatial and contextual descriptors; inputting the two descriptors into a shared network, wherein the shared network consists of a hidden layer of multi-layer perceptrons MLP, and respectively generating corresponding sub-channel attention drawings through the shared network; merging the two generated subchannel attention diagrams to generate an attention weight diagram; performing dot product operation on the attention weight graph and the Input Feature Map to generate a Channel Refined Feature Map;

step S320, performing Maxpool operation and average pool Avgpool operation on the Channel accurate Feature Map along the Channel axis of the Channel accurate Feature Map, and performing connection operation on the generated Feature Map to generate a Feature descriptor; applying the convolutional layer Conv on the feature descriptor to generate a Spatial Attention map; and performing dot product operation on the generated Spatial Attention Map and the Channel accurate Feature Map to generate a Spatial Attention accurate Feature Map.

Optionally, classification Loss, regression Loss CIoU Loss, centrality Loss, and angle Loss are used as a joint Loss function, and a calculation formula of the joint Loss function is as follows:

wherein L is_cls、L_reg、L_θ、L_cesRespectively classification loss, regression loss, angle loss, centrality loss, N_posRepresenting the number of positive samples, l being an indicator function when

When the position is classified as text, the value of the function is 1, otherwise, the value of the function is 0;

specifically, the classification penalty is:

the regression Loss CIoU Loss is:

wherein, b and b^gtRespectively representing the central points of the prediction frame and the target frame, wherein p () is the Euclidean distance for calculating the two central points, a is a coefficient for balancing the length-width ratio, and v is used for measuring the ratio consistency of the prediction frame and the target frame;

the angle loss function is:

L_θ(θ,θ^*)＝1-cos(θ-θ^*) Where θ represents the predicted tilt angle, θ^*And representing the inclination angle of the target box characters.

The centrality loss is:

L_ces(c,c^*)＝-c^*·log(c)+(1-c^*) Log (1-c), wherein c and c^*Respectively, the predicted value of the centrality and the centrality of the target frame.

Optionally, step S600, including,

step S610, for the natural scene character detection model without anchor point, detecting the image data set T_testObtaining a characteristic Feature Map corresponding to the detected image as input, and generating the distance between a pixel point corresponding to a certain point in the Regression Feature Map and the Angle Feature Map and the distance between the point in the detected image and four frames in a prediction frame so as to generate the prediction frame;

obtaining a primary Classification score and a centrality score of the point according to the Classification Feature Map and the centrality Feature Map, and multiplying the primary Classification score obtained according to the Classification Feature Map by the centrality to obtain a final Classification score;

and S620, filtering the prediction box by using a non-maximum suppression algorithm NMS and the final classification score to obtain a character region in the detected image.

The invention provides a method for constructing a natural scene character region detection model based on no anchor point, which adds a convolution branch for predicting the inclination angle of a boundary box on an algorithm without an anchor point so as to detect inclined characters in a natural scene; the method has the advantages that the deformable convolution DCN is added into some layers of the network backbone, so that the capability of expressing the specific characteristics of the text instance by the network is improved, and the receptive field of the text target shape is more flexible; an attention module is introduced into the network, so that the extracted features are filtered, positive information is enhanced, and interference information is suppressed; in the invention, classification Loss, regression Loss CIoU Loss, centrality Loss and angle Loss are used as a joint Loss function, so that the detection precision is improved, the regression of the target frame becomes more stable, and the faster convergence speed is achieved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method for constructing a text region detection model based on a natural scene without anchor points according to the present invention;

FIG. 2 is a network structure diagram of the method for constructing a text region detection model based on a natural scene without anchor points according to the present invention;

FIG. 3 is a network architecture diagram of the attention module of the present invention;

FIG. 4 is a network architecture diagram of a channel attention module according to the present invention;

FIG. 5 is a network architecture diagram of the spatial attention module of the present invention.

Detailed Description

The invention provides a method for constructing a natural scene character region detection model based on no anchor point, which is applied to occasions with higher real-time requirements, and ensures higher accuracy while maintaining higher detection speed.

Fig. 1 is a flowchart of a method for constructing a natural scene text region detection model based on no anchor point, fig. 2 is a network structure diagram of a method for constructing a natural scene text region detection model based on no anchor point, and as shown in fig. 1 and fig. 2, the method for constructing a natural scene text region detection model based on no anchor point comprises,

s100, collecting a data set containing character images facing to a natural scene, wherein the data set comprises a training image set T_trainAnd detecting the image set T_test。

Step S200, inputting the natural image as Input to a Feature extraction network, and generating a Feature pyramid consisting of multi-scale Input Feature maps, wherein the Feature extraction network comprises deformable Convolution DCN (Deformable Convolution Net).

It is explained here that the feature extraction network uses the ResNet50 as a backbone network, and a deformable convolution DCN is added to the network, so that the network is more suitable for extracting text information. The network is constructed into a characteristic pyramid structure, so that a multi-scale strategy is used, and the network can well detect targets with various scales.

In the present invention, step S200 specifically includes:

Step S300, the Feature pyramid is transmitted to the Attention Module Attention, and the Head pyramid Attention Module is operated by the Head to filter the Input Feature Map, so as to generate an accurate Feature Map, where the Attention Module Attention includes a Channel Attention Module and a Spatial Attention Module, as shown in fig. 3.

In the present invention, step S300 specifically includes:

step S310, compressing the Input Feature Map in the Feature pyramid in the spatial dimension by using maximum pooling Maxpool and average pooling Avgppool operations to generate two different spatial and contextual descriptors; inputting the two descriptors into a shared network, wherein the shared network consists of a hidden layer of multi-layer perceptrons MLP, and respectively generating corresponding sub-channel attention drawings through the shared network; merging the two generated subchannel attention diagrams to generate an attention weight diagram; performing dot product operation on the attention weight graph and the Input Feature Map to generate a Channel Refined Feature Map, as shown in fig. 4;

step S320, performing Maxpool operation and average pool Avgpool operation on the Channel accurate Feature Map along the Channel axis of the Channel accurate Feature Map, and performing connection operation on the generated Feature Map to generate a Feature descriptor; applying the convolutional layer Conv on the feature descriptor to generate a Spatial Attention map; the generated Spatial Attention Map and the Channel Refined Feature Map are subjected to dot product operation to generate a Spatial Attention Refined Feature Map, as shown in fig. 5.

Step S400, transmitting the accurate Feature Map into an output layer comprising three Convolution branches, and generating a Feature Map, wherein the Feature Map comprises a Classification Feature Map, a Center-to-less Feature Map, a Regression Feature Map and an Angle Feature Map;

in the three Convolution branches, the first Convolution branch is responsible for a classification task and a centrality prediction task, the second Convolution branch is responsible for regression of a bounding box, and the third Convolution branch is responsible for prediction of an inclination angle of the bounding box.

Compared with the FCOS algorithm, the method has the advantages that a convolution branch for predicting the inclination angle of the boundary box is added, so that the algorithm can detect the inclined text.

Step S500, training image set T_trainThe training image in (1) is input to step S200, and passes through step S200 and step S300Step S400, obtaining a characteristic feature map corresponding to the training image,

and training each actual target frame centrality, target frame regression coordinate target frame character inclination angle and corresponding characteristic feature diagram marked by the training image by using a joint loss function to obtain an anchor-free natural scene character region detection model.

In order to improve the detection precision, make the regression of the target frame more stable and achieve faster convergence speed, in the invention, classification Loss, regression Loss CIoU Loss, centrality Loss and angle Loss are used as a joint Loss function, and the calculation formula of the joint Loss function is as follows:

I.e. the classification of the location is text, the value of the function is 1, otherwise the value of the function is 0.

Specifically, the classification penalty is:

the regression Loss CIoU Loss is:

the angle loss function is:

The centrality loss is:

In the present invention, step S600 specifically includes:

and S620, filtering the prediction box by using a non-maximum suppression algorithm NMS and the final classification score to obtain a character region in the detected image. In the present invention, the threshold in the NMS is a predicted box coverage of 0.6.

The invention provides a method for constructing a natural scene character region detection model based on no anchor point, which can detect inclined characters in a natural scene by adding a convolution branch for predicting the inclination angle of a boundary box; the deformable convolution is added in some layers of the network backbone, so that the capability of expressing the specific characteristics of the text instance by the network is improved, and the receptive field of the text target shape is more flexible; an attention module is introduced into the network, so that the extracted features are filtered, positive information is enhanced, and interference information is suppressed; in the invention, classification Loss, regression Loss CIoU Loss, centrality Loss and angle Loss are used as a joint Loss function, so that the detection precision is improved, the regression of the target frame becomes more stable, and the faster convergence speed is achieved.

The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims

1. A construction method of a natural scene character region detection model based on no anchor point is characterized by comprising the following steps:

2. The method for constructing the text region detection model based on the natural scene without anchor point as claimed in claim 1, wherein step S200 comprises,

3. The method for constructing the text region detection model based on the natural scene without anchor point as claimed in claim 1, wherein step S300 comprises,

4. The method for constructing the natural scene character region detection model based on the anchorless as claimed in claim 1, wherein a classification Loss, a regression Loss CIoU Loss, a centrality Loss and an angle Loss are used as a joint Loss function, and a calculation formula of the joint Loss function is as follows:

specifically, the classification penalty is:

the regression Loss CIoU Loss is:

the angle loss function is:

The centrality loss is:

5. The method for constructing the text region detection model based on the natural scene without anchor point as claimed in claim 1, wherein step S600 comprises,