CN112883964A

CN112883964A - Method for detecting characters in natural scene

Info

Publication number: CN112883964A
Application number: CN202110176924.7A
Authority: CN
Inventors: 巫义锐; 刘文翔
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-06-01
Anticipated expiration: 2041-02-07
Also published as: CN112883964B

Abstract

The invention discloses a method for detecting characters in a natural scene, and belongs to the technical field of character detection methods. The method comprises the following steps: 1, inputting 7200 pictures of characters to be trained; and 2, acquiring basic characteristic information through the convolution layer, and removing redundant information and enlarging the receptive field through the pooling layer. And 3, adding channel attention and receptive field attention to optimize the characteristic information. And 4, layering the network to enhance the detection capability of objects with different sizes and generate target points. And 5, cascading the generated contents to remove false positive results and obtain a final text area. And 6, comparing the text region obtained after cascading with the marked text region, calculating loss and adjusting network parameters. And 7, inputting the pictures to be detected into the trained network to obtain the detection results of the pictures. According to the method and the system, the recall rate and the accuracy of the model can be improved by receiving the attention of the field and the attention of the space. By means of the last cascade module, false positive results can be removed.

Description

Method for detecting characters in natural scene

Technical Field

The invention relates to a method for detecting characters in a natural scene, and belongs to the technical field of character detection methods.

Background

In recent years, the rapid development of mobile devices and automatic driving attracts attention to character detection, for the situations of traveling abroad and the like, the requirement of converting characters into texts by shooting is increasingly needed, the understanding of characters in natural scenes is increasingly concerned, and the detection of texts in natural scenes still has challenge because the directions of languages in natural scenes are possibly different and different from the characters appearing in books, the situation of bending is possibly caused. How to solve the detection problem caused by multi-language text, character bending and multi-direction needs to be solved urgently.

Disclosure of Invention

Aiming at the defects of the conventional character detection method in the study of a receptive field attention module, the invention provides a method for detecting characters in a natural scene, which enhances characteristic information by a multi-dimensional attention module and adds a cascade structure to help remove false positive results.

The invention adopts the following technical scheme for solving the technical problems:

a method for detecting characters in a natural scene comprises the following steps:

step 1, inputting 7200 pictures of characters to be trained;

step 2, obtaining basic characteristic information through the convolution layer, removing redundant information and enlarging the receptive field through the pooling layer,

and adding a residual error network;

step 3, adding channel attention and receptive field attention optimization characteristic information;

step 4, layering the network to enhance the detection capability of objects with different sizes and generate target points;

step 5, cascading the generated content, removing false positive results and obtaining a final text area;

step 6, comparing the text region obtained after cascading with the marked text region to calculate loss and adjusting network parameters;

and 7, inputting the pictures to be detected into the trained network to obtain the detection results of the pictures.

The step 2 comprises the following processes:

step 21, converting the input picture into a characteristic diagram I of n x 3, wherein n is the length and width of the characteristic diagram, and 3 is the channel number of the characteristic diagram;

step 22, extracting feature information from the feature map I obtained in step 21:

and extracting feature information by using ResNet-50, wherein the ResNet-50 is provided with 5 convolution modules, and the ResNet-50 processes the input feature information and is expressed as follows:

F＝Res₅₀(I)

wherein I represents the characteristic graph of n × 3 of the input picture through processing, Res₅₀() Representing a residual network of 50 layers, F represents a characteristic diagram after being processed by ResNet-50, and ResNet-50 is divided into 5 modules and represented by the following formula:

F＝F_i{i＝1，2，3，4，5}

i ═ {1,2,3,4,5} represents 5 convolution blocks, respectively, F_iA characteristic diagram of each module is shown.

Step 3 comprises the following processes:

step 31, firstly, channel attention and receptor field attention are obtained through an ISTK module, and the ISTK processes the characteristic diagrams of 2,3,4 and 5 layers generated by ResNet-50, and the formula is as follows:

wherein: f. of_istk() A presentation channel and a receptor field attention module,

representing a characteristic diagram generated after passing through an ISTK module;

step 32, firstly, the feature map is processed by the convolution processing formula as follows:

K_i，λ＝f_conv,λ(F_i)λ＝{1,2,3}

wherein K_i,λRepresenting the processed feature results of convolution kernels of different sizes, f_conv,λ() Convolution operations representing convolution kernels of different sizes;

generating a new weight-related feature map by using feature maps generated by different convolution kernel size operations through two fully-connected layers and a pooling layer, and then calculating the weight of the new weight-related feature map through softmax, wherein the specific formula is as follows:

wherein w_iRepresenting the weight coefficients generated by the respective convolution kernels, softmax being one way of calculating the weights, f_fc() Representing two fully connected operations, f_avg() Representing the average pooling operation, softmax is calculated specifically as follows:

wherein w_i,λWeight, C, representing the ith channel of the lambda convolution kernel_i,λRepresents attention indexes of an ith channel and a lambda channel;

step 33, generating a new feature map according to the generated weight value and feature map, and expressing the feature map by the following formula:

where sum () represents a summing function,

the weight of the ith feature map is represented, and the feature map passing through the attention module is obtained by summing the weights of the ith feature map and the ith feature map

Representing a new feature graph, relu () representing an activation function, in particular the following formula:

x represents quiltValue of activation, wherein f_relu(x) Represents the value after activation;

step 34, adding NLNet as a spatial attention module to the second layer, wherein the formula is as follows:

wherein:

representing the result of convolution of the feature values of the second layer of the previous step, wherein f_NLN() The module is a global contact network module, and is specifically expressed as the following formula:

wherein c (x) denotes normalization, f (x)_i,x_j) Is represented as the relation of the sought characteristic maps i and j, g (x)_j) The eigenvalue of the j point is calculated.

Step 4 comprises the following processes:

layering through FPN and generating different levels of anchors through RPN, wherein the formula is as follows:

b₀＝f_RPN(P)

b₀denotes the most initial bounding box, f_RPN() And representing the RPN module, and selecting the bounding boxes with different lengths and widths in the first stage, namely target points generated by the RPN through the RPN at different levels.

Step 5 comprises the following processes:

an initial bounding box is obtained through step 4, and is sequentially used as a reference to be circularly input into the RolAlign:

m_k＝f_M,k(R_a(b_k-1,P))k＝1,2,3

b_k＝f_B,k(R_a(b_k-1,P))k＝1,2,3

where k represents the number of stages of the cascade function, thisWhere k is 3, m is taken over three stages_kAs a k-th stage of segmentation, b_kIs a detection frame of the k-th stage, b_k-1The detection box of the k-1 stage, P represents the characteristic diagram generated by the stage three, R_aRepresents RolAlign, f_M,k() Mask, f produced by RolAlign representing stage K_B,k() Representing the bounding box generated in stage k.

The invention has the following beneficial effects:

according to the method, the cascade mask r-cnn (cascade mask area neural network) is modified to have the context information of the picture, and the recall rate and the accuracy rate of the model can be improved by receiving the wild attention and the spatial attention. By means of the last cascade module, false positive results can be removed.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a picture of a character to be trained.

Fig. 3 is a characteristic diagram of the entire network structure.

Fig. 4 is a view of an attention module structure.

FIG. 5 is a mask (segmentation) diagram.

Fig. 6 is a detection result picture.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the method for detecting characters in a natural scene of the present invention is shown in fig. 1, and includes the following steps:

step 1: a picture of the character to be trained is input, and the picture of the character to be trained refers to fig. 2.

Step 2: comprises the following steps.

The operation steps can refer to the basic feature extraction module in FIG. 3

And converting the input picture into a characteristic diagram I of n x 3, wherein n is the length and the width of the characteristic diagram, and 3 is the channel number of the characteristic diagram.

And extracting feature information from the obtained feature map.

The feature information extraction is performed by using ResNet-50 (residual network), the ResNet-50 has 5 convolution modules, and the ResNet-50 can express the input feature information processing as

F＝Res₅₀(I)

Wherein I represents the characteristic graph of n × 3 of the input picture through processing, Res₅₀() Representing a residual network of 50 layers, F represents a characteristic graph after being processed by ResNet-50, and ResNet-50 is divided into 5 modules which can be represented by the following formula:

F＝F_i{i＝1,2,3,4,5}

And step 3: the method comprises the following steps:

the operation steps can refer to fig. 4.

Channel attention and receptor field attention are firstly obtained through an ISTK (independently selected text convolution kernel) module, and the ISTK processes a feature map of 2,3,4 and 5 layers generated by ResNet-50, and the formula is as follows:

showing the signature generated after passing through the ISTK module. Wherein f is_istk() The module details are as follows:

firstly, the feature map is subjected to convolution processing formula as follows:

K_i,λ＝f_conv,λ(F_i)λ＝{1,2,3+

wherein: k_i,λRepresenting the processed feature results of convolution kernels of different sizes, f_conv,λ() Representing convolution operations with convolution kernels of different sizes.

wherein w_i,λWeight, C, representing the ith channel of the lambda convolution kernel_i,λAnd represents attention indexes of the ith channel and the lambda channel.

A new feature map can be generated according to the generated weight values and feature maps, and can be represented by the following formula:

where sum () represents a summing function,

the weight of the ith feature map is represented, and the feature map passing through the attention module can be obtained by summing the weights of the ith feature map and the ith feature map, wherein

x represents the value to be activated, wherein f_relu(x) Representing the value after activation.

Adding NLNet (spatial attention) to the second layer, the formula can be expressed as:

wherein

Representing the result of convolution of the feature values of the second layer of the previous step, wherein f_NLN() The global contact network module may be specifically expressed as the following formula:

Step 4 comprises the following steps:

reference may be made to the FPN (feature pyramid network) part of fig. 3.

Hierarchical with FPN and generating different levels of anchors (target points) with RPN (regional suggestion network), the formula can be expressed as:

b₀＝f_RPN(P)

And 5: the method comprises the following steps:

reference may be made to the cascade detection section of fig. 3.

The initial bounding box is obtained by step 4, which in turn can be used as a reference for cyclic input into the RolAlign (pooling layer)

m_k＝f_M,k(R_a(b_k-1,P)){k＝1,2,3}

b_k＝f_B，k(R_a(b_k-1，P)){k＝1，2，3}

Where k represents the number of stages of the cascade function, where k takes 3 to denote the passage through three stages, m_kAs a k-th stage of segmentation, b_kIs a detection frame of the k-th stage, b_k-1The detection box of the k-1 stage, P represents the characteristic diagram generated by the stage three, R_aRepresents RolAlign, f_M，k() Mask, f produced by RolAlign representing stage K_B，k() Representing the bounding box generated in stage k.

Step 6: the method comprises the following steps:

the distance between the coordinates generated by the present network and the actual coordinates is calculated using a cross-entropy loss function, which is formulated as follows:

wherein p () is expected output, q () is actual output, log is log function, and Loss () is cross entropy Loss obtained.

And 7: after passing through the cascade module, the resulting segmented image is shown in FIG. 5.

And 8: the resulting text area is shown in fig. 6.

Claims

1. A method for detecting characters in a natural scene is characterized by comprising the following steps:

step 1, inputting 7200 pictures of characters to be trained;

step 2, obtaining basic characteristic information through the convolution layer, removing redundant information and enlarging a receptive field through the pooling layer, and adding a residual error network;

2. The method for detecting the characters in the natural scene according to claim 1, wherein: the step 2 comprises the following processes:

F＝Res₅₀(I)

F＝F_i i＝1,2,3,4,5

i-1, 2,3,4,5 respectively represent 5 convolution modules, F_iA characteristic diagram of each module is shown.

3. The method for detecting the characters in the natural scene according to claim 2, wherein the step 3 comprises the following steps:

step 31, firstly, channel attention and receptor field attention are obtained through an ISTK module, the ISTK module processes the characteristic diagram of 2,3,4 and 5 layers generated by ResNet-50, and the formula is as follows:

wherein: f. of_istk() Representing a channel anda receptor field attention module for receiving the attention of the user,

K_i,λ＝f_conv,λ(F_i) λ＝{1,2,3}

wherein K_i,λRepresenting the processed feature results of convolution kernels of different sizes, f_conv，λ() Convolution operations representing convolution kernels of different sizes;

where sum () represents a summing function,

wherein x represents the value to be activated, f_relu(x) Represents the value after activation;

wherein:

wherein c (x) denotes normalization, f (x)_i，x_j) Is represented as the relation of the sought characteristic maps i and j, g (x)_j) The eigenvalue of the j point is calculated.

4. The method for detecting the characters in the natural scene according to claim 1, wherein the step 4 comprises the following steps:

b₀＝f_RPN(P)

wherein b is₀Denotes the most initial bounding box, f_RPN() And representing the RPN module, and selecting the bounding boxes with different lengths and widths in the first stage, namely target points generated by the RPN through the RPN at different levels.

5. The method for detecting the characters in the natural scene according to claim 4, wherein the step 5 comprises the following steps:

m_k＝f_M，k(R_a(b_k-l，P)) k＝i，2，3

b_k＝f_B，k(R_a(b_k-1，P)) k＝i，2，3

where k represents the number of stages of the cascade function, where k takes 3 to denote the passage through three stages, m_kAs a k-th stage of segmentation, b_kIs a detection frame of the k-th stage, b_k-1The detection box of the k-1 stage, P represents the characteristic diagram generated by the stage three, R_aRepresents RolAlign, f_M,k() Mask, f produced by RolAlign representing stage K_B，k() Representing the bounding box generated in stage k.