CN114255456A

CN114255456A - Natural scene text detection method and system based on attention mechanism feature fusion and enhancement

Info

Publication number: CN114255456A
Application number: CN202111393620.2A
Authority: CN
Inventors: 孔令军; 陈静娴; 裴会增; 沈馨怡; 刘伟光; 周耀威; 闫佳艺
Original assignee: Jinling Institute of Technology
Current assignee: Jinling Institute of Technology
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-03-29
Anticipated expiration: 2041-11-23
Also published as: CN114255456B

Abstract

The invention discloses a natural scene text detection method and system based on attention mechanism feature fusion and enhancement. The method includes: extracting features from natural scene text images to obtain a first feature map; further extracting spatial information features to obtain a spatial information mask; Perform feature extraction of semantic information on the last first feature map to obtain a channel weight vector; perform decoding and fusion based on the attention mechanism on the first feature map, spatial information mask, and channel weight vector step by step to obtain a second feature map; Adjust the number of channels of the fusion features, and splicing them according to the channel dimension to obtain the third feature map; upsampling to the size of the original image, and performing convolution to obtain the segmentation mask of the text core area and boundary area in the natural scene text image. The feature extraction information of the present invention is comprehensive and the effect is good; the decoded features contain more and more accurate target information; convolution and pooling operations of different dimensions are used to extract more significant features and play a role in suppressing noise.

Description

Natural scene text detection method and system based on attention mechanism feature fusion and enhancement

Technical Field

The invention belongs to the technical field of computer vision and artificial intelligence, and particularly relates to a natural scene text detection method and system based on attention mechanism feature fusion and enhancement.

Background

The scene image refers to an image captured by an image capturing apparatus in a natural scene. Compared with other elements in the image, the characters can convey more abundant and accurate information, and the text information has extremely high auxiliary value in the research of technologies such as automatic driving and intelligent translation, so that the detection of the characters in the natural scene is important for the image understanding. In order to help a computer understand an image more accurately, it is very important to automatically extract text regions from a scene image accurately.

Detection of scene text is a more challenging task than canonical document text. Texts in natural scenes have the characteristics of complex background, multiple interferences, self diversity and variability, and difficulty is increased for a text detection task. In addition, the quality of the scene image acquired by the image acquisition device is also affected by external conditions such as outdoor light, weather, shooting angle and shielding, so that the acquired image often has the conditions of low contrast, blurring, distortion, shielding and the like. These all make text detection in images of natural scenes a more difficult problem.

With the attack of the heat tide of deep learning, the traditional method for manually designing features and a classifier to detect texts is gradually replaced by a convolution neural network, and feature information in an image is autonomously learned through convolution operation. The current text detection methods based on deep learning are mainly divided into two categories: regression-based and segmentation-based.

The regression-based method is characterized in that a common target detection algorithm, namely a Faster RCNN or SSD frame, is correspondingly improved mainly according to the characteristics of texts. The method adopts the external rectangular frame for positioning, the document text is well detected, but the text in any shape can not be accurately surrounded by the boundary, and the subsequent text recognition can be seriously interfered by redundant background noise.

The method based on segmentation builds a network framework by using the idea of segmenting FCN and FPN by classical semantics. And respectively predicting and classifying each pixel point in the input picture, and clustering the pixel points of the same type so as to determine the position of the target text. The method can adapt to text targets in any shapes, but the extraction of effective features becomes more difficult due to the characteristics of complex background and diversified scales of the scene text, so that false detection and missing detection are caused.

Disclosure of Invention

In view of the defects in the prior art, the invention adopts a natural scene text detection method and system based on attention mechanism feature fusion and enhancement to effectively segment the text of the scene text image, and can reduce the problems of false detection and missed detection by enhancing the spatial information and channel information of the features of the scene text image.

In a first aspect of the present invention, a natural scene text detection method based on attention mechanism feature fusion and enhancement is provided, which includes the following steps,

s1, acquiring a natural scene text image;

s2, extracting the features of the natural scene text image, integrally performing n times of down-sampling operation, and taking the feature map after 2, n times of down-sampling as a first feature map;

s3, extracting spatial information features of each first feature map except the first feature map subjected to the last downsampling to obtain a spatial information mask; extracting the characteristics of the channel information of the last first characteristic diagram to obtain channel weight vectors;

s4, decoding and fusing the first feature map, the spatial information mask and the channel weight vector step by step based on an attention mechanism, and obtaining a second feature map with significant features through operations of upsampling, multiplying and adding;

s5, adjusting the number of channels of the fusion features by convolution, adopting the uniform size of upsampling with different multiplying power as the size of the original image 1/4, and splicing according to the channel dimensions to obtain a third feature map;

s6, constructing different combinations of convolution, pooling and splicing, and further extracting, fusing and reinforcing the features of the third feature map;

and S7, upsampling the features obtained in the S6 to the size of the original image, and performing convolution to obtain a segmentation mask of a text core area and a boundary area in the text image of the natural scene.

Further, the step S2 is specifically executed to uniformly scale the natural scene text image to be recognized to the size of a × a; extracting features by using variable convolution, and performing downsampling for 5 times to obtain first feature maps in with sizes of A/4, A/8, A/16 and A/32₂、in₃、in₄、in₅。

Further, the process of step S3 is specifically to construct a combination of convolution kernel pooling and splicing, and use the spatial information extraction module SAM to the first feature map in_kPerforming maximum pooling and average pooling according to positions, splicing and convolving the features subjected to maximum pooling and average pooling to obtain a spatial information mask; for the first characteristic diagram in₅And performing maximum pooling and average pooling according to the channel dimension to obtain two information vectors, performing full-connection operation on the two information vectors, and adding the two information vectors according to positions to obtain a channel weight vector.

Further, the process of step S4 is specifically that the feature fusion module AFFM performs on the first feature map in from the shallow code end_k(

k

2,3,4) and out from the deep decoding side_k+1(k is 2,3,4) and the second feature map out is obtained after the layer is fused_m(m＝2,3,4,5). Specifically, the channel weight vector c obtained in step S3 is obtained₅And a first characteristic map in from the encoding end₄Multiplying channel by channel to obtain the coding characteristics in after channel weighting₄'; masking the spatial information obtained in step S3 with a mask S₄And out after two times up-sampling from the decoding end₅Multiplying the features by position to obtain the decoding features out after weighting the spatial information₅'; then will in₄' and out₅' adding according to position to obtain the second characteristic diagram out after the 1 st decoding fusion₄. The step in S3 is repeated again to extract in₃And out₄Space and channel information S of₃And c₄And repeating the step S4 to obtain a second feature out after the 2 nd decoding fusion₃And so on to obtain a series of second characteristic diagrams out which undergo 3, 2, 1 and 0 times of decoding and have the scales of A/4, A/8, A/16 and A/32 respectively₂、out₃、out₄、out₅(second characteristic out here₅From in₅Directly outputting without any characteristic extraction).

Further, the process of step S5 is specifically to blend the obtained second feature map out_mAnd further extracting features through a layer of 3 × 3 convolution, uniformly adjusting the channel dimension to be C, performing upsampling on the features with the magnification of 0, 2, 4 and 8 to the original image 1/4 to obtain features P2, P3, P4 and P5, and splicing the features with the channel dimension to be used as a third feature map F, wherein the dimension is 4 CxA/4 xA/4.

Further, the process of step S6 is specifically that, the feature enhancing module JAM respectively uses the channel information extracting module CAM and the spatial information extracting module SAM to model the dependency relationship between channels and spatial positions of the third feature map F, and then expands the obtained 4 cx1 × 1 channel information weight vector and the 1 × a/4 spatial information mask to 4 cxa/4 × a/4 dimensions, and multiplies them by the position to obtain the weight feature F', and then multiplies by the sigoid activation function and the third feature F to obtain the enhanced feature F ″.

In a second aspect of the present invention, an attention mechanism-based feature fusion and enhancement-based natural scene text detection system is provided, which includes an obtaining module, configured to obtain a natural scene text image; the encoding extraction module is used for performing convolution operation on the natural scene text image to extract features and performing downsampling for n times to obtain a first feature map; the spatial information extraction module is used for extracting spatial detail information of the first feature map to obtain a spatial information mask; the channel information extraction module is used for extracting semantic information of the decoded and fused second feature map from the adjacent deep layer to obtain a channel weight vector; the feature fusion module is used for decoding step by step, namely combining, weighting and adding and fusing the first feature map of the layer, the second feature map of the adjacent deep layer, the spatial information mask and the channel weight vector to obtain the information-enhanced second feature map of the layer; the characteristic splicing module is used for sampling the second characteristic diagrams with different scales to the same scale and splicing the second characteristic diagrams in the channel dimension to obtain a third characteristic diagram; and the characteristic enhancement module is used for carrying out spatial and channel relational modeling on the third characteristic diagram.

In a third aspect of the present invention, a computer-readable storage medium is provided, in which a computer program is stored, wherein the computer program is configured to perform the method according to any of the above-mentioned aspects when the computer program runs.

In a fourth aspect of the present invention, there is provided an electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to perform the method according to any of the above technical solutions.

The invention has the following beneficial effects: compared with the existing feature extraction network, the ResNet50 network with the variable convolution is used for extracting features, so that the feature extraction information is comprehensive and the effect is good; designing a feature fusion module based on an attention mechanism, extracting effective information of features of different levels by using space attention and channel attention, and fusing to enable the decoded features to contain more accurate target information; an attention-based feature enhancement module is designed, convolution and pooling operations of different dimensions are used for extracting more significant features, and the function of suppressing noise is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart illustrating feature fusion and enhancement based on attention mechanism for natural scene text detection according to an embodiment of the present invention.

Fig. 2 is a diagram of a network architecture used in the embodiment of fig. 1.

Fig. 3 is a schematic structural diagram of the decoding fusion module AFFM.

Fig. 4 is a schematic diagram of a spatial information extraction module SAM structure.

FIG. 5 is a diagram of a CAM structure of a channel information extraction module.

Fig. 6 is a schematic structural diagram of a feature enhancement module JAM.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment is a natural scene text detection method based on attention mechanism feature fusion and enhancement, the overall method flow is shown in fig. 1, the network architecture and the module internal details are shown in fig. 2-6, wherein the method comprises the following steps:

and S1, acquiring a natural scene text image.

In this embodiment, the image including the scene text is acquired by a street camera, a street photo, or the like.

And S2, transforming a ResNet50 network to perform feature extraction on the natural scene text image obtained in the S1, and performing down-sampling on the whole for n times, wherein 5 times of examples are taken in the embodiment, and features obtained after 2 nd, 3 rd, 4 th and 5 th down-sampling are taken as a first feature map.

Specifically, as shown in the left backbone network portion of fig. 2, the text image of the natural scene to be recognized is uniformly scaled to a size of a × a. Reforming ResNet50 series network, applying variable convolution to extract feature, making it execute 5 times down-sampling to obtain first feature map in with size of A/4, A/8, A/16, A/32₂、in₃、in₄、in₅。

S3, extracting spatial information features of each acquired first feature map to obtain a spatial information mask; and performing feature extraction of semantic information on the last first feature map to obtain a channel weight vector.

Specifically, as shown in fig. 4, the spatial information extraction module SAM is used to match the first feature map in_kPerforming maximum pooling and average pooling according to positions, splicing the obtained two 1 × H × W features, and performing 7 × 7 convolution to obtain a 1 × H × W spatial information mask; as shown in FIG. 5, the CAM bank of the channel information extraction module pairs the first feature in₅And performing maximum pooling and average pooling according to the channel dimension to obtain two Cx 1 x 1 information vectors, performing full-connection operation on the two information vectors, and adding the information vectors according to positions to obtain the Cx 1 x 1 channel weight vector.

And S4, decoding and fusing the first feature map, the spatial information mask and the channel weight vector step by step based on an attention mechanism, and obtaining a second feature map with significant features through operations of upsampling, multiplying and adding.

Specifically, as shown in fig. 3, the first characteristic diagram of the characteristic fusion module AFFM from the shallow coding end is in_k(k is 2,3,4), and the second signature from the deep decoding side is out_k+1(

k

2,3,4), the output second feature map out after feature fusion_m(m is 2,3,4, 5). Specifically, the channel weight vector c obtained in step S3 is obtained₅And a first characteristic map in from the encoding end₄The multiplication is carried out channel by channel,obtaining the coding characteristics in after channel weighting₄'; masking the spatial information obtained in step S3 with a mask S₄And out after two times up-sampling from the decoding end₅Multiplying the features by position to obtain the decoding features out after weighting the spatial information₅'; then will in₄' and out₅' adding according to position to obtain the second characteristic diagram out after the 1 st decoding fusion₄. The step in S3 is repeated again to extract in₃And out₄Space and channel information S of₃And c₄And repeating the step S4 to obtain a second feature out after the 2 nd decoding fusion₃And so on to obtain a series of second characteristic diagrams out which undergo 3, 2, 1 and 0 times of decoding and have the scales of A/4, A/8, A/16 and A/32 respectively₂、out₃、out₄、out₅(second characteristic out here₅From in₅Directly output without any feature extraction). The reason for this configuration is: the shallow feature map in the convolutional neural network contains more space detail information, the deep feature map contains more deep semantic information, effective information of different levels is extracted through an attention mechanism to weight the features of other parts, the remarkable features can be effectively enhanced, and the background noise is suppressed while the target is enhanced.

For this embodiment, as shown in the middle part of FIG. 2, the smallest dimension of the feature map in is used₅Start decoding fusion, i.e. extract in first₄Spatial information masking and linear up-sampling of 2 times out₅(out here)₅＝in₅) Multiplying by position to extract out₅Channel information weight vector of and in₄Multiplying by channel, and adding the weighted 2 features according to position to obtain a decoded and fused feature out₄. Similarly, other levels are decoded in a similar manner.

S5, adjusting the number of channels of the fusion features by convolution, adopting the uniform size of upsampling with different multiplying power as the size of the original image 1/4, and splicing according to the channel dimensions to obtain a third feature map.

Specifically, as shown in the middle part of fig. 2, the obtained fused second feature map out_mRespectively pass through a layer 3Convolution 3 further extracts features. And uniformly adjusting the dimension of the channel to be C, and then adjusting the characteristic out after the channel is adjusted₃、out₄、out₅And performing up-sampling at 2, 4 and 8 multiplying powers respectively to adjust the sizes of the original image 1/4 to obtain features P2, P3, P4 and P5, and performing channel dimension splicing on the features to obtain a third feature map F after final decoding feature fusion, wherein the dimension is 4C × A/4 × A/4.

And S6, constructing different combinations of convolution, pooling and splicing, and further extracting, fusing and reinforcing the features of the third feature map.

Specifically, as shown in fig. 6, the feature enhancement module JAM uses a parallel connection method for processing the features. The module respectively uses CAM and SAM to model the dependency relationship between channels and space positions of a third feature graph F, expands the obtained 4 Cx 1 x 1 channel information weight vector and 1 xA/4 space information mask to 4 Cx A/4 xA/4 dimensionality, multiplies the vectors by the positions to obtain a weight feature F ', multiplies the weight feature F by an input feature F after a Sigmoid activation function to obtain an enhanced feature F'. In order to avoid network degradation, residual error connection is added in the module to ensure the effectiveness of model training.

And S7, up-sampling the finally obtained features to the size of the original image, and performing convolution to obtain a segmentation mask of a text core area and a boundary area in the text image of the natural scene.

The invention provides another embodiment, which is a natural scene text image segmentation system based on an attention mechanism, comprising an acquisition module, a coding extraction module, a spatial information extraction module, a channel information extraction module, a decoding fusion module, a feature splicing module and a feature enhancement module, wherein,

and the acquisition module is used for acquiring the natural scene text image.

And the coding extraction module is used for performing convolution operation on the acquired natural scene text image to extract features, and performing downsampling for 5 times to acquire 4 first feature maps with different scales.

And the spatial information extraction module SAM is used for extracting spatial detail information of the first characteristic diagram to obtain a spatial information mask for the decoding fusion module. In this embodiment, the SAM is embedded inside the AFFM module and the JAM module, and the SAM has an internal structure as shown in FIG. 4, including a first pooling unit, a splicing unit, and a convolution unit, where the SAM is embedded inside the AFFM module and the JAM module

The first pooling unit performs maximum pooling and average pooling on the first characteristic diagram according to positions;

the splicing unit splices the features of 2 feature graphs obtained by maximum pooling and average pooling according to channel dimensions;

and the convolution unit is used for performing convolution operation on the spliced features to extract the features, so as to obtain the spatial information mask.

And the channel information extraction module CAM is used for extracting semantic information from the second feature map after the decoding and fusion of the previous layer to obtain a channel weight vector which is used for the decoding and fusion module of the next layer. In this embodiment, the CAM module is embedded inside the AFFM module and the JAM module, and the internal structure of the CAM module is shown in FIG. 5, and includes a second pooling unit, a full-connection unit, and an adding unit, wherein

The second pooling unit respectively performs global maximum pooling and global average pooling on the second feature map according to channel dimensions;

the full-connection unit establishes a dependency relationship between channels on the maximum and average pooled feature vectors by using full-connection operation;

and the adding unit is used for adding the fully connected 2 eigenvectors according to positions to obtain the channel weight vector.

And the decoding fusion module AFFM is used for decoding step by step, namely combining, weighting and adding and fusing the first characteristic diagram of the current layer, the second characteristic diagram of the previous layer, the spatial information mask and the channel weight vector to obtain the information-enhanced second characteristic diagram of the current layer. In this embodiment, there are 3 decoding fusion modules.

And the feature splicing module is used for sampling the second feature maps of 4 different scales to the same scale and splicing the second feature maps in the channel dimension to obtain a third feature map.

And the feature enhancement module JAM is used for carrying out spatial and channel relation modeling on the third feature graph and enhancing the feature characterization capability.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The method for detecting the natural scene text based on attention mechanism feature fusion and enhancement is characterized by comprising the following steps,

s1, acquiring a natural scene text image;

s2, extracting the features of the natural scene text image, integrally performing n times of down-sampling operation, and taking the feature map after [2, n ] th time of down-sampling as a first feature map;

s3, extracting spatial information features of each first feature map except the first feature map subjected to the last downsampling to obtain a spatial information mask; extracting the characteristics of the channel information of the last first characteristic diagram to obtain a channel weight vector;

s5, adjusting the number of channels of the fusion features by convolution, adopting upsampling unified sizes with different multiplying powers as the size of the original image 1/4, and splicing according to channel dimensions to obtain a third feature map;

and S7, upsampling the features obtained in the S6 to the size of the original image, and performing convolution to obtain a segmentation mask of a text core area and a text boundary area in the text image of the natural scene.

2. The method for detecting natural scene text according to claim 1, wherein the step S2 is specifically implemented by uniformly scaling the natural scene text image to be recognized toA × A size; extracting features by using variable convolution, and performing downsampling for 5 times to obtain first feature maps in with sizes of A/4, A/8, A/16 and A/32₂、in₃、in₄、in₅。

3. The natural scene text detection method of claim 2, wherein the step S3 is specifically configured to construct a combination of convolution kernel pooling and splicing, and use the spatial information extraction module SAM for the first feature map in₄Performing maximum pooling and average pooling according to positions, splicing and convolving the features after maximum pooling and average pooling to obtain a spatial information mask S₄(ii) a For the first characteristic diagram in₅Performing maximum pooling and average pooling according to channel dimension to obtain two information vectors, performing full-connection operation on the two information vectors, and adding the two information vectors according to positions to obtain a channel weight vector c₅。

4. The method according to claim 3, wherein the step S4 is implemented by a feature fusion module AFFM for the first feature map in from the shallow encoding end_k(k 2,3,4) and a second profile out from the deep decoding side_k+1(k is 2,3,4) and the second feature map out after the layer fusion is obtained_m(m＝2,3,4,5)。

5. The natural scene text detection method according to claim 4, wherein the step S5 is specifically implemented by merging the obtained second feature map out_mAnd further extracting features through a layer of 3 × 3 convolution, uniformly adjusting the channel dimension to be C, performing upsampling on the features with the multiplying power of 0, 2, 4 and 8 respectively to be uniformly adjusted to the size of an original image 1/4 to obtain features P2, P3, P4 and P5, and splicing the features with the channel dimension to be used as a third feature map F, wherein the dimension is 4C × A/4 × A/4.

6. The natural scene text detection method of claim 5, wherein the process of step S6 is specifically that the feature enhancement module JAM uses a channel information extraction module CAM and a spatial information extraction module SAM to model the dependency relationship between channels and spatial positions of the third feature map F, respectively, and then expands the obtained 4 cx1 × 1 channel information weight vector and the 1 × a/4 spatial information mask to 4 cxa/4 dimensions, and multiplies them by each other according to the positions to obtain a weight feature F', and multiplies them by the third feature map F after a sigoid activation function to obtain an enhanced feature F ".

7. A natural scene text detection system based on an attention mechanism is characterized by comprising,

the acquisition module is used for acquiring a natural scene text image;

the encoding extraction module is used for performing convolution operation on the natural scene text image to extract features and performing downsampling for n times to obtain a first feature map;

the spatial information extraction module is used for extracting spatial detail information of the first feature map to obtain a spatial information mask;

the channel information extraction module is used for extracting semantic information from the second feature map of the adjacent deep layer to obtain a channel weight vector;

the feature fusion module is used for decoding step by step, namely combining, weighting and adding and fusing the first feature map of the layer, the second feature map of the adjacent deep layer, the spatial information mask and the channel weight vector to obtain the second feature map of the layer after information enhancement;

the characteristic splicing module is used for sampling the second characteristic graphs with different scales to the same scale and splicing the second characteristic graphs in the channel dimension to obtain a third characteristic graph;

and the characteristic enhancement module is used for carrying out spatial and channel relational modeling on the third characteristic diagram.

8. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 6 when executed.

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.