CN114255456A - Natural scene text detection method and system based on attention mechanism feature fusion and enhancement - Google Patents
Natural scene text detection method and system based on attention mechanism feature fusion and enhancement Download PDFInfo
- Publication number
- CN114255456A CN114255456A CN202111393620.2A CN202111393620A CN114255456A CN 114255456 A CN114255456 A CN 114255456A CN 202111393620 A CN202111393620 A CN 202111393620A CN 114255456 A CN114255456 A CN 114255456A
- Authority
- CN
- China
- Prior art keywords
- feature map
- features
- feature
- channel
- natural scene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 29
- 238000001514 detection method Methods 0.000 title claims abstract description 23
- 230000007246 mechanism Effects 0.000 title claims abstract description 18
- 239000013598 vector Substances 0.000 claims abstract description 35
- 238000011176 pooling Methods 0.000 claims abstract description 33
- 238000000605 extraction Methods 0.000 claims abstract description 28
- 238000000034 method Methods 0.000 claims abstract description 21
- 230000011218 segmentation Effects 0.000 claims abstract description 6
- 238000010586 diagram Methods 0.000 claims description 24
- 238000005070 sampling Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000003014 reinforcing effect Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 abstract description 4
- 230000000694 effects Effects 0.000 abstract description 2
- 230000002708 enhancing effect Effects 0.000 description 3
- 230000000873 masking effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000002407 reforming Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention discloses a natural scene text detection method and a system based on attention mechanism feature fusion and enhancement, wherein the method comprises the steps of extracting features of a natural scene text image to obtain a first feature map; further extracting spatial information characteristics to obtain a spatial information mask; extracting the feature of semantic information of the last first feature map to obtain a channel weight vector; decoding and fusing the first feature map, the spatial information mask and the channel weight vector step by step based on an attention mechanism to obtain a second feature map; adjusting the number of channels of the fused features, and splicing according to channel dimensions to obtain a third feature map; and upsampling to the size of an original image, and performing convolution to obtain a segmentation mask of a text core area and a boundary area in a text image of a natural scene. The invention has comprehensive characteristic extraction information and good effect; the decoded features contain more accurate target information; the use of convolution with pooling operations of different dimensions extracts more prominent features and serves to suppress noise.
Description
Technical Field
The invention belongs to the technical field of computer vision and artificial intelligence, and particularly relates to a natural scene text detection method and system based on attention mechanism feature fusion and enhancement.
Background
The scene image refers to an image captured by an image capturing apparatus in a natural scene. Compared with other elements in the image, the characters can convey more abundant and accurate information, and the text information has extremely high auxiliary value in the research of technologies such as automatic driving and intelligent translation, so that the detection of the characters in the natural scene is important for the image understanding. In order to help a computer understand an image more accurately, it is very important to automatically extract text regions from a scene image accurately.
Detection of scene text is a more challenging task than canonical document text. Texts in natural scenes have the characteristics of complex background, multiple interferences, self diversity and variability, and difficulty is increased for a text detection task. In addition, the quality of the scene image acquired by the image acquisition device is also affected by external conditions such as outdoor light, weather, shooting angle and shielding, so that the acquired image often has the conditions of low contrast, blurring, distortion, shielding and the like. These all make text detection in images of natural scenes a more difficult problem.
With the attack of the heat tide of deep learning, the traditional method for manually designing features and a classifier to detect texts is gradually replaced by a convolution neural network, and feature information in an image is autonomously learned through convolution operation. The current text detection methods based on deep learning are mainly divided into two categories: regression-based and segmentation-based.
The regression-based method is characterized in that a common target detection algorithm, namely a Faster RCNN or SSD frame, is correspondingly improved mainly according to the characteristics of texts. The method adopts the external rectangular frame for positioning, the document text is well detected, but the text in any shape can not be accurately surrounded by the boundary, and the subsequent text recognition can be seriously interfered by redundant background noise.
The method based on segmentation builds a network framework by using the idea of segmenting FCN and FPN by classical semantics. And respectively predicting and classifying each pixel point in the input picture, and clustering the pixel points of the same type so as to determine the position of the target text. The method can adapt to text targets in any shapes, but the extraction of effective features becomes more difficult due to the characteristics of complex background and diversified scales of the scene text, so that false detection and missing detection are caused.
Disclosure of Invention
In view of the defects in the prior art, the invention adopts a natural scene text detection method and system based on attention mechanism feature fusion and enhancement to effectively segment the text of the scene text image, and can reduce the problems of false detection and missed detection by enhancing the spatial information and channel information of the features of the scene text image.
In a first aspect of the present invention, a natural scene text detection method based on attention mechanism feature fusion and enhancement is provided, which includes the following steps,
s1, acquiring a natural scene text image;
s2, extracting the features of the natural scene text image, integrally performing n times of down-sampling operation, and taking the feature map after 2, n times of down-sampling as a first feature map;
s3, extracting spatial information features of each first feature map except the first feature map subjected to the last downsampling to obtain a spatial information mask; extracting the characteristics of the channel information of the last first characteristic diagram to obtain channel weight vectors;
s4, decoding and fusing the first feature map, the spatial information mask and the channel weight vector step by step based on an attention mechanism, and obtaining a second feature map with significant features through operations of upsampling, multiplying and adding;
s5, adjusting the number of channels of the fusion features by convolution, adopting the uniform size of upsampling with different multiplying power as the size of the original image 1/4, and splicing according to the channel dimensions to obtain a third feature map;
s6, constructing different combinations of convolution, pooling and splicing, and further extracting, fusing and reinforcing the features of the third feature map;
and S7, upsampling the features obtained in the S6 to the size of the original image, and performing convolution to obtain a segmentation mask of a text core area and a boundary area in the text image of the natural scene.
Further, the step S2 is specifically executed to uniformly scale the natural scene text image to be recognized to the size of a × a; extracting features by using variable convolution, and performing downsampling for 5 times to obtain first feature maps in with sizes of A/4, A/8, A/16 and A/322、in3、in4、in5。
Further, the process of step S3 is specifically to construct a combination of convolution kernel pooling and splicing, and use the spatial information extraction module SAM to the first feature map inkPerforming maximum pooling and average pooling according to positions, splicing and convolving the features subjected to maximum pooling and average pooling to obtain a spatial information mask; for the first characteristic diagram in5And performing maximum pooling and average pooling according to the channel dimension to obtain two information vectors, performing full-connection operation on the two information vectors, and adding the two information vectors according to positions to obtain a channel weight vector.
Further, the process of step S4 is specifically that the feature fusion module AFFM performs on the first feature map in from the shallow code endk( k 2,3,4) and out from the deep decoding sidek+1(k is 2,3,4) and the second feature map out is obtained after the layer is fusedm(m=2,3,4,5). Specifically, the channel weight vector c obtained in step S3 is obtained5And a first characteristic map in from the encoding end4Multiplying channel by channel to obtain the coding characteristics in after channel weighting4'; masking the spatial information obtained in step S3 with a mask S4And out after two times up-sampling from the decoding end5Multiplying the features by position to obtain the decoding features out after weighting the spatial information5'; then will in4' and out5' adding according to position to obtain the second characteristic diagram out after the 1 st decoding fusion4. The step in S3 is repeated again to extract in3And out4Space and channel information S of3And c4And repeating the step S4 to obtain a second feature out after the 2 nd decoding fusion3And so on to obtain a series of second characteristic diagrams out which undergo 3, 2, 1 and 0 times of decoding and have the scales of A/4, A/8, A/16 and A/32 respectively2、out3、out4、out5(second characteristic out here5From in5Directly outputting without any characteristic extraction).
Further, the process of step S5 is specifically to blend the obtained second feature map outmAnd further extracting features through a layer of 3 × 3 convolution, uniformly adjusting the channel dimension to be C, performing upsampling on the features with the magnification of 0, 2, 4 and 8 to the original image 1/4 to obtain features P2, P3, P4 and P5, and splicing the features with the channel dimension to be used as a third feature map F, wherein the dimension is 4 CxA/4 xA/4.
Further, the process of step S6 is specifically that, the feature enhancing module JAM respectively uses the channel information extracting module CAM and the spatial information extracting module SAM to model the dependency relationship between channels and spatial positions of the third feature map F, and then expands the obtained 4 cx1 × 1 channel information weight vector and the 1 × a/4 spatial information mask to 4 cxa/4 × a/4 dimensions, and multiplies them by the position to obtain the weight feature F', and then multiplies by the sigoid activation function and the third feature F to obtain the enhanced feature F ″.
In a second aspect of the present invention, an attention mechanism-based feature fusion and enhancement-based natural scene text detection system is provided, which includes an obtaining module, configured to obtain a natural scene text image; the encoding extraction module is used for performing convolution operation on the natural scene text image to extract features and performing downsampling for n times to obtain a first feature map; the spatial information extraction module is used for extracting spatial detail information of the first feature map to obtain a spatial information mask; the channel information extraction module is used for extracting semantic information of the decoded and fused second feature map from the adjacent deep layer to obtain a channel weight vector; the feature fusion module is used for decoding step by step, namely combining, weighting and adding and fusing the first feature map of the layer, the second feature map of the adjacent deep layer, the spatial information mask and the channel weight vector to obtain the information-enhanced second feature map of the layer; the characteristic splicing module is used for sampling the second characteristic diagrams with different scales to the same scale and splicing the second characteristic diagrams in the channel dimension to obtain a third characteristic diagram; and the characteristic enhancement module is used for carrying out spatial and channel relational modeling on the third characteristic diagram.
In a third aspect of the present invention, a computer-readable storage medium is provided, in which a computer program is stored, wherein the computer program is configured to perform the method according to any of the above-mentioned aspects when the computer program runs.
In a fourth aspect of the present invention, there is provided an electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to perform the method according to any of the above technical solutions.
The invention has the following beneficial effects: compared with the existing feature extraction network, the ResNet50 network with the variable convolution is used for extracting features, so that the feature extraction information is comprehensive and the effect is good; designing a feature fusion module based on an attention mechanism, extracting effective information of features of different levels by using space attention and channel attention, and fusing to enable the decoded features to contain more accurate target information; an attention-based feature enhancement module is designed, convolution and pooling operations of different dimensions are used for extracting more significant features, and the function of suppressing noise is achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart illustrating feature fusion and enhancement based on attention mechanism for natural scene text detection according to an embodiment of the present invention.
Fig. 2 is a diagram of a network architecture used in the embodiment of fig. 1.
Fig. 3 is a schematic structural diagram of the decoding fusion module AFFM.
Fig. 4 is a schematic diagram of a spatial information extraction module SAM structure.
FIG. 5 is a diagram of a CAM structure of a channel information extraction module.
Fig. 6 is a schematic structural diagram of a feature enhancement module JAM.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment is a natural scene text detection method based on attention mechanism feature fusion and enhancement, the overall method flow is shown in fig. 1, the network architecture and the module internal details are shown in fig. 2-6, wherein the method comprises the following steps:
and S1, acquiring a natural scene text image.
In this embodiment, the image including the scene text is acquired by a street camera, a street photo, or the like.
And S2, transforming a ResNet50 network to perform feature extraction on the natural scene text image obtained in the S1, and performing down-sampling on the whole for n times, wherein 5 times of examples are taken in the embodiment, and features obtained after 2 nd, 3 rd, 4 th and 5 th down-sampling are taken as a first feature map.
Specifically, as shown in the left backbone network portion of fig. 2, the text image of the natural scene to be recognized is uniformly scaled to a size of a × a. Reforming ResNet50 series network, applying variable convolution to extract feature, making it execute 5 times down-sampling to obtain first feature map in with size of A/4, A/8, A/16, A/322、in3、in4、in5。
S3, extracting spatial information features of each acquired first feature map to obtain a spatial information mask; and performing feature extraction of semantic information on the last first feature map to obtain a channel weight vector.
Specifically, as shown in fig. 4, the spatial information extraction module SAM is used to match the first feature map inkPerforming maximum pooling and average pooling according to positions, splicing the obtained two 1 × H × W features, and performing 7 × 7 convolution to obtain a 1 × H × W spatial information mask; as shown in FIG. 5, the CAM bank of the channel information extraction module pairs the first feature in5And performing maximum pooling and average pooling according to the channel dimension to obtain two Cx 1 x 1 information vectors, performing full-connection operation on the two information vectors, and adding the information vectors according to positions to obtain the Cx 1 x 1 channel weight vector.
And S4, decoding and fusing the first feature map, the spatial information mask and the channel weight vector step by step based on an attention mechanism, and obtaining a second feature map with significant features through operations of upsampling, multiplying and adding.
Specifically, as shown in fig. 3, the first characteristic diagram of the characteristic fusion module AFFM from the shallow coding end is ink(k is 2,3,4), and the second signature from the deep decoding side is outk+1( k 2,3,4), the output second feature map out after feature fusionm(m is 2,3,4, 5). Specifically, the channel weight vector c obtained in step S3 is obtained5And a first characteristic map in from the encoding end4The multiplication is carried out channel by channel,obtaining the coding characteristics in after channel weighting4'; masking the spatial information obtained in step S3 with a mask S4And out after two times up-sampling from the decoding end5Multiplying the features by position to obtain the decoding features out after weighting the spatial information5'; then will in4' and out5' adding according to position to obtain the second characteristic diagram out after the 1 st decoding fusion4. The step in S3 is repeated again to extract in3And out4Space and channel information S of3And c4And repeating the step S4 to obtain a second feature out after the 2 nd decoding fusion3And so on to obtain a series of second characteristic diagrams out which undergo 3, 2, 1 and 0 times of decoding and have the scales of A/4, A/8, A/16 and A/32 respectively2、out3、out4、out5(second characteristic out here5From in5Directly output without any feature extraction). The reason for this configuration is: the shallow feature map in the convolutional neural network contains more space detail information, the deep feature map contains more deep semantic information, effective information of different levels is extracted through an attention mechanism to weight the features of other parts, the remarkable features can be effectively enhanced, and the background noise is suppressed while the target is enhanced.
For this embodiment, as shown in the middle part of FIG. 2, the smallest dimension of the feature map in is used5Start decoding fusion, i.e. extract in first4Spatial information masking and linear up-sampling of 2 times out5(out here)5=in5) Multiplying by position to extract out5Channel information weight vector of and in4Multiplying by channel, and adding the weighted 2 features according to position to obtain a decoded and fused feature out4. Similarly, other levels are decoded in a similar manner.
S5, adjusting the number of channels of the fusion features by convolution, adopting the uniform size of upsampling with different multiplying power as the size of the original image 1/4, and splicing according to the channel dimensions to obtain a third feature map.
Specifically, as shown in the middle part of fig. 2, the obtained fused second feature map outmRespectively pass through a layer 3Convolution 3 further extracts features. And uniformly adjusting the dimension of the channel to be C, and then adjusting the characteristic out after the channel is adjusted3、out4、out5And performing up-sampling at 2, 4 and 8 multiplying powers respectively to adjust the sizes of the original image 1/4 to obtain features P2, P3, P4 and P5, and performing channel dimension splicing on the features to obtain a third feature map F after final decoding feature fusion, wherein the dimension is 4C × A/4 × A/4.
And S6, constructing different combinations of convolution, pooling and splicing, and further extracting, fusing and reinforcing the features of the third feature map.
Specifically, as shown in fig. 6, the feature enhancement module JAM uses a parallel connection method for processing the features. The module respectively uses CAM and SAM to model the dependency relationship between channels and space positions of a third feature graph F, expands the obtained 4 Cx 1 x 1 channel information weight vector and 1 xA/4 space information mask to 4 Cx A/4 xA/4 dimensionality, multiplies the vectors by the positions to obtain a weight feature F ', multiplies the weight feature F by an input feature F after a Sigmoid activation function to obtain an enhanced feature F'. In order to avoid network degradation, residual error connection is added in the module to ensure the effectiveness of model training.
And S7, up-sampling the finally obtained features to the size of the original image, and performing convolution to obtain a segmentation mask of a text core area and a boundary area in the text image of the natural scene.
The invention provides another embodiment, which is a natural scene text image segmentation system based on an attention mechanism, comprising an acquisition module, a coding extraction module, a spatial information extraction module, a channel information extraction module, a decoding fusion module, a feature splicing module and a feature enhancement module, wherein,
and the acquisition module is used for acquiring the natural scene text image.
And the coding extraction module is used for performing convolution operation on the acquired natural scene text image to extract features, and performing downsampling for 5 times to acquire 4 first feature maps with different scales.
And the spatial information extraction module SAM is used for extracting spatial detail information of the first characteristic diagram to obtain a spatial information mask for the decoding fusion module. In this embodiment, the SAM is embedded inside the AFFM module and the JAM module, and the SAM has an internal structure as shown in FIG. 4, including a first pooling unit, a splicing unit, and a convolution unit, where the SAM is embedded inside the AFFM module and the JAM module
The first pooling unit performs maximum pooling and average pooling on the first characteristic diagram according to positions;
the splicing unit splices the features of 2 feature graphs obtained by maximum pooling and average pooling according to channel dimensions;
and the convolution unit is used for performing convolution operation on the spliced features to extract the features, so as to obtain the spatial information mask.
And the channel information extraction module CAM is used for extracting semantic information from the second feature map after the decoding and fusion of the previous layer to obtain a channel weight vector which is used for the decoding and fusion module of the next layer. In this embodiment, the CAM module is embedded inside the AFFM module and the JAM module, and the internal structure of the CAM module is shown in FIG. 5, and includes a second pooling unit, a full-connection unit, and an adding unit, wherein
The second pooling unit respectively performs global maximum pooling and global average pooling on the second feature map according to channel dimensions;
the full-connection unit establishes a dependency relationship between channels on the maximum and average pooled feature vectors by using full-connection operation;
and the adding unit is used for adding the fully connected 2 eigenvectors according to positions to obtain the channel weight vector.
And the decoding fusion module AFFM is used for decoding step by step, namely combining, weighting and adding and fusing the first characteristic diagram of the current layer, the second characteristic diagram of the previous layer, the spatial information mask and the channel weight vector to obtain the information-enhanced second characteristic diagram of the current layer. In this embodiment, there are 3 decoding fusion modules.
And the feature splicing module is used for sampling the second feature maps of 4 different scales to the same scale and splicing the second feature maps in the channel dimension to obtain a third feature map.
And the feature enhancement module JAM is used for carrying out spatial and channel relation modeling on the third feature graph and enhancing the feature characterization capability.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (9)
1. The method for detecting the natural scene text based on attention mechanism feature fusion and enhancement is characterized by comprising the following steps,
s1, acquiring a natural scene text image;
s2, extracting the features of the natural scene text image, integrally performing n times of down-sampling operation, and taking the feature map after [2, n ] th time of down-sampling as a first feature map;
s3, extracting spatial information features of each first feature map except the first feature map subjected to the last downsampling to obtain a spatial information mask; extracting the characteristics of the channel information of the last first characteristic diagram to obtain a channel weight vector;
s4, decoding and fusing the first feature map, the spatial information mask and the channel weight vector step by step based on an attention mechanism, and obtaining a second feature map with significant features through operations of upsampling, multiplying and adding;
s5, adjusting the number of channels of the fusion features by convolution, adopting upsampling unified sizes with different multiplying powers as the size of the original image 1/4, and splicing according to channel dimensions to obtain a third feature map;
s6, constructing different combinations of convolution, pooling and splicing, and further extracting, fusing and reinforcing the features of the third feature map;
and S7, upsampling the features obtained in the S6 to the size of the original image, and performing convolution to obtain a segmentation mask of a text core area and a text boundary area in the text image of the natural scene.
2. The method for detecting natural scene text according to claim 1, wherein the step S2 is specifically implemented by uniformly scaling the natural scene text image to be recognized toA × A size; extracting features by using variable convolution, and performing downsampling for 5 times to obtain first feature maps in with sizes of A/4, A/8, A/16 and A/322、in3、in4、in5。
3. The natural scene text detection method of claim 2, wherein the step S3 is specifically configured to construct a combination of convolution kernel pooling and splicing, and use the spatial information extraction module SAM for the first feature map in4Performing maximum pooling and average pooling according to positions, splicing and convolving the features after maximum pooling and average pooling to obtain a spatial information mask S4(ii) a For the first characteristic diagram in5Performing maximum pooling and average pooling according to channel dimension to obtain two information vectors, performing full-connection operation on the two information vectors, and adding the two information vectors according to positions to obtain a channel weight vector c5。
4. The method according to claim 3, wherein the step S4 is implemented by a feature fusion module AFFM for the first feature map in from the shallow encoding endk(k 2,3,4) and a second profile out from the deep decoding sidek+1(k is 2,3,4) and the second feature map out after the layer fusion is obtainedm(m=2,3,4,5)。
5. The natural scene text detection method according to claim 4, wherein the step S5 is specifically implemented by merging the obtained second feature map outmAnd further extracting features through a layer of 3 × 3 convolution, uniformly adjusting the channel dimension to be C, performing upsampling on the features with the multiplying power of 0, 2, 4 and 8 respectively to be uniformly adjusted to the size of an original image 1/4 to obtain features P2, P3, P4 and P5, and splicing the features with the channel dimension to be used as a third feature map F, wherein the dimension is 4C × A/4 × A/4.
6. The natural scene text detection method of claim 5, wherein the process of step S6 is specifically that the feature enhancement module JAM uses a channel information extraction module CAM and a spatial information extraction module SAM to model the dependency relationship between channels and spatial positions of the third feature map F, respectively, and then expands the obtained 4 cx1 × 1 channel information weight vector and the 1 × a/4 spatial information mask to 4 cxa/4 dimensions, and multiplies them by each other according to the positions to obtain a weight feature F', and multiplies them by the third feature map F after a sigoid activation function to obtain an enhanced feature F ".
7. A natural scene text detection system based on an attention mechanism is characterized by comprising,
the acquisition module is used for acquiring a natural scene text image;
the encoding extraction module is used for performing convolution operation on the natural scene text image to extract features and performing downsampling for n times to obtain a first feature map;
the spatial information extraction module is used for extracting spatial detail information of the first feature map to obtain a spatial information mask;
the channel information extraction module is used for extracting semantic information from the second feature map of the adjacent deep layer to obtain a channel weight vector;
the feature fusion module is used for decoding step by step, namely combining, weighting and adding and fusing the first feature map of the layer, the second feature map of the adjacent deep layer, the spatial information mask and the channel weight vector to obtain the second feature map of the layer after information enhancement;
the characteristic splicing module is used for sampling the second characteristic graphs with different scales to the same scale and splicing the second characteristic graphs in the channel dimension to obtain a third characteristic graph;
and the characteristic enhancement module is used for carrying out spatial and channel relational modeling on the third characteristic diagram.
8. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 6 when executed.
9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111393620.2A CN114255456A (en) | 2021-11-23 | 2021-11-23 | Natural scene text detection method and system based on attention mechanism feature fusion and enhancement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111393620.2A CN114255456A (en) | 2021-11-23 | 2021-11-23 | Natural scene text detection method and system based on attention mechanism feature fusion and enhancement |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114255456A true CN114255456A (en) | 2022-03-29 |
Family
ID=80791093
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111393620.2A Pending CN114255456A (en) | 2021-11-23 | 2021-11-23 | Natural scene text detection method and system based on attention mechanism feature fusion and enhancement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114255456A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114581906A (en) * | 2022-05-06 | 2022-06-03 | 山东大学 | Text recognition method and system for natural scene image |
CN114820515A (en) * | 2022-04-26 | 2022-07-29 | 渭南日报社印刷厂 | Non-reference image quality evaluation method based on channel attention |
CN114821074A (en) * | 2022-07-01 | 2022-07-29 | 湖南盛鼎科技发展有限责任公司 | Airborne LiDAR point cloud semantic segmentation method, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020221013A1 (en) * | 2019-04-29 | 2020-11-05 | 腾讯科技(深圳)有限公司 | Image processing method and apparaus, and electronic device and storage medium |
CN113486890A (en) * | 2021-06-16 | 2021-10-08 | 湖北工业大学 | Text detection method based on attention feature fusion and cavity residual error feature enhancement |
CN113516126A (en) * | 2021-07-02 | 2021-10-19 | 成都信息工程大学 | Adaptive threshold scene text detection method based on attention feature fusion |
-
2021
- 2021-11-23 CN CN202111393620.2A patent/CN114255456A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020221013A1 (en) * | 2019-04-29 | 2020-11-05 | 腾讯科技(深圳)有限公司 | Image processing method and apparaus, and electronic device and storage medium |
CN113486890A (en) * | 2021-06-16 | 2021-10-08 | 湖北工业大学 | Text detection method based on attention feature fusion and cavity residual error feature enhancement |
CN113516126A (en) * | 2021-07-02 | 2021-10-19 | 成都信息工程大学 | Adaptive threshold scene text detection method based on attention feature fusion |
Non-Patent Citations (1)
Title |
---|
MINGHUI LIAO ET AL.: "Real-Time Scene Text Detection with Differentiable Binarization", 《THE THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE 》, vol. 34, no. 7, 3 April 2020 (2020-04-03), pages 11474 - 11481 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114820515A (en) * | 2022-04-26 | 2022-07-29 | 渭南日报社印刷厂 | Non-reference image quality evaluation method based on channel attention |
CN114581906A (en) * | 2022-05-06 | 2022-06-03 | 山东大学 | Text recognition method and system for natural scene image |
CN114821074A (en) * | 2022-07-01 | 2022-07-29 | 湖南盛鼎科技发展有限责任公司 | Airborne LiDAR point cloud semantic segmentation method, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113052210B (en) | Rapid low-light target detection method based on convolutional neural network | |
US11176381B2 (en) | Video object segmentation by reference-guided mask propagation | |
WO2023056889A1 (en) | Model training and scene recognition method and apparatus, device, and medium | |
CN114255456A (en) | Natural scene text detection method and system based on attention mechanism feature fusion and enhancement | |
CN111199522A (en) | Single-image blind motion blur removing method for generating countermeasure network based on multi-scale residual errors | |
CN112150493B (en) | Semantic guidance-based screen area detection method in natural scene | |
CN109410146A (en) | A kind of image deblurring algorithm based on Bi-Skip-Net | |
CN111915530A (en) | End-to-end-based haze concentration self-adaptive neural network image defogging method | |
CN112241939B (en) | Multi-scale and non-local-based light rain removal method | |
CN114255474A (en) | Pedestrian re-identification method based on multi-scale and multi-granularity | |
CN112766056A (en) | Method and device for detecting lane line in low-light environment based on deep neural network | |
CN116645598A (en) | Remote sensing image semantic segmentation method based on channel attention feature fusion | |
CN117115177A (en) | Lightning channel segmentation method based on dynamic channel diagram convolution and multi-scale attention | |
CN117557779A (en) | YOLO-based multi-scale target detection method | |
CN114155165A (en) | Image defogging method based on semi-supervision | |
CN113177956A (en) | Semantic segmentation method for unmanned aerial vehicle remote sensing image | |
CN113096133A (en) | Method for constructing semantic segmentation network based on attention mechanism | |
CN115641445B (en) | Remote sensing image shadow detection method integrating asymmetric inner convolution and Transformer | |
CN116704187A (en) | Real-time semantic segmentation method, system and storage medium for semantic alignment | |
CN114283431B (en) | Text detection method based on differentiable binarization | |
CN113627368B (en) | Video behavior recognition method based on deep learning | |
CN113222016B (en) | Change detection method and device based on cross enhancement of high-level and low-level features | |
Tao et al. | An accurate low-light object detection method based on pyramid networks | |
CN113744152A (en) | Tide water image denoising processing method, terminal and computer readable storage medium | |
CN115393491A (en) | Ink video generation method and device based on instance segmentation and reference frame |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |