CN114255456A - Natural scene text detection method and system based on attention mechanism feature fusion and enhancement - Google Patents

Natural scene text detection method and system based on attention mechanism feature fusion and enhancement Download PDF

Info

Publication number
CN114255456A
CN114255456A CN202111393620.2A CN202111393620A CN114255456A CN 114255456 A CN114255456 A CN 114255456A CN 202111393620 A CN202111393620 A CN 202111393620A CN 114255456 A CN114255456 A CN 114255456A
Authority
CN
China
Prior art keywords
feature map
features
feature
channel
natural scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111393620.2A
Other languages
Chinese (zh)
Inventor
孔令军
陈静娴
裴会增
沈馨怡
刘伟光
周耀威
闫佳艺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinling Institute of Technology
Original Assignee
Jinling Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinling Institute of Technology filed Critical Jinling Institute of Technology
Priority to CN202111393620.2A priority Critical patent/CN114255456A/en
Publication of CN114255456A publication Critical patent/CN114255456A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a natural scene text detection method and a system based on attention mechanism feature fusion and enhancement, wherein the method comprises the steps of extracting features of a natural scene text image to obtain a first feature map; further extracting spatial information characteristics to obtain a spatial information mask; extracting the feature of semantic information of the last first feature map to obtain a channel weight vector; decoding and fusing the first feature map, the spatial information mask and the channel weight vector step by step based on an attention mechanism to obtain a second feature map; adjusting the number of channels of the fused features, and splicing according to channel dimensions to obtain a third feature map; and upsampling to the size of an original image, and performing convolution to obtain a segmentation mask of a text core area and a boundary area in a text image of a natural scene. The invention has comprehensive characteristic extraction information and good effect; the decoded features contain more accurate target information; the use of convolution with pooling operations of different dimensions extracts more prominent features and serves to suppress noise.

Description

Natural scene text detection method and system based on attention mechanism feature fusion and enhancement
Technical Field
The invention belongs to the technical field of computer vision and artificial intelligence, and particularly relates to a natural scene text detection method and system based on attention mechanism feature fusion and enhancement.
Background
The scene image refers to an image captured by an image capturing apparatus in a natural scene. Compared with other elements in the image, the characters can convey more abundant and accurate information, and the text information has extremely high auxiliary value in the research of technologies such as automatic driving and intelligent translation, so that the detection of the characters in the natural scene is important for the image understanding. In order to help a computer understand an image more accurately, it is very important to automatically extract text regions from a scene image accurately.
Detection of scene text is a more challenging task than canonical document text. Texts in natural scenes have the characteristics of complex background, multiple interferences, self diversity and variability, and difficulty is increased for a text detection task. In addition, the quality of the scene image acquired by the image acquisition device is also affected by external conditions such as outdoor light, weather, shooting angle and shielding, so that the acquired image often has the conditions of low contrast, blurring, distortion, shielding and the like. These all make text detection in images of natural scenes a more difficult problem.
With the attack of the heat tide of deep learning, the traditional method for manually designing features and a classifier to detect texts is gradually replaced by a convolution neural network, and feature information in an image is autonomously learned through convolution operation. The current text detection methods based on deep learning are mainly divided into two categories: regression-based and segmentation-based.
The regression-based method is characterized in that a common target detection algorithm, namely a Faster RCNN or SSD frame, is correspondingly improved mainly according to the characteristics of texts. The method adopts the external rectangular frame for positioning, the document text is well detected, but the text in any shape can not be accurately surrounded by the boundary, and the subsequent text recognition can be seriously interfered by redundant background noise.
The method based on segmentation builds a network framework by using the idea of segmenting FCN and FPN by classical semantics. And respectively predicting and classifying each pixel point in the input picture, and clustering the pixel points of the same type so as to determine the position of the target text. The method can adapt to text targets in any shapes, but the extraction of effective features becomes more difficult due to the characteristics of complex background and diversified scales of the scene text, so that false detection and missing detection are caused.
Disclosure of Invention
In view of the defects in the prior art, the invention adopts a natural scene text detection method and system based on attention mechanism feature fusion and enhancement to effectively segment the text of the scene text image, and can reduce the problems of false detection and missed detection by enhancing the spatial information and channel information of the features of the scene text image.
In a first aspect of the present invention, a natural scene text detection method based on attention mechanism feature fusion and enhancement is provided, which includes the following steps,
s1, acquiring a natural scene text image;
s2, extracting the features of the natural scene text image, integrally performing n times of down-sampling operation, and taking the feature map after 2, n times of down-sampling as a first feature map;
s3, extracting spatial information features of each first feature map except the first feature map subjected to the last downsampling to obtain a spatial information mask; extracting the characteristics of the channel information of the last first characteristic diagram to obtain channel weight vectors;
s4, decoding and fusing the first feature map, the spatial information mask and the channel weight vector step by step based on an attention mechanism, and obtaining a second feature map with significant features through operations of upsampling, multiplying and adding;
s5, adjusting the number of channels of the fusion features by convolution, adopting the uniform size of upsampling with different multiplying power as the size of the original image 1/4, and splicing according to the channel dimensions to obtain a third feature map;
s6, constructing different combinations of convolution, pooling and splicing, and further extracting, fusing and reinforcing the features of the third feature map;
and S7, upsampling the features obtained in the S6 to the size of the original image, and performing convolution to obtain a segmentation mask of a text core area and a boundary area in the text image of the natural scene.
Further, the step S2 is specifically executed to uniformly scale the natural scene text image to be recognized to the size of a × a; extracting features by using variable convolution, and performing downsampling for 5 times to obtain first feature maps in with sizes of A/4, A/8, A/16 and A/322、in3、in4、in5
Further, the process of step S3 is specifically to construct a combination of convolution kernel pooling and splicing, and use the spatial information extraction module SAM to the first feature map inkPerforming maximum pooling and average pooling according to positions, splicing and convolving the features subjected to maximum pooling and average pooling to obtain a spatial information mask; for the first characteristic diagram in5And performing maximum pooling and average pooling according to the channel dimension to obtain two information vectors, performing full-connection operation on the two information vectors, and adding the two information vectors according to positions to obtain a channel weight vector.
Further, the process of step S4 is specifically that the feature fusion module AFFM performs on the first feature map in from the shallow code endk( k 2,3,4) and out from the deep decoding sidek+1(k is 2,3,4) and the second feature map out is obtained after the layer is fusedm(m=2,3,4,5). Specifically, the channel weight vector c obtained in step S3 is obtained5And a first characteristic map in from the encoding end4Multiplying channel by channel to obtain the coding characteristics in after channel weighting4'; masking the spatial information obtained in step S3 with a mask S4And out after two times up-sampling from the decoding end5Multiplying the features by position to obtain the decoding features out after weighting the spatial information5'; then will in4' and out5' adding according to position to obtain the second characteristic diagram out after the 1 st decoding fusion4. The step in S3 is repeated again to extract in3And out4Space and channel information S of3And c4And repeating the step S4 to obtain a second feature out after the 2 nd decoding fusion3And so on to obtain a series of second characteristic diagrams out which undergo 3, 2, 1 and 0 times of decoding and have the scales of A/4, A/8, A/16 and A/32 respectively2、out3、out4、out5(second characteristic out here5From in5Directly outputting without any characteristic extraction).
Further, the process of step S5 is specifically to blend the obtained second feature map outmAnd further extracting features through a layer of 3 × 3 convolution, uniformly adjusting the channel dimension to be C, performing upsampling on the features with the magnification of 0, 2, 4 and 8 to the original image 1/4 to obtain features P2, P3, P4 and P5, and splicing the features with the channel dimension to be used as a third feature map F, wherein the dimension is 4 CxA/4 xA/4.
Further, the process of step S6 is specifically that, the feature enhancing module JAM respectively uses the channel information extracting module CAM and the spatial information extracting module SAM to model the dependency relationship between channels and spatial positions of the third feature map F, and then expands the obtained 4 cx1 × 1 channel information weight vector and the 1 × a/4 spatial information mask to 4 cxa/4 × a/4 dimensions, and multiplies them by the position to obtain the weight feature F', and then multiplies by the sigoid activation function and the third feature F to obtain the enhanced feature F ″.
In a second aspect of the present invention, an attention mechanism-based feature fusion and enhancement-based natural scene text detection system is provided, which includes an obtaining module, configured to obtain a natural scene text image; the encoding extraction module is used for performing convolution operation on the natural scene text image to extract features and performing downsampling for n times to obtain a first feature map; the spatial information extraction module is used for extracting spatial detail information of the first feature map to obtain a spatial information mask; the channel information extraction module is used for extracting semantic information of the decoded and fused second feature map from the adjacent deep layer to obtain a channel weight vector; the feature fusion module is used for decoding step by step, namely combining, weighting and adding and fusing the first feature map of the layer, the second feature map of the adjacent deep layer, the spatial information mask and the channel weight vector to obtain the information-enhanced second feature map of the layer; the characteristic splicing module is used for sampling the second characteristic diagrams with different scales to the same scale and splicing the second characteristic diagrams in the channel dimension to obtain a third characteristic diagram; and the characteristic enhancement module is used for carrying out spatial and channel relational modeling on the third characteristic diagram.
In a third aspect of the present invention, a computer-readable storage medium is provided, in which a computer program is stored, wherein the computer program is configured to perform the method according to any of the above-mentioned aspects when the computer program runs.
In a fourth aspect of the present invention, there is provided an electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to perform the method according to any of the above technical solutions.
The invention has the following beneficial effects: compared with the existing feature extraction network, the ResNet50 network with the variable convolution is used for extracting features, so that the feature extraction information is comprehensive and the effect is good; designing a feature fusion module based on an attention mechanism, extracting effective information of features of different levels by using space attention and channel attention, and fusing to enable the decoded features to contain more accurate target information; an attention-based feature enhancement module is designed, convolution and pooling operations of different dimensions are used for extracting more significant features, and the function of suppressing noise is achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart illustrating feature fusion and enhancement based on attention mechanism for natural scene text detection according to an embodiment of the present invention.
Fig. 2 is a diagram of a network architecture used in the embodiment of fig. 1.
Fig. 3 is a schematic structural diagram of the decoding fusion module AFFM.
Fig. 4 is a schematic diagram of a spatial information extraction module SAM structure.
FIG. 5 is a diagram of a CAM structure of a channel information extraction module.
Fig. 6 is a schematic structural diagram of a feature enhancement module JAM.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment is a natural scene text detection method based on attention mechanism feature fusion and enhancement, the overall method flow is shown in fig. 1, the network architecture and the module internal details are shown in fig. 2-6, wherein the method comprises the following steps:
and S1, acquiring a natural scene text image.
In this embodiment, the image including the scene text is acquired by a street camera, a street photo, or the like.
And S2, transforming a ResNet50 network to perform feature extraction on the natural scene text image obtained in the S1, and performing down-sampling on the whole for n times, wherein 5 times of examples are taken in the embodiment, and features obtained after 2 nd, 3 rd, 4 th and 5 th down-sampling are taken as a first feature map.
Specifically, as shown in the left backbone network portion of fig. 2, the text image of the natural scene to be recognized is uniformly scaled to a size of a × a. Reforming ResNet50 series network, applying variable convolution to extract feature, making it execute 5 times down-sampling to obtain first feature map in with size of A/4, A/8, A/16, A/322、in3、in4、in5
S3, extracting spatial information features of each acquired first feature map to obtain a spatial information mask; and performing feature extraction of semantic information on the last first feature map to obtain a channel weight vector.
Specifically, as shown in fig. 4, the spatial information extraction module SAM is used to match the first feature map inkPerforming maximum pooling and average pooling according to positions, splicing the obtained two 1 × H × W features, and performing 7 × 7 convolution to obtain a 1 × H × W spatial information mask; as shown in FIG. 5, the CAM bank of the channel information extraction module pairs the first feature in5And performing maximum pooling and average pooling according to the channel dimension to obtain two Cx 1 x 1 information vectors, performing full-connection operation on the two information vectors, and adding the information vectors according to positions to obtain the Cx 1 x 1 channel weight vector.
And S4, decoding and fusing the first feature map, the spatial information mask and the channel weight vector step by step based on an attention mechanism, and obtaining a second feature map with significant features through operations of upsampling, multiplying and adding.
Specifically, as shown in fig. 3, the first characteristic diagram of the characteristic fusion module AFFM from the shallow coding end is ink(k is 2,3,4), and the second signature from the deep decoding side is outk+1( k 2,3,4), the output second feature map out after feature fusionm(m is 2,3,4, 5). Specifically, the channel weight vector c obtained in step S3 is obtained5And a first characteristic map in from the encoding end4The multiplication is carried out channel by channel,obtaining the coding characteristics in after channel weighting4'; masking the spatial information obtained in step S3 with a mask S4And out after two times up-sampling from the decoding end5Multiplying the features by position to obtain the decoding features out after weighting the spatial information5'; then will in4' and out5' adding according to position to obtain the second characteristic diagram out after the 1 st decoding fusion4. The step in S3 is repeated again to extract in3And out4Space and channel information S of3And c4And repeating the step S4 to obtain a second feature out after the 2 nd decoding fusion3And so on to obtain a series of second characteristic diagrams out which undergo 3, 2, 1 and 0 times of decoding and have the scales of A/4, A/8, A/16 and A/32 respectively2、out3、out4、out5(second characteristic out here5From in5Directly output without any feature extraction). The reason for this configuration is: the shallow feature map in the convolutional neural network contains more space detail information, the deep feature map contains more deep semantic information, effective information of different levels is extracted through an attention mechanism to weight the features of other parts, the remarkable features can be effectively enhanced, and the background noise is suppressed while the target is enhanced.
For this embodiment, as shown in the middle part of FIG. 2, the smallest dimension of the feature map in is used5Start decoding fusion, i.e. extract in first4Spatial information masking and linear up-sampling of 2 times out5(out here)5=in5) Multiplying by position to extract out5Channel information weight vector of and in4Multiplying by channel, and adding the weighted 2 features according to position to obtain a decoded and fused feature out4. Similarly, other levels are decoded in a similar manner.
S5, adjusting the number of channels of the fusion features by convolution, adopting the uniform size of upsampling with different multiplying power as the size of the original image 1/4, and splicing according to the channel dimensions to obtain a third feature map.
Specifically, as shown in the middle part of fig. 2, the obtained fused second feature map outmRespectively pass through a layer 3Convolution 3 further extracts features. And uniformly adjusting the dimension of the channel to be C, and then adjusting the characteristic out after the channel is adjusted3、out4、out5And performing up-sampling at 2, 4 and 8 multiplying powers respectively to adjust the sizes of the original image 1/4 to obtain features P2, P3, P4 and P5, and performing channel dimension splicing on the features to obtain a third feature map F after final decoding feature fusion, wherein the dimension is 4C × A/4 × A/4.
And S6, constructing different combinations of convolution, pooling and splicing, and further extracting, fusing and reinforcing the features of the third feature map.
Specifically, as shown in fig. 6, the feature enhancement module JAM uses a parallel connection method for processing the features. The module respectively uses CAM and SAM to model the dependency relationship between channels and space positions of a third feature graph F, expands the obtained 4 Cx 1 x 1 channel information weight vector and 1 xA/4 space information mask to 4 Cx A/4 xA/4 dimensionality, multiplies the vectors by the positions to obtain a weight feature F ', multiplies the weight feature F by an input feature F after a Sigmoid activation function to obtain an enhanced feature F'. In order to avoid network degradation, residual error connection is added in the module to ensure the effectiveness of model training.
And S7, up-sampling the finally obtained features to the size of the original image, and performing convolution to obtain a segmentation mask of a text core area and a boundary area in the text image of the natural scene.
The invention provides another embodiment, which is a natural scene text image segmentation system based on an attention mechanism, comprising an acquisition module, a coding extraction module, a spatial information extraction module, a channel information extraction module, a decoding fusion module, a feature splicing module and a feature enhancement module, wherein,
and the acquisition module is used for acquiring the natural scene text image.
And the coding extraction module is used for performing convolution operation on the acquired natural scene text image to extract features, and performing downsampling for 5 times to acquire 4 first feature maps with different scales.
And the spatial information extraction module SAM is used for extracting spatial detail information of the first characteristic diagram to obtain a spatial information mask for the decoding fusion module. In this embodiment, the SAM is embedded inside the AFFM module and the JAM module, and the SAM has an internal structure as shown in FIG. 4, including a first pooling unit, a splicing unit, and a convolution unit, where the SAM is embedded inside the AFFM module and the JAM module
The first pooling unit performs maximum pooling and average pooling on the first characteristic diagram according to positions;
the splicing unit splices the features of 2 feature graphs obtained by maximum pooling and average pooling according to channel dimensions;
and the convolution unit is used for performing convolution operation on the spliced features to extract the features, so as to obtain the spatial information mask.
And the channel information extraction module CAM is used for extracting semantic information from the second feature map after the decoding and fusion of the previous layer to obtain a channel weight vector which is used for the decoding and fusion module of the next layer. In this embodiment, the CAM module is embedded inside the AFFM module and the JAM module, and the internal structure of the CAM module is shown in FIG. 5, and includes a second pooling unit, a full-connection unit, and an adding unit, wherein
The second pooling unit respectively performs global maximum pooling and global average pooling on the second feature map according to channel dimensions;
the full-connection unit establishes a dependency relationship between channels on the maximum and average pooled feature vectors by using full-connection operation;
and the adding unit is used for adding the fully connected 2 eigenvectors according to positions to obtain the channel weight vector.
And the decoding fusion module AFFM is used for decoding step by step, namely combining, weighting and adding and fusing the first characteristic diagram of the current layer, the second characteristic diagram of the previous layer, the spatial information mask and the channel weight vector to obtain the information-enhanced second characteristic diagram of the current layer. In this embodiment, there are 3 decoding fusion modules.
And the feature splicing module is used for sampling the second feature maps of 4 different scales to the same scale and splicing the second feature maps in the channel dimension to obtain a third feature map.
And the feature enhancement module JAM is used for carrying out spatial and channel relation modeling on the third feature graph and enhancing the feature characterization capability.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (9)

1. The method for detecting the natural scene text based on attention mechanism feature fusion and enhancement is characterized by comprising the following steps,
s1, acquiring a natural scene text image;
s2, extracting the features of the natural scene text image, integrally performing n times of down-sampling operation, and taking the feature map after [2, n ] th time of down-sampling as a first feature map;
s3, extracting spatial information features of each first feature map except the first feature map subjected to the last downsampling to obtain a spatial information mask; extracting the characteristics of the channel information of the last first characteristic diagram to obtain a channel weight vector;
s4, decoding and fusing the first feature map, the spatial information mask and the channel weight vector step by step based on an attention mechanism, and obtaining a second feature map with significant features through operations of upsampling, multiplying and adding;
s5, adjusting the number of channels of the fusion features by convolution, adopting upsampling unified sizes with different multiplying powers as the size of the original image 1/4, and splicing according to channel dimensions to obtain a third feature map;
s6, constructing different combinations of convolution, pooling and splicing, and further extracting, fusing and reinforcing the features of the third feature map;
and S7, upsampling the features obtained in the S6 to the size of the original image, and performing convolution to obtain a segmentation mask of a text core area and a text boundary area in the text image of the natural scene.
2. The method for detecting natural scene text according to claim 1, wherein the step S2 is specifically implemented by uniformly scaling the natural scene text image to be recognized toA × A size; extracting features by using variable convolution, and performing downsampling for 5 times to obtain first feature maps in with sizes of A/4, A/8, A/16 and A/322、in3、in4、in5
3. The natural scene text detection method of claim 2, wherein the step S3 is specifically configured to construct a combination of convolution kernel pooling and splicing, and use the spatial information extraction module SAM for the first feature map in4Performing maximum pooling and average pooling according to positions, splicing and convolving the features after maximum pooling and average pooling to obtain a spatial information mask S4(ii) a For the first characteristic diagram in5Performing maximum pooling and average pooling according to channel dimension to obtain two information vectors, performing full-connection operation on the two information vectors, and adding the two information vectors according to positions to obtain a channel weight vector c5
4. The method according to claim 3, wherein the step S4 is implemented by a feature fusion module AFFM for the first feature map in from the shallow encoding endk(k 2,3,4) and a second profile out from the deep decoding sidek+1(k is 2,3,4) and the second feature map out after the layer fusion is obtainedm(m=2,3,4,5)。
5. The natural scene text detection method according to claim 4, wherein the step S5 is specifically implemented by merging the obtained second feature map outmAnd further extracting features through a layer of 3 × 3 convolution, uniformly adjusting the channel dimension to be C, performing upsampling on the features with the multiplying power of 0, 2, 4 and 8 respectively to be uniformly adjusted to the size of an original image 1/4 to obtain features P2, P3, P4 and P5, and splicing the features with the channel dimension to be used as a third feature map F, wherein the dimension is 4C × A/4 × A/4.
6. The natural scene text detection method of claim 5, wherein the process of step S6 is specifically that the feature enhancement module JAM uses a channel information extraction module CAM and a spatial information extraction module SAM to model the dependency relationship between channels and spatial positions of the third feature map F, respectively, and then expands the obtained 4 cx1 × 1 channel information weight vector and the 1 × a/4 spatial information mask to 4 cxa/4 dimensions, and multiplies them by each other according to the positions to obtain a weight feature F', and multiplies them by the third feature map F after a sigoid activation function to obtain an enhanced feature F ".
7. A natural scene text detection system based on an attention mechanism is characterized by comprising,
the acquisition module is used for acquiring a natural scene text image;
the encoding extraction module is used for performing convolution operation on the natural scene text image to extract features and performing downsampling for n times to obtain a first feature map;
the spatial information extraction module is used for extracting spatial detail information of the first feature map to obtain a spatial information mask;
the channel information extraction module is used for extracting semantic information from the second feature map of the adjacent deep layer to obtain a channel weight vector;
the feature fusion module is used for decoding step by step, namely combining, weighting and adding and fusing the first feature map of the layer, the second feature map of the adjacent deep layer, the spatial information mask and the channel weight vector to obtain the second feature map of the layer after information enhancement;
the characteristic splicing module is used for sampling the second characteristic graphs with different scales to the same scale and splicing the second characteristic graphs in the channel dimension to obtain a third characteristic graph;
and the characteristic enhancement module is used for carrying out spatial and channel relational modeling on the third characteristic diagram.
8. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 6 when executed.
9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.
CN202111393620.2A 2021-11-23 2021-11-23 Natural scene text detection method and system based on attention mechanism feature fusion and enhancement Pending CN114255456A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111393620.2A CN114255456A (en) 2021-11-23 2021-11-23 Natural scene text detection method and system based on attention mechanism feature fusion and enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111393620.2A CN114255456A (en) 2021-11-23 2021-11-23 Natural scene text detection method and system based on attention mechanism feature fusion and enhancement

Publications (1)

Publication Number Publication Date
CN114255456A true CN114255456A (en) 2022-03-29

Family

ID=80791093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111393620.2A Pending CN114255456A (en) 2021-11-23 2021-11-23 Natural scene text detection method and system based on attention mechanism feature fusion and enhancement

Country Status (1)

Country Link
CN (1) CN114255456A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581906A (en) * 2022-05-06 2022-06-03 山东大学 Text recognition method and system for natural scene image
CN114820515A (en) * 2022-04-26 2022-07-29 渭南日报社印刷厂 Non-reference image quality evaluation method based on channel attention
CN114821074A (en) * 2022-07-01 2022-07-29 湖南盛鼎科技发展有限责任公司 Airborne LiDAR point cloud semantic segmentation method, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114820515A (en) * 2022-04-26 2022-07-29 渭南日报社印刷厂 Non-reference image quality evaluation method based on channel attention
CN114581906A (en) * 2022-05-06 2022-06-03 山东大学 Text recognition method and system for natural scene image
CN114821074A (en) * 2022-07-01 2022-07-29 湖南盛鼎科技发展有限责任公司 Airborne LiDAR point cloud semantic segmentation method, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
WO2023056889A1 (en) Model training and scene recognition method and apparatus, device, and medium
CN112232349B (en) Model training method, image segmentation method and device
CN113052210B (en) Rapid low-light target detection method based on convolutional neural network
CN114255456A (en) Natural scene text detection method and system based on attention mechanism feature fusion and enhancement
CN111199522A (en) Single-image blind motion blur removing method for generating countermeasure network based on multi-scale residual errors
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
CN111915530A (en) End-to-end-based haze concentration self-adaptive neural network image defogging method
CN112241939B (en) Multi-scale and non-local-based light rain removal method
CN114764868A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN115131797B (en) Scene text detection method based on feature enhancement pyramid network
CN114255474A (en) Pedestrian re-identification method based on multi-scale and multi-granularity
CN112766056A (en) Method and device for detecting lane line in low-light environment based on deep neural network
CN116645598A (en) Remote sensing image semantic segmentation method based on channel attention feature fusion
CN110728238A (en) Personnel re-detection method of fusion type neural network
CN114155165A (en) Image defogging method based on semi-supervision
Ren et al. A lightweight object detection network in low-light conditions based on depthwise separable pyramid network and attention mechanism on embedded platforms
CN113177956A (en) Semantic segmentation method for unmanned aerial vehicle remote sensing image
CN113096133A (en) Method for constructing semantic segmentation network based on attention mechanism
CN115641445B (en) Remote sensing image shadow detection method integrating asymmetric inner convolution and Transformer
CN114283431B (en) Text detection method based on differentiable binarization
CN113627368B (en) Video behavior recognition method based on deep learning
CN113222016B (en) Change detection method and device based on cross enhancement of high-level and low-level features
CN113256528B (en) Low-illumination video enhancement method based on multi-scale cascade depth residual error network
Tao et al. An accurate low-light object detection method based on pyramid networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination