CN114387610A - Method for detecting optional-shape scene text based on enhanced feature pyramid network - Google Patents

Method for detecting optional-shape scene text based on enhanced feature pyramid network Download PDF

Info

Publication number
CN114387610A
CN114387610A CN202210042376.3A CN202210042376A CN114387610A CN 114387610 A CN114387610 A CN 114387610A CN 202210042376 A CN202210042376 A CN 202210042376A CN 114387610 A CN114387610 A CN 114387610A
Authority
CN
China
Prior art keywords
module
feature
text
enhanced
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210042376.3A
Other languages
Chinese (zh)
Inventor
谭钦红
江一峰
黄�俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210042376.3A priority Critical patent/CN114387610A/en
Publication of CN114387610A publication Critical patent/CN114387610A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a method for detecting any-shape scene text based on an enhanced feature pyramid network, which comprises the following modules: the characteristic extraction module is used for extracting the characteristics of the input image; the ratio invariant feature enhancement module is used for enhancing semantic information; a reconstruction spatial resolution module for enhancing spatial information; the feature fusion module is used for fusing the features enhanced by the semantic information and the features enhanced by the spatial information to generate a plurality of segmentation results with different proportions; and the asymptotic expansion module is used as a post-processing module and adopts an asymptotic expansion algorithm to gradually expand and fuse the segmentation results of different scales generated by the feature fusion module to obtain a final text detection result. According to the method, the input image is understood by the text detection model through the fusion of the features of enhancing the semantic information and the features of enhancing the spatial information, and the text detection precision is improved; the post-processing module adopts an asymptotic scale expansion algorithm to sequentially expand the segmentation images with different scales from small to large, effectively predicts the real shape of the scene text and can well distinguish text examples with short distances, thereby realizing the detection of the scene text with any shape.

Description

Method for detecting optional-shape scene text based on enhanced feature pyramid network
Technical Field
The invention relates to the field of image processing, in particular to an arbitrary-shape scene text detection method based on an enhanced feature pyramid network.
Background
With the rapid development of the economic society and the rapid popularization of the intelligent terminal, channels through which people perceive external things are more and more diversified. As a carrier for transmitting information, images are becoming an important channel for people to obtain information in daily life. Different from visual elements in general images, texts in natural scene images contain rich semantic information, and people can be better helped to analyze and understand deeper information contained in the natural scene images. Therefore, scene text detection is also gradually applied to production and life of people, and plays a great role in the fields of intelligent traffic system construction, office automation, vision assistance and the like.
The text in a natural scene has great randomness and diversity, and the text in a conventional horizontal or vertical direction, the text in a slightly complicated and oblique direction, and the text in a more complicated and bent shape or even an irregular shape. Meanwhile, because the scene image is influenced by objective factors such as illumination conditions and shooting angles in the process of obtaining the scene image, the realization of text detection in a natural scene through machine vision still is a very challenging task.
Early natural scene text detection methods relied primarily on artificially designed features and a priori information about some text, such as texture, color, or stroke width. Such scene text detection methods can be roughly classified into two types: the method comprises a scene text detection method based on connected domain analysis and a scene text detection method based on a sliding window. The connected domain method comprises the steps of firstly utilizing digital image processing technologies such as edge extraction and the like to preprocess an input image to obtain a text candidate region, further adopting different connected domain analysis methods to refine and divide labor for the region, and realizing the positioning of the connection of characters and texts. The method based on the sliding window adopts artificial features to represent the candidate region, and trains a classifier by using the features to predict and verify the candidate region. The two types of text detection methods can show good detection effect in detecting scene texts with single background and regular shapes, but excessively depend on characteristics of manual design, and the method cannot effectively cope with detection of texts in complex and changeable scene images.
In recent years, the research and development of natural scene text detection are promoted by the successful application of deep learning such as a deep convolutional neural network in the field of computer vision, and this method usually trains a network model based on the deep convolutional neural network by using a specific data set for automatically extracting the basic features of an input image, and then obtains a final text region by a series of post-processing algorithms. Compared with the traditional scene text detection algorithm, the method effectively avoids the limitation of manual design features. The current scene text detection method based on deep learning is mainly based on a segmentation method and a regression method, wherein the segmentation method generally segments a text from an image and then performs threshold processing to obtain a bounding box of a text region. However, the regression-based method generally directly regresses the bounding box of the text region, which is faster than the segmentation-based method, but the effect of detecting the long text and the irregular scene text such as the bent type is still unsatisfactory, thereby affecting the application of the scene text detection method in real life.
Disclosure of Invention
The invention provides a method for detecting any-shape scene texts based on an enhanced feature pyramid network, which aims at the problem that the scene text detection method based on deep learning has an unsatisfactory detection effect on long texts, curved and other irregular-shape scene texts, and specifically comprises the following modules:
the characteristic extraction module is used for extracting the characteristics of the input image;
the ratio invariant feature enhancement module is used for enhancing semantic information;
a reconstruction spatial resolution module for enhancing spatial information;
the feature fusion module is used for fusing the features enhanced by the semantic information and the features enhanced by the spatial information to generate a plurality of segmentation results with different proportions;
and the asymptotic expansion module is used as a post-processing module and adopts an asymptotic expansion algorithm to gradually expand and fuse the segmentation results with different scales generated by the feature fusion module to obtain a final text detection result.
The feature extraction module extracts the original features { C } of the input image using ResNet50 as the backbone network2,C3,C4,C5}。
The ratio-invariant feature enhancement module processes the high-level semantic feature map C using 3 parallel branches5And the output results of the 3 parallel branches are fused to enhance high-level semantic information.
The reconstruction space resolution module uses convolution operation of 1 x 1 to extract the original features { C ] acquired by the feature extraction module2,C3,C4,C5The channel number of the channel is adjusted to 256 dimension, { C3,C4,C5Adjust resolution to C by upsampling2Same, form a new feature { R2,R3,R4,R5}。
The characteristic fusion module is used for fusing the multilayer characteristic diagram with the reconstructed spatial resolution and the characteristics sampled at the corresponding level in the enhanced characteristic pyramid structure to obtain { P2,P3,P4,P5The fusion feature P of the segmentation is used to generate n different segmentation results S1,S2,…,Sn
The asymptotic expansion module uses an asymptotic scale expansion algorithm to perform n segmentation results S described by the feature fusion module1,S2,…,SnAnd expanding the text in sequence from small to large to obtain a final text prediction result.
The method forms an enhanced feature pyramid structure by using a ratio-invariant feature enhancement module for enhancing high-level semantic information extracted by a feature extraction module; extracting spatial information of the original features by using a reconstructed spatial resolution module for an enhanced feature extraction module; by fusing the features of the enhanced semantic information and the features of the enhanced spatial information, the comprehension of the text detection model to the input image is deepened, and the text detection precision is improved; the post-processing module adopts an asymptotic scale expansion algorithm to sequentially expand the segmentation images with different scales from small to large, so that the real shape of the scene text can be effectively predicted, and text examples with short distances can be well distinguished, so that the disclosed scene text detection method can well realize the detection of the scene text with any shape.
Drawings
FIG. 1 is a model structure diagram of a method for detecting texts in scenes with arbitrary shapes based on an enhanced feature pyramid network;
FIG. 2 is a schematic diagram of a ratio invariant feature enhancement module according to the present invention;
Detailed Description
FIG. 1 shows a structure diagram of a text detection model of the method of the present invention: the invention provides a method for detecting any-shape scene text based on an enhanced feature pyramid network, which specifically comprises the following modules:
the characteristic extraction module is used for extracting the characteristics of the input image;
specifically, the feature extraction module extracts the original features { C } of the input image using ResNet50 as a backbone network2,C3,C4,C5}。
The ratio invariant feature enhancement module is used for enhancing semantic information;
specifically, to reduce the impact of the complex background on the text detection, the rate-invariant feature enhancement module uses three parallel branches to the high-level semantic information C of the feature extraction module5And processing, directly adding the output characteristics of the parallel branches and activating a ReLU function to realize the enhancement of high-level semantic information, wherein the specific structure is shown in FIG. 2.
A reconstruction spatial resolution module for enhancing spatial information;
specifically, the reconstruction spatial resolution module reconstructs the original features { C ] acquired by the feature extraction module2,C3,C4,C5Reconstructing the spatial resolution, wherein the specific operation details are as follows: first, the original features { C } are convolved with 1 x 12,C3,C4,C5The channel number of { C } is adjusted to 256 dimensions3,C4,C5Feature upsampling to and C2So as to obtain the feature { R after the spatial information is enhanced2,R3,R4,R5And fine spatial information of the input image is fully utilized, so that the influence of invalid context information on the text region positioning is reduced, and the accuracy of the text region positioning is improved.
The feature fusion module is used for fusing the features enhanced by the semantic information and the features enhanced by the spatial information to generate a plurality of segmentation results with different proportions;
in particular, the top layer characteristic C5Generating enhanced semantic information M after being processed by a ratio-invariant feature enhancement module5Combining the original features { C extracted by the feature extraction module2,C3,C4,C5And adjusting the channel number of the feature map by using convolution operation of 1 x 1, performing transverse connection, and constructing an enhanced feature pyramid structure { M } by fusing information from top to bottom2,M3,M4,M5}; pyramid of enhanced features { M2,M3,M4,M5After upsampling, reconstructing a multilayer characteristic map of spatial resolution (R)2,R3,R4,R5The corresponding layer features of the image are added to form a fused feature (P)2,P3,P4,P5}; fusing features { P using a channel attention mechanism2,P3,P4,P5The fusion is carried out to obtain a fusion characteristic P, and n different segmentation results S are generated according to the fusion characteristic P1,S2,…,Sn
And the asymptotic expansion module is used as a post-processing module and adopts an asymptotic expansion algorithm to perform gradual expansion and fusion on a plurality of segmentation results with different scales generated by the characteristic fusion module to obtain a final text detection result.
Specifically, the asymptotic expansion module uses an asymptotic scale expansion algorithm to perform the n segmentation results S described by the feature fusion module1,S2,…,SnAnd expanding the text in sequence from small to large to obtain a final text prediction result.

Claims (6)

1. A method for detecting any-shape scene text based on an enhanced feature pyramid network specifically comprises the following modules:
the characteristic extraction module is used for extracting the characteristics of the input image;
the ratio invariant feature enhancement module is used for enhancing semantic information;
a reconstruction spatial resolution module for enhancing spatial information;
the feature fusion module is used for fusing the features enhanced by the semantic information and the features enhanced by the spatial information to generate a plurality of segmentation results with different proportions;
and the asymptotic expansion module is used as a post-processing module and adopts an asymptotic expansion algorithm to gradually expand and fuse the segmentation results with different scales generated by the feature fusion module to obtain a final text detection result.
2. The method as claimed in claim 1, wherein the feature extraction module uses ResNet50 as a backbone network to extract original features { C ] of the input image2,C3,C4,C5}。
3. The method for detecting the text of the arbitrary-shaped scene based on the enhanced feature pyramid network as claimed in claim 1, wherein the ratio-invariant feature enhancement module processes the high-level semantic feature map C obtained by the feature extraction module using 3 parallel branches5And the output results of the 3 parallel branches are fused to enhance high-level semantic information.
4. The enhanced feature pyramid network-based anybody of claim 1The method for detecting the text of the ideographic scene is characterized in that the reconstruction space resolution module uses convolution operation of 1 to extract the original features { C) acquired by the feature extraction module2,C3,C4,C5The channel number of the channel is adjusted to 256 dimension, { C3,C4,C5Adjust resolution to C by upsampling2Same, form a new feature { R2,R3,R4,R5}。
5. The method as claimed in claim 1, wherein the feature fusion module is configured to fuse the multi-layer feature map with the reconstructed spatial resolution with the features of the enhanced feature pyramid structure after upsampling at the corresponding level, so as to obtain { P } P2,P3,P4,P5The fusion feature P of the segmentation is used to generate n different segmentation results S1,S2,…,Sn
6. The method for detecting the text of the arbitrary-shaped scene based on the enhanced feature pyramid network as claimed in claim 1, wherein the asymptotic expansion module uses an asymptotic scale expansion algorithm to merge the n segmentation results S of the feature fusion module1,S2,…,SnAnd expanding the text in sequence from small to large to obtain a final text prediction result.
CN202210042376.3A 2022-01-14 2022-01-14 Method for detecting optional-shape scene text based on enhanced feature pyramid network Pending CN114387610A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210042376.3A CN114387610A (en) 2022-01-14 2022-01-14 Method for detecting optional-shape scene text based on enhanced feature pyramid network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210042376.3A CN114387610A (en) 2022-01-14 2022-01-14 Method for detecting optional-shape scene text based on enhanced feature pyramid network

Publications (1)

Publication Number Publication Date
CN114387610A true CN114387610A (en) 2022-04-22

Family

ID=81202815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210042376.3A Pending CN114387610A (en) 2022-01-14 2022-01-14 Method for detecting optional-shape scene text based on enhanced feature pyramid network

Country Status (1)

Country Link
CN (1) CN114387610A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758332A (en) * 2022-06-13 2022-07-15 北京万里红科技有限公司 Text detection method and device, computing equipment and storage medium
CN115052182A (en) * 2022-06-27 2022-09-13 重庆邮电大学 Ultra-high-definition video transmission system and method based on queue learning and super-resolution

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758332A (en) * 2022-06-13 2022-07-15 北京万里红科技有限公司 Text detection method and device, computing equipment and storage medium
CN114758332B (en) * 2022-06-13 2022-09-02 北京万里红科技有限公司 Text detection method and device, computing equipment and storage medium
CN115052182A (en) * 2022-06-27 2022-09-13 重庆邮电大学 Ultra-high-definition video transmission system and method based on queue learning and super-resolution
CN115052182B (en) * 2022-06-27 2023-07-21 重庆邮电大学 Ultrahigh-definition video transmission system and method based on queue learning and super resolution

Similar Documents

Publication Publication Date Title
CN110322495B (en) Scene text segmentation method based on weak supervised deep learning
CN112232349B (en) Model training method, image segmentation method and device
CN110738207B (en) Character detection method for fusing character area edge information in character image
CN109726657B (en) Deep learning scene text sequence recognition method
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN109902748A (en) A kind of image, semantic dividing method based on the full convolutional neural networks of fusion of multi-layer information
CN110059768B (en) Semantic segmentation method and system for fusion point and region feature for street view understanding
CN110781775A (en) Remote sensing image water body information accurate segmentation method supported by multi-scale features
CN111126379A (en) Target detection method and device
CN108491836B (en) Method for integrally identifying Chinese text in natural scene image
CN114387610A (en) Method for detecting optional-shape scene text based on enhanced feature pyramid network
CN112906706A (en) Improved image semantic segmentation method based on coder-decoder
CN110956681B (en) Portrait background automatic replacement method combining convolution network and neighborhood similarity
CN109299303B (en) Hand-drawn sketch retrieval method based on deformable convolution and depth network
CN113011357A (en) Depth fake face video positioning method based on space-time fusion
CN114067143A (en) Vehicle weight recognition method based on dual sub-networks
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN113808005A (en) Video-driving-based face pose migration method and device
CN113610024B (en) Multi-strategy deep learning remote sensing image small target detection method
CN111209886B (en) Rapid pedestrian re-identification method based on deep neural network
CN117036770A (en) Detection model training and target detection method and system based on cascade attention
CN110728238A (en) Personnel re-detection method of fusion type neural network
CN113554655B (en) Optical remote sensing image segmentation method and device based on multi-feature enhancement
CN109409224A (en) A kind of method of natural scene fire defector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination