CN114387610A

CN114387610A - Method for detecting optional-shape scene text based on enhanced feature pyramid network

Info

Publication number: CN114387610A
Application number: CN202210042376.3A
Authority: CN
Inventors: 谭钦红; 江一峰; 黄�俊
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-04-22

Abstract

The invention discloses a method for detecting any-shape scene text based on an enhanced feature pyramid network, which comprises the following modules: the characteristic extraction module is used for extracting the characteristics of the input image; the ratio invariant feature enhancement module is used for enhancing semantic information; a reconstruction spatial resolution module for enhancing spatial information; the feature fusion module is used for fusing the features enhanced by the semantic information and the features enhanced by the spatial information to generate a plurality of segmentation results with different proportions; and the asymptotic expansion module is used as a post-processing module and adopts an asymptotic expansion algorithm to gradually expand and fuse the segmentation results of different scales generated by the feature fusion module to obtain a final text detection result. According to the method, the input image is understood by the text detection model through the fusion of the features of enhancing the semantic information and the features of enhancing the spatial information, and the text detection precision is improved; the post-processing module adopts an asymptotic scale expansion algorithm to sequentially expand the segmentation images with different scales from small to large, effectively predicts the real shape of the scene text and can well distinguish text examples with short distances, thereby realizing the detection of the scene text with any shape.

Description

Method for detecting optional-shape scene text based on enhanced feature pyramid network

Technical Field

The invention relates to the field of image processing, in particular to an arbitrary-shape scene text detection method based on an enhanced feature pyramid network.

Background

With the rapid development of the economic society and the rapid popularization of the intelligent terminal, channels through which people perceive external things are more and more diversified. As a carrier for transmitting information, images are becoming an important channel for people to obtain information in daily life. Different from visual elements in general images, texts in natural scene images contain rich semantic information, and people can be better helped to analyze and understand deeper information contained in the natural scene images. Therefore, scene text detection is also gradually applied to production and life of people, and plays a great role in the fields of intelligent traffic system construction, office automation, vision assistance and the like.

The text in a natural scene has great randomness and diversity, and the text in a conventional horizontal or vertical direction, the text in a slightly complicated and oblique direction, and the text in a more complicated and bent shape or even an irregular shape. Meanwhile, because the scene image is influenced by objective factors such as illumination conditions and shooting angles in the process of obtaining the scene image, the realization of text detection in a natural scene through machine vision still is a very challenging task.

Early natural scene text detection methods relied primarily on artificially designed features and a priori information about some text, such as texture, color, or stroke width. Such scene text detection methods can be roughly classified into two types: the method comprises a scene text detection method based on connected domain analysis and a scene text detection method based on a sliding window. The connected domain method comprises the steps of firstly utilizing digital image processing technologies such as edge extraction and the like to preprocess an input image to obtain a text candidate region, further adopting different connected domain analysis methods to refine and divide labor for the region, and realizing the positioning of the connection of characters and texts. The method based on the sliding window adopts artificial features to represent the candidate region, and trains a classifier by using the features to predict and verify the candidate region. The two types of text detection methods can show good detection effect in detecting scene texts with single background and regular shapes, but excessively depend on characteristics of manual design, and the method cannot effectively cope with detection of texts in complex and changeable scene images.

In recent years, the research and development of natural scene text detection are promoted by the successful application of deep learning such as a deep convolutional neural network in the field of computer vision, and this method usually trains a network model based on the deep convolutional neural network by using a specific data set for automatically extracting the basic features of an input image, and then obtains a final text region by a series of post-processing algorithms. Compared with the traditional scene text detection algorithm, the method effectively avoids the limitation of manual design features. The current scene text detection method based on deep learning is mainly based on a segmentation method and a regression method, wherein the segmentation method generally segments a text from an image and then performs threshold processing to obtain a bounding box of a text region. However, the regression-based method generally directly regresses the bounding box of the text region, which is faster than the segmentation-based method, but the effect of detecting the long text and the irregular scene text such as the bent type is still unsatisfactory, thereby affecting the application of the scene text detection method in real life.

Disclosure of Invention

The invention provides a method for detecting any-shape scene texts based on an enhanced feature pyramid network, which aims at the problem that the scene text detection method based on deep learning has an unsatisfactory detection effect on long texts, curved and other irregular-shape scene texts, and specifically comprises the following modules:

the characteristic extraction module is used for extracting the characteristics of the input image;

the ratio invariant feature enhancement module is used for enhancing semantic information;

a reconstruction spatial resolution module for enhancing spatial information;

the feature fusion module is used for fusing the features enhanced by the semantic information and the features enhanced by the spatial information to generate a plurality of segmentation results with different proportions;

and the asymptotic expansion module is used as a post-processing module and adopts an asymptotic expansion algorithm to gradually expand and fuse the segmentation results with different scales generated by the feature fusion module to obtain a final text detection result.

The feature extraction module extracts the original features { C } of the input image using ResNet50 as the backbone network₂,C₃,C₄,C₅}。

The ratio-invariant feature enhancement module processes the high-level semantic feature map C using 3 parallel branches₅And the output results of the 3 parallel branches are fused to enhance high-level semantic information.

The reconstruction space resolution module uses convolution operation of 1 x 1 to extract the original features { C ] acquired by the feature extraction module₂,C₃,C₄,C₅The channel number of the channel is adjusted to 256 dimension, { C₃,C₄,C₅Adjust resolution to C by upsampling₂Same, form a new feature { R₂,R₃,R₄,R₅}。

The characteristic fusion module is used for fusing the multilayer characteristic diagram with the reconstructed spatial resolution and the characteristics sampled at the corresponding level in the enhanced characteristic pyramid structure to obtain { P₂,P₃,P₄,P₅The fusion feature P of the segmentation is used to generate n different segmentation results S₁,S₂,…,S_n。

The asymptotic expansion module uses an asymptotic scale expansion algorithm to perform n segmentation results S described by the feature fusion module₁,S₂,…,S_nAnd expanding the text in sequence from small to large to obtain a final text prediction result.

The method forms an enhanced feature pyramid structure by using a ratio-invariant feature enhancement module for enhancing high-level semantic information extracted by a feature extraction module; extracting spatial information of the original features by using a reconstructed spatial resolution module for an enhanced feature extraction module; by fusing the features of the enhanced semantic information and the features of the enhanced spatial information, the comprehension of the text detection model to the input image is deepened, and the text detection precision is improved; the post-processing module adopts an asymptotic scale expansion algorithm to sequentially expand the segmentation images with different scales from small to large, so that the real shape of the scene text can be effectively predicted, and text examples with short distances can be well distinguished, so that the disclosed scene text detection method can well realize the detection of the scene text with any shape.

Drawings

FIG. 1 is a model structure diagram of a method for detecting texts in scenes with arbitrary shapes based on an enhanced feature pyramid network;

FIG. 2 is a schematic diagram of a ratio invariant feature enhancement module according to the present invention;

Detailed Description

FIG. 1 shows a structure diagram of a text detection model of the method of the present invention: the invention provides a method for detecting any-shape scene text based on an enhanced feature pyramid network, which specifically comprises the following modules:

specifically, the feature extraction module extracts the original features { C } of the input image using ResNet50 as a backbone network₂,C₃,C₄,C₅}。

specifically, to reduce the impact of the complex background on the text detection, the rate-invariant feature enhancement module uses three parallel branches to the high-level semantic information C of the feature extraction module₅And processing, directly adding the output characteristics of the parallel branches and activating a ReLU function to realize the enhancement of high-level semantic information, wherein the specific structure is shown in FIG. 2.

A reconstruction spatial resolution module for enhancing spatial information;

specifically, the reconstruction spatial resolution module reconstructs the original features { C ] acquired by the feature extraction module₂,C₃,C₄,C₅Reconstructing the spatial resolution, wherein the specific operation details are as follows: first, the original features { C } are convolved with 1 x 1₂,C₃,C₄,C₅The channel number of { C } is adjusted to 256 dimensions₃,C₄,C₅Feature upsampling to and C₂So as to obtain the feature { R after the spatial information is enhanced₂,R₃,R₄,R₅And fine spatial information of the input image is fully utilized, so that the influence of invalid context information on the text region positioning is reduced, and the accuracy of the text region positioning is improved.

in particular, the top layer characteristic C₅Generating enhanced semantic information M after being processed by a ratio-invariant feature enhancement module₅Combining the original features { C extracted by the feature extraction module₂,C₃,C₄,C₅And adjusting the channel number of the feature map by using convolution operation of 1 x 1, performing transverse connection, and constructing an enhanced feature pyramid structure { M } by fusing information from top to bottom₂,M₃,M₄,M₅}; pyramid of enhanced features { M₂,M₃,M₄,M₅After upsampling, reconstructing a multilayer characteristic map of spatial resolution (R)₂,R₃,R₄,R₅The corresponding layer features of the image are added to form a fused feature (P)₂,P₃,P₄,P₅}; fusing features { P using a channel attention mechanism₂,P₃,P₄,P₅The fusion is carried out to obtain a fusion characteristic P, and n different segmentation results S are generated according to the fusion characteristic P₁,S₂,…,S_n。

And the asymptotic expansion module is used as a post-processing module and adopts an asymptotic expansion algorithm to perform gradual expansion and fusion on a plurality of segmentation results with different scales generated by the characteristic fusion module to obtain a final text detection result.

Specifically, the asymptotic expansion module uses an asymptotic scale expansion algorithm to perform the n segmentation results S described by the feature fusion module₁,S₂,…,S_nAnd expanding the text in sequence from small to large to obtain a final text prediction result.

Claims

1. A method for detecting any-shape scene text based on an enhanced feature pyramid network specifically comprises the following modules:

a reconstruction spatial resolution module for enhancing spatial information;

2. The method as claimed in claim 1, wherein the feature extraction module uses ResNet50 as a backbone network to extract original features { C ] of the input image₂,C₃,C₄,C₅}。

3. The method for detecting the text of the arbitrary-shaped scene based on the enhanced feature pyramid network as claimed in claim 1, wherein the ratio-invariant feature enhancement module processes the high-level semantic feature map C obtained by the feature extraction module using 3 parallel branches₅And the output results of the 3 parallel branches are fused to enhance high-level semantic information.

4. The enhanced feature pyramid network-based anybody of claim 1The method for detecting the text of the ideographic scene is characterized in that the reconstruction space resolution module uses convolution operation of 1 to extract the original features { C) acquired by the feature extraction module₂,C₃,C₄,C₅The channel number of the channel is adjusted to 256 dimension, { C₃,C₄,C₅Adjust resolution to C by upsampling₂Same, form a new feature { R₂,R₃,R₄,R₅}。

5. The method as claimed in claim 1, wherein the feature fusion module is configured to fuse the multi-layer feature map with the reconstructed spatial resolution with the features of the enhanced feature pyramid structure after upsampling at the corresponding level, so as to obtain { P } P₂,P₃,P₄,P₅The fusion feature P of the segmentation is used to generate n different segmentation results S₁,S₂,…,S_n。

6. The method for detecting the text of the arbitrary-shaped scene based on the enhanced feature pyramid network as claimed in claim 1, wherein the asymptotic expansion module uses an asymptotic scale expansion algorithm to merge the n segmentation results S of the feature fusion module₁,S₂,…,S_nAnd expanding the text in sequence from small to large to obtain a final text prediction result.