CN114529894A - Rapid scene text detection method fusing hole convolution - Google Patents

Rapid scene text detection method fusing hole convolution Download PDF

Info

Publication number
CN114529894A
CN114529894A CN202210046573.2A CN202210046573A CN114529894A CN 114529894 A CN114529894 A CN 114529894A CN 202210046573 A CN202210046573 A CN 202210046573A CN 114529894 A CN114529894 A CN 114529894A
Authority
CN
China
Prior art keywords
convolution
text detection
module
scene text
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210046573.2A
Other languages
Chinese (zh)
Inventor
谭钦红
江一峰
黄�俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210046573.2A priority Critical patent/CN114529894A/en
Publication of CN114529894A publication Critical patent/CN114529894A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a fast scene text detection method fusing void convolution, which comprises the steps of obtaining a text detection training data set and carrying out label generation on the training data set; establishing a rapid scene text detection preliminary model fusing the void convolution; the rapid scene text detection preliminary model fusing the void convolution comprises a lightweight feature extraction module, a void convolution module and a differentiable binarization module; using a training data set generated by a label to detect an initial model of the established fast scene text fusing the void convolution, and using a loss function to calculate a loss value so as to adjust parameters of the initial model to obtain a fast scene text detection model fusing the void convolution; and detecting the text in the scene by adopting a fast scene text detection model fusing the cavity convolution. The invention replaces a large network with a light-weight neural network to extract the characteristics of the input graph, reduces the parameters of the network model, and can effectively improve the efficiency of detecting the network model by the text. Meanwhile, in order to make up for the problem of insufficient characteristics extracted by the lightweight neural network, a cavity convolution module is added in a characteristic fusion layer, and a channel attention mechanism is used for carrying out fusion screening on the characteristics, so that the utilization efficiency of the characteristics in the network is improved. Therefore, the method and the device realize the rapid detection of the scene text under the condition of keeping a higher text detection level.

Description

Rapid scene text detection method fusing hole convolution
Technical Field
The invention relates to the field of image processing, in particular to a method for detecting a fast scene text by fusing a hole convolution.
Background
With the rapid development of economic society and the rapid popularization of intelligent terminals, channels for people to acquire information are more and more diversified, and images and videos become main media for information dissemination. Unlike visual elements in general images, text appearing in an image often plays a crucial role in the expression of visual information as an important content in the image. If the text information appearing in the images can be detected and recognized, people can be helped to analyze and understand deeper information contained in the scene images, so that the development and application of technologies such as image search, automatic driving, office automation and the like are promoted, and great convenience is provided for the production and life of people.
The text in the natural scene has many characteristics which are not beneficial to detection, such as complex background, large text scale difference, diversified text shapes, unobvious text edges and the like, and accurate detection of the text in the natural scene image through a machine is still a very challenging task.
The traditional scene text detection method relies on artificially designed features, traditional feature extraction is carried out on an input image through methods such as LBP, DPM and HOG, and then a specific classifier or heuristic rules are used for classifying the extracted features. Two types of scene text detection which are typical in comparison are a method based on a connected domain and a method based on a sliding detection window. Because the methods excessively depend on the characteristics of manual design, the changes of objective factors such as illumination intensity, image quality and text background cannot be effectively coped with, and the robustness of the text detection method is poor.
In recent years, the research and development of natural scene text detection are promoted by the successful application of deep learning such as a deep convolutional neural network in the field of computer vision, and this method usually trains a network model based on the deep convolutional neural network by using a specific data set for automatically extracting the basic features of an input image, and then obtains a final text region by a series of post-processing algorithms. Compared with the traditional scene text detection algorithm, the method effectively avoids the limitation of manual design features. At present, a scene text detection algorithm based on deep learning generally adopts a large-scale deep neural network as a backbone network for feature extraction, and the text detection accuracy is remarkable but the detection model is huge, so that the transplantation of the text detection algorithm is not facilitated. In practical application, text detection in a natural scene needs to consider the efficiency of text detection while pursuing the accuracy of text detection, and an excessively large scene text detection model inevitably has the problem of slow detection, so that the application of the scene text detection algorithm in practical life is influenced.
Disclosure of Invention
The current scene text detection method based on deep learning generally adopts a large-scale deep neural network as a backbone network for feature extraction, and the detection speed is too slow due to a huge text detection model, so that the actual application of a scene text detection algorithm is influenced. Aiming at the problem, the invention provides a fast scene text detection method fusing hole convolution, which specifically comprises the following steps:
s1, acquiring a text detection training data set, and performing label generation on the training data set;
s2, establishing a fast scene text detection preliminary model fusing the cavity convolution; the rapid scene text detection preliminary model fusing the void convolution comprises a lightweight feature extraction module, a void convolution module and a differentiable binarization module;
s3, detecting a preliminary model of the established fast scene text of the fusion cavity convolution in the step S2 by using the training data set generated by the label in the step S1, and calculating a loss value by using a loss function to train the preliminary model to obtain the fast scene text detection model of the fusion cavity convolution;
and S4, detecting the text in the scene image by using the obtained fast scene text detection model for fusion cavity convolution in the step S3.
The step S1 is to obtain a text detection training data set, and perform label generation on the training data set, specifically to perform label generation on the original labels of the public data sets ICDAR2015 and CTW 1500.
The rapid scene text detection preliminary model fusing the void convolution in the step S2 comprises a lightweight feature extraction module, a void convolution module and a differentiable binarization module, and specifically is a pyramid network structure constructed by the lightweight feature extraction module by adopting EfficientNet-b3 as a main network to extract features of an input image; a hole convolution module is added in the characteristic fusion layer; and fusing the two parts of characteristics and then connecting the two parts of characteristics with a differentiable binarization module.
Step S3, detecting a preliminary model of the fast scene text of the fusion cavity convolution established in step S2 by using the training data set generated by the label in step S1, calculating a loss value by using a loss function to train the preliminary model to obtain the fast scene text detection model of the fusion cavity convolution, and specifically training by adopting the following steps to obtain a text detection model:
s3.1, inputting the text image generated by the label into a lightweight backbone network EfficientNet-b3 to extract characteristic graphs from the first stage to the fifth stage to construct a pyramid network structure;
s3.2, respectively using 1/16 feature maps extracted in the step S3.1 of hole convolution processing with hole rates of 1,6,12 and 18 by a hole convolution module to obtain hole convolution features;
s3.3, fusing the characteristics generated in the step S3.1 and the step S3.2 by a characteristic fusion layer, and performing fusion screening on the characteristics by using a channel attention mechanism;
s3.4, predicting the probability map (P) and the threshold map (T) by the fusion feature map generated in the step S3.3, combining the probability map and the threshold map by using a differentiable binarization module to obtain an approximate binary map (B), adaptively predicting the threshold value of each position in the image, and forming a boundary frame for obtaining a text region from the approximate binary map B through the boundary frame in an inference stage;
s3.5 and S3.4, when the probability map P and the threshold map T of the fusion feature map are predicted, the following formula is adopted as a prediction loss function:
L=Ls+α×Lb+β×Lt
in the formula LsIs the loss of the probability map, LbIs the loss of the binary image, LtFor the loss of the threshold map, α and β are set to 1 and 10, respectively, LsAnd LbBinary cross entropy loss (BCE) was used, the formula is shown below:
Ls=Lb=∑yi·log xi+(1-yi)log(1-xi)
the invention discloses a fast scene text detection method fusing void convolution, which replaces a large network with a light-weight neural network to extract the characteristics of an input image, solves the problem of overlarge network model parameters and can effectively improve the efficiency of a text detection network model. And a cavity convolution module is added in the feature fusion layer to enlarge the receptive field, after the features are extracted, the features of each layer are fused layer by layer from top to bottom, and a channel attention mechanism is used for fusing and screening the features, so that the utilization efficiency of the features in the network is improved, and the problem of insufficient extraction features of the lightweight neural network is effectively solved. The text detection method can greatly reduce the parameters of the text detection model, greatly improve the detection speed and realize the rapid detection of the scene text under the condition of keeping a higher detection level.
Drawings
FIG. 1 is a flow chart of a fast scene text detection method incorporating hole convolution according to the present invention;
FIG. 2 is a schematic diagram of a data set tag generation of the present invention;
FIG. 3 is a flow chart of a scene text detection network architecture of the present invention;
FIG. 4 is a diagram of a hole convolution module according to the present invention.
Detailed Description
FIG. 1 is a schematic diagram of the detection process of the method of the present invention: the invention provides a fast scene text detection method fusing hole convolution, which comprises the following steps:
s1, acquiring a text detection training data set, and performing label generation on the training data set;
specifically, label generation is performed on the original labels of the public data sets ICDAR2015 and CTW 1500; given a scene text image, the polygons of its text regions are described by a set of line segments:
Figure BDA0003470642290000041
where n represents the number of vertices, e.g., the text region of the ICDAR2015 data set consists of 4 vertices, the polygon G is reduced to G using a Vatti clipping algorithmsThe contraction offset D is calculated by the perimeter L and the area A of the original polygon:
Figure BDA0003470642290000051
where r is the contraction ratio, typically set at 0.4. Labels can be generated for the threshold map by a similar process, as follows: first, the text polygon G is expanded to G with the same offset DdG issAnd GdThe gaps between them serve as the boundaries of the text region, and the labels of the threshold map are generated by calculating the distance to the closest line segment in G, as shown in fig. 2.
S2, establishing a fast scene text detection preliminary model fusing the cavity convolution; the rapid scene text detection preliminary model fusing the void convolution comprises a lightweight feature extraction module, a void convolution module and a differentiable binarization module;
specifically, the lightweight feature extraction module adopts EfficientNet-b3 as a backbone network to extract features of the input image to construct a pyramid network structure; a hole convolution module is added in the characteristic fusion layer; the feature fusion module fuses the two parts of features and then is connected with a differentiable binarization module; the text detection network model structure flow chart is shown in fig. 3.
S3, detecting a preliminary model of the fast scene text of the fusion cavity convolution established in the step S2 by using the training data set generated by the label in the step S1, and calculating a loss value by using a loss function to adjust parameters of the preliminary model to obtain the fast scene text detection model of the fusion cavity convolution;
specifically, the following steps are adopted to train a text detection preliminary model and obtain a text detection model:
s3.1, inputting the text image generated by the label into a lightweight backbone network EfficientNet-b3 to extract characteristic graphs from the first stage to the fifth stage to construct a pyramid network structure;
s3.2, the hole convolution module respectively uses the 1/16 feature maps extracted in the step S3.1 to process the hole convolution with the hole rates of 1,6,12 and 18 to obtain hole convolution features;
specifically, C to backbone network EfficientNet-b34That is, 1/16 feature maps are sampled in parallel by convolution of common 1 × 1 and convolution kernels of 3 × 3 of different void rates in 3 of 6,12 and 18 respectively to obtain different receptive fields, and then the multi-scale features of 4 channels are subjected to cascade operation to fully capture the context information of the input image, as shown in fig. 4.
S3.3, fusing the characteristics generated in the step S3.1 and the step S3.2 by a characteristic fusion layer, and performing fusion screening on the characteristics by using a channel attention mechanism;
s3.4, predicting the probability map (P) and the threshold map (T) by the fused feature map generated in the step S3.3, combining the probability map and the threshold map by using a differentiable binarization module to obtain an approximate binary map (B), adaptively predicting the threshold value of each position in the image, and forming a boundary box for obtaining a text region from the approximate binary map B through the boundary box in an inference stage;
specifically, the feature map after the attention mechanism is used for predicting the probability map P and the threshold map T, and then the relationship between the probability map and the threshold map is established by the following formula to generate the approximate binary map B:
Figure BDA0003470642290000061
wherein k is an amplification factor, generally set to 50; b isi,jRefers to the value, P, at the point (i, j) on the approximate binary mapi,jRefers to the value of the (i, j) point on the probability map, Ti,jRefers to the value at point (i, j) on the threshold map. The differentiable approximate binarization function can be optimized along with the network in the training process, and is beneficial to distinguishing the character area from the background.
S3.5 and S3.4, when the probability map P and the threshold map T of the fusion feature map are predicted, the following formula is adopted as a prediction loss function:
L=Ls+α×Lb+β×Lt
in the formula LsIs the loss of the probability map, LbIs the loss of the binary image, LtFor the loss of the threshold map, α and β are set to 1 and 10, respectively, LsAnd LbBinary cross entropy loss (BCE) was used, the formula is shown below:
Ls=Lb=∑yi·log xi+(1-yi)log(1-xi)。
and S4, detecting the text in the scene image by using the fast scene text detection model which is obtained in the step S3 and is combined with the cavity convolution.

Claims (5)

1. A fast scene text detection method fusing hole convolution specifically comprises the following steps:
s1, acquiring a text detection training data set, and performing label generation on the training data set;
s2, establishing a fast scene text detection preliminary model fusing the cavity convolution; the rapid scene text detection preliminary model fusing the void convolution comprises a lightweight feature extraction module, a void convolution module and a differentiable binarization module;
s3, detecting a preliminary model of the established fast scene text of the fusion cavity convolution in the step S2 by using the training data set generated by the label in the step S1, and calculating a loss value by using a loss function to adjust parameters of the preliminary model to obtain the fast scene text detection model of the fusion cavity convolution;
and S4, detecting the text in the scene image by using the obtained fast scene text detection model for fusion cavity convolution in the step S3.
2. The method according to claim 1, wherein step S1 is implemented by obtaining a training data set for text detection and performing label generation on the training data set, specifically performing label generation on the original labels of the public data sets ICDAR2015 and CTW 1500.
3. The method for detecting the fast scene text fused with the hole convolution according to claim 1, wherein the preliminary model for detecting the fast scene text fused with the hole convolution in step S2 includes a lightweight feature extraction module, a hole convolution module and a differentiable binarization module, and specifically, the lightweight feature extraction module adopts EfficientNet-b3 as a main network to extract features of an input image to construct a pyramid network structure; the feature fusion layer is added with the hole convolution module; and the feature fusion module fuses the two parts of features and then connects the two parts of features with the differentiable binarization module.
4. The method according to claim 1, wherein the training data set generated in step S1 is used to detect the preliminary model of the fast scene text detection of the fusion hole convolution established in step S2 in step S3, and a loss function is used to calculate a loss value to adjust parameters of the preliminary model to obtain the fast scene text detection model of the fusion hole convolution, and the following steps are specifically adopted to train and obtain the text detection model:
s3.1, inputting the text image generated by the label into a lightweight backbone network EfficientNet-b3 to extract characteristic graphs from the first stage to the fifth stage to construct a pyramid network structure;
s3.2, the hole convolution module respectively uses the 1/16 feature maps extracted in the step S3.1 to process the hole convolution with the hole rates of 1,6,12 and 18 to obtain hole convolution features;
s3.3, fusing the characteristics generated in the step S3.1 and the step S3.2 by a characteristic fusion layer, and performing fusion screening on the characteristics by using a channel attention mechanism;
and S3.4, predicting the probability map (P) and the threshold map (T) by the fused feature map generated in the step S3.3, combining the probability map and the threshold map by using a differentiable binarization module to obtain an approximate binary map (B), adaptively predicting the threshold value of each position in the image, and forming a boundary box for obtaining a text region from the approximate binary map B through the boundary box in an inference stage.
5. The method for fast detecting the scene text fused with the hole convolution according to claim 4, wherein the following formula is adopted as a prediction loss function when the probability map (P) and the threshold map (T) are predicted in step S3.4:
L=Ls+α×Lb+β×Lt
in the formula LsIs the loss of the probability map, LbIs the loss of the binary image, LtFor the loss of the threshold map, α and β are set to 1 and 10, respectively, LsAnd LbBinary cross entropy loss (BCE) was used, the formula is shown below:
Ls=Lb=∑yi·log xi+(1-yi)log(1-xi) 。
CN202210046573.2A 2022-01-14 2022-01-14 Rapid scene text detection method fusing hole convolution Pending CN114529894A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210046573.2A CN114529894A (en) 2022-01-14 2022-01-14 Rapid scene text detection method fusing hole convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210046573.2A CN114529894A (en) 2022-01-14 2022-01-14 Rapid scene text detection method fusing hole convolution

Publications (1)

Publication Number Publication Date
CN114529894A true CN114529894A (en) 2022-05-24

Family

ID=81621819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210046573.2A Pending CN114529894A (en) 2022-01-14 2022-01-14 Rapid scene text detection method fusing hole convolution

Country Status (1)

Country Link
CN (1) CN114529894A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100428A (en) * 2022-07-01 2022-09-23 天津大学 Target detection method using context sensing

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100428A (en) * 2022-07-01 2022-09-23 天津大学 Target detection method using context sensing

Similar Documents

Publication Publication Date Title
CN109949317B (en) Semi-supervised image example segmentation method based on gradual confrontation learning
CN109829443B (en) Video behavior identification method based on image enhancement and 3D convolution neural network
CN112818951B (en) Ticket identification method
CN104050471A (en) Natural scene character detection method and system
CN111506773B (en) Video duplicate removal method based on unsupervised depth twin network
CN110705412A (en) Video target detection method based on motion history image
CN114155527A (en) Scene text recognition method and device
CN112766334A (en) Cross-domain image classification method based on pseudo label domain adaptation
CN113255837A (en) Improved CenterNet network-based target detection method in industrial environment
CN105825216A (en) Method of locating text in complex background image
CN113627266A (en) Video pedestrian re-identification method based on Transformer space-time modeling
CN116030396B (en) Accurate segmentation method for video structured extraction
CN111539417B (en) Text recognition training optimization method based on deep neural network
CN115240024A (en) Method and system for segmenting extraterrestrial pictures by combining self-supervised learning and semi-supervised learning
Ling et al. A model for automatic recognition of vertical texts in natural scene images
CN114529894A (en) Rapid scene text detection method fusing hole convolution
CN114387610A (en) Method for detecting optional-shape scene text based on enhanced feature pyramid network
CN111832497A (en) Text detection post-processing method based on geometric features
Zhang et al. A novel approach for binarization of overlay text
CN116721458A (en) Cross-modal time sequence contrast learning-based self-supervision action recognition method
CN116543162A (en) Image segmentation method and system based on feature difference and context awareness consistency
CN115439690A (en) Flue-cured tobacco leaf image grading method combining CNN and Transformer
CN113192108A (en) Human-in-loop training method for visual tracking model and related device
Mosannafat et al. Farsi text detection and localization in videos and images
CN117152142B (en) Bearing defect detection model construction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination