CN114241470A - Natural scene character detection method based on attention mechanism - Google Patents
Natural scene character detection method based on attention mechanism Download PDFInfo
- Publication number
- CN114241470A CN114241470A CN202111603367.9A CN202111603367A CN114241470A CN 114241470 A CN114241470 A CN 114241470A CN 202111603367 A CN202111603367 A CN 202111603367A CN 114241470 A CN114241470 A CN 114241470A
- Authority
- CN
- China
- Prior art keywords
- text
- scene
- feature
- convolution
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Character Discrimination (AREA)
Abstract
A natural scene character detection method based on an attention mechanism comprises the steps of designing a convolutional neural network model for extracting a text target according to feature information of a text center block and a stroke area, and training the model by using the text center block and stroke information as supervision data; in the model testing stage, the test image is respectively input into a text center block model and a stroke model to obtain a probability chart of a text center block and a word stroke area; obtaining a final text area through reasoning, marking, and completing a scene image character detection task; the method solves the problems that the direction information of the bent text is not easy to return, adhesion between different similar text lines and information redundancy caused by multi-level feature integration in scene character detection in the prior art and the like.
Description
Technical Field
The invention relates to the field of computer analysis, in particular to a natural scene character detection method based on an attention mechanism.
Background
Characters are used as carriers of human knowledge and information, and widely exist in real daily life scenes. The method has the advantages that the method is very valuable and beneficial in a plurality of applications based on image content information, the character extraction technology in the scene image has wide application prospects in scenes such as blind person navigation, blind person reading, image retrieval and marking, man-machine interaction, unmanned driving and the like, the scene character detection is used for determining the specific position of characters in the image, and the character identification is used for identifying the character information in the boundary box into scale characters. The scene character detection plays an important role in extracting and understanding character information in the scene image, and the performance of the scene character detection directly determines the performance of character recognition in the image. The scene character detection and identification technology is used for extracting character information in an image to assist or enhance reality application, becomes a challenging research field in academia and industry, and draws wide attention of researchers at home and abroad.
In recent years, with the rapid development of general target detection and semantic segmentation technology, scene character detection has been widely researched. And achieves remarkable results. Although many scene text detection methods with superior performance are proposed, in some challenging scenes, it is still difficult to achieve accurate positioning of scene text. The challenges of scene text detection come mainly from three aspects: scene characters are influenced by noise, blur, shading, strong light and low resolution factor; scene characters have various existing forms and large height-width ratio change; scene text has different sizes, colors, fonts, languages and styles. For these three reasons, scene text detection is still an openness problem.
Currently, mainstream scene character detection methods can be roughly divided into two types: a general target bounding box regression-based method and a semantic segmentation-based method. The following defects are found in the prior art in the using process:
the direction of the curved text is not easy to return: the scene character detection method based on the packaging box regression needs to carry out regression on direction information when solving the problem of the example direction of the multi-direction scene text; however, for text examples of arbitrary shapes, such as curved shapes, the direction information cannot be regressed;
close adhesion between different text lines: the scene character detection method based on semantic segmentation has obtained better performance when solving the problem of scene characters in any shape and any direction; however, if different text lines are relatively close to each other, the text lines are easy to stick to each other;
multi-level feature integration yields information redundancy: when the scene character detection method based on semantic segmentation predicts the text region information, the multilevel characteristics of a shallow layer and a deep layer are utilized; however, the text target important concentration is distributed in the deep features; in addition, the importance of different features is not considered in the process of feature integration,
disclosure of Invention
The invention aims to provide a natural scene character detection method based on an attention mechanism, so that the problems in the prior art are solved.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a natural scene character detection method based on an attention mechanism comprises the following steps:
s1, constructing a candidate text instance prediction network, and simultaneously adding a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module in the convolution feature extraction network to form the candidate text instance prediction network for natural scene character feature extraction;
s2, acquiring scene character images, and carrying out classification labeling on the scene character images to acquire a scene character image data set; the scene character image dataset comprises a scene character image and a corresponding binary label image, wherein the binary label image comprises a text center block label and a character stroke area label;
s3, extracting the features of the scene character images through the candidate text instance prediction network constructed in the step S1, specifically comprising the following steps:
s301, extracting the features of the scene character images through a convolution feature extraction network to obtain convolution feature maps of the corresponding scene character images, and extracting the convolution feature maps of the scene character images through the third, fourth and fifth convolution stages of the five convolution stages of the VGG-16 convolution network, wherein the convolution feature maps are marked as follows: f ═ F3,f4,f5F represents a convolution feature map set of the scene text image;
s302, inputting the convolution feature map into the context feature extraction module to obtain a corresponding multi-scale context information feature map, wherein the context information feature map is marked as F '═ F'3,f′4,f′5}; f ' represents a set F ' of the multi-scale context information feature map 'iA set of context information feature maps representing the multiscale for each convolution stage, where i ∈ {3,4,5 };
s303, coding all the multi-scale context information feature maps in each convolution stage, and sliding on the multi-scale context information feature maps from left to right by using a 3 x 3 sliding window to obtain a corresponding feature sequence set which is marked as Whereink∈{1,…,Hi},t∈{1i,…,WiDenotes the number of channels, HiAnd WiIs a characteristic map f'iI ∈ {3,4,5}, a sequence number representing a convolution stage, the set S of feature sequencesiInput to the two-way long-short duration memory and feature fusion in forward and backward orderIn the matching module, obtaining a probability icon of scene characters existing in the state diagram of each sliding window in the multi-scale context information characteristic diagram in each convolution stage, and recording the probability icon as: wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram;
s304, according to the probability distribution of the state diagram mapping in the prediction category (0, 1) values, obtaining a weight diagram W for displaying scene characters in the state diagramm={Wconv33,Wconv43,Wconv53Outputting a probability map of scene characters displayed in each pixel position in the scene character image: wherein, σ [ ·]Is an activation function, <' > indicates multiplication of elements, Wl、blIs the weight and bias of the convolution layer L ∈ L showing the scene text, WlReflect feature maps at different positionsThe degree of attention.
S4, training a text center block model and a character stroke area model, training the text center block model to be convergent through a text center block label in the training set of the scene character image data set in the step S2 and the text center block model, training the character stroke area model to be a bracelet through the character stroke area label, and performing fine adjustment on the basis of the training set to generate the text center module and the character stroke area model;
s5, through the text center model and the character stroke area model pairThe test image F of the scene character image to be tested, which is processed in step S3outCalculating to finally generate a probability graph of a text center block and a probability graph of a character stroke area corresponding to the scene text and digital image;
and S6, judging the text center block through the area limit of the text center block, eliminating false candidate text center blocks, finally obtaining the text detected in the scene text image, and marking.
Preferably, the context feature extraction module includes four parallel hole convolution layers, the expansion coefficients of the four parallel hole convolution layers are 1, 3, 5 and 7, the size of the convolution kernel is 3 × 3, and the context feature extraction module is added after each convolution layer in the convolution feature extraction network.
Preferably, the bidirectional long-short time memory and feature fusion module includes a forward LSTM layer, a backward LSTM layer, and a Concat layer, and the feature extraction step in the bidirectional long-short time memory and feature fusion module is as follows:
the characteristic sequence S belongs to SiInputting the data into the bidirectional long-time and short-time memory and feature fusion module; for characteristic sequencesRespectively using a forward LSTM layer and a backward LSTM layer to calculate the state sequence of the hidden layers, and splicing the states of all the hidden layers at each time step to obtain the state diagrams of all the hidden layersWhereinWherein B represents the state diagram of the hidden layer of each characteristic diagram, i belongs to {3,4,5}, and represents the sequence number of the convolution stage; and respectively mapping the hidden layer sequences to corresponding deconvolution layers, respectively outputting a probability graph of scene characters in a state graph of each sliding window in the multi-scale context information characteristic graph in each convolution stage, and marking the probability graph as a character:Wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram; will MdCutting the character image into a character image with the same size as the input scene, and marking the character image as a feature imageSplicing the two convolutional layers through a Concat layer, inputting the spliced convolutional layers into the two convolutional layers, wherein the first convolutional layer comprises 512 channels, and the size of a convolutional kernel is 3 multiplied by 3; the second convolutional layer contains 3 channels and the convolutional kernel size is 1 × 1.
Preferably, the attention-based feature integration module comprises two convolution layers, a Softmax layer, a Slice layer and three SpatialProduct layers; the attention-based feature integration module comprises the following execution steps:
the three channels of the second convolutional layer correspond to the characteristic diagram respectively In the Softmax layer, the feature map F is weightedcThe weight of (2) is mapped to the probability distribution of the value (0, 1) to obtain the probability distribution; in the Slice layer, dividing the probability distribution into weight maps Wm={Wconv33,Wconv43,Wconv53}; the attention-based feature integration module integrates the feature according to the feature map FcThe probability map F of the scene characters displayed at each pixel position in the scene character image is outputout。
Preferably, the judgment condition of the text center block is as follows:
Smin≤Stcb≤Smax
wherein SminAnd SmaxRespectively of minimum area of central block of said textThreshold and maximum area threshold, StcbRepresenting the area of the center block of the candidate text.
Preferably, the labeling method for the detected text in step S6 is: for the text with the curved shape, marking a text example by adopting the outline of the text area; for a straight line text, the outline of the text region is fitted by using a minimum rectangle, and a text example is marked by using a rectangular box.
Preferably, the loss function of the text center module and the character stroke area model in step S4 is L- Σi, jGijlog(Pij)+(1-Gij)log(1-Pij) Wherein G isijIs the label of the pixel at (i, j), PijRepresenting the probability that the pixel at (i, j) belongs to the foreground.
The invention has the beneficial effects that: the invention discloses a natural scene character detection method based on an attention mechanism, which is characterized in that a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module are added on the basis of a VGG-16 framework to construct a network model capable of effectively predicting character areas in a scene image so as to overcome the problems in scene character detection in the prior art: the direction information of the bent text is not easy to return, adhesion among different text lines is close, and information redundancy is generated by multi-level feature integration.
Drawings
FIG. 1 is a test flow diagram of a natural scene text detection method based on an attention mechanism;
FIG. 2 is a comparison diagram of the impact of three modules, a context feature extraction module, a two-way long-and-short term memory and feature fusion module, and an attention-based feature integration module, on detection performance;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
S1, constructing a candidate text instance prediction network, and simultaneously adding a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module in the convolution feature extraction network to form the candidate text instance prediction network for natural scene character feature extraction;
the convolution feature extraction network is created based on VGG-16, the last pooling layer and three full-connection layers in the VGG-16 structure are deleted, and three modules are added: the system comprises a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module; the context feature extraction module comprises four parallel cavity convolution layers, the expansion coefficients of the four cavity convolution layers are respectively 1, 3, 5 and 7, the convolution kernel size is 3 multiplied by 3, and the context feature extraction module is added behind each convolution layer in the convolution feature extraction network; the bidirectional long-short time memory and feature fusion module comprises a forward LSTM layer, a backward LSTM layer and a Concat layer; the attention-based feature integration module comprises two convolution layers, a Softmax layer, a Slice layer and three SpatialProduct layers.
S2, acquiring scene character images, and carrying out classification labeling on the scene character images to acquire a scene character image data set; the scene character image dataset comprises a scene character image and a corresponding binary label image, wherein the binary label image comprises a text center block label and a character stroke area label;
the generating method of the text center block label adopts an algebraic method, and the coordinates of each scene character in the number set are directly calculated through a zoom factor: assume that each instance polygon P in the scene text image dataset contains 2N vertices, where N ≧ 2, a set of vertices { P { (N ≧ 2) }1,…,pN,p′1,…,p′NDenotes wherein (p)i,p′i) Called a point pair, i ∈ {1, …, N }; the scaled polygon P' still has 2N vertices, using a set of vertices { P }01,…,p0N,p′01,…,p′0NRepresents it. Suppose a vertex pi、p′i、p0i、p′0iAre respectively expressed as: (x)i,yi),(x′i,y′i),(x0i,y0i),(x′0i,y′0i). Given (x)i,yi),(x′i,y′i) And a scaling factor lambda0、x0i、y0i、x′0iAnd y'0iCalculated from the following formula:
the generation mode of the character stroke area label is obtained through an area growing algorithm, and for scene image characters with simple backgrounds, fewer seeds are selected to generate a stroke horizontal area; for scene image characters with complex background, more seeds need to be selected; for scene characters with complex backgrounds, if the number of the selected seeds exceeds 10 and the generated area is not perfect, in this case, the selected seeds are discarded, and other seeds are replaced to generate a stroke level area until 1-10 seeds are adopted, so that the area with the scene image characters can be completely generated.
S3, extracting the features of the scene character images through the candidate text instance prediction network constructed in the step S1, specifically comprising the following steps:
s301, extracting the features of the scene character images through a convolution feature extraction network to obtain convolution feature maps of the corresponding scene character images, and extracting the convolution feature maps of the scene character images through the third, fourth and fifth convolution stages of the five convolution stages of the VGG-16 convolution network, wherein the convolution feature maps are marked as follows: f ═ F3,f4,f5F represents a convolution feature map set of the scene text image;
s302, inputting the convolution feature map into the context feature extraction module to obtain a corresponding multi-scale context information feature map, wherein the context information feature map is marked as F '═ F'3,f′4,f′5}; f ' represents a set F ' of the multi-scale context information feature map 'iA set of context information feature maps representing the multiscale for each convolution stage, where i ∈ {3,4,5 };
s303, coding all the multi-scale context information feature maps in each convolution stage, and sliding on the multi-scale context information feature maps from left to right by using a 3 x 3 sliding window to obtain a corresponding feature sequence set which is marked as Whereink∈{1,…,Hi},t∈{1i,…,WiDenotes the number of channels, HiAnd WiIs a characteristic map f'iI belongs to {3,4,5}, denotes the sequence number of the convolution stage, and the characteristic sequence S belongs to SiInputting the data into the bidirectional long-time and short-time memory and feature fusion module; for characteristic sequencesRespectively using a forward LSTM layer and a backward LSTM layer to calculate the state sequence of the hidden layers, and splicing the states of all the hidden layers at each time step to obtain the state diagrams of all the hidden layersWhereinWherein B represents the state diagram of the hidden layer of each characteristic diagram, i belongs to {3,4,5}, and represents the sequence number of the convolution stage; mapping the hidden layer sequence to corresponding deconvolution layers respectively, and outputting to obtain the multi-scale in each convolution stageThe probability chart of the scene characters existing in the state chart of each sliding window in the context information characteristic chart is marked as follows:wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram; will MdCutting the character image into a character image with the same size as the input scene, and marking the character image as a feature image Splicing the two convolutional layers through a Concat layer, inputting the spliced convolutional layers into the two convolutional layers, wherein the first convolutional layer comprises 512 channels, and the size of a convolutional kernel is 3 multiplied by 3; the second convolutional layer contains 3 channels, the convolutional kernel size is 1 × 1;
s304, the three channels of the second convolution layer in step S303 correspond to the characteristic diagram respectively In the Softmax layer, the feature map F is weightedcThe weight of (2) is mapped to the probability distribution of the value (0, 1) to obtain the probability distribution; in the Slice layer, dividing the probability distribution into weight maps Wm={Wconv33,Wconv43,Wconv53}; the attention-based feature integration module integrates the feature according to the feature map FcThe probability map of scene characters displayed at each pixel position in the scene character image is output Wherein, σ [ ·]Is a function of the activation of the function,w indicates multiplication of elementsl、blIs the weight and bias of the convolution layer L ∈ L showing the scene text, WlReflect feature maps at different positionsThe degree of attention.
S4, training a text center block model and a character stroke area model, training the text center block model to be convergent through a text center block label in the training set of the scene character image data set in the step S2 and the text center block model, training the character stroke area model to be convergent through the character stroke area label, and performing fine adjustment on the basis of the training set to generate the text center module and the character stroke area model;
the loss function of the text center module and the character stroke area model is L ═ sigmai,jGijlog(Pij)+(1-Gij)log(1-Pij) Wherein G isijIs the label of the pixel at (i, j), PijRepresenting the probability that the pixel at (i, j) belongs to the foreground;
s5, testing the image F of the scene character image to be tested processed in the step S3 through the text center model and the character stroke area modeloutCalculating to finally generate a probability graph of a text center block and a probability graph of a character stroke area corresponding to the scene text and digital image;
test image F containing scene charactersoutRespectively inputting into the text center model and the character stroke region model to generate a probability map (F) of the text center blocktcb) And probability map of word stroke area (F)wsr);
And S6, judging the text center block through the area limit of the text center block, providing a false candidate text center block, finally obtaining the text detected in the scene text image, and adopting different labeling modes aiming at different text objects.
The judgment condition of the text center block is as follows:
Smin≤Stcb≤Smax
wherein SminAnd SmaxRespectively being a threshold value of the minimum area and a threshold value S of the maximum area of the text center blocktcbRepresenting the area of a candidate text center block;
s when judging candidate text center block according to area of text center blockminAnd Smax211 and 81179 are respectively taken because 99% of the area of the text center block is more than or equal to 211 and less than 81179; marking the finally screened text center block and word stroke area examples; obtaining a final character area, and completing a scene image character detection task; and the marking mode of the detected text is as follows: for the text with the curved shape, marking a text example by adopting the outline of the text area; for a straight line text, the outline of the text region is fitted by using a minimum rectangle, and a text example is marked by using a rectangular box.
Examples
In this embodiment, a comparative experiment is performed to verify the technical effect of the scene text detection method described in the first embodiment, and the experimental environment and the experimental result are as follows:
(1) experimental Environment
The system environment is as follows: ubuntu 16.04;
hardware environment: GPU, GTX 1080Ti, memory: 512G.
(2) Experimental data set
Training data: first, 4 × 10 pre-training of the text center block model was performed using 7200 sheets of training data of MLT20175Secondly; the text center block model and the word stroke region model were then fine-tuned 4 x 10 on Total-test (1255 training sets)5Next, the process is carried out.
Test data: total-test (300 test set).
(3) Evaluation method
Curved shape text: pascal evaluation methods.
In order to show the effectiveness of the invention, four groups of experiments are respectively set for training the model by adopting the same training set, and the test set of the Total-test data set is respectively evaluated:
the first set of experiments: the method comprises the following steps of training by using a combination of a bidirectional long-short time memory and feature fusion module and an attention-based feature integration module, marking as the bidirectional long-short time memory and feature fusion module and the attention-based feature integration module, predicting a target region by using the combination of the bidirectional long-short time memory and feature fusion module and the attention-based feature integration module, and verifying the effectiveness of a context feature extraction module;
the second set of experiments: training by using a combination of a context feature extraction module and an attention-based feature integration module, marking as the context feature extraction module and the attention-based feature integration module, predicting a target region by using the combination of the context feature extraction module and the attention-based feature integration module, and verifying the validity of a bidirectional long-short time memory and feature fusion module;
the third set of experiments: the method comprises the following steps of training by using a combination of a bidirectional long-short time memory and feature fusion module and a context feature extraction module, marking as the bidirectional long-short time memory and feature fusion module and the context feature extraction module, predicting a target region by using the combination of the context feature extraction module and the bidirectional long-short time memory and feature fusion module, and verifying the effectiveness of a feature integration module based on attention;
fourth set of experiments: the method comprises the following steps of training by using a combination of three modules, namely a bidirectional long-short time memory and feature fusion module, a context feature extraction module and an attention-based feature integration module, marking as the bidirectional long-short time memory and feature fusion module, the context feature extraction module and the attention-based feature integration module, predicting a target region by using a combination of the three modules, namely the context feature extraction module, the bidirectional long-short time memory and feature fusion module and the attention-based feature integration module, and verifying the effectiveness of the three modules as a comparison group;
setting parameters: t iswAnd TbSet to 0.55, 0.60, respectively;
validity of the context feature extraction module: in order to obtain multi-scale context information, the invention designs a context feature extraction module; from fig. 2, it can be seen that the method of "two-way long-short time memory and feature fusion module + attention-based feature integration module" reduces F-measure by 1.54% (76.99% vs. 78.53%), reduces Precision by 4.52% (74.39% vs. 78.91%), but increases Recall by 1.63% (79.79% vs. 78.16%) without using the context feature extraction module.
Validity of the bidirectional long-time and short-time memory and feature fusion module: in order to utilize the space sequence characteristics of characters in text objects (words and text lines), the invention designs a bidirectional long-time and short-time memory and feature fusion module. From fig. 2, it can be seen that the method of "context feature extraction module + attention-based feature integration module" reduces F-measure by 3.25% (75.28% vs. 78.53%), Precision by 7.19% (71.72% vs. 78.91%), but increases Recall by 1.06% (79.22% vs. 78.16%) without using the two-way long-short time memory and feature fusion module.
Effectiveness of the attention-based feature integration module: in order to enable the trained model to enhance the attention of the text region in the scene image, the invention designs an attention-based feature integration module. From fig. 2, it can be seen that the method of "context feature extraction module + two-way long-short time memory and feature fusion module" reduces F-measure by 0.71% (77.82% vs. 78.53%), reduces Recall by 1.94% (76.22% vs. 78.16%), but increases Precision by 0.51% (79.48% vs. 78.91%) without using the attention-based feature integration module.
As shown in fig. 2, the test results show the influence of the three modules of the "context feature extraction module", "two-way long-and-short-term memory and feature fusion module", and "attention-based feature integration module" on the detection performance. Comparing the first group of experiments with the fourth group of experiments, and comparing the second group of experiments with the fourth group of experiments, and finding that the two modules, namely the context feature extraction module and the two-way long-short time memory and feature fusion module, can both remarkably improve Precision of the method, but slightly reduce Recall; comparing the third set of experiments with the fourth set of experiments, it was found that the "attention-based feature integration module" module can significantly improve Recall of the method of the invention, but slightly reduce Precision; meanwhile, the results of the comparison experiment show that the three modules, namely the context feature extraction module, the two-way long-short time memory and feature fusion module and the attention-based feature integration module, have complementarity in the aspects of improving Precision, Recall and F-measure of the method.
The above embodiments are only used for illustrating the technical solution of the present invention and not for limiting the same, and it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the above embodiments may be embodied in the form of a software product, which may be stored in a non-volatile storage medium, for example: CD-ROM, usb disk, removable hard disk, etc., comprising instructions for causing a computing device, such as: a personal computer, a server, or a network appliance, etc., that performs the methods described in the various embodiments of the invention.
By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:
the invention discloses a natural scene character detection method based on an attention mechanism, which is characterized in that a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module are added on the basis of a VGG-16 framework to construct a network model capable of effectively predicting character areas in a scene image so as to overcome the problems in scene character detection in the prior art: the direction information of the bent text is not easy to return, adhesion among different text lines is close, and information redundancy is generated by multi-level feature integration.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.
Claims (7)
1. A natural scene character detection method based on an attention mechanism is characterized by comprising the following steps:
s1, constructing a candidate text instance prediction network, and simultaneously adding a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module in the convolution feature extraction network to form the candidate text instance prediction network for natural scene character feature extraction;
s2, acquiring scene character images, and carrying out classification labeling on the scene character images to acquire a scene character image data set; the scene character image dataset comprises a scene character image and a corresponding binary label image, wherein the binary label image comprises a text center block label and a character stroke area label;
s3, extracting the features of the scene character images through the candidate text instance prediction network constructed in the step S1, specifically comprising the following steps:
s301, extracting the features of the scene character images through a convolution feature extraction network to obtain convolution feature maps of the corresponding scene character images, and extracting the convolution feature maps of the scene character images through the third, fourth and fifth convolution stages of the five convolution stages of the VGG-16 convolution network, wherein the convolution feature maps are marked as follows: f ═ F3,f4,f5F represents a convolution feature map set of the scene text image, and marks the last convolution layer of each convolution stage as: l { conv33, conv43, conv53 };
s302, inputting the convolution feature map into the context feature extraction module to obtain a corresponding multi-scale context information feature map, wherein the context information feature map is marked as F '═ F'3,f′4,f′5}; f ' represents a set F ' of the multi-scale context information feature map 'iA set of context information feature maps representing the multiscale for each convolution stage, where i ∈ {3,4,5 };
s303, coding all the multi-scale context information feature maps in each convolution stage, and sliding on the multi-scale context information feature maps from left to right by using a 3 x 3 sliding window to obtain a corresponding feature sequence set which is marked as WhereinC represents the number of channels, HiAnd WiIs a characteristic map f'iI ∈ {3,4,5}, a sequence number representing a convolution stage, the set S of feature sequencesiInputting the data into the bidirectional long-time and short-time memory and feature fusion module in a forward and backward sequence, and obtaining a probability icon of scene characters existing in a state diagram of each sliding window in the multi-scale context information feature diagram in each convolution stage as: wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram;
s304, according to the probability distribution of the state diagram mapping in the prediction category (0, 1) values, obtaining a weight diagram W for displaying scene characters in the state diagramm={Wconv33,Wconv43,Wconv53Outputting a probability map of scene characters displayed in each pixel position in the scene character image: wherein, σ [ ·]Is an activation function, <' > indicates multiplication of elements, Wl、blIs the weight and bias of the convolution layer L ∈ L showing the scene text, WlReflect feature maps at different positionsThe degree of attention of;
s4, training a text center block model and a character stroke area model, pre-training the text center block model to be convergent through a text center block label and a character stroke area label in a training set of the scene character image data set in the step S2, and simultaneously carrying out fine tuning on the basis of the training set to generate the text center module and the character stroke area model;
s5, testing the image F of the scene character image to be tested processed in the step S3 through the text center model and the character stroke area modeloutCalculating to finally generate a probability graph of a text center block and a probability graph of a character stroke area corresponding to the scene text and digital image;
and S6, judging the text center block through the area limit of the text center block, eliminating false candidate text center blocks, finally obtaining the text detected in the scene text image, and marking.
2. The attention mechanism-based natural scene character detection method according to claim 1, wherein the context feature extraction module comprises four parallel hole convolution layers, the expansion coefficients of the four hole convolution layers are 1, 3, 5 and 7, the convolution kernel size is 3 x 3, and the context feature extraction module is added after each convolution layer in the convolution feature extraction network.
3. The attention mechanism-based natural scene text detection method according to claim 1, wherein the bidirectional long-short time memory and feature fusion module comprises a forward LSTM layer, a backward LSTM layer and a Concat layer, and the feature extraction step in the bidirectional long-short time memory and feature fusion module is as follows:
the characteristic sequence S belongs to SiInputting the data into the bidirectional long-time and short-time memory and feature fusion module; for characteristic sequencesRespectively using a forward LSTM layer and a backward LSTM layer to calculate the state sequence of the hidden layers, and splicing the states of all the hidden layers at each time step to obtain the state diagrams of all the hidden layersWhereinWherein B represents the state diagram of the hidden layer of each characteristic diagram, i belongs to {3,4,5}, and represents the sequence number of the convolution stage; and mapping the hidden layer sequences to corresponding deconvolution layers respectively, and outputting a probability graph of scene characters in a state graph of each sliding window in the multi-scale context information characteristic graph in each convolution stage respectively, wherein the probability graph is marked as:wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram; will MaCutting the character image into a character image with the same size as the input scene, and marking the character image as a feature imageSplicing the two convolutional layers through a Concat layer, inputting the spliced convolutional layers into the two convolutional layers, wherein the first convolutional layer comprises 512 channels, and the size of a convolutional kernel is 3 multiplied by 3; the second convolutional layer contains 3 channels and the convolutional kernel size is 1 × 1.
4. The attention-based natural scene text detection method of claim 3, wherein the attention-based feature integration module comprises two convolution layers, a Softmax layer, a Slice layer and three spatialProduct layers; the attention-based feature integration module comprises the following execution steps:
the three channels of the second convolutional layer correspond to the characteristic diagram respectively In the Softmax layer, the feature map F is weightedcThe weight of (2) is mapped to the probability distribution of the value (0, 1) to obtain the probability distribution; in the Slice layer, dividing the probability distribution into weight maps Wm={Wconv33,Wconv43,Wconv53}; the attention-based feature integration module integrates the feature according to the feature map FcThe probability map F of the scene characters displayed at each pixel position in the scene character image is outputout。
5. The attention mechanism-based natural scene character detection method according to claim 1, wherein the judgment condition of the text center block is:
Smin≤Stcb≤Smax
wherein SminAnd SmaxRespectively a threshold value of the minimum area and a threshold value of the maximum area of the text center block, StcbRepresenting the area of the center block of the candidate text.
6. The attention mechanism-based natural scene character detection method of claim 1, wherein the labeling manner for the detected character text in step S6 is: for the text with the curved shape, marking a text example by adopting the outline of the text area; for a straight line text, the outline of the text region is fitted by using a minimum rectangle, and a text example is marked by using a rectangular box.
7. The attention mechanism-based natural scene text detection method according to claim 1, wherein the loss function of the text center module and the character stroke area model in step S4 is L- Σi,jGijlog(Pij)+(1-Gij)log(1-Pij) Wherein G isijIs the label of the pixel at (i, j), PijRepresenting the probability that the pixel at (i, j) belongs to the foreground.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111603367.9A CN114241470A (en) | 2021-12-24 | 2021-12-24 | Natural scene character detection method based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111603367.9A CN114241470A (en) | 2021-12-24 | 2021-12-24 | Natural scene character detection method based on attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114241470A true CN114241470A (en) | 2022-03-25 |
Family
ID=80762840
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111603367.9A Pending CN114241470A (en) | 2021-12-24 | 2021-12-24 | Natural scene character detection method based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114241470A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115438214A (en) * | 2022-11-07 | 2022-12-06 | 北京百度网讯科技有限公司 | Method for processing text image, neural network and training method thereof |
-
2021
- 2021-12-24 CN CN202111603367.9A patent/CN114241470A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115438214A (en) * | 2022-11-07 | 2022-12-06 | 北京百度网讯科技有限公司 | Method for processing text image, neural network and training method thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110738207B (en) | Character detection method for fusing character area edge information in character image | |
US11670071B2 (en) | Fine-grained image recognition | |
CN109919108B (en) | Remote sensing image rapid target detection method based on deep hash auxiliary network | |
CN110322495B (en) | Scene text segmentation method based on weak supervised deep learning | |
CN110598029B (en) | Fine-grained image classification method based on attention transfer mechanism | |
CN111640125B (en) | Aerial photography graph building detection and segmentation method and device based on Mask R-CNN | |
CN109784283B (en) | Remote sensing image target extraction method based on scene recognition task | |
CN112100346B (en) | Visual question-answering method based on fusion of fine-grained image features and external knowledge | |
CN113642390B (en) | Street view image semantic segmentation method based on local attention network | |
CN111612008A (en) | Image segmentation method based on convolution network | |
CN107784288B (en) | Iterative positioning type face detection method based on deep neural network | |
CN111738055B (en) | Multi-category text detection system and bill form detection method based on same | |
CN110210431B (en) | Point cloud semantic labeling and optimization-based point cloud classification method | |
CN112418212A (en) | Improved YOLOv3 algorithm based on EIoU | |
CN115482418B (en) | Semi-supervised model training method, system and application based on pseudo-negative labels | |
CN113505670A (en) | Remote sensing image weak supervision building extraction method based on multi-scale CAM and super-pixels | |
CN113989287A (en) | Urban road remote sensing image segmentation method and device, electronic equipment and storage medium | |
CN116977844A (en) | Lightweight underwater target real-time detection method | |
CN116091946A (en) | Yolov 5-based unmanned aerial vehicle aerial image target detection method | |
CN116071331A (en) | Workpiece surface defect detection method based on improved SSD algorithm | |
CN111739037A (en) | Semantic segmentation method for indoor scene RGB-D image | |
CN116524189A (en) | High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization | |
CN117152427A (en) | Remote sensing image semantic segmentation method and system based on diffusion model and knowledge distillation | |
CN114972947A (en) | Depth scene text detection method and device based on fuzzy semantic modeling | |
CN116486393A (en) | Scene text detection method based on image segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |