CN114241470A - Natural scene character detection method based on attention mechanism - Google Patents

Natural scene character detection method based on attention mechanism Download PDF

Info

Publication number
CN114241470A
CN114241470A CN202111603367.9A CN202111603367A CN114241470A CN 114241470 A CN114241470 A CN 114241470A CN 202111603367 A CN202111603367 A CN 202111603367A CN 114241470 A CN114241470 A CN 114241470A
Authority
CN
China
Prior art keywords
text
scene
feature
convolution
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111603367.9A
Other languages
Chinese (zh)
Inventor
刘占东
张海军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinjiang Normal University
Original Assignee
Xinjiang Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinjiang Normal University filed Critical Xinjiang Normal University
Priority to CN202111603367.9A priority Critical patent/CN114241470A/en
Publication of CN114241470A publication Critical patent/CN114241470A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Character Discrimination (AREA)

Abstract

A natural scene character detection method based on an attention mechanism comprises the steps of designing a convolutional neural network model for extracting a text target according to feature information of a text center block and a stroke area, and training the model by using the text center block and stroke information as supervision data; in the model testing stage, the test image is respectively input into a text center block model and a stroke model to obtain a probability chart of a text center block and a word stroke area; obtaining a final text area through reasoning, marking, and completing a scene image character detection task; the method solves the problems that the direction information of the bent text is not easy to return, adhesion between different similar text lines and information redundancy caused by multi-level feature integration in scene character detection in the prior art and the like.

Description

Natural scene character detection method based on attention mechanism
Technical Field
The invention relates to the field of computer analysis, in particular to a natural scene character detection method based on an attention mechanism.
Background
Characters are used as carriers of human knowledge and information, and widely exist in real daily life scenes. The method has the advantages that the method is very valuable and beneficial in a plurality of applications based on image content information, the character extraction technology in the scene image has wide application prospects in scenes such as blind person navigation, blind person reading, image retrieval and marking, man-machine interaction, unmanned driving and the like, the scene character detection is used for determining the specific position of characters in the image, and the character identification is used for identifying the character information in the boundary box into scale characters. The scene character detection plays an important role in extracting and understanding character information in the scene image, and the performance of the scene character detection directly determines the performance of character recognition in the image. The scene character detection and identification technology is used for extracting character information in an image to assist or enhance reality application, becomes a challenging research field in academia and industry, and draws wide attention of researchers at home and abroad.
In recent years, with the rapid development of general target detection and semantic segmentation technology, scene character detection has been widely researched. And achieves remarkable results. Although many scene text detection methods with superior performance are proposed, in some challenging scenes, it is still difficult to achieve accurate positioning of scene text. The challenges of scene text detection come mainly from three aspects: scene characters are influenced by noise, blur, shading, strong light and low resolution factor; scene characters have various existing forms and large height-width ratio change; scene text has different sizes, colors, fonts, languages and styles. For these three reasons, scene text detection is still an openness problem.
Currently, mainstream scene character detection methods can be roughly divided into two types: a general target bounding box regression-based method and a semantic segmentation-based method. The following defects are found in the prior art in the using process:
the direction of the curved text is not easy to return: the scene character detection method based on the packaging box regression needs to carry out regression on direction information when solving the problem of the example direction of the multi-direction scene text; however, for text examples of arbitrary shapes, such as curved shapes, the direction information cannot be regressed;
close adhesion between different text lines: the scene character detection method based on semantic segmentation has obtained better performance when solving the problem of scene characters in any shape and any direction; however, if different text lines are relatively close to each other, the text lines are easy to stick to each other;
multi-level feature integration yields information redundancy: when the scene character detection method based on semantic segmentation predicts the text region information, the multilevel characteristics of a shallow layer and a deep layer are utilized; however, the text target important concentration is distributed in the deep features; in addition, the importance of different features is not considered in the process of feature integration,
disclosure of Invention
The invention aims to provide a natural scene character detection method based on an attention mechanism, so that the problems in the prior art are solved.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a natural scene character detection method based on an attention mechanism comprises the following steps:
s1, constructing a candidate text instance prediction network, and simultaneously adding a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module in the convolution feature extraction network to form the candidate text instance prediction network for natural scene character feature extraction;
s2, acquiring scene character images, and carrying out classification labeling on the scene character images to acquire a scene character image data set; the scene character image dataset comprises a scene character image and a corresponding binary label image, wherein the binary label image comprises a text center block label and a character stroke area label;
s3, extracting the features of the scene character images through the candidate text instance prediction network constructed in the step S1, specifically comprising the following steps:
s301, extracting the features of the scene character images through a convolution feature extraction network to obtain convolution feature maps of the corresponding scene character images, and extracting the convolution feature maps of the scene character images through the third, fourth and fifth convolution stages of the five convolution stages of the VGG-16 convolution network, wherein the convolution feature maps are marked as follows: f ═ F3,f4,f5F represents a convolution feature map set of the scene text image;
s302, inputting the convolution feature map into the context feature extraction module to obtain a corresponding multi-scale context information feature map, wherein the context information feature map is marked as F '═ F'3,f′4,f′5}; f ' represents a set F ' of the multi-scale context information feature map 'iA set of context information feature maps representing the multiscale for each convolution stage, where i ∈ {3,4,5 };
s303, coding all the multi-scale context information feature maps in each convolution stage, and sliding on the multi-scale context information feature maps from left to right by using a 3 x 3 sliding window to obtain a corresponding feature sequence set which is marked as
Figure BDA0003432795130000031
Figure BDA0003432795130000032
Wherein
Figure BDA0003432795130000033
k∈{1,…,Hi},t∈{1i,…,WiDenotes the number of channels, HiAnd WiIs a characteristic map f'iI ∈ {3,4,5}, a sequence number representing a convolution stage, the set S of feature sequencesiInput to the two-way long-short duration memory and feature fusion in forward and backward orderIn the matching module, obtaining a probability icon of scene characters existing in the state diagram of each sliding window in the multi-scale context information characteristic diagram in each convolution stage, and recording the probability icon as:
Figure BDA0003432795130000034
Figure BDA0003432795130000035
wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram;
s304, according to the probability distribution of the state diagram mapping in the prediction category (0, 1) values, obtaining a weight diagram W for displaying scene characters in the state diagramm={Wconv33,Wconv43,Wconv53Outputting a probability map of scene characters displayed in each pixel position in the scene character image:
Figure BDA0003432795130000036
Figure BDA0003432795130000037
wherein, σ [ ·]Is an activation function, <' > indicates multiplication of elements, Wl、blIs the weight and bias of the convolution layer L ∈ L showing the scene text, WlReflect feature maps at different positions
Figure BDA0003432795130000038
The degree of attention.
S4, training a text center block model and a character stroke area model, training the text center block model to be convergent through a text center block label in the training set of the scene character image data set in the step S2 and the text center block model, training the character stroke area model to be a bracelet through the character stroke area label, and performing fine adjustment on the basis of the training set to generate the text center module and the character stroke area model;
s5, through the text center model and the character stroke area model pairThe test image F of the scene character image to be tested, which is processed in step S3outCalculating to finally generate a probability graph of a text center block and a probability graph of a character stroke area corresponding to the scene text and digital image;
and S6, judging the text center block through the area limit of the text center block, eliminating false candidate text center blocks, finally obtaining the text detected in the scene text image, and marking.
Preferably, the context feature extraction module includes four parallel hole convolution layers, the expansion coefficients of the four parallel hole convolution layers are 1, 3, 5 and 7, the size of the convolution kernel is 3 × 3, and the context feature extraction module is added after each convolution layer in the convolution feature extraction network.
Preferably, the bidirectional long-short time memory and feature fusion module includes a forward LSTM layer, a backward LSTM layer, and a Concat layer, and the feature extraction step in the bidirectional long-short time memory and feature fusion module is as follows:
the characteristic sequence S belongs to SiInputting the data into the bidirectional long-time and short-time memory and feature fusion module; for characteristic sequences
Figure BDA0003432795130000041
Respectively using a forward LSTM layer and a backward LSTM layer to calculate the state sequence of the hidden layers, and splicing the states of all the hidden layers at each time step to obtain the state diagrams of all the hidden layers
Figure BDA0003432795130000042
Wherein
Figure BDA0003432795130000043
Wherein B represents the state diagram of the hidden layer of each characteristic diagram, i belongs to {3,4,5}, and represents the sequence number of the convolution stage; and respectively mapping the hidden layer sequences to corresponding deconvolution layers, respectively outputting a probability graph of scene characters in a state graph of each sliding window in the multi-scale context information characteristic graph in each convolution stage, and marking the probability graph as a character:
Figure BDA0003432795130000044
Wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram; will MdCutting the character image into a character image with the same size as the input scene, and marking the character image as a feature image
Figure BDA0003432795130000045
Splicing the two convolutional layers through a Concat layer, inputting the spliced convolutional layers into the two convolutional layers, wherein the first convolutional layer comprises 512 channels, and the size of a convolutional kernel is 3 multiplied by 3; the second convolutional layer contains 3 channels and the convolutional kernel size is 1 × 1.
Preferably, the attention-based feature integration module comprises two convolution layers, a Softmax layer, a Slice layer and three SpatialProduct layers; the attention-based feature integration module comprises the following execution steps:
the three channels of the second convolutional layer correspond to the characteristic diagram respectively
Figure BDA0003432795130000046
Figure BDA0003432795130000047
In the Softmax layer, the feature map F is weightedcThe weight of (2) is mapped to the probability distribution of the value (0, 1) to obtain the probability distribution; in the Slice layer, dividing the probability distribution into weight maps Wm={Wconv33,Wconv43,Wconv53}; the attention-based feature integration module integrates the feature according to the feature map FcThe probability map F of the scene characters displayed at each pixel position in the scene character image is outputout
Preferably, the judgment condition of the text center block is as follows:
Smin≤Stcb≤Smax
wherein SminAnd SmaxRespectively of minimum area of central block of said textThreshold and maximum area threshold, StcbRepresenting the area of the center block of the candidate text.
Preferably, the labeling method for the detected text in step S6 is: for the text with the curved shape, marking a text example by adopting the outline of the text area; for a straight line text, the outline of the text region is fitted by using a minimum rectangle, and a text example is marked by using a rectangular box.
Preferably, the loss function of the text center module and the character stroke area model in step S4 is L- Σi, jGijlog(Pij)+(1-Gij)log(1-Pij) Wherein G isijIs the label of the pixel at (i, j), PijRepresenting the probability that the pixel at (i, j) belongs to the foreground.
The invention has the beneficial effects that: the invention discloses a natural scene character detection method based on an attention mechanism, which is characterized in that a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module are added on the basis of a VGG-16 framework to construct a network model capable of effectively predicting character areas in a scene image so as to overcome the problems in scene character detection in the prior art: the direction information of the bent text is not easy to return, adhesion among different text lines is close, and information redundancy is generated by multi-level feature integration.
Drawings
FIG. 1 is a test flow diagram of a natural scene text detection method based on an attention mechanism;
FIG. 2 is a comparison diagram of the impact of three modules, a context feature extraction module, a two-way long-and-short term memory and feature fusion module, and an attention-based feature integration module, on detection performance;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
S1, constructing a candidate text instance prediction network, and simultaneously adding a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module in the convolution feature extraction network to form the candidate text instance prediction network for natural scene character feature extraction;
the convolution feature extraction network is created based on VGG-16, the last pooling layer and three full-connection layers in the VGG-16 structure are deleted, and three modules are added: the system comprises a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module; the context feature extraction module comprises four parallel cavity convolution layers, the expansion coefficients of the four cavity convolution layers are respectively 1, 3, 5 and 7, the convolution kernel size is 3 multiplied by 3, and the context feature extraction module is added behind each convolution layer in the convolution feature extraction network; the bidirectional long-short time memory and feature fusion module comprises a forward LSTM layer, a backward LSTM layer and a Concat layer; the attention-based feature integration module comprises two convolution layers, a Softmax layer, a Slice layer and three SpatialProduct layers.
S2, acquiring scene character images, and carrying out classification labeling on the scene character images to acquire a scene character image data set; the scene character image dataset comprises a scene character image and a corresponding binary label image, wherein the binary label image comprises a text center block label and a character stroke area label;
the generating method of the text center block label adopts an algebraic method, and the coordinates of each scene character in the number set are directly calculated through a zoom factor: assume that each instance polygon P in the scene text image dataset contains 2N vertices, where N ≧ 2, a set of vertices { P { (N ≧ 2) }1,…,pN,p′1,…,p′NDenotes wherein (p)i,p′i) Called a point pair, i ∈ {1, …, N }; the scaled polygon P' still has 2N vertices, using a set of vertices { P }01,…,p0N,p′01,…,p′0NRepresents it. Suppose a vertex pi、p′i、p0i、p′0iAre respectively expressed as: (x)i,yi),(x′i,y′i),(x0i,y0i),(x′0i,y′0i). Given (x)i,yi),(x′i,y′i) And a scaling factor lambda0、x0i、y0i、x′0iAnd y'0iCalculated from the following formula:
Figure BDA0003432795130000061
the generation mode of the character stroke area label is obtained through an area growing algorithm, and for scene image characters with simple backgrounds, fewer seeds are selected to generate a stroke horizontal area; for scene image characters with complex background, more seeds need to be selected; for scene characters with complex backgrounds, if the number of the selected seeds exceeds 10 and the generated area is not perfect, in this case, the selected seeds are discarded, and other seeds are replaced to generate a stroke level area until 1-10 seeds are adopted, so that the area with the scene image characters can be completely generated.
S3, extracting the features of the scene character images through the candidate text instance prediction network constructed in the step S1, specifically comprising the following steps:
s301, extracting the features of the scene character images through a convolution feature extraction network to obtain convolution feature maps of the corresponding scene character images, and extracting the convolution feature maps of the scene character images through the third, fourth and fifth convolution stages of the five convolution stages of the VGG-16 convolution network, wherein the convolution feature maps are marked as follows: f ═ F3,f4,f5F represents a convolution feature map set of the scene text image;
s302, inputting the convolution feature map into the context feature extraction module to obtain a corresponding multi-scale context information feature map, wherein the context information feature map is marked as F '═ F'3,f′4,f′5}; f ' represents a set F ' of the multi-scale context information feature map 'iA set of context information feature maps representing the multiscale for each convolution stage, where i ∈ {3,4,5 };
s303, coding all the multi-scale context information feature maps in each convolution stage, and sliding on the multi-scale context information feature maps from left to right by using a 3 x 3 sliding window to obtain a corresponding feature sequence set which is marked as
Figure BDA0003432795130000071
Figure BDA0003432795130000072
Wherein
Figure BDA0003432795130000073
k∈{1,…,Hi},t∈{1i,…,WiDenotes the number of channels, HiAnd WiIs a characteristic map f'iI belongs to {3,4,5}, denotes the sequence number of the convolution stage, and the characteristic sequence S belongs to SiInputting the data into the bidirectional long-time and short-time memory and feature fusion module; for characteristic sequences
Figure BDA0003432795130000074
Respectively using a forward LSTM layer and a backward LSTM layer to calculate the state sequence of the hidden layers, and splicing the states of all the hidden layers at each time step to obtain the state diagrams of all the hidden layers
Figure BDA0003432795130000075
Wherein
Figure BDA0003432795130000076
Wherein B represents the state diagram of the hidden layer of each characteristic diagram, i belongs to {3,4,5}, and represents the sequence number of the convolution stage; mapping the hidden layer sequence to corresponding deconvolution layers respectively, and outputting to obtain the multi-scale in each convolution stageThe probability chart of the scene characters existing in the state chart of each sliding window in the context information characteristic chart is marked as follows:
Figure BDA0003432795130000077
wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram; will MdCutting the character image into a character image with the same size as the input scene, and marking the character image as a feature image
Figure BDA0003432795130000078
Figure BDA0003432795130000079
Splicing the two convolutional layers through a Concat layer, inputting the spliced convolutional layers into the two convolutional layers, wherein the first convolutional layer comprises 512 channels, and the size of a convolutional kernel is 3 multiplied by 3; the second convolutional layer contains 3 channels, the convolutional kernel size is 1 × 1;
s304, the three channels of the second convolution layer in step S303 correspond to the characteristic diagram respectively
Figure BDA00034327951300000710
Figure BDA00034327951300000711
In the Softmax layer, the feature map F is weightedcThe weight of (2) is mapped to the probability distribution of the value (0, 1) to obtain the probability distribution; in the Slice layer, dividing the probability distribution into weight maps Wm={Wconv33,Wconv43,Wconv53}; the attention-based feature integration module integrates the feature according to the feature map FcThe probability map of scene characters displayed at each pixel position in the scene character image is output
Figure BDA00034327951300000712
Figure BDA00034327951300000713
Wherein, σ [ ·]Is a function of the activation of the function,w indicates multiplication of elementsl、blIs the weight and bias of the convolution layer L ∈ L showing the scene text, WlReflect feature maps at different positions
Figure BDA00034327951300000714
The degree of attention.
S4, training a text center block model and a character stroke area model, training the text center block model to be convergent through a text center block label in the training set of the scene character image data set in the step S2 and the text center block model, training the character stroke area model to be convergent through the character stroke area label, and performing fine adjustment on the basis of the training set to generate the text center module and the character stroke area model;
the loss function of the text center module and the character stroke area model is L ═ sigmai,jGijlog(Pij)+(1-Gij)log(1-Pij) Wherein G isijIs the label of the pixel at (i, j), PijRepresenting the probability that the pixel at (i, j) belongs to the foreground;
s5, testing the image F of the scene character image to be tested processed in the step S3 through the text center model and the character stroke area modeloutCalculating to finally generate a probability graph of a text center block and a probability graph of a character stroke area corresponding to the scene text and digital image;
test image F containing scene charactersoutRespectively inputting into the text center model and the character stroke region model to generate a probability map (F) of the text center blocktcb) And probability map of word stroke area (F)wsr);
And S6, judging the text center block through the area limit of the text center block, providing a false candidate text center block, finally obtaining the text detected in the scene text image, and adopting different labeling modes aiming at different text objects.
The judgment condition of the text center block is as follows:
Smin≤Stcb≤Smax
wherein SminAnd SmaxRespectively being a threshold value of the minimum area and a threshold value S of the maximum area of the text center blocktcbRepresenting the area of a candidate text center block;
s when judging candidate text center block according to area of text center blockminAnd Smax211 and 81179 are respectively taken because 99% of the area of the text center block is more than or equal to 211 and less than 81179; marking the finally screened text center block and word stroke area examples; obtaining a final character area, and completing a scene image character detection task; and the marking mode of the detected text is as follows: for the text with the curved shape, marking a text example by adopting the outline of the text area; for a straight line text, the outline of the text region is fitted by using a minimum rectangle, and a text example is marked by using a rectangular box.
Examples
In this embodiment, a comparative experiment is performed to verify the technical effect of the scene text detection method described in the first embodiment, and the experimental environment and the experimental result are as follows:
(1) experimental Environment
The system environment is as follows: ubuntu 16.04;
hardware environment: GPU, GTX 1080Ti, memory: 512G.
(2) Experimental data set
Training data: first, 4 × 10 pre-training of the text center block model was performed using 7200 sheets of training data of MLT20175Secondly; the text center block model and the word stroke region model were then fine-tuned 4 x 10 on Total-test (1255 training sets)5Next, the process is carried out.
Test data: total-test (300 test set).
(3) Evaluation method
Curved shape text: pascal evaluation methods.
In order to show the effectiveness of the invention, four groups of experiments are respectively set for training the model by adopting the same training set, and the test set of the Total-test data set is respectively evaluated:
the first set of experiments: the method comprises the following steps of training by using a combination of a bidirectional long-short time memory and feature fusion module and an attention-based feature integration module, marking as the bidirectional long-short time memory and feature fusion module and the attention-based feature integration module, predicting a target region by using the combination of the bidirectional long-short time memory and feature fusion module and the attention-based feature integration module, and verifying the effectiveness of a context feature extraction module;
the second set of experiments: training by using a combination of a context feature extraction module and an attention-based feature integration module, marking as the context feature extraction module and the attention-based feature integration module, predicting a target region by using the combination of the context feature extraction module and the attention-based feature integration module, and verifying the validity of a bidirectional long-short time memory and feature fusion module;
the third set of experiments: the method comprises the following steps of training by using a combination of a bidirectional long-short time memory and feature fusion module and a context feature extraction module, marking as the bidirectional long-short time memory and feature fusion module and the context feature extraction module, predicting a target region by using the combination of the context feature extraction module and the bidirectional long-short time memory and feature fusion module, and verifying the effectiveness of a feature integration module based on attention;
fourth set of experiments: the method comprises the following steps of training by using a combination of three modules, namely a bidirectional long-short time memory and feature fusion module, a context feature extraction module and an attention-based feature integration module, marking as the bidirectional long-short time memory and feature fusion module, the context feature extraction module and the attention-based feature integration module, predicting a target region by using a combination of the three modules, namely the context feature extraction module, the bidirectional long-short time memory and feature fusion module and the attention-based feature integration module, and verifying the effectiveness of the three modules as a comparison group;
setting parameters: t iswAnd TbSet to 0.55, 0.60, respectively;
validity of the context feature extraction module: in order to obtain multi-scale context information, the invention designs a context feature extraction module; from fig. 2, it can be seen that the method of "two-way long-short time memory and feature fusion module + attention-based feature integration module" reduces F-measure by 1.54% (76.99% vs. 78.53%), reduces Precision by 4.52% (74.39% vs. 78.91%), but increases Recall by 1.63% (79.79% vs. 78.16%) without using the context feature extraction module.
Validity of the bidirectional long-time and short-time memory and feature fusion module: in order to utilize the space sequence characteristics of characters in text objects (words and text lines), the invention designs a bidirectional long-time and short-time memory and feature fusion module. From fig. 2, it can be seen that the method of "context feature extraction module + attention-based feature integration module" reduces F-measure by 3.25% (75.28% vs. 78.53%), Precision by 7.19% (71.72% vs. 78.91%), but increases Recall by 1.06% (79.22% vs. 78.16%) without using the two-way long-short time memory and feature fusion module.
Effectiveness of the attention-based feature integration module: in order to enable the trained model to enhance the attention of the text region in the scene image, the invention designs an attention-based feature integration module. From fig. 2, it can be seen that the method of "context feature extraction module + two-way long-short time memory and feature fusion module" reduces F-measure by 0.71% (77.82% vs. 78.53%), reduces Recall by 1.94% (76.22% vs. 78.16%), but increases Precision by 0.51% (79.48% vs. 78.91%) without using the attention-based feature integration module.
As shown in fig. 2, the test results show the influence of the three modules of the "context feature extraction module", "two-way long-and-short-term memory and feature fusion module", and "attention-based feature integration module" on the detection performance. Comparing the first group of experiments with the fourth group of experiments, and comparing the second group of experiments with the fourth group of experiments, and finding that the two modules, namely the context feature extraction module and the two-way long-short time memory and feature fusion module, can both remarkably improve Precision of the method, but slightly reduce Recall; comparing the third set of experiments with the fourth set of experiments, it was found that the "attention-based feature integration module" module can significantly improve Recall of the method of the invention, but slightly reduce Precision; meanwhile, the results of the comparison experiment show that the three modules, namely the context feature extraction module, the two-way long-short time memory and feature fusion module and the attention-based feature integration module, have complementarity in the aspects of improving Precision, Recall and F-measure of the method.
The above embodiments are only used for illustrating the technical solution of the present invention and not for limiting the same, and it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the above embodiments may be embodied in the form of a software product, which may be stored in a non-volatile storage medium, for example: CD-ROM, usb disk, removable hard disk, etc., comprising instructions for causing a computing device, such as: a personal computer, a server, or a network appliance, etc., that performs the methods described in the various embodiments of the invention.
By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:
the invention discloses a natural scene character detection method based on an attention mechanism, which is characterized in that a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module are added on the basis of a VGG-16 framework to construct a network model capable of effectively predicting character areas in a scene image so as to overcome the problems in scene character detection in the prior art: the direction information of the bent text is not easy to return, adhesion among different text lines is close, and information redundancy is generated by multi-level feature integration.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims (7)

1. A natural scene character detection method based on an attention mechanism is characterized by comprising the following steps:
s1, constructing a candidate text instance prediction network, and simultaneously adding a context feature extraction module, a bidirectional long-time memory and feature fusion module and an attention-based feature integration module in the convolution feature extraction network to form the candidate text instance prediction network for natural scene character feature extraction;
s2, acquiring scene character images, and carrying out classification labeling on the scene character images to acquire a scene character image data set; the scene character image dataset comprises a scene character image and a corresponding binary label image, wherein the binary label image comprises a text center block label and a character stroke area label;
s3, extracting the features of the scene character images through the candidate text instance prediction network constructed in the step S1, specifically comprising the following steps:
s301, extracting the features of the scene character images through a convolution feature extraction network to obtain convolution feature maps of the corresponding scene character images, and extracting the convolution feature maps of the scene character images through the third, fourth and fifth convolution stages of the five convolution stages of the VGG-16 convolution network, wherein the convolution feature maps are marked as follows: f ═ F3,f4,f5F represents a convolution feature map set of the scene text image, and marks the last convolution layer of each convolution stage as: l { conv33, conv43, conv53 };
s302, inputting the convolution feature map into the context feature extraction module to obtain a corresponding multi-scale context information feature map, wherein the context information feature map is marked as F '═ F'3,f′4,f′5}; f ' represents a set F ' of the multi-scale context information feature map 'iA set of context information feature maps representing the multiscale for each convolution stage, where i ∈ {3,4,5 };
s303, coding all the multi-scale context information feature maps in each convolution stage, and sliding on the multi-scale context information feature maps from left to right by using a 3 x 3 sliding window to obtain a corresponding feature sequence set which is marked as
Figure FDA0003432795120000011
Figure FDA0003432795120000012
Wherein
Figure FDA0003432795120000013
C represents the number of channels, HiAnd WiIs a characteristic map f'iI ∈ {3,4,5}, a sequence number representing a convolution stage, the set S of feature sequencesiInputting the data into the bidirectional long-time and short-time memory and feature fusion module in a forward and backward sequence, and obtaining a probability icon of scene characters existing in a state diagram of each sliding window in the multi-scale context information feature diagram in each convolution stage as:
Figure FDA0003432795120000014
Figure FDA0003432795120000015
wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram;
s304, according to the probability distribution of the state diagram mapping in the prediction category (0, 1) values, obtaining a weight diagram W for displaying scene characters in the state diagramm={Wconv33,Wconv43,Wconv53Outputting a probability map of scene characters displayed in each pixel position in the scene character image:
Figure FDA0003432795120000021
Figure FDA0003432795120000022
wherein, σ [ ·]Is an activation function, <' > indicates multiplication of elements, Wl、blIs the weight and bias of the convolution layer L ∈ L showing the scene text, WlReflect feature maps at different positions
Figure FDA0003432795120000023
The degree of attention of;
s4, training a text center block model and a character stroke area model, pre-training the text center block model to be convergent through a text center block label and a character stroke area label in a training set of the scene character image data set in the step S2, and simultaneously carrying out fine tuning on the basis of the training set to generate the text center module and the character stroke area model;
s5, testing the image F of the scene character image to be tested processed in the step S3 through the text center model and the character stroke area modeloutCalculating to finally generate a probability graph of a text center block and a probability graph of a character stroke area corresponding to the scene text and digital image;
and S6, judging the text center block through the area limit of the text center block, eliminating false candidate text center blocks, finally obtaining the text detected in the scene text image, and marking.
2. The attention mechanism-based natural scene character detection method according to claim 1, wherein the context feature extraction module comprises four parallel hole convolution layers, the expansion coefficients of the four hole convolution layers are 1, 3, 5 and 7, the convolution kernel size is 3 x 3, and the context feature extraction module is added after each convolution layer in the convolution feature extraction network.
3. The attention mechanism-based natural scene text detection method according to claim 1, wherein the bidirectional long-short time memory and feature fusion module comprises a forward LSTM layer, a backward LSTM layer and a Concat layer, and the feature extraction step in the bidirectional long-short time memory and feature fusion module is as follows:
the characteristic sequence S belongs to SiInputting the data into the bidirectional long-time and short-time memory and feature fusion module; for characteristic sequences
Figure FDA0003432795120000024
Respectively using a forward LSTM layer and a backward LSTM layer to calculate the state sequence of the hidden layers, and splicing the states of all the hidden layers at each time step to obtain the state diagrams of all the hidden layers
Figure FDA0003432795120000025
Wherein
Figure FDA0003432795120000026
Wherein B represents the state diagram of the hidden layer of each characteristic diagram, i belongs to {3,4,5}, and represents the sequence number of the convolution stage; and mapping the hidden layer sequences to corresponding deconvolution layers respectively, and outputting a probability graph of scene characters in a state graph of each sliding window in the multi-scale context information characteristic graph in each convolution stage respectively, wherein the probability graph is marked as:
Figure FDA0003432795120000031
wherein d ∈ {0, 1} represents two prediction categories with scene text displayed in each state diagram; will MaCutting the character image into a character image with the same size as the input scene, and marking the character image as a feature image
Figure FDA0003432795120000032
Splicing the two convolutional layers through a Concat layer, inputting the spliced convolutional layers into the two convolutional layers, wherein the first convolutional layer comprises 512 channels, and the size of a convolutional kernel is 3 multiplied by 3; the second convolutional layer contains 3 channels and the convolutional kernel size is 1 × 1.
4. The attention-based natural scene text detection method of claim 3, wherein the attention-based feature integration module comprises two convolution layers, a Softmax layer, a Slice layer and three spatialProduct layers; the attention-based feature integration module comprises the following execution steps:
the three channels of the second convolutional layer correspond to the characteristic diagram respectively
Figure FDA0003432795120000033
Figure FDA0003432795120000034
In the Softmax layer, the feature map F is weightedcThe weight of (2) is mapped to the probability distribution of the value (0, 1) to obtain the probability distribution; in the Slice layer, dividing the probability distribution into weight maps Wm={Wconv33,Wconv43,Wconv53}; the attention-based feature integration module integrates the feature according to the feature map FcThe probability map F of the scene characters displayed at each pixel position in the scene character image is outputout
5. The attention mechanism-based natural scene character detection method according to claim 1, wherein the judgment condition of the text center block is:
Smin≤Stcb≤Smax
wherein SminAnd SmaxRespectively a threshold value of the minimum area and a threshold value of the maximum area of the text center block, StcbRepresenting the area of the center block of the candidate text.
6. The attention mechanism-based natural scene character detection method of claim 1, wherein the labeling manner for the detected character text in step S6 is: for the text with the curved shape, marking a text example by adopting the outline of the text area; for a straight line text, the outline of the text region is fitted by using a minimum rectangle, and a text example is marked by using a rectangular box.
7. The attention mechanism-based natural scene text detection method according to claim 1, wherein the loss function of the text center module and the character stroke area model in step S4 is L- Σi,jGijlog(Pij)+(1-Gij)log(1-Pij) Wherein G isijIs the label of the pixel at (i, j), PijRepresenting the probability that the pixel at (i, j) belongs to the foreground.
CN202111603367.9A 2021-12-24 2021-12-24 Natural scene character detection method based on attention mechanism Pending CN114241470A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111603367.9A CN114241470A (en) 2021-12-24 2021-12-24 Natural scene character detection method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111603367.9A CN114241470A (en) 2021-12-24 2021-12-24 Natural scene character detection method based on attention mechanism

Publications (1)

Publication Number Publication Date
CN114241470A true CN114241470A (en) 2022-03-25

Family

ID=80762840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111603367.9A Pending CN114241470A (en) 2021-12-24 2021-12-24 Natural scene character detection method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN114241470A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438214A (en) * 2022-11-07 2022-12-06 北京百度网讯科技有限公司 Method for processing text image, neural network and training method thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438214A (en) * 2022-11-07 2022-12-06 北京百度网讯科技有限公司 Method for processing text image, neural network and training method thereof

Similar Documents

Publication Publication Date Title
CN110738207B (en) Character detection method for fusing character area edge information in character image
US11670071B2 (en) Fine-grained image recognition
CN109919108B (en) Remote sensing image rapid target detection method based on deep hash auxiliary network
CN110322495B (en) Scene text segmentation method based on weak supervised deep learning
CN110598029B (en) Fine-grained image classification method based on attention transfer mechanism
CN111640125B (en) Aerial photography graph building detection and segmentation method and device based on Mask R-CNN
CN109784283B (en) Remote sensing image target extraction method based on scene recognition task
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN113642390B (en) Street view image semantic segmentation method based on local attention network
CN111612008A (en) Image segmentation method based on convolution network
CN107784288B (en) Iterative positioning type face detection method based on deep neural network
CN111738055B (en) Multi-category text detection system and bill form detection method based on same
CN110210431B (en) Point cloud semantic labeling and optimization-based point cloud classification method
CN112418212A (en) Improved YOLOv3 algorithm based on EIoU
CN115482418B (en) Semi-supervised model training method, system and application based on pseudo-negative labels
CN113505670A (en) Remote sensing image weak supervision building extraction method based on multi-scale CAM and super-pixels
CN113989287A (en) Urban road remote sensing image segmentation method and device, electronic equipment and storage medium
CN116977844A (en) Lightweight underwater target real-time detection method
CN116091946A (en) Yolov 5-based unmanned aerial vehicle aerial image target detection method
CN116071331A (en) Workpiece surface defect detection method based on improved SSD algorithm
CN111739037A (en) Semantic segmentation method for indoor scene RGB-D image
CN116524189A (en) High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization
CN117152427A (en) Remote sensing image semantic segmentation method and system based on diffusion model and knowledge distillation
CN114972947A (en) Depth scene text detection method and device based on fuzzy semantic modeling
CN116486393A (en) Scene text detection method based on image segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination