CN113723352A - Text detection method, system, storage medium and electronic equipment - Google Patents
Text detection method, system, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN113723352A CN113723352A CN202111069214.0A CN202111069214A CN113723352A CN 113723352 A CN113723352 A CN 113723352A CN 202111069214 A CN202111069214 A CN 202111069214A CN 113723352 A CN113723352 A CN 113723352A
- Authority
- CN
- China
- Prior art keywords
- feature map
- inputting
- attention
- text
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 80
- 238000012545 processing Methods 0.000 claims abstract description 93
- 238000000034 method Methods 0.000 claims abstract description 26
- 238000010586 diagram Methods 0.000 claims description 46
- 238000011176 pooling Methods 0.000 claims description 29
- 239000013598 vector Substances 0.000 claims description 29
- 230000006870 function Effects 0.000 claims description 28
- 238000000605 extraction Methods 0.000 claims description 23
- 230000003321 amplification Effects 0.000 claims description 10
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 10
- 230000006835 compression Effects 0.000 claims description 8
- 238000007906 compression Methods 0.000 claims description 8
- 230000009467 reduction Effects 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 7
- 238000005549 size reduction Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000013527 convolutional neural network Methods 0.000 description 16
- 238000004590 computer program Methods 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 6
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000007670 refining Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the invention provides a text detection method, a text detection system, a storage medium and electronic equipment, which can be applied to the field of artificial intelligence or the field of finance. The method comprises the following steps: extracting the features of the image to be detected by adopting an attention pyramid network model to obtain an attention pyramid feature map; selecting candidate frames from the attention pyramid feature map by adopting a regional suggestion network to obtain text candidate frames; and inputting the attention pyramid feature map and the position information of the candidate box into a Faster R-CNN model to perform candidate box classification prediction processing so as to judge whether the selected area of the text candidate box is a text area or not and obtain a text detection result. According to the method, the saliency detection is carried out on the text in the image to be detected through the attention pyramid network model, the background information is suppressed while the text is highlighted, the interference caused by the background is further reduced, the representation capability of the features can be improved, and the accuracy of text detection is improved.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a text detection method, a text detection system, a storage medium, and an electronic device.
Background
With the rapid development of computer technologies and mobile devices, a large number of practical applications need to acquire high-level semantic information contained in a scene text. The text detection in the bill image is used for accurately positioning the text area of the bill in the image, and the detection result can directly influence the final recognition effect. However, because the bill text shows the sparsity characteristic in the freely shot bill image, a large number of background areas bring great difficulty to the detection of the real text, and the problem of low text detection accuracy exists.
Disclosure of Invention
The embodiment of the invention aims to provide a text detection method, a text detection system, a storage medium and electronic equipment, which can improve the accuracy of text detection. The specific technical scheme is as follows:
the invention provides a text detection method, which comprises the following steps:
acquiring an image to be detected;
extracting the characteristics of the image to be detected by adopting an attention pyramid network model to obtain an attention pyramid characteristic diagram;
selecting candidate frames from the attention pyramid feature map by adopting a regional suggestion network to obtain text candidate frames;
inputting the attention pyramid feature map and the position information of the text candidate box into a Faster R-CNN model to perform candidate box classification prediction processing so as to judge whether the selected area of the text candidate box is a text area or not and obtain a text detection result.
Optionally, the attention pyramid network model includes a ResNet101 network, a global average pooling layer, a first residual module, a channel attention module, and a second residual module; the ResNet101 network comprises a top layer convolution unit, a middle layer convolution unit and a bottom layer convolution unit;
the method comprises the following steps of adopting an attention pyramid network model to carry out feature extraction on an image to be detected to obtain an attention pyramid feature map, and specifically comprising the following steps:
inputting the image to be detected into the bottom layer convolution unit for feature extraction to obtain a bottom layer feature map; inputting the bottom layer characteristic diagram into the middle layer convolution unit for characteristic extraction to obtain a middle layer characteristic diagram; inputting the middle layer feature map into the top layer convolution unit to obtain a top layer feature map;
inputting the top layer feature map into a global average pooling layer to obtain a pooling processing result;
inputting the top-level feature map into a first residual error module corresponding to the top-level convolution unit to obtain a first residual error feature map;
inputting the pooling processing result and the first residual error feature map into a channel attention module corresponding to the top convolution unit for weight adjustment to obtain a first channel attention feature map;
inputting the first channel attention feature map into a second residual error module corresponding to the top layer convolution unit to obtain a second residual error feature map;
inputting the middle layer feature map into a first residual error module corresponding to the middle layer convolution unit to obtain a third residual error feature map;
inputting the second residual error feature map and the third residual error feature map into a channel attention module corresponding to the middle layer convolution unit for weight adjustment to obtain a second channel attention feature map;
and inputting the second channel attention feature map into a second residual error module corresponding to the middle layer convolution unit to obtain an attention pyramid feature map.
Optionally, the inputting the top-level feature map into a first residual error module corresponding to the top-level convolution unit to obtain a first residual error feature map specifically includes:
inputting the top layer characteristic diagram into a 1 × 1 convolutional layer for channel merging processing to obtain a merging result;
inputting the merging result into a 3 x 3 convolution layer for size amplification processing to obtain an amplification processing result;
inputting the amplification processing result into a Batch Norm layer for Batch normalization processing to obtain a normalization processing result;
inputting the normalized processing result into a ReLU function to obtain a result, and performing size reduction processing on the result through a 3 x 3 convolution layer to obtain a reduction processing result;
and inputting a result obtained by summing the top-layer feature map and the reduction processing result into a ReLU function to obtain a first residual error feature map.
Optionally, the inputting the second residual feature map and the third residual feature map into a channel attention module corresponding to the middle layer convolution unit for weight adjustment to obtain a second channel attention feature map specifically includes:
merging the second residual error feature map and the third residual error feature map to obtain a merged feature map;
inputting the combined feature map into a global pooling layer for compression processing to obtain a compressed feature map;
inputting the compression characteristic diagram into a 1 × 1 convolution layer for processing, and inputting a processed result into a ReLU function to obtain an output result;
inputting the output result into a 1 × 1 convolutional layer for processing, and inputting the processed result into a Sigmoid function to obtain an attention vector;
and carrying out weight adjustment on the second residual error feature map by using the attention vector to obtain a second channel attention feature map.
Optionally, the inputting the output result into a 1 × 1 convolutional layer for processing, and inputting the processed result into a Sigmoid function to obtain an attention vector specifically includes:
inputting the output result into a 1 × 1 convolutional layer to perform feature map channel summation processing to obtain a score map;
determining text prediction probability by using the scores in the score map;
and obtaining an attention vector by using a Sigmoid function according to the text prediction probability and the text expectation probability.
Optionally, the performing weight adjustment on the second residual feature map by using the attention vector to obtain a second channel attention feature map specifically includes:
performing product operation on the attention vector and the second residual error feature map to obtain a product operation result;
and performing summation operation on the product operation result and the combined feature map to obtain a second channel attention feature map.
Optionally, the inputting the attention pyramid feature map and the candidate box information into a fast R-CNN model to perform candidate box classification prediction processing to determine whether the text candidate box is a text region, so as to obtain a text detection result, specifically including:
inputting the attention pyramid feature map and the candidate box information into a Faster R-CNN model to perform candidate box classification prediction processing to obtain a detection box;
and carrying out NMS duplicate removal processing on the detection box to obtain a text detection result.
The present invention also provides a text detection system, comprising:
the image acquisition module is used for acquiring an image to be detected;
the feature extraction module is used for extracting features of the image to be detected by adopting an attention pyramid network model to obtain an attention pyramid feature map;
the candidate box generation module is used for selecting candidate boxes from the attention pyramid feature map by adopting a regional suggestion network to obtain a text candidate box;
and the text detection module is used for inputting the attention pyramid feature map and the position information of the text candidate box into a Faster R-CNN model to perform candidate box classification prediction processing so as to judge whether the selected area of the text candidate box is a text area or not and obtain a text detection result.
The present invention also provides a computer-readable storage medium having a program stored thereon, which when executed by a processor implements the text detection method described above.
The present invention also provides an electronic device comprising:
at least one processor, and at least one memory, bus connected with the processor;
the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory to execute the text detection method.
In the text detection method, the text detection system, the text detection storage medium and the electronic device, an attention pyramid network model is adopted to perform feature extraction on an image to be detected to obtain an attention pyramid feature map; selecting candidate frames from the attention pyramid feature map by adopting a regional suggestion network to obtain text candidate frames; and inputting the attention pyramid feature map and the position information of the candidate box into a Faster R-CNN model to perform candidate box classification prediction processing so as to judge whether the selected area of the text candidate box is a text area or not and obtain a text detection result. According to the method, the saliency of the text in the image to be detected is detected through the attention pyramid network model, the background information is suppressed while the text is highlighted, the interference caused by the background is further reduced, the representation capability of the features can be improved, and the detection accuracy of the text is improved.
Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a text detection method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an attention pyramid network structure according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a refined residual error module according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a channel attention module according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a text detection process according to an embodiment of the present invention;
FIG. 6 is a block diagram of a text detection system according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a text detection method, as shown in fig. 1, the method comprises:
step 101: and acquiring an image to be detected.
Step 102: and extracting the features of the image to be detected by adopting an attention pyramid network model to obtain an attention pyramid feature map.
An Attention Pyramid Network (APN) structure is shown in fig. 2, where an Attention Pyramid Network model includes a ResNet101 Network, a Global Average Pooling layer (GAP), 2 Refined Residual Blocks (RRB), which are a first Residual Block and a second Residual Block, respectively, and a Channel Attention Block (CAB). The CAB utilizes an attention mechanism to carry out weight self-adaptive calibration on the characteristic diagram channels of each stage to enhance the representation capability of the target characteristics and obtain the characteristics with more discriminative power; the RRB module is added to the feature layer of each stage and is used for combining channel information and introducing a residual block to optimize the network so as to realize the effect of refining the feature map.
The ResNet101 network has five groups of convolutions, including a top layer convolution unit Res5, a middle layer convolution unit and a bottom layer convolution unit conv 1; the convolution units in the middle layer have three layers, namely Res2, Res3 and Res 4. The five groups of convolutions comprise 101 layers in total, wherein the input size of conv1 is 224 multiplied by 224, the output size of Res5 is 7 multiplied by 7, each convolution stage is reduced by 2 times, and through the five groups of convolutions, the five groups of convolutions are reduced by 32 times, and finally aiming at the problem that the network lacks global information, the APN introduces a global average pooling layer GAP at the top of the ResNet101 network, provides global context information, ensures high-level consistency constraint and completes network construction.
The attention fusion module in the APN structure comprises a refinement residual module RRB and a channel attention module CAB. The RRB module playing a role of transverse connection is used for merging channel information, and a residual block is introduced to optimize a network so as to realize the effect of refining a characteristic diagram; and the CAB module corrects the weight of each channel by fusing the features of adjacent stages by using an attention mechanism from the aspect of the feature map channel, and acquires the features with higher discriminative power for a subsequent text detection task.
1) inputting an image to be detected into a bottom layer convolution unit for feature extraction to obtain a bottom layer feature map; inputting the bottom layer characteristic diagram into a middle layer convolution unit for characteristic extraction to obtain a middle layer characteristic diagram; and inputting the middle layer characteristic diagram into a top layer convolution unit to obtain a top layer characteristic diagram.
Inputting an image to be detected into conv1 for feature extraction to obtain a bottom layer feature map; inputting the bottom layer characteristic diagram into Res2 for characteristic extraction to obtain a first middle layer characteristic diagram; inputting the first intermediate layer characteristic diagram into Res3 for characteristic extraction to obtain a second intermediate layer characteristic diagram; inputting the second intermediate layer characteristic diagram into Res4 for characteristic extraction to obtain a third intermediate layer characteristic diagram; and inputting the third middle layer feature map into a top layer convolution unit Res5 to obtain a top layer feature map.
2) And inputting the top feature map into the global average pooling layer to obtain a pooling processing result.
And inputting the top-level feature graph into a global average pooling layer GAP to obtain a pooling processing result.
3) And inputting the top-level feature map into a first residual error module corresponding to the top-level convolution unit to obtain a first residual error feature map.
As shown in fig. 3, the steps specifically include:
inputting the top layer characteristic diagram into a 1 × 1 convolutional layer (1 × 1conv) for channel merging processing to obtain a merging result, merging information of all channels through the 1 × 1 convolutional layers, and fixing the number of characteristic diagram channels from the convolutional neural network CNNs to be 512; inputting the merged result into a 3 × 3 convolutional layer (3 × 3conv) for size amplification processing to obtain an amplification processing result; inputting the amplified processing result into a Batch Norm layer (Batch normalization layer for ensuring consistent data distribution and ReLU accelerated training speed) to perform Batch normalization processing to obtain a normalized processing result; inputting the normalized processing result into a ReLU function, and performing size reduction processing on the result through a 3 x 3 convolution layer to obtain a reduction processing result; and inputting a result obtained by performing summation operation (sum) on the top-layer feature map and the reduction processing result into a ReLU function to obtain a first residual error feature map.
4) And inputting the pooling result and the first residual error feature map into a channel attention module corresponding to the top convolution unit for weight adjustment to obtain a first channel attention feature map.
As shown in fig. 4, the channel attention module CAB is used to combine the features of the adjacent stages, and take the pooling result and the first residual feature map as input, thereby fully utilizing the differences of the different stages. The CAB module firstly connects (concatee) RRB characteristics of a high stage (pooling processing result) and a low-order segment (first residual characteristic graph), explicitly establishes a dependency relationship between channels, uses a Global pooling layer (Global pool) to compress the characteristic graph to generate statistical information of the channels, adds two 1 × 1 convolutions (1 × 1conv) and ReLU functions to reduce model complexity and assist generalization, learns the dependency relationship between the channels by using a Sigmoid function to obtain an attention vector, then uses the attention vector to perform weight adjustment on the low-order segment characteristic channel, and performs product operation on the attention vector and the first residual characteristic graph to obtain a product (mul) operation result; and performing summation (sum) operation on the product operation result and the pooling processing result, and finally obtaining a first channel attention feature map.
5) And inputting the first channel attention feature map into a second residual error module corresponding to the top layer convolution unit to obtain a second residual error feature map.
As shown in fig. 3, the generation method of the second residual feature map is similar to the generation method of the first residual feature map, and is not repeated.
6) And inputting the middle layer feature map into a first residual error module corresponding to the middle layer convolution unit to obtain a third residual error feature map.
Inputting the third intermediate layer feature map into a first residual error module corresponding to the third intermediate layer convolution unit to obtain a third residual error feature map;
inputting the second intermediate layer feature map into a first residual error module corresponding to the second intermediate layer convolution unit to obtain a second residual error feature map;
and inputting the first intermediate layer feature map into a first residual error module corresponding to the first intermediate layer convolution unit to obtain a first residual error feature map.
7) And inputting the second residual error feature map and the third residual error feature map into a channel attention module corresponding to the middle layer convolution unit for weight adjustment to obtain a second channel attention feature map.
The method specifically comprises the following steps:
merging the second residual error feature map and the third residual error feature map to obtain a merged feature map; inputting the combined feature map into a global pooling layer for compression treatment to obtain a compressed feature map; inputting the compression characteristic diagram into a 1 × 1 convolution layer for processing, and inputting a processed result into a ReLU function to obtain an output result; inputting the output result into a 1 × 1 convolutional layer for processing, and inputting the processed result into a Sigmoid function to obtain an attention vector; and carrying out weight adjustment on the second residual error feature map by using the attention vector to obtain a second channel attention feature map.
Optionally, performing weight adjustment on the second residual feature map by using the attention vector to obtain a second channel attention feature map, which specifically includes:
performing product operation on the attention vector and the second residual error feature map to obtain a product operation result; and performing summation operation on the product operation result and the combined feature map to obtain a second channel attention feature map.
Optionally, inputting the output result into the 1 × 1 convolutional layer for processing, and inputting the processed result into a Sigmoid function to obtain the attention vector, which specifically includes:
inputting the output result into a 1 × 1 convolutional layer to perform feature map channel summation processing to obtain a score map; determining text prediction probability by using the scores in the score map; and obtaining the attention vector by using a Sigmoid function according to the text prediction probability and the text expectation probability.
CAB aims to integrate adjacent stage features to calculate attention vectors of channels, and changes the weight of each stage feature to optimize feature consistency. After the APN is expanded to be an FCN (full Convolutional neural network) architecture, convolution operation outputs a score map, the probability of each pixel on each category is given as a formula (1-1), and the final score y of the score map is givennBut simply summing the channels of all the signatures.
In equation (1-1), x is the characteristic of the network output, k is the convolution kernel, n ∈ {1,2, …, n }, n is the number of channels, D is the set of pixel positions (i denotes rows, j denotes columns), and equation (1-1) implicitly indicates that the weights of the different channels are equal. The calculation of the channel attention weight is shown in equations (1-2) and (1-3). In equation (1-2), δ is the prediction probability, y is the net output, and N is the total number of columns.
The final predicted label is the category with the highest probability, which is obtained by the formula (1-2) and the formula (1-3). Suppose the prediction result is y0To do soThe true label is y1Then, as in equation (1-3), the attention weight parameter is introduced to give the highest probability value y0Correction to y1。
In the formulae (1-3)Representing a new prediction of the network, α ═ Sigmoid (x; k). To obtain consistent and accurate prediction results, features with discriminative power are extracted and non-discriminative features are suppressed, so in equation (1-3), the α value is the attention weight for the feature map x, indicating that attention feature selection is performed using CAB. By the method, the characteristics of the discrimination can be acquired step by using the network, and the consistency of the prediction categories is ensured.
In order to refine the features more accurately, a deep supervision method is adopted to obtain better performance and optimize the network, and in the APN of the invention, Softmax loss function is used to supervise the sampling output at each stage except the global average pooling layer, as shown in the formula (1-4).
L=SoftmaxLoss(y;k) (1-4)
8) And inputting the second channel attention feature map into a second residual error module corresponding to the middle layer convolution unit to obtain an attention pyramid feature map.
And inputting the second channel attention feature map into a second residual error module corresponding to the third middle layer convolution unit to obtain a first attention pyramid feature map.
It should be noted that, because there are three intermediate layer feature maps, three attention pyramid feature maps, namely, a first attention pyramid feature map, a second attention pyramid feature map, and a third attention pyramid feature map, can be obtained.
Step 103: and selecting candidate frames from the attention pyramid feature map by adopting a regional suggestion network to obtain text candidate frames.
As shown in fig. 5, after feature extraction is performed on an image to be detected by using an attention pyramid network, an attention pyramid feature is obtained, and the attention pyramid feature is simultaneously input to the regional suggestion network and the fast R-CNN model. After the attention pyramid characteristics are input to the regional suggestion network, the anchor frame is also input to the regional suggestion network, and after the text secondary classification processing and the rectangular bounding box regression, the rectangular text candidate frame after the thinning processing can be obtained. The method utilizes the regional suggestion network to generate candidate frames according to the attention pyramid feature map output by the APN network, and extracts corresponding effective RoI features for each candidate frame.
Step 104: and inputting the attention pyramid feature map and the position information of the text candidate box into a Faster R-CNN model to perform candidate box classification prediction processing so as to judge whether the selected area of the text candidate box is a text area or not and obtain a text detection result.
And distinguishing the extracted RoI type by using a classifier in a Fast R-CNN module, judging whether the RoI type is a text, and directly outputting the corrected text candidate box as a text detection result.
inputting the attention pyramid feature map and the candidate box information into a Faster R-CNN model to perform candidate box classification prediction processing to obtain a detection box; and carrying out NMS duplicate removal processing on the detection box to obtain a text detection result.
As shown in fig. 5, after the attention pyramid feature map and the candidate box information are input into the Faster R-CNN model, the output result of the Faster R-CNN model is subjected to text secondary classification and quadrilateral candidate box regression, and the detection box obtained by processing is subjected to NMS deduplication processing to obtain a text detection result. The invention carries out more refined classification and boundary box regression on the text candidate box detected in the step 103, the classification task learning judges whether the candidate box is a text region or a background region, the candidate box regression task learning and the regression quadrilateral boundary box position information, and finally, the NMS duplication elimination is carried out on the candidate box to obtain a final text prediction result.
The invention is based on the attention mechanism feature extraction model APN, utilizes the attention mechanism to extract more discriminative features on the ResNet101 basic model, provides significance detection for text regions, highlights text information, inhibits background information and reduces misinformation caused by background interference similar to the text. The channel attention module of the invention utilizes an attention mechanism to fuse the characteristics of adjacent stages to correct the weight of each channel from the angle of the characteristic diagram channel so as to obtain the characteristics with more discriminative power. The residual error refining module plays a role in transverse connection and is used for combining channel information, and a residual error block is introduced to optimize a network so as to realize the effect of refining a characteristic diagram.
As an optional embodiment, the method carries out bank bill text detection based on the attention pyramid network, combines a characteristic pyramid structure with an attention mechanism, adjusts weight parameters by using a channel attention vector from a text characteristic layer level, guides a high-stage combination low-order section to enhance characteristic consistency, improves characteristic representation capability and selects more excellent bill text characteristics, further improves text detection effect, and can solve the problem that the real text detection is low in accuracy because the bill text shows sparsity characteristics in a freely shot bank bill image and a large number of background areas bring great difficulty to the real text detection.
The present invention also provides a text detection system, as shown in fig. 6, the system includes:
the image obtaining module 601 is configured to obtain an image to be detected.
The feature extraction module 602 is configured to perform feature extraction on the image to be detected by using the attention pyramid network model, so as to obtain an attention pyramid feature map.
The attention pyramid network model comprises a ResNet101 network, a global average pooling layer, a first residual module, a channel attention module and a second residual module; the ResNet101 network includes a top layer convolution unit, a middle layer convolution unit and a bottom layer convolution unit.
The feature extraction module 602 specifically includes:
the characteristic extraction unit is used for inputting the image to be detected into the bottom layer convolution unit for characteristic extraction to obtain a bottom layer characteristic diagram; inputting the bottom layer characteristic diagram into a middle layer convolution unit for characteristic extraction to obtain a middle layer characteristic diagram; and inputting the middle layer characteristic diagram into a top layer convolution unit to obtain a top layer characteristic diagram.
And the pooling processing unit is used for inputting the top-layer feature map into the global average pooling layer to obtain a pooling processing result.
And the first residual error feature map generating unit is used for inputting the top layer feature map into the first residual error module corresponding to the top layer convolution unit to obtain a first residual error feature map.
And the first channel attention feature map generation unit is used for inputting the pooling processing result and the first residual error feature map into a channel attention module corresponding to the top-level convolution unit for weight adjustment to obtain a first channel attention feature map.
And the second residual error feature map generating unit is used for inputting the first channel attention feature map into a second residual error module corresponding to the top layer convolution unit to obtain a second residual error feature map.
And the third residual error feature map generating unit is used for inputting the middle layer feature map into the first residual error module corresponding to the middle layer convolution unit to obtain a third residual error feature map.
The second channel attention feature map generation unit is used for inputting the second residual error feature map and the third residual error feature map into a channel attention module corresponding to the middle layer convolution unit for weight adjustment to obtain a second channel attention feature map;
and the attention pyramid feature map generating unit is used for inputting the second channel attention feature map into the second residual error module corresponding to the middle layer convolution unit to obtain an attention pyramid feature map.
Wherein,
the first residual feature map generation unit is specifically configured to: inputting the top layer characteristic diagram into the 1 multiplied by 1 convolutional layer to carry out channel merging treatment to obtain a merging result; inputting the merged result into a 3 x 3 convolution layer for size amplification processing to obtain an amplification processing result; inputting the amplified processing result into a Batch Norm layer for Batch normalization processing to obtain a normalized processing result; inputting the normalized processing result into a ReLU function, and performing size reduction processing on the result through a 3 x 3 convolution layer to obtain a reduction processing result; and inputting a result obtained by summing the top-layer feature map and the reduction processing result into a ReLU function to obtain a first residual error feature map.
The second channel attention feature map generation unit is specifically configured to:
merging the second residual error feature map and the third residual error feature map to obtain a merged feature map; inputting the combined feature map into a global pooling layer for compression treatment to obtain a compressed feature map; inputting the compression characteristic diagram into a 1 × 1 convolution layer for processing, and inputting a processed result into a ReLU function to obtain an output result; inputting the output result into a 1 × 1 convolutional layer for processing, and inputting the processed result into a Sigmoid function to obtain an attention vector; and carrying out weight adjustment on the second residual error feature map by using the attention vector to obtain a second channel attention feature map.
Optionally, inputting the output result into the 1 × 1 convolutional layer for processing, and inputting the processed result into a Sigmoid function to obtain the attention vector, which specifically includes: inputting the output result into a 1 × 1 convolutional layer to perform feature map channel summation processing to obtain a score map; determining text prediction probability by using the scores in the score map; and obtaining the attention vector by using a Sigmoid function according to the text prediction probability and the text expectation probability.
Optionally, performing weight adjustment on the second residual feature map by using the attention vector to obtain a second channel attention feature map, which specifically includes: performing product operation on the attention vector and the second residual error feature map to obtain a product operation result; and performing summation operation on the product operation result and the combined feature map to obtain a second channel attention feature map.
And the candidate box generating module 603 is configured to select a candidate box from the attention pyramid feature map by using a regional suggestion network, so as to obtain a text candidate box.
The text detection module 604 is configured to input the attention pyramid feature map and the position information of the text candidate box into a Faster R-CNN model to perform candidate box classification prediction processing, so as to determine whether a region framed by the text candidate box is a text region, and obtain a text detection result.
The text detection module 604 is specifically configured to: inputting the attention pyramid feature map and the candidate box information into a Faster R-CNN model to perform candidate box classification prediction processing to obtain a detection box; and carrying out NMS duplicate removal processing on the detection box to obtain a text detection result.
An embodiment of the present invention provides a computer-readable storage medium on which a program is stored, the program implementing the text detection method when executed by a processor.
An embodiment of the present invention provides an electronic device, as shown in fig. 7, an electronic device 70 includes at least one processor 701, and at least one memory 702 and a bus 703 that are connected to the processor 701; the processor 701 and the memory 702 complete mutual communication through a bus 703; the processor 701 is configured to call program instructions in the memory 702 to perform the text detection method described above. The electronic device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application also provides a computer program product adapted to execute a program initialized with the steps comprised by the text detection method described above, when executed on a data processing device.
It should be noted that the text detection method, the text detection system, the storage medium and the electronic device provided by the invention can be applied to the field of artificial intelligence or the field of finance. The foregoing is merely an example, and does not limit the application fields of the text detection method, the text detection system, the storage medium, and the electronic device provided by the present invention.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, systems and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (10)
1. A text detection method, comprising:
acquiring an image to be detected;
extracting the characteristics of the image to be detected by adopting an attention pyramid network model to obtain an attention pyramid characteristic diagram;
selecting candidate frames from the attention pyramid feature map by adopting a regional suggestion network to obtain text candidate frames;
inputting the attention pyramid feature map and the position information of the text candidate box into a Faster R-CNN model to perform candidate box classification prediction processing so as to judge whether the selected area of the text candidate box is a text area or not and obtain a text detection result.
2. The text detection method of claim 1, wherein the attention pyramid network model comprises a ResNet101 network, a global average pooling layer, a first residual module, a channel attention module, and a second residual module; the ResNet101 network comprises a top layer convolution unit, a middle layer convolution unit and a bottom layer convolution unit;
the method comprises the following steps of adopting an attention pyramid network model to carry out feature extraction on an image to be detected to obtain an attention pyramid feature map, and specifically comprising the following steps:
inputting the image to be detected into the bottom layer convolution unit for feature extraction to obtain a bottom layer feature map; inputting the bottom layer characteristic diagram into the middle layer convolution unit for characteristic extraction to obtain a middle layer characteristic diagram; inputting the middle layer feature map into the top layer convolution unit to obtain a top layer feature map;
inputting the top layer feature map into a global average pooling layer to obtain a pooling processing result;
inputting the top-level feature map into a first residual error module corresponding to the top-level convolution unit to obtain a first residual error feature map;
inputting the pooling processing result and the first residual error feature map into a channel attention module corresponding to the top convolution unit for weight adjustment to obtain a first channel attention feature map;
inputting the first channel attention feature map into a second residual error module corresponding to the top layer convolution unit to obtain a second residual error feature map;
inputting the middle layer feature map into a first residual error module corresponding to the middle layer convolution unit to obtain a third residual error feature map;
inputting the second residual error feature map and the third residual error feature map into a channel attention module corresponding to the middle layer convolution unit for weight adjustment to obtain a second channel attention feature map;
and inputting the second channel attention feature map into a second residual error module corresponding to the middle layer convolution unit to obtain an attention pyramid feature map.
3. The text detection method according to claim 2, wherein the inputting the top-level feature map into a first residual error module corresponding to the top-level convolution unit to obtain a first residual error feature map specifically includes:
inputting the top layer characteristic diagram into a 1 × 1 convolutional layer for channel merging processing to obtain a merging result;
inputting the merging result into a 3 x 3 convolution layer for size amplification processing to obtain an amplification processing result;
inputting the amplification processing result into a Batch Norm layer for Batch normalization processing to obtain a normalization processing result;
inputting the normalized processing result into a ReLU function to obtain a result, and performing size reduction processing on the result through a 3 x 3 convolution layer to obtain a reduction processing result;
and inputting a result obtained by summing the top-layer feature map and the reduction processing result into a ReLU function to obtain a first residual error feature map.
4. The text detection method according to claim 2, wherein the inputting the second residual feature map and the third residual feature map into the channel attention module corresponding to the intermediate layer convolution unit for weight adjustment to obtain a second channel attention feature map specifically comprises:
merging the second residual error feature map and the third residual error feature map to obtain a merged feature map;
inputting the combined feature map into a global pooling layer for compression processing to obtain a compressed feature map;
inputting the compression characteristic diagram into a 1 × 1 convolution layer for processing, and inputting a processed result into a ReLU function to obtain an output result;
inputting the output result into a 1 × 1 convolutional layer for processing, and inputting the processed result into a Sigmoid function to obtain an attention vector;
and carrying out weight adjustment on the second residual error feature map by using the attention vector to obtain a second channel attention feature map.
5. The text detection method according to claim 4, wherein the inputting the output result into a 1 × 1 convolutional layer for processing, and inputting the processed result into a Sigmoid function to obtain an attention vector, specifically comprises:
inputting the output result into a 1 × 1 convolutional layer to perform feature map channel summation processing to obtain a score map;
determining text prediction probability by using the scores in the score map;
and obtaining an attention vector by using a Sigmoid function according to the text prediction probability and the text expectation probability.
6. The text detection method according to claim 4, wherein the performing weight adjustment on the second residual feature map by using the attention vector to obtain a second channel attention feature map specifically comprises:
performing product operation on the attention vector and the second residual error feature map to obtain a product operation result;
and performing summation operation on the product operation result and the combined feature map to obtain a second channel attention feature map.
7. The text detection method according to claim 1, wherein the inputting the attention pyramid feature map and the candidate box information into a fast R-CNN model for candidate box classification prediction processing to determine whether the text candidate box is a text region, and obtaining a text detection result specifically includes:
inputting the attention pyramid feature map and the candidate box information into a Faster R-CNN model to perform candidate box classification prediction processing to obtain a detection box;
and carrying out NMS duplicate removal processing on the detection box to obtain a text detection result.
8. A text detection system, comprising:
the image acquisition module is used for acquiring an image to be detected;
the feature extraction module is used for extracting features of the image to be detected by adopting an attention pyramid network model to obtain an attention pyramid feature map;
the candidate box generation module is used for selecting candidate boxes from the attention pyramid feature map by adopting a regional suggestion network to obtain a text candidate box;
and the text detection module is used for inputting the attention pyramid feature map and the position information of the text candidate box into a Faster R-CNN model to perform candidate box classification prediction processing so as to judge whether the selected area of the text candidate box is a text area or not and obtain a text detection result.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a program which, when executed by a processor, implements the text detection method according to any one of claims 1 to 7.
10. An electronic device, comprising:
at least one processor, and at least one memory, bus connected with the processor;
the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform the text detection method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111069214.0A CN113723352B (en) | 2021-09-13 | 2021-09-13 | Text detection method, system, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111069214.0A CN113723352B (en) | 2021-09-13 | 2021-09-13 | Text detection method, system, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113723352A true CN113723352A (en) | 2021-11-30 |
CN113723352B CN113723352B (en) | 2024-08-02 |
Family
ID=78683569
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111069214.0A Active CN113723352B (en) | 2021-09-13 | 2021-09-13 | Text detection method, system, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113723352B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114743206A (en) * | 2022-05-17 | 2022-07-12 | 北京百度网讯科技有限公司 | Text detection method, model training method, device and electronic equipment |
CN117315702A (en) * | 2023-11-28 | 2023-12-29 | 山东正云信息科技有限公司 | Text detection method, system and medium based on set prediction |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110276316A (en) * | 2019-06-26 | 2019-09-24 | 电子科技大学 | A kind of human body critical point detection method based on deep learning |
CN110895695A (en) * | 2019-07-31 | 2020-03-20 | 上海海事大学 | Deep learning network for character segmentation of text picture and segmentation method |
US10671878B1 (en) * | 2019-01-11 | 2020-06-02 | Capital One Services, Llc | Systems and methods for text localization and recognition in an image of a document |
CN111291759A (en) * | 2020-01-17 | 2020-06-16 | 北京三快在线科技有限公司 | Character detection method and device, electronic equipment and storage medium |
US10699715B1 (en) * | 2019-12-27 | 2020-06-30 | Alphonso Inc. | Text independent speaker-verification on a media operating system using deep learning on raw waveforms |
CN111401201A (en) * | 2020-03-10 | 2020-07-10 | 南京信息工程大学 | Aerial image multi-scale target detection method based on spatial pyramid attention drive |
CN111626300A (en) * | 2020-05-07 | 2020-09-04 | 南京邮电大学 | Image semantic segmentation model and modeling method based on context perception |
CN111914843A (en) * | 2020-08-20 | 2020-11-10 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Character detection method, system, equipment and storage medium |
KR20200143193A (en) * | 2019-06-13 | 2020-12-23 | 네이버 주식회사 | Apparatus and method for object detection |
CN112232232A (en) * | 2020-10-20 | 2021-01-15 | 城云科技(中国)有限公司 | Target detection method |
CN112465820A (en) * | 2020-12-22 | 2021-03-09 | 中国科学院合肥物质科学研究院 | Semantic segmentation based rice disease detection method integrating global context information |
-
2021
- 2021-09-13 CN CN202111069214.0A patent/CN113723352B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10671878B1 (en) * | 2019-01-11 | 2020-06-02 | Capital One Services, Llc | Systems and methods for text localization and recognition in an image of a document |
KR20200143193A (en) * | 2019-06-13 | 2020-12-23 | 네이버 주식회사 | Apparatus and method for object detection |
CN110276316A (en) * | 2019-06-26 | 2019-09-24 | 电子科技大学 | A kind of human body critical point detection method based on deep learning |
CN110895695A (en) * | 2019-07-31 | 2020-03-20 | 上海海事大学 | Deep learning network for character segmentation of text picture and segmentation method |
US10699715B1 (en) * | 2019-12-27 | 2020-06-30 | Alphonso Inc. | Text independent speaker-verification on a media operating system using deep learning on raw waveforms |
CN111291759A (en) * | 2020-01-17 | 2020-06-16 | 北京三快在线科技有限公司 | Character detection method and device, electronic equipment and storage medium |
CN111401201A (en) * | 2020-03-10 | 2020-07-10 | 南京信息工程大学 | Aerial image multi-scale target detection method based on spatial pyramid attention drive |
CN111626300A (en) * | 2020-05-07 | 2020-09-04 | 南京邮电大学 | Image semantic segmentation model and modeling method based on context perception |
CN111914843A (en) * | 2020-08-20 | 2020-11-10 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Character detection method, system, equipment and storage medium |
CN112232232A (en) * | 2020-10-20 | 2021-01-15 | 城云科技(中国)有限公司 | Target detection method |
CN112465820A (en) * | 2020-12-22 | 2021-03-09 | 中国科学院合肥物质科学研究院 | Semantic segmentation based rice disease detection method integrating global context information |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114743206A (en) * | 2022-05-17 | 2022-07-12 | 北京百度网讯科技有限公司 | Text detection method, model training method, device and electronic equipment |
CN114743206B (en) * | 2022-05-17 | 2023-10-27 | 北京百度网讯科技有限公司 | Text detection method, model training method, device and electronic equipment |
CN117315702A (en) * | 2023-11-28 | 2023-12-29 | 山东正云信息科技有限公司 | Text detection method, system and medium based on set prediction |
CN117315702B (en) * | 2023-11-28 | 2024-02-23 | 山东正云信息科技有限公司 | Text detection method, system and medium based on set prediction |
Also Published As
Publication number | Publication date |
---|---|
CN113723352B (en) | 2024-08-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Self-produced guidance for weakly-supervised object localization | |
US11055535B2 (en) | Method and device for video classification | |
CN112541904B (en) | Unsupervised remote sensing image change detection method, storage medium and computing device | |
JP2015506026A (en) | Image classification | |
CN111027576B (en) | Cooperative significance detection method based on cooperative significance generation type countermeasure network | |
CN111738269B (en) | Model training method, image processing device, model training apparatus, and storage medium | |
CN113723352B (en) | Text detection method, system, storage medium and electronic equipment | |
CN111899203B (en) | Real image generation method based on label graph under unsupervised training and storage medium | |
CN116310850B (en) | Remote sensing image target detection method based on improved RetinaNet | |
CN115631112B (en) | Building contour correction method and device based on deep learning | |
CN111126358B (en) | Face detection method, device, storage medium and equipment | |
CN113297959A (en) | Target tracking method and system based on corner attention twin network | |
CN117036941A (en) | Building change detection method and system based on twin Unet model | |
CN114821823A (en) | Image processing, training of human face anti-counterfeiting model and living body detection method and device | |
Zong et al. | A cascaded refined rgb-d salient object detection network based on the attention mechanism | |
CN116912924B (en) | Target image recognition method and device | |
Li et al. | ABYOLOv4: improved YOLOv4 human object detection based on enhanced multi-scale feature fusion | |
CN113963236A (en) | Target detection method and device | |
Pasqualino et al. | A multi camera unsupervised domain adaptation pipeline for object detection in cultural sites through adversarial learning and self-training | |
CN115512428A (en) | Human face living body distinguishing method, system, device and storage medium | |
CN117593619B (en) | Image processing method, device, electronic equipment and storage medium | |
CN114662614B (en) | Training method of image classification model, image classification method and device | |
CN110610185A (en) | Method, device and equipment for detecting salient object of image | |
Wang et al. | A Hybrid Self-Attention Model for Pedestrians Detection | |
KR101991043B1 (en) | Video summarization method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |