CN111723798B - Multi-instance natural scene text detection method based on relevance hierarchy residual errors - Google Patents

Multi-instance natural scene text detection method based on relevance hierarchy residual errors Download PDF

Info

Publication number
CN111723798B
CN111723798B CN202010464099.6A CN202010464099A CN111723798B CN 111723798 B CN111723798 B CN 111723798B CN 202010464099 A CN202010464099 A CN 202010464099A CN 111723798 B CN111723798 B CN 111723798B
Authority
CN
China
Prior art keywords
text
feature
loss
feature map
regions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010464099.6A
Other languages
Chinese (zh)
Other versions
CN111723798A (en
Inventor
田智强
王春晖
杜少毅
兰旭光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Xingzhou Zhiyi Intelligent Technology Co ltd
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202010464099.6A priority Critical patent/CN111723798B/en
Publication of CN111723798A publication Critical patent/CN111723798A/en
Application granted granted Critical
Publication of CN111723798B publication Critical patent/CN111723798B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-instance natural scene text detection method based on relevance hierarchy residual errors, and a feature extraction network adopted by the method utilizes the relevance hierarchy residual errors and reverse stage-by-stage feature fusion to extract multi-scale features combining coarse granularity and fine granularity, wherein the multi-scale features comprise more accurate and complete text information so as to improve the text detection precision; secondly, the regression Loss of the text detection box used by the method consists of a CIoU Loss part and an angle Loss part, and particularly, the factors such as the overlapping area, the center distance, the length-width ratio and the like between the predicted text detection box and the real text detection box are considered in the use of the CIoU Loss part, so that the actual regression condition of the text detection box can be more accurately evaluated, and the performance of the text detection method can be improved; then, the invention adopts a proper mode to reduce the hardware calculation pressure in a plurality of steps, and finally, the invention has good detection effect on the conventional text region and the small text region.

Description

Multi-instance natural scene text detection method based on relevance hierarchy residual errors
Technical Field
The invention belongs to the field of deep learning, computer vision and text detection, and particularly relates to a multi-instance natural scene text detection method based on relevance hierarchy residual errors.
Background
Characters serve as a main mode of information transfer, play an indispensable role in daily life of people, and with the arrival of the big data era, how to acquire text information in massive images becomes a problem which needs to be solved urgently. Therefore, based on the development of deep learning, the natural scene text detection technology becomes a very popular research direction in the field of computer vision, and has important significance for image retrieval, scene understanding and the like.
At present, the emergence of a large number of research achievements enables natural scene text detection to be widely applied to various fields of various industries. Such as: many internet companies develop related services and applications such as image retrieval and street view navigation based on natural scene text detection technology. Some cloud service providers also provide image text detection services successively, wherein each service is oriented to multiple fields of education, logistics, videos, e-commerce, tourism and the like, and provides direct text detection services, indirect text detection models, customized AI service system integration and the like for users. Although the technical result of natural scene text detection is remarkable up to now, the text image processed by natural scene text detection has the characteristics of complex background and various texts, and the related technology still has the technical problems of insufficient detection precision and the like at present.
Through investigation and research, the following two defects of the existing natural scene text detection method are found: on one hand, although the multi-scale feature extraction network used for text detection starts to adopt the idea of multi-scale feature extraction and fusion by taking advantage of the multi-scale feature extraction network such as SPPNet, U-Net, FPNNet and the like, in the process of extracting adjacent features with different scales, the method generally only adopts a convolution kernel with one size, so that the feature extraction is coarse-grained. Considering that a text example in a text image of a natural scene is usually much smaller than a natural background, and the existing method has a poor detection effect on a small text region, it is necessary to extract features with a finer granularity to adapt to a text detection task, so that the coarse-granularity multi-scale feature extraction method still has a room for improvement and promotion. On the other hand, a regression Loss function commonly used in the text detection method at present is IoU Loss, which calculates the intersection ratio between the predicted text detection box and the true text detection box, but under the condition of the same intersection ratio, the position condition and the overlapping condition of the predicted box and the true text detection box may be different, so that the evaluation of the regression condition of the text detection box by directly using the intersection ratio is also defective, and therefore, the design of the Loss function is still to be improved.
Disclosure of Invention
The invention aims to provide a multi-instance natural scene text detection method based on relevance hierarchy residual errors, and aims to solve the problems that the current text detection method is poor in detection effect on small text regions, and loss functions commonly used in text detection cannot well evaluate the actual regression condition of a text detection box.
In order to achieve the purpose, the invention adopts the following technical scheme:
firstly, extracting features of an original input image by adopting a feature extraction network based on relevance hierarchy residual errors so as to obtain feature maps with different scales containing rich text information from low level to high level;
step two, performing reverse step-by-step feature fusion on the feature maps of different scales extracted in the step one so as to obtain a multi-scale fusion feature map;
thirdly, performing text region detection on the multi-scale fusion characteristic map output in the second step by adopting characteristic mapping, and outputting a pixel-level text score characteristic map and a text region geometric characteristic map so as to represent candidate prediction text regions;
step four, simply screening and eliminating all candidate predictive text regions generated in the step three in advance according to the score of each candidate predictive text region;
step five, merging and screening the residual candidate prediction text regions in the step four by using a local perception non-maximum suppression algorithm, thereby obtaining quasi-prediction text regions;
and step six, calculating the average score of the regions of all the quasi-prediction text regions obtained in the step five, and removing the regions with the average score of the regions lower than a certain threshold value, thereby obtaining the final prediction text regions and the detection results.
The method comprises a training process, wherein a plurality of public common text detection data sets are used for training a method model;
using back propagation in the training process, continuously updating model parameters when the loss is large until the loss converges to a small value, and storing the model parameters;
and step seven, using the stored structure and parameters of the model to form a multi-instance natural scene text detection model.
Further, in the step one, a relevance hierarchy residual error structure is introduced into the relevance hierarchy residual error-based feature extraction network based on the ResNet-50 backbone network, so that accurate and complete multi-scale text features with combination of coarse granularity and fine granularity can be extracted. In a feature extraction link, the original input image gradually acquires coarse-grained feature information of different scales from low level to high level through 5 convolutional layers Conv1-Conv5, and the sizes of feature maps passing through each convolutional layer are sequentially changed into 1/2, 1/4, 1/8, 1/16 and 1/32 of the original image; in addition, a relevance hierarchy residual error structure is introduced into Conv2-Conv5 and used for fine-grained feature extraction between adjacent feature maps with different scales; in this way, the different scale feature maps f1, f2, f3 and f4 generated in the extraction process simultaneously contain multi-scale feature information combining coarse granularity and fine granularity.
Further, based on the feature extraction network of the correlation-level residual, Conv1 uses a 7 × 7 convolution kernel followed by a MaxPool layer using a 3 × 3 convolution kernel for downsampling. Conv2-Conv5 were constructed as 1 × 1 convolution, 3 × 3 convolution group, and 1 × 1 convolution, with residual concatenation to simplify the learning objectives and difficulty of deep neural networks. The 3 × 3 convolution group is a key for realizing fine-grained feature extraction, and firstly, a feature graph generated by 1 × 1 convolution is equally divided into 4 sub-feature graphs along a channel dimension, and the 1 st sub-feature graph x 1 Is directly output as y 1 (ii) a Each sub-feature graph x thereafter i After a 3 x 3 convolution operation K i Then obtaining output y i (ii) a And starting from the 3 rd sub-feature map, x i The output y of the previous sub-feature map is added i-1 Then, 3 × 3 convolution operation is performed; and finally, combining the outputs of the 4 sub-feature graphs along the channel dimension to obtain the total output y.
Further, the feature map f generated from Conv5 is subjected to reverse progressive feature fusion in the second step 1 At first, f is first paired 1 Upsampling and outputting the feature map with the size 2 times of the original feature map, so as to obtain the feature map size after output and the feature map f generated by Conv4 2 The two are consistent, and can be directly combined along the channel dimension; in addition, after the feature maps are combined, two convolution operations of 1 × 1 and 3 × 3 are added for reducing the channel dimension and reducing the parameter calculation amount; thus, the characteristic diagram f with different scales is finally obtained in sequence according to the mode 1 、f 2 、f 3 And f 4 Step-by-step fusion is completed, and the size of the fused feature map is 1/4 original input image size; in addition, a 3 x 3 convolutional layer is added to generate the final multi-scale fusion feature map.
Further, in the third step, 1 × 1 convolution operation is adopted for performing feature mapping on the multi-scale fusion feature map; and then, the output pixel-level text score feature map and the text region geometric feature map respectively show whether each pixel point in the feature map is in the text region, the boundary distance between each pixel point and the text region and the inclination angle of the text region to which the pixel point belongs, so that the candidate prediction text region can be represented.
Further, the score threshold for performing simple pre-screening and culling on the candidate predictive text regions in step four is set to 0.5.
Further, in the fifth step, the local perception non-maximum suppression algorithm firstly merges the remaining candidate prediction text regions line by line, and when the intersection area of the two candidate prediction text regions is greater than a set threshold value 0.2, the requirement of merging is met; when merging, the vertex coordinates of the original two text regions are weighted and averaged to obtain the vertex coordinates of the merged text region, wherein the weight is the scores of the original two text regions, and the scores of the original two text regions are added to obtain the score of the merged new text region. And then screening the combined candidate prediction text regions by a standard non-maximum inhibition algorithm to obtain the quasi-prediction text regions.
Further, in the sixth step, a threshold value for performing region screening according to the region average score of the quasi-predictive text region is set to 0.1.
Further, in the sixth step, a loss function is used in the training process, and parameter adjustment is performed when the loss is propagated reversely.
Further, the loss function is composed of two parts, wherein the text classification loss is used for guiding the correct classification of the text region; and the test box regression loss is used to guide the correct regression of the text test box. The overall loss function is calculated as:
L=L cls +λL reg
wherein L is the total loss of detection; l is cls For text classification loss, L reg To detect the box regression loss, λ is a parameter that balances the importance of the two losses, with a value of 1.
The text classification loss calculation formula is as follows:
Figure BDA0002512024730000051
wherein L is cls Representing a text classification loss; y represents all positive sample areas in the real text score feature map; | Y * I represents all positive sample regions in the predicted text score feature map; y is reverse U Y * And | represents a portion where the positive sample region in the predicted text score feature map intersects with the positive sample region in the true text score feature map.
The detection frame regression loss calculation formula is as follows:
L reg =L gθ L θ
wherein L is reg Regression loss for the detection box; l is g Detecting box geometric regression loss for the text without considering the angle; l is θ Detecting a frame angle loss for the text; lambda [ alpha ] θ The two loss tradeoff parameters are 20.
Furthermore, the geometric regression Loss of the text detection box without considering the angle in the regression Loss of the detection box is CIoU Loss, and the calculation formula is as follows:
L g =1-IoU+R(A,B)
wherein L is g Representing a text detection box geometric regression loss;
Figure BDA0002512024730000052
comparing the areas of the prediction frame and the real frame; a and B respectively represent a prediction frame and a real frame region; r (A, B) is a function penalty term which is calculated by the formula:
Figure BDA0002512024730000053
wherein a and B respectively represent the area centers of the prediction frame A and the real frame B; ρ (·) represents a euclidean distance; c represents the diagonal distance of the minimum bounding rectangle capable of containing the A and B areas;
Figure BDA0002512024730000054
is a trade-off parameter; and v is a parameter for measuring the uniformity of the aspect ratio, and the calculation formula is as follows:
Figure BDA0002512024730000055
wherein w B ,h B Width and height of the real frame B; w is a A ,h A The width and height of box a are predicted.
Further, the calculation formula of the angle loss of the text detection box in the regression loss of the detection box is as follows:
L θ =1-cos(θ * -θ)
wherein L is θ Representing a text detection box angle loss; theta * A predicted angle for the text region; θ represents the true angle of the text region.
Compared with the prior art, the invention has the following technical effects:
the adopted feature extraction network extracts multi-scale features combining coarse granularity and fine granularity by utilizing relevance hierarchy residual and reverse stage-by-stage feature fusion, wherein the multi-scale features comprise more accurate and complete text information, and the feature expression capability of the network is further enhanced, so that the text detection precision can be improved;
the regression Loss of the text detection box used by the invention consists of a CIoU Loss part and an angle Loss part, and particularly, the use of the CIoU Loss part considers factors such as the overlapping area, the central distance, the length-width ratio and the like between the predicted text detection box and the real text detection box, so that the actual regression condition of the text detection box can be more accurately evaluated, and the performance of the text detection method can be improved;
the invention relieves hardware computational stress in a number of steps in a suitable manner, such as: 1 × 1, 3 × 3 small convolutions, feature splitting and splicing and the like are used at multiple positions in network design to reduce feature dimensions and reduce parameter calculation amount; pre-simple threshold screening and the like for candidate prediction text regions are also carried out;
the method has high detection precision for the conventional text region, is sensitive to the detection of the small text region, and has higher application value in the field of natural scene text detection.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a feature extraction and feature fusion network architecture of the present invention;
FIG. 3 is a basic structure diagram of the relevance hierarchy residual error used by the feature extraction network Conv2-Conv5 according to the present invention;
FIG. 4 is a graph of a portion of the test results of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings:
referring to fig. 1, the present invention comprises the steps of:
step 101, using a camera to acquire image data or directly uploading the image data as image input.
Step 102, extracting network pairs by using characteristics based on relevance residual errorsExtracting features of the original input image to obtain a feature map f with the combined coarse and fine granularity sizes of 1/32, 1/16, 1/8 and 1/4 of the original input image 1 、f 2 、f 3 、f 4 The multi-scale feature maps respectively represent rich feature information from low level to high level.
103, reversely, merging the progressive features from the feature map f 1 At the beginning, sequentially pair f 1 、f 2 、f 3 、f 4 And performing upsampling and feature splicing to finally generate a multi-scale fusion feature map with the size of the original input image 1/4.
And 104, performing feature mapping on the multi-scale fusion feature map to detect a text region, and outputting a pixel-level text score feature map and a text region geometric feature map, wherein the pixel-level text score feature map and the text region geometric feature map respectively indicate whether each pixel point in the feature map is in the text region, the boundary distance between each pixel point and the text region and the inclination angle of the text region to which the pixel point belongs, so that candidate prediction text regions can be represented.
And 105, simply screening and eliminating the candidate predictive text region in advance according to the region score, wherein the screening score threshold is set to be 0.5.
Step 106, merging the remaining candidate prediction text regions line by adopting a local perception non-maximum suppression algorithm, and meeting the requirement of merging when the intersection area of the two candidate prediction text regions is larger than a set threshold value of 0.2; when merging, the vertex coordinates of the original two text regions are weighted and averaged to obtain the vertex coordinates of the merged text region, wherein the weight is the scores of the original two text regions, and the scores of the original two text regions are added to obtain the score of the merged new text region. And then screening the combined candidate prediction text regions by a standard non-maximum inhibition algorithm to obtain the quasi-prediction text regions.
Step 107, performing region screening according to the region average score of the quasi-predictive text region to obtain a final predictive text region (i.e. a text detection result), wherein the screening threshold is set to 0.1.
In addition, as with most deep learning methods, the method model is first trained using a large amount of labeled image data, and back propagation and parameter optimization during the training process require the construction of a loss function. The loss function consists of two parts, wherein the text classification loss is used for guiding the correct classification of the text region; and the test box regression loss is used to guide the correct regression of the text test box. The overall loss function is calculated as:
L=L cls +λL reg
wherein L is the total loss of detection; l is cls For text classification loss, L reg To detect the box regression loss, λ is a parameter that balances the importance of the two losses, with a value of 1.
The text classification loss calculation formula is as follows:
Figure BDA0002512024730000081
wherein L is cls Representing a text classification loss; y represents all positive sample areas in the real text score feature map; | Y * I represents all positive sample regions in the predicted text score feature map; y is reverse U Y * And | represents a portion where the positive sample region in the predicted text score feature map intersects with the positive sample region in the true text score feature map.
The detection frame regression loss calculation formula is as follows:
L reg =L gθ L θ
wherein L is reg Regression loss for the detection box; l is g Detecting box geometric regression loss for the text without considering the angle; l is θ Detecting a frame angle loss for the text; lambda [ alpha ] θ The two loss tradeoff parameters are 20.
The geometric regression Loss of the text detection box without considering the angle in the regression Loss of the detection box is CIoU Loss, and the calculation formula is as follows:
L g =1-IoU+R(A,B)
wherein L is g Representing text detection box geometryLoss of regression;
Figure BDA0002512024730000082
comparing the areas of the prediction frame and the real frame; a and B respectively represent a prediction frame and a real frame region; r (A, B) is a function penalty term which is calculated by the formula:
Figure BDA0002512024730000083
wherein a and B respectively represent the area centers of the prediction frame A and the real frame B; ρ (·) represents a euclidean distance; c represents the diagonal distance of the minimum bounding rectangle capable of containing the A and B areas;
Figure BDA0002512024730000091
is a trade-off parameter; and v is a parameter for measuring the uniformity of the aspect ratio, and the calculation formula is as follows:
Figure BDA0002512024730000092
wherein w B ,h B Width and height of the real frame B; w is a A ,h A The width and height of box a are predicted.
The calculation formula of the angle loss of the text detection box in the regression loss of the detection box is as follows:
L θ =1-cos(θ * -θ)
wherein L is θ Representing a text detection box angle loss; theta * A predicted angle for the text region; θ represents the true angle of the text region.
Referring to fig. 2, it depicts a structure diagram of the feature extraction and feature fusion network of the present invention, which includes the following parts:
step 201, using a camera to acquire image data or directly uploading the image data as image input.
Step 202, performing feature extraction on an original input image by using a feature extraction network based on relevance hierarchy residuals, wherein 5 convolutional layers Conv1-Conv5 in the feature extraction network gradually acquire coarse-grained feature information with different scales from a low level to a high level, and the sizes of feature maps passing through each convolutional layer are sequentially changed into 1/2, 1/4, 1/8, 1/16 and 1/32 of an original image; in addition, a relevance hierarchy residual error structure is introduced into Conv2-Conv5 and used for fine-grained feature extraction between adjacent feature maps with different scales; in this way, the different scale feature maps f1, f2, f3 and f4 generated in the extraction process simultaneously contain multi-scale feature information combining coarse granularity and fine granularity. Conv1 was downsampled using a 7 × 7 convolution kernel followed by a 3 × 3 convolution kernel for a MaxPoint layer. Conv2-Conv5 were constructed as 1 × 1 convolution, 3 × 3 convolution group, and 1 × 1 convolution, with residual concatenation to simplify the learning objectives and difficulty of deep neural networks.
Step 203, inverse progressive feature fusion feature map f generated from Conv5 1 At first, f is first paired 1 Upsampling and outputting the feature map with the size 2 times of the original feature map, so as to obtain the feature map size after output and the feature map f generated by Conv4 2 The two are consistent, and can be directly combined along the channel dimension; in addition, after the feature maps are combined, two convolution operations of 1 × 1 and 3 × 3 are added for reducing the channel dimension and reducing the parameter calculation amount; thus, the characteristic diagram f with different scales is finally obtained in sequence according to the mode 1 、f 2 、f 3 And f 4 Step-by-step fusion is completed, and the size of the fused feature map is 1/4 original input image size; in addition, a 3 x 3 convolutional layer is added to generate the final multi-scale feature fusion map.
Referring to fig. 3, which depicts a basic structure diagram of the relevance hierarchy residual error used by the feature extraction network Conv2-Conv5 of the present invention, includes the following parts:
in step 301, the feature map 1 is convolved by 1 × 1 to reduce the amount of parameter calculation.
Step 302, the feature map generated by convolution with 1x1 is equally divided into 4 sub-feature maps along the channel dimension, the 1 st sub-feature map x 1 Is directly output as y 1 (ii) a Each sub-feature graph x thereafter i After a 3 x 3 convolution operation K i Then obtaining output y i (ii) a And alsoStarting from the 3 rd sub-feature map, x i The output y of the previous sub-feature map is added i-1 Then, 3 × 3 convolution operation is performed; and finally, combining the outputs of the 4 sub-feature graphs along the channel dimension to obtain the total output y.
Step 303, the feature dimension is restored by convolution of the feature graph y by 1x1, and finally the feature graph 2 is generated.
Meanwhile, Conv2-Conv5 uses residual concatenation in order to simplify the learning objectives and difficulty of deep neural networks.
Referring to fig. 4, partial detection results of the method are shown, and the results show that the method is relatively accurate in detection of the horizontal text, relatively sensitive in detection of small text regions, relatively accurate in distinguishing of multiple instances, and capable of eliminating interference of text similar objects.
The embodiments of the present invention have been described above with reference to the accompanying drawings. It will be appreciated by persons skilled in the art that the present invention is not limited by the embodiments described above. On the basis of the technical solution of the present invention, those skilled in the art can make various modifications or variations without creative efforts and still be within the protection scope of the present invention.

Claims (9)

1. The multi-instance natural scene text detection method based on the relevance hierarchy residual error is characterized by comprising the following steps of:
firstly, extracting features of an original input image by adopting a feature extraction network based on relevance hierarchy residual errors so as to obtain feature maps with different scales containing rich text information from low level to high level; the relevance hierarchy residual error is that Conv2-Conv5 are composed of 1 × 1 convolution, a 3 × 3 convolution group and 1 × 1 convolution, and residual error connection is attached to simplify the learning goal and difficulty of the deep neural network, wherein the 3 × 3 convolution group is the key for achieving fine-grained feature extraction, firstly, a feature graph generated by 1 × 1 convolution is evenly divided into 4 sub-feature graphs along a channel dimension, and the 1 st sub-feature graph x 1 Is directly output as y 1 (ii) a Each sub-feature graph x thereafter i After a 3 x 3 convolution operation K i Then obtain the outputy i (ii) a And starting from the 3 rd sub-feature map, x i The output y of the previous sub-feature map is added i-1 Then, 3 × 3 convolution operation is performed; finally, combining the outputs of the 4 sub-feature graphs along the channel dimension to obtain a total output y;
step two, performing reverse step-by-step feature fusion on the feature maps of different scales extracted in the step one so as to obtain a multi-scale fusion feature map;
thirdly, performing text region detection on the multi-scale fusion characteristic graph output in the second step by adopting characteristic mapping, and outputting a pixel-level text score characteristic graph and a text region geometric characteristic graph so as to represent candidate prediction text regions;
step four, simply screening and eliminating all candidate predictive text regions generated in the step three in advance according to the score of each candidate predictive text region, and setting the score threshold value to be 0.5;
step five, merging and screening the residual candidate prediction text regions in the step four by using a local perception non-maximum suppression algorithm, thereby obtaining quasi prediction text regions;
step six, calculating the average score of the regions of all the quasi-prediction text regions obtained in the step five, and removing the regions with the average score of the regions lower than the threshold value of 0.1, so as to obtain the final prediction text regions and the detection results;
the method comprises a training process, wherein a plurality of public common text detection data sets are used for training a natural scene text detection model applying multiple instances from step one to step five;
in the training process, the model parameters are continuously updated by using back propagation until loss convergence, and the parameters of the model are stored;
and step seven, forming a multi-instance natural scene text detection model by using the model parameters and the structure stored in the step six.
2. The method for detecting the text of the multi-instance natural scene based on the relevance hierarchy residual error as claimed in claim 1, wherein in the step one, the relevance hierarchy residual error based feature extraction network introduces a relevance hierarchy residual error structure based on a ResNet-50 backbone network, so as to extract precise and complete multi-scale text features combining coarse granularity and fine granularity, in a feature extraction link, an original input image gradually acquires coarse-grained feature information of different scales from a low level to a high level through 5 convolutional layers Conv1-Conv5, and the feature map size after passing through each convolutional layer sequentially becomes 1/2, 1/4, 1/8, 1/16 and 1/32 of an original image; in addition, a relevance hierarchy residual error structure is introduced into Conv2-Conv5 and used for fine-grained feature extraction between adjacent feature maps with different scales; in this way, the different scale feature maps f1, f2, f3 and f4 generated in the extraction process simultaneously contain multi-scale feature information combining coarse granularity and fine granularity.
3. The method according to claim 1, wherein in step two, the feature map f generated from Conv5 is subjected to inverse progressive feature fusion 1 At first, f is first paired 1 Upsampling and outputting a feature map with a size 2 times that of the original feature map, the feature map size after output and the feature map f generated by Conv4 2 The two are consistent, and can be directly combined along the channel dimension; in addition, after the feature maps are combined, two convolution operations of 1 × 1 and 3 × 3 are added for reducing the channel dimension and reducing the parameter calculation amount; thus, the characteristic diagrams f with different scales are finally obtained in sequence in the mode 1 、f 2 、f 3 And f 4 Step-by-step fusion is completed, and the size of the fused feature map is 1/4 original input image size; in addition, a 3 x 3 convolutional layer is added to generate the final multi-scale feature fusion map.
4. The method for detecting multi-instance natural scene text based on relevance hierarchy residual error according to claim 1, wherein in step three, the feature mapping for the multi-scale feature fusion graph adopts 1x1 convolution operation; and then, the output pixel-level text score feature map and the text region geometric feature map respectively show whether each pixel point in the feature map is in the text region, the boundary distance between each pixel point and the text region and the inclination angle of the text region to which the pixel point belongs, so that the candidate prediction text region can be represented.
5. The method for detecting the multi-instance natural scene text based on the relevance hierarchy residual error according to claim 1, wherein in the fifth step, a local perception non-maximum suppression algorithm firstly merges the remaining candidate prediction text regions line by line, and when the intersection area of the two candidate prediction text regions is greater than a set threshold value of 0.2, the requirement of merging is met; when merging, carrying out weighted average on the vertex coordinates of the original two text regions to obtain the vertex coordinates of the merged text region, wherein the used weight is the fraction of the original two text regions, the fractions of the original two text regions are added to obtain the score of the merged new text region, and then screening the merged candidate predicted text region by a standard non-maximum suppression algorithm to obtain a quasi-predicted text region.
6. The method according to claim 1, wherein in step six, a loss function is used in the training process, and parameter adjustment is performed when loss is propagated backwards.
7. The method of claim 6, wherein the loss function consists of two parts, wherein the text classification loss is used to guide the correct classification of text regions; the regression loss of the detection box is used for guiding the correct regression of the text detection box; the overall loss function is calculated as:
L=L cls +λL reg
wherein L is the total loss of detection; l is a radical of an alcohol cls For text classification loss, L reg For detecting the frame regression loss, lambda is a parameter for balancing the importance of two losses, and the value of lambda is 1;
the text classification loss calculation formula is as follows:
Figure FDA0003718126560000031
wherein L is cls Representing a text classification loss; y represents all positive sample areas in the real text score feature map; | Y * I represents all positive sample regions in the predicted text score feature map; y is reverse U Y * I represents the part of the predicted text score feature map where the positive sample region intersects with the positive sample region in the real text score feature map;
the detection frame regression loss calculation formula is as follows:
L reg =L gθ L θ
wherein L is reg Regression loss for the detection box; l is g Detecting box geometric regression loss for the text without considering the angle; l is θ Detecting a frame angle loss for the text; lambda [ alpha ] θ The two loss tradeoff parameters are 20.
8. The method of claim 7, wherein the text detection box geometric regression Loss without considering angles in the text detection box regression Loss is CIoU Loss, and the calculation formula is as follows:
L g =1-IoU+R(A,B)
wherein L is g Representing a text detection box geometric regression loss;
Figure FDA0003718126560000041
comparing the areas of the prediction frame and the real frame; a and B respectively represent a prediction frame and a real frame region; r (A, B) is a function penalty term which is calculated by the formula:
Figure FDA0003718126560000042
wherein a and B represent the area centers of the predicted frame A and the real frame B respectively(ii) a ρ (·) represents a euclidean distance; c represents the diagonal distance of the minimum bounding rectangle capable of containing the A and B areas;
Figure FDA0003718126560000043
is a trade-off parameter; and v is a parameter for measuring the uniformity of the aspect ratio, and the calculation formula is as follows:
Figure FDA0003718126560000044
wherein w B ,h B Width and height of the real frame B; w is a A ,h A The width and height of box a are predicted.
9. The method of claim 8, wherein the calculation formula of the text detection box angle loss in the detection box regression loss is as follows:
L θ =1-cos(θ * -θ)
wherein L is θ Representing a text detection box angle loss; theta * A predicted angle for the text region; θ represents the true angle of the text region.
CN202010464099.6A 2020-05-27 2020-05-27 Multi-instance natural scene text detection method based on relevance hierarchy residual errors Active CN111723798B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010464099.6A CN111723798B (en) 2020-05-27 2020-05-27 Multi-instance natural scene text detection method based on relevance hierarchy residual errors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010464099.6A CN111723798B (en) 2020-05-27 2020-05-27 Multi-instance natural scene text detection method based on relevance hierarchy residual errors

Publications (2)

Publication Number Publication Date
CN111723798A CN111723798A (en) 2020-09-29
CN111723798B true CN111723798B (en) 2022-08-16

Family

ID=72565109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010464099.6A Active CN111723798B (en) 2020-05-27 2020-05-27 Multi-instance natural scene text detection method based on relevance hierarchy residual errors

Country Status (1)

Country Link
CN (1) CN111723798B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149620A (en) * 2020-10-14 2020-12-29 南昌慧亦臣科技有限公司 Method for constructing natural scene character region detection model based on no anchor point
CN112598004A (en) * 2020-12-21 2021-04-02 安徽七天教育科技有限公司 English composition test paper layout analysis method based on scanning
CN112598066A (en) * 2020-12-25 2021-04-02 中用科技有限公司 Lightweight road pavement detection method and system based on machine vision
CN112926533A (en) * 2021-04-01 2021-06-08 北京理工大学重庆创新中心 Optical remote sensing image ground feature classification method and system based on bidirectional feature fusion
CN113191450B (en) * 2021-05-19 2022-09-06 清华大学深圳国际研究生院 Weak supervision target detection algorithm based on dynamic label adjustment
CN114037826A (en) * 2021-11-16 2022-02-11 平安普惠企业管理有限公司 Text recognition method, device, equipment and medium based on multi-scale enhanced features
CN114526682B (en) * 2022-01-13 2023-03-21 华南理工大学 Deformation measurement method based on image feature enhanced digital volume image correlation method
CN114842001B (en) * 2022-07-01 2022-09-20 苏州大学 Remote sensing image detection system and method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10163022B1 (en) * 2017-06-22 2018-12-25 StradVision, Inc. Method for learning text recognition, method for recognizing text using the same, and apparatus for learning text recognition, apparatus for recognizing text using the same
CN110008953A (en) * 2019-03-29 2019-07-12 华南理工大学 Potential target Area generation method based on the fusion of convolutional neural networks multilayer feature

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10163022B1 (en) * 2017-06-22 2018-12-25 StradVision, Inc. Method for learning text recognition, method for recognizing text using the same, and apparatus for learning text recognition, apparatus for recognizing text using the same
CN110008953A (en) * 2019-03-29 2019-07-12 华南理工大学 Potential target Area generation method based on the fusion of convolutional neural networks multilayer feature

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Distance-IoU Loss: faster and better learning for bounding box regression;Zhaohui Zheng etc.;《arXiv[cs.CV]》;20191119;第1-8页 *
EAST:An Efficient and Accurate Scene Text Detector;Xinyu Zhou etc.;《arXiv[cs.CV]》;20170710;第1-10页 *
Res2Net:A New Multi-scale backbone architecture;ShangHua Gao etc.;《arXiv[cs.CV]》;20190402;第1-8页 *
基于深度特征的多方向场景文字检测;杨小栋;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20190715(第7期);第I138-941页 *

Also Published As

Publication number Publication date
CN111723798A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN111723798B (en) Multi-instance natural scene text detection method based on relevance hierarchy residual errors
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN109241982B (en) Target detection method based on deep and shallow layer convolutional neural network
WO2019192397A1 (en) End-to-end recognition method for scene text in any shape
CN109840556B (en) Image classification and identification method based on twin network
CN110738697A (en) Monocular depth estimation method based on deep learning
CN111160249A (en) Multi-class target detection method of optical remote sensing image based on cross-scale feature fusion
CN113505792B (en) Multi-scale semantic segmentation method and model for unbalanced remote sensing image
CN111259940A (en) Target detection method based on space attention map
CN111768388A (en) Product surface defect detection method and system based on positive sample reference
CN112365514A (en) Semantic segmentation method based on improved PSPNet
CN113313706B (en) Power equipment defect image detection method based on detection reference point offset analysis
CN113239818B (en) Table cross-modal information extraction method based on segmentation and graph convolution neural network
CN115311730B (en) Face key point detection method and system and electronic equipment
CN112418212A (en) Improved YOLOv3 algorithm based on EIoU
CN115578735B (en) Text detection method and training method and device of text detection model
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN111553351A (en) Semantic segmentation based text detection method for arbitrary scene shape
CN112070040A (en) Text line detection method for video subtitles
CN114332921A (en) Pedestrian detection method based on improved clustering algorithm for Faster R-CNN network
CN111192279B (en) Object segmentation method based on edge detection, electronic terminal and storage medium
CN113496480A (en) Method for detecting weld image defects
CN111104941B (en) Image direction correction method and device and electronic equipment
CN115330703A (en) Remote sensing image cloud and cloud shadow detection method based on context information fusion
CN114202648A (en) Text image correction method, training method, device, electronic device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240103

Address after: 710075 Room 204, 2nd Floor, Building 4A, West Yungu Phase II, Fengxi New City, Xixian New Area, Xi'an City, Shaanxi Province

Patentee after: Xi'an Xingzhou Zhiyi Intelligent Technology Co.,Ltd.

Address before: 710049 No. 28 West Xianning Road, Shaanxi, Xi'an

Patentee before: XI'AN JIAOTONG University

TR01 Transfer of patent right