CN111723798B

CN111723798B - Multi-instance natural scene text detection method based on relevance hierarchy residual errors

Info

Publication number: CN111723798B
Application number: CN202010464099.6A
Authority: CN
Inventors: 田智强; 王春晖; 杜少毅; 兰旭光
Original assignee: Xian Jiaotong University
Current assignee: Xi'an Xingzhou Zhiyi Intelligent Technology Co ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2022-08-16
Anticipated expiration: 2040-05-27
Also published as: CN111723798A

Abstract

The invention provides a multi-instance natural scene text detection method based on relevance hierarchy residual errors, and a feature extraction network adopted by the method utilizes the relevance hierarchy residual errors and reverse stage-by-stage feature fusion to extract multi-scale features combining coarse granularity and fine granularity, wherein the multi-scale features comprise more accurate and complete text information so as to improve the text detection precision; secondly, the regression Loss of the text detection box used by the method consists of a CIoU Loss part and an angle Loss part, and particularly, the factors such as the overlapping area, the center distance, the length-width ratio and the like between the predicted text detection box and the real text detection box are considered in the use of the CIoU Loss part, so that the actual regression condition of the text detection box can be more accurately evaluated, and the performance of the text detection method can be improved; then, the invention adopts a proper mode to reduce the hardware calculation pressure in a plurality of steps, and finally, the invention has good detection effect on the conventional text region and the small text region.

Description

Multi-instance natural scene text detection method based on relevance hierarchy residual errors

Technical Field

The invention belongs to the field of deep learning, computer vision and text detection, and particularly relates to a multi-instance natural scene text detection method based on relevance hierarchy residual errors.

Background

Characters serve as a main mode of information transfer, play an indispensable role in daily life of people, and with the arrival of the big data era, how to acquire text information in massive images becomes a problem which needs to be solved urgently. Therefore, based on the development of deep learning, the natural scene text detection technology becomes a very popular research direction in the field of computer vision, and has important significance for image retrieval, scene understanding and the like.

At present, the emergence of a large number of research achievements enables natural scene text detection to be widely applied to various fields of various industries. Such as: many internet companies develop related services and applications such as image retrieval and street view navigation based on natural scene text detection technology. Some cloud service providers also provide image text detection services successively, wherein each service is oriented to multiple fields of education, logistics, videos, e-commerce, tourism and the like, and provides direct text detection services, indirect text detection models, customized AI service system integration and the like for users. Although the technical result of natural scene text detection is remarkable up to now, the text image processed by natural scene text detection has the characteristics of complex background and various texts, and the related technology still has the technical problems of insufficient detection precision and the like at present.

Through investigation and research, the following two defects of the existing natural scene text detection method are found: on one hand, although the multi-scale feature extraction network used for text detection starts to adopt the idea of multi-scale feature extraction and fusion by taking advantage of the multi-scale feature extraction network such as SPPNet, U-Net, FPNNet and the like, in the process of extracting adjacent features with different scales, the method generally only adopts a convolution kernel with one size, so that the feature extraction is coarse-grained. Considering that a text example in a text image of a natural scene is usually much smaller than a natural background, and the existing method has a poor detection effect on a small text region, it is necessary to extract features with a finer granularity to adapt to a text detection task, so that the coarse-granularity multi-scale feature extraction method still has a room for improvement and promotion. On the other hand, a regression Loss function commonly used in the text detection method at present is IoU Loss, which calculates the intersection ratio between the predicted text detection box and the true text detection box, but under the condition of the same intersection ratio, the position condition and the overlapping condition of the predicted box and the true text detection box may be different, so that the evaluation of the regression condition of the text detection box by directly using the intersection ratio is also defective, and therefore, the design of the Loss function is still to be improved.

Disclosure of Invention

The invention aims to provide a multi-instance natural scene text detection method based on relevance hierarchy residual errors, and aims to solve the problems that the current text detection method is poor in detection effect on small text regions, and loss functions commonly used in text detection cannot well evaluate the actual regression condition of a text detection box.

In order to achieve the purpose, the invention adopts the following technical scheme:

firstly, extracting features of an original input image by adopting a feature extraction network based on relevance hierarchy residual errors so as to obtain feature maps with different scales containing rich text information from low level to high level;

step two, performing reverse step-by-step feature fusion on the feature maps of different scales extracted in the step one so as to obtain a multi-scale fusion feature map;

thirdly, performing text region detection on the multi-scale fusion characteristic map output in the second step by adopting characteristic mapping, and outputting a pixel-level text score characteristic map and a text region geometric characteristic map so as to represent candidate prediction text regions;

step four, simply screening and eliminating all candidate predictive text regions generated in the step three in advance according to the score of each candidate predictive text region;

step five, merging and screening the residual candidate prediction text regions in the step four by using a local perception non-maximum suppression algorithm, thereby obtaining quasi-prediction text regions;

and step six, calculating the average score of the regions of all the quasi-prediction text regions obtained in the step five, and removing the regions with the average score of the regions lower than a certain threshold value, thereby obtaining the final prediction text regions and the detection results.

The method comprises a training process, wherein a plurality of public common text detection data sets are used for training a method model;

using back propagation in the training process, continuously updating model parameters when the loss is large until the loss converges to a small value, and storing the model parameters;

and step seven, using the stored structure and parameters of the model to form a multi-instance natural scene text detection model.

Further, in the step one, a relevance hierarchy residual error structure is introduced into the relevance hierarchy residual error-based feature extraction network based on the ResNet-50 backbone network, so that accurate and complete multi-scale text features with combination of coarse granularity and fine granularity can be extracted. In a feature extraction link, the original input image gradually acquires coarse-grained feature information of different scales from low level to high level through 5 convolutional layers Conv1-Conv5, and the sizes of feature maps passing through each convolutional layer are sequentially changed into 1/2, 1/4, 1/8, 1/16 and 1/32 of the original image; in addition, a relevance hierarchy residual error structure is introduced into Conv2-Conv5 and used for fine-grained feature extraction between adjacent feature maps with different scales; in this way, the different scale feature maps f1, f2, f3 and f4 generated in the extraction process simultaneously contain multi-scale feature information combining coarse granularity and fine granularity.

Further, based on the feature extraction network of the correlation-level residual, Conv1 uses a 7 × 7 convolution kernel followed by a MaxPool layer using a 3 × 3 convolution kernel for downsampling. Conv2-Conv5 were constructed as 1 × 1 convolution, 3 × 3 convolution group, and 1 × 1 convolution, with residual concatenation to simplify the learning objectives and difficulty of deep neural networks. The 3 × 3 convolution group is a key for realizing fine-grained feature extraction, and firstly, a feature graph generated by 1 × 1 convolution is equally divided into 4 sub-feature graphs along a channel dimension, and the 1 st sub-feature graph x ₁ Is directly output as y ₁ (ii) a Each sub-feature graph x thereafter _i After a 3 x 3 convolution operation K _i Then obtaining output y _i (ii) a And starting from the 3 rd sub-feature map, x _i The output y of the previous sub-feature map is added _i-1 Then, 3 × 3 convolution operation is performed; and finally, combining the outputs of the 4 sub-feature graphs along the channel dimension to obtain the total output y.

Further, the feature map f generated from Conv5 is subjected to reverse progressive feature fusion in the second step ₁ At first, f is first paired ₁ Upsampling and outputting the feature map with the size 2 times of the original feature map, so as to obtain the feature map size after output and the feature map f generated by Conv4 ₂ The two are consistent, and can be directly combined along the channel dimension; in addition, after the feature maps are combined, two convolution operations of 1 × 1 and 3 × 3 are added for reducing the channel dimension and reducing the parameter calculation amount; thus, the characteristic diagram f with different scales is finally obtained in sequence according to the mode ₁ 、f ₂ 、f ₃ And f ₄ Step-by-step fusion is completed, and the size of the fused feature map is 1/4 original input image size; in addition, a 3 x 3 convolutional layer is added to generate the final multi-scale fusion feature map.

Further, in the third step, 1 × 1 convolution operation is adopted for performing feature mapping on the multi-scale fusion feature map; and then, the output pixel-level text score feature map and the text region geometric feature map respectively show whether each pixel point in the feature map is in the text region, the boundary distance between each pixel point and the text region and the inclination angle of the text region to which the pixel point belongs, so that the candidate prediction text region can be represented.

Further, the score threshold for performing simple pre-screening and culling on the candidate predictive text regions in step four is set to 0.5.

Further, in the fifth step, the local perception non-maximum suppression algorithm firstly merges the remaining candidate prediction text regions line by line, and when the intersection area of the two candidate prediction text regions is greater than a set threshold value 0.2, the requirement of merging is met; when merging, the vertex coordinates of the original two text regions are weighted and averaged to obtain the vertex coordinates of the merged text region, wherein the weight is the scores of the original two text regions, and the scores of the original two text regions are added to obtain the score of the merged new text region. And then screening the combined candidate prediction text regions by a standard non-maximum inhibition algorithm to obtain the quasi-prediction text regions.

Further, in the sixth step, a threshold value for performing region screening according to the region average score of the quasi-predictive text region is set to 0.1.

Further, in the sixth step, a loss function is used in the training process, and parameter adjustment is performed when the loss is propagated reversely.

Further, the loss function is composed of two parts, wherein the text classification loss is used for guiding the correct classification of the text region; and the test box regression loss is used to guide the correct regression of the text test box. The overall loss function is calculated as:

L＝L _cls +λL _reg

wherein L is the total loss of detection; l is _cls For text classification loss, L _reg To detect the box regression loss, λ is a parameter that balances the importance of the two losses, with a value of 1.

The text classification loss calculation formula is as follows:

wherein L is _cls Representing a text classification loss; y represents all positive sample areas in the real text score feature map; | Y ^* I represents all positive sample regions in the predicted text score feature map; y is reverse U Y ^* And | represents a portion where the positive sample region in the predicted text score feature map intersects with the positive sample region in the true text score feature map.

The detection frame regression loss calculation formula is as follows:

L _reg ＝L _g +λ _θ L _θ

wherein L is _reg Regression loss for the detection box; l is _g Detecting box geometric regression loss for the text without considering the angle; l is _θ Detecting a frame angle loss for the text; lambda [ alpha ] _θ The two loss tradeoff parameters are 20.

Furthermore, the geometric regression Loss of the text detection box without considering the angle in the regression Loss of the detection box is CIoU Loss, and the calculation formula is as follows:

L _g ＝1-IoU+R(A，B)

wherein L is _g Representing a text detection box geometric regression loss;

comparing the areas of the prediction frame and the real frame; a and B respectively represent a prediction frame and a real frame region; r (A, B) is a function penalty term which is calculated by the formula:

wherein a and B respectively represent the area centers of the prediction frame A and the real frame B; ρ (·) represents a euclidean distance; c represents the diagonal distance of the minimum bounding rectangle capable of containing the A and B areas;

is a trade-off parameter; and v is a parameter for measuring the uniformity of the aspect ratio, and the calculation formula is as follows:

wherein w _B ，h _B Width and height of the real frame B; w is a _A ，h _A The width and height of box a are predicted.

Further, the calculation formula of the angle loss of the text detection box in the regression loss of the detection box is as follows:

L _θ ＝1-cos(θ ^* -θ)

wherein L is _θ Representing a text detection box angle loss; theta ^* A predicted angle for the text region; θ represents the true angle of the text region.

Compared with the prior art, the invention has the following technical effects:

the adopted feature extraction network extracts multi-scale features combining coarse granularity and fine granularity by utilizing relevance hierarchy residual and reverse stage-by-stage feature fusion, wherein the multi-scale features comprise more accurate and complete text information, and the feature expression capability of the network is further enhanced, so that the text detection precision can be improved;

the regression Loss of the text detection box used by the invention consists of a CIoU Loss part and an angle Loss part, and particularly, the use of the CIoU Loss part considers factors such as the overlapping area, the central distance, the length-width ratio and the like between the predicted text detection box and the real text detection box, so that the actual regression condition of the text detection box can be more accurately evaluated, and the performance of the text detection method can be improved;

the invention relieves hardware computational stress in a number of steps in a suitable manner, such as: 1 × 1, 3 × 3 small convolutions, feature splitting and splicing and the like are used at multiple positions in network design to reduce feature dimensions and reduce parameter calculation amount; pre-simple threshold screening and the like for candidate prediction text regions are also carried out;

the method has high detection precision for the conventional text region, is sensitive to the detection of the small text region, and has higher application value in the field of natural scene text detection.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a feature extraction and feature fusion network architecture of the present invention;

FIG. 3 is a basic structure diagram of the relevance hierarchy residual error used by the feature extraction network Conv2-Conv5 according to the present invention;

FIG. 4 is a graph of a portion of the test results of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

referring to fig. 1, the present invention comprises the steps of:

step 101, using a camera to acquire image data or directly uploading the image data as image input.

Step 102, extracting network pairs by using characteristics based on relevance residual errorsExtracting features of the original input image to obtain a feature map f with the combined coarse and fine granularity sizes of 1/32, 1/16, 1/8 and 1/4 of the original input image ₁ 、f ₂ 、f ₃ 、f ₄ The multi-scale feature maps respectively represent rich feature information from low level to high level.

103, reversely, merging the progressive features from the feature map f ₁ At the beginning, sequentially pair f ₁ 、f ₂ 、f ₃ 、f ₄ And performing upsampling and feature splicing to finally generate a multi-scale fusion feature map with the size of the original input image 1/4.

And 104, performing feature mapping on the multi-scale fusion feature map to detect a text region, and outputting a pixel-level text score feature map and a text region geometric feature map, wherein the pixel-level text score feature map and the text region geometric feature map respectively indicate whether each pixel point in the feature map is in the text region, the boundary distance between each pixel point and the text region and the inclination angle of the text region to which the pixel point belongs, so that candidate prediction text regions can be represented.

And 105, simply screening and eliminating the candidate predictive text region in advance according to the region score, wherein the screening score threshold is set to be 0.5.

Step 106, merging the remaining candidate prediction text regions line by adopting a local perception non-maximum suppression algorithm, and meeting the requirement of merging when the intersection area of the two candidate prediction text regions is larger than a set threshold value of 0.2; when merging, the vertex coordinates of the original two text regions are weighted and averaged to obtain the vertex coordinates of the merged text region, wherein the weight is the scores of the original two text regions, and the scores of the original two text regions are added to obtain the score of the merged new text region. And then screening the combined candidate prediction text regions by a standard non-maximum inhibition algorithm to obtain the quasi-prediction text regions.

Step 107, performing region screening according to the region average score of the quasi-predictive text region to obtain a final predictive text region (i.e. a text detection result), wherein the screening threshold is set to 0.1.

In addition, as with most deep learning methods, the method model is first trained using a large amount of labeled image data, and back propagation and parameter optimization during the training process require the construction of a loss function. The loss function consists of two parts, wherein the text classification loss is used for guiding the correct classification of the text region; and the test box regression loss is used to guide the correct regression of the text test box. The overall loss function is calculated as:

L＝L _cls +λL _reg

The text classification loss calculation formula is as follows:

The detection frame regression loss calculation formula is as follows:

L _reg ＝L _g +λ _θ L _θ

The geometric regression Loss of the text detection box without considering the angle in the regression Loss of the detection box is CIoU Loss, and the calculation formula is as follows:

L _g ＝1-IoU+R(A，B)

wherein L is _g Representing text detection box geometryLoss of regression;

The calculation formula of the angle loss of the text detection box in the regression loss of the detection box is as follows:

L _θ ＝1-cos(θ ^* -θ)

Referring to fig. 2, it depicts a structure diagram of the feature extraction and feature fusion network of the present invention, which includes the following parts:

step 201, using a camera to acquire image data or directly uploading the image data as image input.

Step 202, performing feature extraction on an original input image by using a feature extraction network based on relevance hierarchy residuals, wherein 5 convolutional layers Conv1-Conv5 in the feature extraction network gradually acquire coarse-grained feature information with different scales from a low level to a high level, and the sizes of feature maps passing through each convolutional layer are sequentially changed into 1/2, 1/4, 1/8, 1/16 and 1/32 of an original image; in addition, a relevance hierarchy residual error structure is introduced into Conv2-Conv5 and used for fine-grained feature extraction between adjacent feature maps with different scales; in this way, the different scale feature maps f1, f2, f3 and f4 generated in the extraction process simultaneously contain multi-scale feature information combining coarse granularity and fine granularity. Conv1 was downsampled using a 7 × 7 convolution kernel followed by a 3 × 3 convolution kernel for a MaxPoint layer. Conv2-Conv5 were constructed as 1 × 1 convolution, 3 × 3 convolution group, and 1 × 1 convolution, with residual concatenation to simplify the learning objectives and difficulty of deep neural networks.

Step 203, inverse progressive feature fusion feature map f generated from Conv5 ₁ At first, f is first paired ₁ Upsampling and outputting the feature map with the size 2 times of the original feature map, so as to obtain the feature map size after output and the feature map f generated by Conv4 ₂ The two are consistent, and can be directly combined along the channel dimension; in addition, after the feature maps are combined, two convolution operations of 1 × 1 and 3 × 3 are added for reducing the channel dimension and reducing the parameter calculation amount; thus, the characteristic diagram f with different scales is finally obtained in sequence according to the mode ₁ 、f ₂ 、f ₃ And f ₄ Step-by-step fusion is completed, and the size of the fused feature map is 1/4 original input image size; in addition, a 3 x 3 convolutional layer is added to generate the final multi-scale feature fusion map.

Referring to fig. 3, which depicts a basic structure diagram of the relevance hierarchy residual error used by the feature extraction network Conv2-Conv5 of the present invention, includes the following parts:

in step 301, the feature map 1 is convolved by 1 × 1 to reduce the amount of parameter calculation.

Step 302, the feature map generated by convolution with 1x1 is equally divided into 4 sub-feature maps along the channel dimension, the 1 st sub-feature map x ₁ Is directly output as y ₁ (ii) a Each sub-feature graph x thereafter _i After a 3 x 3 convolution operation K _i Then obtaining output y _i (ii) a And alsoStarting from the 3 rd sub-feature map, x _i The output y of the previous sub-feature map is added _i-1 Then, 3 × 3 convolution operation is performed; and finally, combining the outputs of the 4 sub-feature graphs along the channel dimension to obtain the total output y.

Step 303, the feature dimension is restored by convolution of the feature graph y by 1x1, and finally the feature graph 2 is generated.

Meanwhile, Conv2-Conv5 uses residual concatenation in order to simplify the learning objectives and difficulty of deep neural networks.

Referring to fig. 4, partial detection results of the method are shown, and the results show that the method is relatively accurate in detection of the horizontal text, relatively sensitive in detection of small text regions, relatively accurate in distinguishing of multiple instances, and capable of eliminating interference of text similar objects.

The embodiments of the present invention have been described above with reference to the accompanying drawings. It will be appreciated by persons skilled in the art that the present invention is not limited by the embodiments described above. On the basis of the technical solution of the present invention, those skilled in the art can make various modifications or variations without creative efforts and still be within the protection scope of the present invention.

Claims

1. The multi-instance natural scene text detection method based on the relevance hierarchy residual error is characterized by comprising the following steps of:

firstly, extracting features of an original input image by adopting a feature extraction network based on relevance hierarchy residual errors so as to obtain feature maps with different scales containing rich text information from low level to high level; the relevance hierarchy residual error is that Conv2-Conv5 are composed of 1 × 1 convolution, a 3 × 3 convolution group and 1 × 1 convolution, and residual error connection is attached to simplify the learning goal and difficulty of the deep neural network, wherein the 3 × 3 convolution group is the key for achieving fine-grained feature extraction, firstly, a feature graph generated by 1 × 1 convolution is evenly divided into 4 sub-feature graphs along a channel dimension, and the 1 st sub-feature graph x ₁ Is directly output as y ₁ (ii) a Each sub-feature graph x thereafter _i After a 3 x 3 convolution operation K _i Then obtain the outputy _i (ii) a And starting from the 3 rd sub-feature map, x _i The output y of the previous sub-feature map is added _i-1 Then, 3 × 3 convolution operation is performed; finally, combining the outputs of the 4 sub-feature graphs along the channel dimension to obtain a total output y;

thirdly, performing text region detection on the multi-scale fusion characteristic graph output in the second step by adopting characteristic mapping, and outputting a pixel-level text score characteristic graph and a text region geometric characteristic graph so as to represent candidate prediction text regions;

step four, simply screening and eliminating all candidate predictive text regions generated in the step three in advance according to the score of each candidate predictive text region, and setting the score threshold value to be 0.5;

step five, merging and screening the residual candidate prediction text regions in the step four by using a local perception non-maximum suppression algorithm, thereby obtaining quasi prediction text regions;

step six, calculating the average score of the regions of all the quasi-prediction text regions obtained in the step five, and removing the regions with the average score of the regions lower than the threshold value of 0.1, so as to obtain the final prediction text regions and the detection results;

the method comprises a training process, wherein a plurality of public common text detection data sets are used for training a natural scene text detection model applying multiple instances from step one to step five;

in the training process, the model parameters are continuously updated by using back propagation until loss convergence, and the parameters of the model are stored;

and step seven, forming a multi-instance natural scene text detection model by using the model parameters and the structure stored in the step six.

2. The method for detecting the text of the multi-instance natural scene based on the relevance hierarchy residual error as claimed in claim 1, wherein in the step one, the relevance hierarchy residual error based feature extraction network introduces a relevance hierarchy residual error structure based on a ResNet-50 backbone network, so as to extract precise and complete multi-scale text features combining coarse granularity and fine granularity, in a feature extraction link, an original input image gradually acquires coarse-grained feature information of different scales from a low level to a high level through 5 convolutional layers Conv1-Conv5, and the feature map size after passing through each convolutional layer sequentially becomes 1/2, 1/4, 1/8, 1/16 and 1/32 of an original image; in addition, a relevance hierarchy residual error structure is introduced into Conv2-Conv5 and used for fine-grained feature extraction between adjacent feature maps with different scales; in this way, the different scale feature maps f1, f2, f3 and f4 generated in the extraction process simultaneously contain multi-scale feature information combining coarse granularity and fine granularity.

3. The method according to claim 1, wherein in step two, the feature map f generated from Conv5 is subjected to inverse progressive feature fusion ₁ At first, f is first paired ₁ Upsampling and outputting a feature map with a size 2 times that of the original feature map, the feature map size after output and the feature map f generated by Conv4 ₂ The two are consistent, and can be directly combined along the channel dimension; in addition, after the feature maps are combined, two convolution operations of 1 × 1 and 3 × 3 are added for reducing the channel dimension and reducing the parameter calculation amount; thus, the characteristic diagrams f with different scales are finally obtained in sequence in the mode ₁ 、f ₂ 、f ₃ And f ₄ Step-by-step fusion is completed, and the size of the fused feature map is 1/4 original input image size; in addition, a 3 x 3 convolutional layer is added to generate the final multi-scale feature fusion map.

4. The method for detecting multi-instance natural scene text based on relevance hierarchy residual error according to claim 1, wherein in step three, the feature mapping for the multi-scale feature fusion graph adopts 1x1 convolution operation; and then, the output pixel-level text score feature map and the text region geometric feature map respectively show whether each pixel point in the feature map is in the text region, the boundary distance between each pixel point and the text region and the inclination angle of the text region to which the pixel point belongs, so that the candidate prediction text region can be represented.

5. The method for detecting the multi-instance natural scene text based on the relevance hierarchy residual error according to claim 1, wherein in the fifth step, a local perception non-maximum suppression algorithm firstly merges the remaining candidate prediction text regions line by line, and when the intersection area of the two candidate prediction text regions is greater than a set threshold value of 0.2, the requirement of merging is met; when merging, carrying out weighted average on the vertex coordinates of the original two text regions to obtain the vertex coordinates of the merged text region, wherein the used weight is the fraction of the original two text regions, the fractions of the original two text regions are added to obtain the score of the merged new text region, and then screening the merged candidate predicted text region by a standard non-maximum suppression algorithm to obtain a quasi-predicted text region.

6. The method according to claim 1, wherein in step six, a loss function is used in the training process, and parameter adjustment is performed when loss is propagated backwards.

7. The method of claim 6, wherein the loss function consists of two parts, wherein the text classification loss is used to guide the correct classification of text regions; the regression loss of the detection box is used for guiding the correct regression of the text detection box; the overall loss function is calculated as:

L＝L _cls +λL _reg

wherein L is the total loss of detection; l is a radical of an alcohol _cls For text classification loss, L _reg For detecting the frame regression loss, lambda is a parameter for balancing the importance of two losses, and the value of lambda is 1;

the text classification loss calculation formula is as follows:

wherein L is _cls Representing a text classification loss; y represents all positive sample areas in the real text score feature map; | Y ^* I represents all positive sample regions in the predicted text score feature map; y is reverse U Y ^* I represents the part of the predicted text score feature map where the positive sample region intersects with the positive sample region in the real text score feature map;

the detection frame regression loss calculation formula is as follows:

L _reg ＝L _g +λ _θ L _θ

8. The method of claim 7, wherein the text detection box geometric regression Loss without considering angles in the text detection box regression Loss is CIoU Loss, and the calculation formula is as follows:

L _g ＝1-IoU+R(A，B)

wherein L is _g Representing a text detection box geometric regression loss;

wherein a and B represent the area centers of the predicted frame A and the real frame B respectively(ii) a ρ (·) represents a euclidean distance; c represents the diagonal distance of the minimum bounding rectangle capable of containing the A and B areas;

9. The method of claim 8, wherein the calculation formula of the text detection box angle loss in the detection box regression loss is as follows:

L _θ ＝1-cos(θ ^* -θ)