CN111368848A

CN111368848A - Character detection method under complex scene

Info

Publication number: CN111368848A
Application number: CN202010464622.5A
Authority: CN
Inventors: 朱浩; 张磊; 郑全新; 董小栋; 刘阳; 赵海波; 孟祥松; 张逞逞; 冯鑫; 江龙; 邓家勇; 刘婷婷
Original assignee: Beijing Tongfang Software Co Ltd
Current assignee: Beijing Tongfang Software Co Ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-07-03
Anticipated expiration: 2040-05-28
Also published as: CN111368848B

Abstract

The invention relates to the technical field of artificial intelligence and computer vision, in particular to a character detection method based on deep learning under a complex scene. The character false detection method has the advantages that the network structure (SDetNet) of the segmentation module and the detection module and the spatial distribution characteristic of the Loss function (Shape Loss) learning data are fused, so that the false detection rate of characters can be reduced, the redundancy of a detection frame is reduced, and the interpretability is good. A character detection method under a complex scene comprises the following steps: scene preprocessing of image data; designing a network model; a loss function.

Description

Character detection method under complex scene

Technical Field

The invention relates to the technical field of artificial intelligence and computer vision, in particular to a character detection method based on deep learning under a complex scene.

Background

Optical Character Recognition (OCR) refers to converting characters on an image into computer-editable Character contents. The most important step is to find out the character region features in the image through feature extraction, namely character detection. Text detection is divided into three mainstream methods: an algorithm based on text box regression; an algorithm based on pixel segmentation; a research algorithm based on a combination of segmentation and regression. At present, the character detection faces a plurality of challenges, such as the variability of character directions, the irregularity of character distribution and the non-uniqueness of character sizes. Due to the above challenges, character detection in a complex scene is prone to two situations of false detection and excessive redundancy of detection frames, and further bad influence is caused on character recognition.

In the field of computer vision, text Detection of complex scenes can utilize two different Detection ideas, namely Object Detection (Object Detection) and Object segmentation (Object segmentation). The paper detection Text in Natural Image with connectivity Text textual Network, published in 2016 by Zhi TIAN et al, first introduced RNN into the detection Network using a target detection method. The depth features of the image are obtained through the CNN, then the anchor with fixed width is used for detecting the text pitch, the features corresponding to the anchor in the same row are serialized into a sequence and input into the RNN, finally the full-connection layer is used for classification or regression, and the correct text pitch is combined into a text line, so that the detection precision is improved by the method for seamlessly combining the RNN and the CNN. Baoguang Shi et al, 2017, published detection organized Text in Natural Images by Linking Segments by first Detecting that a slice (segment) was created that represents a Text line or a portion of a word, which may be a character, a word, or several characters. The slices (segments) belonging to the same text line or word are linked by means of links. The linking is carried out at the central points of two overlapped slices, and finally, the slices and the links are combined into a complete text line through a combination algorithm to obtain the position and the rotation angle of the detection frame of the complete text line. The method has achieved a good expression on scene text detection through a direct regression method, but the scene text is subject to large scale, aspect ratio and direction changes. Qiangpeng Yang et al published 2018 that the A New expression-Text with Deformable PSROI Pooling for Multi-Oriented SceneText Detection proposed a new expression-Text Module for Multi-directional scene Text Detection, using a Deformable PSROI Pooling Module to process Multi-directional texts, using convolution branches of multiple different convolution kernels to process texts with different aspect ratio ratios, and connecting a Deformable convolution layer behind each branch to adapt to the Multi-directional texts, thereby realizing the Detection of texts in complex scenes.

In summary, the use of target detection and target segmentation algorithms to achieve natural scene text detection is a different and efficient approach. However, the character detection in a complex scene has certain disadvantages, and the complex character background is likely to cause false detection of characters and the like. How to improve the detection precision and reduce the false detection is also a hot spot of complex scene character detection research.

In a complex scene, due to diversification of a real scene, diversification of character distribution, difference of character sizes and the like, certain problems of false detection and detection frame redundancy occur in a text detection process. Meanwhile, under the condition that the image size is large, the character pixel proportion is small, and the small target is easy to miss detection. By using a single target segmentation algorithm, not only is complex post-processing operation existed, but also false detection condition exists; by using a single target detection algorithm, redundancy and false detection of detection frames easily occur in a complex scene.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a character detection method in a complex scene. The character false detection method has the advantages that the network structure (SDetNet) of the segmentation module and the detection module and the spatial distribution characteristic of the loss function (Shapeloss) learning data are fused, so that the false detection rate of characters can be reduced, the redundancy of a detection frame is reduced, and the character false detection method has good interpretability.

In order to solve the technical problem, the invention provides a character detection method under a complex scene, which is characterized by comprising the following steps:

the method comprises the following steps: the scene preprocessing of the image data comprises the steps of dividing a large pixel image in an original complex scene into a plurality of small image blocks, respectively detecting the small image blocks, and fusing detection results.

Step two: and (2) designing a network structure SDetNet integrating a segmentation module and a detection module, calculating the intersection ratio IOU of the detection frame of the detection module and the detection frame of the segmentation module, and judging whether characters exist in some local positions in the scene or not by using the intersection ratio parameter value and the text existence probability value through a merging module. Calculating by using the formula (1):

（1）

the detection module is used for detecting the character in the text, wherein the Pre _ Rect is an intersection ratio parameter value of the segmentation module and the detection module, and the Label _ Rect represents a real distribution area in which the character exists.

Step three: designing a loss function, setting IOU parameter values of a detection frame and a real frame as dynamic weight parameters, taking the dynamic weight parameters as a final target function of the model, and then performing CNN iterative training, wherein the method for calculating the loss function of the regression length-width ratio comprises the following steps:

the coordinate origin is set as a point (0, 0), x and y respectively represent the length and the width of a text box, a point A (x 1 and y 1) and a point B (x 2 and y 2) in the coordinate respectively represent the truth value of the detection box and the result value predicted by the model, and a theta parameter is used as an included angle between the point A and the point B to measure the similarity of vector sums. Optimizing a theta parameter value, and adjusting a detection frame according to the following formulas (2) and (3):

（2）

（3）

wherein, theta is the included angle between the true value coordinate A and the predicted coordinate B, when the parameter value of theta is increased, the cos function is increased, and the-ln function is also increased. And effectively adjusting the model through a gradient descent algorithm to enable the theta parameter value to be gradually reduced, wherein AL is the calculated vector direction difference degree parameter value.

And designing a dynamic weight value by utilizing the intersection ratio of the true value frame and the prediction frame, and when the IOU parameter value is larger, indicating that the character detection area can better cover the character area and setting a higher weight value. When the IOU parameter value is smaller, the effect of the character detection area covering the character area is poor, and a lower weight is set; the loss function formula is as follows (4):

(4)

in the above character detection method, the number of the small image blocks dividing the image in the original complex scene is 4.

In the character detection method, the detection module learns character region distribution and character inclination angle characteristics; the segmentation module learns the character distribution probability and character detection box features.

As the method is adopted, compared with the prior art, the invention has the following advantages:

1. in the invention, a network structure SDetNet of a segmentation module and a detection module is fused, wherein segmentation branches can effectively calculate the existence probability of character areas and characters, and the false detection rate of the characters can be effectively reduced by combining the detection branches;

2. according to the target frame Loss function Shape Loss, the character distribution is used for realizing the standardization of the detection of the region frame by using the prior characteristic of regular length-width ratio, the detection efficiency is improved, and the detection redundancy is reduced;

3. the method designs a dynamic weight parameter by using an intersection-to-parallel ratio IOU parameter. Due to the initial stage of network training, the learning of the model has high randomness, and a large number of character detection boxes can be generated. Through the IOU parameter value, the positive sample and the negative sample of the detection frame can be effectively obtained. When the positive sample exists, the corresponding text region is indicated to have higher probability, and the length-width ratio of the detection frame is adjusted. Conversely, when negative samples are used, there should be a lower probability that the ratio of the length to the width of the detection box is adjusted. Through the purposeful constraint, the model can well pay attention to the character region characteristics. Therefore, the learning of the model can be effectively and dynamically adjusted by using the cross-over ratio IOU parameter value.

The invention is further described with reference to the following figures and detailed description.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of the network architecture SDetNet of the present invention;

fig. 3 is a graph in accordance with the present invention.

Detailed Description

Referring to fig. 1, the invention relates to a text detection method in a complex scene, comprising the following steps:

the method comprises the following steps: scene preprocessing of image data, in an image with a pixel size of 1920 x 1080 in an original complex scene, pixels occupied by small characters are small. In order to improve the pixel ratio of characters, an original image is divided into 4 960 × 540 image blocks, detection is performed respectively, and detection results are fused.

Step two: the network model design, referring to fig. 2, by designing a network structure SDetNet that integrates a segmentation module and a detection module, the segmentation module and the detection module share a network backbone structure backbone respectively, segmentation branches obtain a character distribution region and probability, and detection branches obtain a character distribution angle parameter and region. And calculating the merging ratio IOU of the detection box of the detection module and the detection box of the segmentation module, and judging whether characters exist at some local positions in the scene or not by using the merging ratio parameter value and the text existence probability value through the merging module. Calculating the intersection ratio IOU by adopting a formula (1):

（1）

the detection module is used for detecting the character distribution, wherein the Pre _ Rect is an intersection ratio parameter value of the segmentation module and the detection module, and the Label _ Rect represents a real distribution area where the character exists;

in fig. 2, the input image size is batchsize 3 × 512, and the number of channels output by each module is as follows:

Conv：16

Conv Stage 1：64

Conv Stage 2：256

Conv Stage 3：384

DeConv Stage 1：128

DeConv Stage 2：64

DeConv Stage 3：32

DeConv：32

Detection Block: 5

Segmentation: 2

the feature map size in the output result is:

Score Map: batchsize*256 * 256 * 1

Box Geometry: batchsize*256 * 256 * 4

Rotation Angel: batchsize*256 * 256 * 1

as shown in fig. 2, three results were obtained: text region Score (Score Map), text Box size (Box Geometry), and text Rotation angle (Rotation angle). The Segmentation Block and the detection Block share a U-shaped network structure.

Step three: and (4) designing a loss function, screening the collected samples according to actual business requirements, and realizing character area polygon marking on the screened specific scene samples. In a complex scene, due to the characteristic that the character size, the distance and the distribution position have diversity and the defects of cross entropy, long IOU loss function pair compared with a rule, large wide-frame regression redundancy and the like. The invention designs a new loss function which can regress length and width ratios. In order to solve the problem of difficult convergence of the model, the IOU parameter values of the detection frame and the real frame are set as dynamic weight parameters to serve as final target functions of the model, and finally CNN iterative training is carried out, so that the redundancy of the detection frame is effectively reduced by the trained model;

the effects of Shape Loss and IOU in FIG. 1 are: standardizing the length-width ratio; the loss weight parameter value is dynamically adjusted. Setting IOU parameter values of the detection frame and the real frame as dynamic weight parameters, and performing CNN iterative training as a final target function of the model, wherein the method for calculating the loss function of the regression length-width ratio comprises the following steps:

referring to fig. 3, setting the coordinate origin as a (0, 0) point, x and y respectively representing the length and width of a text box, a point a (x 1, y 1) and a point B (x 2, y 2) in the coordinate respectively representing the true value of the detection box and the result value predicted by the model, and a theta parameter as the included angle between the point a and the point B, the similarity of the vector sum can be measured;

optimizing a theta parameter value, and adjusting a detection frame according to the following formulas (2) and (3):

（2）

（3）

Because a large number of detection frames are generated in the early stage of network training, the model is difficult to converge due to the fact that the theta parameter value is simply minimized. Designing a dynamic weight value by utilizing the intersection ratio of the true value frame and the prediction frame, and when the IOU parameter value is larger, indicating that the character detection area can better cover the character area, and setting a higher weight value; when the IOU parameter value is smaller, the effect of the character detection area covering the character area is poor, and a lower weight is set. The loss function equation (4) is as follows:

(4)

and finally, removing redundant detection frames by using a non-maximum suppression algorithm (NMS) and outputting a final detection result.

The following alternatives used on the basis of the technical scheme of the invention all belong to the protection scope of the invention:

1. the scheme of the convolutional neural network CNN model can be replaced by a scheme combining other deep learning models or machine learning;

2. the segmentation and detection fusion network SDetNet designed by the invention can be replaced by other fusion methods;

3. the Loss function Shape Loss method designed by the invention can be replaced by other methods;

4. the dynamic threshold scheme designed by the invention can be replaced by other methods.

Claims

1. A character detection method under a complex scene is characterized by comprising the following steps:

the method comprises the following steps: the scene preprocessing of the image data, divide the large pixel image in the original complex scene into several small image blocks, detect separately, and then fuse the detection results;

step two: designing a network structure SDetNet integrating a segmentation module and a detection module, calculating the intersection ratio IOU of the detection frame of the detection module and the detection frame of the segmentation module, and judging whether characters exist in some local positions in the scene or not by using the intersection ratio parameter value and the text existence probability value through a merging module; calculating the intersection ratio IOU by adopting a formula (1):

（1）

step three: and a loss function, namely setting the IOU parameter values of the detection frame and the real frame as dynamic weight parameters, and performing CNN iterative training as a final target function of the model, wherein the method for calculating the loss function of the regression length-width ratio comprises the following steps:

setting the origin of coordinates as a (0, 0) point, wherein x and y respectively represent the length and the width of a text box, points A (x 1 and y 1) and B (x 2 and y 2) in the coordinates respectively represent the truth value of a detection box and the result value predicted by a model, and a theta parameter is used as an included angle between the points A and B to measure the similarity of vector sums; optimizing a theta parameter value, and adjusting a detection frame according to the following formulas (2) and (3):

（2）

（3）

wherein theta is an included angle between the true value coordinate A and the prediction coordinate B, and when the parameter value of theta is increased, the cos function is increased and the-ln function is also increased; the model is effectively adjusted through a gradient descent algorithm, so that a theta parameter value is gradually reduced, and AL is a calculated vector direction difference degree parameter value;

designing a dynamic weight value by utilizing the intersection ratio of the true value frame and the prediction frame, and when the IOU parameter value is larger, indicating that the character detection area can better cover the character area, and setting a higher weight value; when the IOU parameter value is smaller, the effect of the character detection area covering the character area is poor, and a lower weight is set; the loss function ShapeLoss is as follows (4):

(4) 。

2. the method for detecting words in a complex scene as claimed in claim 1, wherein the number of the image blocks in the original complex scene is 4.

3. The character detection method under the complex scene according to claim 1 or 2, wherein the detection module learns character region distribution and character inclination angle characteristics; the segmentation module learns the character distribution probability and character detection box features.