CN111242120B

CN111242120B - Character detection method and system

Info

Publication number: CN111242120B
Application number: CN202010008296.7A
Authority: CN
Inventors: 张勇东; 王裕鑫; 谢洪涛
Original assignee: Beijing Zhongke Research Institute; University of Science and Technology of China USTC
Current assignee: Beijing Zhongke Research Institute; University of Science and Technology of China USTC
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2022-07-29
Anticipated expiration: 2040-01-03
Also published as: CN111242120A

Abstract

A character detection method and system, the method includes: performing feature extraction on an input image to obtain a feature image; predicting by using a self-adaptive regional suggestion network to obtain a suggestion frame; utilizing the suggestion frame to cut the feature image to obtain a cut feature image; respectively carrying out character texture information modeling on the cutting characteristic diagram in two orthogonal directions to obtain a contour point thermodynamic diagram corresponding to each orthogonal direction; and screening the contour points in the contour point thermodynamic diagram to obtain a contour point set so as to reconstruct characters in the input image. The adaptive region suggestion network can adapt to the scale change of characters to generate suggestion boxes corresponding to character regions, and the character texture information modeling module carries out character texture information modeling in the orthogonal direction to inhibit false positive contour points, so that the precision of character detection of scenes in any shapes is improved.

Description

Character detection method and system

Technical Field

The present disclosure relates to the field of text recognition technologies, and in particular, to a text detection method and system.

Background

The natural scene character detection means that a character area is detected in a complex background, and the character area is identified by using a surrounding frame. The result of natural scene character detection is widely applied in the fields of automatic driving, robots and the like. The character detection in the natural scene faces the difficulties of low resolution, complex background, variable font size and the like, so that the practical application effect of the traditional character detection technology is poor.

With the development of the deep learning technology, the natural scene character detection technology based on deep learning is remarkably improved, and although the detection technology can detect characters in any shapes, the detection result contains more false positive detections and is influenced by the problem of character size diversity, and the detection precision of the detection technology needs to be improved.

Disclosure of Invention

Technical problem to be solved

In view of this, the present disclosure provides a text detection method and system capable of improving text detection accuracy of scenes with arbitrary shapes.

(II) technical scheme

The present disclosure provides a text detection method, including: performing feature extraction on an input image to obtain a feature image; predicting by using a self-adaptive regional suggestion network to obtain a suggestion frame; cutting the characteristic image by using the suggestion frame to obtain a cutting characteristic diagram; respectively carrying out character texture information modeling on the cutting characteristic diagram in two orthogonal directions to obtain a contour point thermodynamic diagram corresponding to each orthogonal direction; and screening the contour points in the contour point thermodynamic diagram to obtain a contour point set so as to reconstruct the characters in the input image.

Optionally, the predicting by using the adaptive regional recommendation network to obtain a recommendation box includes: local bias prediction is carried out on the points of the preset anchor frame by utilizing the self-adaptive regional suggestion network to obtain corresponding predicted points; and determining the suggestion frame according to the predicted point.

Optionally, the two orthogonal directions are a horizontal direction and a vertical direction, and performing text texture information modeling on the clipping feature map in the two orthogonal directions respectively includes: according to the first convolution kernel, a first character texture information model of the cutting feature map in the horizontal direction is established; and establishing a second character texture information model of the cutting feature map in the vertical direction according to a second convolution kernel.

Optionally, the size of the first convolution kernel is 1 × k, the size of the second convolution kernel is k × 1, k is not greater than the size of the cropping feature map, and k is 3 in the present disclosure.

Optionally, the method further comprises: according to the cutting feature diagram, the suggestion frame is adjusted by using a fine adjustment network to obtain an adjusted suggestion frame; utilizing the adjusted suggestion frame to cut the feature image to obtain an adjusted cutting feature image; and performing upsampling on the adjusted cutting feature map to obtain an upsampling feature map.

Optionally, the performing text texture information modeling on the clipping feature map in two orthogonal directions respectively includes: and respectively carrying out character texture information modeling on the up-sampling characteristic diagram in two orthogonal directions.

Optionally, the performing text texture information modeling on the clipping feature map in two orthogonal directions respectively includes:

respectively utilizing the character texture information perception networks in the two orthogonal directions to carry out character texture information modeling on the cutting characteristic diagram;

before feature extraction on the input image, the method further comprises:

training the adaptive area suggestion network, the character texture information perception network and the fine tuning network according to a loss function by using a random gradient descent method, wherein the loss function is as follows:

L＝L _Arpn +λ _Hcp L _Hcp +λ _Vcp L _Vcp +λ _boxclass L _boxclass +λ _boxreg L _boxreg

wherein L is the loss function, L _Arpn Proposing a loss function of the network for said adaptive area, L _Hcp Is a loss function of the literal texture information perception network in an orthogonal direction, L _Vcp For a loss function of the literal-texture-information-aware network in the other orthogonal direction, L _boxclass 、L _boxreg As a loss function of said fine tuning network, λ _Hcp For the balanced parameter, lambda, of the text texture information perception network in said one orthogonal direction _Vcp For said other orthogonal direction a balance parameter, λ, of the text-to-texture information aware network _boxclass 、λ _boxreg Balancing parameters for the fine tuning network.

Optionally, the screening the contour point thermodynamic diagrams to obtain a contour point set includes: filtering background pixel points in the contour point thermodynamic diagram by using a non-maximum value inhibition method; and screening the contour point thermodynamic diagrams according to a preset threshold value to obtain the contour point set.

Optionally, the screening the contour point thermodynamic diagram according to a preset threshold to obtain the contour point set includes: and screening out pixel points of which the response values in the contour point thermodynamic diagrams corresponding to the two orthogonal directions are both larger than the preset threshold value to form the contour point set.

Another aspect of the present disclosure provides a text detection system, including: the extraction module is used for extracting the characteristics of the input image to obtain a characteristic image; the prediction module is used for predicting by utilizing the self-adaptive regional suggestion network to obtain a suggestion frame; the cutting module is used for cutting the characteristic image by using the suggestion frame to obtain a cutting characteristic diagram; the modeling module is used for respectively carrying out character texture information modeling on the cutting characteristic diagram in two orthogonal directions to obtain a contour point thermodynamic diagram corresponding to each orthogonal direction; and the screening module is used for screening the contour points in the contour point thermodynamic diagram to obtain a contour point set so as to reconstruct the characters in the input image.

(III) advantageous effects

According to the character detection method and system, the adaptive regional suggestion network is designed, the scale change of characters can be better adapted, the character texture information modeling is carried out in the orthogonal direction, and the false positive contour points can be restrained, so that the problems of the scale change of the characters and the prediction of the false positive are effectively solved, and the precision of character detection of scenes in any shapes is improved.

Drawings

Fig. 1 schematically illustrates a flowchart of a text detection method provided in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a predicted crop box in a text detection method according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram schematically illustrating modeling of text texture information in a text detection method provided by an embodiment of the present disclosure;

fig. 4 schematically shows a block diagram of a text detection system provided in an embodiment of the present disclosure;

fig. 5 schematically illustrates a schematic diagram of a fine tuning network provided by an embodiment of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

Fig. 1 schematically shows a flowchart of a text detection method provided in an embodiment of the present disclosure.

Referring to fig. 1, the method shown in fig. 1 will be described in detail with reference to fig. 2 to 3. As shown in fig. 1, the text detection method includes operations S110 to S150.

In operation S110, feature extraction is performed on the input image to obtain a feature image.

In this embodiment, a Deep Neural Network (DNN) is used for text detection, where the Deep Neural network includes a ResNet50 feature extraction network, an adaptive region suggestion network, a fine-tuning network, a text texture information perception network in the horizontal direction, a text texture information perception network in the vertical direction, and the like.

The deep neural network should be trained prior to operation S110. Specifically, for example, end-to-end training is performed by using a Stochastic Gradient Descent (SGD) method, where a loss function L of the deep neural network as a whole is:

wherein L is _Arpn Proposing a loss function of the network for the adaptive region, L _Hcp Is a loss function of the literal texture information aware network in an orthogonal direction (e.g., horizontal direction), L _Vcp For a loss function of the literal-texture-information-aware network in another orthogonal direction (e.g., vertical direction), L _boxclass 、L _boxreg For fine-tuning the loss function of the network, λ _Hcp Is a balance parameter, lambda, of a literal texture information aware network in an orthogonal direction _Vcp For a balance parameter, λ, of the text-texture-information-aware network in the other orthogonal direction _boxclass 、λ _boxreg To fine tune the balance parameters of the network.

Further, the adaptive area suggests a loss function L of the network _Arpn Comprises the following steps:

L _Arpn ＝L _Arpnclass +L _Arpnreg

wherein L is _Arpnclass As a function of classification loss, L _Arpnreg As a function of the regression loss, p _i Probability of the anchor frame being a target frame (i.e., a suggestion frame), L, for a preset _cls As a cross-entropy loss function, N _pos The number of positive anchor frames, the Intersection is the Intersection of the anchor frame and the target frame, the Union is the Union of the anchor frame and the target frame, and the Intersection ratio of the anchor frame and the target frame When the concentration of the carbon dioxide is more than 0.5,

1, when the intersection ratio of the anchor frame and the target frame is not more than 0.5,

is 0.

Loss function L of text texture information perception network in horizontal direction _Hcp And a loss function L of the literal texture information aware network in the vertical direction _Vcp Comprises the following steps:

wherein, y _i Labels for contour points, q _i As a prediction of contour points, N _neg To predict the number of background pixels, N _pos The number of predicted contour points is determined.

Loss function L of fine tuning network _boxclass 、L _boxreg Comprises the following steps:

wherein p is _i1 For the probability that the anchor frame is the target frame in the box branch, L _cls As a cross-entropy loss function, N _pos1 When the intersection ratio of the anchor frame and the target frame in the box branch is more than 0.5 for the box branch and the label to correctly match the number of the prediction frames,

1, when the intersection ratio of the anchor frame and the target frame in the box branch is not more than 0.5,

is 0, N _reg For the number of frames in the box branch that need to be trimmed, t _i In order to predict the parameters of the box,

as parameters of the tag frame, Smooth _l1 Is Smoothl ₁ A function.

In the deep neural network training process, the initial learning rate is selected to be 0.0025, when the training times reach 120000-160000 times, the learning rate is reduced to 0.1 time, for example, 180000 times of training in this embodiment, and at this time, the overall loss function L of the deep neural network meets the requirement, and then the trained deep neural network can be used for character detection.

According to the embodiment of the disclosure, feature extraction is performed on an input image by using a ResNet50 feature extraction network, so that a feature image is obtained.

Operation S120, a prediction is performed by using the adaptive regional recommendation network, and a recommendation box is obtained.

According to an embodiment of the present disclosure, operation S120 includes sub-operation S120A and sub-operation S120B.

A sub-operation S120A, which is to perform local bias prediction on the point of the preset anchor frame by using the adaptive regional suggestion network, to obtain a corresponding predicted point. Specifically, the obtained predicted points are:

wherein n is the number of the middle points of the preset anchor frame, and x _l ' is the abscissa, y, of the first predicted point _l ' is the ordinate, x, of the first predicted point _l For presetting the abscissa, y, of the first point in the anchor frame _l For presetting the ordinate, omega, of the first point in the anchor frame _c For presetting the length of anchor frame, h _c For presetting the width of the anchor frame, Δ x _l Suggesting the abscissa offset, Deltay, of the ith point in the preset anchor frame of the network output for the adaptive region _l And the ordinate offset of the ith point in a preset anchor frame output by the network is suggested for the adaptive area.

Referring to fig. 2, the number n of preset anchor frame midpoints is set to 9, which represents a center point and eight boundary points (including an upper left point, an upper middle point, an upper right point, a middle right point, a lower middle point, a lower left point, and a middle left point).

A sub-operation S120B determines a suggestion box according to the predicted point. Specifically, the predicted points corresponding to the four most significant coordinates (including the minimum abscissa, the minimum ordinate, the maximum abscissa, and the maximum ordinate) are obtained through maximum and minimum value screening to determine the suggestion box, as shown in fig. 2. The proposed box (proposal) position is represented by these four most valued coordinates:

in this embodiment, the number of the obtained suggestion boxes is one or more. And a plurality of suggestion boxes are obtained through prediction, so that the character detection precision can be further improved.

In operation S130, the feature image is clipped by using the suggestion box, so as to obtain a clipped feature map.

In this embodiment, when the number of the suggestion frames is multiple, each suggestion frame is used to crop the feature image to obtain multiple cropping feature maps, and the multiple cropping feature maps are normalized to obtain multiple cropping feature maps with the same size.

According to an embodiment of the present disclosure, after operation S130, the text detection method further includes: according to the cutting feature diagram, the suggestion frame is adjusted by using a fine adjustment network to obtain an adjusted suggestion frame; and utilizing the adjusted suggestion frame to cut the feature image to obtain an adjusted cutting feature image.

Referring to fig. 5, the trimming network is used to calculate the trimming feature map, and output the adjustment parameter of the adjustment suggestion frame, and the adjustment parameter is used to adjust the suggestion frame, where the adjusted suggestion frame is:

whereinX is the horizontal coordinate of the center point of the adjusted suggestion frame, y is the vertical coordinate of the center point of the adjusted suggestion frame, w is the width of the adjusted suggestion frame, h is the height of the adjusted suggestion frame, and x _c For the center point abscissa, y, of the suggestion box before adjustment _c For the ordinate of the centre point of the suggestion box before adjustment, w _c To adjust the width of the proposed box before, h _c For the height of the suggestion box before adjustment, x _c 、y _c 、w _c 、h _c Can be calculated according to the most-valued coordinate of the suggestion box (proposal), t ₁ 、t ₂ 、t ₃ 、t ₄ To fine tune the tuning parameters of the network output.

Further, the text detection method further comprises the following steps: and performing upsampling on the adjusted cutting feature map to obtain an upsampling feature map. The size of the up-sampling feature map is larger than the feature size of the adjusted cutting feature map.

And operation S140, performing text texture information modeling on the cropping feature map in two orthogonal directions, respectively, to obtain a contour point thermodynamic diagram corresponding to each orthogonal direction.

Specifically, character texture information modeling is performed on the adjusted up-sampling feature map in two orthogonal directions respectively, so that a contour point thermodynamic diagram corresponding to each orthogonal direction is obtained.

Referring to fig. 3, the two orthogonal directions are a horizontal direction and a vertical direction, and the operation S140 includes a sub-operation S140A and a sub-operation S140B.

In sub-operation S140A, a first text texture information model of the cropped feature map in the horizontal direction is created according to the first convolution kernel. Specifically, according to the first convolution kernel, a first character texture information model of the adjusted up-sampling feature map in the horizontal direction is established in a sliding mode. The first convolution kernel has a size of 1 xk, k being greater than 0 and not greater than the size of the cropped feature map, k being, for example, 3.

Suboperation S140B, according to the second convolution kernel, builds a second text texture information model of the cropped feature map in the vertical direction. Specifically, according to the second convolution kernel, a second character texture information model of the adjusted up-sampling feature map in the vertical direction is established in a sliding mode. The size of the second convolution kernel is k × 1.

Further, the first character texture information model and the second character texture information model are normalized by a Sigmoid function to obtain a contour point thermodynamic diagram Hmap in the horizontal direction and a contour point thermodynamic diagram Vmap in the vertical direction.

And operation S150, screening the contour points in the contour point thermodynamic diagram to obtain a set of contour points, so as to reconstruct the characters in the input image.

In this embodiment, the contour point thermodynamic diagrams are screened by using a contour point regressing algorithm, so that pixel points with high response values in the two contour point thermodynamic diagrams are obtained at the same time, and a contour point set is formed.

According to an embodiment of the present disclosure, operation S150 includes sub-operation S150A and sub-operation S150B.

In sub-operation S150A, a non-maximum suppression method is used to filter background pixels in the contour thermodynamic diagram. Specifically, for example, a 1 × 3 sliding window is used to process the horizontal contour point thermodynamic diagram, a 3 × 1 sliding window is used to process the vertical contour point thermodynamic diagram, and the largest pixel point in the current window is output, and the remaining pixel points are suppressed.

And a sub-operation S150B, which is to filter the contour point thermodynamic diagrams according to a preset threshold to obtain a contour point set. Specifically, traversing each pixel point position in the contour point thermodynamic diagram after the non-maximum value is suppressed, and screening out pixel points of which the response values in the contour point thermodynamic diagrams corresponding to the horizontal direction and the vertical direction are both greater than a preset threshold value to form a contour point set. The preset threshold value is, for example, 0.5.

Furthermore, a character area in the input image is reconstructed according to the screened contour point set, so that characters in the input image are detected.

In the embodiment of the disclosure, the character detection method is used for detecting a large number of characters in scenes with any shapes, and detection results show that the character detection method has very good detection performance. For example, the recall rate, accuracy and F value of the text detection method on an ICDAR2015 data set are respectively 86.1%, 87.6% and 86.9%, and the FPS is 3.5; the recall rate, the accuracy rate and the F value on the Total-Text data set are respectively 83.9 percent, 86.9 percent and 85.4 percent, and the FPS is 3.8; recall, accuracy, F-number on CTW1500 dataset were 84.1%, 83.7%, 83.9%, respectively, FPS was 4.5.

Fig. 4 schematically shows a block diagram of a text detection system provided in an embodiment of the present disclosure.

The embodiment of the disclosure also provides a character detection system. The word detection system 400 includes an extraction module 410, a prediction module 420, a cropping module 430, a modeling module 440, and a screening module 450.

The extraction module 410 may perform operation S110, for example, to perform feature extraction on the input image to obtain a feature image.

The prediction module 420 may perform operation S120, for example, for predicting with the adaptive regional suggestion network, resulting in a suggestion box.

The cropping module 430 may perform operation S130, for example, to crop the feature image using the suggestion box, resulting in a cropped feature map.

The modeling module 440 may perform operation S140, for example, to perform text texture information modeling on the cropping feature map in two orthogonal directions, respectively, so as to obtain a contour point thermodynamic diagram corresponding to each orthogonal direction.

The screening module 450 may perform operation S150, for example, to screen the contour points in the contour point thermodynamic diagram to obtain a set of contour points, so as to reconstruct the text in the input image.

Please refer to the text detection method in the embodiment shown in fig. 1-3.

To sum up, the text detection method and system in the embodiments of the present disclosure perform feature extraction on an input image to obtain a feature image, predict the feature image by using an adaptive regional suggestion network to obtain a suggestion frame, crop the feature image by using the suggestion frame to obtain a cropping feature map, adjust the suggestion frame by a fine tuning network according to the cropping feature map, crop the feature image according to the adjusted suggestion frame to obtain an adjusted cropping feature map, perform text texture information modeling on the adjusted cropping feature map in two orthogonal directions respectively to obtain a contour point thermodynamic diagram corresponding to each orthogonal direction, screen contour points in the contour point thermodynamic diagrams to obtain a contour point set to reconstruct text in the input image, design the adaptive regional suggestion network to better adapt to scale changes of the text, perform text texture information modeling in the orthogonal directions, the method can inhibit false positive contour points, thereby effectively solving the problems of character scale change and false positive prediction and improving the precision of character detection in scenes with any shapes.

The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present disclosure in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present disclosure and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A text detection method, comprising:

performing feature extraction on an input image to obtain a feature image;

predicting by using the adaptive regional suggestion network to obtain a suggestion box, which specifically comprises the following steps: local bias prediction is carried out on the points of the preset anchor frame by utilizing the self-adaptive regional suggestion network to obtain corresponding predicted points; determining the suggestion box according to the predicted point;

cutting the characteristic image by using the suggestion frame to obtain a cutting characteristic diagram;

respectively performing character texture information modeling on the cutting characteristic diagram in two orthogonal directions to obtain a contour point thermodynamic diagram corresponding to each orthogonal direction, wherein the two orthogonal directions are a horizontal direction and a vertical direction, and the character texture information modeling is performed on the cutting characteristic diagram in the two orthogonal directions respectively, and comprises the following steps: according to the first convolution kernel, a first character texture information model of the cutting feature map in the horizontal direction is established; according to a second convolution kernel, establishing a second character texture information model of the cutting feature map in the vertical direction;

And filtering background pixel points in the contour point thermodynamic diagrams by using a non-maximum value inhibition method, and screening pixel points of which the response values in the contour point thermodynamic diagrams corresponding to the two orthogonal directions are both greater than a preset threshold value to obtain a contour point set so as to reconstruct characters in the input image.

2. The method of claim 1, wherein the first convolution kernel is 1 x k in size and the second convolution kernel is k x 1 in size, k being no greater than the size of the cropped feature map.

3. The method of claim 1, wherein the method further comprises:

according to the cutting feature diagram, the suggestion frame is adjusted by using a fine adjustment network to obtain an adjusted suggestion frame;

utilizing the adjusted suggestion frame to cut the feature image to obtain an adjusted cutting feature image;

and upsampling the adjusted cutting characteristic diagram to obtain an upsampled characteristic diagram.

4. The method of claim 3, wherein said separately modeling the cropped feature map texturally in two orthogonal directions comprises:

and respectively carrying out character texture information modeling on the up-sampling characteristic diagram in two orthogonal directions.

5. The method of claim 3, wherein said separately modeling the cropped feature map texturally in two orthogonal directions comprises:

before feature extraction on the input image, the method further comprises:

wherein L is the loss function, L _Arpn Proposing a loss function of the network for said adaptive area, L _Hcp Is a loss function of the literal texture information perception network in an orthogonal direction, L _Vcp For a loss function of the literal-texture-information-aware network in the other orthogonal direction, L _boxclass 、L _boxreg As a loss function of said fine tuning network, λ _Hcp A balance parameter, λ, for the textual texture information aware network in said one orthogonal direction _Vcp For said other orthogonal direction a balance parameter, λ, of the text-to-texture information aware network _boxclass 、λ _boxreg Balancing parameters for the fine tuning network.

6. A text detection system, comprising:

The extraction module is used for extracting the features of the input image to obtain a feature image;

the prediction module is configured to perform prediction by using a self-adaptive regional suggestion network to obtain a suggestion box, and specifically includes: local bias prediction is carried out on the points of the preset anchor frame by utilizing the self-adaptive regional suggestion network to obtain corresponding predicted points; determining the suggestion box according to the predicted point;

the cutting module is used for cutting the characteristic image by using the suggestion frame to obtain a cutting characteristic diagram;

the modeling module is configured to perform text texture information modeling on the clipping feature map in two orthogonal directions respectively to obtain a contour point thermodynamic diagram corresponding to each of the orthogonal directions, where the two orthogonal directions are a horizontal direction and a vertical direction, and the text texture information modeling is performed on the clipping feature map in the two orthogonal directions respectively, and includes: according to the first convolution kernel, a first character texture information model of the cutting feature map in the horizontal direction is established; according to a second convolution kernel, establishing a second character texture information model of the cutting feature map in the vertical direction;

and the screening module is used for filtering background pixel points in the contour point thermodynamic diagrams by using a non-maximum value inhibition method, and then screening pixel points of which the response values in the contour point thermodynamic diagrams corresponding to the two orthogonal directions are both larger than a preset threshold value to obtain a contour point set so as to reconstruct characters in the input image.