CN112784836A

CN112784836A - Text and graphic offset angle prediction and correction method thereof

Info

Publication number: CN112784836A
Application number: CN202110090669.4A
Authority: CN
Inventors: 励建科; 陈再蝶; 朱晓秋; 邓明明; 樊伟东; 周杰
Original assignee: Zhejiang Kangxu Technology Co ltd
Current assignee: Zhejiang Kangxu Technology Co ltd
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2021-05-11

Abstract

The invention discloses a text and graphic offset angle prediction and correction method, which comprises the following four steps: (1) acquiring a text image to be recognized; (2) inputting a text image to be recognized into a text direction classification model for deep learning, and obtaining coordinates and a text offset angle of a text detection box through regression prediction of the text direction classification model, namely detecting a text position in a bill; (3) comparing the coordinate direction of the text detection box with the coordinate direction of the standard text box according to the coordinate of the text detection box, and judging whether the text detection box is in the standard direction; (4) and if the direction of the text image to be recognized is not the standard direction, converting the direction into the standard direction. According to the method, the text images can be subjected to angle classification of any angle in batch, an angle deviation predicted value is returned, the value can be compared with a standard angle, the text images of non-standard angles are automatically converted into the text images of standard angles, the size of a model is only 3M, the model can be deployed on a PC or a mobile terminal, and meanwhile, the accuracy is high.

Description

Text and graphic offset angle prediction and correction method thereof

Technical Field

The invention relates to the technical field of text image correction, in particular to a text image offset angle prediction method and a text image offset angle correction method.

Background

With the development of mobile internet, big data and artificial intelligence, the internet enters a new era, which is rapidly changing the tradition of human society and bringing great impact to the traditional banking industry, especially to the traditional bank service mode taking website tellers and managers as the core, the application of artificial intelligence is increasingly wide, the artificial intelligence technologies such as text direction classification technology and the like are beginning to be widely applied to various business fields such as banks, enterprises and the like, and bank bills are accurately classified according to text directions.

Firstly, however, most of the existing banks use manual adjustment of the angle of bills when processing bills in batch, and cannot automatically and batch correct the pictures with offset uploaded pictures;

some classifiers in the text direction can only classify at 0 degree and 180 degrees, and cannot classify text pictures at any angle;

after the text angle classification is carried out, the corresponding automatic correction cannot be carried out on the picture;

secondly, some algorithm models are heavyweight models at the present stage, and cannot be deployed at a mobile terminal, and the landing requirements cannot be met due to too low calling speed.

Disclosure of Invention

The invention aims to: in order to solve the technical problems mentioned in the background art, a text graphic offset angle prediction and a correction method thereof are provided.

In order to achieve the purpose, the invention adopts the following technical scheme:

a text graphic offset angle prediction and correction method thereof includes the following steps:

(1) acquiring a text image to be recognized;

(2) inputting a text image to be recognized into a text direction classification model for deep learning, and obtaining coordinates and a text offset angle of a text detection box through regression prediction of the text direction classification model, namely detecting a text position in a bill;

(3) comparing the coordinate direction of the text detection box with the coordinate direction of the standard text box according to the coordinate of the text detection box, and judging whether the text detection box is in the standard direction;

(4) if the direction of the text image to be recognized is not the standard direction, converting the direction into the standard direction;

a MobileNet V3 model is used in the text direction classification model, and the MobileNet V3 model comprises a depth separable convolutional neural network for extracting the characteristics of candidate text regions and a non-maximum suppression algorithm for dividing the texts in the candidate regions;

the depth-separable convolutional neural network comprises a convolution layer of 3 x 3 and an average pooling layer, the convolution layer is used for extracting text features, the average pooling layer is composed of a plurality of convolution layers, the average pooling layer is connected with an output layer through two convolution layers of 1 x1, and an h-swish function is adopted as an activation function of the depth-separable convolutional neural network; the depth-separable convolutional neural network optimizes the model by adjusting model basic parameters in a training process; the depth separable convolutional neural network adopts a cross entropy loss function during training; an image enhancement processing means is adopted in the generation process of the training set of the MobileNet V3 model; the step (4) converts the text region in the image into a standard orientation by rotating the text region by a corresponding angle in an instant direction using a coordinate rotation formula and an OPENCV technique, respectively.

As a further description of the above technical solution:

the flow of the non-maximum suppression algorithm includes the steps of:

(1) dividing coordinates of a text detection box obtained by text direction classification model regression according to categories;

(2) sorting the bounding boxes (B _ BOX) in each object class in descending order according to the classification confidence;

(3) in a certain class, selecting a bounding BOX B _ BOX1 with the highest confidence coefficient, removing the B _ BOX1 from an input list, and adding the B _ BOX1 into an output list;

(4) calculating the intersection ratio IoU of the B _ BOX1 and the rest B _ BOX2 one by one, and if IoU (B _ BOX1, B _ BOX2) > threshold TH, removing the B _ BOX2 at the input;

(5) repeating the steps 3-4 until the input list is empty, and completing traversal of an object class;

(6) repeating the steps of 2-5 until the non-maximum suppression algorithm processing of all the object classes is completed;

(7) and outputting the list and finishing the algorithm.

As a further description of the above technical solution:

the calculation formula of the h-swish activation function of the MobileNet V3 model is

ReLU is an activation function;

the formula for ReLU is:

and x is the input characteristic value.

As a further description of the above technical solution:

the training process for obtaining the text direction classification model comprises the following steps:

(1) acquiring a text image;

(2) carrying out image enhancement processing on the acquired text image;

(3) taking the image after image enhancement as a training set, and training the original depth separable convolution neural network;

(4) and adjusting model basic parameters of the original depth separable convolutional neural network in a training process, and evaluating the model by using a cross entropy loss function.

As a further description of the above technical solution:

the image enhancement processing includes the steps of:

(1) performing image rotation on the acquired text image;

(2) and carrying out perspective transformation on the text image subjected to image rotation.

As a further description of the above technical solution:

the formula adopted for image rotation is as follows:

x′＝(x₀-xcenter)cosθ-(y₀-ycenter)sinθ+xcenter；

y′＝(x₀-xcenter)sinθ-(y₀-ycenter)cosθ+ycenter

(left, top) represents the top left corner coordinates of the image;

(right, bottom) represents the lower right corner coordinates of the image;

(x₀,y₀) Representing coordinates of an arbitrary point on the image;

(xcenter, ycenter) represents the coordinates of the center point of the image;

(x ', y') represents the new coordinate position.

As a further description of the above technical solution:

the general transformation formula of the perspective transformation is as follows:

(u, v) are original image pixel coordinates),

for the transformed image pixel coordinates, the perspective transformation matrix is as follows:

representing a linear transformation of the image;

T2＝[a13,a23]^Tfor generating a perspective transformation of the image;

T3＝[a₃₁,a₃₂]representing image translation.

As a further description of the above technical solution:

the cross entropy loss function is:

w is a weight;

xi, yi are eigenvalues;

b is an offset value.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. according to the method, the text images can be subjected to angle classification of any angle in batch, the images are subjected to model prediction, an angle deviation predicted value is returned, the value can be compared with a standard angle, the text images of non-standard angles are automatically converted into the text images of standard angles, the whole system provides very convenient technical support for the stage of bill deviation image preprocessing, the images are automatically aligned, the subsequent image detection and identification are facilitated, and the manual labor cost and time are reduced.

2. The invention provides a lightweight algorithm model, the size of the model is only 3M, and the model can be deployed on a PC or a mobile terminal and has high accuracy.

3. In the invention, in the process of reading a parameter file required by training and training, the system modifies the learning rate by using a method for automatically adjusting the learning rate, the method for automatically adjusting the learning rate is used because the optimized deep neural network is regarded as an empirical process to a great extent, the optimized deep neural network needs to manually adjust several parameters such as the learning rate, weight attenuation and random inactivation rate, so to speak, the learning rate is the most important one of the parameters needing to be adjusted, compared with the fixed learning rate, the variable learning rate scheduling system can provide faster convergence, can improve the training speed of the model and enhance the generalization capability of the model, and the images of bank notes are used for carrying out corresponding image enhancement techniques on the images, such as image rotation, image perspective transformation and other direction transformation technical means for carrying out image rotation and image annotation, and motion blur and Gaussian noise are used, the number of the pictures in the training set is increased for the operation purpose, in order to improve the generalization capability of the model and avoid overfitting of the model, after multiple times of ablation training, the system achieves a good classification effect and can perform automatic correction according to the recognized classification angle, the output result of the system comprises a predicted angle value and confidence coefficient, the system can automatically correct the pictures to a standard position through the angle prediction value, an early warning is prompted according to the confidence coefficient, if the confidence coefficient is too low, the corrected images possibly need to be manually intervened, and the robustness of the whole system is improved.

Drawings

FIG. 1 is a schematic diagram illustrating a text image direction classification algorithm flow of a text image offset angle prediction and a correction method thereof according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating an application of a text image direction classification algorithm to a text image offset angle prediction and correction method thereof according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart illustrating the process of determining the orientation correction of the image to be recognized according to the text graphic offset angle prediction and the correction method thereof provided by the embodiment of the invention;

FIG. 4 is a schematic diagram illustrating 180-degree rotation correction of an image to be corrected according to the text graphic offset angle prediction and the correction method thereof provided by the embodiment of the invention;

FIG. 5 is a schematic diagram illustrating 15-degree rotation correction of an image to be corrected according to the text graphic offset angle prediction and the correction method thereof provided by the embodiment of the invention;

FIG. 6 is a schematic diagram illustrating 90-degree rotation correction of an image to be corrected according to the text graphic offset angle prediction and the correction method thereof provided by the embodiment of the invention;

FIG. 7 is a schematic diagram of a training system based on a MobileNet V3 model for predicting the offset angle of a text graphic and a correction method thereof according to an embodiment of the present invention;

fig. 8 shows a structure diagram of a MobileNetV3 model for text graphic offset angle prediction and correction method thereof according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1-8, the present invention provides a technical solution: a text graphic offset angle prediction and correction method thereof includes the following steps:

(1) acquiring a text image to be recognized;

the depth-separable convolutional neural network comprises a 3 x 3 convolutional layer and an average pooling layer for extracting text features, wherein the average pooling layer is composed of a plurality of convolutional layers, the average pooling layer is connected with an output layer through two 1 x1 convolutional layers, after the four steps, the text images can be subjected to angle classification of any angle in batch, after the images are subjected to model prediction, an angle deviation predicted value is returned, the value is compared with a standard angle, the text images of non-standard angles are automatically converted into text images of standard angles, the steps provide a lightweight algorithm model, the size of the model is only 3M, the model can be deployed at a PC or a mobile terminal, and meanwhile, the accuracy is high, wherein in the step (4), a text area in the images is respectively rotated by corresponding angles in an instant needle through coordinate rotation formula and OPENCV technology, converting it to a standard orientation.

Referring to FIG. 1, the calculation formula of the h-swish activation function of the MobileNet V3 model is

ReLU is an activation function;

the formula for ReLU is:

x is an input characteristic value;

(1) acquiring a text image;

(2) carrying out image enhancement processing on the acquired text image;

(4) adjusting model basic parameters of the original depth separable convolutional neural network in a training process, and simultaneously performing model evaluation by using a cross entropy loss function;

the image enhancement processing includes the steps of:

(1) performing image rotation on the acquired text image;

(2) carrying out perspective transformation on the text image subjected to image rotation;

the formula adopted for image rotation is:

x′＝(x₀-xcenter)cosθ-(y₀-ycenter)sinθ+xcenter；

y′＝(x₀-xcenter)sinθ-(y₀-ycenter)cosθ+ycenter

(left, top) represents the top left corner coordinates of the image;

(right, bottom) represents the lower right corner coordinates of the image;

(x₀,y₀) Representing coordinates of an arbitrary point on the image;

(xcenter, ycenter) represents the coordinates of the center point of the image;

(x ', y') represents the new coordinate position;

the general transformation formula for the perspective transformation is:

(u, v) are original image pixel coordinates),

representing a linear transformation of the image;

T2＝[a13,a23]^Tfor generating a perspective transformation of the image;

T3＝[a₃₁,a₃₂]representing image translation;

the cross entropy loss function is:

w is a weight;

xi, yi are eigenvalues;

b is an offset value, in which the h-Swish function is approximated by a Swish function

The method has the advantages that sigmoid is replaced, the number of filters can be reduced from 32 to 16 under the condition of not losing precision through experiments due to the benefit of h-swish design, the time can be saved by 3ms through the improvement, 1000 ten thousand times of multiply-add operation is realized, the basic parameters of the model are adjusted, the model does not need to be reconstructed, the efficiency is greatly improved, meanwhile, under the condition that the data volume is not large, the characteristics learned by the model can be more robust through fine adjustment, the cross entropy loss function is selected for training, multiple parameter adjustment is carried out for training, the generalization capability of the model is improved, and the coordinate rotation formula for correcting the picture in the MobileNet V3 model is optimized through image enhancement processing.

Referring to fig. 1, the flow of the non-maximum suppression algorithm includes the following steps:

(1) dividing the coordinates of the text detection box obtained by the regression of the text direction classification model according to categories, and removing the background of the text image;

(7) and outputting the list, finishing the algorithm, and obtaining the target detection frame of the text through a non-maximum suppression algorithm, namely obtaining the specific position information of the text in the image, thereby being convenient for storing the coordinates of each detection frame.

The working principle is as follows: when the method is used, firstly, a MobileNet V3 model needs to be trained, in order to generate a training set of offset angles, images need to be enhanced, random offset angles of some pictures can be optimized by adopting image enhancement processing to a coordinate rotation formula for correcting the pictures in the MobileNet V3 model, in the training process, a method for adjusting model basic parameters is adopted, the model basic parameters are adjusted, the model basic parameters do not need to be re-constructed, the efficiency is greatly improved, meanwhile, under the condition that the data volume of the model is not large, fine adjustment enables the characteristics learned by the model to be more robust, wherein a cross entropy loss function is selected for training, multiple times of parameter adjustment are carried out for training, the generalization capability of the model is improved, the model with high training needs to be evaluated in the training process, and the use accuracy in the system is improved

Recall rate

And F1 score

The performance of the model is evaluated, the accuracy of the model evaluation can reach more than 95%, after training is finished, the model is rapidly predicted through a prediction script, single picture or batch prediction can be carried out through formulating a prediction picture or folder path, the system can output an angle prediction value and a confidence coefficient of the picture, for example, (30,0.99876), the system can automatically correct the picture to a standard position through the angle prediction value, an early warning is prompted according to the confidence coefficient, if the confidence coefficient is too low, the corrected picture can be required to be manually intervened, the robustness of the whole system is improved, secondly, after the model is trained, the text image can be subjected to angle classification of any angle in batch, after the picture is predicted through the model, an angle deviation prediction value is returned, the value can be compared with a standard angle, the text image of a non-standard angle is automatically converted into the text image of the standard angle, the steps provide a lightweight algorithm model, the size of the model is only 3M, the model can be deployed on a PC or a mobile terminal, and meanwhile, the accuracy is high, wherein target detection frames of the text can be obtained through a non-maximum suppression algorithm, namely specific position information of the text in the image can be obtained, and further, the coordinates of each detection frame can be conveniently stored.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A text graphic offset angle prediction and correction method thereof is characterized by comprising the following steps:

(1) acquiring a text image to be recognized;

the depth-separable convolutional neural network comprises a convolution layer of 3 x 3 and an average pooling layer, the convolution layer is used for extracting text features, the average pooling layer is composed of a plurality of convolution layers, the average pooling layer is connected with an output layer through two convolution layers of 1 x1, and an h-swish function is adopted as an activation function of the depth-separable convolutional neural network;

the depth-separable convolutional neural network optimizes the model by adjusting model basic parameters in a training process;

the depth separable convolutional neural network adopts a cross entropy loss function during training;

an image enhancement processing means is adopted in the generation process of the training set of the MobileNet V3 model;

the step (4) converts the text region in the image into a standard orientation by rotating the text region by a corresponding angle in an instant direction using a coordinate rotation formula and an OPENCV technique, respectively.

2. The method of claim 1, wherein the flow of the non-maximum suppression algorithm comprises the steps of:

(7) and outputting the list and finishing the algorithm.

3. The method as claimed in claim 1, wherein the calculation formula of the h-swish activation function of the MobileNet V3 model is

ReLU is an activation function;

the formula for ReLU is:

and x is the input characteristic value.

4. The method as claimed in claim 1, wherein the training process for obtaining the text direction classification model comprises the following steps:

(1) acquiring a text image;

(2) carrying out image enhancement processing on the acquired text image;

5. The method of claim 4, wherein the image enhancement process comprises the following steps:

(1) performing image rotation on the acquired text image;

6. The method as claimed in claim 5, wherein the image rotation is performed by the following formula: