CN112580507A

CN112580507A - Deep learning text character detection method based on image moment correction

Info

Publication number: CN112580507A
Application number: CN202011506599.8A
Authority: CN
Inventors: 田辉; 刘其开
Original assignee: Hefei High Dimensional Data Technology Co ltd
Current assignee: Hefei High Dimensional Data Technology Co ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-30
Anticipated expiration: 2040-12-18
Also published as: CN112580507B

Abstract

The invention discloses a deep learning text character detection method based on image moment correction, which specifically comprises the following steps: preparing a data set, manually correcting a pre-labeled box frame, generating a heat map label in a Gaussian heat map form according to the box frame, defining a neural network structure and a loss function, pre-training, expanding a training sample set of an actual scene, performing self-adaptive binarization operation on the expanded training sample set, calculating Hu moment feature vectors of each character, and performing fine tuning training and model testing and verification by taking an orientation quantity mean value as an auxiliary label of the character and modifying the loss function form; according to the method, the heat map label and the moment feature vector label are combined to form an optimization loss function, so that the accuracy of a character box frame is improved, and the problems of excessive segmentation and insufficient segmentation of a character frame are solved; by preprocessing the sample set after expansion, the problem of insufficient character-level labeling is solved, and the character detection generalization capability is better.

Description

Deep learning text character detection method based on image moment correction

Technical Field

The invention belongs to the field of target detection, and particularly relates to a deep learning text character detection method based on image moment correction.

Background

At present, text detection is widely applied to the field of computer vision, such as real-time translation, image retrieval, scene analysis, geographic positioning, blind navigation and the like, so that the text detection has extremely high application value and research significance in scene understanding and text analysis.

The existing text detection methods are divided into the following categories:

1. the traditional image processing method is based on manually designed feature detection, such as MSER (maximum stable extremum region) and SWT (stroke width transformation), mainly processes text detection of printing fonts and printing and scanning scenes, and has poor text detection effect on natural scenes;

2. the Two-stage method based on deep learning generates a candidate region and extracts corresponding features, carries out network training fine adjustment, and outputs a corresponding text region frame, and has the advantages of higher precision, good performance for small-scale target detection, sharing calculated amount, and low inference speed and longer training period;

3. the One-stage method based on deep learning directly skips the step of generating a candidate frame to predict the text region frame of the target end to end, and has the advantages of high reasoning speed, lower precision than the two-stage method and poor small target detection effect.

Most of the existing text detection algorithm technology is based on the position coordinates of the region of the output text line, for example, the reference network CTPN in the existing text detection technology is improved based on the Two-stage method, on the basis of the fast RCNN, the specificity improvement of horizontal arrangement or vertical arrangement of the target text is combined, and the region of the text line is output. Existing text detection algorithm techniques are not accurate to character-level text detection and thus provide limited information.

The existing character-level text detection algorithm is based on a semantic segmentation idea, a Gaussian center heat map is used for replacing a pixel-level block heat map by a label, two indexes of regional score or compact score are adopted to optimize a network, and a probability map is subjected to binarization processing in post-processing to obtain a final character frame. The character-level text detection can output not only the coordinates of a single character frame body, but also the coordinates of a text line area, the output information is richer, and the larger requirements of customers can be met. However, the existing algorithm for detecting the character-level text is affected by parameters and the complex Chinese text scene, and the over-segmentation or under-segmentation phenomenon occurs on the frame of the segmented character, which respectively corresponds to the rectangular frame and the blackened rectangular frame shown in fig. 4.

Disclosure of Invention

In order to solve the above problems, the present invention provides a deep learning text character detection method based on image moment correction, which includes the following steps:

a: preparing a data set, namely pre-labeling a randomly sampled sample in the data set, and storing a box frame of each character of the sample;

b: manually correcting the box frame which is not accurately pre-marked, and generating a heat map label in a Gaussian heat map form according to the box frame;

c: defining neural network structure and loss function loss_cross；

D: using the determined network structure and loss function loss in said step C_crossCarrying out preliminary pre-training;

e: expanding a training sample set of an actual scene;

f: performing self-adaptive binarization operation on the training sample set expanded in the step E, calculating the Hu moment feature vector of each character, and taking the orientation quantity mean value as an auxiliary label of the character;

g: modifying the form of the loss function, adding a regular term branch, and performing fine tuning training by using the modified loss function loss by using the extended training sample set;

h: and (3) model testing and verifying, namely modifying the parameter theta of the Gaussian heat map generated by the pre-labeling, and drawing an accuracy rate change curve of the character box frame under different theta threshold values, so that a proper parameter theta is selected according to requirements.

Further, in the present invention,

the data set in the step A mainly comprises data in ICDAR2017, ICDAR2019 and CTW, and randomly sampled samples in the data set are pre-labeled by adopting an easy OCR trained public character level segmentation model.

Further, in the present invention,

the step B of pre-labeling inaccuracy specifically means that the character box frame is over-segmented or under-segmented;

the over-segmentation means that the character box does not contain all the current character in the box, and the under-segmentation means that the character box contains other characters or symbols besides the current character.

Further, in the present invention,

and B, mapping the box frame to a two-dimensional Gaussian map by adopting perspective transformation to generate a tag in the form of a Gaussian heatmap.

Further, in the present invention,

the specific operation of determining the neural network structure in the step C is as follows:

a sample with a preset size is input by the network, a VGG16 reference network is taken as a feature extraction network, and U-net is taken as a decoding network;

outputting a pixel score matrix representing the confidence region;

loss function loss in said step C_crossDetermined by the following method:

loss function loss_crossAnd adopting pixel-level cross entropy loss, namely setting the theta threshold value for the tag heat map, wherein character areas are considered to be represented by a category 1 if the theta threshold value is larger, and non-character areas are represented by a category 0 if the theta threshold value is smaller.

Further, in the present invention,

and E, the method for expanding the training sample set of the actual scene comprises the steps of shooting an interface containing documents on a computer screen under random screenshot or different angles, pre-labeling by using a pre-trained model, and manually correcting by using the mode in the step B.

Further, in the present invention,

the theta threshold is obtained by the following steps:

performing Gaussian smoothing processing on the heat map label, and calculating a gradient map of the heat map label;

determining communication areas under different thresholds according to a watershed algorithm, and taking the minimum circumscribed rectangle under each communication area, namely the character frame under the threshold;

randomly counting and sampling a plurality of words, judging the accuracy of the minimum external frame under the corresponding different thresholds, and taking the threshold with the highest accuracy as the theta threshold.

Further, in the present invention,

the loss function loss after modification in the step G is the loss function loss in the step C_crossAdding

Loss of L2: loss is loss_cross+m*loss_L2

Wherein

L2 loss characterizing sample moments, m representing number of samples, K representing number of characters of a single sample, y_ijMeans, f (x), representing the mean of the moment feature vectors corresponding to the jth character in the ith sample_ij) And representing the mean value of the moment feature vector corresponding to the jth character in the ith sample of the network output prediction.

Further, in the present invention,

and H, testing and verifying the model in the step H, wherein the sample is a character in a text scene shot or screenshot by a randomly selected computer document.

The invention has the advantages that:

the detection method of the invention provides that the center of a single character is represented based on the image moment characteristics, more robust auxiliary information is provided, namely, an optimization loss function is formed by combining a Gaussian heat map and the moment characteristics to improve the accuracy of a character box frame, the character detection segmentation capability of a model is improved by combining a segmentation task (a heat map label) and a regression task (a moment characteristic label), and the problems of excessive segmentation and under-segmentation of a character frame are solved; in addition, a sample is synthesized by text scenes in the screenshot, a preliminary character text detection model is pre-trained, then pre-labeling is carried out in a real text sample, the text is manually corrected, and the moment characteristic of each character in the real sample is calculated and used as a regular term of a loss function in the training fine adjustment. The preprocessing mode makes up the problem of insufficient character-level labeling on one hand, and on the other hand, the character detection generalization capability of the preprocessing mode is better in the actual text scene of printing, photographing or screenshot.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 shows a prior art character segmentation algorithm flow diagram;

FIG. 2 shows a flow diagram of a character segmentation algorithm of an embodiment of the present invention;

FIG. 3 illustrates an exemplary graph of a exemplar tag Gaussian map of the present invention;

fig. 4 illustrates an exemplary diagram of an over-segmentation or under-segmentation phenomenon.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Because the sample background of the natural scene is complex, and deviation can be caused by calculating the image moment characteristic, the image moment characteristic value is calculated only for the screenshot of the computer document background or the shot specific scene. The moments of different orders have different characteristics, the origin moment or the central moment is used as the characteristics of the image, the characteristics cannot be guaranteed to have translation, rotation and proportion invariance simultaneously, if the characteristics of the image are represented by the central moment, the characteristics only have translation invariance, and the normalized central moment not only has translation invariance, but also has proportion invariance and rotation invariance, so that the Hu moment vector is used as auxiliary information to provide more prior knowledge of the network for training.

The invention discloses a deep learning text character detection method based on image moment correction, which comprises the following steps:

preparing a data set; the public Chinese data set used in the method mainly comprises an ICDAR2017 data set, an ICDAR2019 data set and CTW (wild Chinese text) data, wherein the CTW data has high diversity and complexity and comprises a plane text, a projected text, a city street view text, a town street view text, a text under a weak illumination condition, a long-distance text, a partial display text and the like; for each image, all Chinese characters are marked in the data set; for each Chinese character, the data set is labeled with its character category and bounding box. Firstly, pre-labeling the samples sampled randomly in the data set by using an open character level segmentation model trained by simple optical character recognition (easy OCR), and storing a box frame of each character of each sample;

b, developing a simple fine-tuning man-machine interaction labeling interface, similar to a target detection labeling tool, automatically loading a picture and a json format label corresponding to the picture, and then manually correcting some character box frames with inaccurate pre-labeling in a popped dialog box, wherein the inaccurate prediction refers to insufficient frame (over-segmentation) of the current character or areas from the frame to adjacent characters, commas and the like (under-segmentation), and specific examples refer to a rectangular frame (over-segmentation) and a blackened rectangular frame (under-segmentation) in the graph 4; generating a tag in a gaussian heatmap form according to the box frame of the character, as shown in the sample tag gaussian of fig. 3, in this step, the box frame of the character is mapped onto a two-dimensional gaussian map through perspective transformation, so as to represent the heatmap tag of the character.

C defines the network structure and loss function: the method comprises the steps that a sample with the size h x w x 3 is input into a network, a VGG16 reference network is taken as a feature extraction network, an improved U-net is taken as a decoding network, a pixel score matrix (the specific structure is shown in figure 2) representing a confidence coefficient area is output, h represents the height of an image input into the network, w represents the width of the image input into the network, and 3 is the number of RGB channels;

the loss function uses pixel-level cross entropy loss by setting a theta threshold for the tag heat map, with character regions identified as class 1 if greater than the theta threshold, and non-character regions identified as class 0 if less than the theta threshold.

Therefore, the accuracy of the parameter theta under different values needs to be compared so as to select the best parameter theta, and the theta threshold is obtained by testing in an actual training sample by means of a watershed algorithm in graphics, and the general steps are as follows:

firstly, Gaussian smoothing is carried out on the label heat map, a gradient map of the label heat map is calculated, then communication areas under different thresholds are determined according to a watershed algorithm, the minimum circumscribed rectangle (namely a character frame under the threshold) under each communication area is taken, a plurality of characters are randomly statistically sampled, the accuracy of the minimum circumscribed frame under the threshold is artificially and subjectively judged, and the threshold with relatively high accuracy is taken as a theta threshold.

D, pre-training, and performing preliminary pre-training by adopting the network structure and the loss function defined in the step C.

And E, expanding a training sample set of an actual scene, randomly screenshot or shooting an interface containing documents on a computer screen at different angles, such as a webpage, a word document and the like, pre-labeling by using a pre-trained model, and manually correcting by using the mode of the step B.

F, carrying out self-adaptive binarization on the sample expanded in the step E to obtain a binary image, then calculating the Hu moment feature vector of each character, and taking the average value of the Hu moment feature vector as an auxiliary label of the character; theoretically, the character area moment characteristic mean values have small difference, compared with a non-character area, the moment characteristic value of the character area is much larger, and a moment characteristic branch is introduced, so that on one hand, the attention of a model is more inclined to the character area, and the detection is facilitated; on the other hand, the moment characteristic mean value can guide network learning of a more accurate character frame, and segmentation is facilitated.

G, modifying the form of the loss function, adding a regular term branch, and performing fine tuning training by using the modified loss function by using the expanded training sample set; and (3) carrying out model training by using the expanded training sample, wherein the step of distinguishing the details of pre-training is as follows: the loss function of the network is modified, namely a regular item branch which takes the Hu moment feature vector as auxiliary label information is added, and the loss of the original cross entropy is reduced due to the character frame moment vector_crossAfter L2 loss is added, joint training is carried out, and the value of m is 0.01-0.05;

loss＝loss_cross+m*loss_L2

wherein

L2 loss characterizing sample moments, m representing the number of samples, and K representing the number of characters of a single sample. y is_ijRepresenting the mean value of the moment feature vectors corresponding to the jth character in the ith sample, and taking the mean value as a moment feature label, f (x)_ij) Represents the mean of the moment eigenvectors corresponding to the jth character in the ith sample of the network output prediction, and L2 represents the least squares error.

The method comprises the steps of H model testing and verification, wherein the model of the method mainly aims to solve the problem of character detection of a text scene shot by a computer document, so that a sample in the scene is used for testing and verification, and the accuracy rate of character segmentation is counted; since the pre-labeled thermodynamic diagram is affected by the parameter theta. Therefore, the accuracy of the parameter theta at different values needs to be compared to select the best parameter theta. The parameter theta of the Gaussian heatmap generated by pre-labeling is modified, and an accuracy rate change curve of the character box frame under different theta threshold values is drawn, so that the appropriate parameter theta is selected according to requirements.

FIG. 1 illustrates a prior art character segmentation algorithm

Scaling an input sample to a value h x w x 3 as a network input, adopting a VGG16 reference network as a feature extraction network, wherein the higher the stage of extracting the network is, the more abstract a correspondingly generated feature map is, and the size is reduced to 1/2; in order to fuse the information of the bottom layer features and the information of the high layer features, the decoding network U-net enables the feature diagram of a certain output layer to be the same as the feature diagram of a certain stage of the extraction network in size through upsampling, so that merging and fusion are conducted, and finally a pixel score matrix representing a character connection confidence coefficient area is output through a convolution layer of 1 x 1. The main idea is to use a segmentation task to predict a character detection frame, add a character connection confidence matrix to an output branch to solve the character positioning problem of a non-rectangular area, and use a synthesized character data set to perform weak supervised learning to complete a pre-training task of a model, thereby improving the character segmentation effect under the integral natural scene.

FIG. 2 shows the character segmentation algorithm of the present method

The method is basically the same as the method in terms of network structure, the size and the output of an input sample are different, the input size is h x w x 3 structure, a VGG16 reference network is taken as a feature extraction network and a decoding network to fuse the features of the upper layer and the lower layer, a pixel score matrix representing a character moment mean vector is output through a convolution layer of 1 x 1, and a moment feature vector is obtained by introducing branch output of a full connection layer. The two branches mainly combine the segmentation and regression tasks, and the box coordinate of target detection is replaced by the moment feature, so that the character is more robust in the positioning segmentation task of Chinese character texts with relatively consistent aspect ratio due to the characteristic of the moment feature vector. Constructing a batch of data sets related to the practical application of the algorithm, such as text data sets of computer shooting and screenshot scenes; the purpose is to solve the character-level text detection problem; by taking the idea of semantic segmentation into account, each character is also labeled with a gaussian heat map, where a higher pixel value indicates a closer pixel to the center point of the character.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A deep learning text character detection method based on image moment correction is characterized by comprising the following steps:

c: defining neural network structure and loss function loss_cross；

e: expanding a training sample set of an actual scene;

2. The method of claim 1, wherein the text character detection method based on image moment correction,

3. The method of claim 1, wherein the text character detection method based on image moment correction,

4. The method of claim 1, wherein the text character detection method based on image moment correction,

5. The method of claim 1, wherein the text character detection method based on image moment correction,

outputting a pixel score matrix representing the confidence region;

loss function loss in said step C_crossDetermined by the following method:

loss function loss_crossUsing pixel-level cross entropy loss, i.e. by setting the theta threshold to the tag heat map, to be greater than thAnd if the eta threshold is smaller than the theta threshold, the character area is regarded as a character area and is represented by a category 1, and if the eta threshold is smaller than the theta threshold, the character area is regarded as a non-character area and is represented by a category 0.

6. The method for detecting deep learning text characters based on image moment correction according to any one of claims 1-5,

7. The method for detecting deep learning text characters based on image moment correction according to any one of claims 1-5,

the theta threshold is obtained by the following steps:

8. The method for detecting deep learning text characters based on image moment correction according to any one of claims 1-5,

Loss of L2: loss is loss_cross+m*loss_L2

Wherein

L2 loss characterizing sample moments, m denotes the number of samples,k denotes the number of characters of a single sample, y_ijMeans, f (x), representing the mean of the moment feature vectors corresponding to the jth character in the ith sample_ij) And representing the mean value of the moment feature vector corresponding to the jth character in the ith sample of the network output prediction.

9. The method of claim 8, wherein the text character is selected from a group consisting of a text character,