CN107609549B

CN107609549B - Text detection method for certificate image in natural scene

Info

Publication number: CN107609549B
Application number: CN201710854505.8A
Authority: CN
Inventors: 张楠; 靳晓宁; 张文文; 段禹心; 贺思源
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-09-20
Filing date: 2017-09-20
Publication date: 2021-01-08
Anticipated expiration: 2037-09-20
Also published as: CN107609549A

Abstract

The invention discloses a text detection method of a certificate image in a natural scene, which comprises the following steps: selecting common Chinese characters to manufacture Chinese character pictures to form a data set 1, performing random rotation and cutting operations on the marked certificate images, and fusing different background pictures by using a Poisson clone mode to form a data set 2; training a character classification model of the VGG16 network by adopting a data set 1, initializing a full convolution neural network model by using the obtained parameters after the model is converged, and training the model by using a data set 2; processing the image by using a trained full convolution neural network model, and obtaining the classification condition of each pixel point according to a maximum probability method to form a text-non-text binary image; obtaining a text region by using a connected region method, binarizing an original image, and extracting only character information in the text region in a text-non-text region binary image to obtain a text binary image; correcting the image by a maximum variance method; and projecting the corrected image again, and refining the text-non-text region binary image.

Description

Text detection method for certificate image in natural scene

Technical Field

The invention belongs to an image processing method, and particularly relates to a text detection method of a certificate image in a natural scene.

Background

The rapid development of the internet technology and the popularization of smart phones greatly facilitate our lives. In many scenarios, an operator needs a user to upload a certificate (such as an identity card, a business license, and other certificates) and verify the identity and qualification of the user. The user mobile phone shoots the certificate, uploads the certificate for verification, and is convenient and efficient. The shooting background of the user in a natural scene is complex, and the environmental interference factors are many. The shooting background in the natural scene is various, and the user may shoot in the possible living scene such as the desktop, the sheet and the like with complex textures, and the textures are difficult to distinguish from the characters. There are also situations where text is partially occluded in the captured picture, which also poses a significant challenge to text detection. When a user takes a picture in different environments, different shooting modes and different shooting equipment, the image has the conditions of text rotation, text inclination, uneven illumination, blurring, deformation, more noise points and the like. The traditional text detection technology aiming at the scanned image is difficult to achieve a good effect.

The detection of characters in natural scenes is one of the important research subjects of computer vision and pattern recognition technology in the field of target detection and recognition. The method is used for detecting the characters in the natural scene, and the final purpose of the method is to provide help for subsequent character recognition and semantic understanding. As an important component in a character recognition system, the natural scene character detection technology can assist people in understanding natural scene contents. The natural scene character detection is used as the first step of finishing image acquisition post-processing by a natural scene character recognition system, and the detection performance is directly related to the recognition rate of the whole system. Therefore, how to detect the characters quickly and accurately is a very critical problem in the technology of character recognition in natural scenes.

At present, there are two main algorithms for text detection of pictures: a method based on a sliding window and a method based on a connected region. The method based on the sliding window is that all possible positions of a picture are scanned through a sliding sub-window with variable size to detect text information, and a trained classifier is used for judging whether the text information exists in the window; secondly, the method based on the connected region firstly rapidly segments text and non-text pixels through a bottom filter, and then connects the text pixels with similar attributes to form a text component. Such methods view text in an image as some particular areas or with some particular textural features. First, we can use some features or methods to extract candidate regions in natural images as candidates for text, and these features include color features, texture features, edge features, stroke width transformation, extremum regions, and so on. And filtering out candidate regions without characters after screening, regarding the remaining regions as characters and fusing the characters into text line candidates, and screening the text line candidates to obtain a final text detection result. The filtering and screening method can select a threshold value for screening through manually designing the characteristics, or learn the characteristics by using a statistical model or a machine learning algorithm, and adaptively screen the character candidate region.

Stroke Width Transformation (SWT) and Most Stable Extremum Region (MSER) algorithms are representative of the second class of methods, and are also the predominant classical algorithms in recent years.

The method for extracting text candidates by SWT (Stroke Width Transform) is based on a series of general assumptions: characters are all formed by strokes, the strokes have certain widths, the stroke widths of the same line of text are close, and non-character parts are not formed by strokes, so that the stroke widths are not available. Based on the assumption, the stroke width transformation is carried out on the image, the width value of the stroke where each pixel point in the input image is located is calculated, and the connected region is used as a character candidate.

MSER (maximally stable extremal region) methods use MSER regions that are regions that can maintain shape and size over a range of gray level thresholds. They have sharp edges and a strong grey value contrast with the background. Generally, due to morphological characteristics, characters contain rich edge information, and characters are used as an information transmission mode, so that the characters have strong color and gray value contrast with a background in order to be seen clearly, and therefore the characters are basically MSER regions.

The existing method has the following defects:

(1) the sliding window needs multi-scale traversal images and judges each detection window, so that the detection time is long and the efficiency is low.

(2) Lack of precision and difficulty in coping with complex backgrounds.

SWT is largely affected by noise and blurring because SWT is based on successful edge detection and then detection is performed according to the character stroke width. When the background is complex and the edge is not detected, the method fails. Meanwhile, the method can falsely detect a plurality of objects with regular lines and similar characters, such as rings, grids, bricks and the like, as the characters. The shooting requirements of users on multiple scenes in the natural environment cannot be met.

MSER does not work well for fuzzy, illumination, color, texture variations, low contrast word processing.

The SWT method and the MSER method both detect single characters, detection results are inconvenient for an OCR module to use, detected single characters need to be combined according to character spacing, height difference and other characteristics, and calculated amount is increased.

Disclosure of Invention

The invention provides a text detection method of a certificate image in a natural scene, which is used for detecting text region information in the certificate image shot by a user in the natural scene and outputting an independent text line region in the image, and can tolerate the conditions of distortion, inclination, rotation, light change, complex background and the like of the image at a certain angle.

In order to achieve the purpose, the invention adopts the following technical scheme:

a text detection method of a certificate image in a natural scene comprises the following steps:

step 1, establishing a training data set. Selecting 3816 common Chinese characters, and manufacturing Chinese character pictures with different font types to form a data set 1, wherein the training images in the data set 1 are the Chinese characters with different font types, and the labels are designated labels corresponding to the Chinese characters;

step 2, carrying out operations such as random rotation, cutting, blurring, inversion, brightness conversion, gamma conversion and the like on the marked certificate image, and fusing different background images in a Poisson clone mode to form a data set 2, wherein the training image in the data set 2 is a text image, and the label is a text-non-text binary image with corresponding size;

step 3, training a character classification model of a VGG16(Visual Geometry Group-16Net) network by using a data set 1, removing a full connection layer of the VGG16 network after the model is converged, changing the model into a full convolution neural network (FCN), initializing a full convolution neural network model by using the obtained parameters, and training the full convolution neural network model by using a data set 2;

step 4, processing the image by adopting the trained full convolution neural network model to obtain a text-non-text probability distribution map, and obtaining the classification condition of each pixel point by a maximum probability method to form a text-non-text binary image;

step 5, obtaining a text region by using a region connection method according to the text-non-text region binary image;

step 6, binarizing the original image, and extracting only the character information in the text region in the text-non-text region binary image in the step 5 to obtain a text binary image;

step 7, rotating the text image obtained in the step 6 by different angles, transversely projecting, and correcting the image by a maximum variance method;

and 8, projecting the corrected image again, judging the horizontal/vertical of the region according to the number of horizontal (vertical) pixel points of the region, dividing character lines, and refining the text-non-text region binary image obtained in the step 5.

The method fuses convolution characteristics of different layers by performing convolution processing on the image, and does not need a sliding window of multiple sizes to traverse the image. The result is more accurate due to the pixel-by-pixel prediction. The features are extracted through convolution, the network structure is a full convolution neural network, a full connection layer is not provided, and the processing speed can reach real time. By fusing the characteristics of different convolution layers, the space characteristic and the texture characteristic are provided. Under the condition that the image background is complex, the text area can still be well detected. The training sample is enlarged in a Poisson cloning mode, so that overfitting of the model is effectively prevented, and scenes of the training sample are enriched.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 shows a network structure of a convolution layer portion shared by VGG16 and a text region detection model;

FIG. 3 is a network structure of VGG 16;

FIG. 4 is an overall network structure of a text detection model;

FIG. 5a is a sample view of a document image;

FIG. 5b is a background image to be fused;

FIG. 5c is a fused document image;

FIG. 5d is a text-to-non-text binary image of full convolution neural network prediction;

FIG. 5e is a certificate text information graph derived from a text-to-non-text binary image;

FIG. 5f is a diagram of the document text information after maximum variance correction;

FIG. 5g is a binary image of the refined text region;

FIG. 5h is a text region map of the rectified certificate image determined from the refined text region binary map.

Detailed Description

As shown in fig. 1, the present invention provides a text detection method for a certificate image in a natural scene, which includes the following steps:

step 1, 3816 common Chinese characters are selected, Chinese character pictures are made by using different font types such as Song style, black body, regular script, clerical script and the like, and a certain amount of salt and pepper noise and Gaussian noise are added to the Chinese character pictures to form a data set 1, wherein the training images in the data set 1 are the Chinese characters with different font types, and the labels are designated labels corresponding to the Chinese characters.

And 2, the neural network model has a plurality of parameters, and a large amount of data is required for training to prevent overfitting. Due to the high cost of the labeled sample, the limited labeled sample needs to be expanded. And randomly rotating the marked certificate image by a rotation angle rotate belonging to [ -30,30 ]. And randomly clipping, wherein the width and height of the original image are in an order of 0.7 Xwidth, width and height of the new image, and the newHeight belongs to an order of 0.7 Xheight, height. Random Gaussian blur, KernelSize ∈ [3,9], semma ∈ [1,9 ]. Converting BGR image into HSV representation, after separating channels, adding a random value hue _ vari to image brightness H, hue _ vari belongs to-8, saturation S is multiplied by sat _ vari randomly, sat _ vari belongs to 0.5,1.5, brightness V is multiplied by val _ vari randomly, val _ vari belongs to 0.7,1.3, random gamma transformation, gamma belongs to 0.5, 2.0.

Image pixel value pixel according to gamma table_iAnd (6) mapping. And then randomly fusing the images of different backgrounds in a Poisson cloning modeIf fig. 5a and fig. 5b are fused, fig. 5c is obtained, which can enrich the sample and the image scene. And 2, in the data set 2, the training image in the data set 2 is a text-containing image, and the label is a binary image of a text-non-text region with a corresponding size.

And 3, reforming the picture in the data set 1 into a fixed size of 28 multiplied by 28 pixels, normalizing the pixel value to be between 0 and 1, and inputting the pixel value into a VGG16 neural network model, wherein the network structure of the VGG16 model is shown in FIG. 2. Pre-training a VGG16 neural network by adopting a gradient descent mode, wherein neurons in the neural network adopt a ReLU activation function:

f(x)＝max(0,W^Tx+b)

wherein, W^TThe parameter to be trained in the neuron is input by the x neuron, and the parameter b to be biased in the neuron. The loss function of the VGG16 neural network is a softmax cross-entropy loss function L (y)_i,H_i):

Where m is the number of samples in a batch processed in batches, f_iActual predicted value, y, output for the ith sample in the training batch_iThe true value of the ith sample in the training batch.

Step 4, when the accuracy of the VGG16 model in the step 3 reaches a certain degree, stopping pre-training, removing the full connection layer of the VGG16 model, adding convolution layers with two convolution kernels of 1 × 1, and discarding partial parameters (W) with a probability of 0.5^TAnd b). And changing the size of the convolution layer by means of transposition convolution, and fusing the pool-4 and pool-3 convolution layers pixel by pixel. And finally, the convolution layer output is reformed into the size of the original image through transposition convolution. After two discarded layers, all multi-channel layers are first reduced to 2 channels by the 1 × 1 convolutional layer and then operated on. The FCN model structure is shown in fig. 3. The model is trained in a gradient descent mode,the exponential decay adjusts the step length, and the activation function of the neuron in the neural network model still adopts the ReLU activation function:

f(x)＝max(0,W^Tx+b)

wherein, W^TThe parameter to be trained in the neuron is input by the x neuron, and the parameter b to be biased in the neuron.

The loss function adopts a softmax cross entropy loss function.

Where m is the number of samples in a batch processed in batches, M, N is the length and width dimensions of the input image, f_ijIs the predicted value, y, of the jth pixel of the ith sample in the training batch_ijIs the true value of the jth pixel of the ith sample in the training batch.

In the design of the FCN, a convolution kernel with the size of 3 multiplied by 3 is adopted in a full convolution network, two convolution layers are cascaded, the reception field is equal to the convolution kernel with the size of 5 multiplied by 5, and the reception field is equal to the convolution kernel with the size of 7 multiplied by 7 when three convolution layers are cascaded. The number of parameters to be trained is reduced while the receptive field is increased. And a 1 × 1 convolution is introduced to effectively replace a full connection layer, and partial parameters are discarded, so that model overfitting is prevented, data dimensionality is reduced, and the calculation amount is reduced. The training parameters of the convolution layer and the pooling layer of the VGG16 model are used in the training, so that the convergence speed of the model is greatly increased, and the training time is shortened. Because the FCN network structure is not provided with a full connection layer and is a full convolution network, the input image can be in any size, and the conditions that the image is distorted and deformed and cannot be detected when the image is reformed into a fixed size are eliminated. The FCN model is a method for predicting text-non-text pixel by pixel of an image, and compared with a method of a sliding window and a connected region, the detection precision is higher.

And 5, normalizing the image pixel values in the data set 2 to be between 0 and 1, and inputting the image pixel values into the FCN model in the step 4 for training. As shown in FIG. 4, the FCN model parameters are pre-trained by selecting the convolutional layer and pooling layer parameters of VGG16 in step 3, and the truncated normal is used for newly adding layersThe random number of the distribution is initialized. The model output is a text-to-non-text probability map of the input image. For any pixel of the input image_ijGiving the probability p of being a pixel of the text region_True(pixel_ij) And probability P of pixels in non-text region_False(pixel_ij) By comparing the probability of a pixel text-to-non-text. If:

P_True(pixel_ij)>P_False(pixel_ij)

the pixel is considered to be pixel_ijAnd the pixels belong to the text area, otherwise, the pixels belong to the non-text area. And marking the pixels of the text region as 1 and the pixels of the non-text region as 0 to finally obtain a text-non-text distribution graph of the whole image, and calculating the cross entropy of the text-non-text distribution graph and the label as shown in fig. 5 d.

And 6, clustering pixels of the text area in a connected area mode, so that the area of the image text can be well detected. However, this method cannot accurately segment text lines, many lines are stuck together in the obtained text regions, and the purpose of text detection is to output independent text line regions. Therefore, the certificate image is binarized, and only the text information in the text region is intercepted according to the text-non-text distribution diagram in step 5, so as to obtain a binarized image only containing the text information, as shown in fig. 5 e.

And 7, a certain rotation angle possibly exists in the image shot by the user, and when the background of the image is complex, the correction effect of the certificate can be greatly influenced. After the detection model processing, the information of the text area is obtained, the text can be extracted from the image, and the certificate is corrected by eliminating the interference of a complex background. In the certificate image, the characters in the text region are distributed according to lines, and obvious blanks exist among the lines, so that the closer the projection direction is to the direction of the certificate image, the more obvious the difference value between the peaks and the troughs of the formed projection curve is, and the larger the variance of the projection of the text region is. And (3) rotating and projecting the text binary image (with the size of N multiplied by M) in the step 6 for multiple times, wherein the rotating angle when the image projection variance is maximum is the correcting angle of the image. So that the image is horizontally projected and the horizontal projection is recordedSum of pixel points of i rows of coordinates is sum_i:

Wherein I is an indicator function when pixel_ijWhen the e text exists, I is 1, otherwise, I is 0. All row means are

By rotating the image about a central point by different angles theta_kAnd calculating the variance. The image text region projection variance is:

when the variance is maximum, the corresponding rotation angle theta_kI.e. the tilt angle theta of the image.

Rectify the image and text-to-non-text region map, as shown in fig. 5 f.

And 8, detecting the condition that the text-non-text distribution diagram detected by the model has character line adhesion, and particularly, detecting the condition that the gaps among lines in the certificate are small. The adhesion of the text lines will seriously affect the accuracy of subsequent character recognition, so the adhered text lines are divided to output independent text line areas in the image. After the image correction in step 7, the text lines in the document are substantially horizontal or vertical. And judging the horizontal/vertical information of the text in each text region according to the number of pixel points projected by each line/column in each text region for the horizontal/vertical projection of the corrected image. For example, when the length of the longitudinal projection of the outline is much longer than that of the transverse projection, the text in the text region is considered to be in the horizontal direction. And then projecting according to the vertical direction of the text direction. And determining the peak point and the valley point of each projection curve according to the variation trend of the number of the projection pixel points, wherein the peak point is an extreme point of which the median of the projection curve is greater than the value of the surrounding points, and the valley point is an extreme point of which the median of the projection curve is less than the value of the surrounding points. The segmentation of the lines of text relies primarily on finding the valleys. To eliminate false valleys, statistics can be performed based on the average of the sums of the first 5 rows of pixel points. If it is

The pixel is considered to be a non-text region. Accordingly, the text line is segmented, and the text-non-text binary image is refined to obtain the position information of the text line region, as shown in fig. 5g and 5 h.

Claims

1. A text detection method of a certificate image in a natural scene is characterized by comprising the following steps:

step 1, establishing a training data set: selecting common Chinese characters, manufacturing Chinese character pictures by adopting different font types to form a data set 1, wherein training images in the data set 1 are the Chinese characters with different font types, and labels are designated labels corresponding to the Chinese characters;

step 2, randomly rotating, cutting, blurring, reversing, changing brightness and gamma changing the marked certificate image, and fusing different background images in a Poisson clone mode to form a data set 2, wherein the training image in the data set 2 is a text image, and the label is a text-non-text binary image with a corresponding size;

step 3, training a character classification model of a VGG16(Visual Geometry Group-16Net) network by using a data set 1, removing a full connection layer of the VGG16 network after the model is converged, changing the model into a full convolution neural network (FCN), initializing a full convolution neural network model by using the obtained VGG16 character classification model parameters, and training the full convolution neural network model by using a data set 2;

the step 3 specifically comprises the following steps: reforming the picture in the data set 1 into a fixed size of 28 × 28 pixels, normalizing the pixel value to be between 0 and 1, inputting the pixel value into a VGG16 neural network model, and pre-training a VGG16 neural network in a gradient descent mode, wherein neurons in the neural network adopt a ReLU activation function:

f(x)＝max(0，W^Tx+b)

wherein, W^TThe loss function of the VGG16 neural network is softmax cross entropy loss function L (y)_i，H_i)：

Where m is the number of samples in a batch processed in batches, f_iActual predicted value, y, output for the ith sample in the training batch_iThe real value of the ith sample in the training batch;

stopping pre-training when the accuracy of the VGG16 model reaches a preset degree, removing a full connection layer of the VGG16 model, adding convolution layers with two convolution kernels of 1 multiplied by 1, and discarding partial parameters with a probability of 0.5; changing the size of the convolution layer by means of transposition convolution, and fusing the pool-4 and pool-3 convolution layers pixel by pixel; finally, outputting and reforming the convolution layer into the size of the original image through transposition convolution; after two discarding layers, all multi-channel layers are firstly reduced to 2 channels through the convolution layer of 1 multiplied by 1 and then operated; training the model by adopting a gradient descent mode, wherein the activation function of the neurons in the neural network model still adopts a ReLU activation function:

f(x)＝max(0，W^Tx+b)

wherein, W^TIs the parameter to be trained in the neuron, x neuron input, b is the parameter to be biased in the neuron,

the loss function adopts a softmax cross entropy loss function,

where m is the number of samples in a batch processed in batches, M, N is the length and width dimensions of the image, f_ijIs the predicted value, y, of the jth pixel of the ith sample in the training batch_ijThe real value of the jth pixel of the ith sample in the training batch is obtained;

the step 5 specifically comprises the following steps: normalizing the image pixel values in the data set 2 to be between 0 and 1, inputting the image pixel values into the FCN model in the step 4 for training, selecting the convolution layer and pooling layer parameters of VGG16 in the step 3 for pre-training the FCN model parameters, and initializing a new added layer by using a random number of truncated normal distribution; the FCN model output is a text-to-non-text probability map of the image; for any pixel of an image_ijGiving the probability P of being a pixel of the text region_True(pixel_ij) And probability P of pixels in non-text region_False(pixel_ij) Comparing the probability of pixel text-non-text, if:

P_True(pixel_ij)＞P_False(pixel_ij)

the pixel is considered to be pixel_ijIf the pixel belongs to the text area, otherwise, the pixel belongs to the non-text area, the pixel of the text area is marked with 1, the pixel of the non-text area is marked with 0, and finally the pixel is obtainedA text-to-non-text profile of the entire image;

step 6, binarizing the image, and extracting only character information in a text region in the text-non-text region binary image in the step 5 to obtain a text binary image;

the step 7 specifically comprises the following steps: and (4) rotating and projecting the text binarization image in the step (6) for multiple times, wherein the rotating angle when the image projection variance is maximum is the correction angle of the image, so that the image is horizontally projected in the transverse direction, and the sum of pixel points of i lines of the vertical coordinate after horizontal projection is recorded as sum_i：

Wherein I is an indication function when pixel_ijWhen the e belongs to the text, I is 1, otherwise, I is 0, and all line mean values are

By rotating the image about a central point by different angles theta_kCalculating the variance, wherein the image text region projection variance is as follows:

when the variance is maximum, the corresponding rotation angle theta_kThat is, the tilt angle θ of the image:

step 8, projecting the corrected image again, judging the horizontal/vertical of the region according to the number of horizontal/vertical pixel points of the region, dividing character lines, and finely modifying the text-non-text region binary image obtained in the step 5;

the step 8 specifically comprises the following steps: judging the horizontal/vertical information of the text in each text region according to the number of pixel points projected by each line/column in each text region for the horizontal/vertical projection of the corrected image; when the length of the longitudinal projection of the outline is far greater than that of the transverse projection, the text in the text area is considered to be in the horizontal direction, then the text is projected in the vertical direction of the text direction, and according to the variation trend of the number of the projection pixel points, the peak point and the valley point of each projection curve are determined, wherein the peak point is an extreme point of which the median of the projection curve is greater than the surrounding value, and the valley point is an extreme point of which the median of the projection curve is less than the surrounding value; the character line is divided by searching for a valley, and in order to eliminate false valleys, statistics can be carried out according to the average value of the sum of the first 5 lines of pixel points, and if the sum is not equal to the false valley, the statistics is carried out

And considering the pixel line as a text-free area, segmenting the text line according to the text-free area, and refining the text-non-text binary image to obtain the position information of the text line area.