CN111652233B

CN111652233B - Text verification code automatic identification method aiming at complex background

Info

Publication number: CN111652233B
Application number: CN202010495757.8A
Authority: CN
Inventors: 王瑶; 王佰玲; 魏玉良; 张茗晋; 辛国栋; 王巍
Original assignee: Weihai Tianzhiwei Network Space Safety Technology Co ltd; Harbin Institute of Technology Weihai
Current assignee: Weihai Tianzhiwei Network Space Safety Technology Co ltd; Harbin Institute of Technology Weihai
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2023-04-25
Anticipated expiration: 2040-06-03
Also published as: CN111652233A

Abstract

The invention relates to an automatic text verification code identification method aiming at complex background, which comprises the following steps: the verification code denoising module removes complex security features of the real verification code through a circularly generated countermeasure network; the character segmentation module uses an image processing algorithm to segment the whole verification code picture into single characters; and sending the segmented characters into a text recognition network to obtain final output. The automatic identifying method for the text identifying code can be used for quickly and effectively identifying the text type identifying code with background noise, character distortion and blurred edges, has high generalization and portability, can be simply embedded into a crawler algorithm, and solves the identifying code problem in the data acquisition process.

Description

Text verification code automatic identification method aiming at complex background

Technical Field

The invention relates to an automatic identifying method of text identifying codes aiming at complex backgrounds, and belongs to the technical field of identifying codes.

Background

In the big data age, a data source is a necessary condition for carrying out big data analysis and data mining, and the time and the labor are consumed for manually searching useful data in the Internet. The crawler technology can automatically acquire data content of interest from the Internet and crawl the data to be used as a data source, so that more advanced data analysis can be performed. The verification code is taken as a measure for preventing an automatic program and is a main constraint factor in the process of the crawler. Character type verification codes are still widely used on the network at present, so that a full-automatic end-to-end identification method for the type verification codes becomes particularly important.

Existing verification code automatic identification algorithms generally include three broad categories: specific type captcha attack algorithms, algorithms based on character segmentation, methods based on deep learning. The attack algorithm of a specific type can only identify a single type of verification code picture (such as Microsoft verification code) and cannot be generalized to other types, so that the attack algorithm is difficult to apply to the engineering field; the algorithm based on character segmentation generally uses a traditional image processing algorithm to preprocess (such as graying, binarization and the like) the verification code picture, and the traditional image processing has limitations, so that the background interference cannot be effectively removed, and the problems of difficulty in character segmentation and low recognition accuracy are caused; in recent years, with the development of deep learning, verification code recognition technology based on a neural network model achieves good effects, but two main problems still exist in the method at present: first, most of the existing verification code recognition methods based on deep learning models adopt a supervised learning mode, and a large amount of marked data is required to be trained (generally not less than 50000 pieces), so that time and labor are wasted. Under the condition of insufficient labeling samples, the phenomenon of fitting is very easy to occur, so that the model cannot be converged, and the accuracy is very low; secondly, in the existing identifying method of identifying the identifying code, higher accuracy is obtained for identifying the regular and slightly noisy text identifying code, but the text identifying code type with complex security features cannot be well identified.

In addition, chinese patent document CN107967475a discloses a verification code recognition method based on window sliding and convolutional neural network. Firstly, collecting a small number of verification code pictures, extracting character sets to be identified by the verification code after noise reduction, rotating and twisting each character set to increase background noise, and training the character sets by using a convolutional neural network to obtain a single character classifier. Finally, preprocessing the verification code picture to be identified, then dividing connected domains, sliding windows for each connected domain, and classifying by using a single character classifier trained before, so as to obtain an identification final result. Chinese patent document CN110555298A discloses a verification code recognition device and a computing device, and the verification code recognition model training method includes: acquiring verification code image samples with the same verification code length, and determining character sample labels corresponding to the verification code image samples; determining verification code characters forming character sample labels and attribute values of the verification code characters, and acquiring character type information of the verification code characters; encoding the character sample label according to the character type information and the attribute value to obtain an encoded sample label; and training a verification code identification model for identifying the verification code image by using the verification code image sample and the coded sample label. However, the methods mentioned in the above two patent documents all adopt the traditional image processing algorithm to preprocess the verification code picture, and the method is only suitable for the situation without obvious noise, but cannot effectively remove noise interference for the verification code type with complex security features, so that the accuracy of character segmentation and recognition is seriously affected.

Disclosure of Invention

Aiming at the problems of the existing identifying technology of identifying codes, in particular to the problems that the text identifying code with complex safety features can not remove noise well and the identifying effect of the distorted text identifying code is poor under the condition of a small number of marks. The invention provides an automatic text verification code identification method aiming at a complex background. The method has the characteristics of few labeling samples, short processing time and high recognition accuracy, solves the problems that the existing algorithm needs a large amount of manual labeling and has poor recognition effect on complex and distorted background characters, and has wide application prospect. The method combines the verification code denoising module, the character segmentation module and the verification code identification module into a whole, and realizes automatic identification of the text verification code end-to-end. The method can obtain higher recognition accuracy rate by only marking a small number of samples (500 sheets), and has better recognition effect on noise and distortion verification codes. The identifying method of the identifying code provided by the invention has better generalization and can be applied to different types of text identifying codes on the premise of ensuring that the model structure is unchanged. Meanwhile, the model can be simply embedded into a crawler algorithm, so that the problem of text verification code anticreeper encountered by enterprises and individuals when acquiring data is solved rapidly and efficiently.

The technical scheme of the invention is as follows:

a text verification code automatic identification method aiming at complex background comprises the following steps:

the verification code denoising module removes complex security features of the real verification code through a circularly generated countermeasure network;

the character segmentation module uses an image processing algorithm to segment the whole verification code picture into single characters;

and sending the segmented characters into a text recognition network to obtain final output.

According to the invention, preferably, for text verification code types with large distortion rotation amplitude, the text recognition network of the invention uses a space transformation layer (Spatial Transformer Layers) to correct the text verification code types, so that the model has space invariance.

The invention aims at automatically identifying a text type verification code with complex safety characteristics (such as background noise, edge blurring and character distortion), and belongs to an automatic verification code identification method based on a small number of training samples. The invention comprises three parts of a verification code denoising module, a character segmentation module and a text recognition module. The whole model solution is shown in fig. 1. The verification code shown on the left side of fig. 1 is from wikipedia, and has the characteristics of blurred edges, noise and text distortion, and the noise at the edges does not greatly affect the recognition of human eyes, but is difficult to segment due to disordered pixel distribution and difficult to obtain high recognition accuracy for a neural network. Therefore, the invention firstly denoises the true verification code through the circularly generated countermeasure network, so that the edge of the true verification code is clear, and the true verification code is more beneficial to further identification. The overall captcha picture is then segmented into individual characters using an image processing algorithm. Finally, the segmented characters are sent to a text recognition network to obtain final output, and particularly, for the text verification code type with large distortion rotation amplitude, the network firstly uses a space transformation layer (Spatial Transformer Layers) to correct the text verification code type, so that the model has space invariance.

Verification code denoising module

According to the invention, preferably, in the process of denoising the true verification code through the circularly generated countermeasure network, firstly, a verification code generator is utilized, pictures with the similar format to the true verification code word are generated in batches through parameter adjustment, and the pictures and the true verification code are combined into a training set in pairs to be used as the input of the denoising network.

According to the present invention, preferably, the loop generation type countermeasure network (Cycle GAN) is composed of two generators and two discriminators, and is in a dual structure as a whole. The core goal of the loop generation type countermeasure network is to convert the verification code picture with complex security features into a simple verification code with the security features removed, so as to reduce the difficulty of character segmentation and recognition. In the model training process, as shown in fig. 2, a real input image is firstly obtained through a domain a, and is converted into a simple verification code picture in a target domain B through a first generator a- > B; this picture is then converted back to the original complex picture as input to the second generator b→a. In addition, two discriminators (discriminators) are used here to determine whether an input picture is a true input picture or a pseudo picture generated by a Generator.

According to the invention, preferably, the optimization objective of denoising the loop-generating type countermeasure network includes two different types of loss functions, namely a countermeasure loss (differential loss) and a loop consistency loss (Cycle Consistency Loss); the contrast loss is used for matching the pixel distribution of the generated picture with the pixel distribution of the picture in the target domain; the cyclical consistency loss is used for controlling the converted image to be similar to the image in the source domain as much as possible;

further preferably, the true verification code and the generated verification code are respectively used as a domain X and a domain Y, and two style converters are used for mutually converting between the domain X and the domain Y; the optimization process is as follows: (1) Firstly, carrying out feature extraction on an input picture by utilizing a convolutional neural network to obtain a feature vector; (2) Then converting the feature vector of the picture in the domain X into the feature vector in the domain Y through a Resnet module, and reserving the target of the original image feature while converting; (3) Finally, the decoding process restores the converted image from the feature vector by deconvolution operation. The discriminator consists of a multi-layer convolutional neural network, takes pictures as input, tries to judge whether the input pictures are real pictures from an original domain or false pictures generated through conversion, and outputs the probability of predicting the final layer of the discriminator as the real pictures. Algorithm flow as shown in fig. 3, unlike the unidirectional conversion of a conventional generative antagonism network, the present invention uses two style converters to convert between domain X and domain Y.

The trained verification code denoising network can effectively identify complex safety features (including background noise, interference lines, character colors, distortion and ambiguity of characters, small character spacing and the like) which possibly interfere with character segmentation and identification, and well remove the interference items, so that the verification code denoising network becomes a simple verification code picture. The difficulty of character segmentation and recognition is effectively reduced, so that high recognition accuracy can be achieved by only a small amount of labeling samples. Meanwhile, the denoising network has generalization and universality, can be applied to different types of verification code pictures under the condition of not changing a model structure, and greatly reduces manual intervention.

Character segmentation module

Through the verification code denoising network, the original verification code with complex security features is converted into a simple verification code, and the simple verification code is input into the character segmentation module. Aiming at the characteristics of different types of verification codes, the character strings in the verification code pictures are respectively segmented into single characters by adopting methods of contour detection, traditional segmentation, threshold segmentation and the like. Equidistant segmentation is one of the conventional image processing algorithms that equally divides the pixels of a picture into N parts, but this method has a problem in that equidistant segmentation does not separate the authentication codewords well, as shown in fig. 4 (a), where two characters may exist in one box. The present invention thus improves this segmentation method. The starting position of the segmentation is adjusted from (0, 0) to the upper left corner pixel point of the first character, the segmentation width is adjusted to the approximate size of each character, and the height is adjusted to the approximate height of each character, the segmentation effect is as shown in fig. 4 (b), wherein the black frame represents the segmentation result.

According to the invention, preferably, the image processing algorithm adopts a contour detection, an improved equidistant segmentation algorithm and a threshold segmentation algorithm, wherein in the improved equidistant segmentation algorithm, the starting position of segmentation is the pixel point at the upper left corner of the first character, the segmentation width is the approximate size of each character, and the segmentation height is the approximate height of each character.

Aiming at the text verification code with clear processed edges and distorted characters, the traditional segmentation algorithm is not applicable, and the invention preferably adopts an algorithm of contour detection to segment the characters;

further preferably, the contour detection algorithm scans the pixel points of the whole picture, finds the starting point of the outer boundary of each character and the starting point of the hole boundary, numbers the boundary points, and finally connects the outer boundaries through a contour drawing function to obtain a final segmentation result.

Threshold segmentation is a region-based image segmentation technique, applicable to pictures where the target and background occupy different gray level ranges.

According to the invention, preferably, for the situation that the character size intervals in the verification code pictures are unequal, a threshold segmentation algorithm is adopted, and the flow is as follows: firstly, carrying out binarization processing on a picture; and then calculating an accumulated value of the ordinate pixels of the picture, and determining the threshold value by adopting a peak-to-valley value analysis method.

Text recognition module

Because the simplified verification code picture removes most of safety factors interfering with segmentation, the segmentation module can obtain higher segmentation accuracy, and simultaneously, the difficulty of character recognition is reduced. The design of the invention uses a simple convolutional neural network model as a final text recognition module, and a specific model structure is shown in fig. 5.

According to the invention, preferably, the text recognition network is a convolutional neural network, and comprises a convolutional layer, a pooling layer, a dropout layer and a full-connection layer;

further preferably, the convolutional neural network uses ReLu as an activation function and cross entropy as a loss function, and the optimizer selects Adadelta. Because the model convolution layer is less, the problem of over fitting is not easy to occur, and a large amount of training data is not needed. In actual use, the high recognition accuracy can be obtained only by training 500 samples, the model training time is greatly reduced, the processing speed in the recognition process is increased, and the engineering use requirement can be met.

The invention is not described in detail and is in accordance with the prior art.

The beneficial effects of the invention are as follows:

1. the automatic identifying method for the text identifying code can be used for quickly and effectively identifying the text type identifying code with background noise, character distortion and blurred edges, has high generalization and portability, can be simply embedded into a crawler algorithm, and solves the identifying code problem in the data acquisition process.

2. The method has high recognition accuracy for the text verification code with complex background, distorted characters and blurred edges.

3. The invention can achieve better recognition effect only by a small amount of data annotation, and reduces manual intervention.

4. The invention has high generalization and portability, and is suitable for different types of text verification codes; the model training time is short, the processing speed is high, and the engineering requirements can be met. The application range supports any web crawler algorithm, can be applied to any website and software needing to be identified by the automatic verification code, and has wide application prospect.

Drawings

FIG. 1 is a flow chart of a complex verification code identification solution based on a small number of samples according to the present invention.

FIG. 2 is a diagram of the overall structure of the verification code denoising network according to the present invention.

FIG. 3 is a graph showing the cyclic consistency loss according to the present invention.

FIG. 4 is a diagram illustrating the result of a conventional character segmentation algorithm, wherein: (a) equidistant segmentation algorithm (b) improved post-segmentation algorithm.

Fig. 5 is a network structure diagram of the text recognition module according to the present invention.

Detailed Description

The invention will now be further illustrated by, but is not limited to, the following specific examples in connection with the accompanying drawings.

Example 1

the verification code denoising module removes complex security features of a real verification code through a circularly generated countermeasure network, and meanwhile, the edges of characters are clear:

firstly, a verification code generator is utilized, pictures with the similar format to the real verification code word are generated in batches through parameter adjustment, and the pictures and the real verification code are combined into a training set in pairs to be used as the input of a denoising network. The Cycle generation type countermeasure network (Cycle GAN) consists of two generators and two discriminators, and the whole is in a dual structure. The core goal of the loop generation type countermeasure network is to convert the verification code picture with complex security features into a simple verification code with the security features removed, so as to reduce the difficulty of character segmentation and recognition. In the model training process, as shown in fig. 2, a real input image is firstly obtained through a domain a, and is converted into a simple verification code picture in a target domain B through a first generator a- > B; this picture is then converted back to the original complex picture as input to the second generator b→a. In addition, two discriminators (discriminators) are used here to determine whether an input picture is a true input picture or a pseudo picture generated by a Generator. The optimization targets of denoising the loop generation type countermeasure network comprise two different types of loss functions, namely a countermeasure loss (differential loss) and a loop consistency loss (Cycle Consistency Loss); the contrast loss is used for matching the pixel distribution of the generated picture with the pixel distribution of the picture in the target domain; the cyclical consistency loss is used for controlling the converted image to be similar to the image in the source domain as much as possible; the true verification code and the generated verification code are respectively used as a domain X and a domain Y, and two style converters are used for mutually converting between the domain X and the domain Y; the optimization process is as follows: (1) Firstly, carrying out feature extraction on an input picture by utilizing a convolutional neural network to obtain a feature vector; (2) Then converting the feature vector of the picture in the domain X into the feature vector in the domain Y through a Resnet module, and reserving the target of the original image feature while converting; (3) Finally, decoding is carried out through deconvolution operation, and the converted image is restored by the feature vector. The discriminator consists of a multi-layer convolutional neural network, takes pictures as input, tries to judge whether the input pictures are real pictures from an original domain or false pictures generated through conversion, and outputs the probability of predicting the final layer of the discriminator as the real pictures. The algorithm flow is shown in fig. 3.

The character segmentation module uses an image processing algorithm to segment the overall captcha picture into individual characters:

the image processing algorithm comprises contour detection, an improved equidistant segmentation algorithm and a threshold segmentation algorithm, wherein in the improved equidistant segmentation algorithm, the starting position of segmentation is the pixel point of the upper left corner of the first character, the segmentation width is the approximate size of each character, the segmentation height is the approximate height of each character, the segmentation effect is shown in fig. 4 (b), and the black frame represents the segmentation result; aiming at the text verification code with clear processed edges and distorted characters, the traditional segmentation algorithm is not applicable, and the invention preferably adopts an algorithm of contour detection to segment the characters; the contour detection algorithm scans the pixel points of the whole picture, finds the starting point of the outer boundary of each character and the starting point of the hole boundary, numbers the boundary points, and finally connects the outer boundaries through a contour drawing function to obtain a final segmentation result. Aiming at the condition that the character size intervals in the verification code pictures are unequal, the invention adopts a threshold segmentation algorithm, and the flow is as follows: firstly, carrying out binarization processing on a picture; and then calculating an accumulated value of the ordinate pixels of the picture, and determining the threshold value by adopting a peak-to-valley value analysis method.

The segmented characters are sent to a text recognition network to obtain final output:

the text recognition network is a convolutional neural network and comprises a convolutional layer, a pooling layer, a dropout layer and a full-connection layer; the convolutional neural network uses ReLu as an activation function and cross entropy as a loss function, and the optimizer selects Adadelta.

The overall model solution of the present invention is shown in fig. 1. The verification code shown on the left side of fig. 1 is from wikipedia, and for a neural network, because the verification code is chaotic in pixel distribution, the verification code is difficult to segment, and high recognition accuracy is difficult to obtain. According to the invention, the real verification code is de-noised through the circularly generated countermeasure network, so that the edge of the verification code is clear, and further identification is facilitated. The overall captcha picture is then segmented into individual characters using corresponding image processing algorithms. And finally, sending the segmented characters into a text recognition network to obtain final output. Meanwhile, the text recognition model designed by the patent has fewer convolution layers, so that the problem of fitting is not easy to occur, and a large amount of training data is not needed. In actual use, the high recognition accuracy can be obtained only by training 500 samples, the model training time is greatly reduced, the processing speed in the recognition process is increased, and the engineering use requirement can be met.

In particular, for text captcha types with large warped rotation amplitudes, the text recognition network first uses a spatial transform layer (Spatial Transformer Layers) to correct the text captcha types to make the model spatially invariant.

Claims

1. A text verification code automatic identification method aiming at complex background comprises the following steps:

the segmented characters are sent to a text recognition network to obtain final output;

in the process of denoising the true verification code through the cyclic generation type countermeasure network, firstly, a verification code generator is utilized, pictures with the similar format to the true verification code word are generated in batches through parameter adjustment, and the pictures and the true verification code are combined into a training set in pairs to be used as the input of the denoising network; the circularly generated countermeasure network consists of two generators and two discriminators, and is of a dual structure as a whole;

the image processing algorithm adopts a contour detection, an improved equidistant segmentation algorithm and a threshold segmentation algorithm, wherein in the improved equidistant segmentation algorithm, the starting position of segmentation is the pixel point at the upper left corner of the first character, the segmentation width is the approximate size of each character, and the segmentation height is the approximate height of each character;

the optimization targets of denoising of the circularly generated type countermeasure network comprise two different types of loss functions, namely countermeasure loss and circular consistency loss; the contrast loss is used for matching the pixel distribution of the generated picture with the pixel distribution of the picture in the target domain; the cyclical consistency loss is used for controlling the converted image to be similar to the image in the source domain as much as possible;

the true verification code and the generated verification code are respectively used as a domain X and a domain Y, and two style converters are used for mutually converting between the domain X and the domain Y; the optimization process is as follows: (1) Firstly, carrying out feature extraction on an input picture by utilizing a convolutional neural network to obtain a feature vector; (2) Then converting the feature vector of the picture in the domain X into the feature vector in the domain Y through a Resnet module, and reserving the target of the original image feature while converting; (3) Finally, the decoding process restores the converted image by the characteristic vector through deconvolution operation;

aiming at the text verification code with clear edges but distorted characters after processing, adopting an algorithm of contour detection to segment the characters;

the algorithm of contour detection scans the pixel points of the whole picture, finds the starting point of the outer boundary of each character and the starting point of the hole boundary, numbers the boundary points, and finally connects the outer boundaries through a contour drawing function to obtain a final segmentation result;

aiming at the condition that the character size intervals in the verification code pictures are unequal, a threshold segmentation algorithm is adopted, and the flow is as follows: firstly, carrying out binarization processing on a picture; and then calculating an accumulated value of the ordinate pixels of the picture, and determining the threshold value by adopting a peak-to-valley value analysis method.

2. The method for automatically identifying text verification codes against complex backgrounds according to claim 1, wherein the text identification network is a convolutional neural network, and comprises a convolutional layer, a pooling layer, a dropout layer and a full connection layer.

3. The automatic recognition method of text verification codes for complex backgrounds according to claim 2, wherein the convolutional neural network uses ReLu as an activation function and cross entropy as a loss function, and the optimizer selects Adadelta.

4. The automatic text verification code recognition method for complex backgrounds according to claim 1, wherein for text verification code types with large twisting rotation amplitude, the text recognition network first uses a spatial transformation layer to correct the text verification code types so that the model has spatial invariance.