CN111652233A

CN111652233A - Text verification code automatic identification method for complex background

Info

Publication number: CN111652233A
Application number: CN202010495757.8A
Authority: CN
Inventors: 王瑶; 王佰玲; 魏玉良; 张茗晋; 辛国栋; 王巍
Original assignee: Weihai Tianzhiwei Network Space Safety Technology Co ltd; Harbin Institute of Technology Weihai
Current assignee: Weihai Tianzhiwei Network Space Safety Technology Co ltd; Harbin Institute of Technology Weihai
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-09-11
Anticipated expiration: 2040-06-03
Also published as: CN111652233B

Abstract

The invention relates to a text verification code automatic identification method aiming at a complex background, which comprises the following steps: the verification code denoising module removes complex security features of the real verification code through a cyclic generation type countermeasure network; the character segmentation module segments the whole verification code picture into single characters by using an image processing algorithm; and sending the segmented characters into a text recognition network to obtain final output. The automatic identification method of the text verification code can quickly and effectively identify the text type verification code with background noise, character distortion and fuzzy edge, has high generalization and portability, can be simply embedded into a crawler algorithm, and solves the problem of the verification code in the data acquisition process.

Description

Text verification code automatic identification method for complex background

Technical Field

The invention relates to a text verification code automatic identification method aiming at a complex background, belonging to the technical field of verification code identification.

Background

In the big data era, a data source is a necessary condition for big data analysis and data mining, and time and labor are consumed for manually searching useful data in the internet. The crawler technology can automatically acquire data contents which are interesting to people from the Internet and crawl the data as a data source so as to perform deeper data analysis. The verification code is a measure for preventing an automation program and is a main restriction factor in a crawler process. At present, character type verification codes are still widely used on networks, so that a full-automatic end-to-end identification method for the type verification codes becomes particularly important.

Existing automatic identifying algorithms for captchas generally include three major categories: a specific type identifying code attack algorithm, an algorithm based on character segmentation and a method based on deep learning. Wherein, the attack algorithm of a specific type can only identify a single type of identifying code picture (such as Microsoft identifying code) and can not be generalized to other types, so that the attack algorithm is difficult to be applied to the engineering field; the algorithm based on character segmentation generally uses the traditional image processing algorithm to preprocess the verification code picture (such as graying, binarization and the like), and because the traditional image processing has limitation, the background interference can not be effectively removed, the problems of difficult character segmentation and low identification accuracy rate can be caused; in recent years, with the development of deep learning, the verification code identification technology based on the neural network model has a good effect, but the method still has two main problems at present: firstly, most of the existing verification code identification methods based on deep learning models adopt a supervised learning mode, a large amount of labeled data is needed for training (generally, the number of labeled data is not less than 50000), and time and labor are consumed. Under the condition of insufficient labeled samples, the phenomenon of overfitting is easy to occur, so that the model cannot be converged, and the accuracy is low; second, in the existing verification code identification method, the identification of regular and slightly noisy text verification codes is performed with high accuracy, but the type of text verification code with complex security features cannot be well identified.

In addition, chinese patent document CN107967475A discloses an identifying code identifying method based on window sliding and convolutional neural network. Firstly, collecting a small number of verification code pictures, cutting out character sets of the verification codes to be identified after noise reduction, rotating and twisting each character set, increasing background noise, and then training by utilizing a convolutional neural network aiming at the character sets to obtain a single character classifier. And finally, preprocessing the verification code picture to be recognized, then carrying out connected domain segmentation, carrying out window sliding on each connected domain, and classifying by using a single character classifier trained previously to obtain a final recognition result. Chinese patent document CN110555298A discloses a verification code recognition apparatus and a computing device, and the verification code recognition model training method includes: acquiring verification code image samples with the same verification code length, and determining character sample labels corresponding to the verification code image samples; determining identifying code characters forming the character sample label and attribute values of the identifying code characters, and acquiring character type information of the identifying code characters; coding the character sample label according to the character type information and the attribute value to obtain a coded sample label; and training a verification code identification model for identifying the verification code image by using the verification code image sample and the coding sample label. However, the methods mentioned in the above two patent documents both adopt the traditional image processing algorithm to preprocess the verification code picture, and this method is only suitable for the case without significant noise, but cannot effectively remove noise interference for the verification code type with complex security features, thus seriously affecting the accuracy of character segmentation and recognition.

Disclosure of Invention

The method aims at the problems of the existing identifying code identifying technology, in particular to the problems that the text identifying code with complex safety characteristics cannot be well denoised, and the identifying effect of the distorted and deformed text identifying code is poor under the condition of a small amount of labels. The invention provides a text verification code automatic identification method aiming at a complex background. The method has the characteristics of less labeled samples, short processing time and high identification accuracy, solves the problems that the existing algorithm needs a large amount of manual labeling and has poor character identification effect on complex background and distortion, and has wide application prospect. The method combines a verification code denoising module, a character segmentation module and a verification code identification module into a whole, and realizes the automatic identification of the end-to-end text verification code. According to the method, a high identification accuracy rate can be obtained only by a small number of labeled samples (500 sheets), and meanwhile, the noise and the distorted verification codes are well identified. The identifying code identifying method provided by the invention has better generalization and can be applied to different types of text identifying codes on the premise of ensuring that the model structure is not changed. Meanwhile, the model can be simply embedded into a crawler algorithm, and the problem of text verification code anti-crawler encountered by enterprises and individuals when acquiring data is quickly and efficiently solved.

The technical scheme of the invention is as follows:

a text verification code automatic identification method aiming at a complex background comprises the following steps:

the verification code denoising module removes complex security features of the real verification code through a cyclic generation type countermeasure network;

the character segmentation module segments the whole verification code picture into single characters by using an image processing algorithm;

and sending the segmented characters into a text recognition network to obtain final output.

According to the invention, preferably, for the text verification code type with large distortion rotation amplitude, the text recognition network of the invention firstly uses a Spatial Transformer Layers (Spatial Transformer Layers) to correct the text verification code type, so that the model has Spatial invariance.

The invention relates to automatic identification of a text type identifying code with complex safety features (such as background noise, edge blurring and character distortion), and belongs to an automatic identifying method of an identifying code based on a small quantity of training samples. The invention comprises three parts, namely a verification code denoising module, a character segmentation module and a text recognition module. The overall model solution is shown in figure 1. The verification code shown on the left side of fig. 1 is from wikipedia, and this type of verification code has the characteristics of blurred edges, noise and text distortion, and although the noise on the edges does not have much influence on the recognition of human eyes, for a neural network, due to disordered pixel distribution, segmentation is difficult, and it is difficult to obtain a high recognition accuracy. Therefore, the invention firstly carries out denoising on the real verification code through the cyclic generation type countermeasure network, so that the edge of the real verification code is clear, and the method is more beneficial to further identification. The overall captcha picture is then segmented into individual characters using image processing algorithms. Finally, the segmented characters are sent to a text recognition network to obtain the final output, and particularly, for the text verification code type with large distortion rotation amplitude, the network firstly uses a spatial transform layer (spatialtransform Layers) to correct the text verification code type, so that the model has spatial invariance.

Verification code denoising module

According to the invention, preferably, in the process of denoising the real verification code through the cyclic generation type countermeasure network, firstly, the verification code generator is utilized to generate pictures with a format similar to that of the real verification code in batch through parameter adjustment, and the pictures and the real verification code are combined in pairs to form a training set to be used as the input of the denoising network.

According to the invention, preferably, the Cycle GAN is composed of two generators and two discriminators, and the whole structure is dual. The core goal of the recurrent countermeasure network is to convert the verification code picture with complex security features into a simple verification code with security features removed, so as to reduce the difficulty of character segmentation and recognition. As shown in fig. 2, in the model training process, firstly, a real input image is obtained through a domain a and is converted into a simple verification code picture in a target domain B through a first generator GeneratorA → B; this picture is then used as input for a second generator GeneratorB → a, which converts back to the original complex picture. Two discriminators (discriminators) are used to determine whether an input picture is a real input picture or a simulated picture generated by the Generator.

According to the present invention, preferably, the optimization objective of denoising the cyclic generation type countermeasure network includes two different types of Loss functions, namely, an adaptive Loss (adaptive Loss) and a cyclic consistency Loss (cyclic consistency Loss); wherein the adversarial loss is used for matching the pixel distribution of the generated picture with the pixel distribution of the picture in the target domain; the cycle consistency loss is used for controlling the converted image to be similar to the image in the source domain as much as possible;

further preferably, the true authentication code and the generated authentication code are respectively used as a domain X and a domain Y, and two style converters are used for converting between the domain X and the domain Y; the optimization process is as follows: (1) firstly, extracting features of an input picture by using a convolutional neural network to obtain a feature vector; (2) then converting the feature vector of the picture in the domain X into a feature vector in the domain Y through a Resnet module, and reserving the target of the original image feature while converting; (3) finally, the decoding process restores the converted image from the feature vector through deconvolution operation. The discriminator is composed of a plurality of layers of convolutional neural networks, the picture is used as input, whether the input picture is a real picture from an original domain or a false picture generated through conversion is tried to be judged, and the probability of predicting the input picture to be the real picture is output by the last layer of the discriminator. Algorithm flow as shown in fig. 3, unlike the one-way conversion of the conventional generative countermeasure network, the present invention uses two style converters to convert between domain X and domain Y to each other.

The trained verification code denoising network can effectively identify complex safety features (including background noise, interference lines, character colors, fuzzy character distortion deformation, small character spacing and the like) which possibly cause interference on character segmentation and identification, and well remove the interference items, so that the complex safety features become simple verification code pictures. The difficulty of character segmentation and recognition is effectively reduced, and therefore, higher recognition accuracy can be achieved only by a small number of labeled samples. Meanwhile, the denoising network has generalization and universality, can be applied to verification code pictures of different types under the condition of not changing the structure of the model, and greatly reduces manual intervention.

Character segmentation module

Through the verification code denoising network, the original verification code with complex security features is converted into a simple verification code and input into the character segmentation module. And aiming at the characteristics of different types of verification codes, the character strings in the verification code picture are divided into single characters by methods such as contour detection, traditional division, threshold division and the like. The equidistant segmentation is one of the traditional image processing algorithms, which equally divides the picture pixels into N, but this method has a certain problem, as shown in fig. 4(a), the equidistant segmentation cannot well separate the characters of the verification code, wherein two characters may exist in one box. Therefore, the present invention improves on this segmentation method. The start position of segmentation is adjusted from (0, 0) to the top left pixel of the first character, the segmentation width is adjusted to the approximate size of each character, and the height is adjusted to the approximate height of each character, the segmentation effect is shown in fig. 4(b), where the black border represents the segmentation result.

According to the invention, preferably, the image processing algorithm adopts contour detection, an improved equidistant segmentation algorithm and a threshold segmentation algorithm, in the improved equidistant segmentation algorithm, the initial position of segmentation is the upper left corner pixel point of the first character, the segmentation width is the approximate size of each character, and the segmentation height is the approximate height of each character.

For the text verification code with clear edge and distorted characters after processing, the traditional segmentation algorithm is not applicable, and the invention preferably adopts an algorithm of contour detection to segment the characters;

further preferably, the contour detection algorithm scans the pixel points of the whole picture, finds the starting point of the outer boundary of each character and the starting point of the hole boundary, numbers the boundary points, and finally connects the outer boundaries through a contour drawing function to obtain a final segmentation result.

Threshold segmentation is a region-based image segmentation technique, and is suitable for pictures with different gray scale ranges occupied by a target and a background.

According to the present invention, preferably, for the case that the sizes and the intervals of the characters in the verification code picture are not equal, a threshold segmentation algorithm is adopted, and the flow is as follows: firstly, carrying out binarization processing on a picture; and then calculating the accumulated value of the vertical coordinate pixels of the picture, and determining the size of the threshold value by adopting a peak-to-valley value analysis method.

Text recognition module

Because most safety factors which interfere with segmentation are removed from the simplified verification code image, the segmentation module can obtain higher segmentation accuracy, and meanwhile, the difficulty of character recognition is also reduced. The invention designs and uses a simple convolutional neural network model as a final text recognition module, and the specific model structure is shown in FIG. 5.

According to the invention, preferably, the text recognition network is a convolutional neural network, and comprises a convolutional layer, a pooling layer, a dropout layer and a full connection layer;

further preferably, the convolutional neural network uses ReLu as an activation function and cross entropy as a loss function, and the optimizer selects adapelta. Because the number of the model convolution layers is small, the problem of overfitting is not easy to occur, and a large amount of training data is not needed. In actual use, only 500 samples are needed to be trained to obtain higher identification accuracy, so that the model training time is greatly shortened, the processing speed in the identification process is increased, and the requirement of engineering use can be met.

The invention has not been described in detail, but is in accordance with the state of the art.

The invention has the following beneficial effects:

1. the automatic identification method of the text verification code can quickly and effectively identify the text type verification code with background noise, character distortion and fuzzy edge, has high generalization and portability, can be simply embedded into a crawler algorithm, and solves the problem of the verification code in the data acquisition process.

2. The method has high accuracy in identifying the text verification code with complex background, distorted characters and fuzzy edges.

3. The invention can achieve better recognition effect only by a small amount of data marking, and reduces manual intervention.

4. The method has high generalization and portability, and is suitable for different types of text verification codes; the model training time is short, the processing speed is high, and the engineering requirements can be met. The application range supports embedding into any web crawler algorithm, can be applied to any website and software needing automatic identifying code identification, and has wide application prospect.

Drawings

Fig. 1 is a flow chart of a complex verification code identification solution based on a small number of samples according to the present invention.

FIG. 2 is a diagram of the entire structure of the verification code denoising network according to the present invention.

FIG. 3 is a schematic of the loss of cyclic consistency according to the present invention.

FIG. 4 is a diagram illustrating the results of a conventional character segmentation algorithm, wherein: (a) and (b) improving the after-segmentation algorithm.

FIG. 5 is a diagram of a text recognition module network architecture according to the present invention.

Detailed Description

The present invention will be further described with reference to the following detailed description of embodiments thereof, but not limited thereto, in conjunction with the accompanying drawings.

Example 1

the verification code denoising module removes the complex security features of the real verification code through a cyclic generation type countermeasure network, and simultaneously makes the character edge clear:

firstly, generating pictures with a format similar to that of a real verification code in batch by using a verification code generator through parameter adjustment, and combining the pictures and the real verification code in pairs to form a training set as the input of a denoising network. The Cycle generation type countermeasure network (Cycle GAN) is composed of two generators and two discriminators, and the whole body is of a dual structure. The core goal of the recurrent countermeasure network is to convert the verification code picture with complex security features into a simple verification code with security features removed, so as to reduce the difficulty of character segmentation and recognition. As shown in fig. 2, in the model training process, firstly, a real input image is obtained through a domain a and is converted into a simple verification code picture in a target domain B through a first generator GeneratorA → B; this picture is then used as input for a second generator GeneratorB → a, which converts back to the original complex picture. Two discriminators (discriminators) are used to determine whether an input picture is a real input picture or a simulated picture generated by the Generator. The optimization target of denoising the cyclic generation type countermeasure network comprises two different types of Loss functions, namely an antagonistic Loss (adaptive Loss) and a cyclic Consistency Loss (Cycle Consistency Loss); wherein the adversarial loss is used for matching the pixel distribution of the generated picture with the pixel distribution of the picture in the target domain; the cycle consistency loss is used for controlling the converted image to be similar to the image in the source domain as much as possible; the real verification code and the generated verification code are respectively used as a domain X and a domain Y, and two style converters are used for converting between the domain X and the domain Y; the optimization process is as follows: (1) firstly, extracting features of an input picture by using a convolutional neural network to obtain a feature vector; (2) then converting the feature vector of the picture in the domain X into a feature vector in the domain Y through a Resnet module, and reserving the target of the original image feature while converting; (3) and finally, decoding is carried out through deconvolution operation, and the converted image is restored by the feature vector. The discriminator is composed of a plurality of layers of convolutional neural networks, the picture is used as input, whether the input picture is a real picture from an original domain or a false picture generated through conversion is tried to be judged, and the probability of predicting the input picture to be the real picture is output by the last layer of the discriminator. The algorithm flow is shown in fig. 3.

The character segmentation module segments the overall verification code picture into single characters using an image processing algorithm:

the image processing algorithm comprises contour detection, an improved equidistant segmentation algorithm and a threshold segmentation algorithm, wherein in the improved equidistant segmentation algorithm, the initial position of segmentation is the upper left corner pixel point of a first character, the segmentation width is the approximate size of each character, the segmentation height is the approximate height of each character, the segmentation effect is shown in figure 4(b), and a black frame represents the segmentation result; for the text verification code with clear edge and distorted characters after processing, the traditional segmentation algorithm is not applicable, and the invention preferably adopts an algorithm of contour detection to segment the characters; the contour detection algorithm scans the pixel points of the whole picture, finds the starting point of the outer boundary of each character and the starting point of the hole boundary, numbers the boundary points, and finally connects the outer boundaries through a contour drawing function to obtain a final segmentation result. Aiming at the condition that the sizes and the intervals of characters in a verification code picture are unequal, the invention adopts a threshold segmentation algorithm, and the flow is as follows: firstly, carrying out binarization processing on a picture; and then calculating the accumulated value of the vertical coordinate pixels of the picture, and determining the size of the threshold value by adopting a peak-to-valley value analysis method.

Sending the segmented characters into a text recognition network to obtain final output:

the text recognition network is a convolutional neural network and comprises a convolutional layer, a pooling layer, a dropout layer and a full-connection layer; the convolutional neural network uses ReLu as an activation function and cross entropy as a loss function, and the optimizer selects Adadelta.

The overall model solution of the present invention is shown in fig. 1. The verification code shown on the left side of fig. 1 is from wikipedia, and for a neural network, the verification code is difficult to segment due to disordered pixel distribution, and high identification accuracy is difficult to obtain. The method and the device have the advantages that the real verification code is denoised through the cyclic generation type countermeasure network, so that the edge of the real verification code is clear, and further identification is facilitated. The overall captcha picture is then segmented into individual characters using corresponding image processing algorithms. And finally, sending the segmented characters into a text recognition network to obtain final output. Meanwhile, the text recognition model designed by the patent has fewer convolution layers and is not easy to over-fit, so that a large amount of training data is not needed. In actual use, only 500 samples are needed to be trained to obtain higher identification accuracy, so that the model training time is greatly shortened, the processing speed in the identification process is increased, and the requirement of engineering use can be met.

Specifically, for a text verification code type with a large distortion rotation amplitude, the text recognition network firstly uses a Spatial Transformer Layers (Spatial Transformer Layers) to correct the text verification code type, so that the model has Spatial invariance.

Claims

1. A text verification code automatic identification method aiming at a complex background comprises the following steps:

2. The method for automatically identifying the text verification code aiming at the complex background as claimed in claim 1, wherein in the process of denoising the real verification code through the cyclic generation type countermeasure network, firstly, the verification code generator is utilized to generate pictures with a format similar to that of the real verification code in batch through parameter adjustment, and the pictures and the real verification code are paired to form a training set to be used as the input of the denoising network;

preferably, the Cycle GAN is composed of two generators and two discriminators, and the whole structure is dual.

3. The method according to claim 1, wherein the optimization objective of denoising for the recurrent countermeasure network includes two different types of Loss functions, namely, antagonistic Loss (adaptive Loss) and cyclic Consistency Loss (Cycle Consistency Loss); wherein the adversarial loss is used for matching the pixel distribution of the generated picture with the pixel distribution of the picture in the target domain; the loss of cyclic consistency is used to control the transformed image to resemble the image in the source domain as much as possible.

4. The method of claim 3, wherein the true captcha and the generated captcha are respectively a domain X and a domain Y, and two style converters are used to convert between the domain X and the domain Y;

the optimization process is as follows: (1) firstly, extracting features of an input picture by using a convolutional neural network to obtain a feature vector; (2) then converting the feature vector of the picture in the domain X into a feature vector in the domain Y through a Resnet module, and reserving the target of the original image feature while converting; (3) finally, the decoding process restores the converted image from the feature vector through deconvolution operation.

5. The method for automatically identifying a text verification code aiming at a complex background according to claim 1, wherein the image processing algorithm adopts contour detection, an improved equidistant segmentation algorithm and a threshold segmentation algorithm, in the improved equidistant segmentation algorithm, the initial position of segmentation is a pixel point at the upper left corner of a first character, the segmentation width is the approximate size of each character, and the segmentation height is the approximate height of each character.

6. The method for automatically identifying the text verification code aiming at the complex background is characterized in that an outline detection algorithm is adopted for the text verification code with clear processed edge and distorted characters to perform character segmentation;

the contour detection algorithm scans the pixel points of the whole picture, finds the starting point of the outer boundary of each character and the starting point of the hole boundary, numbers the boundary points, and finally connects the outer boundaries through a contour drawing function to obtain a final segmentation result.

7. The method for automatically identifying the text verification code aiming at the complex background according to claim 5, characterized in that for the case that the sizes and the intervals of the characters existing in the verification code picture are unequal, a threshold segmentation algorithm is adopted, and the flow is as follows: firstly, carrying out binarization processing on a picture; and then calculating the accumulated value of the vertical coordinate pixels of the picture, and determining the size of the threshold value by adopting a peak-to-valley value analysis method.

8. The method as claimed in claim 5, wherein the text recognition network is a convolutional neural network, and comprises convolutional layers, pooling layers, dropout layers and full-link layers.

9. The method of claim 8, wherein the convolutional neural network uses ReLu as an activation function and cross entropy as a loss function, and the optimizer selects Adadelta.

10. The method of claim 1, wherein for a text captcha type with a large distortion rotation amplitude, the text recognition network first corrects it with a spatial transform layer (spatialtransform Layers) to make the model have spatial invariance.