CN110889402A

CN110889402A - Business license content identification method and system based on deep learning

Info

Publication number: CN110889402A
Application number: CN201911067919.1A
Authority: CN
Inventors: 陈曦; 蓝志坚; 喻春霞; 李海燕
Original assignee: Guangzhou Feng Shi Technology Co Ltd
Current assignee: Guangzhou Feng Shi Technology Co Ltd
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2020-03-17

Abstract

The invention discloses a business license content identification method and a system based on deep learning, wherein the method comprises the following steps: collecting images of business licenses for preprocessing, wherein the preprocessing comprises the following steps: graying, filtering and denoising, image binarization and tilt correction; constructing a text detection model, and respectively performing primary training and secondary training; constructing a text recognition model based on a convolutional neural network, and training the text recognition model by using a training sample generated randomly; inputting the preprocessed business license image into a text detection model obtained by secondary training, outputting a text line image, recognizing the text line image by using the trained text recognition model, and outputting text information; and performing semantic analysis on the text information, and connecting the contents of the same text line in series to obtain a final result of content identification of the business license. The invention reduces the amount of training samples, overcomes the defect of high difficulty in character cutting and improves the content recognition rate.

Description

Business license content identification method and system based on deep learning

Technical Field

The invention relates to the field of image content identification, in particular to a business license content identification method and system based on deep learning.

Background

Optical Character Recognition (OCR) is now mainly applied to document recognition and document recognition. The certificate identification is realized by digitizing a certificate original, a scanned part and a copied part, converting the certificate original, the scanned part and the copied part into pictures and then identifying the certificate content through texts, so that the working efficiency is improved, and the working intensity is reduced. Three key techniques of OCR in conventional image processing: character region detection, character cutting and recognition. Text region detection, also called text detection, extracts a text information region in an image. Text detection methods are now broadly divided into layout analysis, which extracts target regions from images using a feature extraction method, and deep learning, which is automatic recognition and extraction of text regions of document images. The character cutting is to divide the extracted text area into single characters according to lines. The recognition is to recognize the divided single characters one by one.

The optical character recognition method in the current image processing mainly has the following problems:

the training samples are difficult to collect in large quantities

When the content identification of the license certificate based on deep learning is needed, the deep neural network training using a large number of samples cannot be avoided. However, it is difficult to collect a large number of certificate pictures such as a license as a training sample.

The difficulty of character cutting is large

The single character segmentation of one line of characters is greatly influenced by language characters, and for multi-language mixture, for example, the certificate address type characters contain Chinese, numbers, English, symbols and the like, the difficulty of character segmentation is greatly increased. And the segmentation of the characters is basically the projection method adopted at present.

The recognition accuracy of the credit codes in the business license is low

For the application scenarios of OCR of certificate types such as business licenses, there are often strict requirements on the accuracy of recognition. Among them, the "unified social credit code" in the license is an important factor influencing the recognition rate. This is because the credit code is composed of numbers and letters, and the connection is compact, which is likely to cause false recognition and missed recognition.

In summary, the existing business license content identification method based on deep learning needs a large number of training samples, is difficult to cut characters, and has a low identification rate.

Disclosure of Invention

The invention provides a business license content identification method and system based on deep learning, aiming at overcoming the defects of large quantity of required training samples, high character cutting difficulty and low identification rate of the business license content identification method based on deep learning in the prior art.

The primary objective of the present invention is to solve the above technical problems, and the technical solution of the present invention is as follows:

the invention provides a business license content identification method based on deep learning, which comprises the following steps:

s1: collecting images of business licenses;

s2: pre-processing the acquired license images, the pre-processing comprising: graying, filtering and denoising, image binarization and tilt correction;

s3: constructing a text detection model, performing primary training by using an open source text detection data set, constructing a pre-labeled business license image data set by using the preprocessed images with set proportion, performing secondary training on the primarily trained detection model by using the pre-labeled business license image data set,

s4: constructing a text recognition model based on a convolutional neural network, and training the text recognition model by using a randomly generated training sample to obtain a trained text recognition model;

s5: inputting the preprocessed business license image into a text detection model obtained by secondary training, outputting a text line image, recognizing the text line image by using the trained text recognition model, and outputting text information;

s6: and performing semantic analysis on the text information identified by the text identification model, and connecting the contents of the same text line in series to obtain the final result of identifying the contents of the business license.

Further, the pretreatment specifically comprises:

carrying out gray processing on the collected business license image by adopting a weighted average method;

carrying out median filtering denoising on the image subjected to graying;

carrying out image binarization on the denoised image by using a point-by-point method;

and performing inclination correction on the image subjected to the binarization processing through perspective transformation.

Further, the text detection model is a fast RCNN model or a CTPN model or a SegLink model or an EAST model.

Further, the convolutional neural network-based text recognition model is a DenseNet + CTC text recognition model.

Based on the method, the invention also provides a business license content identification system based on deep learning, and the system comprises: the system comprises an image acquisition module, an image preprocessing module, a text detection module, a random sample generation module, a text recognition module and a text information integration module, wherein the image acquisition module is used for acquiring a complete business license image;

the image preprocessing module is used for preprocessing the acquired business license image, and the preprocessing comprises the following steps: graying, filtering and denoising, image binarization and tilt correction;

the text detection module is used for performing text line detection on the preprocessed image and outputting a text line image;

the random sample generation module provides a random generation training sample for the text recognition module;

the text recognition module is used for performing text recognition on the text line image and outputting text information;

the integrated text information module is used for performing semantic analysis on the text information output by the text recognition module and connecting the contents of the same text line in series to obtain the final result of content recognition of the business license.

Further, the randomly generated training samples provided by the random sample generation module are divided into a training data set and a verification data set according to a preset proportion.

Further, the randomly generated training sample provided by the random sample generation module contains preset noise and preset distortion characteristic amplitude.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

according to the method, the training of the text detection model is divided into two stages, so that the number of samples required by training is reduced, and the accuracy of text detection is improved; the training of the generated random training samples on the text recognition model improves the accuracy of text recognition, and the defect that the traditional character recognition method needs to cut characters is overcome through recognition of text line images.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

Example 1

Fig. 1 shows a flowchart of a license content identification method based on deep learning.

s1: collecting images of business licenses;

in a specific embodiment, the complete image of the business license is collected by an image collecting device, which includes but is not limited to a camera, a smart phone, a computer and a tablet computer, and can also receive the image of the business license collected and transmitted by other manners.

in a specific embodiment, the collected business license image is preprocessed, and the license image is grayed by adopting a weighted average method. And (3) selecting a self-adaptive median filtering method to carry out filtering and denoising on the image, wherein the filtering and denoising are carried out to ensure that the boundary characteristics of the image are not blurred. And (4) binarizing the denoised gray level image by using a point-by-point method, wherein the image binarization operation can highlight interested target content. In addition, in the plane image processing, due to a lens angle and the like, an image is prone to be inclined, deformed and the like, and for convenience of subsequent processing, the image is required to be inclined and corrected, and the deformed image can be corrected through perspective transformation.

Perspective Transformation (Perspective Transformation) is a nonlinear Transformation in three-dimensional space, which essentially projects the original image to a new viewing plane by a 3 × 3 Transformation matrix, and the visual intuitive expression is to generate or eliminate the sense of distance and proximity.

S3: constructing a text detection model, performing primary training by using an open source text detection data set, constructing a pre-labeled business license image data set by using the preprocessed images in a set proportion, and performing secondary training on the detection model after the primary training by using the pre-labeled business license image data set;

in a specific embodiment, an open-source text detection dataset of multilingual scene text detection and script recognition (MLT) may be used to perform preliminary training on a text detection model; the available text detection models include fast RCNN, CTPN, SegLink, EAST, etc., the fast RCNN is a general target detection model, and the latter three are network models optimized for text detection. The preliminarily trained text detection model can be applied to text information detection of business license pictures. However, due to the effect of text typesetting and font size of a license, the text detection effect cannot meet the actual requirement. Extracting the pre-processed license image with a set proportion (such as 20%), manually labeling the text box to form a pre-labeled license image data set, and performing secondary training on the preliminarily trained detection model by using the pre-labeled license image data set.

It should be noted that, by adopting two stages of text detection model training, the number of labels of business license image data required by the model training can be effectively reduced, and the accuracy of text detection is improved.

In a specific embodiment, a CTPN (connectionist Text forward network) Text detection model may be used, which converts a Text detection task into detection of a series of small-scale Text boxes.

S4: constructing a text recognition model based on a convolutional neural network, and training the text recognition model by using a random training sample generated by a preset text library to obtain a trained text recognition model;

in a specific embodiment, the random training samples include training samples with various preset noise and distortion characteristic amplitudes, that is, text pictures containing only one line of characters.

It should be noted that the text library for random sample generation includes various types of materials, such as news, encyclopedia, articles, and the like. The related various common Chinese characters, English, numbers, symbols and the like are enough to correspond to the content information identification in the business license.

In addition, a large number of training samples are generated by simulating the combination rule and the style of 'unified social credit codes' in a business license, so that the accuracy of the text recognition model is effectively improved.

The random sample generator may be divided into a training data set and a validation data set on a 99:1 scale.

In the invention, a DenseNet + CTC model is adopted, the DenseNet breaks away from the fixed thinking of deepening the network layer number (ResNet) and widening the network structure (inclusion) to improve the network performance, and in view of characteristics, through characteristic reuse and Bypass (Bypass) setting, the parameter quantity of the network is greatly reduced, and the generation of the gradientvanising problem is relieved to a certain extent. The network structure of DenseNet consists mainly of DenseBlock and Transition.

In the DenseBlock, the BN + ReLU +3x3 Conv structure is adopted, the feature maps of all layers are consistent in size and can be connected in channel dimension; for the Transition layer, mainly two adjacent DenseBlock are connected and the feature map size is reduced. The Transition layer comprises a convolution of 1x1 and AvgPooling of 2x2, with the structure BN + ReLU +1x1Conv +2x2 AvgPooling. In addition, the Transition layer can function as a compression model.

Ctc (connectionist Temporal classification), is a time-series classification algorithm that addresses the alignment of input data with a given tag.

And training the DenseNet + CTC text recognition model by using random training samples, and performing text recognition on the text line pictures output by the text detection model by using the trained model.

The text information identified by the text identification model is subjected to semantic analysis, and the text lines of the same attribute are concatenated, for example, the text information of each line in the text line image is subjected to semantic analysis to obtain the contents such as 'unified social credit code', 'number', 'name', 'type', 'address', and the like. And integrating the text recognition result after semantic analysis as a final result of the content recognition system of the business license.

The semantic analysis can be realized through a semantic analysis model, and the training of the semantic analysis model is to perform model training by generating some regular information samples, for example, names generally end in "company", registered capital ends in "element", general rules of address texts, and the like.

The invention also provides a business license content identification system based on deep learning based on the method, and the system comprises: the system comprises an image acquisition module, an image preprocessing module, a text detection module, a random sample generation module, a text recognition module and a text information integration module, wherein the image acquisition module is used for acquiring a complete business license image;

the integrated text information module is used for performing semantic analysis on the text information output by the text recognition module and connecting the contents of the same text line in series to obtain the final result of the content recognition system of the business license.

Further, the random training samples provided by the random sample generation module are divided into a training data set and a verification data set according to a preset proportion.

In one particular embodiment, the random training samples may be divided into a training data set and a validation data set on a 99:1 scale.

Further, the random training sample provided by the random sample generation module contains preset noise and preset distortion characteristic amplitude.

The integrated text information module is the last step of the whole business license content identification system and is used for performing semantic analysis on all text information output by the text identification model and connecting the contents of the same text line in series to obtain the final result of business license content identification.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A business license content identification method based on deep learning is characterized by comprising the following steps:

s1: collecting images of business licenses;

s3: constructing a text detection model, performing primary training by using an open source text detection data set, constructing a pre-labeled business license image data set by using the preprocessed image with a set proportion, and performing secondary training on the primarily trained detection model by using the pre-labeled business license image data set;

2. The method for license content recognition based on deep learning of claim 1, wherein the preprocessing is specifically:

carrying out median filtering denoising on the image subjected to graying;

3. The method for recognizing the contents of a business license based on deep learning of claim 1, wherein the text detection model is fast RCNN model or CTPN model or SegLink model or EAST model.

4. The method of claim 1, wherein the text recognition model based on convolutional neural network is a DenseNet + CTC text recognition model.

5. A deep learning based license content recognition system, the system comprising: the system comprises an image acquisition module, an image preprocessing module, a text detection module, a random sample generation module, a text recognition module and a text information integration module, wherein the image acquisition module is used for acquiring a complete business license image;

6. The system for recognizing the contents of a business license based on deep learning of claim 5, wherein the randomly generated training samples provided by the randomly generated sample generating module are divided into the training data set and the verification data set according to a predetermined ratio.

7. The system for license content recognition based on deep learning of claim 5, wherein the randomly generated training samples provided by the random sample generation module include a preset noise and a preset amplitude of distortion features.