CN115713776A

CN115713776A - General certificate structured recognition method and system based on deep learning

Info

Publication number: CN115713776A
Application number: CN202211409647.0A
Authority: CN
Inventors: 彭勤牧; 尤新革; 胡亚宸
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-02-24

Abstract

The invention relates to the technical field of image recognition, in particular to a general certificate structured recognition method and system based on deep learning, wherein the method comprises the following steps: acquiring certificate image information, and preprocessing the acquired image; inputting a standardized certificate image into a text detection network, positioning a text instance in the image, and storing position information into a file; intercepting a text image from the certificate image according to the detected position coordinates of the text example, inputting the text image into a text recognition network for recognition, obtaining a recognition result, and storing the recognition result after position information is obtained; classifying text entities in the recognition result by using a key text extraction network, removing non-key types, and then storing the classification result after the recognition result; and carrying out structuralization processing on the key text extraction result and displaying. The invention can extract key text information from complex license images, realizes intelligent identification and reading of the certificates, and is suitable for various licenses of different types.

Description

General certificate structured recognition method and system based on deep learning

Technical Field

The invention relates to the technical field of image recognition, in particular to a general certificate structured recognition method and system based on deep learning.

Background

With the deep development of innovation and the acceleration of the international process, the population flow and cross-regional communication at home and abroad are more frequent, and the large population flow is accompanied with the high-frequency use of the certificate. The certificate is used as a valid certificate of citizenship, and is often used in various occasions requiring identity verification, such as banks, airports, customs, railway stations, and the like. The accurate examination of the certificate is beneficial to guaranteeing the safety of citizens and society. The manual inspection mode is inefficient, causes a large amount of waste of human resources, and is difficult to adapt to increasingly accelerated international processes. Therefore, the method realizes the automatic checking of the license by researching a set of structured identification scheme suitable for the license, can improve the checking efficiency of the license, can protect the privacy of citizens and the personal information safety, and has important research significance and research value.

The certificate consists of a machine reading area, a portrait area and other areas, and is printed with important information, such as name, sex, date of birth, certificate number, expiration date and the like, and is a special structured document. The structured identification of the license refers to a process of converting a license image into character content by text detection and text identification technology, extracting key information from the character content, and storing the key information as structured output in a key value pair form. In practical application, the license has various types and complex and diverse layout structures, complex background patterns and anti-counterfeiting characteristics are printed on the surface of the license, and the situations of character stains, defects and the like often occur, so that great challenges are brought to the existing algorithm. Therefore, how to provide a multi-certificate structured identification method to extract key text information from a complex certificate image to realize intelligent certificate identification is a difficult problem to be solved in the invention.

In view of the above, it is an urgent problem in the art to overcome the above-mentioned drawbacks of the prior art.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a general certificate structured recognition method and system based on deep learning.

The invention is realized by the following steps:

in a first aspect, the invention provides a general certificate structured recognition method based on deep learning, which comprises the following steps:

acquiring certificate image information, and preprocessing the acquired image to obtain a standardized certificate image;

inputting a standardized certificate image into a text detection network, positioning a text instance in the image, and storing position information into a file;

intercepting a text image from the certificate image according to the detected position coordinates of the text example, inputting the text image into a text recognition network for recognition, obtaining a recognition result, and storing the recognition result after position information is obtained;

classifying text entities in the recognition result by using a key text extraction network, removing non-key types, and then storing the classification result after the recognition result;

and performing structured processing and displaying on the key text extraction result, and uploading to a service system for processing.

Further, the structure of the text detection network includes:

on the basis of DBNet, using ResNet18 as a feature extraction network, and replacing a conventional 3 x 3 convolution structure in ResNet18 by a grouping multipath selectable convolution structure; during training, the feature graph output by feature extraction is supervised by using a discriminant loss function.

Further, the grouped multipath selectable convolution structure comprises 3 convolution branches, convolution kernels are respectively 1 × 3, 1 × 5 and 3 × 3 in size, then a channel attention mechanism is used for carrying out weight distribution on the branches, and finally, weighted summation is carried out on each branch to serve as a final convolution characteristic;

the discriminant loss function measures the distance between the feature vectors by adopting an Euclidean distance, and the calculation mode comprises the following steps: each pixel point in the text image is mapped into an n-dimensional vector, and one text instance is regarded as a cluster. Let the number of clusters be C, N _k Indicates the number of elements in the kth cluster, x _i Feature vector, μ, representing the ith element in the cluster _k The average feature vector of the kth cluster is represented as the cluster center. k1 and k2 represent two different clusters. II denotes the L1 or L2 distance, [ x ]] ₊ Denotes max (0, x), δ _v And delta _d The boundaries of variance loss and distance loss are shown separately, based on experience, delta in this example _v And delta _d The selected values were 0.5 and 1.5, respectively. From this, it can be seen that the feature vector distance loss L within the same cluster _var And distance loss L between different clusters _dist ：

Discrimination loss L _d Is equal to L _var And L _dist And the sum is:

L _d ＝L _var +L _dist

wherein by minimizing L _var The penalties may minimize the distance of feature vectors within a cluster, reducing feature vector representation differences in the cluster. By minimizing L _dist The loss can maximize the distance between different clusters, adding different clustersThe discrimination of the clusters.

Further, the structure of the text recognition network includes:

based on the CRNN, a Transformer encoder is used as a sequence modeling module to replace a BiLSTM module in the CRNN, and the semantic modeling capability of the Transformer encoder is trained by masking input features; and adding the CVSM module into a text recognition network, distributing different weights to the context features and the visual features obtained after sequence modeling to obtain combined features, and performing decoding output.

Further, the CVSM module performs weight distribution on the visual features and the context features obtained in the sequence modeling stage by using an attention mechanism to obtain combined features, specifically, the visual features are obtained after the text image passes through the feature extraction stage and are recorded as f _v ，f _v Obtaining context characteristics containing sequence semantic information through a sequence modeling stage, and marking as f _c ，f _v And f _c Having the same feature dimension, combined feature f _u The calculation process of (2) includes:

f＝concatenate(f _v ,f _c )

a＝Sigmoid(fc(f))

f _u ＝matmul(f,a)

in the formula, coordinate represents a stacking operation, sigmoid is an activation function, and matmul represents matrix multiplication.

Further, the key text extraction network uses the text image and the detection recognition result as input, and comprises three parts of feature coding, relational modeling and feature decoding, wherein:

the feature encoding specifically includes: respectively coding the visual image, the text character and the position coordinate of the text entity, and then fusing into a feature vector in an addition mode to be used as the feature representation of the text entity;

the relational modeling specifically includes: modeling the relationship between text entities through BilSTM, wherein the input of the BilSTM is the output of a feature coding stage, sequencing the text in the certificate according to the relative sequence of the text before inputting a feature coding vector into the BilSTM, namely sequencing the text from top to bottom and from left to right, obtaining a feature sequence of the text after sequencing, inputting the feature sequence into the BilSTM, and learning the interrelation and sequence information among elements of the feature sequence;

the feature decoding specifically includes: in the decoding and predicting stage, the text extraction task is used as a classification task for text entities, the probabilities of various classes of different text entities are directly predicted by using a full connection layer, and then the class with the maximum probability is respectively found out through a softmax function to represent the classification result of the corresponding text.

Further, the encoding the visual image, the text characters and the position coordinates of the text entity respectively includes visual feature encoding, text character feature encoding and position feature encoding, where:

the visual feature coding method specifically includes: converting the detected text position information into a corresponding position in the extracted feature map in a ROIAlign mode according to the detected text position information, and then extracting a feature vector of the position as a visual feature; or extracting a text image from the original image according to the detected text position information, and then performing feature extraction operation by using the feature extraction network again to serve as visual features;

the text character feature encoding mode specifically includes: firstly, adding an extra < cls > character at the head of a character string result obtained by text recognition for obtaining the whole information of the whole text, then coding each character in the character string into a 512-dimensional word vector through Embedding, finally obtaining a text coding result through a Transformer coder, and taking the first dimension of the result, namely the output of the position where the < cls > is located to represent the text characteristic coding result of the whole character string;

the position feature coding mode specifically includes: firstly, normalizing a coordinate by using the width and the height of a picture, and then coding the normalized coordinate through a linear layer to obtain a position characteristic vector; specifically, let the coordinates of the text box be p _box ＝(x ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ ) Firstly, normalizing the coordinates of the text box to obtain normalized coordinate information, wherein the width of the picture is w, and the height of the picture is h:

passing the normalized coordinates through a linear layer to obtain the position characteristic f _p ＝max(0,Wp′ _box + b) where W ∈ R ^8×C Max represents the Relu activation function, a learnable parameter.

Further, the acquiring certificate image information and preprocessing the acquired image to obtain a standardized certificate image specifically includes:

acquiring certificate image information, and preprocessing the acquired image, including brightness correction, background cutting and rotation correction operations to obtain a standardized certificate image; the brightness correction is used for acquiring a certificate image with uniform brightness; the background cutting is used for removing the background of the certificate and acquiring the certificate image after the background is removed; the rotation correction is used for correcting the image direction of the certificate.

Further, the structuring the key text extraction result specifically includes:

and sequentially reading the stored text position coordinates, the text recognition result and the text classification result according to lines, and converting into a structured form.

In a second aspect, the invention provides a general certificate structured recognition system based on deep learning, which is used for realizing the method in the first aspect, and the system comprises an image acquisition module, an image preprocessing module, a text detection module, a text recognition module, a key text extraction module and an information processing and displaying module; wherein:

the image acquisition module is used for acquiring a certificate image and transmitting the certificate image to a computer terminal for processing;

the image preprocessing module comprises brightness correction, background cutting and rotation correction operations and is used for extracting a certificate image which is uniform in brightness, free of background and correct in direction from an original image;

the text detection module is used for detecting a text area on the certificate image;

the text recognition module is used for recognizing text content in the detected text area;

the key text extraction module is used for classifying the recognized texts and removing non-key texts;

the information processing and displaying module is used for carrying out structuralization processing and displaying on the key text and uploading the key text to a service system for processing.

In conclusion, the beneficial effects of the invention are as follows:

(1) In the task of detecting the license text, firstly, a grouping multipath selectable convolution structure is provided for the problem of detecting the slender text in the license, so that the receptive field of a feature extraction network is improved, and the detection precision of the text detection network on the slender text is improved. Secondly, aiming at the problem of text adhesion or disconnection, a text detection method based on judgment loss guidance is provided, the expression and distinguishing capability of a network on different text example characteristics is enhanced, and the phenomenon of text adhesion or disconnection is solved.

(2) In a license text recognition task, aiming at the problems of license background interference, character abrasion and stain, a sequence modeling method based on a transform encoder is provided, semantic correlation among text characters is fully excavated in a mask training mode, and the prediction accuracy of a text recognition network on the existence of interference and stained characters is improved. Furthermore, in order to take the recognition effect of the semantic-free text in the license into consideration, a context feature and visual feature selection module is provided, decoding output is carried out by combining the context information and the visual information, and the text recognition accuracy rate of the network in the license is improved.

(3) In a key text extraction task, aiming at the problems of complex and various certificate layout structures and high information extraction difficulty, a feature coding method integrating multiple information of vision, text and position is provided for taking text extraction as a classification task of a text entity in a certificate, the text entity is represented by using multi-dimensional information, a text entity context relation modeling mode based on BilSTM is provided at the same time, the association relation and layout information among the text entities in the certificate are fully utilized, and finally, a key text extraction network consisting of three parts of feature coding, relation modeling and decoding prediction is established, so that the structured identification of multiple certificates is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a general certificate structured recognition method based on deep learning according to embodiment 1 of the present invention;

fig. 2 is a schematic diagram of a GMSK convolution structure provided in embodiment 1 of the present invention;

fig. 3 is a schematic structural diagram of a CVSM module provided in embodiment 1 of the present invention;

fig. 4 is a schematic diagram of a key text extraction network structure provided in embodiment 1 of the present invention;

fig. 5 is a schematic structural diagram of a feature coding part in a key text extraction network according to embodiment 1 of the present invention;

fig. 6 is a schematic structural diagram of a relational modeling part in a key text extraction network according to embodiment 1 of the present invention;

fig. 7 is a schematic structural diagram of a decoding prediction part in a key text extraction network according to embodiment 1 of the present invention;

fig. 8 is a schematic block diagram of a deep learning-based general certificate structured recognition system according to embodiment 2 of the present invention;

fig. 9 is a schematic structural diagram of a general certificate structured recognition apparatus based on deep learning according to embodiment 3 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

The present invention is a system structure of a specific function system, so the functional logic relationship of each structural module is mainly explained in the specific embodiment, and the specific software and hardware implementation is not limited.

In addition, the technical features related to the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other, and the order of the steps may be changed if they are consistent with logic and do not conflict with each other.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

as shown in fig. 1, embodiment 1 of the present invention provides a deep learning-based general certificate structured recognition method, which includes the following steps:

step 100: certificate image information is collected, and the collected image is preprocessed to obtain a standardized certificate image.

Step 200: and inputting the standardized certificate image into a text detection network, positioning a text instance in the image, and storing the position information into a file.

Step 300: and intercepting the text image from the certificate image according to the detected position coordinates of the text example, inputting the text image into a text recognition network for recognition, obtaining a recognition result, and storing the recognition result after the position information is obtained.

Step 400: and classifying text entities in the recognition result by using a key text extraction network, removing non-key types, and then storing the classification result after the recognition result.

Step 500: and performing structured processing and displaying on the key text extraction result, and uploading to a service system for processing.

The preferred embodiment can solve the problems in the prior art through the steps, can extract key text information from the complex license image, realizes intelligent identification and reading of the certificate, and is suitable for various different types of certificates.

In the preferred embodiment, the step 100 (acquiring document image information and preprocessing the acquired image to obtain a standardized document image) specifically includes: acquiring certificate image information, and preprocessing the acquired image, including brightness correction, background cutting and rotation correction operations to obtain a standardized certificate image; the brightness correction is used for acquiring a certificate image with uniform brightness; the background cutting is used for removing the background of the certificate and acquiring the certificate image after the background is removed; the rotation correction is used for correcting the direction of the certificate image.

In the process, the certificate image information is collected, and the certificate image is firstly shot through a special certificate instrument which can be a camera device or a CIS scanner device. Or capture the document image by other devices with image capture capabilities. And then the data is transmitted to a computer end through a USB data line or a Bluetooth, wi-Fi and other modes. For the brightness correction, background clipping and rotation correction operations, due to the instability of the illumination environments of different devices, the brightness of the acquired images is different, and different areas in the same image may have different brightness, so that the brightness correction is needed to obtain the certificate image with uniform brightness in order to avoid influencing the subsequent operations. Then, in order to process the certificate information, the background in the certificate image needs to be cut off first, and the certificate needs to be extracted. Because the direction of the user for placing the certificate is uncertain, the certificate is also required to be rotationally corrected.

The brightness correction operation specifically comprises: first, a piece of standard white paper is taken and put into the device for image acquisition. And then comparing each pixel of the acquired image with the preset brightness, recording the quotient of the preset brightness and each pixel as a correction coefficient, and multiplying the pixel value of each pixel in the original image by the corresponding correction coefficient when the original image is acquired, thereby acquiring the image with corrected brightness.

The background operation specifically comprises the following steps: firstly, the image is binarized by using the Otsu method, and the findContours function provided by OpenCV is used to find the outline with the largest area, wherein the outline is represented by using a set of points. Considering that convex shapes such as a human hand may exist in an image, an adjacent concave part in a contour needs to be searched, a covexitydefects function provided by OpenCV is used for searching the concave part, a contour point set between the two adjacent concave parts is removed, namely the convex part is cut, the contour of a certificate is obtained after the convex cutting is finished, finally, a minAreaRect function is used for finding the minimum circumscribed rectangle of the contour, corresponding affine transformation is carried out according to the horizontal and vertical of the rectangle, the certificate can be extracted from the image, and therefore background cutting operation is achieved.

The rotating operation is specifically as follows: for the certificate with the organic code reading, the up-and-down judgment is carried out according to the position of the machine-readable code, the position of the head portrait can be detected if the head portrait exists, the judgment is carried out according to the position of the head portrait, and other certificates can be judged up and down by calculating the variance of the upper, lower, left and right areas of the certificate or the like, or the judgment is carried out by searching and positioning a specific pattern.

In the preferred embodiment, step 200 specifically includes: the image obtained finally in step 100 is input into the improved text detection network of the present invention, the text instance in the image is located, and the location information is saved into a file. For the position information of the text instance in the image in the embodiment, optionally, the position information is represented by coordinates of four points of a minimum bounding rectangle of the text instance, or by using a central point, a length and a width, and a rotation angle of the rectangle. For example, using [ (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),(x ₃ ,y ₃ ),(x ₄ ,y ₄ )]To represent location information of a text instance, where (x) _i ,y _i ) 1. Ltoreq. I.ltoreq.4, representsThe vertices of the rectangle. After all the detection results are obtained, a part of the wrong detection results are firstly filtered through a preset rule, for example, all the detection results with the area smaller than 25 are excluded, and finally, the position information of all the text instances is stored in a result.

The text detection network in step 200 of this embodiment is based on DBNet and uses ResNet18 as a feature extraction network. Further, in order to solve the problem of the slender text, the invention proposes a packet multiplexing selectable convolution structure (GMSK convolution) to replace the conventional 3 × 3 convolution structure in ResNet 18. When detecting the elongated text, the conventional square convolution structure in the ResNet18 is difficult to cover the text instance well, and easily introduces too much noise in the vertical direction, which causes the text detection network to degrade the accuracy of the elongated text detection.

In this embodiment, the block multiplexing selectable convolution structure (GMSK convolution structure) specifically includes 3 convolution branches, where convolution kernels have sizes of 1 × 3, 1 × 5, and 3 × 3, respectively, and then the branches are weight-assigned using a channel attention mechanism, and finally the branches are weighted and summed to obtain a final convolution feature, as shown in fig. 2. GMSK convolution combines a multi-size convolution kernel and a channel attention mechanism, different convolution structures can be selected in a self-adaptive mode according to the target shape, stacking can be achieved in a multilayer network, and the GMSK convolution has a very flexible receptive field. For example, a 1 × 25 sized receptive field may be obtained by stacking two 1 × 5 convolution kernels; stacking 1 × 3 and 1 × 5 convolution kernels can obtain a receptive field of size 1 × 15; stacking 3 x 3 and 1 x 5 convolution kernels can achieve a 3 x 15 sized receptive field. The flexible receptive field is beneficial to improving the precision of a text detection network, especially for slender texts, the complete information of a text example can be better acquired, the introduction of noise is reduced, and the higher detection precision is achieved.

In the embodiment, during training, a discriminant loss function is used for monitoring a feature map output by feature extraction, the loss function used for the detection network training consists of two parts, one is the discriminant loss function (marked as L) provided by the invention _d ) For supervising the generation of feature vectors of different text instances, solving the textProblems of sticking or breaking in this example; the other is the loss of the predicted part of the output of the network (denoted as L) _o ). According to experience, the value of the hyperparameter λ is 0.2. The total loss of the network (noted as L) can be written as:

L＝L _o +λL _d

discriminant loss function L _d The distance between feature vectors is measured using euclidean distance. The specific calculation method is as follows: each pixel point in the text image is mapped into an n-dimensional vector, and one text instance is regarded as a cluster. Let the number of clusters be C, N _k Indicates the number of elements in the kth cluster, x _i Feature vector, μ, representing the ith element in the cluster _k The average feature vector of the kth cluster is represented as the cluster center. k1 and k2 represent two different clusters. | denotes the distance L1 or L2, [ x | ]] ₊ Denotes max (0, x), δ _v And delta _d The boundaries of variance loss and distance loss are expressed respectively, and delta is empirically obtained in the present example _v And delta _d The selected values were 0.5 and 1.5, respectively. From this, the eigenvector distance loss L inside the same cluster can be obtained _var And distance loss L between different clusters _dist ：

Discrimination loss L _d Is equal to L _var And L _dist And (c) the sum, i.e.:

L _d ＝L _var +L _dist

wherein by minimizing L _var The loss can minimize the distance of the feature vectors inside the cluster, and reduce the difference of the feature vector representation in the cluster. By minimizing L _dist The loss can maximize the distance between different clusters, increasing the discrimination of different clusters.

Loss of network part output L _o Including text score graph loss (denoted L) _s ) Loss of threshold map (noted as L) _t ) And loss of binarized image (denoted L) _b )：

L _o ＝L _s +μL _b +νL _t

According to experience, the values of the over-parameters mu and v are 1.0 and 5.0 respectively. Wherein L is _s And L _b Calculated using BCELoss, L _t Using L1Loss calculation yields:

L _BCE ＝∑y _i log(x _i )+(1-y _i )log(1-x _i )#

L1＝∑|y _i -x _i |#

the text score map (recorded as S) and the threshold map (recorded as T) are obtained by network prediction, and the binary map (recorded as T)

) Calculated from S and T:

during training, pre-training is firstly carried out on a public data set, and then the marked real certificate data is used for training the whole network.

In the preferred embodiment, step 300 specifically includes: according to the detected position coordinates of the text instance, the text image is intercepted from the certificate image and is input into the improved text recognition network for recognition, and after a recognition result is obtained, the recognition result is stored in the position information. When the identification is carried out, the coordinates in result.txt are read in sequence according to lines, a text example image is intercepted from the preprocessed certificate image according to the coordinates, then the image is input into a text identification network, an identification result character string is obtained and stored in the corresponding line of result.txt, and the coordinate information is later. The structure of the text recognition network in this embodiment includes: on the basis of the CRNN, a Transformer encoder is used as a sequence modeling module to replace a BilSTM module in the CRNN, and the semantic modeling capability of the Transformer encoder is trained by masking input features; and adding the CVSM module into a text recognition network, distributing different weights to the context features and the visual features obtained after sequence modeling to obtain combined features, and performing decoding output.

The text recognition network of the present embodiment is improved on the basis of the CRNN network. In order to solve the problems of background interference and character defilement, a Transformer encoder is used as a sequence modeling module to replace a BilSTM module in a CRNN network, and semantic information among characters is modeled by using a mask type training mode so as to assist the network in identifying interfered and defilement characters; in order to improve the vocabulary dependence problem of the text recognition network, the invention provides a context feature and visual feature selection module (CVSM) based on an attention mechanism, which is added into the text recognition network, different weights are distributed to the context feature and the visual feature to obtain a joint feature, and the joint feature is decoded and output.

In this embodiment, the CVSM module performs weight distribution for the visual features and the context features obtained in the sequence modeling stage using an attention mechanism to obtain combined features, the structure of the CVSM module is shown in fig. 3, and the text image is subjected to a feature extraction stage to obtain visual features (denoted as f) _v )，f _v Obtaining the context characteristic (marked as f) containing the sequence semantic information through the sequence modeling stage _c )，f _v And f _c Have the same characteristic dimensions. Combined characteristic f _u The calculation process of (2) is as follows:

f＝concatenate(f _v ,f _c )

a＝Sigmoid(fc(f))

f _u ＝matmul(f,a)

The training of the text recognition network is divided into two stages: pre-training of a Transformer encoder and training of the whole text recognition network.

When a Transformer encoder is pre-trained, input visual features are randomly masked, and CTCLOss and CELoss are used for supervision, so that a sequence modeling module can fully capture semantic correlation information among characters, and the recognition accuracy of a network on interfered and lossless characters is improved.

When the whole text recognition network is trained, the parameters of a Transformer encoder are fixed, a CVSM module is added into the network, and finally CTCLOs is used for supervision.

In the preferred embodiment, step 400 specifically includes: the key text extraction network provided by the invention is used for classifying text entities, removing non-key types and then storing classification results after identification results.

A key text entity type is first defined. For example: a passport code, a country of issue or issuing agency code, a passport number, a name primary designator, a name secondary designator name, a nationality, a date of birth, a personal number, a sex, a place of birth, a place of issue, a date of issue, an issuing agency, and a date of expiration are defined for the passport, and 14 types of text entities are defined. For the travel certificates (the pass of the travel of the country from harbor to australia, taiwan to China, the pass of the residents in the country from harbor to australia and Honda to Honda and the pass of the continent from Taiwan to continent) 12 text entity types including certificate number, chinese name, pinyin name, birth date, valid period, gender, issuing organization, issuing place, issuing times, the number of the certificates of the harbor to Australia, issuing date and expiration date are defined.

Outside the key text entity types, non-key text entity types are defined.

And during classification, sequentially reading the text position coordinates and the text recognition results in the result.txt according to lines, intercepting the text instance image from the preprocessed certificate image according to the coordinates, inputting the text position, the text characters and the text image into a key text extraction network to obtain a classified result, and storing the classified result behind the text recognition results of the corresponding lines in the result.txt. And delete the line of non-key text entity types.

The structure of the key text extraction network is shown in fig. 4, and text images and detection recognition results are used as input, and the key text extraction network comprises three parts, namely feature encoding, relational modeling and feature decoding.

The feature encoding method is as shown in fig. 5, in which the visual image, the text characters, and the position coordinates of the text entity are encoded respectively, and then directly fused into a feature vector in an addition manner as the feature representation of the text entity.

The visual feature coding method is specifically, preferably, according to the text position information detected in step 200, converting the text position information into a corresponding position in the feature map extracted in step 200 in a roiign manner, and then extracting a feature vector of the position as the visual feature. Alternatively, a text image is extracted from the original image according to the text position information detected in step 200, and then a feature extraction network is reused to perform a feature extraction operation as the visual feature.

The text character feature coding mode specifically includes that an extra < cls > character is added to the first position of a character string result obtained by text recognition to obtain the whole information of the whole text, each character in the character string is coded into a 512-dimensional word vector through Embedding, finally a text coding result is obtained through a Transformer coder, and the first dimension of the result (namely the output of the position where the < cls > is located) represents the text feature coding result of the whole character string.

The position feature coding mode is specifically that firstly, the width and the height of the picture are adopted to normalize the coordinates, and then the normalized coordinates are coded through a linear layer to obtain the position feature vector. Assume the coordinates of the text box as p _box ＝(x ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ ) The width of the picture is w and the height is h. Firstly, the coordinates of the text box are normalized to obtain normalized coordinate information

Passing the normalized coordinates through a linear layer to obtain a position feature, i.e. f _p ＝max(0,Wp′ _box + b), where W ∈ R ^ (8 × C) is a learnable parameter, and max denotes the Relu activation function.

The relational modeling approach models the relationships between text entities by BilSTM, as shown in FIG. 6. The relational modeling module used in this embodiment is composed of two layers of BilSTM, and the hidden state feature dimension is 512. The input of BilSTM is the output of the feature encoding stage. Before the feature coding vectors are input into the BilSTM, the feature coding vectors are firstly sorted according to the relative sequence of the text appearing in the certificate, namely, the text is sorted from top to bottom and from left to right, the feature sequence of the text is obtained after the sorting, and then the feature sequences are input into the BilSTM, and the interrelation and the sequence information among elements of the feature sequence are learned.

A certain text entity in the license has strong correlation with adjacent texts around the text entity, after the text entities are sorted, the text entities related to the text entity are closely distributed on the left side and the right side of the text entity, and the BilSTM can well model context information in the left direction and the right direction of the sequence; secondly, although the distribution of the text entities in different licenses may change, the relative sequence of the text entities is relatively stable, and the BilSTM can well model the sequence information; finally, because the feature vector representing the text entity contains the position information of the text, the layout information among the text entities can be learned to a certain extent by the modeling method. In conclusion, the context information and the layout information between the text entities can be well learned through modeling by the bidirectional long-short term memory unit, so that good effect is achieved.

As shown in fig. 7, in the decoding prediction stage, in the embodiment of the present invention, the text extraction task is regarded as a classification task for the text entity, the probabilities of the classes of different text entities are directly predicted by using the full-link layer, and then the class with the largest probability is respectively found by using the softmax function, so as to represent the classification result of the corresponding text. At present, the mainstream text extraction method carries out character level prediction on a text through a BilSTM-CRF layer, then finds out an optimal path through a Viterbi algorithm, and finally extracts a text entity, so that the algorithm is high in time consumption. In the text detection task, the embodiment of the invention solves the problem of adhesion or disconnection of the text instance by introducing a discrimination loss supervision mode, so that the detected text instance corresponds to a complete entity type without character-level prediction. Based on the conditions, a BilSTM-CRF layer which consumes time in a text extraction task is removed, the category of a text entity is directly predicted, the prediction efficiency is greatly improved, and the network precision is not lost.

And in the network training process, optimizing by adopting an adam optimizer. In order to increase the diversity of training samples and improve the robustness of the model, the training data is enhanced in an online enhancement mode. On-line enhancement means include blurring operations, color and brightness transformation, random clipping, etc. In the training process, in order to solve the problem of sample imbalance, focal local Loss function is used for focusing hard learning.

In the preferred embodiment, step 500 specifically includes: and (4) performing structured processing on the key text extraction result in the step (400), displaying the key text extraction result on a user interface, and uploading the key text extraction result to a business system for processing, such as opening a gate in an airport, opening an account in a bank and the like.

The structural processing process specifically comprises the steps of sequentially reading text position coordinates, text recognition results and text classification results stored in result. Alternatively, for example, using the Json format:

{

"certificate number" { "coordinates [ [545,54], [762,86], [545,86] ]," result ": C12345678" },

"Chinese name" { "coordinates": [280,121], [392,154], [280, 154] ], "result": open three "},

"Pinyin name" { "coordinates" [ [280,141], [412,141], [412,174], [280,174 ] ], "result": ZHANG, SAN "},

"date of birth" { "coordinates [ [280,242], [428,269], [280, 269] ]," result ": 2000.01.01" }

}

In summary, in the task of detecting the license text, the embodiment of the present invention provides a packet multipath selectable convolution structure for detecting the elongated text in the license, so as to improve the receptive field of the feature extraction network and improve the detection accuracy of the text detection network on the elongated text. Secondly, aiming at the problem of text adhesion or disconnection, a text detection method based on judgment loss guidance is provided, the expression and distinguishing capability of a network on different text example characteristics is enhanced, and the phenomenon of text adhesion or disconnection is solved.

In the license text recognition task, aiming at the problems of license background interference, character abrasion and stain, the embodiment of the invention provides a sequence modeling method based on a transform encoder, fully excavates semantic correlation among text characters in a mask training mode, and improves the prediction accuracy of a text recognition network on the existence of interference and stained characters. Furthermore, in order to take account of the recognition effect of the text without semantics in the license, a context feature and visual feature selection module is provided, and the context information and the visual information are combined for decoding and outputting, so that the text recognition accuracy of the network in the license is improved.

In the key text extraction task, aiming at the problems of complex and various certificate layout structures and high information extraction difficulty, the embodiment of the invention provides a feature coding method for integrating multiple kinds of information of vision, text and position, wherein the text extraction is regarded as a classification task of a text entity in a certificate, the text entity is represented by utilizing multidimensional information, a text entity context relation modeling mode based on BilSTM is provided at the same time, the incidence relation and layout information among the text entities in the certificate are fully utilized, and finally, a key text extraction network consisting of three parts of feature coding, relation modeling and decoding prediction is established, so that the structured identification of multiple certificates is realized.

Example 2:

on the basis of the general certificate structured recognition method based on deep learning provided in the above embodiment 1, an embodiment 2 of the present invention provides a general certificate structured recognition system based on deep learning to implement the method provided in the embodiment 1, and as shown in fig. 8, the system includes an image acquisition module, an image preprocessing module, a text detection module, a text recognition module, a key text extraction module, and an information processing and displaying module.

The image acquisition module of the embodiment is used for operating special equipment to acquire the certificate image and transmitting the certificate image to a computer end for processing.

The image preprocessing module of this embodiment includes brightness correction, background cropping, and rotation correction operations, and is used to extract a certificate image with uniform brightness, no background, and correct direction from an original image.

The text detection module of the embodiment is used for detecting a text area on a certificate image; specifically, the image output by the image preprocessing module is input into the improved text network, the text area on the certificate image is detected, and the position information is sequentially stored into result.

The text recognition module of this embodiment is configured to recognize text content in the detected text region; specifically, the position information in result.txt is read in sequence according to lines, a text image is intercepted from a certificate image and input into the improved text recognition network, and a text character recognition result is obtained and stored in the corresponding line of result.txt after the position information.

The key text extraction module of the embodiment is used for classifying the recognized texts and removing non-key texts; text images, text characters and text position information are input into the key text extraction network provided by the invention, specifically, the texts are actually classified, the classification result is stored in result.

The information processing and displaying module of this embodiment is configured to perform structured processing and displaying on the key text, and upload the key text to the service system for processing. Specifically, the text position coordinates, the text recognition results and the text classification results stored in result.txt are sequentially read according to lines, converted into a structured form, displayed on a user interface, convenient for a user to check and interact, and uploaded to a service system for service processing.

Example 3:

on the basis of the general certificate structured recognition method based on deep learning provided in embodiment 1, the present invention further provides a general certificate structured recognition device based on deep learning, which can be used for implementing the method and system described above, as shown in fig. 9, it is a schematic diagram of a device architecture in an embodiment of the present invention. The deep learning based universal certificate structured recognition device of the present embodiment includes one or more processors 21 and a memory 22. In fig. 9, one processor 21 is taken as an example.

The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 9 illustrates the connection by a bus as an example.

The memory 22, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the deep learning based method for structured identification of universal certificates in embodiment 1. The processor 21 executes various functional applications and data processing of the deep learning based universal certificate structured recognition device by running the nonvolatile software program, instructions and modules stored in the memory 22, that is, implements the deep learning based universal certificate structured recognition method of embodiment 1.

The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Program instructions/modules are stored in the memory 22 that, when executed by the one or more processors 21, perform the deep learning based generic certificate structured recognition method of embodiment 1 above, e.g., perform the various steps illustrated in fig. 1 described above.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be performed by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Those not described in detail in this specification are within the skill of the art.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention. Those not described in detail in this specification are within the skill of the art.

Claims

1. A general certificate structured recognition method based on deep learning is characterized by comprising the following steps:

2. The deep learning-based universal certificate structured recognition method as claimed in claim 1, wherein the structure of the text detection network comprises:

on the basis of DBNet, using ResNet18 as a feature extraction network, and replacing a conventional 3 x 3 convolution structure in ResNet18 by a grouping multipath selectable convolution structure; during training, a discriminant loss function is used for supervising the feature graph output by feature extraction.

3. The deep learning-based general certificate structured recognition method according to claim 2, wherein the grouped multipath selectable convolution structure comprises 3 convolution branches, the sizes of convolution kernels are 1 x 3, 1 x 5 and 3 x 3 respectively, then the branches are subjected to weight distribution by using a channel attention mechanism, and finally the branches are subjected to weighted summation to serve as a final convolution feature;

the discriminant loss function measures the distance between the feature vectors by adopting an Euclidean distance, and the calculation mode comprises the following steps: mapping each pixel point in the text image into an N-dimensional vector, regarding a text example as a cluster, and setting the number of the clusters as C and N _c Indicates the number of elements in the cluster c, x _i Feature vector, μ, representing the ith element in the cluster _c Mean feature vector representing cluster c, i.e. cluster center, | | represents L1 distance or L2 distance, [ x |)] ₊ Denotes max (0, x), delta _v And delta _d The boundaries of the variance loss and the distance loss are represented, respectively, from which it is possible to obtain:

L _d ＝L _var +L _dist

wherein L is _var Representing the loss of the distance of the feature vectors in the same cluster, minimizing the loss to minimize the distance of the feature vectors in the cluster, and reducing the feature vector table in the clusterShowing differences; l is a radical of an alcohol _dist Representing the distance loss among different clusters, and maximizing the distance among different clusters by minimizing the loss so as to increase the discrimination of different clusters; l is _d To discriminate loss, equal to L _var And L _dist And (4) summing.

4. The method for structured recognition of a generic document based on deep learning of claim 1, wherein the structure of the text recognition network comprises:

5. The deep learning-based general certificate structural identification method as claimed in claim 4, wherein the CVSM module performs weight distribution for the visual features and the context features obtained in the sequence modeling stage by using an attention mechanism to obtain combined features, specifically, the text image obtains the visual features after passing through the feature extraction stage, and the visual features are recorded as f _v ，f _v Obtaining the context characteristic containing the sequence semantic information through the sequence modeling stage, and marking as f _c ，f _v And f _c Having the same feature dimension, combined features f _u The calculation process of (2) includes:

f＝concatenate(f _v ,f _c )

a＝Sigmoid(fc(f))

f _u ＝matmul(f,a)

in the formula, concatenate represents a stacking operation, sigmoid is an activation function, and matmul represents matrix multiplication.

6. The method for the structured recognition of the generic document based on the deep learning as claimed in claim 1, wherein the key text extraction network uses the text image and the detection recognition result as input, and comprises three parts of feature encoding, relationship modeling and feature decoding, wherein:

the feature encoding specifically includes: respectively coding the visual image, the text character and the position coordinate of the text entity, and then fusing into a feature vector as the feature representation of the text entity in an addition mode;

the relational modeling specifically includes: modeling the relation between text entities through a BilSTM, wherein the input of the BilSTM is the output of a feature coding stage, sorting the feature coding vectors according to the relative sequence of the texts appearing in the certificate before the input of the feature coding vectors into the BilSTM, namely sorting the feature coding vectors from top to bottom and from left to right, obtaining a feature sequence of the texts after sorting, and then inputting the feature sequence into the BilSTM to learn the interrelation and the sequence information among elements of the feature sequence;

the feature decoding specifically includes: in the decoding and predicting stage, the text extraction task is used as a classification task for text entities, the full-connection layer is used for directly predicting the probability of each category of different text entities, and then the category with the maximum probability is respectively found out through the softmax function to represent the classification result of the corresponding text.

7. The method for the structured recognition of the deep learning based universal certificate as claimed in claim 6, wherein the encoding of the visual image, the text characters and the position coordinates of the text entity respectively comprises a visual feature code, a text character feature code and a position feature code, wherein:

the visual feature coding method specifically includes: converting the detected text position information into a corresponding position in the extracted feature map in a ROIAlign mode according to the detected text position information, and then extracting a feature vector of the position to serve as a visual feature; or extracting a text image from the original image according to the detected text position information, and then performing feature extraction operation by using the feature extraction network again to serve as visual features;

the text character feature encoding mode specifically includes: firstly, adding an additional < cls > character to the head of a character string result obtained by text recognition for obtaining the whole information of the whole text, then coding each character in the character string into a 512-dimensional word vector through Embedding, finally obtaining a text coding result through a Transformer coder, and taking the first dimension of the result, namely the output of the position of the < cls > represents the text characteristic coding result of the whole character string;

the position feature coding mode specifically includes: firstly, normalizing a coordinate by using the width and the height of a picture, and then coding the normalized coordinate through a linear layer to obtain a position characteristic vector; specifically, let the coordinates of the text box be p _box ＝(x ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ ) The width of the picture is w, and the height of the picture is h, firstly, the text box coordinates are normalized to obtain normalized coordinate information:

8. The method for structured recognition of a generic document based on deep learning as claimed in any one of claims 1 to 7, wherein the capturing document image information and preprocessing the captured image to obtain a standardized document image specifically comprises:

acquiring certificate image information, and preprocessing the acquired image, including brightness correction, background cutting and rotation correction operations to obtain a standardized certificate image; the brightness correction is used for acquiring a certificate image with uniform brightness; the background cutting is used for removing the background of the certificate and acquiring the certificate image after the background is removed; the rotation correction is used for correcting the direction of the certificate image.

9. The method for structured recognition of a generic document based on deep learning as claimed in any one of claims 1 to 7, wherein the step of structuring the key text extraction result specifically comprises:

and sequentially reading the stored text position coordinates, the text recognition results and the text classification results according to lines, and converting the text position coordinates, the text recognition results and the text classification results into a structured form.

10. A general certificate structured recognition system based on deep learning is characterized by comprising an image acquisition module, an image preprocessing module, a text detection module, a text recognition module, a key text extraction module and an information processing and displaying module; wherein:

the image preprocessing module comprises brightness correction, background cutting and rotation correction operations and is used for extracting certificate images with uniform brightness, no background and correct direction from the original images;

the information processing and displaying module is used for carrying out structuralized processing and displaying on the key text and uploading the key text to a service system for processing.