CN112818951A

CN112818951A - Ticket identification method

Info

Publication number: CN112818951A
Application number: CN202110265378.4A
Authority: CN
Inventors: 路通; 黄智衡; 朱立平; 易欣
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-05-18
Anticipated expiration: 2041-03-11
Also published as: CN112818951B

Abstract

The invention discloses a method for ticket identification, which relates to the technical field of text detection, text identification and information structured extraction and solves the technical problem that the existing model can not effectively extract structured information; the data are synthesized through the rules of the high-frequency words and the text contents of the specific fields, so that the training data of the text recognition model is expanded, and the accuracy of the recognition model is improved; based on the convolutional neural network, the method has good parallelism, and can utilize a high-performance GPU (Graphics Processing Unit) to accelerate the calculation.

Description

Ticket identification method

Technical Field

The disclosure relates to the technical field of text detection, text recognition and information structured extraction, in particular to a ticket recognition method.

Background

The ticket identification refers to a technology for identifying images containing text information in different fields such as invoices, identity cards, bank cards and the like which are common in daily life and extracting structured information in the images. Due to the fields of the ticket, the format of the ticket is complicated, and a plurality of difficulties are brought to identification and structured extraction.

The ticket structured recognition task can be subdivided into research tasks in a plurality of fields such as text detection, text recognition and the like. The main method in the current text detection field is to combine a target detection or segmentation algorithm in deep learning with a text detection task, such as EAST, the algorithm adopts an FCN (full Convolutional network) structure commonly used for semantic segmentation, actually regresses text box parameters based on a regression idea, completes the operations of feature extraction and feature fusion by means of an FCN architecture, then an EAST model predicts regression parameters of a group of text lines at each position in an image, and finally extracts the text lines in an input image by using a non-maximum suppression operation. The method greatly simplifies the process of character detection, but the similar methods at present still have the problems of poor detection effect on long texts and poor detection capability on small text areas, which are the more critical problems in ticket identification.

The current methods in the field of text recognition are mainly character recognition and sequence recognition. When character recognition is carried out by using a character recognition method, firstly, single characters need to be segmented from an image, then, single character images are classified by a classifier, and finally, recognition results of text line levels are combined; the text recognition algorithm based on sequence recognition takes the whole text line as the minimum unit of recognition, completes recognition of the whole text sequence in an automatic alignment mode, and simultaneously introduces a Seq2Seq model and an attention mechanism of natural language processing to improve the recognition effect. However, both methods have respective problems, and the character recognition method needs character-level supervision information, so that a large amount of labeling work is needed; the robustness of the sequence recognition-based method is greatly affected by training data, and erroneous recognition is liable to occur for images and similar characters with complicated backgrounds.

Therefore, for the task of ticket structured identification, the current method does not consider the problem of extracting information structuring, and the obtained messy information cannot be directly used for subsequent work, so the above problems are still to be researched and solved.

Disclosure of Invention

The disclosure provides a ticket identification method, which aims to establish a model capable of effectively extracting structured information aiming at the problems of inconsistent image styles, inconsistent form formats, unclear printing and the like in tickets.

The technical purpose of the present disclosure is achieved by the following technical solutions:

a method of ticket recognition, a model training process and a text recognition process, the model training process comprising:

s100: collecting data for text line detection and text image recognition; wherein the data comprises a text line image;

s101: collecting high-frequency words appearing in various ticket scenes, establishing a keyword database through the high-frequency words, counting rules of specific field text contents in the high-frequency words, and randomly generating expansion data according to the high-frequency words and the rules;

s102: training the CTPN network through the text line image to obtain a text line position detection model;

s103: training a recognition network through the data and the expansion data to obtain a text recognition model with a self-attention mechanism;

the text recognition process includes:

s200, inputting the image of the ticket into a text line position detection model, detecting the text line position in the ticket by the text line position detection model, and outputting the text image of which the text line position is detected;

s201: and inputting the text image into a text recognition model for text recognition, recognizing the text through a self-attention mechanism of the text recognition model to obtain a recognition result, and performing structured extraction on the recognition result according to the keyword database to obtain effective information.

The beneficial effect of this disclosure lies in: the invention obtains the text line position detection model by training the CTPN network, thereby positioning the key information in the ticket and having robustness for tickets in various forms (tables and the like); the data are synthesized through the rules of the high-frequency words and the text contents of the specific fields, so that the training data of the text recognition model is expanded, and the accuracy of the recognition model is improved; based on the convolutional neural network, the method has good parallelism, and can utilize a high-performance GPU (Graphics Processing Unit) to accelerate the calculation.

Drawings

FIGS. 1 and 2 are flow charts of model training processes of a method for ticket identification according to the present invention;

FIGS. 3 and 4 are flow charts of text recognition process of a ticket recognition method according to the present invention;

FIG. 5 is a block diagram of a text recognition model;

fig. 6 is a schematic flow chart of text line positioning, text recognition, and structured extraction according to an embodiment of the present invention.

Detailed Description

The technical scheme of the disclosure will be described in detail with reference to the accompanying drawings. In the description of the present disclosure, it is to be understood that the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated, but merely as distinguishing between different components.

Fig. 1 and 2 are flowcharts of a model training process of a method for ticket identification according to the present invention, and as shown in fig. 1 and 2, the model training process includes: s100: collecting data for text line detection and text image recognition; wherein the data comprises a text line image.

Specifically, collecting data for text line detection and text image recognition can obtain a large number of public, high-precision labeled text line detection sets and text image recognition data sets containing multiple languages through searching in the field of text detection and recognition research. And screening out data with a large difference with a ticket identification scene from the collected data set, marking and removing the acquired abnormal data, and using the sorted data for training a CTPN (connecting Text Proposal Network) Network and an identification Network.

S101: collecting high-frequency words appearing in various ticket scenes, establishing a keyword database through the high-frequency words, counting rules of specific field text contents in the high-frequency words, and randomly generating expansion data according to the high-frequency words and the rules.

Specifically, randomly generating augmentation data according to the high-frequency words and the rules includes: (1) and combining the high-frequency words with the word frequency not less than a preset threshold value to generate a text. (2) The text is assembled into a specific format that conforms to the text in the ticket. (3) And randomly selecting a blank or noisy image as a background, and rendering the text conforming to a specific format on the image to obtain an image of the text, namely the expansion data.

The data and the extended data are actually image data, and the CTPN network and the identification network are trained directly by extracting the characteristics of the image data.

S102: and training the CTPN network through the text line image to obtain a text line position detection model.

S103: and training the recognition network through the data and the expansion data to obtain a text recognition model.

Fig. 3 and 4 are flowcharts of a text recognition process of a ticket recognition method according to the present invention, and as shown in fig. 3 and 4, the text recognition process includes: s200, inputting the image of the ticket into a text line position detection model, detecting the text line position in the ticket by the text line position detection model, and outputting the text image of which the text line position is detected.

Specifically, the performing structured extraction to obtain effective information includes: and calculating the editing distance between each keyword and the recognition result, generating an editing distance matrix, matching a matched recognition result with the minimum editing distance for each keyword, and determining the position of the keyword in the recognition result according to the matched recognition result to obtain the effective information. When the key words are not matched with the matching identification result, returning a default value; that is, the recognition rate is not 100%, and when a case occurs in which the keyword cannot be matched to the pair recognition result having the minimum edit distance, a default value is returned. The keyword information is obtained by matching the output of the deep neural network through the minimum editing distance, and the reliability of the result is effectively improved.

As a specific embodiment, step S102 includes:

s102-1: the CTPN network comprises a convolutional neural network, an LSTM (Long Short-Term Memory) network and a 1 x 1 convolutional layer which are sequentially connected; each text line comprises at least two text line components, and a plurality of preset anchor boxes with fixed width 16 and different heights are preset in the convolutional neural network and are used for positioning the text line components.

S102-2: and the initial learning rate of the CTPN network training is 0.001, the momentum is 0.9, and the text line image is put into the CTPN network for training.

In the forward propagation process of the CTPN network, firstly, the feature extraction is carried out on the input text line image through a convolutional neural network (such as VGG16), a first feature map with the size of N multiplied by C multiplied by H multiplied by W is obtained, then, obtaining a second feature map with the size of Nx 9 CxHxW by using 3 x 3 convolution at the position, corresponding to each preset anchor frame, on the first feature map, then transforming the dimension of the second feature map into NH xW x 9C, then sending the second feature map with the dimension of NH xW x 9C into the LSTM network to learn the sequence feature of each line in the second feature map, and obtaining a third feature map with the output of NH xW x 256, transforming the dimension of the third feature map into Nx 512 xHxW, and finally putting the third feature map with the dimension of Nx 512 xHxW into a 1 x 1 convolutional layer for convolution to obtain a prediction result; wherein, N represents the number of the text line images processed each time, H represents the height of the text line images, W represents the width of the text line images, and C represents the number of channels of the text line images in the network forward propagation.

S102-3: after the prediction result is obtained, calculating the loss of the CTPN network according to a first loss function, updating the parameters of the CTPN network by using an optimizer SGD (stochastic gradient descent), putting the text row image into the CTPN network with the updated parameters for training, repeating the process repeatedly until the optimal prediction result is obtained, and storing the optimal model parameters corresponding to the optimal prediction result to obtain the text row position detection model;

wherein the first loss function is: loss ═ λ_v×L_v+λ_conf×L_conf+λ_x×L_xWherein L is_vExpressing the Loss of the ordinate, namely a Loss function Smooth L1Loss between the coordinates and the height of the center point of the preset anchor frame and the coordinates and the height of the center point of the actual anchor frame; l is_confRepresenting confidence loss, namely whether binary cross entropy loss of text line components exists between the confidence of the preset anchor frame and the actual anchor frame; l is_xExpressing the offset Loss of the abscissa, namely a Loss function Smooth L1Loss between the offset values of the horizontal coordinate and the width of the text line in the predicted anchor frame and the offset values of the horizontal coordinate and the width of the text line in the actual anchor frame; lambda [ alpha ]_v、λ_conf、λ_xRepresenting a weight;

the output result of the text line component at each of the preset anchor frame positions comprises: v. of_j、v_h、s_i、x_sideWherein v is_j、v_hRepresenting the coordinates and height, s, of the center point of the pre-set anchor frame_iRepresenting confidence, x, of a text line element included in a preset anchor box_sideOffset values representing the lateral coordinates and width of the text line part.

As a specific embodiment, step S103 includes:

s103-1: the identification network comprises a feature extraction network, a feature fusion network, a coding network, a full connection layer and a decoding algorithm which are connected in sequence, and is shown in fig. 5.

S103-2: the initial learning rate of the recognition network is 0.0001, the beta value of an optimizer Adam is (0.9,0.999), and the data and the expansion data are put into the recognition network for training;

in the forward propagation process of the identification network, carrying out feature extraction on the image with the size of H multiplied by W through the feature extraction network to obtain a first feature;

fusing the first feature through the feature fusion network, and sampling the fused first feature to enable the height of the fused first feature to be 1, so as to obtain a second feature;

inputting the second characteristic into the coding network for coding to obtain a coding characteristic;

inputting the coding characteristics into the full-connection layer for decoding to obtain a decoding result;

finally, aligning the decoding results through the decoding algorithm to obtain an identification result;

wherein the Feature extraction network is a Resnet50 network, the Feature fusion network is a FPEM (Feature Pyramid Enhancement Module) network, the coding network is an Encoder network, the decoding algorithm is a CTC (connection temporal Classification) algorithm, and a loss function of the CTC algorithm is a loss function

Y represents the decoding result, Y ' represents the correctly labeled recognition result, t represents the sequence length of the coding feature, k represents the alignment function of the CTC network, C: k (C) ═ Y ' represents that all sequences C in the set C can obtain the correctly labeled recognition result Y ' through the CTC algorithm, p represents the probability, and p (C)_tY) indicates that a length of t is obtained on the premise of YSequence c_tThe probability of (c).

The Resnet50 network is a residual error network for extracting image visual features, the FPEM network is a convolution network for fusing multi-stage image visual features, and the receptive field of the model can be increased by fusing the multi-stage features, so that the accuracy of the model is improved. The Encoder network is a feature coding network based on a self-attention mechanism, and the self-attention mechanism can enable the model to extract effective messages in features more accurately, so that the robustness of the text recognition model is improved. The CTC algorithm is a decoding algorithm of an output sequence, for example, the output sequence is cccaaat, and after being aligned by the CTC algorithm, the output sequence is cat.

The Encoder network is an Encoder part in a model Transformer widely applied to the fields of natural language processing and computer vision, the model part has excellent feature capture performance due to a superimposable encoding module, the encoding module comprises two parts of Multi-Head Attention and Feed Forward, and the Multi-Head Attention part is expressed as follows:

Multi-Head Attention(x)＝x+Self-Attention(FC(x),FC(x),FC(x))；

wherein the input of Encoder is used as Q, K, V input in Self-Attention module after passing through 3 layers of full connection layer FC, d_kFor the input dimension, T represents the matrix transpose; the feedforward part is composed of a layer 1 full link layer FC, a layer 1 Relu activation function and a layer 1 full link layer FC.

S103-3: and after the recognition result is obtained, calculating the loss of the recognition network through a loss function of the CTC algorithm, updating parameters of the recognition network by using an optimizer Adam, inputting the data and the expanded data into the recognition network with the updated parameters for training, repeating the process repeatedly until the optimal recognition result is obtained, and storing the optimal model parameters corresponding to the optimal recognition result to obtain the text recognition model.

Fig. 6 is a schematic flow chart of text line positioning, text recognition, and structured extraction according to an embodiment of the present invention, where a single ticket image is input into a text line position detection model (CTPN model) loaded with optimal parameters to obtain a text line detection result, and redundant text boxes are filtered by a confidence threshold to obtain a text positioning box of a key area on the image.

When text line content is recognized, the height of a text line image is generally adjusted to 32 pixels and then the text line image is sent to the text recognition model for recognition, which specifically comprises the following steps: (1) and scaling the text line image with the original length-width ratio, wherein the height h ' of the scaled image is 32, and the width w ' of the image is w x (h '/h), wherein w and h are the original width and height of the image. (2) And inputting the single image into the text recognition model loaded with the optimal parameters to obtain a recognition vector. (3) And processing the identification vector through a CTC decoding algorithm to obtain a text sequence with the highest confidence coefficient.

Then, structured extraction is carried out to obtain effective information, including: (1) calculating the editing distance between each keyword and the text recognition result, wherein the larger the editing distance is, the lower the matching degree is; (2) generating an edit distance matrix, and finding a pair with the minimum edit distance for each keyword; (3) and determining the position of the keyword in the recognition result according to the pairing to obtain the text content. And finally, extracting the positioned key information, organizing the positioned key information into structured data according to the corresponding type, and outputting the structured data, wherein if the positioned key information is not matched with the key information, the structured data is supplemented by using a default value obtained by statistics.

The foregoing is an exemplary embodiment of the present disclosure, and the scope of the present disclosure is defined by the claims and their equivalents.

Claims

1. A method of ticket recognition, characterized by a model training process and a text recognition process, the model training process comprising:

the text recognition process includes:

2. The method according to claim 1, wherein the step S101 of randomly generating augmented data according to the high-frequency words and the rules includes:

combining the high-frequency words with the word frequency not less than a preset threshold value to generate a text;

combining the texts into a specific format which accords with the texts in the ticket;

and randomly selecting a blank or noisy image as a background, and rendering the text conforming to a specific format on the image to obtain an image of the text, namely the expansion data.

3. The method according to claim 1, wherein the performing of structured extraction to obtain valid information in step S201 includes:

calculating the editing distance between each keyword and the recognition result, generating an editing distance matrix, matching a matched recognition result with the minimum editing distance for each keyword, and determining the position of the keyword in the recognition result according to the matched recognition result to obtain the effective information;

and when the keyword is not matched with the pairing identification result, returning a default value.

4. A method of ticket identification according to any of claims 1-3, wherein step S102 comprises:

s102-1: the CTPN network comprises a convolutional neural network, an LSTM network and a 1 x 1 convolutional layer which are sequentially connected; each text line comprises at least two text line components, and a plurality of preset anchor boxes with fixed width as 16 and different heights are preset in the convolutional neural network and are used for positioning the text line components;

s102-2: the initial learning rate of the CTPN network training is 0.001, the momentum is 0.9, and the text line image is put into the CTPN network for training;

in the forward propagation process of the CTPN network, firstly, the feature extraction is carried out on the input text line image through the convolutional neural network to obtain a first feature map with the size of N multiplied by C multiplied by H multiplied by W, then, obtaining a second feature map with the size of Nx 9 CxHxW by using 3 x 3 convolution at the position, corresponding to each preset anchor frame, on the first feature map, then transforming the dimension of the second feature map into NH xW x 9C, then sending the second feature map with the dimension of NH xW x 9C into the LSTM network to learn the sequence feature of each line in the second feature map, and obtaining a third feature map with the output of NH xW x 256, transforming the dimension of the third feature map into Nx 512 xHxW, and finally putting the third feature map with the dimension of Nx 512 xHxW into a 1 x 1 convolutional layer for convolution to obtain a prediction result; wherein N represents the number of the text line images processed each time, H represents the height of the text line images, W represents the width of the text line images, and C represents the number of channels of the text line images in network forward propagation;

s102-3: after the prediction result is obtained, calculating the loss of the CTPN network according to a first loss function, updating the parameters of the CTPN network by using an optimizer SGD, putting the text row image into the CTPN network with the updated parameters for training, repeating the process repeatedly until the optimal prediction result is obtained, and storing the optimal model parameters corresponding to the optimal prediction result to obtain the text row position detection model;

wherein the first loss function is: loss ═ λ_v×L_v+λ_conf×L_conf+λ_x×L_xWherein L is_vExpressing the Loss of the ordinate, namely a Loss function Smooth L1Loss between the coordinates and the height of the center point of the preset anchor frame and the coordinates and the height of the center point of the actual anchor frame; l is_confRepresenting confidence loss, namely whether binary cross entropy loss of text line components exists between the confidence of the preset anchor frame and the actual anchor frame; l is_xThe offset Loss of the abscissa, namely a Loss function Smooth L1Loss between the offset values of the horizontal coordinates and the width of the text line in the predicted anchor box and the offset values of the horizontal coordinates and the width of the text line in the actual anchor box; lambda [ alpha ]_v、λ_conf、λ_xRepresenting a weight;

5. The method of ticket identification of claim 4, wherein step S103 comprises:

s103-1: the identification network comprises a feature extraction network, a feature fusion network, a coding network, a full connection layer and a decoding algorithm which are connected in sequence;

the feature extraction network is a Resnet50 network, the feature fusion network is an FPEM network, the encoding network is an Encoder network, the decoding algorithm is a CTC algorithm, and the loss function of the CTC algorithm is

Y represents the decoding result, Y ' represents the correctly labeled recognition result, t represents the sequence length of the coding feature, k represents the alignment function of the CTC network, C: k (C) ═ Y ' represents that all sequences C in the set C can obtain the correctly labeled recognition result Y ' through the CTC algorithm, p represents the probability, and p (C)_t| Y) denotes the sequence c of length t, which is obtained on the premise of Y_tThe probability of (d);

6. The method according to claim 5, wherein in step S201, when the text line image is recognized by the text recognition model, the height of the text line image is adjusted to 32 pixels and then the text line image is sent to the text recognition model for recognition.