CN112818951B

CN112818951B - A method of ticket identification

Info

Publication number: CN112818951B
Application number: CN202110265378.4A
Authority: CN
Inventors: 路通; 黄智衡; 朱立平; 易欣
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2023-11-21
Anticipated expiration: 2041-03-11
Also published as: CN112818951A

Abstract

The invention discloses a ticket recognition method, which relates to the technical fields of text detection, text recognition and structured information extraction. It solves the technical problem that the existing model cannot effectively extract structured information. The key point of the technical solution is to carry out the CTPN network The text line position detection model is trained to locate the key information in the ticket and is robust to various forms of tickets (tables, etc.); the data is synthesized through rules of high-frequency words and text content in specific fields to expand the It provides training data for the text recognition model and improves the accuracy of the recognition model; it is based on a convolutional neural network and has good parallelism, and can use high-performance GPU (Graphics Processing Unit, graphics processor) to accelerate calculations.

Description

A method of ticket identification

技术领域Technical field

本公开涉及文本检测、文本识别与信息结构化提取技术领域，尤其涉及一种票证识别的方法。The present disclosure relates to the technical fields of text detection, text recognition and structured information extraction, and in particular, to a method of ticket recognition.

背景技术Background technique

票证识别指的是对日常生活中常见的发票、身份证、银行卡等不同领域中含有文字信息的图像进行识别并提取其中结构化信息的技术。由于票证包含的领域众多，票证的格式繁杂，给识别与结构化提取带来了诸多困难。Ticket recognition refers to the technology of identifying images containing text information in different fields such as invoices, ID cards, bank cards, etc. that are common in daily life and extracting structured information from them. Since tickets include many fields and their formats are complex, it brings many difficulties to identification and structured extraction.

票证结构化识别任务可以被细分为文本检测、文本识别等多个领域内的研究任务。当前文本检测领域的主要方法是将深度学习中目标检测或分割算法与文字检测任务相结合得到，如EAST，该算法采用了语义分割常用的FCN(Fully Convolutional Networks，全卷积网络)结构，实际上还是基于回归的思想对文字框参数进行回归，借助了FCN的架构完成特征提取和特征融合的操作，然后EAST模型在图像中的每个位置预测一组文本行的回归参数，最后使用非极大值抑制操作即可提取输入图像中的文字行。该方法极大地简化了文字检测的流程，但是目前相似的方法仍存在对长文本检测效果不佳、小文本区域检测能力差的问题，而这些问题正是票证识别中较为关键的问题。The ticket structured recognition task can be subdivided into research tasks in multiple fields such as text detection and text recognition. The main method in the current field of text detection is to combine the target detection or segmentation algorithm in deep learning with the text detection task, such as EAST. This algorithm uses the FCN (Fully Convolutional Networks) structure commonly used in semantic segmentation. In fact, The text box parameters are regressed based on the idea of regression, and the FCN architecture is used to complete the feature extraction and feature fusion operations. Then the EAST model predicts the regression parameters of a set of text lines at each position in the image, and finally uses non-polar The large value suppression operation extracts lines of text from the input image. This method greatly simplifies the process of text detection, but currently similar methods still have problems such as poor detection of long text and poor detection of small text areas, and these problems are the more critical issues in ticket recognition.

当前文本识别领域内的方法主要有字符识别和序列识别两种。使用字符识别方法进行文字识别时，首先需要从图像中分割出单个字符，再通过分类器对单个字符图像进行分类，最后合并成文本行级别的识别结果；而基于序列识别的文本识别算法将整个文本行作为识别的最小单元，以自动对齐的方式完成对整个文字序列的识别，同时引入了自然语言处理的Seq2Seq模型和注意力机制来提高识别效果。但这两种方法都存在各自的问题，字符识别方法需要字符级别的监督信息，所以需要大量的标注工作；基于序列识别的方法的鲁棒性受训练数据的影响很大，对于背景复杂的图像和相似的字符容易出现错误识别。Currently, there are two main methods in the field of text recognition: character recognition and sequence recognition. When using the character recognition method for text recognition, you first need to segment a single character from the image, then classify the single character image through a classifier, and finally merge it into a text line-level recognition result; while the text recognition algorithm based on sequence recognition divides the entire As the smallest unit of recognition, text lines are automatically aligned to complete the recognition of the entire text sequence. At the same time, the Seq2Seq model of natural language processing and the attention mechanism are introduced to improve the recognition effect. However, both methods have their own problems. The character recognition method requires character-level supervision information, so a lot of annotation work is required; the robustness of the method based on sequence recognition is greatly affected by the training data, and for images with complex backgrounds and similar characters are prone to misrecognition.

因此，对于票证结构化识别的任务，当前方法没有考虑到对信息结构化的提取问题，识别得到的杂乱信息并不能直接用于后续的工作，所以针对以上问题还有待研究解决。Therefore, for the task of structured ticket recognition, the current method does not take into account the extraction of structured information, and the messy information obtained cannot be directly used in subsequent work, so the above problems still need to be researched and solved.

发明内容Contents of the invention

本公开提供了一种票证识别的方法，其技术目的是针对票证中的图像风格不一、表格格式不统一、印刷不清晰等问题，建立一个能有效提取结构化信息的模型。The present disclosure provides a method for ticket recognition. Its technical purpose is to establish a model that can effectively extract structured information to address issues such as different image styles, inconsistent table formats, and unclear printing in tickets.

本公开的上述技术目的是通过以下技术方案得以实现的：The above technical objectives of the present disclosure are achieved through the following technical solutions:

一种票证识别的方法，模型训练过程和文本识别过程，所述模型训练过程包括：A method of ticket recognition, a model training process and a text recognition process. The model training process includes:

S100：收集用于文本行检测与文本图像识别的数据；其中，所述数据包括文本行图像；S100: Collect data for text line detection and text image recognition; wherein the data includes text line images;

S101：收集在各类票证场景下出现的高频词，通过所述高频词建立关键词数据库，并统计所述高频词中特定字段文本内容的规则，根据所述高频词和所述规则随机生成扩充数据；S101: Collect high-frequency words that appear in various ticket scenarios, establish a keyword database based on the high-frequency words, and count the rules of text content in specific fields in the high-frequency words. According to the high-frequency words and the Rules randomly generate extended data;

S102：通过所述文本行图像对CTPN网络进行训练，得到文本行位置检测模型；S102: Train the CTPN network through the text line image to obtain a text line position detection model;

S103：通过所述数据和所述扩充数据对识别网络进行训练，得到带有自注意力机制的文本识别模型；S103: Train the recognition network through the data and the expanded data to obtain a text recognition model with a self-attention mechanism;

所述文本识别过程包括：The text recognition process includes:

S200:将票证的图像输入到文本行位置检测模型，所述文本行位置检测模型对票证中的文本行位置进行检测，输出检测到文本行位置的文本图像；S200: Input the image of the ticket into the text line position detection model. The text line position detection model detects the text line position in the ticket and outputs the text image with the detected text line position;

S201：将所述文本图像输入到文本识别模型进行文本识别，通过文本识别模型的自注意力机制对所述文本进行识别后得到识别结果，根据所述关键词数据库对所述识别结果进行结构化提取，得到有效信息。S201: Input the text image into the text recognition model for text recognition, recognize the text through the self-attention mechanism of the text recognition model to obtain a recognition result, and structure the recognition result according to the keyword database Extract and get valid information.

本公开的有益效果在于：本发明通过对CTPN网络进行训练得到文本行位置检测模型，从而对票证中的关键信息进行定位，且对各种形式(表格等)的票证具有鲁棒性；通过高频词及其中特定字段文本内容的规则合成数据，扩充了文本识别模型的训练数据，提升了识别模型的准确性；基于卷积神经网络，具有很好的并行性，可以利用高性能的GPU(Graphics Processing Unit，图形处理器)加速计算。The beneficial effects of the present disclosure are: the present invention obtains a text line position detection model by training the CTPN network, thereby locating the key information in the ticket, and is robust to various forms of tickets (tables, etc.); through high The rule-synthesized data of frequent words and specific field text content in them expands the training data of the text recognition model and improves the accuracy of the recognition model; based on the convolutional neural network, it has good parallelism and can utilize high-performance GPU ( Graphics Processing Unit, graphics processor) accelerates calculations.

附图说明Description of the drawings

图1、图2为本发明所述一种票证识别的方法模型训练过程的流程图；Figures 1 and 2 are flow charts of the model training process of a ticket identification method according to the present invention;

图3、图4为本发明所述一种票证识别的方法文本识别过程的流程图；Figures 3 and 4 are flow charts of the text recognition process of a ticket recognition method according to the present invention;

图5为文本识别模型的结构图；Figure 5 is the structure diagram of the text recognition model;

图6为本发明实施例提供的文本行定位、文本识别、结构化提取的流程示意图。Figure 6 is a schematic flowchart of text line positioning, text recognition, and structured extraction provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合附图对本公开技术方案进行详细说明。在本公开的描述中，需要理解地是，术语“第一”、“第二”、“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量，仅用来区分不同的组成部分。The technical solution of the present disclosure will be described in detail below with reference to the accompanying drawings. In the description of the present disclosure, it should be understood that the terms "first", "second" and "third" are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the indicated. The number of technical characteristics serves only to distinguish the different components.

图1、图2为本发明所述一种票证识别的方法模型训练过程的流程图，如图1和图2所示，模型训练过程包括：S100：收集用于文本行检测与文本图像识别的数据；其中，所述数据包括文本行图像。Figures 1 and 2 are flow charts of the model training process of a ticket recognition method according to the present invention. As shown in Figures 1 and 2, the model training process includes: S100: Collecting data for text line detection and text image recognition. Data; wherein the data includes text line images.

具体地，收集用于文本行检测与文本图像识别的数据，可通过在文本检测与识别研究领域检索得到大量公开的、高精度标注的、包含多种语言的文本行检测集和文本图像识别数据集。从收集到的数据集中筛去与票证识别场景相差较大的数据，标记并剔除获取到的异常数据，将整理得到的数据用于CTPN(Connectionist Text Proposal Network，连接文本提案网络)网络和识别网络的训练。Specifically, to collect data for text line detection and text image recognition, a large number of public, high-precision annotated text line detection sets and text image recognition data containing multiple languages can be obtained by searching in the field of text detection and recognition research. set. Screen out the data that is significantly different from the ticket recognition scenario from the collected data set, mark and eliminate the abnormal data obtained, and use the collected data for the CTPN (Connectionist Text Proposal Network, Connectionist Text Proposal Network) network and recognition network training.

S101：收集在各类票证场景下出现的高频词，通过所述高频词建立关键词数据库，并统计所述高频词中特定字段文本内容的规则，根据所述高频词和所述规则随机生成扩充数据。S101: Collect high-frequency words that appear in various ticket scenarios, establish a keyword database based on the high-frequency words, and count the rules of text content in specific fields in the high-frequency words. According to the high-frequency words and the Rules randomly generate augmented data.

具体地，根据所述高频词和所述规则随机生成扩充数据，包括：(1)将词频不小于预设阈值的所述高频词进行组合生成文本。(2)将所述文本组合成符合票证中文本的特定格式。(3)随机选取空白或带有噪声的图像作为背景，将符合特定格式的所述文本渲染到图像上，得到所述文本的图像，即得到所述扩充数据。Specifically, randomly generating expanded data according to the high-frequency words and the rules includes: (1) combining the high-frequency words whose word frequency is not less than a preset threshold to generate text. (2) Combine the text into a specific format that matches the text in the ticket. (3) Randomly select a blank or noisy image as the background, render the text that conforms to a specific format onto the image, and obtain an image of the text, that is, obtain the expanded data.

上述无论是数据还是扩充数据，实际都是图像数据，直接通过提取图像数据的特征训练CTPN网络和识别网络。The above data, whether it is data or expanded data, is actually image data. The CTPN network and recognition network are directly trained by extracting the characteristics of the image data.

S102：通过所述文本行图像对CTPN网络进行训练，得到文本行位置检测模型。S102: Train the CTPN network through the text line image to obtain a text line position detection model.

S103：通过所述数据和所述扩充数据对识别网络进行训练，得到文本识别模型。S103: Train the recognition network using the data and the expanded data to obtain a text recognition model.

图3、图4为本发明所述一种票证识别的方法文本识别过程的流程图，如图3和图4所示，文本识别过程包括：S200:将票证的图像输入到文本行位置检测模型，所述文本行位置检测模型对票证中的文本行位置进行检测，输出检测到文本行位置的文本图像。Figures 3 and 4 are flow charts of the text recognition process of a ticket recognition method according to the present invention. As shown in Figures 3 and 4, the text recognition process includes: S200: Input the image of the ticket into the text line position detection model , the text line position detection model detects the text line position in the ticket, and outputs a text image with the detected text line position.

具体地，所述进行结构化提取，得到有效信息，包括：计算每个所述关键词与所述识别结果的编辑距离，生成编辑距离矩阵，为每个所述关键词匹配编辑距离最小的配对识别结果，根据所述配对识别结果确定所述关键词在所述识别结果中的位置，得到所述有效信息。而当关键词没有匹配到配对识别结果时，返回缺省值；也就是说，识别率并不是100％的，当出现关键词无法匹配到编辑距离最小的配对识别结果的情况时，就会返回一个缺省值表示。将深度神经网络的输出通过最小编辑距离匹配得到关键词信息，有效的提高了结果的可靠性。Specifically, performing structured extraction to obtain effective information includes: calculating the edit distance between each keyword and the recognition result, generating an edit distance matrix, and matching the pair with the smallest edit distance for each keyword Recognition results: determine the position of the keyword in the recognition result according to the paired recognition result, and obtain the effective information. When the keyword does not match the pair recognition result, the default value is returned; that is to say, the recognition rate is not 100%. When the keyword cannot match the pair recognition result with the smallest edit distance, it will return Represented by a default value. The output of the deep neural network is matched with the minimum edit distance to obtain keyword information, which effectively improves the reliability of the results.

作为具体实施方案地，步骤S102包括：As a specific implementation, step S102 includes:

S102-1：所述CTPN网络包括依次连接的卷积神经网络、LSTM(Long Short-TermMemory，长短时记忆)网络和一个1×1卷积层；每个文本行包括至少两个文本行部件，在所述卷积神经网络中预设多个宽度固定为16、高度不同的预设锚框用于定位所述文本行部件。S102-1: The CTPN network includes a convolutional neural network, an LSTM (Long Short-Term Memory) network and a 1×1 convolution layer connected in sequence; each text line includes at least two text line components, A plurality of preset anchor boxes with a fixed width of 16 and different heights are preset in the convolutional neural network for positioning the text line component.

S102-2：所述CTPN网络训练的初始学习率为0.001，动量为0.9，将所述文本行图像投入到所述CTPN网络进行训练。S102-2: The initial learning rate of the CTPN network training is 0.001, the momentum is 0.9, and the text line image is put into the CTPN network for training.

在所述CTPN网络的前向传播过程中，首先通过卷积神经网络(例如VGG16)对输入的所述文本行图像进行特征提取，得到大小为N×C×H×W的第一特征图，然后在所述第一特征图上对应每个预设锚框的位置处使用3×3卷积得到大小为N×9C×H×W的第二特征图，随后将所述第二特征图的维度变换为NH×W×9C，再将维度为NH×W×9C的第二特征图送入所述LSTM网络中学习所述第二特征图中每一行的序列特征，得到输出为NH×W×256的第三特征图，并将所述第三特征图的维度变换为N×512×H×W，最后将维度为N×512×H×W的第三特征图投入到1×1卷积层卷积后得到预测结果；其中，N表示每次处理的文本行图像的数量，H表示文本行图像的高度，W表示文本行图像的宽度，C表示文本行图像在网络前向传播中的通道数。In the forward propagation process of the CTPN network, first feature extraction is performed on the input text line image through a convolutional neural network (such as VGG16), and a first feature map of size N×C×H×W is obtained. Then use 3×3 convolution at the position corresponding to each preset anchor box on the first feature map to obtain a second feature map with a size of N×9C×H×W, and then convert the The dimension is transformed to NH×W×9C, and then the second feature map with the dimension NH×W×9C is sent to the LSTM network to learn the sequence features of each row in the second feature map, and the output is NH×W ×256 third feature map, and transform the dimension of the third feature map to N×512×H×W, and finally put the third feature map with dimension N×512×H×W into a 1×1 volume The prediction result is obtained after convolution of the multilayer layer; where N represents the number of text line images processed each time, H represents the height of the text line image, W represents the width of the text line image, and C represents the forward propagation of the text line image in the network. number of channels.

S102-3：得到所述预测结果后，按照第一损失函数计算CTPN网络的损失，再使用优化器SGD(stochastic gradient descent，随机梯度下降)对CTPN网络的参数进行更新，再将所述文本行图像投入到更新参数后的CTPN网络进行训练，反复重复这一过程，直至得到最佳预测结果，保存所述最佳预测结果对应的最佳模型参数，即得到所述文本行位置检测模型；S102-3: After obtaining the prediction result, calculate the loss of the CTPN network according to the first loss function, then use the optimizer SGD (stochastic gradient descent, stochastic gradient descent) to update the parameters of the CTPN network, and then add the text line The image is put into the CTPN network after updated parameters for training, and this process is repeated repeatedly until the best prediction result is obtained, and the best model parameters corresponding to the best prediction result are saved to obtain the text line position detection model;

其中，所述第一损失函数为：Loss＝λ_v×L_v+λ_conf×L_conf+λ_x×L_x，其中，L_v表示纵坐标损失，即预设锚框中心点坐标和高度与实际锚框中心点坐标和高度之间的损失函数Smooth L1 Loss；L_conf表示置信度损失，即预设锚框置信度与实际锚框之间是否含有文本行部件的二元交叉熵损失；L_x表示横坐标偏移损失，即预测锚框中文本行的横向坐标、宽度的偏移值与实际锚框中文本行的横向坐标、宽度的偏移值之间的损失函数Smooth L1Loss；λ_v、λ_conf、λ_x表示权重；Wherein _, _the first loss _function _is _: Loss ₌ λ _v The loss function Smooth L1 Loss between the actual anchor box center point coordinates and height; L _conf represents the confidence loss, that is, the binary cross entropy loss of whether the text line component is included between the preset anchor box confidence and the actual anchor box; L conf _x represents the abscissa offset loss, that is, the loss function Smooth L1Loss between the offset value of the transverse coordinate and width of the text line in the predicted anchor box and the offset value of the transverse coordinate and width of the text line in the actual anchor box; λ _v , λ _conf , λ _x represent weight;

所述文本行部件在每个所述预设锚框位置处的输出结果包括：v_j、v_h、s_i、x_side，其中，v_j、v_h表示所述预设锚框的中心点坐标和高度，s_i表示预设锚框中包括的文本行部件的置信度，x_side表示所述文本行部件的横向坐标和宽度的偏移值。The output results of the text line component at each of the preset anchor box positions include: v _j , v _h , _si , x _side , where v _j and v _h represent the center point of the preset anchor box. coordinates and height, s _i represents the confidence of the text line component included in the preset anchor box, x _side represents the offset value of the horizontal coordinate and width of the text line component.

作为具体实施方案地，步骤S103包括：As a specific implementation, step S103 includes:

S103-1：所述识别网络包括依次连接的特征提取网络、特征融合网络、编码网络、一层的全连接层和解码算法，如图5所示。S103-1: The recognition network includes a feature extraction network, a feature fusion network, a coding network, a fully connected layer and a decoding algorithm that are connected in sequence, as shown in Figure 5.

S103-2：所述识别网络的初始学习率为0.0001，优化器Adam的贝塔值为(0.9,0.999)，将所述数据和所述扩充数据投入到所述识别网络进行训练；S103-2: The initial learning rate of the recognition network is 0.0001, and the beta value of the optimizer Adam is (0.9, 0.999). The data and the expanded data are put into the recognition network for training;

在所述识别网络的前向传播过程中，将大小为H×W的图像通过所述特征提取网络进行特征提取，得到第一特征；In the forward propagation process of the recognition network, the image with a size of H×W is extracted through the feature extraction network to obtain the first feature;

再通过所述特征融合网络对所述第一特征进行融合，并对融合后的所述第一特征进行采样使融合后的所述第一特征的高度为1，得到第二特征；Then, the first features are fused through the feature fusion network, and the fused first features are sampled so that the height of the fused first features is 1 to obtain the second features;

将所述第二特征输入到所述编码网络进行编码得到编码特征；Input the second feature into the encoding network for encoding to obtain encoding features;

将所述编码特征输入到所述全连接层进行解码，得到解码结果；Input the encoding features into the fully connected layer for decoding, and obtain the decoding result;

最后通过所述解码算法对所述解码结果进行对齐得到识别结果；Finally, the decoding results are aligned through the decoding algorithm to obtain the recognition result;

其中，所述特征提取网络为Resnet50网络，所述特征融合网络为FPEM(FeaturePyramid Enhancement Module，特征金字塔增强模块)网络，所述编码网络为Encoder网络，所述解码算法为CTC(ConnectionistTemporal Classification，连接时序分类)算法，所述CTC算法的损失函数为Y表示所述解码结果，Y＇表示经过正确标注的所述识别结果，t表示所述编码特征的序列长度，k表示所述CTC网络的对齐函数，C:k(c)＝Y'表示集合C中的所有序列c都可以通过CTC算法得到正确标注的识别结果Y＇，p表示概率，p(c_t|Y)表示在Y的前提下得到长度为t的序列c_t的概率。Wherein, the feature extraction network is a Resnet50 network, the feature fusion network is a FPEM (FeaturePyramid Enhancement Module, feature pyramid enhancement module) network, the encoding network is an Encoder network, and the decoding algorithm is CTC (Connectionist Temporal Classification, connection timing Classification) algorithm, the loss function of the CTC algorithm is Y represents the decoding result, Y' represents the correctly annotated recognition result, t represents the sequence length of the encoding feature, k represents the alignment function of the CTC network, C:k(c)=Y' represents the set All sequences c in C can obtain the correctly labeled recognition result Y' through the CTC algorithm, p represents the probability, and p(c _t |Y) represents the probability of obtaining the sequence c _t of length t under the premise of Y.

Resnet50网络为提取图像视觉特征的残差网络，FPEM网络为将多阶段图像视觉特征进行融合的卷积网络，通过融合多阶段特征可以增大模型的感受野从而提升模型的准确性。Encoder网络为基于自注意力机制的特征编码网络，采用自注意力机制可以使得模型更准确的提取特征中的有效消息，从而提升文本识别模型的鲁棒性。CTC算法为输出序列的解码算法，例如输出序列为“cccaaat”，经过CTC算法对齐后就是“cat”。The Resnet50 network is a residual network that extracts image visual features, and the FPEM network is a convolutional network that fuses multi-stage image visual features. By fusing multi-stage features, the receptive field of the model can be increased and the accuracy of the model can be improved. The Encoder network is a feature encoding network based on the self-attention mechanism. The use of the self-attention mechanism allows the model to more accurately extract effective information from the features, thus improving the robustness of the text recognition model. The CTC algorithm is a decoding algorithm for the output sequence. For example, the output sequence is "cccaaat", and after alignment with the CTC algorithm, it is "cat".

Encoder网络为同时在自然语言处理与计算机视觉领域广泛应用的模型Transformer中的编码器部分，该部分模型得益于可叠加的编码模块拥有优秀的特征捕捉性能，编码模块包含Multi-Head Attention(多头注意力)和Feed Forward(前馈)两个部分，其中的Multi-Head Attention部分数学表示如下:The Encoder network is the encoder part of the model Transformer that is widely used in the fields of natural language processing and computer vision. This part of the model benefits from the superpositionable encoding module and has excellent feature capture performance. The encoding module includes Multi-Head Attention (multi-head attention). Attention) and Feed Forward (feed forward) two parts, the mathematical expression of the Multi-Head Attention part is as follows:

Multi-Head Attention(x)＝x+Self-Attention(FC(x),FC(x),FC(x))；Multi-Head Attention(x)=x+Self-Attention(FC(x),FC(x),FC(x));

其中Encoder的输入分别经过3层全连接层FC后作为Q，K，V输入Self-Attention(自注意力)模块中，d_k为输入的维度，T表示矩阵转置；前馈部分由1层全连接层FC、1层Relu激活函数与1层全连接层FC构成。The input of the Encoder is input into the Self-Attention (self-attention) module as Q, K, V after passing through three layers of fully connected layers FC respectively. d _k is the dimension of the input, and T represents the matrix transpose; the feedforward part consists of layer 1 It is composed of a fully connected layer FC, a layer of Relu activation function and a layer of fully connected layer FC.

S103-3：得到所述识别结果后，通过所述CTC算法的损失函数计算所述识别网络的损失，再使用优化器Adam对识别网络的参数进行更新，再将所述数据和所述扩充数据投入到更新参数后的识别网络进行训练，反复重复这一过程，直至得到最佳识别结果，保存所述最佳识别结果对应的最佳模型参数，即得到所述文本识别模型。S103-3: After obtaining the recognition result, calculate the loss of the recognition network through the loss function of the CTC algorithm, then use the optimizer Adam to update the parameters of the recognition network, and then combine the data and the expanded data Invest in the recognition network after updated parameters for training, repeat this process repeatedly until the best recognition result is obtained, and save the best model parameters corresponding to the best recognition result to obtain the text recognition model.

图6为本发明实施例提供的文本行定位、文本识别、结构化提取的流程示意图，将单幅票证图像输入加载了最佳参数的文本行位置检测模型(CTPN模型)中，得到文本行检测结果，并通过置信度阈值过滤掉多余的文本框，得到图像上关键区域的文本定位框。Figure 6 is a schematic flowchart of text line positioning, text recognition, and structured extraction provided by an embodiment of the present invention. A single ticket image is input into a text line position detection model (CTPN model) loaded with optimal parameters to obtain text line detection. As a result, redundant text boxes are filtered out through the confidence threshold, and the text positioning box of the key area on the image is obtained.

对文本行内容进行识别时，一般将文本行图像的高度调至32像素再送入所述文本识别模型进行识别，具体包括：(1)将文本行图像保持原长宽比进行缩放，缩放后图像高度h＇＝32，图像宽度w＇＝w×(h＇/h)，其中w，h为图像原宽度和高度。(2)将单张图像输入加载了最佳参数的文本识别模型中，得到识别向量。(3)通过CTC解码算法对识别向量进行处理，得到置信度最高的文本序列。When identifying the content of a text line, the height of the text line image is generally adjusted to 32 pixels and then sent to the text recognition model for recognition. This specifically includes: (1) scaling the text line image while maintaining the original aspect ratio. The scaled image Height h'=32, image width w'=w×(h'/h), where w and h are the original width and height of the image. (2) Input a single image into the text recognition model loaded with optimal parameters to obtain the recognition vector. (3) Process the identification vector through the CTC decoding algorithm to obtain the text sequence with the highest confidence.

然后进行结构化提取，获取有效信息，包括：(1)计算各个关键词与文本识别结果的编辑距离，编辑距离越大，匹配度越低；(2)生成编辑距离矩阵，为每个关键词找到编辑距离最小的配对；(3)根据配对确定关键词在识别结果中的位置，获得文本内容。最后将定位到关键信息提取并按照对应的类型组织成结构化数据输出，若未匹配到关键词则使用统计得到的缺省值补充。Then perform structured extraction to obtain effective information, including: (1) Calculating the edit distance between each keyword and the text recognition result. The larger the edit distance, the lower the matching degree; (2) Generating an edit distance matrix for each keyword Find the pairing with the smallest edit distance; (3) Determine the position of the keyword in the recognition result based on the pairing, and obtain the text content. Finally, the key information will be extracted and organized into structured data output according to the corresponding type. If the keyword is not matched, the default value obtained from statistics will be used to supplement it.

以上为本公开示范性实施例，本公开的保护范围由权利要求书及其等效物限定。The above are exemplary embodiments of the present disclosure, and the protection scope of the present disclosure is defined by the claims and their equivalents.

Claims

1. A method of ticket recognition, characterized by a model training process and a text recognition process, and the model training process includes:

S100: Collect data for text line detection and text image recognition; wherein the data includes text line images;

S101: Collect high-frequency words that appear in various ticket scenarios, establish a keyword database based on the high-frequency words, and count the rules of field text content in the high-frequency words. According to the high-frequency words and the rules Randomly generate extended data;

S102: Train the CTPN network through the text line image to obtain a text line position detection model;

S103: Train the recognition network through the data and the expanded data to obtain a text recognition model with a self-attention mechanism;

The text recognition process includes:

S200: Input the image of the ticket into the text line position detection model. The text line position detection model detects the text line position in the ticket and outputs the text image with the detected text line position;

S201: Input the text image into the text recognition model for text recognition, recognize the text through the self-attention mechanism of the text recognition model to obtain a recognition result, and structure the recognition result according to the keyword database Extract and obtain effective information;

Wherein, in step S101, extended data is randomly generated according to the high-frequency words and the rules, including:

Combine the high-frequency words whose word frequency is not less than a preset threshold to generate text;

Combining said text into a specific format consistent with the text in the ticket;

Randomly select a blank or noisy image as the background, render the text that conforms to a specific format onto the image, and obtain an image of the text, that is, obtain the expanded data;

The step S102 includes:

S102-1: The CTPN network includes a convolutional neural network, an LSTM network and a 1×1 convolution layer connected in sequence; each text line includes at least two text line components, which are preset in the convolutional neural network Multiple preset anchor boxes with a fixed width of 16 and different heights are used to position the text line components;

S102-2: The initial learning rate of the CTPN network training is 0.001, the momentum is 0.9, and the text line image is put into the CTPN network for training;

In the forward propagation process of the CTPN network, first feature extraction is performed on the input text line image through the convolutional neural network to obtain a first feature map of size N×C×H×W, and then in Use 3×3 convolution on the first feature map corresponding to the position of each preset anchor frame to obtain a second feature map with a size of N×9C×H×W, and then transform the dimensions of the second feature map is NH×W×9C, and then the second feature map with the dimension NH×W×9C is sent to the LSTM network to learn the sequence features of each row in the second feature map, and the output is NH×W×256 The third feature map, and the dimension of the third feature map is transformed to N×512×H×W, and finally the third feature map with the dimension of N×512×H×W is put into the 1×1 convolution layer The prediction result is obtained after convolution; among them, N represents the number of text line images processed each time, H represents the height of the text line image, W represents the width of the text line image, and C represents the channel of the text line image in the forward propagation of the network. number;

S102-3: After obtaining the prediction result, calculate the loss of the CTPN network according to the first loss function, then use the optimizer SGD to update the parameters of the CTPN network, and then put the text line image into the CTPN network after the updated parameters Perform training, repeat this process repeatedly until the best prediction result is obtained, and save the best model parameters corresponding to the best prediction result, thereby obtaining the text line position detection model;

Wherein _, _the first loss _function _is _: Loss ₌ λ _v The loss function SmoothL1 Loss between the actual anchor box center point coordinates and height; L _conf represents the confidence loss, that is, the binary cross-entropy loss of whether there is a text line component between the preset anchor box confidence and the actual anchor box; L _x Represents the horizontal coordinate offset loss, that is, the loss function Smooth L1 Loss between the offset value of the horizontal coordinate and width of the text line in the predicted anchor box and the offset value of the horizontal coordinate and width of the text line in the actual anchor box; λ _v , λ _conf , λ _x represent weight;

The output results of the text line component at each of the preset anchor box positions include: v _j , v _h , _si , x _side , where v _j and v _h represent the center point of the preset anchor box. Coordinates and height, s _i represents the confidence of the text line component included in the preset anchor box, x _side represents the offset value of the horizontal coordinate and width of the text line component;

The step S103 includes:

S103-1: The recognition network includes a feature extraction network, a feature fusion network, a coding network, a fully connected layer and a decoding algorithm that are connected in sequence;

S103-2: The initial learning rate of the recognition network is 0.0001, and the beta value of the optimizer Adam is (0.9, 0.999). The data and the expanded data are put into the recognition network for training;

In the forward propagation process of the recognition network, the image with a size of H×W is extracted through the feature extraction network to obtain the first feature;

Then, the first features are fused through the feature fusion network, and the fused first features are sampled so that the height of the fused first features is 1 to obtain the second features;

Input the second feature into the encoding network for encoding to obtain encoding features;

Input the encoding features into the fully connected layer for decoding, and obtain the decoding result;

Finally, the decoding results are aligned through the decoding algorithm to obtain the recognition result;

Wherein, the feature extraction network is the Resnet50 network, the feature fusion network is the FPEM network, the encoding network is the Encoder network, the decoding algorithm is the CTC algorithm, and the loss function of the CTC algorithm is Y represents the decoding result, Y' represents the correctly annotated recognition result, t represents the sequence length of the encoding feature, k represents the alignment function of the CTC network, C:k(c)=Y' represents the set All sequences c in C can obtain the correctly labeled recognition result Y' through the CTC algorithm, p represents the probability, and p(c _t |Y) represents the probability of obtaining the sequence c _t of length t under the premise of Y;

S103-3: After obtaining the recognition result, calculate the loss of the recognition network through the loss function of the CTC algorithm, then use the optimizer Adam to update the parameters of the recognition network, and then combine the data and the expanded data Invest in the recognition network after updated parameters for training, repeat this process repeatedly until the best recognition result is obtained, and save the best model parameters corresponding to the best recognition result to obtain the text recognition model.

2. A ticket identification method as claimed in claim 1, characterized in that in step S201, the structured extraction is performed to obtain effective information, including:

Calculate the edit distance between each keyword and the recognition result, generate an edit distance matrix, match the paired recognition result with the smallest edit distance for each keyword, and determine where the keyword is based on the paired recognition result. The position in the recognition result is obtained to obtain the effective information;

When the keyword does not match the pair recognition result, the default value is returned.

3. A method of ticket recognition according to claim 2, characterized in that, in the step S201, when the text line image is recognized by the text recognition model, the height of the text line image is adjusted to 32 pixels are then sent to the text recognition model for recognition.