CN111507328A

CN111507328A - Text recognition and model training method, system, device and readable storage medium

Info

Publication number: CN111507328A
Application number: CN202010270210.8A
Authority: CN
Inventors: 邬国锐; 卿山; 王庆庆
Original assignee: Beijing Aikaka Information Technology Co ltd
Current assignee: Beijing Aikaka Information Technology Co ltd
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2020-08-07

Abstract

The invention discloses a text recognition and model training method, system, equipment and readable storage medium. In the coding stage of text recognition, the invention extracts image features of a picture to be recognized through a dense convolutional neural network, so that the extracted features are more Abstract, contains richer semantic information; by adding two-dimensional position encoding information to image features, image features containing position information are generated, and the added two-dimensional position encoding can more accurately locate the characters in the image when decoding the image features. position, so that the corresponding text characters can be more accurately recognized, and the accuracy of curved text recognition can be improved; in the decoding stage, the image features containing position information are decoded through the transformer decoding layer containing the two-dimensional attention mechanism. , can make full use of the two-dimensional spatial information of the image, and use a weakly supervised method for training, which can further improve the accuracy of curved text recognition.

Description

Text recognition and model training method, system, device and readable storage medium

技术领域technical field

本发明涉及图像处理技术领域，尤其涉及一种文本识别及模型训练方法、系统、设备及可读存储介质。The present invention relates to the technical field of image processing, and in particular, to a text recognition and model training method, system, device and readable storage medium.

背景技术Background technique

在日常工作或生活中，往往需要利用计算机技术识别纸质文件上的文本，例如，各种票据上的文字，证件实体上的身份信息等等，基于图像的文字识别已经成为了计算机视觉中的一项重要研究主题。In daily work or life, it is often necessary to use computer technology to recognize the text on paper documents, for example, the text on various bills, the identity information on the certificate entity, etc. Image-based text recognition has become a computer vision technology. an important research topic.

目前对印在纸上的文本信息的识别主要采用光学字符识别(Optical CharacterRecognition，以下简称：OCR)技术，其利用光学技术和计算机技术把印在或写在纸上的文字读取出来，并转换成一种人可以理解的格式。OCR的处理步骤主要包括：图像预处理、版面分析、文本定位(或叫图像切割)、字符切割和识别等。At present, the recognition of text information printed on paper mainly adopts Optical Character Recognition (Optical Character Recognition, hereinafter referred to as: OCR) technology, which uses optical technology and computer technology to read the text printed or written on paper, and convert it. into a format that humans can understand. The processing steps of OCR mainly include: image preprocessing, layout analysis, text positioning (or image cutting), character cutting and recognition, etc.

由于自然场景中的文本字体多样、文本形状多样、且存在遮盖、光照不均、噪声过多等情况，尤其对于自然场景中的很多弯曲文本，例如弯曲形状的商标、印章等等，往往包含非常重要的信息，对识别的准确性要求很高。但是现有技术对于自然场景中的弯曲文本识别的准确率很低，如何提高自然场景中的弯曲文本识别的准确率成为一个亟待解决的技术问题。Due to the variety of text fonts and shapes in natural scenes, and the existence of occlusion, uneven lighting, excessive noise, etc., especially for many curved texts in natural scenes, such as trademarks, seals, etc., often contain very Important information requires high recognition accuracy. However, the prior art has a low accuracy rate for the recognition of curved text in natural scenes, and how to improve the accuracy of recognition of curved text in natural scenes has become an urgent technical problem to be solved.

发明内容SUMMARY OF THE INVENTION

本发明提供一种文本识别及模型训练方法、系统、设备及可读存储介质，用以克服上述现有技术中存在的技术问题，以提高自然场景中的弯曲文本识别的准确率。The present invention provides a text recognition and model training method, system, device and readable storage medium to overcome the above technical problems in the prior art and improve the accuracy of curved text recognition in natural scenes.

本发明提供的一种文本识别方法，包括：A text recognition method provided by the present invention includes:

通过稠密卷积神经网络提取待识别图片的图像特征；Extract the image features of the image to be recognized through a dense convolutional neural network;

在所述图像特征中添加二维位置编码信息，生成包含位置信息的图像特征；Adding two-dimensional position coding information to the image features to generate image features including position information;

通过包含二维注意力机制的transformer解码层，对所述包含位置信息的图像特征进行解码处理，得到识别结果。Through the transformer decoding layer including the two-dimensional attention mechanism, the image features including the position information are decoded to obtain the recognition result.

本发明还提供一种文本识别模型包括：编码模块和解码模块；所述编码模块用于：通过稠密卷积神经网络提取待识别图片的图像特征，在所述图像特征中添加二维位置编码信息，生成包含位置信息的图像特征；所述解码模块包括包含二维注意力机制的transformer解码层，所述包含二维注意力机制的transformer解码层用于对所述包含位置信息的图像特征进行解码处理，得到识别结果；The present invention also provides a text recognition model comprising: an encoding module and a decoding module; the encoding module is used for: extracting image features of a picture to be recognized through a dense convolutional neural network, and adding two-dimensional position encoding information to the image features , to generate image features including position information; the decoding module includes a transformer decoding layer including a two-dimensional attention mechanism, and the transformer decoding layer including a two-dimensional attention mechanism is used to decode the image features including position information. processing to obtain the identification result;

所述方法包括：The method includes:

获取自然场景文本识别的训练集，所述训练集至少包括多条弯曲文本训练数据，每条所述弯曲文本训练数据包括：包含弯曲文本的样本图片及其对应的文本标注信息；Obtaining a training set for text recognition in natural scenes, the training set includes at least a plurality of pieces of curved text training data, and each piece of the curved text training data includes: a sample picture containing the curved text and its corresponding text annotation information;

通过所述训练集对文本识别模型进行训练。The text recognition model is trained through the training set.

本发明还提供文本识别系统，包括：The present invention also provides a text recognition system, including:

编码模块，用于通过稠密卷积神经网络提取待识别图片的图像特征，在所述图像特征中添加二维位置编码信息，生成包含位置信息的图像特征；an encoding module for extracting image features of the picture to be identified through a dense convolutional neural network, adding two-dimensional position encoding information to the image features, and generating image features including position information;

解码模块，用于通过包含二维注意力机制的transformer解码层，对所述包含位置信息的图像特征进行解码处理，得到识别结果。The decoding module is used for decoding the image features including the position information through the transformer decoding layer including the two-dimensional attention mechanism to obtain the recognition result.

本发明还提供一种文本识别设备，包括：The present invention also provides a text recognition device, comprising:

处理器，存储器，以及存储在所述存储器上并可在所述处理器上运行的计算机程序；其中，所述处理器运行所述计算机程序时实现上述所述的文本识别方法和/或文本识别模型训练方法。A processor, a memory, and a computer program stored on the memory and executable on the processor; wherein, the processor implements the text recognition method and/or text recognition described above when the processor runs the computer program Model training method.

本发明还提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序可被执行上述的文本识别方法和/或文本识别模型训练方法。The present invention also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program can be executed to execute the above-mentioned text recognition method and/or text recognition model training method.

本发明在文本识别的编码阶段，通过稠密卷积神经网络提取待识别图片的图像特征，使得提取出的特征更加抽象，包含的语义信息更加丰富；通过在所述图像特征中添加二维位置编码信息，生成包含位置信息的图像特征，加入的二维位置编码能够在对图像特征进行解码时更加准确的定位图像中字符的位置，从而能够更加准确地识别出对应的文本字符，能够提高弯曲文本识别的准确率；在解码阶段，通过包含二维注意力机制的transformer解码层，对所述包含位置信息的图像特征进行解码处理，能够充分地利用图像二维的空间信息，使用一种弱监督的方式进行训练，能够进一步提高弯曲文本识别的准确率。In the coding stage of text recognition, the present invention extracts the image features of the picture to be recognized through a dense convolutional neural network, so that the extracted features are more abstract and contain more abundant semantic information; by adding two-dimensional position coding to the image features information, generate image features including position information, and the added two-dimensional position code can more accurately locate the position of characters in the image when decoding the image features, so that the corresponding text characters can be more accurately identified, and the curved text can be improved. The accuracy of recognition; in the decoding stage, the image features containing the position information are decoded through the transformer decoding layer including the two-dimensional attention mechanism, which can fully utilize the two-dimensional spatial information of the image and use a weak supervision. The training method can further improve the accuracy of curved text recognition.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例一提供的一种文本识别方法流程图；1 is a flowchart of a text recognition method provided in Embodiment 1 of the present invention;

图2为本发明实施例一提供的传统transformer模型结构示意图；2 is a schematic structural diagram of a traditional transformer model provided by Embodiment 1 of the present invention;

图3为本发明实施例一提供的文本识别模型的结构示意图；3 is a schematic structural diagram of a text recognition model provided in Embodiment 1 of the present invention;

图4为本发明实施例二提供的一种加入二维位置编码的流程图；4 is a flowchart of adding two-dimensional position coding according to Embodiment 2 of the present invention;

图5为本发明实施例三提供的一种二维注意力向量确定流程图；FIG. 5 is a flowchart for determining a two-dimensional attention vector according to Embodiment 3 of the present invention;

图6为本发明实施例三提供的一种二维注意力向量确定流程示意图；6 is a schematic diagram of a flow chart for determining a two-dimensional attention vector according to Embodiment 3 of the present invention;

图7为本发明实施例四提供的一种文本识别模型训练方法流程图；7 is a flowchart of a text recognition model training method provided in Embodiment 4 of the present invention;

图8为本发明实施例五提供的一种文本识别系统的结构示意图；8 is a schematic structural diagram of a text recognition system according to Embodiment 5 of the present invention;

图9为本发明实施例六提供的一种文本识别系统的结构示意图；9 is a schematic structural diagram of a text recognition system according to Embodiment 6 of the present invention;

图10为本发明实施例七提供的文本识别设备的结构示意图。FIG. 10 is a schematic structural diagram of a text recognition device according to Embodiment 7 of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are the present invention. Some, but not all, embodiments are disclosed. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明所涉及的术语“第一”、“第二”等仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。在以下各实施例的描述中，“多个”的含义是两个以上，除非另有明确具体的限定。The terms "first", "second", etc. involved in the present invention are only used for descriptive purposes, and should not be understood as indicating or implying relative importance or implying the number of indicated technical features. In the description of the following embodiments, the meaning of "plurality" is two or more, unless otherwise expressly and specifically defined.

目前，在场景弯曲文本识别领域，难点主要存在于每个文本字符与图像文本区域的“对齐”(以下简称“对齐”操作)，也即是如何准确识别出图像文本区域中的文本字符。常规直线文本与弯曲文本相比，上述“对齐”操作相对简单。对于上述这个技术难点，本发明采用了以下四种方式来针对文本区域的“对齐”操作：用卷积神经网络提取出图像特征，在提取出的图像特征中加入二维位置编码，用transformer解码层(也即transformer-decoder)提取出字符之间的相关性以及实现与图像特征的上述“对齐”操作，字符特征与图像特征的“对齐”采用二维注意力模块。其中，卷积神经网络提取图像特征和transformer-decoder为基础模块，二维注意力模块为在transformer-decoder中针对文本字符与图像文本区域“对齐”的核心，二维位置编码是针对二维注意力模块专门增加的处理，能够加强“对齐”的效果。At present, in the field of scene bending text recognition, the difficulty mainly lies in the "alignment" of each text character and the image text area (hereinafter referred to as the "alignment" operation), that is, how to accurately identify the text characters in the image text area. The above "align" operation is relatively simple compared to regular straight text compared to curved text. For the above technical difficulty, the present invention adopts the following four methods to "align" the text area: extracting image features with convolutional neural network, adding two-dimensional position coding to the extracted image features, decoding with transformer The layer (ie, transformer-decoder) extracts the correlation between characters and implements the above-mentioned "alignment" operation with image features. The "alignment" between character features and image features uses a two-dimensional attention module. Among them, the convolutional neural network extracts image features and transformer-decoder as the basic module, the two-dimensional attention module is the core of "alignment" between text characters and image text areas in the transformer-decoder, and the two-dimensional position encoding is for two-dimensional attention. The force module specifically adds processing that enhances the "alignment" effect.

本实施例提供的文本识别方法利用文本识别模型实现，采用的模型架构为编码(encoder)-解码(decoder)架构，文本识别模型包括编码模块和解码模块。在encoder阶段，首先经过卷积神经网络提取出待识别图片的图像特征，然后加上二维位置编码。在decoder阶段，通过transformer-decoder接受来自encoder的输出，同时采用二维注意力机制，解码得到识别结果。The text recognition method provided in this embodiment is implemented by using a text recognition model, the adopted model architecture is an encoder-decoder architecture, and the text recognition model includes an encoding module and a decoding module. In the encoder stage, the image features of the picture to be recognized are first extracted through the convolutional neural network, and then the two-dimensional position encoding is added. In the decoder stage, the output from the encoder is accepted through the transformer-decoder, and the two-dimensional attention mechanism is used to decode the recognition result.

为使本发明的技术方案更加清楚，以下结合附图对本发明的实施例进行详细说明。In order to make the technical solutions of the present invention clearer, the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

图1为本发明实施例一提供的一种文本识别方法流程图，图2为本发明实施例一提供的传统transformer模型结构示意图，图3为本发明实施例一提供的文本识别模型的结构示意图。如图1所示，本实施例中的文本识别方法，包括：FIG. 1 is a flowchart of a text recognition method provided by Embodiment 1 of the present invention, FIG. 2 is a schematic structural diagram of a traditional transformer model provided by Embodiment 1 of the present invention, and FIG. 3 is a schematic structural diagram of a text recognition model provided by Embodiment 1 of the present invention . As shown in Figure 1, the text recognition method in this embodiment includes:

步骤10、通过稠密卷积神经网络提取待识别图片的图像特征。Step 10: Extract the image features of the image to be recognized through a dense convolutional neural network.

人对于图像的认知是抽象分层的，首先理解的是颜色和亮度，然后是边缘、角点、直线等局部细节特征，接下来是纹理、几何形状等更复杂的信息和结构，最后形成整个物体的概念。视觉神经科学对于视觉机理的研究验证了这一结论，动物大脑的视觉皮层具有分层结构。卷积神经网络可以被看做是对于人的视觉机理的模仿，它由多个卷积层组成，每一个卷积层都是卷积核从左到右从上到下扫描图片，输出特征图，即提取出图片的局部特征。随着卷积层的逐步增加，感受野(特征图中每个像素在输入图片上映射的区域大小)也逐步增加，同时提取出的特征更加抽象，最后得到图像在各个不同尺度的抽象表示。自从卷积神经网络在2012年imageNet图像识别挑战赛上大放异彩，卷积神经网络不断发展，同时被广泛用于各个领域，在很多问题上都取得了当前最好的性能，现在已经成为了提取图像特征的主流。People’s cognition of images is abstract and layered. First, they understand color and brightness, then local details such as edges, corners, and lines, and then more complex information and structures such as textures and geometric shapes. The concept of the whole object. The research on the visual mechanism of visual neuroscience has verified this conclusion. The visual cortex of the animal brain has a hierarchical structure. The convolutional neural network can be regarded as an imitation of the human visual mechanism. It consists of multiple convolutional layers. Each convolutional layer is a convolutional kernel that scans the image from left to right and from top to bottom, and outputs feature maps. , that is, to extract the local features of the image. With the gradual increase of the convolutional layer, the receptive field (the size of the area mapped by each pixel in the feature map on the input image) gradually increases, and the extracted features are more abstract, and finally the abstract representation of the image at different scales is obtained. Since the convolutional neural network shines at the 2012 imageNet image recognition challenge, the convolutional neural network has continued to develop, and has been widely used in various fields. It has achieved the current best performance on many problems, and has now become the Mainstream for extracting image features.

本实施例中通过稠密卷积神经网络(Dense Convolutional Network，简称DenseNet)提取待识别图片的图像特征，得到待识别图片的图像特征，也即是待识别图片的特征图。In this embodiment, a dense convolutional neural network (Dense Convolutional Network, DenseNet for short) is used to extract the image features of the image to be recognized, to obtain the image features of the image to be recognized, that is, the feature map of the image to be recognized.

示例性地，在进行文本识别之前，可以预先对文本识别模型进行模型训练，在模型训练过程中实现对编码模块和解码模块的训练，可以得到训练好的在编码阶段用于提取图像特征的稠密卷积神经网络，以及训练好的包含二维注意力机制的transformer解码层。Exemplarily, before performing text recognition, the text recognition model can be pre-trained, and the encoding module and the decoding module can be trained during the model training process, so as to obtain the trained density used for extracting image features in the encoding stage. Convolutional Neural Network, and a trained Transformer Decoder Layer with a 2D Attention Mechanism.

该步骤中，通过预先训练好的稠密卷积神经网络提取待识别图片的图像特征。In this step, the image features of the image to be recognized are extracted through a pre-trained dense convolutional neural network.

步骤20、在图像特征中添加二维位置编码信息，生成包含位置信息的图像特征。Step 20: Add two-dimensional position coding information to the image features to generate image features including position information.

transformer模型是完全由注意力(Attention)机制组成，注意力(Attention)机制由Bengio团队于2014年提出并在近年广泛的应用在深度学习中的各个领域，例如在计算机视觉方向用于捕捉图像上的感受野，或者自然语言处理((Natural LanguageProcessing，简称NLP)中用于定位关键token或者特征。transformer中抛弃了传统的卷积神经网络(Convolutional Neural Network，简称CNN)，循环神经网(Recurrent NeuralNetwork，简称RNN)，整个网络结构完全是由Attention机制组成。但Attention机制并没有包含位置信息，即一句话中词语在不同的位置时在transformer中是没有区别的，这当然是不符合实际的。因此，在transformer中引入位置信息相比CNN，RNN等模型有更加重要的作用，所以需要给通过卷积神经网络得到的图像特征中的每个特征向量加上一个位置向量，来加入位置信息。但是传统的位置编码作用于一维，而本实施例中的图像特征是二维的，无法直接使用一维位置编码。The transformer model is completely composed of the Attention mechanism. The Attention mechanism was proposed by the Bengio team in 2014 and has been widely used in various fields of deep learning in recent years, such as in the direction of computer vision for capturing images. The receptive field of , or used in Natural Language Processing (NLP for short) to locate key tokens or features. Transformer abandons the traditional Convolutional Neural Network (CNN), Recurrent Neural Network (Recurrent Neural Network) , referred to as RNN), the entire network structure is completely composed of the Attention mechanism. However, the Attention mechanism does not contain position information, that is, there is no difference in the transformer when the words in a sentence are in different positions, which is of course unrealistic. Therefore, the introduction of position information in the transformer has a more important role than CNN, RNN and other models, so it is necessary to add a position vector to each feature vector in the image features obtained through the convolutional neural network to add position information. However, the traditional position encoding acts on one dimension, and the image features in this embodiment are two-dimensional, so the one-dimensional position encoding cannot be directly used.

本实施例中，在提取出待识别图片的图像特征(也即是特征图)之后，在图像特征中添加二维位置编码信息，能够加强图像特征中二维空间的位置表示，从而能够进一步加强图像特征与字符特征“对齐”能力。In this embodiment, after the image features (that is, the feature map) of the image to be recognized are extracted, two-dimensional position coding information is added to the image features, which can strengthen the position representation of the two-dimensional space in the image features, thereby further enhancing the The ability to "align" image features with character features.

步骤30、通过包含二维注意力机制的transformer解码层，对包含位置信息的图像特征进行解码处理，得到识别结果。Step 30: Decode the image features including the position information through the transformer decoding layer including the two-dimensional attention mechanism to obtain the recognition result.

传统的transformer模型的整个网络结构是由Attention机制组成，transformer编码层由自注意力机制(Self-Attenion)层和前馈神经网络(Feedforward NeuralNetwork，简称FNN)组成。如图2所示，传统的transformer编码层(图2中所示的Encoder#1和Encoder#2表示两个编码层)包括Self-Attenion层和前馈网络层(Feed Forward)，传统的transformer解码层(如图2中2X表示有两个解码层堆叠，每个解码层如图中解码模块部分虚线框中所示)包括Self-Attenion层，encoder-decoder Attenion层和前馈网络层(FeedForward)，transformer解码层中的每个子层(包括Self-Attenion层，Encoder-DecoderAttenion层和前馈网络层)中间都有一个“Add&Normalize”层，表示残差连接和层标准化步骤。一个基于transformer的可训练的神经网络可以通过堆叠transformer层的形式进行搭建。transformer的提出解决了RNN的两个缺点，RNN相关算法只能从左向右依次计算或者从右向左依次计算，这种特点一是限制了模型的并行能力，二是顺序计算的过程中对于特别长期的依赖现象信息会丢失。而transformer不是类似RNN的顺序结构，因此具有更好的并行性，并且将序列中的任意两个位置之间的距离是缩小为一个常量，解决了长期依赖的问题。The entire network structure of the traditional transformer model is composed of the Attention mechanism, and the transformer encoding layer is composed of the Self-Attenion layer and the Feedforward Neural Network (FNN for short). As shown in Figure 2, the traditional transformer coding layer (Encoder#1 and Encoder#2 shown in Figure 2 represent two coding layers) includes a Self-Attenion layer and a feedforward network layer (Feed Forward). The traditional transformer decoding Layer (2X in Figure 2 indicates that there are two decoding layers stacked, and each decoding layer is shown in the dashed box in the decoding module part of the figure) including the Self-Attenion layer, the encoder-decoder Attenion layer and the feedforward network layer (FeedForward) , each sublayer in the transformer decoding layer (including the Self-Attenion layer, the Encoder-DecoderAttenion layer and the feedforward network layer) has an "Add&Normalize" layer in the middle, which represents the residual connection and layer normalization step. A transformer-based trainable neural network can be built by stacking transformer layers. The proposal of transformer solves two shortcomings of RNN. RNN-related algorithms can only be calculated sequentially from left to right or from right to left. This feature firstly limits the parallel ability of the model, and secondly, in the process of sequential calculation Particularly long-term dependent phenomena information is lost. The transformer is not a sequential structure similar to RNN, so it has better parallelism, and the distance between any two positions in the sequence is reduced to a constant, which solves the problem of long-term dependence.

本实施例中的编码模块采用DenseNet提取图像特征，并添加二维位置编码信息；本实施例中的解码模块采用多个transformer解码层堆叠而成，前一个transformer解码层的输出作为后面一个transformer解码层的输入，每个transformer解码层包括：带掩码的多头注意力层，二维注意力层和前馈神经网络层。The encoding module in this embodiment uses DenseNet to extract image features, and adds two-dimensional position encoding information; the decoding module in this embodiment is formed by stacking multiple transformer decoding layers, and the output of the former transformer decoding layer is used as the latter transformer decoding layer. The input of the layers, each transformer decoding layer includes: a masked multi-head attention layer, a 2D attention layer and a feed-forward neural network layer.

示例性的，如图3所示，本实施例中的transformer解码模块可以由两个或者三个transformer解码层堆叠而成，图3中的“3X”表示3个transformer解码层堆叠，图3中的“Masked Multi-Head Attention”表示带掩码的多头注意力层，“2D Attention”表示二维注意力层，“Feed Forward”表示前馈神经网络层。Exemplarily, as shown in FIG. 3 , the transformer decoding module in this embodiment may be formed by stacking two or three transformer decoding layers. “3X” in FIG. 3 indicates that three transformer decoding layers are stacked. The "Masked Multi-Head Attention" means a masked multi-head attention layer, "2D Attention" means a two-dimensional attention layer, and "Feed Forward" means a feed-forward neural network layer.

如图3所示，transformer解码层中的每个子层(带掩码的多头注意力层，二维注意力层和前馈神经网络层)中间都有一个“Add&Normalize”层，表示残差连接和层标准化步骤。transformer解码模块还包括线性层(Linear层)和Softmax层。As shown in Figure 3, each sub-layer in the transformer decoding layer (multi-head attention layer with mask, 2D attention layer and feed-forward neural network layer) has an “Add&Normalize” layer in the middle, which represents the residual connection and Layer normalization step. The transformer decoding module also includes a linear layer (Linear layer) and a Softmax layer.

另外，为进行模型训练，transformer解码模块还包括embeding层。在训练阶段，将真值字符通过embeding层将字符转化为高维空间中的向量表示，作为第一个transformer解码层的输入。In addition, for model training, the transformer decoding module also includes an embedding layer. In the training phase, the ground-truth characters are converted into vector representations in a high-dimensional space through the embedding layer, which is used as the input of the first transformer decoding layer.

本实施例中，在decoder阶段使用transformer解码模块来提取字符特征，并且与encoder阶段提取的图像特征“对齐”。与传统的transformer解码模块不同，本实施例中把原始的“对齐”操作中encoder-decoder Attention换为了二维注意力机制，来进一步加强解码阶段提取的字符特征与编码阶段提取的图像特征的对齐能力，最后通过一个前馈神经网络输出。In this embodiment, the transformer decoding module is used in the decoder stage to extract character features and "align" with the image features extracted in the encoder stage. Different from the traditional transformer decoding module, in this embodiment, the encoder-decoder Attention in the original "alignment" operation is replaced by a two-dimensional attention mechanism to further strengthen the alignment of the character features extracted in the decoding stage and the image features extracted in the encoding stage. capacity, and finally output through a feed-forward neural network.

自然场景中的文本识别中传统的实现“对齐”操作的Attention模块需要在encoder阶段将提取出的图像特征竖向池化，由此丢失了图像中的空间信息，不能很好的利用二维的空间信息。本实施例中使用了针对弯曲文本的2D Attention机制，能够充分利用图像的空间信息，使用一种弱监督的方式来“对齐”字符特征与图像特征，实现弯曲文本的识别。The traditional Attention module that implements the "alignment" operation in text recognition in natural scenes needs to pool the extracted image features vertically in the encoder stage, which loses the spatial information in the image and cannot make good use of two-dimensional image features. Spatial information. In this embodiment, the 2D Attention mechanism for curved text is used, which can make full use of the spatial information of the image, and use a weakly supervised method to "align" character features and image features to realize the recognition of curved text.

本发明实施例在文本识别的编码阶段，通过稠密卷积神经网络提取待识别图片的图像特征，使得提取出的特征更加抽象，包含的语义信息更加丰富；通过在图像特征中添加二维位置编码信息，生成包含位置信息的图像特征，加入的二维位置编码能够在对图像特征进行解码时更加准确的定位图像中字符的位置，从而能够提高弯曲文本识别的准确率；在解码阶段，通过包含二维注意力机制的transformer解码层，对包含位置信息的图像特征进行解码处理，能够充分地利用图像二维的空间信息，使用一种弱监督的方式进行训练，能够进一步提高弯曲文本识别的准确率。In the coding stage of text recognition, the embodiment of the present invention extracts the image features of the image to be recognized through a dense convolutional neural network, so that the extracted features are more abstract and contain more semantic information; by adding two-dimensional position coding to the image features information, generate image features containing position information, and the added two-dimensional position coding can more accurately locate the position of characters in the image when decoding the image features, thereby improving the accuracy of curved text recognition; in the decoding stage, by including The transformer decoding layer of the two-dimensional attention mechanism decodes the image features containing position information, which can make full use of the two-dimensional spatial information of the image, and use a weakly supervised method for training, which can further improve the accuracy of curved text recognition. Rate.

图4为本发明实施例二提供的一种加入二维位置编码的流程图，在上述实施例一的基础上，本实施例中，如图4所示，上述步骤20在图像特征中添加二维位置编码信息，生成包含位置信息的图像特征，可以采用如下步骤201-203实现：FIG. 4 is a flowchart of adding two-dimensional position coding according to the second embodiment of the present invention. On the basis of the above-mentioned first embodiment, in this embodiment, as shown in FIG. 4 , the above step 20 adds two Dimension position coding information, and generate image features including position information, which can be achieved by adopting the following steps 201-203:

步骤201、生成图像特征中的每个像素的二维位置编码。Step 201: Generate a two-dimensional position code of each pixel in the image feature.

具体的，该步骤可以采用如下方式实现：Specifically, this step can be implemented in the following manner:

根据图像特征确定水平方向和竖直方向的位置编码权重；对于图像特征中的任意一个像素，分别生成该像素在水平方向和竖直方向上的一维位置编码；根据水平方向和竖直方向的位置编码权重，对该像素在水平方向和竖直方向上的一维位置编码进行加权求和，得到该像素的二维位置编码。Determine the position encoding weights in the horizontal and vertical directions according to the image features; for any pixel in the image features, generate the one-dimensional position encoding of the pixel in the horizontal and vertical directions respectively; The position encoding weight is the weighted summation of the one-dimensional position encoding of the pixel in the horizontal direction and the vertical direction to obtain the two-dimensional position encoding of the pixel.

示例性地，对于在图像特征(也即是特征图)中高h宽w位置的像素，该像素的二维位置编码可以用P_hw表示，该像素在竖直方向上的一维位置编码可以用

表示，其中下标h表示竖直方向，pos_h为该像素在竖直方向上的排列位置，该像素在水平方向上的一维位置编码可以用

表示，其中下标w表示水平方向，pos_w为该像素在水平方向上的排列位置，特征图的深度可以用D表示，那么该像素在竖直方向的一维位置编码可以采用如下公式一或公式二计算：Exemplarily, for a pixel with a height h and a width w in an image feature (that is, a feature map), the two-dimensional position code of the pixel can be represented by P _hw , and the one-dimensional position code of the pixel in the vertical direction can be expressed by

Representation, where the subscript h represents the vertical direction, pos_h is the arrangement position of the pixel in the vertical direction, and the one-dimensional position encoding of the pixel in the horizontal direction can be used

where the subscript w represents the horizontal direction, pos_w is the arrangement position of the pixel in the horizontal direction, and the depth of the feature map can be represented by D, then the one-dimensional position encoding of the pixel in the vertical direction can use the following formula 1 or formula Second calculation:

其中，pos_h为该像素在竖直方向上的排列位置，例如特征图中竖直方向有20个像素，则pos_h的取值为[0，19]；

表示竖直方向第pos_h个像素对应的一维位置编码向量中的第2i个分量，

表示竖直方向第pos_h个像素对应的一维位置编码向量中的第2i+1个分量，i为非负整数，下标2i和2i+1取值为[0，D-1]，其中D为特征图的深度，即像素对应一维位置编码向量的偶数(2i)分量通过公式一计算该分量值，像素对应一维位置编码向量的奇数(2i+1)分量通过公式二计算该分量值。Among them, pos_h is the arrangement position of the pixel in the vertical direction. For example, if there are 20 pixels in the vertical direction in the feature map, the value of pos_h is [0, 19];

represents the 2ith component in the one-dimensional position encoding vector corresponding to the pos_hth pixel in the vertical direction,

Indicates the 2i+1th component in the one-dimensional position encoding vector corresponding to the pos_hth pixel in the vertical direction, i is a non-negative integer, and the subscripts 2i and 2i+1 are [0, D-1], where D is the depth of the feature map, that is, the pixel corresponds to the even-numbered (2i) component of the one-dimensional position coding vector, and the component value is calculated by formula 1, and the pixel corresponds to the odd-numbered (2i+1) component of the one-dimensional position coding vector. The component value is calculated by formula 2 .

其中采取三角函数的原因为

和

在k给定的情况下为线性关系，可以表示竖直方向上像素点的相对位置关系。The reason for taking the trigonometric function is

and

When k is given, it is a linear relationship, which can represent the relative position relationship of pixels in the vertical direction.

同理，该像素在水平方向的一维位置编码可以采用如下公式三或公式四计算：Similarly, the one-dimensional position encoding of the pixel in the horizontal direction can be calculated using the following formula 3 or formula 4:

其中，pos_w为该像素在水平方向上的排列位置，例如特征图中水平方向有48个像素，则pos_w的取值为[0，47]；

表示水平方向第pos_w个像素对应的一维位置编码向量中的第2i个分量，

表示水平方向第pos_w个像素对应的一维位置编码向量中的第2i+1个分量，i为非负整数，下标2i和2i+1取值为[0，D-1]，其中D为特征图的深度，即像素对应一维位置编码向量的偶数(2i)分量通过公式三计算该分量值，像素对应一维位置编码向量的奇数(2i+1)分量通过公式四计算该分量值。Among them, pos_w is the arrangement position of the pixel in the horizontal direction. For example, if there are 48 pixels in the horizontal direction in the feature map, the value of pos_w is [0, 47];

represents the 2ith component in the one-dimensional position encoding vector corresponding to the pos_wth pixel in the horizontal direction,

Indicates the 2i+1th component in the one-dimensional position encoding vector corresponding to the pos_wth pixel in the horizontal direction, i is a non-negative integer, and the subscripts 2i and 2i+1 are [0, D-1], where D is The depth of the feature map, that is, the pixel corresponding to the even (2i) component of the one-dimensional position coding vector is calculated by formula 3, and the pixel corresponding to the odd (2i+1) component of the one-dimensional position coding vector. The component value is calculated by formula four.

示例性地，可以采用以下公式五确定图像特征的水平方向的位置编码权重，采用以下公式六确定图像特征的竖直方向的位置编码权重：Exemplarily, the following formula 5 can be used to determine the position encoding weight of the image feature in the horizontal direction, and the following formula 6 can be used to determine the position encoding weight of the image feature in the vertical direction:

其中，α(E)和β(E)分别表示竖直方向和水平方向的位置编码权重，

为可学习线性权重，可以通过模型训练得到优选值，g(E)为整个特征图平均池化后的结果。where α(E) and β(E) represent the position encoding weights in the vertical and horizontal directions, respectively,

In order to learn the linear weight, the preferred value can be obtained through model training, and g(E) is the result of the average pooling of the entire feature map.

进一步地，在确定水平方向和竖直方向的位置编码权重，以及像素在水平方向和竖直方向上的一维位置编码之后，可以采用以下公式七确定该像素的二维位置编码：Further, after determining the position encoding weights in the horizontal direction and the vertical direction, and the one-dimensional position encoding of the pixel in the horizontal direction and the vertical direction, the following formula 7 can be used to determine the two-dimensional position encoding of the pixel:

其中，P_hw表示像素的二维位置编码，

表示该像素在竖直方向的一维位置编码，

表示该像素在水平方向的一维位置编码。Among them, P _hw represents the two-dimensional position code of the pixel,

represents the one-dimensional position encoding of the pixel in the vertical direction,

Indicates the one-dimensional position encoding of the pixel in the horizontal direction.

步骤202、生成图像特征的位置编码张量。Step 202 , generating a position encoding tensor of image features.

其中，每个像素的二维位置编码为向量，在得到图像特征中的每个像素的二维位置编码向量之后，将所有像素的二维位置编码向量拼接成一个张量，其中每个像素的二维位置编码向量的在张量中的位置对应于图像中像素点的位置。Among them, the two-dimensional position encoding of each pixel is a vector. After obtaining the two-dimensional position encoding vector of each pixel in the image feature, the two-dimensional position encoding vector of all pixels is spliced into a tensor, in which the The position of the two-dimensional position-encoding vector in the tensor corresponds to the position of the pixel in the image.

步骤203、将图像特征的位置编码张量与图像特征相加，得到包含位置信息的图像特征。Step 203: Add the position encoding tensor of the image feature to the image feature to obtain the image feature including the position information.

本实施例中，针对解码阶段的二维注意力模块在编码阶段的图像特征中增加二维位置编码，加强图像特征中二维空间的位置表示，使得二维注意力模块能够发挥更好的效果，从而加强文本字符与图像文本区域的“对齐”能力。In this embodiment, for the two-dimensional attention module in the decoding stage, two-dimensional position coding is added to the image features in the encoding stage, and the position representation of the two-dimensional space in the image features is strengthened, so that the two-dimensional attention module can play a better effect. , thereby enhancing the ability to "align" text characters with image text areas.

在上述实施例一或者实施例二的基础上，本实施例中，如图3所示，文本识别模型包括至少一个包含二维注意力机制的transformer解码层，每个transformer解码层包括：带掩码的多头注意力层，二维注意力层和前馈神经网络层。On the basis of the above Embodiment 1 or Embodiment 2, in this embodiment, as shown in FIG. 3 , the text recognition model includes at least one transformer decoding layer including a two-dimensional attention mechanism, and each transformer decoding layer includes: Code multi-head attention layer, two-dimensional attention layer and feed-forward neural network layer.

其中，每个transformer解码层的处理过程包括：通过带掩码的多头注意力层对输入的字符特征进行处理，得到第一字符特征；通过二维注意力层根据包含位置信息的图像特征和第一字符特征，确定二维注意力向量，在第一字符特征中加上二维注意力向量，得到第二字符特征；将第二字符特征输入前馈神经网络层。Among them, the processing process of each transformer decoding layer includes: processing the input character features through a masked multi-head attention layer to obtain the first character feature; A character feature is determined, a two-dimensional attention vector is determined, and a two-dimensional attention vector is added to the first character feature to obtain a second character feature; the second character feature is input into the feedforward neural network layer.

进一步地，第一字符特征可以包括一个或者多个特征向量，若第一字符特征包含多个特征向量，则通过二维注意力层根据包含位置信息的图像特征和第一字符特征，确定第一字符特征的每个特征向量对应的二维注意力向量，然后在第一字符特征的每个特征向量加上对应的二维注意力向量，得到第二字符特征。Further, the first character feature may include one or more feature vectors. If the first character feature includes multiple feature vectors, the two-dimensional attention layer determines the first character feature according to the image feature and the first character feature including position information. A two-dimensional attention vector corresponding to each feature vector of the character feature, and then adding the corresponding two-dimensional attention vector to each feature vector of the first character feature to obtain the second character feature.

图5为本发明实施例三提供的一种文本识别流程图，图6为本发明实施例三提供的一种二维注意力向量确定流程示意图。如图5和图6所示，通过二维注意力层根据包含位置信息的图像特征和第一字符特征，确定二维注意力向量，具体可以采用如下步骤301-303实现：FIG. 5 is a flowchart of a text recognition provided in Embodiment 3 of the present invention, and FIG. 6 is a schematic diagram of a flowchart of a two-dimensional attention vector determination provided by Embodiment 3 of the present invention. As shown in FIG. 5 and FIG. 6 , the two-dimensional attention vector is determined by the two-dimensional attention layer according to the image features including the position information and the first character feature. Specifically, the following steps 301-303 can be used to achieve:

步骤301、对包含位置信息的图像特征进行第一卷积处理，得到一个H×W×d的第一张量，其中H、W和d分别表示第一张量的高度、宽度和深度。Step 301: Perform a first convolution process on the image features including position information to obtain a first tensor of H×W×d, where H, W and d represent the height, width and depth of the first tensor, respectively.

其中，第一卷积处理是指将输入的图像特征通过一个3×3的卷积。The first convolution process refers to passing the input image features through a 3×3 convolution.

步骤302、第一字符特征包括至少一个特征向量，根据第一张量确定第一字符特征的每个特征向量关于包含位置信息的图像特征的权重值，第一字符特征的每个特征向量关于包含位置信息的图像特征的权重值即为包含位置信息的图像特征的每个像素点的权重值。Step 302, the first character feature includes at least one feature vector, and according to the first tensor, determine the weight value of each feature vector of the first character feature with respect to the image feature containing the position information, and each feature vector of the first character feature with respect to the image feature containing the position information. The weight value of the image feature of the position information is the weight value of each pixel of the image feature including the position information.

具体的，第一字符特征可以包括一个或者多个特征向量，该步骤中可以根据第一张量确定第一字符特征的每个特征向量关于包含位置信息的图像特征的权重值。Specifically, the first character feature may include one or more feature vectors, and in this step, the weight value of each feature vector of the first character feature with respect to the image feature including the position information may be determined according to the first tensor.

本实施例中，如图6所示，对于第一字符特征的每一个特征向量，具体可以采用如下方式实现确定该特征向量关于包含位置信息的图像特征的权重值：In this embodiment, as shown in FIG. 6 , for each feature vector of the first character feature, the weight value of the feature vector with respect to the image feature including the position information can be determined in the following manner:

将该特征向量进行第二卷积处理(如图6中所示通过一个1×1的卷积)，得到一个1×1×d的第二张量；扩充第二张量的高度和宽度，得到H×W×d的第三张量，第三张量与第一张量的高度、宽度和深度均一致；将第三张量与第一张量相加，并采用激活函数进行处理(图6中以采用tanh函数为例进行示例性地说明)，得到H×W×d的第四张量；对第四张量进行第三卷积处理(如图6中所示通过一个1×1的卷积)，得到H×W×1的第五张量(图6中未示出)；对第五张量进行二维softmax处理，得到该特征向量关于包含位置信息的图像特征的权重值，是一个H×W×1的张量。Perform a second convolution process on the feature vector (through a 1×1 convolution as shown in Figure 6) to obtain a 1×1×d second tensor; expand the height and width of the second tensor, The third tensor of H×W×d is obtained, and the height, width and depth of the third tensor are consistent with the first tensor; the third tensor is added to the first tensor, and the activation function is used for processing ( In Fig. 6, the tanh function is used as an example for illustration), and the fourth tensor of H×W×d is obtained; the third convolution processing is performed on the fourth tensor (as shown in Fig. 6, through a 1× 1 convolution) to obtain a fifth tensor of H×W×1 (not shown in Figure 6); perform two-dimensional softmax processing on the fifth tensor to obtain the weight of the feature vector with respect to the image features containing position information value, which is an H×W×1 tensor.

其中，第二卷积和第三卷积为不同的1×1的卷积。激活函数可以采用双曲正切函数(tanh函数)、sigmoid函数、或者其他类似的激活函数，本实施例此处不做具体限定。Among them, the second convolution and the third convolution are different 1×1 convolutions. The activation function may use a hyperbolic tangent function (tanh function), a sigmoid function, or other similar activation functions, which are not specifically limited in this embodiment.

步骤303、把包含位置信息的图像特征根据每个像素点的权重值加权求和，得到第一字符特征的每个特征向量对应的二维注意力向量。Step 303: The image features including the position information are weighted and summed according to the weight value of each pixel to obtain a two-dimensional attention vector corresponding to each feature vector of the first character feature.

其中，第一字符特征的特征向量可以有一个或者多个，通常，第一字符特征中特征向量的数量与字符数量一致。There may be one or more feature vectors of the first character feature, and generally, the number of feature vectors in the first character feature is consistent with the number of characters.

该步骤中，对于第一字符特征的每个特征向量，将该特征向量关于包含位置信息的图像特征的权重值当做包含位置信息的图像特征的每个像素点的权重值，把包含位置信息的图像特征根据每个像素点的权重值加权求和，得到该特征向量对应的二维注意力向量。In this step, for each feature vector of the first character feature, the weight value of the feature vector with respect to the image feature including position information is regarded as the weight value of each pixel of the image feature including position information, and the weight value of the image feature including position information The image features are weighted and summed according to the weight value of each pixel point to obtain the two-dimensional attention vector corresponding to the feature vector.

本实施例提供了确定二维注意力向量的一种具体实现方式，本实施例中使用了针对弯曲文本的2D Attention机制，能够充分利用图像的空间信息，使用一种弱监督的方式来“对齐”字符特征与图像特征，进一步加强解码阶段提取的字符特征与编码阶段提取的图像特征的对齐能力，能够进一步提高弯曲文本识别的准确率。This embodiment provides a specific implementation method for determining the two-dimensional attention vector. In this embodiment, the 2D Attention mechanism for curved text is used, which can make full use of the spatial information of the image and use a weakly supervised method to "align" "Character features and image features, further strengthen the alignment ability of the character features extracted in the decoding stage and the image features extracted in the encoding stage, which can further improve the accuracy of curved text recognition.

图7为本发明实施例四提供的一种文本识别模型训练方法流程图。本实施例提供一种文本识别模型训练方法，文本识别模型包括：编码模块和解码模块；编码模块用于：通过稠密卷积神经网络提取待识别图片的图像特征，在图像特征中添加二维位置编码信息，生成包含位置信息的图像特征；解码模块包括包含二维注意力机制的transformer解码层，包含二维注意力机制的transformer解码层用于对包含位置信息的图像特征进行解码处理，得到识别结果。FIG. 7 is a flowchart of a method for training a text recognition model according to Embodiment 4 of the present invention. This embodiment provides a method for training a text recognition model. The text recognition model includes: an encoding module and a decoding module; the encoding module is used for: extracting image features of a picture to be recognized through a dense convolutional neural network, and adding a two-dimensional position to the image features Encoding information to generate image features containing position information; the decoding module includes a transformer decoding layer containing a two-dimensional attention mechanism, and the transformer decoding layer containing a two-dimensional attention mechanism is used to decode the image features containing position information to obtain identification. result.

本实施例提供的文本识别模型用于实现上述任一实施例提供的文本识别方法，文本识别方法的具体实现方式详见上述实施例，本实施例此处不再赘述。The text recognition model provided in this embodiment is used to implement the text recognition method provided by any of the foregoing embodiments. For details of the specific implementation of the text recognition method, refer to the foregoing embodiment, which will not be repeated in this embodiment.

本实施例中，在执行上述文本识别方法之前，可以预先训练文本识别模型，如图7所示，文本识别模型训练方法具体包括如下步骤：In this embodiment, before executing the above text recognition method, a text recognition model can be pre-trained. As shown in FIG. 7 , the text recognition model training method specifically includes the following steps:

步骤40、获取自然场景文本识别的训练集，训练集至少包括多条弯曲文本训练数据，每条弯曲文本训练数据包括：包含弯曲文本的样本图片及其对应的文本标注信息。Step 40: Acquire a training set for natural scene text recognition, the training set includes at least a plurality of pieces of curved text training data, and each piece of curved text training data includes: a sample picture containing the curved text and its corresponding text annotation information.

示例性地，本实施例中训练集包括尽可能丰富的自然场景中的样本图片及其文本标注信息。Exemplarily, in this embodiment, the training set includes as rich as possible sample pictures in natural scenes and their text annotation information.

步骤50、通过训练集对文本识别模型进行训练。Step 50: Train the text recognition model through the training set.

另外，transformer解码模块还包括嵌入(embeding)层。In addition, the transformer decoding module also includes an embedding layer.

在训练阶段，将真值字符通过embeding层将字符转化为高维空间中的向量表示，作为第一个transformer解码层的输入。In the training phase, the ground-truth characters are converted into vector representations in a high-dimensional space through the embedding layer, which is used as the input of the first transformer decoding layer.

本实施例提供的文本识别模型本质上为分类模型，decoder最后输出结果为概率张量，因此本实施例中可以采用多分类交叉熵损失函数计算出模型损失，损失函数公式为H(p，q)＝-∑p(x)logq(x)，其中p(x)代表正确答案时为1，其他时候为0，q(x)代表正确答案项的预测概率。这样，针对每个样本，仅考虑了预测正确项的那一部分的损失。The text recognition model provided in this embodiment is essentially a classification model, and the final output result of the decoder is a probability tensor. Therefore, in this embodiment, a multi-class cross entropy loss function can be used to calculate the model loss, and the loss function formula is H(p, q )=-∑p(x)logq(x), where p(x) is 1 when it represents the correct answer, 0 otherwise, and q(x) represents the predicted probability of the correct answer item. This way, for each sample, only the part of the loss that predicts the correct term is considered.

另外，本实施例中，最后采用Adam作为优化方法来优化模型，Adam是一种可以替代传统随机梯度下降(Stochastic Gradient Descent，简称SGD)过程的一阶优化算法，它能基于训练数据迭代地更新神经网络权重。Adam算法和传统的随机梯度下降不同。随机梯度下降保持单一的学习率(即alpha)更新所有的权重，学习率在训练过程中并不会改变。而Adam通过计算梯度的一阶矩估计和二阶矩估计为不同的参数设计独立的自适应性学习率。同时Adam算法很容易实现，并且有很高的计算效率和较低的内存需求。因此，本实施例优选采用Adam算法作为优化器。In addition, in this embodiment, Adam is used as the optimization method to optimize the model. Adam is a first-order optimization algorithm that can replace the traditional Stochastic Gradient Descent (SGD) process. It can iteratively update based on training data. Neural network weights. Adam algorithm is different from traditional stochastic gradient descent. Stochastic gradient descent maintains a single learning rate (i.e. alpha) to update all weights, and the learning rate does not change during training. Adam designs independent adaptive learning rates for different parameters by calculating the first-order moment estimation and second-order moment estimation of the gradient. At the same time, the Adam algorithm is easy to implement, and has high computational efficiency and low memory requirements. Therefore, in this embodiment, the Adam algorithm is preferably used as the optimizer.

图8为本发明实施例五提供的一种文本识别系统的结构示意图，如图8所示，本实施例中的文本识别系统，包括：编码模块801和解码模块802。FIG. 8 is a schematic structural diagram of a text recognition system according to Embodiment 5 of the present invention. As shown in FIG. 8 , the text recognition system in this embodiment includes an encoding module 801 and a decoding module 802 .

具体的，编码模块801用于通过稠密卷积神经网络提取待识别图片的图像特征，在图像特征中添加二维位置编码信息，生成包含位置信息的图像特征。Specifically, the encoding module 801 is configured to extract image features of the picture to be recognized through a dense convolutional neural network, add two-dimensional position encoding information to the image features, and generate image features including position information.

解码模块802用于通过包含二维注意力机制的transformer解码层，对包含位置信息的图像特征进行解码处理，得到识别结果。The decoding module 802 is used for decoding the image features including the position information through the transformer decoding layer including the two-dimensional attention mechanism to obtain the recognition result.

上述各个功能模块分别用于完成本发明方法实施例一对应的操作功能，也达到类似的功能效果，详细内容不再赘述。The above-mentioned functional modules are respectively used to complete the operation functions corresponding to the first embodiment of the method of the present invention, and also achieve similar functional effects, and details are not repeated here.

在上述实施例五的基础上，本实施例中，编码模块801还用于：生成图像特征中的每个像素的二维位置编码，并生成图像特征的位置编码张量；将图像特征的位置编码张量与图像特征相加，得到包含位置信息的图像特征。On the basis of the fifth embodiment above, in this embodiment, the encoding module 801 is further configured to: generate a two-dimensional position code of each pixel in the image feature, and generate a position code tensor of the image feature; The encoded tensor is added to the image features to obtain image features containing location information.

可选的，编码模块801还用于：根据图像特征确定水平方向和竖直方向的位置编码权重；对于图像特征中的任意一个像素，分别生成该像素在水平方向和竖直方向上的一维位置编码；根据水平方向和竖直方向的位置编码权重，对该像素在水平方向和竖直方向上的一维位置编码进行加权求和，得到该像素的二维位置编码。Optionally, the encoding module 801 is further configured to: determine the position encoding weights in the horizontal direction and the vertical direction according to the image feature; for any pixel in the image feature, generate a one-dimensional image of the pixel in the horizontal direction and the vertical direction respectively. Position encoding: According to the position encoding weights in the horizontal and vertical directions, the one-dimensional position encoding of the pixel in the horizontal and vertical directions is weighted and summed to obtain the two-dimensional position encoding of the pixel.

本实施例中，文本识别模型包括至少一个包含二维注意力机制的transformer解码层，每个transformer解码层包括：带掩码的多头注意力层，二维注意力层和前馈神经网络层。In this embodiment, the text recognition model includes at least one transformer decoding layer including a two-dimensional attention mechanism, and each transformer decoding layer includes: a masked multi-head attention layer, a two-dimensional attention layer and a feedforward neural network layer.

具体的，解码模块502还用于：Specifically, the decoding module 502 is also used for:

通过带掩码的多头注意力层对输入的字符特征进行处理，得到第一字符特征；通过二维注意力层根据包含位置信息的图像特征和第一字符特征，确定二维注意力向量，在第一字符特征中加上二维注意力向量，得到第二字符特征；将第二字符特征输入前馈神经网络层。The input character features are processed by the masked multi-head attention layer to obtain the first character feature; the two-dimensional attention layer is used to determine the two-dimensional attention vector according to the image features containing the position information and the first character feature, and in A two-dimensional attention vector is added to the first character feature to obtain the second character feature; the second character feature is input into the feedforward neural network layer.

可选的，解码模块502还用于：对包含位置信息的图像特征进行第一卷积处理，得到一个H×W×d第一张量，其中H、W和d分别表示第一张量的高度、宽度和深度；第一字符特征包括至少一个特征向量，根据第一张量确定第一字符特征的每个特征向量关于包含位置信息的图像特征的权重值，第一字符特征的每个特征向量关于包含位置信息的图像特征的权重值即为包含位置信息的图像特征的每个像素点的权重值；把包含位置信息的图像特征根据每个像素点的权重值加权求和，得到第一字符特征的每个特征向量对应的二维注意力向量。Optionally, the decoding module 502 is further configured to: perform a first convolution process on the image features including the position information to obtain a H×W×d first tensor, where H, W and d represent the values of the first tensor respectively. height, width and depth; the first character feature includes at least one feature vector, and the weight value of each feature vector of the first character feature with respect to the image feature containing the position information is determined according to the first tensor, and each feature of the first character feature is The weight value of the vector with respect to the image feature containing the position information is the weight value of each pixel of the image feature containing the position information; the image features containing the position information are weighted and summed according to the weight value of each pixel to obtain the first The 2D attention vector corresponding to each feature vector of character features.

可选的，解码模块502还用于：Optionally, the decoding module 502 is also used for:

将该特征向量进行第二卷积处理，得到一个1×1×d的第二张量；扩充第二张量的高度和宽度，得到H×W×d的第三张量，第三张量与第一张量的高度、宽度和深度均一致；将第三张量与第一张量相加，并采用激活函数进行处理，得到H×W×d的第四张量；对第四张量进行第三卷积处理，得到H×W×1的第五张量；对第五张量进行二维softmax处理，得到该特征向量关于包含位置信息的图像特征的权重值。Perform the second convolution process on the feature vector to obtain a second tensor of 1×1×d; expand the height and width of the second tensor to obtain the third tensor of H×W×d, the third tensor It is consistent with the height, width and depth of the first tensor; the third tensor is added to the first tensor, and the activation function is used to process it to obtain the fourth tensor of H×W×d; for the fourth tensor The third convolution process is performed on the quantity to obtain the fifth tensor of H×W×1; the two-dimensional softmax processing is performed on the fifth tensor to obtain the weight value of the feature vector with respect to the image feature containing the position information.

上述各个功能模块分别用于完成本发明方法实施例二和实施例三对应的操作功能，也达到类似的功能效果，详细内容不再赘述。The above-mentioned functional modules are respectively used to complete the operation functions corresponding to the second and third embodiments of the method of the present invention, and also achieve similar functional effects, and the details are not repeated here.

图9为本发明实施例六提供的一种文本识别系统的结构示意图，在上述实施例五的基础上，本发明的另一实施例中，如图9所示，文本识别系统还可以包括模型训练模块803。文本识别模型包括：编码模块801和解码模块802；编码模块801用于：通过稠密卷积神经网络提取待识别图片的图像特征，在图像特征中添加二维位置编码信息，生成包含位置信息的图像特征；解码模块802包括包含二维注意力机制的transformer解码层，包含二维注意力机制的transformer解码层用于对包含位置信息的图像特征进行解码处理，得到识别结果。模型训练模块803用于：获取自然场景文本识别的训练集，训练集至少包括多条弯曲文本训练数据，每条弯曲文本训练数据包括：包含弯曲文本的样本图片及其对应的文本标注信息；通过训练集对文本识别模型进行训练。FIG. 9 is a schematic structural diagram of a text recognition system according to Embodiment 6 of the present invention. On the basis of Embodiment 5 above, in another embodiment of the present invention, as shown in FIG. 9 , the text recognition system may further include a model Training module 803. The text recognition model includes: an encoding module 801 and a decoding module 802; the encoding module 801 is used to: extract the image features of the picture to be recognized through a dense convolutional neural network, add two-dimensional position encoding information to the image features, and generate an image containing the position information The decoding module 802 includes a transformer decoding layer including a two-dimensional attention mechanism, and the transformer decoding layer including a two-dimensional attention mechanism is used to decode the image features including the position information to obtain the recognition result. The model training module 803 is used to: obtain a training set for text recognition in natural scenes, the training set includes at least a plurality of pieces of curved text training data, and each piece of curved text training data includes: a sample picture containing curved text and its corresponding text annotation information; The training set trains the text recognition model.

另外，在本发明的另一实施例中，模型训练模块可以单独作为一个系统实现。In addition, in another embodiment of the present invention, the model training module may be implemented as a single system.

上述编码模块801和解码模块802用于完成本发明方法实施例一至三中任一实施例对应的操作功能，上述模型训练模块803用于完成本发明方法实施例四对应的操作功能，也达到类似的功能效果，详细内容不再赘述。The above encoding module 801 and the decoding module 802 are used to complete the operation functions corresponding to any one of the first to third method embodiments of the present invention, and the above-mentioned model training module 803 is used to complete the operation functions corresponding to the fourth method embodiment of the present invention. The function effect, the details will not be repeated.

图10为本发明实施例七提供的文本识别设备的结构示意图。如图10所示，该设备100包括：处理器1001，存储器1002，以及存储在存储器1002上并可在处理器1001上运行的计算机程序。FIG. 10 is a schematic structural diagram of a text recognition device according to Embodiment 7 of the present invention. As shown in FIG. 10 , the device 100 includes a processor 1001 , a memory 1002 , and a computer program stored on the memory 1002 and executable on the processor 1001 .

其中，处理器1001运行计算机程序时实现上述任一方法实施例提供的文本识别方法和/或文本识别模型训练方法。Wherein, when the processor 1001 runs the computer program, the text recognition method and/or the text recognition model training method provided by any of the above method embodiments is implemented.

本发明实施例还提供一种计算机可读存储介质，该可读存储介质如：ROM/RAM、磁碟、光盘等，计算机可读存储介质存储有计算机程序，计算机程序可被终端设备、计算机或服务器等硬件设备执行上述的文本识别方法和/或文本识别模型训练方法。Embodiments of the present invention further provide a computer-readable storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc., where the computer-readable storage medium stores a computer program, and the computer program can be used by a terminal device, a computer or A hardware device such as a server executes the above text recognition method and/or text recognition model training method.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A text recognition method, comprising:

extracting image features of the picture to be identified through a dense convolutional neural network;

adding two-dimensional position coding information to the image characteristics to generate image characteristics containing position information;

and decoding the image features containing the position information through a transform decoding layer containing a two-dimensional attention mechanism to obtain a recognition result.

2. The method according to claim 1, wherein the adding two-dimensional position code information to the image feature to generate the image feature containing the position information comprises:

generating a two-dimensional position code for each pixel in the image feature and generating a position code tensor for the image feature;

and adding the position coding tensor of the image features and the image features to obtain the image features containing the position information.

3. The method of claim 2, wherein generating a two-dimensional position code for each pixel in the image feature comprises:

determining position coding weights in the horizontal direction and the vertical direction according to the image features;

for any pixel in the image characteristics, respectively generating one-dimensional position codes of the pixel in the horizontal direction and the vertical direction;

and according to the position coding weights in the horizontal direction and the vertical direction, carrying out weighted summation on the one-dimensional position codes of the pixel in the horizontal direction and the vertical direction to obtain the two-dimensional position code of the pixel.

4. The method according to any one of claims 1 to 3, comprising at least one of said transformer decoding layers comprising a two-dimensional attention mechanism, each of said transformer decoding layers comprising: a multi-headed attention layer with a mask, a two-dimensional attention layer, and a feedforward neural network layer.

5. The method of claim 4, wherein the decoding the image feature including the position information by a transform decoding layer including a two-dimensional attention mechanism to obtain an identification result comprises:

processing the input character features through a multi-head attention layer with a mask to obtain first character features;

determining a two-dimensional attention vector through a two-dimensional attention layer according to the image feature containing the position information and the first character feature, and adding the two-dimensional attention vector to the first character feature to obtain a second character feature;

inputting the second character feature into the feed-forward neural network layer.

6. The method of claim 5, wherein determining a two-dimensional attention vector from the image feature containing location information and the first character feature by a two-dimensional attention layer comprises:

performing first volume processing on the image features containing the position information to obtain a first tensor of H × W × d, wherein H, W and d respectively represent the height, width and depth of the first tensor;

the first character features comprise at least one feature vector, a weight value of each feature vector of the first character features relative to the image features containing the position information is determined according to the first vector, and the weight value of each feature vector of the first character features relative to the image features containing the position information is the weight value of each pixel point of the image features containing the position information;

and weighting and summing the image features containing the position information according to the weight value of each pixel point to obtain the two-dimensional attention vector corresponding to each feature vector of the first character features.

7. The method according to claim 6, wherein determining a weight value of any one feature vector of the first character features with respect to the image features containing the position information according to the first vector comprises:

performing second convolution processing on the eigenvector to obtain a second tensor of 1 × 1 × d;

expanding the height and width of the second tensor to obtain a third tensor of H × W × d, wherein the height, the width and the depth of the third tensor are consistent with those of the first tensor;

adding the third tensor to the first tensor, and processing by adopting an activation function to obtain a fourth tensor of H × W × d;

performing a third convolution processing on the fourth tensor to obtain a fifth tensor of H × W × 1;

and performing two-dimensional softmax processing on the fifth tensor to obtain a weight value of the feature vector about the image feature containing the position information.

8. A method for training a text recognition model, wherein the text recognition model comprises: an encoding module and a decoding module; the encoding module is configured to: extracting image features of a picture to be identified through a dense convolutional neural network, adding two-dimensional position coding information into the image features, and generating image features containing position information; the decoding module comprises a transformer decoding layer containing a two-dimensional attention mechanism, and the transformer decoding layer containing the two-dimensional attention mechanism is used for decoding the image features containing the position information to obtain an identification result;

the method comprises the following steps:

acquiring a training set for natural scene text recognition, wherein the training set at least comprises a plurality of pieces of bent text training data, and each piece of bent text training data comprises: the method comprises the steps of obtaining a sample picture containing a bent text and corresponding text marking information;

and training the text recognition model through the training set.

9. A text recognition system, comprising:

the coding module is used for extracting image features of the picture to be identified through a dense convolutional neural network, adding two-dimensional position coding information into the image features and generating image features containing position information;

and the decoding module is used for decoding the image features containing the position information through a transform decoding layer containing a two-dimensional attention mechanism to obtain an identification result.

10. A text recognition apparatus, comprising:

a processor, a memory, and a computer program stored on the memory and executable on the processor;

wherein the processor, when running the computer program, implements the method of any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which can be executed to perform the method according to any one of claims 1 to 8.