CN114581906B

CN114581906B - Method and system for text recognition in natural scene images

Info

Publication number: CN114581906B
Application number: CN202210483188.4A
Authority: CN
Inventors: 许信顺; 王彬; 罗昕; 陈振铎
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2022-08-05
Anticipated expiration: 2042-05-06
Also published as: CN114581906A

Abstract

The invention relates to the technical field of data identification, and discloses a text identification method and a text identification system for natural scene images, wherein the method comprises the following steps: acquiring a natural scene image to be identified; performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text; firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image after correction; and then respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features. The method can identify the scene texts in any shapes, has wide application scenes and strong generalization capability of the model, and can be applied to various scenes of text identification.

Description

Method and system for text recognition in natural scene images

技术领域technical field

本发明涉及数据识别技术领域，特别是涉及自然场景图像的文本识别方法及系统。The invention relates to the technical field of data recognition, in particular to a text recognition method and system for natural scene images.

背景技术Background technique

本部分的陈述仅仅是提到了与本发明相关的背景技术，并不必然构成现有技术。The statements in this section merely provide background related to the present disclosure and do not necessarily constitute prior art.

文本识别是计算机视觉研究领域的分支之一，归属于模式识别和人工智能，是计算机科学的重要组成部分。它是利用计算机技术将自然场景中的文本识别出来，并转换成计算机能够显示、人可以理解的格式。通过文本识别，可以大大加快信息处理速度。Text recognition is one of the branches of computer vision research, which belongs to pattern recognition and artificial intelligence, and is an important part of computer science. It uses computer technology to recognize text in natural scenes and convert it into a format that can be displayed by computers and understood by humans. Through text recognition, information processing speed can be greatly accelerated.

场景文本识别任务由于文本本身具有的序列性，使得二维的文本图像与一维的序列文本之间有着巨大的语义鸿沟，也使得场景文本识别有其独特的困难性。人们识别场景中的文本是受各种各样因素影响的，例如背景，色彩，甚至文化等因素。Due to the sequential nature of the text itself, there is a huge semantic gap between two-dimensional text images and one-dimensional sequential texts in the task of scene text recognition, which also makes scene text recognition uniquely difficult. Human recognition of text in a scene is influenced by a variety of factors, such as context, color, and even culture.

比如，当人看到麦当劳标志性的白底黄标的符号，会下意识的知道写的是麦当劳，往往人第一眼看到场景后不需要反应写的什么就可以判断出这里的文本是什么。但是对于计算机来说，这些复杂的场景对于文本的识别却是副作用的。现实中场景极其复杂，在字体，风格，背景等方面千差万别，这些干扰使得计算机经常错误的识别某些字符，而识别错了某一个字符就往往会导致识别结果完全背离了文本的含义。For example, when people see McDonald's iconic symbol with a yellow label on a white background, they will subconsciously know that it is McDonald's. Often people can judge what the text here is without reacting to what is written after seeing the scene for the first time. But for computers, these complex scenes are side effects for text recognition. In reality, the scene is extremely complex, and there are huge differences in fonts, styles, backgrounds, etc. These interferences make the computer often mistakenly recognize certain characters, and the recognition of a wrong character will often lead to the recognition result completely deviating from the meaning of the text.

当下的文本识别方法大致可以细分为两个方向，一个方向是利用深度学习时代神经网络强大的特征提取能力在视觉特征方面做优化来得到更高的特征提取能力，另一个方向则是根据文本本身具有的语义，对特征提取器对二维的视觉图像提取到的特征进行语义的增强或是对视觉模型得到的结果进行纠正。The current text recognition methods can be roughly divided into two directions. One direction is to use the powerful feature extraction capabilities of neural networks in the deep learning era to optimize visual features to obtain higher feature extraction capabilities, and the other direction is based on text. It has its own semantics, and the features extracted by the feature extractor from the two-dimensional visual image are semantically enhanced or the results obtained by the visual model are corrected.

近几年来，取得效果最好的模型多是基于语义的模型，这类方法通常先使用特征提取器将二维图片提取为视觉特征图，接着利用语义模型对视觉模型提取到的特征图进一步编码，得到语义特征。接着综合利用视觉特征与语义特征得到最终的识别结果。但是这种方式会导致语义模型高度依赖于视觉特征。In recent years, the models with the best results are mostly semantic-based models. These methods usually first use a feature extractor to extract a two-dimensional image into a visual feature map, and then use the semantic model to further encode the feature map extracted by the visual model. , to get semantic features. Then, the visual features and semantic features are comprehensively used to obtain the final recognition result. But in this way, the semantic model is highly dependent on visual features.

这种将语义特征与视觉特征耦合到一起的方式会有两个缺点：一是语义模型仅仅用于更正视觉模型得到的结果。语义模型在整个流程端对端训练，但是其实际是作为后处理的方式独立于整个模型，这就导致了模型变大，梯度链因此增长难以训练。第二，使用语义模型纠正视觉模型的结果，通过这种结构的确增强了模型的识别能力，但是自然场景中存在着各种各样错误文本信息，比方说对于手写试卷的识别与批改，对于错误的文本，使用语义模型后，模型会自动将其纠正为正确的单词，这无疑偏离了批改作业的本意。This way of coupling semantic features with visual features has two disadvantages: First, the semantic model is only used to correct the results obtained by the visual model. The semantic model is trained end-to-end in the entire process, but it is actually used as a post-processing method independent of the entire model, which leads to a larger model, which makes it difficult to train the gradient chain. Second, using the semantic model to correct the results of the visual model, the recognition ability of the model is indeed enhanced through this structure, but there are various erroneous text information in natural scenes, for example, for the recognition and correction of handwritten test papers, for errors After using the semantic model, the model will automatically correct it to the correct word, which undoubtedly deviates from the original intention of the correction work.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术的不足，本发明提供了自然场景图像的文本识别方法及系统；提出了一种语义独立的文本识别网络，它不同于之前模型仅仅利用截断梯度来解耦视觉与语义模型，而是从模型结构上进行调整，实现结构上彻底的解耦。设计了一种新的语义模块，用于充分利用语义信息。设计了一种新的视觉语义特征融合模块，使得视觉特征与语义特征充分交互，充分利用了视觉特征与语义特征。使用门机制将增强的视觉信息与语义信息融合，得到最终的预测结果。In order to solve the deficiencies of the prior art, the present invention provides a text recognition method and system for natural scene images, and proposes a semantically independent text recognition network, which is different from the previous model that only uses the truncated gradient to decouple the visual and semantic models. Instead, it is adjusted from the model structure to achieve a complete decoupling of the structure. A new semantic module is designed to make full use of semantic information. A new visual-semantic feature fusion module is designed, which makes visual features and semantic features fully interact, and makes full use of visual features and semantic features. A gate mechanism is used to fuse the enhanced visual information with the semantic information to obtain the final prediction result.

第一方面，本发明提供了自然场景图像的文本识别方法；In a first aspect, the present invention provides a text recognition method for natural scene images;

自然场景图像的文本识别方法，包括：Text recognition methods for natural scene images, including:

获取待识别的自然场景图像；Obtain the natural scene image to be recognized;

对待识别的自然场景图像，采用训练后的深度学习模型进行文本识别，得到识别的文本；For the natural scene image to be recognized, the trained deep learning model is used for text recognition, and the recognized text is obtained;

其中，深度学习模型，首先对待识别的自然场景图像进行矫正处理，然后对矫正后的提取出图像的特征向量；再从图像的特征向量中，分别提取出视觉特征和语义特征，并对两种特征进行特征融合，最后对融合后的特征进行文本识别。Among them, the deep learning model firstly corrects the natural scene image to be recognized, and then extracts the feature vector of the image after correction; then extracts visual features and semantic features from the feature vectors of the image, respectively. The features are fused, and finally the fused features are used for text recognition.

第二方面，本发明提供了自然场景图像的文本识别系统；In a second aspect, the present invention provides a text recognition system for natural scene images;

自然场景图像的文本识别系统，包括：Text recognition systems for natural scene images, including:

获取模块，其被配置为：获取待识别的自然场景图像；an acquisition module, which is configured to: acquire a natural scene image to be recognized;

识别模块，其被配置为：对待识别的自然场景图像，采用训练后的深度学习模型进行文本识别，得到识别的文本；A recognition module, which is configured to: use the trained deep learning model to perform text recognition on the natural scene image to be recognized, and obtain the recognized text;

与现有技术相比，本发明的有益效果是：Compared with the prior art, the beneficial effects of the present invention are:

1）本发明可以识别任意形状的场景文本，应用场景广泛，模型的泛化能力强，可以运用于多种文本识别的场景。1) The present invention can recognize scene texts of any shape, has a wide range of application scenarios, and has strong generalization ability of the model, and can be applied to various text recognition scenarios.

2）本发明调整模型结构，从而降低语义模块对视觉模块的耦合程度，使其成为独立的部分，从而使得模型更容易端对端训练而不需要复杂的多阶段或是预训练过程。2) The present invention adjusts the model structure, thereby reducing the coupling degree of the semantic module to the visual module, making it an independent part, thereby making the model easier to train end-to-end without requiring complex multi-stage or pre-training processes.

3）本发明提出一种新的用于文本识别的语义模块，通过该模块可以与视觉模块对等地处理文本的语义信息，并在该分支可以得到初步的语义识别结果，用以引导模型训练。3) The present invention proposes a new semantic module for text recognition, through which the semantic information of text can be processed equivalently with the visual module, and preliminary semantic recognition results can be obtained in this branch to guide model training .

4）本发明提出一种新的视觉语义融合方式，使得对等的语义特征与视觉特征进行交互，可以充分利用两者的信息，并且采用门机制将两者融合的信息做最终的决策，从而得到更好的识别结果。4) The present invention proposes a new visual-semantic fusion method, which enables the interaction of equivalent semantic features and visual features, and can make full use of the information of both, and uses the gate mechanism to make the final decision on the information fused by the two, thereby get better recognition results.

附图说明Description of drawings

构成本发明的一部分的说明书附图用来提供对本发明的进一步理解，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。The accompanying drawings forming a part of the present invention are used to provide further understanding of the present invention, and the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute an improper limitation of the present invention.

图1为整个网络的功能模块示意图；Fig. 1 is the functional module schematic diagram of the whole network;

图2为视觉特征和语义特征的特征融合示意图。Figure 2 is a schematic diagram of feature fusion of visual features and semantic features.

具体实施方式Detailed ways

应该指出，以下详细说明都是示例性的，旨在对本发明提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the invention. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

本实施例所有数据的获取都在符合法律法规和用户同意的基础上，对数据的合法应用。All data acquisition in this embodiment is based on compliance with laws and regulations and the user's consent, and the legal application of the data.

实施例一Example 1

本实施例提供了自然场景图像的文本识别方法；This embodiment provides a text recognition method for natural scene images;

S101：获取待识别的自然场景图像；S101: Obtain a natural scene image to be recognized;

S102：对待识别的自然场景图像，采用训练后的深度学习模型进行文本识别，得到识别的文本；S102: For the natural scene image to be recognized, use the trained deep learning model to perform text recognition to obtain the recognized text;

进一步地，如图1所示，所述深度学习模型，包括：矫正模块，所述矫正模块与骨干网络的输入端连接，所述骨干网络的输出端分别与视觉特征提取模块的输入端和语义特征提取模块的输入端连接；所述视觉特征提取模块的输出端和语义特征提取模块的输出端，均与视觉语义特征融合模块的输入端连接，所述视觉语义特征融合模块的输出端与预测模块的输入端连接，预测模块的输出端输出文本识别结果。Further, as shown in FIG. 1, the deep learning model includes: a correction module, the correction module is connected with the input end of the backbone network, and the output end of the backbone network is respectively connected with the input end of the visual feature extraction module and the semantic The input end of the feature extraction module is connected; the output end of the visual feature extraction module and the output end of the semantic feature extraction module are both connected with the input end of the visual semantic feature fusion module, and the output end of the visual semantic feature fusion module is connected to the prediction The input end of the module is connected, and the output end of the prediction module outputs the text recognition result.

进一步地，所述矫正模块，采用薄板样条插值算法TPS(Thin Plate Spline)来实现，用于将弯曲的图像矫正为规则的形状。Further, the rectification module is implemented by using a thin plate spline interpolation algorithm TPS (Thin Plate Spline), and is used to rectify the curved image into a regular shape.

应理解地，自然场景中广泛存在弯曲文本，这些图像背景，外观和布局的巨大变化给文本识别任务带来重大挑战，传统的光学字符识别（Optical Character Recognition，OCR）方法无法有效处理。难以训练一个能应对各种场景的识别模型，因此，本实施例使用薄板样条插值算法TPS来对图像进行矫正，以得到较为规则的文本。薄板样条插值算法TPS的关键问题在于定位2J个控制点以及控制点之间的变换矩阵，其中J为超参数。这2J个控制点是通过回归的手段获得的，具体的，将输入图像送入3层的卷积神经网络提取图像特征后，使用全连接操作得到了2J个输出，这2J个输出便对应着其控制点。变换矩阵则是通过像素点与控制点之间的范数与控制点之间的范数得到的解析解。通过这一步，可以得到初步矫正后的文本图像。It should be understood that curved texts widely exist in natural scenes, and these dramatic changes in image background, appearance, and layout bring significant challenges to text recognition tasks, which traditional Optical Character Recognition (OCR) methods cannot handle effectively. It is difficult to train a recognition model that can cope with various scenarios. Therefore, in this embodiment, the thin plate spline interpolation algorithm TPS is used to correct the image to obtain relatively regular text. The key problem of the thin plate spline interpolation algorithm TPS is to locate 2J control points and the transformation matrix between the control points, where J is a hyperparameter. These 2J control points are obtained by means of regression. Specifically, after the input image is sent to a 3-layer convolutional neural network to extract image features, 2J outputs are obtained using the full connection operation, and these 2J outputs correspond to its control point. The transformation matrix is the analytical solution obtained by the norm between the pixel point and the control point and the norm between the control points. Through this step, a preliminary corrected text image can be obtained.

进一步地，所述骨干网络，采用ResNet（Residual Neural Network）卷积神经网络来实现，用于提取自然场景图像的空间特征，采用位置编码对空间特征进行位置信息嵌入，采用第一Transformer神经网络对位置信息嵌入后的空间特征进行特征增强，得到图像的特征向量。Further, the backbone network is implemented by a ResNet (Residual Neural Network) convolutional neural network, which is used to extract the spatial features of natural scene images, and the spatial features are embedded with position information by using position coding. The spatial features of the embedded position information are enhanced to obtain the feature vector of the image.

其中，位置编码，是指：Among them, the position code refers to:

；（1.1）

; (1.1)

；（1.2）

; (1.2)

其中，

指的就是对应的位置编码，是一个一维的向量。in,

Refers to the corresponding position code, which is a one-dimensional vector.

其中，

对应的是第

个位置，第

个维度的值。in,

corresponding to the

position, the

dimension value.

；（1.3）

; (1.3)

应理解地，场景文本图像由于其风格背景，字体风格，以及拍摄过程中的噪声，因此通常采用深层的卷积神经网络作为图像的编码器，本实施例选取ResNet45作为骨干网络，通过使用网络中的残差连接，减少网络过深带来的模型退化，同样也减少了网络梯度消失的问题。此外，由于文本的序列性，本实施例在使用了残差网络后使用位置编码与Transformer来获取增强的特征，使用位置编码可以引入图像的位置信息，使得神经网络更加关注二维特征图中特征之间的位置关系，对于Transformer，这里的查询

、键

、值

三者都是自身，这种方式可以深度挖掘出特征图内部特征的关系。公式化描述如下：It should be understood that, due to its style background, font style, and noise in the shooting process, the scene text image usually adopts a deep convolutional neural network as the image encoder. In this embodiment, ResNet45 is selected as the backbone network. The residual connection of the network reduces the model degradation caused by the deep network, and also reduces the problem of the disappearance of the network gradient. In addition, due to the sequential nature of the text, this embodiment uses the position encoding and Transformer to obtain enhanced features after using the residual network. Using the position encoding can introduce the position information of the image, so that the neural network pays more attention to the features in the two-dimensional feature map. The positional relationship between, for Transformer, the query here

,key

,value

All three are themselves, and in this way, the relationship between the internal features of the feature map can be deeply excavated. The formulation is described as follows:

；（1.4）

; (1.4)

其中，

是公式(1.1)中得到的位置编码，

指的是输入图像，

指的是残差神经网络，

指的是Transformer网络，该步的操作指的是将图像输入ResNet45网络中后得到了特征，将特征与位置编码相加后送Transformer网络中。in,

is the position code obtained in formula (1.1),

refers to the input image,

refers to the residual neural network,

Refers to the Transformer network. The operation of this step refers to inputting the image into the ResNet45 network to obtain features, adding the features and position codes and sending them to the Transformer network.

进一步地，所述视觉特征提取模块，采用第二Transformer神经网络对图像特征向量的视觉部分和语义部分进行分离，采用位置注意力机制模块对视觉部分进行解码，得到视觉特征；Further, the visual feature extraction module adopts the second Transformer neural network to separate the visual part and the semantic part of the image feature vector, and uses the position attention mechanism module to decode the visual part to obtain visual features;

所述位置注意力机制模块，是将自注意力机制self-attention的查询

、键

、值

替换为不同元素；其中，查询

被替换为位置编码，键

被替换为UNet网络的输出值，值

被替换为

的恒等映射。The position attention mechanism module is a query of the self-attention mechanism self-attention

,key

,value

replaced with a different element; where the query

is replaced by the position code, key

is replaced by the output value of the UNet network, value

is replaced with

The identity map of .

所述位置注意力机制，见公式（1.6）-公式（1.8）。For the location attention mechanism, see Equation (1.6)-Equation (1.8).

其中，第二Transformer网络得到的视觉特征使用位置注意力机制模块解码后，采用全连接层对解码结果进行预测，并使用公式(1.10)中的交叉熵损失进行监督训练。Among them, after the visual features obtained by the second Transformer network are decoded using the position attention mechanism module, the fully connected layer is used to predict the decoding results, and the cross-entropy loss in formula (1.10) is used for supervised training.

示例性地，所述视觉特征提取模块，本实施例使用Transformer 进一步地提取高维的视觉特征从而降低下一步解码对骨干网络的依赖，使得语义部分的解码与视觉部分的解码解耦。视觉部分的特征为二维的特征而语义部分的特征为一维的结果序列，这是两者根本的不同之处。通过这一步根据特征

得到了

，公式如下所示：Exemplarily, in the visual feature extraction module, the Transformer is used in this embodiment to further extract high-dimensional visual features, thereby reducing the dependence of the next decoding on the backbone network, so that the decoding of the semantic part is decoupled from the decoding of the visual part. The features of the visual part are two-dimensional features and the features of the semantic part are one-dimensional result sequences, which is the fundamental difference between the two. Through this step according to the characteristics

Got

, the formula is as follows:

；（1.5）

; (1.5)

其中，公式(1.5)中的

都为公式（1.4）中得到

的恒等映射，

是特征的维度，

是超参数，

指的是

函数。通过该步，使得视觉部分提取更深的解码，使得视觉部分的解码与语义部分分离。Among them, in formula (1.5)

Both are obtained in formula (1.4)

The identity mapping of ,

is the dimension of the feature,

are hyperparameters,

Refers

function. This step enables the visual part to extract deeper decoding, so that the decoding of the visual part is separated from the semantic part.

接下来，使用位置注意力机制模块将视觉特征转换为字符序列，这里同样使用的是self-attention公式，但是不同于（1.5）中

、

和

使用的都是

的恒等映射，公式（1.8）中

、

和

都使用各不相同的编码方式。Next, use the position attention mechanism module to convert the visual features into character sequences. The self-attention formula is also used here, but it is different from (1.5)

,

and

are used

The identity map of , in Equation (1.8)

,

and

All use different encodings.

公式（1.5）的

与公式（1.6）中的

不同，公式（1.5）的查询旨在编码二维视觉特征的位置关系因此实际采用的是特征

，而这里的

编码的则是单词中字符之间的序列关系，因此是对位置顺序采用了词嵌入，公式化描述如下：Equation (1.5)

with the formula (1.6)

Different, the query of formula (1.5) aims to encode the positional relationship of the two-dimensional visual features, so the feature is actually used

, and here the

The encoding is the sequence relationship between the characters in the word, so the word embedding is used for the position order, and the formula is described as follows:

（1.6）

(1.6)

其中，

使用的是词嵌入函数，

是字符的顺序，包括[0,1,…N]，

就是对应的位置编码。in,

The word embedding function is used,

is the sequence of characters, including [0,1,…N],

is the corresponding position code.

则是使用了UNet网络，这里的UNet网络没有使用字符级别的分割来引导训练，而是作为一种特征增强的方式，这里通过UNet网络得到与原特征图大小一样的新的特征，公式化描述如下：

The UNet network is used. The UNet network here does not use character-level segmentation to guide training, but as a way of feature enhancement. Here, new features of the same size as the original feature map are obtained through the UNet network. The formula is described as follows :

（1.7）

(1.7)

其中，

是公式（1.5）得到的特征，

是UNet网络，

是得到的键。

则是使用了恒等映射，使用的特征

。in,

is the feature obtained by formula (1.5),

is the UNet network,

is the obtained key.

Then the identity mapping is used, and the features used

.

同样的，通过使用self-attention公式，得到视觉部分最终的特征图

，对其进行解码后，得到视觉部分的结果。Similarly, by using the self-attention formula, the final feature map of the visual part is obtained

, after decoding it, the result of the visual part is obtained.

；（1.8）

; (1.8)

；（1.9）

; (1.9)

其中，公式（1.8）中

对应的是公式（1.6）中的

，

对应的是公式（1.7）中的

，

是

的恒等映射，得到的

是视觉部分最终的特征；公式（1.8）中

是全连接函数，

是公式（1.8）中得到的输出，

指的是空间维度，

是单词的最大长度，这里是预定义的值，

是字符的类别数，同样是预定义的值，

是视觉部分的识别结果，

表示全体实数的集合。Among them, in formula (1.8)

Corresponding to the formula (1.6) in

,

Corresponds to the formula (1.7) in

,

Yes

The identity mapping of , we get

is the final feature of the visual part; in Equation (1.8)

is the fully connected function,

is the output obtained in Equation (1.8),

refers to the spatial dimension,

is the maximum length of a word, here is the predefined value,

is the number of categories of characters, again a predefined value,

is the recognition result of the visual part,

represents the set of all real numbers.

本实施例引入交叉熵损失来引导视觉部分的训练公式如下：In this embodiment, the cross-entropy loss is introduced to guide the training formula of the visual part as follows:

；（1.10）

; (1.10)

其中，

指的就是公式（1.9）中

第

个时间片，

就是真实的标签，

对应的就是单词的长度。

就是对每一个预测出的字符，将其与真实值做交叉熵损失。in,

Refers to the formula (1.9) in

the first

time slice,

is the real label,

Corresponds to the length of the word.

It is to perform cross-entropy loss on each predicted character with the real value.

进一步地，所述语义特征提取模块，采用注意力机制模块，对图像的特征向量进行对齐处理，然后，对对齐后的数据进行解码处理，得到语义特征。Further, the semantic feature extraction module adopts the attention mechanism module to align the feature vectors of the image, and then decode the aligned data to obtain the semantic features.

其中，对齐处理，是指将二维特征对齐为一维特征。The alignment processing refers to aligning two-dimensional features into one-dimensional features.

其中，对对齐后的数据进行解码处理，采用长短期记忆网络（Long Short-TermMemory，LSTM）来实现。Among them, the aligned data is decoded by using a long short-term memory network (Long Short-Term Memory, LSTM).

其中，长短期记忆网络，采用全连接层进行语义特征的预测，并且使用第二交叉熵损失函数进行监督训练。Among them, the long short-term memory network uses a fully connected layer to predict semantic features, and uses a second cross-entropy loss function for supervised training.

示例性地，由于语义特征提取模块对视觉特征提取模块的耦合，使得语义特征提取模块的训练极大地依赖于视觉特征提取模块，因此本实施例将语义特征提取模块独立出来，独立地对骨干网络的特征进行编码，从而降低模型的耦合程度。Exemplarily, due to the coupling of the semantic feature extraction module to the visual feature extraction module, the training of the semantic feature extraction module greatly depends on the visual feature extraction module. Therefore, in this embodiment, the semantic feature extraction module is independent, and the backbone network is independent. features to encode, thereby reducing the degree of coupling of the model.

所述语义特征提取模块，首先采用注意力机制将骨干网络得到的二维特征进行对齐，使其成为对齐好的一维特征，此外，本实施例在注意力机制中加入引入位置信息

，使得注意力机制不仅仅关注于有判别力的区域，而且关注图像之间文本的位置关系，公式化描述如下：The semantic feature extraction module first uses the attention mechanism to align the two-dimensional features obtained by the backbone network to make them into aligned one-dimensional features. In addition, this embodiment adds the introduction of position information into the attention mechanism.

, so that the attention mechanism not only pays attention to the discriminative regions, but also pays attention to the positional relationship of the text between the images, which is formulated as follows:

；（2.1）

; (2.1)

；（2.2）

; (2.2)

其中，公式（2.1）中

，

，

都是可训练参数，

是字符的顺序，包括[0,1,…N]，

做的是词嵌入操作，

是公式（1.4）中得到的特征向量；公式（2.2）中

指的是以

为底的指数函数，括号中的值是公式（2.1）得到的结果，也就是

的指数，

是预设的单词最大长度，全文其他的

都是同一个超参数，

得到的就是

时刻位置

的权重，通过将权重乘回特征序列

，可以得到已对齐的一维序列特征公式化描述如下：Among them, in formula (2.1)

,

are all trainable parameters.

is the sequence of characters, including [0,1,…N],

What it does is the word embedding operation,

is the eigenvector obtained in formula (1.4); in formula (2.2)

refers to

is the base exponential function, the value in parentheses is the result obtained by formula (2.1), which is

index,

is the preset maximum length of words, the full text is other

are the same hyperparameter,

what you get is

time position

the weights of , by multiplying the weights back into the feature sequence

, the aligned one-dimensional sequence features can be formulated as follows:

；（2.3）

; (2.3)

其中，

是公式（1.4）得到的特征向量，

是公式（2.2）得到的权重，

就是得到的对齐的一维序列。in,

is the eigenvector obtained from formula (1.4),

is the weight obtained by formula (2.2),

is the resulting aligned one-dimensional sequence.

得到了对齐的一维序列特征后，使用长短记忆型递归神经网络(Long Short TermMemory，LSTM)对其进行解码，由于时间因素，这里不采用自回归的方式解码，而是得到结果后对其进行一步解码，具体的公式化描述如下：After obtaining the aligned one-dimensional sequence features, use Long Short Term Memory (LSTM) to decode them. Due to the time factor, auto-regressive decoding is not used here, but the results are obtained. One-step decoding, the specific formulation is described as follows:

；（2.4）

; (2.4)

；（2.5）

; (2.5)

；（2.6）

; (2.6)

；（2.7）

; (2.7)

；（2.8）

; (2.8)

；（2.9）

; (2.9)

；（2.10）

; (2.10)

其中，

，

，

是LSTM对应的三个门控

对应的是

函数，所有的

与

都是用于学习的参数，

对应着上一时间片的隐藏层状态，

是公式（2.3）得到的结果，对应的是

时刻的输入，

是

函数，

对应的是点乘，

是LSTM的输出。in,

,

are the three gates corresponding to LSTM

corresponds to

function, all

and

are all parameters used for learning,

Corresponding to the hidden layer state of the previous time slice,

is the result obtained by formula (2.3), which corresponds to

time input,

Yes

function,

Corresponding to the point multiplication,

is the output of the LSTM.

对其进行全连接后得到最终的识别结果：After it is fully connected, the final recognition result is obtained:

；（2.11）

; (2.11)

其中，

是公式（2.10）中LSTM的输出，

是全连接层，

是语义模型最终的预测结果，

指的是空间维度，

是单词的最大长度，这里是预定义的值，

是字符的类别数，同样是预定义的值，

是解码后的视觉特征。in,

is the output of the LSTM in Equation (2.10),

is a fully connected layer,

is the final prediction result of the semantic model,

refers to the spatial dimension,

is the maximum length of a word, here is the predefined value,

is the number of categories of characters, again a predefined value,

is the decoded visual feature.

通过LSTM的解码，这里得到了解码后的语义特征，对齐进行全连接后得到了语义部分的识别结果，同样的，对于语义模型，采用交叉熵，监督这一部分的识别结果，公式化描述如下：Through the decoding of LSTM, the decoded semantic features are obtained here, and the recognition results of the semantic parts are obtained after alignment and full connection. Similarly, for the semantic model, cross entropy is used to supervise the recognition results of this part. The formula is described as follows:

；（2.12）

; (2.12)

其中，

是公式（2.10）得到的

时刻的预测值，

对应的是

时刻的真实值。

是字符的长度，

指的就是求每一个预测的字符与真实字符之间的损失。in,

is obtained by formula (2.10)

the predicted value at the moment,

corresponds to

The true value of the moment.

is the length of characters,

It refers to the loss between each predicted character and the real character.

进一步地，所述视觉语义特征融合模块，用于对视觉特征和语义特征进行交互，得到增强的视觉特征和增强的语义特征；采用门机制决策，对增强的视觉特征和增强的语义特征进行融合。Further, the visual semantic feature fusion module is used to interact with the visual feature and the semantic feature to obtain the enhanced visual feature and the enhanced semantic feature; adopt the gate mechanism decision to fuse the enhanced visual feature and the enhanced semantic feature .

示例性地，通过视觉特征提取模块与语义特征提取模块，分别解码得到了视觉特征

与语义特征

。为了充分利用两者的特征，本实施例采取一种新的视觉语义特征交互方式，具体的，同样遵循上面的self-attention的范式，不同的是这里是用于视觉特征与语义特征的融合，因此在交互时使用一个特征做查询

，另一个特征做索引

与键值

，

对应的是query也就是查询，key对应的是索引

，value对应的是键值

。图像化表示如图2，具体的公式化描述如下：Exemplarily, through the visual feature extraction module and the semantic feature extraction module, the visual features are obtained by decoding respectively.

with semantic features

. In order to make full use of the features of the two, this embodiment adopts a new visual-semantic feature interaction method. Specifically, it also follows the above self-attention paradigm. The difference is that it is used for the fusion of visual features and semantic features. So use a feature to query when interacting

, another feature to index

with key value

,

The corresponding query is the query, and the key corresponds to the index

, value corresponds to the key value

. The graphical representation is shown in Figure 2, and the specific formulation is described as follows:

；（3.1）

; (3.1)

；（3.2）

; (3.2)

其中，

对应的是公式（2.10）得到的语义特征，

对应的是公式（1.8）得到的视觉特征，

是特征的维度，

是一个超参数，

是

函数，

与

为视觉主导的融合向量与语义主导的融合向量。in,

Corresponding to the semantic features obtained by formula (2.10),

Corresponding to the visual features obtained by formula (1.8),

is the dimension of the feature,

is a hyperparameter,

Yes

function,

and

It is the visual-dominant fusion vector and the semantic-dominant fusion vector.

通过这一步，得到了交互的视觉特征与语义特征，这种方式使得视觉信息与语义信息充分利用，接着，使用门机制来对两者交互的特征来进行融合，具体公式如下：Through this step, the visual features and semantic features of the interaction are obtained. This method makes full use of the visual information and semantic information. Then, the gate mechanism is used to fuse the interactive features of the two. The specific formula is as follows:

；（3.3）

; (3.3)

；（3.4）

; (3.4)

其中，公式（3.3）中

是可训练参数，

与

为公式（3.1）和公式（3.2）得到的结果，

是将

与

进行级联。公式（3.4）中

是最终融合得到的向量，

是公式（3.3）得到的结果，

与

为公式（3.1）与公式（3.2）得到的结果，最后对

进行全连接，得到了最终结果

，公式化描述如下：Among them, in formula (3.3)

are trainable parameters,

and

The results obtained for Equation (3.1) and Equation (3.2),

will

and

to cascade. In formula (3.4)

is the final fusion vector,

is the result obtained by formula (3.3),

and

For the results obtained from formula (3.1) and formula (3.2), finally

Perform full connection and get the final result

, which is formulated as follows:

；（3.5）

; (3.5)

其中，

是全连接函数，

是公式（3.4）的输出，

是最终的预测结果。in,

is the fully connected function,

is the output of Equation (3.4),

is the final prediction result.

得到最终的结果后，对结果采取交叉熵损失来监督模型的训练，公式化描述如下：After the final result is obtained, the cross-entropy loss is used to supervise the training of the model. The formulation is described as follows:

；（3.6）

; (3.6)

其中，

指的是公式（3.5）中得到的结果，

也就对应着第

的时间片预测的结果，其实也就是第

个字符的预测结果。

就是第

个字符的真实值，这里的公式，同公式（1.10）和公式（2.11）一样，

是融合模块使用交叉熵损失来监督模型的训练。in,

refers to the result obtained in Equation (3.5),

which corresponds to the

The result of the time slice prediction is actually the first

predictions for characters.

is the first

The real value of characters, the formula here is the same as formula (1.10) and formula (2.11),

It is the fusion module that uses the cross-entropy loss to supervise the training of the model.

进一步地，所述预测模块，对增强的视觉特征和增强的语义特征进行识别，得到最终的文本识别结果；Further, the prediction module identifies the enhanced visual feature and the enhanced semantic feature to obtain the final text recognition result;

对增强的视觉特征和增强的语义特征，采用全连接层得到整体的预测结果，使用第三交叉熵损失监督训练。For enhanced visual features and enhanced semantic features, a fully connected layer is used to obtain the overall prediction results, and a third cross-entropy loss is used to supervise training.

进一步地，所述训练后的深度学习模型，训练过程包括：Further, the deep learning model after the training, the training process includes:

构建训练集；所述训练集为已知文本识别结果的自然场景图像；constructing a training set; the training set is a natural scene image with known text recognition results;

将训练集输入到深度学习模型中，对模型进行训练，当总损失函数值最小时，或者达到设定迭代次数时，停止训练，得到训练后的深度学习模型；Input the training set into the deep learning model, and train the model. When the total loss function value is the smallest, or when the set number of iterations is reached, the training is stopped, and the trained deep learning model is obtained;

其中，总损失函数，为第一、第二和第三交叉熵损失函数的求和结果。Among them, the total loss function is the summation result of the first, second and third cross-entropy loss functions.

其中，总损失函数，为：Among them, the total loss function is:

；（3.7）

; (3.7)

目标函数共有三个部分组成，其中，

，

，

是用于平衡的超参数，

是视觉部分的损失，

是语义部分的损失，

是融合损失。文本识别损失函数用的都是交叉熵损失函数。The objective function consists of three parts, among which,

,

are the hyperparameters used for balancing,

is the loss of the visual part,

is the loss of the semantic part,

is the fusion loss. The text recognition loss function uses the cross entropy loss function.

进一步地，所述构建训练集，包括：Further, the construction of the training set includes:

获取自然场景图像；Get natural scene images;

对自然场景图像进行增广处理；Augmenting natural scene images;

对增广处理后的自然场景图像进行尺寸归一化处理。Normalize the size of the augmented natural scene image.

示例性地，之所以需要对增广处理后的自然场景图像进行尺寸归一化处理，是因为自然场景中的文本图像有着各种各样的形状，长宽比例，为了得到泛化性能更好的模型，本实施例将文本图像的宽度与高度分别设定为宽度预设值大小与高度预设值大小。Exemplarily, the reason why it is necessary to normalize the size of the augmented natural scene image is because the text images in the natural scene have various shapes and aspect ratios, in order to obtain better generalization performance. In this embodiment, the width and height of the text image are respectively set to the default width and height.

示例性地，之所以需要对自然场景图像进行增广处理，是因为自然场景中广泛存在的旋转，透视扭曲，模糊等问题，本实施例使用概率预设值来对原始图像使用随机添加高斯噪声，旋转，透视扭曲等图像增广的方式。Exemplarily, the reason why it is necessary to perform augmentation processing on natural scene images is due to the widespread problems of rotation, perspective distortion, and blurring in natural scenes. This embodiment uses a probability preset value to randomly add Gaussian noise to the original image. , rotation, perspective distortion and other image augmentation methods.

对于图像的真实本文标签，使用的是包含所有英文字符，0-9数组，以及结束字符的字符字典，由于任务是对序列分类，因此本实施例使用长度预设值来规定文本的最大长度。For the real text label of an image, a character dictionary containing all English characters, an array of 0-9, and an end character is used. Since the task is to classify sequences, this embodiment uses a length preset value to specify the maximum length of the text.

实施例二Embodiment 2

本实施例提供了自然场景图像的文本识别系统；This embodiment provides a text recognition system for natural scene images;

此处需要说明的是，上述获取模块和识别模块对应于实施例一中的步骤S101至S102，上述模块与对应的步骤所实现的示例和应用场景相同，但不限于上述实施例一所公开的内容。需要说明的是，上述模块作为系统的一部分可以在诸如一组计算机可执行指令的计算机系统中执行。It should be noted here that the above-mentioned acquisition module and identification module correspond to steps S101 to S102 in the first embodiment, and the examples and application scenarios implemented by the above-mentioned modules and the corresponding steps are the same, but are not limited to those disclosed in the above-mentioned first embodiment. content. It should be noted that the above modules may be executed in a computer system such as a set of computer-executable instructions as part of the system.

上述实施例中对各个实施例的描述各有侧重，某个实施例中没有详述的部分可以参见其他实施例的相关描述。The description of each embodiment in the foregoing embodiments has its own emphasis. For the part that is not described in detail in a certain embodiment, reference may be made to the relevant description of other embodiments.

所提出的系统，可以通过其他的方式实现。例如以上所描述的系统实施例仅仅是示意性的，例如上述模块的划分，仅仅为一种逻辑功能划分，实际实现时，可以有另外的划分方式，例如多个模块可以结合或者可以集成到另外一个系统，或一些特征可以忽略，或不执行。The proposed system can be implemented in other ways. For example, the system embodiments described above are only illustrative. For example, the division of the above modules is only a logical function division. In actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated into other A system, or some feature, can be ignored, or not implemented.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. The text recognition method of the natural scene image is characterized by comprising the following steps:

acquiring a natural scene image to be identified;

performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;

firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image from the corrected image; respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features;

the deep learning model comprises: the correction module is connected with the input end of a backbone network, and the output end of the backbone network is respectively connected with the input end of the visual feature extraction module and the input end of the semantic feature extraction module; the output end of the visual characteristic extraction module and the output end of the semantic characteristic extraction module are both connected with the input end of the visual semantic characteristic fusion module, the output end of the visual semantic characteristic fusion module is connected with the input end of the prediction module, and the output end of the prediction module outputs a text recognition result;

the backbone network is realized by adopting a ResNet convolution neural network and is used for extracting spatial features of a natural scene image, position information embedding is carried out on the spatial features by adopting position coding, and feature enhancement is carried out on the spatial features after the position information embedding by adopting a first Transformer neural network to obtain feature vectors of the image;

the visual feature extraction module is used for separating a visual part and a semantic part of the image feature vector by adopting a second transform neural network, and decoding the visual part by adopting a position attention mechanism module to obtain visual features; the position attention mechanism module is used for inquiring self-attention mechanism self-attention

Key, key

Sum value

Replacing with different elements; wherein the query

Is replaced by position code, key

Is replaced by the output value, value of the UNet network

Is replaced by

Identity mapping of (2);

；（1.4）

wherein,

is a position code, and the position code is,

it is referred to as an input image,

refer to a residual neural network that is,

refers to a Transformer network;

；（1.5）

wherein, in the formula (1.5)

Are all obtained from the formula (1.4)

The identity of the image to be scanned is mapped,

is the dimension of the feature that is,

is a hyper-parameter which is the parameter,

refer to

A function;

（1.7）

wherein,

is a characteristic obtained by the formula (1.5),

is a network of UNet's, and,

is the resulting bond;

the semantic feature extraction module adopts an attention mechanism module to align the feature vectors of the images, and then decodes the aligned data to obtain semantic features; wherein, the alignment processing refers to aligning the two-dimensional features into one-dimensional features; the aligned data is decoded and realized by adopting a long-term and short-term memory network.

2. The method for recognizing texts in images of natural scenes according to claim 1, wherein the visual semantic feature fusion module is configured to interact with the visual features and the semantic features to obtain enhanced visual features and enhanced semantic features; and adopting a door mechanism decision to fuse the enhanced visual features and the enhanced semantic features.

3. The text recognition system for a natural scene image using the text recognition method for a natural scene image according to claim 1, comprising:

an acquisition module configured to: acquiring a natural scene image to be identified;

an identification module configured to: performing text recognition on the natural scene image to be recognized by adopting the trained deep learning model to obtain a recognized text;

firstly, correcting a natural scene image to be recognized by a deep learning model, and then extracting a feature vector of the image from the corrected image; and respectively extracting visual features and semantic features from the feature vectors of the image, performing feature fusion on the two features, and finally performing text recognition on the fused features.