CN114495101A

CN114495101A - Text detection method, text detection network training method and device

Info

Publication number: CN114495101A
Application number: CN202210034256.9A
Authority: CN
Inventors: 张晓强; 钦夏孟; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-05-13

Abstract

The present disclosure provides a text detection method and a training method of a text detection network, which relate to the technical field of image processing, and in particular, to the technical field of artificial intelligence. The specific implementation scheme is: determine the sequence feature of the image to be detected; determine the decoded sequence vector based on the sequence feature and the instance feature corresponding to the text instance; determine the type of the to-be-detected image based on the decoded sequence vector ; In response to the type of the image to be detected that the image to be detected includes text, determine the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature.

Description

Text detection method, text detection network training method and device

技术领域technical field

本公开涉及图像处理技术领域，尤其涉及人工智能的文本检测方法、文本检测网络的训练方法及装置。The present disclosure relates to the technical field of image processing, and in particular, to an artificial intelligence text detection method, a text detection network training method and device.

背景技术Background technique

文字检测是指确定出图像中文字的位置，并找出其边界框。它是很多视觉任务的前置步骤，比如文字识别和场景理解等，可以广泛应用于身份证识别、票据识别等业务场景，从而大幅度节省人工录入时间，并提高各种应用场景中的效率。Text detection refers to determining the position of text in an image and finding its bounding box. It is a pre-step for many visual tasks, such as text recognition and scene understanding, etc. It can be widely used in business scenarios such as ID card recognition and bill recognition, thereby greatly saving manual input time and improving the efficiency in various application scenarios.

发明内容SUMMARY OF THE INVENTION

本公开提供了一种文字检测方法、文本检测网络的训练方法及装置。The present disclosure provides a text detection method, a text detection network training method and a device.

根据本公开的一方面，提供了一种文本检测方法，包括：According to an aspect of the present disclosure, there is provided a text detection method, comprising:

确定待检测图像的序列特征；Determine the sequence characteristics of the image to be detected;

基于所述序列特征和文本实例对应的实例特征，确定解码后的序列向量；Determine the decoded sequence vector based on the sequence feature and the instance feature corresponding to the text instance;

基于所述解码后的序列向量，确定所述待检测图像的类型；Determine the type of the image to be detected based on the decoded sequence vector;

响应于所述待检测图像的类型为所述待检测图像包括文本，基于所述解码后的序列向量和所述序列特征对应的向量确定所述文本在所述待检测图像中的位置信息。In response to the type of the image to be detected being that the image to be detected includes text, the location information of the text in the image to be detected is determined based on the decoded sequence vector and the vector corresponding to the sequence feature.

根据本公开的另一方面，提供了一种文本检测网络的训练方法，所述文本检测网络包括编码子网络、译码子网络和输出子网络；According to another aspect of the present disclosure, a method for training a text detection network is provided, the text detection network includes an encoding sub-network, a decoding sub-network and an output sub-network;

基于所述编码子网络确定训练样本集中样本图像的序列样本特征；Determine the sequence sample features of the sample images in the training sample set based on the coding sub-network;

以所述序列样本特征和文本实例样本对应的实例样本特征作为所述译码子网络的跨层注意力层的输入，将所述跨层注意力层的输出确定为解码后的样本序列向量；Using the sequence sample feature and the instance sample feature corresponding to the text instance sample as the input of the cross-layer attention layer of the decoding sub-network, and determining the output of the cross-layer attention layer as the decoded sample sequence vector;

将所述解码后的样本序列向量作为所述输出子网络的输入，根据所述输出子网络的输出，确定所述样本图像的预测类型；The decoded sample sequence vector is used as the input of the output sub-network, and the prediction type of the sample image is determined according to the output of the output sub-network;

响应于所述样本图像的预测类型为所述样本图像包括文本，将所述解码后的样本序列向量和所述样本序列特征作为所述输出子网络的输入，根据所述输出子网络的输出，确定所述文本在所述样本图像中的预测位置信息；In response to the prediction type of the sample image being that the sample image includes text, the decoded sample sequence vector and the sample sequence feature are used as the input of the output sub-network, and according to the output of the output sub-network, determining the predicted position information of the text in the sample image;

匹配所述样本图像的预测类型和所述样本图像的标注类型，以及所述样本图像中文本的预测位置信息和标注位置信息，基于匹配结果调整所述文本检测网络的参数。Match the prediction type of the sample image with the annotation type of the sample image, and the predicted position information and the annotation position information of the text in the sample image, and adjust the parameters of the text detection network based on the matching result.

本公开第三方面提供一种电子设备，包括：至少一个处理器；以及与所述至少一个处理器通信连接的存储器；其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行上述所述的文本检测方法或文本检测网络的训练方法。A third aspect of the present disclosure provides an electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor to enable the at least one processor to perform the above-described text detection method or text detection network training method.

本公开第四方面提供一种存储有计算机指令的非瞬时计算机可读存储介质，其中，所述计算机指令用于使所述计算机执行上述所述的文本检测方法或文本检测网络的训练方法。A fourth aspect of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the above-described text detection method or text detection network training method.

本公开第五方面提供一种计算机程序产品，包括计算机程序/指令，所述计算机程序/指令在被处理器执行时实现上述所述的文本检测方法或文本检测网络的训练方法。A fifth aspect of the present disclosure provides a computer program product, comprising a computer program/instruction, the computer program/instruction, when executed by a processor, implements the above-described text detection method or text detection network training method.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or critical features of embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

附图说明Description of drawings

附图用于更好地理解本方案，不构成对本公开的限定。其中：The accompanying drawings are used for better understanding of the present solution, and do not constitute a limitation to the present disclosure. in:

图1示出了本申请实施例提供的文本检测方法的一种可选流程示意图；FIG. 1 shows an optional schematic flowchart of a text detection method provided by an embodiment of the present application;

图2示出了本申请实施例提供的文本检测网络的训练方法的一种可选流程示意图；FIG. 2 shows an optional schematic flowchart of a training method for a text detection network provided by an embodiment of the present application;

图3示出了本申请实施例提供的文本检测方法的另一种可选流程示意图；FIG. 3 shows another optional schematic flowchart of the text detection method provided by the embodiment of the present application;

图4示出了本申请实施例提供的文本检测方法的数据示意图；FIG. 4 shows a data schematic diagram of the text detection method provided by the embodiment of the present application;

图5示出了本申请实施例提供的文本检测装置的一种可选结构示意图；FIG. 5 shows an optional structural schematic diagram of the text detection device provided by the embodiment of the present application;

图6示出了本申请实施例提供的文本检测网络训练装置的一种可选结构示意图；FIG. 6 shows an optional structural schematic diagram of a text detection network training device provided by an embodiment of the present application;

图7示出了可以用来实施本公开的实施例的示例电子设备的示意性框图。7 shows a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

具体实施方式Detailed ways

以下结合附图对本公开的示范性实施例做出说明，其中包括本公开实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本公开的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

文字检测技术是指确定出图像中文字的位置，并找出其边界框。它是很多视觉任务的前置步骤，比如文字识别和场景理解等，可以广泛应用于身份证识别、票据识别等业务场景，从而大幅度节省人工录入时间，并提高各种应用场景中的效率。自然场景的文字检测和通用目标检测相比有其特殊性，作为视觉主体目标的文字，其字体、大小、颜色、方向、形状等呈现出多样化的特征，相比于一般的目标检测难度更大。近些年来，文字检测技术得到了快速发展，在常规的文字检测数据集上这些技术都取得了不错的效果，但是在充满挑战性包含任意形状的自然场景数据集下，其效果令人沮丧。Text detection technology refers to determining the position of text in an image and finding its bounding box. It is a pre-step for many visual tasks, such as text recognition and scene understanding, etc. It can be widely used in business scenarios such as ID card recognition and bill recognition, thereby greatly saving manual input time and improving the efficiency in various application scenarios. Compared with general target detection, text detection in natural scenes has its particularity. As the main visual target, the text, such as font, size, color, direction, shape, etc., presents various characteristics, which is more difficult than general target detection. big. Text detection techniques have developed rapidly in recent years, and these techniques have achieved good results on conventional text detection datasets, but their results are frustrating on challenging natural scene datasets containing arbitrary shapes.

现有方法主要分为基于回归的方法和基于分割的方法，基于回归的方法往往需要额外的形状建模来解决弯曲文字问题，仍然无法有效解决任意形状的文字，基于分割的方法天然地可以解决任意形状的问题，但是往往需要后处理规则去区分不同的文字实体，并不能使效果达到最好。Existing methods are mainly divided into regression-based methods and segmentation-based methods. Regression-based methods often require additional shape modeling to solve the problem of curved text, and still cannot effectively solve text of any shape. Segmentation-based methods can naturally solve the problem. The problem of arbitrary shape, but often requires post-processing rules to distinguish different text entities, and can not achieve the best effect.

其中，前者(基于回归的方法)通过回归对应的边界框来对目标文本进行定位，后者通常采用全卷积网络对图像进行逐像素分类预测，将图像分为文本和非文本区域，再通过特定的后处理操作将像素级别的输出转化为边界框形式。其中基于分割的文字检测算法主要使用Mask-RCNN作为基础神经网络以产生分割图，尽管基于分割的方法在常规的水平文本检测上取得了很高的精确度，但是其通常需要复杂的后处理步骤以产生相应的边界框，这在推理阶段会消耗大量的内存和时间，因为要对产生的区域进行精修和贴标签。同时由于采用了分割思想，针对重叠文字框的情况，其检测效果很差。而基于回归的检测方法通常是直接预测边界框，常见的有EAST、CPTN等算法，由于后处理过程简单，其推理速度明显优于基于分割的算法，但是在复杂的自然场景下，由于字体的变化幅度大和场景干扰严重的问题，其检测效果也并不是很好。Among them, the former (regression-based method) locates the target text by regressing the corresponding bounding box, and the latter usually uses a fully convolutional network to classify and predict the image pixel by pixel, divide the image into text and non-text areas, and then pass Specific post-processing operations convert the pixel-level output into bounding box form. Among them, segmentation-based text detection algorithms mainly use Mask-RCNN as the basic neural network to generate segmentation maps. Although segmentation-based methods have achieved high accuracy in conventional horizontal text detection, they usually require complex post-processing steps. to generate the corresponding bounding boxes, which consumes a lot of memory and time during the inference phase, as the resulting regions are refined and labeled. At the same time, due to the segmentation idea, the detection effect is very poor for overlapping text boxes. The regression-based detection method usually predicts the bounding box directly. Common algorithms such as EAST and CPTN are used. Due to the simple post-processing process, its inference speed is significantly better than the segmentation-based algorithm. However, in complex natural scenes, due to the font size For problems with large changes and serious scene interference, the detection effect is not very good.

综上，现有的文字检测方法具有如下缺点：To sum up, the existing text detection methods have the following shortcomings:

1)大部分现有方法针对水平文字框都能取得令人满意的检测精度。然而在面对自然场景下任意形状文字时，由于建模能力有限，仍存在大量的漏检和错检情况。1) Most of the existing methods can achieve satisfactory detection accuracy for horizontal text boxes. However, when faced with characters of arbitrary shape in natural scenes, there are still a lot of missed detections and false detections due to the limited modeling ability.

2)现有的文字检测模型大都是基于CNN架构的，需要大量人为的先验假设和冗杂的后处理过程，比如Anchor的设计和NMS等过程，整体流程复杂。2) Most of the existing text detection models are based on the CNN architecture, which requires a lot of artificial prior assumptions and complicated post-processing processes, such as Anchor design and NMS processes, and the overall process is complex.

本公开提供一种文本检测方法和文本检测网络的训练方法，以至少解决现有技术中文本检测存在的缺陷。The present disclosure provides a text detection method and a text detection network training method, so as to at least solve the defects of text detection in the prior art.

图1示出了本申请实施例提供的文本检测方法的一种可选流程示意图，将根据各个步骤进行说明。FIG. 1 shows an optional schematic flowchart of a text detection method provided by an embodiment of the present application, which will be described according to each step.

步骤S101，确定待检测图像的序列特征。Step S101, determining the sequence features of the images to be detected.

在一些实施例中，文本检测装置(以下简称第一装置)将待检测图像对应的图像矩阵转换为的一维向量；确定所述一维向量对应的特征为所述待检测图像的序列特征。In some embodiments, the text detection device (hereinafter referred to as the first device) converts the image matrix corresponding to the image to be detected into a one-dimensional vector, and determines that the feature corresponding to the one-dimensional vector is the sequence feature of the image to be detected.

具体实施时，所述第一装置可以基于文本检测网络包括的编码子网络确定所述待检测图像的序列特征；即将所述待检测图像输入至所述编码子网络中，获得所述待检测图像的序列特征。During specific implementation, the first device may determine the sequence feature of the image to be detected based on the encoding sub-network included in the text detection network; that is, input the image to be detected into the encoding sub-network to obtain the image to be detected sequence features.

步骤S102，基于所述序列特征和文本实例对应的实例特征，确定解码后的序列向量。Step S102: Determine a decoded sequence vector based on the sequence feature and the instance feature corresponding to the text instance.

在一些实施例中，所述第一装置确定所述解码后的序列向量之前，获取所述文本实例对应的嵌入值，所述嵌入值为所述文本实例对应的实例特征；其中，所述文本实例包括字符串和/或短语，不同的文本实例对应的嵌入值不同。In some embodiments, before determining the decoded sequence vector, the first device obtains an embedded value corresponding to the text instance, where the embedded value is an instance feature corresponding to the text instance; wherein the text Instances include strings and/or phrases, and different text instances correspond to different embedded values.

具体实施时，所述第一装置可以基于所述文本检测网络包括的译码子网络的自注意力层，获取所述文本实例对应的嵌入值；即以所述文本实例作为所述自注意力层的输入，所述自注意力层的输出为所述文本实例对应的嵌入值。During specific implementation, the first device may obtain the embedded value corresponding to the text instance based on the self-attention layer of the decoding sub-network included in the text detection network; that is, the text instance is used as the self-attention The input of the self-attention layer is the embedded value corresponding to the text instance.

在一些实施例中，所述第一装置可以基于所述译码子网络包括的跨层注意力层，获取所述解码后的序列向量；即以所述序列特征和所述文本实例对应的实例特征作为所述跨层注意力层的输入，所述跨层注意力层的输出为所述解码后的序列向量。In some embodiments, the first device may obtain the decoded sequence vector based on a cross-layer attention layer included in the decoding sub-network; that is, the sequence feature and the instance corresponding to the text instance The feature is used as the input of the cross-layer attention layer, and the output of the cross-layer attention layer is the decoded sequence vector.

步骤S103，基于所述解码后的序列向量，确定所述待检测图像的类型。Step S103, based on the decoded sequence vector, determine the type of the image to be detected.

在一些实施例中，所述第一装置可以基于所述文本检测网络包括的输出子网络，确定所述待检测图像的类型；即以所述解码后的序列向量作为所述输出子网络包括的全连接层的输入，所述全连接层的输出为所述待检测图像的类型。In some embodiments, the first device may determine the type of the image to be detected based on the output sub-network included in the text detection network; that is, the decoded sequence vector is used as the output sub-network included in the output sub-network. The input of the fully connected layer, and the output of the fully connected layer is the type of the image to be detected.

其中，所述待检测图像的类型可以包括所述待检测图像包括文本或所述待检测图像不包括文本。The type of the image to be detected may include that the image to be detected includes text or the image to be detected does not include text.

步骤S104，响应于所述待检测图像的类型为所述待检测图像包括文本，基于所述解码后的序列向量和所述序列特征对应的向量确定所述文本在所述待检测图像中的位置信息。Step S104, in response to the type of the image to be detected that the image to be detected includes text, determine the position of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature information.

在一些实施例中，所述第一装置响应于所述待检测图像的类型为所述待检测图像包括文本，则基于所述解码后的序列向量和所述序列特征对应的向量确定所述文本在所述待检测图像中的位置信息。In some embodiments, in response to the type of the image to be detected being that the image to be detected includes text, the first device determines the text based on the decoded sequence vector and a vector corresponding to the sequence feature location information in the image to be detected.

在一些实施例中，所述第一装置将所述解码后的序列向量与所述序列特征对应的向量相乘，得到乘积结果；基于所述乘积结果，确定所述文本在所述序列特征中的位置信息；基于所述文本在所述序列特征中的位置信息，确定所述文本在所述待检测图像的位置信息。可选的，所述解码后的序列向量与所述序列特征对应的向量可以进行Tensor乘法操作；在进行Tensor乘法操作时，可以通过补0、复制等方式使所述解码后的序列向量与所述序列特征对应的向量的长度相同。In some embodiments, the first device multiplies the decoded sequence vector and the vector corresponding to the sequence feature to obtain a product result; based on the product result, it is determined that the text is in the sequence feature The position information of the text; based on the position information of the text in the sequence feature, determine the position information of the text in the to-be-detected image. Optionally, a Tensor multiplication operation can be performed on the decoded sequence vector and the vector corresponding to the sequence feature; when performing a Tensor multiplication operation, the decoded sequence vector and the The vectors corresponding to the sequence features have the same length.

在一些可选实施例中，所述第一装置基于所述解码后的序列向量和所述序列特征对应的向量确定所述文本在所述待检测图像中的位置信息之后，还可以基于所述待检测图像中文本的位置信息确定所述待检测图像中的连通域；基于所述连通域的边界确定文本边界框；其中，所述文本边界框用于识别所述待检测图像中的文本。In some optional embodiments, after the first apparatus determines the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature, it may also The position information of the text in the image to be detected determines a connected domain in the image to be detected; a text bounding box is determined based on the boundary of the connected domain; wherein, the text bounding box is used to identify the text in the image to be detected.

如此，通过本公开实施例提供的文本检测方法，可以直接预测文本实例级的任意形状的文本在图像中的位置，能够适应复杂场景下各种形状文本的检测任务。特别地，结合光学字符识别(Optical Character Recognition，OCR)进行文本识别的情况下，可以提升自然场景下OCR的文本识别能力。In this way, the text detection method provided by the embodiments of the present disclosure can directly predict the position of text of any shape at the text instance level in the image, and can adapt to the detection task of text of various shapes in complex scenes. In particular, in the case of text recognition combined with Optical Character Recognition (OCR), the text recognition capability of OCR in natural scenes can be improved.

图2示出了本申请实施例提供的文本检测网络的训练方法的一种可选流程示意图，将根据各个步骤进行说明。FIG. 2 shows an optional schematic flowchart of a training method for a text detection network provided by an embodiment of the present application, which will be described according to each step.

步骤S201，基于所述编码子网络确定训练样本集中样本图像的序列样本特征。Step S201 , based on the coding sub-network, determine the sequence sample features of the sample images in the training sample set.

在一些实施例中，所述文本检测网络包括：编码子网络、译码子网络和输出子网络；其中，所述译码子网络包括自注意力层和跨层注意力层；所述输出子网络包括全连接层。In some embodiments, the text detection network includes: an encoding sub-network, a decoding sub-network, and an output sub-network; wherein the decoding sub-network includes a self-attention layer and a cross-layer attention layer; the output sub-network The network includes fully connected layers.

在一些实施例中，文本检测网络的训练装置(以下简称第二装置)基于所述编码子网络将所述样本图像对应的图像矩阵转换为表征所述样本图像的一维向量；所述样本图像的一维向量对应的特征为所述样本图像的序列样本特征。In some embodiments, the training device of the text detection network (hereinafter referred to as the second device) converts the image matrix corresponding to the sample image into a one-dimensional vector representing the sample image based on the encoding sub-network; the sample image The corresponding feature of the one-dimensional vector is the sequence sample feature of the sample image.

步骤S202，以所述序列样本特征和文本实例样本对应的实例样本特征作为所述译码子网络的跨层注意力层的输入，将所述跨层注意力层的输出确定为解码后的样本序列向量。Step S202, using the sequence sample feature and the instance sample feature corresponding to the text instance sample as the input of the cross-layer attention layer of the decoding sub-network, and determining the output of the cross-layer attention layer as the decoded sample sequence vector.

在一些实施例中，所述第二装置确定解码后的样本序列向量之前，还可以将所述文本实例样本输入至所述译码子网络的自注意力层，基于所述自注意力层的输出获取所述文本实例样本对应的嵌入值，所述嵌入值为所述文本实例样本对应的实例样本特征；其中，所述文本实例样本包括字符串和/或短语，不同的文本实例样本对应的嵌入值不同；可选的，所述实例样本特征可以是一维向量。In some embodiments, before the second device determines the decoded sample sequence vector, the text instance samples may also be input into a self-attention layer of the decoding sub-network, based on the self-attention layer of the self-attention layer. Output and obtain the embedded value corresponding to the text instance sample, where the embedded value is the instance sample feature corresponding to the text instance sample; wherein, the text instance sample includes character strings and/or phrases, and different text instance samples correspond to The embedded values are different; optionally, the instance sample feature may be a one-dimensional vector.

在一些实施例中，所述第二装置将所述序列样本特征和所述实例样本特征输入至所述跨层注意力层；确定所述跨层注意力层的输出为所述解码后的样本序列向量。In some embodiments, the second device inputs the sequence sample feature and the instance sample feature to the cross-layer attention layer; and determines that the output of the cross-layer attention layer is the decoded sample sequence vector.

步骤S203，将所述解码后的样本序列向量作为所述输出子网络的输入，根据所述输出子网络的输出，确定所述样本图像的预测类型。Step S203: The decoded sample sequence vector is used as the input of the output sub-network, and the prediction type of the sample image is determined according to the output of the output sub-network.

在一些实施例中，所述第二装置将所述解码后的样本序列向量作为所述输出子网络包括的全连接层的输入，基于所述全连接层的输出确定所述样本图像的预测类型。In some embodiments, the second device uses the decoded sample sequence vector as an input of a fully connected layer included in the output sub-network, and determines the prediction type of the sample image based on the output of the fully connected layer .

其中，所述预测类型可以包括：所述样本图像包括文本，或所述样本图像不包括文本。The prediction type may include: the sample image includes text, or the sample image does not include text.

步骤S204，响应于所述样本图像的预测类型为所述样本图像包括文本，将所述解码后的样本序列向量和所述样本序列特征作为所述输出子网络的输入，根据所述输出子网络的输出，确定所述文本在所述样本图像中的预测位置信息。Step S204, in response to the prediction type of the sample image being that the sample image includes text, use the decoded sample sequence vector and the sample sequence feature as the input of the output sub-network, according to the output sub-network output, determine the predicted position information of the text in the sample image.

在一些实施例中，所述第二装置响应于所述样本图像的预测类型为所述样本图像包括文本，将所述解码后的样本序列向量与所述样本序列特征对应的向量相乘，得到乘积结果；基于所述乘积结果，确定所述文本在所述样本序列特征中的预测位置信息；基于所述文本在所述样本序列特征中的预测位置信息，确定所述文本在所述样本图像的预测位置信息。In some embodiments, in response to the prediction type of the sample image being that the sample image includes text, the second device multiplies the decoded sample sequence vector by a vector corresponding to the sample sequence feature to obtain product result; based on the product result, determine the predicted position information of the text in the sample sequence feature; based on the predicted position information of the text in the sample sequence feature, determine the text in the sample image predicted location information.

在一些可选实施例中，所述解码后的样本序列向量与所述样本序列特征对应的向量可以进行Tensor乘法操作；在进行Tensor乘法操作时，可以通过补0、复制等方式使所述解码后的样本序列向量与所述样本序列特征对应的向量的长度相同。In some optional embodiments, the decoded sample sequence vector and the vector corresponding to the sample sequence feature may perform a Tensor multiplication operation; when performing a Tensor multiplication operation, the decoding may be performed by means of 0-filling, copying, etc. The length of the latter sample sequence vector is the same as that of the vector corresponding to the sample sequence feature.

具体实施时，所述样本序列特征对应的向量与所述解码后的样本序列向量进行Tensor乘法操作后，得到的结果(通过向量表示)中，所述文本在所述结果对应的参数相比非文本对应在所述结果中对应的参数大或小，可以通过相乘的结果确定所述文本在所述样本序列特征中的预测位置信息；由于所述样本序列特征为所述样本图像经过变换后得到的一维向量；通过矩阵变换的方式，可以基于所述文本在所述样本序列特征中的预测位置信息，确定所述文本在所述样本图像中的预测位置信息。In specific implementation, after the Tensor multiplication operation is performed between the vector corresponding to the sample sequence feature and the decoded sample sequence vector, in the result (represented by a vector), the text is not compared with the parameter corresponding to the result. The text corresponds to whether the corresponding parameter in the result is large or small, and the predicted position information of the text in the sample sequence feature can be determined by the result of multiplication; since the sample sequence feature is the sample image after transformation The obtained one-dimensional vector; by means of matrix transformation, the predicted position information of the text in the sample image can be determined based on the predicted position information of the text in the sample sequence feature.

步骤S205，匹配所述样本图像的预测类型和所述样本图像的标注类型，以及所述样本图像中文本的预测位置信息和标注位置信息，基于匹配结果调整所述文本检测网络的参数。Step S205: Match the prediction type of the sample image with the annotation type of the sample image, and the predicted position information and annotation position information of the text in the sample image, and adjust the parameters of the text detection network based on the matching result.

在一些实施例中，若所述预测类型和所述标注类型相同，且所述预测位置信息与所述标注位置信息之间的损失值小于预设阈值，则所述第二装置确定不调整所述文本检测网络的参数。In some embodiments, if the prediction type and the annotation type are the same, and the loss value between the predicted location information and the annotation location information is less than a preset threshold, the second device determines not to adjust the Describe the parameters of the text detection network.

在另一些实施例中，若所述预测类型和所述标注类型不同，或者，所述预测位置信息与所述标注位置信息之间的损失值大于或等于所述预设阈值，则基于所述样本图像的预测类型和所述样本图像的标注类型之间的差异，和/或所述样本图像中文本的预测位置信息和标注位置信息之间的差异，调整所述文本检测网络的参数。In other embodiments, if the prediction type and the labeling type are different, or the loss value between the predicted location information and the labeling location information is greater than or equal to the preset threshold, based on the The difference between the prediction type of the sample image and the annotation type of the sample image, and/or the difference between the predicted position information and the annotation position information of the text in the sample image, adjust the parameters of the text detection network.

具体实施时，可以通过二分图匹配算法对所述预测类型与所述标注类型，或者，所述预测位置信息与所述标注位置信息进行匹配，分别计算类型损失和位置信息损失(或mask损失)。In specific implementation, the prediction type and the labeling type, or the predicted location information and the labeling location information can be matched by a bipartite graph matching algorithm, and the type loss and location information loss (or mask loss) are calculated respectively. .

其中，所述值可以包括二分类的交叉熵损失值和dice损失值中至少之一。Wherein, the value may include at least one of a binary classification cross-entropy loss value and a dice loss value.

如此，通过本公开实施例提供的文本检测网络的训练方法，能够获得可以直接预测文本实例级的任意形状的文本在图像中位置的神经网络，为复杂场景下各种形状文本的检测任务提供有力支撑。In this way, through the training method of the text detection network provided by the embodiments of the present disclosure, a neural network that can directly predict the position of text of any shape at the text instance level in the image can be obtained, which provides powerful detection tasks for text of various shapes in complex scenes. support.

图3示出了本申请实施例提供的文本检测方法的另一种可选流程示意图，将根据各个步骤进行说明；图4示出了本申请实施例提供的文本检测方法的数据示意图。FIG. 3 shows another optional schematic flowchart of the text detection method provided by the embodiment of the present application, which will be described according to each step; FIG. 4 is a schematic data diagram of the text detection method provided by the embodiment of the present application.

本公开实施例提供的文字检测方法，包括任意形状文字的自然场景图像(待检测图像)首先经过文本检测网络的编码子网络，提取所述待检测图像的序列特征，然后通过译码子网络和不同的可学习的文本实例(Object Queries)对应的文本向量来关注待检测图像中不同的文本实例信息，进一步，所述文本检测网络输出不同的实例级文本实例的位置信息；基于所述位置信息，进行简单的连通域分析即可获得文字实例的边框，为后续文字识别提供有力支撑，提升文字识别的准确性。The text detection method provided by the embodiment of the present disclosure includes that a natural scene image (image to be detected) of any shape text first passes through the coding sub-network of the text detection network to extract the sequence features of the to-be-detected image, and then passes through the decoding sub-network and The text vectors corresponding to different learnable text instances (Object Queries) are used to pay attention to different text instance information in the image to be detected. Further, the text detection network outputs the position information of different instance-level text instances; based on the position information , a simple connected domain analysis can be performed to obtain the border of the text instance, which provides strong support for subsequent text recognition and improves the accuracy of text recognition.

以下是文本检测网络以及训练完成(如通过步骤S201至步骤S205)的基础上，进行文字检测的步骤流程：The following is the text detection network and the process of text detection based on the completion of the training (such as through steps S201 to S205):

步骤S301，将待检测图像输入至文本检测网络。Step S301, input the image to be detected into the text detection network.

步骤S302，文本检测网络包括的编码子网络确定所述待检测图像的序列特征。Step S302, the coding sub-network included in the text detection network determines the sequence feature of the image to be detected.

在一些实施例中，所述文本检测网络包括的编码子网络(或称编码Encode模块)可以是基于卷积神经网络(CNN)、Transform(一种机器学习模型)或CNN与Transform混合的网络结构，所述编码子网络的目的是提取待检测图像的序列特征。其中，所述序列包括将所述图像按照行排列得到的向量(如待检测图像可以表示为h*w矩阵，则所述待检测图像对应的序列为1*hw的一维向量)。In some embodiments, the encoding sub-network (or encoding Encode module) included in the text detection network may be a network structure based on a convolutional neural network (CNN), a Transform (a machine learning model), or a mixture of CNN and Transform. , the purpose of the encoding sub-network is to extract the sequence features of the image to be detected. The sequence includes a vector obtained by arranging the images in rows (if the image to be detected can be represented as an h*w matrix, the sequence corresponding to the image to be detected is a one-dimensional vector of 1*hw).

在一些实施例中，所述编码子网络将待检测图像对应的图像矩阵转换为的一维向量；确定所述一维向量对应的特征为所述待检测图像的序列特征。In some embodiments, the encoding sub-network converts the image matrix corresponding to the image to be detected into a one-dimensional vector; and the feature corresponding to the one-dimensional vector is determined as the sequence feature of the image to be detected.

步骤S303，文本检测网络包括的译码子网络确定解码后的序列向量。Step S303, the decoding sub-network included in the text detection network determines the decoded sequence vector.

在一些实施例中，所述文本检测网络包括的译码子网络(或称译码Decoder模块)基于自注意力和跨层注意力机制构建，目的是对编码子网络提取出的序列特征进行解码操作。具体的，所述译码子网络包括自注意力层和跨层注意力层；所述自注意力层用于将至少一个可学习的文本实例(Object Queries)转换为至少一个文本特征，其中，所述至少一个可学习的文本实例包括长度不同的字符串和/或短语；所述跨层注意力层用于输出经过解码后的序列向量。In some embodiments, the decoding sub-network (or called decoding Decoder module) included in the text detection network is constructed based on self-attention and cross-layer attention mechanism, and the purpose is to decode the sequence features extracted by the encoding sub-network operate. Specifically, the decoding sub-network includes a self-attention layer and a cross-layer attention layer; the self-attention layer is used to convert at least one learnable text instance (Object Queries) into at least one text feature, wherein, The at least one learnable text instance includes strings and/or phrases of different lengths; the cross-layer attention layer is used to output a decoded sequence vector.

在一些实施例中，将所述至少一个文本实例输入至所述译码子网络的自注意力层，基于所述自注意力层的输出获取所述至少一个文本实例对应的嵌入值，所述嵌入值为所述文本实例对应的实例特征；以所述序列样本和文本实例对应的实例特征作为所述译码子网络的跨层注意力层的输入，将所述跨层注意力层的输出确定为解码后的序列向量。In some embodiments, the at least one text instance is input to a self-attention layer of the decoding sub-network, and an embedding value corresponding to the at least one text instance is obtained based on an output of the self-attention layer, the The embedded value is the instance feature corresponding to the text instance; the instance feature corresponding to the sequence sample and the text instance is used as the input of the cross-layer attention layer of the decoding sub-network, and the output of the cross-layer attention layer is used. Determined as the decoded sequence vector.

步骤S304，基于所述解码后的序列向量，确定所述待检测图像的类型，和/或所述文本在所述待检测图像中的位置信息。Step S304, based on the decoded sequence vector, determine the type of the image to be detected and/or the position information of the text in the image to be detected.

在一些实施例中，所述文本检测网络包括的输出子网络包括全连接层；所述全连接层用于基于所述解码后的序列向量确定所述待检测图像的类型；所述文本检测网络还用于确定所述文本在所述待检测图像中的位置信息。In some embodiments, the output sub-network included in the text detection network includes a fully connected layer; the fully connected layer is configured to determine the type of the image to be detected based on the decoded sequence vector; the text detection network It is also used to determine the position information of the text in the image to be detected.

具体实施时，所述输出子网络可以通过Tensor乘法操作，将所述解码后的序列向量和所述序列特征相乘，基于乘积结果，确定所述文本在所述序列特征中的位置信息；基于所述文本在所述序列特征中的位置信息，确定所述文本在所述待检测图像的位置信息。During specific implementation, the output sub-network may multiply the decoded sequence vector and the sequence feature through a Tensor multiplication operation, and determine the position information of the text in the sequence feature based on the product result; The position information of the text in the sequence feature determines the position information of the text in the to-be-detected image.

如图4所示，待检测图像输入至编码子网络后，获得序列特征；文本实例输入所述译码子网络后，其输出与所述序列特征共同作为所述译码子网络的输出，在输出子网络，与所述序列特征相乘，获得所述待检测图像中文本的位置信息。As shown in Figure 4, after the image to be detected is input into the encoding sub-network, the sequence features are obtained; after the text instance is input into the decoding sub-network, its output and the sequence features are jointly used as the output of the decoding sub-network. The output sub-network is multiplied with the sequence feature to obtain the position information of the text in the image to be detected.

如此，通过本申请实施例提供的文本检测方法，针对现有文本检测方法不能有效处理任意形状文字检测的问题，提出了一种文本检测方法，本申请实施例提供的文本检测方法将任意形状文字检测问题转化为mask分类(即实例分割，区分不同类的不同实例)问题，本申请实施例提供的文本检测方法通过预测实例级文本实例来同时解决任意形状问题和区分不同文本实体问题，无需复杂后处理和人工规则。具体地，本申请实施例提供的文本检测方法利用文本检测网络包括的编码子网络对输入的待检测图像进行特征提取，获取序列特征，利用文本检测网络包括的译码子网络输出待检测图像中至少一个文本的位置信息，最后对相应的位置进行简单的连通域分析即可获得文本实例的边界框。与以往的基于回归和分割的文字检测方法相比，本申请实施例提供的文本检测方法通过直接预测实例级文本在待检测图像中的位置信息，简化任意形状建模的复杂度，无需复杂人工后处理,可以通过数据驱动的方式有效提升复杂自然场景下的任意文字检测效果。In this way, with the text detection method provided by the embodiments of the present application, a text detection method is proposed to solve the problem that the existing text detection methods cannot effectively handle the detection of characters with arbitrary shapes. The detection problem is transformed into the problem of mask classification (that is, instance segmentation, distinguishing different instances of different classes). The text detection method provided by the embodiment of the present application solves the problem of arbitrary shape and distinguishes different text entities simultaneously by predicting instance-level text instances, without complex Post-processing and manual rules. Specifically, the text detection method provided by the embodiment of the present application uses the coding sub-network included in the text detection network to perform feature extraction on the input image to be detected, obtains sequence features, and uses the decoding sub-network included in the text detection network to output the image to be detected. The position information of at least one text, and finally a simple connected domain analysis is performed on the corresponding position to obtain the bounding box of the text instance. Compared with the previous text detection methods based on regression and segmentation, the text detection method provided by the embodiment of the present application simplifies the complexity of arbitrary shape modeling by directly predicting the position information of instance-level text in the image to be detected, and does not require complicated manual work. Post-processing can effectively improve the detection effect of arbitrary text in complex natural scenes in a data-driven way.

图5示出了本申请实施例提供的文本检测装置的一种可选结构示意图，将根据各个部分进行说明。FIG. 5 shows an optional structural schematic diagram of the text detection apparatus provided by the embodiment of the present application, which will be described according to each part.

在一些实施例中，所述文本检测装置500包括编码单元501、译码单元502、图像类型确定单元503和输出单元504。In some embodiments, the text detection apparatus 500 includes an encoding unit 501 , a decoding unit 502 , an image type determination unit 503 and an output unit 504 .

编码单元501，用于确定待检测图像的序列特征；an encoding unit 501, configured to determine the sequence feature of the image to be detected;

译码单元502，用于基于所述序列特征和文本实例对应的实例特征，确定解码后的序列向量；A decoding unit 502, configured to determine a decoded sequence vector based on the sequence feature and the instance feature corresponding to the text instance;

所述图像类型确定单元503，用于基于所述解码后的序列向量，确定所述待检测图像的类型；The image type determination unit 503 is configured to determine the type of the image to be detected based on the decoded sequence vector;

所述输出单元504，用于响应于所述待检测图像的类型为所述待检测图像包括文本，基于所述解码后的序列向量和所述序列特征对应的向量确定所述文本在所述待检测图像中的位置信息。The output unit 504 is configured to determine, based on the decoded sequence vector and the vector corresponding to the sequence feature, that the text is in the to-be-detected image in response to the type of the to-be-detected image being that the to-be-detected image includes text. Detect location information in an image.

所述编码单元501，具体用于将待检测图像对应的图像矩阵转换为的一维向量；确定所述一维向量对应的特征为所述待检测图像的序列特征。The encoding unit 501 is specifically configured to convert the image matrix corresponding to the image to be detected into a one-dimensional vector; and determine that the feature corresponding to the one-dimensional vector is the sequence feature of the image to be detected.

所述译码单元502，还用于获取所述文本实例对应的嵌入值，所述嵌入值为所述文本实例对应的实例特征；其中，所述文本实例包括字符串和/或短语，不同的文本实例对应的嵌入值不同。The decoding unit 502 is further configured to obtain an embedded value corresponding to the text instance, where the embedded value is an instance feature corresponding to the text instance; wherein the text instance includes a character string and/or a phrase, and different Text instances correspond to different embedded values.

所述输出单元504，具体用于将所述解码后的序列向量与所述序列特征对应的向量相乘，得到乘积结果；基于所述乘积结果，确定所述文本在所述序列特征中的位置信息；基于所述文本在所述序列特征中的位置信息，确定所述文本在所述待检测图像的位置信息。The output unit 504 is specifically configured to multiply the decoded sequence vector and the vector corresponding to the sequence feature to obtain a product result; based on the product result, determine the position of the text in the sequence feature information; based on the position information of the text in the sequence feature, determine the position information of the text in the to-be-detected image.

在一些可选实施例中，所述文本检测装置500还可以包括：边界框确定单元505。In some optional embodiments, the text detection apparatus 500 may further include: a bounding box determination unit 505 .

所述边界框确定单元505，用于在基于所述解码后的序列向量和所述序列特征对应的向量确定所述文本在所述待检测图像中的位置信息之后，基于所述待检测图像中文本的位置信息确定所述待检测图像中的连通域；基于所述连通域的边界确定文本边界框；其中，所述文本边界框用于识别所述待检测图像中的文本。The bounding box determining unit 505 is configured to, after determining the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature, based on the image to be detected Chinese The location information of the present determines a connected domain in the image to be detected; a text bounding box is determined based on the boundary of the connected domain; wherein, the text bounding box is used to identify the text in the image to be detected.

图6示出了本申请实施例提供的文本检测网络训练装置的一种可选结构示意图，将根据各个部分进行说明。FIG. 6 is a schematic diagram showing an optional structure of the text detection network training apparatus provided by the embodiment of the present application, which will be described according to each part.

在一些实施例中，所述文本检测网络的训练装置600包括第一确定单元601、第二确定单元602、第三确定单元603、响应单元604和调整单元605。In some embodiments, the training apparatus 600 of the text detection network includes a first determination unit 601 , a second determination unit 602 , a third determination unit 603 , a response unit 604 and an adjustment unit 605 .

所述第一确定单元601，用于基于所述编码子网络确定训练样本集中样本图像的序列样本特征；The first determining unit 601 is configured to determine the sequence sample features of the sample images in the training sample set based on the coding sub-network;

所述第二确定单元602，以所述序列样本特征和文本实例样本对应的实例样本特征作为所述译码子网络的跨层注意力层的输入，将所述跨层注意力层的输出确定为解码后的样本序列向量；The second determining unit 602 uses the sequence sample feature and the instance sample feature corresponding to the text instance sample as the input of the cross-layer attention layer of the decoding sub-network, and determines the output of the cross-layer attention layer. is the decoded sample sequence vector;

所述第三确定单元603，将所述解码后的样本序列向量作为所述输出子网络的输入，根据所述输出子网络的输出，确定所述样本图像的预测类型；The third determining unit 603 uses the decoded sample sequence vector as the input of the output sub-network, and determines the prediction type of the sample image according to the output of the output sub-network;

所述响应单元604，用于响应于所述样本图像的预测类型为所述样本图像包括文本，将所述解码后的样本序列向量和所述样本序列特征作为所述输出子网络的输入，根据所述输出子网络的输出，确定所述文本在所述样本图像中的预测位置信息；The response unit 604 is configured to, in response to the prediction type of the sample image being that the sample image includes text, use the decoded sample sequence vector and the sample sequence feature as the input of the output sub-network, according to The output of the output sub-network determines the predicted position information of the text in the sample image;

所述调整单元605，用于匹配所述样本图像的预测类型和所述样本图像的标注类型，以及所述样本图像中文本的预测位置信息和标注位置信息，基于匹配结果调整所述文本检测网络的参数。The adjustment unit 605 is configured to match the prediction type of the sample image and the annotation type of the sample image, as well as the predicted position information and annotation position information of the text in the sample image, and adjust the text detection network based on the matching result. parameter.

所述第一确定单元601，具体用于通过编码子网络将所述样本图像对应的图像矩阵转换为表征所述样本图像的一维向量；所述样本图像的一维向量对应的特征为所述样本图像的序列样本特征。The first determining unit 601 is specifically configured to convert the image matrix corresponding to the sample image into a one-dimensional vector representing the sample image through an encoding sub-network; the feature corresponding to the one-dimensional vector of the sample image is the Sequence sample features for sample images.

所述第二确定单元602，还用于将所述文本实例样本输入至所述译码子网络的自注意力层，基于所述自注意力层的输出获取所述文本实例样本对应的嵌入值，所述嵌入值为所述文本实例样本对应的实例样本特征；其中，所述文本实例样本包括字符串和/或短语，不同的文本实例样本对应的嵌入值不同。The second determining unit 602 is further configured to input the text instance samples to the self-attention layer of the decoding sub-network, and obtain the embedded value corresponding to the text instance samples based on the output of the self-attention layer , the embedded value is an instance sample feature corresponding to the text instance sample; wherein, the text instance sample includes character strings and/or phrases, and different text instance samples correspond to different embedded values.

所述响应单元604，具体用于将所述解码后的样本序列向量作为所述输出子网络包括的全连接层的输入，基于所述全连接层的输出确定所述样本图像的预测类型。The response unit 604 is specifically configured to use the decoded sample sequence vector as the input of the fully connected layer included in the output sub-network, and determine the prediction type of the sample image based on the output of the fully connected layer.

所述第三确定单元603，具体用于将所述解码后的样本序列向量与所述样本序列特征对应的向量相乘，得到乘积结果；基于所述乘积结果，确定所述文本在所述样本序列特征中的预测位置信息；基于所述文本在所述样本序列特征中的预测位置信息，确定所述文本在所述样本图像的预测位置信息。The third determining unit 603 is specifically configured to multiply the decoded sample sequence vector and the vector corresponding to the sample sequence feature to obtain a product result; based on the product result, determine that the text is in the sample sequence. The predicted position information in the sequence feature; based on the predicted position information of the text in the sample sequence feature, the predicted position information of the text in the sample image is determined.

所述调整单元605，具体用于若所述预测类型和所述标注类型相同，且所述预测位置信息与所述标注位置信息之间的损失值小于预设阈值，则确定不调整所述文本检测网络的参数；若所述预测类型和所述标注类型不同，或者，所述预测位置信息与所述标注位置信息之间的损失值大于或等于所述预设阈值，则基于所述样本图像的预测类型和所述样本图像的标注类型之间的差异，和/或所述样本图像中文本的预测位置信息和标注位置信息之间的差异，调整所述文本检测网络的参数。The adjustment unit 605 is specifically configured to determine not to adjust the text if the prediction type and the annotation type are the same, and the loss value between the predicted location information and the annotation location information is less than a preset threshold Detect the parameters of the network; if the prediction type and the annotation type are different, or the loss value between the predicted location information and the annotation location information is greater than or equal to the preset threshold, then the sample image The difference between the prediction type of the sample image and the annotation type of the sample image, and/or the difference between the predicted position information and the annotation position information of the text in the sample image, adjust the parameters of the text detection network.

根据本公开的实施例，本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

图7示出了可以用来实施本公开的实施例的示例电子设备800的示意性框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。7 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图7所示，设备800包括计算单元801，其可以根据存储在只读存储器(ROM)802中的计算机程序或者从存储单元808加载到随机访问存储器(RAM)803中的计算机程序，来执行各种适当的动作和处理。在RAM 803中，还可存储设备800操作所需的各种程序和数据。计算单元801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。As shown in FIG. 7 , the device 800 includes a computing unit 801 that can be executed according to a computer program stored in a read only memory (ROM) 802 or a computer program loaded from a storage unit 808 into a random access memory (RAM) 803 Various appropriate actions and handling. In the RAM 803, various programs and data necessary for the operation of the device 800 can also be stored. The computing unit 801 , the ROM 802 , and the RAM 803 are connected to each other through a bus 804 . An input/output (I/O) interface 805 is also connected to bus 804 .

设备800中的多个部件连接至I/O接口805，包括：输入单元806，例如键盘、鼠标等；输出单元807，例如各种类型的显示器、扬声器等；存储单元808，例如磁盘、光盘等；以及通信单元809，例如网卡、调制解调器、无线通信收发机等。通信单元809允许设备800通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Various components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, mouse, etc.; an output unit 807, such as various types of displays, speakers, etc.; a storage unit 808, such as a magnetic disk, an optical disk, etc. ; and a communication unit 809, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 809 allows the device 800 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

计算单元801可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元801的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元801执行上文所描述的各个方法和处理，例如文本检测方法或文本检测网络的训练方法。例如，在一些实施例中，文本检测方法或文本检测网络的训练方法可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元808。在一些实施例中，计算机程序的部分或者全部可以经由ROM 802和/或通信单元809而被载入和/或安装到设备800上。当计算机程序加载到RAM 803并由计算单元801执行时，可以执行上文描述的文本检测方法或文本检测网络的训练方法的一个或多个步骤。备选地，在其他实施例中，计算单元801可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行文本检测方法或文本检测网络的训练方法。Computing unit 801 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computing units 801 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as a text detection method or a training method of a text detection network. For example, in some embodiments, a text detection method or a training method of a text detection network may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808 . In some embodiments, part or all of the computer program may be loaded and/or installed on device 800 via ROM 802 and/or communication unit 809 . When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the text detection method or the training method of the text detection network described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a text detection method or a training method of a text detection network by any other suitable means (eg, by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，也可以为分布式系统的服务器，或者是结合了区块链的服务器。A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, a distributed system server, or a server combined with blockchain.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present disclosure can be executed in parallel, sequentially, or in different orders. As long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, there is no limitation herein.

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements, and improvements made within the spirit and principles of the present disclosure should be included within the protection scope of the present disclosure.

Claims

1. A text detection method, comprising:

Determine the sequence characteristics of the image to be detected;

Determine the decoded sequence vector based on the sequence feature and the instance feature corresponding to the text instance;

Determine the type of the image to be detected based on the decoded sequence vector;

In response to the type of the image to be detected being that the image to be detected includes text, the location information of the text in the image to be detected is determined based on the decoded sequence vector and the vector corresponding to the sequence feature.

2. The method according to claim 1, wherein said determining the sequence features of the images to be detected comprises:

Convert the image matrix corresponding to the image to be detected into a one-dimensional vector;

It is determined that the feature corresponding to the one-dimensional vector is the sequence feature of the image to be detected.

3. The method according to claim 1, wherein, before said determining the decoded sequence vector, the method further comprises:

obtaining an embedded value corresponding to the text instance, where the embedded value is an instance feature corresponding to the text instance;

The text instances include character strings and/or phrases, and different text instances correspond to different embedded values.

4. The method according to claim 1, wherein the determining the position information of the text in the to-be-detected image based on the decoded sequence vector and the vector corresponding to the sequence feature comprises:

Multiplying the decoded sequence vector with the vector corresponding to the sequence feature to obtain a product result;

Based on the product result, determine the position information of the text in the sequence feature;

Based on the position information of the text in the sequence feature, the position information of the text in the image to be detected is determined.

5. The method according to claim 1, wherein after the position information of the text in the to-be-detected image is determined based on the decoded sequence vector and the vector corresponding to the sequence feature, the method include:

Determine a connected domain in the to-be-detected image based on the position information of the text in the to-be-detected image;

determining a text bounding box based on the boundaries of the connected domain;

Wherein, the text bounding box is used to identify the text in the image to be detected.

6. A training method for a text detection network, wherein the text detection network comprises an encoding sub-network, a decoding sub-network and an output sub-network;

Determine the sequence sample features of the sample images in the training sample set based on the coding sub-network;

Using the sequence sample feature and the instance sample feature corresponding to the text instance sample as the input of the cross-layer attention layer of the decoding sub-network, and determining the output of the cross-layer attention layer as the decoded sample sequence vector;

The decoded sample sequence vector is used as the input of the output sub-network, and the prediction type of the sample image is determined according to the output of the output sub-network;

In response to the prediction type of the sample image being that the sample image includes text, the decoded sample sequence vector and the sample sequence feature are used as the input of the output sub-network, and according to the output of the output sub-network, determining the predicted position information of the text in the sample image;

The prediction type of the sample image and the annotation type of the sample image are matched, as well as the predicted position information and the annotation position information of the text in the sample image, and the parameters of the text detection network are adjusted based on the matching result.

7. The method according to claim 6, wherein the determining the sequence sample features of the sample images in the training sample set based on the coding sub-network comprises:

The encoding sub-network converts the image matrix corresponding to the sample image into a one-dimensional vector representing the sample image;

The feature corresponding to the one-dimensional vector of the sample image is the sequence sample feature of the sample image.

8. The method according to claim 6, wherein, before said determining the decoded sample sequence vector, the method further comprises:

Inputting the text instance sample to the self-attention layer of the decoding sub-network, and obtaining an embedded value corresponding to the text instance sample based on the output of the self-attention layer, where the embedded value is the text instance sample Corresponding instance sample features;

The text instance samples include character strings and/or phrases, and different text instance samples correspond to different embedding values.

9. The method according to claim 6, wherein the decoded sample sequence vector is used as the input of the output sub-network, and the prediction type of the sample image is determined according to the output of the output sub-network ,include:

The decoded sample sequence vector is used as the input of the fully connected layer included in the output sub-network, and the prediction type of the sample image is determined based on the output of the fully connected layer.

10. The method according to claim 6, wherein the decoded sample sequence vector and the sample sequence feature are used as the input of the output sub-network, and according to the output of the output sub-network, the The predicted position information of the text in the sample image, including:

Multiplying the decoded sample sequence vector with the vector corresponding to the sample sequence feature to obtain a product result;

Based on the product result, determine the predicted position information of the text in the sample sequence feature;

Based on the predicted position information of the text in the sample sequence feature, the predicted position information of the text in the sample image is determined.

11. The method according to claim 6, wherein the prediction type of the sample image and the annotation type of the sample image are matched, as well as the predicted position information and the annotation position information of the text in the sample image, and adjusted based on the matching result. The parameters of the text detection network include:

If the prediction type and the annotation type are the same, and the loss value between the predicted location information and the annotation location information is less than a preset threshold, it is determined not to adjust the parameters of the text detection network;

If the prediction type and the annotation type are different, or, the loss value between the predicted position information and the annotation position information is greater than or equal to the preset threshold, then based on the prediction type of the sample image and the The parameters of the text detection network are adjusted according to the difference between the annotation types of the sample images, and/or the difference between the predicted position information and the annotation position information of the text in the sample images.

12. A text detection device, comprising:

an encoding unit, used to determine the sequence feature of the image to be detected;

a decoding unit for determining a decoded sequence vector based on the sequence feature and the instance feature corresponding to the text instance;

an image type determination unit, configured to determine the type of the image to be detected based on the decoded sequence vector;

an output unit, configured to determine that the text is in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature in response to the type of the image to be detected being that the image to be detected includes text location information.

13. The apparatus according to claim 12, wherein the encoding unit is specifically used for:

14. The apparatus of claim 12, wherein the decoding unit is further configured to:

Before determining the decoded sequence vector, obtain an embedded value corresponding to the text instance, where the embedded value is an instance feature corresponding to the text instance;

15. The apparatus of claim 12, wherein the output unit is specifically configured to:

16. The apparatus of claim 12, wherein the apparatus further comprises:

a bounding box determination unit, configured to determine the position information of the text in the image to be detected based on the decoded sequence vector and the vector corresponding to the sequence feature, based on the position of the text in the image to be detected information to determine a connected domain in the image to be detected; determine a text bounding box based on the boundary of the connected domain;

17. A training device for a text detection network, comprising:

a first determining unit, configured to determine the sequence sample features of the sample images in the training sample set based on the coding sub-network;

The second determining unit is configured to use the sequence sample feature and the instance sample feature corresponding to the text instance sample as the input of the cross-layer attention layer of the decoding sub-network, and determine the output of the cross-layer attention layer as The decoded sample sequence vector;

a third determining unit, configured to use the decoded sample sequence vector as the input of the output sub-network, and determine the prediction type of the sample image according to the output of the output sub-network;

a response unit, configured to use the decoded sample sequence vector and the sample sequence feature as the input of the output sub-network in response to the prediction type of the sample image being that the sample image includes text, and according to the output The output of the sub-network determines the predicted position information of the text in the sample image;

An adjustment unit, configured to match the prediction type of the sample image and the annotation type of the sample image, as well as the predicted position information and annotation position information of the text in the sample image, and adjust the parameters of the text detection network based on the matching result.

18. The apparatus according to claim 17, wherein the first determining unit is specifically configured to:

Convert the image matrix corresponding to the sample image into a one-dimensional vector representing the sample image through the coding sub-network;

19. The apparatus of claim 17, wherein the second determining unit is further configured to:

20. The apparatus according to claim 17, wherein the response unit is specifically used for:

21. The apparatus according to claim 17, wherein the third determining unit is specifically configured to:

22. The device according to claim 17, wherein the adjustment unit is specifically used for:

If the prediction type and the annotation type are different, or the loss value between the predicted location information and the annotation location information is greater than or equal to the preset threshold, the The parameters of the text detection network are adjusted according to the difference between the annotation types of the sample images, and/or the difference between the predicted position information and the annotation position information of the text in the sample images.

23. An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the execution of any of claims 1-5 Methods;

Alternatively, to enable the at least one processor to perform the method of any of claims 6-11.

24. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5;

Alternatively, the computer instructions are for causing the computer to perform the method of any of claims 6-11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-5;

Alternatively, the computer program, when executed by a processor, implements the method according to any of claims 6-11.