WO2020192433A1 - Multi-language text detection and recognition method and device - Google Patents

Multi-language text detection and recognition method and device Download PDF

Info

Publication number
WO2020192433A1
WO2020192433A1 PCT/CN2020/078928 CN2020078928W WO2020192433A1 WO 2020192433 A1 WO2020192433 A1 WO 2020192433A1 CN 2020078928 W CN2020078928 W CN 2020078928W WO 2020192433 A1 WO2020192433 A1 WO 2020192433A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
attention
processor
recognition network
normalized
Prior art date
Application number
PCT/CN2020/078928
Other languages
French (fr)
Chinese (zh)
Inventor
张勇东
周宇
谢洪涛
李岩
Original Assignee
中国科学技术大学
北京中科研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学技术大学, 北京中科研究院 filed Critical 中国科学技术大学
Publication of WO2020192433A1 publication Critical patent/WO2020192433A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image

Definitions

  • the present disclosure relates to the field of artificial intelligence, and in particular to methods and devices for multilingual text detection and recognition.
  • the existing scene text recognition system is mainly aimed at the cut text, and cannot detect and recognize the text image at the same time.
  • a few methods that can detect and recognize text at the same time are only for English text.
  • the purpose of the present disclosure is to provide a multilingual text detection and recognition method and device, which can simultaneously detect and recognize texts in multiple languages in a scene text image.
  • the purpose of the present disclosure is achieved by a multilingual text detection and recognition method.
  • the method includes:
  • the computer equipment includes:
  • a memory where the memory stores instructions that can be executed by the processor, and when the instructions are executed by the processor, the processor:
  • the technical solution provided according to the present disclosure can simultaneously detect and recognize texts in multiple languages, and has a higher accuracy rate than traditional text detection and multi-language recognition solutions.
  • Fig. 1 is a flowchart of a method for multilingual text detection and recognition according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a system that can be used to implement the method in FIG. 1 according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of the structure of a text detector provided according to an embodiment of the present disclosure.
  • Fig. 4 is a block diagram of a computer device for multilingual text detection and recognition according to an embodiment of the present disclosure.
  • Fig. 1 shows a method 10 for multilingual text detection and recognition according to an embodiment of the present disclosure.
  • the method 10 includes: at step S102, feature extraction of the input image and generate a series of text candidate boxes; at step S104, maintaining the original width and height of the text area corresponding to each text candidate box On the basis of the ratio, the text areas of all text candidate boxes are clipped and normalized and adjusted to a uniform height; and at step S106, the text in the normalized and adjusted text area is recognized.
  • the text region after normalized by a text identifying may further comprise: step 1 S106, the text region after normalized by a text category identified, to determine corresponding text symbol or a specific type of language; and / or, at step S106 2, the contents of the text within the text region after normalization adjustment identified.
  • the above-mentioned method according to the embodiment of the present disclosure can be applied to machine translation.
  • texts in different languages can be recognized and then translated into the desired text.
  • the above method can also be used for autonomous driving.
  • road signs in different languages can be detected and recognized, so as to choose the correct direction to move forward.
  • Fig. 2 is a schematic diagram of a system that can be used to implement the method of Fig. 1 according to an embodiment of the present disclosure. In the following, the execution of each step in FIG. 1 will be described in further detail by way of examples in conjunction with FIG. 2.
  • Step S102 that is, performing feature extraction on the input image and generating a series of text candidate boxes can be performed by, for example, the text detector 200 shown in FIG. 2.
  • FIG. 3 is a schematic structural diagram of a text detector 200 according to an embodiment of the present disclosure.
  • the text detector 200 consists of four inception modules 305, 308, 313, and 314 designed for text and three channel-wise attention and spatial attention (channel-wise attention&spatial attention) modules 306, 309.
  • each inception module can use 1 ⁇ 5 and 5 ⁇ 1 convolution kernels. Since text generally has a large aspect ratio, this convolution kernel is more suitable for text.
  • Step S102 may include: e.g., at a step S102, a text output by the detector shown in FIG. 2200, for each pixel P of FIG feature candidate text box with direction, and at step S102 and then 2, Use non-maximum suppression to process these text candidate boxes to obtain M text candidate boxes with directions.
  • each image is adjusted to a size of 256 ⁇ 256 and then input to the text detector 200.
  • the text detector 200 outputs 14 text candidate boxes with directions for each pixel in the feature map. Then,-use non-maximum suppression (NMS) to process these text candidate boxes to remove redundant text suggestion boxes and speed up the calculation.
  • NMS non-maximum suppression
  • 3 ⁇ 3 means that a convolution kernel with a width and a height of 3 is used in the convolution operation (1 ⁇ 1 has a similar meaning); the 7 convolution layers correspond to the 3 ⁇ 3 part in Fig. 2.
  • 16 means that the number of convolution kernels used in the convolution operation is 16 (the meaning of 1, 2, 4, 64, 256, 512 is similar); /2 means that the resolution of the feature map is halved; upsample Represents the up-sampling operation, the function is to increase the resolution of the feature map; f1 to f4 and f1,2, f1,2,3, f1,2,3,4 are the feature maps obtained at each stage; segmentation1 and segmentation2 indicate The segmentation map of the text area; box1 and box2 represent the predicted distance from each pixel on the feature map to the top, bottom, left, right, and four sides of the text candidate box; angle1 and angle2 represent the angle of the text, some text is not horizontal, It may be at an angle to the horizontal.
  • the workflow of the text detector 200 is briefly described as follows: input an input image into the network, and then pass through the first four convolutional layers 301-304 and inception1 305, the first channel -wise attention and spatial attention module 306 (can be referred to as attention module for short), fifth convolutional layer (3x3,128,/2) 307, inception2 308, second channel-wise attention and spatial attention module 309, first Six convolutional layers (3x3,256,/2) 310, the third channel-wise attention and spatial attention module 311, and the seventh convolutional layer (3x3,512,/2) 312.
  • the feature map f1 is up-sampled and then added to f2 for feature fusion to obtain the feature map f1,2.
  • the feature map f1,2 is up-sampled (for example, up-sampled to 32x32) and then added to the feature map f3 for feature fusion, thereby obtaining the feature map f1,2,3.
  • the feature map f1,2,3 passes through inception3 313, and then after upsampling (for example, upsampling to 64x64), it is added with the feature map f4 for feature fusion, thereby obtaining the feature map f1,2,3,4.
  • Feature maps f1,2,3,4 are extracted by inception4314.
  • the feature maps f1,2,3 output by inception3 and the feature maps f1,2,3,4 output by inception4 are respectively used to predict the text candidate frame (that is, the generation of the text candidate frame).
  • step S104 may be performed by, for example, the normalization unit 202 shown in FIG. 2.
  • the normalization unit 202 trims the text regions of all text candidate boxes and then adjusts them to a uniform height K on the basis of maintaining the original aspect ratio of the text region corresponding to each text candidate box.
  • This normalization method maintains the aspect ratio of the corresponding text area, avoids the deformation of the text area, and provides a guarantee for the subsequent text recognition and text language category recognition.
  • step S104 may include: a step S104, adjustments to normalize the text regions according to the following formula:
  • W'and H' respectively represent the width and height of the text area after normalization adjustment; w and h respectively represent the original width and height of the text area.
  • K can be 64, of course, it can also be changed to other values as needed.
  • step S106 1 may be performed through the script recognition network 204, for example.
  • the script recognition network 204 can be implemented by a convolutional neural network (CNN).
  • CNN convolutional neural network
  • Table 1 shows the structure of the script recognition network 204, which mainly includes: a plurality of alternately arranged convolutional layers (conv) and maximum pooling layer (max-pooling), and a global average at the back end of the last maximum pooling layer Global-avg-pool, and a fully-connected layer at the back end of the global average pooling layer; wherein, the fully-connected layer has multiple (for example, 7) neurons, each The softmax output of the neuron represents the probability that the text in each text candidate box belongs to a certain language type and symbol, and the highest probability is the category of the text in the text candidate box.
  • the global average pooling layer outputs a feature map with a size of 1 ⁇ 512.
  • the fully connected layer can contain 7 neurons.
  • the softmax of these 7 neurons outputs 7 decimals, representing the probability that the text in each text area is Arabic, Bengali, Chinese, Korean, Japanese, Latin and symbols .
  • the highest probability is the category of the text in the text area.
  • step S106 2 in FIG. 1 may be performed by, for example, the attention mechanism-based multilingual text recognition network 206 shown in FIG. 2.
  • the attention mechanism-based multilingual text recognition network 206 uses CNN as an encoder, and then uses a CTC decoder to generate character sequences.
  • the attention mechanism-based multilingual text recognition network 206 uses the channel-wise attention and spatial attention cascade to make the CTC decoder pay more attention to the place where the text exists, thereby improving the accuracy of text recognition.
  • Table 2 The structure of the encoder in the multilingual text recognition network 206 based on the attention mechanism is shown in Table 2.
  • Table 2 The structure of the encoder in the multilingual text recognition network based on the attention mechanism
  • the method 10 provided by the embodiment of the present disclosure may optionally further include step S100.
  • step S100 the text detector 200, the script recognition network 204, and the multilingual text recognition network 206 based on the attention mechanism are trained, verified, and tested using the scene text image or the cropped image. More specifically, the following data sets are constructed in advance: scene text images and cropped images. Both types of images contain texts in multiple languages, and are divided into training set, validation set, and test set. The texts in the training set and validation set are labeled.
  • the scene text image is used for training, verification, and testing of the text detector 200; and the cropped image is used for the training, verification, and testing of the script recognition network 204 and the multilingual text recognition network 206 based on the attention mechanism.
  • a cropped image is an image containing text that is cut down from an image containing background and text in advance, and is mainly used to train a multilingual text recognition network based on the attention mechanism; while the scene text image contains A large background image contains many background areas without text in addition to text.
  • ICDAR MLT cropped images and scene text images from the Internet.
  • These images contain six types of characters, namely Arabic, Bengali, Chinese, Korean, Japanese and Latin.
  • the text detector can be trained using the Adam optimizer, the initial learning rate can be set to 0.001, and the loss function can be defined as;
  • L dice is dice loss
  • dice loss is a loss function used to calculate semantic segmentation, such as a region, for each pixel, if the pixel is text, its value is 1, and it is not text Is 0; if the probability of predicting text is closer to 1, the dice loss tends to 0, otherwise it tends to 1.
  • L dice is the sum of the classification losses of all pixels;
  • Adam is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process. It can iteratively update neural network weights based on training data.
  • the script recognition network can be optimized using a stochastic gradient descent algorithm; the following parameters can be set: momentum is 0.9, the initial learning rate is 0.001, and the learning rate becomes one tenth for every 5 epochs.
  • the above-mentioned solutions of the embodiments of the present disclosure are completely based on convolutional neural networks, and can simultaneously detect and recognize texts in multiple languages in one model.
  • the accuracy (accuracy), recall (recall rate), and F-Measure (F value) of this solution on the multilingual data set ICDAR RRC-MLT test set are 0.6968, 0.6425 and respectively. 0.6687, and the best results of the existing methods are 0.5759, 0.6207, 0.5974. It can be seen that our method has been greatly improved compared to existing methods.
  • the precision, recall, and F-Measure on the end-to-end identification ICDAR RRC-MLT test set of this method are 0.502, 0.424 and 0.460, respectively.
  • Fig. 4 is a block diagram of a computer device 40 for multilingual text detection and recognition according to an embodiment of the present disclosure.
  • the computer device 40 includes a processor 41 and a memory 42.
  • the memory 42 stores instructions executable by the processor 41.
  • the processor 41 is caused to execute a method including the following steps: extracting features of the input image and generating a series of text candidate boxes; maintaining the original width of the text area corresponding to each text candidate box On the basis of the height ratio, the text areas of all text candidate boxes are cropped and then normalized and adjusted to a uniform height K; and the text in the text area after the normalization adjustment is recognized.
  • recognizing the text in the text area after the normalization adjustment includes: recognizing the type of the text in the text area after the normalization adjustment to determine whether the corresponding text is a symbol or a specific Language type; and/or recognize the content of the text in the normalized and adjusted text area.
  • the processor 41 when the instruction is executed by the processor 41, the processor 41 can realize the text detector 200, the normalization unit 202, the script recognition network 204 and the attention mechanism-based The function of one or more of the multilingual text recognition network 206.
  • the processor 41 when an instruction is executed by the processor 41, the processor 41 can be made to implement any step of the method shown in FIG. 1.
  • the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.).
  • the non-volatile storage medium includes a number of instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in the various embodiments of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed is a computer device for a multi-language text detection and recognition method. The method comprises: performing feature extraction on an input image and generating a series of candidate text boxes; on the basis of keeping the original aspect ratio of a text region corresponding to each candidate text box, performing normalized adjustment on text regions of all the candidate text boxes such that same are of a uniform height; and recognizing text in the text regions after normalized adjustment. In some embodiments, recognizing the text in the text regions after normalized adjustment comprises: recognizing the types of the text in the text regions after normalized adjustment to determine that the corresponding text is a symbol or a certain specific language type; and/or recognizing the content of the text in the text regions after normalized adjustment. In the method, text of multiple languages in a scene text image can be simultaneously detected and recognized.

Description

多语言文本检测识别方法和设备Multilingual text detection and recognition method and equipment
相关申请的交叉引用Cross references to related applications
本申请要求于2019年3月26日提交的中国专利申请201910232853.0的优先权,该申请的公开内容通过引用完全合并于此。This application claims the priority of Chinese patent application 201910232853.0 filed on March 26, 2019, the disclosure of which is fully incorporated herein by reference.
技术领域Technical field
本公开涉及人工智能领域,尤其涉及多语言文本检测识别方法和设备。The present disclosure relates to the field of artificial intelligence, and in particular to methods and devices for multilingual text detection and recognition.
背景技术Background technique
现有的场景文本识别系统主要是针对剪裁后的文本,而不能同时对文本图像进行检测和识别。少数的能同时检测和识别文本的方法却只是针对英文文本,而在现实生活中经常会遇到在同一场景下处理多种语言文本的情况。因此迫切需要一个端到端的多语言场景文本识别系统,这将会给图像检索、机器翻译,自动驾驶等带来很大便利。The existing scene text recognition system is mainly aimed at the cut text, and cannot detect and recognize the text image at the same time. A few methods that can detect and recognize text at the same time are only for English text. In real life, it is often encountered that texts in multiple languages are processed in the same scene. Therefore, there is an urgent need for an end-to-end multilingual scene text recognition system, which will bring great convenience to image retrieval, machine translation, and automatic driving.
发明内容Summary of the invention
本公开的目的是提供一种多语言文本检测识别方法和设备,可以同时检测并识别出场景文本图像中的多种语言的文本。The purpose of the present disclosure is to provide a multilingual text detection and recognition method and device, which can simultaneously detect and recognize texts in multiple languages in a scene text image.
在一方面,本公开的目的是通过一种多语言文本检测识别方法来实现的。该方法包括:In one aspect, the purpose of the present disclosure is achieved by a multilingual text detection and recognition method. The method includes:
对输入图像进行特征提取并生成一系列文本候选框;Perform feature extraction on the input image and generate a series of text candidate boxes;
在保持每一文本候选框对应的文本区域的原有宽高比的基础上,对所有文本候选框的文本区域进行剪裁并且随后归一化调整为统一高度K;On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, crop the text areas of all text candidate boxes and then normalize them to a uniform height K;
对归一化调整后的文本区域中的文本进行识别。Recognize the text in the normalized and adjusted text area.
在另一方面,本公开的目的是通过一种用于多语言文本检测识别的计算机设备来实现的。该计算机设备包括:On the other hand, the purpose of the present disclosure is achieved by a computer device for multilingual text detection and recognition. The computer equipment includes:
处理器;以及Processor; and
存储器,所述存储器存储能够由所述处理器执行的指令,所述指令当被所述处理器 执行时,使所述处理器:A memory, where the memory stores instructions that can be executed by the processor, and when the instructions are executed by the processor, the processor:
对输入图像进行特征提取并生成一系列文本候选框;Perform feature extraction on the input image and generate a series of text candidate boxes;
在保持每一文本候选框对应的文本区域的原有宽高比的基础上,对所有文本候选框的文本区域进行剪裁并且随后归一化调整为统一高度K;On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, crop the text areas of all text candidate boxes and then normalize them to a uniform height K;
对归一化调整后的文本区域中的文本进行识别。Recognize the text in the normalized and adjusted text area.
根据本公开提供的技术方案能同时检测和识别多种语言的文本,并且相比对传统的文本检测、多语言识别方案均具有较高的准确率。The technical solution provided according to the present disclosure can simultaneously detect and recognize texts in multiple languages, and has a higher accuracy rate than traditional text detection and multi-language recognition solutions.
附图说明Description of the drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。显而易见地,下面描述中的附图仅仅是本公开的一些实施例。对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些实施例来获得其他实施例,这些其他实施例均落在本公开的保护范围内。In order to more clearly describe the technical solutions of the embodiments of the present disclosure, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other embodiments can be obtained based on these embodiments without creative work, and these other embodiments fall within the protection scope of the present disclosure.
图1为根据本公开实施例提供的一种用于多语言文本检测识别的方法的流程图;Fig. 1 is a flowchart of a method for multilingual text detection and recognition according to an embodiment of the present disclosure;
图2为根据本公开实施例提供的可以用来实现图1的方法的系统的示意图;FIG. 2 is a schematic diagram of a system that can be used to implement the method in FIG. 1 according to an embodiment of the present disclosure;
图3为根据本公开实施例提供的文本检测器的结构的示意图;3 is a schematic diagram of the structure of a text detector provided according to an embodiment of the present disclosure;
图4为根据本公开实施例提供的一种用于多语言文本检测识别的计算机设备的框图。Fig. 4 is a block diagram of a computer device for multilingual text detection and recognition according to an embodiment of the present disclosure.
具体实施方式detailed description
下面结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅仅是本公开的一部分实施例,而不是全部的实施例。基于本公开的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都落在本公开的保护范围内。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only a part of the embodiments of the present disclosure, rather than all the embodiments. Based on the embodiments of the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present disclosure.
图1为根据本公开实施例提供的一种用于多语言文本检测识别的方法10。如图1所述,方法10包括:在步骤S102处,对输入图像进行特征提取并生成一系列文本候选框;在步骤S104处,在保持每一文本候选框对应的文本区域的原有宽高比的基础上,对所有文本候选框的文本区域进行剪裁并归一化调整为统一高度;以及在步骤S106处,对归一化调整后的文本区域中的文本进行识别。在一些实施例中,对归一化调整后的文本区域中的文本进行识别还可以包括:在步骤S106 1处,对归一化调整后的文本区域中的文本的 类别进行识别,以确定相应文本为符号或者某个具体的语言类型;和/或在步骤S106 2处,对归一化调整后的文本区域中的文本的内容进行识别。 Fig. 1 shows a method 10 for multilingual text detection and recognition according to an embodiment of the present disclosure. As shown in Figure 1, the method 10 includes: at step S102, feature extraction of the input image and generate a series of text candidate boxes; at step S104, maintaining the original width and height of the text area corresponding to each text candidate box On the basis of the ratio, the text areas of all text candidate boxes are clipped and normalized and adjusted to a uniform height; and at step S106, the text in the normalized and adjusted text area is recognized. In some embodiments, the text region after normalized by a text identifying may further comprise: step 1 S106, the text region after normalized by a text category identified, to determine corresponding text symbol or a specific type of language; and / or, at step S106 2, the contents of the text within the text region after normalization adjustment identified.
根据本公开实施例的上述方法可以应用于机器翻译中。通过在翻译软件的后台中使用该方法,可以识别不同语言的文本,然后将其翻译成想要的文本。上述方法也可以用于自动驾驶。通过在无人驾驶汽车上使用该方法,可以检测并识别不同语言的路标,从而选择正确的方向前行。The above-mentioned method according to the embodiment of the present disclosure can be applied to machine translation. By using this method in the background of the translation software, texts in different languages can be recognized and then translated into the desired text. The above method can also be used for autonomous driving. By using this method on driverless cars, road signs in different languages can be detected and recognized, so as to choose the correct direction to move forward.
图2是根据本公开实施例提供的可以用来实现图1的方法的系统的示意图。下面将结合图2,通过示例的方式来进一步详细地阐述图1中的各个步骤的执行。Fig. 2 is a schematic diagram of a system that can be used to implement the method of Fig. 1 according to an embodiment of the present disclosure. In the following, the execution of each step in FIG. 1 will be described in further detail by way of examples in conjunction with FIG. 2.
步骤S102,即,对输入图像进行特征提取并生成一系列文本候选框,可以例如通过图2中所示的文本检测器200来执行。Step S102, that is, performing feature extraction on the input image and generating a series of text candidate boxes can be performed by, for example, the text detector 200 shown in FIG. 2.
图3为根据本公开实施例提供的文本检测器200的结构示意图。如图3所示,所述文本检测器200由4个针对文本而设计的inception模块305、308、313、314和3个channel-wise attention与spatial attention(channel-wise attention&spatial attention)模块306、309、311以及7个卷积层301-304、307、310、312堆叠而成;其中,channel-wise attention与spatial attention模块中的channel-wise attention(通道上的注意力)子模块是针对特征图的通道而言的,其输出各通道的重要性级别,即告诉网络哪些通道的信息比较重要;channel-wise attention与spatial attention模块中的spatial attention(空间上的注意力)子模块是针对特征图的每个像素而言的,其输出特征图上每个像素点的关注权重,即告诉网络应该更关注特征图的哪些地方。在本公开实施例中,各个inception模块可以使用1×5和5×1的卷积核,由于文本一般都具有很大的宽高比,所以这种卷积核更适合文本。FIG. 3 is a schematic structural diagram of a text detector 200 according to an embodiment of the present disclosure. As shown in Figure 3, the text detector 200 consists of four inception modules 305, 308, 313, and 314 designed for text and three channel-wise attention and spatial attention (channel-wise attention&spatial attention) modules 306, 309. , 311, and 7 convolutional layers 301-304, 307, 310, 312 are stacked; among them, the channel-wise attention (channel-wise attention) sub-module in the channel-wise attention and spatial attention module is for feature maps For the channel, the importance level of each channel is output, which tells the network which channel information is more important; the spatial attention sub-module in the channel-wise attention and spatial attention module is for the feature map For each pixel of, the attention weight of each pixel on the output feature map is to tell the network where to pay more attention to the feature map. In the embodiments of the present disclosure, each inception module can use 1×5 and 5×1 convolution kernels. Since text generally has a large aspect ratio, this convolution kernel is more suitable for text.
步骤S102可以包括:例如通过图2中所示文本检测器200,在步骤S102 1处,针对特征图的每个像素点输出P个带有方向的文本候选框,以及然后在步骤S102 2处,使用非极大值抑制对这些文本候选框进行处理,以得到M个带有方向的文本候选框。 Step S102 may include: e.g., at a step S102, a text output by the detector shown in FIG. 2200, for each pixel P of FIG feature candidate text box with direction, and at step S102 and then 2, Use non-maximum suppression to process these text candidate boxes to obtain M text candidate boxes with directions.
示例性地,每个图像被调整为成256×256的大小之后被输入到文本检测器200。文本检测器200针对特征图中的每个像素点输出14个带有方向的文本候选框。然后,-使用非极大值抑制(NMS)对这些文本候选框进行处理,以去除冗余的文本建议框,加快计算速度。Illustratively, each image is adjusted to a size of 256×256 and then input to the text detector 200. The text detector 200 outputs 14 text candidate boxes with directions for each pixel in the feature map. Then,-use non-maximum suppression (NMS) to process these text candidate boxes to remove redundant text suggestion boxes and speed up the calculation.
在图3中,3×3表示卷积操作中使用宽和高都为3的卷积核(1×1含义类似);7个卷积层对应于图2中的3x3的部分。在图3中,16表示卷积操作中使用的卷积核的数量为16个(1、2、4、64、256、512的含义类似);/2表示特征图的分辨率减半;upsample表示上 采样操作,作用是将特征图的分辨率变大;f1~f4以及f1,2、f1,2,3、f1,2,3,4分别是各个阶段得到的特征图;segmentation1和segmentation2表示文本区域的分割图;box1和box2表示特征图上每个像素点到文本候选框的上、下、左、右、四条边的预测距离;angle1和angle2表示文本的角度,有些文本不是水平的,可能与水平方向成一个角度。In Fig. 3, 3×3 means that a convolution kernel with a width and a height of 3 is used in the convolution operation (1×1 has a similar meaning); the 7 convolution layers correspond to the 3×3 part in Fig. 2. In Figure 3, 16 means that the number of convolution kernels used in the convolution operation is 16 (the meaning of 1, 2, 4, 64, 256, 512 is similar); /2 means that the resolution of the feature map is halved; upsample Represents the up-sampling operation, the function is to increase the resolution of the feature map; f1 to f4 and f1,2, f1,2,3, f1,2,3,4 are the feature maps obtained at each stage; segmentation1 and segmentation2 indicate The segmentation map of the text area; box1 and box2 represent the predicted distance from each pixel on the feature map to the top, bottom, left, right, and four sides of the text candidate box; angle1 and angle2 represent the angle of the text, some text is not horizontal, It may be at an angle to the horizontal.
如图3所示,文本检测器200的工作流程简述如下:将一张输入图像(input image)输入到网络中,依次经过前四个卷积层301-304以及inception1 305、第一个channel-wise attention与spatial attention模块306(可简称为注意力模块)、第五个卷积层(3x3,128,/2)307、inception2 308、第二个channel-wise attention与spatial attention模块309、第六个卷积层(3x3,256,/2)310、第三个channel-wise attention与spatial attention模块311、第七个卷积层(3x3,512,/2)312。从第七个卷积层312输出特征图f1,其分辨率为8x8;从第三个channel-wise attention与spatial attention模块311输出特征图f2;从第二个channel-wise attention与spatial attention模块309输出特征图f3;以及从第一个channel-wise attention与spatial attention模块306输出特征图f4。特征图f1经过上采样后与f2相加以进行特征融合,从而得到特征图f1,2。特征图f1,2经过上采样(例如,上采样到32x32)后与特征图f3相加以进行特征融合,从而得到特征图f1,2,3。特征图f1,2,3经过inception3 313,之后在经过上采样(例如,上采样到64x64)后与特征图f4相加以进行特征融合,从而得到特征图f1,2,3,4。特征图f1,2,3,4经过inception4 314进行特征提取。在此过程中,分别对inception3输出的特征图f1,2,3和inception4输出的特征图f1,2,3,4进行文本候选框的预测(即,文本候选框的生成)。As shown in Figure 3, the workflow of the text detector 200 is briefly described as follows: input an input image into the network, and then pass through the first four convolutional layers 301-304 and inception1 305, the first channel -wise attention and spatial attention module 306 (can be referred to as attention module for short), fifth convolutional layer (3x3,128,/2) 307, inception2 308, second channel-wise attention and spatial attention module 309, first Six convolutional layers (3x3,256,/2) 310, the third channel-wise attention and spatial attention module 311, and the seventh convolutional layer (3x3,512,/2) 312. Output feature map f1 from the seventh convolutional layer 312, with a resolution of 8x8; output feature map f2 from the third channel-wise attention and spatial attention module 311; output feature map f2 from the second channel-wise attention and spatial attention module 309 Output feature map f3; and output feature map f4 from the first channel-wise attention and spatial attention module 306. The feature map f1 is up-sampled and then added to f2 for feature fusion to obtain the feature map f1,2. The feature map f1,2 is up-sampled (for example, up-sampled to 32x32) and then added to the feature map f3 for feature fusion, thereby obtaining the feature map f1,2,3. The feature map f1,2,3 passes through inception3 313, and then after upsampling (for example, upsampling to 64x64), it is added with the feature map f4 for feature fusion, thereby obtaining the feature map f1,2,3,4. Feature maps f1,2,3,4 are extracted by inception4314. In this process, the feature maps f1,2,3 output by inception3 and the feature maps f1,2,3,4 output by inception4 are respectively used to predict the text candidate frame (that is, the generation of the text candidate frame).
在本公开实施例中,步骤S104可以例如通过图2中所示的归一化单元202来执行。该归一化单元202在保持每一文本候选框对应的文本区域的原有宽高比的基础上,对所有文本候选框的文本区域进行剪裁并随后调整为统一高度K。这种归一化方法保持了相应文本区域的宽高比,避免了本文区域变形,为后面的文本识别和文本语言类别识别提供了保障。In the embodiment of the present disclosure, step S104 may be performed by, for example, the normalization unit 202 shown in FIG. 2. The normalization unit 202 trims the text regions of all text candidate boxes and then adjusts them to a uniform height K on the basis of maintaining the original aspect ratio of the text region corresponding to each text candidate box. This normalization method maintains the aspect ratio of the corresponding text area, avoids the deformation of the text area, and provides a guarantee for the subsequent text recognition and text language category recognition.
在本公开实施例中,步骤S104可以包括:在步骤S104 1处,依据下列公式来对文本区域进行归一化调整: In the embodiments disclosed in the present embodiment, step S104 may include: a step S104, adjustments to normalize the text regions according to the following formula:
H'=KH'=K
W'=wH'/hW'=wH'/h
其中,W'和H'分别表示经归一化调整后的文本区域的宽度、高度;w和h分别表示 文本区域的原有的宽度和高度。Among them, W'and H'respectively represent the width and height of the text area after normalization adjustment; w and h respectively represent the original width and height of the text area.
示例性地,K可以为64,当然,也可以根据需要改为其他数值。Exemplarily, K can be 64, of course, it can also be changed to other values as needed.
在本公开实施例中,步骤S106 1可以例如通过脚本识别网络204来执行。该脚本识别网络204可通过卷积神经网络(CNN)来实现。下面的表1示出了脚本识别网络204的结构,主要包括:多个交替设置的卷积层(conv)和最大池化层(max-pooling)、位于最后一个最大池化层后端的全局平均池化层(global-avg-pool),以及位于全局平均池化层后端的全连接层(fully-connect);其中,所述全连接层具有多个(例如,7个)神经元,每一神经元的softmax输出分别代表每一个文本候选框中的文本属于某个语言类型与符号的概率,概率最高的即为文本候选框中的文本的类别。 In the embodiment of the present disclosure, step S106 1 may be performed through the script recognition network 204, for example. The script recognition network 204 can be implemented by a convolutional neural network (CNN). The following Table 1 shows the structure of the script recognition network 204, which mainly includes: a plurality of alternately arranged convolutional layers (conv) and maximum pooling layer (max-pooling), and a global average at the back end of the last maximum pooling layer Global-avg-pool, and a fully-connected layer at the back end of the global average pooling layer; wherein, the fully-connected layer has multiple (for example, 7) neurons, each The softmax output of the neuron represents the probability that the text in each text candidate box belongs to a certain language type and symbol, and the highest probability is the category of the text in the text candidate box.
Figure PCTCN2020078928-appb-000001
Figure PCTCN2020078928-appb-000001
表1脚本识别网络的网络结构Table 1 Network structure of script recognition network
示例性地,全局平均池化层输出大小为1×512的特征图。全连接层可以包含7个神经元,这7个神经元的softmax输出7个小数,分别代表每一个文本区域中的文本是阿拉伯语、孟加拉语、汉语、韩语、日语、拉丁语和符号的概率。概率最高的即为文本区域中文本的类别。Exemplarily, the global average pooling layer outputs a feature map with a size of 1×512. The fully connected layer can contain 7 neurons. The softmax of these 7 neurons outputs 7 decimals, representing the probability that the text in each text area is Arabic, Bengali, Chinese, Korean, Japanese, Latin and symbols . The highest probability is the category of the text in the text area.
在本公开实施例中,图1的步骤S106 2可以例如通过图2中所示的基于注意力机制的多语言文本识别网络206来执行。所述基于注意力机制的多语言文本识别网络206使用CNN作为编码器,然后使用CTC解码器来生成字符序列。基于注意力机制的多语言文本识别网 络206使用channel-wise attention和spatial attention级联来使CTC解码器更关注存在文本的地方,进而提高了文本识别的精度。基于注意力机制的多语言文本识别网络206中编码器的结构如表2所示。 In the embodiment of the present disclosure, step S106 2 in FIG. 1 may be performed by, for example, the attention mechanism-based multilingual text recognition network 206 shown in FIG. 2. The attention mechanism-based multilingual text recognition network 206 uses CNN as an encoder, and then uses a CTC decoder to generate character sequences. The attention mechanism-based multilingual text recognition network 206 uses the channel-wise attention and spatial attention cascade to make the CTC decoder pay more attention to the place where the text exists, thereby improving the accuracy of text recognition. The structure of the encoder in the multilingual text recognition network 206 based on the attention mechanism is shown in Table 2.
Figure PCTCN2020078928-appb-000002
Figure PCTCN2020078928-appb-000002
表2基于注意力机制的多语言文本识别网络中编码器的结构Table 2 The structure of the encoder in the multilingual text recognition network based on the attention mechanism
另一方面,本公开实施例提供的方法10还可以可选地包括步骤S100。在步骤S100处,使用场景文本图像或剪裁图像来执行文本检测器200、脚本识别网络204以及基于注意力机制的多语言文本识别网络206的训练、验证与测试。更具体地,预先构建如下数据集:场景文本图像和剪裁图像。这两类图像中均包含多种语言类型的文本,并且都被划分为训练集、验证集与测试集,其中训练集和验证集中的文本均有标注。场景文本图像用于文本检测器200的训练、验证和测试;而剪裁图像用于脚本识别网络204以及基于注意力机制的多语言文本识别网络206的训练、验证与测试。On the other hand, the method 10 provided by the embodiment of the present disclosure may optionally further include step S100. At step S100, the text detector 200, the script recognition network 204, and the multilingual text recognition network 206 based on the attention mechanism are trained, verified, and tested using the scene text image or the cropped image. More specifically, the following data sets are constructed in advance: scene text images and cropped images. Both types of images contain texts in multiple languages, and are divided into training set, validation set, and test set. The texts in the training set and validation set are labeled. The scene text image is used for training, verification, and testing of the text detector 200; and the cropped image is used for the training, verification, and testing of the script recognition network 204 and the multilingual text recognition network 206 based on the attention mechanism.
本领域技术人员可以理解,剪裁图像是事先从一幅包含背景与文本的图像中裁减下来的包含文本的图像,主要用来训练基于注意力机制的多语言文本识别网络;而场景文本图像是包含背景的大的图像,除了文字之外还包含很多没有文字的背景区域。Those skilled in the art can understand that a cropped image is an image containing text that is cut down from an image containing background and text in advance, and is mainly used to train a multilingual text recognition network based on the attention mechanism; while the scene text image contains A large background image contains many background areas without text in addition to text.
示例性地,可以从网上下载ICDAR MLT剪裁图像和场景文本图像,其中剪裁图像有68613幅用于训练,16255幅用于验证,97619幅用于测试;而场景文本图像有7200幅用于训练,1800幅用于验证,9000幅用于测试。这些图像包含阿拉伯文字、孟加拉文字、中文、韩文、日文和拉丁文共6种文字。Exemplarily, you can download ICDAR MLT cropped images and scene text images from the Internet. Among them, there are 68613 cropped images for training, 16,255 for verification, and 97619 for testing; while scene text images have 7,200 for training, 1800 frames are used for verification and 9,000 frames are used for testing. These images contain six types of characters, namely Arabic, Bengali, Chinese, Korean, Japanese and Latin.
在本公开实施例中,所述文本检测器可以使用Adam优化器来训练,初始学习率可以被设置为0.001,损失函数可以被定义为;In the embodiment of the present disclosure, the text detector can be trained using the Adam optimizer, the initial learning rate can be set to 0.001, and the loss function can be defined as;
L det=L geo+L dice L det =L geo +L dice
其中,L dice是dice损失,dice损失是用来计算语义分割的一种损失函数,比如说一个区域,对于每个像素而言,如果这个像素是文本,则它的值是1,不是文本则为0;如果预测为文本的概率越接近1,则dice损失越趋于0,否则越趋于1,L dice是所有像素的分类损失之和;L geo是文本候选框和ground-truth(文本的标注)的IoU(交并比)损失L IoU与角度损失L θ之和,即,L geo=L IoUθL θ,其中,λ θ为设定的系数,示例性地,其可以被设为1。本领域技术人员可以理解,Adam是一种可以替代传统随机梯度下降过程的一阶优化算法,它能基于训练数据迭代地更新神经网络权重。 Among them, L dice is dice loss, dice loss is a loss function used to calculate semantic segmentation, such as a region, for each pixel, if the pixel is text, its value is 1, and it is not text Is 0; if the probability of predicting text is closer to 1, the dice loss tends to 0, otherwise it tends to 1. L dice is the sum of the classification losses of all pixels; L geo is the text candidate box and ground-truth (text The sum of the IoU (intersection ratio) loss L IoU and the angle loss L θ , that is, L geo = L IoU + λ θ L θ , where λ θ is a set coefficient, which can be Is set to 1. Those skilled in the art can understand that Adam is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process. It can iteratively update neural network weights based on training data.
在本公开实施例中,所述脚本识别网络可以使用随机梯度下降算法来优化;可以设置如下参数:momentum为0.9,初始学习率为0.001,每5个epoches学习率变为十分之一。In the embodiment of the present disclosure, the script recognition network can be optimized using a stochastic gradient descent algorithm; the following parameters can be set: momentum is 0.9, the initial learning rate is 0.001, and the learning rate becomes one tenth for every 5 epochs.
在本公开实施例中,所述基于注意力机制的多语言文本识别网络可以使用Adam优化器来训练,可以设置如下参数:初始学习率为0.001,β 1=0.9,β 2=0.99。 In the embodiment of the present disclosure, the attention mechanism-based multilingual text recognition network can be trained using Adam optimizer, and the following parameters can be set: the initial learning rate is 0.001, β 1 =0.9, β 2 =0.99.
本公开实施例的上述方案完全基于卷积神经网络,并且能在一个模型中同时检测和识别多种语言的文本。经过测试,该方案在多语言数据集ICDAR RRC-MLT测试集上的定位和语言类型识别上的precision(准确率)、recall(召回率)和F-Measure(F值)分别是0.6968、0.6425和0.6687,而现有方法最好的结果分别是0.5759、0.6207、0.5974。由此可见,相比于现有方法,我们的方法有了很大的提高。此外,该方法在端到端识别ICDAR RRC-MLT测试集上的precision、recall和F-Measure分别是0.502,0.424和0.460。The above-mentioned solutions of the embodiments of the present disclosure are completely based on convolutional neural networks, and can simultaneously detect and recognize texts in multiple languages in one model. After testing, the accuracy (accuracy), recall (recall rate), and F-Measure (F value) of this solution on the multilingual data set ICDAR RRC-MLT test set are 0.6968, 0.6425 and respectively. 0.6687, and the best results of the existing methods are 0.5759, 0.6207, 0.5974. It can be seen that our method has been greatly improved compared to existing methods. In addition, the precision, recall, and F-Measure on the end-to-end identification ICDAR RRC-MLT test set of this method are 0.502, 0.424 and 0.460, respectively.
图4为根据本公开实施例提供的一种用于多语言文本检测识别的计算机设备40的框图。如图4所示,该计算机设备40包括处理器41和存储器42。存储器42存储可由处理器41执行的指令。当该指令被处理器41执行时,使处理器41执行包括以下步骤的方法:对输入图像进行特征提取并生成一系列文本候选框;在保持每一文本候选框对应的文本区域的原有宽高比的基础上,对所有文本候选框的文本区域进行剪裁并且随后归一化调整为统一高度K;以及对归一化调整后的文本区域中的文本进行识别。在一些实施例中,对归一化调整后的文本区域中的文本进行识别包括:对归一化调整后的文本区域中的文本的类型进行识别,以确定相应文本为符号或者某个具体的语言类型;和/或对归一化调整后的文本区域中的文本的内容进行识别。Fig. 4 is a block diagram of a computer device 40 for multilingual text detection and recognition according to an embodiment of the present disclosure. As shown in FIG. 4, the computer device 40 includes a processor 41 and a memory 42. The memory 42 stores instructions executable by the processor 41. When the instruction is executed by the processor 41, the processor 41 is caused to execute a method including the following steps: extracting features of the input image and generating a series of text candidate boxes; maintaining the original width of the text area corresponding to each text candidate box On the basis of the height ratio, the text areas of all text candidate boxes are cropped and then normalized and adjusted to a uniform height K; and the text in the text area after the normalization adjustment is recognized. In some embodiments, recognizing the text in the text area after the normalization adjustment includes: recognizing the type of the text in the text area after the normalization adjustment to determine whether the corresponding text is a symbol or a specific Language type; and/or recognize the content of the text in the normalized and adjusted text area.
在本公开实施例中,当指令被处理器41执行时,可以使该处理器41实现如图2所示的文本检测器200、归一化单元202、脚本识别网络204以及基于注意力机制的多语言文本识别网络206中的一个或多个的功能。In the embodiment of the present disclosure, when the instruction is executed by the processor 41, the processor 41 can realize the text detector 200, the normalization unit 202, the script recognition network 204 and the attention mechanism-based The function of one or more of the multilingual text recognition network 206.
在本公开实施例中,当指令被处理器41执行时,可以使该处理器41实现如图1所示的方法的任何步骤。In the embodiment of the present disclosure, when an instruction is executed by the processor 41, the processor 41 can be made to implement any step of the method shown in FIG. 1.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例可以通过软件实现,也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解,上述实施例的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,该非易失性存储介质包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the above embodiments can be implemented by software, or can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the above-mentioned embodiments can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.). The non-volatile storage medium includes a number of instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in the various embodiments of the present disclosure.
以上所述,仅为本公开较佳的具体实施方式,但本公开的保护范围并不局限于此。任何熟悉本技术领域的技术人员在本公开披露的技术范围内,可轻易想到的变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应该以权利要求书的保护范围为准。The above are only preferred specific implementations of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present disclosure shall be covered by the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims (19)

  1. 一种用于多语言文本检测识别的方法,包括:A method for multilingual text detection and recognition, including:
    对输入图像进行特征提取并生成(S102)一系列文本候选框;Perform feature extraction on the input image and generate (S102) a series of text candidate boxes;
    在保持每一文本候选框对应的文本区域的原有宽高比的基础上,对所有文本候选框的文本区域进行剪裁并且随后归一化调整(S104)为统一高度K;On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, crop the text areas of all text candidate boxes and then normalize and adjust (S104) to a uniform height K;
    对归一化调整后的文本区域中的文本进行识别(S106)。The text in the normalized and adjusted text area is recognized (S106).
  2. 根据权利要求1所述的方法,其中,对归一化调整后的文本区域中的文本进行识别(S106)包括:The method according to claim 1, wherein the recognizing (S106) the text in the normalized and adjusted text area comprises:
    对所示归一化调整后的文本区域中的文本的类别进行识别(S106 1),以确定相应文本为符号或者某个具体的语言类型;和/或 Recognizing the category of the text in the normalized and adjusted text area (S106 1 ) to determine that the corresponding text is a symbol or a specific language type; and/or
    对所述归一化调整后的文本区域中的文本的内容进行识别(S106 2)。 The content of the text in the normalized and adjusted text area is recognized (S106 2 ).
  3. 根据权利要求2所述的方法,其中,所述一系列文本候选框是通过文本检测器(200)来生成的,所述文本检测器(200)由4个针对文本而设计的inception模块(305、308、313、314)和3个channel-wise attention与spatial attention模块(306、309、311)以及7个卷积层(301-304、307、310、312)堆叠而成;其中,所述channel-wise attention与spatial attention模块(306、309、311)中的channel-wise attention子模块用于输出特征图的各通道的重要性级别,并且spatial attention子模块输出特征图中每个像素点的关注权重。The method according to claim 2, wherein the series of text candidate boxes are generated by a text detector (200), and the text detector (200) is composed of 4 inception modules (305) designed for text. , 308, 313, 314) and 3 channel-wise attention and spatial attention modules (306, 309, 311) and 7 convolutional layers (301-304, 307, 310, 312) are stacked; where the The channel-wise attention submodule in the channel-wise attention and spatial attention modules (306, 309, 311) is used to output the importance level of each channel of the feature map, and the spatial attention submodule outputs the value of each pixel in the feature map Pay attention to weight.
  4. 根据权利要求1-3中任一项所述的方法,其中,对输入图像进行特征提取并生成(S102)一系列文本候选框包括:The method according to any one of claims 1 to 3, wherein performing feature extraction on the input image and generating (S102) a series of text candidate boxes comprises:
    对于特征图的每个像素点输出(S102 1)P个带有方向的文本候选框;以及 For each pixel of the feature map, output (S102 1 ) P text candidate boxes with directions; and
    使用非极大值抑制对所述P个带有方向的文本候选框进行处理(S102 2),以得到M个带有方向的文本候选框。 Use non-maximum value suppression to process the P text candidate boxes with directions (S102 2 ) to obtain M text candidate boxes with directions.
  5. 根据权利要求1-3中任一项所述的方法,其中,对所有文本候选框的文本区域进行剪裁并且随后归一化调整(S104)为统一高度K包括:The method according to any one of claims 1 to 3, wherein clipping the text regions of all text candidate boxes and then normalizing and adjusting (S104) to a uniform height K comprises:
    在保持每一文本候选框对应的文本区域的原有宽高比的基础上,按照如下公式将所有文本候选框对应的文本区域归一化调整(S104 1)为统一高度K: On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, the text areas corresponding to all text candidate boxes are normalized and adjusted (S104 1 ) to a uniform height K according to the following formula:
    H'=KH'=K
    W'=wH'/hW'=wH'/h
    其中,W'和H'分别表示归一化调整后的相应文本区域的宽度和高度;w和h分别表 示所述相应文本区域的原有的宽度和高度。Wherein, W'and H'respectively represent the width and height of the corresponding text area after normalization adjustment; w and h respectively represent the original width and height of the corresponding text area.
  6. 根据权利要求3所述的方法,其中,所述归一化调整后的文本区域中包含的文本的类别是通过脚本识别网络(204)来识别的,其中,所述脚本识别网络(204)包括多个交替设置的卷积层和最大池化层、位于最后一个最大池化层后端的全局平均池化层、以及位于全局平均池化层后端的全连接层;The method according to claim 3, wherein the category of the text contained in the normalized and adjusted text area is recognized by a script recognition network (204), wherein the script recognition network (204) comprises Multiple alternating convolutional layers and maximum pooling layers, a global average pooling layer at the back of the last maximum pooling layer, and a fully connected layer at the back of the global average pooling layer;
    其中,所述全连接层具有多个神经元,每一神经元的softmax输出分别代表每一个文本候选框中的文本属于某个语言类型与符号的概率,其中,概率最高的即为相应文本候选框中的文本的类型。Wherein, the fully connected layer has multiple neurons, and the softmax output of each neuron represents the probability that the text in each text candidate box belongs to a certain language type and symbol. Among them, the highest probability is the corresponding text candidate The type of text in the box.
  7. 根据权利要求6所述方法,其中,所述归一化调整后的文本区域中包含的文本的内容是通过基于注意力机制的多语言文本识别网络(206)来识别的,其中,所述基于注意力机制的多语言文本识别网络(206)使用CNN作为编码器,然后使用CTC解码器来生成字符序列;并且其中,所述基于注意力机制的多语言文本识别网络使用channel-wise attention和spatial attention级联来使CTC解码器更关注有文本的文本候选框。The method according to claim 6, wherein the content of the text contained in the normalized and adjusted text area is recognized by a multilingual text recognition network (206) based on an attention mechanism, wherein the content based on The attention mechanism-based multilingual text recognition network (206) uses CNN as the encoder, and then uses the CTC decoder to generate character sequences; and the attention mechanism-based multilingual text recognition network uses channel-wise attention and spatial Attention is cascaded to make the CTC decoder pay more attention to text candidate boxes with text.
  8. 根据权利要求7所述的方法,其中,The method according to claim 7, wherein:
    所述文本检测器(200)是使用Adam优化器来训练的,其中,损失函数被定义为;The text detector (200) is trained using Adam optimizer, where the loss function is defined as;
    L det=L geo+L dice L det =L geo +L dice
    其中,L dice是dice损失;L geo是文本候选框和ground-truth的IoU损失L IoU与角度损失L θ之和:L geo=L IoUθL θ,λ θ为设定的系数; Among them, L dice is the dice loss; L geo is the sum of the IoU loss L IoU and the angle loss L θ of the text candidate box and ground-truth: L geo =L IoUθ L θ , λ θ is the set coefficient;
    所述脚本识别网络(204)是使用随机梯度下降算法来优化的;并且The script recognition network (204) is optimized using a stochastic gradient descent algorithm; and
    所述基于注意力机制的多语言文本识别网络(206)是使用Adam优化器来训练的。The attention mechanism-based multilingual text recognition network (206) is trained using Adam optimizer.
  9. 根据权利要求8所述的方法,所述方法还包括:The method according to claim 8, further comprising:
    使用场景文本图像或剪裁图像来执行(S100)所述文本检测器(200)、所述脚本识别网络(204)以及所述基于注意力机制的多语言文本识别网络(206)的训练、验证与测试,Use scene text images or cropped images to perform (S100) the training, verification, and verification of the text detector (200), the script recognition network (204), and the attention mechanism-based multilingual text recognition network (206) test,
    其中,所述场景文本图像和剪裁图像均包含多种语言类型的文本,并且都被划分为训练集、验证集与测试集,并且其中,所述训练集和所述验证集中的文本均有标注,Wherein, the scene text image and the cropped image both contain texts in multiple languages, and are divided into a training set, a verification set, and a test set, and wherein the texts in the training set and the verification set are labeled ,
    并且其中,所述场景文本图像用于所述文本检测器(200)的训练、验证和测试;而所述剪裁图像用于所述脚本识别网络(204)以及所述基于注意力机制的多语言文本识别网络(206)的训练、验证与测试如下数据集。And wherein, the scene text image is used for training, verification and testing of the text detector (200); and the cropped image is used for the script recognition network (204) and the attention mechanism-based multilingual The training, verification and testing of the text recognition network (206) are as follows.
  10. 一种用于多语言文本检测识别的计算机设备(40),其特征在于,包括:A computer device (40) for multilingual text detection and recognition, characterized in that it comprises:
    处理器(41);以及Processor (41); and
    存储器(42),所述存储器(42)包括能够由所述处理器(41)执行的指令,所述指令当被所述处理器(41)执行时,使所述处理器(41):A memory (42), the memory (42) includes instructions that can be executed by the processor (41), and when the instructions are executed by the processor (41), the processor (41):
    对输入图像进行特征提取并生成一系列文本候选框;Perform feature extraction on the input image and generate a series of text candidate boxes;
    在保持每一文本候选框对应的文本区域的原有宽高比的基础上,对所有文本候选框的文本区域进行剪裁并且随后归一化调整为统一高度K;On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, crop the text areas of all text candidate boxes and then normalize them to a uniform height K;
    对归一化调整后的文本区域中的文本进行识别。Recognize the text in the normalized and adjusted text area.
  11. 根据权利要求10所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to claim 10, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:
    对所述归一化调整后的文本区域中的文本的类别进行识别,以确定相应文本为符号或者某个具体的语言类型;和/或Recognizing the category of the text in the normalized and adjusted text area to determine that the corresponding text is a symbol or a specific language type; and/or
    对所述归一化调整后的文本区域中的文本的内容进行识别。The content of the text in the normalized and adjusted text area is recognized.
  12. 根据权利要求11所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to claim 11, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:
    通过文本检测器(200)来生成所述一系列文本候选框,其中,所述文本检测器(200)由4个针对文本而设计的inception模块(305、308、313、314)和3个channel-wise attention与spatial attention模块(306、309、311)以及7个卷积层(301-304、307、310、312)堆叠而成;其中,所述channel-wise attention与spatial attention模块(306、309、311)中的channel-wise attention子模块用于输出特征图的各通道的重要性级别,并且spatial attention子模块输出特征图中每个像素点的关注权重。The series of text candidate boxes are generated by a text detector (200), where the text detector (200) consists of 4 inception modules (305, 308, 313, 314) designed for text and 3 channels -wise attention and spatial attention modules (306, 309, 311) and 7 convolutional layers (301-304, 307, 310, 312) are stacked; among them, the channel-wise attention and spatial attention modules (306, The channel-wise attention submodule in 309 and 311) is used to output the importance level of each channel of the feature map, and the spatial attention submodule outputs the attention weight of each pixel in the feature map.
  13. 根据权利要求10-12中任一项所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to any one of claims 10-12, wherein the instruction, when executed by the processor (41), further causes the processor (41) to:
    对于特征图的每个像素点输出P个带有方向的文本候选框;以及Output P text candidate boxes with directions for each pixel of the feature map; and
    使用非极大值抑制对所述P个带有方向的文本候选框进行处理,以得到M个带有方向的文本候选框。Use non-maximum value suppression to process the P text candidate boxes with directions to obtain M text candidate boxes with directions.
  14. 根据权利要求10-12中任一项所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to any one of claims 10-12, wherein the instruction, when executed by the processor (41), further causes the processor (41) to:
    在保持每一文本候选框对应的文本区域的原有宽高比的基础上,按照如下公式将所 有文本候选框对应的文本区域归一化调整(S1041)为统一高度K:On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, the text areas corresponding to all text candidate boxes are normalized and adjusted (S1041) to a uniform height K according to the following formula:
    H'=KH'=K
    W'=wH'/hW'=wH'/h
    其中,W'和H'分别表示归一化调整后的相应文本区域的宽度和高度;w和h分别表示所述相应文本区域的原有的宽度和高度。Wherein, W'and H'respectively represent the width and height of the corresponding text area after normalization adjustment; w and h respectively represent the original width and height of the corresponding text area.
  15. 根据权利要求12所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to claim 12, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:
    通过脚本识别网络(204)来识别所述归一化调整后的文本区域中的文本的类别,其中,所述脚本识别网络(204)包括多个交替设置的卷积层和最大池化层、位于最后一个最大池化层后端的全局平均池化层、以及位于全局平均池化层后端的全连接层;The script recognition network (204) is used to recognize the text category in the normalized and adjusted text area, wherein the script recognition network (204) includes a plurality of alternately arranged convolutional layers and maximum pooling layers, The global average pooling layer at the back end of the last largest pooling layer, and the fully connected layer at the back end of the global average pooling layer;
    其中,所述全连接层具有多个神经元,每一神经元的softmax输出分别代表每一个文本候选框中的文本属于某个语言类型与符号的概率,其中,概率最高的即为相应文本候选框中的文本的类别。Wherein, the fully connected layer has multiple neurons, and the softmax output of each neuron represents the probability that the text in each text candidate box belongs to a certain language type and symbol. Among them, the highest probability is the corresponding text candidate The category of the text in the box.
  16. 根据权利要求15所述方法,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The method according to claim 15, wherein the instruction, when executed by the processor (41), further causes the processor (41) to:
    通过基于注意力机制的多语言文本识别网络(206)来识别所述归一化调整后的文本区域中的文本的内容,其中,所述基于注意力机制的多语言文本识别网络(206)使用CNN作为编码器,然后使用CTC解码器来生成字符序列;并且其中,所述基于注意力机制的多语言文本识别网络(206)使用channel-wise attention和spatial attention模块来使CTC解码器更关注有文本的文本候选框。The content of the text in the normalized and adjusted text area is recognized through the attention mechanism-based multilingual text recognition network (206), wherein the attention mechanism-based multilingual text recognition network (206) uses CNN is used as an encoder, and then a CTC decoder is used to generate character sequences; and the attention mechanism-based multilingual text recognition network (206) uses channel-wise attention and spatial attention modules to make the CTC decoder pay more attention to The text candidate box for the text.
  17. 根据权利要求16所述的方法,其中,The method of claim 16, wherein:
    所述文本检测器(200)是使用Adam优化器来训练的,其中,损失函数被定义为;The text detector (200) is trained using Adam optimizer, where the loss function is defined as;
    L det=L geo+L dice L det =L geo +L dice
    其中,L dice是dice损失;L geo是文本候选框和ground-truth的IoU损失L IoU与角度损失L θ之和:L geo=L IoUθL θ,λ θ为设定的系数; Among them, L dice is the dice loss; L geo is the sum of the IoU loss L IoU and the angle loss L θ of the text candidate box and ground-truth: L geo =L IoUθ L θ , λ θ is the set coefficient;
    所述脚本识别网络(204)是使用随机梯度下降算法来优化的;并且The script recognition network (204) is optimized using a stochastic gradient descent algorithm; and
    所述基于注意力机制的多语言文本识别网络(206)是使用Adam优化器来训练的。The multilingual text recognition network (206) based on the attention mechanism is trained using Adam optimizer.
  18. 根据权利要求17所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to claim 17, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:
    使用场景文本图像或剪裁图像来执行所述文本检测器(200)、所述脚本识别网络(204)以及所述基于注意力机制的多语言文本识别网络(206)的训练、验证与测试,Use scene text images or cropped images to perform the training, verification and testing of the text detector (200), the script recognition network (204), and the attention mechanism-based multilingual text recognition network (206),
    其中,所述场景文本图像和所述剪裁图像均包含多种语言类型的文本,并且都被划分为训练集、验证集与测试集,并且其中,所述训练集和所述验证集中的文本均有标注,Wherein, the scene text image and the clipped image both contain texts in multiple languages, and are divided into a training set, a verification set, and a test set, and wherein the text in the training set and the verification set are both Marked,
    并且其中,所述场景文本图像用于所述文本检测器(200)的训练、验证和测试;而所述剪裁图像用于所述脚本识别网络(204)以及所述基于注意力机制的多语言文本识别网络(206)的训练、验证与测试。And wherein, the scene text image is used for training, verification and testing of the text detector (200); and the cropped image is used for the script recognition network (204) and the attention mechanism-based multilingual Training, verification and testing of text recognition network (206).
  19. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序当被处理器执行时,使所述处理器执行根据权利要求1-9中的任一项所述的方法。A computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the processor executes the method according to any one of claims 1-9 The method described.
PCT/CN2020/078928 2019-03-26 2020-03-12 Multi-language text detection and recognition method and device WO2020192433A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910232853.0A CN109948615B (en) 2019-03-26 2019-03-26 Multi-language text detection and recognition system
CN201910232853.0 2019-03-26

Publications (1)

Publication Number Publication Date
WO2020192433A1 true WO2020192433A1 (en) 2020-10-01

Family

ID=67010832

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/078928 WO2020192433A1 (en) 2019-03-26 2020-03-12 Multi-language text detection and recognition method and device

Country Status (2)

Country Link
CN (1) CN109948615B (en)
WO (1) WO2020192433A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613348A (en) * 2020-12-01 2021-04-06 浙江华睿科技有限公司 Character recognition method and electronic equipment
CN113095370A (en) * 2021-03-18 2021-07-09 北京达佳互联信息技术有限公司 Image recognition method and device, electronic equipment and storage medium
CN113159021A (en) * 2021-03-10 2021-07-23 国网河北省电力有限公司 Text detection method based on context information
CN113255646A (en) * 2021-06-02 2021-08-13 北京理工大学 Real-time scene text detection method
CN113537189A (en) * 2021-06-03 2021-10-22 深圳市雄帝科技股份有限公司 Handwritten character recognition method, device, equipment and storage medium
CN114743045A (en) * 2022-03-31 2022-07-12 电子科技大学 Small sample target detection method based on double-branch area suggestion network
CN115936073A (en) * 2023-02-16 2023-04-07 江西省科学院能源研究所 Language-oriented convolutional neural network and visual question-answering method

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948615B (en) * 2019-03-26 2021-01-26 中国科学技术大学 Multi-language text detection and recognition system
CN110942067A (en) * 2019-11-29 2020-03-31 上海眼控科技股份有限公司 Text recognition method and device, computer equipment and storage medium
CN111126243B (en) * 2019-12-19 2023-04-07 北京科技大学 Image data detection method and device and computer readable storage medium
CN111259764A (en) * 2020-01-10 2020-06-09 中国科学技术大学 Text detection method and device, electronic equipment and storage device
CN111507406A (en) * 2020-04-17 2020-08-07 上海眼控科技股份有限公司 Method and equipment for optimizing neural network text recognition model
CN111914843B (en) * 2020-08-20 2021-04-16 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Character detection method, system, equipment and storage medium
CN114170594A (en) * 2021-12-07 2022-03-11 奇安信科技集团股份有限公司 Optical character recognition method, device, electronic equipment and storage medium
CN118378707A (en) * 2024-06-21 2024-07-23 中国科学技术大学 Dynamic evolution multi-mode value generation method based on value system guidance

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980858A (en) * 2017-02-28 2017-07-25 中国科学院信息工程研究所 The language text detection of a kind of language text detection with alignment system and the application system and localization method
CN107220641A (en) * 2016-03-22 2017-09-29 华南理工大学 A kind of multi-language text sorting technique based on deep learning
US20180137349A1 (en) * 2016-11-14 2018-05-17 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks
CN108470172A (en) * 2017-02-23 2018-08-31 阿里巴巴集团控股有限公司 A kind of text information identification method and device
CN109948615A (en) * 2019-03-26 2019-06-28 中国科学技术大学 Multi-language text detects identifying system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570497A (en) * 2016-10-08 2017-04-19 中国科学院深圳先进技术研究院 Text detection method and device for scene image
CN108491836B (en) * 2018-01-25 2020-11-24 华南理工大学 Method for integrally identifying Chinese text in natural scene image
CN109359293B (en) * 2018-09-13 2019-09-10 内蒙古大学 Mongolian name entity recognition method neural network based and its identifying system
CN109492679A (en) * 2018-10-24 2019-03-19 杭州电子科技大学 Based on attention mechanism and the character recognition method for being coupled chronological classification loss

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220641A (en) * 2016-03-22 2017-09-29 华南理工大学 A kind of multi-language text sorting technique based on deep learning
US20180137349A1 (en) * 2016-11-14 2018-05-17 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks
CN108470172A (en) * 2017-02-23 2018-08-31 阿里巴巴集团控股有限公司 A kind of text information identification method and device
CN106980858A (en) * 2017-02-28 2017-07-25 中国科学院信息工程研究所 The language text detection of a kind of language text detection with alignment system and the application system and localization method
CN109948615A (en) * 2019-03-26 2019-06-28 中国科学技术大学 Multi-language text detects identifying system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN, XIAOLONG ET AL.: "Electricity Equipment Nameplate Recognition Based on Deep Learning", JOURNAL OF GUANGXI UNIVERSITY (NATURAL SCIENCE EDITION), vol. 43, no. 6, 31 December 2018 (2018-12-31), DOI: 20200529221058X *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613348A (en) * 2020-12-01 2021-04-06 浙江华睿科技有限公司 Character recognition method and electronic equipment
CN113159021A (en) * 2021-03-10 2021-07-23 国网河北省电力有限公司 Text detection method based on context information
CN113095370A (en) * 2021-03-18 2021-07-09 北京达佳互联信息技术有限公司 Image recognition method and device, electronic equipment and storage medium
CN113095370B (en) * 2021-03-18 2023-11-03 北京达佳互联信息技术有限公司 Image recognition method, device, electronic equipment and storage medium
CN113255646A (en) * 2021-06-02 2021-08-13 北京理工大学 Real-time scene text detection method
CN113255646B (en) * 2021-06-02 2022-10-18 北京理工大学 Real-time scene text detection method
CN113537189A (en) * 2021-06-03 2021-10-22 深圳市雄帝科技股份有限公司 Handwritten character recognition method, device, equipment and storage medium
CN114743045A (en) * 2022-03-31 2022-07-12 电子科技大学 Small sample target detection method based on double-branch area suggestion network
CN114743045B (en) * 2022-03-31 2023-09-26 电子科技大学 Small sample target detection method based on double-branch area suggestion network
CN115936073A (en) * 2023-02-16 2023-04-07 江西省科学院能源研究所 Language-oriented convolutional neural network and visual question-answering method

Also Published As

Publication number Publication date
CN109948615A (en) 2019-06-28
CN109948615B (en) 2021-01-26

Similar Documents

Publication Publication Date Title
WO2020192433A1 (en) Multi-language text detection and recognition method and device
US10558893B2 (en) Systems and methods for recognizing characters in digitized documents
WO2020221298A1 (en) Text detection model training method and apparatus, text region determination method and apparatus, and text content determination method and apparatus
CN111488826B (en) Text recognition method and device, electronic equipment and storage medium
US11507800B2 (en) Semantic class localization digital environment
KR102275413B1 (en) Detecting and extracting image document components to create flow document
WO2019192397A1 (en) End-to-end recognition method for scene text in any shape
US20180114071A1 (en) Method for analysing media content
CN108427950B (en) Character line detection method and device
CN111507335A (en) Method and device for automatically labeling training images for deep learning network
CN109934229B (en) Image processing method, device, medium and computing equipment
WO2021081562A2 (en) Multi-head text recognition model for multi-lingual optical character recognition
EP3910532B1 (en) Learning method and learning device for training an object detection network by using attention maps and testing method and testing device using the same
CN109712164A (en) Image intelligent cut-out method, system, equipment and storage medium
JP7198350B2 (en) CHARACTER DETECTION DEVICE, CHARACTER DETECTION METHOD AND CHARACTER DETECTION SYSTEM
CN111291759A (en) Character detection method and device, electronic equipment and storage medium
JP2021135993A (en) Text recognition method, text recognition apparatus, electronic device, and storage medium
US20190294963A1 (en) Signal processing device, signal processing method, and computer program product
US20220101065A1 (en) Automatic document separation
WO2021237227A1 (en) Method and system for multi-language text recognition model with autonomous language classification
CN112070040A (en) Text line detection method for video subtitles
CN111178363B (en) Character recognition method, character recognition device, electronic equipment and readable storage medium
CN114462490A (en) Retrieval method, retrieval device, electronic device and storage medium of image object
CN114373092A (en) Progressive training fine-grained vision classification method based on jigsaw arrangement learning
US20240062560A1 (en) Unified scene text detection and layout analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20779406

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20779406

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20779406

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.03.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20779406

Country of ref document: EP

Kind code of ref document: A1