WO2020192433A1 - Multi-language text detection and recognition method and device - Google Patents
Multi-language text detection and recognition method and device Download PDFInfo
- Publication number
- WO2020192433A1 WO2020192433A1 PCT/CN2020/078928 CN2020078928W WO2020192433A1 WO 2020192433 A1 WO2020192433 A1 WO 2020192433A1 CN 2020078928 W CN2020078928 W CN 2020078928W WO 2020192433 A1 WO2020192433 A1 WO 2020192433A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- attention
- processor
- recognition network
- normalized
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/24—Aligning, centring, orientation detection or correction of the image
Definitions
- the present disclosure relates to the field of artificial intelligence, and in particular to methods and devices for multilingual text detection and recognition.
- the existing scene text recognition system is mainly aimed at the cut text, and cannot detect and recognize the text image at the same time.
- a few methods that can detect and recognize text at the same time are only for English text.
- the purpose of the present disclosure is to provide a multilingual text detection and recognition method and device, which can simultaneously detect and recognize texts in multiple languages in a scene text image.
- the purpose of the present disclosure is achieved by a multilingual text detection and recognition method.
- the method includes:
- the computer equipment includes:
- a memory where the memory stores instructions that can be executed by the processor, and when the instructions are executed by the processor, the processor:
- the technical solution provided according to the present disclosure can simultaneously detect and recognize texts in multiple languages, and has a higher accuracy rate than traditional text detection and multi-language recognition solutions.
- Fig. 1 is a flowchart of a method for multilingual text detection and recognition according to an embodiment of the present disclosure
- FIG. 2 is a schematic diagram of a system that can be used to implement the method in FIG. 1 according to an embodiment of the present disclosure
- FIG. 3 is a schematic diagram of the structure of a text detector provided according to an embodiment of the present disclosure.
- Fig. 4 is a block diagram of a computer device for multilingual text detection and recognition according to an embodiment of the present disclosure.
- Fig. 1 shows a method 10 for multilingual text detection and recognition according to an embodiment of the present disclosure.
- the method 10 includes: at step S102, feature extraction of the input image and generate a series of text candidate boxes; at step S104, maintaining the original width and height of the text area corresponding to each text candidate box On the basis of the ratio, the text areas of all text candidate boxes are clipped and normalized and adjusted to a uniform height; and at step S106, the text in the normalized and adjusted text area is recognized.
- the text region after normalized by a text identifying may further comprise: step 1 S106, the text region after normalized by a text category identified, to determine corresponding text symbol or a specific type of language; and / or, at step S106 2, the contents of the text within the text region after normalization adjustment identified.
- the above-mentioned method according to the embodiment of the present disclosure can be applied to machine translation.
- texts in different languages can be recognized and then translated into the desired text.
- the above method can also be used for autonomous driving.
- road signs in different languages can be detected and recognized, so as to choose the correct direction to move forward.
- Fig. 2 is a schematic diagram of a system that can be used to implement the method of Fig. 1 according to an embodiment of the present disclosure. In the following, the execution of each step in FIG. 1 will be described in further detail by way of examples in conjunction with FIG. 2.
- Step S102 that is, performing feature extraction on the input image and generating a series of text candidate boxes can be performed by, for example, the text detector 200 shown in FIG. 2.
- FIG. 3 is a schematic structural diagram of a text detector 200 according to an embodiment of the present disclosure.
- the text detector 200 consists of four inception modules 305, 308, 313, and 314 designed for text and three channel-wise attention and spatial attention (channel-wise attention&spatial attention) modules 306, 309.
- each inception module can use 1 ⁇ 5 and 5 ⁇ 1 convolution kernels. Since text generally has a large aspect ratio, this convolution kernel is more suitable for text.
- Step S102 may include: e.g., at a step S102, a text output by the detector shown in FIG. 2200, for each pixel P of FIG feature candidate text box with direction, and at step S102 and then 2, Use non-maximum suppression to process these text candidate boxes to obtain M text candidate boxes with directions.
- each image is adjusted to a size of 256 ⁇ 256 and then input to the text detector 200.
- the text detector 200 outputs 14 text candidate boxes with directions for each pixel in the feature map. Then,-use non-maximum suppression (NMS) to process these text candidate boxes to remove redundant text suggestion boxes and speed up the calculation.
- NMS non-maximum suppression
- 3 ⁇ 3 means that a convolution kernel with a width and a height of 3 is used in the convolution operation (1 ⁇ 1 has a similar meaning); the 7 convolution layers correspond to the 3 ⁇ 3 part in Fig. 2.
- 16 means that the number of convolution kernels used in the convolution operation is 16 (the meaning of 1, 2, 4, 64, 256, 512 is similar); /2 means that the resolution of the feature map is halved; upsample Represents the up-sampling operation, the function is to increase the resolution of the feature map; f1 to f4 and f1,2, f1,2,3, f1,2,3,4 are the feature maps obtained at each stage; segmentation1 and segmentation2 indicate The segmentation map of the text area; box1 and box2 represent the predicted distance from each pixel on the feature map to the top, bottom, left, right, and four sides of the text candidate box; angle1 and angle2 represent the angle of the text, some text is not horizontal, It may be at an angle to the horizontal.
- the workflow of the text detector 200 is briefly described as follows: input an input image into the network, and then pass through the first four convolutional layers 301-304 and inception1 305, the first channel -wise attention and spatial attention module 306 (can be referred to as attention module for short), fifth convolutional layer (3x3,128,/2) 307, inception2 308, second channel-wise attention and spatial attention module 309, first Six convolutional layers (3x3,256,/2) 310, the third channel-wise attention and spatial attention module 311, and the seventh convolutional layer (3x3,512,/2) 312.
- the feature map f1 is up-sampled and then added to f2 for feature fusion to obtain the feature map f1,2.
- the feature map f1,2 is up-sampled (for example, up-sampled to 32x32) and then added to the feature map f3 for feature fusion, thereby obtaining the feature map f1,2,3.
- the feature map f1,2,3 passes through inception3 313, and then after upsampling (for example, upsampling to 64x64), it is added with the feature map f4 for feature fusion, thereby obtaining the feature map f1,2,3,4.
- Feature maps f1,2,3,4 are extracted by inception4314.
- the feature maps f1,2,3 output by inception3 and the feature maps f1,2,3,4 output by inception4 are respectively used to predict the text candidate frame (that is, the generation of the text candidate frame).
- step S104 may be performed by, for example, the normalization unit 202 shown in FIG. 2.
- the normalization unit 202 trims the text regions of all text candidate boxes and then adjusts them to a uniform height K on the basis of maintaining the original aspect ratio of the text region corresponding to each text candidate box.
- This normalization method maintains the aspect ratio of the corresponding text area, avoids the deformation of the text area, and provides a guarantee for the subsequent text recognition and text language category recognition.
- step S104 may include: a step S104, adjustments to normalize the text regions according to the following formula:
- W'and H' respectively represent the width and height of the text area after normalization adjustment; w and h respectively represent the original width and height of the text area.
- K can be 64, of course, it can also be changed to other values as needed.
- step S106 1 may be performed through the script recognition network 204, for example.
- the script recognition network 204 can be implemented by a convolutional neural network (CNN).
- CNN convolutional neural network
- Table 1 shows the structure of the script recognition network 204, which mainly includes: a plurality of alternately arranged convolutional layers (conv) and maximum pooling layer (max-pooling), and a global average at the back end of the last maximum pooling layer Global-avg-pool, and a fully-connected layer at the back end of the global average pooling layer; wherein, the fully-connected layer has multiple (for example, 7) neurons, each The softmax output of the neuron represents the probability that the text in each text candidate box belongs to a certain language type and symbol, and the highest probability is the category of the text in the text candidate box.
- the global average pooling layer outputs a feature map with a size of 1 ⁇ 512.
- the fully connected layer can contain 7 neurons.
- the softmax of these 7 neurons outputs 7 decimals, representing the probability that the text in each text area is Arabic, Bengali, Chinese, Korean, Japanese, Latin and symbols .
- the highest probability is the category of the text in the text area.
- step S106 2 in FIG. 1 may be performed by, for example, the attention mechanism-based multilingual text recognition network 206 shown in FIG. 2.
- the attention mechanism-based multilingual text recognition network 206 uses CNN as an encoder, and then uses a CTC decoder to generate character sequences.
- the attention mechanism-based multilingual text recognition network 206 uses the channel-wise attention and spatial attention cascade to make the CTC decoder pay more attention to the place where the text exists, thereby improving the accuracy of text recognition.
- Table 2 The structure of the encoder in the multilingual text recognition network 206 based on the attention mechanism is shown in Table 2.
- Table 2 The structure of the encoder in the multilingual text recognition network based on the attention mechanism
- the method 10 provided by the embodiment of the present disclosure may optionally further include step S100.
- step S100 the text detector 200, the script recognition network 204, and the multilingual text recognition network 206 based on the attention mechanism are trained, verified, and tested using the scene text image or the cropped image. More specifically, the following data sets are constructed in advance: scene text images and cropped images. Both types of images contain texts in multiple languages, and are divided into training set, validation set, and test set. The texts in the training set and validation set are labeled.
- the scene text image is used for training, verification, and testing of the text detector 200; and the cropped image is used for the training, verification, and testing of the script recognition network 204 and the multilingual text recognition network 206 based on the attention mechanism.
- a cropped image is an image containing text that is cut down from an image containing background and text in advance, and is mainly used to train a multilingual text recognition network based on the attention mechanism; while the scene text image contains A large background image contains many background areas without text in addition to text.
- ICDAR MLT cropped images and scene text images from the Internet.
- These images contain six types of characters, namely Arabic, Bengali, Chinese, Korean, Japanese and Latin.
- the text detector can be trained using the Adam optimizer, the initial learning rate can be set to 0.001, and the loss function can be defined as;
- L dice is dice loss
- dice loss is a loss function used to calculate semantic segmentation, such as a region, for each pixel, if the pixel is text, its value is 1, and it is not text Is 0; if the probability of predicting text is closer to 1, the dice loss tends to 0, otherwise it tends to 1.
- L dice is the sum of the classification losses of all pixels;
- Adam is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process. It can iteratively update neural network weights based on training data.
- the script recognition network can be optimized using a stochastic gradient descent algorithm; the following parameters can be set: momentum is 0.9, the initial learning rate is 0.001, and the learning rate becomes one tenth for every 5 epochs.
- the above-mentioned solutions of the embodiments of the present disclosure are completely based on convolutional neural networks, and can simultaneously detect and recognize texts in multiple languages in one model.
- the accuracy (accuracy), recall (recall rate), and F-Measure (F value) of this solution on the multilingual data set ICDAR RRC-MLT test set are 0.6968, 0.6425 and respectively. 0.6687, and the best results of the existing methods are 0.5759, 0.6207, 0.5974. It can be seen that our method has been greatly improved compared to existing methods.
- the precision, recall, and F-Measure on the end-to-end identification ICDAR RRC-MLT test set of this method are 0.502, 0.424 and 0.460, respectively.
- Fig. 4 is a block diagram of a computer device 40 for multilingual text detection and recognition according to an embodiment of the present disclosure.
- the computer device 40 includes a processor 41 and a memory 42.
- the memory 42 stores instructions executable by the processor 41.
- the processor 41 is caused to execute a method including the following steps: extracting features of the input image and generating a series of text candidate boxes; maintaining the original width of the text area corresponding to each text candidate box On the basis of the height ratio, the text areas of all text candidate boxes are cropped and then normalized and adjusted to a uniform height K; and the text in the text area after the normalization adjustment is recognized.
- recognizing the text in the text area after the normalization adjustment includes: recognizing the type of the text in the text area after the normalization adjustment to determine whether the corresponding text is a symbol or a specific Language type; and/or recognize the content of the text in the normalized and adjusted text area.
- the processor 41 when the instruction is executed by the processor 41, the processor 41 can realize the text detector 200, the normalization unit 202, the script recognition network 204 and the attention mechanism-based The function of one or more of the multilingual text recognition network 206.
- the processor 41 when an instruction is executed by the processor 41, the processor 41 can be made to implement any step of the method shown in FIG. 1.
- the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.).
- the non-volatile storage medium includes a number of instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in the various embodiments of the present disclosure.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (19)
- 一种用于多语言文本检测识别的方法,包括:A method for multilingual text detection and recognition, including:对输入图像进行特征提取并生成(S102)一系列文本候选框;Perform feature extraction on the input image and generate (S102) a series of text candidate boxes;在保持每一文本候选框对应的文本区域的原有宽高比的基础上,对所有文本候选框的文本区域进行剪裁并且随后归一化调整(S104)为统一高度K;On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, crop the text areas of all text candidate boxes and then normalize and adjust (S104) to a uniform height K;对归一化调整后的文本区域中的文本进行识别(S106)。The text in the normalized and adjusted text area is recognized (S106).
- 根据权利要求1所述的方法,其中,对归一化调整后的文本区域中的文本进行识别(S106)包括:The method according to claim 1, wherein the recognizing (S106) the text in the normalized and adjusted text area comprises:对所示归一化调整后的文本区域中的文本的类别进行识别(S106 1),以确定相应文本为符号或者某个具体的语言类型;和/或 Recognizing the category of the text in the normalized and adjusted text area (S106 1 ) to determine that the corresponding text is a symbol or a specific language type; and/or对所述归一化调整后的文本区域中的文本的内容进行识别(S106 2)。 The content of the text in the normalized and adjusted text area is recognized (S106 2 ).
- 根据权利要求2所述的方法,其中,所述一系列文本候选框是通过文本检测器(200)来生成的,所述文本检测器(200)由4个针对文本而设计的inception模块(305、308、313、314)和3个channel-wise attention与spatial attention模块(306、309、311)以及7个卷积层(301-304、307、310、312)堆叠而成;其中,所述channel-wise attention与spatial attention模块(306、309、311)中的channel-wise attention子模块用于输出特征图的各通道的重要性级别,并且spatial attention子模块输出特征图中每个像素点的关注权重。The method according to claim 2, wherein the series of text candidate boxes are generated by a text detector (200), and the text detector (200) is composed of 4 inception modules (305) designed for text. , 308, 313, 314) and 3 channel-wise attention and spatial attention modules (306, 309, 311) and 7 convolutional layers (301-304, 307, 310, 312) are stacked; where the The channel-wise attention submodule in the channel-wise attention and spatial attention modules (306, 309, 311) is used to output the importance level of each channel of the feature map, and the spatial attention submodule outputs the value of each pixel in the feature map Pay attention to weight.
- 根据权利要求1-3中任一项所述的方法,其中,对输入图像进行特征提取并生成(S102)一系列文本候选框包括:The method according to any one of claims 1 to 3, wherein performing feature extraction on the input image and generating (S102) a series of text candidate boxes comprises:对于特征图的每个像素点输出(S102 1)P个带有方向的文本候选框;以及 For each pixel of the feature map, output (S102 1 ) P text candidate boxes with directions; and使用非极大值抑制对所述P个带有方向的文本候选框进行处理(S102 2),以得到M个带有方向的文本候选框。 Use non-maximum value suppression to process the P text candidate boxes with directions (S102 2 ) to obtain M text candidate boxes with directions.
- 根据权利要求1-3中任一项所述的方法,其中,对所有文本候选框的文本区域进行剪裁并且随后归一化调整(S104)为统一高度K包括:The method according to any one of claims 1 to 3, wherein clipping the text regions of all text candidate boxes and then normalizing and adjusting (S104) to a uniform height K comprises:在保持每一文本候选框对应的文本区域的原有宽高比的基础上,按照如下公式将所有文本候选框对应的文本区域归一化调整(S104 1)为统一高度K: On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, the text areas corresponding to all text candidate boxes are normalized and adjusted (S104 1 ) to a uniform height K according to the following formula:H'=KH'=KW'=wH'/hW'=wH'/h其中,W'和H'分别表示归一化调整后的相应文本区域的宽度和高度;w和h分别表 示所述相应文本区域的原有的宽度和高度。Wherein, W'and H'respectively represent the width and height of the corresponding text area after normalization adjustment; w and h respectively represent the original width and height of the corresponding text area.
- 根据权利要求3所述的方法,其中,所述归一化调整后的文本区域中包含的文本的类别是通过脚本识别网络(204)来识别的,其中,所述脚本识别网络(204)包括多个交替设置的卷积层和最大池化层、位于最后一个最大池化层后端的全局平均池化层、以及位于全局平均池化层后端的全连接层;The method according to claim 3, wherein the category of the text contained in the normalized and adjusted text area is recognized by a script recognition network (204), wherein the script recognition network (204) comprises Multiple alternating convolutional layers and maximum pooling layers, a global average pooling layer at the back of the last maximum pooling layer, and a fully connected layer at the back of the global average pooling layer;其中,所述全连接层具有多个神经元,每一神经元的softmax输出分别代表每一个文本候选框中的文本属于某个语言类型与符号的概率,其中,概率最高的即为相应文本候选框中的文本的类型。Wherein, the fully connected layer has multiple neurons, and the softmax output of each neuron represents the probability that the text in each text candidate box belongs to a certain language type and symbol. Among them, the highest probability is the corresponding text candidate The type of text in the box.
- 根据权利要求6所述方法,其中,所述归一化调整后的文本区域中包含的文本的内容是通过基于注意力机制的多语言文本识别网络(206)来识别的,其中,所述基于注意力机制的多语言文本识别网络(206)使用CNN作为编码器,然后使用CTC解码器来生成字符序列;并且其中,所述基于注意力机制的多语言文本识别网络使用channel-wise attention和spatial attention级联来使CTC解码器更关注有文本的文本候选框。The method according to claim 6, wherein the content of the text contained in the normalized and adjusted text area is recognized by a multilingual text recognition network (206) based on an attention mechanism, wherein the content based on The attention mechanism-based multilingual text recognition network (206) uses CNN as the encoder, and then uses the CTC decoder to generate character sequences; and the attention mechanism-based multilingual text recognition network uses channel-wise attention and spatial Attention is cascaded to make the CTC decoder pay more attention to text candidate boxes with text.
- 根据权利要求7所述的方法,其中,The method according to claim 7, wherein:所述文本检测器(200)是使用Adam优化器来训练的,其中,损失函数被定义为;The text detector (200) is trained using Adam optimizer, where the loss function is defined as;L det=L geo+L dice L det =L geo +L dice其中,L dice是dice损失;L geo是文本候选框和ground-truth的IoU损失L IoU与角度损失L θ之和:L geo=L IoU+λ θL θ,λ θ为设定的系数; Among them, L dice is the dice loss; L geo is the sum of the IoU loss L IoU and the angle loss L θ of the text candidate box and ground-truth: L geo =L IoU +λ θ L θ , λ θ is the set coefficient;所述脚本识别网络(204)是使用随机梯度下降算法来优化的;并且The script recognition network (204) is optimized using a stochastic gradient descent algorithm; and所述基于注意力机制的多语言文本识别网络(206)是使用Adam优化器来训练的。The attention mechanism-based multilingual text recognition network (206) is trained using Adam optimizer.
- 根据权利要求8所述的方法,所述方法还包括:The method according to claim 8, further comprising:使用场景文本图像或剪裁图像来执行(S100)所述文本检测器(200)、所述脚本识别网络(204)以及所述基于注意力机制的多语言文本识别网络(206)的训练、验证与测试,Use scene text images or cropped images to perform (S100) the training, verification, and verification of the text detector (200), the script recognition network (204), and the attention mechanism-based multilingual text recognition network (206) test,其中,所述场景文本图像和剪裁图像均包含多种语言类型的文本,并且都被划分为训练集、验证集与测试集,并且其中,所述训练集和所述验证集中的文本均有标注,Wherein, the scene text image and the cropped image both contain texts in multiple languages, and are divided into a training set, a verification set, and a test set, and wherein the texts in the training set and the verification set are labeled ,并且其中,所述场景文本图像用于所述文本检测器(200)的训练、验证和测试;而所述剪裁图像用于所述脚本识别网络(204)以及所述基于注意力机制的多语言文本识别网络(206)的训练、验证与测试如下数据集。And wherein, the scene text image is used for training, verification and testing of the text detector (200); and the cropped image is used for the script recognition network (204) and the attention mechanism-based multilingual The training, verification and testing of the text recognition network (206) are as follows.
- 一种用于多语言文本检测识别的计算机设备(40),其特征在于,包括:A computer device (40) for multilingual text detection and recognition, characterized in that it comprises:处理器(41);以及Processor (41); and存储器(42),所述存储器(42)包括能够由所述处理器(41)执行的指令,所述指令当被所述处理器(41)执行时,使所述处理器(41):A memory (42), the memory (42) includes instructions that can be executed by the processor (41), and when the instructions are executed by the processor (41), the processor (41):对输入图像进行特征提取并生成一系列文本候选框;Perform feature extraction on the input image and generate a series of text candidate boxes;在保持每一文本候选框对应的文本区域的原有宽高比的基础上,对所有文本候选框的文本区域进行剪裁并且随后归一化调整为统一高度K;On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, crop the text areas of all text candidate boxes and then normalize them to a uniform height K;对归一化调整后的文本区域中的文本进行识别。Recognize the text in the normalized and adjusted text area.
- 根据权利要求10所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to claim 10, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:对所述归一化调整后的文本区域中的文本的类别进行识别,以确定相应文本为符号或者某个具体的语言类型;和/或Recognizing the category of the text in the normalized and adjusted text area to determine that the corresponding text is a symbol or a specific language type; and/or对所述归一化调整后的文本区域中的文本的内容进行识别。The content of the text in the normalized and adjusted text area is recognized.
- 根据权利要求11所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to claim 11, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:通过文本检测器(200)来生成所述一系列文本候选框,其中,所述文本检测器(200)由4个针对文本而设计的inception模块(305、308、313、314)和3个channel-wise attention与spatial attention模块(306、309、311)以及7个卷积层(301-304、307、310、312)堆叠而成;其中,所述channel-wise attention与spatial attention模块(306、309、311)中的channel-wise attention子模块用于输出特征图的各通道的重要性级别,并且spatial attention子模块输出特征图中每个像素点的关注权重。The series of text candidate boxes are generated by a text detector (200), where the text detector (200) consists of 4 inception modules (305, 308, 313, 314) designed for text and 3 channels -wise attention and spatial attention modules (306, 309, 311) and 7 convolutional layers (301-304, 307, 310, 312) are stacked; among them, the channel-wise attention and spatial attention modules (306, The channel-wise attention submodule in 309 and 311) is used to output the importance level of each channel of the feature map, and the spatial attention submodule outputs the attention weight of each pixel in the feature map.
- 根据权利要求10-12中任一项所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to any one of claims 10-12, wherein the instruction, when executed by the processor (41), further causes the processor (41) to:对于特征图的每个像素点输出P个带有方向的文本候选框;以及Output P text candidate boxes with directions for each pixel of the feature map; and使用非极大值抑制对所述P个带有方向的文本候选框进行处理,以得到M个带有方向的文本候选框。Use non-maximum value suppression to process the P text candidate boxes with directions to obtain M text candidate boxes with directions.
- 根据权利要求10-12中任一项所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to any one of claims 10-12, wherein the instruction, when executed by the processor (41), further causes the processor (41) to:在保持每一文本候选框对应的文本区域的原有宽高比的基础上,按照如下公式将所 有文本候选框对应的文本区域归一化调整(S1041)为统一高度K:On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, the text areas corresponding to all text candidate boxes are normalized and adjusted (S1041) to a uniform height K according to the following formula:H'=KH'=KW'=wH'/hW'=wH'/h其中,W'和H'分别表示归一化调整后的相应文本区域的宽度和高度;w和h分别表示所述相应文本区域的原有的宽度和高度。Wherein, W'and H'respectively represent the width and height of the corresponding text area after normalization adjustment; w and h respectively represent the original width and height of the corresponding text area.
- 根据权利要求12所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to claim 12, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:通过脚本识别网络(204)来识别所述归一化调整后的文本区域中的文本的类别,其中,所述脚本识别网络(204)包括多个交替设置的卷积层和最大池化层、位于最后一个最大池化层后端的全局平均池化层、以及位于全局平均池化层后端的全连接层;The script recognition network (204) is used to recognize the text category in the normalized and adjusted text area, wherein the script recognition network (204) includes a plurality of alternately arranged convolutional layers and maximum pooling layers, The global average pooling layer at the back end of the last largest pooling layer, and the fully connected layer at the back end of the global average pooling layer;其中,所述全连接层具有多个神经元,每一神经元的softmax输出分别代表每一个文本候选框中的文本属于某个语言类型与符号的概率,其中,概率最高的即为相应文本候选框中的文本的类别。Wherein, the fully connected layer has multiple neurons, and the softmax output of each neuron represents the probability that the text in each text candidate box belongs to a certain language type and symbol. Among them, the highest probability is the corresponding text candidate The category of the text in the box.
- 根据权利要求15所述方法,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The method according to claim 15, wherein the instruction, when executed by the processor (41), further causes the processor (41) to:通过基于注意力机制的多语言文本识别网络(206)来识别所述归一化调整后的文本区域中的文本的内容,其中,所述基于注意力机制的多语言文本识别网络(206)使用CNN作为编码器,然后使用CTC解码器来生成字符序列;并且其中,所述基于注意力机制的多语言文本识别网络(206)使用channel-wise attention和spatial attention模块来使CTC解码器更关注有文本的文本候选框。The content of the text in the normalized and adjusted text area is recognized through the attention mechanism-based multilingual text recognition network (206), wherein the attention mechanism-based multilingual text recognition network (206) uses CNN is used as an encoder, and then a CTC decoder is used to generate character sequences; and the attention mechanism-based multilingual text recognition network (206) uses channel-wise attention and spatial attention modules to make the CTC decoder pay more attention to The text candidate box for the text.
- 根据权利要求16所述的方法,其中,The method of claim 16, wherein:所述文本检测器(200)是使用Adam优化器来训练的,其中,损失函数被定义为;The text detector (200) is trained using Adam optimizer, where the loss function is defined as;L det=L geo+L dice L det =L geo +L dice其中,L dice是dice损失;L geo是文本候选框和ground-truth的IoU损失L IoU与角度损失L θ之和:L geo=L IoU+λ θL θ,λ θ为设定的系数; Among them, L dice is the dice loss; L geo is the sum of the IoU loss L IoU and the angle loss L θ of the text candidate box and ground-truth: L geo =L IoU +λ θ L θ , λ θ is the set coefficient;所述脚本识别网络(204)是使用随机梯度下降算法来优化的;并且The script recognition network (204) is optimized using a stochastic gradient descent algorithm; and所述基于注意力机制的多语言文本识别网络(206)是使用Adam优化器来训练的。The multilingual text recognition network (206) based on the attention mechanism is trained using Adam optimizer.
- 根据权利要求17所述的计算机设备,其中,所述指令当被所述处理器(41)执行时,还使所述处理器(41):The computer device according to claim 17, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:使用场景文本图像或剪裁图像来执行所述文本检测器(200)、所述脚本识别网络(204)以及所述基于注意力机制的多语言文本识别网络(206)的训练、验证与测试,Use scene text images or cropped images to perform the training, verification and testing of the text detector (200), the script recognition network (204), and the attention mechanism-based multilingual text recognition network (206),其中,所述场景文本图像和所述剪裁图像均包含多种语言类型的文本,并且都被划分为训练集、验证集与测试集,并且其中,所述训练集和所述验证集中的文本均有标注,Wherein, the scene text image and the clipped image both contain texts in multiple languages, and are divided into a training set, a verification set, and a test set, and wherein the text in the training set and the verification set are both Marked,并且其中,所述场景文本图像用于所述文本检测器(200)的训练、验证和测试;而所述剪裁图像用于所述脚本识别网络(204)以及所述基于注意力机制的多语言文本识别网络(206)的训练、验证与测试。And wherein, the scene text image is used for training, verification and testing of the text detector (200); and the cropped image is used for the script recognition network (204) and the attention mechanism-based multilingual Training, verification and testing of text recognition network (206).
- 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序当被处理器执行时,使所述处理器执行根据权利要求1-9中的任一项所述的方法。A computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the processor executes the method according to any one of claims 1-9 The method described.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910232853.0A CN109948615B (en) | 2019-03-26 | 2019-03-26 | Multi-language text detection and recognition system |
CN201910232853.0 | 2019-03-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020192433A1 true WO2020192433A1 (en) | 2020-10-01 |
Family
ID=67010832
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/078928 WO2020192433A1 (en) | 2019-03-26 | 2020-03-12 | Multi-language text detection and recognition method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109948615B (en) |
WO (1) | WO2020192433A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112613348A (en) * | 2020-12-01 | 2021-04-06 | 浙江华睿科技有限公司 | Character recognition method and electronic equipment |
CN113095370A (en) * | 2021-03-18 | 2021-07-09 | 北京达佳互联信息技术有限公司 | Image recognition method and device, electronic equipment and storage medium |
CN113159021A (en) * | 2021-03-10 | 2021-07-23 | 国网河北省电力有限公司 | Text detection method based on context information |
CN113255646A (en) * | 2021-06-02 | 2021-08-13 | 北京理工大学 | Real-time scene text detection method |
CN113537189A (en) * | 2021-06-03 | 2021-10-22 | 深圳市雄帝科技股份有限公司 | Handwritten character recognition method, device, equipment and storage medium |
CN114743045A (en) * | 2022-03-31 | 2022-07-12 | 电子科技大学 | Small sample target detection method based on double-branch area suggestion network |
CN115936073A (en) * | 2023-02-16 | 2023-04-07 | 江西省科学院能源研究所 | Language-oriented convolutional neural network and visual question-answering method |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109948615B (en) * | 2019-03-26 | 2021-01-26 | 中国科学技术大学 | Multi-language text detection and recognition system |
CN110942067A (en) * | 2019-11-29 | 2020-03-31 | 上海眼控科技股份有限公司 | Text recognition method and device, computer equipment and storage medium |
CN111126243B (en) * | 2019-12-19 | 2023-04-07 | 北京科技大学 | Image data detection method and device and computer readable storage medium |
CN111259764A (en) * | 2020-01-10 | 2020-06-09 | 中国科学技术大学 | Text detection method and device, electronic equipment and storage device |
CN111507406A (en) * | 2020-04-17 | 2020-08-07 | 上海眼控科技股份有限公司 | Method and equipment for optimizing neural network text recognition model |
CN111914843B (en) * | 2020-08-20 | 2021-04-16 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Character detection method, system, equipment and storage medium |
CN114170594A (en) * | 2021-12-07 | 2022-03-11 | 奇安信科技集团股份有限公司 | Optical character recognition method, device, electronic equipment and storage medium |
CN118378707A (en) * | 2024-06-21 | 2024-07-23 | 中国科学技术大学 | Dynamic evolution multi-mode value generation method based on value system guidance |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106980858A (en) * | 2017-02-28 | 2017-07-25 | 中国科学院信息工程研究所 | The language text detection of a kind of language text detection with alignment system and the application system and localization method |
CN107220641A (en) * | 2016-03-22 | 2017-09-29 | 华南理工大学 | A kind of multi-language text sorting technique based on deep learning |
US20180137349A1 (en) * | 2016-11-14 | 2018-05-17 | Kodak Alaris Inc. | System and method of character recognition using fully convolutional neural networks |
CN108470172A (en) * | 2017-02-23 | 2018-08-31 | 阿里巴巴集团控股有限公司 | A kind of text information identification method and device |
CN109948615A (en) * | 2019-03-26 | 2019-06-28 | 中国科学技术大学 | Multi-language text detects identifying system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106570497A (en) * | 2016-10-08 | 2017-04-19 | 中国科学院深圳先进技术研究院 | Text detection method and device for scene image |
CN108491836B (en) * | 2018-01-25 | 2020-11-24 | 华南理工大学 | Method for integrally identifying Chinese text in natural scene image |
CN109359293B (en) * | 2018-09-13 | 2019-09-10 | 内蒙古大学 | Mongolian name entity recognition method neural network based and its identifying system |
CN109492679A (en) * | 2018-10-24 | 2019-03-19 | 杭州电子科技大学 | Based on attention mechanism and the character recognition method for being coupled chronological classification loss |
-
2019
- 2019-03-26 CN CN201910232853.0A patent/CN109948615B/en active Active
-
2020
- 2020-03-12 WO PCT/CN2020/078928 patent/WO2020192433A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220641A (en) * | 2016-03-22 | 2017-09-29 | 华南理工大学 | A kind of multi-language text sorting technique based on deep learning |
US20180137349A1 (en) * | 2016-11-14 | 2018-05-17 | Kodak Alaris Inc. | System and method of character recognition using fully convolutional neural networks |
CN108470172A (en) * | 2017-02-23 | 2018-08-31 | 阿里巴巴集团控股有限公司 | A kind of text information identification method and device |
CN106980858A (en) * | 2017-02-28 | 2017-07-25 | 中国科学院信息工程研究所 | The language text detection of a kind of language text detection with alignment system and the application system and localization method |
CN109948615A (en) * | 2019-03-26 | 2019-06-28 | 中国科学技术大学 | Multi-language text detects identifying system |
Non-Patent Citations (1)
Title |
---|
CHEN, XIAOLONG ET AL.: "Electricity Equipment Nameplate Recognition Based on Deep Learning", JOURNAL OF GUANGXI UNIVERSITY (NATURAL SCIENCE EDITION), vol. 43, no. 6, 31 December 2018 (2018-12-31), DOI: 20200529221058X * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112613348A (en) * | 2020-12-01 | 2021-04-06 | 浙江华睿科技有限公司 | Character recognition method and electronic equipment |
CN113159021A (en) * | 2021-03-10 | 2021-07-23 | 国网河北省电力有限公司 | Text detection method based on context information |
CN113095370A (en) * | 2021-03-18 | 2021-07-09 | 北京达佳互联信息技术有限公司 | Image recognition method and device, electronic equipment and storage medium |
CN113095370B (en) * | 2021-03-18 | 2023-11-03 | 北京达佳互联信息技术有限公司 | Image recognition method, device, electronic equipment and storage medium |
CN113255646A (en) * | 2021-06-02 | 2021-08-13 | 北京理工大学 | Real-time scene text detection method |
CN113255646B (en) * | 2021-06-02 | 2022-10-18 | 北京理工大学 | Real-time scene text detection method |
CN113537189A (en) * | 2021-06-03 | 2021-10-22 | 深圳市雄帝科技股份有限公司 | Handwritten character recognition method, device, equipment and storage medium |
CN114743045A (en) * | 2022-03-31 | 2022-07-12 | 电子科技大学 | Small sample target detection method based on double-branch area suggestion network |
CN114743045B (en) * | 2022-03-31 | 2023-09-26 | 电子科技大学 | Small sample target detection method based on double-branch area suggestion network |
CN115936073A (en) * | 2023-02-16 | 2023-04-07 | 江西省科学院能源研究所 | Language-oriented convolutional neural network and visual question-answering method |
Also Published As
Publication number | Publication date |
---|---|
CN109948615A (en) | 2019-06-28 |
CN109948615B (en) | 2021-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020192433A1 (en) | Multi-language text detection and recognition method and device | |
US10558893B2 (en) | Systems and methods for recognizing characters in digitized documents | |
WO2020221298A1 (en) | Text detection model training method and apparatus, text region determination method and apparatus, and text content determination method and apparatus | |
CN111488826B (en) | Text recognition method and device, electronic equipment and storage medium | |
US11507800B2 (en) | Semantic class localization digital environment | |
KR102275413B1 (en) | Detecting and extracting image document components to create flow document | |
WO2019192397A1 (en) | End-to-end recognition method for scene text in any shape | |
US20180114071A1 (en) | Method for analysing media content | |
CN108427950B (en) | Character line detection method and device | |
CN111507335A (en) | Method and device for automatically labeling training images for deep learning network | |
CN109934229B (en) | Image processing method, device, medium and computing equipment | |
WO2021081562A2 (en) | Multi-head text recognition model for multi-lingual optical character recognition | |
EP3910532B1 (en) | Learning method and learning device for training an object detection network by using attention maps and testing method and testing device using the same | |
CN109712164A (en) | Image intelligent cut-out method, system, equipment and storage medium | |
JP7198350B2 (en) | CHARACTER DETECTION DEVICE, CHARACTER DETECTION METHOD AND CHARACTER DETECTION SYSTEM | |
CN111291759A (en) | Character detection method and device, electronic equipment and storage medium | |
JP2021135993A (en) | Text recognition method, text recognition apparatus, electronic device, and storage medium | |
US20190294963A1 (en) | Signal processing device, signal processing method, and computer program product | |
US20220101065A1 (en) | Automatic document separation | |
WO2021237227A1 (en) | Method and system for multi-language text recognition model with autonomous language classification | |
CN112070040A (en) | Text line detection method for video subtitles | |
CN111178363B (en) | Character recognition method, character recognition device, electronic equipment and readable storage medium | |
CN114462490A (en) | Retrieval method, retrieval device, electronic device and storage medium of image object | |
CN114373092A (en) | Progressive training fine-grained vision classification method based on jigsaw arrangement learning | |
US20240062560A1 (en) | Unified scene text detection and layout analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20779406 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20779406 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20779406 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.03.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20779406 Country of ref document: EP Kind code of ref document: A1 |