WO2020192433A1

WO2020192433A1 - Multi-language text detection and recognition method and device

Info

Publication number: WO2020192433A1
Application number: PCT/CN2020/078928
Authority: WO
Inventors: 张勇东; 周宇; 谢洪涛; 李岩
Original assignee: 中国科学技术大学; 北京中科研究院
Priority date: 2019-03-26
Filing date: 2020-03-12
Publication date: 2020-10-01
Also published as: CN109948615A; CN109948615B

Abstract

Disclosed is a computer device for a multi-language text detection and recognition method. The method comprises: performing feature extraction on an input image and generating a series of candidate text boxes; on the basis of keeping the original aspect ratio of a text region corresponding to each candidate text box, performing normalized adjustment on text regions of all the candidate text boxes such that same are of a uniform height; and recognizing text in the text regions after normalized adjustment. In some embodiments, recognizing the text in the text regions after normalized adjustment comprises: recognizing the types of the text in the text regions after normalized adjustment to determine that the corresponding text is a symbol or a certain specific language type; and/or recognizing the content of the text in the text regions after normalized adjustment. In the method, text of multiple languages in a scene text image can be simultaneously detected and recognized.

Description

Multilingual text detection and recognition method and equipment

Cross references to related applications

This application claims the priority of Chinese patent application 201910232853.0 filed on March 26, 2019, the disclosure of which is fully incorporated herein by reference.

Technical field

The present disclosure relates to the field of artificial intelligence, and in particular to methods and devices for multilingual text detection and recognition.

Background technique

The existing scene text recognition system is mainly aimed at the cut text, and cannot detect and recognize the text image at the same time. A few methods that can detect and recognize text at the same time are only for English text. In real life, it is often encountered that texts in multiple languages are processed in the same scene. Therefore, there is an urgent need for an end-to-end multilingual scene text recognition system, which will bring great convenience to image retrieval, machine translation, and automatic driving.

Summary of the invention

The purpose of the present disclosure is to provide a multilingual text detection and recognition method and device, which can simultaneously detect and recognize texts in multiple languages in a scene text image.

In one aspect, the purpose of the present disclosure is achieved by a multilingual text detection and recognition method. The method includes:

Perform feature extraction on the input image and generate a series of text candidate boxes;

On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, crop the text areas of all text candidate boxes and then normalize them to a uniform height K;

Recognize the text in the normalized and adjusted text area.

On the other hand, the purpose of the present disclosure is achieved by a computer device for multilingual text detection and recognition. The computer equipment includes:

Processor; and

A memory, where the memory stores instructions that can be executed by the processor, and when the instructions are executed by the processor, the processor:

Recognize the text in the normalized and adjusted text area.

The technical solution provided according to the present disclosure can simultaneously detect and recognize texts in multiple languages, and has a higher accuracy rate than traditional text detection and multi-language recognition solutions.

Description of the drawings

In order to more clearly describe the technical solutions of the embodiments of the present disclosure, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other embodiments can be obtained based on these embodiments without creative work, and these other embodiments fall within the protection scope of the present disclosure.

Fig. 1 is a flowchart of a method for multilingual text detection and recognition according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a system that can be used to implement the method in FIG. 1 according to an embodiment of the present disclosure;

3 is a schematic diagram of the structure of a text detector provided according to an embodiment of the present disclosure;

Fig. 4 is a block diagram of a computer device for multilingual text detection and recognition according to an embodiment of the present disclosure.

detailed description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only a part of the embodiments of the present disclosure, rather than all the embodiments. Based on the embodiments of the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present disclosure.

Fig. 1 shows a method 10 for multilingual text detection and recognition according to an embodiment of the present disclosure. As shown in Figure 1, the method 10 includes: at step S102, feature extraction of the input image and generate a series of text candidate boxes; at step S104, maintaining the original width and height of the text area corresponding to each text candidate box On the basis of the ratio, the text areas of all text candidate boxes are clipped and normalized and adjusted to a uniform height; and at step S106, the text in the normalized and adjusted text area is recognized. In some embodiments, the text region after normalized by a text identifying may further comprise: step ₁ S106, the text region after normalized by a text category identified, to determine corresponding text symbol or a specific type of language; and / or, at step S106 _2, the contents of the text within the text region after normalization adjustment identified.

The above-mentioned method according to the embodiment of the present disclosure can be applied to machine translation. By using this method in the background of the translation software, texts in different languages can be recognized and then translated into the desired text. The above method can also be used for autonomous driving. By using this method on driverless cars, road signs in different languages can be detected and recognized, so as to choose the correct direction to move forward.

Fig. 2 is a schematic diagram of a system that can be used to implement the method of Fig. 1 according to an embodiment of the present disclosure. In the following, the execution of each step in FIG. 1 will be described in further detail by way of examples in conjunction with FIG. 2.

Step S102, that is, performing feature extraction on the input image and generating a series of text candidate boxes can be performed by, for example, the text detector 200 shown in FIG. 2.

FIG. 3 is a schematic structural diagram of a text detector 200 according to an embodiment of the present disclosure. As shown in Figure 3, the text detector 200 consists of four

inception modules

305, 308, 313, and 314 designed for text and three channel-wise attention and spatial attention (channel-wise attention&spatial attention)

modules

306, 309. , 311, and 7 convolutional layers 301-304, 307, 310, 312 are stacked; among them, the channel-wise attention (channel-wise attention) sub-module in the channel-wise attention and spatial attention module is for feature maps For the channel, the importance level of each channel is output, which tells the network which channel information is more important; the spatial attention sub-module in the channel-wise attention and spatial attention module is for the feature map For each pixel of, the attention weight of each pixel on the output feature map is to tell the network where to pay more attention to the feature map. In the embodiments of the present disclosure, each inception module can use 1×5 and 5×1 convolution kernels. Since text generally has a large aspect ratio, this convolution kernel is more suitable for text.

Step S102 may include: e.g., at _a step S102, a text output by the detector shown in FIG. 2200, for each pixel P of FIG feature candidate text box with direction, and at step S102 and then _2, Use non-maximum suppression to process these text candidate boxes to obtain M text candidate boxes with directions.

Illustratively, each image is adjusted to a size of 256×256 and then input to the text detector 200. The text detector 200 outputs 14 text candidate boxes with directions for each pixel in the feature map. Then,-use non-maximum suppression (NMS) to process these text candidate boxes to remove redundant text suggestion boxes and speed up the calculation.

In Fig. 3, 3×3 means that a convolution kernel with a width and a height of 3 is used in the convolution operation (1×1 has a similar meaning); the 7 convolution layers correspond to the 3×3 part in Fig. 2. In Figure 3, 16 means that the number of convolution kernels used in the convolution operation is 16 (the meaning of 1, 2, 4, 64, 256, 512 is similar); /2 means that the resolution of the feature map is halved; upsample Represents the up-sampling operation, the function is to increase the resolution of the feature map; f1 to f4 and f1,2, f1,2,3, f1,2,3,4 are the feature maps obtained at each stage; segmentation1 and segmentation2 indicate The segmentation map of the text area; box1 and box2 represent the predicted distance from each pixel on the feature map to the top, bottom, left, right, and four sides of the text candidate box; angle1 and angle2 represent the angle of the text, some text is not horizontal, It may be at an angle to the horizontal.

As shown in Figure 3, the workflow of the text detector 200 is briefly described as follows: input an input image into the network, and then pass through the first four convolutional layers 301-304 and inception1 305, the first channel -wise attention and spatial attention module 306 (can be referred to as attention module for short), fifth convolutional layer (3x3,128,/2) 307, inception2 308, second channel-wise attention and spatial attention module 309, first Six convolutional layers (3x3,256,/2) 310, the third channel-wise attention and spatial attention module 311, and the seventh convolutional layer (3x3,512,/2) 312. Output feature map f1 from the seventh convolutional layer 312, with a resolution of 8x8; output feature map f2 from the third channel-wise attention and spatial attention module 311; output feature map f2 from the second channel-wise attention and spatial attention module 309 Output feature map f3; and output feature map f4 from the first channel-wise attention and spatial attention module 306. The feature map f1 is up-sampled and then added to f2 for feature fusion to obtain the feature map f1,2. The feature map f1,2 is up-sampled (for example, up-sampled to 32x32) and then added to the feature map f3 for feature fusion, thereby obtaining the feature map f1,2,3. The feature map f1,2,3 passes through inception3 313, and then after upsampling (for example, upsampling to 64x64), it is added with the feature map f4 for feature fusion, thereby obtaining the feature map f1,2,3,4. Feature maps f1,2,3,4 are extracted by inception4314. In this process, the feature maps f1,2,3 output by inception3 and the feature maps f1,2,3,4 output by inception4 are respectively used to predict the text candidate frame (that is, the generation of the text candidate frame).

In the embodiment of the present disclosure, step S104 may be performed by, for example, the normalization unit 202 shown in FIG. 2. The normalization unit 202 trims the text regions of all text candidate boxes and then adjusts them to a uniform height K on the basis of maintaining the original aspect ratio of the text region corresponding to each text candidate box. This normalization method maintains the aspect ratio of the corresponding text area, avoids the deformation of the text area, and provides a guarantee for the subsequent text recognition and text language category recognition.

In the embodiments disclosed in the present embodiment, step S104 may include: _a step S104, adjustments to normalize the text regions according to the following formula:

H'=K

W＇＝wH＇/h

Among them, W'and H'respectively represent the width and height of the text area after normalization adjustment; w and h respectively represent the original width and height of the text area.

Exemplarily, K can be 64, of course, it can also be changed to other values as needed.

In the embodiment of the present disclosure, step S106 ₁ may be performed through the script recognition network 204, for example. The script recognition network 204 can be implemented by a convolutional neural network (CNN). The following Table 1 shows the structure of the script recognition network 204, which mainly includes: a plurality of alternately arranged convolutional layers (conv) and maximum pooling layer (max-pooling), and a global average at the back end of the last maximum pooling layer Global-avg-pool, and a fully-connected layer at the back end of the global average pooling layer; wherein, the fully-connected layer has multiple (for example, 7) neurons, each The softmax output of the neuron represents the probability that the text in each text candidate box belongs to a certain language type and symbol, and the highest probability is the category of the text in the text candidate box.

Table 1 Network structure of script recognition network

Exemplarily, the global average pooling layer outputs a feature map with a size of 1×512. The fully connected layer can contain 7 neurons. The softmax of these 7 neurons outputs 7 decimals, representing the probability that the text in each text area is Arabic, Bengali, Chinese, Korean, Japanese, Latin and symbols . The highest probability is the category of the text in the text area.

In the embodiment of the present disclosure, step S106 ₂ in FIG. 1 may be performed by, for example, the attention mechanism-based multilingual text recognition network 206 shown in FIG. 2. The attention mechanism-based multilingual text recognition network 206 uses CNN as an encoder, and then uses a CTC decoder to generate character sequences. The attention mechanism-based multilingual text recognition network 206 uses the channel-wise attention and spatial attention cascade to make the CTC decoder pay more attention to the place where the text exists, thereby improving the accuracy of text recognition. The structure of the encoder in the multilingual text recognition network 206 based on the attention mechanism is shown in Table 2.

Table 2 The structure of the encoder in the multilingual text recognition network based on the attention mechanism

On the other hand, the method 10 provided by the embodiment of the present disclosure may optionally further include step S100. At step S100, the text detector 200, the script recognition network 204, and the multilingual text recognition network 206 based on the attention mechanism are trained, verified, and tested using the scene text image or the cropped image. More specifically, the following data sets are constructed in advance: scene text images and cropped images. Both types of images contain texts in multiple languages, and are divided into training set, validation set, and test set. The texts in the training set and validation set are labeled. The scene text image is used for training, verification, and testing of the text detector 200; and the cropped image is used for the training, verification, and testing of the script recognition network 204 and the multilingual text recognition network 206 based on the attention mechanism.

Those skilled in the art can understand that a cropped image is an image containing text that is cut down from an image containing background and text in advance, and is mainly used to train a multilingual text recognition network based on the attention mechanism; while the scene text image contains A large background image contains many background areas without text in addition to text.

Exemplarily, you can download ICDAR MLT cropped images and scene text images from the Internet. Among them, there are 68613 cropped images for training, 16,255 for verification, and 97619 for testing; while scene text images have 7,200 for training, 1800 frames are used for verification and 9,000 frames are used for testing. These images contain six types of characters, namely Arabic, Bengali, Chinese, Korean, Japanese and Latin.

In the embodiment of the present disclosure, the text detector can be trained using the Adam optimizer, the initial learning rate can be set to 0.001, and the loss function can be defined as;

L _det ＝L _geo +L _dice

Among them, L _dice is dice loss, dice loss is a loss function used to calculate semantic segmentation, such as a region, for each pixel, if the pixel is text, its value is 1, and it is not text Is 0; if the probability of predicting text is closer to 1, the dice loss tends to 0, otherwise it tends to 1. L _dice is the sum of the classification losses of all pixels; L _geo is the text candidate box and ground-truth (text The sum of the IoU (intersection ratio) loss L _IoU and the angle loss L _θ , that is, L _geo = L _IoU + λ _θ L _θ , where λ _θ is a set coefficient, which can be Is set to 1. Those skilled in the art can understand that Adam is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process. It can iteratively update neural network weights based on training data.

In the embodiment of the present disclosure, the script recognition network can be optimized using a stochastic gradient descent algorithm; the following parameters can be set: momentum is 0.9, the initial learning rate is 0.001, and the learning rate becomes one tenth for every 5 epochs.

In the embodiment of the present disclosure, the attention mechanism-based multilingual text recognition network can be trained using Adam optimizer, and the following parameters can be set: the initial learning rate is 0.001, β ₁ =0.9, β ₂ =0.99.

The above-mentioned solutions of the embodiments of the present disclosure are completely based on convolutional neural networks, and can simultaneously detect and recognize texts in multiple languages in one model. After testing, the accuracy (accuracy), recall (recall rate), and F-Measure (F value) of this solution on the multilingual data set ICDAR RRC-MLT test set are 0.6968, 0.6425 and respectively. 0.6687, and the best results of the existing methods are 0.5759, 0.6207, 0.5974. It can be seen that our method has been greatly improved compared to existing methods. In addition, the precision, recall, and F-Measure on the end-to-end identification ICDAR RRC-MLT test set of this method are 0.502, 0.424 and 0.460, respectively.

Fig. 4 is a block diagram of a computer device 40 for multilingual text detection and recognition according to an embodiment of the present disclosure. As shown in FIG. 4, the computer device 40 includes a processor 41 and a memory 42. The memory 42 stores instructions executable by the processor 41. When the instruction is executed by the processor 41, the processor 41 is caused to execute a method including the following steps: extracting features of the input image and generating a series of text candidate boxes; maintaining the original width of the text area corresponding to each text candidate box On the basis of the height ratio, the text areas of all text candidate boxes are cropped and then normalized and adjusted to a uniform height K; and the text in the text area after the normalization adjustment is recognized. In some embodiments, recognizing the text in the text area after the normalization adjustment includes: recognizing the type of the text in the text area after the normalization adjustment to determine whether the corresponding text is a symbol or a specific Language type; and/or recognize the content of the text in the normalized and adjusted text area.

In the embodiment of the present disclosure, when the instruction is executed by the processor 41, the processor 41 can realize the text detector 200, the normalization unit 202, the script recognition network 204 and the attention mechanism-based The function of one or more of the multilingual text recognition network 206.

In the embodiment of the present disclosure, when an instruction is executed by the processor 41, the processor 41 can be made to implement any step of the method shown in FIG. 1.

Through the description of the above embodiments, those skilled in the art can clearly understand that the above embodiments can be implemented by software, or can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the above-mentioned embodiments can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.). The non-volatile storage medium includes a number of instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in the various embodiments of the present disclosure.

The above are only preferred specific implementations of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present disclosure shall be covered by the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims

A method for multilingual text detection and recognition, including:

Perform feature extraction on the input image and generate (S102) a series of text candidate boxes;

On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, crop the text areas of all text candidate boxes and then normalize and adjust (S104) to a uniform height K;

The text in the normalized and adjusted text area is recognized (S106).
The method according to claim 1, wherein the recognizing (S106) the text in the normalized and adjusted text area comprises:

Recognizing the category of the text in the normalized and adjusted text area (S106 1 ) to determine that the corresponding text is a symbol or a specific language type; and/or

The content of the text in the normalized and adjusted text area is recognized (S106 2 ).
The method according to claim 2, wherein the series of text candidate boxes are generated by a text detector (200), and the text detector (200) is composed of 4 inception modules (305) designed for text. , 308, 313, 314) and 3 channel-wise attention and spatial attention modules (306, 309, 311) and 7 convolutional layers (301-304, 307, 310, 312) are stacked; where the The channel-wise attention submodule in the channel-wise attention and spatial attention modules (306, 309, 311) is used to output the importance level of each channel of the feature map, and the spatial attention submodule outputs the value of each pixel in the feature map Pay attention to weight.
The method according to any one of claims 1 to 3, wherein performing feature extraction on the input image and generating (S102) a series of text candidate boxes comprises:

For each pixel of the feature map, output (S102 1 ) P text candidate boxes with directions; and

Use non-maximum value suppression to process the P text candidate boxes with directions (S102 2 ) to obtain M text candidate boxes with directions.
The method according to any one of claims 1 to 3, wherein clipping the text regions of all text candidate boxes and then normalizing and adjusting (S104) to a uniform height K comprises:

On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, the text areas corresponding to all text candidate boxes are normalized and adjusted (S104 1 ) to a uniform height K according to the following formula:

H'=K

W＇＝wH＇/h

Wherein, W'and H'respectively represent the width and height of the corresponding text area after normalization adjustment; w and h respectively represent the original width and height of the corresponding text area.
The method according to claim 3, wherein the category of the text contained in the normalized and adjusted text area is recognized by a script recognition network (204), wherein the script recognition network (204) comprises Multiple alternating convolutional layers and maximum pooling layers, a global average pooling layer at the back of the last maximum pooling layer, and a fully connected layer at the back of the global average pooling layer;

Wherein, the fully connected layer has multiple neurons, and the softmax output of each neuron represents the probability that the text in each text candidate box belongs to a certain language type and symbol. Among them, the highest probability is the corresponding text candidate The type of text in the box.
The method according to claim 6, wherein the content of the text contained in the normalized and adjusted text area is recognized by a multilingual text recognition network (206) based on an attention mechanism, wherein the content based on The attention mechanism-based multilingual text recognition network (206) uses CNN as the encoder, and then uses the CTC decoder to generate character sequences; and the attention mechanism-based multilingual text recognition network uses channel-wise attention and spatial Attention is cascaded to make the CTC decoder pay more attention to text candidate boxes with text.
The method according to claim 7, wherein:

The text detector (200) is trained using Adam optimizer, where the loss function is defined as;

L det ＝L geo +L dice

Among them, L dice is the dice loss; L geo is the sum of the IoU loss L IoU and the angle loss L θ of the text candidate box and ground-truth: L geo =L IoU +λ θ L θ , λ θ is the set coefficient;

The script recognition network (204) is optimized using a stochastic gradient descent algorithm; and

The attention mechanism-based multilingual text recognition network (206) is trained using Adam optimizer.
The method according to claim 8, further comprising:

Use scene text images or cropped images to perform (S100) the training, verification, and verification of the text detector (200), the script recognition network (204), and the attention mechanism-based multilingual text recognition network (206) test,

Wherein, the scene text image and the cropped image both contain texts in multiple languages, and are divided into a training set, a verification set, and a test set, and wherein the texts in the training set and the verification set are labeled ,

And wherein, the scene text image is used for training, verification and testing of the text detector (200); and the cropped image is used for the script recognition network (204) and the attention mechanism-based multilingual The training, verification and testing of the text recognition network (206) are as follows.
A computer device (40) for multilingual text detection and recognition, characterized in that it comprises:

Processor (41); and

A memory (42), the memory (42) includes instructions that can be executed by the processor (41), and when the instructions are executed by the processor (41), the processor (41):

Perform feature extraction on the input image and generate a series of text candidate boxes;

On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, crop the text areas of all text candidate boxes and then normalize them to a uniform height K;

Recognize the text in the normalized and adjusted text area.
The computer device according to claim 10, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:

Recognizing the category of the text in the normalized and adjusted text area to determine that the corresponding text is a symbol or a specific language type; and/or

The content of the text in the normalized and adjusted text area is recognized.
The computer device according to claim 11, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:

The series of text candidate boxes are generated by a text detector (200), where the text detector (200) consists of 4 inception modules (305, 308, 313, 314) designed for text and 3 channels -wise attention and spatial attention modules (306, 309, 311) and 7 convolutional layers (301-304, 307, 310, 312) are stacked; among them, the channel-wise attention and spatial attention modules (306, The channel-wise attention submodule in 309 and 311) is used to output the importance level of each channel of the feature map, and the spatial attention submodule outputs the attention weight of each pixel in the feature map.
The computer device according to any one of claims 10-12, wherein the instruction, when executed by the processor (41), further causes the processor (41) to:

Output P text candidate boxes with directions for each pixel of the feature map; and

Use non-maximum value suppression to process the P text candidate boxes with directions to obtain M text candidate boxes with directions.
The computer device according to any one of claims 10-12, wherein the instruction, when executed by the processor (41), further causes the processor (41) to:

On the basis of maintaining the original aspect ratio of the text area corresponding to each text candidate box, the text areas corresponding to all text candidate boxes are normalized and adjusted (S1041) to a uniform height K according to the following formula:

H'=K

W＇＝wH＇/h

Wherein, W'and H'respectively represent the width and height of the corresponding text area after normalization adjustment; w and h respectively represent the original width and height of the corresponding text area.
The computer device according to claim 12, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:

The script recognition network (204) is used to recognize the text category in the normalized and adjusted text area, wherein the script recognition network (204) includes a plurality of alternately arranged convolutional layers and maximum pooling layers, The global average pooling layer at the back end of the last largest pooling layer, and the fully connected layer at the back end of the global average pooling layer;

Wherein, the fully connected layer has multiple neurons, and the softmax output of each neuron represents the probability that the text in each text candidate box belongs to a certain language type and symbol. Among them, the highest probability is the corresponding text candidate The category of the text in the box.
The method according to claim 15, wherein the instruction, when executed by the processor (41), further causes the processor (41) to:

The content of the text in the normalized and adjusted text area is recognized through the attention mechanism-based multilingual text recognition network (206), wherein the attention mechanism-based multilingual text recognition network (206) uses CNN is used as an encoder, and then a CTC decoder is used to generate character sequences; and the attention mechanism-based multilingual text recognition network (206) uses channel-wise attention and spatial attention modules to make the CTC decoder pay more attention to The text candidate box for the text.
The method of claim 16, wherein:

The text detector (200) is trained using Adam optimizer, where the loss function is defined as;

L det ＝L geo +L dice

Among them, L dice is the dice loss; L geo is the sum of the IoU loss L IoU and the angle loss L θ of the text candidate box and ground-truth: L geo =L IoU +λ θ L θ , λ θ is the set coefficient;

The script recognition network (204) is optimized using a stochastic gradient descent algorithm; and

The multilingual text recognition network (206) based on the attention mechanism is trained using Adam optimizer.
The computer device according to claim 17, wherein the instructions, when executed by the processor (41), further cause the processor (41) to:

Use scene text images or cropped images to perform the training, verification and testing of the text detector (200), the script recognition network (204), and the attention mechanism-based multilingual text recognition network (206),

Wherein, the scene text image and the clipped image both contain texts in multiple languages, and are divided into a training set, a verification set, and a test set, and wherein the text in the training set and the verification set are both Marked,

And wherein, the scene text image is used for training, verification and testing of the text detector (200); and the cropped image is used for the script recognition network (204) and the attention mechanism-based multilingual Training, verification and testing of text recognition network (206).
A computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the processor executes the method according to any one of claims 1-9 The method described.