AU2021100391A4

AU2021100391A4 - Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism

Info

Publication number: AU2021100391A4
Application number: AU2021100391A
Authority: AU
Inventors: Lianwen JIN; Tiancai Liang; Canjie LUO; Huiyun MAO
Original assignee: South China University of Technology SCUT; GRG Banking Equipment Co Ltd; Shenzhen Xinyi Technology Co Ltd; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Current assignee: South China University of Technology SCUT; GRG Banking Equipment Co Ltd; Shenzhen Xinyi Technology Co Ltd; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2021-04-15
Anticipated expiration: 2029-01-22

Abstract

Disclosed is a natural scene text recognition method based on sequence transformation correction and an attention mechanism, comprising the steps of data acquisition, data processing, label generation, network training and network testing; wherein the network training comprises the steps of constructing a recognition network, and inputting the training data and the processed labels into a pre-designed recognition network to complete the training of the recognition network; the recognition network comprises a sequence transformation corrector and an attention mechanism-based text recognizer; the sequence transformation corrector comprises multiple convolutional layers, a nonlinear layer and a pooling layer, and further comprises a decomposition layer and a positioning network composed of multiple fully connected layers, and the attention mechanism-based text recognizer the comprises a feature coding network and an attention mechanism-based decoder. The present invention greatly reduces the recognition difficulty of the recognition model with high recognition accuracy rate and strong robustness, and provides a good recognition performance for texts with irregular shapes. 1/2 Data acquisition: synthesizing a natural scene text line image by taking open source codes as well as text corpus obtained from the Internet as the training set, public natural scene text recognition dataset as the test set; Data processing: scaling the training set and test set images, with the processed image size of 64*192; and making all images in the training set and test set into LMDB format to improve the image reading speed; Label generation: processing the training set images by adding labels in the form of text content corresponding to each text line image; Network training: -onstructing a recognition network, wherein the recognition network comprises a sequence transformation corrector and an attention mechanism-based text recognizer; and inputting the raining data and the processed labels into the pre-designed ecognition network to complete the training of the recognition letwork: Network testing: inputting test data into the trained network to finally obtain E recognition result of text line in the images. Fig. 1 Sequence Attention transformation mechanism-based corrector text recognizer ----------------1 1, 1 4"DONOVAN" Weak String label supervision supervision Fig. 2

Description

1/2

Data acquisition: synthesizing a natural scene text line image by taking open source codes as well as text corpus obtained from the Internet as the training set, public natural scene text recognition dataset as the test set;

Data processing: scaling the training set and test set images, with the processed image size of 64*192; and making all images in the training set and test set into LMDB format to improve the image reading speed;

Label generation: processing the training set images by adding labels in the form of text content corresponding to each text line image;

Network training: -onstructing a recognition network, wherein the recognition network comprises a sequence transformation corrector and an attention mechanism-based text recognizer; and inputting the raining data and the processed labels into the pre-designed ecognition network to complete the training of the recognition letwork:

Network testing: inputting test data into the trained network to finally obtain E recognition result of text line in the images.

Fig. 1

Sequence Attention transformation mechanism-based corrector text recognizer

1 1, 1 4"DONOVAN" ----------------

Weak String label supervision supervision

Fig. 2

Natural Scene Text Recognition Method Based on Sequence Transformation

Correction and Attention Mechanism

TECHNICAL FIELD

[01] The present invention relates to the fields of pattern recognition and artificial intelligence technology, and in particular to a natural scene text recognition method based on sequence transformation correction and an attention mechanism.

BACKGROUND

[02] As a carrier of information, text has been widely used from ancient times to the present. The presentation of text allows people to more accurately understand and process the visual information, which facilitates information exchange between them. With the rapid development of computer technology, artificial intelligence technology is gradually changing people's lives. People hope to understand and process images efficiently through computers, and text information is crucial to image understanding. So, the natural scene text recognition has always been a meaningful focus of most attention as a current research direction.

[03] In contrast to document image recognition tasks, texts are often deformed diversely such as rotation, transmission deformation and bending in natural scenes. Moreover, natural scene text deformations are complex, diverse and irregular, and difficult to be simulated by a mathematical transformation, which poses a great challenge to natural scene text recognition systems.

[04] Therefore, there is an urgent need for a text recognition method capable of effectively improving the recognition accuracy for irregular natural scene text datasets.

SUMMARY

[05] The purpose of the present invention is to provide a natural scene text recognition method based on sequence transformation correction and an attention mechanism, aiming at solving the above problems of the prior art and effectively improving the recognition accuracy of natural scene texts.

[06] Aiming at achieving the above purpose, the technical solution of the present invention is to provide a natural scene text recognition method based on sequence transformation correction and an attention mechanism, comprising the steps of:

[07] data acquisition: acquiring training set and test set samples;

[08] data processing: scaling the training set and test set images;

[09] label generation: labeling the training set images;

[010] network training: constructing a recognition network, and inputting the training data and the processed labels into a pre-designed recognition network to complete the training of the recognition network;

[011] the recognition network comprising a sequence transformation corrector and an attention mechanism-based text recognizer, wherein the sequence transformation corrector comprises multiple convolutional layers, a nonlinear layer and a pooling layer, and further comprises a decomposition layer and a positioning network composed of multiple fully connected layers, and the attention mechanism-based text recognizer comprises a feature coding network and an attention mechanism-based decoder;and

[012] network testing: inputting test data to the trained recognition network to obtain the recognition result of text line in the images.

[013] Preferably, the sequence transformation corrector further comprises a scaling layer and a grid mapping module, and the method for correcting images by the sequence transformation corrector comprises the steps of:

[014] obtaining a feature map of the image to be corrected through being processed in the scaling layer, the convolutional layer, the nonlinear layer and the pooling layer;

[015] decomposing the feature map into N image blocks disjoint from each other in the horizontal direction by means of the decomposition layer, and inputting the features of each image block to a localization network, through which the transformation parameters of each image block are predicted;

[016] inputting the transformation parameters of each image block to the grid mapping module to obtain a smooth sampled grid; and

[017] obtaining the corrected image by using the sampling grid through bilinear interpolation sampling on the original image to be corrected.

[018] Preferably, the convolutional layer can be inpainted, and the specific inpainting method comprises the steps of affixing a circle of pixel dots on the top, bottom, left and right of the original image or feature map, with the pixel dots having a pixel value of 0.

[019] Preferably, the feature encoding network is used for converting image data into time-series features with contextual information by taking a convolutional neural network and a long-short term memory network as basic units.

[020] Preferably, the attention mechanism-based decoder introduces a long-short term memory (LSTM) network in the decoding process to gradually recognize each character in the image, and the specific recognition method comprises the steps of:

[021] calculating an attention weight matrix by the attention mechanism-based decoder according to the time-series features output by the feature encoding network and the hidden state at a time point on the long-short term memory network;

[022] normalizing the attention weight matrix to obtain the probability distributionthereof;

[023] weighting and summing the time-series features obtained by encoding the feature encoding network according to the probability distribution of the attention weight matrix, so as to obtain the attention features at the current moment;

[024] updating the hidden state of the long-short term memory network according to the attention features of the current moment and the probability distribution predicted based on the characters of the previous moment;

[025] decoding the fully connected layer and inputting the decoding result to the softmax layer for probability normalization, so as to obtain the probability distribution of the predicted characters; and

[026] selecting a character corresponding to the value with the maximum confidence in the probability distribution as the current decoded output character, and completing the recognition of characters in the images.

[027] Preferably, the recognition network training comprises the steps of:

[028] minimizing the cross entropy loss by using an adaptive gradient descent method and taking the probability of each character in the training data string output at the corresponding time point as cross entropy.

[029] Preferably, the weight parameters in the recognition network are initialized by a random Gaussian distribution initialization method.

[030] The prevent invention discloses the following technical effects:

[031] (1) The automatic learning algorithm with deep network structure helps to learn effective expressions from data well and improve the recognition accuracy.

[032] (2) The present invention combines the end-to-end network design and the physically meaningful gradient information returned by the recognition model to effectively guide the correction network by a weak supervision training method, which greatly reduces the recognition difficulty of the recognition model and improves the recognition accuracy in practical applications.

[033] (3) The method of the present invention introduces an idea of decomposition in the design of the corrector, in which irregular text images are decomposed into various image blocks with small deformation, thus greatly reducing the difficulty of correcting irregular texts; a grid mapping module is designed in the correction network to ensure that the whole correction process is smooth, which makes the whole correction transformation process flexible and efficient, and plays a good role in correcting irregular texts, with high recognition accuracy, strong robustness and good recognition performance for irregularly shaped texts.

BRIEF DESCRIPTION OF THE FIGURES

[034] In order to explain more clearly the embodiments in the present invention or the technical solutions in the prior art, the following will briefly introduce the figures needed in the description of the embodiments. Obviously, figures in the following description are only some embodiments of the present invention, and for a person skilled in the art, other figures may also be obtained based on these figures without paying any creative effort.

[035] Fig. 1 is a flow chart of the text recognition method of the present invention.

[036] Fig. 2 is schematic diagram of the overall structure of the text recognition method of the present invention.

[037] Fig. 3 is a structural diagram of the sequence transformation corrector network of the present invention.

[038] Fig. 4 is schematic diagram of the verification results according to an embodiment of the present invention.

DESCRIPTION OF THE INVENTION

[039] The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention. Obviously, the embodiments described are only a part of the embodiments of the present invention and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.

[040] In order to make the above objects, features and advantages of the present invention more obvious and readily understood, the present invention will be further explained in detail with reference to the drawings and specific embodiments.

[041] Referring to Figs. 1-3, the example provides a natural scene text recognition method based on sequence transformation correction and an attention mechanism, comprising the steps of:

[042] Si. data acquisition: acquiring training set and test set samples;

[043] synthesizing a natural scene text line image by taking open source codes as well as text corpus obtained from the Internet as the training set, public natural scene text recognition dataset as the test set, and saving each image in the corresponding folder.

[044] S2. data processing:

[045] scaling the training set and test set images, with the processed image size of64*192;and

[046] making all images in the training set and test set into lightning memory mapped database (LMDB) format to improve the image reading speed.

[047] S3: label generation:

[048] The recognition network is trained by a supervision method in the present invention; therefore, the training set images are processed by adding labels in the form of text content corresponding to each text line image.

[049] S4. network training: constructing a recognition network, wherein the recognition network comprises a sequence transformation corrector and an attention mechanism-based text recognizer; and inputting the training data and the processed labels into the pre-designed recognition network to complete the training of the recognition network; and the network training specifically comprising the steps of:

[050] S4.1. building a sequence transformation corrector, with the network structure and parameter settings in this example shown in Table 1.

[051] Table 1

Network layer Specific operations Feature map size Inputlayer 3*64*192 Scaling layer Scaling the image to 16*48 3*16*48 Network layer Specific operations Feature map size Convolutional Number of kernels 32, convolutional kernel 3*3, step 32*16*48 layer size 1*1, unpainting Nonlinear layer 32*16*48 Convolutional Number of kernels 64, convolutional kernel 3*3, step 64*16*48 layer size 1*1, unpainting Nonlinear layer 64*16*48 Convolutional Number of kernels 128, convolutional kernel 3*3, 128*16*48 layer step size 1*1, unpainting Nonlinear layer - 128*16*48 Pooling layer Pooling kernel 2*2, step size 2*2 128*8*24 Convolutional Number of kernels 256, convolutional kernel 3*3, 256*8*24 layer step size 1*1, unpainting Nonlinear layer - 256*8*24 Convolutional Number of kernels 256, convolutional kernel 3*3, 256*8*24 layer step size 1*1, unpainting Nonlinear layer - 256*8*24 Convolutional Number of kernels 256, convolutional kernel 3*3, 256*8*24 layer step size 1*1, unpainting Nonlinear layer - 256*8*24 Pooling layer Pooling kernel 2*2, step size 2*2 256*4*12 Decomposition Decomposing the feature map into N pieces in the 256*4*(N*

) layer horizontal direction N

Fully connected Number of nodes 256 256 layer Fully connected Number of nodes 6 6 layer ____________________________________

[052] wherein, the specific methods for inpainting the convolutional layer in Table 1 comprises the following steps: affixing a circle of pixel dots on the top, bottom, left and right of the original image or feature map, with the pixel dots having a pixel value of 0, the nonlinear layer adopts the ReLU activation function, and the pooling layer adopts the maximum pooling method.

[053] The scaling layer of the sequence transformation corrector effectively improves the network receptive field by zooming out the images, reduces the computation, avoids the input of a large amount of noise, and improves the robustness of the module.

[054] The method for correcting images by the sequence transformation corrector comprises the steps of:

[055] firstly, inputting the image to the sequence transformation corrector, and getting processed in the scaling layer, the convolutional layer, the nonlinear layer and the pooling layer in Table 1 to obtain a feature map of 4*12; and

[056] secondly, decomposing the feature map into N image blocks disjoint from each other in the horizontal direction by means of the decomposition layer, and inputting the features of each image block to a localization network consisting of two fully connected layers, and predicting the transformation parameters of each image block through the localization network, as shown in Formula (1):

T(patch,0) = [ll 012 93

[057] L021 23 .....................

[058] where, 0 represents the parameters of the neural network, patch

represents the i* image block, E [1, N], and T(patchjlO) represents the transformation parameter obtained by inputting the features of the ith image block into the localization network.

[059] thirdly, inputting the transformation parameter of each image block to the grid mapping module to obtain a smooth sampled grid, with the specific process as follows:

[060] taking the height and width of the image block input by the sequence

transformation corrector as Hi W respectively, and obtaining the height and width

H-,W of the image block after being corrected by the sequence transformation corrector;

[061] calculating the image block to which the coordinate position ('-,Y ) on the sampling grid belongs, as shown in Formula (2):

X x (2 i = ,-1 i C {1,2,., ,N}

[062] W. ........................ (2)

[063] mapping the coordinate position ('-,Yo) on the sampling grid to the hidden

grid, and obtaining the coordinate(h7yh);with the mapping calculation process shown in Formula (3):

-X0 X n X N xw =Xl T(patchIO) yW. H,

[064] 1 ........................ (3)

[065] where, n and m represent the width and height of each block grid in the hidden grid, respectively;

[066] smoothly mapping the coordinates (x7Yh ) in the hidden grid to the

coordinate positions (xY) in the input image block grid through bilinear interpolation, with the mapping calculation process shown in Formula (4):

W 01

[K|_n XN 0

[067] m. ........................ (4)

[068] In summary, the entire grid mapping process is expressed as (X'y) =P(X-.,,y').

[069] where, P represents the grid mapping function; based on Formula (3) and Formula (4), the grid mapping function is shown in Formula (5):

0 XnXN

P(x,,y.) = nXN H.TW(patchnO)[ y x m

[070] ............... (5)

[071] and finally, obtaining the corrected image through bilinear interpolation sampling on the original input image by using the sampling grid, with the sampling calculation process shown in Formula (6):

Hi Wi

Y Y .= max(0,1 - Ix, - ul) max(0,1 - |yj - vl)

[072] u V ...... (6)

[073] where, represents the pixel value at position (X- Y-) in the output

image, and represents the pixel value at position (U1 ) in the input image.

[074] The above transformation processes are all derivable, which ensures that the sequence transformation corrector can update the optimization parameters by the gradient descent algorithm.

[075] S4.2. building an attention mechanism-based text recognizer;

[076] firstly, constructing a feature encoding network for converting the image data into the time-series features with contextual information by taking the convolutional neural network and the long-short term memory network as basic units;

[077] the convolutional neural network being structured as: input (32*100) 64C3 -* MP22-- 128C3 - MP22 - 256C3 - 256C3 - MP21 - 512C3 - MP21 -- 512C2, where in pCq, p represents the number of convolution output channels, q is the size of convolutional kernel, and C represents a convolutional layer, for example, 64C3 represents the convolutional layer with a convolutional kernel size of 3 and the number of output channels of 64; and in MPef, e and f represent the width-height and step size of the maximum pooling layer respectively, andMP represents the maximum pooling layer, for example, MP22 represents the maximum pooling layer with width, height and step size of 2;

[078] allowing the input image to be processed in the convolutional neural network to obtain a feature with the height of 1, inputting the feature to a BLSTM network consisting of two-layer bidirectional long-short term memory, and extracting the context-dependent time-series features;

[079] secondly, inputting the time-series features H= hh 2 .. hL] obtained by encoding the feature encoding network to an attention mechanism-based decoder, and obtaining the character prediction results, where L represents the length of the time- series features, the attention mechanism-based decoder introduces a LSTM network in the decoding process to gradually recognize each character in the image, and the specific recognition method comprises the steps of:

[080] calculating an attention weight matrix 't by the attention mechanism-based decoder according to the time-series features H output by the feature encoding network

and the hidden state st-1 at a time point t on the long-short term memory network, as shown in Formula (7):

et1j:- w TTanh(W gt WhJb

[081] W)h + b)) (7t_1+

[082] where, w, W,, W and b represent trainable parameters, Tanh represents the activation function, and j represents the ordinal number of the time series jE [1, L];

[083] normalizing the attention weight matrix e to obtain the probability distribution at thereof, as shown in Formula (8);

-J exp(etj)

[084] 1 exp( e )..............................(8)

[085] weighting and summing the time-series features obtained by encoding the feature encoding network according to the probability distribution of the attention

weight matrix to obtain the attention features 9t at the current moment;

[086] .................................... (9)

[087] updating the hidden state of the long-short term memory network according

to the attention features of the current moment and the probability distribution Yt-1 predicted based on the characters of the previous moment, as shown in Formula (10):

[088] St = LSTM (y,_ 1 ,g ,st,)...........................(10)

[089] decoding the fully connected layer and inputting the decoding result to the softmax normalization layer for probability normalization to obtain the probability

distribution Y of the predicted characters, as shown in Formula (11):

[090] Yt = Softmax(Ust + d)...........................(1)

[091] where, U and d represent trainable parameters;

[092] selecting a character corresponding to the value with the maximum

confidence from Y as the current decoded output character;

[093] S4.3. training parameters setting

[094] sending the training data to the network for training, and allowing the network to traverse the training dataset for 10 times, where the read-in batch size is set to 64, the initial learning rate of the attention mechanism-based text recognizer is set to 1, and the initial learning rate of the sequence transformation corrector is set to 0.1, and then the learning rate of the whole network is decreased by a factor of 10 when the dataset is traversed for 6 and 8 times;

[095] an adaptive gradient descent method is used as the optimization algorithm, with a loss function L as shown in Formula (12):

[096] B -- I C | . . . . . . . . . . . . (12)

[097] where, B represents the data size used for this batch optimization,

p(C~) I (b) 0)(b) Pacf| representss the probability of outputting a character c from the bth

sample image at the moment of a, and b represents the length of the b* sample string label;

[098] S4.4. initialization of recognition network weight: all weight parameters in the network are initialized at the beginning of training by the random Gaussian distribution initialization method;

[099] S4.5. recognition network training: minimizing the cross entropy loss by using an adaptive gradient descent method and taking the probability of each character in the training data string output at the corresponding time point as cross entropy, i.e. the loss function is minimized; and guiding the training of the sequence transformation corrector by the attention mechanism-based text recognizer, which realizes the weak supervision of the recognition network training process and effectively improves the accuracy of text data recognition of irregular natural scenes;

[0100] S5. network testing: inputting test data into the trained network to finally obtain a recognition result of text line in the images, which specifically comprises the steps of:

[0101] 5-1: inputting test dataset samples, selecting a character with the maximum confidence as a predicted character based on the greedy algorithm, and putting these characters together to get a final predicted text line; and

[0102] S5.2: after recognition is completed, calculating the line recognition accuracy and editing a distance based on the comparison of the recognized text line results with the labeled ones.

[0103] In order to further verify the effectiveness and robustness of the text recognition method of the present invention, an image of 64*192 is selected in this example, and the correction result and recognition result are shown in Fig. 4. Fig. 4 shows that the input image text is regularly arranged after being processed by the corrector, which enables the recognizer to accurately recognize the text in the images. The text recognition method of the present invention is highly robust and effective.

[0104] According to the present invention, the natural scene text recognition method based on sequence transformation correction and an attention mechanism reduces the recognition difficulty of the subsequent recognizer by correcting the irregular text. The training of the correction network is guided by the recognition model combining a weak supervision method, and no location coordinate labels are used in the training process.

[0105] The method of the present invention introduces an idea of decomposition in the design of the correction network, in which irregular text images are decomposed into various image blocks with small deformation, thus greatly reducing the difficulty of correcting irregular texts; a grid mapping module is also designed in the correction network to ensure that the whole correction process is smooth; and the recognition algorithm is adopted based on the attention mechanism in the design of the recognition network, which can effectively improve the recognition accuracy of natural scene text, especially in the irregular natural scene text data set.

[0106] In the description of the present invention, it should be understood that the terms "longitudinal", "transverse", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside" and "outside" indicate the orientation or positional relationship shown in the drawings, which are merely for the convenience of the description of the present invention, and are not intended to indicate or imply that the device or component referred to has a specific orientation, and is constructed and operated in a specific orientation. Therefore, the terms shall not be constructed as limiting the present invention.

[0107] Although the invention has been described with reference to specific examples, it will be appreciated by those skilled in the art that the invention may be embodied in many other forms, in keeping with the broad principles and the spirit of the invention described herein.

[0108] The present invention and the described embodiments specifically include the best method known to the applicant of performing the invention. The present invention and the described preferred embodiments specifically include at least one feature that is industrially applicable

Claims

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:

1. A natural scene text recognition method based on sequence transformation correction and an attention mechanism, characterized by comprising the steps of:

data acquisition: acquiring training set and test set samples;

data processing: scaling the training set and test set images;

label generation: labeling the training set images;

network training: constructing a recognition network, and inputting the training data and the processed labels into a pre-designed recognition network to complete the training of the recognition network;

the recognition network comprising a sequence transformation corrector and an attention mechanism-based text recognizer, wherein the sequence transformation corrector comprises multiple convolutional layers, a nonlinear layer and a pooling layer, and further comprises a decomposition layer and a positioning network composed of multiple fully connected layers, and the attention mechanism-based text recognizer comprises a feature coding network and an attention mechanism-based decoder; and

network testing: inputting test data to the trained recognition network to obtain the recognition result of text line in the images.

2. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 1, characterized in that the sequence transformation corrector further comprises a scaling layer and a grid mapping module, and the method for correcting images by the sequence transformation corrector comprises the steps of:

obtaining a feature map of the image to be corrected through being processed in the scaling layer, the convolutional layer, the nonlinear layer and the pooling layer;

decomposing the feature map into N image blocks disjoint from each other in the horizontal direction by means of the decomposition layer, and inputting the features of each image block to a localization network, through which the transformation parameters of each image block are predicted; inputting the transformation parameters of each image block to the grid mapping module to obtain a smooth sampled grid; and obtaining the corrected image by using the sampling grid through bilinear interpolation sampling on the original image to be corrected.

3. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 1, characterized in that the convolutional layer can be inpainted, and the specific inpainting method comprises the steps of affixing a circle of pixel dots on the top, bottom, left and right of the original image or feature map, with the pixel dots having a pixel value of 0.

4. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 1, characterized in that the feature encoding network is used for converting image data into time-series features with contextual information by taking a convolutional neural network and a long-short term memory network as basic units.

5. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 4, characterized in that the attention mechanism-based decoder introduces a long-short term memory (LSTM) network in the decoding process to gradually recognize each character in the image, and the specific recognition method comprises the steps of:

calculating an attention weight matrix by the attention mechanism-based decoder according to the time-series features output by the feature encoding network and the hidden state at a time point on the long-short term memory network;

normalizing the attention weight matrix to obtain the probability distribution thereof;

weighting and summing the time-series features obtained by encoding the feature encoding network according to the probability distribution of the attention weight matrix, so as to obtain the attention features at the current moment; updating the hidden state of the long-short term memory network according to the attention features of the current moment and the probability distribution predicted based on the characters of the previous moment; decoding the fully connected layer and inputting the decoding result to the softmax layer for probability normalization, so as to obtain the probability distribution of the predicted characters; and selecting a character corresponding to the value with the maximum confidence in the probability distribution as the current decoded output character, and completing the recognition of characters in the images.

6. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 5, characterized in that the recognition network training comprises the steps of:

minimizing the cross entropy loss by using an adaptive gradient descent method and taking the probability of each character in the training data string output at the corresponding time point as cross entropy.

7. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 1, characterized in that the weight parameters in the recognition network are initialized by a random Gaussian distribution initialization method.