AU2021100391A4 - Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism - Google Patents

Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism Download PDF

Info

Publication number
AU2021100391A4
AU2021100391A4 AU2021100391A AU2021100391A AU2021100391A4 AU 2021100391 A4 AU2021100391 A4 AU 2021100391A4 AU 2021100391 A AU2021100391 A AU 2021100391A AU 2021100391 A AU2021100391 A AU 2021100391A AU 2021100391 A4 AU2021100391 A4 AU 2021100391A4
Authority
AU
Australia
Prior art keywords
network
recognition
training
attention mechanism
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
AU2021100391A
Inventor
Lianwen JIN
Tiancai Liang
Canjie LUO
Huiyun MAO
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
GRG Banking Equipment Co Ltd
Shenzhen Xinyi Technology Co Ltd
Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Original Assignee
South China University of Technology SCUT
GRG Banking Equipment Co Ltd
Shenzhen Xinyi Technology Co Ltd
Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT, GRG Banking Equipment Co Ltd, Shenzhen Xinyi Technology Co Ltd, Zhuhai Institute of Modern Industrial Innovation of South China University of Technology filed Critical South China University of Technology SCUT
Priority to AU2021100391A priority Critical patent/AU2021100391A4/en
Application granted granted Critical
Publication of AU2021100391A4 publication Critical patent/AU2021100391A4/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/32Digital ink
    • G06V30/333Preprocessing; Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

Disclosed is a natural scene text recognition method based on sequence transformation correction and an attention mechanism, comprising the steps of data acquisition, data processing, label generation, network training and network testing; wherein the network training comprises the steps of constructing a recognition network, and inputting the training data and the processed labels into a pre-designed recognition network to complete the training of the recognition network; the recognition network comprises a sequence transformation corrector and an attention mechanism-based text recognizer; the sequence transformation corrector comprises multiple convolutional layers, a nonlinear layer and a pooling layer, and further comprises a decomposition layer and a positioning network composed of multiple fully connected layers, and the attention mechanism-based text recognizer the comprises a feature coding network and an attention mechanism-based decoder. The present invention greatly reduces the recognition difficulty of the recognition model with high recognition accuracy rate and strong robustness, and provides a good recognition performance for texts with irregular shapes. 1/2 Data acquisition: synthesizing a natural scene text line image by taking open source codes as well as text corpus obtained from the Internet as the training set, public natural scene text recognition dataset as the test set; Data processing: scaling the training set and test set images, with the processed image size of 64*192; and making all images in the training set and test set into LMDB format to improve the image reading speed; Label generation: processing the training set images by adding labels in the form of text content corresponding to each text line image; Network training: -onstructing a recognition network, wherein the recognition network comprises a sequence transformation corrector and an attention mechanism-based text recognizer; and inputting the raining data and the processed labels into the pre-designed ecognition network to complete the training of the recognition letwork: Network testing: inputting test data into the trained network to finally obtain E recognition result of text line in the images. Fig. 1 Sequence Attention transformation mechanism-based corrector text recognizer ----------------1 1, 1 4"DONOVAN" Weak String label supervision supervision Fig. 2

Description

1/2
Data acquisition: synthesizing a natural scene text line image by taking open source codes as well as text corpus obtained from the Internet as the training set, public natural scene text recognition dataset as the test set;
Data processing: scaling the training set and test set images, with the processed image size of 64*192; and making all images in the training set and test set into LMDB format to improve the image reading speed;
Label generation: processing the training set images by adding labels in the form of text content corresponding to each text line image;
Network training: -onstructing a recognition network, wherein the recognition network comprises a sequence transformation corrector and an attention mechanism-based text recognizer; and inputting the raining data and the processed labels into the pre-designed ecognition network to complete the training of the recognition letwork:
Network testing: inputting test data into the trained network to finally obtain E recognition result of text line in the images.
Fig. 1
Sequence Attention transformation mechanism-based corrector text recognizer
1 1, 1 4"DONOVAN" ----------------
Weak String label supervision supervision
Fig. 2
Natural Scene Text Recognition Method Based on Sequence Transformation
Correction and Attention Mechanism
TECHNICAL FIELD
[01] The present invention relates to the fields of pattern recognition and artificial intelligence technology, and in particular to a natural scene text recognition method based on sequence transformation correction and an attention mechanism.
BACKGROUND
[02] As a carrier of information, text has been widely used from ancient times to the present. The presentation of text allows people to more accurately understand and process the visual information, which facilitates information exchange between them. With the rapid development of computer technology, artificial intelligence technology is gradually changing people's lives. People hope to understand and process images efficiently through computers, and text information is crucial to image understanding. So, the natural scene text recognition has always been a meaningful focus of most attention as a current research direction.
[03] In contrast to document image recognition tasks, texts are often deformed diversely such as rotation, transmission deformation and bending in natural scenes. Moreover, natural scene text deformations are complex, diverse and irregular, and difficult to be simulated by a mathematical transformation, which poses a great challenge to natural scene text recognition systems.
[04] Therefore, there is an urgent need for a text recognition method capable of effectively improving the recognition accuracy for irregular natural scene text datasets.
SUMMARY
[05] The purpose of the present invention is to provide a natural scene text recognition method based on sequence transformation correction and an attention mechanism, aiming at solving the above problems of the prior art and effectively improving the recognition accuracy of natural scene texts.
[06] Aiming at achieving the above purpose, the technical solution of the present invention is to provide a natural scene text recognition method based on sequence transformation correction and an attention mechanism, comprising the steps of:
[07] data acquisition: acquiring training set and test set samples;
[08] data processing: scaling the training set and test set images;
[09] label generation: labeling the training set images;
[010] network training: constructing a recognition network, and inputting the training data and the processed labels into a pre-designed recognition network to complete the training of the recognition network;
[011] the recognition network comprising a sequence transformation corrector and an attention mechanism-based text recognizer, wherein the sequence transformation corrector comprises multiple convolutional layers, a nonlinear layer and a pooling layer, and further comprises a decomposition layer and a positioning network composed of multiple fully connected layers, and the attention mechanism-based text recognizer comprises a feature coding network and an attention mechanism-based decoder;and
[012] network testing: inputting test data to the trained recognition network to obtain the recognition result of text line in the images.
[013] Preferably, the sequence transformation corrector further comprises a scaling layer and a grid mapping module, and the method for correcting images by the sequence transformation corrector comprises the steps of:
[014] obtaining a feature map of the image to be corrected through being processed in the scaling layer, the convolutional layer, the nonlinear layer and the pooling layer;
[015] decomposing the feature map into N image blocks disjoint from each other in the horizontal direction by means of the decomposition layer, and inputting the features of each image block to a localization network, through which the transformation parameters of each image block are predicted;
[016] inputting the transformation parameters of each image block to the grid mapping module to obtain a smooth sampled grid; and
[017] obtaining the corrected image by using the sampling grid through bilinear interpolation sampling on the original image to be corrected.
[018] Preferably, the convolutional layer can be inpainted, and the specific inpainting method comprises the steps of affixing a circle of pixel dots on the top, bottom, left and right of the original image or feature map, with the pixel dots having a pixel value of 0.
[019] Preferably, the feature encoding network is used for converting image data into time-series features with contextual information by taking a convolutional neural network and a long-short term memory network as basic units.
[020] Preferably, the attention mechanism-based decoder introduces a long-short term memory (LSTM) network in the decoding process to gradually recognize each character in the image, and the specific recognition method comprises the steps of:
[021] calculating an attention weight matrix by the attention mechanism-based decoder according to the time-series features output by the feature encoding network and the hidden state at a time point on the long-short term memory network;
[022] normalizing the attention weight matrix to obtain the probability distributionthereof;
[023] weighting and summing the time-series features obtained by encoding the feature encoding network according to the probability distribution of the attention weight matrix, so as to obtain the attention features at the current moment;
[024] updating the hidden state of the long-short term memory network according to the attention features of the current moment and the probability distribution predicted based on the characters of the previous moment;
[025] decoding the fully connected layer and inputting the decoding result to the softmax layer for probability normalization, so as to obtain the probability distribution of the predicted characters; and
[026] selecting a character corresponding to the value with the maximum confidence in the probability distribution as the current decoded output character, and completing the recognition of characters in the images.
[027] Preferably, the recognition network training comprises the steps of:
[028] minimizing the cross entropy loss by using an adaptive gradient descent method and taking the probability of each character in the training data string output at the corresponding time point as cross entropy.
[029] Preferably, the weight parameters in the recognition network are initialized by a random Gaussian distribution initialization method.
[030] The prevent invention discloses the following technical effects:
[031] (1) The automatic learning algorithm with deep network structure helps to learn effective expressions from data well and improve the recognition accuracy.
[032] (2) The present invention combines the end-to-end network design and the physically meaningful gradient information returned by the recognition model to effectively guide the correction network by a weak supervision training method, which greatly reduces the recognition difficulty of the recognition model and improves the recognition accuracy in practical applications.
[033] (3) The method of the present invention introduces an idea of decomposition in the design of the corrector, in which irregular text images are decomposed into various image blocks with small deformation, thus greatly reducing the difficulty of correcting irregular texts; a grid mapping module is designed in the correction network to ensure that the whole correction process is smooth, which makes the whole correction transformation process flexible and efficient, and plays a good role in correcting irregular texts, with high recognition accuracy, strong robustness and good recognition performance for irregularly shaped texts.
BRIEF DESCRIPTION OF THE FIGURES
[034] In order to explain more clearly the embodiments in the present invention or the technical solutions in the prior art, the following will briefly introduce the figures needed in the description of the embodiments. Obviously, figures in the following description are only some embodiments of the present invention, and for a person skilled in the art, other figures may also be obtained based on these figures without paying any creative effort.
[035] Fig. 1 is a flow chart of the text recognition method of the present invention.
[036] Fig. 2 is schematic diagram of the overall structure of the text recognition method of the present invention.
[037] Fig. 3 is a structural diagram of the sequence transformation corrector network of the present invention.
[038] Fig. 4 is schematic diagram of the verification results according to an embodiment of the present invention.
DESCRIPTION OF THE INVENTION
[039] The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention. Obviously, the embodiments described are only a part of the embodiments of the present invention and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.
[040] In order to make the above objects, features and advantages of the present invention more obvious and readily understood, the present invention will be further explained in detail with reference to the drawings and specific embodiments.
[041] Referring to Figs. 1-3, the example provides a natural scene text recognition method based on sequence transformation correction and an attention mechanism, comprising the steps of:
[042] Si. data acquisition: acquiring training set and test set samples;
[043] synthesizing a natural scene text line image by taking open source codes as well as text corpus obtained from the Internet as the training set, public natural scene text recognition dataset as the test set, and saving each image in the corresponding folder.
[044] S2. data processing:
[045] scaling the training set and test set images, with the processed image size of64*192;and
[046] making all images in the training set and test set into lightning memory mapped database (LMDB) format to improve the image reading speed.
[047] S3: label generation:
[048] The recognition network is trained by a supervision method in the present invention; therefore, the training set images are processed by adding labels in the form of text content corresponding to each text line image.
[049] S4. network training: constructing a recognition network, wherein the recognition network comprises a sequence transformation corrector and an attention mechanism-based text recognizer; and inputting the training data and the processed labels into the pre-designed recognition network to complete the training of the recognition network; and the network training specifically comprising the steps of:
[050] S4.1. building a sequence transformation corrector, with the network structure and parameter settings in this example shown in Table 1.
[051] Table 1
Network layer Specific operations Feature map size Inputlayer 3*64*192 Scaling layer Scaling the image to 16*48 3*16*48 Network layer Specific operations Feature map size Convolutional Number of kernels 32, convolutional kernel 3*3, step 32*16*48 layer size 1*1, unpainting Nonlinear layer 32*16*48 Convolutional Number of kernels 64, convolutional kernel 3*3, step 64*16*48 layer size 1*1, unpainting Nonlinear layer 64*16*48 Convolutional Number of kernels 128, convolutional kernel 3*3, 128*16*48 layer step size 1*1, unpainting Nonlinear layer - 128*16*48 Pooling layer Pooling kernel 2*2, step size 2*2 128*8*24 Convolutional Number of kernels 256, convolutional kernel 3*3, 256*8*24 layer step size 1*1, unpainting Nonlinear layer - 256*8*24 Convolutional Number of kernels 256, convolutional kernel 3*3, 256*8*24 layer step size 1*1, unpainting Nonlinear layer - 256*8*24 Convolutional Number of kernels 256, convolutional kernel 3*3, 256*8*24 layer step size 1*1, unpainting Nonlinear layer - 256*8*24 Pooling layer Pooling kernel 2*2, step size 2*2 256*4*12 Decomposition Decomposing the feature map into N pieces in the 256*4*(N*
) layer horizontal direction N
Fully connected Number of nodes 256 256 layer Fully connected Number of nodes 6 6 layer ____________________________________
[052] wherein, the specific methods for inpainting the convolutional layer in Table 1 comprises the following steps: affixing a circle of pixel dots on the top, bottom, left and right of the original image or feature map, with the pixel dots having a pixel value of 0, the nonlinear layer adopts the ReLU activation function, and the pooling layer adopts the maximum pooling method.
[053] The scaling layer of the sequence transformation corrector effectively improves the network receptive field by zooming out the images, reduces the computation, avoids the input of a large amount of noise, and improves the robustness of the module.
[054] The method for correcting images by the sequence transformation corrector comprises the steps of:
[055] firstly, inputting the image to the sequence transformation corrector, and getting processed in the scaling layer, the convolutional layer, the nonlinear layer and the pooling layer in Table 1 to obtain a feature map of 4*12; and
[056] secondly, decomposing the feature map into N image blocks disjoint from each other in the horizontal direction by means of the decomposition layer, and inputting the features of each image block to a localization network consisting of two fully connected layers, and predicting the transformation parameters of each image block through the localization network, as shown in Formula (1):
T(patch,0) = [ll 012 93
[057] L021 23 .....................
[058] where, 0 represents the parameters of the neural network, patch
represents the i* image block, E [1, N], and T(patchjlO) represents the transformation parameter obtained by inputting the features of the ith image block into the localization network.
[059] thirdly, inputting the transformation parameter of each image block to the grid mapping module to obtain a smooth sampled grid, with the specific process as follows:
[060] taking the height and width of the image block input by the sequence
transformation corrector as Hi W respectively, and obtaining the height and width
H-,W of the image block after being corrected by the sequence transformation corrector;
[061] calculating the image block to which the coordinate position ('-,Y ) on the sampling grid belongs, as shown in Formula (2):
X x (2 i = ,-1 i C {1,2,., ,N}
[062] W. ........................ (2)
[063] mapping the coordinate position ('-,Yo) on the sampling grid to the hidden
grid, and obtaining the coordinate(h7yh);with the mapping calculation process shown in Formula (3):
-X0 X n X N xw =Xl T(patchIO) yW. H,
[064] 1 ........................ (3)
[065] where, n and m represent the width and height of each block grid in the hidden grid, respectively;
[066] smoothly mapping the coordinates (x7Yh ) in the hidden grid to the
coordinate positions (xY) in the input image block grid through bilinear interpolation, with the mapping calculation process shown in Formula (4):
W 01
[K|_n XN 0
[067] m. ........................ (4)
[068] In summary, the entire grid mapping process is expressed as (X'y) =P(X-.,,y').
[069] where, P represents the grid mapping function; based on Formula (3) and Formula (4), the grid mapping function is shown in Formula (5):
0 XnXN
P(x,,y.) = nXN H.TW(patchnO)[ y x m
[070] ............... (5)
[071] and finally, obtaining the corrected image through bilinear interpolation sampling on the original input image by using the sampling grid, with the sampling calculation process shown in Formula (6):
Hi Wi
Y Y .= max(0,1 - Ix, - ul) max(0,1 - |yj - vl)
[072] u V ...... (6)
[073] where, represents the pixel value at position (X- Y-) in the output
image, and represents the pixel value at position (U1 ) in the input image.
[074] The above transformation processes are all derivable, which ensures that the sequence transformation corrector can update the optimization parameters by the gradient descent algorithm.
[075] S4.2. building an attention mechanism-based text recognizer;
[076] firstly, constructing a feature encoding network for converting the image data into the time-series features with contextual information by taking the convolutional neural network and the long-short term memory network as basic units;
[077] the convolutional neural network being structured as: input (32*100) 64C3 -* MP22-- 128C3 - MP22 - 256C3 - 256C3 - MP21 - 512C3 - MP21 -- 512C2, where in pCq, p represents the number of convolution output channels, q is the size of convolutional kernel, and C represents a convolutional layer, for example, 64C3 represents the convolutional layer with a convolutional kernel size of 3 and the number of output channels of 64; and in MPef, e and f represent the width-height and step size of the maximum pooling layer respectively, andMP represents the maximum pooling layer, for example, MP22 represents the maximum pooling layer with width, height and step size of 2;
[078] allowing the input image to be processed in the convolutional neural network to obtain a feature with the height of 1, inputting the feature to a BLSTM network consisting of two-layer bidirectional long-short term memory, and extracting the context-dependent time-series features;
[079] secondly, inputting the time-series features H= hh 2 .. hL] obtained by encoding the feature encoding network to an attention mechanism-based decoder, and obtaining the character prediction results, where L represents the length of the time- series features, the attention mechanism-based decoder introduces a LSTM network in the decoding process to gradually recognize each character in the image, and the specific recognition method comprises the steps of:
[080] calculating an attention weight matrix 't by the attention mechanism-based decoder according to the time-series features H output by the feature encoding network
and the hidden state st-1 at a time point t on the long-short term memory network, as shown in Formula (7):
et1j:- w TTanh(W gt WhJb
[081] W)h + b)) (7t_1+
[082] where, w, W,, W and b represent trainable parameters, Tanh represents the activation function, and j represents the ordinal number of the time series jE [1, L];
[083] normalizing the attention weight matrix e to obtain the probability distribution at thereof, as shown in Formula (8);
-J exp(etj)
[084] 1 exp( e )..............................(8)
[085] weighting and summing the time-series features obtained by encoding the feature encoding network according to the probability distribution of the attention
weight matrix to obtain the attention features 9t at the current moment;
[086] .................................... (9)
[087] updating the hidden state of the long-short term memory network according
to the attention features of the current moment and the probability distribution Yt-1 predicted based on the characters of the previous moment, as shown in Formula (10):
[088] St = LSTM (y,_ 1 ,g ,st,)...........................(10)
[089] decoding the fully connected layer and inputting the decoding result to the softmax normalization layer for probability normalization to obtain the probability
distribution Y of the predicted characters, as shown in Formula (11):
[090] Yt = Softmax(Ust + d)...........................(1)
[091] where, U and d represent trainable parameters;
[092] selecting a character corresponding to the value with the maximum
confidence from Y as the current decoded output character;
[093] S4.3. training parameters setting
[094] sending the training data to the network for training, and allowing the network to traverse the training dataset for 10 times, where the read-in batch size is set to 64, the initial learning rate of the attention mechanism-based text recognizer is set to 1, and the initial learning rate of the sequence transformation corrector is set to 0.1, and then the learning rate of the whole network is decreased by a factor of 10 when the dataset is traversed for 6 and 8 times;
[095] an adaptive gradient descent method is used as the optimization algorithm, with a loss function L as shown in Formula (12):
[096] B -- I C | . . . . . . . . . . . . (12)
[097] where, B represents the data size used for this batch optimization,
p(C~) I (b) 0)(b) Pacf| representss the probability of outputting a character c from the bth
sample image at the moment of a, and b represents the length of the b* sample string label;
[098] S4.4. initialization of recognition network weight: all weight parameters in the network are initialized at the beginning of training by the random Gaussian distribution initialization method;
[099] S4.5. recognition network training: minimizing the cross entropy loss by using an adaptive gradient descent method and taking the probability of each character in the training data string output at the corresponding time point as cross entropy, i.e. the loss function is minimized; and guiding the training of the sequence transformation corrector by the attention mechanism-based text recognizer, which realizes the weak supervision of the recognition network training process and effectively improves the accuracy of text data recognition of irregular natural scenes;
[0100] S5. network testing: inputting test data into the trained network to finally obtain a recognition result of text line in the images, which specifically comprises the steps of:
[0101] 5-1: inputting test dataset samples, selecting a character with the maximum confidence as a predicted character based on the greedy algorithm, and putting these characters together to get a final predicted text line; and
[0102] S5.2: after recognition is completed, calculating the line recognition accuracy and editing a distance based on the comparison of the recognized text line results with the labeled ones.
[0103] In order to further verify the effectiveness and robustness of the text recognition method of the present invention, an image of 64*192 is selected in this example, and the correction result and recognition result are shown in Fig. 4. Fig. 4 shows that the input image text is regularly arranged after being processed by the corrector, which enables the recognizer to accurately recognize the text in the images. The text recognition method of the present invention is highly robust and effective.
[0104] According to the present invention, the natural scene text recognition method based on sequence transformation correction and an attention mechanism reduces the recognition difficulty of the subsequent recognizer by correcting the irregular text. The training of the correction network is guided by the recognition model combining a weak supervision method, and no location coordinate labels are used in the training process.
[0105] The method of the present invention introduces an idea of decomposition in the design of the correction network, in which irregular text images are decomposed into various image blocks with small deformation, thus greatly reducing the difficulty of correcting irregular texts; a grid mapping module is also designed in the correction network to ensure that the whole correction process is smooth; and the recognition algorithm is adopted based on the attention mechanism in the design of the recognition network, which can effectively improve the recognition accuracy of natural scene text, especially in the irregular natural scene text data set.
[0106] In the description of the present invention, it should be understood that the terms "longitudinal", "transverse", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside" and "outside" indicate the orientation or positional relationship shown in the drawings, which are merely for the convenience of the description of the present invention, and are not intended to indicate or imply that the device or component referred to has a specific orientation, and is constructed and operated in a specific orientation. Therefore, the terms shall not be constructed as limiting the present invention.
[0107] Although the invention has been described with reference to specific examples, it will be appreciated by those skilled in the art that the invention may be embodied in many other forms, in keeping with the broad principles and the spirit of the invention described herein.
[0108] The present invention and the described embodiments specifically include the best method known to the applicant of performing the invention. The present invention and the described preferred embodiments specifically include at least one feature that is industrially applicable

Claims (7)

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:
1. A natural scene text recognition method based on sequence transformation correction and an attention mechanism, characterized by comprising the steps of:
data acquisition: acquiring training set and test set samples;
data processing: scaling the training set and test set images;
label generation: labeling the training set images;
network training: constructing a recognition network, and inputting the training data and the processed labels into a pre-designed recognition network to complete the training of the recognition network;
the recognition network comprising a sequence transformation corrector and an attention mechanism-based text recognizer, wherein the sequence transformation corrector comprises multiple convolutional layers, a nonlinear layer and a pooling layer, and further comprises a decomposition layer and a positioning network composed of multiple fully connected layers, and the attention mechanism-based text recognizer comprises a feature coding network and an attention mechanism-based decoder; and
network testing: inputting test data to the trained recognition network to obtain the recognition result of text line in the images.
2. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 1, characterized in that the sequence transformation corrector further comprises a scaling layer and a grid mapping module, and the method for correcting images by the sequence transformation corrector comprises the steps of:
obtaining a feature map of the image to be corrected through being processed in the scaling layer, the convolutional layer, the nonlinear layer and the pooling layer;
decomposing the feature map into N image blocks disjoint from each other in the horizontal direction by means of the decomposition layer, and inputting the features of each image block to a localization network, through which the transformation parameters of each image block are predicted; inputting the transformation parameters of each image block to the grid mapping module to obtain a smooth sampled grid; and obtaining the corrected image by using the sampling grid through bilinear interpolation sampling on the original image to be corrected.
3. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 1, characterized in that the convolutional layer can be inpainted, and the specific inpainting method comprises the steps of affixing a circle of pixel dots on the top, bottom, left and right of the original image or feature map, with the pixel dots having a pixel value of 0.
4. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 1, characterized in that the feature encoding network is used for converting image data into time-series features with contextual information by taking a convolutional neural network and a long-short term memory network as basic units.
5. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 4, characterized in that the attention mechanism-based decoder introduces a long-short term memory (LSTM) network in the decoding process to gradually recognize each character in the image, and the specific recognition method comprises the steps of:
calculating an attention weight matrix by the attention mechanism-based decoder according to the time-series features output by the feature encoding network and the hidden state at a time point on the long-short term memory network;
normalizing the attention weight matrix to obtain the probability distribution thereof;
weighting and summing the time-series features obtained by encoding the feature encoding network according to the probability distribution of the attention weight matrix, so as to obtain the attention features at the current moment; updating the hidden state of the long-short term memory network according to the attention features of the current moment and the probability distribution predicted based on the characters of the previous moment; decoding the fully connected layer and inputting the decoding result to the softmax layer for probability normalization, so as to obtain the probability distribution of the predicted characters; and selecting a character corresponding to the value with the maximum confidence in the probability distribution as the current decoded output character, and completing the recognition of characters in the images.
6. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 5, characterized in that the recognition network training comprises the steps of:
minimizing the cross entropy loss by using an adaptive gradient descent method and taking the probability of each character in the training data string output at the corresponding time point as cross entropy.
7. The natural scene text recognition method based on sequence transformation correction and an attention mechanism according to claim 1, characterized in that the weight parameters in the recognition network are initialized by a random Gaussian distribution initialization method.
AU2021100391A 2021-01-22 2021-01-22 Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism Active AU2021100391A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2021100391A AU2021100391A4 (en) 2021-01-22 2021-01-22 Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2021100391A AU2021100391A4 (en) 2021-01-22 2021-01-22 Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism

Publications (1)

Publication Number Publication Date
AU2021100391A4 true AU2021100391A4 (en) 2021-04-15

Family

ID=75397211

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2021100391A Active AU2021100391A4 (en) 2021-01-22 2021-01-22 Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism

Country Status (1)

Country Link
AU (1) AU2021100391A4 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111871A (en) * 2021-04-21 2021-07-13 北京金山数字娱乐科技有限公司 Training method and device of text recognition model and text recognition method and device
CN113486167A (en) * 2021-07-26 2021-10-08 科大讯飞股份有限公司 Text completion method and device, computer equipment and storage medium
CN114241495A (en) * 2022-02-28 2022-03-25 天津大学 Data enhancement method for offline handwritten text recognition
WO2022262239A1 (en) * 2021-06-16 2022-12-22 科大讯飞股份有限公司 Text identification method, apparatus and device, and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111871A (en) * 2021-04-21 2021-07-13 北京金山数字娱乐科技有限公司 Training method and device of text recognition model and text recognition method and device
CN113111871B (en) * 2021-04-21 2024-04-19 北京金山数字娱乐科技有限公司 Training method and device of text recognition model, text recognition method and device
WO2022262239A1 (en) * 2021-06-16 2022-12-22 科大讯飞股份有限公司 Text identification method, apparatus and device, and storage medium
CN113486167A (en) * 2021-07-26 2021-10-08 科大讯飞股份有限公司 Text completion method and device, computer equipment and storage medium
CN113486167B (en) * 2021-07-26 2024-04-16 科大讯飞股份有限公司 Text completion method, apparatus, computer device and storage medium
CN114241495A (en) * 2022-02-28 2022-03-25 天津大学 Data enhancement method for offline handwritten text recognition
CN114241495B (en) * 2022-02-28 2022-05-03 天津大学 Data enhancement method for off-line handwritten text recognition

Similar Documents

Publication Publication Date Title
AU2021100391A4 (en) Natural Scene Text Recognition Method Based on Sequence Transformation Correction and Attention Mechanism
CN111428718B (en) Natural scene text recognition method based on image enhancement
CN109492202B (en) Chinese error correction method based on pinyin coding and decoding model
CN110765966B (en) One-stage automatic recognition and translation method for handwritten characters
US10558893B2 (en) Systems and methods for recognizing characters in digitized documents
CN111428727B (en) Natural scene text recognition method based on sequence transformation correction and attention mechanism
CN110414498B (en) Natural scene text recognition method based on cross attention mechanism
CN110427938A (en) A kind of irregular character recognition device and method based on deep learning
CN109919174A (en) A kind of character recognition method based on gate cascade attention mechanism
CN113096017A (en) Image super-resolution reconstruction method based on depth coordinate attention network model
CN113221874A (en) Character recognition system based on Gabor convolution and linear sparse attention
CN114662788B (en) Seawater quality three-dimensional time-space sequence multi-parameter accurate prediction method and system
CN110837830B (en) Image character recognition method based on space-time convolutional neural network
US20240135610A1 (en) Image generation using a diffusion model
CN114581918A (en) Text recognition model training method and device
CN116110059A (en) Offline handwriting mathematical formula identification method based on deep learning
Wang et al. Recognizing handwritten mathematical expressions as LaTex sequences using a multiscale robust neural network
Zhang et al. A simple and effective static gesture recognition method based on attention mechanism
CN118072318A (en) Character recognition method, device and equipment in filling field and readable storage medium
CN113743315B (en) Handwriting elementary mathematical formula identification method based on structure enhancement
CN116798044A (en) Text recognition method and device and electronic equipment
CN114140317A (en) Image animation method based on cascade generation confrontation network
CN113313127A (en) Text image recognition method and device, computer equipment and storage medium
Li et al. LabanFormer: Multi-scale graph attention network and transformer with gated recurrent positional encoding for labanotation generation
CN117423119A (en) Transformer-based scene handwriting Chinese character recognition method

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)