AU2021100480A4

AU2021100480A4 - Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism

Info

Publication number: AU2021100480A4
Application number: AU2021100480A
Authority: AU
Inventors: Xue GAO; Lianwen JIN; Canjie LUO; Huiyun MAO; Haoyu Wang; Min Wu
Original assignee: Shenzheng Yunshi Technology Co Ltd; South China University of Technology SCUT; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Current assignee: Shenzheng Yunshi Technology Co ltd; South China University of Technology SCUT; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-04-15
Anticipated expiration: 2029-01-25

Abstract

Disclosed is a natural scene text recognition method based on a two dimensional feature attention mechanism, comprising the steps of: 1. data acquisition: synthesizing a line text image for training by using a public code, dividing the line text image into a regular training set and an irregular training set by shape, and downloading a true text image from the network as test data; 2. data processing: stretching the size of the image, with the size of the processed image being 32 * 104; 3. label generation: training a recognition model by adopting a supervision method, wherein each line text image has corresponding text content; 4. network training: training a recognition network by using the data in the training set; and 5. network testing: inputting test data into the trained network to obtain a prediction result of the line text image. According to the present invention, characters are decoded from the two dimensional features of images by using the attention network, and the recognition accuracy reaches a high level on a public data set. Thus, the method is highly practical and applicable. -1/4 Data collection: synthesizing the training set by using a public code, and downloading public text images of real scenes from the Internet as test data; Data processing: stretching all images to make the height uniform and making up the width by padding black borders Label generation: assigning the text labels corresponding to eact image by code, and dividing the training data int< a regular text training dataset and an irregular tex training dataset according to the shape of the te in the images; Network training: sending training data to the network for training, sending the regular training data firstly; and then sending the irregular training data; Network testing: sending the test data to the trained recognition network, and outputting the recognized text characters in the images along with the confidence level of each character. Fig. 1

Description

-1/4

Data collection: synthesizing the training set by using a public code, and downloading public text images of real scenes from the Internet as test data;

Data processing: stretching all images to make the height uniform and making up the width by padding black borders

Label generation: assigning the text labels corresponding to eact image by code, and dividing the training data int< a regular text training dataset and an irregular tex training dataset according to the shape of the te in the images;

Network training: sending training data to the network for training, sending the regular training data firstly; and then sending the irregular training data;

Network testing: sending the test data to the trained recognition network, and outputting the recognized text characters in the images along with the confidence level of each character.

Fig. 1

Natural Scene Text Recognition Method Based on Two-Dimensional

Feature Attention Mechanism

TECHNICAL FIELD

The present invention relates to a natural scene text recognition method,

and in particular to a natural scene text recognition method based on a two

dimensional feature attention mechanism, and belongs to the fields of pattern

recognition and artificial intelligence technology.

BACKGROUND

Text breaks the limits of information transmission between people in

hearing, enabling people to pass on spiritual wealth and wisdom of mankind

through visual information, so as to more accurately understand and process

the visual information, and promote information exchange between people.

With the rapid development of computer technology, artificial intelligence

technology is gradually changing our lives, making our lives more convenient

and efficient. Moreover, the recent rapid development and wide application of

hardware technology, especially GPU, make the practical application of deep

neural network possible.

In the real world, people get much more information through vision than

through other senses. In terms of the visual information, people mainly

understand the external environment and obtain important information through

text. Since the invention of writing, human beings have been conveying

information to and receiving information from the outside world through text in

large quantities. In order to obtain text information, one must first correctly

identify the text obtained through the visual senses. For an educated person, it is easy to correctly identify words from an image. However, computers cannot recognize the text in an image as easily as a human can.

Text plays an important role in the real world. Most of the information that

people obtain visually is carried by text. People will rely heavily on text to

obtain information in the past and in the future. The most important step in

obtaining text information is to recognize the text correctly. For human beings,

it is a must to correctly recognize text in images by computers. However, texts

often exist in natural scenes in various forms; for example, street signs are

often in different background environments, and the variability of the

background makes it difficult for computers to correctly recognize text

information. Moreover, people often arrange text in different shapes, such as

curve line and broken line, so as to achieve certain artistic effects. Besides,

many other factors also make it difficult for computers to correctly recognize

the text in natural scenes. So, it is necessary to find an effective method to

recognize text in natural scenes.

The research progress of artificial intelligence makes it possible to solve

the above problems. In recent years, several research teams have proposed

solutions based on deep neural networks for natural scene text recognition. In

various solutions, the method based on an attention mechanism is particularly

prominent in the field of natural scene text recognition. Due to the flexibility of

attention mechanism in decoding mode and semantic derivation, the

recognition rate of the model based on the attention mechanism has been

greatly improved compared with the previous methods. However, the

traditional scene text recognition solutions based on the attention mechanism

often require compressing the input scene text images directly into a feature sequence through a convolutional neural network, which will introduce extra noise into the feature sequence.

SUMMARY

Aiming at solving the above problems, the purpose of the present

invention is to provide a natural scene text recognition method based on a

two-dimensional feature attention mechanism with a high recognition rate for

irregularly arranged text and a high use value for images with rich

backgrounds from which text can also be recognized.

In order to achieve the purpose, the present invention is realized by the

following technical solution: a natural scene text recognition method based on

a two-dimensional feature attention mechanism, comprising the steps of:

1. data acquisition: synthesizing a line text image for training by using a

public code, dividing the line text image into a regular training set and an

irregular training set by shape, and downloading a true text image from the

network as test data;

2. data processing: stretching the size of all training samples, with the

size of the processed image sample being 32*104 and the aspect ratio of

each image being consistent with that of the original image as possible, firstly

stretching the height to 32 pixels, then stretching the width according to the

original aspect ratio, and filling the part with insufficient width with a black

border;

3. label generation: training a recognition model by adopting a

supervision method, wherein each line text image has corresponding text

message and a label is already saved by a code during data synthesis;

4. network training: inputting the ready-made training data and labels into a two-dimensional feature attention network for training, and inputting the regular training data firstly; after the network has been trained to a suitable degree through the regular training data, training the network with the irregular text data and filling the length of each batch of read-in labels with terminators to a consistent length; and

5. inputting test data into a trained network, calculating the confidence of

each image, selecting a character with the maximum confidence as a

predicted character based on the greedy algorithm, and putting these

characters together to get a final predicted line text.

Preferably, in step 1, the training data are synthesized by using a public

code, the number of synthesized text images should be as large as possible,

the text in synthesized text images should cover a variety of fonts, the

background should be as complex and varied as possible, and the total

number of images is 20 million.

Preferably, in step 2, the synthesized text images are stretched, with the

size of the stretched images being 32*104 and the aspect ratio of each image

being consistent with that of the original image as possible, wherein the height

is stretched to 32 pixels firstly, the width is stretched according to the original

aspect ratio, and the part with insufficient width is filled with a black border to

completely preserve the shape information of the original image text.

Preferably, step 3 comprises the steps of:

3-1: synthesizing a text-containing image by using an online public code

and a text corpus, and cutting out the text from the image according to the

position of line text recorded by the code in the file to make a line text training

sample;

3-2: saving the text content in each text image in the corresponding text

file;

3-3: taking all synthesized training samples as training data and taking

the public real text images downloaded from the network as the test set; and

3-4: making all samples into files in lmdb database format respectively for

accelerated reading.

Preferably, step 4 comprises the steps of:

4-1: constructing a feature encoding network by taking a convolutional

block and a long-short term memory model as basic units, wherein the feature

extraction network in the front part of the network down-samples the features

by pooling layers, and each pooling layer has a downsampling multiple of 2;

the feature maps output by the first convolutional layer and the third

convolutional layer of the convolutional module are added numerically to

obtain an output feature map of the convolutional module; each convolutional

module does not downsample the feature map; a batch standardization

operation is attached after each convolutional layer in the convolutional block,

and the result is output after being processed in the linear rectification unit,

and finally the output feature map is obtained;

after being processed in the feature extraction network, the obtained

feature map with height not being 1, i.e., a two-dimensional feature map is cut

into H sub-feature maps by rows, where H is the height of the two-dimensional

feature map; each sub-feature map is input to a BLSTM network consisting of

two-layer bidirectional long-short term memory (BLSTM), so that the feature

vectors of each sub-feature map have contextual information, as expressed

by the mathematical formula below:

= BLSTM(lj)

where, Ii represents a sub-feature map of the ith row cut from the two

dimensional feature map, W represents the width of the two-dimensional

feature map, represents the jth feature vector obtained after the ith feature

map is encoded by the BLSTM network, and all encoded sub-feature maps

are stitched in the horizontal direction to obtain an encoded feature map;

4-2: constructing a decoding network based on the two-dimensional

feature attention mechanism:

exp (eg)

where, a = 1atat,-,a ,,, represents the weight of feature vectors

in each sub-feature map of the two-dimensional feature map, that is, the

attention weight distribution, while H in the formula represents the height of

feature map, and eto is obtained by the following formula:

etj = Vtanh(Wrt + QF + b)

V, W, Q and b represent trainable weight parameters, F is the feature

encoded and spliced by BLSTM, and rt is the output of the language network

consisting of a long-short term memory (LSTM) model in the attention network

and obtained by embedding the word of character decoded at the previous

time node into the vector embt-i and decoding the hidden layer output vector

ht-i used to decode the previous character;

rt = LSTM(embt_ 1 ,ht-1)

a rough attention distribution is obtained via the attention weight

distribution, and a is multiplied by the corresponding elements of the feature

F to obtain a featuremapatwithallfeatures being filtered except current

characters to be decoded, and then the feature map is processed in the

attention network to obtain an attention weight distribution acting on Ft:

exp (e a ') WX ep (

e'. = Vtanh(W'g + Q'F + b')

VI, W', Q' and b' represent trainable parameters, and the vector 9C represents a vector of a rough feature of a character obtained by a weighted

sum of the feature map F with the attention weight distribution at:

WXH

9t = ar gF; j=1

after obtaining a't, a detailed feature vector g't for decoding the current

character in the feature map Ft is calculated:

WXIH

t ,j Fq j=1

summing t and g't to obtain a vectorg"t for decoding the current

character:

g"t = gt + g't

the probability distribution Y of characters is obtained by decoding in a

fully connected layer and performing probability normalization in a

normalization layer softmax:

yt =softmax(*i(Wg"t+ bp)

where, * represents a linear rectification unit, We and bc represent trainable weights of the fully connected layer; the current decoded output character ct is obtained by selecting a character corresponding to the value with the maximum confidence from Y.

4-3: training parameters setting: inputting the training data to the network

for training, and allowing the network to traverse the training dataset for 10

times, with about 310,000 batches of data being read during each traversal of

the training dataset, wherein the size of the read-in batches is set to 64, an

adaptive gradient descent method (ADADELTA) is used as the optimization

algorithm, and the initial learning rate is set to 1;

the loss function is defined as

N IVI

- Ilogn7p(cIn) i=1 j=1

where, N represents the data size used in the optimization batch, and

p(cI ; represents the probability of outputting a character ci from the

ith sample image at time j;

4-4: initialization of weights: all weight parameters in the network are

randomly initialized at the beginning of training; and

4-5: training convolutional neural network: the probability of outputting

each character of the target character string at a corresponding time point is

used as cross entropy, and the cross entropy is minimized by using the

gradient descent method.

Preferably, step 5 comprises the steps of:

5-1: inputting test dataset samples, selecting a character with the

maximum confidence as a predicted character based on the greedy algorithm, and putting these characters together to get a final predicted line text; and

5-2: after the recognition is completed, calculating the accuracy rate and

editing a distance by a program.

The advantageous effects of the present invention are as follows:

(1) The automatic learning recognition algorithm with deep network

structure helps to learn effective expressions from data well and improve the

accuracy of recognition.

(2) The present invention has a fast training speed and a high accuracy

compared with the method of detecting the position of each character firstly

and then recognizing each character separately.

(3) Profiting from high recognition accuracy, high robustness, and good

recognition performance, the classification method of the present invention is

much more robust in recognizing the text of irregular shapes.

BRIEF DESCRIPTION OF THE FIGURES

Fig. 1 is general flowchart of the natural scene text recognition method of

the present invention.

Fig. 2 is a flowchart of the convolutional module in the feature extraction

network of the present invention.

Fig. 3 is a schematic diagram of the recognition process of the present

invention.

Fig. 4 is a schematic diagram of the parameter configuration of the deep

convolutional neural network of the present invention.

DESCRIPTION OF THE INVENTION

The technical solutions in the embodiments of the present invention will

be described clearly and completely with reference to the drawings in the embodiments of the present invention. Obviously, the embodiments described are only a part of the embodiments of the present invention and not all of them.

All other embodiments obtained by those skilled in the art based on the

embodiments of the present invention without creative efforts shall fall within

the scope of the present invention.

Referring to Figs. 1-4, a natural scene text recognition method based on

a two-dimensional feature attention mechanism comprises the steps of:

1. data acquisition: synthesizing a line text image for training by using a

public code, dividing the line text image into a regular training set and an

irregular training set by shape, and downloading a true text image from the

network as test data;

2. data processing: stretching the size of all training samples, with the

size of the processed image sample being 32*104 and the aspect ratio of

stretching the height to 32 pixels, then stretching the width according to the

border;

3. label generation: training a recognition model by adopting a

supervision method, wherein each line text image has corresponding text

message and a label is already saved by a code during data synthesis;

Step 3 comprises the steps of:

3-1: synthesizing a text-containing image by using an online public code

and a text corpus, cutting out the text from the image according to the position

of line text recorded by the code in the file to make a line text training sample,

and downloading public natural scene text datasets from the Internet to test the network performance, wherein the line text images of these datasets are taken from real images;

3-2: saving the text content in each text image in the corresponding text

file;

3-3: taking all synthesized training samples as training data that are

divided into regular training data an irregular training data by shape of text

images, and taking the public real text images downloaded from the network

as the test set; and

3-4: making all samples into files in lmdb database format respectively for

accelerated reading.

4. network training: inputting the ready-made training data and labels into

a two-dimensional feature attention network for training, and inputting the

regular training data firstly; after the network has been trained to a suitable

degree through the regular training data, training the network with the irregular

text data and filling the length of each batch of read-in labels with terminators

to a consistent length; and

Step 4 comprises the steps of:

4-1: constructing a feature encoding network by taking a convolutional

block and a long-short term memory model as basic units, wherein the feature

extraction network in the front part of the network down-samples the features

by pooling layers, each pooling layer has a downsampling multiple of 2, and

the convolutional block can be represented as a computational process

involving convolutional layers;

the feature maps output by the first convolutional layer and the third

convolutional layer of the convolutional module are added numerically to obtain an output feature map of the convolutional module; each convolutional module does not downsample the feature map; a batch standardization operation is attached after each convolutional layer in the convolutional block, and the result is output after being processed in the linear rectification unit, and finally the output feature map is obtained; after being processed in the feature extraction network, the obtained feature map with height not being 1, i.e., a two-dimensional feature map is cut into H sub-feature maps by rows, where H is the height of the two-dimensional feature map; each sub-feature map is input to a BLSTM network consisting of two-layer bidirectional long-short term memory (BLSTM), so that the feature vectors of each sub-feature map have contextual information, as expressed by the mathematical formula below:

{E ,E,...,EV) = B3LSTM (1j)

where, represents a sub-feature map of the ith row cut from the two

dimensional feature map, W represents the width of the two-dimensional

feature map, represents the jth feature vector obtained after the ith feature

map is encoded by the BLSTM network, and all encoded sub-feature maps

are stitched in the horizontal direction to obtain an encoded feature map;

4-2: constructing a decoding network based on the two-dimensional

feature attention mechanism:

exp (eu) aj= EwxR exp (e,,k)

where, at = fata ,aWXH t, represents the weight of feature vectors

in each sub-feature map of the two-dimensional feature map, that is, the

attention weight distribution, while H in the formula represents the height of feature map, and eto is obtained by the following formula: et = Vtanh(Wrt +QF + b)

V, W, Q and b represent trainable weight parameters, F is the feature

encoded and spliced by BLSTM, and rt is the output of the language network

consisting of a long-short term memory (LSTM) model in the attention network

and obtained by embedding the word of character decoded at the previous

time node into the vector enb--i and decoding the hidden layer output vector

ht-i used to decode the previous character;

rt = L ST M(embt_1,ht_j)

a rough attention distribution is obtained via the attention weight

distribution, and t is multiplied by the corresponding elements of the feature

F to obtain a feature map a with all features being filtered except current

characters to be decoded, and then the feature map is processed in the

attention network to obtain an attention weight distribution acting on Ft:

~z't~~2J~*,tAXHI

exp (e,) a j WXH exp(e'

e = V'tanh(W'ge + Q'Ft + b')

V', W', Q' and b' represent trainable parameters, and the vector Bt

represents a vector of a rough feature of a character obtained by a weighted

sum of the feature map F with the attention weight distribution a:

WXH

Y aFj j=1

after obtaining a' a detailed feature vector g't for decoding the current character in the feature map Ft is calculated:

WXH

j=1

summing t and g't to obtain a vector g"t for decoding the current

character:

g"t g Et +g't the probability distribution Yt of characters is obtained by decoding in a

fully connected layer and performing probability normalization in a

normalization layer softmax:

yt =softmax(* (Wg"+ be))

where, * represents a linear rectification unit, Wc and b, represent

trainable weights of the fully connected layer; the current decoded output

character er is obtained by selecting a character corresponding to the value

with the maximum confidence from Yt.

4-3: training parameters setting: inputting the training data to the network

for training, and allowing the network to traverse the training dataset for 10

times, with about 310,000 batches of data being read during each traversal of

the training dataset, wherein the size of the read-in batches is set to 64, an

adaptive gradient descent method (ADADELTA) is used as the optimization

algorithm, and the initial learning rate is set to 1;

the loss function is defined as

- log n7p(cyI;). i=1 j=1

where, N represents the data size used in the optimization batch, and pc I6)represents the probability of outputting a character c from the ith sample image at time j;

4-4: initialization of weights: all weight parameters in the network are

randomly initialized at the beginning of training; and

4-5: training convolutional neural network: the probability of outputting

each character of the target character string at a corresponding time point is

used as cross entropy, and the cross entropy is minimized by using the

gradient descent method.

5. inputting test data into a trained network, calculating the confidence of

each image, selecting a character with the maximum confidence as a

predicted character based on the greedy algorithm, and putting these

characters together to get a final predicted line text.

Step 5 comprises the steps of:

5-1: during training, inputting the images from the validation set as well as

the labels into the network for validation; and

5-2: after the training is completed, inputting the images from the test

dataset into the trained network, and calculating the correct recognition rate of

the network and the total edit distance of the predicted results and the labels

by a program.

It should be apparent to those skilled in the art that the present invention

is not limited to the details of the exemplary embodiments described above,

and may be embodied in any other specific forms without departing from the

spirit or essential features of the present invention. Therefore, the

embodiments should be regarded as exemplary and non-limiting from any

point of view, and the scope of the present invention is defined by the appended claims rather than by the above description, and is therefore intended to encompass all variations falling within the meaning and scope of the equivalent elements of the claims. Any appended markings in the claims should not be constructed as limiting the claims involved.

It should be understood that although the specification is described

according to embodiments, but not every embodiment contains only an

independent technical solution and the specification is such described only for

the sake of clarity. Those skilled in the art should consider the specification as

a whole. The technical solutions in each embodiment can be combined

appropriately to form other embodiments that can be understood by those

skilled in the art.

Claims

1. A natural scene text recognition method based on a two-dimensional

feature attention mechanism, characterized by comprising the steps of:

1. data acquisition: synthesizing a line text image for training by using a

public code, dividing the line text image into a regular training set and an

irregular training set by shape, and downloading a true text image from the

network as test data;

2. data processing: stretching the size of all training samples, with the

size of the processed image sample being 32*104 and the aspect ratio of

stretching the height to 32 pixels, then stretching the width according to the

border;

3. label generation: training a recognition model by adopting a

supervision method, wherein each line text image has corresponding text

message and a label is already saved by a code during data synthesis;

4. network training: inputting the ready-made training data and labels into

a two-dimensional feature attention network for training, and inputting the

regular training data firstly; after the network has been trained to a suitable

to a consistent length; and

5. inputting test data into a trained network, calculating the confidence of

each image, selecting a character with the maximum confidence as a

predicted character based on the greedy algorithm, and putting these characters together to get a final predicted line text.

2. A natural scene text recognition method based on a two-dimensional

feature attention mechanism according to claim 1, characterized in that in step

1, the training data are synthesized by using a public code, the number of

synthesized text images should be as large as possible, the text in

synthesized text images should cover a variety of fonts, the background

should be as complex and varied as possible, and the total number of images

is 20 million.

3. A natural scene text recognition method based on a two-dimensional

feature attention mechanism according to claim 1, characterized in that in step

2, the synthesized text images are stretched, with the size of the stretched

images being 32*104 and the aspect ratio of each image being consistent with

that of the original image as possible, wherein the height is stretched to 32

pixels firstly, the width is stretched according to the original aspect ratio, and

the part with insufficient width is filled with a black border to completely

preserve the shape information of the original image text.

4. A natural scene text recognition method based on a two-dimensional

feature attention mechanism according to claim 1, characterized in that step 3

comprises the steps of:

3-1: synthesizing a text-containing image by using an online public code

and a text corpus, and cutting out the text from the image according to the

sample;

3-2: saving the text content in each text image in the corresponding text

file;

3-3: taking all synthesized training samples as training data and taking

the public real text images downloaded from the network as the test set; and

3-4: making all samples into files in lmdb database format respectively for

accelerated reading.

5. A natural scene text recognition method based on a two-dimensional

feature attention mechanism according to claim 1, characterized in that step 4

comprises the steps of:

4-1: constructing a feature encoding network by taking a convolutional

block and a long-short term memory model as basic units, wherein the feature

extraction network in the front part of the network down-samples the features

by pooling layers, and each pooling layer has a downsampling multiple of 2;

the feature maps output by the first convolutional layer and the third

convolutional layer of the convolutional module are added numerically to

obtain an output feature map of the convolutional module; each convolutional

module does not downsample the feature map; a batch standardization

and finally the output feature map is obtained;

after being processed in the feature extraction network, the obtained

feature map with height not being 1, i.e., a two-dimensional feature map is cut

into H sub-feature maps by rows, where H is the height of the two-dimensional

feature map; each sub-feature map is input to a BLSTM network consisting of

two-layer bidirectional long-short term memory (BLSTM), so that the feature

vectors of each sub-feature map have contextual information, as expressed

by the mathematical formula below:

= BLSTM(lj)

where, Ii represents a sub-feature map of the ith row cut from the two

dimensional feature map, W represents the width of the two-dimensional

feature map, represents the jth feature vector obtained after the ith feature

map is encoded by the BLSTM network, and all encoded sub-feature maps

are stitched in the horizontal direction to obtain an encoded feature map;

4-2: constructing a decoding network based on the two-dimensional

feature attention mechanism:

exp (eg)

where, a = 1atat,-,a ,,, represents the weight of feature vectors

in each sub-feature map of the two-dimensional feature map, that is, the

attention weight distribution, while H in the formula represents the height of

feature map, and eto is obtained by the following formula:

etj = Vtanh(Wrt + QF + b)

V, W, Q and b represent trainable weight parameters, F is the feature

encoded and spliced by BLSTM, and rt is the output of the language network

consisting of a long-short term memory (LSTM) model in the attention network

and obtained by embedding the word of character decoded at the previous

time node into the vector embt-i and decoding the hidden layer output vector

ht-i used to decode the previous character;

rt = LSTM(embt_ 1 ,ht-1)

a rough attention distribution is obtained via the attention weight

distribution, and a is multiplied by the corresponding elements of the feature

F to obtain a featuremapFtwithallfeatures being filtered except current

characters to be decoded, and then the feature map is processed in the

attention network to obtain an attention weight distribution acting on Ft:

exp (e a d WX ep (

e'. = Vtanh(W'g + Q'F + b')

VI, W', Q' and b' represent trainable parameters, and the vector Bt

represents a vector of a rough feature of a character obtained by a weighted

sum of the feature map F with the attention weight distribution at:

WXH

9t = ar gF; j=1

after obtaining a't, a detailed feature vector g't for decoding the current

character in the feature map is calculated:

WX H

E't = a', Fq j=1

summing t and g't to obtain a vectorg"t for decoding the current

character:

g"t = gt + g't

the probability distribution Y of characters is obtained by decoding in a

fully connected layer and performing probability normalization in a

normalization layer softmax:

yt = s oftnax($(Weg"t + b)

where, * represents a linear rectification unit, W and be represent trainable weights of the fully connected layer; the current decoded output character ct is obtained by selecting a character corresponding to the value with the maximum confidence from Y.

4-3: training parameters setting: inputting the training data to the network

for training, and allowing the network to traverse the training dataset for 10

times, with about 310,000 batches of data being read during each traversal of

the training dataset, wherein the size of the read-in batches is set to 64, an

adaptive gradient descent method (ADADELTA) is used as the optimization

algorithm, and the initial learning rate is set to 1;

the loss function is defined as

N IVI

- Ilogn7p(cIn) i=1 j=1

where, N represents the data size used in the optimization batch, and

p(cI ; represents the probability of outputting a character ci from the

ith sample image at time j;

4-4: initialization of weights: all weight parameters in the network are

randomly initialized at the beginning of training; and

4-5: training convolutional neural network: the probability of outputting

each character of the target character string at a corresponding time point is

used as cross entropy, and the cross entropy is minimized by using the

gradient descent method.

6. A natural scene text recognition method based on a two-dimensional

feature attention mechanism according to claim 1, characterized in that step 5

comprises the steps of:

5-1: inputting test dataset samples, selecting a character with the

maximum confidence as a predicted character based on the greedy algorithm,

and putting these characters together to get a final predicted line text; and

5-2: after the recognition is completed, calculating the accuracy rate and

editing a distance by a program.

-1/4- Jan 2021

Data collection: synthesizing the training set by using a public code, and downloading public text images of real scenes from the Internet as test data; 2021100480

Data processing: stretching all images to make the height uniform, and making up the width by padding black borders;

Label generation: assigning the text labels corresponding to each image by code, and dividing the training data into a regular text training dataset and an irregular text training dataset according to the shape of the text in the images;

Fig. 1

-2/4- Jan 2021

Inputting a feature map

Convolutional layer 2021100480

Batch standardization layer

Linear rectification unit

Outputting a feature map 1

Convolutional layer

Batch standardization layer

Linear rectification unit

Outputting a feature map 2

Convolutional layer

Batch standardization layer

Outputting a feature map 3 Add

Total output feature map

Fig. 2

-3/4- Jan 2021 2021100480

Feature Scene text image extraction 2D BLSTM Splicing into a one- 2D feature Recogniti Network feature coding dimensional 2D attention on results map network feature map network

Fig. 3

-4/4- Jan 2021 2021100480

Layer name Output size Configuration parameters Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 64 Convolutional 32×104 [Convolutional kernel size: 3×3, number of channels: 64], block 1 padding: 1, step size: 1

[Convolutional kernel size: 3×3, number of channels: 64] Pooling layer 16×52 Pooling kernel size: 2×2, s: 2×2, p: 0 1 Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 128 Convolutional 16×52 [Convolutional kernel size: 3×3, number of channels: 128], block 2 padding: 1, step size: 1

[Convolutional kernel size: 3×3, number of channels: 128] Pooling layer 8×26 k: 2×2, s: 2×2, p: 0 2 Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 256 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 256], block 3 padding: 1, step size: 1

[Convolutional kernel size: 3×3, number of channels: 256] Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 256 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 256], block 4 padding: 1, step size: 1

[Convolutional kernel size: 3×3, number of channels: 256] Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 512 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 512], block 5 padding: 1, step size: 1

[Convolutional kernel size: 3×3, number of channels: 512] Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 512 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 512], block 6 padding: 1, step size: 1

[Convolutional kernel size: 3×3, number of channels: 512]

Fig. 4