AU2021100480A4 - Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism - Google Patents

Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism Download PDF

Info

Publication number
AU2021100480A4
AU2021100480A4 AU2021100480A AU2021100480A AU2021100480A4 AU 2021100480 A4 AU2021100480 A4 AU 2021100480A4 AU 2021100480 A AU2021100480 A AU 2021100480A AU 2021100480 A AU2021100480 A AU 2021100480A AU 2021100480 A4 AU2021100480 A4 AU 2021100480A4
Authority
AU
Australia
Prior art keywords
training
text
network
convolutional
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
AU2021100480A
Inventor
Xue GAO
Lianwen JIN
Canjie LUO
Huiyun MAO
Haoyu Wang
Min Wu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzheng Yunshi Technology Co ltd
South China University of Technology SCUT
Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Original Assignee
Shenzheng Yunshi Technology Co Ltd
South China University of Technology SCUT
Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzheng Yunshi Technology Co Ltd, South China University of Technology SCUT, Zhuhai Institute of Modern Industrial Innovation of South China University of Technology filed Critical Shenzheng Yunshi Technology Co Ltd
Priority to AU2021100480A priority Critical patent/AU2021100480A4/en
Application granted granted Critical
Publication of AU2021100480A4 publication Critical patent/AU2021100480A4/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)

Abstract

Disclosed is a natural scene text recognition method based on a two dimensional feature attention mechanism, comprising the steps of: 1. data acquisition: synthesizing a line text image for training by using a public code, dividing the line text image into a regular training set and an irregular training set by shape, and downloading a true text image from the network as test data; 2. data processing: stretching the size of the image, with the size of the processed image being 32 * 104; 3. label generation: training a recognition model by adopting a supervision method, wherein each line text image has corresponding text content; 4. network training: training a recognition network by using the data in the training set; and 5. network testing: inputting test data into the trained network to obtain a prediction result of the line text image. According to the present invention, characters are decoded from the two dimensional features of images by using the attention network, and the recognition accuracy reaches a high level on a public data set. Thus, the method is highly practical and applicable. -1/4 Data collection: synthesizing the training set by using a public code, and downloading public text images of real scenes from the Internet as test data; Data processing: stretching all images to make the height uniform and making up the width by padding black borders Label generation: assigning the text labels corresponding to eact image by code, and dividing the training data int< a regular text training dataset and an irregular tex training dataset according to the shape of the te in the images; Network training: sending training data to the network for training, sending the regular training data firstly; and then sending the irregular training data; Network testing: sending the test data to the trained recognition network, and outputting the recognized text characters in the images along with the confidence level of each character. Fig. 1

Description

-1/4
Data collection: synthesizing the training set by using a public code, and downloading public text images of real scenes from the Internet as test data;
Data processing: stretching all images to make the height uniform and making up the width by padding black borders
Label generation: assigning the text labels corresponding to eact image by code, and dividing the training data int< a regular text training dataset and an irregular tex training dataset according to the shape of the te in the images;
Network training: sending training data to the network for training, sending the regular training data firstly; and then sending the irregular training data;
Network testing: sending the test data to the trained recognition network, and outputting the recognized text characters in the images along with the confidence level of each character.
Fig. 1
Natural Scene Text Recognition Method Based on Two-Dimensional
Feature Attention Mechanism
TECHNICAL FIELD
The present invention relates to a natural scene text recognition method,
and in particular to a natural scene text recognition method based on a two
dimensional feature attention mechanism, and belongs to the fields of pattern
recognition and artificial intelligence technology.
BACKGROUND
Text breaks the limits of information transmission between people in
hearing, enabling people to pass on spiritual wealth and wisdom of mankind
through visual information, so as to more accurately understand and process
the visual information, and promote information exchange between people.
With the rapid development of computer technology, artificial intelligence
technology is gradually changing our lives, making our lives more convenient
and efficient. Moreover, the recent rapid development and wide application of
hardware technology, especially GPU, make the practical application of deep
neural network possible.
In the real world, people get much more information through vision than
through other senses. In terms of the visual information, people mainly
understand the external environment and obtain important information through
text. Since the invention of writing, human beings have been conveying
information to and receiving information from the outside world through text in
large quantities. In order to obtain text information, one must first correctly
identify the text obtained through the visual senses. For an educated person, it is easy to correctly identify words from an image. However, computers cannot recognize the text in an image as easily as a human can.
Text plays an important role in the real world. Most of the information that
people obtain visually is carried by text. People will rely heavily on text to
obtain information in the past and in the future. The most important step in
obtaining text information is to recognize the text correctly. For human beings,
it is a must to correctly recognize text in images by computers. However, texts
often exist in natural scenes in various forms; for example, street signs are
often in different background environments, and the variability of the
background makes it difficult for computers to correctly recognize text
information. Moreover, people often arrange text in different shapes, such as
curve line and broken line, so as to achieve certain artistic effects. Besides,
many other factors also make it difficult for computers to correctly recognize
the text in natural scenes. So, it is necessary to find an effective method to
recognize text in natural scenes.
The research progress of artificial intelligence makes it possible to solve
the above problems. In recent years, several research teams have proposed
solutions based on deep neural networks for natural scene text recognition. In
various solutions, the method based on an attention mechanism is particularly
prominent in the field of natural scene text recognition. Due to the flexibility of
attention mechanism in decoding mode and semantic derivation, the
recognition rate of the model based on the attention mechanism has been
greatly improved compared with the previous methods. However, the
traditional scene text recognition solutions based on the attention mechanism
often require compressing the input scene text images directly into a feature sequence through a convolutional neural network, which will introduce extra noise into the feature sequence.
SUMMARY
Aiming at solving the above problems, the purpose of the present
invention is to provide a natural scene text recognition method based on a
two-dimensional feature attention mechanism with a high recognition rate for
irregularly arranged text and a high use value for images with rich
backgrounds from which text can also be recognized.
In order to achieve the purpose, the present invention is realized by the
following technical solution: a natural scene text recognition method based on
a two-dimensional feature attention mechanism, comprising the steps of:
1. data acquisition: synthesizing a line text image for training by using a
public code, dividing the line text image into a regular training set and an
irregular training set by shape, and downloading a true text image from the
network as test data;
2. data processing: stretching the size of all training samples, with the
size of the processed image sample being 32*104 and the aspect ratio of
each image being consistent with that of the original image as possible, firstly
stretching the height to 32 pixels, then stretching the width according to the
original aspect ratio, and filling the part with insufficient width with a black
border;
3. label generation: training a recognition model by adopting a
supervision method, wherein each line text image has corresponding text
message and a label is already saved by a code during data synthesis;
4. network training: inputting the ready-made training data and labels into a two-dimensional feature attention network for training, and inputting the regular training data firstly; after the network has been trained to a suitable degree through the regular training data, training the network with the irregular text data and filling the length of each batch of read-in labels with terminators to a consistent length; and
5. inputting test data into a trained network, calculating the confidence of
each image, selecting a character with the maximum confidence as a
predicted character based on the greedy algorithm, and putting these
characters together to get a final predicted line text.
Preferably, in step 1, the training data are synthesized by using a public
code, the number of synthesized text images should be as large as possible,
the text in synthesized text images should cover a variety of fonts, the
background should be as complex and varied as possible, and the total
number of images is 20 million.
Preferably, in step 2, the synthesized text images are stretched, with the
size of the stretched images being 32*104 and the aspect ratio of each image
being consistent with that of the original image as possible, wherein the height
is stretched to 32 pixels firstly, the width is stretched according to the original
aspect ratio, and the part with insufficient width is filled with a black border to
completely preserve the shape information of the original image text.
Preferably, step 3 comprises the steps of:
3-1: synthesizing a text-containing image by using an online public code
and a text corpus, and cutting out the text from the image according to the
position of line text recorded by the code in the file to make a line text training
sample;
3-2: saving the text content in each text image in the corresponding text
file;
3-3: taking all synthesized training samples as training data and taking
the public real text images downloaded from the network as the test set; and
3-4: making all samples into files in lmdb database format respectively for
accelerated reading.
Preferably, step 4 comprises the steps of:
4-1: constructing a feature encoding network by taking a convolutional
block and a long-short term memory model as basic units, wherein the feature
extraction network in the front part of the network down-samples the features
by pooling layers, and each pooling layer has a downsampling multiple of 2;
the feature maps output by the first convolutional layer and the third
convolutional layer of the convolutional module are added numerically to
obtain an output feature map of the convolutional module; each convolutional
module does not downsample the feature map; a batch standardization
operation is attached after each convolutional layer in the convolutional block,
and the result is output after being processed in the linear rectification unit,
and finally the output feature map is obtained;
after being processed in the feature extraction network, the obtained
feature map with height not being 1, i.e., a two-dimensional feature map is cut
into H sub-feature maps by rows, where H is the height of the two-dimensional
feature map; each sub-feature map is input to a BLSTM network consisting of
two-layer bidirectional long-short term memory (BLSTM), so that the feature
vectors of each sub-feature map have contextual information, as expressed
by the mathematical formula below:
= BLSTM(lj)
where, Ii represents a sub-feature map of the ith row cut from the two
dimensional feature map, W represents the width of the two-dimensional
feature map, represents the jth feature vector obtained after the ith feature
map is encoded by the BLSTM network, and all encoded sub-feature maps
are stitched in the horizontal direction to obtain an encoded feature map;
4-2: constructing a decoding network based on the two-dimensional
feature attention mechanism:
exp (eg)
where, a = 1atat,-,a ,,, represents the weight of feature vectors
in each sub-feature map of the two-dimensional feature map, that is, the
attention weight distribution, while H in the formula represents the height of
feature map, and eto is obtained by the following formula:
etj = Vtanh(Wrt + QF + b)
V, W, Q and b represent trainable weight parameters, F is the feature
encoded and spliced by BLSTM, and rt is the output of the language network
consisting of a long-short term memory (LSTM) model in the attention network
and obtained by embedding the word of character decoded at the previous
time node into the vector embt-i and decoding the hidden layer output vector
ht-i used to decode the previous character;
rt = LSTM(embt_ 1 ,ht-1)
a rough attention distribution is obtained via the attention weight
distribution, and a is multiplied by the corresponding elements of the feature
F to obtain a featuremapatwithallfeatures being filtered except current
characters to be decoded, and then the feature map is processed in the
attention network to obtain an attention weight distribution acting on Ft:
exp (e a ') WX ep (
e'. = Vtanh(W'g + Q'F + b')
VI, W', Q' and b' represent trainable parameters, and the vector 9C represents a vector of a rough feature of a character obtained by a weighted
sum of the feature map F with the attention weight distribution at:
WXH
9t = ar gF; j=1
after obtaining a't, a detailed feature vector g't for decoding the current
character in the feature map Ft is calculated:
WXIH
t ,j Fq j=1
summing t and g't to obtain a vectorg"t for decoding the current
character:
g"t = gt + g't
the probability distribution Y of characters is obtained by decoding in a
fully connected layer and performing probability normalization in a
normalization layer softmax:
yt =softmax(*i(Wg"t+ bp)
where, * represents a linear rectification unit, We and bc represent trainable weights of the fully connected layer; the current decoded output character ct is obtained by selecting a character corresponding to the value with the maximum confidence from Y.
4-3: training parameters setting: inputting the training data to the network
for training, and allowing the network to traverse the training dataset for 10
times, with about 310,000 batches of data being read during each traversal of
the training dataset, wherein the size of the read-in batches is set to 64, an
adaptive gradient descent method (ADADELTA) is used as the optimization
algorithm, and the initial learning rate is set to 1;
the loss function is defined as
N IVI
- Ilogn7p(cIn) i=1 j=1
where, N represents the data size used in the optimization batch, and
p(cI ; represents the probability of outputting a character ci from the
ith sample image at time j;
4-4: initialization of weights: all weight parameters in the network are
randomly initialized at the beginning of training; and
4-5: training convolutional neural network: the probability of outputting
each character of the target character string at a corresponding time point is
used as cross entropy, and the cross entropy is minimized by using the
gradient descent method.
Preferably, step 5 comprises the steps of:
5-1: inputting test dataset samples, selecting a character with the
maximum confidence as a predicted character based on the greedy algorithm, and putting these characters together to get a final predicted line text; and
5-2: after the recognition is completed, calculating the accuracy rate and
editing a distance by a program.
The advantageous effects of the present invention are as follows:
(1) The automatic learning recognition algorithm with deep network
structure helps to learn effective expressions from data well and improve the
accuracy of recognition.
(2) The present invention has a fast training speed and a high accuracy
compared with the method of detecting the position of each character firstly
and then recognizing each character separately.
(3) Profiting from high recognition accuracy, high robustness, and good
recognition performance, the classification method of the present invention is
much more robust in recognizing the text of irregular shapes.
BRIEF DESCRIPTION OF THE FIGURES
Fig. 1 is general flowchart of the natural scene text recognition method of
the present invention.
Fig. 2 is a flowchart of the convolutional module in the feature extraction
network of the present invention.
Fig. 3 is a schematic diagram of the recognition process of the present
invention.
Fig. 4 is a schematic diagram of the parameter configuration of the deep
convolutional neural network of the present invention.
DESCRIPTION OF THE INVENTION
The technical solutions in the embodiments of the present invention will
be described clearly and completely with reference to the drawings in the embodiments of the present invention. Obviously, the embodiments described are only a part of the embodiments of the present invention and not all of them.
All other embodiments obtained by those skilled in the art based on the
embodiments of the present invention without creative efforts shall fall within
the scope of the present invention.
Referring to Figs. 1-4, a natural scene text recognition method based on
a two-dimensional feature attention mechanism comprises the steps of:
1. data acquisition: synthesizing a line text image for training by using a
public code, dividing the line text image into a regular training set and an
irregular training set by shape, and downloading a true text image from the
network as test data;
2. data processing: stretching the size of all training samples, with the
size of the processed image sample being 32*104 and the aspect ratio of
each image being consistent with that of the original image as possible, firstly
stretching the height to 32 pixels, then stretching the width according to the
original aspect ratio, and filling the part with insufficient width with a black
border;
3. label generation: training a recognition model by adopting a
supervision method, wherein each line text image has corresponding text
message and a label is already saved by a code during data synthesis;
Step 3 comprises the steps of:
3-1: synthesizing a text-containing image by using an online public code
and a text corpus, cutting out the text from the image according to the position
of line text recorded by the code in the file to make a line text training sample,
and downloading public natural scene text datasets from the Internet to test the network performance, wherein the line text images of these datasets are taken from real images;
3-2: saving the text content in each text image in the corresponding text
file;
3-3: taking all synthesized training samples as training data that are
divided into regular training data an irregular training data by shape of text
images, and taking the public real text images downloaded from the network
as the test set; and
3-4: making all samples into files in lmdb database format respectively for
accelerated reading.
4. network training: inputting the ready-made training data and labels into
a two-dimensional feature attention network for training, and inputting the
regular training data firstly; after the network has been trained to a suitable
degree through the regular training data, training the network with the irregular
text data and filling the length of each batch of read-in labels with terminators
to a consistent length; and
Step 4 comprises the steps of:
4-1: constructing a feature encoding network by taking a convolutional
block and a long-short term memory model as basic units, wherein the feature
extraction network in the front part of the network down-samples the features
by pooling layers, each pooling layer has a downsampling multiple of 2, and
the convolutional block can be represented as a computational process
involving convolutional layers;
the feature maps output by the first convolutional layer and the third
convolutional layer of the convolutional module are added numerically to obtain an output feature map of the convolutional module; each convolutional module does not downsample the feature map; a batch standardization operation is attached after each convolutional layer in the convolutional block, and the result is output after being processed in the linear rectification unit, and finally the output feature map is obtained; after being processed in the feature extraction network, the obtained feature map with height not being 1, i.e., a two-dimensional feature map is cut into H sub-feature maps by rows, where H is the height of the two-dimensional feature map; each sub-feature map is input to a BLSTM network consisting of two-layer bidirectional long-short term memory (BLSTM), so that the feature vectors of each sub-feature map have contextual information, as expressed by the mathematical formula below:
{E ,E,...,EV) = B3LSTM (1j)
where, represents a sub-feature map of the ith row cut from the two
dimensional feature map, W represents the width of the two-dimensional
feature map, represents the jth feature vector obtained after the ith feature
map is encoded by the BLSTM network, and all encoded sub-feature maps
are stitched in the horizontal direction to obtain an encoded feature map;
4-2: constructing a decoding network based on the two-dimensional
feature attention mechanism:
exp (eu) aj= EwxR exp (e,,k)
where, at = fata ,aWXH t, represents the weight of feature vectors
in each sub-feature map of the two-dimensional feature map, that is, the
attention weight distribution, while H in the formula represents the height of feature map, and eto is obtained by the following formula: et = Vtanh(Wrt +QF + b)
V, W, Q and b represent trainable weight parameters, F is the feature
encoded and spliced by BLSTM, and rt is the output of the language network
consisting of a long-short term memory (LSTM) model in the attention network
and obtained by embedding the word of character decoded at the previous
time node into the vector enb--i and decoding the hidden layer output vector
ht-i used to decode the previous character;
rt = L ST M(embt_1,ht_j)
a rough attention distribution is obtained via the attention weight
distribution, and t is multiplied by the corresponding elements of the feature
F to obtain a feature map a with all features being filtered except current
characters to be decoded, and then the feature map is processed in the
attention network to obtain an attention weight distribution acting on Ft:
~z't~~2J~*,tAXHI
exp (e,) a j WXH exp(e'
e = V'tanh(W'ge + Q'Ft + b')
V', W', Q' and b' represent trainable parameters, and the vector Bt
represents a vector of a rough feature of a character obtained by a weighted
sum of the feature map F with the attention weight distribution a:
WXH
Y aFj j=1
after obtaining a' a detailed feature vector g't for decoding the current character in the feature map Ft is calculated:
WXH
j=1
summing t and g't to obtain a vector g"t for decoding the current
character:
g"t g Et +g't the probability distribution Yt of characters is obtained by decoding in a
fully connected layer and performing probability normalization in a
normalization layer softmax:
yt =softmax(* (Wg"+ be))
where, * represents a linear rectification unit, Wc and b, represent
trainable weights of the fully connected layer; the current decoded output
character er is obtained by selecting a character corresponding to the value
with the maximum confidence from Yt.
4-3: training parameters setting: inputting the training data to the network
for training, and allowing the network to traverse the training dataset for 10
times, with about 310,000 batches of data being read during each traversal of
the training dataset, wherein the size of the read-in batches is set to 64, an
adaptive gradient descent method (ADADELTA) is used as the optimization
algorithm, and the initial learning rate is set to 1;
the loss function is defined as
- log n7p(cyI;). i=1 j=1
where, N represents the data size used in the optimization batch, and pc I6)represents the probability of outputting a character c from the ith sample image at time j;
4-4: initialization of weights: all weight parameters in the network are
randomly initialized at the beginning of training; and
4-5: training convolutional neural network: the probability of outputting
each character of the target character string at a corresponding time point is
used as cross entropy, and the cross entropy is minimized by using the
gradient descent method.
5. inputting test data into a trained network, calculating the confidence of
each image, selecting a character with the maximum confidence as a
predicted character based on the greedy algorithm, and putting these
characters together to get a final predicted line text.
Step 5 comprises the steps of:
5-1: during training, inputting the images from the validation set as well as
the labels into the network for validation; and
5-2: after the training is completed, inputting the images from the test
dataset into the trained network, and calculating the correct recognition rate of
the network and the total edit distance of the predicted results and the labels
by a program.
It should be apparent to those skilled in the art that the present invention
is not limited to the details of the exemplary embodiments described above,
and may be embodied in any other specific forms without departing from the
spirit or essential features of the present invention. Therefore, the
embodiments should be regarded as exemplary and non-limiting from any
point of view, and the scope of the present invention is defined by the appended claims rather than by the above description, and is therefore intended to encompass all variations falling within the meaning and scope of the equivalent elements of the claims. Any appended markings in the claims should not be constructed as limiting the claims involved.
It should be understood that although the specification is described
according to embodiments, but not every embodiment contains only an
independent technical solution and the specification is such described only for
the sake of clarity. Those skilled in the art should consider the specification as
a whole. The technical solutions in each embodiment can be combined
appropriately to form other embodiments that can be understood by those
skilled in the art.

Claims (6)

1. A natural scene text recognition method based on a two-dimensional
feature attention mechanism, characterized by comprising the steps of:
1. data acquisition: synthesizing a line text image for training by using a
public code, dividing the line text image into a regular training set and an
irregular training set by shape, and downloading a true text image from the
network as test data;
2. data processing: stretching the size of all training samples, with the
size of the processed image sample being 32*104 and the aspect ratio of
each image being consistent with that of the original image as possible, firstly
stretching the height to 32 pixels, then stretching the width according to the
original aspect ratio, and filling the part with insufficient width with a black
border;
3. label generation: training a recognition model by adopting a
supervision method, wherein each line text image has corresponding text
message and a label is already saved by a code during data synthesis;
4. network training: inputting the ready-made training data and labels into
a two-dimensional feature attention network for training, and inputting the
regular training data firstly; after the network has been trained to a suitable
degree through the regular training data, training the network with the irregular
text data and filling the length of each batch of read-in labels with terminators
to a consistent length; and
5. inputting test data into a trained network, calculating the confidence of
each image, selecting a character with the maximum confidence as a
predicted character based on the greedy algorithm, and putting these characters together to get a final predicted line text.
2. A natural scene text recognition method based on a two-dimensional
feature attention mechanism according to claim 1, characterized in that in step
1, the training data are synthesized by using a public code, the number of
synthesized text images should be as large as possible, the text in
synthesized text images should cover a variety of fonts, the background
should be as complex and varied as possible, and the total number of images
is 20 million.
3. A natural scene text recognition method based on a two-dimensional
feature attention mechanism according to claim 1, characterized in that in step
2, the synthesized text images are stretched, with the size of the stretched
images being 32*104 and the aspect ratio of each image being consistent with
that of the original image as possible, wherein the height is stretched to 32
pixels firstly, the width is stretched according to the original aspect ratio, and
the part with insufficient width is filled with a black border to completely
preserve the shape information of the original image text.
4. A natural scene text recognition method based on a two-dimensional
feature attention mechanism according to claim 1, characterized in that step 3
comprises the steps of:
3-1: synthesizing a text-containing image by using an online public code
and a text corpus, and cutting out the text from the image according to the
position of line text recorded by the code in the file to make a line text training
sample;
3-2: saving the text content in each text image in the corresponding text
file;
3-3: taking all synthesized training samples as training data and taking
the public real text images downloaded from the network as the test set; and
3-4: making all samples into files in lmdb database format respectively for
accelerated reading.
5. A natural scene text recognition method based on a two-dimensional
feature attention mechanism according to claim 1, characterized in that step 4
comprises the steps of:
4-1: constructing a feature encoding network by taking a convolutional
block and a long-short term memory model as basic units, wherein the feature
extraction network in the front part of the network down-samples the features
by pooling layers, and each pooling layer has a downsampling multiple of 2;
the feature maps output by the first convolutional layer and the third
convolutional layer of the convolutional module are added numerically to
obtain an output feature map of the convolutional module; each convolutional
module does not downsample the feature map; a batch standardization
operation is attached after each convolutional layer in the convolutional block,
and the result is output after being processed in the linear rectification unit,
and finally the output feature map is obtained;
after being processed in the feature extraction network, the obtained
feature map with height not being 1, i.e., a two-dimensional feature map is cut
into H sub-feature maps by rows, where H is the height of the two-dimensional
feature map; each sub-feature map is input to a BLSTM network consisting of
two-layer bidirectional long-short term memory (BLSTM), so that the feature
vectors of each sub-feature map have contextual information, as expressed
by the mathematical formula below:
= BLSTM(lj)
where, Ii represents a sub-feature map of the ith row cut from the two
dimensional feature map, W represents the width of the two-dimensional
feature map, represents the jth feature vector obtained after the ith feature
map is encoded by the BLSTM network, and all encoded sub-feature maps
are stitched in the horizontal direction to obtain an encoded feature map;
4-2: constructing a decoding network based on the two-dimensional
feature attention mechanism:
exp (eg)
where, a = 1atat,-,a ,,, represents the weight of feature vectors
in each sub-feature map of the two-dimensional feature map, that is, the
attention weight distribution, while H in the formula represents the height of
feature map, and eto is obtained by the following formula:
etj = Vtanh(Wrt + QF + b)
V, W, Q and b represent trainable weight parameters, F is the feature
encoded and spliced by BLSTM, and rt is the output of the language network
consisting of a long-short term memory (LSTM) model in the attention network
and obtained by embedding the word of character decoded at the previous
time node into the vector embt-i and decoding the hidden layer output vector
ht-i used to decode the previous character;
rt = LSTM(embt_ 1 ,ht-1)
a rough attention distribution is obtained via the attention weight
distribution, and a is multiplied by the corresponding elements of the feature
F to obtain a featuremapFtwithallfeatures being filtered except current
characters to be decoded, and then the feature map is processed in the
attention network to obtain an attention weight distribution acting on Ft:
exp (e a d WX ep (
e'. = Vtanh(W'g + Q'F + b')
VI, W', Q' and b' represent trainable parameters, and the vector Bt
represents a vector of a rough feature of a character obtained by a weighted
sum of the feature map F with the attention weight distribution at:
WXH
9t = ar gF; j=1
after obtaining a't, a detailed feature vector g't for decoding the current
character in the feature map is calculated:
WX H
E't = a', Fq j=1
summing t and g't to obtain a vectorg"t for decoding the current
character:
g"t = gt + g't
the probability distribution Y of characters is obtained by decoding in a
fully connected layer and performing probability normalization in a
normalization layer softmax:
yt = s oftnax($(Weg"t + b)
where, * represents a linear rectification unit, W and be represent trainable weights of the fully connected layer; the current decoded output character ct is obtained by selecting a character corresponding to the value with the maximum confidence from Y.
4-3: training parameters setting: inputting the training data to the network
for training, and allowing the network to traverse the training dataset for 10
times, with about 310,000 batches of data being read during each traversal of
the training dataset, wherein the size of the read-in batches is set to 64, an
adaptive gradient descent method (ADADELTA) is used as the optimization
algorithm, and the initial learning rate is set to 1;
the loss function is defined as
N IVI
- Ilogn7p(cIn) i=1 j=1
where, N represents the data size used in the optimization batch, and
p(cI ; represents the probability of outputting a character ci from the
ith sample image at time j;
4-4: initialization of weights: all weight parameters in the network are
randomly initialized at the beginning of training; and
4-5: training convolutional neural network: the probability of outputting
each character of the target character string at a corresponding time point is
used as cross entropy, and the cross entropy is minimized by using the
gradient descent method.
6. A natural scene text recognition method based on a two-dimensional
feature attention mechanism according to claim 1, characterized in that step 5
comprises the steps of:
5-1: inputting test dataset samples, selecting a character with the
maximum confidence as a predicted character based on the greedy algorithm,
and putting these characters together to get a final predicted line text; and
5-2: after the recognition is completed, calculating the accuracy rate and
editing a distance by a program.
-1/4- Jan 2021
Data collection: synthesizing the training set by using a public code, and downloading public text images of real scenes from the Internet as test data; 2021100480
Data processing: stretching all images to make the height uniform, and making up the width by padding black borders;
Label generation: assigning the text labels corresponding to each image by code, and dividing the training data into a regular text training dataset and an irregular text training dataset according to the shape of the text in the images;
Network training: sending training data to the network for training, sending the regular training data firstly; and then sending the irregular training data;
Network testing: sending the test data to the trained recognition network, and outputting the recognized text characters in the images along with the confidence level of each character.
Fig. 1
-2/4- Jan 2021
Inputting a feature map
Convolutional layer 2021100480
Batch standardization layer
Linear rectification unit
Outputting a feature map 1
Convolutional layer
Batch standardization layer
Linear rectification unit
Outputting a feature map 2
Convolutional layer
Batch standardization layer
Outputting a feature map 3 Add
Total output feature map
Fig. 2
-3/4- Jan 2021 2021100480
Feature Scene text image extraction 2D BLSTM Splicing into a one- 2D feature Recogniti Network feature coding dimensional 2D attention on results map network feature map network
Fig. 3
-4/4- Jan 2021 2021100480
Layer name Output size Configuration parameters Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 64 Convolutional 32×104 [Convolutional kernel size: 3×3, number of channels: 64], block 1 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 64] Pooling layer 16×52 Pooling kernel size: 2×2, s: 2×2, p: 0 1 Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 128 Convolutional 16×52 [Convolutional kernel size: 3×3, number of channels: 128], block 2 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 128] Pooling layer 8×26 k: 2×2, s: 2×2, p: 0 2 Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 256 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 256], block 3 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 256] Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 256 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 256], block 4 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 256] Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 512 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 512], block 5 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 512] Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 512 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 512], block 6 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 512]
Fig. 4
AU2021100480A 2021-01-25 2021-01-25 Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism Active AU2021100480A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2021100480A AU2021100480A4 (en) 2021-01-25 2021-01-25 Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2021100480A AU2021100480A4 (en) 2021-01-25 2021-01-25 Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism

Publications (1)

Publication Number Publication Date
AU2021100480A4 true AU2021100480A4 (en) 2021-04-15

Family

ID=75397050

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2021100480A Active AU2021100480A4 (en) 2021-01-25 2021-01-25 Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism

Country Status (1)

Country Link
AU (1) AU2021100480A4 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114092931A (en) * 2022-01-20 2022-02-25 中科视语(北京)科技有限公司 Scene character recognition method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114092931A (en) * 2022-01-20 2022-02-25 中科视语(北京)科技有限公司 Scene character recognition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110378334B (en) Natural scene text recognition method based on two-dimensional feature attention mechanism
CN110414498B (en) Natural scene text recognition method based on cross attention mechanism
CN110287479B (en) Named entity recognition method, electronic device and storage medium
CN112734775B (en) Image labeling, image semantic segmentation and model training methods and devices
CN110533737A (en) The method generated based on structure guidance Chinese character style
CN108415977A (en) One is read understanding method based on the production machine of deep neural network and intensified learning
CN110795556A (en) Abstract generation method based on fine-grained plug-in decoding
CN110555896B (en) Image generation method and device and storage medium
CN111598979B (en) Method, device and equipment for generating facial animation of virtual character and storage medium
CN111428727B (en) Natural scene text recognition method based on sequence transformation correction and attention mechanism
CN113298151A (en) Remote sensing image semantic description method based on multi-level feature fusion
CN109711465A (en) Image method for generating captions based on MLL and ASCA-FR
CN112633431B (en) Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC
CN113705313A (en) Text recognition method, device, equipment and medium
CN114627282B (en) Method, application method, equipment, device and medium for establishing target detection model
CN113762269A (en) Chinese character OCR recognition method, system, medium and application based on neural network
CN110096591A (en) Long text classification method, device, computer equipment and storage medium based on bag of words
CN114282059A (en) Video retrieval method, device, equipment and storage medium
CN113515669A (en) Data processing method based on artificial intelligence and related equipment
CN116310339A (en) Remote sensing image segmentation method based on matrix decomposition enhanced global features
CN117522697A (en) Face image generation method, face image generation system and model training method
CN115131698A (en) Video attribute determination method, device, equipment and storage medium
AU2021100480A4 (en) Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism
CN116561274A (en) Knowledge question-answering method based on digital human technology and natural language big model
CN113283432A (en) Image recognition and character sorting method and equipment

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)