AU2021100480A4 - Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism - Google Patents
Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism Download PDFInfo
- Publication number
- AU2021100480A4 AU2021100480A4 AU2021100480A AU2021100480A AU2021100480A4 AU 2021100480 A4 AU2021100480 A4 AU 2021100480A4 AU 2021100480 A AU2021100480 A AU 2021100480A AU 2021100480 A AU2021100480 A AU 2021100480A AU 2021100480 A4 AU2021100480 A4 AU 2021100480A4
- Authority
- AU
- Australia
- Prior art keywords
- training
- text
- network
- convolutional
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/63—Scene text, e.g. street names
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Character Discrimination (AREA)
Abstract
Disclosed is a natural scene text recognition method based on a two
dimensional feature attention mechanism, comprising the steps of: 1. data
acquisition: synthesizing a line text image for training by using a public code,
dividing the line text image into a regular training set and an irregular training
set by shape, and downloading a true text image from the network as test
data; 2. data processing: stretching the size of the image, with the size of the
processed image being 32 * 104; 3. label generation: training a recognition
model by adopting a supervision method, wherein each line text image has
corresponding text content; 4. network training: training a recognition network
by using the data in the training set; and 5. network testing: inputting test data
into the trained network to obtain a prediction result of the line text image.
According to the present invention, characters are decoded from the two
dimensional features of images by using the attention network, and the
recognition accuracy reaches a high level on a public data set. Thus, the
method is highly practical and applicable.
-1/4
Data collection:
synthesizing the training set by using a public
code, and downloading public text images of real
scenes from the Internet as test data;
Data processing:
stretching all images to make the height uniform
and making up the width by padding black borders
Label generation:
assigning the text labels corresponding to eact
image by code, and dividing the training data int<
a regular text training dataset and an irregular tex
training dataset according to the shape of the te
in the images;
Network training:
sending training data to the network for training,
sending the regular training data firstly; and then
sending the irregular training data;
Network testing:
sending the test data to the trained recognition
network, and outputting the recognized text
characters in the images along with the
confidence level of each character.
Fig. 1
Description
-1/4
Data collection: synthesizing the training set by using a public code, and downloading public text images of real scenes from the Internet as test data;
Data processing: stretching all images to make the height uniform and making up the width by padding black borders
Label generation: assigning the text labels corresponding to eact image by code, and dividing the training data int< a regular text training dataset and an irregular tex training dataset according to the shape of the te in the images;
Network training: sending training data to the network for training, sending the regular training data firstly; and then sending the irregular training data;
Network testing: sending the test data to the trained recognition network, and outputting the recognized text characters in the images along with the confidence level of each character.
Fig. 1
Natural Scene Text Recognition Method Based on Two-Dimensional
Feature Attention Mechanism
The present invention relates to a natural scene text recognition method,
and in particular to a natural scene text recognition method based on a two
dimensional feature attention mechanism, and belongs to the fields of pattern
recognition and artificial intelligence technology.
Text breaks the limits of information transmission between people in
hearing, enabling people to pass on spiritual wealth and wisdom of mankind
through visual information, so as to more accurately understand and process
the visual information, and promote information exchange between people.
With the rapid development of computer technology, artificial intelligence
technology is gradually changing our lives, making our lives more convenient
and efficient. Moreover, the recent rapid development and wide application of
hardware technology, especially GPU, make the practical application of deep
neural network possible.
In the real world, people get much more information through vision than
through other senses. In terms of the visual information, people mainly
understand the external environment and obtain important information through
text. Since the invention of writing, human beings have been conveying
information to and receiving information from the outside world through text in
large quantities. In order to obtain text information, one must first correctly
identify the text obtained through the visual senses. For an educated person, it is easy to correctly identify words from an image. However, computers cannot recognize the text in an image as easily as a human can.
Text plays an important role in the real world. Most of the information that
people obtain visually is carried by text. People will rely heavily on text to
obtain information in the past and in the future. The most important step in
obtaining text information is to recognize the text correctly. For human beings,
it is a must to correctly recognize text in images by computers. However, texts
often exist in natural scenes in various forms; for example, street signs are
often in different background environments, and the variability of the
background makes it difficult for computers to correctly recognize text
information. Moreover, people often arrange text in different shapes, such as
curve line and broken line, so as to achieve certain artistic effects. Besides,
many other factors also make it difficult for computers to correctly recognize
the text in natural scenes. So, it is necessary to find an effective method to
recognize text in natural scenes.
The research progress of artificial intelligence makes it possible to solve
the above problems. In recent years, several research teams have proposed
solutions based on deep neural networks for natural scene text recognition. In
various solutions, the method based on an attention mechanism is particularly
prominent in the field of natural scene text recognition. Due to the flexibility of
attention mechanism in decoding mode and semantic derivation, the
recognition rate of the model based on the attention mechanism has been
greatly improved compared with the previous methods. However, the
traditional scene text recognition solutions based on the attention mechanism
often require compressing the input scene text images directly into a feature sequence through a convolutional neural network, which will introduce extra noise into the feature sequence.
Aiming at solving the above problems, the purpose of the present
invention is to provide a natural scene text recognition method based on a
two-dimensional feature attention mechanism with a high recognition rate for
irregularly arranged text and a high use value for images with rich
backgrounds from which text can also be recognized.
In order to achieve the purpose, the present invention is realized by the
following technical solution: a natural scene text recognition method based on
a two-dimensional feature attention mechanism, comprising the steps of:
1. data acquisition: synthesizing a line text image for training by using a
public code, dividing the line text image into a regular training set and an
irregular training set by shape, and downloading a true text image from the
network as test data;
2. data processing: stretching the size of all training samples, with the
size of the processed image sample being 32*104 and the aspect ratio of
each image being consistent with that of the original image as possible, firstly
stretching the height to 32 pixels, then stretching the width according to the
original aspect ratio, and filling the part with insufficient width with a black
border;
3. label generation: training a recognition model by adopting a
supervision method, wherein each line text image has corresponding text
message and a label is already saved by a code during data synthesis;
4. network training: inputting the ready-made training data and labels into a two-dimensional feature attention network for training, and inputting the regular training data firstly; after the network has been trained to a suitable degree through the regular training data, training the network with the irregular text data and filling the length of each batch of read-in labels with terminators to a consistent length; and
5. inputting test data into a trained network, calculating the confidence of
each image, selecting a character with the maximum confidence as a
predicted character based on the greedy algorithm, and putting these
characters together to get a final predicted line text.
Preferably, in step 1, the training data are synthesized by using a public
code, the number of synthesized text images should be as large as possible,
the text in synthesized text images should cover a variety of fonts, the
background should be as complex and varied as possible, and the total
number of images is 20 million.
Preferably, in step 2, the synthesized text images are stretched, with the
size of the stretched images being 32*104 and the aspect ratio of each image
being consistent with that of the original image as possible, wherein the height
is stretched to 32 pixels firstly, the width is stretched according to the original
aspect ratio, and the part with insufficient width is filled with a black border to
completely preserve the shape information of the original image text.
Preferably, step 3 comprises the steps of:
3-1: synthesizing a text-containing image by using an online public code
and a text corpus, and cutting out the text from the image according to the
position of line text recorded by the code in the file to make a line text training
sample;
3-2: saving the text content in each text image in the corresponding text
file;
3-3: taking all synthesized training samples as training data and taking
the public real text images downloaded from the network as the test set; and
3-4: making all samples into files in lmdb database format respectively for
accelerated reading.
Preferably, step 4 comprises the steps of:
4-1: constructing a feature encoding network by taking a convolutional
block and a long-short term memory model as basic units, wherein the feature
extraction network in the front part of the network down-samples the features
by pooling layers, and each pooling layer has a downsampling multiple of 2;
the feature maps output by the first convolutional layer and the third
convolutional layer of the convolutional module are added numerically to
obtain an output feature map of the convolutional module; each convolutional
module does not downsample the feature map; a batch standardization
operation is attached after each convolutional layer in the convolutional block,
and the result is output after being processed in the linear rectification unit,
and finally the output feature map is obtained;
after being processed in the feature extraction network, the obtained
feature map with height not being 1, i.e., a two-dimensional feature map is cut
into H sub-feature maps by rows, where H is the height of the two-dimensional
feature map; each sub-feature map is input to a BLSTM network consisting of
two-layer bidirectional long-short term memory (BLSTM), so that the feature
vectors of each sub-feature map have contextual information, as expressed
by the mathematical formula below:
= BLSTM(lj)
where, Ii represents a sub-feature map of the ith row cut from the two
dimensional feature map, W represents the width of the two-dimensional
feature map, represents the jth feature vector obtained after the ith feature
map is encoded by the BLSTM network, and all encoded sub-feature maps
are stitched in the horizontal direction to obtain an encoded feature map;
4-2: constructing a decoding network based on the two-dimensional
feature attention mechanism:
exp (eg)
where, a = 1atat,-,a ,,, represents the weight of feature vectors
in each sub-feature map of the two-dimensional feature map, that is, the
attention weight distribution, while H in the formula represents the height of
feature map, and eto is obtained by the following formula:
etj = Vtanh(Wrt + QF + b)
V, W, Q and b represent trainable weight parameters, F is the feature
encoded and spliced by BLSTM, and rt is the output of the language network
consisting of a long-short term memory (LSTM) model in the attention network
and obtained by embedding the word of character decoded at the previous
time node into the vector embt-i and decoding the hidden layer output vector
ht-i used to decode the previous character;
rt = LSTM(embt_ 1 ,ht-1)
a rough attention distribution is obtained via the attention weight
distribution, and a is multiplied by the corresponding elements of the feature
F to obtain a featuremapatwithallfeatures being filtered except current
characters to be decoded, and then the feature map is processed in the
attention network to obtain an attention weight distribution acting on Ft:
exp (e a ') WX ep (
e'. = Vtanh(W'g + Q'F + b')
VI, W', Q' and b' represent trainable parameters, and the vector 9C represents a vector of a rough feature of a character obtained by a weighted
sum of the feature map F with the attention weight distribution at:
9t = ar gF; j=1
after obtaining a't, a detailed feature vector g't for decoding the current
character in the feature map Ft is calculated:
t ,j Fq j=1
summing t and g't to obtain a vectorg"t for decoding the current
character:
g"t = gt + g't
the probability distribution Y of characters is obtained by decoding in a
fully connected layer and performing probability normalization in a
normalization layer softmax:
yt =softmax(*i(Wg"t+ bp)
where, * represents a linear rectification unit, We and bc represent trainable weights of the fully connected layer; the current decoded output character ct is obtained by selecting a character corresponding to the value with the maximum confidence from Y.
4-3: training parameters setting: inputting the training data to the network
for training, and allowing the network to traverse the training dataset for 10
times, with about 310,000 batches of data being read during each traversal of
the training dataset, wherein the size of the read-in batches is set to 64, an
adaptive gradient descent method (ADADELTA) is used as the optimization
algorithm, and the initial learning rate is set to 1;
the loss function is defined as
- Ilogn7p(cIn) i=1 j=1
where, N represents the data size used in the optimization batch, and
p(cI ; represents the probability of outputting a character ci from the
ith sample image at time j;
4-4: initialization of weights: all weight parameters in the network are
randomly initialized at the beginning of training; and
4-5: training convolutional neural network: the probability of outputting
each character of the target character string at a corresponding time point is
used as cross entropy, and the cross entropy is minimized by using the
gradient descent method.
Preferably, step 5 comprises the steps of:
5-1: inputting test dataset samples, selecting a character with the
maximum confidence as a predicted character based on the greedy algorithm, and putting these characters together to get a final predicted line text; and
5-2: after the recognition is completed, calculating the accuracy rate and
editing a distance by a program.
The advantageous effects of the present invention are as follows:
(1) The automatic learning recognition algorithm with deep network
structure helps to learn effective expressions from data well and improve the
accuracy of recognition.
(2) The present invention has a fast training speed and a high accuracy
compared with the method of detecting the position of each character firstly
and then recognizing each character separately.
(3) Profiting from high recognition accuracy, high robustness, and good
recognition performance, the classification method of the present invention is
much more robust in recognizing the text of irregular shapes.
Fig. 1 is general flowchart of the natural scene text recognition method of
the present invention.
Fig. 2 is a flowchart of the convolutional module in the feature extraction
network of the present invention.
Fig. 3 is a schematic diagram of the recognition process of the present
invention.
Fig. 4 is a schematic diagram of the parameter configuration of the deep
convolutional neural network of the present invention.
The technical solutions in the embodiments of the present invention will
be described clearly and completely with reference to the drawings in the embodiments of the present invention. Obviously, the embodiments described are only a part of the embodiments of the present invention and not all of them.
All other embodiments obtained by those skilled in the art based on the
embodiments of the present invention without creative efforts shall fall within
the scope of the present invention.
Referring to Figs. 1-4, a natural scene text recognition method based on
a two-dimensional feature attention mechanism comprises the steps of:
1. data acquisition: synthesizing a line text image for training by using a
public code, dividing the line text image into a regular training set and an
irregular training set by shape, and downloading a true text image from the
network as test data;
2. data processing: stretching the size of all training samples, with the
size of the processed image sample being 32*104 and the aspect ratio of
each image being consistent with that of the original image as possible, firstly
stretching the height to 32 pixels, then stretching the width according to the
original aspect ratio, and filling the part with insufficient width with a black
border;
3. label generation: training a recognition model by adopting a
supervision method, wherein each line text image has corresponding text
message and a label is already saved by a code during data synthesis;
Step 3 comprises the steps of:
3-1: synthesizing a text-containing image by using an online public code
and a text corpus, cutting out the text from the image according to the position
of line text recorded by the code in the file to make a line text training sample,
and downloading public natural scene text datasets from the Internet to test the network performance, wherein the line text images of these datasets are taken from real images;
3-2: saving the text content in each text image in the corresponding text
file;
3-3: taking all synthesized training samples as training data that are
divided into regular training data an irregular training data by shape of text
images, and taking the public real text images downloaded from the network
as the test set; and
3-4: making all samples into files in lmdb database format respectively for
accelerated reading.
4. network training: inputting the ready-made training data and labels into
a two-dimensional feature attention network for training, and inputting the
regular training data firstly; after the network has been trained to a suitable
degree through the regular training data, training the network with the irregular
text data and filling the length of each batch of read-in labels with terminators
to a consistent length; and
Step 4 comprises the steps of:
4-1: constructing a feature encoding network by taking a convolutional
block and a long-short term memory model as basic units, wherein the feature
extraction network in the front part of the network down-samples the features
by pooling layers, each pooling layer has a downsampling multiple of 2, and
the convolutional block can be represented as a computational process
involving convolutional layers;
the feature maps output by the first convolutional layer and the third
convolutional layer of the convolutional module are added numerically to obtain an output feature map of the convolutional module; each convolutional module does not downsample the feature map; a batch standardization operation is attached after each convolutional layer in the convolutional block, and the result is output after being processed in the linear rectification unit, and finally the output feature map is obtained; after being processed in the feature extraction network, the obtained feature map with height not being 1, i.e., a two-dimensional feature map is cut into H sub-feature maps by rows, where H is the height of the two-dimensional feature map; each sub-feature map is input to a BLSTM network consisting of two-layer bidirectional long-short term memory (BLSTM), so that the feature vectors of each sub-feature map have contextual information, as expressed by the mathematical formula below:
{E ,E,...,EV) = B3LSTM (1j)
where, represents a sub-feature map of the ith row cut from the two
dimensional feature map, W represents the width of the two-dimensional
feature map, represents the jth feature vector obtained after the ith feature
map is encoded by the BLSTM network, and all encoded sub-feature maps
are stitched in the horizontal direction to obtain an encoded feature map;
4-2: constructing a decoding network based on the two-dimensional
feature attention mechanism:
exp (eu) aj= EwxR exp (e,,k)
where, at = fata ,aWXH t, represents the weight of feature vectors
in each sub-feature map of the two-dimensional feature map, that is, the
attention weight distribution, while H in the formula represents the height of feature map, and eto is obtained by the following formula: et = Vtanh(Wrt +QF + b)
V, W, Q and b represent trainable weight parameters, F is the feature
encoded and spliced by BLSTM, and rt is the output of the language network
consisting of a long-short term memory (LSTM) model in the attention network
and obtained by embedding the word of character decoded at the previous
time node into the vector enb--i and decoding the hidden layer output vector
ht-i used to decode the previous character;
rt = L ST M(embt_1,ht_j)
a rough attention distribution is obtained via the attention weight
distribution, and t is multiplied by the corresponding elements of the feature
F to obtain a feature map a with all features being filtered except current
characters to be decoded, and then the feature map is processed in the
attention network to obtain an attention weight distribution acting on Ft:
~z't~~2J~*,tAXHI
exp (e,) a j WXH exp(e'
e = V'tanh(W'ge + Q'Ft + b')
V', W', Q' and b' represent trainable parameters, and the vector Bt
represents a vector of a rough feature of a character obtained by a weighted
sum of the feature map F with the attention weight distribution a:
Y aFj j=1
after obtaining a' a detailed feature vector g't for decoding the current character in the feature map Ft is calculated:
j=1
summing t and g't to obtain a vector g"t for decoding the current
character:
g"t g Et +g't the probability distribution Yt of characters is obtained by decoding in a
fully connected layer and performing probability normalization in a
normalization layer softmax:
yt =softmax(* (Wg"+ be))
where, * represents a linear rectification unit, Wc and b, represent
trainable weights of the fully connected layer; the current decoded output
character er is obtained by selecting a character corresponding to the value
with the maximum confidence from Yt.
4-3: training parameters setting: inputting the training data to the network
for training, and allowing the network to traverse the training dataset for 10
times, with about 310,000 batches of data being read during each traversal of
the training dataset, wherein the size of the read-in batches is set to 64, an
adaptive gradient descent method (ADADELTA) is used as the optimization
algorithm, and the initial learning rate is set to 1;
the loss function is defined as
- log n7p(cyI;). i=1 j=1
where, N represents the data size used in the optimization batch, and pc I6)represents the probability of outputting a character c from the ith sample image at time j;
4-4: initialization of weights: all weight parameters in the network are
randomly initialized at the beginning of training; and
4-5: training convolutional neural network: the probability of outputting
each character of the target character string at a corresponding time point is
used as cross entropy, and the cross entropy is minimized by using the
gradient descent method.
5. inputting test data into a trained network, calculating the confidence of
each image, selecting a character with the maximum confidence as a
predicted character based on the greedy algorithm, and putting these
characters together to get a final predicted line text.
Step 5 comprises the steps of:
5-1: during training, inputting the images from the validation set as well as
the labels into the network for validation; and
5-2: after the training is completed, inputting the images from the test
dataset into the trained network, and calculating the correct recognition rate of
the network and the total edit distance of the predicted results and the labels
by a program.
It should be apparent to those skilled in the art that the present invention
is not limited to the details of the exemplary embodiments described above,
and may be embodied in any other specific forms without departing from the
spirit or essential features of the present invention. Therefore, the
embodiments should be regarded as exemplary and non-limiting from any
point of view, and the scope of the present invention is defined by the appended claims rather than by the above description, and is therefore intended to encompass all variations falling within the meaning and scope of the equivalent elements of the claims. Any appended markings in the claims should not be constructed as limiting the claims involved.
It should be understood that although the specification is described
according to embodiments, but not every embodiment contains only an
independent technical solution and the specification is such described only for
the sake of clarity. Those skilled in the art should consider the specification as
a whole. The technical solutions in each embodiment can be combined
appropriately to form other embodiments that can be understood by those
skilled in the art.
Claims (6)
1. A natural scene text recognition method based on a two-dimensional
feature attention mechanism, characterized by comprising the steps of:
1. data acquisition: synthesizing a line text image for training by using a
public code, dividing the line text image into a regular training set and an
irregular training set by shape, and downloading a true text image from the
network as test data;
2. data processing: stretching the size of all training samples, with the
size of the processed image sample being 32*104 and the aspect ratio of
each image being consistent with that of the original image as possible, firstly
stretching the height to 32 pixels, then stretching the width according to the
original aspect ratio, and filling the part with insufficient width with a black
border;
3. label generation: training a recognition model by adopting a
supervision method, wherein each line text image has corresponding text
message and a label is already saved by a code during data synthesis;
4. network training: inputting the ready-made training data and labels into
a two-dimensional feature attention network for training, and inputting the
regular training data firstly; after the network has been trained to a suitable
degree through the regular training data, training the network with the irregular
text data and filling the length of each batch of read-in labels with terminators
to a consistent length; and
5. inputting test data into a trained network, calculating the confidence of
each image, selecting a character with the maximum confidence as a
predicted character based on the greedy algorithm, and putting these characters together to get a final predicted line text.
2. A natural scene text recognition method based on a two-dimensional
feature attention mechanism according to claim 1, characterized in that in step
1, the training data are synthesized by using a public code, the number of
synthesized text images should be as large as possible, the text in
synthesized text images should cover a variety of fonts, the background
should be as complex and varied as possible, and the total number of images
is 20 million.
3. A natural scene text recognition method based on a two-dimensional
feature attention mechanism according to claim 1, characterized in that in step
2, the synthesized text images are stretched, with the size of the stretched
images being 32*104 and the aspect ratio of each image being consistent with
that of the original image as possible, wherein the height is stretched to 32
pixels firstly, the width is stretched according to the original aspect ratio, and
the part with insufficient width is filled with a black border to completely
preserve the shape information of the original image text.
4. A natural scene text recognition method based on a two-dimensional
feature attention mechanism according to claim 1, characterized in that step 3
comprises the steps of:
3-1: synthesizing a text-containing image by using an online public code
and a text corpus, and cutting out the text from the image according to the
position of line text recorded by the code in the file to make a line text training
sample;
3-2: saving the text content in each text image in the corresponding text
file;
3-3: taking all synthesized training samples as training data and taking
the public real text images downloaded from the network as the test set; and
3-4: making all samples into files in lmdb database format respectively for
accelerated reading.
5. A natural scene text recognition method based on a two-dimensional
feature attention mechanism according to claim 1, characterized in that step 4
comprises the steps of:
4-1: constructing a feature encoding network by taking a convolutional
block and a long-short term memory model as basic units, wherein the feature
extraction network in the front part of the network down-samples the features
by pooling layers, and each pooling layer has a downsampling multiple of 2;
the feature maps output by the first convolutional layer and the third
convolutional layer of the convolutional module are added numerically to
obtain an output feature map of the convolutional module; each convolutional
module does not downsample the feature map; a batch standardization
operation is attached after each convolutional layer in the convolutional block,
and the result is output after being processed in the linear rectification unit,
and finally the output feature map is obtained;
after being processed in the feature extraction network, the obtained
feature map with height not being 1, i.e., a two-dimensional feature map is cut
into H sub-feature maps by rows, where H is the height of the two-dimensional
feature map; each sub-feature map is input to a BLSTM network consisting of
two-layer bidirectional long-short term memory (BLSTM), so that the feature
vectors of each sub-feature map have contextual information, as expressed
by the mathematical formula below:
= BLSTM(lj)
where, Ii represents a sub-feature map of the ith row cut from the two
dimensional feature map, W represents the width of the two-dimensional
feature map, represents the jth feature vector obtained after the ith feature
map is encoded by the BLSTM network, and all encoded sub-feature maps
are stitched in the horizontal direction to obtain an encoded feature map;
4-2: constructing a decoding network based on the two-dimensional
feature attention mechanism:
exp (eg)
where, a = 1atat,-,a ,,, represents the weight of feature vectors
in each sub-feature map of the two-dimensional feature map, that is, the
attention weight distribution, while H in the formula represents the height of
feature map, and eto is obtained by the following formula:
etj = Vtanh(Wrt + QF + b)
V, W, Q and b represent trainable weight parameters, F is the feature
encoded and spliced by BLSTM, and rt is the output of the language network
consisting of a long-short term memory (LSTM) model in the attention network
and obtained by embedding the word of character decoded at the previous
time node into the vector embt-i and decoding the hidden layer output vector
ht-i used to decode the previous character;
rt = LSTM(embt_ 1 ,ht-1)
a rough attention distribution is obtained via the attention weight
distribution, and a is multiplied by the corresponding elements of the feature
F to obtain a featuremapFtwithallfeatures being filtered except current
characters to be decoded, and then the feature map is processed in the
attention network to obtain an attention weight distribution acting on Ft:
exp (e a d WX ep (
e'. = Vtanh(W'g + Q'F + b')
VI, W', Q' and b' represent trainable parameters, and the vector Bt
represents a vector of a rough feature of a character obtained by a weighted
sum of the feature map F with the attention weight distribution at:
WXH
9t = ar gF; j=1
after obtaining a't, a detailed feature vector g't for decoding the current
character in the feature map is calculated:
WX H
E't = a', Fq j=1
summing t and g't to obtain a vectorg"t for decoding the current
character:
g"t = gt + g't
the probability distribution Y of characters is obtained by decoding in a
fully connected layer and performing probability normalization in a
normalization layer softmax:
yt = s oftnax($(Weg"t + b)
where, * represents a linear rectification unit, W and be represent trainable weights of the fully connected layer; the current decoded output character ct is obtained by selecting a character corresponding to the value with the maximum confidence from Y.
4-3: training parameters setting: inputting the training data to the network
for training, and allowing the network to traverse the training dataset for 10
times, with about 310,000 batches of data being read during each traversal of
the training dataset, wherein the size of the read-in batches is set to 64, an
adaptive gradient descent method (ADADELTA) is used as the optimization
algorithm, and the initial learning rate is set to 1;
the loss function is defined as
N IVI
- Ilogn7p(cIn) i=1 j=1
where, N represents the data size used in the optimization batch, and
p(cI ; represents the probability of outputting a character ci from the
ith sample image at time j;
4-4: initialization of weights: all weight parameters in the network are
randomly initialized at the beginning of training; and
4-5: training convolutional neural network: the probability of outputting
each character of the target character string at a corresponding time point is
used as cross entropy, and the cross entropy is minimized by using the
gradient descent method.
6. A natural scene text recognition method based on a two-dimensional
feature attention mechanism according to claim 1, characterized in that step 5
comprises the steps of:
5-1: inputting test dataset samples, selecting a character with the
maximum confidence as a predicted character based on the greedy algorithm,
and putting these characters together to get a final predicted line text; and
5-2: after the recognition is completed, calculating the accuracy rate and
editing a distance by a program.
-1/4- Jan 2021
Data collection: synthesizing the training set by using a public code, and downloading public text images of real scenes from the Internet as test data; 2021100480
Data processing: stretching all images to make the height uniform, and making up the width by padding black borders;
Label generation: assigning the text labels corresponding to each image by code, and dividing the training data into a regular text training dataset and an irregular text training dataset according to the shape of the text in the images;
Network training: sending training data to the network for training, sending the regular training data firstly; and then sending the irregular training data;
Network testing: sending the test data to the trained recognition network, and outputting the recognized text characters in the images along with the confidence level of each character.
Fig. 1
-2/4- Jan 2021
Inputting a feature map
Convolutional layer 2021100480
Batch standardization layer
Linear rectification unit
Outputting a feature map 1
Convolutional layer
Batch standardization layer
Linear rectification unit
Outputting a feature map 2
Convolutional layer
Batch standardization layer
Outputting a feature map 3 Add
Total output feature map
Fig. 2
-3/4- Jan 2021 2021100480
Feature Scene text image extraction 2D BLSTM Splicing into a one- 2D feature Recogniti Network feature coding dimensional 2D attention on results map network feature map network
Fig. 3
-4/4- Jan 2021 2021100480
Layer name Output size Configuration parameters Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 64 Convolutional 32×104 [Convolutional kernel size: 3×3, number of channels: 64], block 1 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 64] Pooling layer 16×52 Pooling kernel size: 2×2, s: 2×2, p: 0 1 Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 128 Convolutional 16×52 [Convolutional kernel size: 3×3, number of channels: 128], block 2 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 128] Pooling layer 8×26 k: 2×2, s: 2×2, p: 0 2 Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 256 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 256], block 3 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 256] Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 256 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 256], block 4 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 256] Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 512 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 512], block 5 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 512] Convolutional kernel size: 3×3, step size: 1×1, padding: 1, number of channels: 512 Convolutional 8×26 [Convolutional kernel size: 3×3, number of channels: 512], block 6 padding: 1, step size: 1
[Convolutional kernel size: 3×3, number of channels: 512]
Fig. 4
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021100480A AU2021100480A4 (en) | 2021-01-25 | 2021-01-25 | Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021100480A AU2021100480A4 (en) | 2021-01-25 | 2021-01-25 | Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2021100480A4 true AU2021100480A4 (en) | 2021-04-15 |
Family
ID=75397050
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2021100480A Active AU2021100480A4 (en) | 2021-01-25 | 2021-01-25 | Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2021100480A4 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114092931A (en) * | 2022-01-20 | 2022-02-25 | 中科视语(北京)科技有限公司 | Scene character recognition method and device, electronic equipment and storage medium |
-
2021
- 2021-01-25 AU AU2021100480A patent/AU2021100480A4/en active Active
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114092931A (en) * | 2022-01-20 | 2022-02-25 | 中科视语(北京)科技有限公司 | Scene character recognition method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110378334B (en) | Natural scene text recognition method based on two-dimensional feature attention mechanism | |
CN110414498B (en) | Natural scene text recognition method based on cross attention mechanism | |
CN110287479B (en) | Named entity recognition method, electronic device and storage medium | |
CN112734775B (en) | Image labeling, image semantic segmentation and model training methods and devices | |
CN110533737A (en) | The method generated based on structure guidance Chinese character style | |
CN108415977A (en) | One is read understanding method based on the production machine of deep neural network and intensified learning | |
CN110795556A (en) | Abstract generation method based on fine-grained plug-in decoding | |
CN110555896B (en) | Image generation method and device and storage medium | |
CN111598979B (en) | Method, device and equipment for generating facial animation of virtual character and storage medium | |
CN111428727B (en) | Natural scene text recognition method based on sequence transformation correction and attention mechanism | |
CN113298151A (en) | Remote sensing image semantic description method based on multi-level feature fusion | |
CN109711465A (en) | Image method for generating captions based on MLL and ASCA-FR | |
CN112633431B (en) | Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC | |
CN113705313A (en) | Text recognition method, device, equipment and medium | |
CN114627282B (en) | Method, application method, equipment, device and medium for establishing target detection model | |
CN113762269A (en) | Chinese character OCR recognition method, system, medium and application based on neural network | |
CN110096591A (en) | Long text classification method, device, computer equipment and storage medium based on bag of words | |
CN114282059A (en) | Video retrieval method, device, equipment and storage medium | |
CN113515669A (en) | Data processing method based on artificial intelligence and related equipment | |
CN116310339A (en) | Remote sensing image segmentation method based on matrix decomposition enhanced global features | |
CN117522697A (en) | Face image generation method, face image generation system and model training method | |
CN115131698A (en) | Video attribute determination method, device, equipment and storage medium | |
AU2021100480A4 (en) | Natural Scene Text Recognition Method Based on Two-Dimensional Feature Attention Mechanism | |
CN116561274A (en) | Knowledge question-answering method based on digital human technology and natural language big model | |
CN113283432A (en) | Image recognition and character sorting method and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) |