CN111967470A

CN111967470A - Text recognition method and system based on decoupling attention mechanism

Info

Publication number: CN111967470A
Application number: CN202010841738.6A
Authority: CN
Inventors: 朱远志; 金连文; 王天玮; 陈晓雪; 罗灿杰
Original assignee: South China University of Technology SCUT; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Current assignee: South China University of Technology SCUT; Zhuhai Institute of Modern Industrial Innovation of South China University of Technology
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-11-20

Abstract

The invention discloses a text recognition method and system based on a decoupling attention mechanism, which mainly comprise a feature coding module, a convolution alignment module and a text decoding module, wherein the feature coding module extracts visual features from an input image based on a deep convolution neural network; the convolution alignment module replaces a traditional score-based recursive alignment module, multi-scale visual features are extracted from the feature coding module to serve as input, and a full convolution neural network is used for generating an attention diagram channel by channel; the text decoding module obtains a final prediction result by combining the characteristic diagram and the attention diagram through the gated recursion unit, and the method has the advantages of simple realization, high identification precision, effectiveness, flexibility and robustness, excellent performance in various text identification fields such as scene text identification, handwritten text identification and the like, and good practical application value.

Description

Text recognition method and system based on decoupling attention mechanism

Technical Field

The invention belongs to the technical field of pattern recognition and artificial intelligence, and particularly relates to an accurate image recognition method related to a deep neural network.

Background

In recent years, text recognition has attracted the research interest of most scholars. Thanks to deep learning and the study of sequence problems, many text recognition techniques have enjoyed significant success. The connection time classification technique and the attention mechanism technique are two popular methods for solving the sequence problem, wherein the attention mechanism technique exhibits more prominent performance and has been widely studied in recent years.

Attention-driven techniques were first proposed to solve the machine translation problem and are increasingly being used to address the scene text recognition problem. Since then, attention-based technologies have dominated a portion of the development in the field of text recognition. Attention-based techniques in text recognition are used to align and recognize characters. In the previous work, it was noted that the alignment operation of force mechanism techniques was always combined with the decoding operation. Specifically, the alignment operation of conventional attention-based techniques is implemented using two types of information. One is a feature map, which is visual information obtained by encoding an image by an encoder; the second is historical decoding information, which can be a hidden layer state in the recursive process or an embedded vector of the previous decoding result. The main idea behind the attention mechanism technology is matching, namely, a part of characteristics of the characteristic graph is given, and an attention score is calculated by scoring the matching degree of the part of characteristics and historical decoding information.

Conventional attention-based techniques often face serious alignment problems because the relationship of the alignment and decoding operations together inevitably leads to accumulation and propagation of errors. The alignment operation based on matching is very susceptible to the decoding result, for example, when two similar substrings exist in a string, the attention point of the attention mechanism technology is easily jumped from one substring to another substring through historical decoding information, which is also the reason why the attention mechanism technology is difficult to align long sequences observed in the literature, because the longer sequences are more likely to generate similar substrings. This motivates us to find a way to decouple the alignment operation from the historical decoding information, thereby alleviating this negative impact.

Disclosure of Invention

The invention aims to provide a text recognition method and system based on a decoupling attention mechanism.

In order to achieve the purpose, the invention provides the following scheme:

a text recognition method based on a decoupling attention mechanism comprises the following steps:

s1, extracting image features according to the text image and coding to obtain a feature map;

s2, aligning the feature maps to obtain a target image, constructing a deep convolutional neural network model, processing the target image based on the deep convolutional neural network model to obtain an attention map and training;

s3, performing accurate character recognition on the feature map and the attention map based on the deep convolutional neural network recognition model;

preferably, the text image is a scene text image and/or a handwritten character image;

preferably, the scene text image and/or the handwritten text image are characterized by:

the scene text image characteristics comprise a scene text training data set and a scene text real evaluation data set, and the scene text training data set and the scene text real evaluation data set cover various different font styles, light and shadow changes and resolution changes;

the handwritten text image characteristics comprise a handwritten text real training data set and a handwritten text real evaluation data set, and the handwritten text real training data set and the handwritten text real evaluation data set contain different writing styles;

preferably, the scene text image training data set has a complete text part and occupies more than two thirds of the image area, and comprises a plurality of different font styles, so that the scene text image training data set is allowed to cover light and shadow changes and resolution changes;

preferably, the scene text truth evaluation data set is obtained by shooting through a mobile phone and a special hardware camera device, in the shooting process, the text in the normalized scene text image occupies more than two thirds of the image area, inclination and blur are allowed to exist, and the shot scene text image covers application scenes with different font styles;

preferably, the real training data and the real evaluation data of the handwritten text are written and collected by different people respectively, and the training data and the evaluation data have independence.

Preferably, the text image alignment processing method includes:

stretching the scene text training data set and the scene text real evaluation data set image data to be converted into a uniform size;

and scaling the handwritten text real training data set and the handwritten text real evaluation data set in a way of keeping the original image proportion, and filling the periphery until the sizes are unified.

Preferably, the deep convolutional neural network construction method comprises the following steps:

extracting multi-scale visual features based on the feature codes;

carrying out convolution and deconvolution through a full convolution neural network to construct a deep convolution neural network model;

a deconvolution stage, each output feature is added by the corresponding feature mapping of the convolution stage;

the convolution process is down sampling, the deconvolution process is up sampling, except the last deconvolution process, a nonlinear layer is connected after all the convolution and deconvolution processes are finished, and a ReLu function is used;

preferably, the network structure of the deep convolutional neural network model is an input layer, a convolutional layer and a residual error layer;

preferably, the residual layer is divided into a first convolution layer, a first batch of normalization layers, a first nonlinear layer, a second convolution layer, a second batch of normalization layers, a down-sampling layer, and a second nonlinear layer.

Preferably, in the training of the deep convolutional neural network model in S2, a back propagation algorithm is adopted, and all parameters of the network model are updated by calculating a transfer gradient from the last layer and transferring layer by layer;

preferably, the deep convolutional neural network model training strategy adopts a supervision mode: training a universal deep network recognition model by using text image data and corresponding labeling information;

preferably, the input image of the deep convolutional neural network model is a handwritten text image and/or a scene text image, and the output is a character sequence in the text image and/or the scene text image.

Preferably, the parameters of the deep convolutional neural network model training are set as follows:

the number of iterations of the deep convolutional neural network is 1,000,000;

the deep convolutional neural network optimizer is Adadelta;

the deep convolutional neural network learning rate is 1.0;

deep convolutional neural network learning rate updating strategy: the reduction is one tenth of the original at 50% and 75% of the total number of iterations, respectively.

Preferably, the specific method for recognizing the S3 text is as follows:

F_x,yrepresents said characteristic diagram, α_t,x,yThe attention map representing the t time obtained by convolution alignment is calculated by equation (1) to obtain a semantic vector c_t，

Wherein W and H are the width and height of the characteristic diagram, and at the time t,

output y_tComprises the following steps: y is_t＝W_oh_t+b_o， (2)，

Wherein, W_oAnd b_oIs a parameter, h_tRepresenting the hidden layer state of the gated recursion unit at the time t;

h_tthe way in which (a) is calculated is expressed as,

h_t＝GRU((e_t-l，c_t)，h_t-1)， (3),

e_trepresenting the last bit output y_t-1The encoded vector of (1); the final Loss function Loss is calculated as follows,

where θ represents all learnable parameters of the deep neural network model, g_tRepresenting the sample tag value at time t.

A text recognition system based on a decoupling attention mechanism comprises a feature coding module, a convolution alignment module and a text decoding module,

the feature coding module extracts visual features from the text image based on a deep convolutional neural network;

the convolution alignment module extracts multi-scale visual features from the feature coding module and generates an attention diagram channel by channel through a deep convolution neural network;

and the text decoding module combines the feature map and the attention map by a gated recursion unit to obtain a final prediction result.

Preferably, the network structure of the deep convolutional neural network unit is an input layer unit, a convolutional layer unit and a residual layer unit;

preferably, the residual layer unit is divided into a first convolution layer unit, a first batch of normalization layer units, a first nonlinear layer unit, a second convolution layer unit, a second batch of normalization layer units, a down-sampling layer unit and a second nonlinear layer unit;

preferably, the nonlinear layer units in the residual layer unit all adopt a ReLU activation function;

preferably, the downsampling layer unit is implemented by the convolutional layer unit and the batch normalization layer unit.

The invention has the technical effects that:

(1) the present invention decouples the conventional attention mechanism modules. Compared with the traditional attention mechanism technology, the method and the device do not need to align the information fed back in the decoding stage, and avoid accumulation and propagation of decoding errors, so that the identification accuracy is higher.

(2) The method is simple to use, can be easily embedded into other models, is very flexible, and can be freely converted in one-dimensional texts and two-dimensional texts.

(3) And a back propagation algorithm is adopted, and the convolution kernel parameters are automatically adjusted, so that a more robust filter is obtained, and the filter can adapt to various complex environments.

(4) Compared with a manual mode, the method and the device can automatically finish the recognition of the scene text and the handwritten text, and can save manpower and material resources.

(5) The invention can provide more reliable alignment performance for the attention mechanism through the decoupling attention algorithm, and particularly has more robust characteristic compared with the traditional attention mechanism when facing long texts.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a block diagram of the deep convolutional network recognition model structure of the present invention.

FIG. 2 is a flow chart of a text recognition method based on a decoupling attention mechanism according to the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1: as shown in fig. 1, the text recognition system based on the decoupling attention mechanism includes a feature encoding module, a convolution alignment module and a text decoding module;

As shown in fig. 2, the text recognition method based on the decoupling attention mechanism specifically includes the following steps:

firstly, carrying out feature extraction coding on a scene text image and/or a handwritten character image through a feature coding module to form a feature map;

the scene text image characteristics comprise a scene text training data set and a scene text real evaluation data set, wherein the scene text training data set and the scene text real evaluation data set cover various different font styles, light and shadow changes and resolution changes;

the scene text image training data has a complete text part occupying more than two thirds of the area of the image, comprises various different font styles and is allowed to cover certain degree of light and shadow change and resolution change;

the real evaluation data set of the scene text is obtained by shooting through camera equipment such as a mobile phone and special hardware, in the shooting process, the text in the normalized scene text image occupies more than two thirds of the image area, certain inclination and fuzziness are allowed to exist, and the shot scene text image covers application scenes with different font styles;

the real training data and the real evaluation data of the handwritten text are written by different people and collected respectively, and the training data and the evaluation data have independence;

secondly, performing convolution alignment on the scene text image and/or the handwritten character image through a convolution alignment module, wherein the structure of the convolution alignment module is shown in table 1:

stretching image data of the scene text training data set and the scene text real evaluation data set to be in a uniform size;

scaling the handwritten text real training data set and the handwritten text real evaluation data set to keep the original image proportion, and filling the surroundings until the sizes are unified;

TABLE 1

The deep convolutional neural network is constructed and trained as shown in table 2, and the construction method of the deep convolutional neural network comprises the following steps: based on a convolution neural network, extracting visual features from the scene text image and/or the handwritten character image, extracting multi-scale visual features from a feature coding module as input, performing convolution and deconvolution through a full convolution neural network, wherein in a deconvolution stage, each output feature is added by corresponding feature mapping of the convolution stage, the convolution process is downsampling, the deconvolution process is upsampling, except for the last deconvolution process, a nonlinear layer is connected after all the convolution and deconvolution processes are finished, and a ReLu function is used; the number of output channels of the last deconvolution layer is maxT, different values are determined according to different text types, wherein a scene text is 25, a handwritten text is 150, and the last nonlinear layer adopts a Sigmoid function to keep an output attention diagram between 0 and 1; in the deep neural network model training, a back propagation algorithm is adopted, and all parameters of the network model are updated by calculating a transfer gradient from the last layer and transferring layer by layer;

TABLE 2

TABLE 3

As shown in table 3, the residual layer is divided into a first convolution layer, a first batch of normalization layers, a first nonlinear layer, a second convolution layer, a second batch of normalization layers, a down-sampling layer, and a second nonlinear layer;

nonlinear layers in the residual error layer all adopt a ReLU activation function;

the down-sampling layer is realized by a convolution layer and a batch normalization layer;

the deep neural network model training strategy adopts a supervision mode: training a universal deep network recognition model by using text image data and corresponding labeling information;

the input image of the deep neural network model is a handwritten text image and/or a scene text image, and the input image is output as a character sequence in the text image and/or the scene text image;

the parameters of the deep neural network model training are set as follows:

the number of iterations of the deep neural network is 1,000,000;

the deep neural network optimizer is Adadelta;

the deep neural network learning rate is 1.0;

deep neural network learning rate updating strategy: the reduction is one tenth of the original at 50% and 75% of the total number of iterations, respectively.

Thirdly, character recognition is carried out on the feature graph and the attention map through a character recognition module, the feature graph and the attention map are input, and accurate recognition is carried out on the image based on a depth network recognition model of a decoupling attention mechanism;

the specific method for carrying out character recognition comprises the following steps:

F_x,yrepresents a characteristic diagram, α_t,x,yAn attention map indicating time t obtained by convolution alignment is calculated by equation (1) to obtain a semantic vector c_t，

output y_tComprises the following steps: y is_t＝W_oh_t+b_o， (2),

h_tthe way in which (a) is calculated is expressed as,

h_t＝GRU((e_t-l，c_t)，h_t-l)， (3),

where θ represents all learnable parameters of the deep neural network model, g_tA sample tag value representing time t;

inputting a text image, and accurately identifying the image based on a depth network identification model of a decoupling attention mechanism to obtain characters in the text image.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention may be made by those skilled in the art without departing from the spirit of the present invention, which is defined by the claims.

Claims

1. A text recognition method based on a decoupling attention mechanism is characterized by comprising the following steps:

and S3, performing accurate character recognition on the feature map and the attention map based on the deep convolutional neural network recognition model.

2. The text recognition method based on the decoupling attention mechanism as claimed in claim 1, wherein:

the text image is a scene text image and/or a handwritten character image;

the scene text image and/or the handwritten character image are characterized in that:

the handwritten text image features comprise a handwritten text real training data set and a handwritten text real evaluating data set, and the handwritten text real training data set and the handwritten text real evaluating data set contain different writing styles.

3. The text recognition method based on the decoupling attention mechanism as claimed in claim 2, wherein:

in the scene text training data set, the text part is complete and occupies more than two thirds of the image area, and the scene text training data set comprises a plurality of different font styles and is allowed to cover light and shadow changes and resolution changes;

the scene text real evaluation data set is obtained by shooting through a mobile phone and a special hardware camera device, in the shooting process, the text in the normalized scene text image occupies more than two thirds of the image area, inclination and fuzziness are allowed to exist, and the shot scene text image covers application scenes with various different font styles;

the handwritten text real training data set and the handwritten text real evaluation data set are respectively written by different people and collected, and the training data and the evaluation data have independence.

4. The text recognition method based on the decoupling attention mechanism as claimed in claim 2, wherein:

the text image alignment processing method comprises the following steps:

and scaling the handwritten text real training data set and the handwritten text real evaluation data set in proportion of original images, and filling surroundings until the sizes are unified.

5. The text recognition method based on the decoupling attention mechanism as claimed in claim 1, wherein:

in S2, the method for constructing the deep convolutional neural network includes:

extracting multi-scale visual features based on the feature codes;

carrying out convolution and deconvolution through a full convolution neural network to construct the deep convolution neural network model;

said deconvolution stage, each of said output features being summed by a corresponding feature map of said convolution stage;

the convolution process is down sampling, the deconvolution process is up sampling, except the last deconvolution process, all convolution and the deconvolution processes are all followed by a nonlinear layer by using a ReLu function;

the network structure of the deep convolutional neural network model is an input layer, a convolutional layer and a residual error layer;

the residual error layer is divided into a first convolution layer, a first batch of normalization layers, a first nonlinear layer, a second convolution layer, a second batch of normalization layers, a down-sampling layer and a second nonlinear layer.

6. The text recognition method based on the decoupling attention mechanism as claimed in claim 1, wherein:

in the training of the deep convolutional neural network model in the S2, a back propagation algorithm is adopted, and all parameters of the network model are updated by calculating a transfer gradient from the last layer and transferring layer by layer;

the deep convolutional neural network model training strategy adopts a supervision mode: training a universal deep network recognition model by using text image data and corresponding labeling information;

and the input image of the deep convolutional neural network model is the handwritten text image and/or the scene text image, and the input image is output as a character sequence in the text image and/or the scene text image.

7. The text recognition method based on the decoupling attention mechanism as claimed in claim 6, wherein:

the parameters of the deep convolutional neural network model training are set as follows:

the number of iterations of the deep convolutional neural network is 1,000,000;

the deep convolutional neural network optimizer is Adadelta;

the deep convolutional neural network learning rate is 1.0;

the deep convolutional neural network learning rate updating strategy comprises the following steps: the reduction is one tenth of the original at 50% and 75% of the total number of iterations, respectively.

8. The text recognition method based on the decoupling attention mechanism as claimed in claim 1, wherein:

the specific method for recognizing the S3 characters comprises the following steps:

F_x，yrepresents said characteristic diagram, α_t，x，yThe attention map representing the t time obtained by convolution alignment is calculated by equation (1) to obtain a semantic vector c_t，

output y_tComprises the following steps: y is_t＝W_oh_t+b_o， (2)，

h_tthe way in which (a) is calculated is expressed as,

h_t＝GRU((e_t-1，c_t)，h_t-1)， (3)，

9. A text recognition system based on a decoupling attention mechanism is characterized by comprising a feature coding module, a convolution alignment module and a text decoding module,

the text decoding module combines the feature map and the attention map to obtain a final prediction result through a gated recursion unit.

10. The system for text recognition based on a decoupled attention mechanism of claim 9,

the network structure of the deep convolutional neural network unit comprises an input layer unit, a convolutional layer unit and a residual error layer unit;

the residual layer unit is divided into a first convolution layer unit, a first batch of normalization layer units, a first nonlinear layer unit, a second convolution layer unit, a second batch of normalization layer units, a down-sampling layer unit and a second nonlinear layer unit;

the nonlinear layer units in the residual layer units all adopt ReLU activation functions;

the down-sampling layer unit is realized by the convolution layer unit and the batch normalization layer unit.