CN109389091B

CN109389091B - Character recognition system and method based on combination of neural network and attention mechanism

Info

Publication number: CN109389091B
Application number: CN201811230112.0A
Authority: CN
Inventors: 杨宏志; 庞宇; 王慧倩
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-10-22
Filing date: 2018-10-22
Publication date: 2022-05-03
Anticipated expiration: 2038-10-22
Also published as: CN109389091A

Abstract

The invention requests to protect a character recognition system and method based on the combination of a neural network and an attention mechanism, and the method specifically comprises the following steps: the convolutional neural network feature extraction module is used for extracting the spatial features of the character images; inputting the spatial features extracted by the convolutional neural network into a bidirectional long and short term memory network module, wherein the bidirectional long and short term memory network can extract the sequence features of the characters; semantic coding is carried out on the extracted feature vectors, and then attention weights of the feature vectors are distributed through an attention mechanism, so that attention is focused on the feature vectors with higher weights; the decoding part of the model is realized by a nested long-short term memory network, the characteristics extracted by attention and the prediction information of the previous moment are used as the input of the nested long-short term memory network, and the long-short term memory network is adopted before and after the long-short term memory network in order to keep the time characteristics of the characteristic vector and ensure that the attention position point of the model is continuously changed along with the time; the method can more accurately detect the character area in the natural scene, and has good detection effect on small target characters and texts with small inclination angles.

Description

Character recognition system and method based on combination of neural network and attention mechanism

Technical Field

The invention belongs to character image recognition in natural scenes, and relates to a correlation algorithm combining a convolutional neural network, a long-term and short-term memory network and an attention mechanism.

Background

The natural scene is the living environment where we are, and the natural scene image includes various visual information, such as text, automobile, landscape, organism and architectural landscape, and these element information constitute the main components of the natural scene content.

Digital recognition in natural scenes belongs to the category of text recognition in natural scenes, and research on the problem of text recognition in natural scenes started in the last 90 th century, but is still an unsolved problem until now. In general, the text recognition task in natural scenes includes two parts: text region detection and text recognition. Text recognition is based on detection, and the detected text box is used as recognition input. With the development of deep learning, detection is the earliest studied field, and the related technology is mature, so that the decision on the recognition effect is the design of a recognition algorithm, target recognition is the current active field of deep learning, various applications emerge at random, characters serve as daily common visual information, and the method has important research significance, and meanwhile, the improvement of the recognition accuracy of the characters is greatly helpful for the NLP field. However, due to various factors such as the position, deformation and illumination of characters in a natural scene, and the background of characters in the natural scene is quite complex, a great deal of technical difficulties need to be overcome for recognition.

At present, most of research methods are based on a top-down algorithm model, Jaderberg et al designs an output method based on a convolutional neural network and structurization to identify characters from end to end, but the length of the text needs to be fixed, and the identification effect on long-sequence texts is poor, Shi et al proposes an end-to-end identification method based on a convolutional neural network, a cyclic neural network and sequence classification, but the identification effect on complex character images is poor.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The character recognition system and method based on the combination of the neural network and the attention mechanism can detect character areas in a natural scene more accurately and have a good detection effect on small target characters and texts with small inclination angles. The technical scheme of the invention is as follows:

a word recognition system based on a combination of neural networks and attention mechanisms, comprising: the character recognition system comprises a character extraction module, a coding and attention module and a decoding module, wherein the character extraction module adopts a structure of combining a convolutional neural network and a bidirectional long and short term memory network, (the convolutional neural network is used for extracting the space characteristics of character images, the bidirectional long and short term memory network is used for extracting the sequence characteristics of characters)

A coding and attention module for hiding state h of the bidirectional long-short term memory network coding stage_iCarrying out weighted summation to obtain attention weights at different moments, and then predicting the output at the current moment through attention focusing; )

The decoding module adopts a nested long-short term memory network, the decoding part analyzes intermediate semantic information generated by encoding, the decoding needs to focus attention on the encoded state by using an attention mechanism, then the decoding information at a past moment is learned through the nested long-short term memory network and used for extracting sequence information of a text, and the output of the current moment is predicted through the state at the previous moment.

Further, the convolutional neural network module comprises a convolutional layer 1, a pooling layer 1, a convolutional group 2, a convolutional layer 3, a pooling layer 2, a convolutional layer 4, a pooling layer 3, a batch standard layer, a convolutional layer 5, a pooling layer 4, a batch standard layer and a Dropout layer.

Further, the detailed parameters of the convolutional neural network are set as follows: the convolution kernel size of the convolution layer 1 is 5 × 5 × 64, the step length is 1, the extended edge is 1, the pooling layer of the patent all adopts a mean value method, and the parameter settings are the same: kernel size 3 × 3, step size 2, extended edge 0; the convolution group 2 comprises convolution layers A and B with convolution kernels of 7 multiplied by 7 and 5 multiplied by 5 in parallel and convolution layers C1 with convolution kernels of 1 multiplied by C after the convolution layers are stacked in parallel, wherein C represents the number of the convolution kernels, and the channel dimension can be reduced by adjusting the size of C, so that the calculation speed is accelerated, and the calculation cost is reduced; the

convolution layers

3, 4 and 5 all adopt convolution kernels with the size of 3 multiplied by 3, the number of the convolution kernels is 128, 128 and 256, the step length of the convolution kernels is 1, and the expansion edge is 1; the batch standard layer is used for standardizing each small batch of data, calculating the mean value and the variance of the data, then normalizing the data, and then translating the scaling parameters; the Dropout layer can be seen as a random sum and then an average of the models, i.e. the hidden units are randomly lost.

Further, when the resolution of the input image is 800 × 600, through the above process of pooling of convolution kernels, a feature map is finally obtained, the size of the feature map is 1 × 256 × 50 × 37, a 1 × 256 feature sequence is obtained, and then an acceleration layer is added, where the acceleration layer is an optimization method provided by Caffe, and can convert pixels in a small window area covered by the convolution kernels into a line, and then store the line in a continuous memory space.

Furthermore, the dimension of the parameter of the bidirectional long and short term memory network is 512, and through the fusion of T from left to right and from right to left, the states of the hidden layers of T are overlapped, the long and short term memory network does not change the feature sequence position of the feature map, has translational invariance, the original image experience corresponding to the feature vector is also unchanged, the output dimension is 1 × 1024 × 50 × 37, the hidden layer of the bidirectional long and short term memory network contains the context state of the text sequence, and is used as the encoding process of the attention model, and the feature vector set is [ h ] h₁,h₂,h₃,...h_T]Wherein a feature vector h is generated at each time instant_iH is formed by combining features of two directions_i＝[h_i,h_i ^*]。

Further, the coding and attention module specifically includes: semantic code C_iIs the key point of the attention model, and semantically encodes the 1 multiplied by 1024 characteristic vector sequence generated by the bidirectional recurrent neural network, and aims to conceal the hidden state h in the encoding stage_iCarrying out weighted summation to obtain attention weights at different moments, predicting the output of the current moment through attention focusing, adopting an attention machine to prepare a characteristic sequence S of a vector with the length of 20 of T to carry out characteristic focusing, and when predicting the last character, focusing attention on an input text at the current moment and a hidden state at a certain past moment, wherein the weights of an attention model are distributed in the hidden state at different moments, the greater the weight, the more the attention is focused, and [ x ] in the attention model₁,x₂,x₃,...x_T]Indicating the input at the current time, A_t,iIndicating the focus weight of attention, C_tShowing the feature h at time t_iThe weighting value of (2).

Further, A is_t,i、C_tThe specific formula is as follows:

e_t,i＝f_att(s_t-1,h_i) (15)

s_t＝f(s_t-1,y_t-1,C_t) (16)

y_t＝g(y_t-1,s_t,C_t) (17)

f_att(s_t-1,h_i) Is a correlation function representing the degree of correlation between the state at time t-1 of decoding and the coding characteristic b, y_tDenotes the prediction output of the decoding module, g (y)_t-1,s_t,C_t) Representing a probability output function.

Furthermore, the decoding module adopts a nested long-short term memory network for identifying the feature vector focused by attention, wherein the input at the t-th time is the feature vector focused by the features for the t-th time, the nested long-short term memory network selectively reads and writes by using a standard long-short term memory network gate, and the prediction output y at a certain time t after decoding_tPredicted output y from past time_t-1Hidden state s of decoder_tAttention weighted value C_tCollectively, the memory cell function is formulated as follows:

Ce_t＝IM_t(f_t☉Ce_t-1,i_t☉g_t)(18)

f_trepresentation of a non-linear function representing forward propagation, IM_tRepresenting internal memory states of nested long-short term memory networks, Ce_t-1Represents the state of the memory cell at the previous time t-1, g_tRepresenting the gating function of the long-short term memory network.

And finally, the output is expressed in a probability form by adopting Softmax, then the probability value is selected as a prediction result, a predicted value is provided for each time t of the long-term and short-term memory network, and then characters before the end character are taken according to the time sequence to form a character string, namely the required result.

A character recognition method based on a combination of a neural network and an attention mechanism comprises the following steps: the method comprises a characteristic extraction step, a coding and attention step and a decoding step, wherein the characteristic extraction step adopts a structure combining a convolutional neural network and a bidirectional long-short term memory network, and the convolutional neural network is used for extracting the spatial characteristics of character images; the bidirectional long and short term memory network is used for extracting the sequence characteristics of the characters;

coding and attention steps for hidden states h of the coding phase of the bidirectional long-short term memory network_iCarrying out weighted summation to obtain attention weights at different moments, and then predicting the output at the current moment through attention focusing;

the decoding step adopts a nested long-short term memory network, the decoding part analyzes intermediate semantic information generated by encoding, the decoding needs to focus attention on the encoded state by using an attention mechanism, then the nested long-short term memory network learns the decoding information at a past moment to extract the sequence information of the text, and the output of the current moment is predicted by the state at the previous moment.

The invention has the following advantages and beneficial effects:

the invention can memorize the past information and solve the long-term dependence problem by introducing an attention mechanism, nesting the long-term and short-term memory network and then fusing the convolutional neural network and the bidirectional long-term and short-term memory network. The feature recognition is greatly advantageous by focusing the target on a feature vector through the attention mechanism.

The main innovation is the feature extraction step, the encoding and attention step and the decoding step.

(1) A characteristic extraction step: the quality of feature extraction directly determines the effect of the model, in order to extract more features as far as possible, a parallel convolution kernel is added in the design process, the parallel convolution can extract multi-scale spatial features, the parallel convolution kernel has better adaptability for extracting character features with different sizes, meanwhile, a bidirectional long-short term memory network is fused, the feature relation among characters can be learned, and the method has better stability in the aspect of long-sequence texts.

(2) Coding and attention steps: the module fully considers the state of a certain time in the past and the future in design, takes the state of the certain time in the past and the future as a coding part, focuses the coding state through attention weight, increases learning and fusion of future features compared with the traditional long-short term memory network, has great feature relation between the text information of the current time and the state of the future in the aspect of text recognition, and can improve the robustness of the model if the states before and after the certain time are fully utilized.

(3) And (3) decoding: the key of the module is decoding and outputting semantic information generated by encoding, the key point of decoding is focusing attention on the past information, the nested long-term and short-term memory network is adopted to selectively memorize the past information, the module has better flexibility, the memory unit can be divided into an internal memory unit and an external memory unit, the internal memory unit is nested in the external memory unit, the external memory unit can freely control the memory state of the internal unit, the external unit can write information to enable the internal unit state to selectively memorize things related to input information at the current moment, for irrelevant information, the external memory unit controls the internal memory unit to be forgotten selectively, and as some memory information can interfere with prediction at the current moment, the module has great advantage for character recognition.

Drawings

FIG. 1 is a block diagram of the overall framework of the algorithm model of the preferred embodiment of the present invention

FIG. 2 is a codec and attention mechanism

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the algorithm model mainly comprises the following steps:

step S1, a convolutional neural network feature extraction module is used for the spatial features of the text images;

step S2, inputting the space characteristics extracted by the convolutional neural network into a bidirectional long and short term memory network module, wherein the bidirectional long and short term memory network can extract the sequence characteristics of the characters;

step S3, semantic coding is carried out on the extracted feature vectors, and then attention weights of the feature vectors are distributed through an attention mechanism, so that attention is focused on the feature vectors with higher weights;

step S4, the model decoding part is implemented by a nested long and short term memory network, and the features extracted by attention and the prediction information at the previous time are used as the input of the nested long and short term memory network, and the purpose of using the long and short term memory network before and after is to keep the time characteristics of the feature vector, so that the attention position point of the model changes constantly with time.

In the field of machine learning, judging the quality of a model requires some performance metrics, wherein the most common are accuracy P, recall R and comprehensive F, wherein the accuracy is the ratio of the correct number to the total number retrieved, and the recall refers to the ratio of the correct number to all numbers retrieved. Therefore, in model evaluation studies, the comprehensive metric F of P and R is generally adopted as a main evaluation index, and F can use P and R conversion, as shown in the following formula.

The patent algorithm model is mainly divided into three modules: the system comprises a feature extraction module, an encoding and attention module and a decoding module, wherein the feature extraction module adopts a structure of a convolutional neural network-bidirectional long and short term memory network, the encoding and attention module is used for expressing hidden layer states and prediction outputs of the bidirectional long and short term memory network based on attention weight values, the decoding module adopts a nested long and short term memory network, sequence information of texts can be extracted, and the output of the current moment is predicted through the state of the previous moment. According to the above general analysis, the specific implementation steps are as follows:

step one, a convolutional neural network feature extraction module extracts the spatial features of the character image;

the method of claim 1, further comprising: the convolutional neural network module comprises a convolutional layer 1, a pooling layer 1, a convolutional group 2, a convolutional layer 3, a pooling layer 2, a convolutional layer 4, a pooling layer 3, a batch standard layer, a convolutional layer 5, a pooling layer 4, a batch standard layer and a Dropout layer, wherein the detailed parameters are set as follows:

the convolution kernel size of the convolution layer 1 is 5 × 5 × 64, the step length is 1, the extended edge is 1, the pooling layer of the patent all adopts a mean value method, and the parameter settings are the same: kernel size 3 × 3, step size 2, and extended edge 0.

The convolution group 2 comprises convolution layers A with convolution kernels of 7 multiplied by 7 and convolution layers B with convolution kernels of 5 multiplied by 5 and convolution layers C with convolution kernels of 1 multiplied by C which are stacked on the parallel convolution layers, and the channel dimension can be reduced by adjusting the size of C, so that the calculation speed is accelerated and the calculation cost is reduced;

the

convolution layers

3, 4 and 5 all adopt convolution kernels with the size of 3 multiplied by 3, the number of the convolution kernels is 128, 128 and 256, the step length of the convolution kernels is 1, and the expansion edge is 1.

The batch standard layer is used for standardizing each small batch of data, calculating the mean value and the variance of the data, then normalizing the data, and then translating the scaling parameters. After the layer is added, a larger learning rate can be used, and meanwhile, the regularization effect is also achieved, and the method is a good improvement method for accelerating the training of the network.

The Dropout layer can be regarded as random summation and then averaging of a model, namely, the hidden unit is lost randomly, a new network model works after each training of batch data, and the loss of the hidden unit is random, so that the randomness of the model is maintained. The Dropout parameter is set to 0.5. Dropout implements the principle of randomly generating a variable of 0 or 1 by a Bernoulli function, and then multiplying the neuron with it to decide whether to discard the neuron. The following equation is the forward propagation derivation after adding a Dropout layer:

d^(l)＝r^(l)*y^(l) (5)

where y represents the projection of the neural network, the forward propagation computation, w represents the weights of the neuron links, b represents the bias, r represents the obedient bernoulli distribution, and acting on y, z is the forward propagation computation.

Through the forward propagation calculation of the neural network, when the resolution of the input image is 800 × 600, through the above convolution kernel pooling process, a feature map is finally obtained, and the size of the feature map is 1 × 256 × 50 × 37, so that a 1 × 256 feature sequence is obtained. And then, an acceleration layer is added, wherein the acceleration layer is an optimization method provided by Caffe, pixels in a small window area covered by a convolution kernel can be converted into a line and then stored in a continuous memory space, and a CPU (central processing unit) reads the continuous memory space to accelerate convolution operation speed and avoid access time cost consumption caused by memory discontinuity.

Step two, extracting the sequence characteristics of the characters by the bidirectional long-short term memory network;

considering that a character image in a natural scene is an indefinite-length sequence text, when characteristics are extracted and predicted, the predicted output of the character image has a great relationship with the state and the predicted output of a previous moment and a next moment, if only a convolutional neural network is used, only the spatial characteristics of a text region are extracted, the text is split into the character sequence for prediction, so that a better effect can be achieved, for example, the content of a text box is 'blue sky', the character behind the 'day' character cannot be accurately judged during recognition, the predicted output of the current moment is determined by the predicted output 'day' of the previous moment and the hidden layer state of the current moment after the 'empty' is input by using a long-short term memory network, and the principle is similar to translation.

The parameter dimensionality of the bidirectional long-short term memory network is 512, states of hidden layers of the bidirectional long-short term memory network are overlapped through the fusion of T1, 2,3 and T from left to right and from right to left, the long-short term memory network does not change the feature sequence position of a feature map, the bidirectional long-short term memory network has translation invariance, and the original image receptive field corresponding to feature vectors of the bidirectional long-short term memory network is unchanged. The dimension of the final output is 1 × 1024 × 50 × 37. The hidden layer of the bidirectional long and short term memory network contains the context state information of the text sequence as the coding stage of the attention model, and the feature vector set is [ h [₁,h₂,h₃,...h_T]Wherein the feature h is generated at each moment_iH is formed by combining features of two directions_i＝[h_i,h_i ^*]. The bidirectional long and short term memory network is developed from the long and short term memory network, and the patent fully utilizes the bidirectional long and short term memory network in the coding stageThe hidden layer state of the short-term memory network and the prediction information at each moment are fully fused and coded, and the following principle implementation of the bidirectional long-term and short-term memory network is as follows:

the bidirectional long-short term memory network is similar to the principle of the long-short term memory network, the memory state of a cell is controlled by an input gate, an output gate and a forgetting gate, the gate value range is generally [0,1], so that three gates use Sigmoid as an activation function, and the output state uses Tanh activation function.

i_t＝σ_i(w_i[h_t-1,x_t]+b_i) (8)

Formula (8) shows the input at the current time, which is controlled by the input gate, and determines whether the input information is retained by the active function Sigmoid.

f_t＝σ_f(w_t[h_t-1,x_t]+b_f) (9)

The formula (9) shows that the forgetting gate controls the state of the memory unit to let the memory unit ce_tSome past states that would interfere with the prediction at the current time are discarded randomly.

Ce_t＝f_t*Ce_t-1+i_t*tanh(w_c·[h_t-1,Ce_t]+b_c) (10)

Equation (10) represents the internal Memory IM (inner Memory) of the function m at time t.

o_t＝σ_o(w_o[h_t-1,x_t]+b_o) (11)

h_t＝o_t☉σ_h(ce_t) (12)

Equations (11) and (12) indicate the output gate, (11) the initialization of the output value, and (12) the output value and the memory cell Ce_tTaking correlation operation as hidden state h of current time_t，σ_tIs a tanh activation function, and has the effect of stabilizing the value.

Step (III), semantic coding C_iIs the key point of attention model, 1 x 1024 features generated by bidirectional recurrent neural networkThe sequence of eigenvectors is semantically coded with the purpose of hiding the state h in the coding phase_iAnd carrying out weighted summation to obtain attention weights at different moments, and predicting the output at the current moment through attention focusing. The feature sequence S of the vector with the length of 20 of T is prepared by an attention machine for feature focusing, if the value of T is too large, the information needing to be memorized is excessive, the calculation amount of the model is violently increased, a general text sentence rarely exceeds 20 words to form a sentence, and the excessive value of T can cause the distraction of the model, such as: the sky of blue and blue has a white cloud which is just like a bird flying in the sky, when the last character is predicted, attention is focused on the input text at the current moment and the hidden state at a certain past moment, the weight value of the attention model is distributed in the hidden state at different moments, and the attention is focused more when the weight value is larger. When the text box character is predicted, the current character is predicted to be 'floating' according to 'white cloud' and 'free-form free', if the T value is selected too much, the weight of 'sky' is increased, the weight of useful information 'white cloud' and 'free-form free' is reduced, expression deviation occurs during prediction, and the effect is reduced on the contrary. Attention model [ x ]₁,x₂,x₃,...x_T]Indicating the input at the current time, A_t,iIndicating the focus weight of attention, C_tShowing the feature h at time t_iThe specific formula of the weighted value of (2) is as follows:

e_t,i＝f_att(s_t-1,h_i) (15)

s_t＝f(s_t-1,y_t-1,C_t) (16)

y_t＝g(y_t-1,s_t,C_t) (17)

step (three), according to the description of the attached figure 1, the transposition layer is used for converting dimensionality, is convenient for decoding, is commonly used for connecting a long-term and short-term memory network, dimensionality matching is the key point of the model, the model can be stably trained by proper dimensionality, and the training speed is improved; the decoding part adopts a nested long-short term memory network and can identify the feature vector focused by attention, wherein the input at the t moment is the feature vector focused by the t-th feature. The nested long and short term memory network selectively reads and writes using standard long and short term memory network gates. This key feature enables the model to achieve a more efficient time hierarchy than traditional stacked long short term memory networks. As can be seen from FIG. 2, s_tThe hidden state of a decoder at the time t is shown, the nested long and short term memory network shown in the figure 1 is formed by nesting two layers of long and short term memory networks, the dimensionalities of the long and short term memory networks 1 and 2 are both 256, the dimensionality after nesting and fusion is 512, and the prediction output y at a certain time t after decoding is output_tPredicted output y from past time_t-1Hidden state s of decoder_tAttention weighted value C_tAnd (4) jointly determining.

The nested long and short term memory network is different from the traditional long and short term memory network, which is a stacked mode, like a line, from the front to the back one layer stacked, and the output of the previous layer is the input of the next layer. The nested long and short term memory network uses the gates of the standard long and short term memory network to selectively read and write. This key feature enables the model to achieve a more efficient time hierarchy than traditional stacked long short term memory networks. The memory cell function expression is different, and the formula is as follows:

Ce_t＝IM_t(f_t☉Ce_t-1,i_t☉g_t) (18)

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A character recognition system based on a combination of a neural network and an attention mechanism, comprising: the character image coding and decoding system comprises a feature extraction module, a coding and attention module and a decoding module, wherein the feature extraction module adopts a structure of combining a convolutional neural network and a bidirectional long-short term memory network, and the convolutional neural network is used for extracting the spatial features of a character image; the bidirectional long and short term memory network is used for extracting the sequence characteristics of the characters;

a coding and attention module for hiding state h of the bidirectional long-short term memory network coding stage_iCarrying out weighted summation to obtain attention weights at different moments, and then predicting the output at the current moment through attention focusing;

the decoding module adopts a nested long-short term memory network, the decoding part analyzes intermediate semantic information generated by encoding, the decoding needs to focus attention on the encoded state by using an attention mechanism, then learns decoding information at a past moment through the nested long-short term memory network, is used for extracting sequence information of a text, and predicts the output of the current moment through the state at the previous moment;

the convolutional neural network module comprises a convolutional layer 1, a pooling layer 1, a convolutional layer 2, a convolutional layer 3, a pooling layer 2, a convolutional layer 4, a pooling layer 3, a batch standard layer, a convolutional layer 5, a pooling layer 4, a batch standard layer and a Dropout layer;

the detailed parameter settings of the convolutional neural network are as follows: the convolution kernel size of the convolution layer 1 is 5 × 5 × 64, the step length is 1, the extended edge is 1, the pooling layer adopts an average value method, and the parameter settings are the same as: kernel size 3 × 3, step size 2, extended edge 0; the convolution group 2 comprises convolution layers A and B with convolution kernels of 7 multiplied by 7 and 5 multiplied by 5 which are parallel, and convolution layers C1 with convolution kernels of 1 multiplied by C which are stacked on the parallel convolution layers, wherein C represents the number of the convolution kernels, and the dimension can be reduced by adjusting the size of C, so that the calculation speed is accelerated, and the calculation cost is reduced; the convolution layers 3, 4 and 5 all adopt convolution kernels with the size of 3 multiplied by 3, the number of the convolution kernels is 128, 128 and 256, the step length of the convolution kernels is 1, and the expansion edge is 1; the batch standard layer is used for standardizing each small batch of data, calculating the mean value and the variance of the data, then normalizing the data, and then translating the scaling parameters; the Dropout layer can be regarded as random summation and then averaging of a model, namely, the hidden unit is lost randomly;

the coding and attention module specifically comprises: semantic code C_iIs the key point of the attention model, and semantically encodes the 1 multiplied by 1024 characteristic vector sequence generated by the bidirectional recurrent neural network, and aims to conceal the hidden state h in the encoding stage_iCarrying out weighted summation to obtain attention weights at different moments, predicting the output of the current moment through attention focusing, adopting an attention machine to prepare a characteristic sequence S of a vector with the length of 20 of T to carry out characteristic focusing, and when predicting the last character, focusing attention on an input text at the current moment and a hidden state at a certain past moment, wherein the weights of an attention model are distributed in the hidden state at different moments, the greater the weight, the more the attention is focused, and [ x ] in the attention model₁,x₂,x₃,...x_T]Indicating the input at the current time, A_t,iIndicating the focus weight of attention, C_tShowing the feature h at time t_iThe weighted value of (1);

a is described_t,i、C_tThe specific formula is as follows:

e_t,i＝f_att(s_t-1,h_i) (15)

s_t＝f(s_t-1,y_t-1,C_t) (16)

y_t＝g(y_t-1,s_t,C_t) (17)

f_att(s_t-1,h_i) Is a correlation function representing the degree of correlation between the state at time t-1 of decoding and the coding characteristic b, y_tDenotes the prediction output of the decoding module, g (y)_t-1,s_t,C_t) Representing a probability output function;

the decoding module adopts a nested long-short term memory network for identifying the feature vector focused by attention, wherein the input of the t-th time is the feature vector focused by the t-th time of features, the nested long-short term memory network selectively reads and writes by using a standard long-short term memory network gate, and the prediction output y of a certain time t after decoding is output_tPredicted output y from past time_t-1Hidden state s of decoder_tAttention weighted value C_tCollectively, the memory cell function is formulated as follows:

Ce_t＝IM_t(f_t☉Ce_t-1,i_t☉g_t) (18)

f_trepresentation of a non-linear function representing forward propagation, IM_tRepresenting internal memory states of nested long-short term memory networks, Ce_t-1Represents the state of the memory cell at the previous time t-1, g_tA gating function representing a long-short term memory network;

2. The system of claim 1, wherein when the resolution of the input image is 800 × 600, the above convolution kernel pooling process is performed to obtain a feature map, the size of the feature map is 1 × 256 × 50 × 37, a 1 × 256 feature sequence is obtained, and then an acceleration layer is added, the acceleration layer is an optimization method provided by Caffe, and the acceleration layer can convert pixels in a small window area covered by the convolution kernel into a row and then store the row in a continuous memory space.

3. The system of claims 1-2, wherein the dimension of the parameter of the bidirectional long-short term memory network is 512, and through the fusion of T1, 2,3,. T from left to right and from right to left, the states of the hidden layers are overlapped, the long-short term memory network does not change the feature sequence position of the feature map, has translational invariance, the original perception corresponding to the feature vector is also unchanged, the output dimension is 1 × 1024 × 50 × 37, the hidden layers of the bidirectional long-short term memory network contain the context state of the text sequence, and the context state is used as the encoding process of the attention model, and the feature vector set is [ h ] h₁,h₂,h₃,...h_T]Wherein a feature vector h is generated at each time instant_iH is formed by combining features of two directions_i＝[h_i,h_i ^*]。

4. A character recognition method based on combination of a neural network and an attention mechanism is characterized by comprising the following steps: the method comprises a characteristic extraction step, a coding and attention step and a decoding step, wherein the characteristic extraction step adopts a structure combining a convolutional neural network and a bidirectional long-short term memory network, and the convolutional neural network is used for extracting the spatial characteristics of character images; the bidirectional long and short term memory network is used for extracting the sequence characteristics of the characters;

coding and attention steps for hidden states h of the coding phase of the bidirectional long-short term memory network_iCarrying out weighted summation to obtain attention weights at different momentsThe value, then the output at the current moment is predicted by attention focusing;