CN109389091B - Character recognition system and method based on combination of neural network and attention mechanism - Google Patents

Character recognition system and method based on combination of neural network and attention mechanism Download PDF

Info

Publication number
CN109389091B
CN109389091B CN201811230112.0A CN201811230112A CN109389091B CN 109389091 B CN109389091 B CN 109389091B CN 201811230112 A CN201811230112 A CN 201811230112A CN 109389091 B CN109389091 B CN 109389091B
Authority
CN
China
Prior art keywords
attention
term memory
short term
long
memory network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811230112.0A
Other languages
Chinese (zh)
Other versions
CN109389091A (en
Inventor
杨宏志
庞宇
王慧倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201811230112.0A priority Critical patent/CN109389091B/en
Publication of CN109389091A publication Critical patent/CN109389091A/en
Application granted granted Critical
Publication of CN109389091B publication Critical patent/CN109389091B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The invention requests to protect a character recognition system and method based on the combination of a neural network and an attention mechanism, and the method specifically comprises the following steps: the convolutional neural network feature extraction module is used for extracting the spatial features of the character images; inputting the spatial features extracted by the convolutional neural network into a bidirectional long and short term memory network module, wherein the bidirectional long and short term memory network can extract the sequence features of the characters; semantic coding is carried out on the extracted feature vectors, and then attention weights of the feature vectors are distributed through an attention mechanism, so that attention is focused on the feature vectors with higher weights; the decoding part of the model is realized by a nested long-short term memory network, the characteristics extracted by attention and the prediction information of the previous moment are used as the input of the nested long-short term memory network, and the long-short term memory network is adopted before and after the long-short term memory network in order to keep the time characteristics of the characteristic vector and ensure that the attention position point of the model is continuously changed along with the time; the method can more accurately detect the character area in the natural scene, and has good detection effect on small target characters and texts with small inclination angles.

Description

Character recognition system and method based on combination of neural network and attention mechanism
Technical Field
The invention belongs to character image recognition in natural scenes, and relates to a correlation algorithm combining a convolutional neural network, a long-term and short-term memory network and an attention mechanism.
Background
The natural scene is the living environment where we are, and the natural scene image includes various visual information, such as text, automobile, landscape, organism and architectural landscape, and these element information constitute the main components of the natural scene content.
Digital recognition in natural scenes belongs to the category of text recognition in natural scenes, and research on the problem of text recognition in natural scenes started in the last 90 th century, but is still an unsolved problem until now. In general, the text recognition task in natural scenes includes two parts: text region detection and text recognition. Text recognition is based on detection, and the detected text box is used as recognition input. With the development of deep learning, detection is the earliest studied field, and the related technology is mature, so that the decision on the recognition effect is the design of a recognition algorithm, target recognition is the current active field of deep learning, various applications emerge at random, characters serve as daily common visual information, and the method has important research significance, and meanwhile, the improvement of the recognition accuracy of the characters is greatly helpful for the NLP field. However, due to various factors such as the position, deformation and illumination of characters in a natural scene, and the background of characters in the natural scene is quite complex, a great deal of technical difficulties need to be overcome for recognition.
At present, most of research methods are based on a top-down algorithm model, Jaderberg et al designs an output method based on a convolutional neural network and structurization to identify characters from end to end, but the length of the text needs to be fixed, and the identification effect on long-sequence texts is poor, Shi et al proposes an end-to-end identification method based on a convolutional neural network, a cyclic neural network and sequence classification, but the identification effect on complex character images is poor.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. The character recognition system and method based on the combination of the neural network and the attention mechanism can detect character areas in a natural scene more accurately and have a good detection effect on small target characters and texts with small inclination angles. The technical scheme of the invention is as follows:
a word recognition system based on a combination of neural networks and attention mechanisms, comprising: the character recognition system comprises a character extraction module, a coding and attention module and a decoding module, wherein the character extraction module adopts a structure of combining a convolutional neural network and a bidirectional long and short term memory network, (the convolutional neural network is used for extracting the space characteristics of character images, the bidirectional long and short term memory network is used for extracting the sequence characteristics of characters)
A coding and attention module for hiding state h of the bidirectional long-short term memory network coding stageiCarrying out weighted summation to obtain attention weights at different moments, and then predicting the output at the current moment through attention focusing; )
The decoding module adopts a nested long-short term memory network, the decoding part analyzes intermediate semantic information generated by encoding, the decoding needs to focus attention on the encoded state by using an attention mechanism, then the decoding information at a past moment is learned through the nested long-short term memory network and used for extracting sequence information of a text, and the output of the current moment is predicted through the state at the previous moment.
Further, the convolutional neural network module comprises a convolutional layer 1, a pooling layer 1, a convolutional group 2, a convolutional layer 3, a pooling layer 2, a convolutional layer 4, a pooling layer 3, a batch standard layer, a convolutional layer 5, a pooling layer 4, a batch standard layer and a Dropout layer.
Further, the detailed parameters of the convolutional neural network are set as follows: the convolution kernel size of the convolution layer 1 is 5 × 5 × 64, the step length is 1, the extended edge is 1, the pooling layer of the patent all adopts a mean value method, and the parameter settings are the same: kernel size 3 × 3, step size 2, extended edge 0; the convolution group 2 comprises convolution layers A and B with convolution kernels of 7 multiplied by 7 and 5 multiplied by 5 in parallel and convolution layers C1 with convolution kernels of 1 multiplied by C after the convolution layers are stacked in parallel, wherein C represents the number of the convolution kernels, and the channel dimension can be reduced by adjusting the size of C, so that the calculation speed is accelerated, and the calculation cost is reduced; the convolution layers 3, 4 and 5 all adopt convolution kernels with the size of 3 multiplied by 3, the number of the convolution kernels is 128, 128 and 256, the step length of the convolution kernels is 1, and the expansion edge is 1; the batch standard layer is used for standardizing each small batch of data, calculating the mean value and the variance of the data, then normalizing the data, and then translating the scaling parameters; the Dropout layer can be seen as a random sum and then an average of the models, i.e. the hidden units are randomly lost.
Further, when the resolution of the input image is 800 × 600, through the above process of pooling of convolution kernels, a feature map is finally obtained, the size of the feature map is 1 × 256 × 50 × 37, a 1 × 256 feature sequence is obtained, and then an acceleration layer is added, where the acceleration layer is an optimization method provided by Caffe, and can convert pixels in a small window area covered by the convolution kernels into a line, and then store the line in a continuous memory space.
Furthermore, the dimension of the parameter of the bidirectional long and short term memory network is 512, and through the fusion of T from left to right and from right to left, the states of the hidden layers of T are overlapped, the long and short term memory network does not change the feature sequence position of the feature map, has translational invariance, the original image experience corresponding to the feature vector is also unchanged, the output dimension is 1 × 1024 × 50 × 37, the hidden layer of the bidirectional long and short term memory network contains the context state of the text sequence, and is used as the encoding process of the attention model, and the feature vector set is [ h ] h1,h2,h3,...hT]Wherein a feature vector h is generated at each time instantiH is formed by combining features of two directionsi=[hi,hi *]。
Further, the coding and attention module specifically includes: semantic code CiIs the key point of the attention model, and semantically encodes the 1 multiplied by 1024 characteristic vector sequence generated by the bidirectional recurrent neural network, and aims to conceal the hidden state h in the encoding stageiCarrying out weighted summation to obtain attention weights at different moments, predicting the output of the current moment through attention focusing, adopting an attention machine to prepare a characteristic sequence S of a vector with the length of 20 of T to carry out characteristic focusing, and when predicting the last character, focusing attention on an input text at the current moment and a hidden state at a certain past moment, wherein the weights of an attention model are distributed in the hidden state at different moments, the greater the weight, the more the attention is focused, and [ x ] in the attention model1,x2,x3,...xT]Indicating the input at the current time, At,iIndicating the focus weight of attention, CtShowing the feature h at time tiThe weighting value of (2).
Further, A ist,i、CtThe specific formula is as follows:
Figure BDA0001836931290000031
Figure BDA0001836931290000032
et,i=fatt(st-1,hi) (15)
st=f(st-1,yt-1,Ct) (16)
yt=g(yt-1,st,Ct) (17)
fatt(st-1,hi) Is a correlation function representing the degree of correlation between the state at time t-1 of decoding and the coding characteristic b, ytDenotes the prediction output of the decoding module, g (y)t-1,st,Ct) Representing a probability output function.
Furthermore, the decoding module adopts a nested long-short term memory network for identifying the feature vector focused by attention, wherein the input at the t-th time is the feature vector focused by the features for the t-th time, the nested long-short term memory network selectively reads and writes by using a standard long-short term memory network gate, and the prediction output y at a certain time t after decodingtPredicted output y from past timet-1Hidden state s of decodertAttention weighted value CtCollectively, the memory cell function is formulated as follows:
Cet=IMt(ft☉Cet-1,it☉gt)(18)
ftrepresentation of a non-linear function representing forward propagation, IMtRepresenting internal memory states of nested long-short term memory networks, Cet-1Represents the state of the memory cell at the previous time t-1, gtRepresenting the gating function of the long-short term memory network.
And finally, the output is expressed in a probability form by adopting Softmax, then the probability value is selected as a prediction result, a predicted value is provided for each time t of the long-term and short-term memory network, and then characters before the end character are taken according to the time sequence to form a character string, namely the required result.
A character recognition method based on a combination of a neural network and an attention mechanism comprises the following steps: the method comprises a characteristic extraction step, a coding and attention step and a decoding step, wherein the characteristic extraction step adopts a structure combining a convolutional neural network and a bidirectional long-short term memory network, and the convolutional neural network is used for extracting the spatial characteristics of character images; the bidirectional long and short term memory network is used for extracting the sequence characteristics of the characters;
coding and attention steps for hidden states h of the coding phase of the bidirectional long-short term memory networkiCarrying out weighted summation to obtain attention weights at different moments, and then predicting the output at the current moment through attention focusing;
the decoding step adopts a nested long-short term memory network, the decoding part analyzes intermediate semantic information generated by encoding, the decoding needs to focus attention on the encoded state by using an attention mechanism, then the nested long-short term memory network learns the decoding information at a past moment to extract the sequence information of the text, and the output of the current moment is predicted by the state at the previous moment.
The invention has the following advantages and beneficial effects:
the invention can memorize the past information and solve the long-term dependence problem by introducing an attention mechanism, nesting the long-term and short-term memory network and then fusing the convolutional neural network and the bidirectional long-term and short-term memory network. The feature recognition is greatly advantageous by focusing the target on a feature vector through the attention mechanism.
The main innovation is the feature extraction step, the encoding and attention step and the decoding step.
(1) A characteristic extraction step: the quality of feature extraction directly determines the effect of the model, in order to extract more features as far as possible, a parallel convolution kernel is added in the design process, the parallel convolution can extract multi-scale spatial features, the parallel convolution kernel has better adaptability for extracting character features with different sizes, meanwhile, a bidirectional long-short term memory network is fused, the feature relation among characters can be learned, and the method has better stability in the aspect of long-sequence texts.
(2) Coding and attention steps: the module fully considers the state of a certain time in the past and the future in design, takes the state of the certain time in the past and the future as a coding part, focuses the coding state through attention weight, increases learning and fusion of future features compared with the traditional long-short term memory network, has great feature relation between the text information of the current time and the state of the future in the aspect of text recognition, and can improve the robustness of the model if the states before and after the certain time are fully utilized.
(3) And (3) decoding: the key of the module is decoding and outputting semantic information generated by encoding, the key point of decoding is focusing attention on the past information, the nested long-term and short-term memory network is adopted to selectively memorize the past information, the module has better flexibility, the memory unit can be divided into an internal memory unit and an external memory unit, the internal memory unit is nested in the external memory unit, the external memory unit can freely control the memory state of the internal unit, the external unit can write information to enable the internal unit state to selectively memorize things related to input information at the current moment, for irrelevant information, the external memory unit controls the internal memory unit to be forgotten selectively, and as some memory information can interfere with prediction at the current moment, the module has great advantage for character recognition.
Drawings
FIG. 1 is a block diagram of the overall framework of the algorithm model of the preferred embodiment of the present invention
FIG. 2 is a codec and attention mechanism
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
the algorithm model mainly comprises the following steps:
step S1, a convolutional neural network feature extraction module is used for the spatial features of the text images;
step S2, inputting the space characteristics extracted by the convolutional neural network into a bidirectional long and short term memory network module, wherein the bidirectional long and short term memory network can extract the sequence characteristics of the characters;
step S3, semantic coding is carried out on the extracted feature vectors, and then attention weights of the feature vectors are distributed through an attention mechanism, so that attention is focused on the feature vectors with higher weights;
step S4, the model decoding part is implemented by a nested long and short term memory network, and the features extracted by attention and the prediction information at the previous time are used as the input of the nested long and short term memory network, and the purpose of using the long and short term memory network before and after is to keep the time characteristics of the feature vector, so that the attention position point of the model changes constantly with time.
In the field of machine learning, judging the quality of a model requires some performance metrics, wherein the most common are accuracy P, recall R and comprehensive F, wherein the accuracy is the ratio of the correct number to the total number retrieved, and the recall refers to the ratio of the correct number to all numbers retrieved. Therefore, in model evaluation studies, the comprehensive metric F of P and R is generally adopted as a main evaluation index, and F can use P and R conversion, as shown in the following formula.
Figure BDA0001836931290000061
Figure BDA0001836931290000062
Figure BDA0001836931290000071
The patent algorithm model is mainly divided into three modules: the system comprises a feature extraction module, an encoding and attention module and a decoding module, wherein the feature extraction module adopts a structure of a convolutional neural network-bidirectional long and short term memory network, the encoding and attention module is used for expressing hidden layer states and prediction outputs of the bidirectional long and short term memory network based on attention weight values, the decoding module adopts a nested long and short term memory network, sequence information of texts can be extracted, and the output of the current moment is predicted through the state of the previous moment. According to the above general analysis, the specific implementation steps are as follows:
step one, a convolutional neural network feature extraction module extracts the spatial features of the character image;
the method of claim 1, further comprising: the convolutional neural network module comprises a convolutional layer 1, a pooling layer 1, a convolutional group 2, a convolutional layer 3, a pooling layer 2, a convolutional layer 4, a pooling layer 3, a batch standard layer, a convolutional layer 5, a pooling layer 4, a batch standard layer and a Dropout layer, wherein the detailed parameters are set as follows:
the convolution kernel size of the convolution layer 1 is 5 × 5 × 64, the step length is 1, the extended edge is 1, the pooling layer of the patent all adopts a mean value method, and the parameter settings are the same: kernel size 3 × 3, step size 2, and extended edge 0.
The convolution group 2 comprises convolution layers A with convolution kernels of 7 multiplied by 7 and convolution layers B with convolution kernels of 5 multiplied by 5 and convolution layers C with convolution kernels of 1 multiplied by C which are stacked on the parallel convolution layers, and the channel dimension can be reduced by adjusting the size of C, so that the calculation speed is accelerated and the calculation cost is reduced;
the convolution layers 3, 4 and 5 all adopt convolution kernels with the size of 3 multiplied by 3, the number of the convolution kernels is 128, 128 and 256, the step length of the convolution kernels is 1, and the expansion edge is 1.
The batch standard layer is used for standardizing each small batch of data, calculating the mean value and the variance of the data, then normalizing the data, and then translating the scaling parameters. After the layer is added, a larger learning rate can be used, and meanwhile, the regularization effect is also achieved, and the method is a good improvement method for accelerating the training of the network.
The Dropout layer can be regarded as random summation and then averaging of a model, namely, the hidden unit is lost randomly, a new network model works after each training of batch data, and the loss of the hidden unit is random, so that the randomness of the model is maintained. The Dropout parameter is set to 0.5. Dropout implements the principle of randomly generating a variable of 0 or 1 by a Bernoulli function, and then multiplying the neuron with it to decide whether to discard the neuron. The following equation is the forward propagation derivation after adding a Dropout layer:
Figure BDA0001836931290000081
d(l)=r(l)*y(l) (5)
Figure BDA0001836931290000082
Figure BDA0001836931290000083
where y represents the projection of the neural network, the forward propagation computation, w represents the weights of the neuron links, b represents the bias, r represents the obedient bernoulli distribution, and acting on y, z is the forward propagation computation.
Through the forward propagation calculation of the neural network, when the resolution of the input image is 800 × 600, through the above convolution kernel pooling process, a feature map is finally obtained, and the size of the feature map is 1 × 256 × 50 × 37, so that a 1 × 256 feature sequence is obtained. And then, an acceleration layer is added, wherein the acceleration layer is an optimization method provided by Caffe, pixels in a small window area covered by a convolution kernel can be converted into a line and then stored in a continuous memory space, and a CPU (central processing unit) reads the continuous memory space to accelerate convolution operation speed and avoid access time cost consumption caused by memory discontinuity.
Step two, extracting the sequence characteristics of the characters by the bidirectional long-short term memory network;
considering that a character image in a natural scene is an indefinite-length sequence text, when characteristics are extracted and predicted, the predicted output of the character image has a great relationship with the state and the predicted output of a previous moment and a next moment, if only a convolutional neural network is used, only the spatial characteristics of a text region are extracted, the text is split into the character sequence for prediction, so that a better effect can be achieved, for example, the content of a text box is 'blue sky', the character behind the 'day' character cannot be accurately judged during recognition, the predicted output of the current moment is determined by the predicted output 'day' of the previous moment and the hidden layer state of the current moment after the 'empty' is input by using a long-short term memory network, and the principle is similar to translation.
The parameter dimensionality of the bidirectional long-short term memory network is 512, states of hidden layers of the bidirectional long-short term memory network are overlapped through the fusion of T1, 2,3 and T from left to right and from right to left, the long-short term memory network does not change the feature sequence position of a feature map, the bidirectional long-short term memory network has translation invariance, and the original image receptive field corresponding to feature vectors of the bidirectional long-short term memory network is unchanged. The dimension of the final output is 1 × 1024 × 50 × 37. The hidden layer of the bidirectional long and short term memory network contains the context state information of the text sequence as the coding stage of the attention model, and the feature vector set is [ h [1,h2,h3,...hT]Wherein the feature h is generated at each momentiH is formed by combining features of two directionsi=[hi,hi *]. The bidirectional long and short term memory network is developed from the long and short term memory network, and the patent fully utilizes the bidirectional long and short term memory network in the coding stageThe hidden layer state of the short-term memory network and the prediction information at each moment are fully fused and coded, and the following principle implementation of the bidirectional long-term and short-term memory network is as follows:
the bidirectional long-short term memory network is similar to the principle of the long-short term memory network, the memory state of a cell is controlled by an input gate, an output gate and a forgetting gate, the gate value range is generally [0,1], so that three gates use Sigmoid as an activation function, and the output state uses Tanh activation function.
it=σi(wi[ht-1,xt]+bi) (8)
Formula (8) shows the input at the current time, which is controlled by the input gate, and determines whether the input information is retained by the active function Sigmoid.
ft=σf(wt[ht-1,xt]+bf) (9)
The formula (9) shows that the forgetting gate controls the state of the memory unit to let the memory unit cetSome past states that would interfere with the prediction at the current time are discarded randomly.
Cet=ft*Cet-1+it*tanh(wc·[ht-1,Cet]+bc) (10)
Equation (10) represents the internal Memory IM (inner Memory) of the function m at time t.
ot=σo(wo[ht-1,xt]+bo) (11)
ht=ot☉σh(cet) (12)
Equations (11) and (12) indicate the output gate, (11) the initialization of the output value, and (12) the output value and the memory cell CetTaking correlation operation as hidden state h of current timet,σtIs a tanh activation function, and has the effect of stabilizing the value.
Step (III), semantic coding CiIs the key point of attention model, 1 x 1024 features generated by bidirectional recurrent neural networkThe sequence of eigenvectors is semantically coded with the purpose of hiding the state h in the coding phaseiAnd carrying out weighted summation to obtain attention weights at different moments, and predicting the output at the current moment through attention focusing. The feature sequence S of the vector with the length of 20 of T is prepared by an attention machine for feature focusing, if the value of T is too large, the information needing to be memorized is excessive, the calculation amount of the model is violently increased, a general text sentence rarely exceeds 20 words to form a sentence, and the excessive value of T can cause the distraction of the model, such as: the sky of blue and blue has a white cloud which is just like a bird flying in the sky, when the last character is predicted, attention is focused on the input text at the current moment and the hidden state at a certain past moment, the weight value of the attention model is distributed in the hidden state at different moments, and the attention is focused more when the weight value is larger. When the text box character is predicted, the current character is predicted to be 'floating' according to 'white cloud' and 'free-form free', if the T value is selected too much, the weight of 'sky' is increased, the weight of useful information 'white cloud' and 'free-form free' is reduced, expression deviation occurs during prediction, and the effect is reduced on the contrary. Attention model [ x ]1,x2,x3,...xT]Indicating the input at the current time, At,iIndicating the focus weight of attention, CtShowing the feature h at time tiThe specific formula of the weighted value of (2) is as follows:
Figure BDA0001836931290000101
Figure BDA0001836931290000102
et,i=fatt(st-1,hi) (15)
st=f(st-1,yt-1,Ct) (16)
yt=g(yt-1,st,Ct) (17)
step (three), according to the description of the attached figure 1, the transposition layer is used for converting dimensionality, is convenient for decoding, is commonly used for connecting a long-term and short-term memory network, dimensionality matching is the key point of the model, the model can be stably trained by proper dimensionality, and the training speed is improved; the decoding part adopts a nested long-short term memory network and can identify the feature vector focused by attention, wherein the input at the t moment is the feature vector focused by the t-th feature. The nested long and short term memory network selectively reads and writes using standard long and short term memory network gates. This key feature enables the model to achieve a more efficient time hierarchy than traditional stacked long short term memory networks. As can be seen from FIG. 2, stThe hidden state of a decoder at the time t is shown, the nested long and short term memory network shown in the figure 1 is formed by nesting two layers of long and short term memory networks, the dimensionalities of the long and short term memory networks 1 and 2 are both 256, the dimensionality after nesting and fusion is 512, and the prediction output y at a certain time t after decoding is outputtPredicted output y from past timet-1Hidden state s of decodertAttention weighted value CtAnd (4) jointly determining.
The nested long and short term memory network is different from the traditional long and short term memory network, which is a stacked mode, like a line, from the front to the back one layer stacked, and the output of the previous layer is the input of the next layer. The nested long and short term memory network uses the gates of the standard long and short term memory network to selectively read and write. This key feature enables the model to achieve a more efficient time hierarchy than traditional stacked long short term memory networks. The memory cell function expression is different, and the formula is as follows:
Cet=IMt(ft☉Cet-1,it☉gt) (18)
and finally, the output is expressed in a probability form by adopting Softmax, then the probability value is selected as a prediction result, a predicted value is provided for each time t of the long-term and short-term memory network, and then characters before the end character are taken according to the time sequence to form a character string, namely the required result.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (4)

1. A character recognition system based on a combination of a neural network and an attention mechanism, comprising: the character image coding and decoding system comprises a feature extraction module, a coding and attention module and a decoding module, wherein the feature extraction module adopts a structure of combining a convolutional neural network and a bidirectional long-short term memory network, and the convolutional neural network is used for extracting the spatial features of a character image; the bidirectional long and short term memory network is used for extracting the sequence characteristics of the characters;
a coding and attention module for hiding state h of the bidirectional long-short term memory network coding stageiCarrying out weighted summation to obtain attention weights at different moments, and then predicting the output at the current moment through attention focusing;
the decoding module adopts a nested long-short term memory network, the decoding part analyzes intermediate semantic information generated by encoding, the decoding needs to focus attention on the encoded state by using an attention mechanism, then learns decoding information at a past moment through the nested long-short term memory network, is used for extracting sequence information of a text, and predicts the output of the current moment through the state at the previous moment;
the convolutional neural network module comprises a convolutional layer 1, a pooling layer 1, a convolutional layer 2, a convolutional layer 3, a pooling layer 2, a convolutional layer 4, a pooling layer 3, a batch standard layer, a convolutional layer 5, a pooling layer 4, a batch standard layer and a Dropout layer;
the detailed parameter settings of the convolutional neural network are as follows: the convolution kernel size of the convolution layer 1 is 5 × 5 × 64, the step length is 1, the extended edge is 1, the pooling layer adopts an average value method, and the parameter settings are the same as: kernel size 3 × 3, step size 2, extended edge 0; the convolution group 2 comprises convolution layers A and B with convolution kernels of 7 multiplied by 7 and 5 multiplied by 5 which are parallel, and convolution layers C1 with convolution kernels of 1 multiplied by C which are stacked on the parallel convolution layers, wherein C represents the number of the convolution kernels, and the dimension can be reduced by adjusting the size of C, so that the calculation speed is accelerated, and the calculation cost is reduced; the convolution layers 3, 4 and 5 all adopt convolution kernels with the size of 3 multiplied by 3, the number of the convolution kernels is 128, 128 and 256, the step length of the convolution kernels is 1, and the expansion edge is 1; the batch standard layer is used for standardizing each small batch of data, calculating the mean value and the variance of the data, then normalizing the data, and then translating the scaling parameters; the Dropout layer can be regarded as random summation and then averaging of a model, namely, the hidden unit is lost randomly;
the coding and attention module specifically comprises: semantic code CiIs the key point of the attention model, and semantically encodes the 1 multiplied by 1024 characteristic vector sequence generated by the bidirectional recurrent neural network, and aims to conceal the hidden state h in the encoding stageiCarrying out weighted summation to obtain attention weights at different moments, predicting the output of the current moment through attention focusing, adopting an attention machine to prepare a characteristic sequence S of a vector with the length of 20 of T to carry out characteristic focusing, and when predicting the last character, focusing attention on an input text at the current moment and a hidden state at a certain past moment, wherein the weights of an attention model are distributed in the hidden state at different moments, the greater the weight, the more the attention is focused, and [ x ] in the attention model1,x2,x3,...xT]Indicating the input at the current time, At,iIndicating the focus weight of attention, CtShowing the feature h at time tiThe weighted value of (1);
a is describedt,i、CtThe specific formula is as follows:
Figure FDA0003491240110000021
Figure FDA0003491240110000022
et,i=fatt(st-1,hi) (15)
st=f(st-1,yt-1,Ct) (16)
yt=g(yt-1,st,Ct) (17)
fatt(st-1,hi) Is a correlation function representing the degree of correlation between the state at time t-1 of decoding and the coding characteristic b, ytDenotes the prediction output of the decoding module, g (y)t-1,st,Ct) Representing a probability output function;
the decoding module adopts a nested long-short term memory network for identifying the feature vector focused by attention, wherein the input of the t-th time is the feature vector focused by the t-th time of features, the nested long-short term memory network selectively reads and writes by using a standard long-short term memory network gate, and the prediction output y of a certain time t after decoding is outputtPredicted output y from past timet-1Hidden state s of decodertAttention weighted value CtCollectively, the memory cell function is formulated as follows:
Cet=IMt(ft☉Cet-1,it☉gt) (18)
ftrepresentation of a non-linear function representing forward propagation, IMtRepresenting internal memory states of nested long-short term memory networks, Cet-1Represents the state of the memory cell at the previous time t-1, gtA gating function representing a long-short term memory network;
and finally, the output is expressed in a probability form by adopting Softmax, then the probability value is selected as a prediction result, a predicted value is provided for each time t of the long-term and short-term memory network, and then characters before the end character are taken according to the time sequence to form a character string, namely the required result.
2. The system of claim 1, wherein when the resolution of the input image is 800 × 600, the above convolution kernel pooling process is performed to obtain a feature map, the size of the feature map is 1 × 256 × 50 × 37, a 1 × 256 feature sequence is obtained, and then an acceleration layer is added, the acceleration layer is an optimization method provided by Caffe, and the acceleration layer can convert pixels in a small window area covered by the convolution kernel into a row and then store the row in a continuous memory space.
3. The system of claims 1-2, wherein the dimension of the parameter of the bidirectional long-short term memory network is 512, and through the fusion of T1, 2,3,. T from left to right and from right to left, the states of the hidden layers are overlapped, the long-short term memory network does not change the feature sequence position of the feature map, has translational invariance, the original perception corresponding to the feature vector is also unchanged, the output dimension is 1 × 1024 × 50 × 37, the hidden layers of the bidirectional long-short term memory network contain the context state of the text sequence, and the context state is used as the encoding process of the attention model, and the feature vector set is [ h ] h1,h2,h3,...hT]Wherein a feature vector h is generated at each time instantiH is formed by combining features of two directionsi=[hi,hi *]。
4. A character recognition method based on combination of a neural network and an attention mechanism is characterized by comprising the following steps: the method comprises a characteristic extraction step, a coding and attention step and a decoding step, wherein the characteristic extraction step adopts a structure combining a convolutional neural network and a bidirectional long-short term memory network, and the convolutional neural network is used for extracting the spatial characteristics of character images; the bidirectional long and short term memory network is used for extracting the sequence characteristics of the characters;
coding and attention steps for hidden states h of the coding phase of the bidirectional long-short term memory networkiCarrying out weighted summation to obtain attention weights at different momentsThe value, then the output at the current moment is predicted by attention focusing;
the decoding step adopts a nested long-short term memory network, the decoding part analyzes intermediate semantic information generated by encoding, the decoding needs to focus attention on the encoded state by using an attention mechanism, then the nested long-short term memory network learns the decoding information at a past moment to extract the sequence information of the text, and the output of the current moment is predicted by the state at the previous moment.
CN201811230112.0A 2018-10-22 2018-10-22 Character recognition system and method based on combination of neural network and attention mechanism Active CN109389091B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811230112.0A CN109389091B (en) 2018-10-22 2018-10-22 Character recognition system and method based on combination of neural network and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811230112.0A CN109389091B (en) 2018-10-22 2018-10-22 Character recognition system and method based on combination of neural network and attention mechanism

Publications (2)

Publication Number Publication Date
CN109389091A CN109389091A (en) 2019-02-26
CN109389091B true CN109389091B (en) 2022-05-03

Family

ID=65427622

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811230112.0A Active CN109389091B (en) 2018-10-22 2018-10-22 Character recognition system and method based on combination of neural network and attention mechanism

Country Status (1)

Country Link
CN (1) CN109389091B (en)

Families Citing this family (64)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919221B (en) * 2019-03-04 2022-07-19 山西大学 Image description method based on bidirectional double-attention machine
CN110046616B (en) * 2019-03-04 2021-05-25 北京奇艺世纪科技有限公司 Image processing model generation method, image processing device, terminal device and storage medium
CN109948152B (en) * 2019-03-06 2020-07-17 北京工商大学 L STM-based Chinese text grammar error correction model method
CN111695377B (en) * 2019-03-13 2023-09-29 杭州海康威视数字技术股份有限公司 Text detection method and device and computer equipment
CN109948696A (en) * 2019-03-19 2019-06-28 上海七牛信息技术有限公司 A kind of multilingual scene character recognition method and system
CN109977861B (en) * 2019-03-25 2023-06-20 中国科学技术大学 Off-line handwriting mathematical formula recognition method
CN111753822A (en) * 2019-03-29 2020-10-09 北京市商汤科技开发有限公司 Text recognition method and device, electronic equipment and storage medium
CN110162610A (en) * 2019-04-16 2019-08-23 平安科技(深圳)有限公司 Intelligent robot answer method, device, computer equipment and storage medium
CN110070085B (en) * 2019-04-30 2021-11-02 北京百度网讯科技有限公司 License plate recognition method and device
CN110097019B (en) * 2019-05-10 2023-01-10 腾讯科技(深圳)有限公司 Character recognition method, character recognition device, computer equipment and storage medium
CN112037776A (en) * 2019-05-16 2020-12-04 武汉Tcl集团工业研究院有限公司 Voice recognition method, voice recognition device and terminal equipment
CN110413844B (en) * 2019-05-24 2021-12-07 浙江工业大学 Dynamic link prediction method based on space-time attention depth model
CN110147788B (en) * 2019-05-27 2021-09-21 东北大学 Feature enhancement CRNN-based metal plate strip product label character recognition method
CN110276269B (en) * 2019-05-29 2021-06-29 西安交通大学 Remote sensing image target detection method based on attention mechanism
CN110298037B (en) * 2019-06-13 2023-08-04 同济大学 Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110287951B (en) * 2019-06-21 2022-04-12 北京百度网讯科技有限公司 Character recognition method and device
CN110458011A (en) * 2019-07-05 2019-11-15 北京百度网讯科技有限公司 Character recognition method and device, computer equipment and readable medium end to end
CN110473267A (en) * 2019-07-12 2019-11-19 北京邮电大学 Social networks image based on attention feature extraction network describes generation method
CN110427852B (en) * 2019-07-24 2022-04-15 北京旷视科技有限公司 Character recognition method and device, computer equipment and storage medium
CN110458215B (en) * 2019-07-30 2023-03-24 天津大学 Pedestrian attribute identification method based on multi-temporal attention model
CN112329803B (en) * 2019-08-05 2022-08-26 北京大学 Natural scene character recognition method based on standard font generation
CN110458165B (en) * 2019-08-14 2022-11-08 贵州大学 Natural scene text detection method introducing attention mechanism
CN110738203B (en) * 2019-09-06 2024-04-05 中国平安财产保险股份有限公司 Field structured output method, device and computer readable storage medium
CN110795997B (en) * 2019-09-19 2023-07-28 平安科技(深圳)有限公司 Teaching method and device based on long-short-term memory and computer equipment
CN110598718A (en) * 2019-09-20 2019-12-20 电子科技大学 Image feature extraction method based on attention mechanism and convolutional neural network
CN110705692B (en) * 2019-09-25 2022-06-24 中南大学 Nonlinear dynamic industrial process product prediction method of space-time attention network
CN110688949B (en) * 2019-09-26 2022-11-01 北大方正集团有限公司 Font identification method and apparatus
CN111062258B (en) * 2019-11-22 2023-10-24 华为技术有限公司 Text region identification method, device, terminal equipment and readable storage medium
CN111145541B (en) * 2019-12-18 2021-10-22 深圳先进技术研究院 Traffic flow data prediction method, storage medium, and computer device
CN111221966A (en) * 2019-12-31 2020-06-02 北京科东电力控制系统有限责任公司 Text semantic relation extraction method and system
CN111274961B (en) * 2020-01-20 2021-12-07 华南理工大学 Character recognition and information analysis method for flexible IC substrate
CN111428557A (en) * 2020-02-18 2020-07-17 深圳壹账通智能科技有限公司 Method and device for automatically checking handwritten signature based on neural network model
CN111507328A (en) * 2020-04-13 2020-08-07 北京爱咔咔信息技术有限公司 Text recognition and model training method, system, equipment and readable storage medium
CN111553290A (en) * 2020-04-30 2020-08-18 北京市商汤科技开发有限公司 Text recognition method, device, equipment and storage medium
CN111695779B (en) * 2020-05-14 2023-03-28 华南师范大学 Knowledge tracking method, knowledge tracking device and storage medium
CN111612157B (en) * 2020-05-22 2023-06-30 四川无声信息技术有限公司 Training method, character recognition device, storage medium and electronic equipment
CN111723368B (en) * 2020-05-28 2023-12-15 中国人民解放军战略支援部队信息工程大学 Bi-LSTM and self-attention-based malicious code detection method and system
CN111950586B (en) * 2020-07-01 2024-01-19 银江技术股份有限公司 Target detection method for introducing bidirectional attention
CN112016543A (en) * 2020-07-24 2020-12-01 华为技术有限公司 Text recognition network, neural network training method and related equipment
CN111986181B (en) * 2020-08-24 2021-07-30 中国科学院自动化研究所 Intravascular stent image segmentation method and system based on double-attention machine system
CN112232479A (en) * 2020-09-11 2021-01-15 湖北大学 Building energy consumption space-time factor characterization method based on deep cascade neural network and related products
CN112270316B (en) * 2020-09-23 2023-06-20 北京旷视科技有限公司 Character recognition, training method and device of character recognition model and electronic equipment
CN111931773B (en) * 2020-09-24 2022-01-28 北京易真学思教育科技有限公司 Image recognition method, device, equipment and storage medium
CN112183544A (en) * 2020-09-29 2021-01-05 厦门大学 Double-channel fused three-layer architecture mathematical formula identification method, system and storage device
CN112149644A (en) * 2020-11-09 2020-12-29 西北工业大学 Two-dimensional attention mechanism text recognition method based on global feature guidance
CN112418409B (en) * 2020-12-14 2023-08-22 南京信息工程大学 Improved convolution long-short-term memory network space-time sequence prediction method by using attention mechanism
CN112530235A (en) * 2020-12-15 2021-03-19 深圳市新亚恒利科技有限公司 Fast reading training control method, device, equipment and storage medium
CN112527966B (en) * 2020-12-18 2022-09-20 重庆邮电大学 Network text emotion analysis method based on Bi-GRU neural network and self-attention mechanism
CN112686345B (en) * 2020-12-31 2024-03-15 江南大学 Offline English handwriting recognition method based on attention mechanism
CN112967739B (en) * 2021-02-26 2022-09-06 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on long-term and short-term memory network
CN112702294B (en) * 2021-03-24 2021-06-22 四川大学 Modulation recognition method for multi-level feature extraction based on deep learning
CN113139446B (en) * 2021-04-12 2024-02-06 长安大学 End-to-end automatic driving behavior decision method, system and terminal equipment
CN112966691B (en) * 2021-04-14 2022-09-16 重庆邮电大学 Multi-scale text detection method and device based on semantic segmentation and electronic equipment
CN113326739B (en) * 2021-05-07 2022-08-09 山东大学 Online learning participation degree evaluation method based on space-time attention network, evaluation system, equipment and storage medium
CN113221885B (en) * 2021-05-13 2022-09-06 中国科学技术大学 Hierarchical modeling method and system based on whole words and radicals
CN113283336A (en) * 2021-05-21 2021-08-20 湖南大学 Text recognition method and system
CN113239703B (en) * 2021-05-24 2023-05-02 清华大学深圳国际研究生院 Deep logic reasoning financial text analysis method and system based on multi-element factor fusion
CN113361406A (en) * 2021-06-07 2021-09-07 中山大学 Method, system and storage medium for handwriting identification based on letter level
CN113343711A (en) * 2021-06-29 2021-09-03 南方电网数字电网研究院有限公司 Work order generation method, device, equipment and storage medium
CN114444572A (en) * 2021-12-25 2022-05-06 西北工业大学 Data error-oriented aerial target intention identification method and device
CN115469627B (en) * 2022-11-01 2023-04-04 山东恒远智能科技有限公司 Intelligent factory operation management system based on Internet of things
CN116091363A (en) * 2023-04-03 2023-05-09 南京信息工程大学 Handwriting Chinese character image restoration method and system
CN116168040B (en) * 2023-04-26 2023-07-07 四川元智谷科技有限公司 Component direction detection method and device, electronic equipment and readable storage medium
CN117494713B (en) * 2023-12-29 2024-03-01 苏州元脑智能科技有限公司 Character recognition method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179683A (en) * 2017-04-01 2017-09-19 浙江工业大学 A kind of interaction intelligent robot motion detection and control method based on neutral net
CN107368831A (en) * 2017-07-19 2017-11-21 中国人民解放军国防科学技术大学 English words and digit recognition method in a kind of natural scene image
CN107484017A (en) * 2017-07-25 2017-12-15 天津大学 Supervision video abstraction generating method is had based on attention model
CN107562784A (en) * 2017-07-25 2018-01-09 同济大学 Short text classification method based on ResLCNN models
CN107797992A (en) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 Name entity recognition method and device
CN108024158A (en) * 2017-11-30 2018-05-11 天津大学 There is supervision video abstraction extraction method using visual attention mechanism
KR20180065498A (en) * 2016-12-08 2018-06-18 한국항공대학교산학협력단 Method for deep learning and method for generating next prediction image using the same
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180065498A (en) * 2016-12-08 2018-06-18 한국항공대학교산학협력단 Method for deep learning and method for generating next prediction image using the same
CN107179683A (en) * 2017-04-01 2017-09-19 浙江工业大学 A kind of interaction intelligent robot motion detection and control method based on neutral net
CN107368831A (en) * 2017-07-19 2017-11-21 中国人民解放军国防科学技术大学 English words and digit recognition method in a kind of natural scene image
CN107484017A (en) * 2017-07-25 2017-12-15 天津大学 Supervision video abstraction generating method is had based on attention model
CN107562784A (en) * 2017-07-25 2018-01-09 同济大学 Short text classification method based on ResLCNN models
CN107797992A (en) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 Name entity recognition method and device
CN108024158A (en) * 2017-11-30 2018-05-11 天津大学 There is supervision video abstraction extraction method using visual attention mechanism
CN108595632A (en) * 2018-04-24 2018-09-28 福州大学 A kind of hybrid neural networks file classification method of fusion abstract and body feature

Also Published As

Publication number Publication date
CN109389091A (en) 2019-02-26

Similar Documents

Publication Publication Date Title
CN109389091B (en) Character recognition system and method based on combination of neural network and attention mechanism
CN110322446B (en) Domain self-adaptive semantic segmentation method based on similarity space alignment
CN109299342B (en) Cross-modal retrieval method based on cycle generation type countermeasure network
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN111985369B (en) Course field multi-modal document classification method based on cross-modal attention convolution neural network
CN109961034B (en) Video target detection method based on convolution gating cyclic neural unit
CN112507898B (en) Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN
CN110929092B (en) Multi-event video description method based on dynamic attention mechanism
CN109711463B (en) Attention-based important object detection method
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN111444968A (en) Image description generation method based on attention fusion
CN112949647B (en) Three-dimensional scene description method and device, electronic equipment and storage medium
CN113204952A (en) Multi-intention and semantic slot joint identification method based on clustering pre-analysis
CN111464881A (en) Full-convolution video description generation method based on self-optimization mechanism
Liu et al. Learning explicit shape and motion evolution maps for skeleton-based human action recognition
Cui et al. Representation and correlation enhanced encoder-decoder framework for scene text recognition
CN114896434A (en) Hash code generation method and device based on center similarity learning
CN112257716A (en) Scene character recognition method based on scale self-adaption and direction attention network
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium
CN110633689A (en) Face recognition model based on semi-supervised attention network
CN112668543B (en) Isolated word sign language recognition method based on hand model perception
CN111242114B (en) Character recognition method and device
CN111259197B (en) Video description generation method based on pre-coding semantic features
CN117093692A (en) Multi-granularity image-text matching method and system based on depth fusion
CN115862015A (en) Training method and device of character recognition system, and character recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant