CN107368831B

CN107368831B - English words and digit recognition method in a kind of natural scene image

Info

Publication number: CN107368831B
Application number: CN201710592890.3A
Authority: CN
Inventors: 张军; 涂丹; 李硕豪; 陈旭; 雷军; 郭强
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2017-07-19
Filing date: 2017-07-19
Publication date: 2019-08-02
Anticipated expiration: 2037-07-19
Also published as: CN107368831A

Abstract

The present invention provides the English words and digit recognition method in a kind of natural scene image, the identification problem of English words and number in natural scene is divided into feature extraction, feature focuses and feature identifies three steps, feature extraction is carried out to input picture using convolutional neural networks, attention mechanism is focused the useful information in characteristic sequence, long memory network in short-term identifies feature vector, to which deep neural network and attention mechanism be combined, when input picture is to deep neural network, final recognition result can be immediately arrived at.The present invention is not needed to carry out input picture sliding window operation and be identified to the character in window；The character string that the present invention exports simultaneously is final recognition result, does not need merging algorithm and integrates to the character string after identification.

Description

English words and digit recognition method in a kind of natural scene image

Technical field

The invention belongs to technical field of character recognition, relates to the use of deep neural network and attention mechanism carries out nature field English words and digit recognition method in scape image.

Background technique

Text in natural scene often carries very important information, it can be used to describe the interior of the image Hold.The text information automatically obtained in image can help people more effectively to understand image and store, press to image The processing such as contracting, retrieval.Relative to natural scene character detecting method, natural scene character recognition method is to having been detected by Character area is identified.English and number are used as a kind of universal language, occur extensively in the scene of countries in the world, know Other English words and numerical significance are great.However, the position of natural scene Chinese and English text and number different from Handwritten Digits Recognition Set, size, font, illumination, visual angle, shape there is variability, and the background of natural scene character is also considerably complicated, institute With the English words in natural scene, there are many technological difficulties for needing to capture with number identification.

Existing natural scene Text region algorithm be usually all the bottom of from and on algorithm, see document [Neumann L, Matas J.'Real-time lexicon-free scene text localization and recognition',IEEE Transactions on Pattern Analysis and Machine Intelligence,2015,38,(9), Pp.1872-1885], that is, first with sliding window operation and traditional classifier in image English words and number each of Character is identified, due to not necessarily there is character in window, then after also needing that merging algorithm is recycled to identify these Character string is integrated.But there are two limitations for this method: 1. identify character using sliding window method and traditional classifier Accuracy rate is not high；2. character recognition and merging algorithm are to separate training, the error that each generates will be directly delivered to In final recognition result, cause Text region precision not high.

Summary of the invention

It is an object of the invention to solve these limitations, deep neural network and attention mechanism are combined, and will combine Model is trained and identifies neural network afterwards as a whole, on the basis of the real operation there are currently no sliding window, gives One image comprising English words and number directly exports recognition result.

The principle of the present invention is as follows: firstly, extracting using in the widely applied convolutional neural networks of computer vision field The two dimensional character matrix of input picture, under the action of convolutional neural networks, each column in matrix are represented in input picture The depth characteristic of corresponding region is serialized to obtain characteristic sequence according to column direction to two dimensional character matrix；Then, note is utilized Power mechanism of anticipating extracts the information relevant to character in characteristic sequence, filters redundancy, obtains feature vector, so-called attention Mechanism is exactly to observe things according to the observing pattern of human vision with the mode focused, filters out garbage, is deep learning In common model；Finally, successively identifying the English in image according to spatial order from left to right using long memory network in short-term Text and number.

The technical scheme is that English words and digit recognition method in a kind of natural scene image, for defeated Entering image is the gray level image comprising English words and number, and this method combines deep neural network and attention mechanism, and By the neural network in conjunction with after, model is trained and identifies as a whole, and real there are currently no the bases of sliding window operation On, a given image comprising English words and number directly exports recognition result, specifically include the following steps:

Step (1), carry out feature extraction to the image of input: the present invention is using the convolutional Neural in deep neural network Network carries out feature extraction to input picture, by the output of convolutional neural networks as feature extraction as a result, with traditional volume Product neural network output three-dimensional feature matrix is different, and the output for the convolutional neural networks that the present invention designs is two dimensional character matrix. Convolutional neural networks are from output is input to successively by convolutional layer 1, batch normalization layer 1, pond layer 1, convolutional layer 2, batch standard Change layer 2, pond layer 2, convolutional layer 3, batch normalization layer 3, convolutional layer 4, batch normalization layer 4, pond layer 4, convolutional layer 5, criticize Normalization layer 5, convolutional layer 6, batch normalization layer 6, pond layer 6, convolutional layer 7, batch normalization layer 7 is measured to form.Wherein convolution The parameter of layer is spaced according to convolution kernel size, number of active lanes, sliding and expands the sequence of size successively are as follows: and (3,64,1,1), (3, 128,1,1), (3,256,1,1), (3,256,1,1), (3,512,1,1), (3,512,1,1) and (2,512,1,0).Batch is marked The purpose of standardization layer is to adjust the distribution of intermediate result data, without parameter.The parameter of pond layer is slided according to pond window, left and right Dynamic interval, slides up and down interval, and left and right expands size and expands the sequence of size successively up and down are as follows: (2*2,2,2,0,0), (2* 2,2,2,0,0), (1*2,1,2,0,0) and (1*2,1,2,0,0).Image is differentiated before being input to convolutional neural networks Rate is adjusted to 80 × 32, then the two-dimensional matrix size of convolutional neural networks output is 512 × 19, by this two dimensional character matrix sequence The characteristic sequence comprising 19 sizes for 1 × 512 vector is obtained after change, is indicated are as follows: S={ s₁,s₂,...s_L, wherein s_i∈ R⁵¹², R⁵¹²Indicate that 1 × 512 vector, i=1,2 ..., L, L indicate the length of sequence, size 19.

Step (2) uses attention mechanism to carry out feature to comprising 19 sizes for the characteristic sequence S of 1 × 512 vector It focuses, the result that the set for the feature vector that attention mechanism is exported is focused as feature.The present invention is according to from left to right Spatial order successively identify the character in image, and training dataset Synth of the invention [Jaderberg M, Simonyan K,Vedaldi A,et al.'Reading text in the wild with convolutional neural networks',International Journal of Computer Vision,2016,116,(1),pp.1- 20] character length is up to 24 in, therefore output of the invention is the English words sum number combinatorics on words that length is 24, so algorithm It needs to carry out 24 features to focus, feature each time was focused as a moment.Final output be exactly 24 focus after Feature vector set V_f, V_f={ V₁,V₂,…V_T, T=24.Feature vector V in set_tIndicate what the t times feature focused As a result, indicating are as follows:

WhereinAnd When representing the t times feature focusing The coefficient of attention mechanism.Element in this coefficient is obtained by the following formula:

Wherein h_t-1Indicate the hidden variable of t-1 moment long memory unit in short-term in third step.w^T, W_a, U_aAnd b_aIt is note The parameter of meaning power model, is trained by the Back Propagation Algorithm based on stochastic gradient descent.

Step (3) identifies the feature vector after focusing: the present invention utilizes the length in deep neural network in short-term Memory network identifies the feature vector after focusing.According to character string maximum length it is assumed that long memory network in short-term contains There are 24 units, the output of long memory unit in short-term is exactly the character identified, and each character has 37 classes (26 English words Mother, 0~9 totally 10 numbers, end mark "-", end mark indicate character string end of identification.), the long short-term memory list of t moment The input of member is exactly the feature vector V after the t times feature focuses_t, output is exactly the character class J identified_t。J_tThere are 37 classes Not (26 English alphabets, 0~9 totally ten numbers, end mark "-"), each moment choose the classification of maximum probability as at this time The output for carving long memory unit in short-term, chooses mode such as following formula:

z_i=softmax (h_t)

Wherein h_tIndicate the hidden variable of the long memory unit in short-term of t moment, specific explanations are shown in Fig. 3 explanation.After end of identification The output of whole network is exactly the combination of 24 characters.The present invention takes the character string before end mark as final identification knot Fruit.

The input of step (1) is the image comprising English words and number, and output is characterized sequence, and characteristic sequence passes through That crosses step (2) is calculated feature vector required for step (3) inputs, finally by the character of step (3) output identification String.After three steps are integrated into a frame, the parameter to entire model is needed to be trained, if X={ I_i,L_iBe Training dataset, I_iIndicate i-th of image, L_iFor its corresponding label, that is, in image character string true value.So exist Objective function in training process can indicate are as follows:

Wherein W indicates the parameter of entire model, contains convolutional neural networks, attention mechanism and long memory network in short-term Parameter, W^*Indicate the optimum value of these parameters.J={ J₁,…J_TIndicate model identification string as a result, being by 24 The character string of character composition, entire character string identify that correct probability is equal to each character recognition correct probability in character string and multiplies Product, then-logp (J=L_i|I_i) form can be expressed as:

Wherein L_i,tIndicate t-th of character in the corresponding label of i-th of image, then objective function can indicate are as follows:

After obtaining objective function, the present invention is using the Back Propagation Algorithm based on stochastic gradient descent to network parameter W It is trained, sees document [Shi B, Bai X, Yao C.'An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition',arXiv preprint arXiv:1507.05717,2015]。

If input picture is color image, then above step will be executed after color image gray processing.

Compared with prior art, the beneficial effects of the present invention are:

The present invention combines deep neural network and attention mechanism, can be with when input picture is to deep neural network Immediately arrive at final recognition result.Therefore, the present invention does not need to carry out input picture sliding window operation and to the word in window Symbol is identified.Meanwhile the character string that the present invention exports is final recognition result, after not needing to merge algorithm to identification Character string is integrated.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.

Fig. 1 is overview flow chart of the present invention；

Fig. 2 is the design framework figure of convolutional neural networks in the present invention；

Fig. 3 is the internal structure chart of long memory network unit in short-term；

Fig. 4 is the result example one of present invention identification English words and number；

Fig. 5 is the result example two of present invention identification English words and number.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Overview flow chart such as Fig. 1 institute of the present invention " a kind of natural scene image in English words and digit recognition method " Show, the identification problem of English words and number in natural scene is divided into feature extraction, feature focuses and feature identifies three Step.

Step (1): feature extraction.The present invention carries out feature extraction to input picture using convolutional neural networks, and input is Image under natural scene comprising English character and number, image is adjusted to 80 before being input to convolutional neural networks × 32 size.It is different that three-dimensional feature matrix can only be exported with traditional convolutional neural networks as shown in Figure 2, designed by the present invention Convolutional neural networks can export two-dimensional eigenmatrix.As shown, convolutional neural networks are from top to bottom successively by convolutional layer 1, batch normalization layer 1, pond layer 1, convolutional layer 2, batch normalization layer 2, pond layer 2, convolutional layer 3, batch normalization layer 3, Convolutional layer 4, batch normalization layer 4, pond layer 4, convolutional layer 5, batch normalization layer 5, convolutional layer 6, batch normalization layer 6, pond Change layer 6, convolutional layer 7, batch normalization layer 7 to form.Wherein the parameter of convolutional layer is according to convolution kernel size, number of active lanes, sliding The sequence of interval and expansion size is successively are as follows: and (3,64,1,1), (3,128,1,1), (3,256,1,1), (3,256,1,1), (3, 512,1,1), (3,512,1,1) and (2,512,1,0).The purpose of batch normalization layer is the distribution for adjusting intermediate result data, is not had There is parameter.The parameter of pond layer according to pond window, horizontally slip interval, slides up and down interval, expand size and up and down for left and right Expand size sequence successively are as follows: (2*2,2,2,0,0), (2*2,2,2,0,0), (1*2,1,2,0,0) and (1*2,1,2,0, 0).Output is two-dimensional 512 × 19 eigenmatrix, is obtained after it is serialized according to column direction comprising 19 sizes being 1 × 512 The characteristic sequence of vector indicates are as follows: S={ s₁,s₂,…s_L, wherein s_i∈R⁵¹², i=1,2 ..., L, the length of L expression sequence, Size is 19.

Step (2): feature focuses.The present invention is focused the useful information in characteristic sequence using attention mechanism, Input is the characteristic sequence comprising 19 sizes for 1 × 512 vector that feature extraction phases obtain, and output is feature vector.It calculates Method is successively identified to the character in image according to spatial order from left to right, and character string most greatly enhances in setting image Degree is 24, is focused then algorithm will carry out T=24 feature, and final output is exactly the feature vector after 24 focusing Set V_f, V_f={ V₁,V₂,...V_T}.Feature vector V_tIndicate that the t times feature focuses as a result, indicating are as follows:

WhereinAnd Represent note when the t times feature focuses The coefficient for power mechanism of anticipating.Element in this coefficient can be obtained by the following formula:

Wherein w^T, W_a, U_aAnd b_aThe parameter of attention model, by the Back Propagation Algorithm based on stochastic gradient descent into Row training.h_t-1Indicate the hidden variable of t-1 moment long memory unit in short-term in third step, specific explanations are as shown in Figure 3.

Fig. 3 is the internal structure chart of long memory unit in short-term.The long one kind of memory network as recurrent neural network in short-term Network is improved, the generation of conventional recursive neural network gradient extinction tests in the training process is limited by door operation.Such as figure Showing the length of t moment, memory unit, one long memory unit in short-term are by a memory unit c in short-term_tWith three door operations i_t,o_t,f_tComposition.Wherein, i_tIt is input gate, it indicates that how many information content of current time can be input in unit；o_tIt is defeated It gos out, it indicates this moment unit outwardly exports how many information content；f_tIt is to forget door, it is indicated in the reception of current time unit The number of one moment unit output information；Its specific calculating process is as follows:

i_t=σ (W_ixV_t+W_imh_t-1+b_i)

f_t=σ (W_fxV_t+W_fmh_t-1+b_f)

o_t=σ (W_oxV_t+W_omh_t-1+b_o)

c_t=f_t⊙c_t-1+i_t⊙g_t

h_tFor the hidden variable of the long memory unit in short-term of t moment, σ indicates sigmoid function,⊙ indicates dot product Operation.W_ix, W_im, W_fx, W_fm, W_ox, W_om, W_gx, W_gm, b_i, b_f, b_o, b_gThe parameter for indicating long memory unit in short-term, due in length When memory network in the parameters of all units be shared, so these parameters can also be used as the ginseng of long memory network in short-term Number is trained these parameters using the Back Propagation Algorithm based on stochastic gradient descent in the training stage present invention.

Step (3): character recognition.The present invention identifies that input is 24 to feature vector using long memory network in short-term Feature vector after a focusing, output are the character strings of 24 length.In the present invention, long memory network in short-term includes 24 long Short-term memory unit, that is, the identification process of entire character string have 24 moment, the length of t moment memory unit in short-term it is defeated Entering is exactly feature vector V after the t times feature focuses_t, output is exactly the character class J identified_t。J_tThere are 37 classifications (26 English alphabet, 0~9 totally ten numbers, end mark "-"), when each moment chooses the classification of maximum probability as this moment length Mode such as following formula is chosen in the output of memory unit:

z_i=softmax (h_t)

Wherein h_tIndicate the hidden variable of the long memory unit in short-term of t moment.Whole network after end of identification as shown in Figure 1 Output is exactly the combination of 24 characters, and such as ' a ' ' d ' ' o ' ' n ' ' i ' ' s ' '-' '-' '-' ..., final recognition result is ‘adonis’。

Fig. 4 is the result one of correct identification English words and number of the invention, and true value and predicted value are brutalities.As can be seen that the present invention can identify the biggish image of Character deformation, robustness is higher.

Fig. 5 is present invention identification English words and digital result two, true value recapitaliozes, predicted value are Regapitaliozes, the third letter character wrong for identification.It can be seen that the noise of image is larger, for error Character, human eye, which is substantially all, to be differentiated.

Claims

1. English words and digit recognition method in a kind of natural scene image, include the following steps:

Step (1) carries out feature extraction using image of the convolutional neural networks in deep neural network to input, by convolution Result of the output of neural network as feature extraction；The convolutional neural networks from be input to output successively by: convolutional layer 1, Batch normalization layer 1, pond layer 1, convolutional layer 2, batch normalization layer 2, pond layer 2, convolutional layer 3, batch normalization layer 3, volume Lamination 4, batch normalization layer 4, pond layer 4, convolutional layer 5, batch normalization layer 5, convolutional layer 6, batch normalization layer 6, Chi Hua Layer 6, convolutional layer 7, batch normalization layer 7 form；Wherein the parameter of convolutional layer 1~7 is according to convolution kernel size, number of active lanes, cunning The sequence of dynamic interval and expansion size is successively are as follows: and (3,64,1,1), (3,128,1,1), (3,256,1,1), (3,256,1,1), (3,512,1,1), (3,512,1,1) and (2,512,1,0)；The purpose of batch normalization layer 1~7 is adjustment intermediate result data Distribution, without parameter；The parameter of pond layer 1,2,4,6 according to pond window, horizontally slip interval, slides up and down interval, left It is right to expand size and expand the sequence of size successively up and down are as follows: (2*2,2,2,0,0), (2*2,2,2,0,0), (1*2,1,2,0, And (1*2,1,2,0,0) 0)；Image needs before being input to convolutional neural networks by the resolution adjustment of image to be 80 × 32, The output of the convolutional neural networks is the two dimensional character matrix that size is 512 × 19；By the two dimensional character matrix sequence The characteristic sequence comprising 19 sizes for 1 × 512 vector is obtained afterwards, is indicated are as follows: S={ s₁,s₂,...s_L, wherein s_i∈R⁵¹², i =1,2 ..., L；L=19 indicates the length of sequence；

Step (2) uses attention mechanism to carry out feature focusing to comprising 19 sizes for the characteristic sequence S of 1 × 512 vector: Successively identify that the character in image, the character length that setting training data is concentrated are up to according to spatial order from left to right 24,24 features are carried out to characteristic sequence S and are focused, feature each time was focused as a moment；Export feature vector Set V_f, V_f={ V₁,V₂,...V_T, T=24；Wherein feature vector V_tIndicate the result that the t times feature focuses:And Represent attention when the t times feature focuses The coefficient of mechanism, wherein Wherein h_t-1Indicate third step The hidden variable of t-1 moment long memory unit in short-term in rapid；W^T, W_a, U_aAnd b_aIt is the parameter of attention model, by based on random The Back Propagation Algorithm of gradient decline is trained；

Step (3), using the length in deep neural network, memory network identifies the feature vector after focusing in short-term: long Short-term memory network contains 24 units, and the input of the length of t moment memory unit in short-term is exactly the spy after the t times feature focuses Levy vector V_t, output is exactly the character class J identified_t；The character class that each moment chooses maximum probability is long as this moment Mode is chosen in the output of short-term memory unit are as follows:Wherein z_i=softmax (h_t)；The h_tIndicate t The hidden variable of moment long memory unit in short-term；The output of whole network is exactly the combination of 24 characters after end of identification, takes end Character string before symbol is as final recognition result；The wherein J_tThere are 37 classifications, comprising: 26 English alphabets, 0~9 Totally 10 numbers, end mark "-"；The end mark indicates character string end of identification.

2. the method as described in claim 1, which is characterized in that the method being trained to the parameter in this method are as follows: set X= {I_i,L_iIt is training dataset, I_iIndicate i-th of image, L_iFor the true value of character string in i-th of image；In training process Objective function are as follows:Wherein W indicates convolutional neural networks, The parameter of attention mechanism and long memory network in short-term, W^*Indicate the optimum value of the parameter, L_i,tIndicate i-th of image pair T-th of character in the label answered, p (J_t=L_i,t|I_i,J₁,…J_t-1) it is t-th of word in known preceding t-1 character value The value label L of symbol_i,tProbability；Network parameter W is trained using the Back Propagation Algorithm based on stochastic gradient descent.

3. the method as described in claim 1, which is characterized in that the image of the input is grayscale image.