CN107368831A

CN107368831A - English words and digit recognition method in a kind of natural scene image

Info

Publication number: CN107368831A
Application number: CN201710592890.3A
Authority: CN
Inventors: 张军; 涂丹; 李硕豪; 陈旭; 雷军; 郭强
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2017-07-19
Filing date: 2017-07-19
Publication date: 2017-11-21
Anticipated expiration: 2037-07-19
Also published as: CN107368831B

Abstract

The present invention provides the English words and digit recognition method in a kind of natural scene image, the identification problem of English words in natural scene and numeral is divided into feature extraction, feature focuses on and three steps of feature recognition, feature extraction is carried out to input picture using convolutional neural networks, notice mechanism is focused to the useful information in characteristic sequence, characteristic vector is identified long memory network in short-term, so as to which deep neural network and notice mechanism be combined, when input picture is to deep neural network, final recognition result can be immediately arrived at.The present invention need not carry out sliding window operation to input picture and the character in window is identified；The character string that the present invention exports simultaneously is final recognition result, it is not necessary to merges algorithm and the character string after identification is integrated.

Description

English words and digit recognition method in a kind of natural scene image

Technical field

The invention belongs to technical field of character recognition, relates to the use of deep neural network and notice mechanism carries out nature field English words and digit recognition method in scape image.

Background technology

Word in natural scene often carries very important information, and it can be used to describe the interior of the image Hold.The text information automatically obtained in image can help people more effectively to understand image and stored, pressed to image The processing such as contracting, retrieval.Relative to natural scene character detecting method, natural scene character recognition method is to having been detected by Character area is identified.English and numeral are used as a kind of universal language, occur extensively in the scene of countries in the world, know Other English words and numerical significance are great.However, natural scene Chinese and English word and the position of numeral different from Handwritten Digits Recognition Put, size, font, illumination, visual angle, profile there is polytropy, and the background of natural scene character is also considerably complicated, institute Many technological difficulties for needing to capture be present with the English words in natural scene and numeral identification.

Existing natural scene Text region algorithm be generally all the bottom of from and on algorithm, see document [Neumann L, Matas J.'Real-time lexicon-free scene text localization and recognition',IEEE Transactions on Pattern Analysis and Machine Intelligence,2015,38,(9), Pp.1872-1885], that is, first with sliding window operation and traditional classifier to the English words in image and numeral it is each Character is identified, due to not necessarily there is character in window, then after also needing to recycle merging algorithm to these identifications Character string is integrated.But there are two limitations in this method：1. identify character using sliding window method and traditional classifier Accuracy rate is not high；2. character recognition and merging algorithm are to separate training, error will be directly delivered to caused by each of which In final recognition result, cause Text region precision not high.

The content of the invention

It is an object of the invention to solve these limitations, deep neural network and notice mechanism are combined, and will combine Neutral net afterwards is trained and identified as a block mold, on the basis of the real operation there are currently no sliding window, gives One image comprising English words and numeral directly exports recognition result.

The principle of the present invention is as follows：First, extracted using in the wide variety of convolutional neural networks of computer vision field The two dimensional character matrix of input picture, in the presence of convolutional neural networks, each row in matrix are represented in input picture The depth characteristic of respective regions, two dimensional character matrix is serialized to obtain characteristic sequence according to column direction；Then, note is utilized The information related to character in power of anticipating mechanism extraction characteristic sequence, filters redundancy, draws characteristic vector, so-called notice Mechanism is exactly to observe things according to the observing pattern of human vision with the pattern focused on, filters out garbage, is deep learning In commonly use model；Finally, using long memory network in short-term, the English in image is identified successively according to spatial order from left to right Word and numeral.

The technical scheme is that English words and digit recognition method in a kind of natural scene image, for defeated It is the gray level image comprising English words and numeral to enter image, and this method combines deep neural network and notice mechanism, and It is trained and identifies using the neutral net after combination as a block mold, real there are currently no the basis of sliding window operation On, a given image comprising English words and numeral directly exports recognition result, specifically includes following step：

Step (1), feature extraction is carried out to the image of input：The present invention uses the convolutional Neural in deep neural network Network carries out feature extraction to input picture, the result using the output of convolutional neural networks as feature extraction, with traditional volume Product neutral net output three-dimensional feature matrix is different, and the output for the convolutional neural networks that the present invention designs is two dimensional character matrix. Convolutional neural networks from be input to output successively by：Convolutional layer 1, batch normalization layer 1, pond layer 1, convolutional layer 2, batch standard Change layer 2, pond layer 2, convolutional layer 3, batch normalization layer 3, convolutional layer 4, batch normalization layer 4, pond layer 4, convolutional layer 5, criticize Amount normalization layer 5, convolutional layer 6, batch normalization layer 6, pond layer 6, convolutional layer 7, batch normalization layer 7 form.Wherein convolution The order that the parameter of layer was spaced and expanded size according to convolution kernel size, number of active lanes, slip is followed successively by：(3,64,1,1), (3, 128,1,1), (3,256,1,1), (3,256,1,1), (3,512,1,1), (3,512,1,1) and (2,512,1,0).Batch is marked The purpose of standardization layer is to adjust the distribution of intermediate result data, without parameter.The parameter of pond layer is sliding according to pond window, left and right Dynamic interval, interval is slided up and down, size is expanded in left and right and the upper and lower order for expanding size is followed successively by：(2*2,2,2,0,0), (2* 2,2,2,0,0), (1*2,1,2,0,0) and (1*2,1,2,0,0).Image is differentiated before convolutional neural networks are input to Rate is adjusted to 80 × 32, then the two-dimensional matrix size of convolutional neural networks output is 512 × 19, by this two dimensional character matrix sequence Obtain, comprising the characteristic sequence that 19 sizes are 1 × 512 vector, being expressed as after change：S={ s₁,s₂,...s_L, wherein s_i∈ R⁵¹², R⁵¹²1 × 512 vector, i=1,2 ..., L are represented, L represents the length of sequence, size 19.

Step (2), notice mechanism is used to carrying out feature comprising 19 sizes for the characteristic sequence S of 1 × 512 vector Focus on, the characteristic vector that notice mechanism is exported gathers the result focused on as feature.The present invention is according to from left to right Spatial order identify character in image successively, and the present invention training dataset Synth [Jaderberg M, Simonyan K,Vedaldi A,et al.'Reading text in the wild with convolutional neural networks',International Journal of Computer Vision,2016,116,(1),pp.1- 20] character length is up to 24 in, therefore the output of the present invention is the English words sum combinatorics on words that length is 24, so algorithm Need to carry out 24 features focusing, feature each time was focused on as a moment.Final output be exactly 24 focus on after Characteristic vector set V_f, V_f={ V₁,V₂,…V_T, T=24.Characteristic vector V in set_tRepresent what the t times feature focused on As a result, it is expressed as：

WhereinAnd When representing the t times feature focusing The coefficient of notice mechanism.Element in this coefficient is obtained by the following formula：

Wherein h_t-1The t-1 moment grows the hidden variable of mnemon in short-term in expression third step.w^T, W_a, U_aAnd b_aIt is note The parameter of meaning power model, is trained by the Back Propagation Algorithm based on stochastic gradient descent.

Step (3), the characteristic vector after focusing is identified：The length of the invention utilized in deep neural network is in short-term Characteristic vector after focusing is identified memory network.According to character string maximum length it is assumed that long memory network in short-term contains There are 24 units, the output of long mnemon in short-term is exactly the character identified, and each character has 37 classes (26 English words Mother, 0~9 totally 10 numerals, end mark "-", end mark represent character string end of identification.), the long short-term memory list of t The input of member is exactly the characteristic vector V after the t times feature focuses on_t, output is exactly the character class J identified_t。J_tThere are 37 classes Not (26 English alphabets, 0~9 totally ten numerals, end mark "-"), each moment choose the classification of maximum probability as now The output of long mnemon in short-term is carved, chooses mode such as following formula：

z_i=softmax (h_t)

Wherein h_tRepresent that t grows the hidden variable of mnemon in short-term, specific explanations are shown in Fig. 3 explanations.After end of identification The output of whole network is exactly the combination of 24 characters.The present invention takes the character string before end mark as final identification knot Fruit.

The input of step (1) is the image comprising English words and numeral, and output is characterized sequence, and characteristic sequence passes through The characteristic vector being calculated required for step (3) input of step (2) is crossed, the character of identification is finally exported by step (3) String.Three steps are being integrated into after a framework, it is necessary to be trained to the parameter of whole model, if X={ I_i,L_iBe Training dataset, I_iRepresent i-th of image, L_iFor its corresponding label, that is, in image character string actual value.So exist Object function in training process can be expressed as：

Wherein W represents the parameter of whole model, contains convolutional neural networks, notice mechanism and long memory network in short-term Parameter, W^*Represent the optimum value of these parameters.J={ J₁,…J_TRepresent Model Identification character string result, be by 24 The character string of character composition, whole character string identify that correct probability is equal to multiplying for each character recognition correct probability in character string Product, then-logp (J=L_i|I_i) form can be expressed as：

Wherein L_i,tT-th of character in label corresponding to i-th of image is represented, then object function can be expressed as：

After object function is drawn, the present invention is utilized based on the Back Propagation Algorithm of stochastic gradient descent to network parameter W It is trained, sees document [Shi B, Bai X, Yao C.'An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition',arXiv preprint arXiv:1507.05717,2015]。

If input picture is coloured image, then above step will be performed after coloured image gray processing.

Compared with prior art, the beneficial effects of the present invention are：

The present invention combines deep neural network and notice mechanism, can be with when input picture is to deep neural network Immediately arrive at final recognition result.Therefore, the present invention need not carry out sliding window operation to input picture and to the word in window Symbol is identified.Meanwhile the character string that the present invention exports is final recognition result, it is not necessary to after merging algorithm to identification Character string is integrated.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also To obtain other accompanying drawings according to these accompanying drawings.

Fig. 1 is overview flow chart of the present invention；

Fig. 2 is the design framework figure of convolutional neural networks in the present invention；

Fig. 3 is the cut-away view of long memory network unit in short-term；

Fig. 4 is the result example one of present invention identification English words and numeral；

Fig. 5 is the result example two of present invention identification English words and numeral.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

Overview flow chart such as Fig. 1 institutes of " English words and digit recognition method in a kind of natural scene image " of the invention Show, the identification problem of the English words in natural scene and numeral is divided into feature extraction, feature focuses on and feature recognition three Step.

Step (1)：Feature extraction.The present invention carries out feature extraction using convolutional neural networks to input picture, and input is Image comprising English character and numeral under natural scene, image is adjusted to 80 before convolutional neural networks are input to × 32 size.It is different that three-dimensional feature matrix can only be exported with traditional convolutional neural networks as shown in Figure 2, designed by the present invention Convolutional neural networks can export the eigenmatrix of two dimension.As illustrated, convolutional neural networks from top to bottom successively by：Convolutional layer 1st, batch normalization layer 1, pond layer 1, convolutional layer 2, batch normalization layer 2, pond layer 2, convolutional layer 3, batch normalization layer 3, Convolutional layer 4, batch normalization layer 4, pond layer 4, convolutional layer 5, batch normalization layer 5, convolutional layer 6, batch normalization layer 6, pond Change layer 6, convolutional layer 7, batch normalization layer 7 to form.Wherein the parameter of convolutional layer is according to convolution kernel size, number of active lanes, slip The order of interval and expansion size is followed successively by：(3,64,1,1), (3,128,1,1), (3,256,1,1), (3,256,1,1), (3, 512,1,1), (3,512,1,1) and (2,512,1,0).The purpose of batch normalization layer is the distribution for adjusting intermediate result data, is not had There is parameter.The parameter of pond layer is according to pond window, horizontally slip interval, slides up and down interval, and expand size and up and down for left and right The order for expanding size is followed successively by：(2*2,2,2,0,0), (2*2,2,2,0,0), (1*2,1,2,0,0) and (1*2,1,2,0, 0).Output is 512 × 19 eigenmatrixes of two dimension, and it is 1 × 512 to be obtained after it is serialized according to column direction comprising 19 sizes The characteristic sequence of vector, is expressed as：S={ s₁,s₂,…s_L, wherein s_i∈R⁵¹², i=1,2 ..., L, the length of L expression sequences, Size is 19.

Step (2)：Feature focuses on.The present invention is focused using notice mechanism to the useful information in characteristic sequence, Input is the characteristic sequence comprising 19 sizes for 1 × 512 vector that feature extraction phases obtain, and output is characteristic vector.Calculate Method is that the character in image is identified successively according to spatial order from left to right, and character string most greatly enhances in setting image Spend for 24, then algorithm will carry out T=24 feature and focus on, and final output is exactly the characteristic vector after 24 focusing Set V_f, V_f={ V₁,V₂,...V_T}.Characteristic vector V_tThe result that the t times feature focuses on is represented, is expressed as：

WhereinAnd Represent when the t times feature focuses on and note The coefficient for power mechanism of anticipating.Element in this coefficient can be obtained by the following formula：

Wherein w^T, W_a, U_aAnd b_aIt is the parameter of attention model, is entered by the Back Propagation Algorithm based on stochastic gradient descent Row training.h_t-1The t-1 moment grows the hidden variable of mnemon in short-term in expression third step, and specific explanations are as shown in Figure 3.

Fig. 3 is the cut-away view of long mnemon in short-term.The long one kind of memory network as recurrent neural network in short-term Network is improved, the generation of conventional recursive neutral net gradient extinction tests in the training process is limited by door operation.Such as figure Showing the length of t, mnemon, one long mnemon in short-term are by a mnemon c in short-term_tWith three door operations i_t,o_t,f_tComposition.Wherein, i_tIt is input gate, it represents that how many information content of current time can be input in unit；o_tIt is defeated Go out, it represents this moment unit outwardly exports how many information content；f_tIt is to forget door, it is represented in the reception of current time unit The number of one moment unit output information；Its specific calculating process is as follows：

i_t=σ (W_ixV_t+W_imh_t-1+b_i)

f_t=σ (W_fxV_t+W_fmh_t-1+b_f)

o_t=σ (W_oxV_t+W_omh_t-1+b_o)

c_t=f_t⊙c_t-1+i_t⊙g_t

h_tThe hidden variable of mnemon in short-term is grown for t, σ represents sigmoid functions,⊙ represents dot product Computing.W_ix, W_im, W_fx, W_fm, W_ox, W_om, W_gx, W_gm, b_i, b_f, b_o, b_gThe parameter of long mnemon in short-term is represented, due in length When memory network in the parameters of all units be shared, so these parameters can also be as the ginseng of long memory network in short-term Number, these parameters are trained using based on the Back Propagation Algorithm of stochastic gradient descent in the training stage present invention.

Step (3)：Character recognition.Characteristic vector is identified using long memory network in short-term by the present invention, and input is 24 Characteristic vector after individual focusing, output are the character strings of 24 length.In the present invention, long memory network in short-term includes 24 long Short-term memory unit, that is, the identification process of whole character string have 24 moment, the length of t mnemon in short-term it is defeated It is exactly characteristic vector V after the t times feature focuses on to enter_t, output is exactly the character class J identified_t。J_tThere are 37 classifications (26 English alphabet, 0~9 totally ten numerals, end mark "-"), when each moment chooses the classification of maximum probability as this moment length The output of mnemon, choose mode such as following formula：

z_i=softmax (h_t)

Wherein h_tRepresent that t grows the hidden variable of mnemon in short-term.Whole network after end of identification as shown in Figure 1 Output is exactly the combination of 24 characters, and such as ' a ' ' d ' ' o ' ' n ' ' i ' ' s ' '-' '-' '-' ..., final recognition result is ‘adonis’。

Fig. 4 is the result one of correct identification English words of the invention and numeral, and actual value and predicted value are brutalities.As can be seen that the present invention can identify the larger image of Character deformation, robustness is higher.

Fig. 5 is that present invention identification English words are with the result two of numeral, actual value recapitaliozes, predicted value Regapitaliozes, the 3rd letter are the wrong character of identification.It can be seen that the noise of image is larger, for error Character, human eye, which is substantially all, to be differentiated.

Claims

1. English words and digit recognition method in a kind of natural scene image, comprise the following steps：

Step (1), feature extraction is carried out to the image of input using the convolutional neural networks in deep neural network, by convolution Result of the output of neutral net as feature extraction；The convolutional neural networks from be input to output successively by：Convolutional layer 1, Batch normalization layer 1, pond layer 1, convolutional layer 2, batch normalization layer 2, pond layer 2, convolutional layer 3, batch normalization layer 3, volume Lamination 4, batch normalization layer 4, pond layer 4, convolutional layer 5, batch normalization layer 5, convolutional layer 6, batch normalization layer 6, Chi Hua Layer 6, convolutional layer 7, batch normalization layer 7 form；Wherein the parameter of convolutional layer 1~7 is according to convolution kernel size, number of active lanes, cunning The order of dynamic interval and expansion size is followed successively by：(3,64,1,1), (3,128,1,1), (3,256,1,1), (3,256,1,1), (3,512,1,1), (3,512,1,1) and (2,512,1,0)；The purpose of batch normalization layer 1~7 is adjustment intermediate result data Distribution, without parameter；The parameter of pond layer 1,2,4,6 is according to pond window, horizontally slip interval, slides up and down interval, left The right order for expanding size and expanding size up and down is followed successively by：(2*2,2,2,0,0), (2*2,2,2,0,0), (1*2,1,2,0, And (1*2,1,2,0,0) 0)；It is 80 × 32 that image, which is needed before convolutional neural networks are input to by the resolution adjustment of image, The output of the convolutional neural networks is the two dimensional character matrix that size is 512 × 19；By the two dimensional character matrix sequence Obtain, comprising the characteristic sequence that 19 sizes are 1 × 512 vector, being expressed as afterwards：S={ s₁,s₂,…s_L, wherein s_i∈R⁵¹², i =1,2 ..., L；L=19, represent the length of sequence；

Step (2), notice mechanism is used to carrying out feature focusing comprising 19 sizes for the characteristic sequence S of 1 × 512 vector： Identify the character in image successively according to spatial order from left to right, the character length that setting training data is concentrated is up to 24,24 features are carried out to characteristic sequence S and are focused on, feature each time was focused on as a moment；Output characteristic vector Set V_f, V_f={ V₁,V₂,...V_T, T=24；Wherein characteristic vector V_tRepresent the result that the t times feature focuses on：T ∈ 1,2 ... T }, and Represent notice when the t times feature focuses on The coefficient of mechanism, wherein Wherein h_t-1Represent the 3rd step The t-1 moment grows the hidden variable of mnemon in short-term in rapid；w^T, W_a, U_aAnd b_aIt is the parameter of attention model, by based on random The Back Propagation Algorithm that gradient declines is trained；

Step (3), using the length in deep neural network, the characteristic vector after focusing is identified memory network in short-term：It is long Short-term memory network contains 24 units, and the input of the length of t mnemon in short-term is exactly the spy after the t times feature focuses on Levy vectorial V_t, output is exactly the character class J identified_t；The character class that each moment chooses maximum probability is grown as this moment The output of short-term memory unit, selection mode are：Wherein z_i=softmax (h_t)；The h_tRepresent t Moment grows the hidden variable of mnemon in short-term；The output of whole network is exactly the combination of 24 characters after end of identification, takes end Character string before symbol is as final recognition result；Wherein described J_tThere are 37 classifications, including：26 English alphabets, 0~9 Totally 10 numerals, end mark "-"；The end mark represents character string end of identification.

2. the method as described in claim 1, it is characterised in that the method being trained to the parameter in this method is：If X= {I_i,L_iIt is training dataset, I_iRepresent i-th of image, L_iFor the actual value of character string in i-th of image；In training process Object function is：Wherein W represents convolutional neural networks, The parameter of notice mechanism and long memory network in short-term, W^*Represent the optimum value of the parameter, L_i,tRepresent i-th of image pair T-th of character in the label answered；Using based on the Back Propagation Algorithm of stochastic gradient descent to being trained.

3. the method as described in claim 1, it is characterised in that the image of the input is gray-scale map.