CN110399845A

CN110399845A - Continuously at section text detection and recognition methods in a kind of image

Info

Publication number: CN110399845A
Application number: CN201910688854.6A
Authority: CN
Inventors: 刘晋; 龚沛朱
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2019-07-29
Filing date: 2019-07-29
Publication date: 2019-11-01

Abstract

The present invention discloses in a kind of image based on SegLink and Attention-based CRNN fusion treatment continuously at section text detection and recognition methods, belong to optical character recognition technology field, aim to solve the problem that text detection in the digitlization of OCR information document, especially inclination text detection accurate rate is low, location difficulty, font cutting is difficult, the low problem of recognition accuracy.The present invention has built the SegLink+CRNN model based on Tensorflow deep learning frame, pass through the line of text in SegLink network detection image, row cutting will be pressed at section text, single file text feature is extracted by the convolutional neural networks intensively connected, the sequence information of context in bidirectional circulating Processing with Neural Network text, individual character cutting problems are avoided using CTC decoding algorithm, influence of the individual character cutting link to recognition accuracy is eliminated, and further incorporates Attention mechanism when CTC is transcribed and improves recognition accuracy for text sequence characteristic.This method is all suitable in block letter and handwriting recongnition, and can apply to English, the multilingual text identification such as Chinese.

Description

Continuously at section text detection and recognition methods in a kind of image

Technical field

The invention belongs to computer vision, target detection and optical character recognition technology fields, are related to for information document In text detection and identification, in particular to a kind of figure based on SegLink Yu Attention-based CRNN fusion treatment Continuously at section text detection and recognition methods as in.

Background technique

The detection of text is carried out in natural scene and identification is a field of greatest concern in current computer vision, It includes two subtasks: text detection and text identification.

Current Method for text detection is based on process from bottom to top, from low-level features such as simple characters and stroke mostly It detects, then carries out non-textual filtering, the building of line of text and the verifying of line of text.The accuracy rate of this method is very big Be to rely in degree character machining as a result, and the mistake of its testing result can constantly accumulate during from bottom to top, Therefore its reliability is poor and its structure is extremely complex.

In object detection task, depth convolutional neural networks are widely used, such as Faster RCNN structure, wherein Region candidate network (Region Proposal Network, RPN) is proposed, is generated directly from the characteristic pattern of convolutional layer high The classification candidate frame of quality.But since line of text is by multiple characters, text is spliced, without specific closed boundary, Therefore the problems such as will appear candidate frame overlapping or leak detection in identification.For another example CTPN network structure is fixed its horizontal direction and is adopted With the position of line of text in vertical anchor point regression forecasting image, but the network structure is difficult to solve the text in inclination text image Detection task.

Text identification task is similar to a polytypic problem, and the most commonly used method is the convolution mind by deep layer Classify through network and Recognition with Recurrent Neural Network to text, a word represents one kind.This structure is known in the less text of classification Be in other task it is adequate, if English text identify, count capital and small letter and common punctuation mark totally 79 class in, but handling Effect is bad when such as Chinese text identification (3755 class of Chinese characters in common use) of big category task, this is because the rising of classification number must not Do not deepen the depth of neural network, the explosion of gradient disappearance gradient and network degenerate problem can be generated.

Summary of the invention

The invention proposes continuous in a kind of image based on SegLink and Attention-based CRNN fusion treatment At section text detection and recognition methods, the main thought of this method is: detecting line of text position by SegLink and carries out to it Line of text input convolutional neural networks are carried out feature extraction to it, obtained characteristic sequence are input to two-way length by correction Phase remembers Recognition with Recurrent Neural Network, completes the mapping from characteristic sequence to character string, then character string is carried out CTC transcription, obtains To final recognition result；To solve text detection in the digitlization of OCR information document, especially inclination text detection accurate rate Low, location difficulty, font cutting is difficult, the low problem of recognition accuracy.

In order to achieve the above object, the invention is realized by the following technical scheme:

Continuously at section text detection in a kind of image based on SegLink and Attention-based CRNN fusion treatment With recognition methods comprising the steps of:

Data set is simultaneously respectively divided into training set, verifying collection and test set by S1, production continuous text image data set；

S2, SegLink network model is built under Tensorflow deep learning frame, by the spy for generating different scale Inclination line of text is carried out rectification to detect the line of text of different sizes, length-width ratio by sign figure；

S3, allowable loss function: by prediction Segment confidence function, Link confidence function and predicted position error function Weighted sum obtains overall loss function optimization model；

S4, the SegLink network model of the training set in step S1 to step S2 is trained to obtain it is final Text detection model is simultaneously tested with test set；

S5, building CRNN model, extract picture feature by the network DenseNet intensively connected and export characteristic pattern, pass through The feature sequence that the contextual information of two-way shot and long term memory network BLSTM combination continuous text obtains entire DenseNet convolution Each frame of column is predicted；

S6, the Attention based CTC dubbing method decoding step S5 for the identification of random length text sequence is used Middle forecasting sequence obtains target text；

S7, it is trained the CRNN model of the training set in step S1 to step S5 to obtain final text identification mould Type, and the CRNN model is inputted with test set, obtain test result；

S8, entire text detection and identification network are trained, are completed continuously at the detection and identification of section text.

Preferably, it in the step S1, further includes: different words is generated by CycleGAN deep learning modelling Individual character data set picture is spliced into semantic line of text or text fragment, and adds by the individual character data set picture of body style Plus noise；Data set is foreign language data set or Chinese data collection, and the font style of data set is block letter or handwritten form.

Preferably, it in the step S2, further includes:

The VGG16 convolutional neural networks that pre-training is crossed are substituted for convolution as network backbone, by full articulamentum therein Layer, and the size of convolution is successively halved and generates Analysis On Multi-scale Features figure, the text image of input can be divided into slice by network Segment and link Link two parts；Wherein, Segment outlines any one part of line of text, indicates its location information； One line of text includes multiple Segment, and each Segment is connected by Link；

The Segment and Link for generating multiple predictions with the sliding window of 3*3 to the characteristic pattern of different scale, pass through fusion Segment and Link information on each scale is merged and is rejected redundancy by rule, the line of text finally predicted.

Preferably, the location information of Segment is indicated by five coordinate (x, y), long w, high h and tilt angle theta parameters；

The Analysis On Multi-scale Features figure of generation is the characteristic pattern containing six kinds of sizes, respectively [64,32,16,8,4,2], The parameter (x, y, w, h, θ) of Segment updates in the following manner:

w_s=α exp (Δ w_s)；

h_s=α exp (Δ h_s)；

θ_s=Δ θ_s；

Wherein, x_s,y_sRespectively indicate cross, the ordinate of anchor point, Δ x_s,Δy_sIndicate horizontal, the vertical offset of prediction anchor point, w_s,h_sRespectively indicate the width and height of anchor box, Δ w_s,Δh_sRespectively indicate wide, the high offset of prediction anchor box, θ_s,Δθ_sTable respectively Show the rotation angle and its offset of anchor box, w_I,h_IRespectively indicate the width and height of original image；w_f,h_fRespectively indicate characteristic pattern width and It is high；Pass throughIndicate the size of receptive field；λ is weight coefficient；

Fusion method be straight line L is obtained by least-squares linear regression so that the centre coordinate of all Segment to should The distance of straight line L is minimum, centre coordinate is projected to straight line L, and two farthest coordinates is taken to be denoted as (x_m,y_m),(x_n,y_n), Along with the width w of the Segment where farthest two points_m,w_nHalf, height h takes the mean value of all anchor box height, obtains It is as follows:

Finally obtained text detection resulting text frame coordinate is (x, y, w, h, θ)；Wherein, N is the quantity of anchor box, h_iFor The height of i-th of anchor box.

Preferably, in the step S2, rectification process is further included: rectification is visited using the transformation of Hough line The straight line in image where text is measured, the inclination angle of straight line is calculated, then according to inclination angle rotational correction.

Preferably, in the step 3, loss function is as follows:

In formula, L_conf(y_s,c_s) indicate prediction Segment confidence function；L_conf(y_l,c_l) indicate Link confidence function；Indicate predicted position error function；λ₁,λ₂Indicate weight coefficient；N_s,N_lRespectively indicate the quantity and Link of Segment Quantity；y_s,y_lRespectively indicate the label of Segment and Link；c_s,c_lRespectively indicate the prediction for indicating Segment and Link Value；Respectively indicate the geometry and its true virtual value of prediction segment.

Preferably, in the step S5, network DenseNet includes several intensive block Dense Block and transition block It is attached between Transition Block, each Dense Block by Transition Block；Dense Block by Batch Normalization+RELU+3*3 convolutional layer is as component function；In Dense Bolck any two convolutional layer it Between have connection；Transition Block is made of bottleneck layer and pond layer.

Preferably, in the step S5, the context both direction of continuous text sequence is handled using BLSTM network Information joined three kinds of gates in LSTM network, respectively update door, forget door and out gate, formula are as follows:

Γ_f=σ (W_f[a^{< t-1 >},x^{< t >}]+b_f)；

Γ_u=σ (W_u[a^{< t-1 >},x^{< t >}]+b_u)；

a^{< t >}=Γ_o*tanh c^{< t >}；

Wherein, σ is activation primitive；C indicates long-term memory state；A indicates state；X indicates input；When<t>indicates current t It carves；c^{< t >}Indicate the memory state of t moment；W_f,W_u,W_cRespectively weight matrix；b_f,b_u,b_cRespectively offset；Pass through first The output a of a upper moment t-1^<t-1>With the input x of this moment t^<t>, intermediate more new state is obtained by tanh function Γ_uAnd Γ_fIt respectively indicates and updates door and forget door, respectively value ∈ [0,1], control memory sequences in memory cell, finally by Out gate Γ_oObtain the state a of t moment^t。

Preferably, in the step S6, CTC is by forward, backward algorithm by text image and label according to time step Alignment includes following process:

Define forwards algorithms α_t(s), it initializes first:

Wherein, α₁(1) indicate that first character is the probability of blank labelα₁(2) indicate that first character is true The probability of first character in sequence

Subsequent recurrence formula is as follows:

Wherein,

In formula, a_t-1It (s) is from 0 to t-1 moment, and the t-1 moment is predicted as the forward direction probability of s-th of character in sequence l '； a_t-1It (s-1) is from 0 to t-1 moment, and the t-1 moment is predicted as the forward direction probability of the s-1 character in sequence l '；By a_t-1(s) And a_t-1(s-1) total forward direction probability before the two summation obtains t momentB indicates blank tag blank；L indicates image The length of middle character string；L ' indicates to introduce the sequence length after blank label, l '=2l+1；l'_s,l’_s-2Respectively indicate l ' S, s-2 characters in sequence；Indicate that t moment is predicted as the probability of s-th of character in sequence l '；

Define backward algorithm β_t(s), it initializes:

In formula,Indicate that the T moment is predicted as the probability of blank, as backward probability β_T(l ') the i.e. T moment is predicted as The backward probability of the last one blank of sequence l '；Indicate that the T moment is predicted as the probability of last character in sequence l, it will It is as β_T(l ' -1) the i.e. T moment is predicted as the backward probability of penultimate character in sequence l '；

Subsequent recurrence formula are as follows:

Wherein,

In formula, β_t+1(s) indicate that the t+1 moment is predicted as in sequence after s-th of character to total moment T, and at the t+1 moment To probability；β_t+1(s+1) indicate that the t+1 moment is predicted as in sequence the rear to general of the s+ character to total moment T, and at the t+ moment Rate；By β_t+1(s) and β_t+1(s+1) the two summation obtains total backward probabilityB indicates blank tag blank；L indicates image The length of middle character string；L ' indicates to introduce the sequence length after blank label, l '=2l+1；l'_s,l’_s+2Respectively indicate l ' S, s+2 characters in sequence；Indicate that t moment is predicted as the probability of s-th of character in sequence l '；

Further, the probability of the label s when t is walked is obtained are as follows:

In formula, t indicates the moment；T indicates total moment；L indicates sequence length；S indicates s-th of character in sequence；Table Show that t moment is predicted as the probability of s-th of character in l sequence；α_t(s) it indicates in t moment by all possible path formation sequence l_1:s Probability；β_t(s) refer to and generate subsequent sequence l by all possible paths in t moment_s:lProbability.

Preferably, in the step 6, the loss function of CTC are as follows:

Wherein, p (z | x) indicates given input x, the probability that list entries is z；U indicates the length of original sequence；T is indicated Moment；α (t, u) indicates the forward direction probability in t moment in node u；β (t, u) indicates the backward probability in t moment in node u.

Compared with prior art, the beneficial effects of the present invention are: (1) by SegLink carry out line of text detection, lead to The characteristic pattern of different scale is crossed to detect that different sizes, the line of text of length-width ratio are detected line of text using the thought divided and ruled It is divided into the detection of Segment and Link, promotes the precision that text box detects in conjunction with Analysis On Multi-scale Features figure, add in Segment Parameter is entered, to solve the text detection task to inclination text image and correct text by perspective transform and straight line probe method This；(2) traditional convolutional network is replaced to extract line of text feature using DenseNet, it is characterized in that in Dense Block structure Have connection between any two layers, parameter amount and calculating can be reduced by the multiplexing of characteristic pattern, thus solve gradient disappear/it is quick-fried Fried, network degenerate problem extracts more crucial feature；(3) obtained characteristic sequence is inputted on two-way LSTM combination text Sequence information hereafter completes the mapping of characteristic sequence to character string, to improve recognition accuracy；(4) finally by involvement The unified training of CRNN is avoided cutting by the CTC decoding algorithm of Attention mechanism by the prediction to each time step of sequence Error caused by point, the text identification of completion.

Detailed description of the invention

Fig. 1 be in the image of the invention based on SegLink and Attention-based CRNN fusion treatment continuously at The flow chart of section text detection and recognition methods；

Fig. 2 a- Fig. 2 c is the data set sample figure generated in the present invention；

Fig. 3 is the network structure of CycleGAN in the present invention of the invention；

Fig. 4 is the network structure of SegLink in the present invention；

Fig. 5 is the effect picture of line of text detection in the present invention；

Fig. 6 is the network structure of DenseNet in the present invention；

Fig. 7 is the effect picture of text identification in the present invention；

Fig. 8 is the structure chart of BLSTM in the present invention；

Fig. 9 is the CTC decoding structural schematic diagram that Attention mechanism has been incorporated in the present invention.

Specific embodiment

Keep the purposes, technical schemes and advantages of the embodiment of the present invention clearer, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.The present invention includes but are not limited to following Embodiment.

As shown in Figure 1 in the image of the invention based on SegLink and Attention-based CRNN fusion treatment Continuously at the whole implementation flow chart of section text detection and recognition methods, the specific steps are as follows:

Step 1, selection and production data set: the continuous text image data set of multiple fonts is made simultaneously using CycleGAN Data set is respectively divided into training set, verifying collection and test set.

The step 1 includes following procedure: the individual character data set figure of different fonts style is produced and generated by CycleGAN Individual character data set is spliced into semantic line of text/text fragment, and adds the noises such as inclination, rotation by piece, text line number According to the corresponding sequence label of collection picture to training, data set can be a variety of styles such as block letter or handwritten form, can also be with It is that English, Chinese etc. multilingual (selection one) are trained.

As shown in Figure 2 c, the English data set in the present embodiment is selected from IAM hand-written data library, includes handwritten form English text Data set, with the resolution scan of 300dpi and save as the png image of 256 gray levels, data are infused in word grade and row grade It releases.

As shown in Fig. 2 a- Fig. 2 b, it is and splicing the Chinese handwritten body individual character data set of HWDB1.1 that Chinese data, which collects, At, and the Chinese script data set of different writing styles is generated by CycleGAN, keep the robustness of model higher.

It is the structure chart of CycleGAN deep learning model as shown in Figure 3.CycleGAN by generator (generator) and Discriminator (Discriminator) composition, the two form confrontation network (GAN), and CycleGAN is substantially two mirror symmetries GAN, constitute a loop network.Generator trial generates sample from distribution, and the discriminator determines sample For original image or generate figure.The picture in data field A is mapped as output image B using Generator A to B, is true Guarantor's image mapping is significant, and outputting and inputting there must be some significant associations between image, therefore uses another life The Generator B to A that grows up to be a useful person will export image mapping can former data field.

Generator in this example passes through using network DenseNet as shift module (Transfer Module) Encoder-Decoder (coder-decoder) constructs network.The part Encoder includes three-layer coil lamination, and structure is Conv-Norm-ReLU, convolution kernel size are two 3*3 convolution of a 7*7 convolution sum；Transfer part is by three Dense Block Composition, growth rate (growth rate) are set to 256；With comprising three layers of deconvolution, structure is for the part Decoder Two layers is 3*3, the last layer 7*7 before DeConv-Norm-ReLU, kernel size.

Step 2 builds SegLink network model under Tensorflow deep learning frame, by generating different scale Characteristic pattern detect the line of text of different sizes, length-width ratio, and extract four apex coordinates of line of text, it is saturating with four-point method Line of text correction will be tilted by penetrating transformation and rectification.

It is illustrated in figure 4 the structure chart of SegLink network model in this example, the VGG16 convolutional Neural that pre-training is crossed Network is substituted for convolutional layer as backbone (network backbone), by full articulamentum therein, and the size of convolution is successively subtracted Half generates Analysis On Multi-scale Features figure (64,32,16,8,4,2 in such as Fig. 4), and the text image of input can be divided by network Segment (slice) and Link (link) two parts.Segment outlines any one part of line of text, indicates its position letter Breath is indicated by (x, y, w, h, θ) five parameters, respectively represents its coordinate (x, y), long (w), high (h) and tilt angle (θ).

One line of text can be composed of multiple Segment frames, and each Segment is connected by Link, to not The Segment frame and Link (as shown in Figure 4) that multiple predictions are generated with the sliding window of 3*3 with the characteristic pattern of scale, by melting Segment the and Link information on each scale is normally merged and is rejected redundancy, the line of text finally predicted.

The characteristic pattern of six kinds of sizes, respectively [64,32,16,8,4,2], the parameter of Segment are provided in this example (x, y, w, h, θ) updates in the following manner:

w_s=α exp (Δ w_s)；

h_s=α exp (Δ h_s)；

θ_s=Δ θ_s；

Wherein, x_s,y_sRespectively indicate the transverse and longitudinal coordinate of anchor point；Δx_s,Δy_sRespectively indicate horizontal, the vertical offset of prediction anchor point Amount；w_s,h_sRespectively indicate the width and height of anchor box；Δw_s,Δh_sRespectively indicate wide, the high offset of prediction anchor box；θ_s,Δθ_sRespectively Indicate the rotation angle and its offset of anchor box；w_I,h_IRespectively indicate the width and height of original image；w_f,h_fRespectively indicate the width of characteristic pattern And height；Pass throughIndicate the size of receptive field；λ is weight coefficient.

Fusion method in the present embodiment is that straight line L is obtained by least-squares linear regression, so that all The distance of the centre coordinate of Segment to the straight line is minimum, centre coordinate is projected to straight line L, and farthest two is taken to sit Labeled as (x_m,y_m),(x_n,y_n), along with the width w of the Segment where farthest two points_m,w_nHalf, height h takes institute There is the mean value of anchor box height, N is the quantity of anchor box, and detail formula is as follows:

Finally obtained text detection resulting text frame coordinate is (x, y, w, h, θ).

In the step 2, rectification is the straight line detected in image where text with the transformation of Hough line, is calculated straight The inclination angle of line, then according to inclination angle rotational correction.

Step 3, allowable loss function: by prediction Segment confidence function, Link confidence function and predicted position error letter These three loss function weighted sums of number, obtain overall loss function optimization model.Loss function definition in the step 3 is such as Under:

Loss function is made of above three loss subfunction, respectively prediction Segment confidence function, Link confidence letter Several and predicted position error function.

Wherein, the confidence function of Segment and Link carries out two classification using softmax, judges whether it has text And whether have link, use L_confIt indicates；Predicted position error returns loss function with Smooth L1 to calculate, and uses L_locTable Show；c_s,c_lRespectively indicate the predicted value of Segment and Link；y_s,y_lRespectively indicate the label of Segment and Link；N_s,N_lPoint It Biao Shi not the quantity of Segment and the quantity of Link；Respectively indicate the geometry and its ground of prediction segment Truth (true virtual value)；λ₁,λ₂Indicate weight coefficient.

Step 4, the training set in step 1 is trained to obtain to the SegLink network model described in step 2 it is final Text detection model is simultaneously tested with test set.

It is effect picture of the SegLink to text detection as shown in Figure 5, is the text position of its positioning in frame.

Step 5, building CRNN model (Convolutional Recurrent Neural Network, convolution loop mind Through network), picture feature is extracted by the convolutional neural networks DenseNet intensively connected and exports characteristic pattern, passes through two-way length The characteristic sequence that the contextual information of phase memory network BLSTM combination continuous text obtains entire DenseNet convolution it is each Frame is predicted.

The CRNN model of the present embodiment is a kind of Attention-based CRNN convolution loop neural network, the CRNN mould Type mainly includes three parts: DenseNet convolutional neural networks, BLSTM transcribe layer to shot and long term memory network and CTC.

It is the structure chart of DenseNet network in the present embodiment as shown in Figure 6.DenseNet mainly consists of two parts, point It is not Dense Block (intensive block) and Transition Block (transition block).

It is different from conventional convolution neural network, having between any two convolutional layer in Dense Bolck of the invention Connection, i.e. each layer of output can all become all layers below of input, and each layer of input all contains all layers of front Output, to reduce parameter amount and the calculation amount in network, can retain shallow-layer feature by this structure repeatedly used features figure, Solve the problems, such as that gradient disappears and gradient is exploded.

As shown in fig. 6, Dense Block is made by Batch Normalization+RELU+3*3 convolutional layer in the present embodiment For component function.It is attached between Dense Block by Transition Block；Transition Block is by bottleneck Layer and pond layer form, and complete the dimensionality reduction of characteristic pattern with the convolutional layer of 1*1 in bottleneck layer come compression parameters.It is close due to using Collect connection structure, it is infeasible that pond layer is directly added between the layers, therefore is used between each Dense Block Pond layer is added.

As shown in fig. 6, the DenseNet network in the present embodiment is by three Dense Block and two Transition Block composition；Each Dense Block is made of the convolutional neural networks that eight convolution kernels are 3*3, Transition Block It is made of the 1*1 convolution kernel pond layer of 128 dimensions.

Fig. 7 is the structure chart of BLSTM.The present embodiment handles two sides of context of continuous text sequence using BLSTM To information, compared to Recognition with Recurrent Neural Network, joined three kinds of gates in LSTM network is respectively to update door, forget door and output Door, formula are as follows:

Γ_f=σ (W_f[a^{< t-1 >},x^{< t >}]+b_f)；

Γ_u=σ (W_u[a^{< t-1 >},x^{< t >}]+b_u)；

a^{< t >}=Γ_o*tanh c^{< t >}；

Wherein, c indicates long-term memory state (memory)；A indicates state (stage)；X indicates input；<t>indicates current T moment；c^{< t >}Indicate the memory state of t moment；σ is activation primitive；W_f,W_u,W_cFor weight matrix；b_f,b_u,b_cFor offset； Pass through the output a of a upper moment<t-1>first^{< t-1 >}With the input x at this moment<t>^<t>, centre is obtained by tanh function More new stateΓ_uAnd Γ_fIt respectively indicates and updates door and forget door, value ∈ [0,1] remembers to control in memory cell Sequence, finally by out gate Γ_oObtain the state a of t moment^t。

Attention network mechanism is also incorporated in the present embodiment in BLSTM network, so that each time series is pre- It surveys and all calculates output by choosing current time most suitable contextual information.

Step 6 (is incorporated using the Attention based CTC dubbing method for the identification of random length text sequence The CTC of Attention network mechanism) forecasting sequence obtains target text in decoding step 5.Wherein, CTC (Connectionist temporal classification couples chronological classification) is calculated by the forward-backward algorithm of Dynamic Programming Text image is aligned with label according to time step by method, includes following process:

(1) forwards algorithms α is defined_t(s), it initializes first:

Wherein, α₁(1) indicate that first character is the probability of blank labelα₁(2) indicate that first character is true The probability of first character in sequenceThis probability can be provided by upper BLSTM.

Subsequent recurrence formula is as follows:

Wherein,

In formula, a_t-1It (s) is from 0 to t-1 moment, and the t-1 moment is predicted as the forward direction probability of s-th of character in sequence l '； a_t-1It (s-1) is from 0 to t-1 moment, and the t-1 moment is predicted as the forward direction probability of the s-1 character in sequence l '；Due to each Prediction can only maintain former character or be moved to character late when into subsequent time, therefore when t can be obtained in the two summation Total forward direction probability before quarterB indicates blank tag blank；L indicates the length of character string in image, and l ' expression is drawn Sequence length after entering blank label, l '=2l+1；l'_s,l’_s-2Respectively indicate s, s-2 characters in l ' sequence；It indicates T moment is predicted as the probability of s-th of character in sequence l ', can obtain the value by BLSTM.

(2) backward algorithm β is defined_t(s), it initializes:

Wherein,Indicate that the T moment is predicted as the probability of blank, as backward probability β_T(l ') the i.e. T moment is predicted as The backward probability of the last one blank of sequence l '；Indicate that the T moment is predicted as the probability of last character in sequence l, it will It is as β_T(l ' -1) the i.e. T moment is predicted as the backward general of penultimate character in sequence l ' (i.e. last character in l) Rate.

Subsequent recurrence formula are as follows:

Wherein,

In formula, β_t+1(s) indicate that the t+1 moment is predicted as in sequence after s-th of character to total moment T, and at the t+1 moment To probability；β_t+1(s+1) indicate that the t+1 moment is predicted as in sequence the rear to general of the s+ character to total moment T, and at the t+ moment Rate；It is identical as forward direction probabilistic operation, total backward probability is can be obtained into the two summationB indicates blank tag blank；L table The length of character string in diagram picture；L ' indicates to introduce the sequence length after blank label, l '=2l+1；l'_s,l’_s+2Respectively Indicate s, s+2 characters in l ' sequence；Indicate that t moment is predicted as the probability of s-th of character in sequence l ', it can by BLSTM Obtain the value.

Therefore, the probability of the label s in t step can be obtained are as follows:

In the step 6, the loss function of CTC are as follows:

Wherein, p (z | x) indicates given input x, the probability that list entries is z；U indicates the length of original sequence；T is indicated Moment；α (t, u) indicates the forward direction probability in t moment in node u；β (t, u) indicates the backward probability in t moment in node u；It should Take logarithm operation that can obtain by the probability function derived above.

Attention network mechanism has been incorporated in Decoder in training in the present invention, has been based on content and position, thus Identification precision is improved, reduces leakage knowledge in OCR identification process, wrong the case where knowing, knowing more, CTC decoding structure chart is visible Fig. 9.Its principle is that traditional CTC decoding algorithm is assessed all information in front and back using function of the same race in decoding process Its weight, this is clearly unscientific；Such as when decoding the prediction of t moment, at the time of closer from t moment such as t-1, t+1's Weight shared by information should be bigger, should be more smaller to the predicted impact of t moment at the time of remoter from its moment；Therefore in the present invention In Attention mechanism has been incorporated in CTC, its weights influence is adjusted using different functions, so that prediction result is more Precisely.

Training set in step 1 is trained to obtain final text identification by step 7 to the CRNN model described in step 5 Model, and the CRNN model is inputted with test set, obtain test result.

CRNN Web vector graphic stochastic gradient descent is trained, and gradient is calculated by back-propagation algorithm；Particularly, in CTC When transcription, error carries out backpropagation using forward-backward algorithm algorithm, and in BLSTM, error passes through the backpropagation of BPTT algorithm.

Step 8 is trained entire text detection and identification network, completes continuously at the detection and identification of section text. Identification error rate in the present embodiment see the table below 1:

Table 1 is the identification error rate in embodiment

The preferred embodiment of the present invention has been described in detail above.It should be appreciated that those skilled in the art or Be universal model fan may not need creative work or by software programming can it is according to the present invention design make Many modifications and variations.Therefore, all technician in the art or universal model fan exist under this invention's idea It, all should be by weighing by the available technical solution of logical analysis, reasoning, or a limited experiment on the basis of the prior art In protection scope determined by sharp claim.

Claims

1. in a kind of image based on SegLink and Attention-based CRNN fusion treatment continuously at section text detection with Recognition methods, which is characterized in that comprise the steps of:

S2, SegLink network model is built under Tensorflow deep learning frame, by the characteristic pattern for generating different scale To detect the line of text of different sizes, length-width ratio, and inclination line of text is subjected to rectification；

S3, allowable loss function: prediction Segment confidence function, Link confidence function and predicted position error function are weighted Summation, obtains overall loss function optimization model；

S4, it is trained the SegLink network model of the training set in step S1 to step S2 to obtain final text Detection model is simultaneously tested with test set；

S5, building CRNN model, extract picture feature by the network DenseNet intensively connected and export characteristic pattern, by two-way The characteristic sequence that the contextual information of shot and long term memory network BLSTM combination continuous text obtains entire DenseNet convolution Each frame is predicted；

S6, using pre- in the Attention based CTC dubbing method decoding step S5 for the identification of random length text sequence Sequencing column obtain target text；

S7, it is trained the CRNN model of the training set in step S1 to step S5 to obtain final text identification model, And the CRNN model is inputted with test set, obtain test result；

2. continuously at section text detection and recognition methods in image as described in claim 1, which is characterized in that the step S1 In, it further includes:

The individual character data set picture that different fonts style is generated by CycleGAN deep learning modelling, by individual character data set Picture is spliced into semantic line of text or text fragment, and adds noise；

Data set is foreign language data set or Chinese data collection, and the font style of data set is block letter or handwritten form.

3. continuously at section text detection and recognition methods in image as described in claim 1, which is characterized in that the step S2 In, it further includes:

The VGG16 convolutional neural networks that pre-training is crossed are substituted for convolutional layer as network backbone, by full articulamentum therein, and The size of convolution is successively halved and generates Analysis On Multi-scale Features figure, the text image of input can be divided into slice by network Segment and link Link two parts；Wherein, Segment outlines any one part of line of text, indicates its location information； One line of text includes multiple Segment, and each Segment is connected by Link；

The Segment and Link for generating multiple predictions with the sliding window of 3*3 to the characteristic pattern of different scale, pass through fusion rule Segment and Link information on each scale is merged and is rejected redundancy, the line of text finally predicted.

4. continuously at section text detection and recognition methods in image as claimed in claim 3, which is characterized in that Segment's Location information is indicated by five coordinate (x, y), long w, high h and tilt angle theta parameters；The Analysis On Multi-scale Features figure of generation is containing six The characteristic pattern of kind of size, respectively [64,32,16,8,4,2], the parameter (x, y, w, h, θ) of Segment is in the following manner more It is new:

w_s=α exp (Δ w_s)；

h_s=α exp (Δ h_s)；

θ_s=Δ θ_s；

Wherein, x_s,y_sRespectively indicate cross, the ordinate of anchor point, Δ x_s,Δy_sIndicate horizontal, the vertical offset of prediction anchor point, w_s,h_s Respectively indicate the width and height of anchor box, Δ w_s,Δh_sRespectively indicate wide, the high offset of prediction anchor box, θ_s,Δθ_sRespectively indicate anchor The rotation angle and its offset of box, w_I,h_IRespectively indicate the width and height of original image；w_f,h_fRespectively indicate the width and height of characteristic pattern； Pass throughIndicate the size of receptive field；λ is weight coefficient；

Fusion method is that straight line L is obtained by least-squares linear regression, so that the centre coordinate of all Segment is to the straight line The distance of L is minimum, centre coordinate is projected to straight line L, and two farthest coordinates is taken to be denoted as (x_m,y_m),(x_n,y_n), then plus The width w of Segment where upper farthest two points_m,w_nHalf, height h takes the mean value of all anchor box height, obtains as follows:

Finally obtained text detection resulting text frame coordinate is (x, y, w, h, θ)；Wherein, N is the quantity of anchor box, h_iIt is i-th The height of anchor box.

5. continuously at section text detection and recognition methods in image as described in claim 1, which is characterized in that the step S2 In, rectification process further includes:

Rectification is the straight line detected in image where text using the transformation of Hough line, calculates the inclination angle of straight line, so Afterwards according to inclination angle rotational correction.

6. continuously at section text detection and recognition methods in image as described in claim 3 or 4, which is characterized in that

In the step 3, loss function is as follows:

In formula, L_conf(y_s,c_s) indicate prediction Segment confidence function；L_conf(y_l,c_l) indicate Link confidence function； Indicate predicted position error function；λ₁,λ₂Indicate weight coefficient；N_s,N_lRespectively indicate the quantity of Segment and the quantity of Link； y_s,y_lRespectively indicate the label of Segment and Link；c_s,c_lRespectively indicate the predicted value for indicating Segment and Link；S,Point The geometry and its true virtual value of segment Biao Shi not be predicted.

7. continuously at section text detection and recognition methods in image as described in claim 1, which is characterized in that the step S5 In, network DenseNet includes several intensive block Dense Block and transition block Transition Block, each Dense It is attached between Block by Transition Block；

Dense Block is by BatchNormalization+RELU+3*3 convolutional layer as component function；

There is connection between any two convolutional layer in Dense Bolck；

Transition Block is made of bottleneck layer and pond layer.

8. continuously at section text detection and recognition methods in image as described in claim 1, which is characterized in that the step S5 In, the information of the context both direction of continuous text sequence is handled using BLSTM network, joined three kinds in LSTM network Gate, respectively update door forget door and out gate, formula are as follows:

Γ_f=σ (W_f[a^{< t-1 >},x^{< t >}]+b_f)；

Γ_u=σ (W_u[a^{< t-1 >},x^{< t >}]+b_u)；

a^{< t >}=Γ_o*tanh c^{< t >}；

Wherein, σ is activation primitive；C indicates long-term memory state；A indicates state；X indicates input；

<t>indicates current t moment；c^{< t >}Indicate the memory state of t moment；W_f,W_u,W_cRespectively weight matrix；b_f,b_u,b_cRespectively For offset；Pass through the output a of a upper moment t-1 first^<t-1>With the input x of this moment t^<t>, obtained by tanh function Intermediate more new stateΓ_uAnd Γ_fIt respectively indicates and updates door and forget door, respectively value ∈ [0,1], control in memory cell Memory sequences, finally by out gate Γ_oObtain the state a of t moment^t。

9. continuously at section text detection and recognition methods in image as described in claim 1, which is characterized in that the step S6 In, it includes following process that text image is aligned with label according to time step by CTC by forward, backward algorithm:

Define forwards algorithms α_t(s), it initializes first:

Wherein, α₁(1) indicate that first character is the probability of blank labelα₁(2) indicate that first character is real sequence The probability of middle first character

Subsequent recurrence formula is as follows:

Wherein,

In formula, a_t-1It (s) is from 0 to t-1 moment, and the t-1 moment is predicted as the forward direction probability of s-th of character in sequence l '；a_t-1 It (s-1) is from 0 to t-1 moment, and the t-1 moment is predicted as the forward direction probability of the s-1 character in sequence l '；By a_t-1(s) and a_t-1(s-1) total forward direction probability before the two summation obtains t momentB indicates blank tag blank；L is indicated in image The length of character string；L ' indicates to introduce the sequence length after blank label, l '=2l+1；l'_s,l’_s-2Respectively indicate l ' sequence S, s-2 characters in column；Indicate that t moment is predicted as the probability of s-th of character in sequence l '；

Define backward algorithm β_t(s), it initializes:

In formula,Indicate that the T moment is predicted as the probability of blank, as backward probability β_T(l ') the i.e. T moment is predicted as sequence The backward probability of the last one blank of l '；It indicates that the T moment is predicted as the probability of last character in sequence l, is made For β_T(l ' -1) the i.e. T moment is predicted as the backward probability of penultimate character in sequence l '；

Subsequent recurrence formula are as follows:

Wherein,

In formula, β_t+1(s) indicate that the t+1 moment is predicted as in sequence the rear to general of s-th of character to total moment T, and at the t+1 moment Rate；β_t+1(s+1) indicate that the t+1 moment to total moment T, and is predicted as at the t+ moment backward probability of the s+ character in sequence；It will β_t+1(s) and β_t+1(s+1) the two summation obtains total backward probabilityB indicates blank tag blank；L indicates word in image Accord with the length of sequence；L ' indicates to introduce the sequence length after blank label, l '=2l+1；l'_s,l’_s+2Respectively indicate l ' sequence In s, s+2 characters；Indicate that t moment is predicted as the probability of s-th of character in sequence l '；

In formula, t indicates the moment；T indicates total moment；L indicates sequence length；S indicates s-th of character in sequence；When indicating t Carve the probability for being predicted as s-th of character in l sequence；α_t(s) it indicates in t moment by all possible path formation sequence l_1:sIt is general Rate；β_t(s) refer to and generate subsequent sequence l by all possible paths in t moment_s:lProbability.

10. continuously at section text detection and recognition methods in image as claimed in claim 9, which is characterized in that the step 6 In, the loss function of CTC are as follows:

Wherein, p (z | x) indicates given input x, the probability that list entries is z；U indicates the length of original sequence；T indicates the moment； α (t, u) indicates the forward direction probability in t moment in node u；β (t, u) indicates the backward probability in t moment in node u.