CN110135248A

CN110135248A - A kind of natural scene Method for text detection based on deep learning

Info

Publication number: CN110135248A
Application number: CN201910270269.4A
Authority: CN
Inventors: 刘发贵; 陈成
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2019-08-16

Abstract

The invention discloses a kind of natural scene Method for text detection based on deep learning.This method extracts the Analysis On Multi-scale Features of text using CNN network, then encodes these features using RNN to make full use of the contextual properties of text；Then, characteristic pattern is inputted into the pond ROI layer and exports a series of text and proposed.After by non-maxima suppression, the text proposal of generation is connected finally by a text connector, to realize multiple dimensioned, multidirectional text detection flexibly and efficiently.The present invention improves multi-direction, natural scene text detection under the conditions of mutative scale accuracy rate and recall rate.

Description

A kind of natural scene Method for text detection based on deep learning

Technical field

The invention belongs to technical field of image processing, and in particular to a kind of natural scene text detection based on deep learning Method.

Background technique

Scene text detection is the important prerequisite of text identification, is often used in image retrieval, machine translation, automatic Pilot Equal fields.But text detection complex background, multiple dimensioned, multilingual, uneven illumination is even, it is fuzzy when detection still So there is many difficulties.

The diversity of natural scene text is with variability: compared with the text in document, the text of natural scene may be Multiple dimensioned, multilingual, shape, direction, ratio, color are possibly different from, these variations are all brought to the detection of text Lot of challenges.

Complex background: scene text may occur in arbitrary background, including signal signature, brick or thick grass, grid Column, these backgrounds may have the feature closely similar with text, be likely to become the judgement of influence of noise text.Meanwhile also The missing of text caused by the blocking of foreign matter, causes potentially to detect mistake.

Irregular image quality: due to uncontrollable collect means, the quality of imaging is not can guarantee.For detection Image may cause distortion, empty burnt due to different shooting angle or shooting distance, or when due to shooting illumination difference Form noise, shade.

For natural scene text detection problem, the method that can will test is divided into two classes, and one kind is traditional detection method, separately One kind is the detection method based on deep learning.Traditional method has the method based on texture, such as uses local strength, filter Response, wavelet coefficient etc.；Have the method based on region, as stroke width transformation (Stroke Width Transform, SWT), Maximum extreme value stability region (Maximally Stable Extremal Regions, MSER), stroke feature convert (Stroke Feature Transform, SFT) etc..In recent years, with the development of deep neural network, deep learning is led in computer vision Domain shows increasing advantage.Currently, it is most popular or be based on convolutional neural networks (Convolutional Neural Networks, CNN) deep learning method.After having used deep learning, the accuracy of text detection is substantially increased, And people are freed from complicated characteristic Design work.The commonly natural scene text detection based on deep learning Model is typically based on common target detection model, such as RCNN, YOLO, SSD.The basic structure of these models is usually to use number A convolutional layer and pond layer extract feature, and the classification and recurrence of detection block are finally carried out using full articulamentum.

Summary of the invention

In order to more accurately and efficiently carry out text detection in natural scene, solve text in natural scene it is multi-direction, The test problems of mutative scale, the invention proposes a kind of natural scene Method for text detection based on deep learning.

The purpose of the present invention is realized at least through one of following technical solution.

A kind of natural scene Method for text detection based on deep learning, includes the following steps:

(1) natural scene text detection model neural network based is constructed and is trained, includes following sub-step:

(1.1) building is based on the feature extraction of feature pyramid network (Feature Pyramid Networks, FPN) Device；

(1.2) feature extractor is extracted using Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN) Feature encoded；

(1.3) precision of detection is further increased using the pond ROI layer；

(1.4) classification and recurrence that detection block is finally carried out using full articulamentum, form text detection model；

It (1.5) will be by the training figure input model of mark；Use the multitask comprising Classification Loss and recurrence loss Loss function calculates penalty values with training pattern；

(2) using it is above-mentioned training complete natural scene text detection model to the natural scene text in given image into Row detection, includes following sub-step:

(2.1) image to be detected is inputted, literary version is carried out to given image using model after above-mentioned training and is detected, a system is exported The score and coordinate of column text proposal detection block.

(2.2) obtained text is proposed to carry out non-maxima suppression, to remove partial redundance detection block.

(2.3) proposal of a series of text is attached using text connector, generates final testing result.

Compared with prior art, the invention has the advantages that and technical effect:

(1) present invention has used feature pyramid network (Feature Pyramid for the text detection of mutative scale Networks, FPN), it can be efficiently simultaneously using the information of each different size of convolutional layer, compared to using the last layer The method of characteristic pattern, while the high resolution information of high-rise strong semantic information and bottom is utilized, to realize higher call together Return rate and accuracy rate；Compared to the method based on image pyramid, then greatly reduce calculation amount.

(2) multidirectional text detection is connected by the way of exporting a series of texts and proposing finally by text Device connects the proposal of these texts, compared to using arbitrary quadrilateral or rotating the method for rectangle, has used less Parameter, thus more flexible to the detection of multi-direction text efficient.

Detailed description of the invention

Fig. 1 is natural scene text detection flow chart in embodiment.

Fig. 2 is natural scene text detection model support composition used in embodiment.

Fig. 3 is the actual result figure detected under different scenes in embodiment using Method for text detection of the invention.

Specific embodiment

In order to which technical solution of the present invention and advantage is more clearly understood, below in conjunction with attached drawing, carry out further detailed It is without being limited thereto to describe bright but of the invention implementation and protection in detail.

Illustrate the term in the present invention first:

Feature pyramid network (Feature Pyramid Networks, FPN): FPN is directly in original back bone network On make an amendment, the characteristic pattern of each resolution ratio, which introduces latter resolution ratio and scales twice of characteristic pattern, does that each element is corresponding to be added Operation.Connection in this way, each layer of prediction characteristic pattern used have all merged the spy of different resolution, different semantic intensity Sign, the characteristic pattern of the different resolution of fusion do the object detection of corresponding resolution sizes respectively.This ensure that each layer is all There are suitable resolution ratio and strong semantic feature.

Residual error network (ResNet): it is the depth convolutional network model that He Kaiming was proposed in 2015, is adopted according to model The difference of the number of plies is respectively designated as ResNet-34, ResNet-50, ResNet-101, ResNet-152 etc..

Non-maxima suppression (Non-Maximum Suppression, NMS): inhibition is not the element of maximum, Ke Yili Solution is that local maxima is searched for.Each detection block of output has a score, these detection blocks there may be comprising with intersection Situation, the detection block that the detection block of highest scoring in selection field is carried out using NMS, and inhibits those scores low.

As shown in Figure 1, the natural scene text detection model in the present invention based on deep learning, comprising the following steps:

(1) it constructs and trains natural scene text detection model neural network based, as shown in Fig. 2, including following son Step:

(1.1) building is based on the feature extraction of feature pyramid network (Feature Pyramid Networks, FPN) Device.Use ResNet-101 as back bone network, generates feature pyramid, use the feature of the level therein from P2 to P5.

(1.2) feature extracted is carried out using Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN) Coding.Use two-way long short-term memory Recognition with Recurrent Neural Network (the Bi-directional Long Short- of 512 hidden layers Term Memory, Bi-LSTM) feature extracted is encoded as RNN.

(1.3) precision of detection is further increased using the pond ROI layer.The concrete operations in the pond ROI are as follows:

ROI is mapped to the corresponding position of characteristic pattern according to the image of input by (1.3.1)；

Region division after mapping is the part of same size by (1.3.2), and the quantity of division and the dimension of output are identical；

(1.3.3) carries out maximum pondization operation to each part.

(1.4) classification and recurrence of detection block are finally carried out using full articulamentum.Pass through respectively by the feature in the pond ROI Two full articulamentums are classified and are returned.If the quantity of the detection block of output is k, wherein the dimension of classification layer output is 2k, Correspond to text and background；The dimension for returning layer output is 4k, the upper left of corresponding detection block and the coordinate of bottom right 2.

(1.5) input is trained model by the training figure of mark.Wherein, quadrangle can be used in training image Rectangle mark also can be used in mark.But before input model, it need to be divided by given width, if training image mark Note is quadrangle, then the minimum circumscribed rectangle after taking it to divide；If being labeled as rectangle, directly divide.

Design is comprising Classification Loss and returns the multitask loss function lost.It is lost using the loss function of design Calculating:

Wherein L, L_clsAnd L_regRespectively total losses, Classification Loss and recurrence loss, λ are balanced sort loss and recurrence damage Weight coefficient between mistake.p_iIt is the classification of i-th of detection block prediction,It is the true classification of i-th of detection block.t_iIt is i-th The prediction coordinate of detection block,It is the true coordinate of i-th of detection block.

(2.1) image to be detected is inputted, text detection is carried out to given image using model after above-mentioned training, exports a system The score and coordinate of column text proposal detection block.

(2.2) obtained text is proposed to carry out non-maxima suppression, to remove partial redundance detection block.Concrete operations are such as Under:

The list B and its corresponding score S for proposing detection block for text, using following calculation.Selection has Largest score

Detection block M, by it from removing and be added in final testing result D in set B.Usually examined remaining in B It surveys in frame

It is removed from B with the IoU of the M frame for being greater than threshold value.This process is repeated, until B is sky.

(2.3) proposal of a series of text is attached using text connector, generates final testing result.It uses Such as

Lower step carries out the connection of text proposal:

If proposing P_jWith proposal P_i(i, j herein indicates different proposals) meets following two conditions, will propose P_jDefinition To propose P_iNeighbours:

(1) propose P_jWith proposal P_iIt is nearest and they the distance between be less than w_j+w_i

(2) propose P_jWith proposal P_iThere is the registration greater than 0.5 in vertical direction

Wherein w_iAnd w_jRespectively propose P_iWith proposal P_jWidth, if propose P_iIt is to propose P_jNeighbours and propose P_j It is to propose P_iNeighbours, the two proposals are connected as the same detection block by this.Above-mentioned steps are repeated, until all mention View connection is completed, then detection block is final output result.From Fig. 2 and Fig. 3 it is found that detection effect of the present invention in natural scene Fruit, it is seen that the present invention can detect well mutative scale, multidirectional text in natural scene.

Claims

1. a kind of natural scene Method for text detection based on deep learning, it is characterised in that the following steps are included:

(1) it constructs and trains natural scene text detection model neural network based, comprising:

(1.1) building is based on the feature extractor of feature pyramid network (Feature Pyramid Networks, FPN)；

(1.2) spy that feature extractor is extracted using Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN) Sign is encoded；

(1.3) precision of detection is further increased using the pond ROI layer；

It (1.5) will be by the training figure input model of mark；

(1.6) penalty values are calculated with training pattern using the multitask loss function comprising Classification Loss and recurrence loss；

(2) the natural scene text in given image is examined using the natural scene text detection model that training is completed It surveys, includes following sub-step:

(2.1) image to be detected is inputted, text inspection is carried out to given image using the natural scene text detection model after training It surveys, exports score and coordinate that a series of texts propose detection block；

(2.2) obtained text is proposed to carry out non-maxima suppression, to remove partial redundance detection block；

2. the natural scene Method for text detection according to claim 1 based on deep learning, which is characterized in that constructing In natural scene text detection model neural network based, feature pyramid network (Feature Pyramid Networks, FPN the level from P2 to P5) has been only used.

3. the natural scene Method for text detection according to claim 1 based on deep learning, which is characterized in that constructing In natural scene text detection model neural network based, feature pyramid network (Feature Pyramid Networks, FPN) used ResNet-101 as back bone network.

4. the natural scene Method for text detection according to claim 1 based on deep learning, which is characterized in that constructing In natural scene text detection model neural network based, nerve is recycled using the two-way long short-term memory of 512 hidden layers Network (Bi-directional Long Short-Term Memory, Bi-LSTM) is used as Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN) feature extracted is encoded.

5. the natural scene Method for text detection according to claim 1 based on deep learning, which is characterized in that constructing In natural scene text detection model neural network based, the calculating lost using following loss function:

Wherein L, L_clsAnd L_regRespectively total losses, Classification Loss and recurrence loss, λ are balanced sort loss and recurrence loss Between weight coefficient,It is the true classification of i-th of detection block.

6. the natural scene Method for text detection according to claim 5 based on deep learning, which is characterized in that classification damage Mistake is defined as follows:

Wherein, p_iIt is the prediction classification of i-th of detection block,It is the true classification of i-th of detection block.

7. the natural scene Method for text detection according to claim 5 based on deep learning, which is characterized in that return damage Mistake is defined as follows:

Wherein, t_iIt is the prediction coordinate of i-th of detection block,It is the true coordinate of i-th of detection block.

8. the natural scene Method for text detection according to claim 1 based on deep learning, which is characterized in that giving Determine to use following steps to carry out the connection of text proposal during the natural scene text in image detected:

If proposing P_jWith proposal P_iMeet following two conditions, will propose P_jIt is defined as proposing P_iNeighbours:

(1) propose P_jWith proposal P_iIt is nearest and they the distance between be less than w_j+w_i；

Wherein w_iAnd w_jRespectively propose P_iWith proposal P_jWidth, if propose P_iIt is to propose P_jNeighbours and propose P_jIt is to mention Discuss P_iNeighbours, the two proposals are connected as the same detection block by this.