CN110533041A

CN110533041A - Multiple dimensioned scene text detection method based on recurrence

Info

Publication number: CN110533041A
Application number: CN201910838235.0A
Authority: CN
Inventors: 景小荣; 朱莉
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Shanghai Jinming Information Technology Co.,Ltd.
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2019-12-03
Anticipated expiration: 2039-09-05
Also published as: CN110533041B

Abstract

The present invention relates to the multiple dimensioned scene text detection methods based on recurrence, belong to digital image processing field.This method specifically includes: S1: the sufficient training data with text position calibration is arranged；S2: construction feature extracts network, including feedforward network process and top-down Fusion Features process bottom-up, for extracting the low feature on the middle and senior level of each training data；S3: cascade module is used to each layer feature for being sent into detection layers；S4: using the detection framework based on recurrence, suitable default frame is arranged according to text feature, the text position in detection image.The cascade module that the present invention uses improves the receptive field of network, so that the default frame of the text feature of setting is very suitable, text position of the final accurate detection into image.

Description

Multiple dimensioned scene text detection method based on recurrence

Technical field

The invention belongs to digital image processing field, it is related to the multi-direction scene text detection method based on recurrence.

Background technique

With popularizing for smart machine, people can obtain image information anywhere or anytime.Text in image is as one Kind high-layer semantic information, provides important clue to understand and analyzing picture material.Text is the direct reflection of picture material, It is easier to be extracted and understand compared to other elements, and the description of many texts can be used directly, can easily be answered For in the various images based on keyword, Video content retrieval and analysis.Thus text detection becomes computer vision neck The hot research topic in domain.

The method of text detection has very much.Traditional scene text detection method needs artificial design features, often different Image need to design different feature extraction modes, workload is huge.Designer is wanted in the work of characteristic Design simultaneously It asks very high, needs professional knowledge abundant.These all cause the development bottleneck of artificial design features.And deep learning go out It is existing, solve the problems, such as this.

It is some to be calculated based on general target detection as deep learning achieves outstanding detection effect in object detection field The improved Method for text detection of method comes into being.Two major classes can be divided into based on general target detection method: based on candidate region Method and method based on recurrence.Unlike general target detection, how the variation of text length-width ratio acutely, makes network pair It is problem in need of consideration that text dimensional variation, which has stronger robustness,.The text that method based on candidate region is improved This detection algorithm, such as: connecting natural scene text detection algorithm (the Detecting Text in Natural of text box Image with Connectionist Text Proposal Network, CTPN), which proposes text sequence Acutely, horizontal position is higher than the prediction difficulty of upright position for length variation, in order to more accurately generate text candidates frame, the party Method is fixed as width 16 for frame is defaulted, and only predicts the position of vertical direction.Although this method realizes convolutional neural networks for the first time With the end-to-end training of Recognition with Recurrent Neural Network, it is extracted the space characteristics and sequence signature of text；And to multiple dimensioned and multilingual The precision of text detection is higher, but just for the detection of horizontal text, and speed is slower.Method based on recurrence is carried out Improved text detection algorithm, such as: fast text detection algorithm (the A Fast Text Detector of single step deep neural network With a Single Deep Neural Network, Textboxes), which is predicted in different layers, low layer prediction Small object, high level predict big target.Devise the default frame of suitable text scale.Although the speed of this method has not with precision Wrong effect, but since the feature extraction of middle low layer is insufficient, it is undesirable to the detection effect of Small object.

Therefore, the Method for text detection that a kind of pair of text dimensional variation has higher robustness is needed.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of multi-direction scene text detection method based on recurrence, solution The problem of being certainly currently based on existing for the text detection network of recurrence to text dimensional variation inadequate robust, sets for text feature Suitable default frame is set, finally detects the text position in image.

In order to achieve the above objectives, the invention provides the following technical scheme:

A kind of multiple dimensioned scene text detection method based on recurrence, specifically includes the following steps:

S1: the sufficient training data with text position calibration is set；

S2: construction feature extracts network, including feedforward network process and top-down Fusion Features mistake bottom-up Journey, for extracting the low feature on the middle and senior level of each training data；

S3: cascade (inception) module is used to each layer feature for being sent into detection layers；

S4: using the detection framework based on recurrence, suitable default frame is arranged according to text feature, the text in detection image This position.

Further, in the step S2, feedforward network bottom-up includes: an input module, volume first to the 5th Volume module, the first to the 5th pond module, a Recognition with Recurrent Neural Network module, the 6th to the tenth convolution module and the 6th Chi Huamo Block；Wherein input of the input module as the first convolution module, the first to the 5th convolution module separately include the first to the 5th pond Change module；One Recognition with Recurrent Neural Network module, the 6th to the tenth convolution module and the 6th pond module successively cascade.

Further, in the step S2, top-down Fusion Features, which refer to, melts high-level characteristic and low-level feature Close, specifically: it is high-rise first by deconvolution obtain with low layer characteristic pattern of the same size, then connect and batch normalize (Batch Normalization, BatchNorm) module；Low layer connects a convolution module first, and convolution kernel size is 1*1, step-length 1, It is filled with 0；BatchNorm module is connect again；Element dot product operations (Eltwise) finally is used, two characteristic layers are merged； Output of the fused output as entire feature extraction network.

Further, the convolution kernel size of the first to the 6th convolution module is 3*3, and step-length 1 is filled with 1；5th pond Module convolution kernel size is 3*3, and step-length 1 is filled with 1；Remaining pond module convolution kernel size is 2*2, step-length 2, filling It is 0；One Recognition with Recurrent Neural Network module is two-way long short-term memory Recognition with Recurrent Neural Network (Bi-directional Long Short-Term Memory Recurrent Neural Network, BLSTM-RNN), hidden layer unit number is 256；The Seven convolution kernel sizes are 1*1, and step-length 1 is filled with 0；8th to the tenth convolution module includes two convolution kernels, one of them Convolution kernel size is 1*1, and step-length 1 is filled with 0, another convolution kernel size is 3*3, and step-length 2 is filled with 1.

Further, in the step S3, described cascade (inception) module includes input spectrum end and characteristic spectrum cascade End, by four convolution branches connection in parallel between the input spectrum end and characteristic spectrum cascade end, every branch road includes 1,2 Or 3 convolution modules.

Further, the cascade module includes four convolution branches in parallel,

It include a convolution kernel on first convolution branch, convolution kernel size is 3*3, and step-length 1 is filled with 1；

Article 2 convolution branch road includes three convolution kernels, and one of convolution kernel size is 1*1, and step-length 1 is filled with 0；One convolution kernel size is 1*5, and step-length 1 is filled with 1；One convolution kernel size is 5*1, and step-length 1 is filled with 1；

Article 3 convolution branch road includes three convolution kernels, and one of convolution kernel size is 1*1, and step-length 1 is filled with 0；One convolution kernel size is 5*1, and step-length 1 is filled with 1；One convolution kernel size is 1*5, and step-length 1 is filled with 1；

Article 4 convolution branch road includes a pond layer and a convolution kernel, and wherein the convolution kernel size of pond layer is 3* 3, step-length 1 is filled with 1, and convolution kernel size is 1*1, and step-length 1 is filled with 0；

BatchNorm module and Relu module are connect after all of above convolution kernel.

The beneficial effects of the present invention are: Method for text detection of the present invention has stronger Shandong to text dimensional variation Stick.The present invention extracts the space characteristics of text with sequence signature simultaneously using convolution loop neural network.Use feature gold word The prediction output of tower structure multilayer, low-level feature figure predict that Small object, high-level characteristic figure predict big target.It is high using Fusion Features For the semantic information of layer for classifying, the structured message of low layer is used for auxiliary regression, alleviates low-level feature abstract to a certain extent It is insufficient, the lower problem of Small object predictablity rate.Inception module finally is used to each layer feature for being sent into detection layers The receptive field of network is further increased, then uses the detection framework based on recurrence, suitable default frame is set for text feature, Finally detect the text position in image.

Other advantages, target and feature of the invention will be illustrated in the following description to a certain extent, and And to a certain extent, based on will be apparent to those skilled in the art to investigating hereafter, Huo Zheke To be instructed from the practice of the present invention.Target of the invention and other advantages can be realized by following specification and It obtains.

Detailed description of the invention

To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention is made below in conjunction with attached drawing excellent The detailed description of choosing, in which:

Fig. 1 is inventive network structural schematic diagram；

Fig. 2 is characterized fusion schematic diagram；

Fig. 3 is cascade inception modular structure schematic diagram.

Specific embodiment

Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from Various modifications or alterations are carried out under spirit of the invention.

FIG. 1 to FIG. 3 is please referred to, for a kind of present invention preferably implementation of the multiple dimensioned scene text detection method based on recurrence Example, this method includes the following steps:

Step 1: prepare data；

Gather several public data collection --- SynthText, ICDAR2011, ICDAR2013, SVT.Wherein SynthText The 8*10 for including⁵Synthesising picture is used for network pre-training, ICDAR2011, ICDAR2013, SVT totally 749 trained pictures pair Network is finely adjusted.Totally 585 trained pictures are used to test tri- data sets of ICDAR2011, ICDAR2013, SVT.

Step 2: network pre-training, specifically includes the following steps:

1) network structure as shown in Figure 1 is constructed；

2) pre-training is carried out to network on SynthText generated data collection: the image for being normalized to 300*300 is inputted In network model, network output is the positioning result of text and the marking of text classification, loses letter using shown in formula (1) Number.

Loss function includes two parts: two Classification Loss of line of text and the default frame position of line of text return loss；Its Middle N indicates the quantity of matched default frame, and α=1, x are to default frame with the matching matrix of true frame, and c represents each default frame Whether include text confidence level, l represents the positioning result of the neural network forecast of each default frame, and g represents the position of true frame.Text Two Classification Loss L of current row_confUsing entropy loss is intersected, the default frame position of line of text returns loss L_locUsing smooth L1 Loss；

3) to the loss 2) obtained using random optimization device (A Method for Stochastic Optimization, Adam it) optimizes: loss function being minimized by Adam optimizer, constantly updates the parameter in network.Network trains 4* altogether 10⁶Secondary, learning rate is initialized as 10^-3, every iteration 4*10⁵Secondary learning rate loses 0.3 parameter multiplied by 0.1 at random.

Step 3: network fine tuning specifically includes following steps:

1) 749 true pictures on ICDAR2011, ICDAR2013, the SVT proposed using step 1 obtain step 2 To network model be finely adjusted, data enhancings are carried out to 749 true pictures, including overturn at random, plus noise, the behaviour such as fuzzy Make；

2) the default frame of 6 kinds of different length-width ratios is set in different output layers, is respectively as follows: 1,2,3,5,7 and 10；

3) detection layers are wide by increasing network by different size of convolution nuclear cascade using cascade (inception) module Degree improves the receptive field of network, solves the text detection problem of ultimate attainment length-width ratio；

4) setting learning rate is 10^-5, total iteration 20000 times.It is optimized, is obtained using stochastic gradient descent in this process To final deep neural network model；

Step 4: the network succeeded in school is tested on test set: in the step, normalized test image is defeated Enter in network model, network output is the positioning result of text and the marking of text classification.

Finally, it is stated that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to compared with Good embodiment describes the invention in detail, those skilled in the art should understand that, it can be to skill of the invention Art scheme is modified or replaced equivalently, and without departing from the objective and range of the technical program, should all be covered in the present invention Scope of the claims in.

Claims

1. a kind of multiple dimensioned scene text detection method based on recurrence, which is characterized in that this method specifically includes the following steps:

S1: the sufficient training data with text position calibration is set；

S2: construction feature extracts network, including feedforward network process and top-down Fusion Features process bottom-up, uses In the low feature on the middle and senior level for extracting each training data；

S3: cascade module is used to each layer feature for being sent into detection layers；

S4: using the detection framework based on recurrence, suitable default frame is arranged according to text feature, the text position in detection image It sets.

2. the multiple dimensioned scene text detection method according to claim 1 based on recurrence, which is characterized in that the step In S2, feedforward network bottom-up includes: an input module, the first to the 5th convolution module, the first to the 5th Chi Huamo Block, a Recognition with Recurrent Neural Network module, the 6th to the tenth convolution module and the 6th pond module；Wherein input module is as first The input of convolution module, the first to the 5th convolution module separately include the first to the 5th pond module；One Recognition with Recurrent Neural Network Module, the 6th to the tenth convolution module and the 6th pond module successively cascade.

3. the multiple dimensioned scene text detection method according to claim 1 based on recurrence, which is characterized in that the step In S2, top-down Fusion Features, which refer to, merges high-level characteristic with low-level feature, specifically: it is high-rise first by anti- Convolution obtain with low layer characteristic pattern of the same size, then connect batch normalization (Batch Normalization, BatchNorm) mould Block；Low layer connects a convolution module first, and convolution kernel size is 1*1, and step-length 1 is filled with 0；BatchNorm module is connect again； Element dot product operations are finally used, two characteristic layers are merged；Fused output is as entire feature extraction network Output.

4. the multiple dimensioned scene text detection method according to claim 2 based on recurrence, which is characterized in that first to The convolution kernel size of six convolution modules is 3*3, and step-length 1 is filled with 1；5th pond module convolution kernel size is 3*3, step A length of 1, it is filled with 1；Remaining pond module convolution kernel size is 2*2, and step-length 2 is filled with 0；One Recognition with Recurrent Neural Network mould Block is two-way long short-term memory Recognition with Recurrent Neural Network (Bi-directional Long Short-Term Memory Recurrent Neural Network, BLSTM-RNN), hidden layer unit number is 256；7th convolution kernel size is 1*1, Step-length is 1, is filled with 0；8th to the tenth convolution module includes two convolution kernels, and one of convolution kernel size is 1*1, step A length of 1, it is filled with 0, another convolution kernel size is 3*3, and step-length 2 is filled with 1.

5. the multiple dimensioned scene text detection method according to claim 1 based on recurrence, which is characterized in that the step In S3, the cascade module includes input spectrum end and characteristic spectrum cascade end, is led between the input spectrum end and characteristic spectrum cascade end Four convolution branches connection in parallel is crossed, every branch road includes 1,2 or 3 convolution module.

6. the multiple dimensioned scene text detection method according to claim 5 based on recurrence, which is characterized in that the cascade Module includes four convolution branches in parallel,

Article 2 convolution branch road includes three convolution kernels, and one of convolution kernel size is 1*1, and step-length 1 is filled with 0；One A convolution kernel size is 1*5, and step-length 1 is filled with 1；One convolution kernel size is 5*1, and step-length 1 is filled with 1；

Article 3 convolution branch road includes three convolution kernels, and one of convolution kernel size is 1*1, and step-length 1 is filled with 0；One A convolution kernel size is 5*1, and step-length 1 is filled with 1；One convolution kernel size is 1*5, and step-length 1 is filled with 1；

Article 4 convolution branch road includes a pond layer and a convolution kernel, and wherein the convolution kernel size of pond layer is 3*3, step A length of 1, it is filled with 1, convolution kernel size is 1*1, and step-length 1 is filled with 0；

BatchNorm module and line rectification unit module (Rectified Linear are met after all of above convolution kernel Unit,Relu)。