CN107563379A

CN107563379A - For the localization method to natural scene image Chinese version

Info

Publication number: CN107563379A
Application number: CN201710781807.7A
Authority: CN
Inventors: 宋彬; 黄家冕; 郭洁; 王丹; 秦浩
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-09-02
Filing date: 2017-09-02
Publication date: 2018-01-09
Anticipated expiration: 2037-09-02
Also published as: CN107563379B

Abstract

The invention discloses one kind to be based on full convolutional neural networks natural scene text positioning method.Its step is：(1) images to be recognized sample is inputted；(2) normalized；(3) build and train full convolutional neural networks；(4) coordinate parameters of full convolutional neural networks output are screened；(5) text in natural scene image is positioned；The present invention builds and trains full convolutional neural networks, the image of text will be contained under natural scene as input, it can efficiently solve and be not enough to characterize profound text message in the prior art only with the feature of single shallow-layer, cause computationally intensive during using artificial extraction feature and be not carried out the problem of end-to-end autotext positions, the present invention, which has, combines a variety of characteristics of image, obtain the text message of more rich deeper, the advantages of improving the String localization precision in natural scene image.

Description

For the localization method to natural scene image Chinese version

Technical field

The invention belongs to technical field of image processing, further relates to a kind of in framing technical field be used for certainly The localization method of right scene image Chinese version.The present invention is directed to the natural image containing text, using the complete of fusion angle change Convolutional neural networks, arbitrary size image Chinese version can be positioned.

Background technology

With the increase at double of view data, text is extracted from substantial amounts of natural scene image becomes research heat Point, String localization have become a critically important research theme in technical field of image processing.

Paper patent at present in terms of the String localization in natural scene image is very more, from the technology path taken From the point of view of, the technology of String localization is broadly divided into four kinds：Technology based on connected region, the technology based on edge, based on texture Technology and the technology based on angle point.Also two kinds of technologies are combined the String localization technology used together.Some is based on texture Text positioning method, although can be with localization of text region, for some non-textual areas under complex scene sometimes Domain but texture and text filed similar situation, easily occur deviation in positioning.Text positioning method of some based on angle point, It is but more preferable than English text on positioning Chinese text because the angle point that Chinese is complicated, stroke is formed is more.These Method belongs to shallow-layer study category, depends on feature extraction, it is necessary to a large amount of manual workings, and only with single spy Sign, it is not enough to characterize more fully target information, causes String localization effect bad.

In its patent document applied, " text in a kind of complex background image is determined for Institute of Information Engineering, CAS A kind of complex background figure is disclosed in position method " (number of patent application CN201610153384.X, publication number CN105825216A) Text positioning method as in.This method uses respectively on tri- passages of R, G and B of pending coloured image first MSERs algorithms are handled, and obtain coordinate of three MSERs regions on coloured image, then to after the denoising of MSERs regions again Extraction setting feature and based on the setting feature to candidate's MSERs territorial classifications, obtain including the MSERs regions of text, finally Obtained text block is connected into text bar, and duplicate removal processing is carried out to text bar.Although this method is that one kind is applied in complexity Chinese text localization method under scene, still, this method, which still has weak point, is, because this method is used based on company The technology in logical region, the feature extraction dependent on setting take longer, it is necessary to substantial amounts of manual working, computationally intensive, not carried Get deeper feature, it is impossible to characterize more fully text message, the problem of causing String localization effect bad, easily will The non-textual region flase drop similar with text, it is not carried out autotext positioning end to end.

Paper " the Multi-Oriented Text Detection In Scene that M.Basavanna et al. delivers at it Images”(International Journal of Pattern Recognition&Artificial Intelligence, 2012,26(07):A kind of text positioning method based on edge is disclosed in 1255010-1-1255010-19.).This method is first First with Sobel operator extractions image border, the Sobel edge features of image are obtained, it is then big according to distance between character edge The rule of small fixation, increase the size of calculated level direction character edge spacing using edge, be finally solution character and character Between Characters Stuck problem caused by edge spacing very little, using the method for zero cross point, so as to position the text in image.Should Weak point is existing for method, due to using Zero-Crossing Method, it is desirable to which text distribution must be horizontal, otherwise will go out Now intersect, therefore be directed to the inclined situation of text, the accuracy rate of this method String localization is not high.

The content of the invention

The purpose of the present invention is to be directed to above-mentioned the shortcomings of the prior art, it is proposed that one kind is used for natural scene image The localization method of Chinese version.The present invention is compared with other natural scene String localization technologies in the prior art, without artificial setting Feature, strong adaptability, accuracy rate are high.

The specific steps that the present invention realizes include as follows：

(1) image pattern to be identified is inputted：

(1a) is from the artificial synthesized training that Chinese text RCTW-17 is read containing text image data collection and natural scene In data set, the random image composition training sample set for extracting 32000 width known text coordinates；

(1b) shoots the image composition test sample collection that 200 width contain text from natural scene；

(2) normalized：

Each sample of training sample set and test sample collection is zoomed to 416 × 416 sizes by (2a), after composition scales Training sample set and test sample collection；

(2b) concentrates the pixel value of each sample to be normalized the training sample set after scaling and test sample, Obtain the training sample set after normalized and test sample collection；

(3) build and train full convolutional neural networks：

(3a) structure contains 23 layers of full convolutional neural networks；

The sample that training sample after normalized is concentrated is input in full convolutional neural networks by (3b), to full convolution Neutral net is trained, until the penalty values of the output vector of full convolutional neural networks output layer are less than or equal to 10；

(4) coordinate parameters in full convolutional neural networks output vector are screened：

(4a) concentrates from the test sample after normalized obtains a sample do not tested as input sample；

Current input sample is input in the full convolutional neural networks trained by (4b), obtains the full convolution of current sample Text detection probable value in the output vector of neutral net output layer and in the output vector of full convolutional neural networks output layer Coordinate parameters；

(4c) judges whether the output vector Chinese version detection probability value of current input sample is more than or equal to 0.6, if so, then Step (4d) is performed, otherwise, performs step (4e)；

(4d) retains the coordinate parameters corresponding with the output vector Chinese version detection probability value of current input sample, performs Step (4f)；

(4e) gives up the coordinate parameters corresponding with the output vector Chinese version detection probability value of current input sample, performs Step (4f)；

(4f) judges test sample concentrates whether also have the sample do not tested, if so, step (4a) is then performed, otherwise, Perform step (5)；

(5) text in natural scene image is positioned：

Using the coordinate parameters remained in full convolutional neural networks output layer output vector, test set sample is demarcated successively The text of each sample in this.

The present invention has advantages below compared with prior art：

Firstth, due to present invention employs containing 23 layers of full convolutional neural networks, in the image for extracting natural scene Deeper level, more fully text message feature, overcome and use single shallow-layer characteristic present text message not in the prior art Comprehensively, the problem of causing String localization effect bad so that the present invention can meticulously describe the text of natural scene image comprehensively Feature, improve the accuracy rate of String localization.

Secondth, due to present invention employs being trained to full convolutional neural networks, by training full convolutional neural networks, The coordinate parameters and text detection probability of full convolutional neural networks output are obtained, are overcome special using artificial extraction in the prior art Sign causes location Calculation amount big, time-consuming longer, while can not realize autotext positioning end to end and positioning tilt text effect The problem of fruit is poor so that the present invention can learn the deeper feature of image automatically, independent of artificial extraction feature, reduce Amount of calculation, realize that text is automatically positioned and improved the locating effect for tilting text end to end.

Brief description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the flow chart for the coordinate parameters step that the present invention screens full convolutional neural networks output；

Fig. 3 is a width figure of the training sample set of emulation experiment input of the present invention；

Fig. 4 is the test sample input figure and test sample output figure of emulation experiment of the present invention.

Embodiment

The present invention will be further described below in conjunction with the accompanying drawings.

Reference picture 1, what the present invention realized comprises the following steps that：

Step 1, image pattern to be identified is inputted.

From the artificial synthesized training data that Chinese text RCTW-17 is read containing text image data collection and natural scene Concentrate, the random image composition training sample set for extracting 32000 width known text coordinates.

The image composition test sample collection that 200 width contain text is shot from natural scene.

Step 2, normalized.

Each sample of training sample set and test sample collection is zoomed into 416 × 416 sizes, forms the training after scaling Sample set and test sample collection.

Concentrate the pixel value of each sample to be normalized the training sample set after scaling and test sample, obtain Training sample set and test sample collection after normalized.

Step 3, build and train full convolutional neural networks.

For structure containing 23 layers of full convolutional neural networks, the structure of 23 layers of full convolutional neural networks is convolutional layer successively Conv1, convolutional layer Conv2, without sampling convolutional layer NSConv3, low step convolutional layer LSConv4, convolutional layer Conv5, without sampling roll up Lamination NSConv6, low step convolutional layer LSConv7, convolutional layer Conv8, without sampling convolutional layer NSConv9, low step convolutional layer LSConv10, without sampling convolutional layer NSConv11, low step convolutional layer LSConv12, convolutional layer Conv13, without sampling convolutional layer NSConv14, low step convolutional layer LSConv15, without sampling convolutional layer NSConv16, low step convolutional layer LSConv17, without sampling roll up Lamination NSConv18, without sampling convolutional layer NSConv19, without sampling convolutional layer NSConv20, without sampling convolutional layer NSConv21, Low step convolutional layer LSConv22, output layer, build comprising the following steps that containing 23 layers of full convolutional neural networks：

1st step, the sample that the training sample after normalized is concentrated is input to convolutional layer Conv1, it is entered successively Row convolution operation, batch standardization, linear R elu conversion and the operation of maximum down-sampling, obtain 32 208 × 208 pixel sizes The feature vector chart of convolutional layer Conv1 outputs；

2nd step, convolutional layer Conv1 output characteristic vectograms are input to convolutional layer Conv2, convolution behaviour is carried out successively to it Make, criticize standardization, linear R elu conversion and the operation of maximum down-sampling, obtain the convolutional layer of 64 104 × 104 pixel sizes The feature vector chart of Conv2 outputs；

3rd step, the convolutional layer Conv2 feature vector charts exported are input to without convolutional layer NSConv3 is sampled, to it successively Carry out convolution operation, batch standardization and linear R elu conversion, obtain 128 104 × 104 pixel sizes without sampling convolutional layer The feature vector chart of NSConv3 outputs；

4th step, convolutional layer Conv3 output characteristic vectograms are input to low step convolutional layer LSConv4, it is carried out successively Low step convolution operation and linear R elu conversion, the low convolutional layer LSConv4 outputs for obtaining 64 104 × 104 pixel sizes are special Levy vectogram；

5th step, convolutional layer LSConv4 output characteristic vectograms are input to convolutional layer Conv5, convolution is carried out successively to it Operation, batch standardization, linear R elu conversion and the operation of maximum down-sampling, obtain the convolutional layer of 128 52 × 52 pixel sizes Conv5 output characteristic vectograms；

6th step, convolutional layer Conv5 output characteristic vectograms are input to without sampling convolutional layer NSConv6, it is entered successively Row convolution operation, batch standardization and linear R elu conversion, obtain 256 52 × 52 pixel sizes without sampling convolutional layer NSConv6 output characteristic vectograms；

7th step, convolutional layer Conv6 output characteristic vectograms are input to low step convolutional layer LSConv7, it is carried out successively Low step convolution operation and linear R elu conversion, obtain the low convolutional layer LSConv7 output characteristics of 128 52 × 52 pixel sizes Vectogram；

8th step, convolutional layer LSConv7 output characteristic vectograms are input to convolutional layer Conv8, convolution is carried out successively to it Operation, batch standardization, linear R elu conversion and the operation of maximum down-sampling, obtain the convolutional layer of 256 26 × 26 pixel sizes Conv8 output characteristic vectograms；

9th step, convolutional layer Conv8 output characteristic vectograms are input to convolutional layer Conv9, convolution behaviour is carried out successively to it Make, criticize standardization and linear R elu conversion, obtain the convolutional layer Conv9 output characteristics vector of 512 26 × 26 pixel sizes Figure；

10th step, convolutional layer Conv9 output characteristic vectograms are input to low step convolutional layer LSConv10, it is entered successively The low step convolution operation of row and linear R elu conversion, obtain the low convolutional layer LSConv10 outputs of 256 26 × 26 pixel sizes Feature vector chart；

11st step, low convolutional layer LSConv10 output characteristic vectograms are input to without sampling convolutional layer NSConv11, it is right It carries out convolution operation, batch standardization and linear R elu conversion successively, obtain 512 26 × 26 pixel sizes without sampling Convolutional layer NSConv11 output characteristic vectograms；

12nd step, convolutional layer Conv11 output characteristic vectograms are input to low step convolutional layer LSConv12, to it successively Low step convolution operation and linear R elu conversion are carried out, the low convolutional layer LSConv12 for obtaining 256 26 × 26 pixel sizes is defeated Go out feature vector chart；

13rd step, convolutional layer LSConv12 output characteristic vectograms are input to convolutional layer Conv13, it is carried out successively Convolution operation, batch standardization, linear R elu conversion and the operation of maximum down-sampling, obtain the volume of 512 13 × 13 pixel sizes Lamination Conv13 output characteristic vectograms；

14th step, by convolutional layer Conv13 output characteristic vectograms be input to without sampling convolutional layer NSConv14, to its according to Secondary progress convolution operation, batch standardization and linear R elu conversion, obtain 1024 13 × 13 pixel sizes without sampling convolution Layer NSConv14 output characteristic vectograms；

15th step, convolutional layer Conv14 output characteristic vectograms are input to low step convolutional layer LSConv15, to it successively Low step convolution operation and linear R elu conversion are carried out, the low convolutional layer LSConv15 for obtaining 512 13 × 13 pixel sizes is defeated Go out feature vector chart；

16th step, low convolutional layer LSConv15 output characteristic vectograms are input to without sampling convolutional layer NSConv16, it is right It carries out convolution operation, batch standardization and linear R elu conversion successively, obtain 1024 13 × 13 pixel sizes without sampling Convolutional layer NSConv16 output characteristic vectograms；

17th step, convolutional layer Conv16 output characteristic vectograms are input to low step convolutional layer LSConv17, to it successively Low step convolution operation and linear R elu conversion are carried out, the low convolutional layer LSConv17 for obtaining 512 13 × 13 pixel sizes is defeated Go out feature vector chart；

18th step, low convolutional layer LSConv17 output characteristic vectograms are input to without sampling convolutional layer NSConv18, it is right It carries out convolution operation, batch standardization and linear R elu conversion successively, obtain 1024 13 × 13 pixel sizes without sampling Convolutional layer NSConv18 output characteristic vectograms；

19th step, by convolutional layer Conv18 output characteristic vectograms be input to without sampling convolutional layer NSConv19, to its according to Secondary progress convolution operation, batch standardization and linear R elu conversion, obtain 1024 13 × 13 pixel sizes without sampling convolution Layer NSConv19 output characteristic vectograms；

20th step, Conv19 output characteristic vectograms are input to without sampling convolutional layer NSConv20, it is carried out successively Convolution operation, batch standardization and linear R elu conversion, obtain 1024 13 × 13 pixel sizes without sampling convolutional layer NSConv20 output characteristic vectograms；

21st step, by convolutional layer Conv20 output characteristic vectograms be input to without sampling convolutional layer NSConv21, to its according to Secondary progress convolution operation, batch standardization and linear R elu conversion, obtain 1024 13 × 13 pixel sizes without sampling convolution Layer NSConv21 output characteristic vectograms；

22nd step, convolutional layer Conv21 output characteristic vectograms are input to low step convolutional layer LSConv22, to it successively Low step convolution operation and linear R elu conversion are carried out, obtains the low convolutional layer LSConv22 outputs of 40 13 × 13 pixel sizes Feature vector chart；

23rd step, low convolutional layer LSConv22 output characteristic vectograms are input to output layer, it carried out successively linear Conversion and non-linear Sigmoid transform, obtain the output vector of output layer, wherein output vector is by the defeated of full convolutional neural networks Go out the coordinate parameters exported in layer and text detection probable value composition.

The sample that training sample after normalized is concentrated is input in full convolutional neural networks, to full convolutional Neural Network is trained, until full convolutional neural networks output layer output vector penalty values be less than or equal to 10, to full convolution god Comprised the following steps that through what network was trained：

1st step, one sample of extraction is concentrated from training sample, the full convolutional neural networks that sample input is built In, then exported accordingly from the output layer of full convolutional neural networks, in this stage, the information of the sample passes through full convolution The successively conversion of neutral net, it is delivered to the output layer of full convolutional neural networks；

2nd step, the penalty values of the output vector of full convolutional neural networks output layer are calculated, using ADAM algorithms to full convolution All parameters of neutral net carry out the adjustment for having supervision so that the penalty values of the output vector of full convolutional neural networks output layer It is gradually reduced；

3rd step, the 1st step is repeated and the step of 2 step, until the loss of the output vector of full convolutional neural networks output layer Value is less than or equal to stop iteration when 10, obtains and preserve the full convolutional neural networks trained.

The penalty values of the output vector of full convolutional neural networks output layer are obtained by following formula：

Wherein, X represents the penalty values of abscissa value in full convolutional neural networks output layer output vector, x^pRepresent output to Abscissa value in amount, x^rThe real abscissa value in input sample Chinese one's respective area is represented, Y represents full convolutional neural networks output The penalty values of ordinate value, y in layer output vector^pRepresent the ordinate value in output vector, y^rRepresent input sample Chinese local area The real ordinate value in domain, W represent the penalty values of width value in full convolutional neural networks output layer output vector, w^pRepresent output Width value in vector, w^rThe real width value in input sample Chinese one's respective area is represented, H represents full convolutional neural networks output layer The penalty values of height value, h in output vector^pRepresent the height value in output vector, h^rRepresent that input sample Chinese one's respective area is true Height value, S represents the penalty values of sine value in full convolutional neural networks output layer output vector, s^pRepresent in output vector Sine value, s^rThe real sine value in input sample Chinese one's respective area is represented, T represents full convolutional neural networks output layer output vector The penalty values of middle cosine value, t^pRepresent the cosine value in output vector, t^rRepresent the real cosine in input sample Chinese one's respective area Value, C represent the penalty values of full convolutional neural networks output layer output vector Chinese version detection probability value, c^pRepresent in output vector Text detection probable value, c^rRepresent the real text detection idealized probability value in input sample Chinese one's respective area, l represent output to The penalty values of amount.

Step 4, the coordinate parameters of full convolutional neural networks output layer output are screened.

Reference picture 2, the coordinate parameters specific steps for screening full convolutional neural networks output layer output are described as follows.

1st step, concentrated from the test sample after normalized and obtain a sample do not tested as input sample；

2nd step, current input sample is input in the full convolutional neural networks trained, obtains the full volume of current sample The output vector of text detection probable value and full convolutional neural networks output layer in the output vector of product neutral net output layer In coordinate parameters；

3rd step, judges whether the output vector Chinese version detection probability value of current input sample is more than or equal to 0.6, if so, The 4th step of this step is then performed, otherwise, performs the 5th step of this step；

4th step, retain the coordinate parameters corresponding with the output vector Chinese version detection probability value of current input sample, hold 6th step of this step of row；

5th step, give up the coordinate parameters corresponding with the output vector Chinese version detection probability value of current input sample, hold 6th step of this step of row；

6th step, judges test sample concentrates whether also have the sample do not tested, if so, then performing the 1st of this step Step, otherwise, perform step 5.

Step 5, the text in natural scene image is positioned.

The effect of the present invention is further described with reference to emulation experiment.

1st, emulation experiment condition：

Hardware platform is：Intel Core i7-6700CPU@3.40GHz、32GB RAM、Nvidia GeForce GTX 1060 6GB GPU, software platform：Python3.5.2, Tensorflow1.0.1.

2nd, experiment content and interpretation of result：

Training sample set used in emulation experiment of the present invention is to contain text image data collection and nature from artificial synthesized Read under scene in Chinese text RCTW-17 training datas, the instruction of the random image composition for extracting 32000 width known text coordinates Practice sample set.

Image shown in Fig. 3 (a) is that training sample concentration belongs to an artificial synthesized width containing text image data collection Image, the image shown in Fig. 3 (b) are that training sample concentration belongs to reading Chinese text RCTW-17 training datas under natural scene One width training image of collection.The image composition test sample collection that 200 width shot from natural scene contain text.

Image shown in Fig. 4 (a) is the width test image that test sample is concentrated.Image shown in Fig. 4 (b) is to Fig. 4 (a) test image, by 23 layers of full convolutional neural networks and calibrated image is passed through.

The present invention initially sets up the test sample collection of the training sample set containing 32000 width images and 200 width images.To instruction Practice sample set and the training sample set after being normalized and test sample collection is normalized in test sample collection.Use 32000 width training samples train 23 layers of full convolutional neural networks, the 23 layers of full convolutional neural networks trained.By 200 width Test sample is input in the full convolutional neural networks of 23 layers trained, draws the coordinate parameters and text detection probability of output, The coordinate parameters finally retained by screening, the test sample of input is demarcated using the coordinate parameters of reservation

Table 1 is the simulation result of the present invention, in the 200 width test samples after test, the successful sample of localization of text This number is 162 width, and the number of samples for positioning all text failures is 38 width.

Test sample collection String localization result list under the natural scene of table 1.

	All successful samples	The sample all to fail	All samples
				Number of samples	162	38	200
Ratio	81.0%	19%	100.0%

From table 1 it follows that be used for using proposed by the present invention to the localization method of natural scene image Chinese version, The successful accuracy rate of localization of text is 81.0% in the image containing text under natural scene, it was demonstrated that the present invention is by building and instructing Practice full convolutional neural networks, the text message of deeper in image can be extracted, have and combine image various features, obtain more complete The text message in face, improve natural scene under text locating accuracy the advantages of.

Claims

1. a kind of localization method for natural scene image Chinese version, it is characterised in that comprise the following steps：

(1) image pattern to be identified is inputted：

(1a) is from the artificial synthesized training data that Chinese text RCTW-17 is read containing text image data collection and natural scene Concentrate, the random image composition training sample set for extracting 32000 width known text coordinates；

(2) normalized：

Each sample of training sample set and test sample collection is zoomed to 416 × 416 sizes by (2a), forms the training after scaling Sample set and test sample collection；

(2b) concentrates the pixel value of each sample to be normalized the training sample set after scaling and test sample, obtains Training sample set and test sample collection after normalized；

(3) build and train full convolutional neural networks：

(3a) structure contains 23 layers of full convolutional neural networks；

The sample that training sample after normalized is concentrated is input in full convolutional neural networks by (3b), to full convolutional Neural Network is trained, until the penalty values of the output vector of full convolutional neural networks output layer are less than or equal to 10；

Current input sample is input in the full convolutional neural networks trained by (4b), obtains the full convolutional Neural of current sample Text detection probable value in the output vector of network output layer and the seat in the output vector of full convolutional neural networks output layer Mark parameter；

(4c) judges whether the output vector Chinese version detection probability value of current input sample is more than or equal to 0.6, if so, then performing Step (4d), otherwise, perform step (4e)；

(4f) judges test sample concentrates whether also have the sample do not tested, if so, then performing step (4a), otherwise, performed Step (5)；

(5) text in natural scene image is positioned：

Using the coordinate parameters remained in full convolutional neural networks output layer output vector, demarcate successively in test set sample The text of each sample.

2. the localization method according to claim 1 for natural scene image Chinese version, it is characterised in that：Step (3a) Described in the structures of 23 layers of full convolutional neural networks be convolutional layer Conv1, convolutional layer Conv2, without sampling convolutional layer successively NSConv3, low step convolutional layer LSConv4, convolutional layer Conv5, without sampling convolutional layer NSConv6, low step convolutional layer LSConv7, Convolutional layer Conv8, without sampling convolutional layer NSConv9, low step convolutional layer LSConv10, without sampling convolutional layer NSConv11, low step Convolutional layer LSConv12, convolutional layer Conv13, without sampling convolutional layer NSConv14, low step convolutional layer LSConv15, without sampling roll up Lamination NSConv16, low step convolutional layer LSConv17, without sampling convolutional layer NSConv18, without sampling convolutional layer NSConv19, nothing Sample convolutional layer NSConv20, without sampling convolutional layer NSConv21, low step convolutional layer LSConv22, output layer.

3. the localization method according to claim 1 for natural scene image Chinese version, it is characterised in that：Step (3a) Described in build comprising the following steps that containing 23 layers of full convolutional neural networks：

1st step, the sample that the training sample after normalized is concentrated is input to convolutional layer Conv1, it is rolled up successively Product operation, batch standardization, linear R elu conversion and the operation of maximum down-sampling, obtain the convolution of 32 208 × 208 pixel sizes The feature vector chart of layer Conv1 outputs；

2nd step, convolutional layer Conv1 output characteristic vectograms are input to convolutional layer Conv2, it is carried out successively convolution operation, Standardization, linear R elu conversion and the operation of maximum down-sampling are criticized, obtains the convolutional layer Conv2 of 64 104 × 104 pixel sizes The feature vector chart of output；

3rd step, the convolutional layer Conv2 feature vector charts exported are input to without sampling convolutional layer NSConv3, it is carried out successively Convolution operation, batch standardization and linear R elu conversion, obtain 128 104 × 104 pixel sizes without sampling convolutional layer The feature vector chart of NSConv3 outputs；

4th step, convolutional layer Conv3 output characteristic vectograms are input to low step convolutional layer LSConv4, low step is carried out successively to it Convolution operation and linear R elu conversion, obtain the low convolutional layer LSConv4 output characteristics of 64 104 × 104 pixel sizes to Spirogram；

5th step, convolutional layer LSConv4 output characteristic vectograms are input to convolutional layer Conv5, convolution behaviour is carried out successively to it Make, criticize standardization, linear R elu conversion and the operation of maximum down-sampling, obtain the convolutional layer of 128 52 × 52 pixel sizes Conv5 output characteristic vectograms；

6th step, convolutional layer Conv5 output characteristic vectograms are input to without sampling convolutional layer NSConv6, it is rolled up successively Product operation, batch standardization and linear R elu conversion, obtain 256 52 × 52 pixel sizes without sampling convolutional layer NSConv6 Output characteristic vectogram；

7th step, convolutional layer Conv6 output characteristic vectograms are input to low step convolutional layer LSConv7, low step is carried out successively to it Convolution operation and linear R elu conversion, obtain the low convolutional layer LSConv7 output characteristics vector of 128 52 × 52 pixel sizes Figure；

8th step, convolutional layer LSConv7 output characteristic vectograms are input to convolutional layer Conv8, convolution behaviour is carried out successively to it Make, criticize standardization, linear R elu conversion and the operation of maximum down-sampling, obtain the convolutional layer of 256 26 × 26 pixel sizes Conv8 output characteristic vectograms；

9th step, convolutional layer Conv8 output characteristic vectograms are input to convolutional layer Conv9, it is carried out successively convolution operation, Standardization and linear R elu conversion are criticized, obtains the convolutional layer Conv9 output characteristic vectograms of 512 26 × 26 pixel sizes；

10th step, convolutional layer Conv9 output characteristic vectograms are input to low step convolutional layer LSConv10, it carried out successively low Convolution operation and linear R elu conversion are walked, obtains the low convolutional layer LSConv10 output characteristics of 256 26 × 26 pixel sizes Vectogram；

11st step, by low convolutional layer LSConv10 output characteristic vectograms be input to without sampling convolutional layer NSConv11, to its according to Secondary progress convolution operation, batch standardization and linear R elu conversion, obtain the convolution without sampling of 512 26 × 26 pixel sizes Layer NSConv11 output characteristic vectograms；

12nd step, convolutional layer Conv11 output characteristic vectograms are input to low step convolutional layer LSConv12, it is carried out successively Low step convolution operation and linear R elu conversion, the low convolutional layer LSConv12 outputs for obtaining 256 26 × 26 pixel sizes are special Levy vectogram；

13rd step, convolutional layer LSConv12 output characteristic vectograms are input to convolutional layer Conv13, convolution is carried out successively to it Operation, batch standardization, linear R elu conversion and the operation of maximum down-sampling, obtain the convolutional layer of 512 13 × 13 pixel sizes Conv13 output characteristic vectograms；

14th step, convolutional layer Conv13 output characteristic vectograms are input to without sampling convolutional layer NSConv14, it is entered successively Row convolution operation, batch standardization and linear R elu conversion, obtain 1024 13 × 13 pixel sizes without sampling convolutional layer NSConv14 output characteristic vectograms；

15th step, convolutional layer Conv14 output characteristic vectograms are input to low step convolutional layer LSConv15, it is carried out successively Low step convolution operation and linear R elu conversion, the low convolutional layer LSConv15 outputs for obtaining 512 13 × 13 pixel sizes are special Levy vectogram；

16th step, by low convolutional layer LSConv15 output characteristic vectograms be input to without sampling convolutional layer NSConv16, to its according to Secondary progress convolution operation, batch standardization and linear R elu conversion, obtain 1024 13 × 13 pixel sizes without sampling convolution Layer NSConv16 output characteristic vectograms；

17th step, convolutional layer Conv16 output characteristic vectograms are input to low step convolutional layer LSConv17, it is carried out successively Low step convolution operation and linear R elu conversion, the low convolutional layer LSConv17 outputs for obtaining 512 13 × 13 pixel sizes are special Levy vectogram；

18th step, by low convolutional layer LSConv17 output characteristic vectograms be input to without sampling convolutional layer NSConv18, to its according to Secondary progress convolution operation, batch standardization and linear R elu conversion, obtain 1024 13 × 13 pixel sizes without sampling convolution Layer NSConv18 output characteristic vectograms；

19th step, convolutional layer Conv18 output characteristic vectograms are input to without sampling convolutional layer NSConv19, it is entered successively Row convolution operation, batch standardization and linear R elu conversion, obtain 1024 13 × 13 pixel sizes without sampling convolutional layer NSConv19 output characteristic vectograms；

20th step, Conv19 output characteristic vectograms are input to without sampling convolutional layer NSConv20, convolution is carried out successively to it Operation, batch standardization and linear R elu conversion, obtain 1024 13 × 13 pixel sizes without sampling convolutional layer NSConv20 Output characteristic vectogram；

21st step, convolutional layer Conv20 output characteristic vectograms are input to without sampling convolutional layer NSConv21, it is entered successively Row convolution operation, batch standardization and linear R elu conversion, obtain 1024 13 × 13 pixel sizes without sampling convolutional layer NSConv21 output characteristic vectograms；

22nd step, convolutional layer Conv21 output characteristic vectograms are input to low step convolutional layer LSConv22, it is carried out successively Low step convolution operation and linear R elu conversion, obtain the low convolutional layer LSConv22 output characteristics of 40 13 × 13 pixel sizes Vectogram；

23rd step, low convolutional layer LSConv22 output characteristic vectograms are input to output layer, linear transformation is carried out successively to it And non-linear Sigmoid transform, obtain the output vector of output layer, wherein output vector by full convolutional neural networks output layer Coordinate parameters and text detection the probable value composition of middle output.

4. the localization method according to claim 1 for natural scene image Chinese version, it is characterised in that：Step (3b) Described in comprised the following steps that to what full convolutional neural networks were trained：

1st step, one sample of extraction is concentrated from training sample, in the full convolutional neural networks that sample input is built, so Exported accordingly from the output layer of full convolutional neural networks afterwards, in this stage, the information of the sample passes through full convolutional Neural The successively conversion of network, it is delivered to the output layer of full convolutional neural networks；

2nd step, the penalty values of the output vector of full convolutional neural networks output layer are calculated, using ADAM algorithms to full convolutional Neural All parameters of network carry out the adjustment for having supervision so that the penalty values of the output vector of full convolutional neural networks output layer are gradual Reduce；

3rd step, the 1st step is repeated and the step of 2 step, until the penalty values of the output vector of full convolutional neural networks output layer are small Stop iteration when equal to 10, obtain and preserve the full convolutional neural networks trained.

5. the localization method according to claim 1 for natural scene image Chinese version, it is characterised in that：Step (3b) Described in the penalty values of output vector of full convolutional neural networks output layer obtained by following formula：

Wherein, X represents the penalty values of abscissa value in full convolutional neural networks output layer output vector, x^pRepresent in output vector Abscissa value, x^rThe real abscissa value in input sample Chinese one's respective area is represented, Y represents that full convolutional neural networks output layer is defeated The penalty values of ordinate value, y in outgoing vector^pRepresent the ordinate value in output vector, y^rRepresent that input sample Chinese one's respective area is true Real ordinate value, W represent the penalty values of width value in full convolutional neural networks output layer output vector, w^pRepresent output vector In width value, w^rThe real width value in input sample Chinese one's respective area is represented, H represents full convolutional neural networks output layer output The penalty values of height value, h in vector^pRepresent the height value in output vector, h^rRepresent that input sample Chinese one's respective area is really high Angle value, S represent the penalty values of sine value in full convolutional neural networks output layer output vector, s^pRepresent the sine in output vector Value, s^rThe real sine value in input sample Chinese one's respective area is represented, T represents remaining in full convolutional neural networks output layer output vector The penalty values of string value, t^pRepresent the cosine value in output vector, t^rRepresent the real cosine value in input sample Chinese one's respective area, C tables Show the penalty values of full convolutional neural networks output layer output vector Chinese version detection probability value, c^pRepresent the text in output vector Detection probability value, c^rThe real text detection idealized probability value in input sample Chinese one's respective area is represented, l represents the damage of output vector Mistake value.

6. the localization method according to claim 1 for natural scene image Chinese version, it is characterised in that：Step (4b) Text detection probable value in the output vector of described full convolutional neural networks output layer is obtained by following formula：

Wherein, c represents the text detection probable value in the output vector of full convolutional neural networks output layer, and β represents full convolution god The 7th output valve in output vector through network output layer from left to right.

7. the localization method according to claim 1 for natural scene image Chinese version, it is characterised in that：Step (4b) Coordinate parameters in the output vector of described full convolutional neural networks output layer include its abscissa value, ordinate value, width Value, height value, sine value, cosine value.