CN109117836A

CN109117836A - Text detection localization method and device under a kind of natural scene based on focal loss function

Info

Publication number: CN109117836A
Application number: CN201810729838.2A
Authority: CN
Inventors: 操晓春; 田晓玮; 伍蹈; 代朋纹
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2018-07-05
Filing date: 2018-07-05
Publication date: 2019-01-01
Anticipated expiration: 2038-07-05
Also published as: CN109117836B

Abstract

The present invention discloses text detection localization method and device under a kind of natural scene based on focal loss function.This method first pre-processes the data of mark, then constructs text detection and positions network, then quotes a part of focused lost function as training process loss function, then detects to natural scene picture to be detected.This method makes to mark the text detection network designed by being more suitable by adjusting existing mark；Multireel lamination is merged based on FCN network, is allowed to be more in line with text detection task；Positive negative sample is balanced in the training process by introducing focused lost function, improves detection accuracy.The present invention can obtain the high effect recalled of high-precision in text detection positioning.

Description

Under a kind of natural scene based on focal loss function text detection localization method and Device

Technical field

The invention belongs to technical field of computer vision, and in particular to one kind can be accurately positioned in natural scene picture The method and apparatus of character area.

Background technique

The method that the mankind propagate information is varied, and the carrier that text is propagated as information itself directly contains rich Rich semantic information.In natural scene, text is ubiquitous.Whether shop signboard or traffic sign or even curbside Advertisement, poster etc. all convey information using text.Character area is accurately oriented from natural scene and is identified, Machine can be helped to more fully understand Scene Semantics content, many fields are suffered from complementary help.For example, in street Scape identifies field, identifies that the text on building board is conducive to us and more fully understands streetscape information；Driving field is being assisted, Identify that the text on traffic signboard is conducive to us and preferably assists automatic Pilot.In the present of artificial intelligence high speed development It, natural scene Text region has become the important component of computer vision.Text in natural scene image is known It is not broadly divided into two tasks, first is that text detection, i.e., orient text region in the picture；Second is that Text region, i.e., The content of text is extracted from the region of positioning.Since strokes of characters is abundant, details is sensitive, only accurately orients text Region can just do subsequent identification work, therefore text detection occupies an important position in entire identification mission.

Text detection under natural scene is very different with traditional text detection technology, is mainly reflected in following several A aspect.First is that the object of processing is different.Conventional text detection master to be processed is file and picture, usually scanning document, lattice Formula is PDF etc..Text detection master under natural scene to be processed is Streetscape picture, usually Photograph image, format JPG Deng.Second is that character area is different from background area situation.Character area in conventional text detection processing object occupies picture master Position is wanted, text is regular, and background is mostly pure color, without other interference.Literal field in natural scene text detection process object Domain layout is irregular, and text is not of uniform size, varies in color, and background is complicated, is filled with the various interference informations for text detection, Such as, railing, electric wire, shelter etc..Third is that the image quality of process object is different.Conventional text detection processing object image quality compared with It is good, image clearly.Natural scene text detection deals with objects the image due to shooting angle, shake or light conditions etc. Situations such as it is more that there may be noises, obscures.As can be seen that the text detection in natural scene is more multiple than conventional text detection It is miscellaneous, it deals with objects more difficult.Therefore, the text detection in natural scene is all one in computer vision field all the time The task of great challenge.

Currently, the character detecting method one under natural scene shares three classes, it is based on connection component, based on texture spy respectively Seek peace based on the two mixing method.Method based on connection component is mainly the correlation utilized between text point adjacent pixel Property detects character area.For the text of natural scene, the gray value for being reflected in the text in gray level image is approximate, in addition, Text color, stroke width etc. also have certain correlation.Main method has: maximum stable extremal region (MSER), color Clustering method (Color Clustering), stroke width convert the methods of (SWT) and gradient orientation histogram (HOG).It is based on The method of textural characteristics mainly detects character area using the dissimilarity of text and background texture feature.For natural scene Text, character area often has unique textural characteristics, can be passed to trained classification using this category feature as input Device extracts text from background.Based on the two mixing method combine above-mentioned correlation and dissimilarity come Detect character area.Two steps are broadly divided into, first pass through correlation to obtain text candidate region, then to these candidate regions Textural characteristics detection is carried out, character area is accurately positioned out.

In recent years, deep learning high speed development obtains the textural characteristics of text using the method for deep learning, in turn Distinguish that character area and background area become the main stream approach of text detection instantly.Herein, it is a kind of special text to be considered as Object, using the main stream approach of object detection in deep learning, such as: Faster R-CNN, YOLO, RFCN etc. are carried out The text detection of natural scene.But there are larger difference, main bodies for the object in the text and object detection in natural scene Present character area may be longer, and directly poor using object detecting method specific aim, effect is bad.Therefore, for the spy of text Point designs rationally efficient character detecting method and still remains biggish challenge.

Summary of the invention

The present invention proposes a kind of deep learning text inspection based on focal loss function for the image in natural scene Survey localization method and device.

Present invention employs the FCN in deep neural network, the texture feature information based on text tells character area Place pixel, and returning simultaneously to the size of text box and inclination angle, can be to the text of arbitrary size in natural scene, inclination angle Word carries out detection positioning.Particularly, focal loss function is applied in the training process during differentiating character area, is put down Having weighed in picture, character area is smaller, is difficult to the problem of learning, and improves the accuracy of text detection.In terms of test, due to Network design of the invention all has good sensibility to the text of different scale, and so there is no need to use cascade test can also be accurate Position character area.

The technical solution adopted by the invention is as follows:

Text detection localization method under a kind of natural scene based on focal loss function, comprising the following steps:

1) according to the data set of the natural scene picture of mark, the classification true value figure of text/background two and text picture are constructed Five dimension true value figures of element and text box corresponding relationship where it；

2) it is based on FCN net structure text detection network, the loss function of the text detection network includes focal loss Function and the loss function for returning text box；

3) the text detection network is instructed using the two classification true value figure of construction and the five dimensions true value figure Practice；The text detection network carries out two classification of single cent word, background using the focal loss function pixel-by-pixel, and uses institute The loss function for stating recurrence text box returns height, width and the tilt angle of text box where pixel；

4) natural scene picture to be detected is inputted into the text detection network that training is completed, realizes that the detection of text is fixed Position.

Further, existing mark is adjusted to two classification annotations first by step 1), and the pixel of character area is set as 1, back The pixel of scene area is set as 0, constructs the classification true value figure of text/background two；Then pixel and the minimum rectangle surrounded are calculated Four frontier distances of frame and five dimension true value figures of place text box same level angular separation.

Further, step 2) uses ResNet-50 as basic convolutional neural networks structure, by multiple convolutional layer results It is cascaded, constructs text detection network.

Further, it in step 2) the text detection network, is done by the result of conv5_c same after anti-pondization operates The result of conv4_f obtains f1 network layer after two convolution operations of 3*3 and 1*1 after merging, and is added repeatedly Two network layers of f2 and f3 are obtained after conv3_d, conv2_c；F3 obtains parallel after two different 3*3 convolution operations Two network layers f4_1 and f4_2 are respectively used to calculate two loss functions, common training.

Further, the formula of the focal loss function is as follows:

Wherein w and h refers respectively to the width and height of two classification charts of prediction, Y^*Refer to given true value, α_tIt is to use In the parameter for balancing positive negative sample, γ is the parameter for balancing difficulty or ease sample, p_tRefer to two sorter network predicted values；p_tWith α_tCalculation formula is as follows:

The wherein predicted value provided in the pixel lower network that p refers to, the true value in the pixel that y refers to.

Further, the loss function for returning text box is defined as follows:

L_reg=L_IoU+ηL_θ

Wherein L_IoURefer to the degree of overlapping between the text box and real text frame that return, L_θRefer to the inclination angle of prediction With the loss between true inclination angle, η is balance parameters；L_IoUAnd L_θCalculation it is as follows:

WhereinR^*The text box and corresponding mark text box of prediction are respectively referred to,θ^*Respectively refer to the inclination of prediction Angle and corresponding mark tilt angle, constant c is for constraining L_θThe upper bound.

Further, step 4) the following steps are included:

4.1) natural scene picture to be detected is subjected to scaled down to long side < 2400p；

4.2) natural scene picture to be detected input text detection network is obtained into two classification results and regression result；

4.3) the text pixel region greater than threshold value is selected to two classification charts, according to text pixel region and corresponding recurrence As a result, the maximum restrainable algorithms using part remove extra text box.

Corresponding with above method, the present invention also provides text detections under a kind of natural scene based on focal loss function Positioning device comprising:

True value figure constructing module, is responsible for the data set of the natural scene picture according to mark, and construction text/background two is classified Five dimension true value figures of true value figure and text pixel and text box corresponding relationship where it；

Text detection net structure module is responsible for being based on FCN net structure text detection network, the text detection network Loss function include focal loss function and return text box loss function；

Text detection network training module is responsible for the two classification true value figure and the five dimensions true value figure pair using construction The text detection network is trained；The text detection network carries out single cent using the focal loss function pixel-by-pixel Two classification of word, background, and using the loss function for returning text box to the height of text box where pixel, width and Tilt angle is returned；

The detection locating module of text is responsible for natural scene picture to be detected inputting the text detection net that training is completed Network realizes the detection positioning of text.

In conclusion the method that the present invention devises text detection and localization under the natural scene based on deep learning, Result with the high recall rate of high-precision.Compared with prior art, the present invention has the advantages that

1, it is improved based on FCN network, devises a text detection network.

2, focused lost function is adjusted and used, network training is conducive to.

3, network adaptability is strong, can use few training sample, obtains high-precision test result.

Detailed description of the invention

Fig. 1: data mark conversion schematic diagram；

Fig. 2: natural scene text detection network architecture diagram；

Fig. 3: natural scene text detection result instance graph.

Specific embodiment

Below by specific embodiments and the drawings, the present invention is described in further details.

Text detection localization method under natural scene based on focal loss function of the invention, is broadly divided into the training stage (corresponding training module) and test phase (corresponding test module).

The step of training stage, is as follows:

1) data set of mark is pre-processed, constructs the classification true value figure of text/background two and text pixel and its Five dimension true value figures of place text box corresponding relationship.

The mark of the step 1) converts as shown in Figure 1, being labeled as 1 to pixel in the text box of mark, background pixel marks It is 0, constructs the classification true value figure of text/background two.Such as marking text box is arbitrary quadrilateral frame, need to unify to expand to be adjusted to most The rectangle frame of small encirclement, then to avoid interfering, the interior pixels for reducing 30% to the text box after expansion (consider literal field Might not all be text completely in domain, so filtering out the position close to mark edge by reducing 30%), calculate the pixel With the distance on four boundaries of place rectangle frame and the angle in place text box same level direction, four distances and an angle Constitute five dimension true value figures.As shown in Figure 1, l indicate left, the distance for the pixel away from left frame, t indicate top, for the pixel away from The distance of upper side frame；B indicates that bottom, the distance for the pixel away from lower frame, r indicate right, is the pixel away from left frame Distance, θ refer to the angle in the textbox same level direction.

2) FCN network (Evan Shelhamer, Jonathan Long, and Trevor Darrell, " Fully are based on Convolutional networks for semantic segmentation, " PAMI, 2017, pp.640-651.), use ResNet-50 cascades multiple convolutional layer results as basic convolutional neural networks structure, constructs text detection network.

The step 2) uses for reference Fully Convolutional Network (FCN) thought, and text detection task is treated as object Body divides task.Network structure as shown in Fig. 2, firstly, used ResNet-50 as basic network structure, abstract image The high-rise feature with after base layer texture Fusion Features.Conv5_c in Fig. 2, conv4_f, conv3_d, conv2_c and conv1 table Show convolutional layer.In order to adapt to the character features of different scale, by the feature of conv5_c, conv4_f, conv3_d and conv2_c It is merged, constructs f1, the several special network layers of f2, f3.Specifically, after doing anti-pondization operation by the result of conv5_c With conv4_f result merge after after two convolution operations of 3*3 and 1*1 obtain f1 network layer, be added repeatedly Two network layers of f2 and f3 are obtained after conv3_d, conv2_c, i.e., the knot after anti-pondization operates with conv3_d is done by the result of f1 Fruit obtains f2 network layer after two convolution operations of 3*3 and 1*1 after merging, and is done by the result of f2 same after anti-pondization operates The result of conv2_c obtains f3 network layer after two convolution operations of 3*3 and 1*1 after merging.In Fig. 2 above " predicted value " The representation dimension of numerical value 1 and 5,1 indicate judgement be text pixel, 5 be five parameters in five dimension true value figures.

3) while carrying out the classification of text/background two pixel-by-pixel, the same place of pixel has been returned to pixel where character area The distance on four boundaries of text box and the tilt angle in place text box same level direction.So that network is predicting text While region, text box can be relatively easily constructed, realizes the detection and positioning of text.

The step 3) has carried out returning task to construct the text box of text detection while two classification problems Multi-task learning.Classify for this purpose, constructing two parallel convolutional layer f4_1 and f4_2 difference output character/backgrounds two after f3 As a result with the regression result of text box size and inclination angle.Specifically, f3 is obtained in parallel after two different 3*3 convolution operations Two network layers f4_1 and f4_2, be respectively used to calculate two loss functions, common training.

4) in network training process, it is contemplated that sample is in distress to be had easily, in addition, character area in whole picture ratio compared with Small, positive and negative sample proportion is extremely uneven when sampling, it usually needs introduces the method that difficult example is excavated and improves training effectiveness.The present invention will Focal loss function (Tsung-Yi Lin, Priya Goyal, Ross B.Girshick, Kaiming He, and Piotr Dolla′r,“Focal loss for dense object detec-tion,”in ICCV 2017,pp.2999–3007.) It is introduced into after adjustment in the training process of text detection task, is excavated without difficult example.

The step 4) has used in object detection that (including positive negative sample is not for solving imbalance problem between training sample class Balance and difficulty or ease sample imbalance) a part as text detection task loss function of loss function.By the loss function Object be converted into pixel from anchor (candidate character (text) frame), by multi-class problem convert two classification problems to adapt to text Detection task.

The step of test phase, is as follows:

1) adjustment test picture size inputs text detection network, and the testing result of text detection network is text, background The regression results figure such as two classification charts and text box size, inclination angle.

The step 1) keeps picture ratio, and adjustment picture size to long side is no more than 2400p, and wherein p indicates pixel quantity, That is long side is long no more than 2400 pixels.Keep to the picture for being more than limitation the scaled down of length-width ratio.

2) text pixel region that score (score) is greater than threshold value is selected to two classification charts, according to text pixel region and Corresponding regression result removes extra text box using the non-maxima suppression algorithm (Local-Aware NMS) of part.

The two classification results figures that the step 2) exports network, filtering effective text pixel threshold value score is 0.97.Figure 3 be natural scene text detection result instance graph.

Text detection network of the invention is further illustrated below.Natural scene proposed by the present invention based on deep learning Text detection network in the method for lower text detection positioning, which is mainly improved by FCN, to be got, as shown in Figure 2.The network Use two loss functions, first be focal loss function (L adjusted_seg) two classification (text/background) are carried out, the Two are the loss function (L for returning text box_reg) four boundaries to text pixel apart from place text box distance and text Frame is returned with horizontal direction angle.Total loss function equation is as follows:

L=L_seg+λL_reg#(1)

Wherein balance parameters λ is set as 1.

For two assorting processes, loss function is defined as follows:

Wherein w and h refers respectively to the width and height of two classification charts of prediction, Y^*Refer to given true value, α_tIt is to use In the parameter for balancing positive negative sample, γ is the parameter for balancing difficulty or ease sample, p_tRefer to two sorter network predicted values.(2) Middle p_tAnd α_tCalculation is similar, and calculation formula is as follows:

The wherein predicted value provided in the pixel lower network that p refers to, the true value in the pixel that y refers to, α are a ginsengs Number.In the text detection model of the present embodiment, it is 0.95 and 2 that alpha, gamma, which distinguishes value,.α in other embodiments, γ can also be with Other numerical value are taken as needed.

For regression process, loss function is defined as follows:

L_reg=L_IoU+ηL_θ#(5)

Wherein L_IoURefer to the degree of overlapping (IoU) between the text box and real text frame that return, L_θRefer to prediction Loss between inclination angle and true inclination angle, η are balance parameters, and 20 are set as in this character detecting method.The two calculation is such as Under:

WhereinR^*The text box and corresponding mark text box of prediction are respectively referred to,θ^*Respectively refer to the inclination of prediction Angle and corresponding mark tilt angle, constant c is for constraining L_θThe upper bound, 6 are set as in this character detecting method.

Natural scene character detecting method proposed by the present invention tests environment and experimental result are as follows:

(1) environment is tested:

System environments: ubuntu16.04；

Hardware environment: memory: 64GB, GPU:K80, hard disk: 2TB；

(2) experimental data:

Training data:

ImageNet pre-training RestNet-50 basic network.

Opened using natural scene picture 1229 (including ICDAR2013 training set 299 is opened, ICDAR2015 training set 1000 ) train and arrive model stability, effect is no longer promoted

Training optimization method: ADAM

Test data: ICDAR2015 (500)

Appraisal procedure: ICDAR2015 online evaluation

(3) experimental result:

Effect to illustrate the invention, using identical data set to whether use focal loss function text of the present invention Detection network is trained, and deconditioning when no longer being promoted to model stability effect is surveyed using ICDAR2015 test set Examination, and compared with the character detecting method effect of existing mainstream.

Existing mainstream scheme and test comparison result of the present invention are as shown in Table 1 below:

1. existing method of table and test result of the invention comparison

Serial number	Method	P	R	F
					1	CTPN	0.516	0.742	0.609
2	EAST	0.836	0.735	0.782
					3	The present invention (does not use focal loss function)	0.819	0.767	0.792
4	The present invention (uses focal loss function)	0.847	0.773	0.809

Wherein P refers to accuracy rate, and R refers to recall rate, and F is the harmonic-mean of P and R.It can be clearly seen that from table, this hair Bright involved precision and recall rate of the text detection network than existing character detecting method CTPN and EAST has to be mentioned greatly very much It rises, and the network model obtained using the method that focal loss function is trained has been obtained in precision and recall rate into one Step is promoted.Wherein CTPN method referring to " Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao, “Detecting text in natural image with connectionist text proposal network,”in ECCV 2016, pp.56-72. ", EAST method referring to " Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou,Weiran He,and Jiajun Liang,“EAST:an efficient and accurate scene text detector,”in CVPR 2017,pp.2642–2651.”。

Another embodiment of the present invention provides text detection positioning device under a kind of natural scene based on focal loss function, Comprising:

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this The protection scope of invention should be subject to described in claims.

Claims

1. text detection localization method under a kind of natural scene based on focal loss function, which is characterized in that including following step It is rapid:

1) according to the data set of the natural scene picture of mark, construct text/background two classify true value figure and text pixel with Five dimension true value figures of text box corresponding relationship where it；

2) it is based on FCN net structure text detection network, the loss function of the text detection network includes focal loss function With the loss function for returning text box；

3) the text detection network is trained using the two classification true value figure of construction and the five dimensions true value figure；Institute Two classification that text detection network carries out single cent word, background using the focal loss function pixel-by-pixel are stated, and are returned using described The loss function of text box is returned to return height, width and the tilt angle of text box where pixel；

4) natural scene picture to be detected is inputted into the text detection network that training is completed, realizes the detection positioning of text.

2. the method as described in claim 1, which is characterized in that existing mark is adjusted to two classification annotations first by step 1), The pixel of character area is set as 1, and the pixel of background area is set as 0, constructs the classification true value figure of text/background two；Then it calculates Five dimension true value figures of four frontier distances and place text box same level angular separation of pixel and the minimum rectangle frame surrounded out.

3. the method as described in claim 1, which is characterized in that step 2) uses ResNet-50 as basic convolutional Neural net Network structure cascades multiple convolutional layer results, constructs text detection network.

4. method as claimed in claim 3, which is characterized in that in step 2) the text detection network, by the knot of conv5_c The result that fruit is cooked after anti-pondization operates with conv4_f obtains f1 network after two convolution operations of 3*3 and 1*1 after merging Layer；It is done after the result after anti-pondization operates with conv3_d merges after two convolution operations of 3*3 and 1*1 by the result of f1 Obtain f2 network layer；It is done after the result after anti-pondization operates with conv2_c merges by the result of f2 through 3*3 and 1*1 two F3 network layer is obtained after convolution operation；F3 obtains two parallel network layer f4_1 after two different 3*3 convolution operations And f4_2, it is respectively used to calculate two loss functions, common training.

5. the method as described in claim 1, which is characterized in that the formula of the focal loss function is as follows:

Wherein w and h refers respectively to the width and height of two classification charts of prediction, Y^*Refer to given true value, α_tIt is for putting down Weigh the parameter of positive negative sample, and γ is the parameter for balancing difficulty or ease sample, p_tRefer to two sorter network predicted values；p_tAnd α_tMeter It is as follows to calculate formula:

6. the method as described in claim 1, which is characterized in that the loss function for returning text box is defined as follows:

L_reg=L_IoU+ηL_θ

Wherein L_IoURefer to the degree of overlapping between the text box and real text frame that return, L_θRefer to the inclination angle of prediction and true Loss between inclination angle, η are balance parameters；L_IoUAnd L_θCalculation it is as follows:

WhereinR^*The text box and corresponding mark text box of prediction are respectively referred to,θ^*Respectively refer to prediction tilt angle and Corresponding mark tilt angle, constant c is for constraining L_θThe upper bound.

7. the method as described in claim 1, which is characterized in that step 4) the following steps are included:

4.3) the text pixel region greater than threshold value is selected to two classification charts, is tied according to text pixel region and corresponding recurrence Fruit removes extra text box using the maximum restrainable algorithms of part.

8. text detection positioning device under a kind of natural scene based on focal loss function characterized by comprising

True value figure constructing module is responsible for the data set of the natural scene picture according to mark, the construction classification true value of text/background two Five dimension true value figures of figure and text pixel and text box corresponding relationship where it；

Text detection net structure module is responsible for being based on FCN net structure text detection network, the damage of the text detection network Losing function includes focal loss function and the loss function for returning text box；

Text detection network training module is responsible for the two classification true value figure and the five dimensions true value figure using construction to described Text detection network is trained；The text detection network carries out single cent word, back using the focal loss function pixel-by-pixel Two classification of scape, and height, width and inclination using the loss function for returning text box to text box where pixel Angle is returned；

The detection locating module of text is responsible for natural scene picture to be detected inputting the text detection network that training is completed, Realize the detection positioning of text.

9. device as claimed in claim 8, which is characterized in that existing mark is adjusted to by the true value figure constructing module first Two classification annotations, the pixel of character area are set as 1, and the pixel of background area is set as 0, construct the classification true value of text/background two Figure；Then four frontier distances and the place text box same level angular separation of pixel and the minimum rectangle frame surrounded are calculated Five dimension true value figures.

10. device as described in claim 1, which is characterized in that the formula of the focal loss function is as follows: