CN108764228A

CN108764228A - Word object detection method in a kind of image

Info

Publication number: CN108764228A
Application number: CN201810520329.9A
Authority: CN
Inventors: 吕岳; 吕淑静; 张茹玉
Original assignee: Jiaxing San Suo Intelligent Technology Co Ltd
Current assignee: Jiaxing San Suo Intelligent Technology Co Ltd
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-11-06

Abstract

The present invention provides the word object detection method in a kind of image, belongs to pattern-recognition, technical field of image processing.It includes the following steps：Step 1：The convolutional neural networks of the feature based layer fusion end to end of structure one, the target for different scale in prognostic chart picture；Step 2：According to the candidate frame that Feature-level fusion network exports, the word object detection results in final image are obtained using bounding box blending algorithm.The image object detection method of the present invention is that the band of position of word target is extracted from natural scene image, improves the efficiency and accuracy rate of subsequent target identification.The bounding box of the neural network prediction target using the Feature-level fusion based on deep learning is proposed, and the bounding box of prediction is merged using bounding box blending algorithm, can effectively detect the band of position of pictograph target.

Description

Word object detection method in a kind of image

Technical field

The invention belongs to pattern-recognition, technical field of image processing, more particularly to are word target detections in a kind of image Method.

Background technology

With the development of internet and multimedia technology, more and more information carriers exist in the form of images.Image In include abundant visual information：Word, color, shape, pattern, position etc., these information can help the mankind to divide Analyse the meaning of scene.Technology currently based on the word target detection of image is square in Car license recognition, traffic sign analysis etc. Face has a wide range of applications.But because shooting the randomness of image, the word in image is because of visitors such as deformation, incompleteness, fuzzy fractures Sight factor can generate interference to the detection of character area.In addition, general background is more complicated in scene image, word and background There may be similar textures, this can increase the difficulty of word target detection.Traditional word object detection method is needed to text Word target carries out feature selecting, and the position of word, effect unobvious are obtained using a large amount of heuristic rule.

Invention content

The object of the present invention is to provide word object detection methods in a kind of image based on deep learning, to solve image In word target orientation problem, the present invention carrys out position and the confidence level of predictive text target object by neural network, then By candidate frame aggregation algorithms, all candidate frames of output are merged, obtain the final bounding box of image object, is i.e. image object is examined Survey result.

The technical solution adopted by the present invention to solve the technical problems is：

The pictograph object detection method that a kind of feature based converged network and bounding box blending algorithm are combined, the party Method includes the following steps：

First, a convolutional neural networks end to end are designed, and there are multiple output layers, multiple output layers have strong table Danone power.The different output layer of network can predict the target object of different scale, wherein high-rise output layer predicts large scale Target object, low output layer predicts the target object of small scale.The position of the output layer output target object of network and confidence Degree, obtains a series of boundary candidate frame.

Then the candidate text box of neural network output is post-processed, by merging multiple boundary candidate frames, is obtained The optimum detection position of target object.

Further, the convolutional neural networks of one feature based layer of structure fusion, the position for detecting word target, Include the following steps：

(1) convolutional neural networks of a propagated forward are built, i.e. pre-network is VGG-16, wherein last two layers complete Articulamentum replaces with convolutional layer, after pre-network structure, is added to additional convolutional layer and pond layer.

(2) warp lamination is separately added between highest characteristic layer and other characteristic layers, the deconvolution behaviour in warp lamination Make to be similar to bilinearity difference, selectively characteristic pattern can be amplified so that the characteristic pattern ruler in top characteristic layer Degree becomes the size as low layer scale.The calculation formula of characteristic pattern size of warp lamination output is：

Wherein, i indicates that the size of warp lamination input feature vector figure, k indicate that the size of convolution kernel, s indicate step sizes, p Indicate filling back gauge.According to the size of characteristic layer input feature vector figure and output characteristic pattern, high-rise characteristic layer passes through warp lamination Corresponding parameter is set, can be obtained and low layer characteristic pattern of a size.

(3) characteristic pattern after deconvolution is merged with the characteristic pattern of low-level feature layer using element dot product mode, is obtained New characteristic layer.New characteristic layer is as output layer, the position for exporting target object and confidence level, wherein two features The element dot product operations of figure, are equal to two matrix dot product operations, and two matrix corresponding elements are multiplied：

(4) a series of acquiescence frame that fixed sizes are defined on output layer defines a series of acquiescence frame of fixed sizes, defeated Go out the confidence level of layer output text and the offset coordinates relative to acquiescence frame.Assuming that the size of image and characteristic pattern is (w respectively_im, h_im) and (w_map,h_map), the position (i, j) corresponds to an acquiescence frame b in characteristic pattern₀=(x₀,y₀,w₀,h₀), output layer it is defeated Go out for (Δ x, Δ y, Δ w, Δ h, c), wherein (Δ x, Δ y, Δ w, Δ h) indicate prediction text border frame relative to acquiescence frame Offset coordinates, c indicate text confidence level.The text border frame of prediction is b=(x, y, w, h), wherein：

X=x₀+w₀△x

Y=y₀+h₀△y

W=w₀+exp△x

H=h₀+exp△y

X, y indicate that the transverse and longitudinal coordinate in the upper left corner of the text box of prediction, w, h are the width and height of text box.

For Feature-level fusion neural network, setting uses strategy to select positive negative sample, specific steps packet for neural network It includes：

(1) frame, the feature of N × N sizes are given tacit consent to using the schema creation of sliding window on the characteristic pattern of each output layer Figure has N × N number of characteristic point, according to the transverse and longitudinal of target object ratio, each characteristic point correspond to six kinds of different transverse and longitudinals than acquiescence frame：

a_r={ a₁,a₂,a₃,a₄,a₅,a₆}

(2) relationship between the true tag frame (ground truth) of target object in image and acquiescence frame is established, and Acquiescence frame is labeled.Acquiescence frame is labeled using jaccard Duplication as matching index, jaccard Duplication Higher to show that Sample Similarity is higher, two samples more match.Given acquiescence frame A and true tag frame B, acquiescence frame and true mark Sign the ratio of the intersection area and union area of the jaccard Duplication expression A and B of frame：

For acquiescence frame using jaccard Duplication more than or equal to 0.5 as matched acquiescence frame, jaccard Duplication is small In 0.5 acquiescence frame as unmatched acquiescence frame.Wherein, matched acquiescence frame is made as positive sample, unmatched acquiescence frame For negative sample.

(3) after sample mark, the negative sample given tacit consent in frame is ranked up by confidence level loss, selects confidence level loss It is worth negative sample of the higher acquiescence frame as network training, the ratio of the positive negative sample of training is made to be maintained at 1:3.

For Feature-level fusion network, the object function of Feature-level fusion network is set, specific steps include：

(1) setting target loss function is the weighted sum of positioning loss and confidence level loss：

Wherein, x indicates that matching result matrix, c indicate that confidence level, l indicate that predicted position, g indicate the actual position of target, N indicates the number of acquiescence frame matching true tag frame；Wherein, weight coefficient α is set as 1；

(2) setting positioning loss is L_locFor the predicted position of target and the L2 losses of actual position, setting confidence level is lost L_confThe softmax losses of two classification of position：

For Fusion Features network, each output is arranged in the target object bounding box of multiple output layer prediction different scales The scale of layer output object boundary frame, specific steps include：

(1) select the characteristic layer that top characteristic layer and top characteristic layer are formed with other Feature-level fusions as net The output layer of network.

(2) size that frame is given tacit consent in each output layer is set, and output layer exports object boundary frame relative to the inclined of acquiescence frame Coordinate and confidence level are moved, candidate object boundary frame is obtained.Assuming that there is m output layer in network, each output layer corresponds to one Characteristic pattern, the scale of acquiescence frame is in each characteristic pattern：

Each width for giving tacit consent to frame and height are respectively：

Wherein, S_min, S_maxIndicate that the scale of lowermost layer and top acquiescence frame, low layer output layer predict small scale respectively Target object, the target object of high-rise output layer prediction large scale.The acquiescence frame of output layer has on different characteristic patterns Different scales has different transverse and longitudinal ratios in the same characteristic pattern again, correspondingly, whole network can pass through multiple output layers Predict different scale and target object of different shapes.

Further, multiple candidate target bounding boxes of Feature-level fusion network output are carried out using bounding box blending algorithm Post-processing, obtains the final position of image object, the specific steps of bounding box blending algorithm include：

(1) the boundary candidate frame of target is sorted from high to low according to the value of confidence level, chooses first boundary candidate frame Bounding box as present fusion；

(2) using other boundary candidate frames as the bounding box being fused, compare present fusion bounding box and be fused boundary If the confidence levels of two text boxes of confidence level be all higher than threshold alpha, calculate present fusion bounding box and be fused bounding box Area overlaps rate, otherwise, executes step (3).Wherein, area overlaps rate and refers to that the overlapping area of two bounding boxes accounts for two sides The ratio of boundary's frame union area：

Wherein, area (C) and area (G) is respectively the area of text box C and text box G：

(3) if the area of two boundary candidate frames overlaps rate and is optionally greater than threshold value beta, two bounding boxes are merged, after fusion Bounding box be two bounding boxes extraneous rectangle frame, confidence level be fusion bounding box confidence level.

(4) if the area of two boundary candidate frames overlaps rate and is less than threshold value beta, two bounding boxes of calculating include overlapping Rate removes the bounding box if two bounding boxes are more than threshold gamma comprising Duplication, otherwise, executes step (5).Wherein, it wraps Refer to that the overlapping area of two bounding boxes accounts for the ratio of another bounding box area containing Duplication：

Wherein, area (t_i) indicate rectangle frame t_iArea, area (t_i) indicate rectangle frame t_jArea.I_i(t_i,t_j) table Show rectangle frame t_iRelative to rectangle frame t_iInclude Duplication.

(5) if only remaining the last one text box, algorithm terminates, and selects text box of the confidence level higher than threshold value δ as most Otherwise whole object detection results update the boundary candidate frame of image object, according to the sequence arranged before, take it is next not The bounding box being fused executes step (2) as fusing text frame.

Fusion Features network exports the boundary candidate frame of target, and bounding box blending algorithm handles boundary candidate frame, Finally obtain the testing result of image object.

Compared to the prior art the present invention, has the following advantages and effect：Image object detection method proposed by the present invention It is the band of position that target object is oriented from natural scene.This method utilizes multiple output layers in single Neural straight The band of position of prediction target object is connect, recognition efficiency is high, while only there are one post-processing algorithms for merging all candidates Bounding box obtains the testing result of final image object.

Description of the drawings

Fig. 1 is the flow chart of word object detection method in image related to the technical solution of the present invention.

Fig. 2 is the network structure of Feature-level fusion network related to the technical solution of the present invention.

Fig. 3 is the output layer of Feature-level fusion network related to the technical solution of the present invention.

Fig. 4 is Feature-level fusion network samples mode related to the technical solution of the present invention.

Fig. 5 is the boundary candidate of the text objects related to the technical solution of the present invention exported using Feature-level fusion network Frame.

Fig. 6 is the algorithm flow chart of bounding box blending algorithm related to the technical solution of the present invention.

Fig. 7 is related to the technical solution of the present invention using bounding box blending algorithm treated testing result figure.

Specific implementation mode

The present invention is described in further detail below in conjunction with the accompanying drawings and by embodiment, and following embodiment is to this hair Bright explanation and the invention is not limited in following embodiments.

The present invention examines word target with the method that bounding box blending algorithm is combined using Feature-level fusion network It surveys, is broadly divided into two steps, respectively：(1) band of position for using Feature-level fusion neural network forecast image object, obtains text The boundary candidate frame of word target；(2) bounding box blending algorithm is used to obtain final detection result.It is as shown in Figure 1 present invention text The flow chart of word target detection.

With the development of internet and multimedia technology, more and more information carriers exist in the form of images, image There is be widely applied in actual life for target detection.Traditional text detection algorithm needs a large amount of heuristic rule sieve Select text filed, effect is not obvious, and method of the patent of the present invention based on deep learning builds a characteristic layer end to end Converged network, can the position of word target and confidence level directly in prognostic chart picture.

The neural network of feature based layer fusion is built, Fig. 2 shows the network structure of Feature-level fusion network.With The depth of network, the characteristic pattern scale in characteristic layer tapers into, and the ability to express of characteristic pattern is also increasingly stronger, will be high-rise special Sign layer, which with low-level feature layer merge, is combined into new feature layer as output layer, can enhance the ability to express of output layer.Such as Fig. 3 Shown, there are two types of connection types in overall structure for Fusion Features network, and one is bottom-up connection types, and one is certainly Push up downward connection type, such as Fig. 3.Bottom-up is the propagated forward process of network, the size of characteristic pattern by convolutional layer and It can be tapered into after the layer of pond, whole network is pyramid structure in hierarchical structure.Top-down connection uses deconvolution, By the Fusion Features of network high level to low-level feature layer, new output layer is built.As shown in figure 3, the output of Fusion Features network Layer be A, B ', C ', D ', wherein characteristic layer A, B merge to form new characteristic layer B ', and characteristic layer A, C merge to form new characteristic layer C ', characteristic layer A, D merge to form new characteristic layer D ', since characteristic layer A is top characteristic layer, still as the output of network Layer.

The construction step of Feature-level fusion network is as follows：

Step (1)：On the basis of the convolutional neural networks for building a propagated forward, wherein last two layers of full articulamentum Convolutional layer is replaced with, pre-network is VGG-16, after pre-network structure, adds additional convolutional layer and pond layer.

Step (2)：On the basis of propagated forward network, it will be separately added between top characteristic layer and other characteristic layers Warp lamination makes the characteristic pattern scale after deconvolution and the scale of characteristic pattern in low-level feature layer be consistent.In warp lamination Deconvolution operation be similar to bilinearity difference, selectively characteristic pattern can be amplified so that in top characteristic layer Characteristic pattern scale become the size as low layer scale.The calculation formula of characteristic pattern size of warp lamination output is：

Step (3)：Characteristic pattern after deconvolution is merged with the characteristic pattern of low-level feature layer using element dot product mode, Obtain new characteristic layer.New characteristic layer is as output layer, the position for exporting target object and confidence level, wherein two The element dot product operations of characteristic pattern, are equal to two matrix dot product operations, and two matrix corresponding elements are multiplied：

Step (4)：A series of acquiescence frame that fixed sizes are defined on output layer, defines a series of acquiescence of fixed sizes Frame, output layer export the confidence level of text and the offset coordinates relative to acquiescence frame.Assuming that the size of image and characteristic pattern is distinguished It is (w_im, h_im) and (w_map,h_map), the position (i, j) corresponds to an acquiescence frame b in characteristic pattern₀=(x₀,y₀,w₀,h₀), output The output of layer is (Δ x, Δ y, Δ w, Δ h, c), wherein (Δ x, Δ y, Δ w, Δ h) indicate prediction text border frame relative to Give tacit consent to the offset coordinates of frame, c indicates the confidence level of text.The text border frame of prediction is b=(x, y, w, h), wherein：

X=x₀+w₀△x

Y=y₀+h₀△y

W=w₀+exp△x

H=h₀+exp△y

It is characterized a layer converged network setting sampling policy, positive negative sample is obtained, needs to define on the characteristic pattern of output layer, Give tacit consent to frame, and establish the relationship between the true tag frame of target object in image and acquiescence frame, selects positive negative sample.Specific packet Include following steps：

Step (1)：Frame is given tacit consent to using the schema creation of sliding window on the characteristic pattern of each output layer, N × N sizes Characteristic pattern has N × N number of characteristic point, according to the transverse and longitudinal of target object ratio, six kinds of transverse and longitudinals of each characteristic point than acquiescence frame：

a_r={ a₁,a₂,a₃,a₄,a₅,a₆}

Step (2)：Establish the pass between the true tag frame (ground truth) of target object in image and acquiescence frame System, and acquiescence frame is labeled.Acquiescence frame is labeled using jaccard Duplication as matching index, jaccard weights Folded rate is higher to show that Sample Similarity is higher, and two samples more match.Given acquiescence frame A and true tag frame B, acquiescence frame with it is true The jaccard Duplication of real label frame indicates the ratio of the intersection area and union area of A and B：

Text objects in detection image are characterized converged network and select positive negative sample, need to establish image true tag Relationship between frame and acquiescence frame, such as Fig. 4.The true tag frame of text objects " Marlboro " is the top in figure in Fig. 4 (a) Solid box, the true tag frame of text " LIGHTS " is the solid box of the lower section in figure.The dotted line frame of Fig. 4 (b) and Fig. 4 (c) The acquiescence frame on the characteristic pattern of 8 × 8 sizes and the characteristic pattern of 4 × 4 sizes is indicated respectively.Wherein, matched text " LIGHTS " has Two dotted line frames, for matched text " Marlboro " there are one dotted line frame, the matched acquiescence frame of mark is unmatched as positive sample Frame is given tacit consent to as negative sample.

Step (3)：After sample mark, the negative sample given tacit consent in frame is ranked up by confidence level loss, selects confidence Negative sample of the higher acquiescence frame of penalty values as network training is spent, the ratio of the positive negative sample of training is made to be maintained at 1:3.

For Feature-level fusion network.The object function of Feature-level fusion network is set, following steps are specifically included：

(1)：The weighted sum that target loss function is positioning loss and confidence level loss is set：

(2)：Setting positioning loss is L_locFor the predicted position of target and the L2 losses of actual position, setting confidence level is damaged Lose L_confThe softmax losses of two classification of position：

Since the corresponding characteristic pattern scale of output layer different in network is different, target of the different output layers to prediction Scale is different, and high-rise output layer predicts that the target object of large scale, the output layer of low layer predict the target object of small scale. The scale of Feature-level fusion network output layer output object boundary frame, boundary candidate frame such as Fig. 5 of Fusion Features network are set It is shown, specifically include following steps：

(2) the corresponding characteristic pattern scale of output layer different in network is different, it is assumed that has m output layer in network, often A output layer corresponds to a characteristic pattern, and the scale that frame is given tacit consent in each characteristic pattern is：

Each width for giving tacit consent to frame and height are respectively：

Wherein, S_min, S_maxIndicate that the scale of lowermost layer and top acquiescence frame, low layer output layer predict small scale respectively Target object, the target object of high-rise output layer prediction large scale.The acquiescence frame of output layer has on different characteristic patterns Different scales has different transverse and longitudinal ratios in the same characteristic pattern again, correspondingly, whole network can pass through multiple output layers Predict different scale and text of different shapes.

Feature-level fusion network directly predicts that the bounding box of target object, each bounding box can be obtained using multiple output layers To a confidence score.The bounding box that output layer predicts can there is a situation where overlapped, use bounding box blending algorithm The higher bounding box of confidence level in contiguous range can be chosen, and merges overlapped boundary candidate frame, obtains optimal mesh Test position is marked, following steps are specifically included：

(1) the boundary candidate frame of word target is sorted from high to low according to the value of confidence level, chooses first candidate side Bounding box of boundary's frame as present fusion；

It is merged using two bounding boxes of above-mentioned bounding box blending algorithm pair, the flow chart of algorithm is as shown in Figure 6, wherein IOU (t_i, t_j) indicate bounding box t_iAnd t_jIOU overlap rate, Fusion (t_i,t_j) indicate bounding box t_iAnd t_jBounding box after merging, For the boundary rectangle frame of two bounding boxes；I_i(t_i,t_j) and I_j(t_i,t_j) bounding box t is indicated respectively_iAnd t_jInclude Duplication.Side Boundary's frame blending algorithm includes three threshold values, respectively：Confidence threshold value α, IOU overlaps rate threshold value beta, includes Duplication threshold gamma. Confidence threshold value determines whether two bounding boxes merge, and when the confidence level of two bounding boxes is all higher than α, two bounding boxes carry out Fusion.

The last text objects testing result obtained using bounding box blending algorithm, as shown in Figure 7.Bounding box blending algorithm The position relationship and confidence level of neighborhood boundary candidate frame is utilized, boundary candidate frame is merged, obtains final image mesh Mark testing result.Described in this specification above content is only illustrations made for the present invention.

Those skilled in the art can make various modifications to described specific embodiment Or supplement or substitute by a similar method, content without departing from description of the invention or surmount the claims institute The range of definition, is within the scope of protection of the invention.

Claims

1. word object detection method in a kind of image, which is characterized in that include the following steps：

Step 1：The convolutional neural networks of the feature based layer fusion end to end of structure one, for different rulers in prognostic chart picture The word target of degree；

Step 2：According to the candidate frame that Feature-level fusion network exports, final image text is obtained using bounding box blending algorithm Word object detection results.

2. word object detection method in a kind of image according to claim 1, which is characterized in that structure one is end-to-end The convolutional neural networks of feature based layer fusion specifically include following step for the position of the word target in detection image Suddenly：

(1) convolutional neural networks of a propagated forward are built, pre-network is VGG-16, wherein last two layers of full articulamentum Convolutional layer is replaced with, after pre-network structure, is added to additional convolutional layer and pond layer；

(2) on the basis of propagated forward network, deconvolution will be separately added between top characteristic layer and other characteristic layers Layer, makes the characteristic pattern scale after deconvolution and the scale of characteristic pattern in low-level feature layer be consistent；

(3) characteristic pattern after deconvolution is merged with the characteristic pattern of low-level feature layer using element dot product mode, is obtained new Characteristic layer, new characteristic layer is as output layer, the position for exporting target object and confidence level；

(4) it defines a series of acquiescence frame of fixed sizes on output layer, defines the confidence level of output layer output text and opposite In the offset coordinates of acquiescence frame.

3. word object detection method in a kind of image according to claim 2, which is characterized in that feature based layer merges Convolutional neural networks, setting Feature-level fusion network output layer export object boundary frame scale, specifically include：

(1) select the characteristic layer that top characteristic layer and top characteristic layer are formed with other Feature-level fusions as network Output layer；

(2) size that frame is given tacit consent in each output layer is set, and output layer exports object boundary frame and sat relative to the offset of acquiescence frame Mark and confidence level, obtain candidate object boundary frame, and setting low layer output layer predicts the target object of small scale, high-rise output layer Predict the word target object of large scale.

4. word object detection method in a kind of image according to claim 1, which is characterized in that Feature-level fusion network The boundary candidate frame of output is obtained the final position of word target using bounding box blending algorithm, specifically includes following steps：

(1) the boundary candidate frame of word target is sorted from high to low according to the value of confidence level, chooses first boundary candidate frame Bounding box as present fusion；

(2) using other boundary candidate frames as the bounding box being fused, compare present fusion bounding box and be fused setting for boundary If the confidence level of two text boxes of reliability is all higher than threshold alpha, present fusion bounding box and the area for being fused bounding box are calculated Otherwise overlapping rate executes step (3)；

(3) if the area of two boundary candidate frames overlaps rate and is optionally greater than threshold value beta, two bounding boxes, the side after fusion are merged Boundary's frame is the extraneous rectangle frame of two bounding boxes, and confidence level is to merge the confidence level of bounding box；

(4) if the area of two boundary candidate frames overlaps rate and is less than threshold value beta, two bounding boxes of calculating include Duplication, such as Two bounding boxes of fruit are more than threshold gamma comprising Duplication, remove the bounding box, otherwise, execute step (5)；

(5) if only remaining the last one text box, algorithm terminates, and selects text box of the confidence level higher than threshold value δ as final mesh Mark testing result；

Otherwise, the boundary candidate frame of more new literacy target takes next boundary not being fused according to the sequence arranged before Frame executes step (2) as fusing text frame.