CN107564032A

CN107564032A - A kind of video tracking object segmentation methods based on outward appearance network

Info

Publication number: CN107564032A
Application number: CN201710780214.9A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2018-01-09

Abstract

A kind of video tracking object segmentation methods based on outward appearance network proposed in the present invention, its main contents include：Outward appearance network, object detection network, bounding box filters and training, its process is, each input frame is first set to pass through from the outward appearance network of the Object Segmentation of the classification independence obtained, remove final pond layer and be fully connected layer, connected using jump, multiresolution spatial information is allowed to flow to network end-point from shallow-layer, the output of these sides is connected in network end-point, and by exporting the fusion convolutional layer of neural network forecast, then frame is made to detect network by instance-level semantic object, prospect of the application outward appearance is split to obtain appearance images, then bounding box is filtered using wave filter, finally give segmentation figure picture.Constrained present invention incorporates the output of the outward appearance network and semantic instance once trained detection network, while to result application time, improve the training speed of outward appearance network, while improve the precision of detection and segmentation, greatly improve accuracy.

Description

A kind of video tracking object segmentation methods based on outward appearance network

Technical field

The present invention relates to video object segmentation field, more particularly, to a kind of video tracking object based on outward appearance network Dividing method.

Background technology

Video object segmentation is a basic problem in computer vision, and before current video signal treatment research Along one of with focus.Video object segmentation refers to the combination by Video segmentation for some video semanteme objects on time-space domain, Each frame of video is exactly divided into some different semantic object regions, so as to realize flexibly processing to video.Depending on Frequency Object Segmentation has broad application prospects, as Video coding, video frequency searching, multimedia operations, image procossing, pattern are known Not, video compression coding and video database operation etc., traffic flow video monitoring, industrial automation monitoring, peace be can be also used for In the actual production life such as anti-and network multimedia interaction.The quality of video object segmentation quality directly affects the work in later stage Make, so, the research of Video Object Segmentation Technology is important and challenging.The single node network that conventional method uses exists When video bag contains with multiple examples as the object class of annotation, all or several such examples conducts pair can be mistakenly identified A part for elephant so that segmentation precision declines, and accuracy is not high.

The present invention proposes a kind of video tracking object segmentation methods based on outward appearance network, first makes each input frame from obtaining Classification independence Object Segmentation outward appearance network by, remove and final pond layer and be fully connected layer, connected using jump, Allow multiresolution spatial information to flow to network end-point from shallow-layer, connect the output of these sides in network end-point, and pass through output The fusion convolutional layer of neural network forecast, then makes frame detect network by instance-level semantic object, and prospect of the application outward appearance is split to obtain Appearance images, then bounding box is filtered using wave filter, finally gives segmentation figure picture.Present invention incorporates once train Outward appearance network and semantic instance detection network output, while result application time is constrained, improves the instruction of outward appearance network Practice speed, while improve the precision of detection and segmentation, greatly improve accuracy.

The content of the invention

The problem of for segmentation precision and not high accuracy, it is an object of the invention to provide a kind of based on outward appearance network Video tracking object segmentation methods, first make each input frame from the outward appearance network of the Object Segmentation of the classification independence obtained by, Remove final pond layer and be fully connected layer, connected using jump, it is allowed to which multiresolution spatial information flows to network end from shallow-layer End, the output of these sides is connected in network end-point, and by exporting the fusion convolutional layer of neural network forecast, then frame is passed through example Level semantic object detection network, prospect of the application outward appearance are split to obtain appearance images, then bounding box are filtered using wave filter Ripple, finally give segmentation figure picture.

To solve the above problems, the present invention provides a kind of video tracking object segmentation methods based on outward appearance network, it is led Content is wanted to include：

(1) outward appearance network；

(2) object detection network；

(3) bounding box filters；

(4) train.

Wherein, described outward appearance network, first, outward appearance net of each input frame from the Object Segmentation of the classification independence obtained Network passes through；Network is based on VGG16 convolutional network frameworks, is converted into the network of a complete convolution；It is different from full convolutional network, be Holding spatial resolution, final pond layer and is fully connected layer and has been completely removed；

Connected using jump, it is allowed to which multiresolution spatial information flows to network end-point from shallow-layer, and it is thin to improve object outline Segmentation precision on section；More specifically, the final characteristic pattern in VGG16 each stages is used before the layer of pond, and by itself and single 1 × 1 kernel carries out convolution, obtains the intensity slicing probability graph with current down-sampling stage formed objects, and use bi-linear filter Original image size is sampled；

Finally, the output of these sides is connected in network end-point, and by exporting the fusion convolutional layer of neural network forecast：Full width ash Degree segmentation probability graph；In order to realize that Pixel-level is split, softmax graders are balanced by the classification of offer binary class mask S-shaped cross entropy loss layer replaces.

Wherein, described object detection network, now, frame detect network by instance-level semantic object；The network is by original The RGB image of beginning produces one group of bounding box as input, and for any object of its discovery, and these bounding boxes belong to what it was supported The set of classification；Object detection network can separate the example of same object class, so as to allow to select in video correctly Example, wherein at least one is similar to be chosen by outward appearance network.

Wherein, described bounding box filtering, including the wave filter based on outward appearance, termporal filter and connection component filtering Device.

Further, the wave filter based on outward appearance, after by two network delivery input frames, one is obtained The initial fragment prognostic chart and the bounding box of the identified object of some Semantic detection networks obtained from single outward appearance network is built View；A kind of method for being used to combine the result of two networks is proposed, to the final prediction Object Segmentation figure of each frame in video Refined.

Further, the described method for being used to combine the result of two networks, first, first image calibration is used True Data selects the bounding box for belonging to annotation object；Then, by searching for the bounding box suggestion most matched with appearance images, And the application time continuity in these detections, continuation select correct bounding box in a subsequent frame；

For first image, the Object Segmentation that selection provides with the True Data demarcated by the first frame has optimal weight Folded Semantic detection (bounding box)；By selected classification storage in memory, to scan in a subsequent frame；

For follow-up all frames, the classification only found in the first frame is only frame interested, and remaining is left out； In the suggestion of remaining detection object, according to the size of the point of interface of union between each bounding box suggestion and appearance images, choosing Select the detection object of most suitable appearance images prediction.

Further, described termporal filter, the correct bounding box of a semantic object is selected in former frame, may Its outward appearance can be switched to and predict another object instance overlapping with its semantic bounding box height；In order to further ensure that to border The correct selection of frame, it will be only filtered by the object's position in the frame and former frame of the point of interface of union threshold value, so as to right Correct bounding box performs time tracking；

If semantic object detection can not detect any object in the first frame, the first frame annotation is used instead to define side Boundary's frame；Then for all subsequent frames, the connection component intersected with previous boundary frame is found, and deletes every other fragment, A new bounding box is finally selected according to selected connection component；After this step terminates, an appearance images and note will be obtained Release the correct semantic bounding box detection of object.

Further, described connection component wave filter, in the final step of algorithm, the inspection selected in previous steps is used Survey to limit and strengthen the segmentation figure obtained from outward appearance network；Appearance images are filtered using bounding box, and remove the back of the body Scape noise；

In order to obtain final prediction (i.e. binary system is predicted) segmentation mask, twice threshold is set for outward appearance segmentation figure, i.e., it is low Threshold value and high threshold；Then each mask obtained is divided into their connection component.

Further, described Low threshold and high threshold, during first time, using high threshold mask, and delete and previously step The disjoint all component of bounding box of selection in rapid；This limitation can pair wrong fragment instance similar with annotation object progress Filtering, or simply filter out noise；

At second, the Low threshold mask that final segmentation mask intersects from the mask with being obtained during first time is added to company Connected components；

This enhancing operation provides looser threshold value in selected bounding box, according to the Tuscany side with strong and weak edge Edge detector, weak edge is only selected when being connected with strong edge；It is (high to find the power limited by the selected borderline region of segmentation figure And low confidence) segmenting pixels, and weak pixel is selected when their connection component intersects with strong pixel.

Wherein, described training, only select outward appearance network to be trained, and the use of momentum is 0.9 for off-line training Stochastic gradient descent；Using mirror image, rotate and be sized to expanding data；Meanwhile depth supervision is not performed to training, will Each side output is connected to cross entropy segmentation loss function.

Brief description of the drawings

Fig. 1 is a kind of system framework figure of the video tracking object segmentation methods based on outward appearance network of the present invention.

Fig. 2 is a kind of schematic flow sheet of the video tracking object segmentation methods based on outward appearance network of the present invention.

Fig. 3 is a kind of termporal filter of the video tracking object segmentation methods based on outward appearance network of the present invention.

Fig. 4 is a kind of connection component wave filter of the video tracking object segmentation methods based on outward appearance network of the present invention.

Embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system framework figure of the video tracking object segmentation methods based on outward appearance network of the present invention.Main bag Include outward appearance network, object detection network, bounding box filtering and training.

Outward appearance network, first, each input frame pass through from the outward appearance network of the Object Segmentation of the classification independence obtained；Network Based on VGG16 convolutional network frameworks, the network of a complete convolution is converted into；It is different from full convolutional network, in order to keep space Resolution ratio, final pond layer and is fully connected layer and has been completely removed；

Object detection network, now, frame detect network by instance-level semantic object；The network is by original RGB image One group of bounding box is produced as input, and for any object of its discovery, these bounding boxes belong to the set for the classification that it is supported； Object detection network can separate the example of same object class, so as to allow to select correct example in video, wherein extremely Rare one similar to be chosen by outward appearance network.

Bounding box filtering includes the wave filter based on outward appearance, termporal filter and connection component wave filter.

Wave filter based on outward appearance, after by two network delivery input frames, one is obtained from single outward appearance network The initial fragment prognostic chart of acquisition and the bounding box suggestion of the identified object of some Semantic detection networks；It is proposed that one kind is used for The method for combining the result of two networks, is refined to the final prediction Object Segmentation figure of each frame in video.

First, the bounding box for belonging to annotation object is selected using the True Data of first image calibration；Then, pass through The bounding box suggestion most matched with appearance images, and the application time continuity in these detections are searched for, is continued in follow-up frame The middle correct bounding box of selection；

Training, only outward appearance network is selected to be trained, and under the stochastic gradient for the use of momentum being 0.9 for off-line training Drop；Using mirror image, rotate and be sized to expanding data；Meanwhile depth supervision is not performed to training, each side is exported It is connected to cross entropy segmentation loss function.

Fig. 2 is a kind of schematic flow sheet of the video tracking object segmentation methods based on outward appearance network of the present invention.First make every Individual input frame from the outward appearance network of the Object Segmentation of the classification independence obtained by, remove final pond layer and be fully connected layer, Connected using jump, it is allowed to which multiresolution spatial information flows to network end-point from shallow-layer, and it is defeated to connect these sides in network end-point Go out, and by exporting the fusion convolutional layer of neural network forecast, frame is detected network, prospect of the application by instance-level semantic object Outward appearance is split to obtain appearance images, and then bounding box is filtered using wave filter, finally gives segmentation figure picture.

Fig. 3 is a kind of termporal filter of the video tracking object segmentation methods based on outward appearance network of the present invention.Previous The correct bounding box of a semantic object is selected in frame, it is overlapping with its semantic bounding box height that the prediction of its outward appearance may be switched to Another object instance；In order to further ensure that the correct selection to bounding box, will only it pass through the point of interface of union threshold value Frame is filtered with the object's position in former frame, so as to perform time tracking to correct bounding box；

Fig. 4 is a kind of connection component wave filter of the video tracking object segmentation methods based on outward appearance network of the present invention. The final step of algorithm, limit using the detection selected in previous steps and strengthen the segmentation figure obtained from outward appearance network；Make Appearance images are filtered with bounding box, and remove ambient noise；

During first time, using high threshold mask, and disjoint all groups of the bounding box with being selected in previous steps is deleted Part；This limitation meeting pair wrong fragment instance similar with annotation object is filtered, or simply filters out noise；

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and scope, the present invention can be realized with other concrete forms.In addition, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement and modification also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and change.

Claims

1. a kind of video tracking object segmentation methods based on outward appearance network, it is characterised in that mainly include outward appearance network (one)； Object detection network (two)；Bounding box filters (three)；Train (four).

2. based on the outward appearance network (one) described in claims 1, it is characterised in that first, each input frame is from the class obtained The outward appearance network of not independent Object Segmentation passes through；Network is based on VGG16 convolutional network frameworks, is converted into complete convolution Network；It is different from full convolutional network, in order to keep spatial resolution, final pond layer and it is fully connected layer and has been completely removed；

Connected using jump, it is allowed to which multiresolution spatial information flows to network end-point from shallow-layer, and improves in object outline details Segmentation precision；More specifically, the final characteristic pattern in VGG16 each stages is used before the layer of pond, and by itself and single 1 × 1 Kernel carries out convolution, obtains the intensity slicing probability graph with current down-sampling stage formed objects, and with bi-linear filter pair Original image size is sampled；

Finally, the output of these sides is connected in network end-point, and by exporting the fusion convolutional layer of neural network forecast：Full width gray scale point Cut probability graph；In order to realize Pixel-level segmentation, the S-shaped that softmax graders are balanced by the classification of offer binary class mask Cross entropy loss layer replaces.

3. based on the object detection network (two) described in claims 1, it is characterised in that now, frame is semantic by instance-level Object detection network；The network produces one group of border using original RGB image as input for any object of its discovery Frame, these bounding boxes belong to the set for the classification that it is supported；Object detection network can separate the example of same object class, from And allow to select correct example in video, wherein at least one is similar to be chosen by outward appearance network.

4. based on described in claims 1 bounding box filter (three), it is characterised in that including the wave filter based on outward appearance, when Between wave filter and connection component wave filter.

5. based on the wave filter based on outward appearance described in claims 4, it is characterised in that inputted by two network deliveries After frame, obtain an initial fragment prognostic chart obtained from single outward appearance network and some Semantic detection networks are identified The bounding box suggestion of object；A kind of method for being used to combine the result of two networks is proposed, to the final of each frame in video Prediction Object Segmentation figure is refined.

6. based on the method for being used to combine the result of two networks described in claims 5, it is characterised in that first, use The True Data of first image calibration selects the bounding box for belonging to annotation object；Then, by search and appearance images most The bounding box suggestion of matching, and the application time continuity in these detections, continuation select correct border in a subsequent frame Frame；

For first image, select with the Object Segmentation that the True Data demarcated by the first frame provides with optimal overlapping Semantic detection (bounding box)；By selected classification storage in memory, to scan in a subsequent frame；

For follow-up all frames, the classification only found in the first frame is only frame interested, and remaining is left out；Surplus During remaining detection object is suggested, according to the size of the point of interface of union between each bounding box suggestion and appearance images, selection is most It is adapted to the detection object of appearance images prediction.

7. based on the termporal filter described in claims 4, it is characterised in that one semantic object of selection in former frame Correct bounding box, its outward appearance may be switched to and predict another object instance overlapping with its semantic bounding box height；In order to The correct selection to bounding box is further ensured that, will only pass through the object's position in the frame and former frame of the point of interface of union threshold value It is filtered, so as to perform time tracking to correct bounding box；

If semantic object detection can not detect any object in the first frame, the first frame annotation is used instead to define border Frame；Then for all subsequent frames, the connection component intersected with previous boundary frame is found, and deletes every other fragment, most A new bounding box is selected according to selected connection component afterwards；After this step terminates, an appearance images and annotation will be obtained The correct semantic bounding box detection of object.

8. based on the connection component wave filter described in claims 4, it is characterised in that in the final step of algorithm, use elder generation What is selected in preceding step detects to limit and strengthen the segmentation figure obtained from outward appearance network；Appearance images are carried out using bounding box Filtering, and remove ambient noise；

In order to obtain final prediction (i.e. binary system is predicted) segmentation mask, twice threshold, i.e. Low threshold are set for outward appearance segmentation figure And high threshold；Then each mask obtained is divided into their connection component.

9. based on Low threshold and high threshold described in claims 8, it is characterised in that during first time, using high threshold mask, And delete the disjoint all component of bounding box with being selected in previous steps；This limitation can pair mistake similar with annotation object Fragment instance is filtered by mistake, or simply filters out noise；

At second, the Low threshold mask that final segmentation mask intersects from the mask with being obtained during first time is added to connection group Part；

This enhancing operation provides looser threshold value in selected bounding box, is examined according to the Tuscany edge with strong and weak edge Device is surveyed, weak edge is only selected when being connected with strong edge；It is (high and low to find the power limited by the selected borderline region of segmentation figure Confidence level) segmenting pixels, and weak pixel is selected when their connection component intersects with strong pixel.

10. based on the training (four) described in claims 1, it is characterised in that only selection outward appearance network is trained, and right The stochastic gradient descent that momentum is 0.9 is used in off-line training；Using mirror image, rotate and be sized to expanding data；Meanwhile Depth supervision is not performed to training, the output of each side is connected to cross entropy segmentation loss function.