CN107301400A

CN107301400A - A kind of semantic semi-supervised video picture segmentation method being oriented to

Info

Publication number: CN107301400A
Application number: CN201710487525.6A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2017-06-23
Filing date: 2017-06-23
Publication date: 2017-10-27

Abstract

A kind of semantic semi-supervised video picture segmentation method being oriented to proposed in the present invention, its main contents include：Convolutional neural networks extract feature, semantic selection and it is semantic propagate, be combined display model with semanteme priori by condition stub device, training network, its process is, first feature is extracted with convolutional neural networks, semantic instance partitioning algorithm is recycled to be used as input, estimate the semanteme of object to be split, then display model is combined with semantic priori by condition stub device, finally train framework, to determine the foreground pixel of specific image, with weights initialisation convolutional neural networks and it is finely adjusted and iteration within the testing time.The influence that the present invention can overcome the change of light or block, effectively extracts the useful information in video, greatly reduces the plenty of time checked video and spent, man power and material；Segmentation is finer, and the degree of accuracy also increases.

Description

A kind of semantic semi-supervised video picture segmentation method being oriented to

Technical field

The present invention relates to video object segmentation field, more particularly, to a kind of semantic semi-supervised object video being oriented to point Segmentation method.

Background technology

In informationized society of today, video can provide the abundant and comprehensive information content to us, therefore it is more next More paid attention to by industries such as Modern Traffic, the network media and computer visions.But the letter contained by general original video Breath amount is all very big, and even most is all little for the meaning of industry research and practical application for which part.Therefore, Wo Menxu Video is reduced, extract wherein useful information.Video Object Segmentation Technology is exactly the one kind grown up in recent years The important basic technology of video effective information is extracted, it has been widely used in traffic flow video monitoring, industrial automation prison In the actual production lives such as control, security protection, network multimedia interaction and video compression coding.However, original method is vulnerable to The change of light or the influence blocked, and semi-supervised, therefore practical application effect and bad can not be realized.

The present invention proposes a kind of semantic semi-supervised video picture segmentation method being oriented to, and is first extracted with convolutional neural networks Feature, recycles semantic instance partitioning algorithm as input, estimates the semanteme of object to be split, then will by condition stub device Display model is combined with semantic priori, is finally trained framework, to determine the foreground pixel of specific image, is used within the testing time Weights initialisation convolutional neural networks are simultaneously finely adjusted and iteration.The influence that the present invention can overcome the change of light or block, has Effect extracts the useful information in video, greatly reduces the plenty of time checked video and spent, man power and material；Segmentation is more smart Carefully, the degree of accuracy also increases.

The content of the invention

The problem of for being vulnerable to light change or blocking influence, it is an object of the invention to provide a kind of semantic guiding Semi-supervised video picture segmentation method, first extracts feature with convolutional neural networks, recycles semantic instance partitioning algorithm as defeated Enter, estimate the semanteme of object to be split, then display model is combined with semantic priori by condition stub device, finally trained Framework, to determine the foreground pixel of specific image, with weights initialisation convolutional neural networks and is finely adjusted within the testing time And iteration.

To solve the above problems, the present invention provides a kind of semantic semi-supervised video picture segmentation method being oriented to, its is main Content includes：

(1) convolutional neural networks extract feature；

(2) semantic selection and semantic propagation；

(3) display model is combined with semantic priori by condition stub device；

(4) training network.

Wherein, described convolutional neural networks extract feature, and backbone network is used as using VGG16 convolutional neural networks；Remove It is fully connected layer and last pond layer, increases space characteristics resolution ratio；Connection is skipped in addition, extracts the feature of hypercolumn, is gathered Close the multi-scale information from different layers；Second, third, the 4th and the 5th convolutional layer block merge layer accordingly before, from it Among extract output characteristic figure；Then characteristic pattern is adjusted, makes it identical with input picture size, and they are connected into formation The characteristic of hypercolumn.

Wherein, described semantic selection and semantic propagation, by the use of semantic instance partitioning algorithm as input, estimate to be split The semanteme of object；Multitask cascade (MNC) is selected as input example partitioning algorithm；MNC is a multi-stage network, by Three major part compositions：Network (RPN) and area-of-interest (ROI)-Intelligence Classifier are proposed in shared convolutional layer, region.

Further, described semantic selection, semantic selection occurs in the frame of video first, according to the True Data of demarcation The mask (being in semi-supervised framework, wherein the true mask of the first frame is input) of mask selection matching object；Selection sense is emerging Interesting region, is classified, and it is overlapping that the True Data of demarcation and example are segmented into proposal.

Further, described semantic propagation, semantic propagation stage occurs after the first frame, by what is estimated in the first frame Semanteme travel to after frame；Example segmentation mask is filtered using the estimation of first round prospect, and selects pond top With object.

Wherein, it is described to be combined display model with semantic priori by condition stub device, use complete convolutional network Intensive label, be typically expressed as the classification problem of each pixel；Thus, it can be understood that the overall situation slided on the entire image Grader, and prospect or background label are distributed to by each pixel according to display model；If by the language before final classification Justice merge, can as example (or one group of example) most possible in front frame mask.

Further, described pixel, for each pixel i, estimates the probability of the foreground pixel of given image：p(i| I)；Probability can be decomposed into by the sum of k conditional probability of prior weight：

Wherein, K=2.

Further, described condition stub device, builds two condition stub devices, and one is focused on foreground pixel, another Lay particular emphasis on background pixel；Case-based Reasoning segmentation output estimation priori p (k | I)；Specifically, if pixel is split positioned at example In mask, then pixel depends on foreground classification device；And if background class mask departs from example segmentation mask, then background class Device is more important；In an experiment, it regard the space smoothing of selected mask as semantic priori using Gaussian filter.

Further, the layer of described condition stub device, condition stub device can be integrated in end-to-end trainable mode In a network；The layer is using two prognostic chart f₁And f₂, and use the semantic weight map ω obtained in advance as input；Wherein Each input element is multiplied with weight mapping, is then added with the respective element in another mapping：

f_out(x, y)=ω (x, y) f₁(x,y)+(1-ω(x,y))f₂(x,y) (2)

Similarly, in backpropagation step, according to weight map by top g_topGradient travel to two parts：

g₁(x, y)=ω (x, y) g_top(x,y) (3)

g₂(x, y)=(1- ω (x, y)) g_top(x,y) (4)

Respectively as shown in above formula.

Wherein, described training network, first, uses the VGG convolution of the weights initialisation of the training in advance architecture The part of neutral net；The purpose of training framework is to determine the foreground pixel of specific image；

Then, it is absorbed in the special object to be split in video sequence study display model, weight is used within the testing time Initialization convolutional neural networks are simultaneously finely adjusted, and are taken several iterations；In order to produce segmentation in each frame, to video sequence Special object application trim network, obtains the mask corresponding with object, with the transmission of single forward direction.

Brief description of the drawings

Fig. 1 is a kind of system flow chart of the semantic semi-supervised video picture segmentation method being oriented to of the present invention.

Fig. 2 is a kind of schematic flow sheet of the semantic semi-supervised video picture segmentation method being oriented to of the present invention.

Fig. 3 is that the semantic selection and semanteme of a kind of semantic semi-supervised video picture segmentation method being oriented to of the present invention are propagated.

Fig. 4 is a kind of condition stub device of the semantic semi-supervised video picture segmentation method being oriented to of the present invention.

Embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system flow chart of the semantic semi-supervised video picture segmentation method being oriented to of the present invention.Mainly include Convolutional neural networks extract feature, semantic selection and it is semantic propagate, by condition stub device by display model and semanteme priori phase With reference to training network.

Convolutional neural networks extract feature, and backbone network is used as using VGG16 convolutional neural networks；Removal be fully connected layer and Last pond layer, increases space characteristics resolution ratio；Connection is skipped in addition, extracts the feature of hypercolumn, and polymerization comes from different layers Multi-scale information；Second, third, the 4th and the 5th convolutional layer block merge layer accordingly before, extract defeated among them Go out characteristic pattern；Then characteristic pattern is adjusted, makes it identical with input picture size, and they are connected to the spy for forming hypercolumn Property.

Display model is combined with semantic priori by condition stub device, using the intensive label of complete convolutional network, It is typically expressed as the classification problem of each pixel；Thus, it can be understood that the global classification device slided on the entire image, and Prospect or background label are distributed to by each pixel according to display model；, can be with if the semanteme before final classification merged It is used as the mask of example (or one group of example) most possible in front frame.

For each pixel i, the probability of the foreground pixel of given image is estimated：p(i|I)；Probability can be decomposed into by elder generation The sum of k conditional probability of preceding weighting：

Wherein, K=2.

Training network, first, uses the portion of the VGG convolutional neural networks of the weights initialisation of the training in advance architecture Point；The purpose of training framework is to determine the foreground pixel of specific image；

Fig. 2 is a kind of schematic flow sheet of the semantic semi-supervised video picture segmentation method being oriented to of the present invention.First use convolution Neutral net extracts feature, recycles semantic instance partitioning algorithm as input, estimates the semanteme of object to be split, then pass through Display model is combined by condition stub device with semantic priori, finally trains framework, to determine the foreground pixel of specific image, With weights initialisation convolutional neural networks and it is finely adjusted and iteration in testing time.

Fig. 3 is that the semantic selection and semanteme of a kind of semantic semi-supervised video picture segmentation method being oriented to of the present invention are propagated. By the use of semantic instance partitioning algorithm as input, the semanteme of object to be split is estimated；Select multitask cascade (MNC) conduct Input example partitioning algorithm；MNC is a multi-stage network, is made up of three major parts：Net is proposed in shared convolutional layer, region Network (RPN) and area-of-interest (ROI)-Intelligence Classifier.

Semantic selection occurs in the frame of video first, is selected to match the mask of object according to the True Data mask of demarcation (being in semi-supervised framework, wherein the true mask of the first frame is input)；Area-of-interest is selected, is classified, will demarcated The segmentation proposal of True Data and example it is overlapping.

Semantic propagation stage occurs after the first frame, the frame after the semanteme estimated in the first frame is traveled to；Use The estimation of first round prospect is filtered to example segmentation mask, and selects matching object at the top of pond.

Fig. 4 is a kind of condition stub device of the semantic semi-supervised video picture segmentation method being oriented to of the present invention.Build two Condition stub device, an emphasis foreground pixel, another lays particular emphasis on background pixel；Case-based Reasoning segmentation output estimation priori p (k|I)；Specifically, if pixel is located in example segmentation mask, pixel depends on foreground classification device；And if background Mask of classifying departs from example segmentation mask, then background class device is more important；In an experiment, using Gaussian filter by selected mask Space smoothing be used as semantic priori.

Condition stub device can be integrated in a network in end-to-end trainable mode；The layer is using two prognostic chart f₁With f₂, and use the semantic weight map ω obtained in advance as input；Wherein each input element is multiplied with weight mapping, then It is added with the respective element in another mapping：

f_out(x, y)=ω (x, y) f₁(x,y)+(1-ω(x,y))f₂(x,y) (2)

g₁(x, y)=ω (x, y) g_top(x,y) (3)

g₂(x, y)=(1- ω (x, y)) g_top(x,y) (4)

Respectively as shown in above formula.

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and scope, the present invention can be realized with other concrete forms.In addition, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement and modification also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and modification.

Claims

1. a kind of semantic semi-supervised video picture segmentation method being oriented to, it is characterised in that mainly carried including convolutional neural networks Take feature (one)；Semantic selection and semantic propagation (two)；Display model is combined with semantic priori by condition stub device (3)；Training network (four).

2. extract feature (one) based on the convolutional neural networks described in claims 1, it is characterised in that use VGG16 convolution Neutral net is used as backbone network；Remove and be fully connected layer and last pond layer, increase space characteristics resolution ratio；Add the company of skipping Connect, extract the feature of hypercolumn, polymerize the multi-scale information from different layers；Second, third, the 4th and the 5th convolutional layer Block merges before layer accordingly, and output characteristic figure is extracted among them；Then characteristic pattern is adjusted, makes itself and input picture size It is identical, and they are connected to the characteristic for forming hypercolumn.

3. based on the semantic selection described in claims 1 and semantic propagation (two), it is characterised in that utilize semantic instance segmentation Algorithm estimates the semanteme of object to be split as input；Multitask cascade (MNC) is selected to be calculated as input example segmentation Method；MNC is a multi-stage network, is made up of three major parts：Shared convolutional layer, region are proposed network (RPN) and felt emerging Interesting region (ROI)-Intelligence Classifier.

4. based on the semantic selection described in claims 3, it is characterised in that semantic selection occurs in the frame of video first, root (it is according to the mask of the True Data mask selection matching object of demarcation in semi-supervised framework, wherein the true mask of the first frame For input)；Area-of-interest is selected, is classified, it is overlapping that the True Data of demarcation and example are segmented into proposal.

5. propagated based on the semanteme described in claims 3, it is characterised in that semantic propagation stage occurs after the first frame, Frame after the semanteme estimated in first frame is traveled to；Example segmentation mask is filtered using the estimation of first round prospect, And select matching object at the top of pond.

6. based on display model is combined into (three) with semantic priori by condition stub device described in claims 1, it is special Levy and be, using the intensive label of complete convolutional network, be typically expressed as the classification problem of each pixel；Accordingly, it is to be understood that For the global classification device slided on the entire image, and prospect or background label are distributed to by each picture according to display model Element；If the semanteme before final classification merged, example (or one group of example) most possible in front frame can be used as Mask.

7. based on the pixel described in claims 6, it is characterised in that for each pixel i, estimate the prospect picture of given image The probability of element：p(i|I)；Probability can be decomposed into by the sum of k conditional probability of prior weight：

<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>|</mo> <mi>I</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <mi>p</mi> <mo>(</mo> <mrow> <mi>i</mi> <mo>|</mo> <mi>I</mi> <mo>,</mo> <mi>k</mi> </mrow> <mo>)</mo> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <mi>I</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein, K=2.

8. based on the condition stub device described in claims 6, it is characterised in that build two condition stub devices, an emphasis Foreground pixel, another lays particular emphasis on background pixel；Case-based Reasoning segmentation output estimation priori p (k | I)；Specifically, if Pixel is located in example segmentation mask, then pixel depends on foreground classification device；And if background class mask departs from example point Mask is cut, then background class device is more important；In an experiment, it regard the space smoothing of selected mask as semanteme using Gaussian filter Priori.

9. the layer based on the condition stub device described in claims 8, it is characterised in that condition stub device can with it is end-to-end can The mode of training is integrated in a network；The layer is using two prognostic chart f₁And f₂, and use semantic in advance as input acquisition Weight map ω；Wherein each input element is multiplied with weight mapping, is then added with the respective element in another mapping：

f_out(x, y)=ω (x, y) f₁(x,y)+(1-ω(x,y))f₂(x,y) (2)

g₁(x, y)=ω (x, y) g_top(x,y) (3)

g₂(x, y)=(1- ω (x, y)) g_top(x,y) (4)

Respectively as shown in above formula.

10. based on the training network (four) described in claims 1, it is characterised in that first, at the beginning of using the weight of training in advance The part of the VGG convolutional neural networks of the beginningization architecture；The purpose of training framework is to determine the foreground pixel of specific image；

Then, it is absorbed in the special object to be split in video sequence study display model, it is initial with weight within the testing time Change convolutional neural networks and be finely adjusted, take several iterations；In order to produce segmentation in each frame, to the specific of video sequence Object application trim network, obtains the mask corresponding with object, with the transmission of single forward direction.