CN112990356B

CN112990356B - Video instance segmentation system and method

Info

Publication number: CN112990356B
Application number: CN202110408364.3A
Authority: CN
Inventors: 房体品; 秦者云; 卢宪凯; 丁冬睿
Original assignee: Guangdong Zhongju Artificial Intelligence Technology Co ltd
Current assignee: Jinan Safety Technology Co.,Ltd.
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-08-03
Anticipated expiration: 2041-04-16
Also published as: CN112990356A

Abstract

The invention discloses a video instance segmentation system and a video instance segmentation method. The system comprises: the characteristic extraction module is used for acquiring a video to be processed and extracting the characteristics of a video frame; the shallow embedding quantity estimation module is connected with the feature extraction module and used for estimating the display variable embedding quantity of each pixel position in each video frame and establishing a shallow Gaussian distribution model of each example to be segmented; the deep embedding quantity estimation module is connected with the feature extraction module and used for obtaining the implicit information of each example to be segmented, estimating the implicit variable embedding quantity of each pixel position in each video frame and optimizing the shallow Gaussian distribution model of each example to be segmented; and the clustering inference module is connected with the deep embedding amount estimation module and is used for performing inference clustering on each pixel position in each video frame to obtain a segmentation mask of the video to be processed. The invention reduces the dependence on the computing power of the equipment while ensuring the segmentation precision.

Description

Video instance segmentation system and method

Technical Field

The invention relates to the technical field of video instance segmentation, in particular to a video instance segmentation system and a video instance segmentation method.

Background

Segmentation of video is one of the fundamental problems of computer vision.

The goals of video instance segmentation are: while detecting, segmenting and tracking object instances in the video. This new task opens up possibilities for applications that require video-level object masks such as video editing, autopilot, and augmented reality. Video instance segmentation is more challenging than image instance segmentation because it requires not only instance segmentation on separate frames, but also tracking instances across frames. On the other hand, the video content contains more information than a single image, such as the motion pattern and temporal consistency of different objects, thereby providing more clues for object recognition and segmentation.

The related technology is mainly based on a discriminant method, and the main idea is to use a Mask Region-candidate Region Convolutional Neural Network (Mask R-CNN) and other target detection models as basic frames and add a tracking module to complete video instance segmentation. These methods typically involve a multi-phase pipeline (i.e., multiple training phases) that follows a detection tracking pattern and models video segments as a sequence of images. Multiple networks are used to detect objects in a single frame and then correlate the detections over time. While these methods produce high quality results, they involve multiple independent networks, are computationally demanding, and do not allow end-to-end training. Thus, these methods are generally not repeatedly disciplinable and are well suited to the particular task.

Disclosure of Invention

The present invention provides a video instance segmentation system and method to solve the above problems in the prior art.

In a first aspect, an embodiment of the present invention provides a video instance segmentation system. The system comprises:

the device comprises a feature extraction module, a segmentation module and a feature extraction module, wherein the feature extraction module is used for acquiring a video to be processed, the video to be processed comprises a plurality of video frames, and the video frames comprise at least one example to be segmented; extracting features of the plurality of video frames;

a shallow embedding amount estimation module, connected to the feature extraction module, for estimating an embedding amount of a salient variable at each pixel position in each video frame based on the features of the plurality of video frames and explicit information of the at least one to-be-segmented instance, wherein the explicit information includes at least one of: position information and time sequence information, wherein the explicit information is from the annotation information of the plurality of video frames or is obtained by calculation of the shallow embedding amount estimation module; establishing a shallow Gaussian distribution model of each example to be segmented according to the display variable embedding amount of the pixel position;

the deep embedding quantity estimation module is connected with the feature extraction module and is used for carrying out hidden variable reasoning based on the features of the video frames to obtain hidden information of each to-be-segmented example, wherein the hidden information comprises at least one of the following information: color, lighting and occlusion information; estimating an implicit variable embedding amount of each pixel position in each video frame based on the characteristics of the plurality of video frames and the implicit information of the at least one to-be-segmented example; optimizing the shallow Gaussian distribution model of each example to be segmented according to the hidden variable embedding amount of the pixel position to obtain a deep Gaussian distribution model of each example to be segmented;

and the clustering inference module is connected with the deep embedding amount estimation module and is used for carrying out inference clustering on each pixel position in each video frame by using a Gaussian distribution density estimation function according to the deep Gaussian distribution model of all the examples to be segmented to obtain the segmentation mask of the video to be processed.

In one embodiment, the feature extraction module comprises a feature pyramid network FPN encoder, the FPN encoder comprising:

a first convolution block, a second convolution block, a third convolution block, a fourth convolution block and a fifth convolution block which are connected in series in sequence, wherein a video frame is input into the first convolution block to obtain a characteristic C1; c1 inputting the second volume block, obtaining a characteristic C2; c2 inputting the third volume block to obtain a characteristic C3; c3 inputting the fourth volume block, obtaining a characteristic C4; c4 inputting the fifth volume block to obtain a characteristic C5;

a sixth convolution block, a seventh convolution block, an eighth convolution block and a ninth convolution block, which are all 1 × 1 convolution blocks, are respectively connected to the fifth convolution block, the fourth convolution block, the third convolution block and the second convolution block, and are used for changing the channel numbers of C5, C4, C3 and C2 to obtain features M5, M41, M31 and M21;

the first adder is connected with the sixth convolution block and the seventh convolution block and is used for adding M5 and M41 to obtain a feature M4; the second adder is connected with the first adder and the eighth convolution block and is used for adding M4 and M31 to obtain a feature M3; the third adder is connected with the second adder and the ninth convolution block and is used for adding M3 and M21 to obtain a feature M2;

and the tenth convolution block, the twelfth convolution block and the thirteenth convolution block are 3 × 3 convolution blocks, are respectively connected with the third adder, the second adder, the first adder and the sixth convolution block, and are respectively used for continuously performing feature extraction on M2, M3, M4 and M5 to obtain features P2, P3, P4 and P5.

In one embodiment, the shallow embedding amount estimation module includes:

an upper decoder, connected to the tenth, eleventh, twelfth and thirteenth convolution blocks, for up-sampling P2, P3, P4 and P5 to predict a seed map of each to-be-segmented instance in the video frame, wherein the seed map is a fractional map, and the closer the pixel position is to the center point of each to-be-segmented instance, the higher the corresponding fractional value is;

a lower decoder, connected to the tenth, eleventh, twelfth and thirteenth convolutional blocks, for upsampling P2, P3, P4 and P5 to predict an offset of corresponding explicit information of each to-be-segmented instance in the video frame, wherein the explicit information includes position information and timing information;

and the embedding quantity generating module is connected with the lower decoder and is used for adding the offset and the space-time coordinate vector of each pixel position in the video frame to generate the display quantity embedding quantity of each pixel position in the video frame.

In one embodiment, the deep embedding amount estimation module includes:

the normalized flow branch is connected with the characteristic extraction module and used for estimating the hidden variable distribution corresponding to the hidden information of each example to be segmented based on the characteristics of the video frame;

the distribution optimization module is connected with the normalization flow branch and the lower decoder and used for estimating the hidden variable embedding amount of each pixel position in each video frame according to the hidden variable distribution corresponding to the at least one to-be-segmented example; and adding the hidden variable embedding quantity and the apparent variable embedding quantity of each pixel position, and obtaining a deep Gaussian distribution model of each example to be segmented according to the added embedding quantity of the pixel positions.

In one embodiment, the normalized flow branch comprises:

the activation normalization layer is connected with the feature extraction module and used for performing replacement operation on the features of the video frames;

and a normalized flow structure connected to the active normalization layer, wherein the normalized flow structure includes L hierarchical modules, each hierarchical module includes K normalized flow modules, an output of each normalized flow module is used as an input of a next normalized flow module, an output of each hierarchical module is used as an input of a next hierarchical module, and the number of output channels of each hierarchical module is 1/2 of the number of input channels.

In one embodiment, a dividing structure is disposed between two adjacent hierarchical modules of the normalized flow structure, and the dividing structure is configured to:

dividing the eigenvector z obtained by the previous level module into z1 and z2 in channel dimension, wherein the channel numbers of z1 and z2 are 1/2 of the channel number of z;

performing convolution operation on z1 to obtain a feature vector h, wherein the number of channels of h is 2 times of z 1;

obtaining the mean and variance of h, subtracting the mean of h from z2, and dividing by the variance of h;

z1 is compressed and then passed on.

In one embodiment, the clustering inference module is configured to:

selecting a pixel position corresponding to the highest value in the seed graph of each example to be segmented as the central point of each example to be segmented;

taking the central point of each example to be segmented as an origin, taking the sum of the embedding amount of the explicit variable and the embedding amount of the implicit variable of each pixel position as a standard deviation, and using a Gaussian distribution density estimation function to sequentially calculate the probability that each pixel position belongs to each example to be segmented;

when the value of the probability is greater than 0.5, marking each pixel position as belonging to each to-be-segmented example, thereby obtaining a segmentation mask of each to-be-segmented example.

In a second aspect, an embodiment of the present invention further provides a video instance segmentation method. The method comprises the following steps:

s10: acquiring a plurality of training videos, wherein each training video comprises a plurality of training video frames, and the plurality of training video frames comprise at least one example to be segmented; marking the explicit information of each training video;

s20: constructing a video instance segmentation system provided by the embodiment;

s30: training the video instance segmentation system using the plurality of training videos;

s40: the method comprises the steps of obtaining a video to be segmented, wherein the video to be segmented comprises a plurality of video frames, the plurality of video frames comprise at least one instance to be segmented, inputting the video to be segmented into a trained video instance segmentation system, and obtaining a segmentation mask of the video to be segmented.

In one embodiment, in step S30, the explicit information of each to-be-segmented instance in each training video is derived from the annotation information of each training video;

in step S40, explicit information of each instance to be segmented in the video to be segmented is calculated by the shallow embedding amount estimation module.

In one embodiment, the step S30 includes: and reconstructing seed graphs of a plurality of video frames of each training video according to the labeling information of each training video.

The invention has the beneficial effects that: a universal layered Bayesian inference framework is provided, explicit information and implicit information of a video are respectively obtained by two layers of Bayesian modules, and a finally obtained Gaussian distribution model is progressively optimized. The shallow Bayesian inference module (namely, the shallow embedding amount estimation module) models the instance by using the mixed Gaussian distribution model, and estimates the embedding amount of the explicit variable as the standard deviation of the shallow Gaussian distribution model based on explicit information such as the position information and the time sequence information of the pixel. The deep Bayesian inference module (namely, the deep embedding amount estimation module) uses the normalized flow model to model the instance, estimates the embedding amount of the hidden variable based on the hidden information such as color, illumination, occlusion and the like of the instance, and optimizes the shallow Gaussian distribution model. The method reduces the dependence on the computing power of the equipment while ensuring the segmentation precision.

Drawings

Fig. 1 is a flowchart of a video example segmentation system according to an embodiment of the present invention.

Fig. 2 is a flowchart of a video instance segmentation method according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples. The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

The embodiment provides a video instance segmentation system which is based on a new and universal hierarchical Bayesian framework and is used for realizing instance segmentation of videos. The system comprises: the device comprises a feature extraction module, a shallow embedding quantity estimation module, a deep embedding quantity estimation module and a clustering inference module.

The device comprises a feature extraction module, a segmentation module and a feature extraction module, wherein the feature extraction module is used for acquiring a video to be processed, the video to be processed comprises a plurality of video frames, and the video frames comprise at least one example to be segmented; features of the plurality of video frames are extracted.

A shallow embedding amount estimation module, connected to the feature extraction module, for estimating an embedding amount of a salient variable at each pixel position in each video frame based on the features of the plurality of video frames and explicit information of the at least one to-be-segmented instance, wherein the explicit information includes at least one of: position information and time sequence information, wherein the explicit information is from the annotation information of the plurality of video frames or is obtained by calculation of the shallow embedding amount estimation module; and establishing a shallow Gaussian distribution model of each example to be segmented according to the display variable embedding amount of the pixel position.

The deep embedding quantity estimation module is connected with the feature extraction module and is used for carrying out hidden variable reasoning based on the features of the video frames to obtain hidden information of each to-be-segmented example, wherein the hidden information comprises at least one of the following information: color, lighting and occlusion information; estimating an implicit variable embedding amount of each pixel position in each video frame based on the characteristics of the plurality of video frames and the implicit information of the at least one to-be-segmented example; and optimizing the shallow Gaussian distribution model of each example to be segmented according to the hidden variable embedding amount of the pixel position to obtain the deep Gaussian distribution model of each example to be segmented.

In one embodiment, the shallow embedding amount estimation module includes:

and the embedding amount generating module is connected with the lower decoder and is used for adding the offset and a space-time coordinate vector (vector coordinates are (x, y, t)) of each pixel position in the video frame to generate the display variable embedding amount of each pixel position in the video frame.

In one embodiment, the deep embedding amount estimation module includes:

In one embodiment, the normalized flow branch comprises:

z1 is compressed and then passed on.

In one embodiment, the clustering inference module is configured to:

The system is based on a new and universal hierarchical Bayesian framework. Fig. 1 is a flowchart of a video example segmentation system according to an embodiment of the present invention. As shown in FIG. 1, the workflow of the system includes steps S1-S4.

S1: and acquiring a video to be segmented, and extracting the characteristics of all video frames.

S2: and modeling each instance into a shallow Gaussian model based on the explicit information to obtain a shallow mixed Gaussian distribution model.

S3: and (4) carrying out hidden variable embedding reasoning, and estimating a hidden variable distribution based on hidden information, such as color illumination, shielding information and the like of each example. And optimizing the shallow Gaussian distribution of each model according to the distribution to obtain a final deep mixed Gaussian distribution model.

S4: and obtaining a final segmentation mask according to the obtained deep mixed Gaussian distribution model.

It should be noted that, the system is a surveillance model, and the surveillance information refers to the annotation information of the video sequence, i.e. the group trout. This information is given and is only used during the training process.

For explicit information, the explicit information for the entire video is known during the training phase. Explicit information refers to the location information of each instance throughout the video and the timing information at which frame occurs. The explicit information is given by the label information of the video, and the label information is the label group Truth which is artificially labeled in advance. In the test verification stage, the network model can automatically estimate the required explicit information according to the parameters learned in the training stage without giving explicit information.

For implicit information, the implicit information of the whole video is unknown, and the distribution of the implicit information is estimated by the normalized flow branches through maximum likelihood estimation. The process is unsupervised.

Optionally, in the step S1, a FPN encoder is used to extract features of all video frames.

Specifically, the video frames are passed through 5 consecutive convolution blocks to obtain features (C1, C2, C3, C4, and C5), and then the feature C5 is subjected to 1 × 1 convolution to change the number of channels of the feature map to obtain a feature M5. M5 and C4 are added to obtain a characteristic map after 1 × 1 convolution to obtain a characteristic M4. This procedure was repeated two more times to obtain features M3 and M2, respectively. And performing 3 × 3 convolution on the features M2, M3, M4 and M5 to output final multi-scale features P2, P3, P4 and P5 respectively. The 3 × 3 convolution functions to further extract features of different resolutions, and improves the distinctiveness of the features of different resolutions.

Optionally, step S2 further includes steps S2-1, S2-2 and S2-3.

S2-1: the FPN extracted features P2, P3, P4 and P5 are upsampled by structurally identical upper and lower decoders, each of which corresponds to an encoder structure, again a pyramidal multi-scale structure.

S2-2: the upper decoder is used to predict the seed map for each semantic class. Wherein, each instance to be segmented corresponds to or belongs to a semantic category. The seed graph is a score graph, and a score close to the center point of the example object is high, while a score farther from the center point is low.

S2-3: the lower decoder is used to predict the amount of offset based on explicit spatio-temporal information, which is added to the spatio-temporal coordinate vector (x, y, t) to form the explicit variable embedding amount for the pixel location.

Optionally, step S3 includes steps S3-1 and S3-2.

S3-1: the normalized flow branch (or "generate flow (Glow) branch") is used to estimate the hidden variable distribution of other hidden information.

The Glow branch consists essentially of two parts. The first part is to activate the normalization (actnorm) layer, which is mainly to complete the permutation operation on the input, typically using a 1 × 1 convolution operation. The data are normalized through the acrnorm layer, and the scaling and deviation parameters of each channel are adopted for activation, so that a small batch of data has zero mean and unit variance after activation, and the two parameters can be trained. The data is preprocessed equivalently, and the reduction of the network performance is avoided. The second part is a normalized flow (flow) structure, which has stronger interpretability, and the distribution is converted into solution integrals through reversible transformation so as to fit different distributions.

The Glow branch contains L hierarchical modules, each of which contains K flow modules. The output of each flow module serves as the input to the next flow module, and the output of each level module serves as the input to the next level module. The number of output channels of each flow module is equal to the number of input channels, and the space mapping of the input variables can be completed. The number of output channels of each level is half of the number of input channels, and the multi-scale structure not only reduces the complexity of the model and reduces the calculated amount, but also improves the mapping depth, maps the input variables to a better space or maps the variables to a more complex polynomial, and improves the quality of output results. There is a partitioning operation between the two hierarchical modules.

The dividing operation may also be referred to as a "divide-by-two" operation. First, the eigenvector z obtained from the previous level module is divided into two in the channel dimension, resulting in z1 and z 2. Firstly, performing convolution operation on z1 to obtain a feature vector h, enabling the number of channels of h to be 2 times of z1 (namely, recovering to be the number of channels of z), obtaining the mean value and the variance of h, performing translation scaling on z2, namely subtracting the mean value of h from z2, and then dividing the mean value by the variance of h. The original z1 is compressed and then continues to be transmitted forward.

Wherein the compressing operation is: assuming that the original image is h × w × c in size, the first two axes are spatial dimensions, and then are divided into 2 × 2 × c blocks along the spatial dimensions (this 2 can be customized), and then each block is directly transformed into 1 × 1 × 4c, that is, finally, h/2 × w/2 × 4 c. The compression operation is performed because the use of the convolutional layer is limited. Taking an image as an example, there is a certain correlation between adjacent elements, that is, the image itself has local correlation. Splitting and scrambling the data in the flow destroys the local dependency of the data. Thus, with compression, the number of channels can be increased in dimension, but local correlation is still preserved.

S3-2: and optimizing the Gaussian distribution. And estimating the hidden variable embedding amount according to the hidden variable distribution obtained by the Glow branch based on the idea of information complementation. And adding the explicit variable embedding quantity obtained by the lower decoder to obtain a depth Gaussian mixture distribution model, namely fusing the explicit spatio-temporal information of the example and other implicit information to jointly act on the Gaussian mixture model.

Optionally, step S4 specifically includes the following steps:

in segmentation, clustering is required around the center of each instance. The highest value in the seed map is therefore selected as the center point of the entire instance. Meanwhile, the sum of the estimated embedding amount of the dominant variable and the embedding amount of the hidden variable is used as the standard deviation of Gaussian distribution, and the probability that the position pixel belongs to a certain class is calculated by using a Gaussian distribution density estimation function. When the probability value is greater than 0.5, the point is marked as such. Cluster labeling is performed for all pixel locations in the seed map and the process is repeated for all classes. Up to this point, the split masks for all instances are obtained.

In one embodiment, each module in the video instance segmentation system processes the video to be processed on a frame-by-frame basis. After each module or structure processes each video frame in turn, the processing result of the whole video to be processed is obtained. In this case, the video instance segmentation system includes:

the device comprises a feature extraction module, a segmentation module and a feature extraction module, wherein the feature extraction module is used for acquiring a video to be processed, the video to be processed comprises a plurality of video frames, and the video frames comprise at least one example to be segmented; extracting the characteristics of each video frame;

a shallow embedding amount estimation module, connected to the feature extraction module, for estimating the embedding amount of the explicit variable at each pixel position in each video frame based on the feature of each video frame and the explicit information of the to-be-segmented instance contained in each video frame, where the explicit information includes at least one of: position information and time sequence information, wherein the explicit information is from the labeling information of each video frame or is obtained by calculation of the shallow embedding amount estimation module; establishing a shallow Gaussian distribution model of each example to be segmented contained in each video frame according to the display variable embedding amount of all pixel positions in each video frame;

the deep embedding quantity estimation module is connected with the feature extraction module and used for carrying out hidden variable reasoning based on the features of each video frame to obtain the hidden information of each to-be-segmented example contained in each video frame, wherein the hidden information comprises at least one of the following information: color, lighting and occlusion information; estimating an implicit variable embedding quantity of each pixel position in each video frame based on the characteristics of each video frame and the implicit information of an example to be segmented contained in each video frame; optimizing a shallow Gaussian distribution model of each example to be segmented in each video frame according to the hidden variable embedding amount of all pixel positions in each video frame to obtain a deep Gaussian distribution model of each example to be segmented;

and the clustering inference module is connected with the deep embedding amount estimation module and is used for carrying out inference clustering on each pixel position in each video frame by using a Gaussian distribution density estimation function according to a deep Gaussian distribution model of the to-be-segmented example contained in each video frame to obtain the segmentation mask of each video frame and further obtain the segmentation mask of the to-be-processed video.

a first convolution block, a second convolution block, a third convolution block, a fourth convolution block and a fifth convolution block which are connected in series in sequence, wherein each video frame is input into the first convolution block to obtain a characteristic C1; c1 inputting the second volume block, obtaining a characteristic C2; c2 inputting the third volume block to obtain a characteristic C3; c3 inputting the fourth volume block, obtaining a characteristic C4; c4 inputting the fifth volume block to obtain a characteristic C5;

In one embodiment, the shallow embedding amount estimation module includes:

an upper decoder, connected to the tenth, eleventh, twelfth and thirteenth convolution blocks, for up-sampling P2, P3, P4 and P5 to predict a seed map of each to-be-segmented example contained in each video frame, wherein the seed map is a fractional map, and the closer the pixel position is to the central point of each to-be-segmented example, the higher the corresponding fraction is;

a lower decoder, connected to the tenth, eleventh, twelfth and thirteenth convolutional blocks, for upsampling P2, P3, P4 and P5 to predict an offset of corresponding explicit information of each to-be-segmented instance contained in each video frame, wherein the explicit information includes position information and timing information;

and the embedding amount generating module is connected with the lower decoder and is used for adding the offset to the space-time coordinate vector (x, y, t) of each pixel position in each video frame to generate the display variable embedding amount of each pixel position in each video frame.

In one embodiment, the deep embedding amount estimation module includes:

the normalized flow branch is connected with the characteristic extraction module and used for estimating the hidden variable distribution corresponding to the hidden information of each example to be segmented contained in each video frame based on the characteristic of each video frame;

the distribution optimization module is connected with the normalized flow branch and the lower decoder and used for estimating the hidden variable embedding amount of each pixel position in each video frame according to the hidden variable distribution corresponding to the to-be-segmented example contained in each video frame; and adding the hidden variable embedding quantity and the apparent variable embedding quantity of each pixel position in each video frame, and obtaining a deep Gaussian distribution model of each example to be segmented contained in each video frame according to the added embedding quantity.

In one embodiment, the normalized flow branch comprises:

the activation normalization layer is connected with the feature extraction module and used for performing replacement operation on the features of each video frame;

z1 is compressed and then passed on.

In one embodiment, the clustering inference module is configured to:

selecting a pixel position corresponding to the highest value in the seed graph of each example to be segmented contained in each video frame as the central point of each example to be segmented;

taking a central point of each example to be segmented contained in each video frame as an origin, taking the sum of a significant variable embedding amount and a hidden variable embedding amount of each pixel position in each video frame as a standard deviation, and using a Gaussian distribution density estimation function to sequentially calculate the probability that each pixel position belongs to each example to be segmented;

In an embodiment, the video segmentation system further comprises a video reconstruction module for reconstructing a seed map of each video frame according to the surveillance information during the training process.

The invention discloses a video example segmentation system based on hierarchical Bayesian inference, which comprises a hierarchical Bayesian inference framework, a seed graph and a space-time offset, wherein the video is subjected to multi-scale feature extraction, and the extracted features are up-sampled. And establishing a shallow layer mixed Gaussian distribution model for the video to be segmented by taking the explicit space-time embedding quantity as an estimated value, and establishing a Gaussian distribution model for each instance. And reasoning the hidden variable distribution by using a Glow model, and optimizing a shallow Gaussian distribution model according to the hidden variable embedding value to obtain a deep Gaussian distribution model. And estimating the position of each pixel according to a Gaussian distribution density estimation function to obtain a final segmentation mask. The method reduces the dependence on the computing power of the equipment while ensuring the segmentation precision.

It should be noted that, in the foregoing embodiment, each included unit and each included module are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Example two

The embodiment provides a video instance segmentation method. The method is based on the video instance segmentation system described in the first embodiment and is used for realizing instance segmentation of videos. Fig. 2 is a flowchart of a video instance segmentation method according to an embodiment of the present invention. As shown in FIG. 2, the method includes steps S10-S40.

S10: acquiring a plurality of training videos, wherein each training video comprises a plurality of training video frames, and the plurality of training video frames comprise at least one example to be segmented; the explicit information for each training video is labeled.

S20: the video instance segmentation system provided by any one of the above embodiments is constructed.

S30: training the video instance segmentation system using the plurality of training videos.

The video instance segmentation method of the embodiment of the invention has the same technical principle and beneficial effect as the video instance segmentation system of the first embodiment. For technical details not described in detail in the present embodiment, please refer to the video instance segmentation system in the first embodiment.

EXAMPLE III

The embodiment provides a video instance segmentation method based on hierarchical Bayesian inference. The method is implemented by the video segmentation system described in the first embodiment, and includes the following steps:

acquiring a video to be segmented, and extracting video characteristics;

establishing a shallow layer mixed Gaussian distribution model for a video to be segmented by taking the explicit space-time embedding quantity as an estimated value based on a layered Bayesian inference framework, and establishing a Gaussian distribution model for each instance; and carrying out normalized flow reasoning (or called 'variational reasoning') on the hidden variable distribution, and optimizing the Gaussian distribution model according to the hidden variable embedded value to obtain a deep mixed Gaussian distribution model.

Optionally, in the method, a pyramid structure encoder is used to perform feature extraction on the video, so as to obtain multi-scale feature output.

Optionally, in the method, a gaussian model is built for each instance based on the amount of explicit spatio-temporal embedding. Two decoders of the same structure are used to obtain the seed map and the space-time offset. The seed graph is a score graph used for predicting each category, and the score close to the central point of the example object is correspondingly high, while the score far away from the central point is lower. The offset is added to the spatio-temporal coordinate vector (x, y, t) to form the spatio-temporal embedding quantity of the pixel position.

Optionally, in the method, the Glow model is used to reason about other hidden variables; and estimating an implicit variable embedding value according to the deduced implicit variable distribution, and optimizing the shallow mixed Gaussian distribution model to obtain a deep mixed Gaussian distribution model.

Optionally, in the method, clustering is performed around each instance center. The highest value in the seed map is selected as the center point of the entire instance. Meanwhile, the estimated embedding amount is used as the standard deviation of Gaussian distribution, and the probability that the position pixel belongs to a certain class is calculated by using a Gaussian distribution density estimation function, so that the final segmentation mask is obtained.

The invention provides a video instance segmentation method based on a hierarchical Bayesian inference framework. First, a shallow gaussian distribution is established for each instance based on explicit embedded information of the instance, such as location information and timing information. Then, the implicit information of the example, such as color illumination, shielding and the like, is inferred based on a normalized flow method (or called as a variation method) to obtain the distribution of an implicit variable, and further, the shallow Gaussian distribution is optimized to obtain a deep Gaussian distribution model. The method reduces the dependence on the computing power of the equipment while ensuring the segmentation precision.

Example four

Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 3, the apparatus includes a processor 310 and a memory 320. The number of the processors 310 may be one or more, and one processor 310 is taken as an example in fig. 3.

The memory 320, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules of the video instance segmentation method in the embodiments of the present invention. The processor 310 implements the video instance segmentation method provided by any of the embodiments of the present invention by running software programs, instructions, and modules stored in the memory 320.

The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 320 may further include memory located remotely from the processor 310, which may be connected to the device/terminal/server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

EXAMPLE five

The embodiment of the invention also provides a storage medium. Alternatively, in this embodiment, the storage medium may be configured to store a computer program for executing the video instance segmentation method provided by any embodiment of the present invention.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video instance segmentation system, comprising:

2. The video instance segmentation system of claim 1, wherein the feature extraction module comprises a Feature Pyramid Network (FPN) encoder, the FPN encoder comprising:

3. The video instance segmentation system of claim 2, wherein the shallow embedding amount estimation module comprises:

4. The video instance segmentation system of claim 3, wherein the deep embedding amount estimation module comprises:

5. The video instance segmentation system of claim 4, wherein the normalized stream branch comprises:

6. The video instance segmentation system of claim 5, wherein a partition structure is disposed between two adjacent hierarchical modules of the normalized stream structure, the partition structure being configured to:

z1 is compressed and then passed on.

7. The video instance segmentation system of claim 6 wherein the clustering inference module is to:

8. A method for segmenting a video instance, comprising:

s20: constructing a video instance segmentation system according to any one of claims 1 to 7;

9. The method of video instance segmentation of claim 8,

in step S30, the explicit information of each to-be-segmented instance in each training video is derived from the annotation information of each training video;

10. The video instance segmentation method according to claim 9, wherein the step S30 includes: and reconstructing seed graphs of a plurality of video frames of each training video according to the labeling information of each training video.