CN105719297A

CN105719297A - Object cutting method and device based on video

Info

Publication number: CN105719297A
Application number: CN201610041711.2A
Authority: CN
Inventors: 陈世峰
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2016-01-21
Filing date: 2016-01-21
Publication date: 2016-06-29

Abstract

The invention is applicable to the technical field of video processing and provides an object cutting method and device based on video.The method comprises the steps of extracting one frame of the video, and cutting a designated object out of the image through an image cutting algorithm; conducting feature learning on the object obtained through cutting, so that statistical characteristics of an object area, a non-object area and boundary of the object are obtained; based on the obtained statistical characteristics of the object, cutting the object in other images of the video through a conditional random field model.According to the object cutting method and device, based on the cutting result of the first image, the statistical characteristics of the object obtained after cutting are learned, cutting of the object in other frames of a video clip is further achieved through the conditional random field model, and thereby automatic cutting for any object in any video can be completed, so that object cutting is not limited by the conditions such as the background stationary, camera stationary, the foreground mobile or the known background and the like, and processing capacity of the object cutting algorithm is improved.

Description

Object cutting method and device based on video

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a video-based object cutting method and device.

Background

The object segmentation technology in video plays an extremely important role in many application fields of computer vision, including application fields of video monitoring, video editing, video retrieval and the like. In order to improve the precision of object cutting, most algorithms in the prior art propose assumptions on video data, such as an assumption that a video background is still, an assumption that a camera remains still, and the like.

Disclosure of Invention

In view of this, embodiments of the present invention provide a video-based object cutting method and apparatus, so as to solve the problem that in the prior art, there is a limitation on an application scene of object cutting, and an object cutting operation cannot be completed for any video.

In a first aspect, a video-based object cutting method is provided, including:

extracting one frame of image of a video, and cutting out a specified object in the image through a graph cutting algorithm;

performing feature learning on the cut object to obtain statistical features of an object region, a non-object region and a boundary of the object;

and based on the acquired statistical characteristics of the object, cutting the object in other frame images of the video through a conditional random field model.

In a second aspect, there is provided a video-based object cutting apparatus comprising:

the first cutting unit is used for extracting one frame of image of the video and cutting out a specified object in the image through a graph cutting algorithm;

the characteristic learning unit is used for carrying out characteristic learning on the cut object and acquiring the statistical characteristics of an object region, a non-object region and a boundary of the object;

and the second cutting unit is used for cutting the object in other frame images of the video through the conditional random field model based on the acquired statistical characteristics of the object.

According to the embodiment of the invention, firstly, an object in a first frame image of a video segment is cut by using a simple graph cutting algorithm, then, the statistical characteristics of the cut object are learned based on the cutting result of the first frame image, and then, the object in other frames of the video segment is cut by using a conditional random field model, so that the automatic cutting of any object in any video can be completed, the object cutting is not limited by the conditions of static background, static camera, foreground motion or known background and the like, and the processing capacity of the object cutting algorithm is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of an implementation of a video-based object cutting method according to an embodiment of the present invention;

fig. 2 is a block diagram of a video-based object cutting apparatus according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

Fig. 1 shows an implementation flow of the video-based object cutting method provided by the embodiment of the present invention, which is detailed as follows:

in S101, one frame of image of the video is extracted, and a designated object in the image is cut out by a graph cutting algorithm.

Preferably, the first frame image of the video may be extracted according to the playing sequence of the video image frames, and the object specified by the user interaction in the image may be cut by the graph cutting algorithm.

In S102, feature learning is performed on the cut object, and statistical features of an object region, a non-object region, and a boundary of the object are acquired.

In the embodiment of the present invention, a Support Vector Machine (SVM) may be used as a classifier, and the color or brightness value of a pixel and the color or brightness value of an image block centered on the pixel are selected as image statistical features, so as to learn the statistical features of the object region, the non-object region, and the boundary of the object, respectively.

And taking the object cutting result of the first frame image of the video as training data, and learning by adopting an SVM (support vector machine). Suppose f_oIs a learned classification function in which sgn [ f [ ] is_o(o_i)]The +1 representative pixel i belongs to the object region, sgn [ f [ ]_o(o_i)]The-1 representation pixel i belongs to a non-object region, where sgn represents a sign function.

In S103, based on the obtained statistical features of the object, the object in other frame images of the video is cut through a conditional random field model.

Remember o ═ o_i}_i∈lFor an image frame to be processed, r ═ r_i}_i∈lIs the result of the segmentation of the image frame, where I is the set of pixels in the image frame, o_iIs a statistical characteristic (i.e. luminance or color, etc.) of the pixel iCharacteristic) r_iIs the label for pixel i, which has a value of +1 or-1 (r)_iThe +1 representative pixel i belongs to the object region, r_iA-1 representative pixel i belongs to a non-object region), whereby the object cut problem can be described as a problem in which all pixels in the video image frame solve for the optimal labeling.

With a Conditional Random Field (CRF) model, the solution of optimal labeling of pixels in a video image frame can be achieved by the following maximum a posteriori probability p (r | o):

wherein,is a neighborhood of pixel i, including a spatial neighborhoodAnd temporal neighborhoodFor each pixel, 8 spatial neighborhoods and 18 temporal neighborhoods are employed, Z being the discrimination function. The solution of the above maximum a posteriori probability can be translated into the solution of the following energy function:

in order to define u in equation (2), a classifier needs to be selected, and in the embodiment of the present invention, a Support Vector Machine (SVM) may be used as the classifier, and the color of the pixel and the color of the image block centered on the pixel are selected as the image statistical feature.

And taking the object cutting result of the first frame image of the video as training data, and learning by adopting an SVM (support vector machine). Suppose f_oIs a learned classification function, sgn [ f ] in classification applications_o(o_i)]The +1 representative pixel i belongs to the object region, sgn [ f [ ]_o(o_i)]The-1 representation pixel i belongs to a non-object region, where sgn represents a sign function. By SVM, a hyperplane can be found that is the largest distance from the closest data point in the training data of the two classes, assuming m⁺¹And m^-1The maximum distance from the hyperplane in the two types of training data, respectively, u is defined as:

in formula (2), v is defined as the sum of three terms:

v (r_{i}, r_{j}) = v_{i j}^{c} + v_{i j}^{b} + v_{i j}^{t}, - - - (4)

wherein:

v_{i j}^{c} = α \cdot \exp (- \frac{| | c_{i} - c_{j} | |}{σ}) \cdot | r_{i} - r_{j} |, - - - (5)

here, i and j are pixel pairs in the spatial neighborhood, α and σ are control parameters, α is obtained by dividing a random 100-piece graph, and σ is automatically set to 2<||c_i-c_j||>Wherein<·>is a mean operation for the entire image.

Based on the initial image cutting result, selecting pixel pairs with different label values in all neighborhoods, and based on the selected data, calculating 3 2-dimensional histogram tables H^r，H^gAnd H^bCorresponding to R, G, B color channels, respectively. Defining:

v_{i j}^{b} = α^{'} \cdot \exp (- \frac{H_{i j}}{σ^{'}}) \cdot | r_{i} - r_{j} |, - - - (6)

wherein,

H_{i j} = m a x {H^{r} (c_{i} - c_{j}), H^{g} (c_{i} - c_{j}), H^{b (c_{i} - c_{j})}},

α′＝α，σ′＝N_pwherein N is_pIs the number of pixel pairs selected from the initial result.

For theBy means of heelsContrast being defined similarly, i.e.Where i and j are different pairs of pixels in the temporal neighborhood.

And combining the above formulas to obtain a final energy function, and determining the optimal label of each pixel in each image frame of the video by minimizing the energy function, thereby completing the cutting of the object, wherein the solution of minimizing the energy function can be realized by a graph cutting algorithm.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Corresponding to the video-based object cutting method described in the above embodiments, fig. 2 shows a structural block diagram of the video-based object cutting apparatus provided in the embodiments of the present invention, and for convenience of description, only the parts related to the present embodiment are shown.

Referring to fig. 2, the apparatus includes:

the first cutting unit 21 is used for extracting one frame of image of the video and cutting out a specified object in the image through a graph cutting algorithm;

a feature learning unit 22 configured to perform feature learning on the cut object, and acquire statistical features of an object region, a non-object region, and a boundary of the object;

and the second cutting unit 23 is configured to cut the object in the other frame images of the video through the conditional random field model based on the acquired statistical characteristics of the object.

Optionally, the second cutting unit 23 is configured to:

and minimizing a preset energy function to determine an optimal label of each pixel, wherein the label is used for indicating whether the corresponding pixel belongs to the object or not.

Optionally, the first cutting unit 21 is specifically configured to:

a first frame image of the video is extracted.

Optionally, the feature learning unit 22 is specifically configured to:

and performing feature learning on the cut object through a support vector machine.

Optionally, the statistical features include a color or a luminance value of a pixel, and a color or a luminance value of an image block centered on the pixel.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be implemented in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A video-based object segmentation method, comprising:

2. The method of claim 1, wherein said segmenting the object in other frame images of the video by a conditional random field model based on the obtained statistical features of the object comprises:

3. The method of claim 1, wherein the extracting one of the frames of the video comprises:

a first frame image of the video is extracted.

4. The method of claim 1, wherein the feature learning of the cut object comprises:

5. The method of claim 1, wherein the statistical features comprise color or luminance values of a pixel and color or luminance values of image blocks centered on the pixel.

6. A video-based object cutting apparatus, comprising:

7. The apparatus of claim 6, wherein the second cutting unit is to:

8. The device according to claim 6, characterized in that the first cutting unit is particularly adapted to:

a first frame image of the video is extracted.

9. The apparatus as claimed in claim 6, wherein the feature learning unit is specifically configured to:

10. The apparatus of claim 6, wherein the statistical features comprise color or luminance values of a pixel and color or luminance values of an image block centered on the pixel.