CN113473040A

CN113473040A - Video segmentation method and device

Info

Publication number: CN113473040A
Application number: CN202110729746.6A
Authority: CN
Inventors: 李文国; 王伊飞
Original assignee: Beijing Ziguang Zhanrui Communication Technology Co Ltd
Current assignee: Beijing Ziguang Zhanrui Communication Technology Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-10-01

Abstract

The present invention relates to the field of video segmentation technologies, and in particular, to a video segmentation method and device. The method comprises the following steps: dividing a video frame contained in a target video into a plurality of video frame groups, and determining a first video frame group from the plurality of video frame groups, wherein the first video frame group comprises a reference frame and a non-reference frame; determining optical flow motion vectors of the reference frame to the non-reference frame; constructing an interpolated feature frame with respect to the non-reference frame from the reference frame and the optical flow motion vectors; and performing video semantic segmentation on the interpolation characteristic frame to obtain a video segmentation result about the non-reference frame. The scheme of the embodiment of the invention can reduce the calculation cost in the video segmentation process and can ensure the video segmentation effect.

Description

Video segmentation method and device

Technical Field

The present invention relates to the field of video segmentation technologies, and in particular, to a video segmentation method and device.

Background

Image semantic segmentation is to understand the image content from the pixel level and assign semantic labels, such as people, cars, sofas, trees, or the like, to each pixel in the image. Compared to images, video contains consecutive video frames, increasing the temporal dimension and there is a lot of redundant information between video frames. If the image semantic segmentation mode is directly used for segmenting the video sequence frame by frame, great calculation expense is inevitably brought, and the mode cannot utilize the correlation among the video frames, and the segmentation effect is influenced to a certain extent. An improved scheme for video segmentation is proposed in the related art, which can reduce redundant computation by using motion information between video frames. However, the granularity of the motion information between video frames is coarse, the detail information in the video frames is difficult to be well preserved, and the errors of the motion information between the frames are accumulated along with the increase of the number of the video frames, so that the segmentation result after several frames cannot be used. In summary, how to implement video segmentation under the premise of reducing the computational overhead and ensuring the segmentation effect becomes a problem to be solved.

Disclosure of Invention

In view of this, embodiments of the present invention provide a video segmentation method and device, which can reduce the computation overhead in the video segmentation process and ensure the video segmentation effect.

In a first aspect, an embodiment of the present invention provides a video segmentation method, including: dividing a video frame contained in a target video into a plurality of video frame groups, and determining a first video frame group from the plurality of video frame groups, wherein the first video frame group comprises a reference frame and a non-reference frame; determining optical flow motion vectors of the reference frame to the non-reference frame; constructing an interpolated feature frame with respect to the non-reference frame from the reference frame and the optical flow motion vectors; and performing video semantic segmentation on the interpolation characteristic frame to obtain a video segmentation result about the non-reference frame.

Optionally, determining the optical flow motion vector from the reference frame to the non-reference frame includes: extracting the features of the reference frame to obtain first image features; extracting the features of the non-reference frame to obtain second image features; performing feature fusion on the first image feature and the second image feature along the direction of a data channel to obtain a fusion feature; determining an optical flow motion vector of the reference frame to the non-reference frame according to the fusion feature.

Optionally, the performing feature extraction on the reference frame to obtain a first image feature, and performing feature extraction on the non-reference frame to obtain a second image feature includes: inputting the reference frame and the non-reference frame into a twin network model, wherein the twin network model comprises a first network and a second network, the first network is used for extracting the features of the reference frame, and the second network is used for extracting the features of the non-reference frame.

Optionally, the network structures and the network parameters of the first network and the second network are the same; the first network and the second network both comprise a deep convolution unit and a point convolution unit; the deep convolution unit is used for configuring convolution kernels for each data channel of the input frame one by one, and each convolution kernel is used for performing convolution calculation on data in the corresponding data channel to obtain intermediate image characteristics; the point convolution unit comprises at least one unit convolution kernel, and the at least one unit convolution kernel is used for performing convolution calculation on the intermediate image features.

Optionally, determining an optical flow motion vector from the reference frame to the non-reference frame according to the fusion feature includes: inputting the fusion feature into an optical flow network model that determines an optical flow motion vector for each pixel in the reference frame to move to the non-reference frame based on the fusion feature.

Optionally, constructing an interpolated feature frame with respect to the non-reference frame according to the reference frame and the optical flow motion vector, including: determining the target point position of each pixel point in the interpolation feature frame to be constructed on the non-reference frame according to the reference frame and the optical flow motion vector; performing interpolation calculation on the pixel value of each pixel point in the non-reference frame by adopting an interpolation algorithm to obtain the pixel value of each target point position; and constructing an interpolation characteristic frame related to the non-reference frame according to the pixel values of the positions of the target points.

Optionally, determining, according to the reference frame and the optical flow motion vector, a target point position of each pixel point in the interpolation feature frame to be constructed, where the pixel point is mapped to the non-reference frame, includes: constructing a conversion matrix according to the optical flow motion vector, wherein the size of the conversion matrix is consistent with that of an interpolation feature frame to be constructed; and performing warp conversion on the coordinates of each pixel point in the reference frame based on the conversion matrix to obtain the position of the target point of each pixel point in the interpolation characteristic frame to be constructed, which is mapped on the non-reference frame.

Optionally, the first video frame group includes a plurality of frame pairs, each of the frame pairs includes the reference frame and a non-reference frame, and the non-reference frames included in each frame pair are different; in the method, video semantic segmentation is sequentially performed on the non-reference frames in each frame pair based on the image characteristics of the reference frames.

In a second aspect, an embodiment of the present invention provides a video segmentation apparatus, including: the device comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a first video frame group from a plurality of video frame groups when a video frame contained in a target video is divided into the plurality of video frame groups, and the first video frame group comprises a reference frame and a non-reference frame; an optical flow module to determine optical flow motion vectors from the reference frame to the non-reference frame; an interpolation module for constructing an interpolated feature frame with respect to the non-reference frame based on the reference frame and the optical flow motion vector; and the segmentation module is used for performing video semantic segmentation on the interpolation characteristic frame to obtain a video segmentation result related to the non-reference frame.

In a third aspect, an embodiment of the present invention provides a terminal device, including: at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor calling the program instructions to perform a method as described in the first aspect or any of the possible embodiments of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored program, where the program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method according to the first aspect or any possible implementation manner of the first aspect. .

According to the scheme of the embodiment of the invention, each video frame contained in the target video is divided into a plurality of groups, and each group contains a reference frame and a non-reference frame. The reference frame characteristics can be multiplexed for each non-reference frame in each group, so that repeated calculation of the characteristic frames can be avoided, and the calculation overhead is reduced. In addition, the scheme fully utilizes the related information between frames and adopts the optical flow network to calculate the optical flow vector information between the frames, thereby realizing the alignment of the reference frame to the non-reference frame, reducing the jitter of the division edge between the frames and improving the video division effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a video segmentation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another video segmentation method provided by the embodiment of the invention;

FIG. 3 is a schematic structural diagram of a twin network model provided by an embodiment of the present invention;

FIG. 4-a is a schematic structural diagram of a deep convolution unit according to an embodiment of the present invention;

FIG. 4-b is a schematic structural diagram of a point convolution unit according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of warp conversion provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of interpolation calculation of a target point position P according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a video segmentation apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the problems of redundant calculated amount and unstable segmentation effect in the video segmentation technology, the embodiment of the invention provides a video segmentation scheme. In this scheme, it is considered that the video information between adjacent video frames has small change, and the corresponding high-level features (such as high-frequency features like texture) have smaller change amount. Therefore, the scheme of the embodiment of the invention divides each video frame contained in the target video into a plurality of groups, and each group contains a reference frame and a non-reference frame. The reference frame characteristics can be multiplexed for each non-reference frame in each group, so that repeated calculation of the characteristic frames can be avoided, and the calculation overhead is reduced. In addition, the scheme fully utilizes the related information between frames and adopts the optical flow network to calculate the optical flow vector information between the frames, thereby realizing the alignment of the reference frame to the non-reference frame, reducing the jitter of the division edge between the frames and improving the video division effect. The video segmentation scheme of the embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a video segmentation method according to an embodiment of the present invention. As shown in fig. 1, the processing of the method includes:

a first set of video frames is determined 101 from a plurality of sets of video frames of a target video, the first set of video frames including reference frames and non-reference frames.

In the embodiment of the present invention, each video frame included in the target video is divided into a plurality of video frame groups, and the first video frame group is determined from the plurality of video frame groups. The first video frame group may be any one of the plurality of video frame groups. The video segmentation may be performed with reference to the first video frame group for each of the plurality of video frame groups. In the embodiment of the present invention, each video frame included in the target video may be divided into a plurality of video frame groups according to the time period T. And after the target video is segmented according to the time period T, each video frame group comprises a preset number of continuous video frames. For example, each video frame group contains consecutive 4-frame video frames. Thereafter, reference frames and non-reference frames in each video frame group may be determined. For example, the first video frame in the set of video frames may be determined to be a reference frame and the other video frames may be non-reference frames.

An optical flow motion vector from a reference frame to a non-reference frame in the first set of video frames is determined 102. Alternatively, the reference frame and the non-reference frame may be input to an optical flow network model, and an optical flow motion vector for moving each pixel in the reference frame to the non-reference frame is determined through the optical flow network model.

In some embodiments, calculating an optical flow motion vector from a reference frame to a non-reference frame may include: and performing feature extraction on the reference frame to obtain a first image feature. And performing feature extraction on the non-reference frame to obtain a second image feature. Optionally, in the embodiment of the present invention, the reference frame and the non-reference frame may be input into the twin network model. The twin network model comprises a first network and a second network, wherein the first network is used for extracting the features of the reference frame to obtain first image features; the second network may be configured to perform feature extraction on the non-reference frame to obtain a second image feature. And then, performing feature fusion on the first image feature and the second image feature along the direction of the data channel to obtain fusion features.

The fused features are input to an optical flow network model that determines optical flow motion vectors for each pixel in the reference frame to move to a non-reference frame based on the fused features. Optionally, when the first group of video frames includes a plurality of non-reference frames, each non-reference frame calculates an optical flow motion vector based on image features of the reference frame. It should be noted that after determining the first image feature of the reference frame, each non-reference frame in the first video frame group performs optical flow motion vector calculation based on the same reference frame, and the step of extracting the reference frame feature does not need to be repeatedly performed, thereby reducing the calculation overhead to some extent.

103, constructing an interpolation characteristic frame relative to the non-reference frame according to the reference frame and the optical flow motion vector. Optionally, constructing the interpolated feature frame of the non-reference frame may include: according to the reference frame and the optical flow motion vector, the position of a target point of each pixel point in the interpolation feature frame to be constructed, which is mapped on the non-reference frame, can be determined. Then, an interpolation algorithm can be adopted to perform interpolation calculation on the pixel values of all the pixel points in the non-reference frame, so as to obtain the pixel values of all the target point positions. From the pixel values of the respective target point locations, an interpolated eigenframe can then be constructed with respect to the non-reference frame. In the embodiment of the invention, the reference frame to non-reference frame alignment can be realized based on the optical flow motion vector, and then the interpolation feature frame with higher resolution can be output by using the non-reference frame with lower resolution.

And 104, performing video semantic segmentation on the interpolation characteristic frame to obtain a video segmentation result about the non-reference frame. In the embodiment of the present invention, when the first video frame group includes a plurality of non-reference frames, each non-reference frame is subjected to video segmentation with reference to steps 102-104. When the target video includes a plurality of video frame groups, each video frame group is video-divided with reference to the first video frame group.

Fig. 2 is a flowchart of another video segmentation method according to an embodiment of the present invention. Based on the model shown in fig. 2, the video segmentation method of the embodiment of the present invention includes:

(1) the target video is divided into a plurality of video frame groups each including m +1 consecutive video frames according to the time period T. The video frames in each group are denoted as f_i、f_i+1、f_i+2……f_i+m. Alternatively, the first video frame in each group may be determined to be a reference frame, and the other m video frames in the group may be determined to be non-reference frames. Optionally, the m +1 video frames in each group may be stored in a frame pair manner, each frame pair includes a reference frame and a non-reference frame, and the non-reference frames included in the respective frame pairs are different. For example,<i，i+k>represents a frame pair, i represents the ith video frame, i + k represents the ith + kIn the video frame, k takes values of 1, 2 and 3 … … m.

(2) The reference frame f in one of the video frame groups (referred to as the first video frame group)_iAnd non-reference frame f_i+kInputting a twin network model. As shown in fig. 2, the twin network model includes a first network and a second network. Reference frame f_iInput into a first network for referencing a frame f_iCarrying out feature extraction to obtain a first image feature F_i. Non-reference frame f_i+kInput into a second network for non-reference frames f_i+kCarrying out feature extraction to obtain a second image feature F_i+k。

Fig. 3 is a schematic structural diagram of a twin network model according to an embodiment of the present invention. As shown in fig. 3, the twin network model includes a first network and a second network. The first network is used for extracting the features of the reference frames to obtain image features F0, and the second network is used for extracting the features of the non-reference frames to obtain the image features of the non-reference frames. For example, frame 1, frame 2, and frame 3 are feature extracted based on the second network, resulting in deep feature maps F1, F2, and F3, respectively.

In the embodiment of the invention, because of the similarity of the characteristics between the continuous video frames, a twin network model can be adopted for characteristic extraction. In the twin network model shown in fig. 3, the network structures of the first network and the second network are the same, and the first network and the second network may use the same network parameters, that is, the first network and the second network may share the weight value to reduce the redundant parameters in the network model. Further, based on the twin network model shown in fig. 3, when the image features of each video frame in the video frame group are extracted, the image features of the reference frames in the group can be multiplexed, so that the image features of the reference frames are prevented from being repeatedly calculated, and the calculation overhead is saved.

Further, both the first network and the second network in the twin network model may adopt a MobileNet structure, where the MobileNet is composed of inverse Residual blocks (Inverted Residual blocks), and each inverse Residual Block is decomposed into a deep convolution unit and a point convolution unit. When image features are extracted based on the MobileNet structure, a depth convolution unit configures convolution kernels for each data channel of an input frame one by one. And each convolution kernel is used for performing convolution calculation on the data in the corresponding data channel to obtain the intermediate image characteristics. The intermediate image features are input to a point convolution unit. The point convolution unit comprises at least one unit convolution kernel, and the at least one unit convolution kernel is used for performing convolution calculation on the intermediate image characteristics.

As shown in fig. 4-a, in the deep convolution unit, the size of the input feature map is 5x5x3, where 5x5x3 indicates a height of 5, a width of 5, and a data channel of 3. A convolution kernel of 3x3x3 is provided for each data channel. And performing convolution calculation on the data in the 3 channels of the input feature map by using the three convolution kernels to obtain an output feature map of 5x5x 3. The 5 × 3 output feature map is input to the point convolution unit as an input feature map. As shown in fig. 4-b, 4 convolution kernels of 3x1x1 are arranged in the point convolution unit, and the 4 convolution kernels are respectively used for performing convolution calculation on the input feature map to obtain an output feature map of 5x5x 4. Compared with the traditional convolution model, the twin network model adopts a model structure of a deep convolution unit and a point convolution unit, and the parameter quantity and the calculated quantity are both obviously reduced.

(3) As shown in fig. 2, the first image feature F is_iAnd a second image feature F_i+kInput to the optical flow network model. The optical flow network model firstly carries out the first image feature F_iAnd a second image feature F_i+kPerforming feature fusion to obtain a fusion feature F_c. In the embodiment of the invention, a Concat fusion mode is adopted to perform on the first image characteristic F_iAnd a second image feature F_i+kAnd performing feature fusion. Wherein the Concat fusion mode is used for converting the first image characteristic F_iAnd a second image feature F_i+kSplicing is directly carried out in the direction of a data channel to obtain a fusion characteristic F_c. There is no loss of information in this process. Then, the optical flow network model is based on the fusion feature F_cCalculating a reference frame f_iTo non-reference frame f_i+kThe optical flow motion vector of (1).

Wherein the optical flow motion vector is obtained by using the change of pixels in the video sequence in the time domain and the adjacent framesThe correlation between the frames is used to find the corresponding relation between the previous frame and the current frame, so as to calculate the motion information of the object between the adjacent frames. In the embodiment of the invention, the reference frame f is calculated based on the optical flow technology_iIs moved to a non-reference frame f_i+kThe optical flow motion vector of (1).

Optionally, the optical flow network model may adopt a FlowNet model. The FlowNet model adopts a coding and decoding network structure, wherein a coding module is composed of nine layers of convolution modules, a decoding module is composed of 4 layers of transposition convolutions, and each layer of input of the transposition convolutions in the decoding module is from the output of the previous layer. Optionally, the number of output channels of the FlowNet model may be set to 2, and finally the reference frame f is output_iAnd non-reference frame f_i+kOptical flow motion vectors in between.

(4) The first image characteristic F of the reference frame_iAnd non-reference frame f_i+kThe optical flow motion vector is subjected to warp to obtain a non-reference frame f_i+kThe interpolated feature frame of (1). Wherein, the warp comprises the following specific processes: and determining the position of a target point of each pixel point in the interpolation feature frame to be constructed, which is mapped on the non-reference frame, according to the reference frame and the optical flow motion vector. And performing interpolation calculation on the pixel value of each pixel point in the non-reference frame by adopting an interpolation algorithm to obtain the pixel value of each target point position. And constructing an interpolation characteristic frame relative to the non-reference frame according to the pixel values of the positions of the target points.

The method for determining the target point position of each pixel point in the interpolation feature frame to be constructed on the non-reference frame mapping according to the reference frame and the optical flow motion vector comprises the following steps: and constructing a conversion matrix according to the optical flow motion vector, wherein the size of the conversion matrix is consistent with the size of the interpolation characteristic frame to be constructed. And then, performing warp conversion on the coordinates of each pixel point in the reference frame based on the conversion matrix to obtain the position of the target point of each pixel point in the interpolation characteristic frame to be constructed, which is mapped on the non-reference frame.

In the embodiment of the invention, grid sample algorithm is adopted in the warp process, namely the size is (N, H)_in,W_inInput of C) via (N, H)_out，W_outAnd 2) a grid transformation matrix,to obtain a size of (N, H)_out，W_outAnd C) output. As shown in fig. 5, the grid transformation matrix is determined from the optical flow motion vectors. And the grid conversion matrix is consistent with the size of the interpolation characteristic frame to be constructed. And then, taking the coordinates of each pixel point in the reference frame as input, and converting based on the grid conversion matrix to obtain the position of the target point of each pixel point in the interpolation characteristic frame, which is mapped in the non-reference frame.

In some embodiments, the target point position (ix _ new, iy _ new) of each pixel point in the interpolated feature frame mapped in the non-reference frame may be obtained according to the following formula, where:

in this formula, W_inAnd H_inThe coordinate of a certain pixel point a in the reference frame is shown, and the optical flow motion vector corresponding to the pixel point a is (ix, iy). Optionally, the optical flow motion vector of each pixel in the grid transformation matrix is normalized to [ -1,1]. Further, the pixel point a corresponds to a pixel point P in the interpolation characteristic frame, and the pixel point P corresponds to a pixel point P in the non-reference frame. The coordinates of the pixel point P are (ix _ new, iy _ new) as described above. After the target point position of each image point in the interpolation feature frame mapped on the non-reference frame is determined, a pixel value of the target point position can be calculated by using an interpolation algorithm.

As shown in fig. 6, in the non-reference frame, for the target point position P, a bilinear interpolation algorithm may be used to obtain the pixel value of the position. As shown in fig. 6, first, two sides of the P point are calculated in the x direction by single linear interpolation, and then temporary points R1(x, y1) and R2(x, y2) are determined in the non-reference frame, and then 1 time of single linear interpolation is calculated in the y direction to obtain the pixel value of P (x, y).

The x-direction single linear interpolation formula is:

the y-direction single linear interpolation formula is:

substituting the interpolation result in the x direction into an interpolation formula in the y direction:

due to Q₁₁、Q₁₂、Q₂₁、Q₁₂Are four adjacent positions, therefore (y)₂-y₁＝1、x₂-x₁1, a simplified formula is obtained:

f(x,y)＝(y₂-y)(x₂-x)f(Q₁₁)+(y₂-y)(x-x₁)f(Q₂₁)+(y₂-y)(x-x₁)f(Q₁₂)+(y₂-y)(x-x₁)f(Q₂₂)。

based on the mode, the pixel value of each pixel point in the interpolation characteristic frame can be obtained through the interpolation calculation of the pixel value in the non-reference frame. Therefore, the high-resolution interpolation characteristic frame can be constructed based on the low-resolution non-reference frame.

(5) As shown in fig. 2, the non-reference frame f_i+kThe interpolated characteristic frame is input into a video semantic segmentation model to obtain a non-reference frame f_i+kThe video segmentation result of (1).

Optionally, the video semantic segmentation model may include 2 consecutive convolutions of 3 × 3, followed by the addition of a layer of softmax. The video segmentation of the interpolation characteristic frame can be realized based on the video semantic segmentation model. It should be noted that the convolution unit and the depth convolution unit in the embodiment of the present invention both use bounded activation functions, such as RELU6, hard _ parity, or hard _ swish. Moreover, each convolution and depth convolution unit is provided with a batch normalization (batch norm) structure, so that the design can effectively limit the data range of the network characteristic diagram, and the subsequent high-precision quantification is facilitated.

(6) And (5) repeating the steps (2) to (5) to obtain the video segmentation result of each non-reference frame in the current video frame group.

(7) And (5) repeating the steps (2) to (6) to obtain the video segmentation result of each video frame group of the target video.

In the scheme of the embodiment of the invention, the twin network model is adopted to extract the video frame characteristics, and the twin network model can share the weight parameters and extract the deep characteristics of the image in parallel, so that the effects of saving network parameters and reducing calculated amount can be achieved.

Further, in the embodiment of the present invention, in each video frame group, the reference frame feature may be multiplexed. And the reference frame features and the current frame can be fused to calculate the inter-frame optical flow, so that the alignment of the reference frame to the current frame can be realized, and the jitter of the segmentation edge is reduced. The scheme of the embodiment of the invention fully utilizes the inter-frame related information, reduces the information redundancy between adjacent video frames, further saves the inference time, avoids unnecessary calculation and obviously improves the accuracy of video segmentation.

The embodiment of the invention also provides a video segmentation device corresponding to the video segmentation method. Those skilled in the art will appreciate that these video segmentation means may be constructed using commercially available hardware components configured by the steps taught by the present scheme.

Fig. 7 is a schematic structural diagram of a video segmentation apparatus according to an embodiment of the present invention. As shown in fig. 7, the video segmentation apparatus includes: a determining module 201, configured to determine a first video frame group from a plurality of video frame groups when a video frame included in a target video is divided into the plurality of video frame groups, where the first video frame group includes a reference frame and a non-reference frame; an optical flow module 202 for determining optical flow motion vectors from the reference frame to the non-reference frame; an interpolation module 203, configured to construct an interpolated feature frame with respect to the non-reference frame according to the reference frame and the optical flow motion vector; and a segmentation module 204, configured to perform video semantic segmentation on the interpolated feature frame to obtain a video segmentation result about the non-reference frame.

The video segmentation apparatus according to the embodiment of the present invention may perform the methods according to the embodiments shown in fig. 1 to 6. For parts of the present embodiment not described in detail, reference may be made to the related description of the embodiments shown in fig. 1 to 6. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to fig. 6, and are not described herein again.

It should be understood that the division of the respective modules of the video segmentation apparatus shown in fig. 7 is only a division of logical functions, and alternatively, other functional modules may be added or several functional modules may be integrated into one module on the basis of the embodiment shown in fig. 7. In addition, all or part of the modules may be integrated into one physical entity or may be physically separated in actual implementation. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling by the processing element in software, and part of the modules can be realized in the form of hardware. For example, the determining module 201 may be a separate processing element, or may be integrated into a chip of the electronic device. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, these modules may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

Fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 8, the terminal device is embodied in the form of a general purpose computing device. The components of the terminal device may include, but are not limited to: one or more processors 310, a communication interface 320, a memory 330, and a communication bus 340 that couples various system components including the memory 330, the communication interface 320, and the processing unit 310.

Communication bus 340 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Electronic devices typically include a variety of computer system readable media. Such media may be any available media that is accessible by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 330 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) and/or cache Memory. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. Memory 330 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of the embodiments of this description and illustrated in fig. 1-6.

A program/utility having a set (at least one) of program modules, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored in memory 330, each of which examples or some combination may include an implementation of a network environment. The program modules generally perform the functions and/or methodologies of the embodiments described herein.

The processor 310 executes various functional applications and data processing by executing programs stored in the memory 330, for example, implementing the video segmentation method provided by the embodiments shown in fig. 1 to 6 in this specification.

In specific implementation, the present application further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided in the present application when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

In specific implementation, an embodiment of the present invention further provides a computer program product, where the computer program product includes executable instructions, and when the executable instructions are executed on a computer, the computer is caused to perform some or all of the steps in the above method embodiments.

In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

Those of ordinary skill in the art will appreciate that the various elements and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided by the present invention, any function, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only an embodiment of the present invention, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for video segmentation, comprising:

dividing a video frame contained in a target video into a plurality of video frame groups, and determining a first video frame group from the plurality of video frame groups, wherein the first video frame group comprises a reference frame and a non-reference frame;

determining optical flow motion vectors of the reference frame to the non-reference frame;

constructing an interpolated feature frame with respect to the non-reference frame from the reference frame and the optical flow motion vectors;

and performing video semantic segmentation on the interpolation characteristic frame to obtain a video segmentation result about the non-reference frame.

2. The method of claim 1, wherein determining optical flow motion vectors for the reference frame to the non-reference frame comprises:

extracting the features of the reference frame to obtain first image features;

extracting the features of the non-reference frame to obtain second image features;

performing feature fusion on the first image feature and the second image feature along the direction of a data channel to obtain a fusion feature;

determining an optical flow motion vector of the reference frame to the non-reference frame according to the fusion feature.

3. The method of claim 2, wherein extracting features of the reference frame to obtain first image features and extracting features of the non-reference frame to obtain second image features comprises:

inputting the reference frame and the non-reference frame into a twin network model, wherein the twin network model comprises a first network and a second network, the first network is used for extracting the features of the reference frame, and the second network is used for extracting the features of the non-reference frame.

4. The method of claim 3, wherein the network structure and network parameters of the first network and the second network are the same; the first network and the second network both comprise a deep convolution unit and a point convolution unit;

the deep convolution unit is used for configuring convolution kernels for each data channel of the input frame one by one, and each convolution kernel is used for performing convolution calculation on data in the corresponding data channel to obtain intermediate image characteristics;

the point convolution unit comprises at least one unit convolution kernel, and the at least one unit convolution kernel is used for performing convolution calculation on the intermediate image features.

5. The method of claim 2, wherein determining optical flow motion vectors from the reference frame to the non-reference frame based on the fused feature comprises:

inputting the fusion feature into an optical flow network model that determines an optical flow motion vector for each pixel in the reference frame to move to the non-reference frame based on the fusion feature.

6. The method of claim 1, wherein constructing an interpolated feature frame for the non-reference frame from the reference frame and the optical flow motion vectors comprises:

determining the target point position of each pixel point in the interpolation feature frame to be constructed on the non-reference frame according to the reference frame and the optical flow motion vector;

performing interpolation calculation on the pixel value of each pixel point in the non-reference frame by adopting an interpolation algorithm to obtain the pixel value of each target point position;

and constructing an interpolation characteristic frame related to the non-reference frame according to the pixel values of the positions of the target points.

7. The method of claim 6, wherein determining the position of the target point mapped by each pixel point in the interpolated feature frame to be constructed on the non-reference frame according to the reference frame and the optical flow motion vector comprises:

constructing a conversion matrix according to the optical flow motion vector, wherein the size of the conversion matrix is consistent with that of an interpolation feature frame to be constructed;

and performing warp conversion on the coordinates of each pixel point in the reference frame based on the conversion matrix to obtain the position of the target point of each pixel point in the interpolation characteristic frame to be constructed, which is mapped on the non-reference frame.

8. The method of claim 1, wherein the first group of video frames comprises a plurality of frame pairs, each of the frame pairs comprising the reference frame and a non-reference frame, each of the frame pairs comprising a different one of the non-reference frames;

in the method, video semantic segmentation is sequentially performed on the non-reference frames in each frame pair based on the image characteristics of the reference frames.

9. A video segmentation apparatus, comprising:

the device comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a first video frame group from a plurality of video frame groups when a video frame contained in a target video is divided into the plurality of video frame groups, and the first video frame group comprises a reference frame and a non-reference frame;

an optical flow module to determine optical flow motion vectors from the reference frame to the non-reference frame;

an interpolation module for constructing an interpolated feature frame with respect to the non-reference frame based on the reference frame and the optical flow motion vector;

and the segmentation module is used for performing video semantic segmentation on the interpolation characteristic frame to obtain a video segmentation result related to the non-reference frame.

10. A terminal device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor calling the program instructions to perform the method of any of claims 1 to 8.

11. A computer-readable storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus on which the computer-readable storage medium resides to perform the method of any one of claims 1 to 8.