CN114973079A

CN114973079A - Video target segmentation method, system, terminal and storage medium

Info

Publication number: CN114973079A
Application number: CN202210534657.0A
Authority: CN
Inventors: 吕香伟; 谢中流; 严考碧; 钱瑞和; 钟凯宇
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-08-30

Abstract

The invention discloses a video target segmentation method, a video target segmentation system, a video target segmentation terminal and a storage medium, wherein the method comprises the following steps: determining the past frame characteristics related to the video frame to be segmented according to the past segmentation result; determining the space-time global characteristics associated with the video frame to be segmented according to the past segmentation result and an unsegmented video frame with a time sequence behind the video frame to be segmented; and generating a segmentation result corresponding to the video frame to be segmented based on the space-time global feature and the past frame feature. The technical scheme of the application improves the video segmentation effect.

Description

Video target segmentation method, system, terminal and storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a method, a system, a terminal, and a storage medium for video object segmentation.

Background

Video object segmentation and in particular video portrait segmentation have particularly broad applications in the industry. At present, the general algorithm flow is to maintain a memory block, store the segmentation result of the previous frame, and when segmenting the t-th frame video picture, refer to the segmentation result of the previous t frames stored in the memory block to obtain a segmentation result with stable time sequence. However, if there are phenomena such as shot splicing and machine position conversion in the t frame, the picture content of the t frame has abrupt change compared with the previous video sequence, the appearance and position of the object to be segmented in the picture also change dramatically, and information related to the object in the t frame picture cannot be found in the memory block, resulting in poor segmentation effect.

Disclosure of Invention

The embodiment of the application aims to solve the problem of poor video segmentation effect by providing a video target segmentation method, a video target segmentation system, a video target segmentation terminal and a storage medium.

The embodiment of the application provides a video target segmentation method, which comprises the following steps:

determining the past frame characteristics related to the video frame to be segmented according to the past segmentation result;

determining the space-time global characteristics associated with the video frame to be segmented according to the past segmentation result and an unsegmented video frame with a time sequence behind the video frame to be segmented;

and generating a segmentation result corresponding to the video frame to be segmented based on the space-time global feature and the past frame feature.

In an embodiment, the step of determining the spatio-temporal global features associated with the video frame to be segmented according to the past segmentation result and an un-segmented video frame whose time sequence is after the video frame to be segmented comprises:

inputting the past segmentation result and the non-segmented video frame into a space-time global convolutional network, and acquiring an output result of the space-time global convolutional network;

and taking the output result as the input of a space domain convolution network, and obtaining the space-time global characteristics through the space domain convolution network.

In one embodiment, the space-time global convolution network includes a two-dimensional space-domain convolution kernel and a one-dimensional time-domain convolution kernel, and the two-dimensional space-domain convolution kernel and the one-dimensional time-domain convolution kernel are connected in series; the step of inputting the past segmentation result and the non-segmented video frame into a space-time global convolutional network and obtaining an output result of the space-time global convolutional network comprises the following steps:

sequentially passing the past segmentation result and the unsegmented video frame through the two-dimensional space domain convolution kernel and the one-dimensional time domain convolution kernel to obtain a video frame after space-time global convolution;

and adding the past segmentation result, the unsegmented video frame and the video frame after the space-time global convolution to obtain an output result of the space-time global convolution network.

In one embodiment, the step of inputting the past segmentation result and the non-segmented video frame into a spatiotemporal global convolutional network and obtaining an output result of the spatiotemporal global convolutional network comprises:

changing the pixel value of a background area in the past segmentation result;

inputting the past segmentation result with the changed pixel values and the unsegmented video frame into a space-time global convolutional network, and obtaining an output result of the space-time global convolutional network.

In an embodiment, the step of generating a segmentation result corresponding to the video frame to be segmented based on the spatio-temporal global feature and the past frame feature includes:

determining the down-sampling characteristics of the video frame to be segmented;

determining a splicing characteristic of the spatio-temporal global characteristic and the past frame characteristic;

inputting the down-sampling feature and the splicing feature into a decoding network, and acquiring an output result of the decoding network;

and obtaining a segmentation result corresponding to the video frame to be segmented according to the output result of the decoding network.

In an embodiment, the step of determining the past frame characteristics associated with the video frame to be segmented according to the past segmentation result includes:

determining the index characteristics of the video frame to be segmented;

determining a target past segmentation result matched with the index features;

determining the past frame characteristics based on the target past segmentation result.

In an embodiment, after the step of generating the segmentation result corresponding to the video frame to be segmented based on the spatio-temporal global feature and the past frame feature, the method further includes:

and updating the past segmentation result by adopting the segmentation result corresponding to the video frame to be segmented.

In addition, to achieve the above object, the present invention further provides a video object segmentation system, including:

the past frame characteristic determining module is used for determining the past frame characteristics related to the video frame to be segmented according to the past segmentation result;

the space-time global feature determining module is used for determining space-time global features related to the video frames to be segmented according to the past segmentation result and the non-segmented video frames with the time sequence behind the video frames to be segmented;

and the segmentation result determining module is used for generating a segmentation result corresponding to the video frame to be segmented based on the space-time global feature and the past frame feature.

In addition, to achieve the above object, the present invention also provides an intelligent terminal, including: the system comprises a memory, a processor and a video target segmentation program stored on the memory and capable of running on the processor, wherein the video target segmentation program realizes the steps of the video target segmentation method when being executed by the processor.

Further, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon an object segmentation program of a video, which when executed by a processor, implements the steps of the above object segmentation method of a video.

According to the technical scheme of the video target segmentation method, the system, the terminal and the storage medium, the past frame characteristics related to the video frame to be segmented are determined according to the past segmentation results, the space-time global characteristics related to the video frame to be segmented are determined according to the past segmentation results and the non-segmented video frame with the time sequence behind the video frame to be segmented, and the segmentation results corresponding to the video frame to be segmented are generated based on the space-time global characteristics and the past frame characteristics.

Drawings

Fig. 1 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a video object segmentation method according to the present invention;

FIG. 3 is a functional block diagram of a video object segmentation system according to the present invention;

FIG. 4 is a schematic flow chart of the present invention for segmenting video objects by using spatiotemporal global features;

FIG. 5 is a schematic diagram of the connection of the spatio-temporal global convolution module of the present invention;

FIG. 6 is a schematic diagram of a spatiotemporal global feature extraction network according to the present invention;

FIG. 7 is a schematic diagram of an input video frame of the spatio-temporal global feature extraction network of the present invention.

The objects, features, and advantages of the present application are further described in connection with the embodiments, with reference to the accompanying drawings, which are a single embodiment and are not intended to be a complete description of the invention.

Detailed Description

The technical scheme includes that space-time global features associated with video frames to be segmented are determined according to past segmentation results and non-segmented video frames with time sequences behind the video frames to be segmented, and segmentation results corresponding to the video frames to be segmented are generated based on the space-time global features and the past frame features.

For a better understanding of the above technical solutions, exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that fig. 1 is a schematic structural diagram of a hardware operating environment of the intelligent terminal.

As shown in fig. 1, the smart terminal may include: a processor 1001, such as a CPU, a memory 1005, a user interface 1003, a network interface 1004, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the intelligent terminal architecture shown in fig. 1 is not meant to be limiting of the intelligent terminal and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and an object segmentation program of a video. The operating system is a program for managing and controlling hardware and software resources of the intelligent terminal, an object segmentation program of the video and the operation of other software or programs.

In the intelligent terminal shown in fig. 1, the user interface 1003 is mainly used for connecting the terminal and performing data communication with the terminal; the network interface 1004 is mainly used for the background server and performs data communication with the background server; the processor 1001 may be used to invoke a target segmentation procedure for the video stored in the memory 1005.

In this embodiment, the intelligent terminal includes: a memory 1005, a processor 1001, and an object segmentation program for video stored on the memory and executable on the processor, wherein:

when the processor 1001 calls the target segmentation program of the video stored in the memory 1005, the following operations are performed:

When the processor 1001 calls the target segmentation program of the video stored in the memory 1005, the following operations are also performed:

sequentially passing the past segmentation result and the non-segmented video frame through the two-dimensional space domain convolution kernel and the one-dimensional time domain convolution kernel to obtain a video frame after space-time global convolution;

changing the pixel value of a background area in the past segmentation result;

determining the index characteristics of the video frame to be segmented;

determining a target past segmentation result matched with the index features;

The technical solution of the present application will be described below by way of examples.

As shown in fig. 2, in a first embodiment of the present application, a method for object segmentation of a video of the present application includes the following steps:

step S110, determining the past frame characteristics related to the video frame to be segmented according to the past segmentation result.

In this embodiment, the past frame feature is a feature of a video frame whose timing is before the video frame to be segmented, and the past segmentation result is a segmentation result of a video frame whose timing is before the video frame to be segmented. In the process of obtaining the target by video segmentation, video frames in the video frame sequence are sequentially extracted. And regarding the video frame which is divided and the time sequence of which is positioned before the video frame to be divided as a past frame, and storing the division result of the past frame, namely the past division result in a past frame division result memory module. In the process of target segmentation of a subsequent video frame, namely a video frame to be segmented, the past segmentation result stored in the memory module is referred to, and a segmentation result with stable time sequence is obtained. Therefore, in the process of extracting and segmenting the video frame to be segmented, the method and the device determine the past frame characteristics related to the video frame to be segmented according to the past segmentation result in the past frame segmentation result memory module.

And step S120, determining the space-time global characteristics related to the video frame to be segmented according to the past segmentation result and the non-segmented video frame with the time sequence behind the video frame to be segmented.

In this embodiment, the non-divided video frame is a video frame whose time sequence is located after the video frame to be divided. And inputting a past segmentation result of which the time sequence is positioned in front of the video frame to be segmented and an unsegmented video frame of which the time sequence is positioned behind the video frame to be segmented into a space-time global feature extraction network to further obtain space-time global features. And associating the spatio-temporal global features with the video frame to be segmented. The space-time global features are used for representing the association relationship between the past segmentation result and the non-segmented video frame.

Step S130, based on the space-time global feature and the past frame feature, generating a segmentation result corresponding to the video frame to be segmented.

In this embodiment, after obtaining the spatio-temporal global feature and the past frame feature, the above features are input to a decoding network to further obtain a segmentation result corresponding to the video frame to be segmented. Alternatively, in the process of segmenting the target in the video frame, methods such as a threshold-based segmentation method, a region-based segmentation method, an edge-based segmentation method, and a particular theory-based segmentation method may also be used to segment the target in the video.

On the premise of utilizing the past frame segmentation information, the space-time global features of the target to be segmented in the whole video are used, the global stability of the segmentation result is improved, and a good segmentation effect is still kept under the cross-border condition.

According to the technical scheme, when the video frame to be segmented is segmented, the video information of the past frame before the video frame to be segmented and the video information of the un-segmented video frame after the video frame to be segmented are fully considered, so that a good segmentation effect is generated when the video frame to be segmented is subjected to shot switching.

In an embodiment, determining the spatiotemporal global features associated with the video frame to be segmented according to the past segmentation result and an un-segmented video frame whose time sequence is after the video frame to be segmented specifically includes the following steps:

and step S121, inputting the past segmentation result and the non-segmented video frame into a space-time global convolutional network, and acquiring an output result of the space-time global convolutional network.

And S122, taking the output result as the input of a space domain convolution network, and obtaining the space-time global characteristics through the space domain convolution network.

In this embodiment, referring to fig. 7, the past segmentation result (i.e., the segmented frame in fig. 7) is extracted from the past frame segmentation result memory module, and the non-segmented video frame (i.e., the segmented frame in fig. 7) whose time sequence is after the video frame to be segmented is extracted from the video frame sequence, and the past segmentation result and the non-segmented video frame are input into the spatio-temporal global convolution network. As shown in FIG. 6, the spatio-temporal global feature extraction network comprises two parts, namely an instantaneous spatio-temporal global convolutional network and a spatial convolutional network. The space-time global convolution network is composed of a plurality of space-time global convolution modules, and the space-space convolution network is also composed of a plurality of common convolution layers. The video frame passes through a plurality of space-time global convolution modules to obtain rich target information, and then passes through a plurality of common convolution layers to obtain space-time global characteristics. Specifically, the past segmentation result and the non-segmented video frame sequence are input into a space-time global convolution network, and space-time convolution operation is performed through a plurality of space-time global convolution modules in the space-time global convolution network, so that the output result of the space-time global convolution network is obtained.

In this embodiment, after the output result of the space-time global convolutional network is obtained, the output result is further input into the space-domain convolutional network. And performing space-domain convolution operation on the common convolution layer in the space-domain convolution network to further obtain space-time global characteristics.

For example, the past segmentation result and the non-segmented video frames may be arranged in time sequence to obtain a video frame sequence. The sequence of video frames may be represented as 1 × H × W × 3T, where T is the number of video frames, H represents the height of the video frames, and W represents the width of the video frames. And inputting the video frames in the video frame sequence into a space-time global convolution network to obtain an output result of 1H/r W/r 3T. And inputting the output result into a space-domain convolution network to further obtain the space-time global features of 1H/r W/r C, wherein r is the scaling ratio between the original video frame and the space-time global features, r can be 4, 8, 16 and the like, and C represents the channel of the video frame.

According to the technical scheme, the past segmentation result and the unsegmented video frame are input into a space-time global convolution network and a space-time convolution network, and then space-time global features are extracted.

In an embodiment, the step of inputting the past segmentation result and the non-segmented video frame into a space-time global convolutional network, and obtaining an output result of the space-time global convolutional network specifically includes the following steps:

step S1211, sequentially passing the past segmentation result and the unsegmented video frame through the two-dimensional space domain convolution kernel and the one-dimensional time domain convolution kernel to obtain a video frame after space-time global convolution;

step S1212, adding the past segmentation result, the un-segmented video frame, and the video frame after the spatio-temporal global convolution to obtain an output result of the spatio-temporal global convolution network.

In this embodiment, referring to fig. 5, fig. 5 is a specific structure of a space-time global convolutional network, where the space-time global convolutional network includes a two-dimensional space-domain convolutional kernel and a one-dimensional time-domain convolutional kernel, and the two-dimensional space-domain convolutional kernel and the one-dimensional time-domain convolutional kernel are connected in series. Specifically, the space-time global convolution network is composed of a two-dimensional space-domain convolution kernel with the size of S × n and a one-dimensional time-domain convolution kernel with the size of 1 × t in series, and the input past segmentation result and the non-segmented video frame are added with the output video frame after the space-time global convolution, so that residual learning is realized. The calculation module can process images of different frames in a layering mode, extract layering characteristics, and achieve flowing of target characteristics on a time dimension through time domain convolution. Wherein S, n and t are all adjustable parameters, for example, S can be adjusted to 3, 5 and 7; n and T are both equal to 3T.

According to the technical scheme, the space-time global convolution network is adopted to carry out space-time convolution operation on the past segmentation result and the unsegmented video, so that different video frames are processed in a layered mode.

step S2211, changing the pixel value of the background area in the past segmentation result;

step S2212, inputting the past segmentation result with the pixel values changed and the non-segmented video frame into a spatio-temporal global convolutional network, and obtaining an output result of the spatio-temporal global convolutional network.

In this embodiment, referring to fig. 7, the input of the spatio-temporal global feature extraction network includes two parts: j past frames, k undivided video frames. The pixel value of the background part of the past frame is processed to be 0 according to the past segmentation result so as to form strong information of the object to be segmented, and the non-segmented video frame does not need special processing. And inputting the past segmentation result of the changed pixel values and the unsegmented video frame into a space-time global convolutional network so as to obtain an output result of the space-time global convolutional network.

According to the technical scheme, the technical means of changing the pixel value of the background area in the past segmentation result and then performing space-time global convolution processing is adopted, so that the past frame forms the strong information of the target to be segmented.

In an embodiment, the generating a segmentation result corresponding to the video frame to be segmented based on the spatio-temporal global feature and the past frame feature specifically includes the following steps:

step S131, determining the down-sampling characteristics of the video frame to be segmented;

step S132, determining the splicing characteristics of the space-time global characteristics and the past frame characteristics;

step S133, inputting the down-sampling feature and the splicing feature into a decoding network, and acquiring an output result of the decoding network;

and step S134, obtaining a segmentation result corresponding to the video frame to be segmented according to the output result of the decoding network.

In this embodiment, the shape dimension of the spatio-temporal global feature is 1 × H/r × W/r × C, where H, W is equal to the feature of the past frame, and the spatio-temporal global feature and the feature of the past frame are spliced together in the inference process of the network to form a new feature data, i.e., a spliced feature. And inputting the video frame to be segmented into a space-time global feature extraction network to generate the down-sampling feature of the video frame to be segmented. The down-sampling features comprise 16 times of down-sampling features, 8 times of down-sampling features and 4 times of down-sampling features. And inputting the splicing characteristic, the 16-time down-sampling characteristic, the 8-time down-sampling characteristic and the 4-time down-sampling characteristic into a decoding network together, and further obtaining an output result of the decoding network. In the decoding network, the splicing characteristics are sequentially subjected to 16-time down-sampling, 8-time down-sampling and 4-time down-sampling to obtain an output result of the decoding network. After the output result of the decoding network is obtained, the size of the output result needs to be converted to the same size as the video frame to be segmented, so as to obtain the segmentation result corresponding to the video frame to be segmented.

Optionally, the 16-time downsampling feature, the 8-time downsampling feature, the 4-time downsampling feature, the space-time global feature and the past frame feature may be input into the decoding network, the space-time global feature and the past frame feature are spliced in the decoding network, and the output result of the decoding network is obtained by sequentially performing 16-time downsampling, 8-time downsampling and 4-time downsampling on the spliced feature.

According to the technical scheme, the segmentation result corresponding to the video frame to be segmented is obtained by inputting the down-sampling feature, the space-time global feature and the past frame feature of the video frame to be segmented into the decoding network, and the space-time global feature extraction network is added, so that the space-time global feature containing richer target information can be obtained during reasoning, and the segmentation result is more stable.

In an embodiment, determining the past frame characteristics associated with the video frame to be segmented according to the past segmentation result specifically includes the following steps:

step S111, determining the index characteristics of the video frame to be segmented;

step S112, determining a target past segmentation result matched with the index features;

step S113, determining the past frame characteristics based on the target past segmentation result.

In this embodiment, the past segmentation result is stored in the past frame segmentation result memory module, and the past segmentation result includes the past frame characteristics. And (4) passing the video frame to be segmented through a feature extraction network to obtain index features. And performing feature matching on the index features and the past segmentation results in the past frame segmentation result memory module, and further determining the features of the past frame according to the matched target past segmentation results. Specifically, the process of feature matching is essentially matching by means of key-value pairs. The index feature represents a key of the video frame to be segmented, and a value corresponding to the key is matched in a past frame segmentation result memory module, namely a target past segmentation result. And taking the video features represented by the target past segmentation result as past frame features associated with the video frame to be segmented.

According to the technical scheme, the past frame characteristics associated with the video frame to be segmented are matched from the past frame segmentation result memory module through the index characteristics, and then the past frame characteristics used for the subsequent space-time global characteristic analysis are obtained.

In one embodiment, the method for target segmentation of video further comprises the following steps:

step S113, determining the past frame characteristics based on the target past segmentation result;

step S120, determining the space-time global characteristics related to the video frame to be segmented according to the past segmentation result and the non-segmented video frame with the time sequence behind the video frame to be segmented;

step S130, based on the space-time global feature and the past frame feature, generating a segmentation result corresponding to the video frame to be segmented;

and step S310, updating the past segmentation result by adopting the segmentation result corresponding to the video frame to be segmented.

In this embodiment, referring to fig. 4, a video frame to be segmented passes through a feature extraction network to generate 4 feature data: index features, 16 times downsampling features, 8 times downsampling features, and 4 times downsampling features. And extracting the past frame characteristics of the video frame target to be segmented from a past frame segmentation result memory module through the characteristic matching module according to the index characteristics. And sending a certain number of past frames with obtained segmentation results and a certain number of un-segmented video frames which are not segmented in the video to be segmented into a space-time global feature extraction network to obtain the space-time global features of the target to be segmented between the past frames and the un-segmented video frames. And sending the obtained past frame characteristics, the space-time global characteristics, the 16-time down-sampling characteristics, the 8-time down-sampling characteristics and the 4-time down-sampling characteristics into a decoding network to obtain a segmentation result of the video frame to be segmented. And after the segmentation result corresponding to the video frame to be segmented is obtained, updating the memory module of the segmentation result of the past frame by adopting the segmentation result.

According to the technical scheme, the technical means of updating the memory module of the previous frame segmentation result after the segmentation result is obtained is adopted, so that the corresponding previous segmentation result can be obtained from the memory module for analysis in the process of analyzing the subsequent video frame to be segmented, and the segmentation stability is improved.

While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different from that shown or described herein.

As shown in fig. 3, the present application provides a video target segmentation system, which includes:

and the past frame characteristic determining module 10 is configured to determine, according to the past segmentation result, a past frame characteristic associated with the video frame to be segmented. In an embodiment, the past frame feature determining module 10 is further configured to determine an index feature of the video frame to be segmented; determining a target past segmentation result matched with the index features; and determining the past frame characteristics based on the target past segmentation result.

And a spatiotemporal global feature determining module 20, configured to determine a spatiotemporal global feature associated with the video frame to be segmented according to the past segmentation result and an un-segmented video frame whose time sequence is after the video frame to be segmented. In an embodiment, the spatio-temporal global feature determination module 20 is further configured to input the past segmentation result and the non-segmented video frame into a spatio-temporal global convolutional network, and obtain an output result of the spatio-temporal global convolutional network; and taking the output result as the input of a space domain convolution network, and obtaining the space-time global characteristics through the space domain convolution network. In one embodiment, the space-time global convolution network includes a two-dimensional space-domain convolution kernel and a one-dimensional time-domain convolution kernel, and the two-dimensional space-domain convolution kernel and the one-dimensional time-domain convolution kernel are connected in series; the space-time global feature determination module 20 is further configured to sequentially pass the past segmentation result and the non-segmented video frame through the two-dimensional space-domain convolution kernel and the one-dimensional time-domain convolution kernel to obtain a space-time global convolved video frame; and adding the past segmentation result, the unsegmented video frame and the video frame after the space-time global convolution to obtain an output result of the space-time global convolution network. In an embodiment, the spatiotemporal global feature determination module 20 is further configured to modify pixel values of a background region in the past segmentation result; inputting the past segmentation result with the changed pixel values and the unsegmented video frame into a space-time global convolutional network, and obtaining an output result of the space-time global convolutional network.

And a segmentation result determining module 30, configured to generate a segmentation result corresponding to the video frame to be segmented based on the spatio-temporal global feature and the past frame feature. In an embodiment, the segmentation result determining module 30 is further configured to determine a downsampling characteristic of the video frame to be segmented; determining a splicing characteristic of the spatio-temporal global characteristic and the past frame characteristic; inputting the down-sampling feature and the splicing feature into a decoding network, and acquiring an output result of the decoding network; and obtaining a segmentation result corresponding to the video frame to be segmented according to the output result of the decoding network.

In an embodiment, after the segmentation result determining module 30, an updating module is further included, and the updating module is configured to update the past segmentation result with the segmentation result corresponding to the video frame to be segmented.

The specific implementation of the video object segmentation system of the present invention is substantially the same as the embodiments of the video object segmentation method, and is not described herein again.

Based on the same inventive concept, an embodiment of the present application further provides a computer-readable storage medium, where a video object segmentation program is stored, and when the video object segmentation program is executed by a processor, the steps of the video object segmentation method described above are implemented, and the same technical effects can be achieved, and are not repeated here to avoid repetition.

Since the storage medium provided in the embodiments of the present application is a storage medium used for implementing the method in the embodiments of the present application, based on the method described in the embodiments of the present application, a person skilled in the art can understand a specific structure and a modification of the storage medium, and thus details are not described here. Any storage medium used in the methods of the embodiments of the present application is intended to be within the scope of the present application.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for segmenting a video object, the method comprising:

2. The method for segmenting a video target according to claim 1, wherein the step of determining the spatio-temporal global features associated with the video frame to be segmented according to the past segmentation result and the non-segmented video frame whose time sequence is after the video frame to be segmented comprises:

3. The video object segmentation method according to claim 2, wherein the spatio-temporal global convolution network includes a two-dimensional spatial convolution kernel and a one-dimensional temporal convolution kernel, and the two-dimensional spatial convolution kernel is connected in series with the one-dimensional temporal convolution kernel; the step of inputting the past segmentation result and the non-segmented video frame into a space-time global convolutional network and obtaining an output result of the space-time global convolutional network comprises the following steps:

4. The method for object segmentation of video according to claim 2, wherein the step of inputting the past segmentation result and the non-segmented video frame into a spatio-temporal global convolutional network and obtaining an output result of the spatio-temporal global convolutional network comprises:

changing the pixel value of a background area in the past segmentation result;

5. The method for segmenting the video target according to claim 1, wherein the step of generating the segmentation result corresponding to the video frame to be segmented based on the spatio-temporal global feature and the past frame feature comprises:

6. The method for segmenting the video target according to claim 1, wherein the step of determining the past frame characteristics associated with the video frame to be segmented according to the past segmentation result comprises:

determining the index characteristics of the video frame to be segmented;

determining a target past segmentation result matched with the index features;

7. The method for segmenting the video target according to claim 6, wherein after the step of generating the segmentation result corresponding to the video frame to be segmented based on the spatio-temporal global feature and the past frame feature, the method further comprises:

8. A video object segmentation system, comprising:

9. An intelligent terminal, characterized in that, intelligent terminal includes: memory, a processor and an object segmentation program of a video stored on the memory and executable on the processor, the object segmentation program of a video implementing the steps of the object segmentation method of a video according to any one of claims 1 to 7 when executed by the processor.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores an object segmentation program of a video, which when executed by a processor implements the steps of the object segmentation method of a video according to any one of claims 1 to 7.