CN116091866A

CN116091866A - Video object segmentation model training method and device, electronic equipment and storage medium

Info

Publication number: CN116091866A
Application number: CN202310028445.XA
Authority: CN
Inventors: 王伟农; 戴宇荣
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-05-09

Abstract

The disclosure relates to a video object segmentation model training method, a device, an electronic device and a storage medium, comprising: acquiring a plurality of original images and mask images corresponding to objects in each original image; determining a transformation image set corresponding to each original image based on each original image and the mask image corresponding to the object in each original image; the transformation image set corresponding to each original image comprises K transformation images corresponding to the original images, and each transformation image carries a transformation mask image; each transformation mask image is obtained by transforming mask images corresponding to the transformation image set to which each transformation mask image belongs; k is an integer greater than 1; and performing first training on the original segmentation model based on the transformation image sets corresponding to the plurality of original images to obtain a video object segmentation model. According to the method and the device, the original segmentation model is trained through the transformation image set comprising the transformation image and the transformation mask image, and label data are not needed, so that training cost is reduced.

Description

Video object segmentation model training method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of internet, and in particular relates to a video object segmentation model training method, a device, electronic equipment and a storage medium.

Background

Video object segmentation (Video Object Segmentation, VOS) is the fundamental capability for video scene understanding and video editing. The technology has wide application prospect in the fields of short video intelligent editing, special effect manufacturing, short video creation and the like. The VOS technique refers to a target object mask in an initial frame of a video sequence, and a pixel-level segmentation mask result of the target object is predicted in a subsequent frame. Along with the development of deep learning errors, the deep neural network is applied to VOS, and high-level semantic features extracted from the deep neural network can more accurately distinguish target objects and backgrounds from complex scenes, so that the target segmentation effect is greatly improved.

However, consider that in the case of labeling a training dataset based on a video object segmentation with sophistication, the video object segmentation dataset needs to label the pixel-level mask results for each different object in a piece of video separately. However, this way of labeling is more time consuming and laborious than simple image labeling, because it is not only necessary to label the segmentation mask masks of different objects in each image, but also to align the labels of the labels, and label the target objects according to the same labels in subsequent frames in the same segment of video. The labor and time costs required to annotate the dataset for the video object segmentation are significant.

Disclosure of Invention

The disclosure provides a video object segmentation model training method, a video object segmentation device, electronic equipment and a storage medium, and the technical scheme of the disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a video object segmentation model training method, including:

acquiring a plurality of original images and mask images corresponding to objects in each original image;

determining a transformation image set corresponding to each original image based on each original image and the mask image corresponding to the object in each original image; the transformation image set corresponding to each original image comprises K transformation images corresponding to the original images, and each transformation image carries a transformation mask image; each transformation mask image is obtained by transforming mask images corresponding to the transformation image set to which each transformation mask image belongs; k is an integer greater than 1;

and performing first training on the original segmentation model based on the transformation image sets corresponding to the plurality of original images to obtain a video object segmentation model.

In some possible embodiments, determining the set of transformed images for each original image based on each original image and the mask image for the object in each original image includes:

Determining a target mask image corresponding to the target object from mask images corresponding to the objects in each original image; the object in each original image includes a target object;

carrying out preset transformation on each original image to obtain K transformed images corresponding to each original image;

performing preset transformation on a target mask image corresponding to a target object to obtain K transformed mask images corresponding to the target mask image;

and determining a transformation image set corresponding to each original image based on the K transformation images and the K transformation mask images corresponding to each original image.

In some possible embodiments, the preset transformation includes:

preset transformation is carried out by simulating inter-frame operation information;

or a preset transformation by simulating the distortion information;

or a preset transformation by data enhancement.

In some possible embodiments, acquiring a mask image corresponding to the object in each original image includes:

and identifying the preset object in each original image, and determining a mask image corresponding to the preset object.

Identifying each object in each original image, and determining the pixel of each object in each original image;

performing binarization processing on the original image based on the pixels of each object to obtain a mask image corresponding to each original image;

and carrying out connected region segmentation processing on the mask image corresponding to each original image to obtain the mask image corresponding to each object in each original image.

In some possible embodiments, based on the transformed image sets corresponding to the plurality of original images, performing a first training on the original segmentation model to obtain a video object segmentation model, including:

forming K transformed images in the transformed image set into an object synthesized video;

determining a plurality of transformed images in the object composite video as a plurality of first image frames;

determining transformed mask images corresponding to the plurality of first image frames as a plurality of first masks;

determining a transformed image in the object composite video as a second image frame; the position of the second image frame in the object composite video is located after the plurality of first image frames;

inputting a plurality of first image frames and a plurality of first masks into a first encoder in an original segmentation model to obtain first characteristic information;

Inputting a second image frame into a second encoder in the original segmentation model to obtain second characteristic information;

determining a prediction mask corresponding to the second image frame based on the decoder, the first feature information and the second feature information in the original segmentation model;

and training the original segmentation model based on a prediction mask corresponding to the second image frame and the transformation mask image corresponding to the second image frame in the transformation image set to obtain a video object segmentation model.

In some possible embodiments, after obtaining the video object segmentation model, the method further includes:

acquiring a training video; training videos are obtained through equipment shooting;

performing object mask processing on a plurality of target video frames in the training video to obtain a plurality of object masks corresponding to the plurality of target video frames; the target video frame is a video frame containing a target object in the training video;

and performing second training on the video object segmentation model based on the plurality of target video frames and the plurality of object masks to obtain an updated video object segmentation model.

According to a second aspect of an embodiment of the present disclosure, there is provided a video object segmentation method including:

acquiring a video to be identified;

performing object mask processing on Q video frames containing preset objects in the video to be identified to obtain Q object masks corresponding to the Q video frames; q is a positive integer greater than 1;

Inputting the video to be identified and Q object masks into a video object segmentation model obtained by training according to the video object segmentation model training method of any one of claims 1 to 7, and obtaining the object mask corresponding to the residual video frames containing the preset objects in the video to be identified.

According to a third aspect of embodiments of the present disclosure, there is provided a video object segmentation model training apparatus, including:

an image acquisition module configured to perform acquisition of a plurality of original images, and mask images corresponding to objects in each of the original images;

an image set acquisition module configured to perform determination of a transformed image set corresponding to each original image based on each original image and a mask image corresponding to an object in each original image; the transformation image set corresponding to each original image comprises K transformation images corresponding to the original images, and each transformation image carries a transformation mask image; each transformation mask image is obtained by transforming mask images corresponding to the transformation image set to which each transformation mask image belongs; k is an integer greater than 1;

the training module is configured to execute first training on the original segmentation model based on the transformation image sets corresponding to the plurality of original images to obtain a video object segmentation model.

In some possible embodiments, the image set acquisition module is configured to perform:

In some possible embodiments, the preset transformation includes:

or a preset transformation by simulating the distortion information;

or a preset transformation by data enhancement.

In some possible embodiments, the image acquisition module is configured to perform:

In some possible embodiments, the training module is configured to perform:

According to a fourth aspect of embodiments of the present disclosure, there is provided a video object segmentation apparatus including:

the video acquisition module is configured to acquire a video to be identified;

the object mask determining module is configured to execute object mask processing on Q video frames containing preset objects in the video to be identified, and Q object masks corresponding to the Q video frames are obtained; q is a positive integer greater than 1;

The segmentation module is configured to execute inputting the video to be identified and the Q object masks into the video object segmentation model obtained by training according to the video object segmentation model training method of any one of claims 1 to 7, so as to obtain the object mask corresponding to the remaining video frames containing the preset object in the video to be identified.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement the method as in any of the first or second aspects above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of any one of the first or second aspects of embodiments of the present disclosure.

According to a seventh aspect of the embodiments of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, the computer program being read from the readable storage medium by at least one processor of the computer device and executed such that the computer device performs the method of any one of the first or second aspects of the embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

acquiring a plurality of original images and mask images corresponding to objects in each original image; determining a transformation image set corresponding to each original image based on each original image and the mask image corresponding to the object in each original image; the transformation image set corresponding to each original image comprises K transformation images corresponding to the original images, and each transformation image carries a transformation mask image; each transformation mask image is obtained by transforming mask images corresponding to the transformation image set to which each transformation mask image belongs; k is an integer greater than 1; and performing first training on the original segmentation model based on the transformation image sets corresponding to the plurality of original images to obtain a video object segmentation model. According to the method and the device, the original segmentation model is trained through the transformation image set comprising the transformation image and the transformation mask image, and label data are not needed, so that training cost is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment shown in accordance with an exemplary embodiment;

FIG. 2 is a flowchart illustrating a method of training a video object segmentation model, according to an example embodiment;

FIG. 3 is a flowchart illustrating a method of determining a mask image corresponding to an object, according to an example embodiment;

FIG. 4 is a schematic diagram of an original image and mask image shown according to an exemplary embodiment;

FIG. 5 is a flowchart illustrating a set of transformed images corresponding to an original image, according to an exemplary embodiment;

FIG. 6 is a schematic diagram of a transformed image and transformed mask image, according to an example embodiment;

FIG. 7 is a training flow diagram of a video object segmentation model, according to an example embodiment;

FIG. 8 is a block diagram illustrating a model training according to an exemplary embodiment;

FIG. 9 is a flowchart illustrating a method of video object segmentation according to an exemplary embodiment;

FIG. 10 is a block diagram of a video object segmentation model training apparatus, according to an example embodiment;

FIG. 11 is a block diagram of a video object segmentation apparatus, according to an example embodiment;

FIG. 12 is a block diagram of an electronic device for video object segmentation model training or video object segmentation, according to an example embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar first objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Referring to fig. 1, fig. 1 is a schematic diagram of an application environment, which may include a server 01 and a client 02, as shown in fig. 1, according to an exemplary embodiment.

In some possible embodiments, the server 01 may include a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud audio recognition model training, middleware services, domain name services, security services, CDN (Content Delivery Network ), and basic cloud computing services such as big data and artificial intelligence platforms. Operating systems running on the server may include, but are not limited to, android systems, IOS systems, linux, windows, unix, and the like.

In some possible embodiments, the client 02 described above may include, but is not limited to, a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a smart wearable device, and the like. Or may be software running on the client, such as an application, applet, etc. Alternatively, the operating system running on the client may include, but is not limited to, an android system, an IOS system, linux, windows, unix, and the like.

In some possible embodiments, the server 01 or the client 02 may acquire a plurality of original images, and a mask image corresponding to an object in each original image; determining a transformation image set corresponding to each original image based on each original image and the mask image corresponding to the object in each original image; the transformation image set corresponding to each original image comprises K transformation images corresponding to the original images, and each transformation image carries a transformation mask image; each transformation mask image is obtained by transforming mask images corresponding to the transformation image set to which each transformation mask image belongs; k is an integer greater than 1; and performing first training on the original segmentation model based on the transformation image sets corresponding to the plurality of original images to obtain a video object segmentation model. According to the method and the device, the original segmentation model is trained through the transformation image set comprising the transformation image and the transformation mask image, and label data are not needed, so that training cost is reduced.

In some possible embodiments, the client 02 and the server 01 may be connected through a wired link, or may be connected through a wireless link.

In an exemplary embodiment, the client, the server and the database corresponding to the server may be node devices in the blockchain system, and may share the acquired and generated information to other node devices in the blockchain system, so as to implement information sharing between multiple node devices. The plurality of node devices in the blockchain system can be configured with the same blockchain, the blockchain consists of a plurality of blocks, and the blocks adjacent to each other in front and back have an association relationship, so that the data in any block can be detected through the next block when being tampered, thereby avoiding the data in the blockchain from being tampered, and ensuring the safety and reliability of the data in the blockchain.

Fig. 2 is a flowchart of a video object segmentation model training method according to an exemplary embodiment, and as shown in fig. 2, the video object segmentation model training method may be applied to a server, and may also be applied to other node devices, such as a client, and is described below by taking the server as an example, where the method includes the following steps:

in step S201, a plurality of original images, and mask images corresponding to objects in each original image are acquired.

In the embodiment of the application, the server may acquire a plurality of original images. Alternatively, the server may obtain multiple original images from the image set on the line, or may obtain multiple original images from other devices. Because the original image serves for subsequent original segmentation model training, in order to increase generalization capability and robustness of the obtained video object segmentation model after training, the server can serve for subsequent original segmentation model training through different original images. As such, the number of original images may be plural.

In this embodiment of the present application, in the case where the number of original images is plural, the processing of each of the plural original images is the same, that is, can be applied to training of the original segmentation model by the same embodiment.

In some alternative embodiments, the server may obtain N original images, which may be noted as: xi, i=1, 2,3.

In the embodiment of the application, the server may determine the mask image corresponding to the object in each original image.

In some alternative embodiments, the server may directly acquire the corresponding mask image of the original image. Specifically, the objects in each original image may be annotated by a human before the server obtains. If a plurality of objects exist in some original images, different objects can be distinguished and labeled, so as to obtain mask images corresponding to different objects, for example, if two objects exist in one original image, then mask images corresponding to the two objects respectively can be obtained. The mask image obtained by the embodiment has higher accuracy, but due to manual marking, the required labor and time costs are higher, and the mask image is not suitable for popularization and application.

In some alternative embodiments, the server may identify the preset object in each original image and determine a mask image of the preset object. Specifically, the server may invoke an instance segmentation algorithm to identify a preset object in each original image, and determine a mask image of the preset object.

In the embodiment of the present application, the flow of the example segmentation algorithm based on each original image is as follows: and preprocessing each original image, inputting each preprocessed original image into a trained neural network, and obtaining a feature map corresponding to each original image. The candidate regions of interest are sent to a region generation network (Region Proposal Network, RPN) for binary classification (foreground and background classification) and Bounding Box (BB) regression, and a portion of the candidate regions of interest are filtered.

However, although it is possible to automatically region different objects and obtain mask images of the different objects by the instance segmentation algorithm, the types of objects segmented by the instance segmentation algorithm are limited, and are generally limited to common categories such as people, vehicles, trees, dogs, cats, and the like. Therefore, the server obtains mask images of preset objects (such as people, vehicles, trees, dogs, cats, etc.) in each original image, and the acquisition capability of mask images of non-preset objects is poor.

In other possible embodiments, to avoid that the server can only obtain the mask image of the preset object, fig. 3 is a flowchart illustrating a method for determining the mask image corresponding to the object according to an exemplary embodiment, as shown in fig. 3, including:

in step S301, each object in each original image is identified, and the pixel of each object in each original image is determined.

Fig. 4 is a schematic diagram of an original image and a mask image, as shown in fig. 4, including four subgraphs (a), (b), (c), and (d), according to an exemplary embodiment. In this embodiment of the present application, the server may identify the foreground object in each original image, such as (a) in fig. 4, and determine the pixels of the foreground object (the objects shown in 401 and 402) in the original image.

In step S303, binarization processing is performed on the original image based on the pixel of each object, and a mask image corresponding to each original image is obtained.

Alternatively, the server may perform binarization processing (0 and 255) on the original image ((a) in fig. 4) based on the pixels of the foreground object ((a) in fig. 401 and 402)), that is, distinguish the foreground object from the background as shown in (b) in fig. 4, to obtain a mask image corresponding to the original image. Wherein the mask image can be noted as y _i ，i＝1，2，3...N。

In step S305, a connected region segmentation process is performed on the mask image corresponding to each original image, so as to obtain a mask image corresponding to each object in each original image.

Alternatively, in the embodiment of the present application, the server may perform the connected region segmentation processing on the mask image corresponding to each original image based on the characteristics of the connected region in the mask image corresponding to each original image, where the pixel of the object corresponding to 401 and the pixel of the object corresponding to 402 are not connected, so that the two

objects

401 and 402 may be segmented to obtain the mask image of the object corresponding to 401 shown in (c) of fig. 4 and the mask image of the object corresponding to 402 shown in (d) of fig. 4.

If the original image x _i The mask image corresponding to the object in the original image can be recorded as:

that is, if an object exists in one original image, the mask image corresponding to the original image is the mask image corresponding to the object. If a plurality of objects exist in one original image, the mask image corresponding to the original image may be divided into a plurality of mask images corresponding to the plurality of objects.

In step S203, a set of transformed images corresponding to each original image is determined based on each original image and the mask image corresponding to the object in each original image; the transformation image set corresponding to each original image comprises K transformation images corresponding to the original images, and each transformation image carries a transformation mask image; each transformation mask image is obtained by transforming mask images corresponding to the transformation image set to which each transformation mask image belongs; k is an integer greater than 1.

FIG. 5 is a flowchart illustrating a set of transformed images corresponding to an original image, as shown in FIG. 5, according to an exemplary embodiment, including:

in step S501, a target mask image corresponding to a target object is determined from mask images corresponding to objects in each original image; the object in each original image includes a target object.

In an alternative embodiment, the server may determine the number of transformed image sets corresponding to each original image based on the number of objects in each original image. How to determine the set of transformed images corresponding to each original image in fig. 5 will be described below in conjunction with fig. 4.

Optionally, if an original image includes an object, the server may determine the object in the original image as a target object, and determine a mask image corresponding to the object as a target mask image corresponding to the target object.

If a plurality of objects are included in one original image, such as two

objects

401 and 402 shown in (a) of fig. 4. Alternatively, the object 401 may be regarded as a target object, and the mask image ((c) in fig. 4) corresponding to the object 401 may be regarded as a target mask image corresponding to the target object. Alternatively, the object 402 is taken as the target object, and the mask image ((d) in fig. 4) corresponding to the object 402 is taken as the target mask image corresponding to the target object. Alternatively, it is possible to regard the object 401 and the object 402 as target objects, respectively, and regard the mask image ((c) in fig. 4) corresponding to the object 401 as the target mask image corresponding to the target object 401, and regard the mask image ((d) in fig. 4) corresponding to the object 402 as the target mask image corresponding to the target object 402.

In step S503, a preset transformation is performed on each original image, so as to obtain K transformed images corresponding to each original image.

In this embodiment of the present application, the server may perform preset transformation on each original image to obtain K transformed images corresponding to each original image.

Alternatively, the preset transforms may include a combination of one or more of preset transforms by simulating inter-frame running information, preset transforms by simulating warp information, and preset transforms by data enhancement. Optionally, the preset transform may also include no transform.

Wherein the preset transformation by simulating the inter-frame running information comprises affine transformation, clipping, rotation and thin plate spline interpolation. The preset transform by simulating the warp information includes a color dither transform. The preset transformation by data enhancement includes data enhancement.

Alternatively, affine transformation is also called affine mapping, which means that in geometry, one vector space is linearly transformed once and translated up to another vector space. Thin-plate spline interpolation is a common 2-dimensional interpolation scheme. Color dithering refers to obtaining a richer visual effect with a lower color bit depth, such as obtaining an 8-bit visual effect with a bit depth of 1 bit.

Alternatively, the data enhancement may be implemented by gamma enhancement, or the original image may be regarded as a two-dimensional signal, and the two-dimensional fourier transform-based signal enhancement may be performed on the original image. Noise in the original image can be removed by adopting a low-pass filtering method (namely, only low-frequency signals are transmitted). Alternatively, a high-pass filtering method may be used, so that high-frequency signals such as edges can be enhanced, and the blurred original image becomes clear.

Fig. 6 is a schematic diagram of a transformed image and a transformed mask image according to an exemplary embodiment, as shown in fig. 6, if the first image 600 in fig. 6 is an original image, the number of objects 6001 included in the original image is one, and the objects are target objects, and then the mask image 6000 under the original image is a target mask image corresponding to the target objects.

Alternatively, the server may perform 6 preset transforms on the original image to obtain K-1 transformed images, where K-1=6. Wherein each of the 6 preset transforms may be one of the preset transforms or a combination of the preset transforms.

In step S505, a preset transformation is performed on the target mask image corresponding to the target object, so as to obtain K transformed mask images corresponding to the target mask image.

Similarly, the server may perform 6 kinds of preset transformations on the target mask image corresponding to the target object, to obtain K-1 transformed mask images, where K-1=6. Wherein each of the 6 preset transforms may be one of the preset transforms or a combination of the preset transforms.

As shown in fig. 6, there is a one-to-one correspondence between the transforms in the K-1 transformed image (601-606 total of 6 transformed images) and the K-1 transformed mask image (6011-6066 total of 6 transformed mask images). For example, the last transformed image 606 and the last transformed mask image 6066 are obtained from the original image 601 and the target mask image 6066 through the same preset transformation.

In an alternative embodiment, the original image 600 is transformed into 6 different transformed images 601-606 by different 6 preset transforms, and then each transformed image may refer to the original image 600 to obtain 6 transformed mask images corresponding to the 6 different transformed images, i.e. 6011-6066, in a manner that the target mask image 6000 is obtained.

In step S507, a set of transformed images corresponding to each original image is determined based on the K transformed images and the K transformed mask images corresponding to each original image.

In this embodiment of the present application, the server may determine a set of transformed images corresponding to each original image, where the set of transformed images corresponds to each original image, and the K transformed images and the K transformed mask images correspond to each original image.

Alternatively, the server may determine a set of transformed images corresponding to the original image based on 7 transformed images and 7 transformed mask images corresponding to the original image shown in fig. 6.

Optionally, if the original image includes 2 objects, the server may determine a first set of transformed images corresponding to the original image based on the K transformed images and the K transformed mask images corresponding to the first object in the original image, and may determine a second set of transformed images corresponding to the original image based on the K transformed images and the K transformed mask images corresponding to the second object in the original image.

In this way, each of the K transform images in the set of transform images corresponding to the target object has its corresponding transform mask image, and each transform mask image is obtained based on a preset transform according to the mask image corresponding to the target object.

In step S205, the original segmentation model is first trained based on the transformed image sets corresponding to the plurality of original images, to obtain a video object segmentation model.

FIG. 7 is a training flow diagram of a video object segmentation model, as shown in FIG. 7, according to an exemplary embodiment, including:

in step S701, K transform images in the set of transform images are composed into an object composite video.

In some possible embodiments, the server may sort the K transform images in the set of transform images according to a preset sorting rule, to obtain the object composite video. Assuming that there are 1000 original images, and each original image contains 2 objects, the server can obtain at most 2000 transform image sets, and then compose 2000 object composite videos according to K transform images in each transform image set of the 2000 transform image sets.

In step S702, a plurality of transformed images in an object composite video are determined as a plurality of first image frames.

Fig. 8 is a block diagram illustrating a model training according to an exemplary embodiment, and as shown in fig. 8, a server may determine a plurality of transformed images (e.g., the first 3 image frames in fig. 8) located at the front in an object composite video as a plurality of first image frames.

In step S703, the transformed mask images corresponding to the plurality of first image frames are determined as a plurality of first masks.

Alternatively, the server may determine transformed mask images (e.g., the first 3 masks in the mask sequence in fig. 8) corresponding to the plurality of first image frames as the plurality of first masks.

In step S704, one of the transformed images in the object synthesized video is determined as a second image frame; the position of the second image frame in the object composite video is located after the plurality of first image frames.

Alternatively, the server may determine, as the second image frame, a transformed image located after the plurality of first image frames in the object composite video, and the second image frame may be referred to as the current image frame.

In step S705, a plurality of first image frames and a plurality of first masks are input to a first encoder in an original segmentation model, resulting in first feature information.

Specifically, the server may input a first image frame and a first mask to the first encoder in the original segmentation model to obtain first feature information, input a second first image frame and a second first mask to the first encoder in the original segmentation model to obtain second first feature information, and input a third first image frame and a third first mask to the first encoder in the original segmentation model to obtain third first feature information. Wherein, each first characteristic information may include 2 pieces of sub information including a Key and a value. Wherein the key's role is used for addressing and the value holds some more detailed information used to generate the mask.

Then, the server may combine the three first feature information to obtain total first feature information, where the total first feature information also includes a Key and a value.

In step S706, a second image frame is input to a second encoder in the original segmentation model, resulting in second feature information.

Optionally, the server may input the second image frame into a second encoder in the original segmentation model, to obtain second feature information, where the second feature information also includes a Key and a value. Wherein the key's role is used for addressing and the value holds some more detailed information used to generate the mask.

In step S707, a prediction mask corresponding to the second image frame is determined based on the decoder in the original segmentation model, the first feature information, and the second feature information.

In some optional embodiments, the server may perform inner product operation on the key in the total first feature information and the key in the second feature information to obtain the similarity value. The similarity value is equivalent to a space-time attention mechanism, and weights are assigned to the values of different events and regions. Then, the server may multiply the similarity value with the value in the total first feature information to obtain the data read by the spatiotemporal memory. Then, the server can splice the data read by the space-time memory and the value in the second characteristic information, and input the spliced information into a decoder in the original segmentation model to obtain a prediction mask corresponding to the second image frame.

In step S708, the original segmentation model is trained based on the prediction mask corresponding to the second image frame and the transformed image set, and the transformed mask image corresponding to the second image frame, to obtain a video object segmentation model. In this embodiment of the present application, the server may determine a loss value based on the prediction mask corresponding to the second image frame and the transformed image set, where the transformed mask image corresponding to the second image frame is used to update the parameter of the original segmentation model, and obtain the updated original segmentation model, so that the first training of the original segmentation model is completed.

Optionally, the server may perform subsequent training on the updated original segmentation model by using different object synthesis videos and mask series reference steps S701 to S708 until the iteration termination condition is met, and stop training to obtain the video object segmentation model.

Therefore, model training in the first stage is completed through the object synthesized video, and because the synthesized video avoids manual annotation, labor cost and time cost are saved, and the model training efficiency can be improved.

Since the model training in the first stage is performed based on the object-synthesized video, the obtained video object-segmentation model can be applied to the real video, and the difference between the synthesized video and the real video may cause errors when the video object-segmentation model is applied to the real video. Based on the above, after the video object segmentation model is obtained, the embodiment of the application can also perform the second-stage training on the video object segmentation model based on a small amount of real videos, so as to obtain a more accurate video object segmentation model.

In some possible embodiments, the server may acquire a training video, the training video is acquired by shooting through a device, object mask processing is performed on a plurality of target video frames in the training video, so as to obtain a plurality of object masks corresponding to the plurality of target video frames, the target video frames are video frames containing target objects in the training video, and second training is performed on the video object segmentation model based on the plurality of target video frames and the plurality of object masks, so as to obtain an updated video object segmentation model. The training of the second stage is identical to the training of the first stage.

Alternatively, the server may acquire the training video, wherein the training video is acquired by device photographing and is thus not a composite video. Then, the server can process object masks of a plurality of target video frames containing the target objects in the training video to obtain a plurality of object masks corresponding to the plurality of target video frames.

Then, the server may determine a plurality of third image frames from the plurality of target video frames, and determine an object mask corresponding to the plurality of third image frames from among the plurality of object masks as a plurality of second masks. One of the plurality of target video frames is determined as a fourth image frame, wherein a position of the fourth image frame in the training video follows the plurality of third image frames.

Alternatively, specifically, the server may input the first third image frame and the first second mask into the first encoder in the video object segmentation model to obtain the first feature information, input the second third image frame and the second mask into the first encoder in the video object segmentation model to obtain the second first feature information, and input the third image frame and the third second mask into the first encoder in the video object segmentation model to obtain the third first feature information. Wherein, each first characteristic information may include 2 pieces of sub information including a Key and a value. Wherein the key's role is used for addressing and the value holds some more detailed information used to generate the mask.

Optionally, the server may input the fourth image frame into a second encoder in the video object segmentation model to obtain second feature information, where the second feature information also includes a Key and a value. Wherein the key's role is used for addressing and the value holds some more detailed information used to generate the mask.

In some optional embodiments, the server may perform inner product operation on the key in the total first feature information and the key in the second feature information to obtain the similarity value. The similarity value is equivalent to a space-time attention mechanism, and weights are assigned to the values of different events and regions. Then, the server may multiply the similarity value with the value in the total first feature information to obtain the data read by the spatiotemporal memory. Then, the server can splice the data read by the space-time memory and the value in the second characteristic information, and input the spliced information into a decoder in the video object segmentation model to obtain a prediction mask corresponding to the fourth image frame.

In this embodiment of the present application, the server may determine a loss value based on the prediction mask corresponding to the fourth image frame and the object mask corresponding to the fourth image frame, and update parameters of the video object segmentation model using the loss value to obtain an updated video object segmentation model, so that the first training of the video object segmentation model in the second stage is completed.

Optionally, the server may perform subsequent training on the video object segmentation model by using different training videos in reference to steps S701-708 until the iteration termination condition is met, and stop training to obtain an updated video object segmentation model.

Therefore, the generalization capability of the finally obtained video object segmentation model identification is stronger, and the method can adapt to different video object segmentation. And the training in the first stage does not need to be marked manually, but a data set of the server in model training is obtained through object mask and data processing, so that the labor and time cost are saved.

Fig. 9 is a flowchart of a video object segmentation method, as shown in fig. 9, applicable to a server or a client, according to an exemplary embodiment, including the steps of:

in step S901, a video to be identified is acquired.

In step S903, performing object mask processing on Q video frames including a preset object in the video to be identified, to obtain Q object masks corresponding to the Q video frames; q is a positive integer greater than 1.

Alternatively, the Q video frames may be located at the beginning of the video to be identified.

In step S905, the video to be identified and Q object masks are input into a video object segmentation model obtained by training the video object segmentation model training method, so as to obtain an object mask corresponding to the remaining video frames containing the preset object in the video to be identified.

In the embodiment of the application, the server can acquire the video to be identified, and in the video to be identified, Q video frames containing the preset object are subjected to object mask processing to obtain Q object masks corresponding to the Q video frames. Wherein Q is a positive integer greater than 1, and Q video frames are located at the beginning of the video to be identified. Then, the server can input the video to be identified and the Q object masks into a video object segmentation model training method to train the obtained video object segmentation model, so as to obtain an object mask corresponding to the residual video frames containing the preset objects in the video to be identified, wherein the residual video frames do not contain the Q video frames.

In this way, the embodiment of the application may determine the remaining video frames containing the preset object from the following video frames in the video to be identified through the previous Q video frames in the video to be identified and the annotation data of the Q video frames (Q object masks corresponding to the Q video frames).

FIG. 10 is a block diagram illustrating a video object segmentation model training apparatus, according to an example embodiment. The device has the function of realizing the data processing method in the method embodiment, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. Referring to fig. 10, the apparatus includes:

an image acquisition module 1001 configured to perform acquisition of a plurality of original images, and mask images corresponding to objects in each original image;

an image set acquisition module 1002 configured to perform determination of a transformed image set corresponding to each original image based on each original image and the mask image corresponding to the object in each original image; the transformation image set corresponding to each original image comprises K transformation images corresponding to the original images, and each transformation image carries a transformation mask image; each transformation mask image is obtained by transforming mask images corresponding to the transformation image set to which each transformation mask image belongs; k is an integer greater than 1;

The training module 1003 is configured to perform a first training on the original segmentation model based on the transformed image sets corresponding to the plurality of original images, so as to obtain a video object segmentation model.

In some possible embodiments, the preset transformation includes:

or a preset transformation by simulating the distortion information;

or a preset transformation by data enhancement.

In some possible embodiments, the training module is configured to perform:

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Fig. 11 is a block diagram of a video object segmentation apparatus according to an exemplary embodiment. The device has the function of realizing the data processing method in the method embodiment, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. Referring to fig. 11, the apparatus includes:

a video acquisition module 1101 configured to perform acquisition of a video to be identified;

the object mask determining module 1102 is configured to execute object mask processing on Q video frames containing a preset object in the video to be identified, so as to obtain Q object masks corresponding to the Q video frames; q is a positive integer greater than 1;

the segmentation module 1103 is configured to perform inputting the video to be identified and Q object masks into the video object segmentation model trained according to the video object segmentation model training method of any one of claims 1 to 7, so as to obtain an object mask corresponding to the remaining video frames containing the preset object in the video to be identified.

Fig. 12 is a block diagram illustrating an apparatus 3000 for video object segmentation model training or video object segmentation, according to an example embodiment. For example, apparatus 3000 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, or the like.

Referring to fig. 12, the apparatus 3000 may include one or more of the following components: a processing component 3002, a memory 3004, a power component 3006, a multimedia component 3008, an audio component 3010, an input/output (I/O) interface 3012, a sensor component 3014, and a communications component 3016.

The processing component 3002 generally controls overall operations of the device 3000, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing assembly 3002 may include one or more processors 3020 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 3002 may include one or more modules to facilitate interactions between the processing component 3002 and other components. For example, the processing component 3002 may include a multimedia module to facilitate interaction between the multimedia component 3008 and the processing component 3002.

The memory 3004 is configured to store various types of data to support operations at the device 3000. Examples of such data include instructions for any application or method operating on device 3000, contact data, phonebook data, messages, pictures, videos, and the like. The memory 3004 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply assembly 3006 provides power to the various components of the device 3000. The power supply components 3006 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 3000.

The multimedia component 3008 includes a screen between the device 3000 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia assembly 3008 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 3000 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 3010 is configured to output and/or input audio signals. For example, audio component 3010 includes a Microphone (MIC) configured to receive external audio signals when device 3000 is in an operational mode, such as a call mode, a recording mode, and a speech recognition mode. The received audio signals may be further stored in the memory 3004 or transmitted via the communication component 3016. In some embodiments, the audio component 3010 further comprises a speaker for outputting audio signals.

The I/O interface 3012 provides an interface between the processing component 3002 and a peripheral interface module, which may be a keyboard, click wheel, button, or the like. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 3014 includes one or more sensors for providing status assessment of various aspects of the device 3000. For example, sensor assembly 3014 may detect the on/off state of device 3000, the relative positioning of the components, such as the display and keypad of device 3000, sensor assembly 3014 may also detect the change in position of device 3000 or a component of device 3000, the presence or absence of user contact with device 3000, the orientation or acceleration/deceleration of device 3000, and the temperature change of device 3000. The sensor assembly 3014 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 3014 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 3014 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 3016 is configured to facilitate wired or wireless communication between the apparatus 3000 and other devices. The device 3000 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 3016 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 3016 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 3000 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

Embodiments of the present invention also provide a computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program related to implementing a video object segmentation model training method, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the video object segmentation model training method provided in the above method embodiments.

Embodiments of the present invention also provide a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of the computer device reads and executes the computer program, causing the computer device to perform the method of any of the first aspects of the embodiments of the present disclosure.

It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A method for training a video object segmentation model, comprising:

determining a transformation image set corresponding to each original image based on each original image and a mask image corresponding to an object in each original image; the transformation image set corresponding to each original image comprises K transformation images corresponding to the original images, and each transformation image carries a transformation mask image; each transformation mask image is obtained by transforming mask images corresponding to the transformation image set to which each transformation mask image belongs; the K is an integer greater than 1;

And performing first training on the original segmentation model based on the transformed image sets corresponding to the plurality of original images to obtain a video object segmentation model.

2. The method according to claim 1, wherein determining the set of transformed images corresponding to each original image based on the mask image corresponding to each original image and the object in each original image comprises:

determining a target mask image corresponding to a target object from mask images corresponding to the objects in each original image; the object in each original image comprises the target object;

carrying out the preset transformation on the target mask image corresponding to the target object to obtain K transformed mask images corresponding to the target mask image;

and determining a transformation image set corresponding to each original image based on the K transformation images and the K transformation mask images.

3. The method of claim 2, wherein the pre-set transformation comprises:

or a preset transformation by simulating the distortion information;

or a preset transformation by data enhancement.

4. The method for training a video object segmentation model according to claim 1, wherein the acquiring mask images corresponding to the objects in each original image comprises:

5. The method for training a video object segmentation model according to claim 1 or 4, wherein the acquiring mask images corresponding to the objects in each original image comprises:

6. The method for training a video object segmentation model according to any one of claims 1-4, wherein the first training of the original segmentation model based on the transformed image sets corresponding to the plurality of original images to obtain the video object segmentation model comprises:

determining a transformed image in the object composite video as a second image frame; the position of the second image frame in the object synthesized video is located after the plurality of first image frames;

inputting the plurality of first image frames and the plurality of first masks into a first encoder in the original segmentation model to obtain first characteristic information;

inputting the second image frame into a second encoder in the original segmentation model to obtain second characteristic information;

determining a prediction mask corresponding to the second image frame based on a decoder in the original segmentation model, the first feature information and the second feature information;

and training the original segmentation model based on the prediction mask corresponding to the second image frame and the transformation mask image corresponding to the second image frame to obtain the video object segmentation model.

7. The method for training a video object segmentation model according to claim 1, further comprising, after the obtaining the video object segmentation model:

acquiring a training video; the training video is obtained through equipment shooting;

and performing second training on the video object segmentation model based on the target video frames and the object masks to obtain an updated video object segmentation model.

8. A method of video object segmentation, comprising:

acquiring a video to be identified;

inputting the video to be identified and the Q object masks into a video object segmentation model obtained by training according to the video object segmentation model training method of any one of claims 1 to 7, and obtaining an object mask corresponding to the residual video frame containing the preset object in the video to be identified.

9. A video object segmentation model training apparatus, comprising:

an image set acquisition module configured to perform determination of a transformed image set corresponding to each original image based on the each original image and a mask image corresponding to an object in the each original image; the transformation image set corresponding to each original image comprises K transformation images corresponding to the original images, and each transformation image carries a transformation mask image; each transformation mask image is obtained by transforming mask images corresponding to the transformation image set to which each transformation mask image belongs; the K is an integer greater than 1;

and the training module is configured to perform first training on the original segmentation model based on the transformation image sets corresponding to the plurality of original images to obtain a video object segmentation model.

10. A video object segmentation apparatus, comprising:

the video acquisition module is configured to acquire a video to be identified;

the object mask determining module is configured to execute object mask processing on Q video frames containing preset objects in the video to be identified to obtain Q object masks corresponding to the Q video frames; q is a positive integer greater than 1;

The segmentation module is configured to perform inputting the video to be identified and the Q object masks into a video object segmentation model obtained by training according to the video object segmentation model training method of any one of claims 1 to 7, so as to obtain an object mask corresponding to the remaining video frames containing the preset object in the video to be identified.

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video object segmentation model training method of any one of claims 1 to 7 or the video object segmentation method of claim 8.

12. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video object segmentation model training method of any one of claims 1 to 7 or the video object segmentation method of claim 8.