CN116091955A

CN116091955A - Segmentation method, segmentation device, segmentation equipment and computer readable storage medium

Info

Publication number: CN116091955A
Application number: CN202111299080.1A
Authority: CN
Inventors: 汤成; 程宝平; 谢小燕
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2023-05-09

Abstract

The application discloses a segmentation method, a segmentation device, segmentation equipment and a computer readable storage medium, wherein the segmentation method comprises the following steps: acquiring video data to be segmented and a trained segmentation model; determining a plurality of video frames included in the video data, and sequentially determining the plurality of video frames as video frames to be segmented; when at least one historical video frame exists before the video frame to be segmented, determining a reference transparency template of the video frame to be segmented from target transparency templates of the at least one historical video frame; inputting the video frame to be segmented and the reference transparency template into a trained segmentation model to obtain a target transparency template of the video frame to be segmented; and extracting the video frame to be segmented based on the target transparency template to obtain a target foreground image of the video frame to be segmented. Therefore, the input space-time characteristics are increased through the reference transparency templates corresponding to the historical video frames, so that the calculation complexity is reduced, the accuracy of the target transparency templates is improved, and the segmentation effect is improved.

Description

Segmentation method, segmentation device, segmentation equipment and computer readable storage medium

Technical Field

The present application relates to the field of image technology, and relates to, but is not limited to, a segmentation method, apparatus, device, and computer readable storage medium.

Background

With the continuous development of network technology, various internet communication modes are also in progress, and people have changed from the initial single voice demand to the video and audio communication demand, so that the video communication service integrating voice data and video data is a hotspot in the development of the communication field, and is increasingly widely applied in the aspects of conference televisions, remote video medical treatment, remote video education and the like.

The background replacing function is generated in response to requirements of interest, privacy protection and the like in video communication, and relates to a video portrait segmentation method for matting out a portrait region in a video image and replacing the background by an image fusion method.

The video image segmentation algorithm in the related art does not consider the temporal continuity of the previous and subsequent frames, so that segmentation jitter may be generated, thereby affecting the segmentation effect and reducing the segmentation accuracy.

Disclosure of Invention

In view of this, embodiments of the present application provide a segmentation method, apparatus, device, and computer-readable storage medium.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a segmentation method, which comprises the following steps:

Acquiring video data to be segmented and a trained segmentation model;

determining a plurality of video frames included in the video data, and sequentially determining the plurality of video frames as video frames to be segmented;

when at least one historical video frame exists before the video frame to be segmented, determining a reference transparency template of the video frame to be segmented from target transparency templates of the at least one historical video frame;

inputting the video frame to be segmented and the reference transparency template into the trained segmentation model to obtain a target transparency template of the video frame to be segmented;

and extracting the video frame to be segmented based on the target transparency template to obtain a target foreground image of the video frame to be segmented.

An embodiment of the present application provides a segmentation apparatus, including:

the acquisition module is used for acquiring video data to be segmented and a trained segmentation model;

the first determining module is used for determining a plurality of video frames included in the video data and sequentially determining the video frames as video frames to be segmented;

the second determining module is used for determining a reference transparency template of the video frame to be segmented from target transparency templates of at least one historical video frame when the at least one historical video frame exists before the video frame to be segmented;

The segmentation module is used for inputting the video frame to be segmented and the reference transparency template into the trained segmentation model to obtain a target transparency template of the video frame to be segmented;

and the extraction module is used for extracting the video frame to be segmented based on the target transparency template to obtain a target foreground image of the video frame to be segmented.

a processor; and

a memory for storing a computer program executable on the processor;

wherein the computer program, when executed by a processor, implements the segmentation method described above.

Embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions configured to perform the above-described segmentation method.

The embodiment of the application provides a segmentation method, a device, equipment and a computer readable storage medium, wherein the segmentation method comprises the following steps: firstly, obtaining video data to be segmented and a trained segmentation model, then determining a plurality of video frames included in video frame data, and sequentially determining the plurality of video frames as video frames to be segmented; then, under the condition that at least one historical video frame exists before the video frame to be segmented, determining a reference transparency template corresponding to the video frame to be segmented from target transparency templates corresponding to the at least one historical video frame, wherein the reference transparency template is the space-time characteristic of the video frame to be segmented; then, inputting the video frame to be segmented and the reference transparency template into a trained segmentation model, so that a target transparency template of the video frame to be segmented is obtained through the output of the trained segmentation model, and when the video frame to be segmented and the space-time characteristics thereof are fully considered during segmentation through the trained segmentation model, the segmentation process is simplified, and the target transparency template with high accuracy can be obtained rapidly; and finally, extracting the video frame to be segmented based on the target transparency template, so that a target foreground image corresponding to the video frame to be segmented can be obtained, wherein the target foreground image is more real and accords with the actual situation.

Drawings

In the drawings (which are not necessarily drawn to scale), like numerals may describe similar components in different views. The drawings illustrate generally, by way of example and not by way of limitation, various embodiments discussed herein.

Fig. 1 is a schematic flow chart of an implementation of a segmentation method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of an implementation of determining a reference transparency template according to an embodiment of the present application;

FIG. 3 is a schematic flowchart of an implementation of determining a target transparency template according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of an implementation of determining target semantic information according to an embodiment of the present application;

fig. 5 is a schematic flow chart of another implementation of the segmentation method according to the embodiment of the present application;

FIG. 6 is a schematic flow chart of an implementation of determining a trained segmentation model according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart of an implementation of determining a preliminary trained segmentation model according to an embodiment of the present disclosure;

FIG. 8 is a flowchart illustrating another implementation of determining a trained segmentation model according to an embodiment of the present disclosure;

fig. 9 is a schematic flow chart of still another implementation of the segmentation method according to the embodiment of the present application;

FIG. 10 is a schematic diagram of a structure of an image data transparency template transformation according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a portrait segmentation network model architecture according to an embodiment of the present application;

fig. 12 is a schematic diagram of a composition structure of a dividing device according to an embodiment of the present disclosure;

fig. 13 is a schematic diagram of a composition structure of a dividing apparatus according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a particular order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

In the related art, whether to use a preset model for image segmentation processing is determined by judging the difference value of the front frame and the rear frame of the video, and if the difference value is larger than a preset threshold value, the image segmentation processing is performed on the first image according to the preset model; if the difference value is smaller than or equal to the preset threshold value, determining a human image segmentation result according to the human image segmentation result of the second image and the motion estimation algorithm. In the scheme, the space-time correlation of the front frame and the rear frame in the video is considered in the post-processing stage, so that a certain amount of operation is reduced. However, the setting of the difference values of the front frame and the rear frame has strong connection with the video scene, and is difficult to popularize in different scenes.

The method of integrating hole convolution in a MobileNet network model, designing a detail enhancement module, integrating batch standardization nodes in the model with front and back convolution layers, adjusting the downsampling rate and the like reduces the calculation amount of the model, but the input of the model is a single image, no richer prior information is used for monitoring the correlation between frames before and after video, and a better segmentation result is difficult to obtain when video images are segmented.

The following methods also exist in the related art: obtaining an image to be processed, and performing image semantic segmentation on the image to be processed to obtain a semantic segmentation image, wherein the semantic segmentation image comprises a target area and a non-target area which are subjected to semantic segmentation; then, carrying out gesture recognition on the image to be processed to obtain a gesture recognition image of which the skeleton area is recognized; then, fusing a target area and a non-target area of the semantic segmentation image with a skeleton area of the gesture recognition image to obtain a trimap image which is divided into a foreground area, a background area and an area to be recognized; and finally, generating a transparency mask image for carrying out image separation on the image to be processed according to the image to be processed and the trimap image. According to the method, the human skeleton region combined portrait segmentation result is obtained through gesture recognition, and then the transparency mask image is obtained after the trisection image is obtained, and although a better segmentation result can be obtained, the calculated amount of the transparency mask obtained by adding the human gesture recognition and the trisection image is large, the calculation process is complex, and the method is difficult to apply to real-time calculation of scenes.

In the related art, it is also proposed that the electronic device may update the initial segmentation model to obtain an updated segmentation model. The electronic device can use the updated segmentation model to segment the front background of the image, obtain a more accurate front background segmentation result, and further obtain an image segmentation effect of frame skipping through post-processing of a user, such as Graph Cut algorithm (Graph Cut), based on the accurate front background segmentation result. The scheme can realize higher segmentation precision, but needs manual interaction of users, is difficult to be applied to a real-time communication scene, and has poor instantaneity.

Based on the problems of the related art, the embodiment of the present application provides a segmentation method, and the method provided in the embodiment of the present application may be implemented by a computer program, where the computer program completes the segmentation method provided in the embodiment of the present application when executed. In some embodiments, the computer program may be executed in a processor in the segmentation device. Fig. 1 is a flow chart of an implementation of a segmentation method according to an embodiment of the present application, where, as shown in fig. 1, the segmentation method includes:

step S101, obtaining video data to be segmented and a trained segmentation model.

Here, the video data to be segmented may be live video data, conference video data, on-line training video data, etc., and the video data to be segmented may be collected by a monitoring device, a smart phone, a computer, etc., and after the collection, the video data to be segmented may be stored in a cloud storage device or a local storage device. Based on this, the video data to be segmented can be obtained by a data location or a data content identification instruction.

In the embodiment of the application, the trained segmentation model may be a model based on an artificial intelligence algorithm, where the artificial intelligence algorithm may be a neural network algorithm, a genetic algorithm, a bayesian network algorithm, and the like. In practice, the input of the trained segmentation model comprises the video data to be segmented and the corresponding space-time characteristics thereof, and the segmentation speed and accuracy of the trained segmentation model can be improved based on the space-time characteristics, so that the segmentation effect is improved.

Step S102, determining a plurality of video frames included in the video data, and sequentially determining the plurality of video frames as video frames to be segmented.

In practice, a video data is composed of a plurality of video frames, and the plurality of video frames included in the video data can be determined by a multimedia video processing technology, for example, fast Forward Mpeg, abbreviated as ffmpeg, which is a set of open source computer programs that can be used to record, convert digital audio and video, and convert it into a stream.

In the embodiment of the application, the plurality of video frames can be sequentially determined to be the video frames to be segmented according to the sequence of the plurality of video frames in the video data; the method for determining the video frames to be segmented is not limited in this application. In actual implementation, one of the plurality of video frames is taken as a video frame to be segmented each time until each video frame is taken as a video frame to be segmented.

Step S103, when at least one historical video frame exists before the video frame to be segmented, determining a reference transparency template of the video frame to be segmented from target transparency templates of the at least one historical video frame.

Here, the presence of at least one historical video frame before the video frame to be segmented means that: the video frame to be segmented is not a first frame video frame in the video data but a second frame video frame or a video frame following the second frame in the video data. Based on this, the number of frames of the at least one historical video frame is at least 1, and may be 1, 2, 3, etc.

In the embodiment of the application, the target historical video frame can be determined from the historical video frames leading the n-i frames of the video frames to be segmented according to a default rule or a preset rule, and the target transparency template corresponding to the target historical video frame can play a role of referencing the target transparency template of the video frames to be segmented because of the relation between the target historical video frame and the video frames to be segmented. Based on the transparency template, a transparency template corresponding to the target historical video frame is determined as a reference transparency template of the video frame to be segmented. Wherein n is the number of frames of at least one historical video frame, n is an integer greater than or equal to 1; i is an integer of 0 or more and less than n.

For example, if n is 10, the video frame to be segmented is represented as the 11 th video frame, i may be 1, i.e. a frame of video frame preceding the video frame to be segmented is determined as the target historical video frame; i can take a value of 3, namely, a video frame which leads three frames of the video frame to be segmented is determined as a target historical video frame; i can take a value of 5, namely, a video frame which leads the video frame to be segmented by five frames is determined as a target historical video frame.

In some embodiments, if there is no history video frame before the video frame to be segmented, the Y component of the video frame to be segmented is determined as the reference transparency of the video frame to be segmented, which characterizes the video frame to be segmented as the first frame video frame in the video data.

Step S104, inputting the video frame to be segmented and the reference transparency template into the trained segmentation model to obtain a target transparency template of the video frame to be segmented.

In this embodiment of the present application, the encoding format of the video frame to be segmented determined in step S102 may be YUV encoding, and then the video frame to be segmented includes a Y component, a U component, and a V component.

In step S104, the Y component, the U component, the V component, and the reference transparency template are input to the trained model to be segmented, and then, on one hand, semantic information extraction is performed on the Y component, the U component, the V component, and the reference transparency template to obtain target semantic information; on the other hand, edge information extraction is carried out on the Y component, the U component, the V component and the reference transparency template at the same time, so that target edge information is obtained; then, fusing the target semantic information and the target edge information to obtain target fusion information; and finally, extracting the characteristics of the target fusion information, thereby obtaining a target transparent template corresponding to the video frame to be segmented.

In practice, the target transparent plate may be a black-and-white image consisting of 0, 1, wherein black characterizes the background image, denoted by 0; white characterizes the foreground image, indicated by 1. And the size of the target transparency template is the same as the size of the video frame to be segmented.

Step S105, extracting the video frame to be segmented based on the target transparency template to obtain a target foreground image of the video frame to be segmented.

Here, the target transparency template and the video frame to be segmented may be subjected to and operation, so that the value of the background image is 0, and the value of the target foreground image is the value of the foreground image itself, thereby implementing the extraction processing of the video frame to be segmented.

The embodiment of the application provides a segmentation method, which comprises the steps of firstly acquiring video data to be segmented and a trained segmentation model, then determining a plurality of video frames contained in video frame data, and sequentially determining the plurality of video frames as video frames to be segmented; then, under the condition that at least one historical video frame exists before the video frame to be segmented, determining a reference transparency template corresponding to the video frame to be segmented from target transparency templates corresponding to the at least one historical video frame, wherein the reference transparency template is the space-time characteristic of the video frame to be segmented; then, inputting the video frame to be segmented and the reference transparency template into a trained segmentation model, so that a target transparency template of the video frame to be segmented is obtained through the output of the trained segmentation model, and when the video frame to be segmented and the space-time characteristics thereof are fully considered during segmentation through the trained segmentation model, the segmentation process is simplified, and the target transparency template with high accuracy can be obtained rapidly; and finally, extracting the video frame to be segmented based on the target transparency template, so that a target foreground image corresponding to the video frame to be segmented can be obtained, wherein the target foreground image is more real and accords with the actual situation.

In some embodiments, in implementing step S103 "when there is at least one history video frame before the video frame to be segmented, determining the reference transparency template of the video frame to be segmented from the target transparency templates of the at least one history video frame", as shown in fig. 2, may be implemented by the following steps S1031 to S1033:

step S1031, determining the number of frames of at least one historical video frame, and recording the number of frames as n.

Here, n is an integer greater than or equal to 1. In some embodiments, a frame identification may be included in any video frame in the video data that characterizes the position of the any video frame in the plurality of video frames. Illustratively, if the frame identification characterizes that the video frame to be segmented is at the position of the eleventh frame, then ten frame history video frames are present before the video frame to be segmented, i.e. the number of frames n is 10.

In step S1032, the historical video frame leading the video frame n-i to be segmented is determined as the target historical video frame.

Here, i is an integer of zero or more and less than n. For example, taking n as 10, in theory, the value of i may be any integer from 0 to 9, in practice, in order to improve the correlation between the video frame to be segmented and the target historical video frame, so that the target historical video frame has a strong reference meaning, the target historical video frame is not suitable to be far away from the video frame to be segmented, that is, the value of i is not suitable to be too large, and in general, the value of i may be 1, 2, 3, 5, and so on.

In the embodiment of the application, when the value of i is 1, determining a historical video frame which leads a video frame to be segmented by one frame as a target historical video frame; when the value of i is 2, determining the historical video frames of two frames of the video frames to be segmented in advance as target historical video frames; when the value of i is 3, determining the historical video frames of three frames of the video frames to be segmented in advance as target historical video frames; similarly, when i takes a value of 5, a history video frame five frames ahead of the video frame to be segmented is determined as a target history video frame.

In actual implementation, i can be determined first, and then the historical video frame leading the n-i frame of the video frame to be segmented is determined as the target historical video frame.

Step S1033, determining the target transparency template corresponding to the target historical video frame as the reference transparency template of the video frame to be segmented.

Here, a target transparency template corresponding to the target historical video frame may be obtained first, where the target transparency corresponding to the target historical video frame is also determined based on the steps S101 to S105 described above; and then, determining the target transparency template corresponding to the target historical video frame as a reference transparency template of the video frame to be segmented.

In the embodiment of the present application, through the steps S1031 to S1033, the number n of frames of at least one historical video frame is determined first; then, determining a historical video frame of an n-i frame of the video frame to be segmented in advance as a target historical video frame, wherein i is an integer greater than or equal to zero and less than n, so as to determine a historical video frame related to the video frame to be segmented; and finally, determining the target transparency template corresponding to the target historical video frame as a reference transparency template of the video frame to be segmented, wherein the target transparency template corresponding to the target historical video frame is the space-time characteristic of the video data, so that the space-time characteristic of the video data can be simply and efficiently obtained.

In some embodiments, step S104 "inputting the video frame to be segmented and the reference transparency template into the trained segmentation model to obtain the target transparency template of the video frame to be segmented" when actually implemented, referring to fig. 3, may be implemented by the following steps S1041 to S1044:

step S1041, extracting semantic information from the video frame to be segmented and the reference transparency template to obtain target semantic information.

In actual implementation, referring to fig. 4, step S1041 may be implemented by the following steps S411 to S413:

Step S411, local feature extraction and splicing processing are sequentially carried out on the video frame to be segmented and the reference transparency template, and a first local feature is obtained.

Here, the local feature extraction can be performed on the video frame to be segmented and the reference transparency template through the convolution layers, so as to obtain an extraction result, wherein each layer of convolution layer consists of a plurality of convolution units, and the parameters of each convolution unit are obtained through optimization of a back propagation algorithm. The purpose of the convolution operation is to extract different features of the input, the first layer of convolution layer can extract some low-level features such as edges, lines, angles, etc., and the structure of more layers can iteratively extract more complex features from the low-level features, and in step S411, it is to extract more abstract features, such as classification information in the video frame, that is, whether a certain position in the video frame belongs to the background or the foreground. The convolution layer may be, for example, a 3*3 convolution layer.

In the embodiment of the present application, the extraction results may be spliced based on the order of Y, U, V and the reference template, so as to obtain the first local feature. Of course, the extraction results may also be spliced based on the sequence of the reference templates Y, U and V, and the embodiment of the present application does not limit the sequence of the splicing.

In step S412, the first local feature is downsampled and the feature is extracted to obtain a second local feature.

Here, the first local feature may be downsampled and feature extracted multiple times, in this embodiment, taking two downsampling and feature extraction as examples, the first local feature is downsampled for the first time to obtain a first downsampling result, and then the feature extraction is performed on the first downsampling result to obtain a first extraction result; and then performing second downsampling on the first extraction result to obtain a second downsampling result, and performing feature extraction on the second downsampling result to obtain a second local feature.

The multiple of downsampling may be 4 times, 16 times, 64 times, etc. to accept the above example, the multiple of downsampling for the first time may be 4 times, and after downsampling for the first time, the length and width of the first local feature are reduced to 1/2 of the input size, so that the calculation amount of each part can be reduced; feature extraction can then be continued on the first downsampled result by a plurality of dense complex convolution modules (e.g., 2 dense complex convolution modules), thereby extracting more abstract classification features; then, the multiple of the second downsampling can be 16 times, and after the second downsampling, the length and the width of the first extraction result are respectively reduced to 1/4 of the input size, and the calculated amount of each part is reduced again; and performing secondary feature extraction through two dense comprehensive convolution modules to obtain a second local feature.

Step S413, up-sampling the second local feature to obtain the target semantic information.

Here, the up-sampling multiple corresponds to the down-sampling multiple in step S412, and if in step S412, 4 times down-sampling and 16 times down-sampling are performed, respectively, the up-sampling multiple is 64 times at this time, so that the size is restored to the size at the time of input, and the up-sampling restoration is performed to obtain the target semantic information, where the target semantic information includes the classification feature information that is relatively abstract in the video frame to be segmented, that is, the foreground and background information in the video frame to be segmented.

Step S1042, extracting edge information of the video frame to be segmented and the reference transparency template to obtain target edge information.

In actual implementation, step S1042 may be implemented by the following steps S421 and S422:

step S421, extracting local features of the video frame to be segmented and the reference transparency template to obtain a third local feature.

Here, the method of local feature extraction in step S421 is similar to the method of local feature extraction in step S411, and thus, the method of local feature extraction in step S421 may refer to the method of local feature extraction in step S411.

Step S422, extracting the third local feature to obtain the target edge information.

Here, the feature extraction may be performed on the third local feature by using a dense comprehensive convolution module, so as to obtain target edge information, where the target edge information refers to image information of a boundary between a foreground and a background in the video frame to be segmented.

Step S1043, fusing the target semantic information and the target edge information to obtain target fusion information.

Here, the target semantic information and the target edge information can be fused by a feature map element-by-element addition method, so as to obtain target fusion information containing abstract classification feature information and edge information.

And step S1044, extracting features of the target fusion information to obtain a target transparency template.

Here, the method of feature extraction in step S1044 is similar to the method of local feature extraction in step S411, and thus, the method of local feature extraction in step S1044 may refer to the method of local feature extraction in step S411.

In the embodiment of the present application, through the steps S1041 to S1044, on the one hand, extraction of abstract classification features in the video frame to be segmented and the reference transparency template is achieved through feature extraction, stitching, downsampling, upsampling, and the like; and meanwhile, the target edge information of the video frame to be segmented and the reference transparency template is obtained through feature extraction, finally, abstract classification features and the target edge information are fused, and the target transparency template of the video frame to be segmented is obtained after feature extraction.

In some embodiments, the target video frame includes a Y component, a U component, and a V component, after step S102, as shown in fig. 5, the following steps may also be performed:

step S103', determining whether there is at least one historical video frame before the video frame to be segmented.

Here, whether at least one example video frame exists before the video frame to be divided may be determined by determining whether the video frame to be divided is the first frame video frame of the video data. If the video frame to be segmented is the first frame of video data, characterizing that any historical video frame does not exist before the video frame to be segmented, and entering step S104'; and if the video frame to be segmented is not the first frame video frame of the video data, characterizing that there is at least one historical video frame before the video frame to be segmented, proceeding to step S106'.

Step S104', the Y component is determined as a reference transparency template for the video frame to be segmented.

At this time, any historical video frame does not exist before the video frame to be segmented, and in order to improve the segmentation reliability, the Y component of the video frame to be segmented is determined as a reference transparency template of the video frame to be segmented.

Step S105', the video frame to be segmented and the reference transparency template are input into the trained segmentation model, and the target transparency template of the video frame to be segmented is obtained.

Here, the implementation of step S105 'is similar to that of step S104, and thus, the implementation of step S105' may refer to that of step S104.

Step S106' advances to step S103.

At this time, it is characterized that there is at least one historical video frame before the video frame to be segmented, and since step S103 is for the processing situation that there is at least one historical video frame before the video frame to be segmented, step S103 is performed.

Through the above steps S103 'to S106', in the case where there is no history video frame before the video frame to be divided, determining the Y component of the video frame to be divided as the reference transparency template of the video frame to be divided; under the condition that at least one historical video frame exists before the video frame to be segmented, determining a reference transparency template of the video frame to be segmented from target transparency templates corresponding to the at least one historical video frame; and finally, inputting the video frame to be segmented and the reference transparency template into the trained segmentation model, so as to obtain a target transparency template of the model to be segmented, namely, whether at least one historical video frame exists before the video frame to be segmented or not, the video frame to be segmented can be segmented more accurately, and the integrity of a segmentation scheme is improved.

In some embodiments, before the step S101 "acquire the video data to be segmented and the trained segmentation model" is performed, the trained segmentation model is further determined, and referring to fig. 6, the trained segmentation model may be determined by the following steps S001 to S003:

step S001, obtaining video training data, picture training data and a preset segmentation model.

Here, the picture training data includes a training picture, a first training label corresponding to the training picture, and a training picture transparency template corresponding to the training picture; the video training data comprises training video frames, second training labels corresponding to the training video frames and training video frame transparency templates corresponding to the training video frames.

In the embodiment of the application, the video training data and the picture training data can be obtained from a general server or a special server. In practice, the picture training data may also be obtained from the Portrait buffering_dataset. The preset classification model refers to that each weight value in the model is not determined yet, and at this time, each weight value is a default initial value or a randomly generated value.

In some embodiments, the training picture in the picture training data and the first training label corresponding to the training picture may be obtained through step S001, and the training picture transparency template corresponding to the training picture may be obtained through steps S0011 and S0012 as follows:

And step S0011, labeling the training pictures to obtain an original transparency template.

Here, for example, the training picture is obtained from the Portrait matching_dataset data set, and because the data set already has a rough transparency template, the transparency template is further refined by using image processing software, so that the labeling process of the training picture is completed, and the original transparency template corresponding to the training picture is obtained.

And step S0012, performing rigid transformation and/or non-rigid transformation treatment on the original transparency template to obtain the transparency template of the training picture.

Here, since the training picture is one picture and there are no other pictures related in time, the motion of the training picture can be simulated by rigid transformation and/or non-rigid transformation, thereby obtaining the training picture transparency template. Wherein, the rigid transformation refers to the translation and rotation of the image in the original transparent template, the shape of the image is unchanged, and the obtained transformation is called rigid transformation; non-rigid transforms are more complex transforms than rigid transforms, e.g., scaling, affine, projection, polynomials, etc., some of which are more complex transforms.

In the embodiment of the application, the training picture transparency template can be regarded as the space-time characteristic of the training picture.

Step S002, training the preset segmentation model by using the picture training data to obtain a preliminarily trained segmentation model.

In actual implementation, referring to fig. 7, step S002 may be implemented by the following steps S021 to S023:

step S021, inputting the training picture and the training picture transparency template into a preset segmentation model to obtain a first prediction label corresponding to the training picture.

Here, the first training is performed on the preset segmentation model, and the obtained result is the first prediction label corresponding to the training picture, in practice, the training process is similar to the reasoning process, so the implementation process of the process step S021 is similar to the implementation process of the step S104, and therefore, the implementation process of the step S021 can refer to the implementation process of the step S104.

Step S022, first error information between the first training label and the first prediction label is acquired.

Here, the first training label and the first prediction label are both matrices, and a distance between the first training label and the first prediction label may characterize a difference therebetween, and thus, the distance between the first training label and the first prediction label may be determined as the first error information.

And step S023, carrying out back propagation training on the preset segmentation model based on the first error information and the first error threshold value to obtain a preliminarily trained segmentation model.

Here, the first error threshold may be a default value or a custom set value, and under the condition that the first error information is smaller than the first error threshold, the first error information is represented within an allowable range, the training stop condition is met, training is not required to be continued, and the initially trained segmentation model is represented at the moment; and if the first error information is greater than or equal to the first error threshold, the first error information is characterized as not being within the allowable range, the training stop condition is not satisfied, the segmentation model which is not initially trained at the moment is characterized, and the training still needs to be continued.

And step S003, continuing training the preliminarily trained segmentation model by utilizing video training data to obtain a trained segmentation model.

In actual implementation, referring to fig. 8, step S003 may be implemented by the following steps S031 to S033:

step S031, inputting the training video frames and the transparency templates of the training video frames into the preliminarily trained segmentation model to obtain second prediction labels corresponding to the training video frames.

Here, the implementation procedure of step S031 is similar to that of step S021, and therefore, the implementation procedure of step S031 can refer to the implementation procedure of step S021.

Step S032, obtaining second error information between the second training label and the second prediction label.

Here, the implementation of step S032 is similar to that of step S022, and thus, the implementation of step S032 can refer to the implementation of step S022.

And step S033, performing back propagation training on the preliminarily trained segmentation model based on the second error information and the second error threshold value to obtain a trained segmentation model.

Here, the second error threshold may be a default value or a custom set value, and in addition, the second error threshold may be the same as the first error threshold or different from the first error threshold, which is not limited in this embodiment.

Under the condition that the second error information is smaller than a second error threshold value, representing that the second error information is within an allowable range, meeting training stop conditions, and representing that a trained segmentation model is obtained at the moment without continuing training; and if the second error information is greater than or equal to the second error threshold, the second error information is not in the allowable range, the training stop condition is not satisfied, the segmentation model which is not trained at the moment is characterized, and the training is still needed to be continued.

Through the steps S001 to S003, firstly, video training data, picture training data and a preset segmentation model are obtained; because the training of the reservation and segmentation model can be simply and efficiently completed by using the picture training data, the preset segmentation model is firstly trained for the first time by using the picture training data, and a primarily trained segmentation model is obtained; and then, continuously training the preliminarily trained segmentation model again through video training data, and finally obtaining the trained segmentation model. Thus, a trained segmentation model with higher accuracy is obtained through staged model training.

Based on the above embodiments, the embodiments of the present application further provide a segmentation method, where the segmentation method in the embodiments of the present application is a high-efficiency video portrait segmentation method based on YUV color space and space-time feature, and may be used in scenes such as video communication, live broadcast, etc., and the segmentation method includes:

firstly, collecting and labeling data; the data mainly comprises image data and video data, wherein the image data corresponds to the picture training data in the embodiment, the video data corresponds to the video training data in the embodiment, the image data adopts an open source data set, the video data is mainly the video data collected by the collecting device in a communication scene, and the collecting device can be a universal serial bus (Universal Serial Bus, USB) camera, a smart phone or a set top box with a camera. For the annotation of data, noise and edge flaws often exist in the annotation information of the open source data set, an image processing software (for example, adobe Photoshop) is needed to refine the transparency template, a frame extraction tool (for example, ffmpeg) is used for obtaining all video frames for the video data, and then an image processing software (for example, adobe Photoshop) is used for buckling the portrait transparency template.

Secondly, training a video portrait segmentation model; here, the video portrait segmentation model corresponds to the preset segmentation model in the above embodiment, and because the cost of labor required for segmenting and labeling the video frame image is high, the embodiment of the application proposes a method for training the video portrait segmentation through a single image, and the transparency template difference caused by the motion of the front frame and the rear frame in the video is simulated by adding rigid transformation and non-rigid transformation on the transparency template of the single image. Wherein the rigid transformation comprises translation, rotation and a combination of the translation and the rotation, and the translation simulates the horizontal movement of the portrait relative to the lens plane; the non-rigid transformation includes: scaling, radiation, projection, polynomials, local motion transformations, etc., which simulate the vertical motion of a person's image with respect to the lens plane, by using thin-plate spline (Thin Plate Spline, TPS) deformations of K control points. In this way, training is performed on the image data to obtain a first model weight; based on the first model weight, the real video image segmentation data is used for transfer learning to obtain a second model weight.

Thirdly, deploying the trained model to intelligent equipment to perform video image segmentation. Here, the trained model corresponds to the trained segmentation model in the above embodiment.

In the embodiment of the present application, as shown in fig. 9, the segmentation method includes the following steps S901 to S905 in the model training phase:

step S901, image video data is acquired.

Here, the image video data includes image data and video data, the image data may directly use data in an open source portal material_dataset data set, and the video data may use video data collected by a collection device, where the collection device may be a USB camera, a smart phone, or a set top box with a camera.

Step S902, data labeling.

Here, the current frame transparency template label includes: after the original data is obtained, the original data is subjected to data marking and cleaning. Wherein, for image data, a Port matching_dataset data set is used, the data set has a rough transparency template, and on the basis, the transparency template can be further refined by using Adobe PhotoShop; for video data, the video is split into video frames by using tools such as ffmpeg, and then the transparency template can be obtained by dividing the video by using Adobe PhotoShop. In order to utilize the space-time characteristics in the video to supervise the training of the network, the embodiment of the application proposes to use the current video frame and the previous frame transparency template (Mask) as the video data portrait segmentation model algorithm input, meanwhile, the video frames stored in the video are often YUV color space instead of RGB color space common in the image, and in order to simplify the calculation process, the embodiment of the application uses the combination of YUV and Mask to train and infer by using a four-channel data input network.

Step S903, the image data transparency template is transformed.

Here, in order to make full use of image data, the embodiment of the present application proposes a method of training video portrait segmentation by a single image. Since the image data does not have the previous frame data, the input Mask is determined as shown in fig. 10, referring to fig. 10, a) is an original image, b) is a marked transparency template obtained by marking a) in step S902, based on this, scaling is performed on b) to obtain a scaled transparency template, that is, c), and in practice, the scaling scales are uniformly distributed between [0.9,1.1 ]; then, carrying out translational rigidity transformation on the c) to obtain a translational transparency template, namely d), wherein in practice, translational coordinates in the upper, lower, left and right directions are uniformly distributed among 0,0.05 times of the original dimension; then, non-rigid transformation of TPS deformation is further carried out on d), a transparency template after TPS deformation is obtained, namely e) is obtained, the value of K can be 5 when TPS deformation is carried out, and the simulation of local motion transformation of the portrait by TPS of 5 control points is equivalent.

In addition, in order to solve the problem that the transparency template of the previous frame does not exist in the first frame image of the analog video, when the current frame is the first frame image of the video, a Y channel in a YUV color space of the current frame can be used as a Mask, and then training of the current frame is completed based on YUV and the Mask.

Step S904, training a single image segmentation network.

After obtaining the training data marked and transformed by the transparency template, firstly, training a portrait segmentation network model on the image data, where the portrait segmentation network model corresponds to the preset segmentation model in the above embodiment, and the portrait segmentation network model structure is shown in fig. 11, where the input part of the portrait segmentation network is different from the data of the RGB color space of the traditional portrait segmentation, because the video frame data in the video coding is often stored in YUV420 or YUV420P format, and the traditional image segmentation algorithm needs to convert the video frame in YUV format into the RGB format input model algorithm, so that extra calculation amount is brought. Meanwhile, in order to achieve more robustness of the segmentation effect and reduce the situation of front and rear frame segmentation jitter in the video segmentation process, the embodiment of the application provides a segmentation method combining the video space-time characteristics, and the input part of the segmentation method in the embodiment of the application considers the segmentation result of the front frame of the video, namely the transparency template of the front frame. Therefore, the input of the segmentation method in the embodiment of the application mainly consists of the following two parts: the YUV encoded current frame and previous frame transparency templates (masks) are utilized.

The backbone network part of the portrait segmentation network model is provided with two double branches, as shown in fig. 11, the branches at the upper half part of fig. 11 are rough branches 1101, and the number of network layers in 1101 is more, so that the backbone network part is used for extracting generated high-level semantic information; while the branches in the lower half are refined branches 1102, using only a small number of convolution layers for extracting shallow edge detail information. Wherein, the rough branch 1101 firstly passes through a 3*3 convolution layer 11011, and then splices input data to downsample by 4 times 11012, so that the length and width of Y, U, V, mask are respectively reduced to 1/2 of the input size; then, two dense comprehensive convolution modules 11013 are adopted, input data are spliced and downsampled by 16 times 11014, and the length and the width of Y, U, V, mask are respectively reduced to 1/4 of the input size; next, the feature map is up-sampled to 1/4 of the input area by two dense complex convolution modules 11015 using up-sampling layer 11016, where up-sampling layer 11016 may use bilinear differences. Refined branch 1102 is passed through a 3*3 convolution layer 11021 and then through a dense complex convolution module 11022 to obtain refined branch results. Finally, the feature map is added element by element 1103 to the coarse and refined branch results and passed through a 3*3 convolution layer 1104 to yield the final transparency template 110.

In the training process, a random gradient descent algorithm with momentum can be used as an optimization function, and the weight of the L2 regular constraint model is used. For the loss function, comprehensively considering edge loss of the portrait segmentation and learning of a cross entropy loss supervision network, wherein the total loss function is shown in the following formula 1:

L _total ＝λ*L _ce +(1-λ)*L _b (1)；

wherein L is _total Represents the total loss function, lambda represents the weight coefficient, L _ce Representing cross entropy loss, a cross entropy function with focal loss coefficients may be used herein, which may be used to mitigate the problem of human image foreground and remaining background imbalance in a data set, as shown in equation 2, where L _b Representing the boundary loss, the transparency template and the labeling template output by the image segmentation network model can be used for carrying out image morphology operation, comprising: the transparency is expanded to obtain an expansion template, then the expansion template is corroded to obtain a corrosion template, finally the expansion template and the corrosion template are subjected to difference to obtain the edge information of the image, and then the boundary loss is calculated through a formula 3.

Wherein y represents a tag letter for a specific position in the imageThe user can put the baby in the bed,

and (3) representing a prediction result of the portrait segmentation network model, wherein gamma is larger than 0, and influencing the loss weight of the portrait segmentation network model on the samples which are difficult to segment by adjusting the gamma value. And calculating the gradient of the parameters of the existing neural network according to the loss function, and updating the parameters by using a random gradient descent method with momentum until the portrait segmentation network converges, so as to obtain the first model weight.

In step S905, video portrait segmentation and migration training is performed.

Here, after the first model weight is obtained, video data portrait segmentation model training is performed. Firstly, considering the selection of a transparency template of a previous frame of a video, recording the current frame as a t frame, and randomly using the transparency templates of the t-1 frame, the t-3 frame and the t-5 frame as masks to simulate differences brought by motions of people in different frame rates, different people and different scenes. When t-1, t-3, t-5 are less than 0, then the current frame can be trained using Y-channel in the YUV color space of the current frame as Mask combination. The structure of the video data portrait segmentation model coincides with the model result in step S604. Since the Mask of the input of the video data portrait segmentation model is accurate at this time, and it is desirable that the cross entropy loss function with the focal loss coefficient in the loss function of the video data portrait segmentation model is replaced with the lovassz loss function for better segmentation effect, as shown in formula 4, formula 5 and formula 6.

Wherein y represents a specific position in the imageThe information of the tag is provided to the user,

representing the prediction result of the portrait segmentation model of video data, < >>

Hinge loss (hinge loss) predicted for current position->

Is- >

Is a lovassz extension of (c). The lovassz loss reduces the loss more effectively than Jaccard loss as shown in equation 6. And loading the first network weight file, calculating the gradient of the existing neural network parameters according to the loss function, and updating the parameters by using a random gradient descent method with momentum until the network converges, so as to obtain the second model weight, namely the final training result.

In the embodiment of the application, firstly, the YUV color space is used for model inference, compared with the traditional RGB color space, the operation of converting the video frame from the YUV format to the RBG format is avoided, the calculated amount and the calculated time of the part are saved, and the video image segmentation efficiency is improved. Secondly, the input format of YUV+mask is used for video image segmentation, compared with the traditional single RGB data, the space-time characteristics in the video are added in the prediction of the model, meanwhile, the difficulty of predicting the transparency template is simplified, and the complexity of the model can be further reduced. The algorithm efficiency is better realized while the algorithm effect and the robustness are improved. Thirdly, training of the video portrait segmentation model is performed by using a multi-stage training method and sequentially using image data and video data. In the training stage, image data is applied to a video image segmentation task through image transparency template transformation, so that the problems of insufficient video image segmentation data and high labeling cost are relieved to a certain extent. Meanwhile, different loss functions are used for more effective supervised learning according to the characteristics of the data.

Based on the method, the embodiment of the application does not need to carry out color space conversion when the video image segmentation is inferred, so that the efficient detection of the video non-compliance is realized; compared with the traditional single-frame portrait segmentation algorithm, the method has the advantages that the transparency template of the previous frame of video is added at the input end, so that the space-time characteristics in the video are increased, the problem of non-robustness such as shaking of the transparency template, insufficient segmentation quality and the like in the video segmentation process can be solved, the difficulty in current frame prediction is simplified, a better video portrait segmentation result can be obtained through a smaller network, and the high-efficiency prediction is realized while the robustness of the model is improved; finally, image data is applied to a video image segmentation task through image transparency template transformation in a training stage, so that the problems of insufficient video image segmentation data and high labeling cost are relieved to a certain extent.

Based on the foregoing embodiments, the embodiments of the present application provide a splitting apparatus, where each module included in the splitting apparatus, and each unit included in each module may be implemented by a processor in a computer device; of course, the method can also be realized by corresponding logic circuits; in practice, the processor may be a central processing unit (Central Processing Unit, CPU), microprocessor (Microprocessor Unit, MPU), digital signal processor (Digital Signal Processing, DSP) or field programmable gate array (Field Programmable Gate Array, FPGA), etc.

An embodiment of the present application further provides a dividing apparatus, fig. 12 is a schematic structural diagram of the dividing apparatus provided in the embodiment of the present application, as shown in fig. 12, where the dividing apparatus 1200 includes:

an acquisition module 1201, configured to acquire video data to be segmented and a trained segmentation model;

a first determining module 1202, configured to determine a plurality of video frames included in the video data, and sequentially determine the plurality of video frames as video frames to be segmented;

a second determining module 1203, configured to determine, when at least one historical video frame exists before the video frame to be segmented, a reference transparency template of the video frame to be segmented from target transparency templates of the at least one historical video frame;

the segmentation module 1204 is configured to input the video frame to be segmented and the reference transparency template to the trained segmentation model, so as to obtain a target transparency template of the video frame to be segmented;

and the extracting module 1205 is configured to extract the video frame to be segmented based on the target transparency template, so as to obtain a target foreground image of the video frame to be segmented.

In some embodiments, the second determining module 1203 includes:

A first determining submodule for determining a frame number of the at least one historical video frame and recording the frame number as n, wherein n is an integer greater than or equal to 1;

a second determining submodule, configured to determine a historical video frame that leads the n-i frame of the video frame to be segmented as a target historical video frame, where i is greater than or equal to zero and less than n, and is an integer;

and the third determining submodule is used for determining the target transparency template corresponding to the target historical video frame as the reference transparency template of the video frame to be segmented.

In some embodiments, the segmentation module 1204 includes: :

the first extraction submodule is used for extracting semantic information from the video frame to be segmented and the reference transparency template to obtain target semantic information;

the second extraction submodule is used for extracting edge information of the video frame to be segmented and the reference transparency template to obtain target edge information;

the fusion sub-module is used for fusing the target semantic information and the target edge information to obtain target fusion information;

and the feature extraction sub-module is used for extracting features of the target fusion information to obtain the target transparency template.

In some embodiments, the first extraction submodule includes:

the first extraction unit is used for sequentially carrying out local feature extraction and splicing treatment on the video frame to be segmented and the reference transparency template to obtain a first local feature;

the second extraction unit is used for carrying out downsampling and feature extraction on the first local features to obtain second local features;

and the up-sampling unit is used for up-sampling the second local features to obtain the target semantic information.

In some embodiments, the second extraction submodule includes:

the third extraction unit is used for extracting local features of the video frame to be segmented and the reference transparency template to obtain third local features;

and a fourth extraction unit, configured to perform feature extraction on the third local feature, to obtain the target edge information.

In some embodiments, the target video frame includes a Y component, a U component, and a V component, and the segmentation module 1204 is further configured to input the video frame to be segmented and the reference transparency template to the trained segmentation model, to obtain a target transparency template of the video frame to be segmented. The dividing apparatus 1200 includes:

And the third determining module is used for determining the Y component as a reference transparency template of the video frame to be segmented when any historical video frame does not exist before the video frame to be segmented.

In some embodiments, the obtaining module 1201 is further configured to obtain video training data, picture training data, and a preset segmentation model;

the segmentation module 1204 is further configured to train the preset segmentation model by using the picture training data to obtain a primarily trained segmentation model; and continuing training the preliminarily trained segmentation model by utilizing the video training data to obtain the trained segmentation model.

In some embodiments, the picture training data includes a training picture, a first training tag corresponding to the training picture, a training picture transparency template corresponding to the training picture, and the segmentation module 1204 includes:

the first input submodule is used for inputting the training picture and the training picture transparency template into the preset segmentation model to obtain a first prediction label corresponding to the training picture;

a first obtaining sub-module, configured to obtain first error information between the first training tag and the first prediction tag;

And the first training sub-module is used for carrying out back propagation training on the preset segmentation model based on the first error information and the first error threshold value to obtain the preliminarily trained segmentation model.

In some embodiments, the segmentation apparatus 1200 further comprises:

the marking sub-module is used for marking the training pictures to obtain an original transparency template;

and the transformation submodule is used for carrying out rigid transformation and/or non-rigid transformation processing on the original transparency template to obtain the training picture transparency template.

In some embodiments, the video training data includes a training video frame, a second training tag corresponding to the training video frame, a training video frame transparency template corresponding to the training video frame, and the segmentation module 1204 further includes:

the second input sub-module is used for inputting the training video frames and the training video frame transparency templates into the preliminary trained segmentation model to obtain second prediction labels corresponding to the training video frames;

a second obtaining sub-module, configured to obtain second error information between the second training tag and the second prediction tag;

and the second training sub-module is used for carrying out back propagation training on the preliminary trained segmentation model based on the second error information and the second error threshold value to obtain the trained segmentation model.

It should be noted that the description of the splitting device in the embodiment of the present application is similar to the description of the method embodiment described above, and has similar advantageous effects as the method embodiment. For technical details not disclosed in the embodiments of the present apparatus, please refer to the description of the embodiments of the method of the present application for understanding.

In the embodiment of the present application, if the above-mentioned segmentation method is implemented in the form of a software functional module, and sold or used as a separate product, the segmentation method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributing to the related art, and the computer software product may be stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Accordingly, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the segmentation method provided in the above embodiments.

An embodiment of the present application provides a dividing apparatus, fig. 13 is a schematic diagram of a composition structure of the dividing apparatus provided in the embodiment of the present application, as shown in fig. 13, and the dividing apparatus 1300 includes: a processor 1301, at least one communication bus 1302, a user interface 1303, at least one external communication interface 1304, and a memory 1305. Wherein the communication bus 1302 is configured to enable connected communication between these components. The user interface 1303 may include a display screen, and the external communication interface 1304 may include a standard wired interface and a wireless interface, among others. Wherein the processor 1301 is configured to execute a program of the segmentation method stored in the memory to implement the segmentation method provided in the above-described embodiment.

The above description of the embodiments of the segmentation apparatus and the storage medium is similar to the description of the embodiments of the method described above, with similar advantageous effects as the embodiments of the method. For technical details not disclosed in the embodiments of the dividing apparatus and the storage medium of the present application, please refer to the description of the method embodiments of the present application for understanding.

It should be noted here that: the description of the storage medium and the segmentation apparatus embodiments above is similar to that of the method embodiments described above, with similar benefits as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and the dividing apparatus of the present application, please refer to the description of the method embodiments of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application. The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purposes of the embodiments of the present application.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.

Alternatively, the integrated units described above may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the related art, embodied in the form of a software product stored in a storage medium, including several instructions for causing an AC to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.

The foregoing is merely an embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of segmentation, the method comprising:

Acquiring video data to be segmented and a trained segmentation model;

2. The method of claim 1, wherein when there is at least one historical video frame before the video frame to be segmented, determining a reference transparency template for the video frame to be segmented from target transparency templates for the at least one historical video frame, comprising:

determining a frame number of the at least one historical video frame, and recording the frame number as n, wherein n is an integer greater than or equal to 1;

Determining a historical video frame leading the n-i frames of the video frame to be segmented as a target historical video frame, wherein i is greater than or equal to zero and smaller than n, and is an integer;

and determining a target transparency template corresponding to the target historical video frame as a reference transparency template of the video frame to be segmented.

3. The method according to claim 1, wherein inputting the video frame to be segmented and the reference transparency template into the trained segmentation model to obtain a target transparency template for the video frame to be segmented comprises:

extracting semantic information from the video frame to be segmented and the reference transparency template to obtain target semantic information;

extracting edge information of the video frame to be segmented and the reference transparency template to obtain target edge information;

fusing the target semantic information and the target edge information to obtain target fusion information;

and extracting the characteristics of the target fusion information to obtain the target transparency template.

4. A method according to claim 3, wherein extracting semantic information from the video frame to be segmented and the reference transparency template to obtain target semantic information comprises:

Carrying out local feature extraction and splicing treatment on the video frame to be segmented and the reference transparency template in sequence to obtain a first local feature;

downsampling and extracting the first local feature to obtain a second local feature;

and up-sampling the second local features to obtain the target semantic information.

5. A method according to claim 3, wherein extracting edge information from the video frame to be segmented and the reference transparency template to obtain target edge information comprises:

extracting local features of the video frame to be segmented and the reference transparency template to obtain a third local feature;

and extracting the characteristics of the third local characteristics to obtain the target edge information.

6. The method of any of claims 1 to 5, wherein the target video frame comprises a Y component, a U component, and a V component, the method further comprising:

when any historical video frame does not exist before the video frame to be segmented, determining the Y component as a reference transparency template of the video frame to be segmented;

and inputting the video frame to be segmented and the reference transparency template into the trained segmentation model to obtain a target transparency template of the video frame to be segmented.

7. The method according to any one of claims 1 to 5, further comprising:

acquiring video training data, picture training data and a preset segmentation model;

training the preset segmentation model by using the picture training data to obtain a preliminarily trained segmentation model;

and continuing training the preliminarily trained segmentation model by utilizing the video training data to obtain the trained segmentation model.

8. The method of claim 7, wherein the picture training data includes a training picture, a first training label corresponding to the training picture, and a training picture transparency template corresponding to the training picture, and the training the preset segmentation model using the picture training data to obtain a preliminary trained segmentation model includes:

inputting the training picture and the training picture transparency template into the preset segmentation model to obtain a first prediction label corresponding to the training picture;

acquiring first error information between the first training label and the first prediction label;

and carrying out back propagation training on the preset segmentation model based on the first error information and the first error threshold value to obtain the preliminarily trained segmentation model.

9. The method as recited in claim 8, wherein the method further comprises:

labeling the training pictures to obtain an original transparency template;

and carrying out rigid transformation and/or non-rigid transformation treatment on the original transparency template to obtain the training picture transparency template.

10. The method of claim 7, wherein the video training data comprises a training video frame, a second training tag corresponding to the training video frame, and a training video frame transparency template corresponding to the training video frame, and wherein the training the preliminary trained segmentation model using the video training data to obtain a trained segmentation model comprises:

inputting the training video frames and the training video frame transparency templates into the preliminarily trained segmentation model to obtain second prediction labels corresponding to the training video frames;

acquiring second error information between the second training label and the second prediction label;

and carrying out back propagation training on the preliminarily trained segmentation model based on the second error information and a second error threshold value to obtain the trained segmentation model.

11. A segmentation apparatus, characterized in that the segmentation apparatus comprises:

12. A segmentation apparatus, characterized in that the segmentation apparatus comprises:

a processor; and

a memory for storing a computer program executable on the processor;

wherein the computer program, when executed by a processor, implements the segmentation method of any one of claims 1 to 10.

13. A computer readable storage medium having stored therein computer executable instructions configured to perform the segmentation method of any of the preceding claims 1-10.