CN116630354B

CN116630354B - Video matting method, electronic device, storage medium and program product

Info

Publication number: CN116630354B
Application number: CN202310906047.3A
Authority: CN
Inventors: 田宇桐; 任海涛; 李英俊; 周元甲; 冯向鹤
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-07-24
Filing date: 2023-07-24
Publication date: 2024-04-12
Anticipated expiration: 2043-07-24
Also published as: CN116630354A

Abstract

The embodiment of the application provides a video matting method, electronic equipment, a storage medium and a program product, and relates to the technical field of video processing, wherein the method comprises the following steps: information compression is carried out on a target video frame in the video to obtain a first image feature; dividing the target video frame based on the first image features to obtain features of an outline mask image of the obtained object and a first outline mask image; performing feature reconstruction based on the first image features and the features of the obtained outline mask image, fusing the reconstructed features with first hidden state information of the video, and updating the first hidden state information to obtain a fusion result; based on the fusion result, obtaining a target transparency mask image of the edge of the object in the target video frame; and carrying out regional matting on the target video frame according to the target transparency mask image and the first contour mask image to obtain a matting result. By applying the video segmentation scheme provided by the embodiment of the application, the accuracy of video matting can be improved.

Description

Video matting method, electronic device, storage medium and program product

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a video matting method, an electronic device, a storage medium, and a program product.

Background

Video matting is an important research direction in the field of computer vision, and is mainly to determine the outline and transparency of objects such as people and vehicles in a video frame, and separate the area where the object with transparency is located from the area where the background is located based on the outline.

When video matting is performed in the related technology, the video matting is generally performed based on a Bayesian method or a poisson method. However, these approaches are mainly applicable to still images, and the accuracy of matting is low when matting video.

Disclosure of Invention

In view of the foregoing, the present application provides a video matting method, an electronic device, a storage medium, and a program product, so as to improve the accuracy of video matting.

In a first aspect, an embodiment of the present application provides a video matting method, where the method includes:

information compression is carried out on a target video frame in the video to obtain a first image feature;

dividing the target video frame based on the first image characteristics to obtain characteristics of an outline mask image of an object obtained in the dividing process and a first outline mask image of the object in the target video frame;

Performing feature reconstruction based on the first image features and the features of the obtained outline mask image, fusing the reconstructed features and first hidden state information of the video, and updating the first hidden state information to obtain a fusion result, wherein the first hidden state information represents: fusion characteristics of transparency mask images of object edges in the video frames of the matting are carried out before the target video frames;

based on the fusion result, obtaining a target transparency mask image of the object edge in the target video frame;

and carrying out regional matting on the target video frame according to the target transparency mask image and the first contour mask image to obtain a matting result.

In view of the above, in the solution provided in this embodiment, when a target video frame is segmented based on the first image feature, the feature of the contour mask image of the object in the segmentation process is obtained, the feature has a characterizability on the contour of the object in the contour mask image, and the first hidden state information characterizes the fusion feature of the transparency mask image of the object edge in the video frame that is scratched before the target video frame, so that the feature reconstruction is performed based on the first image feature and the obtained feature of the contour mask image, and the obtained feature and the first hidden state information are fused, and the obtained fusion result is fused with not only the contour information of the object in the contour mask image, but also the information of the object edge in the video frame that is scratched before the target video frame, and because the video frames in the video frame often have a time-domain correlation, when the target mask image is obtained based on the fusion result, the information of the object in the video frame with the time-domain correlation is considered on the basis of the target video frame, so that the accuracy of the obtained target transparency mask image can be improved, and the target region can be accurately scratched according to the target transparency mask image and the first contour image. Therefore, by applying the video matting scheme provided by the embodiment of the application, the accuracy of video matting can be improved.

In addition, when the target transparency mask image is obtained, the fusion characteristic of the transparency mask image of the object edge in the video frames of the matting before the target video frames is considered, namely, the image information of the object edge in the video frames is considered, rather than the image information of the target video frames, so that the inter-frame smoothness of the change of the object edge region in the target transparency mask image corresponding to each video frame in the video can be improved, and the inter-frame smoothness of the change of the object edge region in the matting result corresponding to each video frame can be improved.

In one embodiment of the present application, the obtaining, based on the fusion result, a target transparency mask image of an object edge in the target video frame includes:

acquiring a second contour mask image of the object in the target video frame based on the fusion result;

fusing the second contour mask image and the fusion result to obtain a target fusion characteristic;

and obtaining a target transparency mask image of the edge of the object in the target video frame based on the target fusion characteristic.

As can be seen from the foregoing, in the solution provided in this embodiment, the second contour mask image of the object is obtained based on the fusion result, where the second contour mask image may indicate the region where the object determined by the general contour of the object is located in the target video frame, so that the second contour mask image is fused with the fusion result to obtain the target fusion feature, and the target fusion feature only needs to pay attention to the detailed information of the image in the region where the object is located approximately, so that based on the target fusion feature, the target transparency mask image of the edge of the object may be accurately obtained only when the image content of the region where the object is located approximately is concerned, and thus video matting is performed based on the target transparency mask image, and accuracy of video matting may be improved.

In one embodiment of the present application, the segmenting the target video frame based on the first image feature, to obtain the feature of the contour mask image of the object obtained in the segmentation process and the first contour mask image of the object in the target video frame, includes:

performing cascading feature reconstruction based on the first image features to obtain features of outline mask images with sequentially increased dimensions of the objects, and obtaining first outline mask images of the objects in the target video frame based on the features obtained by the last processing;

the first hidden state information includes: the system comprises a plurality of first sub-hidden state information, wherein each first sub-hidden state information represents the fusion characteristic of a transparency mask image of one scale;

the feature reconstruction is performed based on the first image feature and the feature of the obtained outline mask image, the reconstructed feature and the first hidden state information of the video are fused, and the first hidden state information is updated to obtain a fusion result, including:

fusing the preset number of transparency information according to the following mode, and determining the characteristics obtained by the last processing as a fusion result:

Performing feature reconstruction based on a first target feature and a second target feature in the features of the obtained outline mask image to obtain a second image feature with increased scale, wherein the first target feature is the first image feature when information fusion is performed for the first time, the first target feature is the feature obtained by performing information fusion last time when information fusion is performed for other times, and the scale of the first target feature is the same as that of the second target feature;

and fusing the second image feature and the first sub-state information in the first hidden state information, and updating the first sub-state information to obtain a third image feature, wherein the scale of the transparency mask image corresponding to the fused feature represented by the first sub-state information is the same as the scale of the second image feature.

As can be seen from the above, in the scheme provided by the embodiment, based on the first image feature, the target video frame is segmented in a manner of cascade feature reconstruction, and the cascade feature reconstruction includes multiple feature reconstruction processes, so that the accuracy of the first contour mask image can be improved; and after the first image features are obtained, carrying out multiple transparency information fusion, wherein each transparency information fusion process comprises three processing processes of feature reconstruction, feature and hidden state information fusion and updating of the hidden state information, so that the accuracy of a finally obtained fusion result can be improved, region matting of a target video frame can be realized based on a more accurate first contour mask image and the fusion result, the accuracy of region matting can be improved, and the accuracy of video matting can be further improved.

In an embodiment of the present application, the performing feature reconstruction based on the first target feature and a second target feature in the features of the obtained contour mask image to obtain a second image feature with an increased scale includes:

screening the characteristic features of the edges of the object from the second target features in the features of the obtained outline mask image;

and carrying out feature reconstruction based on the first target feature and the characteristic feature to obtain a second image feature with increased scale.

From the above, in the scheme provided by the embodiment, in the process of performing feature reconstruction based on the first target feature and the second target feature, the characteristic feature with smaller data size is screened from the second target feature, so that the feature reconstruction is performed based on the first target feature and the characteristic feature, the calculation amount of the feature reconstruction can be reduced, the calculation amount of the preset number of transparency information fusion can be reduced, the efficiency of obtaining the fusion result is improved, and the efficiency of video matting can be improved.

In an embodiment of the present application, the performing cascade feature reconstruction based on the first image feature, to obtain features of a contour mask image with sequentially increased dimensions of an object, includes:

And carrying out contour information fusion on the preset number of times according to the following mode to obtain the features of the contour mask images with the sequentially increased scales of the objects:

performing feature reconstruction based on a third target feature to obtain a fourth image feature with an increased scale, wherein the third target feature is the first image feature when information fusion is performed for the first time, and the third target feature is the feature obtained by the previous feature reconstruction when information fusion is performed for the other times;

fusing the fourth image feature and second sub-state information in the second hidden state information and updating the second sub-state information to obtain the feature of the outline mask image of the object, wherein the second hidden state information represents: the second hidden state information includes: and the second sub-hidden state information is used for representing the fusion characteristic of the outline mask image with one scale, and the scale of the outline mask image corresponding to the fusion characteristic represented by the second sub-hidden state information is the same as that of the fourth image characteristic.

In view of the foregoing, in the solution provided in this embodiment, the second hidden state information characterizes the fusion feature of the contour mask image of the object in the video frame that is segmented before the target video frame, and each second sub-hidden state information characterizes the fusion feature of the contour mask image of one scale, so that the fourth image feature obtained by reconstructing the feature in each contour information fusion process is fused with the second sub-state information, and the information in the fusion feature of the contour mask image of one scale of the object in the video frame that is segmented before the target video frame is fused with the fourth image feature, so that the finally obtained feature of the contour mask image of the object not only includes the information of the object in the target video frame but also includes the information of the object in the video frame that is segmented before the target video frame, thereby obtaining the first contour mask image based on the finally obtained feature, and further improving the accuracy of the first contour mask image.

In one embodiment of the present application, the first image feature includes a plurality of first sub-image features;

the step of performing information compression on a target video frame in a video to obtain a first image feature includes:

Performing cascade information compression on a target video frame in a video to obtain first sub-image features with sequentially reduced scales;

the third target feature is a first sub-image feature with the minimum scale when the feature reconstruction is carried out for the first time;

the feature reconstruction is performed based on the third target feature to obtain a fourth image feature with an increased scale, including:

and when the feature reconstruction is carried out for other times, carrying out feature reconstruction based on a third target feature and a first sub-image feature with the same scale as the third target feature to obtain a fourth image feature with the increased scale.

As can be seen from the above, in the scheme provided by the embodiment, the cascade information compression is performed on the target video frame, so that each first sub-image feature with sequentially reduced scale is obtained, and in the subsequent process of fusing other sub-profile information except the first time, feature reconstruction can be performed based on the third target feature and the first sub-image feature with the same scale as the third target feature, so that the accuracy of feature reconstruction can be improved, the accuracy of the finally obtained feature after the profile information fusion can be improved, and the accuracy of video matting can be further improved.

In one embodiment of the present application, the fusing the second image feature and the first sub-state information in the first hidden state information and updating the first sub-state information to obtain a third image feature includes: segmenting the second image feature to obtain a second sub-image feature and a third sub-image feature; fusing the second sub-image feature and first sub-state information in the first hidden state information and updating the first sub-state information to obtain a fourth sub-image feature; and splicing the fourth sub-image feature and the third sub-image feature to obtain a third image feature.

From the above, in the scheme provided by the embodiment, the second image feature is segmented to obtain the second sub-image feature and the third sub-image feature, and the data size of the second sub-image feature and the third sub-image feature is smaller than that of the second image feature, so that the second sub-image feature and the first sub-state information are fused, the fused calculated amount can be reduced, the fusion efficiency is improved, the efficiency of obtaining the third image feature can be improved, the efficiency of video matting can be improved, and meanwhile, the calculation resource of the terminal is saved, so that the lightweight application of the video matting scheme in the terminal can be realized.

In an embodiment of the present application, the fusing the fourth image feature and the second sub-state information in the second hidden state information and updating the second sub-state information to obtain the feature of the outline mask image of the object includes: segmenting the fourth image feature to obtain a fifth sub-image feature and a sixth sub-image feature; fusing the fifth sub-image feature and second sub-state information in the second hidden state information and updating the second sub-state information to obtain a seventh sub-image feature; and splicing the seventh sub-image feature and the sixth sub-image feature to obtain the feature of the outline mask image of the object.

From the above, in the scheme provided by the embodiment, the fourth image feature is segmented to obtain the fifth sub-image feature and the sixth sub-image feature, and the data size of the fifth sub-image feature and the sixth sub-image feature is smaller than that of the fourth image feature, so that the fifth sub-image feature and the first sub-state information are fused, the fused calculated amount can be reduced, the fusion efficiency is improved, the efficiency of obtaining the feature of the outline mask image of the object can be improved, the video matting efficiency can be improved, the calculation resources of the terminal are saved, and the lightweight application of the video matting scheme in the terminal can be realized.

In one embodiment of the present application, the fusing the second image feature and the first sub-state information in the first hidden state information and updating the first sub-state information to obtain a third image feature includes: segmenting the second image feature to obtain a second sub-image feature and a third sub-image feature; fusing the second sub-image feature and first sub-state information in the first hidden state information and updating the first sub-state information to obtain a fourth sub-image feature; and splicing the fourth sub-image feature and the third sub-image feature to obtain a third image feature. The fusing the fourth image feature and the second sub-state information in the second hidden state information and updating the second sub-state information to obtain the feature of the outline mask image of the object, including: segmenting the fourth image feature to obtain a fifth sub-image feature and a sixth sub-image feature; fusing the fifth sub-image feature and second sub-state information in the second hidden state information and updating the second sub-state information to obtain a seventh sub-image feature; and splicing the seventh sub-image feature and the sixth sub-image feature to obtain the feature of the outline mask image of the object.

From the above, by applying the scheme provided by the embodiment, the calculated amount of two fusion processes can be reduced, so that the fusion efficiency is further improved, the video matting efficiency is further improved, and the lightweight application of the video matting scheme in the terminal is realized.

In one embodiment of the present application, the compressing information of a target video frame in a video to obtain a first feature includes:

inputting a target video frame in a video into an information compression network in a pre-trained video matting model to obtain a first image feature output by the information compression network, wherein the video matting model further comprises: the system comprises a first image generation network, a second image generation network, a result output network, a plurality of groups of contour feature generation networks with the same number and transparency feature generation networks, wherein each group of contour feature generation networks corresponds to the scale of one contour mask image and comprises a first reconstruction sub-network and a first fusion sub-network;

performing feature reconstruction based on the first target feature and a second target feature in the features of the obtained contour mask image to obtain a second image feature with an increased scale, including:

Inputting a first target feature and a second target feature in the features of the obtained outline mask image into a target second reconstruction sub-network in a target transparency feature generation network to obtain a second image feature with an increased scale output by the target second reconstruction sub-network, wherein the scale of the transparency mask image corresponding to the target transparency feature generation network is the same as the scale of the second image feature;

the fusing the second image feature and the first sub-state information in the first hidden state information and updating the first sub-state information to obtain a third image feature includes:

inputting the second image feature into a target second fusion sub-network in the target transparency feature generation network, so that the target second fusion sub-network fuses the second image feature and the first sub-state information provided by the target second fusion sub-network and updates the first sub-state information to obtain a third image feature output by the target second fusion sub-network;

the obtaining the target transparency mask image of the object edge in the target video frame based on the fusion result comprises the following steps:

inputting the fusion result into the second image generation network to obtain a target transparency mask image of an object in the target video frame output by the second image generation network;

inputting a third target feature into a target first reconstruction sub-network in a target contour feature generation network to obtain a fourth image feature with increased scale output by the target first reconstruction sub-network, wherein the scale of a contour mask image corresponding to the target contour feature generation network is the same as the scale of the fourth image feature;

the fusing the fourth image feature and the second sub-state information in the second hidden state information and updating the second sub-state information to obtain the feature of the outline mask image of the object, including:

inputting the fourth image feature into a target first fusion sub-network in the target contour feature generation network, so that the target first fusion sub-network fuses the fourth image feature and second sub-state information provided by the target first fusion sub-network and updates the second sub-state information to obtain the feature of the contour mask image of the object output by the target first fusion sub-network;

the obtaining a first contour mask image of the object in the target video frame based on the feature obtained by the last processing includes:

Inputting the characteristics obtained by the last processing into the first image generation network to obtain a first contour mask image of an object in the target video frame output by the first image generation network;

and performing region matting on the target video frame according to the target transparency mask image and the first contour mask image to obtain a matting result, wherein the method comprises the following steps:

and inputting the target transparency mask image and the first contour mask image into the result output network, so that the result output network performs region matting on the target video frame based on the obtained image, and a matting result output by the result output network is obtained.

From the above, in the scheme provided by the embodiment, each network and each sub-network included in the video matting model are utilized to perform video matting, and because the video matting model is a pre-trained video matting model, the video matting model is utilized to improve the accuracy of video matting, and the video matting model does not need to interact with other devices, so that the video matting model can be deployed in the offline device, and the convenience of video segmentation can be improved.

In one embodiment of the present application, the transparency feature generation network further includes a feature screening sub-network;

before the inputting the first target feature and the second target feature in the features of the obtained outline mask image into the target transparency feature generation network, the method further comprises:

inputting a second target feature in the features of the obtained outline mask image into a target feature screening sub-network in the target transparency feature generation network to obtain a target screening feature which has characterizations on the edge outline of the object in the second target feature output by the target feature screening sub-network;

the inputting the first target feature and the second target feature in the features of the obtained outline mask image into the target transparency feature to generate a target second reconstruction sub-network in the network comprises the following steps:

and inputting the first target feature and the target screening feature into a target transparency feature to generate a target second reconstruction sub-network in the network.

As can be seen from the above, in the scheme provided by the embodiment, by adding the feature screening sub-network to the transparency feature generation network, the calculation amount of feature reconstruction performed by the second reconstruction sub-network in the transparency feature generation network can be reduced, and the efficiency of feature reconstruction performed by the second reconstruction sub-network can be improved, so that the efficiency of video matting performed by the video matting model can be improved.

In one embodiment of the present application, the first converged subnetwork is: the gate cycle unit GRU or the long and short time memory LSTM unit.

As can be seen from the above, in the scheme provided by the embodiment, the two units, i.e., the GRU unit and the LSTM unit, have an information memory function, and any one of the two units is used as the first fusion sub-network, and the unit itself can store the hidden state information representing the fusion feature of the outline mask image of the object in the video frame that has been segmented, so that the fourth image feature and the second sub-hidden state information provided by itself can be fused accurately, the feature accuracy of the outline mask image of the object output by the sub-network is improved, and the video matting accuracy is improved.

In one embodiment of the present application, the second converged subnetwork is: the gate cycle unit GRU or the long and short time memory LSTM unit.

As can be seen from the above, in the scheme provided by the embodiment, the two units, i.e., the GRU unit and the LSTM unit, have an information memory function, and any one of the two units is used as the second fusion sub-network, and the unit itself can store the hidden state information of the fusion feature of the transparency mask image representing the object in the video frame which has been segmented, so that the second image feature and the first sub-hidden state information provided by itself can be fused accurately, the accuracy of the third image feature output by the sub-network is improved, and the accuracy of video matting is improved.

In one embodiment of the present application, the first reestablishing subnetwork is implemented based on a QARepVGG network structure.

In view of the above, in the scheme provided by the embodiment, since the quantization calculation accuracy of the QARepVGG network is higher, the above-mentioned first reconstruction sub-network is realized based on the QARepVGG network structure, so that the quantization calculation capability of the first reconstruction sub-network can be improved, the accuracy of feature reconstruction of the first reconstruction sub-network is improved, and the accuracy of video matting can be improved.

In one embodiment of the present application, a first reconstruction sub-network in a specific profile generation network is implemented based on a QARepVGG network structure, wherein the specific profile generation network is: and generating a network by using contour features of which the corresponding contour mask image is smaller than the first preset scale.

As can be seen from the foregoing, in the solution provided in this embodiment, since the calculation amount of the U-shaped residual block in the specific profile feature generation network is increased along with the increase of the scale of the profile mask image corresponding to the network based on the QARepVGG network structure, when each profile feature generation network is constructed, the first reconstruction sub-network in the specific profile feature generation network can be implemented based on the QARepVGG network structure only for the specific profile feature generation network with the scale smaller than the first preset scale, so that the calculation amount of each profile feature generation network can be reduced, the efficiency of obtaining the feature of the profile mask image of the object is improved, the efficiency of video matting is improved, and the video matting model can be deployed in a lightweight manner in the terminal.

In one embodiment of the present application, the second reestablishing subnetwork is implemented based on a QARepVGG network structure.

As can be seen from the above, in the scheme provided by the embodiment, because the quantization calculation precision of the QARepVGG network is higher, the above-mentioned second reconstruction sub-network is realized based on the QARepVGG network structure, so that the quantization calculation capability of the second reconstruction sub-network can be improved, and the accuracy of feature reconstruction of the second reconstruction sub-network based on the first target feature and the second target feature is improved, and further the accuracy of video matting can be improved.

In one embodiment of the present application, the second reconstruction sub-network in the specific transparency feature generation network is implemented based on a QARepVGG network structure, wherein the specific transparency feature generation network is: and generating a network by transparency features of which the corresponding transparency mask image is smaller than a second preset scale.

As can be seen from the foregoing, in the solution provided in this embodiment, since the calculation amount of the U-shaped residual block in the specific transparency feature generating network is increased along with the increase of the scale of the transparency mask image corresponding to the network based on the QARepVGG network structure, when each transparency feature generating network is constructed, the second reconstruction sub-network in the specific transparency feature generating network can be implemented based on the QARepVGG network structure only for the specific transparency feature generating network with the scale smaller than the second preset scale, so that the calculation amount of each transparency feature generating network can be reduced, the efficiency of obtaining the fusion result can be improved, the efficiency of video matting can be improved, and the video matting model can be deployed in a lightweight manner in the terminal.

In one embodiment of the present application, the video matting model is trained in the following manner:

inputting a first sample video frame in a sample video into an initial model of the video matting model for processing to obtain a first sample contour mask image of an object in the first sample video frame output by a first image generating network in the initial model;

obtaining a first difference between an annotation mask image corresponding to the first sample video frame and an annotation mask image corresponding to a second sample video frame, wherein the second sample video frame is: video frames which are arranged in front of the first sample video frame in the sample video and are spaced by a preset frame number;

obtaining a second difference between the first sample contour mask image and a second sample contour mask image, wherein the second sample contour mask image is: the first image generates a mask image output by a network when the initial model processes the second sample video frame;

obtaining a third difference between the first sample transparency mask image and the second sample transparency mask image, wherein the first sample transparency mask image is: the initial model generates a mask image output by a network by the second image when processing the first sample video frame, wherein the second sample transparency mask image is: the initial model generates a mask image output by a network by the second image when processing the second sample video frame;

Calculating a training loss based on the first difference, the second difference, and the third difference;

and based on the training loss, carrying out model parameter adjustment on the initial model to obtain the video matting model.

As can be seen from the foregoing, in the solution provided in this embodiment, since there is often a time-domain correlation between the first sample video frame and the second sample video frame spaced by the preset frame number, a first difference between the labeling mask image corresponding to the first sample video frame and the labeling mask image corresponding to the second sample video frame, a second difference between the first sample contour mask image and the second sample contour mask image, and a third difference between the first sample transparency mask image and the second sample transparency mask image are obtained, and training loss is calculated based on the first difference, the second difference, and the third difference, when model parameter adjustment is performed on the initial model based on the training loss, the time-domain correlation between different video frames of the video can be learned by the initial model, so that accuracy of the model obtained by training can be improved, and video matting can be performed by using the model, and accuracy of video matting can be improved.

In one embodiment of the present application, the first sample contour mask image includes: a first mask subgraph for identifying an area where an object is located in the first sample video frame and a second mask subgraph for identifying an area outside the object in the first sample video frame;

The second sample contour mask image includes: a third mask subgraph of an area where an object is located in the second sample video frame and a fourth mask subgraph of an area outside the object in the second sample video frame are marked;

the obtaining a second difference between the first and second sample profile mask images comprises:

and obtaining the difference between the first mask subgraph and the third mask subgraph, and obtaining the difference between the second mask subgraph and the fourth mask subgraph, and obtaining a second difference containing the obtained difference.

In view of the foregoing, in the solution provided in this embodiment, since the region in the video frame is composed of two regions, i.e., the region where the object is located and the region outside the object, the larger the difference between the regions where the object is located in different video frames is, the larger the difference between the regions outside the object is in different video frames, and it can be seen that the difference between the regions outside the object can reflect the difference between the regions where the object is located, so that the second difference is obtained according to the two differences, i.e., the difference between the first mask sub-graph and the third mask sub-graph and the difference between the second mask sub-graph and the fourth mask sub-graph, and the second difference is calculated from two different angles, so that the accuracy of the second difference can be improved, so that the accuracy of model training can be improved, and the accuracy of performing video matting using the model can be improved.

performing convolution transformation on a target video frame in the video to obtain a fifth image feature;

performing linear transformation on the fifth image feature based on convolution check to obtain a sixth image feature;

carrying out batch standardization processing on the sixth image features to obtain seventh image features;

performing nonlinear transformation on the seventh image feature to obtain an eighth image feature;

and performing linear transformation on the eighth image characteristic based on convolution check to obtain a first characteristic of the target video frame.

From the above, in the scheme provided by the embodiment, when the target video frame is compressed, the target video frame is subjected to various processes such as convolution transformation, linear transformation, batch standardization processing, nonlinear transformation and the like, so that the target video can be compressed more accurately, the accuracy of the first image feature can be improved, and the video segmentation is performed based on the first image feature, so that the accuracy of the video matting can be improved.

In addition, in the scheme provided by the embodiment of the application, the sixth image features are subjected to batch standardization processing, and then nonlinear transformation is performed on the seventh image features obtained through processing, so that the quantization precision of the features can be prevented from being lost during information compression, the quantization precision of the information compression is improved, and the accuracy of the first image features and the accuracy of video matting are further improved.

The scheme provided by the embodiment of the application is applied to the terminal, and the processing such as convolution conversion, linear conversion, batch standardization processing and nonlinear conversion is friendly to the computing capacity of the terminal, so that the processing such as convolution conversion, linear conversion, batch standardization processing and nonlinear conversion is performed in the terminal, the terminal can conveniently compress information, and the lightweight realization of video matting at the terminal side can be promoted.

In one embodiment of the present application, the convolution kernel is: 1x1 convolution kernel.

Because the data size of the convolution kernel of 1x1 is smaller, the convolution kernel based on 1x1 performs linear transformation on the fifth image feature, and on the premise that the linear transformation on the fifth image feature can be realized, the calculation amount of the linear transformation can be reduced, the calculation efficiency of the linear transformation is improved, and therefore the video matting efficiency can be improved. In addition, the video matting scheme provided by the embodiment is applied to the terminal, the fifth image characteristic is checked to perform linear transformation based on the convolution of 1x1 in the terminal, and more calculation resources of the terminal are not required to be occupied, so that the terminal can conveniently realize linear transformation, and the lightweight realization of video matting at the terminal side is promoted.

In one embodiment of the present application, the performing nonlinear transformation on the seventh image feature to obtain an eighth image feature includes:

And performing nonlinear transformation on the seventh image feature based on the RELU activation function to obtain an eighth image feature.

In the case of performing nonlinear transformation on the seventh image feature by using the RELU activation function, since the quantization effect of processing data by using the RELU activation function is good, the nonlinear transformation on the seventh image feature by using the RELU activation function can improve the transformation effect of the nonlinear transformation, thereby improving the accuracy of the eighth image feature.

In a second aspect, embodiments of the present application further provide an electronic device, including:

one or more processors and memory;

the memory is coupled to the one or more processors, the memory for storing computer program code comprising computer instructions that the one or more processors invoke the computer instructions to cause the electronic device to perform the method of any of the above aspects.

In a third aspect, embodiments of the present application also provide a computer readable storage medium comprising a computer program which, when run on an electronic device, causes the electronic device to perform the method of any one of the first aspects above.

In a fourth aspect, embodiments of the present application also provide a computer program product comprising executable instructions which, when executed on a computer, cause the computer to perform the method of any one of the first aspects above.

In a fifth aspect, an embodiment of the present application further provides a chip system, where the chip system is applied to a terminal, and the chip system includes one or more processors, where the processors are configured to invoke computer instructions to cause the terminal to input data into the chip system, and perform the method according to any one of the first aspect to process the data and output a processing result.

Advantageous effects of the solutions provided by the embodiments in the second aspect, the third aspect, the fourth aspect, and the fifth aspect described above may be referred to the advantageous effects of the solutions provided by the embodiments in the first aspect described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 2 is a software structural block diagram of a terminal according to an embodiment of the present application;

fig. 3 is a flowchart of a first video matting method provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of image variation according to an embodiment of the present disclosure;

fig. 5 is a flowchart of a second video matting method provided in an embodiment of the present application;

fig. 6 is a flow chart of a first feature processing method according to an embodiment of the present application;

fig. 7 is a flow chart of a second feature processing method according to an embodiment of the present application;

fig. 8 is a flow chart of a first cascade feature reconstruction method according to an embodiment of the present application;

fig. 9 is a schematic flow chart of a second cascade feature reconstruction method according to an embodiment of the present application;

fig. 10 is a flowchart of a third video matting method provided in an embodiment of the present application;

fig. 11 is a schematic structural diagram of a first video matting model provided in an embodiment of the present application;

fig. 12 is a schematic structural diagram of an information compression network according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a second video matting model provided in an embodiment of the present application;

Fig. 14 is a flowchart of a fourth video matting method provided in an embodiment of the present application;

fig. 15 is a schematic structural diagram of a third video matting model provided in an embodiment of the present application;

fig. 16 is a schematic structural diagram of a fourth video matting model provided in an embodiment of the present application;

FIG. 17 is a flowchart of a first model training method according to an embodiment of the present disclosure;

FIG. 18 is a flowchart of a second model training method according to an embodiment of the present disclosure;

FIG. 19 is a mask image according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of a second image generating network according to an embodiment of the present application;

fig. 21 is a schematic diagram illustrating comparison of a matting result provided in an embodiment of the present application;

fig. 22 is a schematic structural diagram of a chip system according to an embodiment of the present application.

Detailed Description

For a better understanding of the technical solutions of the present application, embodiments of the present application are described in detail below with reference to the accompanying drawings.

In order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. For example, the first instruction and the second instruction are for distinguishing different user instructions, and the sequence of the instructions is not limited. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

In this application, the terms "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

The embodiment of the application can be applied to terminals with communication functions, such as mobile phones, tablet personal computers, personal computers (Personal Computer, PCs), personal digital assistants (Personal Digital Assistant, PDAs), smart watches, netbooks, wearable electronic devices, augmented Reality (Augmented Reality, AR) devices, virtual Reality (VR) devices, vehicle-mounted devices, smart cars, robots, smart glasses, smart televisions and the like.

By way of example, fig. 1 shows a schematic structural diagram of a terminal 100. The terminal 100 may include a processor 110, a display 120, a camera 130, an internal memory 140, a sim (Subscriber Identification Module, subscriber identity module) card interface 150, a usb (Universal Serial Bus ) interface 160, a charge management module 170, a power management module 171, a battery 172, a sensor module 180, a mobile communication module 190, a wireless communication module 200, an antenna 1, an antenna 2, and the like. The sensor modules 180 may include, among other things, pressure sensors 180A, fingerprint sensors 180B, touch sensors 180C, ambient light sensors 180D, and the like.

It should be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the terminal 100. In other embodiments of the present application, terminal 100 may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include a central processor (Central Processing Unit, CPU), an application processor (Application Processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (Image Signal Processor, ISP), a controller, a video codec, a Digital signal processor (Digital SignalProcessor, DSP), a baseband processor, and/or a Neural network processor (Neural-network Processing Unit, NPU), etc. Wherein the different processing units may be separate components or may be integrated in one or more processors. In some embodiments, terminal 100 can also include one or more processors 110. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution. In other embodiments, memory may also be provided in the processor 110 for storing instructions and data. Illustratively, the memory in the processor 110 may be a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it may be called directly from memory. This avoids repeated accesses and reduces the latency of the processor 110, thereby improving the efficiency of the terminal 100 in processing data or executing instructions.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include Inter-integrated circuit (Inter-Integrated Circuit, I2C) interfaces, inter-integrated circuit audio (Inter-Integrated Circuit Sound, I2S) interfaces, pulse-code modulation (Pulse CodeModulation, PCM) interfaces, universal asynchronous receiver Transmitter (Universal Asynchronous Receiver/Transmitter, UART) interfaces, mobile industry processor interfaces (Mobile Industry Processor Interface, MIPI), general-Purpose Input/Output (GPIO) interfaces, SIM card interfaces, and/or USB interfaces, among others. The USB interface 160 is an interface conforming to the USB standard, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 160 may be used to connect a charger to charge the terminal 100, or may be used to transfer data between the terminal 100 and a peripheral device. The USB interface 160 may also be used to connect headphones through which audio is played.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is for illustrative purposes, and is not limited to the structure of the terminal 100. In other embodiments of the present application, the terminal 100 may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The wireless communication function of the terminal 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 190, the wireless communication module 200, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in terminal 100 may be configured to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

Terminal 100 implements display functions through a GPU, display 120, and an application processor, etc. The GPU is a microprocessor for image processing, and is connected to the display 120 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display 120 is used to display images, videos, and the like. The display 120 includes a display panel. The display panel may employ a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), an Active-Matrix Organic Light Emitting Diode or Active-matrix Organic Light-Emitting Diode (AMOLED), a flexible Light-Emitting Diode (FLED), miniled, microLed, micro-OLED, a quantum dot Light-Emitting Diode (Quantum Dot Light Emitting Diodes, QLED), or the like. In some embodiments, terminal 100 may include 1 or more display screens 120.

In some embodiments of the present application, when the display panel is made of OLED, AMOLED, FLED, the display screen 120 in fig. 1 may be folded. Here, the display 120 may be folded, which means that the display may be folded at any angle at any portion and may be held at the angle, for example, the display 120 may be folded in half from the middle. Or folded up and down from the middle.

The display 120 of the terminal 100 may be a flexible screen that is currently of great interest due to its unique characteristics and great potential. Compared with the traditional screen, the flexible screen has the characteristics of strong flexibility and bending property, can provide a new interaction mode based on the bending property for the user, and can meet more requirements of the user on the terminal. For a terminal equipped with a foldable display, the foldable display on the terminal can be switched between a small screen in a folded configuration and a large screen in an unfolded configuration at any time. Accordingly, users use a split screen function on a terminal configured with a foldable display screen, also more and more frequently.

The terminal 100 may implement a photographing function through an ISP, a camera 130, a video codec, a GPU, a display 120, an application processor, and the like, wherein the camera 130 includes a front camera and a rear camera.

The ISP is used to process the data fed back by the camera 130. For example, when shooting, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing, so that the electric signal is converted into an image visible to naked eyes. The ISP can carry out algorithm optimization on noise, brightness and color of the image, and can optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 130.

The camera 130 is used to take pictures or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (Charge Coupled Cevice, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into a standard Red Green Blue (RGB), YUV format image signal, and the like. In some embodiments, the terminal 100 may include 1 or N cameras 130, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the terminal 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, etc.

Video codecs are used to compress or decompress digital video. The terminal 100 may support one or more video codecs. In this way, the terminal 100 may play or record video in a variety of encoding formats, such as: dynamic picture experts group (Moving Picture Experts Group, MPEG) 1, MPEG2, MPEG3, and MPEG4.

The NPU is a Neural-Network (NN) computing processor, and can rapidly process input information by referencing a biological Neural Network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent cognition of the terminal 100 can be implemented by the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

The internal memory 140 may be used to store one or more computer programs, including instructions. The processor 110 may cause the terminal 100 to perform the video matting method provided in some embodiments of the present application, various applications, data processing, and the like by executing the above-described instructions stored in the internal memory 140. The internal memory 140 may include a storage program area and a storage data area. The storage program area can store an operating system; the storage program area may also store one or more applications (such as gallery, contacts, etc.), etc. The storage data area may store data (e.g., photos, contacts, etc.) created during use of the terminal 100, etc. In addition, the internal memory 140 may include high-speed random access memory, and may also include non-volatile memory, such as one or more disk storage units, flash memory units, universal flash memory (Universal Flash Storage, UFS), and the like. In some embodiments, processor 110 may cause terminal 100 to perform the video matting methods provided in embodiments of the present application, as well as other applications and data processing, by executing instructions stored in internal memory 140, and/or instructions stored in a memory provided in processor 110.

The internal memory 140 may be used to store a related program of the video matting method provided in the embodiment of the present application, and the processor 110 may be used to call the related program of the video matting method stored in the internal memory 140 when information is presented, so as to execute the video matting method in the embodiment of the present application.

The sensor module 180 may include a pressure sensor 180A, a fingerprint sensor 180B, a touch sensor 180C, an ambient light sensor 180D, and the like.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 120. The pressure sensor 180A may be of various types, such as a resistive pressure sensor, an inductive pressure sensor, or a capacitive pressure sensor. The capacitive pressure sensor may be a device comprising at least two parallel plates of conductive material, the capacitance between the electrodes changing as a force is applied to the pressure sensor 180A, the terminal 100 determining the strength of the pressure based on the change in capacitance. When a touch operation is applied to the display screen 120, the terminal 100 detects the touch operation according to the pressure sensor 180A. The terminal 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example: executing an instruction for checking the short message when the touch operation with the touch operation intensity smaller than the first pressure threshold acts on the short message application icon; and executing the instruction of newly creating the short message when the touch operation with the touch operation intensity being larger than or equal to the first pressure threshold acts on the short message application icon.

The fingerprint sensor 180B is used to collect a fingerprint. The terminal 100 can utilize the collected fingerprint characteristics to realize the functions of unlocking, accessing an application lock, shooting and receiving an incoming call, and the like.

The touch sensor 180C, also referred to as a touch device. The touch sensor 180C may be disposed on the display screen 120, and the touch sensor 180C and the display screen 120 form a touch screen, which is also referred to as a touch screen. The touch sensor 180C is used to detect a touch operation acting thereon or thereabout. The touch sensor 180C may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to the touch operation may be provided through the display screen 120. In other embodiments, the touch sensor 180C may also be disposed on the surface of the terminal 100 and at a different location than the display 120.

The ambient light sensor 180D is used to sense ambient light level. The terminal 100 may adaptively adjust the brightness of the display 120 according to the perceived ambient light level. The ambient light sensor 180D may also be used to automatically adjust white balance at the time of photographing. Ambient light sensor 180D may also communicate the ambient information in which the device is located to the GPU.

The ambient light sensor 180D is also used to acquire the brightness, light ratio, color temperature, etc. of the acquisition environment in which the camera 130 acquires an image.

Fig. 2 is a software architecture block diagram of a terminal applicable to an embodiment of the present application. The software system of the terminal can adopt a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.

The layered architecture divides the software system of the terminal into several layers, each layer having a distinct role and division of work. The layers communicate with each other through a software interface. In some embodiments, the software system may be divided into five layers, an application layer (applications), an application framework layer (application framework), a system library, a hardware abstraction layer (Hardware Abstract Layer, HAL), and a kernel layer (kernel), respectively.

The application layer may include a series of application packages that run applications by calling an application program interface (Application Programming Interface, API) provided by the application framework layer. As shown in FIG. 2, the application package may include applications such as a browser, gallery, music, video, and the like. It will be appreciated that the ports of each of the applications described above may be used to receive data.

The application framework layer provides APIs and programming frameworks for application programs of the application layer. The application framework layer includes a number of predefined functions. As shown in fig. 2, the application framework layer may include a window manager, a content provider, a view system, a resource manager, a notification manager, and a DHCP (Dynamic Host Configuration Protocol ) module, etc.

The system library may include a plurality of functional modules such as a surface manager, a three-dimensional graphics processing library, a two-dimensional graphics engine, a file library, and the like.

The hardware abstraction layer may include a plurality of library modules, such as a display library module, a motor library module, and the like. The terminal system can load a corresponding library module for the equipment hardware, so that the purpose of accessing the equipment hardware by the application program framework layer is achieved.

The kernel layer is a layer between hardware and software. The kernel layer is used for driving the hardware so that the hardware works. The kernel layer at least includes a display driver, an audio driver, a sensor driver, a motor driver, and the like, which is not limited in the embodiment of the present application. It is understood that the display drive, audio drive, sensor drive, motor drive, etc. may be considered a drive node. Each of the drive nodes described above includes an interface that may be used to receive data.

The following describes a video matting scheme provided in the embodiment of the present application.

First, a video matting flow will be described.

The video comprises a plurality of frames of video frames, and the video frames to be scratched are called target video frames, so that the target video frames can be any video frame needing to be scratched in the video. The video matting refers to extracting an object region where an object is located and having object transparency from a video frame. The object may be a person, an animal, a vehicle, a lane line, etc.

In the video matting process, firstly, a first target video frame is determined, wherein the first target video frame can be a first frame video frame or other frame video frames in a video, and the determined target video frame is scratched to obtain an object area in the target video frame; then, determining the next video frame of the target video frame as a new target video frame, and carrying out matting on the new target video frame; after the object area in the target video frame is obtained each time, determining that the next video frame is a new target video frame, and finishing the matting of the object on the whole video until the segmentation result of the video frame of the last frame of the video is obtained.

Next, an application scenario of the video matting scheme provided by the embodiment of the present application is illustrated.

1. Real-time video scene

In the scene, video matting is carried out on the video to be played to obtain object areas in all video frames in the video, so that when the video is played, only the area content of the object areas in all video frames in the video can be played.

2. Video clip scene

In the scene, after video segmentation is performed on the video to obtain the region where the object in each video frame is located in the video, clipping processing such as background replacement, object erasure, background blurring, color preservation and the like can be performed on the video frame in the video according to the position of the region where the object in the video frame is located, the picture content and other information, so that a new video is obtained. In addition, after the video frames in the video are clipped to obtain a new video, other applications, such as video creation, terminal screen locking and the like, can be realized based on the new video.

In a scene of locking a screen of a terminal, after video is segmented to obtain an area where an object in each video frame is located in the video and the video is clipped to obtain a new video according to information of the area where the object in the video frame is located, dynamic screen locking wallpaper of the terminal can be generated according to picture content of each video frame in the new video, and therefore the dynamic screen locking wallpaper is displayed under the condition that the terminal is in a screen locking state.

3. Video monitoring scene

In the scene, after the monitoring equipment collects the video of the specific area, the object in the specific area can be detected by carrying out video matting on the video.

Next, a video segmentation scheme provided in the embodiment of the present application is described in detail below through specific embodiments.

In an embodiment of the present application, referring to fig. 3, a flowchart of a first video matting method is provided, and in this embodiment, the method includes the following steps S301 to S305.

Step S301: and carrying out information compression on a target video frame in the video to obtain a first image characteristic.

The target video frame may be any video frame among video frames included in the video.

Information compression of a target video frame can be understood as: and extracting the characteristics of the target video frame to obtain a first image characteristic with a smaller scale than that of the target video frame. The feature extraction of the target video frame can extract the edge information of the content in the image, and the edge information can reflect the area where the object in the video frame is located.

In addition, when the feature extraction is carried out on the target video frame, cascading multiple feature extraction can be carried out, and the scale of the obtained features is smaller and smaller along with the increase of the feature extraction times. From the aspect of scale, the larger the scale of the first image feature is, the more detail edge information is contained, and in some cases, the determination of the area where the object in the video frame is located is not facilitated; conversely, the smaller the scale of the first feature, the more macroscopic edge information is included, which is more advantageous for determining the region in the video frame where the object is located.

Furthermore, the dimensions of the first image feature may be the same as those of the target video frame, that is, the target video frame is a two-dimensional image, so that the dimensions thereof are 2, and the first image feature may be 2-dimensional data, in which case the first image feature may be considered as a feature map.

In one embodiment of the present application, when the target video frame is compressed, the information may be implemented based on a coding manner, for example, based on a coding network.

In another embodiment of the present application, the target video frame may be compressed by performing convolution transformation on the target video frame. In the process of carrying out convolution transformation on the target video frame, the target video frame can be subjected to convolution transformation for a plurality of times, so that the scale of the feature obtained by the convolution transformation is continuously reduced.

In addition, the target video frame may be subjected to information compression in combination with convolution transformation, linear transformation, batch normalization processing, nonlinear transformation, and the like, and specifically, see steps S301A to S301E in the embodiment shown in fig. 10, which will not be described in detail herein.

Step S302: and dividing the target video frame based on the first image characteristics to obtain the characteristics of the contour mask image of the object obtained in the dividing process and the first contour mask image of the object in the target video frame.

The first contour mask image is obtained by dividing the target video frame, so that a corresponding relationship exists between a pixel point in the first contour mask image and a pixel point in the target video frame, the first contour mask image can be understood as a binary image indicating the position of an area where an object is located in the target video frame, and the area where the object is indicated by the first contour mask image is determined according to the general contour of the object, so that the area where the object is located can be regarded as the general area where the object is located.

For example, the first contour mask image may be a mask image of pixel points including two pixel values of 0 and 1. In the target video frame, the pixel corresponding to the pixel with the pixel value of 1 in the first contour mask image may be a pixel in the region where the object is located, and the pixel corresponding to the pixel with the pixel value of 0 in the first contour mask image may be a pixel in a region other than the region where the object is located.

Specifically, the target video frame may be segmented by any of three implementations.

In a first implementation, the target video frame may be segmented based on the first image feature by the segmentation method mentioned in the embodiment shown in fig. 6, which is not described in detail herein.

In a second implementation, the target video frame may be segmented by using a pre-trained network model based on the first image feature, such as a contour feature generation network and a first image generation network in the video segmentation model in the embodiment shown in fig. 11, which will not be described in detail herein.

In a third implementation, the target video frame may be segmented using an image segmentation algorithm, a video segmentation algorithm, or the like, based on the first image features.

Step S303: and carrying out feature reconstruction based on the features of the first image and the features of the obtained outline mask image, fusing the reconstructed features with the first hidden state information of the video, and updating the first hidden state information to obtain a fusion result.

Wherein the first hidden state information characterizes: and carrying out fusion characteristics of transparency mask images of object edges in the video frames of the matting before the target video frames.

The object edge in the video frame may reflect the image content details of the object in the video frame, and the image content details may include the specific position of the region where the object is located and the transparency of the image content of the object represented by the pixel point in the region where the object is located.

A transparency mask image of an object edge in a video frame may be understood as a mask image indicating details of the image content of the object in the video frame, and pixel values of pixel points in the transparency mask image may represent transparency of the image content of the object presented by pixel points in the same position in the video frame.

For example, the pixel value range of the pixel points in the transparency mask image may be 0-1. If the pixel value of the pixel point in the transparency mask image is 0, the pixel point with the same position in the video frame belongs to an area outside the area where the object is located, and the image content presented by the pixel point with the same position does not contain the image content of the object; if the pixel value of the pixel point in the transparency mask image is 0.6, the pixel point with the same position in the video frame belongs to the area where the object is located, and the transparency of the object image content presented by the pixel point with the same position is 0.6; if the pixel value of the pixel point in the transparency mask image is 1, the pixel point with the same position in the video frame belongs to the area where the object is located, and the image content presented by the pixel point with the same position is the image content of the object.

The video frames that are to be scratched before the target video frame mentioned in this step include at least two frames of video frames, and of course, all the video frames that are to be scratched before the target video frame may also be used.

In the case that the target video frame is the first video frame in the video, the target video frame is not a video frame that is scratched before the target video frame, and in this case, the first hidden state information may be preset data, for example, preset all-zero data.

Specifically, the first hidden state information may be represented in a tensor form or may be represented in a matrix form.

As can be seen from the description for step S301, the first image feature is a feature that becomes smaller in scale relative to the target video frame, and the first image feature can reflect the region where the object is located in the target video frame. In order to successfully extract the region where the object is located from the target video frame, feature mapping is required to be performed on the small-scale first image features, and the final purpose is to map to the target video frame, so as to obtain the region where the object is located in the target video frame. In view of this, the upsampling process is required for the above-described first image feature.

Specifically, feature reconstruction is performed based on the first image features, further features with increased dimensions are reconstructed, and then the reconstructed features and the first hidden state information are fused to obtain a fusion result. The first hidden state information characterizes the fusion characteristic of the transparency mask image of the object edge in the video frame before the target video frame, that is, the first hidden state information can characterize the information of the object edge in the video frame before the target video frame, the information of the object edge can contain specific position information of the area where the object is located, after the reconstructed characteristic and the first hidden state information are fused, the obtained fusion result can reflect the area where the object is located in the target video frame, and can also be used for adjusting the area where the object is located in the target video frame by combining the area where the object is located in the previous video frame, so that the smoothness or time correlation of the area where the object is located between adjacent video frames is ensured.

Because the first hidden state information also needs to be used when the subsequent video frame is scratched, the first hidden state information needs to be updated based on the information of the object in the target video frame. Specifically, the hidden state information may be updated based on the first image feature, or the hidden state information may be updated based on the fusion result, for example, a result of fusing the reconstructed feature and the first hidden state information may be used as new first hidden state information.

Specifically, when the feature reconstruction is performed based on the first image feature, an up-sampling algorithm may be used to transform the first image feature to obtain a reconstructed first image feature; the first image feature can be subjected to deconvolution transformation to obtain a reconstructed first image feature; the first image feature may be reconstructed based on a decoding network, for example, the decoding network may be a decoding portion in a U-Net network architecture, or may be a decoder portion in a U-Net network architecture.

The reconstructed first image feature and the first hidden state information may be fused in any of the following two implementations.

In the first implementation manner, a fusion algorithm, a network and the like can be utilized to fuse the reconstructed first image features and the first hidden state information, so as to obtain a fusion result.

For example, the reconstructed first image feature and the first hidden state information are fused by using a Long Short-Term Memory (LSTM) network, a gate control loop unit (Gated Recurrent Unit, GRU) and the like, so as to obtain a fusion result.

In the second implementation manner, the reconstructed first image feature and the reconstructed first hidden state information can be directly subjected to superposition, splicing or dot multiplication and other operation processing to obtain a processing result, and the processing result is used as a fusion result.

Other implementations of the above step S303 may be referred to in the following embodiments, which are not described in detail herein.

Step S304: and obtaining a target transparency mask image of the edge of the object in the target video frame based on the fusion result.

The above-mentioned target transparency mask image may be understood as a mask image indicating details of image content of an object in the target video frame, and a pixel value range of a pixel point in the target transparency mask image may be 0-1, and a scale thereof is the same as a scale of the target video frame.

In an implementation manner of the present application, the fusion result may include a confidence that each pixel point in the target video frame belongs to the object. In this case, after the above-described fusion result is obtained, the target transparency mask image may be obtained according to the confidence level corresponding to each pixel included in the fusion result.

For example, the confidence corresponding to each pixel point in the fusion result can be used as the pixel value of each pixel point at the same position in the mask image, so as to obtain the target transparency mask image.

For another example, a first threshold and a second threshold may be preset, where the first threshold is greater than the second threshold. If the confidence coefficient corresponding to the pixel point included in the fusion result is greater than or equal to a first threshold value, the confidence coefficient corresponding to the pixel point is close to 1, the confidence coefficient of the pixel point belonging to the object is higher, and at the moment, the pixel value of the pixel point at the same position in the mask image can be determined to be 1; if the confidence coefficient corresponding to the pixel point included in the fusion result is smaller than or equal to a second threshold value, the confidence coefficient corresponding to the pixel point is close to 0, the confidence coefficient of the pixel point belonging to the object is lower, and at the moment, the pixel value of the pixel point at the same position in the mask image can be determined to be 0; if the confidence coefficient corresponding to the pixel point contained in the fusion result is located between the second threshold value and the first threshold value, the confidence coefficient corresponding to the pixel point can be mapped into the 0-1 interval according to the mapping relation between the confidence coefficient interval from the second threshold value to the first threshold value and the interval 0-1, and the obtained mapped numerical value is the pixel value of the pixel point at the same position in the mask image.

In another implementation manner of the present application, the above-mentioned target transparency mask image may be obtained through steps S304A-S304C in the embodiment shown in fig. 5.

Step S305: and carrying out regional matting on the target video frame according to the target transparency mask image and the first contour mask image to obtain a matting result.

Specifically, the first contour mask image indicates an approximately located area of an object in the target video frame, and the target transparency mask image indicates image content details of the object in the target video frame, so that according to the target transparency mask image and the first contour mask image, region matting can be performed on the target video frame by combining the approximately located area of the object in the target video frame and the image content details, and a matting result is obtained.

In one implementation manner of the method, the target transparency mask image and the first contour mask image can be subjected to dot multiplication according to the positions of all the pixel points to obtain a dot-multiplied mask image, then the obtained mask image and the target video frame are subjected to dot multiplication again according to the positions of all the pixel points to obtain a dot multiplication result, and the dot multiplication result is used as a matting result to realize region matting of the target video frame.

In addition, referring to fig. 4, a schematic diagram of a matting result from a target video frame to a target transparency mask image and a first contour mask image when a plurality of video frames are taken as the target video frame is shown. In fig. 4, the first line of images is a plurality of target video frames; the second row of images are first contour mask images corresponding to the target video frames; the third row of images are target transparency mask images corresponding to a plurality of target video frames; the fourth row of images are matting results corresponding to a plurality of target video frames.

In addition, when the target transparency mask image is obtained, the fusion characteristic of the transparency mask image of the object edge in the video frames of the matting before the target video frames is considered, namely, the image information of the object edge in the video frames is considered, rather than the image information of the target video frames, so that the inter-frame smoothness of the change of the object edge region in the target transparency mask image corresponding to each video frame in the video can be improved, and the inter-frame smoothness of the change of the object edge region in the matting result corresponding to each video frame can be improved. In addition, when the video is the video of the moving object, the information of the video frame subjected to the matting is considered when the target video frame is subjected to the matting, so that smoothness between the matting result of the target video frame and the matting result of the video frame subjected to the matting is higher.

Other implementations of obtaining the target transparency mask image in step S304 are described below.

In an embodiment of the present application, referring to fig. 5, a flowchart of a second video matting method is provided, and in this embodiment, the above step S304 may be implemented by the following steps S304A-S304C.

Step S304A: and obtaining a second contour mask image of the object in the target video frame based on the fusion result.

The second contour mask image may be a binary image, and the scale of the second contour mask image is the same as that of the target video frame.

In one implementation manner of the present application, as can be seen from the description of the above step S304, the confidence that each pixel point in the target video frame belongs to the object may be included in the above fusion result, and in this case, after the above fusion result is obtained, binarization processing may be performed on the fusion result based on a preset third threshold, so as to obtain the second contour mask image.

When the binarization processing is performed, a value greater than the third threshold value in the fusion result may be set to 0, and a value not greater than the third threshold value may be set to 1. Of course, a value smaller than the third threshold value in the fusion result may be set to 0, and a value not smaller than the third threshold value may be set to 1. The embodiments of the present application are not limited thereto.

Step S304B: and fusing the second contour mask image and the fusion result to obtain the target fusion characteristic.

Specifically, the second contour mask image and the fusion result may be fused by any one of the following two implementation manners.

In the first implementation manner, the second contour mask image and the fusion result can be fused by using a fusion algorithm, a network and the like, so as to obtain the fused target fusion characteristic.

In the second implementation manner, the second contour mask image and the fusion result can be directly subjected to operation processing such as superposition, splicing or dot multiplication to obtain a processing result as a target fusion feature.

Step S304C: and obtaining a target transparency mask image of the edge of the object in the target video frame based on the target fusion characteristic.

The target fusion feature is obtained by fusing a second contour mask image and a fusion result, wherein the second contour mask image indicates the approximately located region of the object in the target video frame, and the fusion result can comprise the confidence coefficient of each pixel point belonging to the object in the target video frame, so that the target fusion feature obtained by fusing the second contour mask image and the fusion result can comprise the confidence coefficient of each pixel point belonging to the object in the approximately located region of the object in the target video frame, and the target transparency mask image can be obtained based on the target fusion feature and the confidence coefficient corresponding to each pixel point included by the feature.

Specifically, the implementation manner of obtaining the target transparency mask image based on the target fusion feature may refer to the implementation manner of obtaining the target transparency mask image based on the fusion result in step S304, which is not described herein.

A first implementation of dividing the target video frame mentioned in the above step S302 is described below.

In an embodiment of the present application, referring to fig. 6, a flowchart of a first feature processing method is provided, where the feature processing process shown in fig. 6 includes the segmentation process of step S302 and the reconstruction fusion process of step S303, and the number of times of feature reconstruction included in the segmentation process is the same as the number of times of transparency information fusion included in the reconstruction fusion process.

In fig. 6, the segmentation process includes two feature reconstructions and the reconstruction fusion process includes two transparency information fusions. In addition, the number of feature reconstructions included in the segmentation process and the number of transparency information fusions included in the reconstruction fusion process are also other numbers, such as 3, 4, 5, etc., which are not limited in this embodiment.

The following describes the segmentation process in step S302 and the reconstruction fusion process in step S303, respectively, with reference to fig. 6.

First, for the segmentation process in step S302, when the target video frame is segmented based on the first image feature, the cascade feature reconstruction may be performed based on the first image feature to obtain features of the contour mask image with sequentially increased object scale, and based on the features obtained by the last processing, the first contour mask image of the object in the target video frame is obtained.

The cascade feature reconstruction can be understood as multiple feature reconstruction, wherein the result of each feature reconstruction is the feature of the outline mask image with one scale, the object of the first feature reconstruction is the feature of the first image, and the objects of other feature reconstruction are the features obtained by the previous feature reconstruction.

In fig. 6, the cascading feature reconstruction process includes two feature reconstructions. In the first feature reconstruction process, the object subjected to feature reconstruction is a first image feature, and after feature reconstruction is performed on the first image feature, the outline feature 1 of the outline mask image with the increased scale can be obtained, and at this time, the first feature reconstruction process is ended.

In the second feature reconstruction process, the object to be subjected to feature reconstruction is the contour feature 1, after the contour feature 1 is subjected to feature reconstruction, the contour feature 2 of the contour mask image with the scale increased again can be obtained, at this time, the second feature reconstruction process is finished, and the contour feature 2 is the feature obtained by the last processing, so that the first contour mask image of the object in the target video frame can be obtained based on the contour feature 2.

The implementation of each feature reconstruction and the implementation of obtaining the first contour mask image based on the features obtained from the last processing are described below.

In each feature reconstruction process, feature reconstruction may be achieved by either of the following two implementations.

In the first implementation manner, the feature reconstruction may be implemented in a manner of feature reconstruction, feature fusion and update, which are mentioned in the subsequent embodiments, which will not be described in detail here.

In a second implementation, the object of each feature reconstruction may be processed using an upsampling algorithm, an deconvolution transform, or a feature decoding network, etc.

After the feature obtained by the last feature reconstruction process is obtained, the confidence that each pixel point in the target video frame belongs to the object may be included in the feature, in this case, after the feature is obtained, binarization processing may be performed on the feature based on a preset fourth threshold value, so as to obtain a first contour mask image.

Next, for the above-mentioned reconstruction fusion process of step S303, in this embodiment, the first hidden state information includes: and the first sub-hidden state information is used for representing the fusion characteristic of the transparency mask image of one scale. The multiple first sub-hidden state information may represent a fusion feature of transparency mask images with sequentially increasing scales, for example, the first hidden state information may include three first sub-hidden state information, and the three first sub-hidden state information may represent a fusion feature of transparency mask images with sequentially increasing scales of 24×24, 28×28, and 32×32.

In this case, when the feature reconstruction, fusion and update are performed based on the features of the first image and the obtained feature of the outline mask image, the preset number of transparency information fusion may be performed in the following manner, and the features obtained by the last processing are determined as a fusion result:

And carrying out feature reconstruction based on the first target feature and a second target feature in the features of the obtained outline mask image to obtain a second image feature with increased scale, fusing the second image feature with the first sub-state information in the first hidden state information, and updating the first sub-state information to obtain a third image feature.

The preset number is preset, and is the same as the number of reconfiguration times of the cascade feature reconfiguration.

The first target feature is the first image feature when information fusion is performed for the first time, and the first target feature is the feature obtained by information fusion performed last time when information fusion is performed for other times.

The scale of the first target feature is the same as the scale of the second target feature.

The scale of the transparency mask image corresponding to the fusion feature represented by the first sub-state information is the same as the scale of the second image feature.

In fig. 6, the preset number is 2, that is, the transparency information fusion process is performed twice. In the first transparency information fusion process, the first target feature is a first image feature, the second target feature is the contour feature 1, the first image feature is the same as the contour feature 1 in scale, feature reconstruction is performed based on the first target feature and the second target feature, that is, feature reconstruction is performed based on the first image feature and the contour feature 1, a second image feature 1 with increased scale is obtained, the second image feature 1 and first sub-state information 1 corresponding to the second image feature 1 are fused, the first sub-state information 1 is updated, and a third image feature 1 is obtained, at this time, the first transparency information fusion process is ended.

And after the third image feature 1 in the first transparency information fusion process is obtained, carrying out second transparency information fusion. In the second transparency information fusion process, the first target feature is the third image feature 1, the second target feature is the contour feature 2, the third image feature 1 and the contour feature 2 have the same scale, feature reconstruction is performed based on the first target feature and the second target feature, that is, feature reconstruction is performed based on the third image feature 1 and the contour feature 2, a second image feature 2 with a further increased scale is obtained, the second image feature 2 and the first sub-state information 2 corresponding to the second image feature 2 are fused, the first sub-state information 2 is updated, and the third image feature 2 is obtained, at this time, the second transparency information fusion process is ended, and the third image feature 2 is the finally obtained fusion result.

For each implementation of feature reconstruction based on the first target feature and the second target feature, refer to the implementation of feature reconstruction based on the first image feature and the feature of the contour mask image in step S303 in the embodiment shown in fig. 3; for each implementation manner of fusing the second image feature and the first sub-state information and updating the first sub-state information, refer to the implementation manner of fusing the reconstructed feature and the first hidden state information and updating the first hidden state information in step S303 in the embodiment shown in fig. 3, which is not described herein.

The data volume of two features required for feature reconstruction in each transparency information fusion process is generally larger, so that the calculated volume of feature reconstruction is larger, and the calculated volume for fusion of preset number of transparency information is also larger.

In view of this, in one embodiment of the present application, when performing feature reconstruction based on the first target feature and the second target feature in the features of the obtained outline mask image, the characteristic features of the object edge may be screened from the second target feature in the features of the obtained outline mask image, and the feature reconstruction may be performed based on the first target feature and the characteristic features, so as to obtain the second image feature with an increased scale.

The characteristic features are features which are contained in the second target features and have characteristics for the edge details of the objects in the target video frames.

In the process of feature reconstruction based on the first image features, the attribute of an object represented by the features of the contour mask images obtained through reconstruction can be determined, so that in the process of transparency information fusion, after the second target features with the same scale as the first target features are determined in the features of the contour mask images, the features with the characterizations on the edges of the object in the second target features can be determined according to the attribute of the object characterized by the second target features, and the determined features are extracted from the second target features, wherein the extracted features are the characterizations.

Specifically, after determining that the second target feature has a characteristic feature on the edge of the object, a feature screening algorithm, a feature screening network, or an attention mechanism may be used to extract the characteristic feature from the second target feature.

After the characteristic features are screened out, the implementation manner of performing feature reconstruction based on the first target features and the characteristic features may be referred to the implementation manner of performing feature reconstruction based on the first target features and the second target features in the foregoing embodiment, which is not described herein.

Referring to fig. 7, fig. 7 is adjusted based on fig. 6. The specific adjustment content is as follows: in the first transparency information fusion process, when the feature reconstruction is carried out based on the first image feature and the outline feature 1, the characteristic feature 1 is firstly screened out from the outline feature 1, and then the feature reconstruction is carried out based on the first image feature and the characteristic feature 1, so that the second image feature 1 with the increased scale is obtained. In the second transparency information fusion process, when the feature reconstruction is carried out based on the third image feature 1 and the outline feature 2, the characteristic feature 2 is firstly screened out from the outline feature 2, and then the feature reconstruction is carried out based on the third image feature 1 and the characteristic feature 2, so that the second image feature 2 with the scale increased again is obtained.

The following describes the implementation of feature reconstruction in the manner of feature reconstruction, feature fusion and update mentioned in the embodiment shown in fig. 6.

In an embodiment of the present application, after the first image feature is obtained, a preset number of contour information fusion may be performed according to the following manner, to obtain features of a contour mask image with sequentially increased dimensions of an object, where a primary contour information fusion process is equivalent to a primary feature reconstruction process:

and carrying out feature reconstruction based on the third target feature to obtain a fourth image feature with increased scale, fusing the fourth image feature with the second sub-state information in the second hidden state information, and updating the second sub-state information to obtain the feature of the outline mask image of the object.

The third target feature is the first image feature when information fusion is carried out for the first time, and the third target feature is the feature obtained by the last feature reconstruction when information fusion is carried out for other times.

Second hidden state information characterization: fusion features of contour mask images of objects in the segmented video frame are performed prior to the target video frame.

The second hidden state information includes: and the second hiding state information is used for representing the fusion characteristic of the outline mask image of one scale. The plurality of second sub-hidden state information may represent a fusion feature of contour mask images with sequentially increasing scales, for example, the second hidden state information may include three second sub-hidden state information, and the three second sub-hidden state information may represent a fusion feature of contour mask images with sequentially increasing scales of 24×24, 28×28, and 32×32.

The scale of the outline mask image corresponding to the fusion characteristic represented by the second sub-state information is the same as that of the fourth image characteristic. For example, if the second hidden state information includes three second sub-hidden state information, and the three second sub-hidden state information represents fusion features of contour mask images with scales of 24×24, 28×28 and 32×32 in sequence, the preset number is three, three fusion needs to be performed, and the scales of fourth image features obtained in the three fusion processes are 24×24, 28×28 and 32×32 in sequence.

The contour information fusion process is described below with reference to fig. 8, with the preset number of 3.

Referring to fig. 8, a schematic flow chart of a cascade feature reconstruction method is provided. In fig. 8, after the first image feature is obtained, first contour information fusion is performed. In the first contour information fusion process, the third target feature is the first image feature, feature reconstruction is performed based on the third target feature, that is, feature reconstruction is performed based on the first image feature, a fourth image feature 1 with an increased scale is obtained, the fourth image feature 1 and the second sub-state information 1 corresponding to the fourth image feature 1 are fused and the second sub-state information 1 is updated, and the contour feature 3 of the contour mask image of the object is obtained, and at this time, the first contour information fusion process is ended.

And after the contour feature 3 is obtained, carrying out second contour information fusion. In the second contour information fusion process, the third target feature is the contour feature 3, feature reconstruction is performed based on the third target feature, that is, feature reconstruction is performed based on the contour feature 3, a fourth image feature 2 with a further increased scale is obtained, the fourth image feature 2 and the second sub-state information 2 corresponding to the fourth image feature 2 are fused and the second sub-state information 2 is updated, and the contour feature 4 of the contour mask image of the object is obtained, and at this time, the second contour information fusion process is ended.

And after the contour feature 4 is obtained, carrying out third contour information fusion. In the third contour information fusion process, the third target feature is the contour feature 4, the feature reconstruction is performed based on the third target feature, that is, the feature reconstruction is performed based on the contour feature 4, the fourth image feature 3 with the scale increased again is obtained, the fourth image feature 3 and the second sub-state information 3 corresponding to the fourth image feature 3 are fused and the second sub-state information 3 is updated, the contour feature 5 of the contour mask image of the object is obtained, at this time, the second contour information fusion process is ended, and the obtained contour feature 5 is the feature of the finally obtained contour mask image of the object.

The implementation manner of performing the feature reconstruction based on the third target feature in each contour information fusion process may refer to the implementation manner of performing the feature reconstruction based on the first image feature in step S303 in the embodiment shown in fig. 3; the implementation manner of fusing the fourth image feature and the second sub-state information and updating the second sub-state information may refer to the implementation manner of fusing the reconstructed first image feature and the reconstructed first hidden state information and updating the first hidden state information in step S303 in the embodiment shown in fig. 3, which is not described herein.

When the feature reconstruction is carried out in the contour information fusion process, the processing can carry out the feature reconstruction based on the third target feature, and can also carry out the feature reconstruction by combining the third target feature and other information.

In one embodiment of the present application, the first image feature includes a plurality of first sub-image features. When information compression is carried out on a target video frame in a video, cascade information compression can be carried out on the target video frame, and each first sub-image feature with the sequentially reduced scale is obtained.

The cascade information compression can be understood as multiple information compression, wherein the result of each information compression is a first sub-image feature, the object of the first information compression is a target video frame, and the objects of other information compression are the first sub-image features obtained by the last information compression.

For each implementation of information compression, reference may be made to the implementation of information compression on the target video frame in step S301 shown in fig. 3.

For example, the information-compressed object may be subjected to multiple convolution transformations each time the information is compressed.

For another example, each time information is compressed, the information compression may be achieved by performing one or more subsequent processes shown in steps S301A to S301E in the embodiment shown in fig. 10.

After each first sub-image feature with the sequentially reduced scale is obtained, the preset number of times of contour information fusion can be performed based on each first sub-image feature.

And when the feature reconstruction is carried out in the first contour information fusion process, the first sub-image feature with the minimum scale in each first sub-image feature can be used as a target feature, and the feature reconstruction is carried out based on the target feature.

And when the feature reconstruction is carried out in the other secondary contour information fusion process, the feature obtained by the previous contour information fusion can be used as a third target feature, and the feature reconstruction is carried out based on the third target feature and the first sub-image feature with the same scale as the third target feature, so as to obtain a fourth image feature with the increased scale.

When the feature reconstruction is performed based on the third target feature and the first sub-image feature with the same scale as the third target feature, the third target feature and the first sub-image feature with the same scale as the third target feature can be fused into one feature through superposition, dot multiplication and other fusion modes, and then the feature reconstruction is performed based on the fused feature.

The following describes the process of the contour information fusion with reference to fig. 9 by taking the preset number of 3 as an example.

Referring to fig. 9, a flow diagram of another cascade feature reconstruction method is provided. In fig. 9, after obtaining the features of the first sub-images whose scales are sequentially reduced, first contour information fusion is performed. In the first contour information fusion process, the third target feature is the first sub-image feature 1 with the smallest scale in each first sub-image feature, the feature reconstruction is performed based on the third target feature, that is, the feature reconstruction is performed based on the first sub-image feature 1, so as to obtain a fourth image feature 4 with an increased scale, the fourth image feature 4 and the second sub-state information 4 corresponding to the fourth image feature 4 are fused, the second sub-state information 4 is updated, and the contour feature 6 of the contour mask image of the object is obtained, and at this time, the first contour information fusion process is ended.

And after the contour features 6 in the first contour information fusion process are obtained, carrying out second contour information fusion. In the second contour information fusion process, the third target feature is the contour feature 6, the first sub-image feature with the same scale as the third target feature is the first sub-image feature 2, the feature reconstruction is performed based on the third target feature and the first sub-image feature with the same scale as the third target feature, that is, the feature reconstruction is performed based on the contour feature 6 and the first sub-image feature 2, so as to obtain a fourth image feature 5 with a further increased scale, the fourth image feature 5 and the second sub-state information 5 corresponding to the fourth image feature 5 are fused, and the second sub-state information 5 is updated, so that the contour feature 7 of the contour mask image of the object is obtained, and at this time, the second contour information fusion process is ended.

And after the profile characteristic 7 in the second profile information fusion process is obtained, carrying out third profile information fusion. In the third contour information fusion process, the third target feature is the contour feature 7, the first sub-image feature with the same scale as the third target feature is the first sub-image feature 3, the feature reconstruction is performed based on the third target feature and the first sub-image feature with the same scale as the third target feature, that is, the feature reconstruction is performed based on the contour feature 7 and the first sub-image feature 3, a fourth image feature 6 with a further increased scale is obtained, the fourth image feature 6 and the second sub-state information 6 corresponding to the fourth image feature 6 are fused and updated, and the second sub-state information 6 is updated, so that the contour feature 8 of the contour mask image of the object is obtained.

According to the above, the number of times of the transparency information fusion and the number of times of the contour information fusion are both preset numbers of times, and the larger the preset number is, the more accurate the fusion result obtained by the transparency information fusion is, the more accurate the feature obtained by the contour information fusion is, however, the larger the calculation amount is.

In view of this, in one embodiment of the present application, the preset number is 4, 5 or 6. Therefore, the scheme provided by the embodiment of the application can be applied to the terminal, is friendly to the application of the scheme on the terminal, and can realize the lightweight application of the video matting scheme in the terminal.

Because the data volume of the second image feature and the fourth image feature is relatively huge, the computation amount of fusing the second image feature and the first sub-state information in the first hidden state information and updating the first sub-state information is relatively large, and the computation amount of fusing the fourth image feature and the second sub-state information in the second hidden state information and updating the second sub-state information is relatively large.

In view of the above, in order to reduce the amount of calculation for fusing the second image feature and the first sub-state information and updating the first sub-state information, in one embodiment of the present application, when fusing the second image feature and the first sub-state information and updating the first sub-state information, the second image feature is segmented to obtain the second sub-image feature and the third sub-image feature; fusing the second sub-image features and the first sub-state information in the first hidden state information, and updating the first sub-state information to obtain fourth sub-image features; and splicing the fourth sub-image feature and the third sub-image feature to obtain a third image feature.

Wherein the features can be represented in the form of a matrix, tensor. Taking tensors as an example, the segmentation of the second image feature may be understood as the segmentation of the feature tensor into two sub-tensors in any dimension of the feature tensor representing the second image feature.

For example, for a feature tensor with a dimension of h×c×w, the feature tensor may be segmented in the W dimension direction to obtain two sub-feature tensors with dimensions of h×c×w1 and h×c×w2, respectively, where w1+w2=w.

Specifically, when the second image feature is segmented, the second image feature may be segmented in equal proportion to obtain two sub-features with the same scale, or the second image feature may be segmented in any proportion to obtain two sub-features with different scales. And after the second image feature is segmented to obtain two sub-features, one of the sub-features can be determined to be the second sub-image feature, and the other sub-feature can be determined to be the third sub-image feature.

After the second sub-image feature and the third sub-image feature are obtained by segmentation, the first sub-state information in the second sub-image feature and the first hidden state information may be fused and updated to obtain the fourth sub-image feature, and the specific implementation manner may refer to the implementation manner of the foregoing step S303 in the embodiment shown in fig. 3 for the second sub-image feature and the first hidden state information to be fused in a remembering manner and updating the first hidden state information, which is not described herein again.

The stitching feature may be regarded as an inverse process of feature segmentation, after the fourth sub-image feature is obtained by fusion, the fourth sub-image feature and the third sub-image feature may be stitched, and when the two features are stitched, the fourth sub-image feature and the third sub-image feature may be stitched into one feature in the dimension direction according to which the stitching is performed, i.e. the third sub-image feature is stitched after the fourth sub-image feature in the dimension direction according to which the stitching is performed, or the fourth sub-image feature is stitched after the third sub-image feature, so that the stitched feature is the third image feature.

For example, if the third sub-image feature has a scale of h×cw3 and the fourth sub-image feature has a scale of h×cw4, when the fourth sub-image feature and the third sub-image feature are spliced, the fourth sub-image feature and the third sub-image feature may be spliced into a feature having a scale of h×cw5 in the W dimension direction, where w3+w4=w5.

In order to reduce the calculation amount of fusing the fourth image feature and the second sub-state information and updating the second sub-state information, in one embodiment of the application, when fusing the fourth image feature and the second sub-state information and updating the second sub-state information, the fourth image feature is segmented to obtain a fifth sub-image feature and a sixth sub-image feature; fusing the fifth sub-image feature and second sub-state information in the second hidden state information and updating the second sub-state information to obtain a seventh sub-image feature; and splicing the seventh sub-image feature and the sixth sub-image feature to obtain the feature of the outline mask image of the object.

The implementation manner of splitting the fourth image feature can be referred to the aforementioned implementation manner of splitting the second image; the implementation manner of fusing the fifth sub-image feature and the second sub-state information and updating the second sub-state information can be seen in the aforementioned implementation manner of fusing the second sub-image feature and the first sub-state information and updating the first sub-state information; the implementation manner of stitching the seventh sub-image feature and the sixth sub-image feature may refer to the implementation manner of stitching the fourth sub-image feature and the third sub-image feature, which is not described herein.

In addition, in order to solve the problem of large calculation amount in the two fusion processes, in one embodiment of the present application, when the second image feature and the first sub-state information are fused and the first sub-state information is updated, the second image feature may be segmented to obtain a second sub-image feature and a third sub-image feature; fusing the second sub-image features and the first sub-state information in the first hidden state information, and updating the first sub-state information to obtain fourth sub-image features; and splicing the fourth sub-image feature and the third sub-image feature to obtain a third image feature. When the fourth image feature and the second sub-state information are fused and the second sub-state information is updated, the fourth image feature is segmented to obtain a fifth sub-image feature and a sixth sub-image feature; fusing the fifth sub-image feature and second sub-state information in the second hidden state information and updating the second sub-state information to obtain a seventh sub-image feature; and splicing the seventh sub-image feature and the sixth sub-image feature to obtain the feature of the outline mask image of the object. The calculation amount of the two fusion processes can be reduced, so that the fusion efficiency is further improved, the video matting efficiency is further improved, and the lightweight application of the video matting scheme in the terminal is realized.

An implementation manner of performing information compression on the target video frame in combination with the processing such as convolution transformation, linear transformation, batch normalization processing, and nonlinear transformation in the above step S301 will be described below.

In an embodiment of the present application, referring to fig. 10, a flowchart of a third video matting method is provided, and in this embodiment, the above step S301 may be implemented by the following steps S301A-S301E.

Step S301A: and carrying out convolution transformation on the target video frame in the video to obtain a fifth image characteristic.

Specifically, a preset convolution kernel can be utilized to perform convolution calculation on the target video frame to obtain a fifth image feature, and a trained convolution neural network can also be utilized to perform convolution transformation on the target video frame to obtain a fifth image feature output by the model.

Step S301B: and performing linear transformation on the fifth image characteristic based on convolution check to obtain a sixth image characteristic.

Wherein the convolution kernel is a preset convolution kernel.

Specifically, based on the convolution kernel, the linear transformation of the fifth image feature may be implemented in a manner that the convolution transformation of the fifth image feature is performed. Because the network processor (Network Processing Unit, NPU) in the terminal has stronger calculation capability for performing convolution transformation, the time consumption of the linear transformation can be shortened by performing the linear transformation in a convolution transformation mode, so that the time consumption of video matting can be shortened, and the video matting efficiency is improved.

In one embodiment of the present application, the convolution kernel is: 1x1 convolution kernel. Because the data size of the convolution kernel of 1x1 is smaller, the convolution kernel based on 1x1 performs linear transformation on the fifth image feature, and on the premise that the linear transformation on the fifth image feature can be realized, the calculation amount of the linear transformation can be reduced, the calculation efficiency of the linear transformation is improved, and therefore the video matting efficiency can be improved. In addition, the video matting scheme provided by the embodiment is applied to the terminal, the fifth image characteristic is checked to perform linear transformation based on the convolution of 1x1 in the terminal, and more calculation resources of the terminal are not required to be occupied, so that the terminal can conveniently realize linear transformation, and the lightweight realization of video matting at the terminal side is promoted.

Step S301C: and carrying out batch standardization processing on the sixth image features to obtain seventh image features.

Specifically, batch normalization processing may be performed on the sixth image feature by using a batch normalization algorithm, a model, or the like, to obtain a seventh image feature.

For example, batch normalization may be performed on the sixth image feature using the BatchNorm2d algorithm.

Step S301D: and carrying out nonlinear transformation on the seventh image feature to obtain an eighth image feature.

Specifically, the seventh image feature may be subjected to nonlinear transformation by using a nonlinear transformation function, an algorithm, an activation function, or the like, to obtain an eighth image feature.

For example, the seventh image feature may be non-linearly transformed using a GELU activation function or a RELU activation function. In the case of performing nonlinear transformation on the seventh image feature by using the RELU activation function, since the quantization effect of processing data by using the RELU activation function is good, the nonlinear transformation on the seventh image feature by using the RELU activation function can improve the transformation effect of the nonlinear transformation, thereby improving the accuracy of the eighth image feature.

Step S301E: and performing linear transformation on the eighth image characteristic based on the convolution check to obtain a first characteristic of the target video frame.

The implementation manner of performing the linear transformation in this step is the same as that of performing the linear transformation in the step S301B, and will not be described here again.

In addition, when the first feature is obtained, the processing flow shown in steps S301A to S301E may be executed once, or the processing flow shown in steps S301A to S301E may be executed a plurality of times. For example, the process flow shown in steps S301A-S301E may be performed 4 times, 5 times, or another number of times.

In the case where the processing flows shown in steps S301A to S301E are executed a plurality of times, the input of the first processing flow is a target video frame in the video, the input of the other processing flows is the feature output by the last processing flow, and the feature output by the last processing flow is the first feature, and in this case, the scale of the feature output by each processing flow is continuously reduced as the above-described processing flows are executed a plurality of times.

The video matting scheme provided by the embodiment of the application can also be realized based on the neural network model, and the video matting scheme is explained below by combining the neural network model.

In one embodiment of the present application, the steps described above may be implemented using a pre-trained video matting model.

Referring to fig. 11, a schematic structural diagram of a first video matting model is provided, and as can be seen in fig. 11, the video matting model includes an information compression network, a first image generation network, a second image generation network, a result output network, three sets of contour feature generation networks and three sets of transparency feature generation networks, each set of contour feature generation networks corresponds to a scale of one contour mask image, and includes a first reconstruction sub-network and a first fusion sub-network, and each set of transparency feature generation networks corresponds to a scale of one transparency mask image, and includes a second reconstruction sub-network and a second fusion sub-network.

Fig. 11 is a video matting model that is illustrated with the number of contour feature generation networks included as three, but in addition to this, the number of contour feature generation networks included in the video matting model may be four, five or other numbers, and the number of transparency feature generation networks is the same as that of contour feature generation networks, which is not limited in this embodiment.

The connection relationship of each network in the video matting model shown in fig. 11 is described below.

The three groups of contour feature generation networks in the video matting model are a contour feature generation network 1, a contour feature generation network 2 and a contour feature generation network 3 which are respectively formed by sequentially increasing the scale of the corresponding contour mask image, and a first reconstruction sub-network contained in each group of contour feature generation networks is connected with a first fusion sub-network; the three sets of transparency feature generation networks are respectively a transparency feature generation network 1, a transparency feature generation network 2 and a transparency feature generation network 3, wherein the scales of the corresponding transparency mask images are sequentially increased, and a second reconstruction sub-network contained in each set of transparency feature generation network is connected with a second fusion sub-network.

The video matting model is provided with a segmentation branch and a matting branch, wherein the segmentation branch mainly comprises three groups of contour feature generation networks and a first image generation network, and the matting branch mainly comprises three groups of transparency feature generation networks and a second image generation network. The first layer network of the video matting model is an information compression network, the information compression network is respectively connected with the two branches, and a connection relationship exists between the two branches.

The connection relationship between the network included in the two branches, namely the splitting branch and the matting branch, and the connection relationship between the two branches will be described below.

First, the connection relationship of the network included in the split branch itself will be described. The information compression network is connected with a first reconstruction sub-network 1 included in the contour feature generation network 1, the first fusion sub-network 1 included in the contour feature generation network 1 is connected with a first reconstruction sub-network 2 included in the contour feature generation network 2, the first fusion sub-network 2 included in the contour feature generation network 2 is connected with a first reconstruction sub-network 3 included in the contour feature generation network 3, and the first fusion sub-network 3 included in the contour feature generation network 3 is connected with the first image generation network.

Next, a connection relationship of the network included in the matting branch itself will be described. The information compression network is connected with a second reconstruction sub-network 1 included in the transparency characteristic generation network 1, the second fusion sub-network 1 included in the transparency characteristic generation network 1 is connected with a second reconstruction sub-network 2 included in the transparency characteristic generation network 2, the second fusion sub-network 2 included in the transparency characteristic generation network 2 is connected with a second reconstruction sub-network 3 included in the transparency characteristic generation network 3, and the second fusion sub-network 3 included in the transparency characteristic generation network 3 is connected with a second image generation network.

Finally, the connection relationship between the segmentation branch and the matting branch is explained. The first merging sub-network 1 included in the outline feature generation network 1 is connected with the second reconstruction sub-network 1 included in the transparency feature generation network 1; the first converged subnetwork 2 comprised by the contour feature generation network 2 is connected to the second reconstructed subnetwork 2 comprised by the transparency feature generation network 2; the first converged subnetwork 3 comprised by the profile-feature generating network 3 is connected to the second reconstructed subnetwork 3 comprised by the transparency-feature generating network 3. The first image generation network and the second image generation network are connected to the result output network, respectively.

The following describes each network and sub-network in the video matting model.

When the information compression network is used for compressing the target video frames, the target video frames are input into the information compression network, and the information compression network is used for compressing the target video frames, so that the first characteristics output by the information compression network are obtained.

The implementation manner of the information compression network for performing information compression on the target video frame can be referred to the foregoing, and will not be described herein.

Referring to fig. 12, fig. 12 is a schematic structural diagram of an information compression network, in the information compression network shown in fig. 12, each network layer sequentially includes, from top to bottom: convolution layer, linear layer 1, batch normalization layer, nonlinear layer, and linear layer 2.

The convolution layer is used for carrying out convolution transformation on the target video frame to obtain a fifth image feature.

The linear layer 1 is used for performing linear transformation on the fifth image feature based on convolution check to obtain a sixth image feature.

And the batch normalization layer is used for carrying out batch normalization processing on the sixth image features to obtain seventh image features.

The nonlinear layer is used for carrying out nonlinear transformation on the seventh image feature to obtain an eighth image feature.

The linear layer 2 is used for performing linear transformation on the eighth image feature based on convolution kernel to obtain a first feature.

The implementation manner of the data processing performed by the convolution layer, the linear layer 1, the batch normalization layer, the nonlinear layer and the linear layer 2 can be referred to in the foregoing, and will not be described herein.

And when the target contour feature generates a target first reconstruction sub-network in the network, inputting a third target feature into the target first reconstruction sub-network when the feature is reconstructed based on the third target feature, and performing feature reconstruction by the target first reconstruction sub-network based on the third target feature to obtain a fourth image feature with increased scale output by the target first reconstruction sub-network. The scale of the outline mask image corresponding to the target outline feature generation network is the same as that of the fourth image feature.

The implementation manner of performing feature reconstruction of the target first reconstruction sub-network based on the third target feature may be referred to the foregoing embodiment, and will not be described herein.

In one embodiment of the present application, the first reconstruction sub-network is implemented based on a QARepVGG network structure.

Because the quantization calculation precision of the QARepVGG network is higher, the quantization calculation capability of the first reconstruction sub-network can be improved by realizing the first reconstruction sub-network based on the QARepVGG network structure, so that the accuracy of feature reconstruction of the first reconstruction sub-network is improved, and the accuracy of video matting can be improved.

In another embodiment of the present application, the first reconstruction sub-network in the profile-specific feature generation network is implemented based on a QARepVGG network structure.

Wherein the specific profile feature generation network is: and generating a network by using contour features of which the corresponding contour mask image is smaller than the first preset scale.

The first preset scale may be a preset scale.

When the video matting model is constructed, the outline mask image scale corresponding to each group of outline feature generating networks can be determined, so that the outline feature generating network with the corresponding outline mask image scale smaller than the first preset scale can be determined as a specific outline feature generating network, and when the specific outline feature generating network is constructed, the specific outline feature generating network is constructed based on the QARepVGG network structure. While for other profile feature generation networks may be built based on other network structures.

Because the calculation amount of the U-shaped residual block in the specific contour feature generation network is built based on the QARepVGG network structure and increases along with the increase of the scale of the contour mask image corresponding to the network, when the specific contour feature generation network is built, the first reconstruction sub-network in the specific contour feature generation network can be realized based on the QARepVGG network structure only aiming at the specific contour feature generation network with the scale of the corresponding contour mask image smaller than the first preset scale, so that the calculation amount of each contour feature generation network can be reduced, the feature efficiency of the contour mask image of an object is improved, the video matting efficiency is improved, and the video matting model can be deployed in a terminal in a light mode.

When the fourth image feature and the second sub-state information are fused and the second sub-state information is updated, the fourth image feature can be input into the first target fusion sub-network, and the second sub-hidden state information provided by the first target fusion sub-network is the second sub-state information, so that the first target fusion sub-network fuses the fourth image feature and the second sub-hidden state information provided by the first target fusion sub-network and updates the second sub-hidden state information provided by the first target fusion sub-network, and the feature of the outline mask image of the object output by the first target fusion sub-network is obtained.

The implementation manner of fusing the fourth image feature and the second sub-hidden state information by the target first fusing subnetwork and updating the second sub-hidden state information can be referred to the foregoing embodiment, and will not be repeated here.

In one embodiment of the present application, the first converged subnetwork is a gated loop unit (Gated Recurrent Unit, GRU) or a Long Short-Term Memory (LSTM) unit.

The GRU unit and the LSTM unit both have an information memory function, any one of the GRU unit and the LSTM unit is used as a first fusion sub-network, the unit can store hidden state information representing fusion characteristics of outline mask images of objects in video frames which are segmented, and therefore fourth image characteristics and second sub-hidden state information provided by the unit can be fused accurately, the accuracy of characteristics of the outline mask images of the objects output by the sub-network is improved, and the accuracy of video matting is improved.

When the first contour mask image is obtained based on the features of the contour mask image of the final object, the first image generation network inputs the features of the contour mask image of the final object into the first image generation network, and the first image generation network generates an image based on the features, thereby obtaining an image output by the network as the first contour mask image.

The implementation manner of the first image generation network to generate an image based on the feature may refer to the foregoing embodiment, and will not be described herein.

And inputting the first target feature and the second target feature into a target second reconstruction sub-network when the feature reconstruction is performed based on the first target feature and the second target feature, and performing feature reconstruction by the target second reconstruction sub-network based on the first target feature and the second target feature, so as to obtain a second image feature with increased output scale of the target second reconstruction sub-network, wherein the scale of a transparency mask image corresponding to the target transparency feature generation network is the same as the scale of the second image feature.

The implementation manner of the target second reconstruction sub-network for performing feature reconstruction based on the first target feature and the second target feature may be referred to the foregoing, and will not be described herein.

In one embodiment of the present application, the second reconstruction sub-network is implemented based on a QARepVGG network structure.

Because the quantization calculation precision of the QARepVGG network is higher, the second reconstruction sub-network is realized based on the QARepVGG network structure, the quantization calculation capability of the second reconstruction sub-network can be improved, the accuracy of feature reconstruction of the second reconstruction sub-network based on the first target feature and the second target feature is improved, and the accuracy of video matting can be improved.

In another embodiment of the present application, the second rebuilt subnetwork in the special transparency feature generation network is implemented based on a QARepVGG network structure.

Wherein the specific transparency characteristic generating network is: and generating a network by transparency features of which the corresponding transparency mask image is smaller than a second preset scale.

The second preset scale may be a preset scale. The second preset scale may be the same as or different from the first preset scale.

When the video matting model is constructed, the transparency mask image scale corresponding to each group of transparency feature generation network can be determined, so that the transparency feature generation network with the corresponding transparency mask image scale smaller than the second preset scale can be determined as a specific transparency feature generation network, and when the specific transparency feature generation network is constructed, the specific transparency feature generation network is constructed based on the QARepVGG network structure. While for other transparency characteristics the network is generated, it may be built based on other network structures.

Because the calculation amount of the U-shaped residual block in the specific transparency feature generation network is built based on the QARepVGG network structure and increases along with the increase of the scale of the transparency mask image corresponding to the network, when each transparency feature generation network is built, the second reconstruction sub-network in the specific transparency feature generation network can be realized based on the QARepVGG network structure only aiming at the specific transparency feature generation network with the scale of the corresponding transparency mask image smaller than the second preset scale, so that the calculation amount of each transparency feature generation network can be reduced, the efficiency of obtaining a fusion result is improved, the efficiency of video matting is improved, and the video matting model can be deployed in a lightweight mode in a terminal.

When the second image feature and the first sub-state information are fused and the first sub-state information is updated, the second image feature can be input into the target second fusion sub-network, and the first sub-hidden state information provided by the target second fusion sub-network is the first sub-state information, so that the target second fusion sub-network fuses the second image feature and the first sub-hidden state information provided by the target second fusion sub-network and updates the first sub-hidden state information provided by the target second fusion sub-network, and the third image feature output by the target second fusion sub-network is obtained.

The implementation manner of fusing the second image feature and the first sub-hidden state information by the target second fusing subnetwork and updating the first sub-hidden state information can be referred to the foregoing embodiment, and will not be repeated here.

In one embodiment of the present application, the second converged subnetwork is a gated loop unit (Gated Recurrent Unit, GRU) or a Long Short-Term Memory (LSTM) unit.

The GRU unit and the LSTM unit both have an information memory function, any one of the GRU unit and the LSTM unit is used as a second fusion sub-network, and the unit can store hidden state information representing fusion characteristics of transparency mask images of objects in video frames which are segmented, so that the second image characteristics and first sub-hidden state information provided by the unit can be fused accurately, the accuracy of third image characteristics output by the sub-network is improved, and the accuracy of video matting is improved.

When the first contour mask image is obtained based on the finally obtained fusion result, the second image generation network inputs the fusion result into the second image generation network, and the second image generation network generates an image based on the fusion result, so that an image output by the network is obtained as the target transparency mask image.

The implementation manner of the second image generation network to generate the image based on the fusion result can be referred to the foregoing embodiment, and will not be described herein.

And when the result output network is used for carrying out regional matting, the target transparency mask image and the first contour mask image are input into the result output network, and the result output network carries out regional matting on the target video frame according to the target transparency mask image and the first contour mask image to obtain a matting result output by the result output network.

The implementation manner of the result output network for performing region matting on the target video frame can be referred to the foregoing embodiment, and will not be described herein again.

Because the data volume of the first target feature and the second target feature is generally larger, the calculation volume of feature reconstruction of the target second reconstruction sub-network is larger, so that the efficiency of feature reconstruction of the second reconstruction sub-network is lower, and the efficiency of video matting by using the video matting model is also lower.

In view of this, in an embodiment of the present application, referring to fig. 13, a schematic structural diagram of a second video matting model is provided, and in the video matting model shown in fig. 13, compared with fig. 11, each transparency feature generating network further includes a feature screening sub-network, the first merging sub-network 1 is connected to the second reconstruction sub-network 1 through the feature screening sub-network 1, the first merging sub-network 2 is connected to the second reconstruction sub-network 2 through the feature screening sub-network 2, and the first merging sub-network 3 is connected to the second reconstruction sub-network 3 through the feature screening sub-network 3.

For the target feature screening sub-network in the target transparency feature generation network, before the feature output by the target first fusion sub-network is input into the target second reconstruction sub-network, the feature can be input into the target feature screening sub-network, the target feature screening sub-network screens out the target screening feature which has the characteristic to the edge contour of the object from the feature, the target screening feature is input into the target second reconstruction sub-network in the target transparency feature generation network, and the target second reconstruction sub-network performs feature reconstruction based on the target screening feature.

Specifically, the implementation manner of the target feature screening sub-network screening feature and the implementation manner of the target second reconstruction sub-network performing feature reconstruction based on the target screening feature and the first target feature can be referred to the foregoing embodiments, and will not be described herein.

In one embodiment of the present application, referring to fig. 14, a flowchart of a fourth video matting method is provided, and in fig. 14, a video matting model sequentially processes a video frame 1 and a video frame 2 included in a video. When the video matting model processes the video frame 1, the video frame 1 is processed by each network and each sub-network in the model respectively to obtain a matting result 1 corresponding to the video frame 1, wherein each first fusion sub-network and each second fusion sub-network in the model output information to other network layers on one hand, and update the hidden state information contained in the video matting model on the other hand, so that the video matting model is used for fusing the input characteristics when the model processes the next frame of video frame 2. When the video matting model processes the video frame 2, the video frame 2 is processed by each network and each sub-network in the model respectively, and a matting result 2 corresponding to the video frame 2 is obtained.

In one embodiment of the present application, in a case where the video matting model includes a plurality of sets of contour feature generation networks and transparency feature generation networks, the calculation amount of processing video frames by the video matting model is also large. In view of this, the first fusion subnetwork in the last group or groups of profile feature generation networks included in the video matting model can be removed, and the second fusion subnetwork in the last group or groups of transparency feature generation networks included in the video matting model can be removed, so that the calculated amount of processing video frames by the video matting model is reduced, and the video matting model can be deployed in a lightweight manner in a terminal.

Taking the first fusion subnetwork in the final set of contour feature generation network as an example, referring to fig. 15, a structural schematic diagram of a third video matting model is provided, and compared with the video matting model shown in fig. 11, the final set of contour feature generation network 3 in the video matting model shown in fig. 15 only includes the first reconstruction subnetwork 3, and an output result of the first reconstruction subnetwork 3 is a feature of the contour mask image of the finally obtained object.

In an embodiment of the present application, referring to fig. 16, a structural schematic diagram of a fourth video matting model is provided, the video matting model shown in fig. 14 includes multiple layers of cascaded information compression networks, an output result of each layer of information compression network is a first sub-feature, and one layer of information compression network is connected with a first reconstruction sub-network in a group of profile feature generation networks, the last layer of information compression network is connected with the first group of profile feature generation networks, a scale of a first sub-feature output by the connected information compression network is the same as a scale of a third target feature to be processed by the profile feature generation networks, the first sub-feature output by the last layer of information compression network is used as the first group of profile feature generation networks to be processed by the third target feature, and the first reconstruction sub-networks in other groups of profile feature generation networks are used for feature reconstruction based on the third target feature and the first sub-feature output by the information compression networks connected with the first sub-feature generation networks, so that accuracy of feature reconstruction can be improved, and video matting accuracy can be improved.

The training process of the video matting model is described below.

In an embodiment of the present application, referring to fig. 17, a flow chart of a first model training method is provided, and in this embodiment, the method includes the following steps S1701-S1706.

Step S1701: and inputting a first sample video frame in the sample video into an initial model of the video matting model for processing, and obtaining a first sample contour mask image of an object in the first sample video frame output by a first image generating network in the initial model.

The sample video may be any video obtained through a network, video library, or other channel. In addition, after the video is acquired through a network, a video library or other channels, a plurality of videos may be spliced into one video, and the spliced video is obtained as a sample video.

The first sample contour mask image and the first sample video frame have the same scale, and the pixel values of the pixel points in the first sample contour mask image represent the confidence that the pixel points in the same position in the first sample video frame predicted by the model belong to the region where the object is located.

The initial model is used for processing the video frames input into the model according to the self-configured untrained model parameters. During the processing of the video frames by the model, an image output by the first image generation network in the model may be obtained.

Specifically, after the first sample video frame is input into the initial model, the initial model can process the first sample video frame according to the self-configured model parameters, and obtain an image output by the first image generation network in the model, and the image is used as a first sample contour mask image of an object in the first sample video frame.

Step S1702: and obtaining a first difference between the annotation mask image corresponding to the first sample video frame and the annotation mask image corresponding to the second sample video frame.

Wherein the second sample video frame is: the sample video is preceded by a video frame spaced a predetermined number of frames from the first sample video frame.

The preset number of frames is a preset number of frames, for example, 3 frames, 4 frames, or other number of frames.

The first sample video frame may be a first preset frame number video frame or any video frame after the first preset frame number video frame in the sample video.

The first difference may be calculated by the terminal itself or may be calculated by other devices, and the terminal device obtains the calculated first difference from the other devices.

An implementation of calculating the first difference by the terminal or other device is described below.

The terminal or other equipment can obtain the annotation mask image corresponding to each sample video frame in the sample video, the annotation mask image corresponding to each sample video frame can be understood as an actual mask image of an object in the sample video frame, so that after the second sample video frame is determined according to the number of frames of the first sample video frame and the preset number of frames, the annotation mask image corresponding to the first sample video frame and the annotation mask image corresponding to the second sample video frame can be obtained from the obtained annotation mask images corresponding to each sample video frame, and the first difference between the obtained two annotation mask images is calculated.

In calculating the first difference between the two labeling mask images, in one implementation manner, the pixel values of the pixels at the same position of the two images may be subtracted, and the number of results which are not "0" in the calculation results of the pixels may be counted, as the first difference, or the ratio of the number of results which are not "0" to the total number of pixels of the labeling mask image may be used as the first difference; in another implementation, the similarity of the two images may be calculated, and the calculated similarity is subtracted from 1 to obtain the operation result as the first difference.

Step S1703: obtaining a second difference between the first sample contour mask image and a second sample contour mask image, wherein the second sample contour mask image is: the first image generates a mask image output by the network when the initial model processes the second sample video frame.

Specifically, similar to the video matting process, in the model training process, each sample video frame in the sample video can be input into the model frame by frame, so as to obtain a sample contour mask image of an object in each sample video frame output by the first image generating network in the model. After the first sample contour mask image is obtained, a second sample video frame which is detected by a preset frame number with the first sample video frame can be determined in the video frames processed by the model before the first sample video frame, and a second sample contour mask image which is output by the first image generating network when the model processes the second sample video frame is obtained, so that a second difference between the first sample contour mask image and the second sample contour mask image is calculated.

The implementation manner of calculating the second difference is the same as that of calculating the first difference in the step S1702, and will not be repeated here.

Step S1704: a third difference between the first sample transparency mask image and the second sample transparency mask image is obtained.

Wherein, the first sample transparency mask image is: the second image generates a mask image output by the network when the initial model processes the first sample video frame.

The second sample transparency mask image is: and when the initial model processes the second sample video frame, the second image generates a mask image output by the network.

Specifically, after the first sample video frame is input into the initial model, the initial model can process the first sample video frame according to the model parameters configured by the initial model, and obtain an image output by a second image generation network in the model, and the image is used as the transparency mask image of the first sample; after the second sample video frame is input into the initial model, the initial model can process the second sample video frame according to the model parameters configured by the initial model, and obtain an image output by a second image generation network in the model as the second sample transparency mask image, so that after the first sample transparency mask image and the second sample transparency mask image are obtained, a third difference between the first sample transparency mask image and the second sample transparency mask image can be calculated.

The implementation manner of calculating the third difference is the same as that of calculating the first difference in the step S1702, and will not be repeated here.

Step S1705: training losses are calculated based on the first difference, the second difference, and the third difference.

The training loss is calculated based on the first difference, the second difference, and the third difference, and the training loss may be calculated using a loss function, an algorithm, or the like.

Step S1706: and based on the training loss, carrying out model parameter adjustment on the initial model to obtain the video matting model.

Specifically, based on the training loss, model parameter adjustment may be performed on the initial model by any one of the following three implementation manners.

In a first implementation manner, for each model parameter in the initial model, a correspondence between the training loss and the adjustment amplitude of the model parameter may be preset, so that after the training loss is calculated, an actual adjustment amplitude for adjusting the model parameter may be calculated according to the correspondence, so that the model parameter is adjusted according to the actual adjustment amplitude.

In a second implementation manner, the initial model usually needs to be trained by using a large amount of sample data, training loss is required to be calculated continuously in the training process, and model parameter adjustment is performed on the initial model based on the training loss, so that after the training loss is calculated, the training loss variation difference can be determined according to the training loss and the training loss calculated previously, and then the model parameter adjustment is performed on the initial model according to the variation difference.

In a third implementation, based on the training loss, model parameter adjustment may be performed on the initial model using a model parameter adjustment algorithm, a function, or the like.

In obtaining the above second difference, it is also possible to obtain it in the manner mentioned in step S1703A in the embodiment shown in fig. 18, in addition to the manner mentioned in step S1703.

In one embodiment of the present application, referring to fig. 18, a flow diagram of a second model training method is provided.

In this embodiment, the first sample contour mask image includes: a first mask sub-map identifying areas of the first sample video frame where objects are located and a second mask sub-map identifying areas of the first sample video frame outside the objects.

The pixel values of the pixel points in the first mask subgraph represent the confidence that the pixel points in the same position in the first sample video frame predicted by the model belong to the region where the object is located, and the pixel values of the pixel points in the second mask subgraph represent the confidence that the pixel points in the same position in the first sample video frame predicted by the model belong to the region outside the object.

As shown in fig. 19, fig. 19 is a mask image provided in an embodiment of the present application, where the mask image is a first sample mask image.

The mask image shown in fig. 19 includes two sub-images, respectively: a first mask sub-map identifying areas of the first sample video frame where objects are located and a second mask sub-map identifying areas of the first sample video frame outside the objects.

The second sample contour mask image includes: a third mask pattern identifying regions of the second sample video frame where objects are located and a fourth mask pattern identifying regions of the second sample video frame outside the objects.

The pixel values of the pixel points in the third mask subgraph represent the confidence that the pixel points in the same position in the second sample video frame predicted by the model belong to the region where the object is located, and the pixel values of the pixel points in the fourth mask subgraph represent the confidence that the pixel points in the same position in the second sample video frame predicted by the model belong to the region outside the object.

In this case, the above step S1703 may be implemented by the following step S1703A.

Step S1703A: and obtaining the difference between the first mask subgraph and the third mask subgraph, and obtaining the difference between the second mask subgraph and the fourth mask subgraph, and obtaining a second difference containing the obtained difference.

The implementation manner of obtaining the difference between the first mask subgraph and the third mask subgraph and the difference between the second mask subgraph and the fourth mask subgraph is the same as the aforementioned implementation manner of obtaining the first difference or the second difference, and will not be repeated here.

After the two differences of the difference between the first mask subgraph and the third mask subgraph and the difference between the second mask subgraph and the fourth mask subgraph are obtained, the two differences may be accumulated to obtain a second difference including the two differences, an average value of the two differences may be used as the second difference, a larger difference of the two differences may be determined as the second difference, and the like.

In the video matting model, when the second fusion sub-network fuses the characteristics and the hidden state information provided by the second fusion sub-network, the characteristics of transparency mask images corresponding to the video frames which are segmented by the video are considered in the matting process of the target video frames, namely, the time domain continuity between the video frames is guaranteed, however, in this case, if an image output by the second image generating network in the model is a binary image without hard limitation, a semitransparent area may exist in the image finally output by the second image generating network, the pixel value range of a pixel point in the target transparency mask image output by the second image generating network is 0-1, and the semitransparent area may appear in the target transparency mask image output by the second image generating network, so that the reason for the semitransparent area of the target transparency mask image is difficult to be determined, and training of each network and sub-network in the matting branch in the model is difficult.

In view of this, in one embodiment of the present application, referring to fig. 20, a schematic structural diagram of a second image generation network is provided, and in fig. 20, the second image generation network includes a hard segmentation sub-network, a result fusion sub-network, and an image generation sub-network.

And the input of the hard segmentation sub-network generates a fusion result output by a second fusion sub-network in the network for the last group of transparency characteristics.

The hard segmentation sub-network is used for obtaining a second contour mask image of the object in the target video frame based on the fusion result.

And the result fusion sub-network is used for fusing the second contour mask image and the fusion result to obtain the target fusion characteristic.

The image generation sub-network is used for obtaining a target transparency mask image of the edge of the object in the target video frame based on the target fusion characteristic.

In the model training process, not only the initial model can be trained based on the first difference, the second difference and the third difference, but also mask images corresponding to different sample video frames output by the hard segmentation sub-network can be obtained, and a fourth difference between the obtained mask images is calculated, so that the initial model is trained according to the fourth difference. Because the contour mask image belonging to the binary image is output by the hard segmentation sub-network, the possibility that a semitransparent region appears in the mask image output by the second image generation network can be eliminated in the training process, and therefore the training of the initial model can be accurately and rapidly realized.

As shown in fig. 21, the left side of fig. 21 shows the final matting result in the case where the model training is not performed using the fourth difference, and the right side of fig. 21 shows the final matting result in the case where the model training is performed using the fourth difference.

The user information related in the embodiment of the application is authorized information of the user, and the processes of obtaining, storing, using, processing, transmitting, providing, disclosing and the like of the user information are all in accordance with the regulations of related laws and regulations and do not violate the popular regulations of the public order.

In a specific implementation, the application further provides a computer storage medium, where the computer storage medium may store a program, where when the program runs, the device where the computer readable storage medium is controlled to execute part or all of the steps in the foregoing embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

In a specific implementation, the embodiment of the application further provides a computer program product, where the computer program product contains executable instructions, where the executable instructions when executed on a computer cause the computer to perform some or all of the steps in the embodiment of the method.

As shown in fig. 22, the present application further provides a chip system, where the chip system is applied to the terminal 100, the chip system includes one or more processors 2201, and the processors 2201 are configured to invoke computer instructions to enable the terminal 100 to input data to be processed into the chip system, and the chip system processes the data based on the video matting method provided by the embodiment of the present application and then outputs a processing result.

In one possible implementation, the chip system further includes input and output interfaces for inputting and outputting data.

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the present application may be implemented as a computer program or program code that is executed on a programmable system including at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (Digital Signal Processor, DSP), microcontroller, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in the present application are not limited in scope to any particular programming language. In either case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed over a network or through other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including but not limited to floppy diskettes, optical disks, optical disk read-only memories (Compact Disc Read Only Memory, CD-ROMs), magneto-optical disks, read-only memories, random access memories, erasable programmable read-only memories (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-only memories (ElectricallyErasable Programmable Read Only Memory, EEPROM), magnetic or optical cards, flash memory, or tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared signal digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some structural or methodological features may be shown in a particular arrangement and/or order. However, it should be understood that such a particular arrangement and/or ordering may not be required. Rather, in some embodiments, these features may be arranged in a different manner and/or order than shown in the drawings of the specification. Additionally, the inclusion of structural or methodological features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the present application, each unit/module is a logic unit/module, and in physical aspect, one logic unit/module may be one physical unit/module, or may be a part of one physical unit/module, or may be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logic unit/module itself is not the most important, and the combination of functions implemented by the logic unit/module is the key to solve the technical problem posed by the present application. Furthermore, to highlight the innovative part of the present application, the above-described device embodiments of the present application do not introduce units/modules that are less closely related to solving the technical problems presented by the present application, which does not indicate that the above-described device embodiments do not have other units/modules.

It should be noted that in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims

1. A video matting method, the method comprising:

information compression is carried out on a target video frame in a video to obtain a first image feature, wherein the first image feature comprises edge information of content in an image extracted from the target video frame;

dividing the target video frame based on the first image features to obtain features of an outline mask image of an object obtained in the dividing process and a first outline mask image of the object in the target video frame, wherein the first outline mask image is a binary image indicating the position of an area where the object in the target video frame is located;

performing feature reconstruction based on the first image features and the features of the obtained outline mask image, fusing the reconstructed features and first hidden state information of the video, and updating the first hidden state information to obtain a fusion result, wherein the first hidden state information represents: fusion characteristics of transparency mask images of object edges in the video frames subjected to matting before the target video frames are obtained by fusing the characteristics obtained by reconstructing the characteristics corresponding to the video frames subjected to matting before the target video frames;

according to the target transparency mask image and the first contour mask image, carrying out regional matting on the target video frame to obtain a matting result;

2. The method according to claim 1, wherein the segmenting the target video frame based on the first image feature to obtain features of a contour mask image of the object obtained during the segmentation and a first contour mask image of the object in the target video frame comprises:

3. The method according to claim 2, wherein the performing feature reconstruction based on the first target feature and a second target feature of the features of the obtained contour mask image to obtain a second image feature with an increased scale comprises:

4. The method according to claim 2, wherein the performing the cascade feature reconstruction based on the first image feature, to obtain features of the outline mask image with sequentially increasing dimensions of the object, includes:

5. The method of claim 4, wherein the first image feature comprises a plurality of first sub-image features;

and when the feature reconstruction is carried out for other times, carrying out the feature reconstruction based on the third target feature and the first sub-image feature with the same scale as the third target feature to obtain a fourth image feature with the increased scale.

6. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

the fusing the second image feature and the first sub-state information in the first hidden state information and updating the first sub-state information to obtain a third image feature includes: segmenting the second image feature to obtain a second sub-image feature and a third sub-image feature; fusing the second sub-image feature and first sub-state information in the first hidden state information and updating the first sub-state information to obtain a fourth sub-image feature; splicing the fourth sub-image feature and the third sub-image feature to obtain a third image feature;

and/or

The fusing the fourth image feature and the second sub-state information in the second hidden state information and updating the second sub-state information to obtain the feature of the outline mask image of the object, including: segmenting the fourth image feature to obtain a fifth sub-image feature and a sixth sub-image feature; fusing the fifth sub-image feature and second sub-state information in the second hidden state information and updating the second sub-state information to obtain a seventh sub-image feature; and splicing the seventh sub-image feature and the sixth sub-image feature to obtain the feature of the outline mask image of the object.

7. The method of claim 4, wherein the compressing the information of the target video frame in the video to obtain the first feature comprises:

8. The method of claim 7, wherein the transparency feature generation network further comprises a feature screening sub-network;

9. The method according to claim 7 or 8, wherein,

the first converged subnetwork is: the gate control circulation unit GRU or the long and short time memory LSTM unit;

And/or

The second converged subnetwork is: the gate control circulation unit GRU or the long and short time memory LSTM unit;

and/or

The first reconstruction sub-network is realized based on a QARepVGG network structure, or the first reconstruction sub-network in a specific profile feature generation network is realized based on the QARepVGG network structure, wherein the specific profile feature generation network is as follows: the corresponding outline mask image has an outline feature generating network with a dimension smaller than a first preset dimension;

and/or

The second reconstruction sub-network is realized based on a QARepVGG network structure, or the second reconstruction sub-network in the specific transparency characteristic generating network is realized based on the QARepVGG network structure, wherein the specific transparency characteristic generating network is as follows: and generating a network by transparency features of which the corresponding transparency mask image is smaller than a second preset scale.

10. A method according to claim 7 or 8, wherein the video matting model is trained in the following manner:

11. The method of claim 10, wherein the step of determining the position of the first electrode is performed,

the first sample contour mask image includes: a first mask subgraph for identifying an area where an object is located in the first sample video frame and a second mask subgraph for identifying an area outside the object in the first sample video frame;

12. The method of claim 1, wherein the compressing the information of the target video frame in the video to obtain the first feature comprises:

13. The method of claim 12, wherein the step of determining the position of the probe is performed,

the convolution kernel is: 1x1 convolution kernel;

and/or

The nonlinear transformation is performed on the seventh image feature to obtain an eighth image feature, including:

14. An electronic device, comprising:

one or more processors and memory;

the memory being coupled to the one or more processors, the memory being for storing computer program code comprising computer instructions that the one or more processors invoke to cause the electronic device to perform the method of any of claims 1-13.

15. A computer readable storage medium comprising a computer program which, when run on an electronic device, causes the electronic device to perform the method of any one of claims 1 to 13.

16. A chip system for application to a terminal, the chip system comprising one or more processors for invoking computer instructions to cause the terminal to input data into the chip system and to output the result of processing after processing the data by performing the method of any of claims 1 to 13.