CN114040140A

CN114040140A - Video matting method, device and system and storage medium

Info

Publication number: CN114040140A
Application number: CN202111348109.0A
Authority: CN
Inventors: 张红; 田文宝; 范文新; 李一凡
Original assignee: Tianjin Yifuzhen Internet Hospital Co ltd; Beijing Yibai Technology Co ltd
Current assignee: Tianjin Yifuzhen Internet Hospital Co ltd; Beijing Yibai Technology Co ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-02-11
Anticipated expiration: 2041-11-15
Also published as: CN114040140B

Abstract

The application provides a video matting method, a device, a system and a storage medium, wherein the method comprises the following steps: coding an initial video to obtain a video code stream and dividing the video code stream into three paths; processing the first path of video code stream into a low-resolution code stream, and extracting global characteristic data in the low-resolution code stream; extracting local characteristic data of the second path of video code stream; obtaining a low-resolution video according to the global characteristic data, the local characteristic data and decoding processing; processing the third path of video code stream into a high-resolution code stream, and extracting edge characteristic data in the high-resolution code stream; decoding the edge characteristic data to obtain a high-resolution video; and fusing the human shape area recognition result in the low-resolution video and the human shape edge recognition result in the high-resolution video to obtain a human shape matting result in the initial video. The scheme has the beneficial effects of high efficiency, high accuracy, wide applicable scene and low cost on the matting result of the initial video.

Description

Video matting method, device and system and storage medium

Technical Field

The present application relates to the field of video stream data processing technologies, and in particular, to a method, an apparatus, a system, and a storage medium for video matting.

Background

In a video call, monitoring or capturing scene, a face picture needs to be scratched out from each video frame image of video stream data by adopting a scratching algorithm.

At present, the cutout algorithm includes a real-time portrait background replacement algorithm model MODNet, which provides a simple, fast and stable real-time portrait cutout processing algorithm. The model has the advantages that the human face recognition result can be output only by inputting the video stream signal shot by the camera into the model, and an extra real background is not needed to be used as input. The disadvantage of this model is that it only fits video stream data that has consistency with the existing training samples, and if the background in the input video stream signal does not have similarity with the background in the existing training samples (i.e. the background in the video stream signal of the input model is the background that the model does not "see"), the matting result output by the model will have a certain degree of noise. The scale of the training sample of the MODNet model is not large, so that the application scenario of the matting algorithm is limited, and if the model is trained again, the cost is too high. In the practical application process, the shooting backgrounds of the cameras are difficult to unify, and the situation that the backgrounds of video stream signals input into a model are all processed in the model training process cannot be guaranteed, so that the matting processing is performed by using the algorithm, and the accuracy is greatly reduced.

In other schemes, a sufficiently complex model is proposed to carry out matting on the video stream, and the matting effect obtained by utilizing the complex model under the same test condition is superior to that of the MODNet model. However, such complex models generally require two paths of video signals of the camera to be input, wherein one path of video signal of the camera needs to record a real background, and the requirements on the light stability of the shooting environment of the camera and the parameter stability of the camera are very high, and the complex models are difficult to be put into practical application due to harsh limiting conditions.

Therefore, there is a need for improvements to existing video matting algorithms.

Disclosure of Invention

The application provides a video matting method, a device, a system and a storage medium, which aim to solve the technical problems that in the prior art, the accuracy of a video stream matting algorithm is difficult to ensure or the applicability is poor.

In some embodiments of the present application, a method for video matting is provided, which includes the following steps:

coding an initial video to obtain a video code stream and dividing the video code stream into three paths;

performing resolution reduction processing on the first path of video code stream to obtain a low-resolution code stream, and extracting global feature data in the low-resolution code stream; performing multi-scale pooling analysis on a second path of video code stream, and extracting local characteristic data in the second path of video code stream;

fusing the global feature data and the local feature data to obtain low-resolution feature data; decoding the low-resolution feature data and the local feature data to obtain a low-resolution video, wherein the low-resolution video comprises a human-shaped region identification result;

performing resolution raising processing on the third path of video code stream to obtain a high-resolution code stream; extracting edge characteristic data in the high-resolution code stream; decoding the edge feature data to obtain a high-resolution video, wherein the high-resolution video comprises a human-shaped edge identification result;

and fusing the human-shaped region recognition result in the low-resolution video and the human-shaped edge recognition result in the high-resolution video to obtain a human-shaped matting result in the initial video.

In the video matting method provided in some embodiments of the present application, the step of obtaining low-resolution feature data after fusing the global feature data and the local feature data further includes:

raising the resolution of the low resolution feature data such that the final resolution of the low resolution feature data corresponds to the resolution of the initial video.

In the video matting method provided in some embodiments of the present application, after performing multi-scale pooling analysis on a second path of video code stream, extracting local feature data in the second path of video code stream:

performing at least five pooling scale analysis treatments on the second path of video code stream, wherein each pooling scale treatment comprises:

performing dimensionality reduction processing on the second path of video code stream to obtain dimensionality-reduced feature data required by the pooling scale, and performing convolution processing on the dimensionality-reduced feature image to obtain the convolved feature data;

and performing dimension increasing processing on the feature data after convolution to obtain pooled feature data with the same dimension as the second path of video code stream.

In the video matting method provided in some embodiments of the present application, the low-resolution feature data and the local feature data are decoded to obtain a low-resolution video, and the low-resolution video includes a human-shaped region recognition result:

performing low resolution decoding processing on the low resolution feature data and the local feature data by a tree diagram decoder;

and obtaining the human-shaped area recognition result according to the tree diagram prediction result output by the tree diagram decoder.

In the video matting method provided in some embodiments of the present application, a high-resolution video is obtained after decoding the edge feature data, and the high-resolution video includes a human-shaped edge recognition result:

carrying out high-resolution fine adjustment processing on the edge characteristic data through an edge detection decoder;

and obtaining the human-shaped edge recognition result according to the edge prediction result output by the edge detection decoder.

Based on the same inventive concept, some embodiments of the present application provide a video matting device, including:

the encoder is used for encoding the initial video to obtain a video code stream and dividing the video code stream into three paths;

the resolution reduction model is used for carrying out resolution reduction processing on the first path of video code stream to obtain a low-resolution code stream;

the global feature extraction model is used for extracting global feature data in the low-resolution code stream;

the local characteristic extraction model is used for performing multi-scale pooling analysis on the second path of video code stream and then extracting local characteristic data in the second path of video code stream;

the connector is used for fusing the global feature data and the local feature data to obtain low-resolution feature data;

the first decoder is used for decoding the low-resolution characteristic data and the local characteristic data to obtain a low-resolution video, and the low-resolution video comprises a human-shaped region identification result;

the resolution raising model is used for raising the resolution of the third path of video code stream to obtain a high-resolution code stream;

the edge feature extraction model is used for extracting edge feature data in the high-resolution code stream;

the second decoder is used for decoding the edge characteristic data to obtain a high-resolution video, and the high-resolution video comprises a human-shaped edge identification result;

and the fusion model is used for fusing the human-shaped region recognition result in the low-resolution video and the human-shaped edge recognition result in the high-resolution video to obtain a human-shaped matting result in the initial video.

The video matting device provided in some embodiments of the present application further includes:

and the resolution recovery model is used for increasing the resolution of the low-resolution feature data so that the final resolution of the low-resolution feature data is consistent with the resolution of the initial video.

The video cutout device that provides in some embodiments of this application:

the first decoder is a tree diagram decoder which performs low-resolution decoding processing on the low-resolution feature data and the local feature data; obtaining the human-shaped area recognition result according to the tree diagram prediction result output by the tree diagram decoder;

the second decoder is an edge detection decoder which carries out high-resolution fine adjustment processing on the edge characteristic data; and obtaining the human-shaped edge recognition result according to the edge prediction result output by the edge detection decoder.

Based on the same inventive concept, some embodiments of the present application further provide a video matting system, where the system includes at least one processor and at least one memory, at least one of the memories stores program instructions, and at least one of the processors reads the program instructions to perform the video matting method according to any one of the above aspects.

Based on the same inventive concept, some embodiments of the present application further provide a readable storage medium, where program information is stored in the readable storage medium, and a computer reads the program information and then executes the video matting method according to any one of the above aspects.

Compared with the prior art, the technical scheme provided by the embodiment of the application has at least the following beneficial effects: the video code stream after the initial video coding is divided into three paths, the resolution of the first path of video code stream is reduced, global characteristic data is extracted from the first path of video code stream, multi-scale pooling analysis is carried out on the second path of video code stream data, local characteristic data is extracted from the second path of video code stream data, the global characteristic data and the local characteristic data are fused and are decoded together with the local characteristic data, and then a humanoid region identification result under the condition of low resolution can be obtained. Therefore, the high-precision human figure edge recognition result can be obtained by increasing the resolution of the third video code stream, extracting the edge characteristics of the human figure region from the data with increased resolution, and then decoding the extracted high-resolution edge characteristics. And finally, fusing the two recognition results to obtain a final human-shaped cutout result. The above scheme of this application can be in the regional recognition rate of assurance shape, shape edge recognition rate and the shape edge recognition precision of the prerequisite under the completion initial video shape matting handle. Moreover, the scheme of the application has no high requirement on the input video signal, only one path of camera is needed for inputting, the requirement on the background image is not high in the human figure recognition process, and the support of a large number of training samples is not needed, so that the cost in the aspects of hardware structure and software analysis algorithm is low, and the problem of the video matting algorithm in the prior art is solved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solutions in the different embodiments of the present application are further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not limit the subject matter. In the drawings:

FIG. 1 is a flow chart of a method of video matting according to an embodiment of the present application;

FIG. 2 is a diagram illustrating various stages in the processing of an initial video by a matting method according to an embodiment of the present application;

FIG. 3 is a block diagram of a video matting apparatus according to an embodiment of the present application;

fig. 4 is a schematic hardware structure diagram of a video matting system according to an embodiment of the present application.

Detailed Description

The preferred embodiments of the present application will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein only to illustrate and explain the present application and not to limit the present application. The terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, the technical solutions in the following embodiments provided in the present invention may be combined with each other unless contradictory to each other, and the technical features thereof may be replaced with each other.

The embodiment provides a video matting method, which can be applied to some control systems such as a video conference control system, a video monitoring system or a video chat system that need to identify human figures in videos, as shown in fig. 1, the method includes the following steps:

s101: the method comprises the steps of coding an initial video to obtain a video code stream and dividing the video code stream into three paths. The initial video is original video data containing human figures shot by the camera. It is understood that video is a sequence of consecutive frames of an image, each frame being an image. The continuous frame images have extremely high similarity, so that the video needs to be encoded in order to facilitate storage or transmission of the video, and the video code stream obtained after encoding can reduce the occupation of storage space compared with the original video. The bandwidth occupied in the transmission process is reduced, and the transmission efficiency is improved. The coding mode in this step can be realized by using coding mode in MPEG (Moving Picture Experts Group) series and coding mode in H.26X series, and can be implemented according to the standard of the selected coding method.

S102: and performing resolution reduction processing on the first path of video code stream to obtain a low-resolution code stream, and extracting global characteristic data in the low-resolution code stream. The method for reducing resolution in this step may be determined according to the format of the video stream obtained in step S101. For example, the h.263 encoding format inherently has a reduced resolution mode, which can maintain a satisfactory video picture quality despite the reduced resolution. In the case of the MPEG-4 encoding format, a convolutional neural network algorithm may be used to reduce the resolution of the original video. By reducing the resolution, the data storage space of the video code stream can be reduced, and the transmission rate of the video code stream can be improved. The global features are based on the description of gray pixel values, and can be divided into histogram features, color features, contour features and the like, and the global features can be obtained in a global pooling manner. The histogram feature represents a human-shaped appearance feature, the color feature may be represented by an RGB (red, green, blue) color space or an HSV (hue, saturation, brightness) color space, and the contour feature may be determined by a background difference method or a frame difference method.

S103: and performing multi-scale pooling analysis on the second path of video code stream, and extracting local characteristic data in the second path of video code stream. Local characteristics corresponding to the pooling scale can be obtained in each scale pooling mode, and different scale characteristics of the video code stream can be obtained after the video code stream is processed in multi-scale pooling analysis. The pooling is used for combining the features with similarity, different pooling scales can divide the data of one image into different similar feature sets, and the relative position relationship between the different similar feature sets in the original image has fixity. By connecting pooling results of different scales, it is ensured that more comprehensive local features can be obtained.

S104: and fusing the global feature data and the local feature data to obtain low-resolution feature data. After the data of the global characteristic and the local characteristic are obtained, the data of the global characteristic and the data of the local characteristic are fused to obtain the whole data of the video code stream, and the data volume obtained in the step is reduced to some extent due to the processing process of the resolution and the pooling, so that the subsequent data processing speed can be improved.

S105: and decoding the low-resolution characteristic data and the local characteristic data to obtain a low-resolution video, wherein the low-resolution video comprises a human-shaped region identification result. The decoding operation in this step can be selected according to the practical application in combination with the encoding operation in step S101.

S106: and performing resolution raising processing on the third path of video code stream to obtain a high-resolution code stream, and extracting edge characteristic data in the high-resolution code stream. At present, there is a method of processing an image into super-resolution, and a progressive method may be adopted to divide the resolution of a video frame including a human-shaped region into a plurality of steps for training, and each step is improved by a little more than the previous step, so that the resolution of the video frame can be stably and slowly improved. There are also methods in which the pixel difference values of the humanoid region are directly multiplied by the heat map values. Because the processing result of this step is used to accurately detect the edge of the human figure region, it is not necessary to analyze and calculate all the feature data actually, and it is only necessary to determine the critical point between the human figure region and the background according to the attribute of the encoded data, and determine the edge feature data accordingly.

S107: and decoding the edge feature data to obtain a high-resolution video, wherein the high-resolution video comprises a human-shaped edge identification result. The decoding operation in this step can be selected according to the practical application in combination with the encoding operation in step S101.

S108: and fusing the human-shaped region recognition result in the low-resolution video and the human-shaped edge recognition result in the high-resolution video to obtain a human-shaped matting result in the initial video. As shown in fig. 2, the difference between the high-resolution video and the low-resolution video is only the human shape recognition result, and the human shape region recognition result in the low-resolution video includes all information in the region, such as a human face, clothing, etc., but the noise at the edge is large. And the high-resolution video only contains the edge recognition result of the humanoid region, but does not need to include the information inside the edge. On the basis, the final human shape matting result can be obtained after the two recognition results are fused (the recognition results in the human shape region in the figure, such as human faces, glasses, clothes and the like, are not shown in detail and are consistent with the shooting result in the initial video in practical application). In addition, it can be understood that the processing process for the three video code streams is actually parallel, and is not limited by the sequence of the above step numbers.

In the above scheme provided by this embodiment, the video code stream after the initial video coding is divided into three paths, the resolution of the first path of video code stream is reduced, global feature data is extracted from the first path of video code stream, the second path of video code stream data is subjected to multi-scale pooling analysis, local feature data is extracted from the second path of video code stream data, the global feature data and the local feature data are fused and perform decoding operation together with the local feature data, and then a humanoid region recognition result under the condition of low resolution can be obtained. Therefore, the high-precision human figure edge recognition result can be obtained by increasing the resolution of the third video code stream, extracting the edge characteristics of the human figure region from the data with increased resolution, and then decoding the extracted high-resolution edge characteristics. And finally, fusing the two recognition results to obtain a final human-shaped cutout result.

The above scheme of this application can be in the regional recognition rate of assurance shape, shape edge recognition rate and the shape edge recognition precision of the prerequisite under the completion initial video shape matting handle. Moreover, the scheme of the application has no high requirement on the input video signal, only one path of camera is needed for inputting, the requirement on the background image is not high in the human figure recognition process, and the support of a large number of training samples is not needed, so that the cost in the aspects of hardware structure and software analysis algorithm is low, and the problem of the video matting algorithm in the prior art is solved.

In some embodiments, in step S104 above, the method may further include: raising the resolution of the low resolution feature data such that the final resolution of the low resolution feature data corresponds to the resolution of the initial video. The process of raising the resolution of this step may be implemented as set forth in step S106. In step S102, the resolution of the original video is reduced, and in this step, after the extraction of the global features and the local features is completed, the resolution can be restored to the same degree as the resolution of the original video, so that the resolution of the finally decoded video frame has higher consistency with the resolution of the original image.

In some embodiments of the present application, the encoding process in step S101 may be implemented by a convolutional neural network, and the initial video is input as an input signal into the convolutional neural network, where the convolutional neural network includes a plurality of convolutional layers, and the plurality of convolutional layers includes a plurality of or one downsampling layer (pooling layer). And obtaining a characteristic set output by the nth convolutional layer of the convolutional neural network as a video code stream, wherein n is a positive integer. Namely, the feature set output by the nth convolutional layer is used as the basis of global feature extraction and also used as the basis of multi-scale pooling analysis.

In the present application, a pyramid scene parsing network may be adopted to implement multi-scale pooling parsing, where the "multi-scale" preferably includes at least five pooling scales, that is: performing at least five pooling scale analysis treatments on the second path of video code stream, wherein each pooling scale treatment comprises: performing dimensionality reduction processing on the second path of video code stream to obtain dimensionality-reduced feature data required by the pooling scale, and performing convolution processing on the dimensionality-reduced feature image to obtain the convolved feature data; and performing dimension increasing processing on the feature data after convolution to obtain pooled feature data with the same dimension as the second path of video code stream. The following examples in this application illustrate the resolution process using five pooling dimensions as examples.

Taking the feature that the resolution of Conv5_3 (the 3 rd convolutional layer in the 5 th convolutional block of the convolutional neural network model) output by the convolutional neural network model is (2048 × 90 × 90) as an analysis object of the pyramid analysis network, and dividing the analysis object into 4 paths in the pyramid scene analysis network for processing, in this embodiment: the feature is compressed into 1 × 1 feature (reduced by 90 times), 2 × 2 feature (reduced by 45 times), 3 × 3 feature (reduced by 30 times) and 6 × 6 feature (reduced by 15 times) by global average pooling, and then the 4-way feature map is restored to 90 × 90 by bilinear interpolation. The five characteristics are as follows: the features of Conv5_3, the features of 1 × 1, the features of 2 × 2, the features of 3 × 3 and the features of 6 × 6 are connected together, and then category output is performed.

The above Conv5_3(2048 × 90 × 90) is characterized by being calculated by:

the input data of the convolutional neural network model is 3 × 713 × 713 (i.e., three channels, video image size is 713 × 713); the first convolutional layer in the first convolutional block yields an output of 64 × 357 × 357; the output of the second convolutional layer in the first convolutional block is 64 × 179 × 179 … …, the output of the subsequent convolutional layers is analogized by adopting the following rule, each convolutional layer uses 1 × 1 dimension reduction, then 3 × 3 convolution and then 1 × 1 dimension recovery to the output of the last convolutional layer, and finally the characteristic output of Conv5_3(2048 × 90 × 90) is obtained.

When the Conv5_3(2048 × 90 × 90) features are reduced by 90 times to 1 × 1 features, the convolution parameters are selected to be AVE kernel _ size:90, stride:90, num _ output: 512; by analogy, when the characteristics of Conv5_3(2048 × 90 × 90) are reduced by 90 times to 2 × 2, the convolution parameters are selected to be AVE kernel _ size:45, stride:45, num _ output: 512; when the Conv5_3(2048 × 90 × 90) features are reduced by 90 times to 3 × 3 features, the convolution parameters are selected to be AVE kernel _ size:30, stride:30, num _ output: 512; when the Conv5_3(2048 × 90 × 90) features are reduced by 90 times to 6 × 6 features, the convolution parameters are selected to be AVE kernel _ size:15, stride:15, num _ output: 512; the feature obtained after the above five features are connected together can be used as the finally extracted local feature data. And connecting the local feature data with the global feature data (namely Contact), and then executing low-resolution decoding operation to obtain the low-resolution video comprising the human-shaped region identification result.

As a preferred implementation, the resolution of the image may also be restored by restoring the resolution using bilinear interpolation for the features of 1 × 1, 2 × 2, 3 × 3, and 6 × 6, and in the process of restoring the resolution, the data processing amount may be reduced by performing dimensionality reduction on the data of the original video stream, and then performing dimensionality enhancement on the data after dimensionality reduction, so as to restore the resolution of the data. In the process of restoring the resolution, parameters in the bilinear interpolation method are selected as follows: height 90width 90, each feature can be restored to 90 × 90 resolution.

In some embodiments of the present application, the decoding modes of the high resolution and the low resolution videos may be the same or different, and the algorithm can be simplified when the same decoding mode is adopted. In this embodiment:

in decoding the low-resolution feature data and the local feature data, performing low-resolution decoding processing on the low-resolution feature data and the local feature data through a tree diagram Decoder (Trimap Decoder); and obtaining the human-shaped region identification result according to a treemap prediction result (Trimap Generation) output by the treemap decoder.

Decoding the Edge characteristic data to obtain a high-resolution video, and performing high-resolution fine adjustment processing on the Edge characteristic data through an Edge detection Decoder (Edge Decoder); and obtaining the human-shaped Edge recognition result according to the Edge Prediction result (Edge Prediction) output by the Edge detection decoder. The Fine tuning process can be realized by selecting a pre-trained Fine tuning model (Fine-tuning), and after the selected learning model is trained through a training sample on line and the requirements are met through tests, the Fine tuning model can be directly further optimized for the decoded data.

In the scheme, different decoding steps are selected for processing the low-resolution and high-resolution videos, so that the decoding method is more targeted, and the accuracy of the result obtained by decoding is higher.

In some embodiments of the present application, there is also provided a video matting apparatus, as shown in fig. 3, the apparatus includes an encoder 100, a low resolution training unit (including a resolution reduction model 201, a global feature extraction model 202, a local feature extraction model 203, a connector 204), a high resolution training unit (including a resolution increase model 301, an edge feature extraction model 302), a first decoder 401, a second decoder 402, and a fusion model 500, where:

the encoder 100 is configured to encode an initial video to obtain a video code stream, and divide the video code stream into three paths, where the encoding mode may be implemented by using an encoding format in an MPEG (Moving Picture Experts Group) series or an encoding format in an h.26x series, and the encoding mode may be implemented according to a standard of a selected encoding method.

The resolution reduction model 201 is configured to perform resolution reduction processing on the first path of video code stream to obtain a low-resolution code stream, and the method for resolution reduction processing may be determined according to the format of the video code stream obtained in step S101, and may obtain the global feature in a global pooling manner.

The global feature extraction model 202 is configured to extract global feature data in the low-resolution code stream. The pooling mode of each scale can obtain local features corresponding to the pooling scale.

The local feature extraction model 203 is configured to perform multi-scale pooling analysis on the second channel of video code stream and extract local feature data in the second channel of video code stream; when multi-scale pooling analysis is adopted, different scale characteristics of the video code stream can be obtained after the video code stream is processed. By connecting pooling results of different scales, it is ensured that more comprehensive local features can be obtained.

The connector 204 is configured to fuse the global feature data and the local feature data to obtain low-resolution feature data, and fuse the global feature data and the local feature data to obtain overall data of the video code stream.

The first decoder 401 is configured to decode the low-resolution feature data and the local feature data to obtain a low-resolution video, where the low-resolution video includes a human-shaped region identification result. Preferably, the first decoder 401 is a tree diagraph, and the tree diagraph performs low resolution decoding processing on the low resolution feature data and the local feature data; and obtaining the human-shaped area recognition result according to the tree diagram prediction result output by the tree diagram decoder.

The resolution increasing model 301 is configured to increase the resolution of the third video code stream to obtain a high-resolution code stream, where the high-resolution code stream is used to accurately detect the edge of the human figure region.

The edge feature extraction model 302 is configured to extract edge feature data in the high-resolution code stream; the critical point of the human-shaped area and the background can be determined according to the attribute of the coded data, and the edge characteristic data can be determined according to the critical point.

The second decoder 402 is configured to decode the edge feature data to obtain a high-resolution video, where the high-resolution video includes a human-shaped edge recognition result; preferably, the second decoder 402 is an edge detection decoder, and the edge detection decoder performs high-resolution fine adjustment processing on the edge feature data; and obtaining the human-shaped edge recognition result according to the edge prediction result output by the edge detection decoder.

The fusion model 500 is configured to fuse the human-shaped region recognition result in the low-resolution video and the human-shaped edge recognition result in the high-resolution video to obtain a human-shaped matting result in the initial video.

The video matting device provided by the embodiment can complete the human shape matting processing of the initial video on the premise of ensuring the human shape region identification speed, the human shape edge identification speed and the human shape edge identification precision. Moreover, the scheme has no high requirement on the input video signal, only needs one path of camera input, has low requirement on the background image in the human figure recognition process, does not need the support of a large number of training samples, has low cost in both hardware structure and software analysis algorithm, and solves the problems of the video matting algorithm in the prior art

The video matting apparatus in some embodiments may further include: and the resolution recovery model is used for increasing the resolution of the low-resolution feature data so that the final resolution of the low-resolution feature data is consistent with the resolution of the initial video. That is, after the extraction of the global features and the local features is completed, the resolution can be restored to the same degree as the resolution of the original video, so that the resolution of the finally decoded video frame has higher consistency with the resolution of the original image.

Some embodiments of the present application provide a readable storage medium, where program instructions are stored in the storage medium, and after reading the program instructions, a computer executes the video matting method described in any one of the above embodiments.

Fig. 4 is a schematic diagram of a hardware structure of a video matting system provided in this embodiment, where the system includes one or more processors 601 and a memory 602, and one processor 601 is taken as an example in fig. 4. The video matting system may further comprise: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example.

The memory 602, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The processor 601 executes various functional applications and data processing of the server by running nonvolatile software programs, instructions and modules stored in the memory 602, namely, implements the video matting method of the above-described method embodiment. The system can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. It is intended that the present application also cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims

1. A method of video matting, characterized by the steps of:

2. The method for video matting according to claim 1, wherein the step of fusing the global feature data and the local feature data to obtain low resolution feature data further comprises:

3. The video matting method according to claim 2, wherein in the step of extracting local feature data in the second path of video code stream after performing multi-scale pooling analysis on the second path of video code stream:

4. The method for video matting according to any one of claims 1 to 3, wherein decoding the low resolution feature data and the local feature data to obtain a low resolution video, the low resolution video including a result of identifying a human-shaped region is performed by:

5. The method for video matting according to any one of claims 1 to 3, wherein a high resolution video is obtained after decoding the edge feature data, and the high resolution video comprises a result of human-shaped edge recognition, wherein:

6. A video matting apparatus, comprising:

7. The video matting device according to claim 6, further comprising:

8. The video matting device according to claim 7, characterized in that:

9. A video matting system characterized by:

the system comprises at least one processor and at least one memory, wherein program instructions are stored in at least one of the memories, and the program instructions are read by at least one of the processors to execute the video matting method according to any one of claims 1 to 5.

10. A readable storage medium, characterized by:

the readable storage medium stores program information, and the computer reads the program information and executes the video matting method according to any one of claims 1 to 5.