CN114640882B

CN114640882B - Video processing method, video processing device, electronic equipment and computer readable storage medium

Info

Publication number: CN114640882B
Application number: CN202011478292.1A
Authority: CN
Inventors: 周易; 易阳; 李昊沅; 李峰; 余晓铭; 涂娟辉; 左小祥; 周泉; 李新智; 张永欣; 万旭杰; 杨超; 刘磊; 李博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2024-06-28
Anticipated expiration: 2040-12-15
Also published as: CN114640882A

Abstract

The application provides a video processing method, a video processing device, electronic equipment and a computer readable storage medium, and relates to application of a cloud technology in the technical field of multimedia; the method comprises the following steps: receiving a video code stream, wherein the video code stream comprises multi-frame images and enhanced image information synchronous with each frame of images; decoding the video code stream to obtain the multi-frame image and the enhanced image information synchronous with each frame of image; carrying out fusion processing on each frame of image and the enhancement image information synchronized with each frame of image to obtain a composite image corresponding to each frame of image; and displaying the composite image corresponding to each frame of image in the human-computer interaction interface in sequence. According to the application, the original image and other image information of the video can be accurately and synchronously displayed, so that the display mode of the video is enriched.

Description

Video processing method, video processing device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of cloud technology and computer multimedia technology, and in particular, to a video processing method, a device, an electronic apparatus, and a computer readable storage medium.

Background

With the development of communication technology and the promotion of communication infrastructure construction, video transmission and playback by means of cloud technology are widely used.

Taking live broadcasting as an example, in the live broadcasting process, terminal equipment associated with a host side continuously collects the video data aiming at the host side, the collected video data is transmitted to a live broadcasting platform of a cloud side, and then the video data received by the live broadcasting platform of the cloud side can be distributed to different audience sides. In order to avoid the video content being too monotonous, the live broadcast platform also needs to transmit the image information additionally displayed on the audience side when transmitting the video data, but due to the limitation of network transmission conditions, the synchronous display of the video content and the image information on the audience side is difficult to ensure.

Disclosure of Invention

The embodiment of the application provides a video processing method, a video processing device, electronic equipment and a computer readable storage medium, which can accurately and synchronously display an original image of a video and other image information so as to enrich the display modes of the video.

The technical scheme of the embodiment of the application is realized as follows:

The embodiment of the application provides a video processing method, which comprises the following steps:

receiving a video code stream, wherein the video code stream comprises multi-frame images and enhanced image information synchronous with each frame of images;

Decoding the video code stream to obtain the multi-frame image and the enhanced image information synchronous with each frame of image;

Carrying out fusion processing on each frame of image and the enhancement image information synchronized with each frame of image to obtain a composite image corresponding to each frame of image;

And displaying the composite image corresponding to each frame of image in the human-computer interaction interface in sequence.

In the above scheme, the decoding the video code stream to obtain the multi-frame image and the enhanced image information synchronized with each frame image includes: reading each network extraction layer unit included in the video code stream, and determining the type of each network extraction layer unit; when the type of the read network extraction layer unit is an image slice, performing slice segmentation decoding operation to obtain multi-frame images; when the type of the read network abstraction layer unit is the supplemental enhancement information, a supplemental enhancement information decoding operation is performed to obtain enhanced image information synchronized with the image of each frame.

An embodiment of the present application provides a video processing apparatus, including:

The receiving module is used for receiving a video code stream, wherein the video code stream comprises multi-frame images and enhanced image information synchronous with each frame of images;

The decoding processing module is used for decoding the video code stream to obtain the multi-frame images and the enhanced image information synchronous with each frame of image;

the fusion processing module is used for carrying out fusion processing on each frame of image and the enhanced image information synchronous with each frame of image to obtain a composite image corresponding to each frame of image;

and the display module is used for sequentially displaying the composite images corresponding to each frame of image in the human-computer interaction interface.

In the above aspect, when the enhanced image information is an image mask, the fusion processing module is further configured to perform the following processing for each frame of image: masking the image based on an image mask synchronized with the image to obtain a composite image corresponding to the image and with the background removed; wherein the image mask is generated by object recognition of the image.

In the above scheme, the receiving module is further configured to obtain a background image; the fusion processing module is further used for merging the composite image with the background removed with the background image; the display module is also used for displaying the combined image obtained through combination in the man-machine interaction interface.

In the above-described aspect, the size of the image mask synchronized with the each frame image is smaller than the size of the image; the device also comprises an amplifying processing module, a processing module and a processing module, wherein the amplifying processing module is used for amplifying the enhanced image information synchronous with each frame of image; the device also comprises a noise reduction processing module which is used for carrying out noise reduction processing on the enhanced image information obtained after the amplification processing.

In the above scheme, when the enhanced image information is the position of the key point of the object in the image, the fusion processing module is further configured to obtain a special effect matched with the key point; and adding the special effect at the position of the key point corresponding to the object in each frame of image to obtain a composite image corresponding to each frame of image, wherein the special effect is added to the composite image.

In the above aspect, when the enhanced image information includes a pose of an object in the image, the fusion processing module is further configured to perform at least one of the following processing for each frame of image: determining the states corresponding to each object in the image according to the gesture type, counting the states of all objects in the image, and adding a counting result into the image to obtain a composite image corresponding to the image and added with the counting result; and acquiring a special effect adapted to the gesture of the object in the image, and adding the special effect into the image to obtain a composite image corresponding to the image and added with the special effect.

In the above scheme, the decoding processing module is further configured to read each network abstraction layer unit included in the video bitstream, and determine a type of each network abstraction layer unit; when the type of the read network extraction layer unit is an image slice, performing slice segmentation decoding operation to obtain multi-frame images; when the type of the read network abstraction layer unit is the supplemental enhancement information, a supplemental enhancement information decoding operation is performed to obtain enhanced image information synchronized with the image of each frame.

Acquiring a multi-frame image;

Performing computer vision processing on each frame of acquired image to obtain enhanced image information corresponding to each frame of image;

generating a video code stream comprising the each frame of image and enhanced image information corresponding to the each frame of image;

transmitting the video code stream;

The video code stream is used for being decoded by the terminal equipment to display a composite image corresponding to each frame of image, and the composite image is obtained by fusion processing of each frame of image and enhanced image information synchronous with each frame of image.

In the above aspect, the generating a video code stream including the each frame image and enhanced image information corresponding to the each frame image includes: encoding the image and the enhanced image information corresponding to the image respectively to obtain an image code and an enhanced image information code corresponding to the image code; packaging the image code into a network extraction layer unit with the type of image slices, and packaging the enhanced image information code into a network extraction layer unit with the type of supplemental enhanced information; and assembling the network extraction layer unit with the type of image slice and the network extraction layer unit with the type of supplementary enhancement information into a video code stream.

An embodiment of the present application provides a video processing apparatus including:

The acquisition module is used for acquiring multi-frame images;

the computer vision processing module is used for performing computer vision processing on each acquired frame of image to obtain enhanced image information corresponding to each frame of image;

A generation module, configured to generate a video code stream including the each frame image and enhanced image information corresponding to the each frame image;

a transmitting module, configured to transmit the video code stream;

In the above aspect, when the enhanced image information is an image mask, the computer vision processing module is further configured to perform the following processing for each frame of image: calling an image segmentation model based on the image to identify an object in the image, and generating an image mask corresponding to a background by taking an area outside the object as the background; the image segmentation model is obtained by training based on a sample image and an object marked in the sample image.

In the above aspect, the computer vision processing module is further configured to use the image as an input of the image segmentation model to determine an object in the image, and use a region outside the object as a background to generate an image mask corresponding to the background, where a size of the image mask is consistent with a size of the image; or the method is used for reducing the size of the image, the reduced image is used as an input of the image segmentation model to determine an object in the reduced image, a region outside the object is used as a background, and an image mask corresponding to the background is generated, wherein the size of the image mask is smaller than that of the image.

In the above aspect, the computer vision processing module is further configured to perform the following processing for each frame of image: invoking a keypoint detection model to detect keypoints of the image so as to obtain positions of the keypoints of the object in the image; determining the position of a key point of the object as enhanced image information corresponding to the image; the key point detection model is obtained by training based on a sample image and positions of key points of objects marked in the sample image.

In the above aspect, the computer vision processing module is further configured to perform the following processing for each frame of image: invoking a gesture detection model to perform gesture detection so as to obtain gesture information of an object in the image; and determining the gesture information of the object as enhanced image information corresponding to the image.

In the above scheme, the generating module is further configured to encode the image and the enhanced image information corresponding to the image, so as to obtain an image code and an enhanced image information code corresponding to the image code; packaging the image code into a network extraction layer unit with the type of image slices, and packaging the enhanced image information code into a network extraction layer unit with the type of supplemental enhanced information; and assembling the network extraction layer unit with the type of image slice and the network extraction layer unit with the type of supplementary enhancement information into a video code stream.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

And the processor is used for realizing the video processing method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium which stores executable instructions for realizing the video processing method provided by the embodiment of the application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

By integrating the enhanced image information synchronized with each frame of image into the video code stream, the receiving end can fuse each frame of image and the enhanced image information synchronized with each frame of image after receiving the video code stream so as to obtain a composite image corresponding to each frame of image, thereby realizing accurate synchronous display between the original image and other image information of the video and enriching the display modes of the video.

Drawings

FIG. 1 is a schematic diagram of a picture-in-picture provided by an embodiment of the present application;

Fig. 2 is a schematic architecture diagram of a video processing system 100 according to an embodiment of the present application;

Fig. 3A is a schematic structural diagram of a terminal device 300 according to an embodiment of the present application;

Fig. 3B is a schematic structural diagram of a terminal device 400 according to an embodiment of the present application;

Fig. 4 is a schematic flow chart of a video processing method according to an embodiment of the present application;

fig. 5 is a schematic diagram of a layered structure of an h.264 code stream according to an embodiment of the present application;

fig. 6 is a schematic flow chart of a video processing method according to an embodiment of the present application;

fig. 7 is a schematic flow chart of decoding processing for NAL units according to an embodiment of the present application;

fig. 8 is an application scenario schematic diagram of a video processing method according to an embodiment of the present application;

fig. 9 is an application scenario schematic diagram of a video processing method according to an embodiment of the present application;

fig. 10 is a flow chart of an SEI information production phase provided by an embodiment of the present application;

fig. 11 is a flow chart of an SEI information consumption phase provided by an embodiment of the present application;

fig. 12 is a schematic flow chart of post-processing for a mask map obtained after decoding at a receiving end according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a saturation filter provided by an embodiment of the present application;

FIG. 14 is a schematic diagram of a picture-in-picture interface provided by an embodiment of the present application;

fig. 15 is a schematic diagram of a page corresponding to each stage in the video processing method according to the embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Picture-in-picture (PiP, picture in Picture) is a technique of adding one image to another, for example, referring to fig. 1, fig. 1 is a schematic diagram of a picture-in-picture provided by an embodiment of the present application, as shown in fig. 1, a small picture 102 is added to a large picture 101, that is, the small picture 102 plays on the large picture 101 in a floating window manner, and the small picture 102 supports arbitrary movement and scaling.

2) Supplemental enhancement information (SEI, supplemental Enhancement Information), a term of the video transmission field, provides a method for adding additional information to a video bitstream, and is one of the characteristics of video compression standards such as h.264, h.265, etc., meanwhile, the SEI has the following basic features: 1. is not an essential item of the decoding process; 2. possibly contributing to the decoding process; 3. integrated in the video bitstream.

For example, referring to table 1, table 1 is a schematic table of h.264/AVC field provided by an embodiment of the present application, as shown in table 1, the network abstraction layer unit type (Network Abstraction Layer Unit Type) of SEI in h.264/AVC standard is 6.

TABLE 1 H.264/AVC field schematic table

Network abstraction layer unit type	Network abstraction layer unit content
		1	Non-IDR image, and segment without data division
5	IDR image
		6	Supplemental Enhancement Information (SEI)
7	Sequence Parameter Set (SPS)
		8	Image parameter set (PPS)
11	Stream terminator

3) Human image Segmentation, a technique of specially segmenting (Segmentation) words in the field of computer vision, comprising human images, inputting images comprising human images, and outputting human image masks. In general, a mask that is directly output needs to be post-processed to be used as an Alpha (Alpha) channel of an image. In order to save the calculation cost, the image is usually reduced first, then the portrait mask is solved, and in the post-processing stage, the portrait mask is enlarged to the original size.

4) The detection of key points of a human face, namely the vocabulary in the field of computer vision, namely the positioning of key points of the human face or the alignment of the human face, refers to the positioning of key area positions of the face of the human face, including eyebrows, eyes, nose, mouth, facial contours and the like, given a human face image. The face key point detection method is roughly divided into the following three types: an active shape Model (ASM, active Shape Model); cascade shape regression (CPR, cascaded Pose Regression); a deep learning-based method.

5) Human body posture assessment, computer vision field vocabulary, including: single person posture assessment, multi-person posture assessment, three-dimensional human posture assessment. When actually solving, the evaluation of the human body posture is often converted into a problem of predicting the key points of the human body, namely, the position coordinates of each key point of the human body are predicted first, and then the spatial position relation among the key points is determined according to priori knowledge, so that a predicted human body skeleton is obtained.

With the development of communication technology and the promotion of communication infrastructure construction, real-time video communication technology is widely used. In some event scenes of real-time video, it is desirable to transmit not only video but also additional information (also referred to as enhanced image information), wherein the additional information may include a portrait mask (hereinafter also referred to as a mask map) corresponding to each frame of image, a face key point corresponding to each frame of image, or human body posture information corresponding to each frame of image, and the like.

However, the scheme provided by the related art has the following problems when additional information is simultaneously transmitted:

1) Difficult to achieve strict synchronization

In transmitting the additional information, it is desirable that the additional information to be transmitted can be strictly corresponding to the video frame, for example, taking the additional information as an example, it is desirable that the portrait mask of the I-th frame can be strictly corresponding to the I-th frame image. If corresponding transmission channels are specially established for the video frames and the portrait masks, respectively, and the video and the portrait masks are transmitted, it will be difficult to realize strict correspondence between the portrait masks and the video frames at the receiving end due to the limitation of network transmission conditions.

2) Bandwidth increase

Since additional information (e.g., a portrait mask) needs to be transmitted, the bandwidth occupied by the transmission will increase dramatically, for example, when a portrait mask of the same size as a video frame is transmitted, the amount of data that needs to be transmitted will increase by about 30%, i.e., it is equivalent to adding one channel to a Red Green Blue (RGB) image.

3) Loss is difficult to recover

In order to reduce the bandwidth pressure, additional information (e.g., a portrait mask) needs to be compressed, and the compression introduces noise into the original signal (i.e., the portrait mask), which affects the display effect of subsequent images if the portrait mask containing noise is directly used to mask the video frame image.

In view of the above technical problems, embodiments of the present application provide a video processing method, apparatus, electronic device, and computer readable storage medium, which can accurately and synchronously display an original image of a video and other image information, so as to enrich the display modes of the video. The following describes an exemplary application of the video processing method provided by the embodiment of the present application, where the video processing method provided by the embodiment of the present application may be implemented by various electronic devices, for example, may be implemented by a terminal device alone, or may be implemented by a server and the terminal device in cooperation.

The video processing method provided by the embodiment of the application is taken as an example for cooperatively implementing the server and the terminal equipment. Referring to fig. 2, fig. 2 is a schematic architecture diagram of a video processing system 100 according to an embodiment of the present application, so as to accurately and synchronously display an original image of a video and other image information. Wherein the video processing system 100 comprises: the server 200, the terminal device 300, and the terminal device 400 will be described below, respectively.

The server 200 is a background server of the client 310 and the client 410, and is configured to receive a video code stream sent by the terminal device 300 associated with the sender, and push the received video code stream to the terminal device 400 associated with the receiver. By way of example, when clients 310 and 410 are live clients, server 200 may be a background server of a live platform; when clients 310 and 410 are videoconference clients, server 200 may be a background server that provides services for videoconferences; when the client 310 and the client 410 are instant messaging clients, the server 200 may be a background server of the instant messaging clients.

The terminal device 300 is a terminal device with which the sender is associated (i.e. used by the sender), for example, for live broadcast the terminal device 300 may be a terminal device with which the anchor is associated; for video conferences, the terminal device 300 may be a terminal device associated with a moderator of the video conference. The terminal device 300 has a client 310 running thereon, and the client 310 may be various types of clients, for example, a live client, a video conference client, an instant messaging client, and the like. The terminal device 300 is configured to acquire multiple frame images (for example, an image acquired by calling a camera of the terminal device 300 to acquire a presenter or a host), and call the computing capability of the terminal device 300 to perform computer vision processing on each acquired frame image to obtain (synchronized) enhanced image information corresponding to each frame image, and then, the terminal device 300 performs encoding processing on the acquired multiple frame images and the enhanced image information corresponding to each frame image to generate a video code stream (a process of generating a video code stream will be described in detail below) including each frame image and the enhanced image information corresponding to each frame image. Finally, the terminal device 300 transmits the generated video code stream to the server 200.

The terminal device 400 is a terminal device with which a receiving party is associated, for example, for live broadcast, the terminal device 400 may be a terminal device with which a viewer is associated; for a video conference, the terminal device 400 may be a terminal device associated with a participant of the video conference. The terminal device 400 has a client 410 running thereon, and the client 410 may be various types of clients, for example, a live client, a video conference client, an instant messaging client, and the like. The terminal device 400 is configured to receive a video code stream issued by the server 200, and decode the received video code stream to obtain multiple frame images and enhanced image information synchronized with each frame image; then, the terminal device 400 performs fusion processing on each frame of image and the enhanced image information synchronized with each frame of image to obtain a composite image corresponding to each frame of image; subsequently, the terminal device 400 invokes the man-machine interaction interface of the client 410 to sequentially display the composite image corresponding to each frame of image.

In some embodiments, the server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal device 300 and the terminal device 400 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal device 300 and the server 200, and the server 200 and the terminal device 400 may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

In other embodiments, the video bitstream may also be generated by the server 200. For example, the terminal device 300 transmits only the acquired multi-frame image to the server 200 to cause the server 200 to perform computer vision processing on each received frame image to obtain enhanced image information corresponding to each frame image, then the server 200 generates a video code stream including each frame image and the enhanced image information corresponding to each frame image, and then the server 200 pushes the generated video code stream to the terminal device 400 associated with the receiving side.

It should be noted that the above embodiments are described by taking two users (for example, a host and a spectator, or a conference host and a participant) as an example, and in practical applications, the number of users may be three users (for example, a host and two spectators), or more users (for example, a host and a plurality of spectators), and the embodiments of the present application are not limited in detail herein. In addition, the client 310 and the client 410 may be the same type of client, for example, the client 310 and the client 410 may be video conference clients having the same function, that is, the client 310 and the client 410 each have a function of initiating a video conference and participating in the video conference; of course, the client 310 and the client 410 may be different types of clients, for example, the client 310 is a live client of a live broadcast end, and can collect a live broadcast and push collected video data to a background server of a live broadcast platform; while the client 410 is a live client of the viewer, and has only functions of playing live content and transmitting a bullet screen.

The structure of the terminal device 300 in fig. 2 is explained below. Referring to fig. 3A, fig. 3A is a schematic structural diagram of a terminal device 300 according to an embodiment of the present application, and the terminal device 300 shown in fig. 3A includes: at least one processor 360, a memory 350, at least one network interface 320, and a user interface 330. The various components in terminal device 300 are coupled together by bus system 340. It is understood that the bus system 340 is used to enable connected communications between these components. The bus system 340 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 340 in fig. 3A.

The Processor 360 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The user interface 330 includes one or more output devices 331 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 330 also includes one or more input devices 332, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 350 optionally includes one or more storage devices physically located remote from processor 360.

Memory 350 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM) and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 350 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 350 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

The operating system 351, which includes system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., is used to implement various basic services and process hardware-based tasks.

Network communication module 352 for reaching other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.

A presentation module 353 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 331 (e.g., a display screen, speakers, etc.) associated with the user interface 330.

An input processing module 354 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 3A shows a video processing apparatus 355 stored in a memory 350, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the acquisition module 3551, computer vision processing module 3552, generation module 3553, and transmission module 3554 are logical, and thus may be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be described hereinafter.

In other embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the video processing method provided by the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex Programmable logic devices (CPLDs, complex Programmable Logic Device), field-Programmable gate arrays (FPGAs), or other electronic components.

The structure of the terminal apparatus 400 in fig. 2 will be described further. Referring to fig. 3B, fig. 3B is a schematic structural diagram of a terminal device 400 according to an embodiment of the present application. As shown in fig. 3B, the terminal device 400 includes: a memory 450 for storing executable instructions; the processor 460 is configured to implement the video processing method provided by the embodiment of the present application when processing the executable instructions stored in the memory 450. Further, the video processing device 455 stored in the memory 450 may be software in the form of a program, a plug-in, or the like, including the following software modules: the receiving module 4551, the decoding processing module 4552, the fusion processing module 4553, the display module 4554, the amplifying processing module 4555 and the noise reduction processing module 4556 are logical, and thus may be arbitrarily combined or further split according to the implemented functions. In addition, the terminal device 400 further includes a network interface 420, a user interface 430 (including an output device 431 and an input device 432), a bus system 440, and an operating system 451, a network communication module 452, a presentation module 453, and an input processing module 454 also stored in the memory 450, where functions of the above components are similar to those of the corresponding components in fig. 3A, and the description of fig. 3A may be referred to, and the embodiments of the present application are not repeated herein.

The video processing method provided by the embodiment of the present application will be described below in connection with exemplary applications and implementations of the terminal device provided by the embodiment of the present application. It will be appreciated that the steps performed by the terminal device may in particular be performed by a client running on the terminal device.

The video processing method provided by the embodiment of the application mainly comprises a video code stream generation stage and a video code stream consumption stage, and the video code stream generation stage is described below.

Referring to fig. 4, fig. 4 is a flowchart of a video processing method according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 4.

It should be noted that, the execution subject of steps S101 to S104 shown in fig. 4 may be the terminal device 300 associated with the sender in fig. 2, and for convenience of description, the terminal device 300 associated with the sender will be referred to as a first terminal device, and the terminal device 400 associated with the receiver will be referred to as a second terminal device.

In step S101, a plurality of frame images are acquired.

In some embodiments, the target object may be acquired by invoking a camera of the first terminal device to obtain a multi-frame image. For example, taking a video conference as an example, the first terminal device may be a terminal device associated with a conference initiator, and the conference initiator is collected by calling a camera of the first terminal device, so as to obtain a multi-frame image including the conference initiator; for example, for a live scene, the first terminal device may be a terminal device associated with the anchor, and the anchor is acquired by calling a camera of the first terminal device to obtain a multi-frame image including the anchor.

In step S102, computer vision processing is performed on each acquired image frame to obtain enhanced image information corresponding to each image frame.

In some embodiments, the first terminal device may perform the above-described computer vision processing on each frame of the acquired image by using the following manner to obtain enhanced image information corresponding to each frame of the image: the following processing is performed for each frame image: calling an image segmentation model based on the image to identify an object in the image, taking an area outside the object as a background, generating an image mask corresponding to the background, and taking the generated image mask as enhanced image information synchronous with the image; the image segmentation model is obtained by training based on a sample image and an object marked in the sample image.

For example, the image segmentation model may be trained using a supervised training approach, where the image segmentation model may be various types of neural network models, including deep convolutional neural network models, fully connected neural network models, etc., and the loss function may be any form of loss function used to characterize the difference between the predicted object position and the annotated object position, such as mean square error loss functions (MSE, mean Squared Error), hinge loss functions (HLF, hinge Loss Function), cross entropy loss functions (Cross Entropy), etc.

Taking a live broadcast scene as an example, after calling a camera of the first terminal equipment to collect a host broadcast to obtain multi-frame images containing the host broadcast, executing the following processing for any frame image in the collected multi-frame images: inputting any frame of image into a pre-trained image segmentation model, so that the image segmentation model performs pixel level segmentation (namely, performs pixel level judgment, judges whether the current pixel belongs to a pixel point corresponding to an image, divides the current pixel into image areas when the current pixel belongs to the pixel point corresponding to the image), determines an image area corresponding to the position of a host in any frame of image based on a segmentation result, uses an area outside the image area corresponding to the position of the host as a background area, generates an image mask corresponding to the background area (wherein the image mask can be used for removing the background area in the image and only retains the image), performs masking operation on the image based on the image mask, namely, recalculates the value of each pixel in the image according to the image mask, for example, the human image mask may be a mask image in which the values in the human image area are 1, and the values in the areas other than the human image area are 0, so that after the mask operation is performed on the image based on the human image mask, the pixel values in the areas other than the human image area in the image are reset to 0, so that only the human image is retained in the image. The portrait segmentation model is obtained by training based on a sample image and a portrait marked in the sample image. In addition, it should be noted that the portrait mask output based on the portrait segmentation model may be a two-dimensional array, and the value range of the array is between [0,1], that is, the portrait mask is composed of decimal numbers between [0,1 ].

In other embodiments, in order to reduce the bandwidth cost required for transmitting the image mask, the acquired multi-frame image may be first subjected to a reduction process, i.e. the acquired multi-frame image is subjected to a reduction process in advance, for example, assuming that the resolution of the acquired image is 1080p (i.e. 1920×1080), the size of the image may be first reduced to 100×100, and then the reduced image is input into the image segmentation model, so as to obtain an image mask with a size smaller than the original size of the image, for example, an image mask with a size of only 100×100, so that the size of the transmitted image mask is far smaller than the original size of the image, and the bandwidth pressure caused by transmitting the image mask may be greatly reduced.

In other embodiments, the first terminal device may further perform the above-mentioned computer vision processing on each frame of the acquired image by using the following manner to obtain enhanced image information corresponding to each frame of the image: the following processing is performed for each frame image: invoking a key point detection model to detect key points of the image so as to obtain the positions of the key points of the objects in the image; determining the position of the key point of the obtained object as enhanced image information corresponding to the image; the key point detection model is trained based on the sample image and the positions of the key points of the objects included in the sample image.

Taking a video conference as an example, after a camera of the first terminal device is called to collect a conference initiator to obtain multi-frame images including the conference initiator, the following processing is performed for any frame image in the collected multi-frame images: inputting any frame of image into a pre-trained face key point detection model so that the face key point detection model determines the positions of key points of a conference initiator in the image, such as the positions of the nose of the conference initiator, the eyebrows of the conference initiator, the mouth of the conference initiator and the like, and then determining the determined positions of the key points of the conference initiator as enhanced image information synchronized with any frame of image; the human face key point detection model is obtained by training based on the sample image and the positions of key points of the human images marked in the sample image. For example, the face keypoint detection model may be trained using a supervised training approach, where the face keypoint detection model may be various types of neural network models, including deep convolutional neural network models, fully connected neural network models, etc., and the loss function may be any form of loss function used to characterize the difference between the location of the face keypoint of the predicted object and the location of the face keypoint of the annotated object, e.g., MSE, HLF, etc.

In some embodiments, the first terminal device may also perform the above-mentioned computer vision processing on each frame of the acquired image by using the following manner to obtain enhanced image information corresponding to each frame of the image: the following processing is performed for each frame image: invoking a gesture detection model to perform gesture detection on the object in the image so as to obtain gesture information of the object in the image; the obtained pose information of the object is determined as enhanced image information corresponding to the image.

Taking a live broadcast scene as an example, after calling a camera of the first terminal equipment to collect a host broadcast to obtain multi-frame images containing the host broadcast, executing the following processing for any frame image in the collected multi-frame images: inputting any frame of image into a pre-trained human body posture detection model for posture detection to obtain posture information of a host in any frame of image, namely extracting a skeleton of the host from any frame of image by the human body posture detection model, connecting based on the extracted skeleton to obtain posture information of the host, and then taking the determined posture information of the host as enhanced image information synchronous with any frame of image.

In step S103, a video code stream including each frame image and enhanced image information corresponding to each frame image is generated.

In order to facilitate understanding of the video processing method provided by the embodiment of the present application, before describing a process of generating a video code stream including each frame of image and enhanced image information corresponding to each frame of image, a description will be given first of all of the structure of the video code stream.

By way of example, taking h.264 as an example, the functionality of h.264 is divided into two layers: a video coding layer (VCL, video Coding Layer) and a network abstraction layer (NAL, network Abstraction Layer), wherein VCL data is the output of the coding process, which represents a compressed coded sequence of video data. These encoded VCL data are mapped or encapsulated into NAL units prior to VCL data transmission or storage. Each NAL unit includes an original byte sequence payload (RBSP, raw Byte Sequence Payload), a set of NAL header information corresponding to a video bitstream. The basic structure of RBSP is: end bits are appended after the original encoded data for byte alignment. That is, the h.264 bitstream is composed of NAL units one by one, wherein each NAL unit includes NAL header information and RBSPs.

The overall structure of the h.264 code stream is described below. For example, referring to fig. 5, fig. 5 is a schematic diagram of an h.264 code stream layered structure provided by an embodiment of the present application. As shown in fig. 5, the overall structure of the h.264 code stream may be divided into six layers, which are described below.

Layer one is the packaging format of the h.264 bitstream, including byte stream (Annexb) format and Real-time transport protocol (RTP, real-time Transport Protocol) format, wherein Annexb format is the default output format of most encoders, also referred to as the bare stream since Annexb format is not encapsulated by the transport protocol; and the RTP format is a data transmission format for network transport.

Layer two is a NAL unit, each NAL unit comprising a set of NAL unit headers and a NAL unit body corresponding to video encoding, wherein the types of NAL unit headers are diverse, comprising: delimiters, padding, supplemental enhancement information (SEI, supplemental Enhancement Information), sequence parameter sets (SPS, sequence PARAMETER SET), picture parameter sets (PPS, picture PARAMETER SET), and the like.

Layer three is Slice (Slice), the Slice is mainly used as a carrier of macro block, and each Slice includes Slice header and Slice data, wherein Slice header contains information such as Slice type, macro block type in Slice, number of Slice frames, which image the Slice belongs to, and setting and parameter of corresponding frame. Slice data is then a macroblock, i.e. where pixel data for an image is stored.

It should be noted that, the concept of slicing is different from that of frames, a frame is used to describe an image, a frame corresponds to an image, and a slice is a new concept proposed in h.264, and is a concept that is integrated by an efficient manner by slicing after encoding an image, where an image has at least one or more slices, and the slices are loaded and transmitted through a NAL unit, and in addition, other information for describing a video may be loaded in the NAL unit.

Layer four is Slice data, i.e., a macroblock, which is the primary carrier of video information and contains luminance and chrominance information for each pixel. The main task of video decoding is to provide an efficient way to obtain pixel arrays in a macroblock from a video bitstream. In addition, it should be noted that the Slice type corresponds to the macroblock type, for example, for an I Slice, only an I macroblock is included, and the I macroblock performs intra prediction using a pixel decoded from the current Slice as a reference; the P slice includes P macroblocks and I macroblocks, which can be intra predicted using previously encoded pictures as reference pictures.

Layer five is a type of pulse code modulation (PCM, pulse Code Modulation), i.e., represents a manner for encoding the original pixel values stored in the macroblock, for example, when the type (mb_type) of the macroblock is in the i_pcm mode, the original pixel values of the macroblock may be stored in a differentially encoded form.

Layer six is a Residual (Residual) for storing Residual data.

In some embodiments, when the first terminal device obtains the enhanced image information corresponding to each frame image, the above-described generation of the video code stream including each frame image and the enhanced image information corresponding to each frame image may be achieved by: encoding each frame of image and the enhancement image information corresponding to each frame of image respectively to obtain multi-frame image encoding and enhancement image information encoding corresponding to each frame of image encoding; respectively packaging multi-frame image codes into a plurality of network extraction layer units with image slices of which the types are in one-to-one correspondence, and respectively packaging multi-frame enhanced image information codes into a plurality of network extraction layer units with complementary enhanced information of which the types are in one-to-one correspondence; the plurality of network abstraction layer units of the type image slices and the plurality of network abstraction layer units of the type supplemental enhancement information are assembled into a video bitstream.

Taking enhanced image information as an image mask for example, the first terminal device respectively performs coding processing on multiple frames of images and an image mask synchronous with each frame of images after obtaining the image mask corresponding to each frame of images so as to obtain multiple frames of coded images and a coded image mask corresponding to each frame of coded images, and then encapsulates the multiple frames of coded images into NAL units with the type of image slices, where the NAL units with the type of image slices include: non-partitioned, non-instantaneous decoding refresh (IDR, instantaneous Decoding Refresh) slices of the image, slice partition a, slice partition B, slice partition C, and slices in the IDR image. In addition, the second terminal device may further encapsulate the coded image mask corresponding to each frame of the coded image into NAL units of the type SEI in sequence, so as to obtain a video bitstream to be transmitted based on the multiple NAL units of the type image slice and the multiple NAL units of the type SEI.

It should be noted that, in addition to the above generation of the SEI during video encoding, the generation method of the SEI may also filter an existing bitstream and insert SEI field information, or insert the SEI during writing in a container layer, which is not specifically limited in the embodiment of the present application.

In step S104, a video code stream is transmitted.

In some embodiments, after the first terminal device generates the video code stream, the server may send the generated video code stream, so that the server pushes the received video code stream to the second terminal device.

Taking a live broadcast scene as an example, after a terminal device (i.e., a first terminal device) associated with a host broadcast generates a video code stream, pushing the generated video code stream to a background server of a live broadcast platform, so that the background server of the live broadcast platform pushes the received video code stream to a terminal device (i.e., a second terminal device) associated with a spectator after receiving the video code stream.

The consumption phase of the video stream is described below. Referring to fig. 6, for example, fig. 6 is a schematic flow chart of a video processing method according to an embodiment of the present application, and the steps shown in fig. 6 will be described.

It should be noted that, the execution subject of steps S201 to S204 shown in fig. 6 may be the terminal device 400 in fig. 2, that is, the terminal device associated with the receiving party, for receiving the video code stream and decoding and playing, and for convenience of description, the terminal device associated with the receiving party (that is, the terminal device 400 shown in fig. 2) is referred to as a second terminal device.

In step S201, a video code stream is received.

In some embodiments, after the first terminal device generates the video code stream, the first terminal device sends the generated video code stream to the server, so that the server pushes the received video code stream to the second terminal device after receiving the video code stream sent by the first terminal device, wherein the video code stream received by the second terminal device includes multiple frame images and enhanced image information synchronized with each frame image.

For example, taking a live broadcast scenario as an example, the first terminal device may be a terminal device associated with a host, after generating a video code stream including multiple frame images and enhanced image information synchronized with each frame image, the first terminal device sends the generated video code stream to a background server of the live broadcast platform, so that the background server of the live broadcast platform sends the received video code stream to a terminal device (i.e., a second terminal device) associated with a viewer after receiving the video code stream sent by the first terminal device, so that the second terminal device decodes and plays the received video code stream.

For example, taking a video conference as an example, the first terminal device may be a terminal device associated with an initiator of the conference, after generating a video code stream including multiple frame images and enhanced image information synchronized with each frame image, the first terminal device sends the generated video code stream to a background server of the video conference, so that after receiving the video code stream sent by the first terminal device, the background server of the video conference pushes the received video code stream to a terminal device (i.e. a second terminal device) associated with a participant object, so that the second terminal device decodes and plays the received video code stream.

In addition, in order to facilitate understanding of the video processing method provided by the embodiment of the present application, before explaining the decoding process of the subsequent video code stream, the process of decoding a plurality of NAL units included in the video code stream by the receiving end is first explained.

For example, referring to fig. 7, fig. 7 is a schematic flow chart of decoding processing for a NAL unit according to an embodiment of the present application, as shown in fig. 7, after receiving a video bitstream, a receiving end first reads the NAL unit from the video bitstream, then extracts an RBSP syntax structure from the read NAL unit, and then performs a corresponding decoding process according to the type of the NAL unit. For example, when the receiving end determines that the type of the NAL unit is 6 (i.e., sei=6), performing an SEI decoding process to obtain additional information based on SEI transmission; when the receiving end judges that the type of the NAL unit is 7 (namely SPS=7), performing an SPS decoding process to obtain sequence parameters; when the receiving end determines that the type of the NAL unit is 5 (i.e., idr=5), it enters a slice decoding process to obtain a decoded image.

In step S202, the video stream is decoded to obtain a plurality of frame images and enhanced image information synchronized with each frame image.

In some embodiments, the second terminal device may perform the above decoding processing on the video code stream by using the following manner to obtain multiple frame images and enhanced image information synchronized with each frame image: reading each network extraction layer unit included in the video code stream, and determining the type of each network extraction layer unit; when the type of the read network extraction layer unit is an image slice, performing slice segmentation decoding operation to obtain multi-frame images; when the type of the read network abstraction layer unit is the supplemental enhancement information, a supplemental enhancement information decoding operation is performed to obtain enhanced image information synchronized with each frame image.

Taking enhanced image information as an image mask for example, after receiving a video code stream, the second terminal device traverses and reads a plurality of NAL units included in the received video code stream, and determines the type of each read NAL unit; when the type of the read NAL unit is determined to be an image slice, performing slice segmentation decoding operation to obtain a corresponding decoded image; when it is determined that the type of the read NAL unit is SEI, an SEI decoding operation is performed to obtain a picture mask synchronized with each frame picture.

In step S203, fusion processing is performed on each frame of image and the enhanced image information synchronized with each frame of image, to obtain a composite image corresponding to each frame of image.

In some embodiments, when the enhanced image information is an image mask, the second terminal device may perform the above fusion processing on each frame of image and the enhanced image information synchronized with each frame of image by: the following processing is performed for each frame image: masking the image based on an image mask synchronized with the image to obtain a composite image corresponding to the image and with the background removed; wherein the image mask is generated by object recognition of the image.

Taking a video conference as an example, when a terminal device associated with a reference object performs decoding processing on a received video code stream to obtain multiple frame images and a portrait mask corresponding to each frame image, the following processing is performed on each frame image: the mask operation is carried out on the images based on the image synchronous portrait masks so as to obtain a composite image which corresponds to the images and removes the background (namely, the image of the conference initiator after the background is removed), so that only the image of the conference initiator after the background is removed can be presented on the terminal equipment of the participant by carrying out the mask operation on the images based on the image synchronous portrait masks, and the display effect of the video is improved.

In some embodiments, in order to save network bandwidth required for transmitting the image mask, the image mask is compressed, and noise is introduced into the image mask by the compression process, and if the second terminal device directly uses the decoded image mask to mask the image, the display effect of the image is affected. In view of this, the second terminal device may further perform the following operations before performing the fusion processing on each frame image and the image mask synchronized with each frame image: and (3) invoking a saturation filter to perform noise reduction processing on the decoded image mask so as to recover the information of the image mask before compression to the greatest extent, thereby improving the display effect of the composite image.

In other embodiments, when the enhanced image information is the position of the key point of the object in the image, the second terminal device may perform the above-mentioned fusion processing on each frame of image and the enhanced image information synchronized with each frame of image, to obtain the composite image corresponding to each frame of image by: obtaining a special effect matched with the key point; and adding special effects at the positions of the key points corresponding to the targets in each frame of image to obtain the composite image with the special effects.

For example, taking a live broadcast scene as an example, when a terminal device associated with a viewer decodes a received video code stream to obtain multi-frame images and positions of key points of a host in synchronization with each frame of images, special effects matched with the key points of the host can be obtained, for example, when the key points are eyes of the host, cartoon glasses can be obtained, and cartoon glasses are added at positions of eyes corresponding to the host in each frame of images so as to present a host picture added with the cartoon glasses on the terminal device associated with the viewer.

For example, referring to fig. 8, fig. 8 is a schematic view of an application scenario of the video processing method provided by the embodiment of the present application, as shown in fig. 8, after obtaining the position 801 of the eyebrow of the main broadcasting in the image, a special effect matching with the position 801 of the eyebrow, for example, cartoon glasses may be obtained, and then, cartoon glasses 802 may be added to the position of the eyebrow of the main broadcasting in the live broadcasting picture, so that the display effect of the live broadcasting picture is enriched.

In some embodiments, when the enhanced image information includes a pose of an object in the image, the second terminal device may perform the above-described fusion processing on each frame of the image and the enhanced image information synchronized with each frame of the image, to obtain a composite image corresponding to each frame of the image by: performing at least one of the following processes for each frame of image: determining the states corresponding to each object in the image according to the type of the gesture, counting the states corresponding to all objects in the image, and adding a counting result in the image to obtain a composite image corresponding to the image and added with the counting result; and acquiring the special effect matched with the gesture of the object in the image, and adding the acquired special effect in the image to obtain a composite image corresponding to the image and added with the special effect.

For example, taking a video conference as an example, when a terminal device associated with a reference object performs decoding processing on a received video code stream to obtain multiple frame images and the gesture of the object in the image corresponding to each frame image, the following processing is performed on each frame image: and determining the states (such as moods, views and the like) corresponding to each of the participant in the image according to the gesture types, counting the states corresponding to all the participant appearing in the image, and adding a counting result into the image. For example, assume that when voting on an issue, there are always 5 participants in an image (including 1 conference initiator and 4 participants of the conference), and that the states of 3 participants are detected to be thumb-up states, that is, states indicating consent, and the states of the other 2 participants are thumb-down states, that is, states indicating objection, whereby consent 3 tickets are generated, statistical results against 2 tickets are added to the corresponding images. For example, referring to fig. 9, fig. 9 is a schematic view of an application scenario of a video processing method provided by an embodiment of the present application, as shown in fig. 9, there are 3 reference objects in total in a frame image, and pose information 901 corresponding to each of the 3 reference objects has been acquired, where body orientations of 2 reference objects are screen-oriented, which indicates consent; the body orientation of the 1 participant is opposite to the screen, and the statistics result 902 can be generated according to the states of the 3 participants and presented in the image, so that the participants of the conference can conveniently know the statistics result.

Taking a live broadcast scene as an example, when a terminal device associated with a viewer decodes a received video code stream to obtain multi-frame images and a gesture of a host in synchronization with each frame of images, special effects adapted to the gesture of the host in the images can be obtained, that is, after the terminal device associated with the viewer obtains the gesture of the host, special effects such as graphics, styles, artistic models and the like can be sequentially loaded on the host corresponding to the multi-frame images respectively, that is, by recognizing changes of the human body gesture of the host, the added special effects can be naturally fused with the host during the movement of the host.

In step S204, a composite image corresponding to each frame of image is sequentially displayed in the human-computer interaction interface.

In some embodiments, before the second terminal device sequentially displays the composite image corresponding to each frame of image in the human-computer interaction interface, the following processing may be further performed: acquiring a background image; and merging the composite image after removing the background with the background image after masking the image based on the image mask synchronized with the image so as to display the merged image obtained after merging in the man-machine interaction interface.

For example, taking a video conference scenario as an example, when a terminal device associated with a conference initiator sends a video code stream including a multi-frame image collected by calling a camera and a person mask corresponding to each frame image to a background server of the video conference, a video code stream obtained by intercepting a screen of the terminal device associated with the conference initiator, such as a slide show (PPT) shared by the conference initiator, may be further sent to the background server of the video conference, so that the terminal device associated with a reference object may simultaneously receive the video code stream including the shared PPT sent by the conference initiator and the video code stream including the multi-frame image and the person mask corresponding to each frame image, and then, after masking the image based on the person mask synchronized with the image, the image corresponding to the image and after removing the background may be combined with the shared PPT to achieve a picture-in-picture effect (for example, a picture-in-picture effect as shown in fig. 14 is rendered).

According to the video processing method provided by the embodiment of the application, the enhanced image information synchronous with each frame of image is integrated in the video code stream, so that after the receiving end receives the video code stream, each frame of image and the enhanced image information synchronous with each frame of image can be subjected to fusion processing to obtain the composite image corresponding to each frame of image, and then the composite image corresponding to each frame of image is sequentially displayed in a man-machine interaction interface, so that the original image of the video and other image information (such as mask images, key point information and the like) can be accurately and synchronously displayed, and the display mode of the video is enriched.

In the following, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

The embodiment of the application provides a video processing method, which is used for transmitting additional information based on SEI, wherein the additional information transmitted based on SEI comprises a portrait mask, a human face key point, human body posture information and the like. In general, SEI is used to transmit encoder parameters, video copyright information, camera parameters, event identification, etc. That is, the data transmitted by conventional SEI is often constant data or data of a short length, and there is often no one-to-one correspondence between these data and the original video frame. In contrast, in the video processing method provided by the embodiment of the present application, the additional information based on SEI transmission is an image processing result obtained after performing Computer Vision (CV) processing on each frame of image, and strict synchronization needs to be performed with the original video frame, and meanwhile, the video processing method based on SEI transmission provided by the embodiment of the present application further has the following advantages:

1) Protocol independence, currently, main stream protocols (such as a real-time streaming Protocol (RTSP, real Time Streaming Protocol) and a real-time message transmission Protocol (RTMP, real Time Messaging Protocol)) support SEI, and other protocols can be used only by analyzing received SEI information at a receiving end.

2) The compatibility is high, the SEI is not an essential option in the decoding process, and the compatibility to the previous version can be achieved, namely when SEI analysis is not supported, the original decoder (such as H264) ignores the SEI, and the video can still be decoded continuously.

3) The bandwidth is friendly, the resolution of the image can reach 1080p and 1280p or even higher in the high-definition video code stream, if the original size of the image corresponding to each frame of image is directly transmitted, the bandwidth will be increased by 30% if the RGB image is considered, and if the YUV image is considered (YUV is a color coding method, wherein Y represents brightness (Luminance or Luma), namely gray scale value, U and V represent chromaticity (Chrominance or Chroma), and the format of the YUV image is NV 12), the bandwidth will be increased by 50%, namely, the bandwidth cost brought by introducing the image mask will be quite high.

In order to solve the bandwidth problem, the embodiment of the application splits the post-processing of the portrait segmentation model, namely, a mask image (which is far smaller than the video size, for example, can be in the order of 100×100, and is hereinafter called as a small mask image for convenience of description) of a small size output by the portrait segmentation model is processed, but a large mask image (i.e., a mask image with the same size as the video) output by the complete post-processing is not processed, so that the transmitted portrait mask is ensured to be irrelevant to the size of the video frame, and the bandwidth occupied by transmitting the portrait mask is reduced.

In addition, in order to further reduce the bandwidth occupied by transmitting the portrait mask, the embodiment of the application can also compress the small mask, for example, a zlib (zlib is a general compression library, provides a set of in-memory compression and decompression functions, and can detect the integrity of decompressed data) is used to compress the small mask, so that the size of the mask can be reduced to a certain extent. However, this type of compression algorithm does not take into account the timing nature of the small mask map, and only compresses the small mask map independently for each frame. Thus, for further compression, H264 compression techniques may be employed to perform video compression on the sequence of small mask maps. In this way, by truncating the portrait segmentation post-processing and video compression algorithms, transmission of the portrait mask with minimal bandwidth can be achieved.

4) The embodiment of the application provides a post-processing method aiming at noise introduced by video compression, which can effectively inhibit the noise introduced by video compression, thereby maximally recovering information before compression of a small mask image.

The video processing method provided by the embodiment of the application is specifically described below by taking the example of only presenting a portrait (i.e. a character image after removing the background) and playing the PPT when sharing the desktop of the video conference.

The video processing method provided by the embodiment of the application mainly comprises an SEI information production stage and an SEI information consumption stage, wherein the SEI information production stage mainly comprises that a video code stream processing module of a transmitting end generates additional information for a video code stream to be transmitted, packages the generated additional information into SEI information and returns the SEI information to the video code stream; the SEI information consumption stage mainly comprises the steps that a receiving end analyzes the received SEI information and displays picture-in-picture by using mask information in the SEI information. The two stages are each described in detail below.

For example, referring to fig. 10, fig. 10 is a flow chart of an SEI information production phase provided by an embodiment of the present application. The processing of the video stream after the video stream is input to the video stream processing module is described with reference to fig. 10.

After the video code stream is input into the video code stream processing module, the decoding module firstly performs decoding processing, and after decoding is completed, a decoded image, for example, an image in a single-frame RGB format or a YUV format is obtained. For example, the decoding module decodes the incoming video bitstream, and the decoding tool may be FFMPEG (FFMPEG is a set of open source computer programs that can be used to record, convert digital audio and video, and can convert it into a stream, and uses LGPL or GPL licenses to provide a complete solution for recording, converting, and streaming audio and video, which has very powerful functions including video acquisition, video format conversion, video capture, watermarking, etc.), so that the video bitstream can be converted into YUV data or RGB data that can be processed by CV algorithm.

The transmitting end (corresponding to the first terminal equipment) then performs computer vision processing on the single-frame decoded image obtained after the decoding processing, including computer vision processing such as calling a portrait segmentation model to perform portrait segmentation processing on the decoded image, calling a keypoint detection model to perform keypoint detection on a face in the decoded image, or calling a gesture evaluation model to perform gesture evaluation on a human body in the decoded image. For example, taking image segmentation as an example, the decoded image is first subjected to reduction processing, and the reduced decoded image is input into the image segmentation model to output the small mask map M _small.

After CV processing is performed on the decoded image, an image processing result (in addition, taking a small mask image as an example) will be obtained, and then, the transmitting end performs video compression on the small mask image and the decoded image respectively, so as to obtain image coding and image processing result coding (i.e. small mask image coding), and packages the small mask image coding into SEI of the video frame.

In addition, since the input supported by video compression is often an image, for example, an RGB image, the range of the image is [0, 255], and the range of the small mask image is between [0,1], the video compression cannot be directly performed on the small mask image. For this purpose, the range of the value range of the small mask map is mapped to the [0, 255] interval, that is, the small mask map is directly multiplied by 255, and then video compression is performed, and the calculation formula is as follows:

When the receiving end uses the small mask pattern, the value range of the small mask pattern is reduced to the [0,1] interval and then used.

And finally, pushing the packaged SEI information to a video code stream by the transmitting end, and thus completing the production of the SEI information. For example, FFMPEG may be used to convert the small mask map into an SEI bitstream and write the SEI bitstream into the original video bitstream for merging. And finally, pushing the combined video code stream, thereby completing the production stage of the whole SEI information.

It should be noted that, in addition to filtering an existing code stream and inserting SEI field information as shown in fig. 10, the production method of SEI information may also be used to generate SEI information when video encoding or insert SEI information when writing in a container layer, which is not particularly limited in the embodiment of the present application.

The SEI information consumption phase is explained below.

For example, referring to fig. 11, fig. 11 is a schematic flow chart of an SEI information consumption phase provided by an embodiment of the present application. As shown in fig. 11, the SEI information consumption phase is mainly divided into the following three steps:

After receiving the video code stream, the receiving end (corresponding to the second terminal device) first decodes the received video code stream, that is, decodes the image code included in the video code stream, and parses the SEI information, so as to obtain a decoded image and a decoded small mask map, respectively. For example, the receiving end may decode an image from the video code stream by using FFMPEG, where the decoded image may be the image I _stream in YUV format, and may decode the small mask M _stream from the SEI information of the video code stream by using FFMPEG, where the range of the small mask M _stream is substantially [0, 255], which may cause a part of the values to be greater than 255 due to noise introduced by video compression, and then the range of the small mask M _stream needs to be substantially restored to the [0,1] interval, where the calculation formula is as follows:

Wherein, The small mask map obtained for decoding is different from the small mask map M _small directly output from the portrait segmentation model becauseSuperimposed with noise introduced by video compression.

Then, the receiving end performs post-processing on the mask map obtained by decoding, including amplification processing and noise reduction processing, and fuses the mask map obtained by the post-processing with the decoded image, i.e. the mask map obtained by the post-processing is used as an Alpha (Alpha) channel of the decoded image. Where Alpha channel refers to the transparency or translucency of an image, for example, for a bitmap stored using 16 bits per pixel, for each pixel in the image, 5 bits may be used to represent red, 5 bits green, 5 bits blue, and the last bit Alpha, in which case it represents either transparency or opacity. In the embodiment of the application, the human image after the background is removed is obtained by fusion processing of the human image mask as the alpha channel of the decoded image.

The following describes a process of performing post-processing on the mask map obtained after decoding by the receiving end in detail.

For example, referring to fig. 12, fig. 12 is a schematic flow chart of post-processing performed on a mask map obtained after decoding at a receiving end according to an embodiment of the present application, and as shown in fig. 12, a post-processing process performed at the receiving end mainly includes three steps of mask amplification, mask noise reduction and image merging, which are described below.

(1) Mask amplification

The receiving end analyzes SEI information in the video code stream to obtain a small mask diagram with the value range approximately distributed in the (0, 1) intervalThereafter, the small mask map may be mapped by a size change (size) operationTo the same size as the decoded image, wherein the optional size algorithm includes a bilinear interpolation algorithm, a neighbor sampling algorithm, etc. Thus, by mapping small masksThe large mask image with the same size as the decoded image can be obtained by performing the enlargement processing

(2) Mask noise reduction

Large mask pattern obtained after amplificationIs video compressed and introduces noise if distorted large mask patterns are used directlyCan cause significant damage to the final rendered result. Thus, to map a large maskFor the restoration, a filter called a saturation filter may be provided. After analyzing the small mask map M _stream for the SEI information in the video bitstream, the receiving end first restores the value range of the small mask map M _stream to approximately the [0,1] interval, that is, divides 255, to obtain the small mask mapSubsequently, the small mask pattern is mappedIs enlarged to the same size as the decoded image to obtain a large mask patternFinally, the obtained large mask patternAnd (5) inputting the noise reduction processing into a saturation filter. By way of example, the saturation filter f_sat is defined as follows:

Wherein upper represents the upper limit of the saturation filter, when a large mask pattern When the value of a certain pixel point x is larger than upper, the value is set to be 1 after the saturation filter f_sat is subjected to noise reduction treatment; lower represents the lower limit of the saturation filter, when large mask patternsWhen the value of a certain pixel point x is smaller than lower, the value is set to 0 after the saturation filter f_sat noise reduction processing. For example, referring to fig. 13, fig. 13 is a schematic diagram of a saturation filter provided by an embodiment of the present application, as shown in fig. 13, two parameters, i.e., lower and upper, may be taken as 0.2 and upper may be taken as 0.8, so that a large mask pattern may be mappedThe range of values of (2) is limited to between 0,1 to remove most of the noise introduced by video compression. That is, in a large mask pattern to be distortedAfter the saturation filter f_sat is input for noise reduction, a denoised large mask diagram can be obtainedThe calculation formula is as follows:

(3) Image merging

After obtaining the denoised large mask patternAfter that, the denoised large mask pattern can be mappedAnd merging the images into the decoded image in the form of Alpha channels to obtain a character image with the background removed.

And finally, rendering and using the composite image on a user use interface by the transmitting end, namely completing the whole flow of the picture-in-picture function. For example, the large mask pattern after denoising at the receiving endAfter merging the Alpha channels into the decoded image, merging and rendering the decoded image containing the Alpha channels (for example, the decoded image is subjected to masking operation by using the large mask image after denoising, and then the background-removed character image is obtained) and a background image B (for example, a screen image shared by the initiator of the video conference) to obtain an image I _composite finally displayed on a user using interface, wherein the calculation formula is as follows:

Wherein B represents a background image, for example, may be a screen image shared by an initiator of a video conference, and the background image B and the video code stream may be respectively transmitted to the receiving end through different transmission channels, so as to be combined and rendered at the receiving end, thereby implementing a function of picture-in-picture (for example, a character image with the background removed may be displayed as a small picture in suspension on a PPT (i.e. a large picture) shared by the initiator of the video conference, so as to achieve a display effect of picture-in-picture). For example, referring to fig. 14, fig. 14 is a schematic diagram of a picture-in-picture interface provided in an embodiment of the present application, as shown in fig. 14, a character image 1402 after removing a background may be displayed in suspension in a background image 1401, where the background image 1401 may be various windows or Web pages (Web) such as Excel, word, and the like, in addition to the PPT shown in fig. 14. In addition, because the embodiment of the application is based on the SEI transmission and the video frame is strictly synchronous, the human image 1402 (i.e. the small picture in picture) after removing the background will not have the delay problem when the PPT page turning is performed.

The beneficial effects of the video processing method provided by the embodiment of the application are further described below by combining page diagrams corresponding to different stages respectively.

For example, referring to fig. 15, fig. 15 is a schematic diagram of pages corresponding to different stages of a video processing method according to an embodiment of the present application, where, as shown in fig. 15, a page 1501 is a decoded image obtained by a decoding module at a transmitting end decoding an input video code stream; page 1502 is a person image obtained after performing background replacement using a mask image output by the person image segmentation model, that is, a person image obtained by performing background replacement using a mask image that is not transmitted by the SEI; a page 1503 is a character image obtained by directly implementing background replacement by using a mask image transmitted by the SEI by the receiving end, where the mask image transmitted based on the SEI does not undergo noise reduction processing by a saturation filter; page 1504 is a character image obtained by the receiving end by replacing the background with the mask image transmitted by the SEI and noise-reduced by the saturation filter.

As can be seen from comparing page 1503 with page 1504, if the noise reduction process is not performed by the saturation filter (i.e., page 1503), the background image will be noisy, because the mask image is not subjected to the "binarization" process (i.e., after the mask image is input to the saturation filter, the value greater than upper in the mask image will be set to 1 and the value less than lower will be set to 0), so that the background area in the mask image is not set to 0, thereby leaving the background color. Therefore, the saturation filter plays a role in reducing noise and removing background color. In addition, as can be seen from comparing the page 1502 with the page 1504, the display effects of the page 1502 and the page 1504 are equivalent, that is, the display effect of the person image obtained by performing the background replacement using the mask image transmitted through the SEI is almost the same as the display effect of the person image obtained by performing the background replacement using the mask image directly output by the image segmentation model, that is, the information before the compression of the mask image is recovered to the maximum extent is realized.

In other embodiments, additional information, such as a mask map corresponding to each frame of decoded image or other CV processing results, may be directly written on the decoded image of the corresponding frame according to a certain specific rule, and transmitted using the decoded image as a carrier. At the receiving end, according to the specific rule, the extra information written in the decoded image is separated, and the extra information and the decoded image are recovered respectively, so that the video and the extra information are transmitted simultaneously.

Continuing with the description below of an exemplary architecture in which video processing device 355 provided by embodiments of the present application is implemented as a software module, in some embodiments, as shown in fig. 3A, the software module stored in video processing device 355 of memory 350 may include: an acquisition module 3551, a computer vision processing module 3552, a generation module 3553, and a transmission module 3554.

An acquisition module 3551 for acquiring a multi-frame image; the computer vision processing module 3552 is configured to perform computer vision processing on each acquired frame of image, so as to obtain enhanced image information corresponding to each frame of image; a generation module 3553 for generating a video bitstream including each frame image and enhanced image information corresponding to each frame image; a transmitting module 3554, configured to transmit a video code stream; the video code stream is used for being decoded by the terminal equipment to display a composite image corresponding to each frame of image, and the composite image is obtained by fusion processing of each frame of image and enhanced image information synchronous with each frame of image.

In some embodiments, when the enhanced image information is an image mask, the computer vision processing module 3552 is further configured to perform the following processing for each frame of image: calling an image segmentation model based on the image to identify an object in the image, taking a region outside the object as a background, and generating an image mask corresponding to the background; the image segmentation model is obtained by training based on a sample image and an object marked in the sample image.

In some embodiments, the computer vision processing module 3552 is further configured to directly take the image as an input of an image segmentation model to determine an object in the image, take a region outside the object as a background, and generate an image mask corresponding to the background, where the size of the image mask is consistent with the size of the image; or for reducing the size of the image, taking the reduced image as an input of an image segmentation model to determine an object in the reduced image, taking a region outside the object as a background, and generating an image mask corresponding to the background, wherein the size of the image mask is smaller than the size of the image.

In some embodiments, the computer vision processing module 3552 is further configured to perform the following processing for each frame of image: invoking a key point detection model to detect key points of the image so as to obtain the positions of the key points of the objects in the image; determining the position of a key point of the object as enhanced image information corresponding to the image; the key point detection model is obtained by training based on the sample image and the positions of the key points of the objects marked in the sample image.

In some embodiments, the computer vision processing module 3552 is further configured to perform the following processing for each frame of image: invoking a gesture detection model to perform gesture detection so as to obtain gesture information of an object in the image; the pose information of the object is determined as enhanced image information corresponding to the image.

In some embodiments, the generating module 3553 is further configured to encode the image and the enhanced image information corresponding to the image, respectively, so as to obtain an image code and an enhanced image information code corresponding to the image code; packaging the image code into a network extraction layer unit with the type of image slice, and packaging the enhanced image information code into a network extraction layer unit with the type of supplemental enhanced information; the network abstraction layer unit of the type image slice and the network abstraction layer unit of the type supplemental enhancement information are assembled into a video bitstream.

Continuing with the description below of an exemplary architecture in which the video processing device 455 provided by embodiments of the present application is implemented as a software module, in some embodiments, as shown in fig. 3B, the software module stored in the video processing device 455 of the memory 450 may include: a receiving module 4551, a decoding processing module 4552, a fusion processing module 4553, a display module 4554, an amplifying processing module 4555, and a noise reduction processing module 4556.

A receiving module 4551, configured to receive a video code stream, where the video code stream includes multiple frame images and enhanced image information synchronized with each frame image; the decoding processing module 4552 is configured to decode the video code stream to obtain a multi-frame image and enhanced image information synchronized with each frame image; the fusion processing module 4553 is configured to perform fusion processing on each frame of image and the enhanced image information synchronized with each frame of image, so as to obtain a composite image corresponding to each frame of image; and the display module 4554 is used for sequentially displaying the composite image corresponding to each frame of image in the human-computer interaction interface.

In some embodiments, when the enhanced image information is an image mask, the fusion processing module 4553 is further configured to perform the following processing for each frame of image: masking the image based on an image mask synchronized with the image to obtain a composite image corresponding to the image and with the background removed; wherein the image mask is generated by object recognition of the image.

In some embodiments, the receiving module 4551 is further configured to acquire a background image; the fusion processing module 4553 is further configured to combine the composite image from which the background is removed with the background image; the display module 4554 is further configured to display the combined image obtained by combining in the human-computer interaction interface.

In some embodiments, the size of the image mask synchronized with each frame of image is smaller than the size of the image; the video processing apparatus 455 further includes an enlargement processing module 4555 for performing enlargement processing on the enhanced image information synchronized with each frame image; the video processing apparatus 455 further includes a noise reduction processing module 4556 for performing noise reduction processing on the enhanced image information obtained after the enlargement processing.

In some embodiments, when the enhanced image information is the position of the keypoints of the object in the image, the fusion processing module 4553 is further configured to obtain a special effect matched with the keypoints; and adding special effects at the positions of the key points corresponding to the objects in each frame of image to obtain the composite image with the special effects.

In some embodiments, when the enhanced image information includes a pose of an object in the image, the fusion processing module 4553 is further configured to perform at least one of the following for each frame of the image: determining the states of each object in the image according to the type of the gesture, counting the states of all objects in the image, and adding a counting result into the image to obtain a composite image which corresponds to the image and is added with the counting result; and obtaining the special effect matched with the gesture of the object in the image, and adding the special effect into the image to obtain the composite image with the special effect corresponding to the image.

In some embodiments, the decoding processing module 4552 is further configured to read each network abstraction layer unit included in the video bitstream, and determine a type of each network abstraction layer unit; when the type of the read network extraction layer unit is an image slice, performing slice segmentation decoding operation to obtain multi-frame images; when the type of the read network abstraction layer unit is the supplemental enhancement information, a supplemental enhancement information decoding operation is performed to obtain enhanced image information synchronized with each frame image.

It should be noted that, the description of the apparatus according to the embodiment of the present application is similar to the description of the embodiment of the method described above, and has similar beneficial effects as the embodiment of the method, so that a detailed description is omitted. The technical details of the video processing apparatus provided in the embodiment of the present application may be understood from the description of any one of fig. 4 or fig. 6.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the video processing method according to the embodiment of the present application.

Embodiments of the present application provide a computer readable storage medium having stored therein executable instructions which, when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, a video processing method as shown in fig. 4 or fig. 6.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.

In summary, in the embodiment of the present application, the enhanced image information synchronized with each frame of image is integrated in the video code stream, so that after the receiving end receives the video code stream, the receiving end may perform fusion processing on each frame of image and the enhanced image information synchronized with each frame of image, so as to obtain a composite image corresponding to each frame of image, and then sequentially display the composite image corresponding to each frame of image in the man-machine interface, so that the original image of the video and other image information (such as mask image, key point information, etc.) can be accurately and synchronously displayed, and the display mode of the video is enriched.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of video processing, the method comprising:

receiving a video code stream, wherein the video code stream comprises multi-frame images and enhanced image information synchronous with each frame of images, and the enhanced image information comprises one of the following steps: image mask, position of key point of object in image, gesture of object in image;

Reading each network extraction layer unit included in the video code stream, and determining the type of each network extraction layer unit;

When the type of the read network extraction layer unit is an image slice, performing slice segmentation decoding operation to obtain the multi-frame image;

When the type of the read network extraction layer unit is the supplementary enhancement information, performing a supplementary enhancement information decoding operation to obtain enhanced image information synchronized with each frame of image;

2. The method according to claim 1, wherein when the enhanced image information is the image mask, the fusing the each frame image and the enhanced image information synchronized with the each frame image to obtain a composite image corresponding to the each frame image includes:

The following processing is performed for each frame of image:

masking the image based on an image mask synchronized with the image to obtain a composite image corresponding to the image and with the background removed;

wherein the image mask is generated by object recognition of the image.

3. The method of claim 2, wherein before displaying the composite image corresponding to each frame of image in the human-machine interface in sequence, the method further comprises:

Acquiring a background image;

and merging the composite image with the background image, and displaying the merged image obtained by merging in a man-machine interaction interface.

4. The method of claim 2, wherein the step of determining the position of the substrate comprises,

The size of the image mask synchronized with the each frame of image is smaller than the size of the image;

before the fusing processing is performed on the each frame of image and the enhanced image information synchronized with the each frame of image, the method further includes:

And amplifying the enhanced image information synchronized with each frame of image, and carrying out noise reduction on the enhanced image information obtained after the amplifying.

5. The method according to claim 1, wherein when the enhanced image information is a position of a key point of an object in the image, the fusing the each frame of image and the enhanced image information synchronized with the each frame of image to obtain a composite image corresponding to the each frame of image includes:

obtaining special effects matched with the key points;

And adding the special effect at the position of the key point corresponding to the object in each frame of image to obtain a composite image corresponding to each frame of image, wherein the special effect is added in the composite image.

6. The method according to claim 1, wherein when the enhanced image information includes a pose of an object in the image, the fusing the each frame image and the enhanced image information synchronized with the each frame image to obtain a composite image corresponding to the each frame image includes:

Performing at least one of the following processes for each frame of image:

Determining the states corresponding to each object in the image according to the gesture type, counting the states of all objects in the image, and adding a counting result into the image to obtain a composite image corresponding to the image and added with the counting result;

And acquiring a special effect adapted to the gesture of the object in the image, and adding the special effect in the image to obtain a composite image corresponding to the image and added with the special effect.

7. A method of video processing, the method comprising:

Acquiring a multi-frame image;

performing computer vision processing on each acquired frame of image to obtain enhanced image information corresponding to each frame of image, wherein the enhanced image information comprises one of the following components: image mask, position of key point of object in image, gesture of object in image;

Encoding each frame of image and the enhancement image information corresponding to each frame of image respectively to obtain multi-frame image encoding and enhancement image information encoding corresponding to each frame of image encoding;

Packaging the multi-frame image codes into a plurality of network extraction layer units with image slices of which the types are in one-to-one correspondence respectively, and packaging the multi-frame enhanced image information codes into a plurality of network extraction layer units with complementary enhanced information of which the types are in one-to-one correspondence respectively;

assembling a plurality of network extraction layer units with the types of image slices and a plurality of network extraction layer units with the types of supplementary enhancement information into a video code stream;

transmitting the video code stream;

8. The method according to claim 7, wherein when the enhanced image information is the image mask, the performing computer vision processing on each acquired frame of image to obtain enhanced image information corresponding to each frame of image includes:

The following processing is performed for each frame of image:

Calling an image segmentation model based on the image to identify an object in the image, and generating an image mask corresponding to a background by taking an area outside the object as the background;

the image segmentation model is obtained by training based on a sample image and an object marked in the sample image.

9. The method of claim 8, wherein the invoking an image segmentation model based on the image to identify objects in the image, generating an image mask corresponding to a background with regions outside the objects as the background, comprises:

Taking the image as input of the image segmentation model to determine an object in the image, taking a region outside the object as a background, and generating an image mask corresponding to the background, wherein the size of the image mask is consistent with the size of the image; or alternatively

And reducing the size of the image, taking the reduced image as an input of the image segmentation model to determine an object in the reduced image, taking a region outside the object as a background, and generating an image mask corresponding to the background, wherein the size of the image mask is smaller than that of the image.

10. The method of claim 7, wherein performing computer vision processing on each acquired image to obtain enhanced image information corresponding to each acquired image comprises:

The following processing is performed for each frame of image:

invoking a keypoint detection model to detect keypoints of the image so as to obtain positions of the keypoints of the object in the image;

determining the position of a key point of the object as enhanced image information corresponding to the image;

the key point detection model is obtained by training based on a sample image and positions of key points of objects marked in the sample image.

11. The method of claim 7, wherein performing computer vision processing on each acquired image to obtain enhanced image information corresponding to each acquired image comprises:

The following processing is performed for each frame of image:

invoking a gesture detection model to perform gesture detection so as to obtain gesture information of an object in the image;

and determining the gesture information of the object as enhanced image information corresponding to the image.

12. A video processing apparatus, the apparatus comprising:

The receiving module is used for receiving a video code stream, wherein the video code stream comprises multi-frame images and enhanced image information synchronous with each frame of images, and the enhanced image information comprises one of the following components: image mask, position of key point of object in image, gesture of object in image;

The decoding processing module is used for reading each network extraction layer unit included in the video code stream and determining the type of each network extraction layer unit; when the type of the read network extraction layer unit is an image slice, performing slice segmentation decoding operation to obtain the multi-frame image; when the type of the read network extraction layer unit is the supplementary enhancement information, performing a supplementary enhancement information decoding operation to obtain enhanced image information synchronized with each frame of image;

13. A video processing apparatus, the apparatus comprising:

The acquisition module is used for acquiring multi-frame images;

The computer vision processing module is used for performing computer vision processing on each acquired frame of image to obtain enhanced image information corresponding to each frame of image, wherein the enhanced image information comprises one of the following components: image mask, position of key point of object in image, gesture of object in image;

The generation module is used for respectively carrying out coding processing on each frame of image and the enhanced image information corresponding to each frame of image so as to obtain multi-frame image coding and enhanced image information coding corresponding to each frame of image coding; packaging the multi-frame image codes into a plurality of network extraction layer units with image slices of which the types are in one-to-one correspondence respectively, and packaging the multi-frame enhanced image information codes into a plurality of network extraction layer units with complementary enhanced information of which the types are in one-to-one correspondence respectively; assembling a plurality of network extraction layer units with the types of image slices and a plurality of network extraction layer units with the types of supplementary enhancement information into a video code stream;

a transmitting module, configured to transmit the video code stream;

14. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the video processing method of any one of claims 1-6 or any one of claims 7-11 when executing executable instructions stored in said memory.

15. A computer readable storage medium storing executable instructions which when executed are adapted to implement the video processing method of any one of claims 1 to 6 or any one of claims 7 to 11.

16. A computer program product comprising computer executable instructions or a computer program which, when executed by a processor, implements the video processing method of any of claims 1-6 or any of claims 7-11.