CN113630609B

CN113630609B - Video encoding method, decoding method, storage medium and terminal equipment

Info

Publication number: CN113630609B
Application number: CN202010374073.2A
Authority: CN
Inventors: 冯万良
Original assignee: TCL Technology Group Co Ltd
Current assignee: TCL Technology Group Co Ltd
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2024-03-12
Anticipated expiration: 2040-05-06
Also published as: CN113630609A

Abstract

The invention discloses a video coding method, a decoding method, a storage medium and terminal equipment, wherein the video coding method acquires a video frame to be coded and determines a first foreground image and a first background image of the video frame to be coded; encoding the first foreground image to obtain a first foreground code stream corresponding to the first foreground image; encoding the first background image based on a preset target video frame to obtain a first background code stream corresponding to the first background image; and determining the coding code stream corresponding to the video frame to be coded according to the first foreground code stream and the first background code stream. According to the invention, the video picture of the video frame is split into the foreground image and the background image, and the foreground image and the background image are respectively and independently encoded, so that the background image does not need to encode all image contents, and the video compression cost is further reduced.

Description

Video encoding method, decoding method, storage medium and terminal equipment

Technical Field

The present invention relates to the field of video encoding and decoding technologies, and in particular, to a video encoding method, a decoding method, a storage medium, and a terminal device.

Background

Video is a data form related to dynamic images, and generally comprises a series of video frames, and the dynamic images in the video can be displayed by continuously playing the video frames. Before transmission, the video format file of the video is converted into a video code stream suitable for transmission through video coding.

The currently available coding technology commonly used for video coding is to code all the acquired images. However, for a video with small image frame variation (for example, a video monitoring frame), the video frames in adjacent frames or tens of frames in the video are hardly changed, so that the optical flow image and the image residual of each video are still encoded in the encoding process, and the processing power and the storage bandwidth are consumed by repeating the video frames, so that the video compression cost of the video is increased.

Disclosure of Invention

The invention aims to solve the technical problem of providing a video encoding method, a decoding method, a storage medium and terminal equipment aiming at the defects of the prior art.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a video encoding method, the method comprising:

acquiring a video frame to be encoded, and determining a first foreground image and a first background image of the video frame to be encoded;

Encoding the first foreground image to obtain a first foreground code stream corresponding to the first foreground image;

encoding the first background image based on a preset target video frame to obtain a first background code stream corresponding to the first background image;

and determining the coding code stream corresponding to the video frame to be coded according to the first foreground code stream and the first background code stream.

The video encoding method, wherein the obtaining the video frame to be encoded and determining the first foreground image and the first background image of the video frame to be encoded specifically includes:

acquiring a video frame to be encoded, and judging whether the video frame to be encoded is a target video frame or not;

if the video frame to be encoded is a target video frame, encoding the target video frame to obtain an encoding code stream corresponding to the target video frame;

and if the video frame to be encoded is a non-target video frame, determining a first foreground image and a first background image of the video frame to be encoded.

The video coding method is characterized in that the acquisition time of the target video frame corresponding to the video frame to be coded is earlier than the acquisition time of the video frame to be coded.

The video encoding method, wherein the determining the first foreground image and the first background image of the video frame to be encoded specifically includes:

Inputting the video frame to be encoded into a preset image recognition network model, and recognizing object information carried by the video frame to be encoded through the network model;

and determining a first foreground image and a first background image corresponding to the video frame to be encoded based on the object information, wherein the first foreground image comprises an object image corresponding to the object information.

The video encoding method, wherein the encoding the first background image based on the preset target video frame to obtain a first background code stream corresponding to the first background image specifically includes:

determining first posture information of monitoring equipment based on a preset target video frame and the video frame to be encoded, wherein the target video frame and the video frame to be encoded are acquired by the monitoring equipment;

determining a first residual image corresponding to the first background image based on the first pose information, the target video frame and the first foreground image;

and encoding the first residual image and the first gesture information to obtain a first background code stream corresponding to the first background image.

The video encoding method, wherein the determining, based on the first pose information, the target video frame, and the first foreground image, a first residual image corresponding to the first background image specifically includes:

Acquiring first depth information of the target video frame, and transforming the target video frame based on the first depth information and the first posture information to obtain a predicted video frame;

determining a mask image of the first foreground image according to the first foreground image;

and determining a first residual image corresponding to the first background image based on the video frame to be encoded, the predicted video frame and the mask image.

A video decoding method for decoding an encoded code stream encoded by the video encoding method as described in any one of the above, the video decoding method comprising:

obtaining a code stream to be decoded, wherein the code stream to be decoded comprises a second foreground code stream and a second background code stream;

decoding the second foreground code stream to obtain a second foreground image corresponding to the second foreground code stream;

determining a second background image corresponding to the background code stream based on the second background code stream and a preset target video frame image;

and generating a video frame image corresponding to the code stream to be decoded according to the second background image and the second foreground image.

The video decoding method, wherein when the code stream to be decoded is obtained, the obtaining the second foreground code stream and the second background code stream carried by the code stream to be decoded specifically includes:

Acquiring a code stream to be decoded, and judging whether the code stream to be decoded is a target video frame code stream or not;

when the code stream to be decoded is a target video frame code stream, decoding the code stream to be decoded to obtain a target video frame image;

and when the code stream to be decoded is a non-target video frame code stream, acquiring a second foreground code stream and a second background code stream carried by the code stream to be decoded.

The video decoding method, wherein the second background code stream comprises a second residual image and second gesture information; the determining, based on the second background code stream and a preset target video frame image, the second background image corresponding to the background code stream specifically includes:

decoding the second background code stream to obtain a second residual image and second posture information corresponding to the second background code stream;

and determining a second background image corresponding to the background code stream based on the second residual image, the second gesture information and a preset target video frame image.

The video decoding method, wherein the determining, based on the second residual image, the second pose information, and a preset target video frame image, a second background image corresponding to the second background code stream specifically includes:

Acquiring second depth information of the target video frame image, and carrying out gesture transformation on the target video frame image based on the second depth information and the second gesture information to obtain a predicted image;

and determining a second background image corresponding to the second background code stream based on the estimated image and the second residual image.

A computer readable storage medium storing one or more programs executable by one or more processors to implement steps in a video encoding method as described in any one of the above or to implement steps in a video decoding method as described in any one of the above.

A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus realizes connection communication between the processor and the memory;

the processor, when executing the computer readable program, implements steps in a video encoding method as described in any one of the above, or steps in a video decoding method as described in any one of the above.

The beneficial effects are that: compared with the prior art, the invention provides a video coding method, a decoding method, a storage medium and terminal equipment, wherein the video coding method acquires a video frame to be coded and determines a first foreground image and a first background image of the video frame to be coded; encoding the first foreground image to obtain a first foreground code stream corresponding to the first foreground image; encoding the first background image based on a preset target video frame to obtain a first background code stream corresponding to the first background image; and determining the coding code stream corresponding to the video frame to be coded according to the first foreground code stream and the first background code stream. According to the invention, the video picture of the video frame is split into the foreground image and the background image, and the foreground image and the background image are respectively and independently encoded, so that the background image does not need to encode all image contents, and the video compression cost is further reduced.

Drawings

Fig. 1 is a flowchart of a video encoding method provided by the present invention.

Fig. 2 is a schematic flow chart of a video encoding method provided by the present invention.

Fig. 3 is a schematic structural diagram of a terminal device provided by the present invention.

Detailed Description

The invention provides a video encoding method, a decoding method, a storage medium and a terminal device, and in order to make the purposes, technical schemes and effects of the invention clearer and more definite, the invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The embodiment provides a video encoding method and a video decoding method, which can be applied to an interaction system of monitoring equipment and terminal equipment, wherein in an application scene, the monitoring equipment is connected with the terminal equipment, the monitoring equipment can shoot a video frame to be encoded, respectively encode a first foreground image and a first background image, and transmit an encoded code stream obtained by encoding to the terminal equipment. The terminal equipment can receive the coded code stream, decode the coded code stream to obtain a code stream to be decoded, wherein the code stream to be decoded comprises a second foreground code stream and a second background code stream, and generate a video frame image corresponding to the code stream to be decoded based on the second foreground code stream and the second background code stream.

It may be understood that, in the above application scenario, although the actions of the encoding method in the embodiment of the present invention are described as being performed by the monitoring device, and the decoding method is performed by the terminal device, the present invention is not limited in terms of the execution subject, as long as the actions disclosed in the embodiment of the present invention are performed.

The present implementation provides a video encoding method, as shown in fig. 1 and 2, the method may include the following steps:

s10, acquiring a video frame to be encoded, and determining a first foreground image and a first background image of the video frame to be encoded.

Specifically, the video frame to be encoded may be acquired by a monitoring device (for example, a monitoring camera, etc.), may be sent by an external device (video monitoring device), or may be acquired through the internet. The first foreground image refers to an image formed by a foreground region of a video frame to be encoded, and the first background image refers to an image formed by a background region of the video frame to be encoded; the foreground region is an image region corresponding to each object (such as a person or an article carried by the person) in the video frame to be encoded; the background region is an image region of the video to be encoded except for the foreground region. For example, the image content is an image of a person standing on a lawn, the foreground region of the image being a person and the background region being a lawn; the foreground image is an image formed by the image area corresponding to the person; the background image is a graph formed by the image area where the grassland is. In this embodiment, the video frame to be encoded is a video frame of a monitoring video, the monitoring video is composed of each sub-lens, and a background in a video picture of the video frame corresponding to each sub-lens is basically kept unchanged, while a foreground image is changed; after the video frame to be encoded is obtained, the video frame to be encoded is divided into a first foreground image and a first background image, and the first foreground image and the first background image are encoded in different encoding modes in the subsequent encoding process, so that the encoded code stream obtained by encoding can be ensured to carry the image information of the image to be encoded; but also can reduce the coding data which needs to be coded and improve the coding speed and the transmission speed.

Further, in one implementation of the present embodiment, a target video frame is set before encoding a video to be encoded, where the target video frame is an intra-frame that encodes a video frame into an encoded bitstream, and the target video frame can be reconstructed without requiring other frames. It is understood that when the video frame to be encoded is the target video frame, the video frame to be encoded may be directly encoded into the encoded bitstream. Thus, the obtaining the video frame to be encoded and determining the first foreground image and the first background image of the video frame to be encoded specifically includes:

acquiring a video frame to be encoded;

Specifically, the target video frame is used for providing background image information for the video frame to be encoded, and the acquisition time of the target video frame corresponding to the video frame to be encoded is earlier than the acquisition time of the video frame to be encoded. Wherein the target video frame encodes all image information as a foreground image. Thus, when the target video frame is acquired, all image information of the target video frame can be directly used as a foreground image to be encoded. The target video frame is a coding reference frame of a non-target video frame (marked as a sequence video frame), and when the sequence video frame is coded, residual information of the sequence video frame is determined based on the target video frame. The target video frame is configured with a frame sequence number, and after the video frame to be encoded is acquired, the frame sequence number of the video frame to be encoded can be used for determining whether the video frame to be encoded is the target video frame. For example, if the frame number of the target video frame is the first frame, after the video frame to be encoded is obtained, the frame number of the video frame to be encoded is obtained, if the frame number of the video frame to be encoded is the first frame, the video frame to be encoded is the target video frame, and if the frame number of the video frame to be encoded is not the first frame, the video frame to be encoded is the non-target video frame.

In addition, in a specific implementation manner of this embodiment, a video frame with a frame number of the first frame is taken as a target video frame, and in a video acquisition process, after each acquisition of a preset number N of video frames, the frame sequence of the target video frame is updated to n+1. For example, the frame number of the target video frame is 1, and the target video frame number is updated once after each 20 frames of video frames are acquired, when 20 frames of video frames are acquired, the frame number of the target video frame is updated to 20+1=21, that is, the 21 st frame of video frame is the target video frame; when a 40-frame video frame is acquired, the frame number of the target video frame is updated to 21+20=40, i.e., the 40 th frame video frame is the target video frame. In this way, in the subsequent encoding and decoding process of the sequence video frames, the background images of the sequence video frames do not encode the image content of the background images, but determine the background images of the sequence video frames based on the background images of the target video frames during decoding, so that the target video frames are updated every preset number of video frames, the similarity between the background images of the sequence video frames and the background images of the target video frames can be improved, and the fidelity of the video images obtained by decoding according to the encoding code streams corresponding to the sequence video frames is further improved.

Further, in an implementation manner of this embodiment, the determining the first foreground image and the first background image of the video frame to be encoded specifically includes:

Specifically, the object information is the position information of a person or an object in a video frame to be encoded; for example, the image content is an image of a square standing on a table, four vertices of the area range of the square in the image are (10, 10), (10, 20), (20, 20), and (20, 10), respectively, and then the object information is a square area with (10, 10), (10, 20), (20, 20), and (20, 10) as vertices. In addition, the preset image recognition network model is a trained network model, an input item of the preset image recognition network model is a video image, and output of the preset image recognition network model is object position information corresponding to the video image. It is understood that the preset image recognition network model can recognize the position information of each object in the video image. The preset image recognition network model may be a convolutional neural network for target detection, such as a fast-RCNN network. In addition, after the object information corresponding to the video frame to be encoded is acquired, an image picture formed by an object image corresponding to the object information is taken as a first foreground image, and an image picture obtained by removing the first foreground image from the video picture frame of the video frame to be encoded is taken as a first background image. For example, the image content is an image of a person standing on a lawn, the foreground region of the image being a person and the background region being a lawn; the foreground image is an image formed by the image area corresponding to the person; the background image is a graph formed by the image area where the grassland is.

S20, encoding the first foreground image to obtain a first foreground code stream corresponding to the first foreground image.

Specifically, the first foreground code stream is an encoded file formed by encoding a first foreground image, and the first foreground code stream includes image information of the first foreground image. The first foreground image may be an image of a person or an object moved by the person photographed in the monitoring video, and the person or the object moved by the person before passing through the monitoring device (e.g., a monitoring camera, etc.) is mainly understood by the monitoring video. Thus, for a first foreground image, the first foreground image is directly encoded. For example, the first foreground image may be encoded using an adaptive Intra (Intra)/Inter (Inter) prediction encoding technique to obtain an encoded bitstream corresponding to the first foreground image.

S30, encoding the first background image based on a preset target video frame to obtain a first background code stream corresponding to the first background image.

Specifically, the first background code stream includes first pose information of a monitoring device (e.g., a monitoring camera, etc.) for acquiring the video frame to be encoded. The first gesture information is used for representing information of a gesture of the monitoring device when the monitoring device collects the video frame to be encoded, in other words, the first gesture information is used for representing a spatial position state of the monitoring device, and the spatial position state is a spatial position state when the monitoring device collects the video frame to be encoded. In one implementation manner of this embodiment, the first gesture information includes six gesture parameters, which are three-dimensional space coordinate axis information and rotation information relative to each coordinate axis. It is understood that the first gesture information includes an X-axis, a Y-axis, a Z-axis, a rotation angle relative to the X-axis, a rotation angle relative to the Y-axis, and a rotation angle relative to the Z-axis, and the spatial position of the monitoring device when the monitoring device collects the video frame to be encoded can be determined through the first gesture information.

Further, in an implementation manner of this embodiment, the encoding the first background image based on the preset target video frame to obtain a first background code stream corresponding to the first background image specifically includes:

s31, determining first posture information of monitoring equipment based on a preset target video frame and the video frame to be encoded, wherein the target video frame and the video frame to be encoded are acquired by the monitoring equipment;

s32, determining a first residual image corresponding to the first background image based on the first gesture information, the target video frame and the first foreground image;

and S33, encoding the first residual image and the first posture information to obtain a first background code stream corresponding to the first background image.

Specifically, the first gesture information is obtained through a preset gesture recognition network model, an input item of the preset gesture recognition network model is a group of images, the group of images comprises a first image and a second image, the first image is a reference image of the second image so as to determine gesture information corresponding to the second image based on the first image, and an output item of the preset gesture recognition network model is the gesture information. The first image and the second image are acquired through the same image acquisition equipment (such as monitoring equipment and the like), the posture information of the image acquisition equipment when acquiring the first image is known, and the posture information of the image acquisition equipment when acquiring the second image is unknown; when the first image and the second image are input to the preset gesture recognition network model, the preset gesture recognition network model may output gesture information corresponding to the second image. For example, the gesture information a when the image capturing device captures the first image, the gesture information B when the image capturing device captures the second image is unknown, and when the first image and the second image are input as an image group to the preset gesture recognition network model, the preset gesture recognition network model outputs the gesture information B corresponding to the second image. Of course, it should be noted that the preset gesture recognition network model is a trained network model.

The first residual image is an image formed by difference pixel points of a video frame to be encoded, a predicted video frame and a mask image; for example, the video frame to be encoded includes a pixel point a, a pixel point B, and a pixel point C, the preset predicted video frame includes a pixel point a, the mask image includes a pixel point B, and then the first residual image includes a pixel point C. In addition, the difference pixel points further comprise pixel points with different pixel values; for example, the pixel value of the pixel point a in the video frame to be encoded is 125, and the pixel value of the pixel point a in the predicted video frame is 120, then the pixel point a is a difference pixel point, and the first residual image includes the pixel point a. Therefore, the difference pixel points can be the pixel points contained in the predicted video frame and the mask image, or the pixel points with different pixel point values in the preset video frame and the video frame to be coded.

And the first background code stream is used for encoding the first residual image and the first gesture information to obtain a code stream file, and the first residual image and the first gesture information can be obtained when the first background code stream is decoded. For example, the first background code stream is a code stream file obtained by performing lossless compression on a first residual image and the first gesture information.

Further, in this embodiment, the target video frame is a reference image of the video frame to be encoded, and pose information of a monitoring device corresponding to the target video frame is known. Then, after the target video frame and the video frame to be encoded are input as a group of images into the preset gesture recognition network model, the preset gesture recognition network model can output first gesture information corresponding to the video frame to be encoded. Thus, first gesture information corresponding to the video frame to be encoded is obtained. Of course, it should be noted that, the process of acquiring the first gesture information provided in this embodiment is only one implementation manner of determining the first gesture information, and other manners may be adopted in practical applications, for example, the monitoring device is configured with an acceleration sensor and an angular velocity sensor, and based on the angular velocity sensor, it is determined that the monitoring device acquires the angle information when the video frame is to be encoded and the position information determined by the acceleration sensor, so that the monitoring device may acquire the first gesture information corresponding to the image to be encoded.

Further, in the step S32, the determining, based on the first pose information, the target video frame, and the first foreground image, a first residual image corresponding to the first background image specifically includes:

In particular, the method comprises the steps of,

the first depth information is used for representing the distance between each pixel point in the target video frame and the monitoring equipment. It can be understood that the target video frame carries depth information, so that when the target video frame is acquired, the depth information of all points in the target video frame can be extracted to obtain a depth map corresponding to the target video frame, and the depth map is used as first depth information of the target video frame.

Further, in an implementation manner of this embodiment, the process of transforming the target video frame based on the first depth information and the first pose information may be: determining a rotation matrix and a translation vector corresponding to the target video frame based on the first gesture information, adjusting the depth information of each pixel point to obtain an adjusted video frame based on the rotation matrix and the translation vector, changing the depth information in the first depth information based on the rotation matrix and the translation vector, rotating and translating the target video frame based on the rotation matrix, the translation vector and the depth information changing amount, adjusting the depth information of each pixel point based on the depth information changing amount of each pixel point after rotating and translating, and taking the adjusted video frame as a predicted video frame. Wherein the predicted video frame is a predicted video frame of a video frame to be encoded. It can be understood that the scene collected by the monitoring device is unchanged, and only after the spatial position of the monitoring device is changed, the image picture of the video frame to be encoded collected after the spatial position of the monitoring device is changed is the same as the image picture of the predicted video frame obtained by performing spatial transformation on the target video frame.

In addition, the rotation matrix is a transformation matrix from a space coordinate system corresponding to a target video frame to a space coordinate system corresponding to a video frame to be encoded, the translation vector is a translation vector from the space coordinate system corresponding to the target video frame to the transformation matrix of the space coordinate system corresponding to the video frame to be encoded, wherein the space coordinate system corresponding to the target video frame is a camera coordinate system when the monitoring equipment acquires the target video frame based on gesture information corresponding to the target video frame; the spatial coordinate system corresponding to the video frame to be encoded is a camera coordinate system when the monitoring equipment is determined to collect the target video frame based on the first gesture information corresponding to the target video frame. For example, the spatial coordinate system corresponding to the target video frame is a spatial coordinate system a, and the spatial coordinate system corresponding to the video frame to be encoded is a spatial coordinate system B; and the rotation matrix is R, the translation vector is T, and then the space coordinate system B is obtained by rotating the space coordinate system A according to the rotation matrix R and then translating according to the translation direction. Therefore, the target video frame can be spatially transformed based on the rotation matrix and the translation vector, and a spatially transformed reference video frame can be obtained; for each first pixel point in a reference video frame, determining depth information corresponding to the first pixel point according to the distance between the pixel point and monitoring equipment, and updating the depth information of a second pixel point corresponding to the first pixel point in the first depth information by adopting the depth information to obtain updated depth information; and finally, taking the updated depth information as the depth information corresponding to the reference video, and further obtaining a predicted video frame.

Of course, it should be noted that, the method for determining the rotation matrix and the translation vector for transforming the target video frame into the spatial coordinate system corresponding to the first pose information based on the first pose information and the target video frame may be a method of determining the rotation matrix and the translation vector, which all belong to the protection scope of the present application.

Further, the mask image may be represented by using 1 for a pixel corresponding to the object image in the first foreground image, using 0 for a pixel corresponding to the non-object image, and so on. The determining, based on the video frame to be encoded, the predicted video frame, and the mask image, the first residual image corresponding to the first background image may be: and calculating residual images of the video frame to be encoded and the predicted video frame, and removing image information of a first foreground image in the residual images through the first mask image to obtain a first residual image. Thus, by acquiring residual images of the video frame to be encoded and the predicted video frame, errors of taking an image obtained by performing spatial transformation on the target video frame as a background image corresponding to the image to be encoded can be avoided; meanwhile, by taking out the first mask image from the residual image, the image information contained in the residual image can be reduced.

S40, determining the coding code stream corresponding to the video frame to be coded according to the first foreground code stream and the first background code stream.

Specifically, the first foreground code stream and the first background code stream are both code streams corresponding to the video frame to be encoded, so that after the first foreground code stream and the first background code stream are obtained, the first foreground code stream and the first background code stream can be associated. For example, in one embodiment, the same code stream identifiers are configured for the first foreground code stream and the first background code stream, respectively, so as to associate the first foreground code stream and the first background code stream, where the code stream identifiers may be frame numbers of video frames to be encoded. Of course, it should be noted that the first foreground code stream and the first background code stream may be encoded synchronously, or the first foreground code stream may be encoded first and then the first background code stream may be encoded; the first background code stream may be encoded first and the first foreground code stream may be encoded later.

Based on the video encoding method, the present embodiment further provides a video decoding method, where the video decoding method is used to decode an encoded code stream obtained by encoding by the video encoding method according to the foregoing embodiment, and the video decoding method includes:

M10, acquiring a code stream to be decoded, wherein the code stream to be decoded comprises a second foreground code stream and a second background code stream;

m20, decoding the second foreground code stream to obtain a second foreground image corresponding to the second foreground code stream;

m30, determining a second background image corresponding to the background code stream based on the second background code stream and a preset target video frame image;

and M40, generating a video frame image corresponding to the code stream to be decoded according to the second background image and the second foreground image.

Specifically, in the step M10, the code stream to be decoded is an encoded code stream obtained by encoding based on the video encoding method described in the foregoing embodiment, where the code stream to be decoded may be a target video frame code stream corresponding to a target video frame or an encoded code stream corresponding to a sequence video frame. The encoding method in the above embodiment can learn that the encoding code stream corresponding to the target video frame is obtained by directly encoding the target video frame, so that the target video frame code stream corresponding to the target video frame does not carry the foreground code stream and the background code stream. Therefore, when the code stream to be decoded is obtained, whether the code stream to be decoded is a target video frame code stream corresponding to the target video frame needs to be judged. Correspondingly, in an implementation manner of this embodiment, when the obtaining the code stream to be decoded, obtaining the second foreground code stream and the second background code stream carried by the code stream to be decoded specifically includes:

Specifically, the determining whether the code stream to be decoded is a target video frame code stream may be determined by analyzing the code stream to be decoded according to the target video frame code stream, or may be by determining whether a frame sequence number of a video frame corresponding to the code stream to be decoded is the same as a frame sequence number corresponding to the target video frame, or may further set a target video frame identifier in the code stream; when the code stream to be decoded carries the target video frame identification, judging the code stream to be decoded as a key code stream; and when the code stream to be decoded carries the target video frame identification, judging that the code stream to be decoded is a sequence video frame. Of course, it should be noted that the second foreground code stream and the second background code stream have the same meaning as the first foreground code stream and the first background code stream in the video encoding method described above, and are only used for the distinction between the first foreground code stream and the first background code stream. For example, the second foreground code stream is a first foreground code stream obtained by encoding a foreground image of a video frame to be encoded corresponding to the code stream to be decoded; the second background code stream is a first background code stream obtained by encoding a foreground image of a video frame to be encoded, which corresponds to the code stream to be decoded. It can be understood that the second foreground code stream is an encoded code stream obtained by encoding an image formed by a foreground region of a video frame to be encoded; the second background code stream is an encoded code stream obtained by encoding an image formed by a background area of a video frame to be encoded, wherein the encoding mode of the second encoded code stream is obtained by encoding by adopting the encoding mode described in the above example.

Further, as can be seen from the foregoing embodiments, the second background code stream includes a second residual image and second pose information, and the determining, based on the second background code stream and a preset target video frame image, the second background image corresponding to the background code stream specifically includes: decoding the second background code stream to obtain a second residual image and second posture information corresponding to the second background code stream; and determining a second background image corresponding to the background code stream based on the second residual image, the second gesture information and a preset target video frame image. The second residual image is a first residual image corresponding to the video frame to be encoded; the second gesture information is first gesture information corresponding to the video frame to be encoded.

It can be understood that the second gesture information is used for representing information of a gesture when the monitoring device collects the video frame to be encoded corresponding to the code stream to be decoded, in other words, the second gesture information is used for representing a spatial position state of the monitoring device for collecting the video frame to be encoded of the code stream to be decoded, where the spatial position state is a spatial position state when the monitoring device collects the video frame to be encoded. In one implementation manner of this embodiment, the second gesture information includes six gesture parameters, which are three-dimensional space coordinate axis information and rotation information relative to each coordinate axis. It is understood that the first gesture information includes an X-axis, a Y-axis, a Z-axis, a rotation angle relative to the X-axis, a rotation angle relative to the Y-axis, and a rotation angle relative to the Z-axis, and the spatial position of the monitoring device when the monitoring device collects the video frame to be encoded can be determined through the first gesture information.

The second residual image is an image formed by difference pixel points of the video frame to be encoded and the predicted video frame and the mask image; for example, the video frame to be encoded includes a pixel point a, a pixel point B, and a pixel point C, the preset predicted video frame includes a pixel point a, the mask image includes a pixel point B, and then the first residual image includes a pixel point C. In addition, the difference pixel points further comprise pixel points with different pixel values; for example, the pixel value of the pixel point a in the video frame to be encoded is 125, and the pixel value of the pixel point a in the predicted video frame is 120, then the pixel point a is a difference pixel point, and the first residual image includes the pixel point a. Therefore, the difference pixel points can be the pixel points contained in the predicted video frame and the mask image, or the pixel points with different pixel point values in the preset video frame and the video frame to be coded.

And the second background code stream encodes the first residual image and the first gesture information to obtain a code stream file, and when the first background code stream is decoded, the first residual image and the first gesture information can be obtained. For example, the first background code stream is a code stream file obtained by performing lossless compression on a first residual image and the first gesture information.

Further, the determining, based on the second residual image, the second pose information, and a preset target video frame image, a second background image corresponding to the second background code stream specifically includes:

Specifically, the preset target video frame image is a video frame image obtained by decoding a target video frame code stream corresponding to the received target video frame. The second depth information is the depth information of the target video frame image, and the depth information is used for representing the distance between each pixel point in the target video frame image and the monitoring equipment. It can be understood that the target video frame image carries depth information, so that when the target video frame image is obtained, the depth information of all pixel points in the target video frame image can be extracted to obtain a depth map corresponding to the target video frame image, and the depth map is used as second depth information of the target video frame image. In addition, the process of performing the gesture transformation on the target video frame image based on the second depth information and the second gesture information to obtain the predicted image is the same as the process of transforming the target video frame based on the first depth information and the first gesture information to obtain the predicted video frame in the above embodiment, and specifically, the process of transforming the target video frame based on the first depth information and the first gesture information to obtain the predicted video frame may be referred to the above example, so that description of the predicted video frame will not be repeated here.

Based on the video encoding method and the video decoding method described above, the present embodiment provides a computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps in the video encoding method described in the above embodiment or to implement the steps in the video decoding method described in the above embodiment.

Based on the video encoding method and the video decoding method, the invention also provides a terminal device, as shown in fig. 3, which comprises at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, which may also include a communication interface (Communications Interface) 23 and a bus 24. Wherein the processor 20, the display 21, the memory 22 and the communication interface 23 may communicate with each other via a bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may invoke logic instructions in the memory 22 to perform the methods of the embodiments described above.

Further, the logic instructions in the memory 22 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 22, as a computer readable storage medium, may be configured to store a software program, a computer executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 performs functional applications and data processing, i.e. implements the methods of the embodiments described above, by running software programs, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the terminal device, etc. In addition, the memory 22 may include high-speed random access memory, and may also include nonvolatile memory. For example, a plurality of media capable of storing program codes such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or a transitory storage medium may be used.

In addition, the specific processes that the storage medium and the plurality of instruction processors in the terminal device load and execute are described in detail in the above method, and are not stated here.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of video encoding, the method comprising:

determining an encoding code stream corresponding to the video frame to be encoded according to the first foreground code stream and the first background code stream;

the encoding the first background image based on the preset target video frame to obtain a first background code stream corresponding to the first background image specifically includes:

encoding the first residual image and the first gesture information to obtain a first background code stream corresponding to the first background image;

the determining, based on the first pose information, the target video frame, and the first foreground image, a first residual image corresponding to the first background image specifically includes:

determining a mask image of the first foreground image according to the first foreground image; and determining a first residual image corresponding to the first background image based on the video frame to be encoded, the predicted video frame and the mask image.

2. The video coding method according to claim 1, wherein the acquiring the video frame to be coded and determining the first foreground image and the first background image of the video frame to be coded specifically comprises:

acquiring a video frame to be encoded;

3. The video coding method according to claim 1, wherein the acquisition time of the target video frame corresponding to the video frame to be coded is earlier than the acquisition time of the video frame to be coded.

4. The video coding method according to claim 1, wherein said determining the first foreground image and the first background image of the video frame to be coded comprises:

5. The video encoding method according to claim 1, wherein the determining a first residual image corresponding to the first background image based on the first pose information, the target video frame, and the first foreground image specifically comprises:

6. A video decoding method for decoding an encoded code stream encoded by the video encoding method according to any one of claims 1 to 5, the video decoding method comprising:

Generating a video frame image corresponding to the code stream to be decoded according to the second background image and the second foreground image;

the determining, based on the second background code stream and the preset target video frame image, the second background image corresponding to the background code stream specifically includes:

determining a second background image corresponding to the background code stream based on the second residual image, the second gesture information and a preset target video frame image;

the determining, based on the second residual image, the second pose information and the preset target video frame image, the second background image corresponding to the background code stream specifically includes:

7. The method for decoding video according to claim 6, wherein when the code stream to be decoded is obtained, obtaining the second foreground code stream and the second background code stream carried by the code stream to be decoded specifically includes:

8. A computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps of the video encoding method of any one of claims 1-5 or to implement the steps of the video decoding method of any one of claims 6-7.

9. A terminal device, comprising: a processor, a memory, and a communication bus, the memory having stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps of the video encoding method according to any one of claims 1 to 5 or the steps of the video decoding method according to any one of claims 6 to 7.