CN110460856B

CN110460856B - Video encoding method, video encoding device, video encoding apparatus, and computer-readable storage medium

Info

Publication number: CN110460856B
Application number: CN201910829454.2A
Authority: CN
Inventors: 闻兴; 郑云飞; 于冰; 陈敏; 赵明菲; 陈宇聪; 黄晓政; 王晓楠; 黄跃; 黄博
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2021-11-02
Anticipated expiration: 2039-09-03
Also published as: CN110460856A

Abstract

The disclosure discloses a video coding method, a video coding device and a computer readable storage medium, and belongs to the technical field of internet. The method comprises the following steps: acquiring motion information of a current frame image, wherein the motion information is used for representing the motion state of image content in the current frame image; determining an actual reference frame of the current frame image according to a theoretical reference frame and motion information of the current frame image, wherein the theoretical reference frame is a reference frame determined according to a coding sequence; and coding the current frame image according to the actual reference frame. The method comprises the steps of acquiring real motion information of a current frame image, re-determining an actual reference frame for coding the current frame image based on the motion information, and coding the current frame image through the actual reference frame. Because the encoding is carried out based on the real motion state, the real motion information of the current frame image can be encoded into the image, thereby improving the accuracy of the encoded image.

Description

Video encoding method, video encoding device, video encoding apparatus, and computer-readable storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a video encoding method, an apparatus, an encoding device, and a computer-readable storage medium.

Background

Motion estimation is the most important component in video coding, and refers to a process of dividing each frame of image into at least one non-overlapping macro block, and searching out a matching block most similar to each macro block in a designated area of a reference frame according to a designated search algorithm. Not only can the complexity of the video coding process be reduced by performing motion estimation, but also the number of bits in the video transmission process can be reduced, so that motion estimation is often performed during the video coding process.

Currently, the main processes for video coding based on motion estimation are:

firstly, a current frame image is divided into at least one non-overlapping macro block, and the displacement of all pixels in each macro block is set to be the same.

And secondly, searching a matching block most similar to each macro block in a specified area of the reference frame according to a similarity matching criterion, and acquiring an MVD (Motion Vector Difference) corresponding to each matching block.

And thirdly, subtracting each matching block from the corresponding macro block to obtain residual data of each macro block.

The fourth step is to perform Transform such as DCT (Discrete Cosine Transform) or FFT (Fast Fourier Transform) on the residual data of each macroblock.

And fifthly, quantizing the residual error data after each macro block conversion, and discarding part of the branch components to obtain the quantization parameter of each macro block.

And sixthly, entropy coding is carried out on the quantization parameter and the MVD of each macro block to obtain code stream data of the current frame image.

However, the purpose of motion estimation is to find a theoretically similar matching block rather than a real motion state, which results in inaccurate encoded images, and therefore, it is necessary to provide a new video encoding method to improve the accuracy of the encoded images.

Disclosure of Invention

The embodiment of the disclosure provides a video coding method, a video coding device and a computer readable storage medium, which are used for at least solving the problem of low accuracy of coded images in the related art. The technical scheme is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided a video encoding method, the method including:

acquiring motion information of a current frame image, wherein the motion information is used for representing the motion state of image content in the current frame image;

determining an actual reference frame of the current frame image according to a theoretical reference frame and motion information of the current frame image, wherein the theoretical reference frame is a reference frame determined according to a coding sequence;

and coding the current frame image according to the actual reference frame.

In another embodiment of the present disclosure, the encoding the current frame image according to the actual reference frame includes:

splitting the current frame image into at least one first macro block, and splitting the actual reference frame into at least one second macro block, wherein the first macro block and the second macro block are the same in number and corresponding in position;

each second macro block is differenced with the corresponding first macro block to obtain residual error data corresponding to each first macro block;

transforming and quantizing the residual error data corresponding to each first macro block to obtain a quantization parameter corresponding to each first macro block;

and entropy coding is carried out on the quantization parameter corresponding to each first macro block to obtain code stream data of the current frame image.

In another embodiment of the present disclosure, after the transforming and quantizing the residual data corresponding to each first macroblock to obtain the quantization parameter corresponding to each first macroblock, the method further includes:

performing inverse transformation and inverse quantization processing on the quantization parameter corresponding to each first macro block to obtain residual data corresponding to each first macro block;

and determining a reconstructed reference frame according to the residual data corresponding to each first macro block and the corresponding second macro block, wherein the reconstructed reference frame is used as a theoretical reference frame for a next frame image and is used for encoding the next frame image.

In another embodiment of the present disclosure, the determining the actual reference frame of the current frame image according to the theoretical reference frame and the motion information of the current frame image includes:

and inputting the theoretical reference frame and the motion information of the current frame image into a reference frame generation model, and outputting the actual reference frame of the current frame image, wherein the reference frame generation model is used for determining the actual reference frame of the image based on the theoretical reference frame and the motion information of the image.

In another embodiment of the present disclosure, before inputting the theoretical reference frame and the motion information of the current frame image into the reference frame generation model and outputting the actual reference frame of the current frame image, the method further includes:

acquiring motion information of a plurality of frames of training sample images, wherein each frame of training sample image is provided with a theoretical reference frame and an actual reference frame;

inputting the motion information of each frame of training sample image and a theoretical reference frame thereof into an initial reference frame generation model, and outputting a prediction reference frame of each frame of training sample image;

inputting a prediction reference frame and an actual reference frame of each frame of training sample image into a pre-constructed target loss function;

and adjusting the model parameters of the initial reference frame generation model according to the function value of the target loss function to obtain the reference frame generation model.

According to a second aspect of the embodiments of the present disclosure, there is provided a video encoding apparatus, the apparatus including:

the acquisition unit is configured to acquire motion information of a current frame image, wherein the motion information is used for representing the motion state of image content in the current frame image;

a determining unit configured to perform determining an actual reference frame of the current frame image according to a theoretical reference frame of the current frame image and motion information, the theoretical reference frame being a reference frame determined according to a coding order;

an encoding unit configured to perform encoding of the current frame image according to the actual reference frame.

In another embodiment of the present disclosure, the encoding unit is configured to perform splitting the current frame image into at least one first macro block and splitting the actual reference frame into at least one second macro block, where the first macro block and the second macro block are the same in number and corresponding in position; each second macro block is differenced with the corresponding first macro block to obtain residual error data corresponding to each first macro block; transforming and quantizing the residual error data corresponding to each first macro block to obtain a quantization parameter corresponding to each first macro block; and entropy coding is carried out on the quantization parameter corresponding to each first macro block to obtain code stream data of the current frame image.

In another embodiment of the present disclosure, the apparatus further comprises:

the processing unit is configured to perform inverse transformation and inverse quantization processing on the quantization parameter corresponding to each first macro block to obtain residual data corresponding to each first macro block;

the determining unit is configured to determine a reconstructed reference frame according to the residual data corresponding to each first macro block and the corresponding second macro block, wherein the reconstructed reference frame is used as a theoretical reference frame for a next frame of image and used for encoding the next frame of image.

In another embodiment of the present disclosure, the determining unit is configured to perform inputting the theoretical reference frame and the motion information of the current frame image into a reference frame generation model, and outputting the actual reference frame of the current frame image, wherein the reference frame generation model is used for determining the actual reference frame of the image based on the theoretical reference frame and the motion information of the image.

the acquisition unit is configured to acquire motion information of a plurality of frames of training sample images, wherein each frame of training sample image has a theoretical reference frame and an actual reference frame;

the input and output unit is configured to input the motion information of each frame of training sample image and a theoretical reference frame thereof into an initial reference frame generation model and output a prediction reference frame of each frame of training sample image;

the input and output unit is configured to input the prediction reference frame and the actual reference frame of each frame of training sample image into a pre-constructed target loss function;

and the adjusting unit is configured to adjust the model parameters of the initial reference frame generation model according to the function value of the target loss function to obtain the reference frame generation model.

According to a third aspect of embodiments of the present disclosure, there is provided an encoding apparatus, which includes a processor and a memory, where at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the video encoding method of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having at least one instruction stored therein, the at least one instruction being loaded and executed by a processor to implement the video encoding method of the first aspect.

The technical scheme provided by the embodiment of the disclosure at least has the following beneficial effects:

the actual motion information of the current frame image is obtained, the actual reference frame for coding the current frame image is determined again based on the motion information, and the current frame image is coded through the actual reference frame. Because the encoding is carried out based on the real motion state, the real motion information of the current frame image can be encoded into the image, thereby improving the accuracy of the encoded image.

In addition, motion estimation and motion compensation are complex, a large amount of computing resources are consumed in the whole video coding process, and by adopting the video coding mode provided by the disclosure, motion estimation and motion compensation are not required in the video coding process, so that the computing resources consumed in the video coding process are greatly reduced, the coding time is shortened, and the coding efficiency is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram of an HEVC coding framework shown in accordance with an exemplary embodiment.

Fig. 2 is a flow chart illustrating a method of video encoding according to an example embodiment.

Fig. 3 is a timing diagram illustrating a video encoding according to an example embodiment.

Fig. 4 is a block diagram illustrating a video encoding apparatus according to an example embodiment.

Fig. 5 is an apparatus for video encoding, according to an example embodiment.

Fig. 6 is an illustration of another apparatus for video encoding, according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The user information to which the present disclosure relates may be information authorized by the user or sufficiently authorized by each party.

Before carrying out the embodiments of the present disclosure, concepts related to the embodiments of the present disclosure are explained first.

Motion estimation

Considering that the motion of an object in real life is continuous, the difference between two video images in front of and behind a continuous video sequence is small, and only the relative position of the object may be changed, or the two video images may be changed on the boundary. For a video encoder, if the whole video image is encoded, a large amount of code stream waste is caused, and if the encoding is performed only according to the difference between two images and a reference frame, the code stream waste can be greatly reduced.

The basic idea of motion estimation is to divide each frame of an image sequence into a plurality of non-overlapping macro blocks, set the displacement of all pixels in the macro blocks to be the same, and then search out a matching block which is most similar to each macro block in a specified area of a reference frame according to a specified search algorithm and a specified matching criterion for each macro block, wherein the relative displacement between the matching block and a current block is a motion vector. When video compression is carried out, the current block can be recovered only by storing the motion vector, the residual block and the reference frame. The interframe redundancy can be removed through motion estimation, so that the bit number of video transmission is greatly reduced. The designated search algorithm comprises a global search algorithm, a fractional precision search algorithm, a fast search algorithm, a hierarchical number search algorithm, a hybrid search algorithm and the like. The specified matching criteria include MAD (Mean Absolute Difference), MSE (Mean Squared Error), and the like.

Motion compensation

Motion compensation is a method of describing the difference between adjacent frames (adjacent here means adjacent in coding relation, two frames are not necessarily adjacent in playing order), and specifically, how each macro block of the previous frame image moves to a certain position in the current frame image. Motion is often used by video compression/video codecs to reduce spatial redundancy in video sequences.

Fig. 1 shows the coding framework of HEVC, and with reference to fig. 1, the HEVC coding process is as follows:

the method comprises the following steps that firstly, for any frame of image, the frame of image is divided into at least one macroblock which is not overlapped with each other;

and secondly, inputting the frame image into an encoder for encoding prediction, wherein the process mainly utilizes the spatial correlation and the temporal correlation of video data, and removes the time-space domain redundant information of each macro block by adopting intra-frame prediction or inter-frame prediction to obtain a matching block of each macro block in a reference frame.

And thirdly, subtracting the matching block from the corresponding macro block to obtain residual data, and respectively carrying out transformation and quantization processing on the residual data to obtain quantization parameters.

Wherein the transform includes DCT, FFT, etc. Quantization is a common technique in the field of digital signal processing, and refers to a process of approximating a continuous value (or a large number of possible discrete values) of a signal to a finite number (or fewer) of discrete values. The quantization process is mainly applied to conversion from a continuous signal to a digital signal, the continuous signal is sampled to be a discrete signal, and the discrete signal is quantized to be a digital signal.

And fourthly, entropy coding is carried out on the quantization parameters to obtain a part of the code stream and output the part of the code stream.

And fifthly, carrying out inverse quantization processing and inverse transformation on the quantization parameters to obtain a residual block of the reconstructed image, and further adding the residual block of the reconstructed image and the matching block to obtain the reconstructed image.

And sixthly, adding the reconstructed image into a reference frame queue after DB (Deblocking Filter) and SAO (Sample Adaptive Offset) processing, and taking the reconstructed image as a theoretical reference frame of the next frame image. The video image can be encoded frame by executing the above-described first to sixth steps in a loop.

The main function of the deblocking filtering is to enhance the boundary of the image and reduce discontinuity of the image boundary. The adaptive pixel compensation is mainly used for performing local information compensation on the image subjected to the block filtering processing so as to reduce distortion between a source image and a reconstructed image.

When HEVC is adopted for video coding, the image segmentation precision is finer, the segmentation directions are more, and therefore the calculation amount in the coding process is larger. The calculation amount of motion estimation and motion compensation in the whole video coding process has a large proportion. In order to reduce the amount of computation in the video encoding process, shorten the video encoding time, and improve the video encoding efficiency, the embodiments of the present disclosure provide a video encoding method, which removes the motion estimation and motion compensation processes by encoding the motion information of each frame of image into a reference frame, thereby achieving the reduction of the amount of computation and the improvement of the encoding efficiency.

Fig. 2 is a flowchart illustrating a video encoding method according to an exemplary embodiment, where the video encoding method is used in a video encoding device, which may be a terminal having a video encoding function or a server having a video encoding function, as shown in fig. 2. The video encoding method includes the following steps.

In step S201, the video encoding apparatus acquires motion information of a current frame image.

The motion information is used for representing the motion state of the image content in the current frame image. The motion information includes the operation type, such as a uniform linear motion state, a uniform acceleration motion state, and the like, and may further include motion parameters, such as a motion direction, an acceleration magnitude, and the like.

When the video coding apparatus acquires motion information of a current frame image, the following method may be adopted: in the video image generation process, the acquisition equipment acquires the real motion information of the image content in each frame of image, and in the video coding process, the video coding equipment acquires the real motion information of the image content in each frame of image acquired by the acquisition equipment, so that the motion information of the current frame of image is obtained. The acquisition device may be a gyroscope, an accelerometer, an altimeter, a depth camera, a GPS (Global Positioning System), or the like.

In step S202, the video coding determines an actual reference frame of the current frame image according to the theoretical reference frame and the motion information of the current frame image.

The theoretical reference frame is a reference frame determined according to a coding order. The actual reference frame is generated based on the motion information of the current frame image, the actual motion state of the current frame image is added into the actual reference frame, and the actual motion state of the image content in the current frame image can be reflected, so that the encoding can be performed without performing motion estimation and motion compensation processes, the consumption of computing resources is reduced, and the encoded image is more accurate.

When the video coding device determines the actual reference frame of the current frame image according to the theoretical reference frame and the motion information of the current frame image, the following method can be adopted: and inputting the theoretical reference frame and the motion information of the current frame image into a reference frame generation model, and outputting the actual reference frame of the current frame image. The reference frame generation model is used for determining an actual reference frame of the image based on a theoretical reference frame and motion information of the image.

In the embodiment of the disclosure, the video coding device inputs the theoretical reference frame and the motion information of the current frame image into the reference frame generation model, and before outputting the actual reference frame of the current frame image, a reference frame generation model is trained. The specific training process is as follows:

s2021, the video coding device obtains the motion information of the multi-frame training sample image.

Wherein each frame of training sample image has a theoretical reference frame and an actual reference frame. When the video coding device acquires the motion information of the multi-frame training sample images, the video coding device can acquire the motion information of the image content in the multi-frame training sample images based on the acquisition device, and use the motion information of the multi-frame training sample images acquired by the acquisition device as the motion information of the acquired multi-frame training sample images.

S2022, the video coding device inputs the motion information of each frame of training sample image and the theoretical reference frame thereof into the initial reference frame generation model, and outputs the prediction reference frame of each frame of training sample image.

The initial reference frame generation model includes one of CNN (Convolutional Neural Networks), optical flow model, and traditional motion estimation model.

S2023, inputting the prediction reference frame and the actual reference frame of each frame of training sample image into a pre-constructed target loss function by the video coding device.

In this step, a target loss function may be established in advance for the initial reference frame generation model, initial values may be set for model parameters of the initial reference frame generation model, a prediction reference frame of each frame of training sample image may be determined based on the set initial values of the parameters, and a function value of the target loss function may be calculated by inputting the prediction reference frame and the actual reference frame of each frame of training sample image into the target loss function.

And S2024, the video coding device adjusts model parameters of the initial reference frame generation model according to the function value of the target loss function to obtain a reference frame generation model.

And if the function value of the target loss function does not meet the threshold condition, adjusting the model parameters of the initial reference frame generation model, and continuously calculating the function value of the target loss function until the obtained function value meets the threshold condition. Wherein the threshold condition can be set according to the processing precision.

And acquiring the parameter values of the parameters meeting the threshold condition, and taking the initial reference frame generation model corresponding to the parameter values of the parameters meeting the threshold condition as the reference frame generation model obtained by training.

It should be noted that, when training the reference frame generation model, the video coding device may generate higher-precision and fine motion according to different selected initial reference frame generation models, so as to improve the corresponding subjective quality.

In step S203, the video encoding apparatus encodes the current frame image based on the actual reference frame.

When the video coding device codes the current frame image according to the actual reference frame, the following method can be adopted:

s2031, the video coding apparatus splits the current frame image into at least one first macro block, and splits the actual reference frame into at least one second macro block.

The first macro blocks and the second macro blocks are the same in number and corresponding in position.

S2032, the video coding device performs a difference between each second macroblock and the corresponding first macroblock to obtain residual data corresponding to each first macroblock.

And the video coding equipment makes a difference between each second macro block and the first macro block at the corresponding position to obtain residual data corresponding to each first macro block.

S2033, the video coding device performs transform and quantization processing on the residual data corresponding to each first macroblock to obtain a quantization parameter corresponding to each first macroblock.

The transform includes DCT, FFT, or the like. The DTC is a mathematical operation closely related to Fourier transform, in a Fourier series expansion, if the expanded function is a real even function, the Fourier series only contains a cosine term, and the cosine transform can be obtained after the expansion is discretized, namely the discrete cosine transform.

And the video coding equipment transforms the residual data corresponding to each first macro block to obtain transformed residual data corresponding to each first macro block, and further quantizes the transformed residual data corresponding to each first macro block to obtain a quantization parameter corresponding to each first macro block.

S2034, the video coding device entropy codes the quantization parameter corresponding to each first macro block to obtain code stream data of the current frame image.

The video coding device can obtain code stream data of the current frame image by entropy coding the quantization parameter corresponding to each first macro block, and then output the code stream data of the current frame image.

In the video coding process, the existing coding method needs to perform entropy coding on the MVD and the quantization parameter together, and the computing resource occupied by the MVD is relatively large, so that the computing resource consumption in the video coding process is large, the obtained code stream data is relatively large, and the transmission resource is increased. The embodiment of the disclosure does not need to perform entropy coding on the MVD, thereby not only reducing the consumption of computing resources, but also saving transmission resources.

In another embodiment of the present disclosure, after the video encoding device transforms and quantizes the residual data corresponding to each first macroblock to obtain a quantization parameter corresponding to each first macroblock, the video encoding device further reconstructs a reference frame to encode a next frame image. The process may include the steps of:

a. and the video coding equipment performs inverse transformation and inverse quantization processing on the quantization parameter corresponding to each first macro block to obtain residual data corresponding to each first macro block.

b. And the video coding equipment determines a reconstructed reference frame according to the residual data corresponding to each first macro block and the corresponding second macro block.

And the video coding equipment adds the residual data corresponding to each first macro block and the corresponding second macro block to obtain each first macro block, and further obtain a reconstructed reference frame. The reconstructed reference frame is used as a theoretical reference frame for the next frame image and is used for encoding the next frame image.

In the above steps, the motion information is encoded on a frame level basis, but of course, the motion information may be encoded on a sequence level basis. When the video content in each frame of image all moves in the same type, such as uniform linear motion and uniform acceleration linear motion, the obtained motion information of each frame of image is the same, and at the moment, the motion information can be coded into a plurality of frames of images so as to improve the coding efficiency.

Fig. 3 shows a video encoding process provided by an embodiment of the present disclosure, which includes the following steps:

firstly, inputting the reconstructed reference frame (theoretical reference frame of the current frame image) and the motion information of the current frame image into a reference frame generation model, and outputting the actual reference frame of the current frame image.

And secondly, subtracting each macro block in the actual reference frame from each macro block in the current frame image to obtain residual data corresponding to each macro block in the current frame image.

And thirdly, converting and quantizing residual error data corresponding to each macro block in the current frame image to obtain a quantization parameter corresponding to each macro block in the current frame image.

And fourthly, entropy coding is carried out on the quantization parameter corresponding to each macro block in the current frame image to obtain code stream data of the current frame image and output the code stream data.

According to the method provided by the embodiment of the disclosure, the actual motion information of the current frame image is obtained, the actual reference frame for coding the current frame image is determined again based on the motion information, and then the current frame image is coded through the actual reference frame. Because the encoding is carried out based on the real motion state, the real motion information of the current frame image can be encoded into the image, thereby improving the accuracy of the encoded image.

Fig. 4 is a block diagram illustrating a video encoding apparatus according to an example embodiment. Referring to fig. 4, the apparatus includes: acquisition section 401, determination section 402, and encoding section 403.

The acquiring unit 401 is configured to perform acquiring motion information of the current frame image, where the motion information is used to represent a motion state of image content in the current frame image;

the determining unit 402 is configured to perform determining an actual reference frame of the current frame image from a theoretical reference frame of the current frame image and the motion information, the theoretical reference frame being a reference frame determined according to the encoding order;

the encoding unit 403 is configured to perform encoding of the current frame image according to the actual reference frame.

In another embodiment of the present disclosure, the encoding unit 403 is configured to perform splitting the current frame image into at least one first macro block and splitting the actual reference frame into at least one second macro block, where the first macro block and the second macro block are the same in number and corresponding in position; each second macro block is differenced with the corresponding first macro block to obtain residual error data corresponding to each first macro block; transforming and quantizing the residual error data corresponding to each first macro block to obtain a quantization parameter corresponding to each first macro block; and entropy coding is carried out on the quantization parameter corresponding to each first macro block to obtain code stream data of the current frame image.

In another embodiment of the present disclosure, the apparatus further comprises: and a processing unit.

the determining unit 402 is configured to determine a reconstructed reference frame from the residual data corresponding to each first macroblock and the corresponding second macroblock, the reconstructed reference frame serving as a theoretical reference frame for a next frame image for encoding the next frame image.

In another embodiment of the present disclosure, the determining unit 402 is configured to perform inputting the theoretical reference frame and the motion information of the current frame image into a reference frame generation model, and outputting the actual reference frame of the current frame image, and the reference frame generation model is used for determining the actual reference frame of the image based on the theoretical reference frame and the motion information of the image.

In another embodiment of the present disclosure, the apparatus further comprises: the device comprises an acquisition unit, an input and output unit and an adjustment unit.

The acquisition unit is configured to perform acquisition of motion information of a plurality of frames of training sample images, each frame of training sample image having a theoretical reference frame and an actual reference frame;

the input and output unit is configured to input the motion information of each frame of training sample image and a theoretical reference frame thereof into an initial reference frame generation model, and output a prediction reference frame of each frame of training sample image;

the input and output unit is configured to perform input of a prediction reference frame and an actual reference frame of each frame of training sample image into a pre-constructed target loss function;

the adjusting unit is configured to adjust model parameters of the initial reference frame generation model according to the function value of the target loss function, so as to obtain the reference frame generation model.

In summary, the apparatus provided in the embodiment of the present disclosure obtains the real motion information of the current frame image, and re-determines the actual reference frame for encoding the current frame image based on the motion information, and then encodes the current frame image through the actual reference frame. Because the encoding is carried out based on the real motion state, the real motion information of the current frame image can be encoded into the image, thereby improving the accuracy of the encoded image.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 5 shows a block diagram of a video encoding apparatus according to an exemplary embodiment of the present disclosure. The video coding device is a terminal 500, and the terminal 500 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 500 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and the like.

In general, the terminal 500 includes: a processor 501 and a memory 502.

The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 501 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one instruction for execution by processor 501 to implement the video encoding method provided by the method embodiments of the present disclosure.

In some embodiments, the terminal 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502 and peripheral interface 503 may be connected by a bus or signal lines. Each peripheral may be connected to the peripheral interface 503 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, touch screen display 505, camera 506, audio circuitry 507, positioning components 508, and power supply 509.

The peripheral interface 503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 501 and the memory 502. In some embodiments, the processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 501, the memory 502, and the peripheral interface 503 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 504 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 504 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 504 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 504 may further include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.

The display screen 505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 505 is a touch display screen, the display screen 505 also has the ability to capture touch signals on or over the surface of the display screen 505. The touch signal may be input to the processor 501 as a control signal for processing. At this point, the display screen 505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 505 may be one, providing the front panel of the terminal 500; in other embodiments, the display screens 505 may be at least two, respectively disposed on different surfaces of the terminal 500 or in a folded design; in still other embodiments, the display 505 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 500. Even more, the display screen 505 can be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 505 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 506 is used to capture images or video. Optionally, camera assembly 506 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 500. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 507 may also include a headphone jack.

The positioning component 508 is used for positioning the current geographic Location of the terminal 500 for navigation or LBS (Location Based Service). The Positioning component 508 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.

Power supply 509 is used to power the various components in terminal 500. The power source 509 may be alternating current, direct current, disposable or rechargeable. When power supply 509 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 500 also includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: acceleration sensor 511, gyro sensor 512, pressure sensor 513, fingerprint sensor 514, optical sensor 515, and proximity sensor 516.

The acceleration sensor 511 may detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 500. For example, the acceleration sensor 511 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 501 may control the touch screen 505 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 512 may detect a body direction and a rotation angle of the terminal 500, and the gyro sensor 512 may cooperate with the acceleration sensor 511 to acquire a 3D motion of the user on the terminal 500. The processor 501 may implement the following functions according to the data collected by the gyro sensor 512: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 513 may be disposed on a side bezel of the terminal 500 and/or an underlying layer of the touch display screen 505. When the pressure sensor 513 is disposed on the side frame of the terminal 500, a user's holding signal of the terminal 500 may be detected, and the processor 501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 513. When the pressure sensor 513 is disposed at the lower layer of the touch display screen 505, the processor 501 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 505. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 514 is used for collecting a fingerprint of the user, and the processor 501 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 501 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 514 may be provided on the front, back, or side of the terminal 500. When a physical button or a vendor Logo is provided on the terminal 500, the fingerprint sensor 514 may be integrated with the physical button or the vendor Logo.

The optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the touch display screen 505 based on the ambient light intensity collected by the optical sensor 515. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 505 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 505 is turned down. In another embodiment, processor 501 may also dynamically adjust the shooting parameters of camera head assembly 506 based on the ambient light intensity collected by optical sensor 515.

A proximity sensor 516, also referred to as a distance sensor, is typically disposed on the front panel of the terminal 500. The proximity sensor 516 is used to collect the distance between the user and the front surface of the terminal 500. In one embodiment, when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 gradually decreases, the processor 501 controls the touch display screen 505 to switch from the bright screen state to the dark screen state; when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 becomes gradually larger, the processor 501 controls the touch display screen 505 to switch from the screen-rest state to the screen-on state.

Those skilled in the art will appreciate that the configuration shown in fig. 5 is not intended to be limiting of terminal 500 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The terminal provided by the embodiment of the disclosure re-determines the actual reference frame for encoding the current frame image by acquiring the real motion information of the current frame image and based on the motion information, and then encodes the current frame image through the actual reference frame. Because the encoding is carried out based on the real motion state, the real motion information of the current frame image can be encoded into the image, thereby improving the accuracy of the encoded image.

Fig. 6 is a diagram illustrating a video encoding device according to an example embodiment. The video encoding apparatus is a server 600 for video encoding. Referring to fig. 6, server 600 includes a processing component 622 that further includes one or more processors and memory resources, represented by memory 632, for storing instructions, such as applications, that are executable by processing component 622. The application programs stored in memory 632 may include one or more modules that each correspond to a set of instructions. Further, the processing component 622 is configured to execute instructions to perform the functions performed by the server in video encoding as described above.

The server 600 may also include a power component 626 configured to perform power management of the server 600, a wired or wireless network interface 650 configured to connect the server 600 to a network, and an input/output (I/O) interface 658. The Server 600 may operate based on an operating system, such as Windows Server, stored in the memory 632^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMOr the like.

The server provided by the embodiment of the disclosure re-determines the actual reference frame for encoding the current frame image by acquiring the real motion information of the current frame image and based on the motion information, and then encodes the current frame image through the actual reference frame. Because the encoding is carried out based on the real motion state, the real motion information of the current frame image can be encoded into the image, thereby improving the accuracy of the encoded image.

The disclosed embodiments provide a computer-readable storage medium having at least one instruction stored therein, which is loaded and executed by a processor to implement the video encoding method shown in fig. 2.

The computer-readable storage medium provided by the embodiment of the disclosure is used for re-determining an actual reference frame for encoding a current frame image by acquiring real motion information of the current frame image and based on the motion information, and further encoding the current frame image through the actual reference frame. Because the encoding is carried out based on the real motion state, the real motion information of the current frame image can be encoded into the image, thereby improving the accuracy of the encoded image.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of video encoding, the method comprising:

determining an actual reference frame of the current frame image according to a theoretical reference frame and motion information of the current frame image, wherein the theoretical reference frame is a reference frame determined according to a coding sequence, and a real motion state of the current frame image is added into the actual reference frame and can reflect a real motion state of image content in the current frame image;

2. The method according to claim 1, wherein after transforming and quantizing the residual data corresponding to each first macroblock to obtain the quantization parameter corresponding to each first macroblock, the method further comprises:

3. The method according to claim 1, wherein the determining the actual reference frame of the current frame image according to the theoretical reference frame and the motion information of the current frame image comprises:

4. The method according to claim 3, wherein before inputting the theoretical reference frame and the motion information of the current frame image into the reference frame generation model and outputting the actual reference frame of the current frame image, the method further comprises:

5. A video encoding apparatus, characterized in that the apparatus comprises:

the determining unit is configured to determine an actual reference frame of the current frame image according to a theoretical reference frame and motion information of the current frame image, wherein the theoretical reference frame is a reference frame determined according to a coding sequence, and the actual reference frame is added with a real motion state of the current frame image and can reflect a real motion state of image content in the current frame image;

the encoding unit is configured to split the current frame image into at least one first macro block and split the actual reference frame into at least one second macro block, wherein the first macro block and the second macro block are the same in number and corresponding in position; each second macro block is differenced with the corresponding first macro block to obtain residual error data corresponding to each first macro block; transforming and quantizing the residual error data corresponding to each first macro block to obtain a quantization parameter corresponding to each first macro block; and entropy coding is carried out on the quantization parameter corresponding to each first macro block to obtain code stream data of the current frame image.

6. The apparatus of claim 5, further comprising:

7. The apparatus according to claim 5, wherein the determining unit is configured to perform inputting the theoretical reference frame and the motion information of the current frame image into a reference frame generation model, and outputting the actual reference frame of the current frame image, and the reference frame generation model is configured to determine the actual reference frame of the image based on the theoretical reference frame and the motion information of the image.

8. The apparatus of claim 7, further comprising:

9. An encoding device, comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to implement the video encoding method of any of claims 1-4.

10. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to implement the video encoding method of any one of claims 1 to 4.