CN114765689A

CN114765689A - Video coding method, device, equipment and storage medium

Info

Publication number: CN114765689A
Application number: CN202110320704.7A
Authority: CN
Inventors: 黄斌; 彭巧巧; 蒯多慈
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2021-01-14
Filing date: 2021-03-25
Publication date: 2022-07-19
Also published as: WO2022151972A1

Abstract

The application provides a video coding method, a video coding device, video coding equipment and a storage medium, and belongs to the technical field of video processing. When a video frame is coded, determining a motion vector of an image block in the video frame according to rendering information when the video frame is generated and rendering information when a reference frame of the video frame is generated, and coding the video frame. The method is adopted to carry out video coding, so that the calculation amount of motion estimation can be greatly reduced, the video coding speed is accelerated, the video coding efficiency is improved, and the end-to-end time delay is effectively reduced.

Description

Video coding method, device, equipment and storage medium

The present application claims priority from chinese patent application No. 202110048292.6 entitled "a method, apparatus and system for coding" filed on 14.01/01/2021, the entire contents of which are incorporated herein by reference.

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video encoding method, apparatus, device, and storage medium.

Background

With the rapid development of the internet technology, a user can experience a Virtual Reality (VR) application running on a cloud rendering computing platform through a VR terminal that is connected to the cloud rendering computing platform through a network. Generally, when each frame of video is rendered, the cloud rendering computing platform generates left and right eye video frames with a certain parallax, the left and right eye video frames correspond to the left eye viewing angle and the right eye viewing angle of the user respectively, the two video frames are sent to the VR terminal after being encoded, and the VR terminal decodes and displays the video frames, so that the user can experience the stereoscopic vision effect.

In the related art, when a cloud rendering computing platform encodes video frames for left and right eyes, the following two methods are generally adopted: splicing left and right eye video frames to obtain a video frame, and then coding the video frame based on a standard coder to generate a path of standard video code stream; and secondly, respectively coding the left-eye video frame and the right-eye video frame based on a standard coder to generate two paths of standard video code streams.

In the method, no matter one path of video coding is performed after the left-eye video frame and the right-eye video frame are spliced, or one path of video coding is performed on each of the left-eye video frame and the right-eye video frame, the calculated amount of coding is large, the speed of video coding is slow, and further the efficiency of video coding is low.

Disclosure of Invention

The embodiment of the application provides a video coding method, a video coding device, video coding equipment and a storage medium, which can effectively improve the video coding efficiency and reduce the end-to-end time delay. The technical scheme is as follows:

in a first aspect, a video encoding method is provided, which includes: acquiring a video frame and a reference frame of the video frame, wherein the video frame and the reference frame are from the same visual angle or different visual angles in a multi-visual-angle image; determining a motion vector of an image block in the video frame based on the first rendering information and the second rendering information; the first rendering information is rendering information when the video frame is generated, and the second rendering information is rendering information when the reference frame is generated; the video frame is encoded based on the motion vector.

By adopting the method to carry out video coding, the calculation amount of motion estimation can be greatly reduced, the video coding speed is accelerated, the video coding efficiency is improved, and further the end-to-end time delay is effectively reduced.

Optionally, the video frame and the reference frame are generated while running a virtual reality VR application, the video frame encoded data is for decoding display on a terminal, and the first position and orientation information and the second position and orientation information are determined based on a position and orientation of the terminal.

Based on the optional mode, data generated when the virtual reality VR application is running on the server can be displayed based on the terminal, and the position and the posture of the terminal can influence the rendering process of the server, so that the purpose of VR display is achieved.

Optionally, the terminal is a mobile phone, a tablet, a wearable device, or a split device, and the split device includes a display device and a corresponding control device.

Based on the optional mode, the data after the video frame coding can be decoded and displayed on terminals in various different forms, and the method and the device have wide applicability.

Optionally, the wearable device is a head mounted display.

Optionally, if the video frame and the reference frame are from the same perspective, the first rendering information includes a first position and posture information of a virtual camera when rendering the video frame, and depth information of the video frame; the second rendering information includes a second position of the virtual camera when rendering the reference frame, pose information, and depth information of the reference frame.

Optionally, the determining a motion vector between the video frame and the reference frame based on the first rendering information and the second rendering information comprises: determining two-dimensional coordinates corresponding to pixels or image blocks in the video frame and two-dimensional coordinates corresponding to pixels or image blocks in the reference frame based on the first position of the virtual camera when rendering the video frame, attitude information, depth information of the video frame, the second position of the virtual camera when rendering the reference frame, attitude information, and depth information of the reference frame; and determining the motion vector based on the two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame.

Optionally, the determining two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame includes: acquiring a conversion relation between a two-dimensional coordinate and a three-dimensional coordinate corresponding to a pixel or an image block in any video frame; acquiring two-dimensional coordinates corresponding to pixels or image blocks in the reference frame, and converting the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame into three-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame by applying the conversion relation based on a second position and posture information of the virtual camera and depth information of the reference frame when the reference frame is rendered; and determining two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame by applying the conversion relation based on the first position and the posture information of the virtual camera when the video frame is rendered, the depth information of the video frame and the three-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame.

Optionally, the determining the motion vector based on the two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame includes: acquiring a coordinate difference between the two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame; based on the coordinate difference, the motion vector is determined.

Optionally, the determining the motion vector based on the coordinate difference includes: the coordinate difference is determined as a motion vector.

Optionally, the determining the motion vector based on the coordinate difference further includes: and searching in the reference frame according to a first searching mode based on the coordinate difference, determining a first image block meeting a first searching condition, and determining a motion vector based on the first image block.

Based on the optional mode, when the video frame and the reference frame come from the same visual angle, the correlation between the video frame and the reference frame at the front and back moments of the same visual angle is utilized, the visual angle displacement vector is obtained through calculation according to the position, the posture information and the depth information of the virtual camera when the video frame and the reference frame are rendered, the visual angle displacement vector is used as a motion vector, or the visual angle displacement is used as an initial motion vector to conduct searching, and finally the motion vector is obtained.

Optionally, if the video frame and the reference frame are from different perspectives, the first rendering information includes a focal length when rendering the video frame, a first position of a virtual camera, and depth information of the video frame; the second rendering information includes a second position of the virtual camera when rendering the reference frame; the determining a motion vector between the video frame and the reference frame based on the first rendering information and the second rendering information includes: determining the motion vector based on the focal distance, a distance value between the first location and the second location, and the depth information.

Optionally, the determining the motion vector based on the focal distance, the distance value between the first position and the second position, and the depth information includes: determining a disparity interval between the video frame and the reference frame based on the focal distance, a distance value between the first position and the second position, and the depth information; determining a target search range corresponding to the image block in the video frame based on the parallax interval; the motion vector is determined based on the target search range.

Optionally, the determining a disparity interval between the video frame and the second reference video frame based on the focal length, the distance value between the first location and the second location, and the depth information includes: determining a depth interval based on the depth information, wherein the depth interval is used for indicating a value range of depth values of pixels in the video frame; determining a disparity interval between the video frame and the reference frame based on the focal distance, the distance value between the first position and the second position, and the depth interval.

Optionally, the determining a target search range corresponding to an image block in the video frame based on the disparity interval includes: and determining a first search range corresponding to the image block in the video frame based on the parallax interval, and taking the first search range as a target search range.

Optionally, the determining a target search range corresponding to an image block in the video frame based on the disparity interval includes: determining a first search range corresponding to an image block in the video frame based on the parallax interval; determining a second search range corresponding to the image block in the video frame based on a second search mode; and taking the intersection of the first search scope and the second search scope as the target search scope.

Based on the above optional mode, when the video frame and the reference frame come from different perspectives, the target search range of the motion vector is determined by using the correlation between the video frame and the reference frame of different perspectives at the same time and the parallax interval between the video frame and the reference frame, or the target search range of the motion vector is further reduced by using the parallax interval and the search range of the default search mode, and the motion vector is finally obtained, so that the calculation amount of motion estimation is greatly reduced, and the video coding efficiency is effectively improved.

Optionally, in response to that the obtained video frame is a dependent view, obtaining a reference frame of the video frame, where the reference frame includes both a reference frame with a view angle the same as that of the video frame and a reference frame with a view angle different from that of the video frame; for a reference frame with the same visual angle as a video frame in the reference frame, the server determines a first motion vector corresponding to an image block in the video frame based on a first position and attitude information of a virtual camera when the video frame is rendered, depth information of the video frame, a second position and attitude information of the virtual camera when the reference frame is rendered, and depth information of the reference frame; for a reference frame, which is different from a video frame visual angle, in the reference frame, the server determines a second motion vector corresponding to an image block in the video frame based on a focal length and a first position of a virtual camera when the video frame is rendered, depth information of the video frame, and a second position of the virtual camera when the reference frame is rendered; determining a target motion vector based on the first motion vector and the second motion vector; the video frame is encoded based on the target motion vector.

Based on the above optional mode, the corresponding reference frame is acquired according to the view angle of the video frame, and the correlation between the video frame and the reference frame is considered from the time dimension (front and back time), and the correlation between the video frame and the reference frame is also considered from the space dimension (left and right view angles), so that the video coding efficiency can be effectively improved.

In a second aspect, there is provided a video encoding apparatus comprising:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a video frame and a reference frame of the video frame, and the video frame and the reference frame are from the same visual angle or different visual angles in a multi-visual-angle image;

the determining module is used for determining a motion vector of an image block in the video frame based on the first rendering information and the second rendering information; the first rendering information is rendering information when the video frame is generated, and the second rendering information is rendering information when the reference frame is generated;

an encoding module to encode the video frame based on the motion vector.

Optionally, if the video frame and the reference frame are from the same perspective, the first rendering information includes a first position and posture information of the virtual camera when rendering the video frame, and depth information of the video frame; the second rendering information includes a second position of the virtual camera when rendering the reference frame, pose information, and depth information of the reference frame.

Optionally, if the video frame and the reference frame are from different perspectives, the first rendering information includes a focal length when rendering the video frame, a first position of a virtual camera, and depth information of the video frame; the second rendering information includes a second position of the virtual camera when rendering the reference frame; the determination module is further configured to: determining the motion vector based on the focal distance, a distance value between the first location and the second location, and the depth information.

Optionally, the wearable device is a head mounted display.

Optionally, the determining module includes:

a first determining unit, configured to determine two-dimensional coordinates corresponding to pixels or image blocks in the video frame and two-dimensional coordinates corresponding to pixels or image blocks in the reference frame based on the first position and the attitude information of the virtual camera when rendering the video frame, the depth information of the video frame, the second position and the attitude information of the virtual camera when rendering the reference frame, and the depth information of the reference frame;

and the second determining unit is used for determining the motion vector based on the two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame.

Optionally, the second determining unit is configured to: acquiring a coordinate difference between the two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame; based on the coordinate difference, the motion vector is determined.

Optionally, the determining module further comprises:

a third determining unit, configured to determine a disparity interval between the video frame and the reference frame based on the focal length, a distance value between the first location and the second location, and the depth information;

a fourth determining unit, configured to determine, based on the disparity interval, a target search range corresponding to an image block in the video frame;

a fifth determining unit configured to determine the motion vector based on the target search range.

In a possible implementation manner, the third determining unit is configured to: determining a depth interval based on the depth information, wherein the depth interval is used for indicating a value range of depth values of pixels in the video frame; determining a disparity interval between the video frame and the reference frame based on the focal distance, the distance value between the first position and the second position, and the depth interval.

In a third aspect, a video coding device is provided, which includes a processor and a memory, where the memory is used to store at least one program code, and the at least one program code is loaded and executed by the processor, so that the video coding device executes the video coding method provided in the first aspect or any one of the alternatives of the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, which is used for storing at least one program code, which is loaded and executed by a processor, so as to make a computer execute the video encoding method provided in the first aspect or any one of the alternatives of the first aspect.

In a fifth aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising program code which, when run on a video encoding device, causes the video encoding device to perform the video encoding method as provided in the first aspect or the various alternative implementations of the first aspect.

Drawings

Fig. 1 is a schematic diagram of a cloud VR rendering system architecture provided by an embodiment of the present application;

fig. 2 is a schematic diagram of an execution flow of a VR application according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a VR rendering projection provided by an embodiment of the present application;

fig. 4 is a schematic diagram of an image rendering process according to an embodiment of the present application;

FIG. 5 is a diagram of an image block-based coding framework according to an embodiment of the present application;

fig. 6 is a schematic hardware structure diagram of a video encoding apparatus according to an embodiment of the present application;

fig. 7 is a flowchart of a video encoding method according to an embodiment of the present application;

fig. 8 is a schematic diagram of motion estimation provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of a diamond search method according to an embodiment of the present application;

fig. 10 is a flowchart of another video encoding method according to an embodiment of the present application;

fig. 11 is a flowchart of another video encoding method provided in an embodiment of the present application;

fig. 12 is a flowchart of another video encoding method provided in an embodiment of the present application;

fig. 13 is a flowchart of another video encoding method provided in an embodiment of the present application;

fig. 14 is a flowchart of another video encoding method provided in an embodiment of the present application;

fig. 15 is a flowchart of another video encoding method provided in an embodiment of the present application;

fig. 16 is a schematic structural diagram of a video encoding apparatus according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

For convenience of understanding, before describing the technical solutions provided in the embodiments of the present application, the following description will be made on the key terms related to the present application.

VR technology is a technology that combines virtual and real. VR technology is a computer system that can create and experience virtual worlds. The method creates a three-dimensional virtual world which reflects the change and interaction of the entity object in real time for a user in a virtual simulation mode, so that the user is immersed in the three-dimensional virtual world.

Rendering, which represents the process of generating an image from a model by software in the computer field. A model is a description of a three-dimensional object or virtual scene defined in a mathematical language or data structure, which includes information such as geometry, viewpoint, texture, lighting, and shading. The image is a digital image such as a bitmap image or a vector image.

A cloud computing platform, also called a cloud platform, refers to a service based on hardware resources and software resources, and provides computing, networking, and storage capabilities. The cloud computing platform can realize the rapid distribution and release of the configurable computing resources with smaller management cost or lower interaction complexity between the user and the service provider. It should be understood that the cloud computing platform is capable of providing a variety of types of services, for example, a cloud computing platform for providing real-time rendering services is referred to as a cloud rendering computing platform. It should be noted that, in the present application, such a cloud rendering computing platform is referred to as a server.

Cloud VR specifically refers to the steps of utilizing the powerful computing capacity of a cloud computing platform to cloud, render and even make VR application content, and then transmitting immersive VR experience to a terminal user through the internet. The present application relates to the real-time rendering of strong interactions in cloud VR, commonly referred to as "cloud VR rendering" or "VR cloud rendering".

Depth information of a video frame is also referred to as depth information of an image. For example, in the case where each pixel of a video frame is described by a set of binary numbers, the binary number corresponding to each pixel includes a plurality of bits representing color, and the number of bits of the binary number representing color is referred to as depth information. In general, depth information corresponding to a video frame is generally described by a depth map (depth map) of the same size, where a depth map refers to an image in which distances from an image collector to points in a three-dimensional scene are taken as pixel values, and directly reflects the geometry of a visible surface of an object. In the present application, the depth information of the video frame can be used to reflect the distance of an object from the camera in the three-dimensional scene.

Video coding, namely, video pictures are compressed by removing intra-frame redundancy and inter-frame redundancy, so that the video pictures are convenient to store and transmit. Accordingly, video coding includes inter-coding and intra-coding.

And the motion vector has certain correlation with scenes in different video frames at adjacent moments in the interframe coding process. Therefore, a video frame is divided into a plurality of image blocks which are not overlapped with each other, the displacement of all pixels in the image blocks are considered to be the same, then for each image block, in a given search range of a reference frame, a block which is most similar to a current image block, namely a matching block, is found out according to a certain matching criterion, and the relative displacement between the matching block and the current image block is called a Motion Vector (MV).

The motion estimation, i.e. the process of obtaining motion vectors, is performed. The quality of motion estimation directly determines the size of the residual error of video coding, i.e. directly affects the efficiency of video coding, so motion estimation is one of the most important parts in the whole video coding process. The process of motion estimation can be regarded as a process of finding a matching block in each reference frame by using a matching criterion, and the simplest method is to match a current image block with all image blocks in the reference frame, so as to obtain an optimal matching block. The related searching method adopts a rapid algorithm, and searching is carried out according to a certain path in a certain range, such as a diamond searching method and a hexagon searching method, so that the searching efficiency can be obviously improved.

A disparity vector, which can be considered as a special motion vector. In the above motion estimation process, the video frame and the reference frame belong to the same view angle, and when the video frame and the reference frame belong to different view angles, the relative displacement between the image blocks obtained based on the video frame and the reference frame is called a disparity vector.

Disparity estimation, i.e. a process of obtaining a disparity vector, is performed. Accordingly, disparity estimation can also be regarded as a special motion estimation, the biggest difference between the two is that the view angles of the reference frames are different, the reference frame for motion estimation and the video frame belong to the same view angle, and the reference frame for disparity estimation and the video frame belong to different view angles. The search method of disparity estimation includes a global search algorithm, a local search algorithm, and the like.

The rendering information refers to relevant information when the server generates a video frame in a rendering mode in the process of running the VR application. Illustratively, in the embodiment of the present application, the rendering information includes: display parameters of the terminal, a position of a virtual camera in the VR application, pose information, depth information of video frames and reference frames, and the like.

Rate Distortion Optimization (RDO), a method to improve video coding quality. Specifically, the distortion amount (video quality loss) of the video is reduced under the limitation of a certain code rate. The method determines a prediction mode based on a rate-distortion optimization algorithm, and selects the prediction mode with the minimum coding cost or the prediction mode with the rate-distortion meeting the selection standard.

The following briefly introduces VR applications and application scenarios related to the present application.

In the VR application, a user can interact with a three-dimensional virtual world created by the VR application through a VR terminal, so that the user can be in the three-dimensional virtual world as if being personally on the scene.

Compared with the traditional human-computer interaction application, the VR application mainly comprises the following two characteristics:

first, the interaction mode is more natural. In the traditional man-machine interaction application, interaction is carried out through key events mainly through equipment such as a keyboard, a mouse or a handle, and in the VR application, the interaction dimensionality is a positioning sensing event besides the key events; the VR application can perform scene rendering by combining the positions of the head and the hands of the user, posture information and the like, so that the picture seen by the user can be correspondingly switched and changed along with the rotation of the head or the movement of the position of the user, and a more natural interaction mode is achieved.

Second, stereoscopic and immersive vision. When the VR application renders each frame of video, two video frames with a certain parallax can be generated, and the two video frames can be respectively input into the left eye and the right eye of a user through the VR terminal finally, so that the stereoscopic vision effect is achieved; and the VR terminal adopts a closed scene to isolate the visual and auditory sense of the user from the outside, so that the user can obtain immersive visual and auditory sense experience.

The video coding method provided by the embodiment of the application can be applied to VR games, VR teaching, VR cinemas and other VR application scenes. Exemplary scenarios to which the video coding method provided in the embodiments of the present application can be applied include, but are not limited to, the following.

Scene one, VR Game

VR games are more and more pursued by people with their highly realistic simulation and experience, and users can enter an interactive virtual game world by using VR terminals, and are always located in the virtual game world no matter how the users turn their sight. In the scene, a server running VR game application needs to render left and right eye game videos in real time according to interaction events of users, and the rendered videos are coded and then sent to a VR terminal.

Scene two, VR teaching

At present, the application of VR technology to scenes such as classroom teaching or skill training teaching is a key direction in the field of education in recent years. For example, by establishing a VR laboratory by using VR technology, students can do various experiments without going out by using a VR terminal, and experience same as that of a real experiment is obtained. For another example, a VR classroom is established by using the VR technology, and when students are inconvenient to go out, the students still can participate in the classroom by using the VR terminal, so that the teaching effect is ensured. Under the scene, a server running VR teaching application needs to render left and right eye teaching videos in real time according to interaction events of students or teachers, and the rendered videos are coded and then sent to a VR terminal.

Scene three, VR cinema

Currently, in a VR theater built by using VR technology, a user can enter a virtual movie world by using a VR terminal, and the user is always located in the virtual movie world no matter how the user rotates his/her line of sight, so that the user can be as if he/she is in the virtual movie world. In the scene, a server running VR cinema application needs to render left and right eye movie videos in real time according to interaction events of students or teachers, and the rendered videos are coded and then sent to a VR terminal.

It should be noted that the foregoing scenarios are only exemplary descriptions, and the video encoding method provided in the embodiment of the present application can also be applied to other VR application scenarios, for example, a VR browser, a VR viewing room, VR medicine, and the like, which is not limited in the embodiment of the present application.

The following describes a system framework of a video encoding method provided in an embodiment of the present application.

Referring to fig. 1, fig. 1 is a schematic diagram of a cloud VR rendering system architecture to which the embodiment of the present application is applied. As shown in fig. 1, the video encoding method provided in the embodiment of the present application is applied to a cloud VR rendering system 100. The cloud VR rendering system 100 includes a server 110 and a terminal 120, and the server 110 and the terminal 120 can be directly or indirectly connected through a wireless network or a wired network, which is not limited herein.

Alternatively, the server 110 generally refers to one of a plurality of servers, or a collection of a plurality of servers, or a distributed system; terminal 120 generally refers to one of a plurality of terminals or a set of a plurality of terminals. Fig. 1 shows a cloud VR rendering system 100 including one server 110 and one terminal 120, which is only schematic, and it should be understood that if the server 110 or the terminal 120 is a set of multiple devices, other servers or terminals may be included in the cloud VR rendering system 100. The number and types of servers or terminals in the cloud VR rendering system 100 are not limited in the present application.

Optionally, the wireless or wired networks described above use standard communication techniques and/or protocols. The network is typically the internet, but can be any network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wired or wireless network, a private network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including hypertext markup language (HTML), extensible markup language (XML), and the like. In addition, all or some of the links can be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), internet protocol security (IPsec), and the like. In other embodiments, custom and/or dedicated data communication techniques can also be used in place of or in addition to the data communication techniques described above.

Illustratively, the server 110 includes a VR application 111, a VR interface 112, and a VR driver server 113.

1. VR application 111

The VR application 111 is configured to obtain a current interaction event and rendering parameters of a user, perform image rendering based on the interaction event and the rendering parameters, generate a video frame, and transmit the video frame to the VR interface 112.

Optionally, the user's current interaction event includes key input, head/hand position or posture information, and the like. The rendering parameters include rendering resolution and projection matrices for left and right eyes, etc.

Please refer to fig. 2 for an operation process of the VR application 111, and fig. 2 is a schematic diagram of an execution flow of the VR application according to an embodiment of the present application. Illustratively, taking the terminal 120 as including a head-mounted display (HMD) as an example, the running process of the VR application 111 includes the following two stages.

The first stage, initialization stage. The initialization phase is used to determine rendering parameters for VR rendering. Since the video frames generated by the VR application are ultimately rendered on the head mounted display for presentation, and the display specifications of different head mounted displays differ, in order for the rendered video frames to be properly displayed on different head mounted displays, the VR application needs to determine rendering parameters consistent with the current head mounted display specification at initialization.

The rendering parameters generally include the following two parameters: one is the rendering resolution, which is generally consistent with the display resolution of the head-mounted display, so that the zooming operation on the terminal can be avoided. The other is a projection matrix of the left and right eyes, which needs to be calculated together according to a field of view (FOV), an interpupillary distance (IPD), a display resolution and a rendering depth range of the VR application (i.e., a Z-near clipping plane, a Z-far clipping plane). Rendering and generating a video frame by the VR application according to the rendering parameters, wherein the view angle range of the video frame can be consistent with the FOV of the head-mounted display; and the parallax of the rendered left and right eye video frames is consistent with the IPD of the head mounted display. For an exemplary rendering depth range of VR application, refer to fig. 3, and fig. 3 is a schematic diagram of a VR rendering projection applied in the embodiment of the present application.

The second stage, the rendering stage. The rendering stage is to render each VR video frame. Before each frame is rendered, the VR application acquires the current interaction event of the user, performs application logic processing, and finally renders and submits the video frames of the left eye and the right eye.

Referring to fig. 4, a rendering process in the rendering stage is briefly described below by taking a VR game scene as an example. Fig. 4 is a schematic diagram of an image rendering process applicable to the embodiment of the present application. As shown in fig. 4, first, the VR application performs data processing on events related to game content, such as position calculation, collision detection, and the like; then, the VR application sends the game data (fixed point coordinates, normal vectors, texture and texture coordinates, etc.) to a graphics program interface, such as OpenGL, etc., through a data bus; then, converting the three-dimensional coordinates corresponding to the video frames into two-dimensional coordinates of a screen by the VR application; and finally, the VR application performs primitive combination, rasterization, pixel color matching and the like, and renders video frames to be stored in a frame buffer.

2. VR interface 112

The VR interface 112 is configured to obtain the display parameters and the interaction events, perform rendering parameter calculation and interaction event processing, transmit the obtained rendering parameters and interaction events to the VR application 111, and transmit the video frames received from the VR application 111 to the VR driver server 113.

In the data interaction between the VR application 111 and the terminal 120, in order to avoid pairwise adaptation, a unified standard interface, such as OpenVR, is defined in the industry. Schematically, the VR interface 112 shown in fig. 1 is VR runtime, and VR runtime is an adaptation layer between the VR application 111 and the terminal 120, where VR runtime and VR application 111, and VR runtime and terminal 120 may be implemented by using the standard interfaces; thereby enabling interaction between the VR application 111 and the terminal 120 in a standard manner without the need for mutual adaptation. As shown in fig. 1, the VR interface 112 includes the following three modules: a rendering parameter calculation module, configured to calculate, according to the rendering depth range determined by the VR application 112 and the display parameter of the terminal 120, the aforementioned rendering parameter; the video frame processing module is used for directly transmitting the video frame generated by rendering or forwarding the video frame after performing additional processing (such as distortion correction); and the interactive event processing module is used for directly transmitting the interactive event or forwarding the interactive event after performing additional processing (such as prediction or smoothing).

It should be noted that the VR interface 112 is not necessary in the whole system, for example, a certain VR application is developed based on a specific terminal customization, and the VR application can obtain the display parameters and the interaction events according to the interface provided by the system driver of the terminal and submit the rendered video frame; if there is no VR interface, the rendering parameter calculation described above may be on the VR application side, or on the system driver side of the terminal.

3. VR drive server 113

The VR driver 113 is configured to receive the display parameters and the interaction events transmitted by the terminal 120 through the network, transmit the display parameters and the interaction events to the VR interface 112, encode the video frame submitted by the VR application 111, and transmit the encoded video frame data to the terminal 120 through the network.

Optionally, in the cloud VR rendering system 100, the VR driver 113 compresses the left-eye video frame and the right-eye video frame in a video coding manner. The related art video compression standard is an image block-based coding framework, under which a coded image is divided into non-overlapping image blocks and is coded in units of image blocks, for example, a common h.26x-series coding standard is a hybrid coding method based on image blocks.

Referring to fig. 5, fig. 5 is a schematic diagram of an image block-based coding framework to which an embodiment of the present application is applicable. As shown in fig. 5, in the encoding framework, an encoder in the VR driver 113 performs image blocking on a video frame to be encoded, performs motion estimation on the image block based on a reference frame to obtain a motion vector, performs motion compensation according to the motion vector to obtain residual data, performs Discrete Cosine Transform (DCT) on the residual data, quantizes a corresponding coefficient, and converts the quantized DCT coefficient into a binary codeword to implement entropy encoding. In addition, the encoder performs inverse quantization and inverse transformation on the quantized DCT coefficient to obtain a reconstructed residual error, and a new reference frame is generated and stored in the frame buffer by combining the image block after motion compensation.

Illustratively, the terminal 120 includes an interaction device 121, a system driver 122, and a VR driver client 123.

The interactive device 121 is configured to capture a current interactive event of the user, and decode and present video frame data to the user based on the video frame data received from the server 110.

Alternatively, taking the terminal 120 with the display function and the control function as an example, the interaction device 121 generally includes two devices related to the user: a head mounted display and a controller handle. Wherein, a positioning sensor and an input key are arranged on the head-mounted display and the controller handle. The positioning sensor is used to sense and collect position information (position coordinates in three-dimensional space) and posture information (azimuth angle data in three-dimensional space) of the head and/or both hands of the user. The input keys are used to provide control functions for the VR application by a user, e.g., the input keys are keys on a controller handle, a joystick, or keys on a head-mounted display that the user can manipulate to control the VR application.

The system driver 122 is configured to transmit the current interaction event of the user and the display parameters of the terminal 120 to the VR driver client 123, and transmit the received video frame to the interaction device 121.

The VR driver client 123 is configured to transmit the interaction event and the display parameter to the server 110 through the network, and decode the received video frame data and transmit the decoded video frame data to the system driver 122.

The above describes each functional module of the device in the cloud VR rendering system architecture, and a brief description of the hardware structure of the server 110 is provided below. The embodiment of the present application provides a video encoding device, which can be configured as the server 110 in the cloud VR rendering system architecture. Referring to fig. 6, fig. 6 is a schematic diagram of a hardware structure of a video encoding apparatus according to an embodiment of the present application. As shown in fig. 6, the video coding device 600 comprises a processor 601 and a memory 602, wherein the memory 602 is used for storing at least one program code, and the at least one program code is loaded by the processor 601 and executes the video coding method according to the following embodiments.

The Processor 601 may be a Network Processor (NP), a Central Processing Unit (CPU), an application-specific integrated circuit (ASIC), or an integrated circuit for controlling the execution of the program in the present application. The processor 601 may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. The number of the processors 601 may be one or more.

The Memory 602 may be, but is not limited to, a read-only Memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only Memory (EEPROM), a compact disc read-only Memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The processor 601 and the memory 602 may be separately provided or integrated together.

The server 600 also includes a transceiver. The transceiver is used to communicate with other devices or communication networks, which may be, but is not limited to, ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), etc.

The following briefly introduces a terminal related to an embodiment of the present application.

In the embodiment of the application, the position and the posture of the terminal are determined by the current interaction event of the user. The interaction event refers to any one or combination of the adjustment events of the position and the posture of the terminal by the user. It should be understood that, when the server runs the VR application to render and generate the video frame, the virtual camera in the VR application is used to simulate the left and right eyes of the user respectively, that is, the picture taken by the virtual camera is used to simulate the VR picture seen by the two eyes of the user. The position and the gesture of the virtual camera in the VR application are determined by a terminal used by a user, and the user can adjust the position and the gesture of the terminal to trigger an adjustment event, so that the position and the gesture of the virtual camera are correspondingly adjusted, and an adjusted VR picture is displayed on the terminal.

For example, the terminal is a mobile phone, the mobile phone is connected with the server through a network, the mobile phone generates an interaction event and sends the interaction event to the server in response to the click, sliding, movement and other operations of the user on the mobile phone, the server adjusts the position and posture of the virtual camera in the VR application according to the received interaction event and renders the virtual camera to generate a corresponding video frame, and data encoded by the video frame can be decoded and displayed on the mobile phone. It should be understood that the implementation when the terminal is a tablet is similar to that when the terminal is a mobile phone, and therefore, the description thereof is omitted.

As another example, the terminal is a wearable device, which is a head-mounted display. The head-mounted display is connected with the server through a network, the head-mounted display responds to the actions of rotation, nodding and the like of the head of a user, generates an interaction event and sends the interaction event to the server, the server adjusts the position and the posture of a virtual camera in VR application according to the received interaction event and renders to generate a corresponding video frame, and data coded by the video frame can be decoded on the head-mounted display to be displayed.

For another example, the terminal is a split device, the display device is a mobile phone, a tablet or an electronic display screen, and the control device is a handle or a key. The split type equipment is connected with the server through a network, the split type equipment responds to the operation of a user on the control equipment, generates an interaction event and sends the interaction event to the server, the server adjusts the position and the posture of a virtual camera in the VR application according to the received interaction event and renders to generate a corresponding video frame, and data coded by the video frame can be decoded and displayed on the display equipment.

In the process, the data generated when the server runs the VR application is displayed through the terminal, and the rendering process of the server can be influenced by the position and the posture of the terminal, so that the purpose of VR display is achieved. Furthermore, the data after the video frame coding can be decoded and displayed on terminals with various different forms, so the application has wider applicability.

In addition, the first point to be described is that in a cloud VR rendering system, end-to-end time delay is increased in the encoding and decoding process and the transmission process of a video frame, VR application is strong interaction application which is extremely sensitive to time delay, and time delay is an important index which affects experience; if the network transmission part is not considered, the processing delay of the cloud side can account for more than 60% of the whole end-to-end delay; in the cloud-side processing delay, encoding is a link equivalent to rendering time. The second point to be noted is that the lossy video encoding process may cause image quality degradation, which affects the final subjective experience, and this needs to be improved by increasing the encoding efficiency (compression ratio) or setting a higher encoding rate. Therefore, a video encoding method for improving the performance (i.e. improving the image quality at the same code rate) and speed of the cloud-side video encoding is needed.

The application provides a video coding method based on the system framework of the cloud VR rendering system, and in the process of determining the motion vector, the rendering information related to the server in the process of rendering the video frame is referred, so that the motion vector can be obtained based on time domain motion estimation and/or inter-view motion estimation to code the video frame, the video coding speed can be greatly increased, the video coding efficiency can be effectively improved, and the end-to-end time delay is reduced.

The following describes a video encoding method provided in an embodiment of the present application based on the following embodiments.

Referring to fig. 7, fig. 7 is a flowchart of a video encoding method according to an embodiment of the present application. In the embodiment of the present application, the video encoding method is performed by the server 110 shown in fig. 1, for example, the video encoding method can be applied to the VR driver server 113 of the server 110. Illustratively, the video encoding method includes the following steps 701 to 703.

701. The server acquires a video frame and a reference frame of the video frame, wherein the video frame and the reference frame are from the same visual angle in the multi-visual-angle image.

In an embodiment of the application, the video frames and the reference frames are generated by a server running a VR application. The video frame is a video frame to be coded, and the data coded by the video frame is used for decoding and displaying on the terminal. The reference frame is a coded video frame, i.e. a video frame before the video frame to be coded.

When the server runs the VR application, at least one multi-view image needs to be generated when rendering the VR picture at the same time, wherein the multi-view image comprises a video frame corresponding to a left eye view and a video frame corresponding to a right eye view, and the video frames are texture images. It should be noted that the video frame acquired by the server is from a left-eye viewing angle or a right-eye viewing angle, which is not limited in the present application.

For convenience of expression, a view that only refers to a video frame of the same view in the encoding process is referred to as a reference view, and a view that refers to both a video frame of the same view and a video frame of another view is referred to as a dependent view. A specific embodiment of this step will be described below by taking a video frame acquired by the server as a reference view. Schematically, this step 701 includes: and the server responds to the acquired video frame as a reference visual angle, and acquires a reference frame of the video frame, wherein the visual angle of the video frame is the same as that of the reference frame.

And the server responds to the acquired video frame as a reference view, acquires at least one video frame which is also the reference view from the coded multi-view image, and takes the video frame as a reference frame of the video frame. Illustratively, taking the video frame as a video frame corresponding to the reference view at the time t as an example, the reference frame acquired by the server includes a video frame corresponding to the reference view at the time t-1, a video frame corresponding to the reference view at the time t-2 … …, and a video frame corresponding to the reference view at the time t-n, where n is greater than 1. That is, the server uses the video frame corresponding to the reference view before time t as the reference frame.

In some embodiments, the video frames each carry a view identifier indicating a view of the video frame. The process of acquiring the reference frame by the server comprises the following steps: the server determines the video frame as a reference view based on the view identifier of the video frame, and then the server acquires at least one video frame which is also the reference view from the multi-view image by taking the view identifier of the reference view as an index, and takes the video frame as a reference frame of the video frame.

It should be noted that, in a VR application scenario, most of the display contents of the video frames at the previous and subsequent times of the same view angle are overlapped, and through the above step 701, the server acquires the corresponding reference frame for the video frame of the reference view angle in a targeted manner, and considers the correlation between the video frame and the reference frame in the time dimension (previous and subsequent times), so that the video coding efficiency can be effectively improved.

702. The server determines a motion vector of an image block in the video frame based on the first rendering information and the second rendering information; wherein the first rendering information includes a first position and pose information of a virtual camera when rendering the video frame, and depth information of the video frame; the second rendering information includes a second position of the virtual camera when rendering the reference frame, pose information, and depth information of the reference frame.

In the embodiment of the present application, the first position and orientation information and the second position and orientation information are determined based on the position and orientation of the terminal. The first position and the second position are positions of the virtual camera at different times from the same perspective. The depth information is used for indicating depth values corresponding to pixels in the video frame or the reference frame and is generated when the server runs the VR application.

It should be noted that the process of the server performing the step 702 to determine the motion vector may be referred to as temporal motion estimation, that is, the server performs motion estimation between the video frame and the reference frame at different time instances of the same view in a time dimension. Referring to fig. 8, fig. 8 is a schematic diagram of motion estimation provided in an embodiment of the present application. As shown in fig. 8, after acquiring a video frame and a reference frame, if the video frame and the reference frame are from the same view angle, the server performs temporal motion estimation, so as to obtain a motion vector of the present application.

In some embodiments, this step 702 includes, but is not limited to, step 7021 and step 7022 as follows.

7021. The server determines two-dimensional coordinates corresponding to pixels or image blocks in the video frame and two-dimensional coordinates corresponding to pixels or image blocks in the reference frame based on a first position of the virtual camera when rendering the video frame, attitude information, depth information of the video frame, a second position of the virtual camera when rendering the reference frame, attitude information, and depth information of the reference frame.

In the embodiment of the present application, the server provides a calculation function for the motion vector. This step 7021 is described in detail below, including but not limited to the following steps 7021-1 through 7021-3.

7021-1, the server obtains a conversion relation between two-dimensional coordinates and three-dimensional coordinates corresponding to pixels or image blocks in any video frame.

The conversion relation is a conversion formula for rendering a three-dimensional point (x, y, z) of the real world (adopting a world coordinate system) to a pixel point (u, v) on a two-dimensional screen (adopting a camera coordinate system). It should be understood that, for simplicity of calculation, in a video encoding process, a motion vector of a video frame is usually calculated in units of image blocks, and displacement amounts of all pixels in an image block are considered to be the same, that is, for any image block in the video frame, a two-dimensional coordinate and a three-dimensional coordinate corresponding to the image block may be determined based on a two-dimensional coordinate and a three-dimensional coordinate corresponding to a pixel in the image block.

This conversion relationship is explained below with reference to the following formula (1):

in the formula, T represents a conversion matrix from the world coordinate system to the camera coordinate system, and is determined by the position and orientation of the camera in the world coordinate system. K represents a camera calibration matrix, and is determined by built-in parameters such as FOV and display resolution of the camera. d represents depth information of the pixel (u, v).

7021-2, the server obtains two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame, and applies the conversion relationship to convert the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame into three-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame based on a second position of the virtual camera when rendering the reference frame, attitude information, and depth information of the reference frame.

7021-3, the server determines two-dimensional coordinates corresponding to pixels or image blocks in the video frame by applying the transformation relation based on the first position and attitude information of the virtual camera when rendering the video frame, the depth information of the video frame, and three-dimensional coordinates corresponding to pixels or image blocks in the reference frame.

And for the two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame, the two-dimensional coordinates correspond to the same three-dimensional coordinate in a world coordinate system.

The two-dimensional coordinate corresponding to a certain image block (i.e. pixel point) in the reference frame is taken as (u)₁，v₁) The two-dimensional coordinate corresponding to a certain image block (namely a pixel point) in the video frame is (u)₂，v₂) For example, the above-described step 7021-2 and step 7021-3 will be explained.

Illustratively, the server applies the above-mentioned conversion relationship based on the second position of the virtual camera in the VR application when rendering the reference frame, the attitude information, and the depth information of the reference frame to obtain the following formula (2), and obtains the two-dimensional coordinate (u) according to the formula (2)₁，v₁) Corresponding three-dimensional coordinates (x, y, z). Further, the server applies the above conversion relationship based on the first position of the virtual camera in the VR application when rendering the video frame, the pose information, the depth information of the video frame, and the above three-dimensional coordinates (x, y, z), to obtain the following formula (3). The formula (2) and the formula (3) are as follows:

in the above formulas (2) and (3), T₁Representing the transformation matrix corresponding to the reference frame, d₁Indicating the depth value, T, corresponding to an image block in the reference frame₂Representing a transformation matrix corresponding to the video frame, d₂And the depth value corresponding to a certain image block in the video frame is represented.

By combining the above formula (2) and formula (3), the following formula (4) can be obtained:

according to the formula (4), the two-dimensional coordinate (u) corresponding to a certain image block in the video frame can be calculated and obtained₂，v₂)。

It should be noted that, in the above steps 7021-1 to 7021-3, the two-dimensional coordinate of the three-dimensional coordinate in the video frame is determined based on the three-dimensional coordinate corresponding to the two-dimensional coordinate in the reference frame, that is, the server can determine the two-dimensional coordinate (u) through the above steps 7021-1 to 7021-3₁，v₁) Andtwo dimensional coordinate (u)₂，v₂) The mapping relation between the motion vectors provides a basis for the subsequent determination of the motion vectors.

7022. And the server determines the motion vector based on the two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame.

In the present embodiment, this step 7022 includes, but is not limited to, the following steps 7022-1 and 7022-2.

7022-1, the server obtains a coordinate difference between the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame.

Schematically, the two-dimensional coordinate corresponding to a certain pixel or image block in the reference frame is (u)₁，v₁) The two-dimensional coordinate corresponding to a certain pixel or image block in the video frame is (u)₂，v₂) For example, the coordinate difference is (u)₁-u₂，v₁-v₂). It should be noted that the coordinate difference is only illustrative, and in some embodiments, the coordinate difference may be expressed as (u)₂-u₁，v₂-v₁) (ii) a In other embodiments, the coordinate difference may also be expressed in other vector forms. The present application is not limited to the manner of calculating the coordinate difference and the expression form.

7022-2, the server determines the motion vector based on the coordinate difference.

In some embodiments, the manner in which the server determines the motion vector includes the following two manners.

First, the server determines the coordinate difference as a motion vector.

It should be understood that, in a VR application scene, video frames before and after the same view angle are usually two images generated by rotating the head of the user by a small angle in the same scene, and at a time interval of two frames, the angle of rotation of the head of the user is usually not too large, so that most of the image contents of the two video frames before and after the most of the scene are overlapped and only have a small displacement, and the coordinate difference obtained by the server through the step 7021 is a displacement vector between an image block in the reference frame and an image block in the video frame at the time before and after the same view angle, which is also referred to as a view angle displacement vector. In this step 7022-2, the server uses the view displacement vector as a motion vector for subsequent encoding of the video frame.

Secondly, the server searches in the reference frame according to the coordinate difference and a first searching mode, determines a first image block meeting a first searching condition, and determines a motion vector based on the first image block.

And the server takes the visual angle displacement vector corresponding to the coordinate difference as an initial motion vector, takes the initial motion vector as a search starting point, and searches in the reference frame according to a first search mode. Optionally, the first search method is a diamond search method (DS), a global search method, a hexagonal search method, a rectangular search method, and the like, which is not limited in this application.

In some embodiments, when the number of the reference frames is greater than 1, for any image block in the video frame, the server is able to respectively identify a first image block meeting the first search condition from each reference frame. The server determines at least two motion vectors based on the at least two first image blocks, predicts the coding cost of the video frame by calling an RDO algorithm to obtain at least two corresponding prediction modes, and takes the motion vector corresponding to the prediction mode with the minimum coding cost in the at least two prediction modes as the motion vector of the application.

An embodiment of determining a motion vector by a server is described below by taking a diamond search method as an example, and fig. 9 is a schematic diagram of a diamond search method provided in an example of the present application. As shown in the left diagram of fig. 9, the first center position in the diagram is the image block corresponding to the initial motion vector, and the diamond search method includes the following steps one to three.

Step one, the server determines a Minimum Block Deviation (MBD), which is called a first MBD, from 9 image blocks including the first center position by using a large diamond search template (LDSP) with the first center position as a starting point, and if the first MBD is an image block corresponding to the first center position, the server performs the following step three, and if not, the server performs the following step two.

And step two, the server takes the first MBD as a second central position, applies LDSP again, determines a second MBD from 9 image blocks including the first MBD, if the second MBD is the image block corresponding to the second central position, the server executes the following step three, and if not, the server repeatedly executes the step two.

And step three, the server uses the current MBD (the first MBD or the second MBD) as a third center position, applies a small diamond search template (SDSP), determines a third MBD from 5 image blocks including the current MBD, if the third MBD is an image block corresponding to the third center position, the server calculates a motion vector by using the third MBD, and if not, the server repeatedly executes the step three.

Illustratively, with reference to fig. 8, as shown in fig. 8, when performing time domain motion estimation, the server performs view displacement vector calculation according to the position and posture information of the virtual camera in the VR application when rendering the video frame and the reference frame, and the depth information of the video frame and the reference frame, to obtain a view displacement vector, and determines a motion vector by using the view displacement vector.

In step 702, the server determines the motion vector according to the coordinate difference between the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame. The method for determining the motion vector utilizes the correlation between the video frame and the reference frame at the front and back moments of the same visual angle, obtains the visual angle displacement vector through calculation according to the position, the posture information and the depth information of the virtual camera in VR application when the video frame and the reference frame are rendered, and finally obtains the motion vector by taking the visual angle displacement vector as the motion vector or searching by taking the visual angle displacement as the initial motion vector.

703. The server encodes the video frame based on the motion vector.

In an embodiment of the application, the server provides an encoding function capable of encoding the video frame based on the motion vector.

Alternatively, the encoding function of the server is implemented by an encoder, which may be an encoder built in the server, for example, a hardware encoder built in the server, etc.; or may be a software encoder installed on the server. This is not a limitation of the present application. The above-mentioned motion vector calculation process may be performed by an encoder, or may be performed after calculation outside the encoder and then transferred to the encoder. It should be understood that in the encoding standard, the calculation of the motion vector belongs to an open module, and the encoding standard does not limit the way of calculating the motion vector of the encoder, as long as the motion vector in a specified format can be output. Therefore, the video encoding method provided by the embodiment of the application has wide applicability and can be applied to various encoders in the related art.

In the video encoding method provided by the embodiment of the present application, when a video frame is encoded, a motion vector of an image block in the video frame is determined according to rendering information when the video frame is generated and rendering information when a reference frame of the video frame is generated, and the video frame is encoded. The method for coding the video can greatly reduce the calculation amount of motion estimation, accelerate the speed of motion estimation, accelerate the video coding speed, improve the video coding efficiency and further effectively reduce the end-to-end time delay.

The embodiment shown in fig. 7 is described by taking the example that the video frame and the reference frame are from the same view angle. The following describes a video encoding method provided in an embodiment of the present application, taking a video frame and a reference frame from different views in a multi-view image as an example. Referring to fig. 10, fig. 10 is a flowchart of another video encoding method provided in an embodiment of the present application. In the embodiment of the present application, the video encoding method is performed by the server 110 shown in fig. 1, for example, the video encoding method may be applied to the VR driver server 113 of the server 110. Illustratively, the video encoding method includes the following steps 1001 to 1003.

1001. The server acquires a video frame and a reference frame of the video frame, wherein the video frame and the reference frame are from different visual angles in the multi-visual-angle image.

The following describes a specific embodiment of this step, taking a video frame acquired by a server as an example of a dependent view angle. Schematically, this step 1001 includes: and the server acquires a reference frame of the video frame in response to the acquired video frame being a dependent view, wherein the view of the video frame is different from that of the reference frame.

And the server responds that the obtained video frame is a dependent view, obtains a video frame of a reference view at the current moment from the coded multi-view image, and takes the video frame of the reference view as a reference frame. Illustratively, taking the video frame corresponding to the dependent view at time t as an example, the reference frame acquired by the server is the video frame corresponding to the reference view at time t.

In some embodiments, the video frames each carry a view identifier indicating a view of the video frame. This step 1001 includes: the server determines the video frame as a dependent view based on the view identifier of the video frame, and then the server acquires the video frame of the reference view at the current moment from the multi-view image by using the view identifier of the reference view as an index, and takes the video frame as a reference frame.

It should be noted that, in a VR application scene, video frames at different viewing angles at the same time are pictures seen by two eyes of a VR application simulation user when observing a three-dimensional space, that is, it can be understood that cameras with the same parameters capture two images generated by the same scene from two angles having a certain deviation, and therefore, the video frames at different viewing angles at the same time have a relatively large correlation. Through the above steps 1001, the server obtains the corresponding reference frame according to the view angle of the video frame, and considers the correlation between the video frame and the reference frame in terms of spatial dimension (left and right view angles), thereby effectively improving the video coding efficiency.

1002. The server determines a motion vector of an image block in the video frame based on the first rendering information and the second rendering information; wherein the first rendering information includes a focal length when rendering the video frame, a first position of a virtual camera, and depth information of the video frame; the second rendering information includes a second position of the virtual camera when rendering the reference frame.

Wherein the focal length is determined based on the FOV and the display resolution of the terminal. The first position and the second position are determined based on the position and the attitude of the terminal. The first position and the second position are positions of the virtual camera at different viewing angles in the VR application at the same time. That is, the first position is associated with the view of the video frame, and the second position is associated with the view of the reference frame. Illustratively, taking the view angle of the video frame as a reference view angle (left eye view angle) as an example, the first position corresponds to a virtual camera simulating the left eye of the user, and the second position corresponds to a virtual camera simulating the right eye of the user. It should be understood that both the FOV and the display resolution of the terminal belong to the display parameters of the terminal.

It should be noted that the process of the server performing the step 1002 to determine the motion vector may be referred to as inter-view motion estimation (also referred to as disparity estimation), where a view corresponds to a reference view (a left-eye view) or a dependent view (a right-eye view). That is, the server performs motion estimation between video frames and reference frames of different views at the same time in a spatial dimension.

Schematically, reference is continued to fig. 8. As shown in fig. 8, after the server acquires the video frame and the reference frame, if the video frame and the reference frame are from different viewing angles, inter-view motion estimation is performed, and finally, the motion vector required by the present application is obtained.

In the following, a specific embodiment of this step 1002 will be described, and this step 1002 is replaced with: the server determines the motion vector based on the focal distance, a distance value between the first location and the second location, and the depth information.

In the embodiment of the present application, the terminal includes a head-mounted display, and the distance value between the first position and the second position is also the interpupillary distance of the head-mounted display. Illustratively, this step 1002 includes, but is not limited to, the following steps 1002-1 through 1002-3.

1002-1, the server determines a disparity interval between the video frame and the reference frame based on the focal distance, the distance value, and the depth information.

According to the stereoscopic vision registration theory, under the condition that the camera parameters are the same, the relationship between the parallax and the depth information between the images shot by the two cameras is shown in the following formula (5):

where Δ represents parallax, f is a camera focal length, b is a distance value between two cameras, and d is a depth value of a pixel.

In the embodiment of the present application, the server determines the disparity interval between the video frame and the reference frame according to the above formula (5). The following describes an embodiment of determining the parallax interval, and includes the following two steps.

Step one, the server determines a depth interval based on the depth information, wherein the depth interval is used for indicating the value range of the depth value of the pixel in the video frame.

The server determines the maximum depth value and the minimum depth value of the pixels in the video frame, namely determines the depth interval, according to the depth information of the video frame. The depth interval is schematically indicated as [ Dmin, Dmax ], and the application is not limited thereto.

And step two, the server determines a parallax interval between the video frame and the reference frame based on the focal length, the distance value and the depth interval.

The server substitutes the focal length of the virtual camera in the VR application, the distance value (i.e., the interpupillary distance of the head-mounted display), and the interval value of the depth interval into the above formula (5), so as to obtain the parallax interval between the video frame and the reference frame. The parallax interval is schematically represented as [ Dvmin, Dvmax ], and the expression form of the parallax interval is not limited in the present application.

1002-2, the server determines a target search range corresponding to the image block in the video frame based on the parallax interval.

In some embodiments, the manner in which the server determines the target search scope includes the following two manners.

Firstly, the server determines a first search range corresponding to an image block in the video frame based on the parallax interval, and takes the first search range as a target search range.

And for any image block in the video frame, the server determines the first search range based on the endpoint value of the parallax interval and the position of the image block in the video frame. Schematically, the position of the image block is represented as (MbX, MbY), the disparity interval is represented as [ MinX, MaxX ], and then the first search range is represented as [ MbX + MinX, MbY + MaxX ].

In some embodiments, the server multiplies the endpoint values of the disparity interval by preset coefficients respectively to obtain an expanded or reduced disparity interval. For example, the disparity section may be represented as [ pMinX, qMaxX ], where p and q are preset coefficients, and the first search range is represented as [ MbX + pMinX, MbY + qMaxX ]. It should be understood that the preset coefficients may be the same or different, and the application is not limited thereto.

Secondly, the server determines a first search range corresponding to the image block in the video frame based on the parallax interval; determining a second search range corresponding to the image block in the video frame based on a second search mode; and taking the intersection of the first search scope and the second search scope as the target search scope.

The second search method is a default search method, for example, the second search method is a global search method, a diamond search method, a hexagon search method, and the like, which is not limited in this application.

Through the optional implementation mode, the target search range corresponding to the image block in the video frame can be further narrowed, so that the calculation amount of motion estimation is reduced, the motion estimation speed is greatly increased, and the video coding efficiency is effectively improved.

1002-3, the server determines the motion vector based on the target search range.

In this embodiment, the server uses the search start point and the search path of the second search mode as the search start point and the search path of the motion vector, searches in the reference frame according to the target search range, determines a second image block meeting a second search condition, and determines the motion vector based on the second image block.

In some embodiments, the target search range is sent from the server to an encoder in the server, and the encoder searches the reference frames according to the target search range. This is not a limitation of the present application.

Illustratively, with continuing reference to fig. 8, as shown in fig. 8, when performing inter-viewpoint motion estimation, the server performs disparity interval calculation according to the display parameters of the terminal and the depth information of the video frame to obtain a disparity interval, and determines a motion vector using the disparity interval.

In step 1002, the server narrows the search range of the motion vector according to the display parameters of the terminal and the depth information of the video frame, thereby determining the motion vector required by the present application. The method for determining the motion vector utilizes the correlation between the video frames and the reference frames of different visual angles at the same time, determines the target search range of the motion vector through the parallax interval between the video frames and the reference frames, or further reduces the target search range of the motion vector through the parallax interval and the search range of the default search mode, finally obtains the required motion vector, greatly reduces the calculation amount of motion estimation, and effectively improves the video coding efficiency.

1003. The server encodes the video frame based on the motion vector.

In an embodiment of the application, the server provides an encoding function capable of encoding the video frame based on the motion vector. It should be noted that the optional implementation manner of this step 1003 is similar to that in step 703 shown in fig. 7, and therefore, the description is not repeated here.

In the video encoding method provided by the embodiment of the present application, when a video frame is encoded, a motion vector of an image block in the video frame is determined according to rendering information when the video frame is generated and rendering information when a reference frame of the video frame is generated, and the video frame is encoded. The method is adopted to carry out video coding, the calculation amount of motion estimation can be greatly reduced, the speed of motion estimation is accelerated, the video coding speed is accelerated, the video coding efficiency is improved, and the end-to-end time delay is further effectively reduced.

The embodiment shown in fig. 10 is described by taking the example that the video frame and the reference frame are from different perspectives. The following takes a video frame as an example of a dependent view, and illustrates a video encoding method provided in the embodiment of the present application. Referring to fig. 11, fig. 11 is a flowchart of another video encoding method provided in an embodiment of the present application. In the embodiment of the present application, the video encoding method is performed by the server 110 shown in fig. 1, for example, the video encoding method may be applied to the VR driver server 113 of the server 110. Illustratively, the video encoding method includes steps 1101 to 1105 as follows.

1101. And the server responds to the fact that the obtained video frame is a dependent view angle, and obtains a reference frame of the video frame, wherein the reference frame comprises a reference frame with the same view angle as the video frame and a reference frame with a view angle different from the video frame.

And the server responds to the fact that the obtained video frames are dependent views, obtains at least one video frame which is dependent views from the coded multi-view images and a video frame of a reference view at the current moment, and takes the video frame as a reference frame of the video frame. Illustratively, taking the video frame as a video frame corresponding to the dependent view at the time t as an example, the reference frames acquired by the server include video frames corresponding to the dependent views at the times t-1, t-2 … … and t-n, and a video frame corresponding to the reference view at the time t. That is, the server takes the video frame corresponding to the dependent view before time t and the video frame corresponding to the reference view at time t as reference frames.

In some embodiments, the video frames each carry a view identifier indicating a view of the video frame. This step 1101 includes: the server determines the video frame as a dependent view based on the view identifier of the video frame, then the server acquires the video frame of the current reference view in the multi-view image by using the view identifier of the reference view as an index, and meanwhile, the server acquires at least one video frame which is also the dependent view in the multi-view image by using the view identifier of the dependent view as an index, and the acquired video frame is used as a reference frame.

It should be noted that, through the above step 1101, the server acquires the corresponding reference frame according to the view angle of the video frame, and takes into account the correlation between the video frame and the reference frame in the time dimension (front and rear time points) and the correlation between the video frame and the reference frame in the space dimension (left and right view points), so that the video coding efficiency can be effectively improved.

1102. For a reference frame in the reference frame, which has the same visual angle as the video frame, the server determines a first motion vector corresponding to an image block in the video frame based on a first position of the virtual camera when the video frame is rendered, attitude information, depth information of the video frame, a second position of the virtual camera when the reference frame is rendered, attitude information and depth information of the reference frame.

The method for determining the first motion vector by the server may refer to the corresponding implementation manner in step 702, that is, the server performs time domain motion estimation to obtain the first motion vector. This application is not described in detail herein.

1103. For a reference frame, which is different from the video frame view angle, in the reference frame, the server determines a second motion vector corresponding to the image block in the video frame based on the focal length and the first position of the virtual camera when the video frame is rendered, the depth information of the video frame, and the second position of the virtual camera when the reference frame is rendered.

The manner of determining the second motion vector by the server may refer to the corresponding implementation manner in step 1002, that is, the server performs inter-view motion estimation to obtain the second motion vector. This application is not described in detail herein.

It should be noted that, in the embodiment of the present application, the server performs the above step 1102 and step 1103 in a forward-backward order. In some embodiments, the server performs step 1103 before performing step 1102. In other embodiments, the server performs step 1102 and step 1103 synchronously. The execution sequence of step 1102 and step 1103 in the embodiment of the present application is not limited.

1104. The server determines a target motion vector based on the first motion vector and the second motion vector.

The server calls an RDO algorithm, and predicts the coding cost of the video frame based on the first motion vector and the second motion vector to obtain at least two prediction modes; and taking the motion vector corresponding to the prediction mode with the minimum coding cost in the at least two prediction modes as the target motion vector.

It should be noted that, through the above steps 1102 to 1104, the server determines the motion vector in different manners for different types of reference frames, and further determines the target motion vector in combination with the RDO algorithm, and the motion vector is determined in this manner, so that the motion vector with the minimum corresponding encoding cost can be obtained, and the efficiency of video encoding is further effectively improved.

1105. The server encodes the video frame based on the target motion vector.

In the video encoding method provided by the embodiment of the present application, when a video frame is encoded, a motion vector of an image block in the video frame is determined according to rendering information when the video frame is generated and rendering information when a reference frame of the video frame is generated, and the video frame is encoded. The method is adopted to carry out video coding, the calculation amount of motion estimation can be greatly reduced, the video coding speed is accelerated, the video coding efficiency is improved, and further the end-to-end time delay is effectively reduced.

The video encoding method of the present application is illustrated below with reference to fig. 12 and fig. 13, taking an interaction between a server and a terminal as an example. Fig. 12 is a flowchart of another video encoding method provided in an embodiment of the present application; fig. 13 is a flowchart of another video encoding method according to an embodiment of the present application. Illustratively, the video encoding method includes steps 1201 to 1208 as follows.

1201. The terminal sends rendering parameters to the server, wherein the rendering parameters include but are not limited to rendering resolution, left and right eye projection matrixes and display parameters of the terminal.

1202. The terminal collects the current interaction event of the user and sends the interaction event to the server, wherein the interaction event comprises but is not limited to key input, and position and posture information of the head/hand.

Referring to fig. 13, schematically, in the above step 1201 and step 1203, as shown in fig. 13, the server and the terminal are connected through a network, and the terminal sends the rendering parameters and the current interaction event of the user to the server.

It should be noted that, in the embodiment of the present application, the execution order of the above steps 1201 and 1202 is not limited. In some embodiments, the terminal performs step 1202 first, and then performs step 1201; in other embodiments, the terminal performs steps 1201 and 1202 as described above synchronously.

1203. And the server runs the VR application based on the received rendering parameters and the interaction events, executes application logic calculation and image rendering calculation and generates rendered video frames.

1204. And calculating a motion vector in the server based on the rendering parameter, the interactive event and the depth information of the video frame.

And the rendering parameters, the interaction events and the depth information all belong to rendering information. The server includes the following two scenarios when performing motion vector calculation: the first is a previous frame and a next frame scene, that is, a video frame and a reference frame are the same view angle, in this scene, the server refers to step 702 in the embodiment shown in fig. 7 for determining a motion vector, that is, performs view angle displacement vector calculation; secondly, the inter-view scene is an inter-view scene, that is, the video frame and the reference frame are different view angles, in this scene, please refer to step 1002 in the embodiment shown in fig. 10, that is, the disparity interval calculation is performed. This application is not described in detail herein.

Optionally, the motion vector calculation comprises an initial motion vector calculation and a target motion vector calculation. The initial motion vector calculation is used for performing perspective displacement vector calculation and inter-view parallax interval calculation, and after the server obtains the calculation result of the initial motion vector, the server performs target motion vector calculation (which can also be understood as accurate motion vector calculation) based on the calculation result, and finally obtains a motion vector.

Optionally, in a front-back frame scene, after the server performs initial motion vector calculation to obtain a view displacement vector, the view displacement vector can be directly used as a motion vector in a subsequent server encoding process of a video frame, or the server can perform target motion vector calculation based on the view displacement vector; in an inter-view scene, after the server performs initial motion vector calculation to obtain a parallax interval, the server can perform target motion vector calculation based on the parallax interval, so that the search range of motion vectors is reduced, and accelerated encoding is realized.

Alternatively, the calculation process of the motion vector may be performed by an encoder of the server, or may be performed after calculation outside the encoder and then transferred to the encoder.

Schematically, please continue to refer to fig. 13 in the above steps 1203 and 1204, as shown in fig. 13, after receiving the rendering parameters and the interaction event sent by the terminal, the server performs logic calculation and image rendering calculation to generate a rendered video frame. The server performs motion vector calculations based on the depth information of the video frame and the received rendering parameters and interaction events. And then the server encodes the video frame based on the obtained motion vector to generate video encoding data.

1205. And the server encodes the video frame based on the calculated motion vector to generate video encoded data.

1206. The server transmits the video encoding data to the terminal.

1207. And the terminal decodes the video frame according to the video coding data and submits the decoded video frame to a head-mounted display for display through a system drive in the terminal.

Referring to fig. 13, in the above step 1206 and step 1207, as shown in fig. 13, after receiving the encoded video data transmitted by the server, the terminal decodes the encoded video data and displays the decoded video data on a screen.

In the video encoding method provided by the embodiment of the present application, when a video frame is encoded, a motion vector of an image block in the video frame is determined according to rendering information when the video frame is generated and rendering information when a reference frame of the video frame is generated, and the video frame is encoded. The method is adopted to carry out video coding, the calculation amount of motion estimation can be greatly reduced, the video coding speed is accelerated, the video coding efficiency is improved, and further the end-to-end time delay is effectively reduced. Wherein, no matter the video frame is a left-eye visual angle or a right-eye visual angle, the motion vector can be determined by adopting the method; further, the above method can be used to determine the motion vector regardless of whether the video frame and the reference frame belong to the same view angle.

The above embodiment illustrates a specific implementation of the server performing video coding on any one of the acquired video frames. Referring to fig. 14, fig. 14 is a flowchart of another video encoding method according to an embodiment of the present disclosure. The video encoding method of the present application is exemplified by taking a server encoding a plurality of video frames in a period of time, and taking one video frame as an encoding unit. Illustratively, as shown in fig. 14, the video encoding method is performed by a server, and includes the following steps 1401 to 1406.

1401. The server acquires the coding unit and judges whether the current coding unit is an initial coding unit or not and whether the current coding unit is a reference view or not.

The start coding unit refers to a first video frame in a certain time period. In step 1401, please refer to step 701 in the embodiment shown in fig. 7 for the implementation of the server determining whether the current coding unit is the reference view, which is not described herein again.

1402. If the current coding unit is the initial coding unit and the reference view, the server codes the coding unit as the key frame.

The key frame refers to a video frame of a first reference view in a certain time period. It should be noted that, since the key frame is the first video frame in a certain time, the server does not need to acquire a reference frame when encoding the encoding unit, but directly encodes the encoding unit.

1403. If the current coding unit is the initial coding unit but is a dependent view, the server acquires a video frame of a reference view corresponding to the coding unit, and codes the coding unit into a non-key frame by taking the video frame as a reference.

The non-key frame refers to other video frames except the video frame of the first reference view angle in a certain time period. In this step 1403, please refer to

steps

1002 and 1003 in the above embodiment shown in fig. 10 for the implementation in which the server encodes the current coding unit into the non-key frame, that is, the server performs inter-view motion estimation to obtain a motion vector and then encodes the video frame. This application is not described in detail herein.

1404. If the current coding unit is not the initial coding unit but is the reference view, the server uses the previous single frame or several frames (determined according to the configuration) as the video frame of the reference view, and codes the coding unit as the non-key frame.

In the embodiment where the server encodes the current coding unit into the non-key frame, please refer to

steps

702 and 703 in the embodiment shown in fig. 7, that is, the server performs time domain motion estimation to obtain a motion vector and then encodes the video frame. This application is not described in detail herein.

1405. If the current coding unit is not the initial coding unit and not the reference view, the server takes the previous single frame or several frames (determined according to the configuration) and the video frame of the reference view as the reference frame, and takes the video frame of the reference view corresponding to the current coding unit as the reference frame to code the coding unit as the non-key frame.

In the embodiment of the server encoding the current encoding unit into the non-key frame, please refer to steps 1102 to 1105 in the embodiment shown in fig. 11, that is, the server performs time domain motion estimation and inter-view motion estimation to obtain a motion vector and then encodes the video frame. This application is not described in detail herein.

1406. The server repeats the above steps 1401 to 1405 until all the coding units are coded.

Fig. 14 is a flowchart illustrating a video frame as an example of an encoding unit, and please refer to fig. 15, where fig. 15 is a flowchart of another video encoding method according to an embodiment of the present application. The video coding method of the present application is exemplified by taking a group of pictures (GOP) as an encoding unit. Schematically, as shown in FIG. 15, t₀The time is the starting coding time of the coding unit, and a GOP comprises t₀Time t₁Time and t₂And video frames corresponding to the moments, wherein each moment comprises two video frames which respectively correspond to the reference view and the dependent view. The video encoding method is performed by a server, and includes steps 1501 to 1505 as follows.

1501. The server sends t₀The video frames of the temporal reference view are encoded as key frames.

Please refer to step 1402 in the embodiment shown in fig. 14 for this step 1501, which is not described herein again.

1502. The server will t₀Taking the video frame of the time reference view angle as a reference frame, and taking t as a reference frame₀Video frames of temporally dependent views are encoded as non-key frames.

In this step 1502, please refer to step 1002 and step 1003 in the embodiment shown in fig. 10, that is, the server performs inter-view motion estimation to obtain a motion vector and then encodes the video frame. This application is not described in detail herein.

1503. The server sends t₁Using the video frame of the reference view angle before the moment as a reference frame, and taking t as the reference frame₁The video frames of the temporal reference view are encoded as non-key frames.

In step 1503, please refer to step 702 and step 703 in the embodiment shown in fig. 7, that is, the server performs temporal motion estimation to obtain a motion vector and then encodes the video frame. This application is not described in detail herein.

1504. The server sends t₁View-dependent video frames before a time instant, and t₁Taking the video frame of the time reference view angle as a reference frame, and taking t as a reference frame₁Video frames of temporally dependent views are encoded as non-key frames.

In step 1504, please refer to steps 1102 to 1105 in the embodiment shown in fig. 11, that is, the server performs temporal motion estimation and inter-view motion estimation to obtain a motion vector and then encodes the video frame. This application is not described in detail herein.

1505. The server follows t₁Coding mode of video frame of time reference view and dependent view, for t₂The video frame of the time instant is encoded.

Note that, fig. 15 is only an illustration, and the left-eye viewing angle (L) is taken as a reference viewing angle and the right-eye viewing angle (R) is taken as a dependent viewing angle, and the right-eye viewing angle may be taken as the reference viewing angle and the left-eye viewing angle may be taken as the dependent viewing angle. This is not a limitation of the present application. In addition, fig. 15 only selects the video frame at the previous time as the reference frame in the temporal motion estimation, and it should be understood that all video frames before the current time can also be used as the reference frame in the temporal motion estimation, for example, t is encoded₂At the time of the video frame of the reference view angle, t is set₀Time t and₁the video frames of the time reference view are all used as reference frames, which is not limited in the present application.

Fig. 16 is a schematic structural diagram of a video encoding apparatus according to an embodiment of the present application, the video encoding apparatus is configured to perform the steps of the video encoding method, and referring to fig. 16, the video encoding apparatus 1600 includes: an obtaining module 1601, a determining module 1602, and an encoding module 1603.

An obtaining module 1601, configured to obtain a video frame and a reference frame of the video frame, where the video frame and the reference frame are from a same view or different views in a multi-view image;

a determining module 1602, configured to determine a motion vector of an image block in the video frame based on the first rendering information and the second rendering information; the first rendering information is rendering information when the video frame is generated, and the second rendering information is rendering information when the reference frame is generated;

an encoding module 1603 for encoding the video frame based on the motion vector.

In a possible implementation manner, if the video frame and the reference frame are from the same perspective, the first rendering information includes a first position and posture information of a virtual camera when the video frame is rendered, and depth information of the video frame; the second rendering information includes a second position of the virtual camera when rendering the reference frame, pose information, and depth information of the reference frame.

In a possible implementation manner, if the video frame and the reference frame are from different perspectives, the first rendering information includes a focal length when the video frame is rendered, a first position of a virtual camera, and depth information of the video frame; the second rendering information includes a second position of the virtual camera when rendering the reference frame;

the determining module 1602 is further configured to: determining the motion vector based on the focal distance, a distance value between the first location and the second location, and the depth information.

In one possible implementation, the video frame and the reference frame are generated while running a virtual reality VR application, the video frame encoded data is for decoding display on a terminal, and the first position and orientation information and the second position and orientation information are determined based on a position and orientation of the terminal.

In one possible implementation manner, the terminal is a mobile phone, a tablet, a wearable device, or a split device, and the split device includes a display device and a corresponding control device.

In one possible implementation, the wearable device is a head-mounted display.

In one possible implementation, the determining module 1602 includes:

In a possible implementation manner, the second determining unit is configured to:

acquiring a coordinate difference between two-dimensional coordinates corresponding to pixels or image blocks in the video frame and two-dimensional coordinates corresponding to pixels or image blocks in the reference frame;

based on the coordinate difference, the motion vector is determined.

In a possible implementation manner, the determining module 1602 further includes:

In a possible implementation manner, the third determining unit is configured to:

determining a depth interval based on the depth information, wherein the depth interval is used for indicating a value range of depth values of pixels in the video frame;

determining a disparity interval between the video frame and the reference frame based on the focal distance, the distance value between the first position and the second position, and the depth interval.

In the video encoding device provided in the embodiment of the present application, when a video frame is encoded, a motion vector of an image block in the video frame is determined according to rendering information when the video frame is generated and rendering information when a reference frame of the video frame is generated, and the video frame is encoded. The device is adopted to carry out video coding, so that the calculation amount of motion estimation can be greatly reduced, the video coding speed is accelerated, the video coding efficiency is improved, and the end-to-end time delay is effectively reduced.

In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory including program code, which is executable by a processor in a terminal to cause a computer to perform the video encoding method in the above-described embodiments. For example, the computer-readable storage medium is a read-only memory (ROM), a Random Access Memory (RAM), a compact disc-read-only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises program code which, when run on a video encoding apparatus, causes the video encoding apparatus to perform the video encoding method provided in the above-described embodiment.

Those of ordinary skill in the art will appreciate that the various method steps and elements described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and that the steps and elements of the embodiments are generally described in the foregoing description as functional or software interchange, for the purpose of clearly illustrating the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first location may be referred to as a second location, and similarly, a second location may be referred to as a first location, without departing from the scope of the various described examples. Both the first and second locations may be locations, and in some cases, may be separate and distinct locations.

The term "at least one" in this application means one or more, and the term "plurality" in this application means two or more, for example, the plurality of second motion vectors means two or more second motion vectors. The terms "system" and "network" are often used interchangeably herein.

It is also understood that the term "if" may be interpreted to mean "when" ("where" or "upon") or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined." or "if [ a stated condition or event ] is detected" may be interpreted to mean "upon determining.. or" in response to determining. "or" upon detecting [ a stated condition or event ] or "in response to detecting [ a stated condition or event ]" depending on the context.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer program instructions. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device.

The computer program instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes), optical media (e.g., Digital Video Disks (DVDs), or semiconductor media (e.g., solid state drives)), etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of video encoding, the method comprising:

acquiring a video frame and a reference frame of the video frame, wherein the video frame and the reference frame are from the same visual angle or different visual angles in a multi-visual-angle image;

determining a motion vector of an image block in the video frame based on the first rendering information and the second rendering information; wherein the first rendering information is rendering information when the video frame is generated, and the second rendering information is rendering information when the reference frame is generated;

encoding the video frame based on the motion vector.

2. The method of claim 1, wherein the first rendering information comprises a first position of a virtual camera when rendering the video frame, pose information, and depth information of the video frame if the video frame and the reference frame are from the same perspective; the second rendering information includes a second position of the virtual camera, pose information, and depth information of the reference frame when rendering the reference frame.

3. The method of claim 1, wherein the first rendering information comprises a focal length when rendering the video frame, a first position of a virtual camera, and depth information of the video frame if the video frame and the reference frame are from different perspectives; the second rendering information comprises a second position of the virtual camera when rendering the reference frame;

the determining a motion vector between the video frame and the reference frame based on the first rendering information and the second rendering information comprises: determining the motion vector based on the focal distance, a distance value between the first location and the second location, and the depth information.

4. The method of claim 2, wherein the video frame and the reference frame are generated while running a Virtual Reality (VR) application, wherein the video frame encoded data is for decoded display on a terminal, and wherein the first position and orientation information and the second position and orientation information are determined based on a position and orientation of the terminal.

5. The method according to claim 4, wherein the terminal is a mobile phone, a tablet, a wearable device or a split device, and the split device comprises a display device and a corresponding control device.

6. The method of claim 5, wherein the wearable device is a head-mounted display.

7. The method of claim 2, wherein determining the motion vector between the video frame and the reference frame based on the first rendering information and the second rendering information comprises:

determining two-dimensional coordinates corresponding to pixels or image blocks in the video frame and two-dimensional coordinates corresponding to pixels or image blocks in the reference frame based on the first position of the virtual camera when rendering the video frame, attitude information, depth information of the video frame, the second position of the virtual camera when rendering the reference frame, attitude information, and depth information of the reference frame;

and determining the motion vector based on the two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame.

8. The method of claim 7, wherein determining the motion vector based on two-dimensional coordinates corresponding to pixels or image blocks in the video frame and two-dimensional coordinates corresponding to pixels or image blocks in the reference frame comprises:

determining the motion vector based on the coordinate difference.

9. The method of claim 3, wherein determining the motion vector based on the focal distance, the distance value between the first location and the second location, and the depth information comprises:

determining a disparity interval between the video frame and the reference frame based on the focal distance, a distance value between the first location and the second location, and the depth information;

determining a target search range corresponding to an image block in the video frame based on the parallax interval;

determining the motion vector based on the target search range.

10. The method of claim 9, wherein determining a disparity interval between the video frame and the second reference video frame based on the focal length, the distance value between the first location and the second location, and the depth information comprises:

determining a disparity interval between the video frame and the reference frame based on the focal distance, a distance value between the first location and the second location, and the depth interval.

11. A video encoding apparatus, characterized in that the apparatus comprises:

the determining module is used for determining a motion vector of an image block in the video frame based on the first rendering information and the second rendering information; wherein the first rendering information is rendering information when the video frame is generated, and the second rendering information is rendering information when the reference frame is generated;

an encoding module to encode the video frame based on the motion vector.

12. The apparatus of claim 11, wherein the first rendering information comprises a first position of a virtual camera when rendering the video frame, pose information, and depth information of the video frame if the video frame and the reference frame are from a same perspective; the second rendering information includes a second position of the virtual camera, pose information, and depth information of the reference frame when rendering the reference frame.

13. The apparatus of claim 11, wherein the first rendering information comprises a focal length when rendering the video frame, a first position of a virtual camera, and depth information of the video frame if the video frame and the reference frame are from different perspectives; the second rendering information comprises a second position of the virtual camera when rendering the reference frame;

the determination module is further to: determining the motion vector based on the focal distance, a distance value between the first location and the second location, and the depth information.

14. The apparatus of claim 12, wherein the determining module comprises:

a first determining unit, configured to determine two-dimensional coordinates corresponding to pixels or image blocks in the video frame and two-dimensional coordinates corresponding to pixels or image blocks in the reference frame based on the first position and the posture information of the virtual camera when rendering the video frame, the depth information of the video frame, the second position and the posture information of the virtual camera when rendering the reference frame, and the depth information of the reference frame;

15. The apparatus of claim 14, wherein the second determining unit is configured to:

acquiring a coordinate difference between the two-dimensional coordinates corresponding to the pixels or the image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or the image blocks in the reference frame;

determining the motion vector based on the coordinate difference.

16. The apparatus of claim 13, wherein the determining module further comprises:

a third determining unit configured to determine a disparity interval between the video frame and the reference frame based on the focal length, a distance value between the first position and the second position, and the depth information;

the fourth determining unit is used for determining a target searching range corresponding to the image block in the video frame based on the parallax interval;

17. The apparatus of claim 16, wherein the third determining unit is configured to:

18. A video coding device, characterized in that it comprises a processor and a memory for storing at least one piece of program code, which is loaded by the processor and executes the video coding method according to any one of claims 1 to 10.

19. A computer-readable storage medium for storing at least one program code for performing the video encoding method of any one of claims 1-10.