WO2022151972A1 - 视频编码方法、装置、设备及存储介质 - Google Patents

视频编码方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022151972A1
WO2022151972A1 PCT/CN2021/142232 CN2021142232W WO2022151972A1 WO 2022151972 A1 WO2022151972 A1 WO 2022151972A1 CN 2021142232 W CN2021142232 W CN 2021142232W WO 2022151972 A1 WO2022151972 A1 WO 2022151972A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
video
frame
rendering
reference frame
Prior art date
Application number
PCT/CN2021/142232
Other languages
English (en)
French (fr)
Inventor
黄斌
彭巧巧
蒯多慈
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2022151972A1 publication Critical patent/WO2022151972A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors

Definitions

  • the present application relates to the technical field of video processing, and in particular, to a video encoding method, apparatus, device, and storage medium.
  • VR virtual reality
  • the cloud rendering computing platform will generate left-eye and right-eye video frames with a certain parallax, corresponding to the user's left-eye perspective and right-eye perspective respectively.
  • the VR terminal After encoding the two video frames , send it to the VR terminal, and the VR terminal will decode and display it, so that the user can experience the effect of stereoscopic vision.
  • the cloud rendering computing platform usually adopts the following two methods when encoding the left and right eye video frames: Method 1, splicing the left and right eye video frames to obtain a video frame, and then based on the standard encoder The video frame is encoded to generate one standard video code stream; in method 2, the left and right eye video frames are separately encoded based on the standard encoder to generate two standard video code streams.
  • Embodiments of the present application provide a video encoding method, apparatus, device, and storage medium, which can effectively improve video encoding efficiency and reduce end-to-end delay.
  • the technical solution is as follows:
  • a video encoding method comprising: acquiring a video frame and a reference frame of the video frame, the video frame and the reference frame are from the same view or different views in a multi-view image;
  • the rendering information and the second rendering information determine the motion vector of the image block in the video frame; wherein, the first rendering information is the rendering information when the video frame is generated, and the second rendering information is the rendering information when the reference frame is generated;
  • the video frame is encoded based on the motion vector.
  • Using the above method for video coding can greatly reduce the amount of calculation of motion estimation, speed up video coding, improve video coding efficiency, and effectively reduce end-to-end delay.
  • the video frame and the reference frame are generated when a virtual reality VR application is run, and the encoded data of the video frame is used for decoding and displaying on the terminal, the first position and attitude information, and the second position,
  • the attitude information is determined based on the position and attitude of the terminal.
  • the data generated by the server running the virtual reality VR application can be displayed based on the terminal, and the position and posture of the terminal can affect the rendering process of the server, so as to achieve the purpose of VR display.
  • the terminal is a mobile phone, a tablet, a wearable device or a split device
  • the split device includes a display device and a corresponding control device.
  • the encoded data of the video frame can be decoded and displayed on various terminals of different forms, and the present application has a relatively wide applicability.
  • the wearable device is a head-mounted display.
  • the first rendering information includes the first position and posture information of the virtual camera when rendering the video frame, and the depth information of the video frame;
  • the second rendering The information includes the second position and pose information of the virtual camera when the reference frame is rendered, and depth information of the reference frame.
  • the determining the motion vector between the video frame and the reference frame based on the first rendering information and the second rendering information includes: based on the first position of the virtual camera when the video frame is rendered, attitude information, the video frame The depth information of the frame, the second position of the virtual camera when the reference frame is rendered, the attitude information, and the depth information of the reference frame, to determine the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the pixels or images in the reference frame.
  • the two-dimensional coordinates corresponding to the block; the motion vector is determined based on the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame.
  • the determining of the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame includes: acquiring the two-dimensional coordinates corresponding to the pixels or image blocks in any video frame.
  • the conversion relationship between the coordinates and the three-dimensional coordinates obtain the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame, and apply the The conversion relationship is to convert the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame to the three-dimensional coordinates corresponding to the pixels or image blocks in the reference frame;
  • the depth information of the video frame and the three-dimensional coordinates corresponding to the pixels or image blocks in the reference frame are applied to determine the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame.
  • determining the motion vector based on the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame includes: acquiring the pixels or image blocks in the video frame. The coordinate difference between the corresponding two-dimensional coordinate and the two-dimensional coordinate corresponding to the pixel or image block in the reference frame; based on the coordinate difference, the motion vector is determined.
  • the determining the motion vector based on the coordinate difference includes: determining the coordinate difference as a motion vector.
  • the determining the motion vector based on the coordinate difference further includes: based on the coordinate difference, performing a search in the reference frame according to a first search method, and determining a first image block that meets the first search condition, based on The first image block determines a motion vector.
  • the correlation between the video frame and the reference frame at the time before and after the same perspective is used, and the position and posture of the virtual camera when rendering the video frame and the reference frame are used.
  • information and depth information calculate the viewing angle displacement vector, use the viewing angle displacement vector as the motion vector, or use the viewing angle displacement as the initial motion vector to search, and finally obtain the motion vector, which greatly reduces the calculation amount of motion estimation, Effectively improve the video coding efficiency.
  • the first rendering information includes the focal length when rendering the video frame, the first position of the virtual camera, and the depth information of the video frame;
  • the second rendering The information includes a second position of the virtual camera when rendering the reference frame;
  • the determining a motion vector between the video frame and the reference frame based on the first rendering information and the second rendering information includes: based on the focal length, the first position and the The distance value between the second positions and the depth information determine the motion vector.
  • determining the motion vector based on the focal length, the distance value between the first position and the second position, and the depth information includes: based on the focal length, the distance between the first position and the second position The distance value and the depth information, determine the disparity interval between the video frame and the reference frame; based on the disparity interval, determine the target search range corresponding to the image block in the video frame; based on the target search range, determine the motion vector .
  • determining the disparity interval between the video frame and the second reference video frame based on the focal length, the distance value between the first position and the second position, and the depth information includes: based on the depth information, determine the depth interval, the depth interval is used to indicate the value range of the depth value of the pixel in the video frame; based on the focal length, the distance value between the first position and the second position and the depth interval, determine the depth interval The disparity interval between the video frame and this reference frame.
  • determining the target search range corresponding to the image block in the video frame based on the disparity interval includes: determining, based on the disparity interval, a first search range corresponding to the image block in the video frame, and determining the first search range corresponding to the image block in the video frame. as the target search range.
  • the determining the target search range corresponding to the image block in the video frame based on the disparity interval includes: determining, based on the disparity interval, a first search range corresponding to the image block in the video frame; based on the second search method, A second search range corresponding to the image block in the video frame is determined; the intersection of the first search range and the second search range is used as the target search range.
  • the correlation between the video frame and the reference frame from different perspectives at the same moment is used, and the motion vector is determined by the disparity interval between the video frame and the reference frame. or, further narrow the target search range of the motion vector through the search range of the disparity interval and the default search mode, and finally obtain the motion vector, which greatly reduces the calculation amount of motion estimation and effectively improves the video coding efficiency.
  • a reference frame of the video frame is obtained, and the reference frame includes both a reference frame with the same view angle as the video frame, and a reference frame with a different view angle from the video frame.
  • the server is based on the first position of the virtual camera when rendering the video frame, the attitude information, the depth information of the video frame, the second position of the virtual camera when rendering the reference frame , posture information, the depth information of the reference frame, determine the first motion vector corresponding to the image block in the video frame;
  • the server is based on the focal length when rendering the video frame and the virtual
  • the first position of the camera, the depth information of the video frame, and the second position of the virtual camera when rendering the reference frame determine the second motion vector corresponding to the image block in the video frame; based on the first motion vector and the second motion vector , determine a target motion vector; encode
  • the corresponding reference frame is obtained according to the perspective of the video frame, which not only considers the correlation between the video frame and the reference frame from the time dimension (front and back moments), but also considers the spatial dimension (left and right perspective)
  • the correlation between the video frame and the reference frame can be effectively improved, and the video coding efficiency can be effectively improved.
  • a video encoding device comprising:
  • an acquisition module configured to acquire a video frame and a reference frame of the video frame, the video frame and the reference frame are from the same viewing angle or different viewing angles in the multi-view image;
  • a determining module configured to determine the motion vector of the image block in the video frame based on the first rendering information and the second rendering information; wherein the first rendering information is the rendering information when the video frame is generated, and the second rendering information is the generated Rendering information at the reference frame;
  • an encoding module for encoding the video frame based on the motion vector.
  • the first rendering information includes the first position and posture information of the virtual camera when rendering the video frame, and the depth information of the video frame;
  • the second rendering The information includes the second position and pose information of the virtual camera when the reference frame is rendered, and depth information of the reference frame.
  • the first rendering information includes the focal length when rendering the video frame, the first position of the virtual camera, and the depth information of the video frame;
  • the second rendering The information includes the second position of the virtual camera when rendering the reference frame;
  • the determining module is further configured to: determine the motion vector based on the focal length, the distance value between the first position and the second position, and the depth information.
  • the video frame and the reference frame are generated when a virtual reality VR application is run, and the encoded data of the video frame is used for decoding and displaying on the terminal, the first position and attitude information, and the second position,
  • the attitude information is determined based on the position and attitude of the terminal.
  • the terminal is a mobile phone, a tablet, a wearable device or a split device
  • the split device includes a display device and a corresponding control device.
  • the wearable device is a head-mounted display.
  • the determining module includes:
  • a first determining unit configured to render the video frame based on the first position, attitude information, depth information of the video frame, the second position and attitude information of the virtual camera when rendering the reference frame, the reference frame Depth information, determine the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame;
  • the second determining unit is configured to determine the motion vector based on the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame.
  • the second determining unit is used to: obtain the coordinate difference between the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame; based on the coordinate difference , determine the motion vector.
  • the determining module further includes:
  • a third determining unit configured to determine a parallax interval between the video frame and the reference frame based on the focal length, the distance value between the first position and the second position, and the depth information
  • a fourth determining unit configured to determine the target search range corresponding to the image block in the video frame based on the disparity interval
  • a fifth determination unit configured to determine the motion vector based on the target search range.
  • the third determining unit is configured to: determine a depth interval based on the depth information, where the depth interval is used to indicate a value range of the depth value of the pixel in the video frame; based on the focal length, the depth interval The distance value between the first position and the second position and the depth interval determine the disparity interval between the video frame and the reference frame.
  • a video encoding device in a third aspect, includes a processor and a memory for storing at least one piece of program code, the at least one piece of program code is loaded and executed by the processor, so that the video is encoded
  • the device executes the video encoding method provided in the first aspect or any optional manner of the first aspect.
  • a computer-readable storage medium is provided, the computer-readable storage medium is used to store at least one piece of program code, and the at least one piece of program code is loaded and executed by a processor, so that the computer executes the above-mentioned first aspect or the first
  • the video encoding method provided by any one of the optional manners in one aspect.
  • a computer program product or computer program comprising program code which, when run on a video encoding device, causes the video encoding device to perform the above-mentioned first aspect or the first aspect
  • the video encoding method provided in various optional implementations of .
  • FIG. 1 is a schematic diagram of a cloud VR rendering system architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a VR application execution flow provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a VR rendering projection provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an image rendering process provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an image block-based coding framework provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a hardware structure of a video encoding device provided by an embodiment of the present application.
  • FIG. 7 is a flowchart of a video encoding method provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a motion estimation provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a diamond search method provided by an embodiment of the present application.
  • FIG. 10 is a flowchart of another video encoding method provided by an embodiment of the present application.
  • FIG. 11 is a flowchart of another video encoding method provided by an embodiment of the present application.
  • FIG. 12 is a flowchart of another video encoding method provided by an embodiment of the present application.
  • FIG. 13 is a flowchart of another video encoding method provided by an embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of a video encoding apparatus provided by an embodiment of the present application.
  • VR technology is a technology that combines virtual and reality.
  • VR technology is a computer system that can create and experience virtual worlds. It creates a three-dimensional virtual world for users in the form of virtual simulation, which reflects the changes and interactions of physical objects in real time, so that users are immersed in the three-dimensional virtual world.
  • Rendering in the computer field, refers to the process of generating an image from a model through software.
  • a model is a description of a three-dimensional object or virtual scene defined in a mathematical language or data structure, including information such as geometry, viewpoint, texture, lighting, and shadows.
  • the image is a digital image such as a bitmap image or a vector image.
  • Cloud computing platform also known as cloud platform, refers to services based on hardware resources and software resources, providing computing, network and storage capabilities.
  • the cloud computing platform can realize the rapid provision and release of configurable computing resources with low management cost or low interaction complexity between users and service providers.
  • a cloud computing platform can provide various types of services, for example, a cloud computing platform for providing real-time rendering services is called a cloud rendering computing platform. It should be noted that, in this application, such a cloud rendering computing platform is referred to as a server.
  • Cloud VR specifically, refers to using the powerful computing power of the cloud computing platform to upload VR application content to the cloud, rendering to the cloud, and even production to the cloud, and then delivering the immersive VR experience to end users through the Internet.
  • This application relates to a strong interactive real-time rendering scene in cloud VR, which is generally referred to as “cloud VR rendering” or “VR cloud rendering”.
  • the depth information of the video frame also known as the depth information of the image.
  • the binary number corresponding to each pixel contains multiple bits of binary numbers representing color, and the number of bits of these binary numbers representing color, namely called depth information.
  • the depth information corresponding to a video frame is generally described by a depth map of the same size, where the depth map refers to an image whose pixel value is the distance from the image collector to each point in the 3D scene. Directly reflects the geometry of the visible surface of the object.
  • the depth information of the video frame can be used to reflect the distance of the object in the three-dimensional scene from the camera.
  • Video coding there is a large amount of inter-frame redundancy (also known as temporal redundancy) and intra-frame redundancy (also known as spatial redundancy) in the video picture.
  • Video coding is to remove intra-frame redundancy and inter-frame redundancy.
  • the video picture is compressed, so that the video picture is convenient for storage and transmission. Accordingly, video coding includes inter-frame coding and intra-frame coding.
  • Motion vector in the process of inter-frame coding, there is a certain correlation between the scenes in different video frames at adjacent moments. Therefore, a video frame is divided into many non-overlapping image blocks, and the displacement of all pixels in the image block is considered to be the same, and then for each image block, within a given search range of the reference frame, according to a certain
  • the matching criterion finds the block most similar to the current image block, that is, the matching block, and the relative displacement between the matching block and the current image block is called the motion vector (MV).
  • Motion estimation the process of motion estimation is also the process of obtaining motion vectors.
  • the quality of motion estimation directly determines the size of the video coding residual, which directly affects the efficiency of video coding. Therefore, motion estimation is one of the most important parts in the entire video coding process.
  • the process of motion estimation can be regarded as the process of using matching criteria to find matching blocks in each reference frame. The simplest method is to match the current image block with all image blocks in the reference frame to obtain the best matching block.
  • a fast algorithm is used in the related search method, which can significantly improve the search efficiency by searching according to a certain path within a certain range, such as the diamond search method and the hexagon search method.
  • Disparity vector which can be regarded as a special motion vector.
  • the video frame and the reference frame belong to the same view, and when the video frame and the reference frame belong to different views, the relative displacement between the image blocks obtained based on the video frame and the reference frame is called a disparity vector.
  • Parallax estimation the process of parallax estimation is also the process of obtaining a parallax vector.
  • disparity estimation can also be regarded as a special kind of motion estimation. The biggest difference between the two is that the viewing angle of the reference frame is different.
  • the reference frame of motion estimation and the video frame belong to the same viewing angle, while the reference frame of disparity estimation and the video frame belong to the same viewing angle. different perspectives.
  • the search methods for disparity estimation include global search algorithms and local search algorithms.
  • Rendering information refers to the relevant information when the server renders and generates video frames in the process of running the VR application.
  • the rendering information includes: display parameters of the terminal, the position of the virtual camera in the VR application, attitude information, depth information of the video frame and the reference frame, and the like.
  • Rate distortion optimization a method to improve the quality of video encoding. Specifically, it refers to reducing the amount of video distortion (video quality loss) under the limit of a certain bit rate.
  • the method determines the prediction mode based on a rate-distortion optimization algorithm, and selects the prediction mode with the smallest coding cost, or the prediction mode whose rate-distortion meets the selection criteria.
  • the user can interact with the three-dimensional virtual world created by the VR application through the VR terminal, so that the user is immersed in the three-dimensional virtual world as if he were there.
  • VR applications Compared with traditional human-computer interaction applications, VR applications mainly include the following two characteristics:
  • the interaction is more natural.
  • Traditional human-computer interaction applications mainly use keyboard, mouse or handle and other devices to interact through key events, while in VR applications, the dimension of interaction is not only key events, but also positioning sensing events; VR applications will combine user head
  • the scene is rendered according to the position of the head, hands, posture information, etc., so that the screen seen by the user will switch and change correspondingly with the rotation of the head or the movement of the position, so as to achieve a more natural interaction method.
  • the VR application When the VR application renders each frame of video, it will generate two video frames with a certain parallax, and these two video frames will eventually be put into the user's left and right eyes through the VR terminal to achieve the effect of stereo vision;
  • the terminal uses a closed scene to isolate the user's audiovisual experience from the outside world, so that the user can obtain an immersive audiovisual experience.
  • the video encoding method provided by the embodiments of the present application can be applied in VR application scenarios such as VR games, VR teaching, and VR theaters.
  • VR application scenarios such as VR games, VR teaching, and VR theaters.
  • scenarios to which the video encoding method provided in the embodiments of the present application can be applied include but are not limited to the following.
  • VR games are more and more popular because of their highly realistic simulation and experience. Users can enter an interactive virtual game world by using VR terminals. No matter how the user turns their eyes, they are always in the virtual game world. . In this scenario, the server running the VR game application needs to render the left and right eye game videos in real time according to user interaction events, encode the rendered video and send it to the VR terminal.
  • the application of VR technology to classroom teaching or skill training teaching is a key direction in the field of education in recent years.
  • VR technology to build a VR laboratory
  • students can do various experiments without leaving home by using VR terminals, and get the same experience as real experiments.
  • VR technology to establish a VR classroom
  • the server running the VR teaching application needs to render the left and right eye teaching videos in real time according to the interaction events of students or teachers, and encode the rendered video and send it to the VR terminal.
  • users can enter a virtual movie world by using a VR terminal. No matter how the user turns their eyes, they are always in the virtual movie world, so that the user is in the virtual movie world as if they were there. in a virtual movie world.
  • the server running the VR cinema application needs to render the left and right eye movie videos in real time according to the interaction events of students or teachers, and encode the rendered videos and send them to the VR terminal.
  • FIG. 1 is a schematic diagram of an architecture of a cloud VR rendering system to which an embodiment of the present application is applied.
  • the video coding method provided by the embodiment of the present application is applied to the cloud VR rendering system 100 .
  • the cloud VR rendering system 100 includes a server 110 and a terminal 120, and the server 110 and the terminal 120 can be directly or indirectly connected through a wireless network or a wired network, which is not limited in this application.
  • the server 110 generally refers to one of multiple servers, or a set composed of multiple servers, or a distributed system
  • the terminal 120 generally refers to one of multiple terminals, or a set composed of multiple terminals.
  • the cloud VR rendering system 100 includes a server 110 and a terminal 120. It should be understood that if the server 110 or the terminal 120 is a collection of multiple devices, the cloud VR rendering system 100 also includes Including other servers or terminals. This application does not limit the number and type of servers or terminals in the cloud VR rendering system 100 .
  • the aforementioned wireless network or wired network uses standard communication technologies and/or protocols.
  • the network is usually the Internet, but can be any network, including but not limited to a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), mobile, wired or wireless Any combination of networks, private networks, or virtual private networks.
  • data exchanged over a network is represented using technologies and/or formats including hypertext markup language (HTML), extensible markup language (XML), and the like.
  • HTTP hypertext markup language
  • XML extensible markup language
  • SSL secure socket layer
  • TLS transport layer security
  • VPN virtual private network
  • IPsec internet protocol security
  • customized and/or dedicated data communication techniques can also be used in place of or in addition to the data communication techniques described above.
  • the server 110 includes a VR application 111 , a VR interface 112 and a VR driving server 113 .
  • the VR application 111 is configured to acquire the user's current interaction events and rendering parameters, perform image rendering based on the interaction events and rendering parameters, generate video frames and transmit them to the VR interface 112 .
  • the current interaction event of the user includes key input, head/hand position or gesture information, and the like.
  • Rendering parameters include rendering resolution and left and right eye projection matrices, etc.
  • FIG. 2 is a schematic diagram of an execution flow of a VR application to which the embodiments of the present application are applied.
  • the running process of the VR application 111 includes the following two stages.
  • the first stage the initialization stage.
  • This initialization phase is used to determine rendering parameters for VR rendering. Since the video frames generated by VR application rendering are finally submitted to the head-mounted display for presentation, and the display specifications of different head-mounted displays are different, in order to enable the rendered video frames to be displayed on different head-mounted displays For a proper display, the VR application needs to determine the rendering parameters that are consistent with the current HMD display specifications during initialization.
  • the rendering parameters usually include the following two parameters: one is the rendering resolution, which is generally consistent with the display resolution of the head-mounted display, so that the screen scaling operation on the terminal can be avoided.
  • the other is the projection matrix for the left and right eyes, which needs to be based on the field of view (FOV), interpupillary distance (IPD), display resolution, and rendering depth range of the VR application (ie, the head-mounted display). Z-near near clipping plane, Z-far far clipping plane) to calculate together.
  • the VR application renders and generates video frames according to the above rendering parameters, so that the viewing angle range of the video frames is consistent with the FOV of the head-mounted display; and the parallax of the rendered left and right video frames is consistent with the IPD of the head-mounted display.
  • FIG. 3 is a schematic diagram of a VR rendering projection to which the embodiments of the present application are applied.
  • the second stage the rendering stage.
  • the rendering stage is used to render each VR video frame. Before each frame is rendered, the VR application first obtains the user's current interaction events, performs application logic processing, and finally renders and submits the left and right eye video frames.
  • FIG. 4 is a schematic diagram of an image rendering process to which an embodiment of the present application is applied.
  • the VR application performs data processing on the relevant events of the game content, such as position calculation, collision detection, etc.; then, the VR application passes the game data (fixed point coordinates, normal vectors, textures and texture coordinates, etc.) through the data
  • the bus is sent to the graphics program interface, such as OpenGL, etc.; then, the VR application converts the three-dimensional coordinates corresponding to the video frame to the two-dimensional coordinates of the screen; finally, the VR application performs primitive combination, rasterization, and pixel color matching to render the video frame. stored in the frame buffer.
  • the VR interface 112 is used to acquire display parameters and interaction events, perform rendering parameter calculation and interaction event processing, transmit the obtained rendering parameters and interaction events to the VR application 111, and transmit the video frames received from the VR application 111 to the VR driver Server 113.
  • the industry defines a unified standard interface, such as OpenVR.
  • the VR interface 112 shown in FIG. 1 is a VR runtime
  • the VR runtime is an adaptation layer between the VR application 111 and the terminal 120, wherein, between the VR runtime and the VR application 111, and between the VR runtime and the terminal 120.
  • the VR application 111 and the terminal 120 can interact in a standard manner without needing to adapt to each other. As shown in FIG.
  • the VR interface 112 includes the following three modules: a rendering parameter calculation module, which is used to calculate and obtain the aforementioned rendering parameters according to the rendering depth range determined by the VR application 112 and the display parameters of the terminal 120; a video frame processing module, which uses It is used to directly transparently transmit the video frames generated by rendering, or perform additional processing (such as distortion correction) on the video frames before forwarding; the interaction event processing module is used to directly transparently transmit the interaction events, or perform additional processing on the interaction events (such as prediction or smoothing) before forwarding.
  • a rendering parameter calculation module which is used to calculate and obtain the aforementioned rendering parameters according to the rendering depth range determined by the VR application 112 and the display parameters of the terminal 120
  • a video frame processing module which uses It is used to directly transparently transmit the video frames generated by rendering, or perform additional processing (such as distortion correction) on the video frames before forwarding
  • the interaction event processing module is used to directly transparently transmit the interaction events, or perform additional processing on the interaction events (such as prediction or smoothing) before forwarding.
  • the VR interface 112 is not necessary in the entire system.
  • the VR application can obtain display parameters and interaction events according to the interface provided by the system driver of the terminal. , and submit the rendered video frame; if there is no VR interface, the above rendering parameter calculation can be performed on the VR application side or the system driver side of the terminal.
  • the VR driving server 113 is configured to receive the display parameters and interaction events transmitted by the terminal 120 through the network, transmit the display parameters and interaction events to the VR interface 112, encode the video frames submitted by the VR application 111, and encode the encoded video
  • the frame data is transmitted to the terminal 120 through the network.
  • the VR driving server 113 compresses the left and right eye video frames by means of video encoding.
  • the video compression standard of the related art is a coding framework based on image blocks. Under this coding framework, the coded image is divided into non-overlapping image blocks, and the coding is performed in units of image blocks.
  • the common h.26x series coding standard is It is a hybrid coding method based on image blocks.
  • FIG. 5 is a schematic diagram of an image block-based coding framework to which the embodiments of the present application are applied.
  • the encoder in the VR drive server 113 performs image segmentation on the video frame to be encoded, and then performs motion estimation on the image block based on the reference frame to obtain a motion vector, and then according to the The motion vector is subjected to motion compensation to obtain residual data.
  • DCT discrete cosine transform
  • the corresponding coefficients are quantized, and the quantized DCT coefficients are converted into binary code words to achieve Entropy encoding.
  • the encoder performs inverse quantization and inverse transformation on the quantized DCT coefficients to obtain a reconstructed residual, and combines the motion-compensated image blocks to generate a new reference frame, which is stored in the frame buffer.
  • the terminal 120 includes an interaction device 121 , a system driver 122 and a VR driver client 123 .
  • the interaction device 121 is used to collect the current interaction events of the user, and based on the video frame data received from the server 110 , decode the video frame data and present it to the user.
  • the interaction device 121 usually includes two devices related to the user: a head-mounted display and a controller handle. Among them, positioning sensors and input buttons are arranged on both the head-mounted display and the controller handle.
  • the positioning sensor is used to sense and collect position information (position coordinates in three-dimensional space) and attitude information (azimuth angle data in three-dimensional space) of the user's head and/or hands.
  • Input buttons are used to provide users with control functions for VR applications. For example, input buttons are buttons on a controller handle, joystick, or buttons on a head-mounted display. Users can control the VR application by manipulating these buttons.
  • the system driver 122 is configured to transmit the current interaction event of the user and the display parameters of the terminal 120 to the VR driving client 123 , and transmit the received video frame to the interaction device 121 .
  • the VR driver client 123 is configured to transmit interaction events and display parameters to the server 110 through the network, and decode the received video frame data and transmit them to the system driver 122 .
  • FIG. 6 is a schematic diagram of a hardware structure of a video encoding device provided by an embodiment of the present application.
  • the video encoding device 600 includes a processor 601 and a memory 602, wherein the memory 602 is used to store at least one piece of program code, and the at least one piece of program code is loaded by the processor 601 and executes the following embodiments. the video coding method shown.
  • the processor 601 may be a network processor (Network Processor, NP for short), a central processing unit (CPU), an application-specific integrated circuit (ASIC), or a processor used to control the execution of the program of the present application. integrated circuit.
  • the processor 601 may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. The number of the processors 601 may be one or multiple.
  • the memory 602 may be read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (RAM) or other type of static storage device that can store information and instructions It can also be an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only Memory (CD-ROM) or other optical disk storage, CD-ROM storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being executed by a computer Access any other medium without limitation.
  • ROM read-only memory
  • RAM random access memory
  • EEPROM electrically erasable programmable read-only memory
  • CD-ROM compact disc read-only Memory
  • CD-ROM storage including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.
  • magnetic disk storage media or other magnetic storage devices or capable of carrying
  • processor 601 and the memory 602 can be set separately or integrated together.
  • the server 600 also includes a transceiver.
  • the transceiver is used to communicate with other devices or a communication network, and the network communication mode may be, but not limited to, Ethernet, a radio access network (RAN), a wireless local area network (wireless local area network, WLAN) and the like.
  • RAN radio access network
  • WLAN wireless local area network
  • the position and posture of the terminal are determined by the current interaction event of the user.
  • the interaction event refers to any one or a combination of the adjustment events of the position and posture of the terminal by the user.
  • the server uses the virtual cameras in the VR application to simulate the left and right eyes of the user respectively, that is, the images captured by the virtual cameras simulate the VR images seen by the user's eyes.
  • the position and posture of the virtual camera in the VR application are determined by the terminal used by the user. The user can adjust the position and posture of the terminal to trigger an adjustment event, so that the position and posture of the virtual camera are adjusted accordingly. In order to display the adjusted VR screen on the terminal.
  • the terminal is a mobile phone, a tablet, a wearable device or a split device
  • the split device includes a display device and a corresponding control device.
  • the terminal is a mobile phone
  • the mobile phone and the server are connected through the network.
  • the mobile phone generates interactive events in response to the user's click, slide, and move operations on the mobile phone and sends them to the server.
  • the server adjusts the VR application according to the received interactive events.
  • the position and attitude of the virtual camera, and the corresponding video frame is generated by rendering, and the encoded data of the video frame can be decoded and displayed on the mobile phone. It should be understood that the implementation manner when the terminal is a tablet is similar to that when the terminal is a mobile phone, and thus will not be repeated.
  • the terminal is a wearable device
  • the wearable device is a head-mounted display.
  • the head-mounted display is connected to the server through the network, and the head-mounted display generates interactive events in response to the user's head rotation, nodding and other actions and sends them to the server, and the server adjusts the virtual camera in the VR application according to the received interaction events. position and pose, and render to generate corresponding video frames, the encoded data of the video frames can be decoded and displayed on the head-mounted display.
  • the terminal is a split device
  • the display device is a mobile phone, a tablet or an electronic display screen
  • the control device is a handle or a button.
  • the split device is connected to the server through the network, and the split device responds to the user's operation on the control device, generates interaction events and sends them to the server, and the server adjusts the position and posture of the virtual camera in the VR application according to the received interaction events , and render to generate a corresponding video frame, the encoded data of the video frame can be decoded and displayed on the display device.
  • the data generated by the server running the VR application is displayed through the terminal, and the position and posture of the terminal can affect the rendering process of the server, thereby realizing the purpose of VR display.
  • the encoded data of the video frame can be decoded and displayed on a variety of terminals, so the present application has wider applicability.
  • the first point that needs to be explained is that in the cloud VR rendering system, the encoding and decoding process and transmission process of video frames will increase the end-to-end delay, and VR applications are highly interactive applications that are extremely sensitive to delays. Latency is an important indicator that affects the experience; if the network transmission part is not considered, the processing delay on the cloud side will account for at least 60% of the entire end-to-end delay; and in the processing delay on the cloud side, encoding is related to rendering Time-consuming part.
  • the second point that needs to be explained is that the lossy video encoding process will lead to the degradation of image quality and affect the final subjective experience. This needs to be improved by improving the encoding efficiency (compression ratio) or setting a higher encoding bit rate. Therefore, there is an urgent need for a video coding method that improves the performance of cloud-side video coding (that is, the image quality is improved at the same bit rate) and the speed.
  • the present application provides a video coding method.
  • the rendering information involved in the rendering of the video frame by the server is referred to, so that the temporal motion estimation method can be used.
  • the motion vector is obtained by inter-view motion estimation to encode the video frame, which can greatly speed up the video encoding speed, thereby effectively improving the video encoding efficiency and reducing the end-to-end delay.
  • FIG. 7 is a flowchart of a video encoding method provided by an embodiment of the present application.
  • the video encoding method is executed by the server 110 as shown in FIG. 1 .
  • the video encoding method can be applied to the VR driving server 113 of the server 110 .
  • the video encoding method includes the following steps 701 to 703 .
  • the server acquires a video frame and a reference frame of the video frame, where the video frame and the reference frame are from the same view in the multi-view image.
  • the video frame and the reference frame are generated when the server runs the VR application.
  • the video frame is the video frame to be encoded, and the encoded data of the video frame is used for decoding and displaying on the terminal.
  • the reference frame is an encoded video frame, that is, a video frame before the to-be-encoded video frame.
  • the server when the server is running a VR application, at least one multi-view image needs to be generated to render a VR image at the same moment, including a video frame corresponding to the left eye perspective and a video frame corresponding to the right eye perspective, and the video frame is a texture image.
  • the video frame obtained by the server is from the left eye perspective or the right eye perspective, which is not limited in this application.
  • the left and right eye images generated by the VR application are images from two perspectives.
  • the perspective that only refers to the video frame of the same perspective during the encoding process is called the reference perspective.
  • Perspectives are called dependent perspectives.
  • the specific implementation of this step will be described below by taking the video frame obtained by the server as the reference view angle as an example.
  • this step 701 includes: in response to the obtained video frame being the reference view angle, the server obtains a reference frame of the video frame, where the video frame has the same view angle as the reference frame.
  • the server obtains at least one video frame with the same reference view from the encoded multi-view image, and uses this type of video frame as the reference frame of the video frame.
  • the reference frame obtained by the server includes t-1, t-2... and the video frame corresponding to the reference view angle at time t-n, wherein, n is greater than 1. That is, the server uses the video frame corresponding to the reference view angle before time t as the reference frame.
  • each video frame carries a view ID
  • the view ID is used to indicate the view of the video frame.
  • the process for the server to obtain the reference frame includes: the server determines the video frame as the reference view based on the view ID of the video frame, and then the server uses the view ID of the reference view as an index to obtain at least one video frame with the same reference view from the multi-view image , and use this type of video frame as the reference frame of the video frame.
  • the server obtains the corresponding reference frame for the video frame of the reference viewing angle in a targeted manner.
  • the correlation between the video frame and the reference frame is considered from the time dimension (before and after the moment), which can effectively improve the video coding efficiency.
  • the server determines the motion vector of the image block in the video frame based on the first rendering information and the second rendering information; wherein the first rendering information includes the first position and posture information of the virtual camera when rendering the video frame, and the video Depth information of the frame; the second rendering information includes the second position and posture information of the virtual camera when rendering the reference frame, and the depth information of the reference frame.
  • the first position and attitude information and the second position and attitude information are determined based on the position and attitude of the terminal.
  • the first position and the second position are the positions of the virtual camera with the same viewing angle at different times.
  • the depth information is used to indicate the depth value corresponding to the pixels in the video frame or reference frame, and is generated by the server when the VR application is running.
  • the process that the server performs this step 702 to determine the motion vector may be called temporal motion estimation, that is, the server performs motion estimation between the video frame and the reference frame of the same viewing angle at different times in the time dimension. .
  • FIG. 8 is a schematic diagram of motion estimation provided by an embodiment of the present application.
  • the server After acquiring the video frame and the reference frame, if the video frame and the reference frame are from the same viewing angle, the server performs temporal motion estimation to obtain the motion vector of the present application.
  • this step 702 includes but is not limited to the following steps 7021 and 7022.
  • the server determines the video frame based on the first position of the virtual camera when rendering the video frame, the posture information, the depth information of the video frame, the second position of the virtual camera when rendering the reference frame, the posture information, and the depth information of the reference frame.
  • the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame are the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame.
  • the server provides a calculation function for motion vectors.
  • This step 7021 will be described in detail below, including but not limited to the following steps 7021-1 to 7021-3.
  • the server acquires the conversion relationship between the two-dimensional coordinates and the three-dimensional coordinates corresponding to the pixels or image blocks in any video frame.
  • the conversion relationship refers to the conversion formula for rendering a three-dimensional point (x, y, z) in the real world (using the world coordinate system) to a pixel point (u, v) on a two-dimensional screen (using the camera coordinate system) .
  • the motion vector of the video frame is usually calculated in units of image blocks, and the displacements of all pixels in an image block are considered to be the same, that is, for the video frame
  • the two-dimensional and three-dimensional coordinates corresponding to the image block can be determined based on the two-dimensional and three-dimensional coordinates corresponding to a pixel in the image block.
  • T represents the transformation matrix from the world coordinate system to the camera coordinate system, which is determined by the position and attitude of the camera in the world coordinate system.
  • K represents the camera calibration matrix, which is determined by the camera's FOV, display resolution and other built-in parameters.
  • d represents the depth information of the pixel (u, v).
  • the server obtains the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame, and applies the conversion relationship based on the second position, attitude information, and depth information of the reference frame of the virtual camera when rendering the reference frame.
  • the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame are converted into three-dimensional coordinates corresponding to the pixels or image blocks in the reference frame.
  • the server applies the conversion relationship to determine the video based on the first position of the virtual camera when rendering the video frame, the attitude information, the depth information of the video frame, and the three-dimensional coordinates corresponding to the pixels or image blocks in the reference frame.
  • the two two-dimensional coordinates correspond to the same three-dimensional coordinates in the world coordinate system.
  • the two-dimensional coordinates corresponding to an image block (ie, pixel) in the reference frame are (u 1 , v 1 ), and the two-dimensional coordinates corresponding to an image block (ie, pixel) in the video frame are (u 1 , v 1 ). 2 , v 2 ) as an example, the above steps 7021-2 and 7021-3 will be described.
  • the server applies the above conversion relationship based on the second position, attitude information, and depth information of the virtual camera in the VR application when rendering the reference frame, and obtains the following formula (2), according to the formula (2) , to obtain the three-dimensional coordinates (x, y, z) corresponding to the two-dimensional coordinates (u 1 , v 1 ). Further, the server applies the above conversion relationship based on the first position of the virtual camera in the VR application when rendering the video frame, the attitude information, the depth information of the video frame, and the above-mentioned three-dimensional coordinates (x, y, z), and obtains the following: Formula (3). Equation (2) and Equation (3) are as follows:
  • T 1 represents the conversion matrix corresponding to the reference frame
  • d 1 represents the depth value corresponding to a certain image block in the reference frame
  • T 2 represents the corresponding conversion matrix of the video frame
  • d 2 represents the depth value corresponding to an image block in the video frame.
  • the two-dimensional coordinates (u 2 , v 2 ) corresponding to an image block in the video frame can be calculated.
  • the above steps 7021-1 to 7021-3 are based on the three-dimensional coordinates corresponding to a two-dimensional coordinate in the reference frame to determine the two-dimensional coordinates of the three-dimensional coordinates in the video frame, that is, the server can
  • the mapping relationship between the two-dimensional coordinates (u 1 , v 1 ) and the two-dimensional coordinates (u 2 , v 2 ) is determined through the above steps 7021-1 to 7021-3, which provides a basis for the subsequent determination of the motion vector.
  • the server determines the motion vector based on the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame.
  • this step 7022 includes but is not limited to the following steps 7022-1 and 7022-2.
  • the server obtains the coordinate difference between the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame.
  • the two-dimensional coordinates corresponding to a pixel or image block in the reference frame are (u 1 , v 1 ), and the two-dimensional coordinates corresponding to a pixel or image block in the video frame are (u 2 , v 2 )
  • the coordinate difference is (u 1 -u 2 , v 1 -v 2 ).
  • the coordinate difference here is only a schematic illustration, in some embodiments, the coordinate difference may be expressed as (u 2 -u 1 , v 2 -v 1 ); in other embodiments, the coordinate difference Coordinate differences can also be expressed in other vector forms. The present application does not limit the calculation method and expression form of the coordinate difference.
  • the server determines the motion vector based on the coordinate difference.
  • the manner in which the server determines the motion vector includes the following two manners.
  • the server determines the coordinate difference as a motion vector.
  • the video frames before and after the same viewing angle are usually two images generated by the user's head rotating a small angle in the same scene.
  • the user's head rotates
  • the angle of the video frame is usually not too large. Therefore, in most scenarios, the content of the two video frames before and after the moment is mostly overlapped, and there is only a small displacement.
  • the coordinate difference obtained by the server through the above step 7021 is the same angle of view.
  • the displacement vector between the image block in the reference frame and the image block in the video frame at the time before and after is also called the viewing angle displacement vector.
  • the server uses the viewing angle displacement vector as a motion vector for subsequent encoding of the video frame.
  • the server searches in the reference frame according to the first search method based on the coordinate difference, determines a first image block that meets the first search condition, and determines a motion vector based on the first image block.
  • the server takes the viewing angle displacement vector corresponding to the coordinate difference as the initial motion vector, and uses the initial motion vector as the search starting point to search in the reference frame according to the first search method.
  • the first search method is a diamond search method (diamond search, DS), a global search method, a hexagonal search method, a rectangle search method, etc., which is not limited in this application.
  • the server when the number of the reference frames is greater than 1, for any image block in the video frame, can separately determine the first image block that meets the first search condition from each reference frame.
  • the server determines at least two motion vectors based on the at least two first image blocks, and predicts the encoding cost of the video frame by invoking the RDO algorithm to obtain at least two corresponding prediction modes.
  • the motion vector corresponding to the prediction mode with the least coding cost is used as the motion vector of the present application.
  • FIG. 9 is a schematic diagram of a diamond search method provided by an embodiment of the present application. As shown in the left figure in Figure 9, the first center position in the figure is the image block corresponding to the initial motion vector, and this diamond search method comprises the following steps 1 to 3.
  • Step 1 The server uses the first center position as a starting point, applies a large diamond search proto (LDSP), and determines a minimum error image block (minimum block) from 9 image blocks including the first center position.
  • distortion, MBD referred to as the first MBD
  • the server executes the following step 3 if not, the server executes the following step 2.
  • Step 2 The server uses the first MBD as the second center position, applies LDSP again, and determines the second MBD from 9 image blocks including the first MBD, if the second MBD corresponds to the second center position If there is no image block, the server executes the following step 3, and if not, the server executes the step 2 repeatedly.
  • Step 3 The server uses the current MBD (the first MBD or the second MBD) as the third center position, applies the small diamond search proto (SDSP), and selects five image blocks including the current MBD from five image blocks. If the third MBD is the image block corresponding to the third center position, the server calculates the motion vector based on the third MBD, if not, the server repeats the third step.
  • SDSP small diamond search proto
  • the server when the server performs temporal motion estimation, the position and attitude information of the virtual camera in the VR application and the difference between the video frame and the reference frame are based on the position and attitude information of the virtual camera in the VR application when rendering the video frame and rendering the reference frame.
  • Depth information calculate the viewing angle displacement vector, obtain the viewing angle displacement vector, and use the viewing angle displacement vector to determine the motion vector.
  • the server determines the motion according to the coordinate difference between the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame. vector.
  • This method of determining the motion vector utilizes the correlation between the video frame and the reference frame before and after the same viewing angle. By calculating the position, posture information and depth information of the virtual camera in the VR application when the video frame is rendered and the reference frame is rendered, the The viewing angle displacement vector is used as the motion vector, or the viewing angle displacement is used as the initial motion vector to search, and finally the motion vector is obtained. This method greatly reduces the calculation amount of motion estimation and effectively improves the video coding efficiency.
  • the server encodes the video frame based on the motion vector.
  • the server provides an encoding function, which can encode the video frame based on the motion vector.
  • the encoding function of the server is implemented by an encoder, which can be a built-in encoder on the server, for example, a built-in hardware encoder on the server, or a software encoder installed on the server.
  • an encoder which can be a built-in encoder on the server, for example, a built-in hardware encoder on the server, or a software encoder installed on the server.
  • the above calculation process of the motion vector may be performed by the encoder, or may be calculated outside the encoder and then transferred to the encoder.
  • the coding standard the calculation of the motion vector belongs to an open module, and the coding standard does not limit the way of calculating the motion vector of the encoder, as long as the motion vector in the specified format can be output. Therefore, the video encoding method provided by the embodiments of the present application has wide applicability and can be applied to various encoders in the related art.
  • the image in the video frame is determined according to the rendering information when the video frame is generated and the rendering information when the reference frame of the video frame is generated. block's motion vector and encode the video frame.
  • FIG. 10 is a flowchart of another video encoding method provided by an embodiment of the present application.
  • the video encoding method is executed by the server 110 as shown in FIG. 1 .
  • the video encoding method may be applied to the VR driving server 113 of the server 110 .
  • the video coding method includes the following steps 1001 to 1003 .
  • the server obtains a video frame and a reference frame of the video frame, where the video frame and the reference frame are from different perspectives in a multi-view image.
  • this step 1001 includes: in response to the acquired video frame being a viewing angle dependent, the server acquires a reference frame of the video frame, where the viewing angle of the video frame is different from that of the reference frame.
  • the server obtains the video frame of the reference view at the current moment from the encoded multi-view image, and uses the video frame of the reference view as the reference frame.
  • the reference frame obtained by the server is the video frame corresponding to the reference viewpoint at time t.
  • each video frame carries a view ID
  • the view ID is used to indicate the view of the video frame.
  • Step 1001 includes: the server determines that the video frame is a dependent perspective based on the perspective identifier of the video frame, and then the server uses the perspective identifier of the reference perspective as an index to obtain the video frame of the reference perspective at the current moment in the multi-view image, and the video frame as a reference frame.
  • the video frames of different viewing angles at the same time are the pictures that the VR application simulates the user's two eyes to see when observing the three-dimensional space, that is, it can be understood that the cameras with the same parameters are obtained from two different perspectives. Two images generated from the same scene are shot at a certain deviation angle, so there is a greater correlation between video frames from different viewing angles at the same moment.
  • the server obtains the corresponding reference frame according to the perspective of the video frame, and considers the correlation between the video frame and the reference frame from the spatial dimension (left and right perspective), which can effectively improve the video coding efficiency.
  • the server determines the motion vector of the image block in the video frame based on the first rendering information and the second rendering information; wherein, the first rendering information includes the focal length when rendering the video frame, the first position of the virtual camera, and the video frame. depth information of the frame; the second rendering information includes the second position of the virtual camera when rendering the reference frame.
  • the focal length is determined based on the FOV and display resolution of the terminal.
  • the first position and the second position are determined based on the position and attitude of the terminal.
  • the first position and the second position are the positions of virtual cameras with different viewing angles in the VR application at the same moment. That is, the first position is related to the viewing angle of the video frame, and the second position is related to the viewing angle of the reference frame.
  • the first position corresponds to the virtual camera simulating the user's left eye
  • the second position corresponds to the virtual camera simulating the user's right eye. It should be understood that both the FOV and the display resolution of the terminal belong to the display parameters of the terminal.
  • inter-view motion estimation also called disparity estimation
  • the viewpoint corresponds to the reference view (left-eye view) or dependent view (right-eye view).
  • the server performs motion estimation between video frames and reference frames of different viewing angles at the same moment.
  • FIG. 8 After acquiring the video frame and the reference frame, if the video frame and the reference frame are from different perspectives, the server performs inter-view motion estimation, and finally obtains the motion vector required by the present application.
  • This step 1002 is replaced by: the server determines the motion vector based on the focal length, the distance value between the first position and the second position, and the depth information.
  • the terminal includes a head-mounted display
  • the distance value between the first position and the second position is the interpupillary distance of the head-mounted display.
  • this step 1002 includes but is not limited to the following steps 1002-1 to 1002-3.
  • the server determines a disparity interval between the video frame and the reference frame based on the focal length, the distance value, and the depth information.
  • represents the parallax
  • f is the focal length of the camera
  • b is the distance value between the two cameras
  • d is the depth value of the pixel.
  • the server determines the disparity interval between the video frame and the reference frame according to the above formula (5). This embodiment of determining the parallax interval will be described below, including the following two steps.
  • Step 1 The server determines a depth interval based on the depth information, where the depth interval is used to indicate the value range of the depth value of the pixels in the video frame.
  • the server determines the maximum depth value and the minimum depth value of the pixels in the video frame according to the depth information of the video frame, that is, determines the depth interval.
  • the depth interval is expressed as [Dmin, Dmax], which is not limited in this application.
  • Step 2 The server determines a parallax interval between the video frame and the reference frame based on the focal length, the distance value and the depth interval.
  • the server substitutes the focal length of the virtual camera in the VR application, the distance value (that is, the interpupillary distance of the head-mounted display) and the interval value of the depth interval into the above formula (5) to obtain the distance between the video frame and the reference frame.
  • the parallax interval Illustratively, the disparity interval is represented as [Dvmin, Dvmax], and the present application does not limit the expression form of the disparity interval.
  • the server determines, based on the disparity interval, a target search range corresponding to an image block in the video frame.
  • the manner in which the server determines the target search range includes the following two manners.
  • the server determines a first search range corresponding to an image block in the video frame based on the disparity interval, and uses the first search range as a target search range.
  • the server determines the first search range based on the endpoint value of the disparity interval and the position of the image block in the video frame.
  • the position of the image block is represented as (MbX, MbY)
  • the disparity interval is represented as [MinX, MaxX]
  • the first search range is represented as [MbX+MinX, MbY+MaxX].
  • the server multiplies the endpoint values of the parallax interval by a preset coefficient, respectively, to obtain an enlarged or reduced parallax interval.
  • the disparity interval can be expressed as [pMinX, qMaxX], where p and q are preset coefficients, and the first search range is expressed as [MbX+pMinX, MbY+qMaxX].
  • the preset coefficients may be the same or different, which is not limited in this application.
  • the server determines, based on the disparity interval, a first search range corresponding to an image block in the video frame; based on a second search method, determines a second search range corresponding to an image block in the video frame; the first search range The intersection with the second search range is used as the target search range.
  • the second search method is a default search method, for example, the second search method is a global search method, a diamond search method, a hexagonal search method, etc., which is not limited in this application.
  • the target search range corresponding to the image block in the video frame can be further narrowed, thereby reducing the calculation amount of motion estimation, greatly accelerating the speed of motion estimation, and effectively improving the video coding efficiency.
  • the server determines the motion vector based on the target search range.
  • the server uses the search start point and search path of the second search method as the search start point and search path of the motion vector, searches in the reference frame according to the target search range, and determines the search point that meets the second search condition.
  • the server sends the target search range to an encoder in the server, and the encoder searches in the reference frame according to the target search range. This application does not limit this.
  • the server when performing motion estimation between viewpoints, performs disparity interval calculation according to the display parameters of the terminal and the depth information of the video frame to obtain the disparity interval, and utilizes the disparity interval. interval to determine the motion vector.
  • the server narrows the search range of the motion vector according to the display parameters of the terminal and the depth information of the video frame, thereby determining the motion vector required by the present application.
  • This method of determining motion vectors utilizes the correlation between video frames and reference frames from different viewing angles at the same time, and determines the target search range of motion vectors through the disparity interval between the video frame and the reference frame, or, through the disparity
  • the search range of the interval and the default search mode is used to further narrow the target search range of the motion vector, and finally the required motion vector is obtained, which greatly reduces the calculation amount of the motion estimation and effectively improves the video coding efficiency.
  • the server encodes the video frame based on the motion vector.
  • the server provides an encoding function, which can encode the video frame based on the motion vector. It should be noted that, the optional implementation manner of this step 1003 is similar to that in the above-mentioned step 703 shown in FIG. 7 , so it is not repeated here.
  • the image in the video frame is determined according to the rendering information when the video frame is generated and the rendering information when the reference frame of the video frame is generated. block's motion vector and encode the video frame.
  • FIG. 11 is a flowchart of another video encoding method provided by an embodiment of the present application.
  • the video encoding method is executed by the server 110 as shown in FIG. 1 .
  • the video encoding method may be applied to the VR driving server 113 of the server 110 .
  • the video encoding method includes the following steps 1101 to 1105 .
  • the server obtains a reference frame of the video frame, where the reference frame includes both a reference frame with the same view angle as the video frame, and a reference frame with a different view angle from the video frame.
  • the server obtains at least one video frame that is also a dependent view from the encoded multi-view image, and a video frame of the reference view at the current moment, and uses this type of video frame as the The reference frame of the video frame.
  • the reference frames obtained by the server include t-1, t-2... and the video frames corresponding to the dependent perspective at time t-n, and t The video frame corresponding to the reference view angle at the moment. That is, the server uses the video frame corresponding to the dependent view before time t and the video frame corresponding to the reference view at time t as reference frames.
  • each video frame carries a view ID
  • the view ID is used to indicate the view of the video frame.
  • Step 1101 includes: the server determines that the video frame is a dependent view based on the viewing angle identifier of the video frame, and then the server uses the viewing angle identifier of the reference viewing angle as an index to obtain the video frame of the reference viewing angle at the current moment in the multi-view image; The viewing angle that is dependent on the viewing angle is identified as an index, and at least one video frame that is also dependent on the viewing angle is acquired in the multi-view image, and the acquired video frame is used as a reference frame.
  • the server obtains the corresponding reference frame according to the perspective of the video frame, which not only considers the correlation between the video frame and the reference frame from the time dimension (before and after), but also considers the correlation between the video frame and the reference frame from the spatial dimension (The correlation between the video frame and the reference frame is considered in the left and right viewing angles, which can effectively improve the video coding efficiency.
  • the server is based on the first position of the virtual camera when rendering the video frame, the attitude information, the depth information of the video frame, and the second position of the virtual camera when rendering the reference frame.
  • the position, attitude information, and depth information of the reference frame are used to determine the first motion vector corresponding to the image block in the video frame.
  • the manner in which the server determines the first motion vector may refer to the corresponding implementation manner in the foregoing step 702, that is, the server performs time domain motion estimation to obtain the first motion vector. This application will not repeat them here.
  • the server uses the focal length when rendering the video frame and the first position of the virtual camera, the depth information of the video frame, and the second position of the virtual camera when rendering the reference frame. position to determine the second motion vector corresponding to the image block in the video frame.
  • the manner in which the server determines the second motion vector may refer to the corresponding implementation in step 1002 above, that is, the server performs inter-view motion estimation to obtain the second motion vector. This application will not repeat them here.
  • the server executes the above steps 1102 and 1103 in the order from front to back.
  • the server performs step 1103 first, and then performs step 1102.
  • the server performs step 1102 and step 1103 synchronously. This embodiment of the present application does not limit the execution order of step 1102 and step 1103 .
  • the server determines a target motion vector based on the first motion vector and the second motion vector.
  • the server invokes the RDO algorithm to predict the coding cost of the video frame based on the first motion vector and the second motion vector to obtain at least two prediction modes; among the at least two prediction modes, the one with the smallest coding cost
  • the motion vector corresponding to the prediction mode is used as the target motion vector.
  • the server uses different methods to determine the motion vector for different types of reference frames, and further, combines the RDO algorithm to determine the target motion vector, and determines the motion vector in this way. , the motion vector with the minimum corresponding coding cost can be obtained, thereby effectively improving the efficiency of video coding.
  • the server encodes the video frame based on the target motion vector.
  • the image in the video frame is determined according to the rendering information when the video frame is generated and the rendering information when the reference frame of the video frame is generated. block's motion vector and encode the video frame. Using this method for video coding can greatly reduce the amount of calculation for motion estimation, speed up video coding, improve video coding efficiency, and effectively reduce end-to-end delay.
  • FIG. 12 is a flowchart of another video encoding method provided by an embodiment of the present application
  • FIG. 13 is a flowchart of another video encoding method provided by an embodiment of the present application.
  • the video encoding method includes the following steps 1201 to 1208 .
  • the terminal sends rendering parameters to the server, where the rendering parameters include but are not limited to rendering resolution, left and right eye projection matrices, and display parameters of the terminal.
  • the terminal collects the current interaction event of the user, and sends the interaction event to the server, where the interaction event includes but is not limited to key input, and head/hand position and attitude information.
  • the server and the terminal are connected through a network, and the terminal sends rendering parameters and current interaction events of the user to the server.
  • this embodiment of the present application does not limit the execution order of the foregoing steps 1201 and 1202 .
  • the terminal executes step 1202 first, and then executes step 1201; in other embodiments, the terminal executes the above steps 1201 and 1202 synchronously.
  • the server runs the VR application, performs application logic calculation and image rendering calculation, and generates a rendered video frame.
  • the server performs motion vector calculation based on the rendering parameter, the interaction event, and the depth information of the video frame.
  • rendering parameters, interaction events, and depth information all belong to rendering information.
  • the server calculates the motion vector
  • the following two scenarios are included: The first one is the front and rear frame scenario, that is, the video frame and the reference frame are from the same viewing angle.
  • Step 702 in the illustrated embodiment is to calculate the viewing angle displacement vector; the second is the scene between viewpoints, that is, the video frame and the reference frame are different viewing angles.
  • the server determines the way of the motion vector.
  • step 1002 in the above-mentioned embodiment shown in FIG. 10 that is, the calculation of the disparity interval is performed. This application will not repeat them here.
  • the motion vector calculation includes an initial motion vector calculation and a target motion vector calculation.
  • the initial motion vector calculation is used to perform the calculation of the perspective displacement vector and the calculation of the disparity interval between the viewpoints.
  • the server obtains the calculation result of the initial motion vector, based on the calculation result, the calculation of the target motion vector is performed (it can also be understood as an accurate motion vector calculation) , and finally get the motion vector.
  • the viewing angle displacement vector can be directly used as a motion vector in the subsequent encoding process of the video frame by the server, or the server can be based on the viewing angle.
  • the displacement vector performs the calculation of the target motion vector; in the inter-view scenario, after the server performs the initial motion vector calculation to obtain the parallax interval, the server can perform the target motion vector calculation based on the parallax interval, thereby reducing the search range of the motion vector and realizing accelerated coding.
  • the above-mentioned calculation process of the motion vector may be performed by the encoder of the server, or may be calculated outside the encoder and then transferred to the encoder.
  • the server after receiving the rendering parameters and interaction events sent by the terminal, the server performs logical calculations and image rendering calculations to generate rendered video frames. .
  • the server calculates the motion vector based on the depth information of the video frame and the received rendering parameters and interaction events.
  • the server then encodes the video frame based on the obtained motion vector to generate video encoded data.
  • the server encodes the video frame based on the calculated motion vector to generate video encoded data.
  • the server sends the encoded video data to the terminal.
  • the terminal decodes the video frame according to the encoded video data, and submits the decoded video frame to the head-mounted display for display through the system driver in the terminal.
  • the terminal decodes the encoded video data and displays the image.
  • the video encoding method when encoding the video frame, when encoding the video frame, according to the rendering information when the video frame is generated and the rendering information when the reference frame of the video frame is generated , to determine the motion vector of the image block in the video frame, and encode the video frame.
  • this method for video coding can greatly reduce the amount of calculation for motion estimation, speed up video coding, improve video coding efficiency, and effectively reduce end-to-end delay.
  • the above method can be used to determine the motion vector regardless of whether the video frame is from the left eye perspective or the right eye perspective; further, no matter whether the video frame and the reference frame belong to the same perspective, the above method can also be used to determine the motion vector vector.
  • FIG. 14 is a flowchart of another video encoding method provided by an embodiment of the present application.
  • the video encoding method of the present application is illustrated by taking the server encoding multiple video frames within a period of time, and one video frame being one encoding unit as an example.
  • the video encoding method is executed by the server, and includes the following steps 1401 to 1406 .
  • the server acquires the coding unit, and determines whether the current coding unit is the initial coding unit and whether it is the reference view.
  • the initial coding unit refers to the first video frame within a certain period of time.
  • the server determines whether the current coding unit is a reference view for an implementation manner, please refer to step 701 in the embodiment shown in FIG. 7 above, which is not repeated in this application.
  • the server encodes the coding unit as a key frame.
  • the key frame refers to the video frame of the first reference viewing angle within a certain period of time. It should be noted that, since the key frame is the first video frame within a certain period of time, the server does not need to obtain a reference frame when encoding the coding unit, but directly encodes the coding unit.
  • the server acquires the video frame of the reference view corresponding to the coding unit, and uses the video frame as a reference to encode the coding unit as a non-key frame.
  • the non-key frame refers to other video frames except the video frame of the first reference view angle within a certain period of time.
  • this step 1403 please refer to steps 1002 and 1003 in the above-mentioned embodiment shown in FIG. 10 for the implementation of the server encoding the current coding unit as a non-key frame, that is, the server performs inter-view motion estimation to obtain a motion vector Then encode the video frame. This application will not repeat them here.
  • the server uses the previous single frame or several frames (determined according to the configuration) of the video frame with the same reference view as the reference frame, and encodes the coding unit as a non-key frame. .
  • the server encodes the current coding unit into a non-key frame, please refer to steps 702 and 703 in the above-mentioned embodiment shown in FIG. 7 , that is, the server performs temporal motion estimation, and then obtains the motion vector for the Video frames are encoded. This application will not repeat them here.
  • the server uses the previous single frame or several frames (determined according to the configuration) of the video frame with the same reference view angle as the reference frame, and at the same time uses the reference view angle corresponding to the current coding unit.
  • the video frame is used as a reference frame, and the coding unit is encoded as a non-key frame.
  • the server For the implementation of the server encoding the current coding unit as a non-key frame, please refer to steps 1102 to 1105 in the above-mentioned embodiment shown in FIG. 11 , that is, the server performs temporal motion estimation and inter-view motion estimation to obtain motion vector and encode the video frame. This application will not repeat them here.
  • the server repeats the above steps 1401 to 1405 until all coding units are encoded.
  • Using the above method for video coding can greatly reduce the amount of calculation of motion estimation, speed up video coding, improve video coding efficiency, and effectively reduce end-to-end delay.
  • FIG. 15 is a flowchart of another video coding method provided by an embodiment of the present application. Taking a group of pictures (group pf pictures, GOP) as a coding unit as an example, the video coding method of the present application will be described as an example. Schematically, as shown in FIG. 15 , time t 0 is the start coding time of the coding unit, and one GOP includes video frames corresponding to time t 0 , time t 1 and time t 2 , wherein each time point includes two video frames. video frames, corresponding to the reference view and the dependent view, respectively.
  • the video encoding method is executed by the server and includes the following steps 1501 to 1505 .
  • the server encodes the video frame of the reference viewing angle at time t 0 as a key frame.
  • step 1501 please refer to the step 1402 in the above-mentioned embodiment shown in FIG. 14, and details are not described herein again in this application.
  • the server uses the video frame of the reference view at time t 0 as a reference frame, and encodes the video frame of the view-dependent view at time t 0 as a non-key frame.
  • step 1502 please refer to steps 1002 and 1003 in the embodiment shown in FIG. 10, that is, the server performs inter-view motion estimation, and then encodes the video frame after obtaining the motion vector. This application will not repeat them here.
  • the server uses the video frame of the reference view before time t 1 as a reference frame, and encodes the video frame of the reference view at time t 1 as a non-key frame.
  • step 1503 please refer to step 702 and step 703 in the above-mentioned embodiment shown in FIG. 7, that is, the server performs temporal motion estimation, and then encodes the video frame after obtaining the motion vector. This application will not repeat them here.
  • the server uses the view-dependent video frame before time t 1 and the video frame of the reference view at time t 1 as reference frames, and encodes the view-dependent video frame at time t 1 as a non-key frame.
  • step 1504 please refer to steps 1102 to 1105 in the embodiment shown in FIG. 11, that is, the server performs temporal motion estimation and inter-view motion estimation, and encodes the video frame after obtaining a motion vector. This application will not repeat them here.
  • the server encodes the video frame at time t 2 according to the encoding methods of the reference view angle and the video frame depending on the view angle at time t 1 .
  • FIG. 15 only schematic, the left-eye angle of view (L) is used as the reference angle of view, and the right-eye angle of view (R) is used as the dependent angle of view.
  • the right-eye angle of view can also be used as the reference angle of view
  • the left-eye perspective is used as the dependent perspective.
  • Fig. 15 only selects the video frame at the previous moment as the reference frame during temporal motion estimation. It should be understood that during temporal motion estimation, all video frames before the current moment can also be used as reference frames. For example, encoding When the video frame of the reference view at time t 2 is used, the video frames of the reference view at time t 0 and time t 1 are both used as reference frames, which are not limited in this application.
  • Using the above method for video coding can greatly reduce the amount of calculation for motion estimation, speed up video coding, improve video coding efficiency, and effectively reduce end-to-end delay.
  • FIG. 16 is a schematic structural diagram of a video encoding apparatus provided by an embodiment of the present application.
  • the video encoding apparatus is used to perform the steps in the execution of the above video encoding method.
  • the video encoding apparatus 1600 includes: an obtaining module 1601, a determining module 1602 and encoding module 1603.
  • an acquisition module 1601 configured to acquire a video frame and a reference frame of the video frame, the video frame and the reference frame are from the same perspective or different perspectives in the multi-view image;
  • a determination module 1602 configured to determine the motion vector of the image block in the video frame based on the first rendering information and the second rendering information; wherein, the first rendering information is the rendering information when the video frame is generated, and the second rendering information is Rendering information when the reference frame is generated;
  • An encoding module 1603, configured to encode the video frame based on the motion vector.
  • the first rendering information includes the first position and posture information of the virtual camera when rendering the video frame, and the depth information of the video frame ;
  • the second rendering information includes the second position and posture information of the virtual camera when rendering the reference frame, and the depth information of the reference frame.
  • the first rendering information includes the focal length when rendering the video frame, the first position of the virtual camera, and the depth information of the video frame ;
  • the second rendering information includes the second position of the virtual camera when rendering the reference frame;
  • the determining module 1602 is further configured to: determine the motion vector based on the focal length, the distance value between the first position and the second position, and the depth information.
  • the video frame and the reference frame are generated when a virtual reality VR application is running, the encoded data of the video frame is used for decoding and displaying on the terminal, the first position, attitude information, and The second position and attitude information is determined based on the position and attitude of the terminal.
  • the terminal is a mobile phone, a tablet, a wearable device or a split device
  • the split device includes a display device and a corresponding control device.
  • the wearable device is a head-mounted display.
  • the determining module 1602 includes:
  • a first determining unit configured to render the video frame based on the first position, attitude information, depth information of the video frame, the second position and attitude information of the virtual camera when rendering the reference frame, the reference frame Depth information, determine the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame;
  • the second determining unit is configured to determine the motion vector based on the two-dimensional coordinates corresponding to the pixels or image blocks in the video frame and the two-dimensional coordinates corresponding to the pixels or image blocks in the reference frame.
  • the second determining unit is used for:
  • the motion vector is determined.
  • the determining module 1602 further includes:
  • a third determining unit configured to determine a disparity interval between the video frame and the reference frame based on the focal length, the distance value between the first position and the second position, and the depth information
  • a fourth determining unit configured to determine the target search range corresponding to the image block in the video frame based on the disparity interval
  • a fifth determination unit configured to determine the motion vector based on the target search range.
  • the third determining unit is used for:
  • a depth interval is determined, and the depth interval is used to indicate the value range of the depth value of the pixel in the video frame;
  • a disparity interval between the video frame and the reference frame is determined.
  • the image in the video frame is determined according to the rendering information when the video frame is generated and the rendering information when the reference frame of the video frame is generated block's motion vector and encode the video frame.
  • a computer-readable storage medium such as a memory including program codes
  • the program codes can be executed by a processor in the terminal, so that the computer can perform the video encoding method in the above-mentioned embodiments.
  • the computer-readable storage medium is read-only memory (ROM), random access memory (RAM), compact disc read-only memory (CD-ROM), magnetic tape , floppy disks and optical data storage devices.
  • a computer program product or computer program comprising program code which, when run on a video encoding device, causes the video encoding device to perform the above-described embodiments Provided video encoding method.
  • first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another.
  • a first location could be termed a second location, and, similarly, a second location could be termed a first location, without departing from the scope of various described examples.
  • Both the first location and the second location may be locations, and in some cases, may be separate and distinct locations.
  • the term “if” may be interpreted to mean “when” or “upon” or “in response to determining” or “in response to detecting.”
  • the phrases “if it is determined" or “if a [statement or event] is detected” can be interpreted to mean “when determining" or “in response to determining... ” or “on detection of [recited condition or event]” or “in response to detection of [recited condition or event]”.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer program instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the procedures or functions according to the embodiments of the present application are generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions may be transmitted from a website site, computer, server or data center via Wired or wireless transmission to another website site, computer, server or data center.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, digital video discs (DVDs), or semiconductor media (eg, solid state drives)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

本申请提供了一种视频编码方法、装置、设备及存储介质,属于视频处理技术领域。在对视频帧进行编码时,根据生成该视频帧时的渲染信息,以及生成该视频帧的参考帧时的渲染信息,来确定该视频帧中图像块的运动矢量,并对该视频帧进行编码。采用这种方法进行视频编码,能够大大减少运动估计的计算量,加快视频编码速度,提高视频编码效率,进而有效降低端到端时延。

Description

视频编码方法、装置、设备及存储介质 技术领域
本申请涉及视频处理技术领域,特别涉及一种视频编码方法、装置、设备及存储介质。
背景技术
随着互联网技术的快速发展,用户能够通过与云渲染计算平台实现网络连接的虚拟现实(virtual reality,VR)终端,来体验运行在云渲染计算平台上的VR应用。通常,云渲染计算平台在渲染每一帧视频时,都会生成具有一定视差的左、右眼视频帧,分别对应于用户的左眼视角和右眼视角,在对这两个视频帧进行编码后,将其发送至VR终端,由VR终端进行解码和显示,使得用户可以体验到立体视觉的效果。
相关技术中,云渲染计算平台在对左、右眼视频帧进行编码时,通常采用以下两种方法:方法一、将左、右眼视频帧进行拼接,得到一个视频帧,然后基于标准编码器对该视频帧进行编码,生成一路标准视频码流;方法二、基于标准编码器对左、右眼视频帧分别进行编码,生成两路标准视频码流。
在上述方法中,无论是将左、右眼视频帧拼接后进行一路视频编码,还是将左、右眼视频帧各进行一路视频编码,编码的计算量均较大,导致视频编码的速度较慢,进而导致视频编码的效率不高。
发明内容
本申请实施例提供了一种视频编码方法、装置、设备及存储介质,能够有效提高视频编码效率,降低端到端时延。该技术方案如下:
第一方面,提供了一种视频编码方法,该方法包括:获取视频帧,以及该视频帧的参考帧,该视频帧和该参考帧来自多视角图像中的同一视角或者不同视角;基于第一渲染信息和第二渲染信息确定该视频帧中图像块的运动矢量;其中,该第一渲染信息为生成该视频帧时的渲染信息,该第二渲染信息为生成该参考帧时的渲染信息;基于该运动矢量编码该视频帧。
采用上述方法进行视频编码,能够大大减少运动估计的计算量,加快视频编码速度,提高视频编码效率,进而有效降低端到端时延。
可选地,该视频帧和该参考帧是运行虚拟现实VR应用时生成的,该视频帧编码后的数据用于在终端上解码显示,该第一位置、姿态信息,和该第二位置、姿态信息基于该终端的位置和姿态确定。
基于上述可选方式,可以基于终端来对服务器在运行虚拟现实VR应用所生成的数据进行显示,且终端的位置和姿态可以影响服务器的渲染过程,从而实现VR显示的目的。
可选地,该终端为手机、平板、可穿戴设备或分体式设备,该分体式设备包括显示设备以及对应的控制设备。
基于上述可选方式,视频帧编码后的数据能够在多种不同形式的终端上解码显示,本申请具有较为广泛的适用性。
可选地,该可穿戴设备为头戴式显示器。
可选地,若该视频帧与该参考帧来自同一视角,则该第一渲染信息包括渲染该视频帧时虚拟相机的第一位置、姿态信息,以及该视频帧的深度信息;该第二渲染信息包括渲染该参考帧时虚拟相机的第二位置、姿态信息,以及该参考帧的深度信息。
可选地,该基于第一渲染信息和第二渲染信息确定该视频帧与该参考帧之间的运动矢量,包括:基于该渲染该视频帧时虚拟相机的第一位置、姿态信息、该视频帧的深度信息、该渲染该参考帧时虚拟相机的第二位置、姿态信息、该参考帧的深度信息,确定该视频帧中像素或图像块对应的二维坐标与该参考帧中像素或图像块对应的二维坐标;基于该视频帧中像素或图像块对应的二维坐标和该参考帧中像素或图像块对应的二维坐标,确定该运动矢量。
可选地,该确定该视频帧中像素或图像块对应的二维坐标与该参考帧中像素或图像块对应的二维坐标,包括:获取任一视频帧中像素或图像块对应的二维坐标与三维坐标之间的转换关系;获取该参考帧中像素或图像块对应的二维坐标,基于渲染该参考帧时虚拟相机的第二位置、姿态信息、该参考帧的深度信息,应用该转换关系,将该参考帧中像素或图像块对应的二维坐标,转换为该参考帧中像素或图像块对应的三维坐标;基于渲染该视频帧时虚拟相机的第一位置、姿态信息、该视频帧的深度信息,以及该参考帧中像素或图像块对应的三维坐标,应用该转换关系,确定该视频帧中像素或图像块对应的二维坐标。
可选地,该基于该视频帧中像素或图像块对应的二维坐标和该参考帧中像素或图像块对应的二维坐标,确定该运动矢量,包括:获取该视频帧中像素或图像块对应的二维坐标和该参考帧中像素或图像块对应的二维坐标之间的坐标差;基于该坐标差,确定该运动矢量。
可选地,该基于该坐标差,确定该运动矢量,包括:将该坐标差确定为运动矢量。
可选地,该基于该坐标差,确定该运动矢量,还包括:基于该坐标差,按照第一搜索方式,在该参考帧中进行搜索,确定符合第一搜索条件的第一图像块,基于该第一图像块确定运动矢量。
基于上述可选方式,当视频帧与参考帧来自同一视角时,利用了同一视角前后时刻的视频帧和参考帧之间的相关性,通过渲染视频帧与渲染参考帧时虚拟相机的位置、姿态信息以及深度信息,计算得到视角位移矢量,将该视角位移矢量作为运动矢量,或,以该视角位移为初始运动矢量进行搜索,最终得到运动矢量,这种方式大大减少了运动估计的计算量,有效提高了视频编码效率。
可选地,若该视频帧与该参考帧来自不同视角,则该第一渲染信息包括渲染该视频帧时的焦距、虚拟相机的第一位置,以及该视频帧的深度信息;该第二渲染信息包括渲染该参考帧时虚拟相机的第二位置;该基于第一渲染信息和第二渲染信息确定该视频帧与该参考帧之间的运动矢量,包括:基于该焦距、该第一位置与该第二位置之间的距离值以及该深度信息,确定该运动矢量。
可选地,该基于该焦距、该第一位置与该第二位置之间的距离值以及该深度信息,确定该运动矢量,包括:基于该焦距、该第一位置与该第二位置之间的距离值以及该深度信息,确定该视频帧和该参考帧之间的视差区间;基于该视差区间,确定该视频帧中图像块对应的目标搜索范围;基于该目标搜索范围,确定该运动矢量。
可选地,该基于该焦距、该第一位置与该第二位置之间的距离值以及该深度信息,确定该视频帧和该第二参考视频帧之间的视差区间,包括:基于该深度信息,确定深度区间,该深度区间用于指示该视频帧中像素的深度值的取值范围;基于该焦距、该第一位置与该第二位置之间的距离值以及该深度区间,确定该视频帧和该参考帧之间的视差区间。
可选地,该基于该视差区间,确定该视频帧中图像块对应的目标搜索范围,包括:基于该视差区间,确定该视频帧中图像块对应的第一搜索范围,将该第一搜索范围作为目标搜索范围。
可选地,该基于该视差区间,确定该视频帧中图像块对应的目标搜索范围,包括:基于该视差区间,确定该视频帧中图像块对应的第一搜索范围;基于第二搜索方式,确定该视频帧中图像块对应的第二搜索范围;将该第一搜索范围与该第二搜索范围的交集,作为该目标搜索范围。
基于上述可选方式,当视频帧与参考帧来自不同视角时,利用了同一时刻不同视角的视频帧和参考帧之间的相关性,通过视频帧与参考帧之间的视差区间来确定运动矢量的目标搜索范围,或,通过该视差区间与默认搜索方式的搜索范围来进一步缩小运动矢量的目标搜索范围,最终得到运动矢量,大大减少了运动估计的计算量,有效提高了视频编码效率。
可选地,响应于获取到的视频帧为依赖视角,获取该视频帧的参考帧,该参考帧中既包括与该视频帧视角相同的参考帧,又包括与该视频帧视角不同的参考帧;对于该参考帧中与视频帧视角相同的参考帧,服务器基于渲染该视频帧时虚拟相机的第一位置、姿态信息、该视频帧的深度信息、渲染该参考帧时虚拟相机的第二位置、姿态信息、该参考帧的深度信息,确定该视频帧中图像块对应的第一运动矢量;对于该参考帧中与视频帧视角不同的参考帧,服务器基于渲染该视频帧时的焦距和虚拟相机的第一位置、该视频帧的深度信息、渲染该参考帧时虚拟相机的第二位置,确定该视频帧中图像块对应的第二运动矢量;基于该第一运动矢量和第二运动矢量,确定目标运动矢量;基于该目标运动矢量编码该视频帧。
基于上述可选方式,根据视频帧的视角获取到相应的参考帧,既从时间维度(前后时刻)上考虑了视频帧与参考帧之间的相关性,又从空间维度(左右视角)上考虑了视频帧与参考帧之间的相关性,能够有效提高视频编码效率。
第二方面,提供了一种视频编码装置,该视频编码装置包括:
获取模块,用于获取视频帧,以及该视频帧的参考帧,该视频帧和该参考帧来自多视角图像中的同一视角或者不同视角;
确定模块,用于基于第一渲染信息和第二渲染信息确定该视频帧中图像块的运动矢量;其中,该第一渲染信息为生成该视频帧时的渲染信息,该第二渲染信息为生成该参考帧时的渲染信息;
编码模块,用于基于该运动矢量编码该视频帧。
可选地,若该视频帧与该参考帧来自同一视角,则该第一渲染信息包括渲染该视频帧时虚拟相机的第一位置、姿态信息,以及该视频帧的深度信息;该第二渲染信息包括渲染该参考帧时虚拟相机的第二位置、姿态信息,以及该参考帧的深度信息。
可选地,若该视频帧与该参考帧来自不同视角,则该第一渲染信息包括渲染该视频帧时的焦距、虚拟相机的第一位置,以及该视频帧的深度信息;该第二渲染信息包括渲染该参考帧时虚拟相机的第二位置;该确定模块还用于:基于该焦距、该第一位置与该第二位置之间的距离值以及该深度信息,确定该运动矢量。
可选地,该视频帧和该参考帧是运行虚拟现实VR应用时生成的,该视频帧编码后的数据用于在终端上解码显示,该第一位置、姿态信息,和该第二位置、姿态信息基于该终端的位置和姿态确定。
可选地,该终端为手机、平板、可穿戴设备或分体式设备,该分体式设备包括显示设备以及对应的控 制设备。
可选地,该可穿戴设备为头戴式显示器。
可选地,该确定模块包括:
第一确定单元,用于基于该渲染该视频帧时虚拟相机的第一位置、姿态信息、该视频帧的深度信息、该渲染该参考帧时虚拟相机的第二位置、姿态信息、该参考帧的深度信息,确定该视频帧中像素或图像块对应的二维坐标与该参考帧中像素或图像块对应的二维坐标;
第二确定单元,用于基于该视频帧中像素或图像块对应的二维坐标和该参考帧中像素或图像块对应的二维坐标,确定该运动矢量。
可选地,该第二确定单元用于:获取该视频帧中像素或图像块对应的二维坐标和该参考帧中像素或图像块对应的二维坐标之间的坐标差;基于该坐标差,确定该运动矢量。
可选地,该确定模块还包括:
第三确定单元,用于基于该焦距、该第一位置与该第二位置之间的距离值以及该深度信息,确定该视频帧和该参考帧之间的视差区间;
第四确定单元,用于基于该视差区间,确定该视频帧中图像块对应的目标搜索范围;
第五确定单元,用于基于该目标搜索范围,确定该运动矢量。
在一种可能的实现方式中,该第三确定单元用于:基于该深度信息,确定深度区间,该深度区间用于指示该视频帧中像素的深度值的取值范围;基于该焦距、该第一位置与该第二位置之间的距离值以及该深度区间,确定该视频帧和该参考帧之间的视差区间。
第三方面,提供了一种视频编码设备,该视频编码设备包括处理器和存储器,该存储器用于存储至少一段程序代码,该至少一段程序代码由该处理器加载并执行,以使得该视频编码设备执行上述第一方面或第一方面中任一种可选方式所提供的视频编码方法。
第四方面,提供了一种计算机可读存储介质,该计算机可读存储介质用于存储至少一段程序代码,该至少一段程序代码由处理器加载并执行,以使得计算机执行上述第一方面或第一方面中任一种可选方式所提供的视频编码方法。
第五方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括程序代码,当其在视频编码设备上运行时,使得该视频编码设备执行上述第一方面或第一方面的各种可选实现方式中提供的视频编码方法。
附图说明
图1是本申请实施例提供的一种云VR渲染系统架构的示意图;
图2是本申请实施例提供的一种VR应用执行流程的示意图;
图3是本申请实施例提供的一种VR渲染投影的示意图;
图4是本申请实施例提供的一种图像渲染流程的示意图;
图5是本申请实施例提供的一种基于图像块的编码框架的示意图;
图6是本申请实施例提供的一种视频编码设备的硬件结构示意图;
图7是本申请实施例提供的一种视频编码方法的流程图;
图8是本申请实施例提供的一种运动估计的示意图;
图9是本申请实施例提供的一种钻石搜索法的示意图;
图10是本申请实施例提供的另一种视频编码方法的流程图;
图11是本申请实施例提供的另一种视频编码方法的流程图;
图12是本申请实施例提供的另一种视频编码方法的流程图;
图13是本申请实施例提供的另一种视频编码方法的流程图;
图14是本申请实施例提供的另一种视频编码方法的流程图;
图15是本申请实施例提供的另一种视频编码方法的流程图;
图16是本申请实施例提供的一种视频编码装置的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为了方便理解,在介绍本申请实施例提供的技术方案之前,下面对本申请涉及的关键术语进行说明。
VR技术,是一种将虚拟和现实相互结合的技术。VR技术是一种可创建和体验虚拟世界的计算机系统。它以虚拟仿真的方式给用户创造一个实时反映实体对象变化与相互作用的三维虚拟世界,使用户沉浸到该三维虚拟世界中。
渲染,在计算机领域表示通过软件将模型生成图像的过程。模型是用数学语言或者数据结构进行定义 的三维物体或虚拟场景的描述,它包括几何、视点、纹理、照明和阴影等信息。其中,图像为位图图像或矢量图像等数字图像。
云计算平台,也称为云平台,是指基于硬件资源和软件资源的服务,提供计算、网络和存储能力。云计算平台可以以较小的管理代价,或者用户与业务提供者较低的交互复杂度,实现可配置计算资源的快速发放与发布。应理解,云计算平台能够提供多种类型的服务,例如,用于提供实时渲染服务的云计算平台称为云渲染计算平台。需要说明的是,在本申请中,将这种云渲染计算平台称为服务器。
云VR,具体是指利用云计算平台强大的计算能力,将VR应用内容上云、渲染上云甚至制作上云,再通过互联网将沉浸式的VR体验传递至终端用户。本申请涉及云VR中强交互的实时渲染场景,通常称为“云VR渲染”或者“VR云渲染”。
视频帧的深度信息,也称为图像的深度信息。例如,在视频帧的每个像素通过一组二进制数来进行描述的情况下,每个像素对应的二进制数中包含有多位表示颜色的二进制数,这些表示颜色的二进制数的位数,即称为深度信息。通常,一个视频帧对应的深度信息一般由相同大小的深度图(depth map)来描述,其中,深度图是指,将从图像采集器到三维场景中各点的距离作为像素值的图像,它直接反映了物体可见表面的几何形状。在本申请中,视频帧的深度信息能够用于反映三维场景中物体距离相机的远近。
视频编码,视频画面中存在大量的帧间冗余(也称时间冗余)和帧内冗余(也称空间冗余),视频编码即为通过去除帧内冗余和帧间冗余,对视频画面进行压缩,使得该视频画面便于存储和传输。相应地,视频编码包括帧间编码和帧内编码。
运动矢量,在帧间编码过程中,相邻时刻的不同视频帧中的景物存在着一定的相关性。因此,将一个视频帧分成许多互不重叠的图像块,并认为图像块内所有像素的位移量都相同,然后对每个图像块,在参考帧的某一给定搜索范围内,根据一定的匹配准则找出与当前图像块最相似的块,即匹配块,匹配块与当前图像块之间的相对位移即称为运动矢量(motion vector,MV)。
运动估计,运动估计的过程也即是获取运动矢量的过程。运动估计的好坏直接决定了视频编码的残差的大小,即直接影响视频编码的效率,因此运动估计是整个视频编码过程中最重要的部分之一。运动估计的过程,可以看成是利用匹配准则在各参考帧中寻找匹配块的过程,其最简单的方法是将当前图像块与参考帧中所有图像块进行匹配,从而得到最佳匹配块。相关搜索方法中采用了快速算法,通过在一定的范围内按照一定的路径进行搜索,例如钻石搜索法和六边形搜索法等,能够显著提升搜索效率。
视差矢量,该视差矢量可以看成是一种特殊的运动矢量。在上述运动估计的过程中,视频帧与参考帧属于同一视角,而当视频帧与参考帧属于不同视角时,基于视频帧与参考帧得到的图像块之间的相对位移即称为视差矢量。
视差估计,视差估计的过程也即是获取视差矢量的过程。相应地,视差估计也可以看成一种特殊的运动估计,两者最大的区别是参考帧的视角不一样,运动估计的参考帧与视频帧属于同一视角,视差估计的参考帧与视频帧则属于不同视角。视差估计的搜索方法包括全局搜索算法和局部搜索算法等。
渲染信息,是指服务器在运行VR应用的过程中,渲染生成视频帧时的相关信息。示意性地,在本申请实施例中,该渲染信息包括:终端的显示参数、VR应用中虚拟相机的位置、姿态信息、视频帧和参考帧的深度信息等。
率失真优化(rate distortion optimization,RDO),一种提高视频编码质量的方法。具体是指在一定码率的限制下,减少视频的失真量(视频质量损失)。该方法基于率失真优化算法确定预测模式,选择具有最小编码代价的预测模式,或,率失真满足选择标准的预测模式。
下面对本申请涉及的VR应用以及本申请的应用场景进行简要介绍。
在VR应用中,用户可以通过VR终端与VR应用所创建的三维虚拟世界进行交互,让用户如同身临其境一般置身于该三维虚拟世界中。
相比传统人机交互应用,VR应用主要包括以下两个特点:
第一、交互方式更加自然。传统的人机交互应用主要通过键盘、鼠标或手柄等设备,通过按键事件进行交互,而在VR应用中,交互的维度除了按键事件之外,还有定位传感事件;VR应用会结合用户头部、双手的位置、姿态信息等进行场景渲染,这样用户看到的画面会随着自己头部的转动或者位置的移动进行相应的切换和变化,从而达到一种更自然的交互方式。
第二、立体和沉浸式视觉。VR应用在渲染每一帧视频时,都会产生具有一定视差的两个视频帧,并且这两个视频帧最终会通过VR终端分别投入到用户的左右眼,达到立体视觉的效果;再加上VR终端采用封闭的场景将用户的视听觉和外界隔离,所以能让用户获得沉浸式的视听觉体验。
本申请实施例提供的视频编码方法能够应用在VR游戏、VR教学以及VR影院等这类VR应用场景下。示例性地,本申请实施例提供的视频编码方法能够应用的场景包括但不限于如下几种。
场景一、VR游戏
VR游戏以其高度真实的仿真性和体验感而越来越受到人们的追捧,用户能够通过使用VR终端,进入一个可交互的虚拟游戏世界,不论用户如何转动视线,始终位于该虚拟游戏世界中。在该场景下,运行 有VR游戏应用的服务器需要根据用户的交互事件,实时渲染左、右眼游戏视频,并将渲染视频进行编码后发送给VR终端。
场景二、VR教学
目前,将VR技术应用于课堂教学或技能培训教学等场景是近些年来教育领域的一个重点方向。例如,利用VR技术建立VR实验室,学生能够通过使用VR终端,足不出户便可做各种实验,获得与真实实验一样的体会。再例如,利用VR技术建立VR课堂,当学生不便于出行时,仍然能够通过使用VR终端来参与到课堂中,保证了教学效果。在该场景下,运行有VR教学应用的服务器需要根据学生或老师的交互事件,实时渲染左、右眼教学视频,并将渲染视频进行编码后发送给VR终端。
场景三、VR影院
目前,在利用VR技术建立的VR影院中,用户能够通过使用VR终端,进入一个虚拟电影世界,不论用户如何转动视线,始终位于该虚拟电影世界中,使得用户如同身临其境一般置身于该虚拟电影世界中。在该场景下,运行有VR影院应用的服务器需要根据学生或老师的交互事件,实时渲染左、右眼电影视频,并将渲染视频进行编码后发送给VR终端。
需要说明的是,上述场景仅为示例性的描述,本申请实施例提供的视频编码方法还能够应用于其他VR应用场景中,例如,VR浏览器、VR看房以及VR医学等等,本申请实施例对此不作限定。
下面对本申请实施例提供的视频编码方法的系统框架进行介绍。
参考图1,图1是本申请实施例所适用的一种云VR渲染系统架构的示意图。如图1所示,本申请实施例提供的视频编码方法应用于云VR渲染系统100中。其中,云VR渲染系统100包括服务器110和终端120,服务器110和终端120能够通过无线网络或有线网络进行直接或间接地连接,本申请在此不做限制。
可选地,服务器110泛指多个服务器中的一个,或者多个服务器组成的集合,或者分布式系统;终端120泛指多个终端中的一个,或者多个终端组成的集合。图1所示仅为示意性地,该云VR渲染系统100中包括一个服务器110和一个终端120,应理解,如果服务器110或者终端120是多台设备的集合,则云VR渲染系统100中还包括其他服务器或终端。本申请对云VR渲染系统100中服务器或终端的数量和类型不做限定。
可选地,上述的无线网络或有线网络使用标准通信技术和/或协议。网络通常为因特网、但也能够是任何网络,包括但不限于局域网(local area network,LAN)、城域网(metropolitan area network,MAN)、广域网(wide area network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合。在一些实施例中,使用包括超级文本标记语言(hyper text markup language,HTML)、可扩展标记语言(extensible markup language,XML)等的技术和/或格式来代表通过网络交换的数据。此外还能够使用诸如安全套接字层(secure socket layer,SSL)、传输层安全(transport layer security,TLS)、虚拟专用网络(virtual private network,VPN)、网际协议安全(internet protocol security,IPsec)等常规加密技术来加密所有或者一些链路。在另一些实施例中,还能够使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
示意性地,服务器110包括VR应用111、VR接口112以及VR驱动服务端113。
1、VR应用111
VR应用111用于获取用户当前的交互事件和渲染参数,并基于该交互事件和渲染参数进行图像渲染,生成视频帧后传输至VR接口112。
可选地,用户当前的交互事件包括按键输入、头/手的位置或姿态信息等。渲染参数包括渲染分辨率和左、右眼的投影矩阵等。
VR应用111的运行过程请参考图2,图2是本申请实施例所适用的一种VR应用执行流程的示意图。示意性地,以终端120包括头戴式显示器(head-mounted display,HMD)为例,VR应用111的运行过程包括以下两个阶段。
第一阶段、初始化阶段。该初始化阶段用于确定VR渲染的渲染参数。由于VR应用渲染生成的视频帧最终是提交到头戴式显示器上呈现的,而不同头戴式显示器的显示规格是有差异的,为了让渲染的视频帧能够在不同的头戴式显示器上进行合适的显示,VR应用在初始化时需要确定与当前头戴式显示器显示规格一致的渲染参数。
其中,渲染参数通常包括以下两个参数:一个是渲染分辨率,一般与头戴式显示器的显示分辨率一致,这样可以避免终端上的画面缩放操作。另一个是左、右眼的投影矩阵,需要根据头戴式显示器的视场角(field of view,FOV)、瞳距(interpupillary distance,IPD)、显示分辨率以及VR应用的渲染深度范围(即Z-near近裁剪平面,Z-far远裁剪平面)来共同计算得到。VR应用根据上述渲染参数来渲染生成视频帧,可以使得视频帧的视角范围与头戴式显示器的FOV一致;并且渲染的左、右眼视频帧的视差与头戴式显示器的IPD一致。示意性地,VR应用的渲染深度范围请参考图3,图3是本申请实施例所适用的一种VR渲染投影的示意图。
第二阶段、渲染阶段。该渲染阶段用于对每一VR视频帧进行渲染。在每一帧渲染之前,VR应用先 获取用户当前的交互事件,进行应用逻辑处理,最后进行左、右眼视频帧的渲染和提交。
示意性地,下面参考图4,以VR游戏场景为例,对上述渲染阶段中的一次渲染流程进行简要说明。图4是本申请实施例所适用的一种图像渲染流程示意图。如图4所示,首先,VR应用对游戏内容的相关事件进行数据处理,如位置计算、碰撞检测等;接着,VR应用将游戏数据(定点坐标,法向量,纹理及纹理坐标等)通过数据总线发送给图形程序接口,如OpenGL等;然后,VR应用将视频帧对应的三维坐标转换为屏幕的二维坐标;最后,VR应用进行图元组合、光栅化以及像素配色等,渲染出视频帧存入帧缓冲器中。
2、VR接口112
VR接口112用于获取显示参数和交互事件,进行渲染参数计算和交互事件处理,将得到的渲染参数和交互事件传输至VR应用111,并将从VR应用111接收到的视频帧传输至VR驱动服务端113。
其中,在VR应用111和终端120之间的数据交互上,为了避免两两适配,业界定义了统一标准接口,如OpenVR。示意性地,图1所示的VR接口112为VR runtime,VR runtime为VR应用111和终端120之间的适配层,其中,VR runtime与VR应用111之间,以及VR runtime与终端120之间可以采用上述标准接口实现;从而使得VR应用111和终端120之间能够以标准的方式进行交互,而不需要互相适配。如图1所示,VR接口112包括以下三个模块:渲染参数计算模块,用于根据VR应用112确定的渲染深度范围和终端120的显示参数计算得到前述的渲染参数;视频帧处理模块,用于对渲染生成的视频帧直接透传,或者是对视频帧进行额外处理(例如畸变校正)后再转发;交互事件处理模块,用于对交互事件直接透传,或者对交互事件进行额外处理(例如预测或平滑)后再转发。
需要说明的是,VR接口112在整个系统中并不是必须的,例如,某个VR应用是基于特定终端定制开发的,VR应用可以根据该终端的系统驱动提供的接口来获取显示参数和交互事件,并提交渲染的视频帧;如果没有VR接口,上述的渲染参数计算可以在VR应用侧,或者终端的系统驱动侧。
3、VR驱动服务端113
VR驱动服务端113用于接收终端120通过网络传输的显示参数和交互事件,将显示参数和交互事件传输至VR接口112,以及对VR应用111提交的视频帧进行编码,并将编码后的视频帧数据通过网络传输至终端120。
可选地,在云VR渲染系统100中,VR驱动服务端113采用视频编码的方式对左、右眼视频帧进行压缩。相关技术的视频压缩标准是基于图像块的编码框架,在该编码框架下,编码图像被划分为不重叠的图像块,以图像块为单位进行编码,例如,常见的h.26x系列编码标准即为基于图像块的混合编码方式。
示意性地,参考图5,图5是本申请实施例所适用的一种基于图像块的编码框架的示意图。如图5所示,在该编码框架下,由VR驱动服务端113中的编码器对待编码视频帧进行图像分块,然后基于参考帧,对图像块进行运动估计,得到运动矢量,然后根据该运动矢量进行运动补偿,得到残差数据,在对残差数据进行离散余弦变换(discrete cosine transform,DCT)后,对相应的系数进行量化,并将量化后的DCT系数转化为二进制码字,实现熵编码。另外,由编码器对量化后的DCT系数进行反量化、反变换后得到重建的残差,结合运动补偿后的图像块,生成新的参考帧,存入帧缓冲器中。
示意性地,终端120包括交互设备121、系统驱动122以及VR驱动客户端123。
交互设备121用于采集用户当前的交互事件,并基于从服务器110接收到的视频帧数据,对该视频帧数据进行解码后呈现给用户。
可选地,以终端120具备显示功能和控制功能为例,交互设备121通常包含与用户相关的两个设备:头戴式显示器和控制器手柄。其中,在头戴式显示器和控制器手柄上均配置有定位传感器和输入按键。定位传感器用于感知和采集用户的头部和/或双手的位置信息(在三维空间中的位置坐标)和姿态信息(在三维空间中方位角度数据)。输入按键用于提供用户对VR应用的控制功能,例如,输入按键为控制器手柄上的按键、摇杆,或头戴式显示器上的按键,用户可通过操控这些按键来控制VR应用。
系统驱动122用于将用户当前的交互事件以及终端120的显示参数传输至VR驱动客户端123,以及将接收到的视频帧传输至交互设备121。
VR驱动客户端123用于将交互事件和显示参数通过网络传输至服务器110,以及对接收到的视频帧数据进行解码后传输至系统驱动122。
以上对云VR渲染系统架构中设备的各个功能模块进行了介绍,下面对服务器110的硬件结构进行简要说明。本申请实施例提供了一种视频编码设备,能够配置为上述云VR渲染系统架构中的服务器110。示意性地,参考图6,图6是本申请实施例提供的一种视频编码设备的硬件结构示意图。如图6所示,该视频编码设备600包括处理器601和存储器602,其中,该存储器602用于存储至少一段程序代码,该至少一段程序代码由该处理器601加载并执行下述实施例所示的视频编码方法。
处理器601可以是网络处理器(Network Processor,简称NP)、中央处理器(central processing unit,CPU)、特定应用集成电路(application-specific integrated circuit,ASIC)或用于控制本申请方案程序执行的集成电路。该处理器601可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理 器。该处理器601的数量可以是一个,也可以是多个。
存储器602可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only Memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。
其中,处理器601和存储器602可以分离设置,也可以集成在一起。
该服务器600还包括收发器。该收发器用于与其它设备或通信网络通信,网络通信的方式可以而不限于是以太网,无线接入网(RAN),无线局域网(wireless local area networks,WLAN)等。
下面对本申请实施例涉及的终端进行简要介绍。
在本申请实施例中,终端的位置、姿态由用户当前的交互事件决定。交互事件是指用户对终端的位置和姿态的调整事件中的任一项或组合。应理解,服务器在运行VR应用渲染生成视频帧时,使用VR应用中的虚拟相机来分别模拟用户的左右眼,也即是用虚拟相机拍摄的画面来模拟用户双眼看到的VR画面。其中,VR应用中虚拟相机的位置和姿态是由用户所使用的终端决定的,用户能够通过对终端的位置和姿态进行调整,以触发调整事件,从而使得虚拟相机的位置和姿态发生相应调整,以实现在终端上显示调整后的VR画面。
可选地,终端为手机、平板、可穿戴设备或分体式设备,该分体式设备包括显示设备以及对应的控制设备。
例如,终端为手机,手机与服务器通过网络进行连接,手机响应于用户对手机实施的点击、滑动以及移动等操作,生成交互事件并发送至服务器,由服务器根据接收到的交互事件调整VR应用中虚拟相机的位置和姿态,并渲染生成相应的视频帧,该视频帧编码后的数据能够在手机上解码进行显示。应理解,终端为平板时的实施方式与终端为手机时相似,故不再赘述。
又例如,终端为可穿戴设备,该可穿戴设备为头戴式显示器。该头戴式显示器与服务器通过网络进行连接,头戴式显示器响应于用户头部的旋转、点头等动作,生成交互事件并发送至服务器,由服务器根据接收到的交互事件调整VR应用中虚拟相机的位置和姿态,并渲染生成相应的视频帧,该视频帧编码后的数据能够在头戴式显示器上解码进行显示。
再例如,终端为分体式设备,该显示设备为手机、平板或电子显示屏,该控制设备为手柄或按键等。该分体式设备与服务器通过网络进行连接,该分体式设备响应于用户对控制设备的操作,生成交互事件并发送至服务器,由服务器根据接收到的交互事件调整VR应用中虚拟相机的位置和姿态,并渲染生成相应的视频帧,该视频帧编码后的数据能够在显示设备上解码进行显示。
在上述过程中,通过终端来对服务器在运行VR应用所生成的数据进行显示,且终端的位置和姿态可以影响服务器的渲染过程,从而实现VR显示的目的。进一步地,视频帧编码后的数据能够在多种不同形式的终端上解码显示,因此本申请具有较为广泛的适用性。
另外,需要说明的第一点是,在云VR渲染系统中,视频帧的编解码过程和传输过程会增加端到端时延,而VR应用是对时延极敏感的强交互类应用,时延是影响体验的一个重要指标;如果不考虑网络传输部分,云侧的处理时延会占整个端到端时延的至少60%以上;而在云侧的处理时延中,编码是与渲染耗时相当的环节。需要说明的第二点是,有损的视频编码过程会导致画质下降,影响最终的主观体验,这一点需要通过提高编码效率(压缩比)或者设置更高的编码码率来改善。因此,亟需一种提升云侧视频编码的性能(也即相同码率下提升画质)和速度的视频编码方法。
本申请基于上述云VR渲染系统的系统框架,提供了一种视频编码方法,在确定运动矢量的过程中,参考了服务器在渲染视频帧时所涉及到的渲染信息,从而能够基于时域运动估计和/或视点间运动估计得到运动矢量,来对视频帧进行编码,能够大大加快视频编码速度,进而有效提高视频编码效率,降低端到端时延。
下面,基于下述实施例对本申请实施例提供的视频编码方法进行介绍。
参考图7,图7是本申请实施例提供的一种视频编码方法的流程图。在本申请实施例中,该视频编码方法由如图1所示的服务器110执行,例如,该视频编码方法能够应用于服务器110的VR驱动服务端113中。示意性地,该视频编码方法包括如下步骤701至步骤703。
701、服务器获取视频帧,以及该视频帧的参考帧,该视频帧和该参考帧来自多视角图像中的同一视角。
在本申请实施例中,视频帧和参考帧为服务器运行VR应用时生成的。该视频帧为待编码的视频帧,该视频帧编码后的数据用于在终端上解码显示。该参考帧为已编码的视频帧,即为该待编码视频帧之前的视频帧。
其中,服务器在运行VR应用时,渲染同一时刻的VR画面需要生成至少一个多视角图像,其中包括左眼视角对应的视频帧和右眼视角对应的视频帧,该视频帧为纹理图像。需要说明的是,服务器获取到的视频帧来自左眼视角或右眼视角,本申请对此不作限定。
VR应用产生的左右眼画面是两个视角的画面,为了方便表述,将编码过程中只参考同视角视频帧的视角称为基准视角,将既参考同视角视频帧、又参考其他视角视频帧的视角称为依赖视角。下面以服务器获取到的视频帧为基准视角为例,对本步骤的具体实施方式进行说明。示意性地,本步骤701包括:服务器响应于获取到的视频帧为基准视角,获取该视频帧的参考帧,该视频帧与该参考帧的视角相同。
其中,服务器响应于获取到的视频帧为基准视角,从已编码的多视角图像中获取至少一个同为基准视角的视频帧,将这类视频帧作为该视频帧的参考帧。示意性地,以该视频帧为t时刻的基准视角对应的视频帧为例,服务器获取到的参考帧包括t-1、t-2……以及t-n时刻的基准视角对应的视频帧,其中,n大于1。也即是,服务器将t时刻之前的基准视角对应的视频帧作为参考帧。
在一些实施例中,视频帧均携带视角标识,该视角标识用于指示该视频帧的视角。服务器获取参考帧的过程包括:服务器基于该视频帧的视角标识确定该视频帧为基准视角,然后服务器以基准视角的视角标识为索引,在多视角图像中获取至少一个同为基准视角的视频帧,将这类视频帧作为该视频帧的参考帧。
需要说明的是,在VR应用场景下,同一视角前后时刻的视频帧的显示内容大部分是重合的,经过上述步骤701,服务器对于基准视角的视频帧,有针对性地获取了对应的参考帧,从时间维度(前后时刻)上考虑了视频帧与参考帧之间的相关性,能够有效提高视频编码效率。
702、服务器基于第一渲染信息和第二渲染信息确定该视频帧中图像块的运动矢量;其中,该第一渲染信息包括渲染该视频帧时虚拟相机的第一位置、姿态信息,以及该视频帧的深度信息;该第二渲染信息包括渲染该参考帧时虚拟相机的第二位置、姿态信息,以及该参考帧的深度信息。
在本申请实施例中,该第一位置、姿态信息和该第二位置、姿态信息基于终端的位置和姿态确定。该第一位置和该第二位置为同一视角的虚拟相机在不同时刻下的位置。深度信息用于指示视频帧或参考帧中像素对应的深度值,由服务器运行VR应用时生成。
需要说明的是,服务器执行本步骤702来确定运动矢量的过程可以称为时域运动估计,也即是,服务器在时间维度上,对同一视角不同时刻的视频帧和参考帧之间进行运动估计。
示意性地,参考图8,图8是本申请实施例提供的一种运动估计的示意图。如图8所示,服务器在获取到视频帧和参考帧后,若该视频帧与该参考帧来自同一视角,则进行时域运动估计,从而得到本申请的运动矢量。
在一些实施例中,本步骤702包括但不限于如下步骤7021和步骤7022。
7021、服务器基于渲染该视频帧时虚拟相机的第一位置、姿态信息、该视频帧的深度信息、渲染该参考帧时虚拟相机的第二位置、姿态信息、该参考帧的深度信息,确定该视频帧中像素或图像块对应的二维坐标与该参考帧中像素或图像块对应的二维坐标。
在本申请实施例中,服务器提供针对运动矢量的计算功能。下面对本步骤7021进行详细说明,包括但不限于如下步骤7021-1至步骤7021-3。
7021-1、服务器获取任一视频帧中像素或图像块对应的二维坐标与三维坐标之间的转换关系。
其中,该转换关系是指把现实世界(采用世界坐标系)一个三维点(x,y,z)渲染到二维屏幕(采用相机坐标系)上的一个像素点(u,v)的转换公式。应理解,为简化计算,在视频编码过程中,视频帧的运动矢量通常是以图像块为单位进行计算的,一个图像块内所有像素的位移量认为相同的,也即是,对于视频帧中的任意一个图像块,该图像块对应的二维坐标与三维坐标可以基于该图像块中一个像素点对应的二维坐标与三维坐标来确定。
下面对该转换关系进行说明,参考下述公式(1):
Figure PCTCN2021142232-appb-000001
式中,T表示世界坐标系到相机坐标系的转换矩阵,由相机在世界坐标系的位置、姿态决定。K表示相机标定矩阵,由相机的FOV、显示分辨率等内置参数决定。d表示该像素点(u,v)的深度信息。
7021-2、服务器获取该参考帧中像素或图像块对应的二维坐标,基于渲染该参考帧时虚拟相机的第二位置、姿态信息、该参考帧的深度信息,应用该转换关系,将该参考帧中像素或图像块对应的二维坐标,转换为该参考帧中像素或图像块对应的三维坐标。
7021-3、服务器基于渲染该视频帧时虚拟相机的第一位置、姿态信息、该视频帧的深度信息,以及该参考帧中像素或图像块对应的三维坐标,应用该转换关系,确定该视频帧中像素或图像块对应的二维坐标。
其中,对于该视频帧中像素或图像块对应的二维坐标以及该参考帧中像素或图像块对应的二维坐标,这两个二维坐标对应于世界坐标系中的同一个三维坐标。
下面以该参考帧中某一图像块(即像素点)对应的二维坐标为(u 1,v 1)、该视频帧中某一图像块(即像素点)对应的二维坐标为(u 2,v 2)为例,对上述步骤7021-2和步骤7021-3进行说明。
示意性地,服务器基于渲染该参考帧时VR应用中虚拟相机的第二位置、姿态信息、该参考帧的深度信息,应用上述转换关系,得到下述公式(2),根据该公式(2),得到该二维坐标(u 1,v 1)对应的三维坐标(x,y,z)。进一步地,服务器基于渲染该视频帧时VR应用中虚拟相机的第一位置、姿态信息、该视频帧的深度信息,以及上述三维坐标(x,y,z),应用上述转换关系,得到下述公式(3)。公式(2)和公式(3)如下所示:
Figure PCTCN2021142232-appb-000002
Figure PCTCN2021142232-appb-000003
上述公式(2)和公式(3)中,T 1表示该参考帧对应的转换矩阵,d 1表示该参考帧中某一图像块对应的深度值,T 2表示该视频帧对应的转换矩阵,d 2表示该视频帧中某一图像块对应的深度值。
通过结合上述公式(2)和公式(3),能够得到下述公式(4):
Figure PCTCN2021142232-appb-000004
根据上述公式(4),能够计算得到该视频帧中某一图像块对应的二维坐标(u 2,v 2)。
需要说明的是,上述步骤7021-1至步骤7021-3是基于该参考帧中一个二维坐标对应的三维坐标,确定该三维坐标在该视频帧中的二维坐标,也即是,服务器能够通过上述步骤7021-1至步骤7021-3确定二维坐标(u 1,v 1)与二维坐标(u 2,v 2)之间的映射关系,为后续确定运动矢量提供了基础。
7022、服务器基于该视频帧中像素或图像块对应的二维坐标和该参考帧中像素或图像块对应的二维坐标,确定该运动矢量。
在本申请实施例中,本步骤7022包括但不限于如下步骤7022-1和步骤7022-2。
7022-1、服务器获取该视频帧中像素或图像块对应的二维坐标和该参考帧中像素或图像块对应的二维坐标之间的坐标差。
示意性地,以该参考帧中某一像素或图像块对应的二维坐标为(u 1,v 1)、该视频帧中某一像素或图像块对应的二维坐标为(u 2,v 2)为例,该坐标差为(u 1-u 2,v 1-v 2)。需要说明的是,此处的坐标差仅为示意性说明,在一些实施例中,该坐标差可以表示为(u 2-u 1,v 2-v 1);在另一些实施例中,该坐标差还可以表示为其他向量形式。本申请对于坐标差的计算方式以及表现形式不作限定。
7022-2、服务器基于该坐标差,确定该运动矢量。
在一些实施例中,服务器确定运动矢量的方式包括以下两种方式。
第一种、服务器将该坐标差确定为运动矢量。
应理解,在VR应用场景下,同一视角前后时刻的视频帧通常是在同一场景下,由于用户头部旋转很小的角度而产生的两幅图像,在两帧时间间隔下,用户头部旋转的角度通常不会太大,因此,大多数场景下前后时刻的两个视频帧的画面内容大部分是重合的,只存在较小的位移,服务器通过上述步骤7021得到的坐标差即为同一视角前后时刻下,该参考帧中图像块与该视频帧中图像块之间的位移矢量,也称为视角位移矢量。在本步骤7022-2中,服务器将该视角位移矢量作为运动矢量,用于后续对该视频帧进行编码。
第二种、服务器基于该坐标差,按照第一搜索方式,在该参考帧中进行搜索,确定符合第一搜索条件的第一图像块,基于该第一图像块确定运动矢量。
其中,服务器将该坐标差对应的视角位移矢量作为初始运动矢量,以该初始运动矢量为搜索起点,按照第一搜索方式在该参考帧中进行搜索。可选地,该第一搜索方式为钻石搜索法(diamond search,DS)、全局搜索法、六边形搜索法、矩形搜索法等等,本申请对此不作限定。
在一些实施例中,当该参考帧的数量大于1时,对于该视频帧中任意一个图像块,服务器能够从每个参考帧中分别确定出符合第一搜索条件的第一图像块。服务器基于至少两个第一图像块,确定至少两个运动矢量,通过调用RDO算法,对该视频帧的编码代价进行预测,得到至少两个对应的预测模式,将该至少两个预测模式中,编码代价最小的预测模式所对应的运动矢量,作为本申请的运动矢量。
下面以钻石搜索法为例,对服务器确定运动矢量的实施方式进行说明,示意性地,参考图9,图9是本申请实施例提供的一种钻石搜索法的示意图。如图9中左图所示,图中的第一中心位置即为初始运动矢 量所对应的图像块,该钻石搜索法包括如下步骤一至步骤三。
步骤一、服务器以该第一中心位置为起点,应用大钻石搜索模板(large diamond search proto,LDSP),从包括该第一中心位置在内的9个图像块中确定最小误差图像块(minimum block distortion,MBD),称为第一MBD,若该第一MBD为第一中心位置对应的图像块,则服务器执行下述步骤三,若否,则服务器执行下述步骤二。
步骤二、服务器将该第一MBD作为第二中心位置,再次应用LDSP,从包括该第一MBD在内的9个图像块中确定第二MBD,若该第二MBD为该第二中心位置对应的图像块,则服务器执行下述步骤三,若否,则服务器重复执行本步骤二。
步骤三、服务器将当前的MBD(第一MBD或第二MBD)作为第三中心位置,应用小钻石搜索模板(small diamond search proto,SDSP),从包括该当前的MBD在内的5个图像块中确定第三MBD,若该第三MBD为该第三中心位置对应的图像块,则服务器以该第三MBD计算运动矢量,若否,则服务器重复执行本步骤三。
示意性地,继续参考图8,如图8所示,服务器在进行时域运动估计时,根据渲染视频帧与渲染参考帧时VR应用中虚拟相机的位置、姿态信息以及视频帧与参考帧的深度信息,进行视角位移矢量计算,得到视角位移矢量,并利用该视角位移矢量来确定运动矢量。
需要说明的是,在上述步骤702中,服务器是根据该视频帧中像素或图像块对应的二维坐标和该参考帧中像素或图像块对应的二维坐标之间的坐标差,来确定运动矢量的。这种确定运动矢量的方式利用了同一视角前后时刻的视频帧和参考帧之间的相关性,通过渲染视频帧与渲染参考帧时VR应用中虚拟相机的位置、姿态信息以及深度信息,计算得到视角位移矢量,将该视角位移矢量作为运动矢量,或,以该视角位移为初始运动矢量进行搜索,最终得到运动矢量,这种方式大大减少了运动估计的计算量,有效提高了视频编码效率。
703、服务器基于该运动矢量编码该视频帧。
在本申请实施例中,服务器提供编码功能,能够基于该运动矢量对该视频帧进行编码。
可选地,服务器的编码功能由编码器实现,该编码器可以是服务器上内置的编码器,例如,服务器上内置的硬件编码器等;也可以是安装于服务器上的软件编码器。本申请对此不作限定。其中,上述运动矢量的计算过程可以由编码器执行,也可以在编码器外部计算后到传递到该编码器中。应理解,在编码标准中,运动矢量的计算属于开放模块,编码标准并不对编码器的运动矢量计算方式进行限制,只要能输出指定格式的运动矢量即可。因此,本申请实施例提供的视频编码方法具有广泛的适用性,能够应用于相关技术的各个编码器中。
在本申请实施例提供的视频编码方法中,在对视频帧进行编码时,根据生成该视频帧时的渲染信息,以及生成该视频帧的参考帧时的渲染信息,来确定该视频帧中图像块的运动矢量,并对该视频帧进行编码。采用这种方法进行视频编码,能够大大减少运动估计的计算量,加快运动估计的速度,也就加快了视频编码速度,提高了视频编码效率,进而有效降低了端到端时延。
上述图7所示的实施例是以视频帧与参考帧来自同一视角为例进行说明的。下面以视频帧与参考帧来自多视角图像中的不同视角为例,对本申请实施例提供的视频编码方法进行说明。参考图10,图10是本申请实施例提供的另一种视频编码方法的流程图。在本申请实施例中,该视频编码方法由如图1所示的服务器110执行,例如,该视频编码方法可应用于服务器110的VR驱动服务端113中。示意性地,该视频编码方法包括如下步骤1001至步骤1003。
1001、服务器获取视频帧,以及该视频帧的参考帧,该视频帧和该参考帧来自多视角图像中的不同视角。
下面以服务器获取到的视频帧为依赖视角为例,对本步骤的具体实施方式进行说明。示意性地,本步骤1001包括:服务器响应于获取到的视频帧为依赖视角,获取该视频帧的参考帧,该视频帧与该参考帧的视角不同。
其中,服务器响应于获取到的视频帧为依赖视角,从已编码的多视角图像中获取当前时刻的基准视角的视频帧,将该基准视角的视频帧作为参考帧。示意性地,以该视频帧为t时刻的依赖视角对应的视频帧为例,服务器获取到的参考帧为t时刻的基准视角对应的视频帧。
在一些实施例中,视频帧均携带视角标识,该视角标识用于指示该视频帧的视角。本步骤1001包括:服务器基于该视频帧的视角标识确定该视频帧为依赖视角,然后服务器以基准视角的视角标识为索引,在多视角图像中获取当前时刻基准视角的视频帧,将该视频帧作为参考帧。
需要说明的是,在VR应用场景下,同一时刻不同视角的视频帧为VR应用模拟用户两只眼观察三维空间时所看到的画面,也即能够理解为具有相同参数的相机从两个有一定偏差的角度拍摄同一场景生成的两个图像,因此同一时刻不同视角的视频帧之间具有较大的相关性。经过上述步骤1001,服务器根据视频帧的视角获取相应的参考帧,从空间维度(左右视角)上考虑了视频帧与参考帧之间的相关性,能够有效提高视频编码效率。
1002、服务器基于第一渲染信息和第二渲染信息确定该视频帧中图像块的运动矢量;其中,该第一渲染信息包括渲染该视频帧时的焦距、虚拟相机的第一位置,以及该视频帧的深度信息;该第二渲染信息包括渲染该参考帧时虚拟相机的第二位置。
其中,焦距基于终端的FOV和显示分辨率确定。该第一位置与第二位置基于终端的位置和姿态决定。该第一位置和该第二位置为VR应用中不同视角的虚拟相机在同一时刻下的位置。也即是,该第一位置与该视频帧的视角相关,该第二位置与该参考帧的视角相关。示意性地,以该视频帧的视角为基准视角(左眼视角)为例,该第一位置对应于模拟用户左眼的虚拟相机,该第二位置对应于模拟用户右眼的虚拟相机。应理解,终端的FOV和显示分辨率均属于终端的显示参数。
需要说明的是,服务器执行本步骤1002来确定运动矢量的过程可以称为视点间运动估计(也称为视差估计),其中,视点对应基准视角(左眼视角)或依赖视角(右眼视角)。也即是,服务器在空间维度上,对同一时刻不同视角的视频帧和参考帧之间进行运动估计。
示意性地,继续参考图8。如图8所示,服务器在获取到视频帧和参考帧后,若该视频帧与该参考帧来自不同视角,则进行视点间运动估计,最终得到本申请所需的运动矢量。
下面对本步骤1002的具体实施方式进行说明,本步骤1002替换为:服务器基于该焦距、该第一位置与该第二位置之间的距离值以及该深度信息,确定该运动矢量。
其中,在本申请实施例中,终端包括头戴式显示器,上述第一位置与第二位置之间的距离值也即是头戴式显示器的瞳距。示意性地,本步骤1002包括但不限于如下步骤1002-1至步骤1002-3。
1002-1、服务器基于该焦距、该距离值以及该深度信息,确定该视频帧和该参考帧之间的视差区间。
其中,根据立体视觉配准理论,在相机参数相同的情况下,两个相机所拍摄的图像之间的视差与深度信息之间的关系如下述公式(5)所示:
Figure PCTCN2021142232-appb-000005
式中,Δ表示视差,f为相机焦距,b为两个相机之间的距离值,d为像素的深度值。
在本申请实施例中,服务器根据上述公式(5),确定该视频帧和该参考帧之间的视差区间。下面对这种确定视差区间的实施方式进行说明,包括如下两个步骤。
步骤一、服务器基于该深度信息,确定深度区间,该深度区间用于指示该视频帧中像素的深度值的取值范围。
其中,服务器根据该视频帧的深度信息,确定该视频帧中像素的最大深度值和最小深度值,也即确定了深度区间。示意性地,该深度区间表示为[Dmin,Dmax],本申请对此不作限定。
步骤二、服务器基于该焦距、该距离值以及该深度区间,确定该视频帧和该参考帧之间的视差区间。
其中,服务器将VR应用中虚拟相机的焦距、该距离值(也即头戴式显示器的瞳距)以及深度区间的区间值代入上述公式(5)中,得到该视频帧和该参考帧之间的视差区间。示意性地,该视差区间表示为[Dvmin,Dvmax],本申请对于视差区间的表现形式不作限定。
1002-2、服务器基于该视差区间,确定该视频帧中图像块对应的目标搜索范围。
在一些实施例中,服务器确定目标搜索范围的方式包括以下两种方式。
第一种、服务器基于该视差区间,确定该视频帧中图像块对应的第一搜索范围,将该第一搜索范围作为目标搜索范围。
其中,对于该视频帧中任意一个图像块,服务器基于该视差区间的端点值和该图像块在该视频帧中所处的位置,确定该第一搜索范围。示意性地,该图像块的位置表示为(MbX,MbY),视差区间表示为[MinX,MaxX],则第一搜索范围表示为[MbX+MinX,MbY+MaxX]。
在一些实施例中,服务器将视差区间的端点值分别与预设系数进行相乘,得到扩大或缩小后的视差区间。例如,视差区间可以表示为[pMinX,qMaxX],其中,p和q为预设系数,此时第一搜索范围表示为[MbX+pMinX,MbY+qMaxX]。应理解,该预设系数可以相同,也可以不同,本申请对此不作限定。
第二种、服务器基于该视差区间,确定该视频帧中图像块对应的第一搜索范围;基于第二搜索方式,确定该视频帧中图像块对应的第二搜索范围;将该第一搜索范围与该第二搜索范围的交集,作为该目标搜索范围。
其中,第二搜索方式为默认搜索方式,例如,该第二搜索方式为全局搜索法、钻石搜索法以及六边形搜索法等等,本申请对此不作限定。
通过这种可选地实现方式,能够进一步缩小该视频帧中图像块对应的目标搜索范围,从而减少了运动估计的计算量,大大加快了运动估计速度,有效提高了视频编码效率。
1002-3、服务器基于该目标搜索范围,确定该运动矢量。
在本申请实施例中,服务器以上述第二搜索方式的搜索起点和搜索路径作为运动矢量的搜索起点和搜索路径,按照该目标搜索范围,在参考帧中进行搜索,确定符合第二搜索条件的第二图像块,基于该第二图像块确定运动矢量。
在一些实施例中,服务器中将该目标搜索范围发送至服务器中的编码器,由该编码器按照该目标搜索范围,在参考帧中进行搜索。本申请对此不作限定。
示意性地,继续参考图8,如图8所示,服务器在进行视点间运动估计时,根据终端的显示参数和该视频帧的深度信息,进行视差区间计算,得到视差区间,并利用该视差区间来确定运动矢量。
需要说明的是,在上述步骤1002中,服务器是根据终端的显示参数以及该视频帧的深度信息来缩小运动矢量的搜索范围的,从而确定了本申请所需的运动矢量。这种确定运动矢量的方式利用了同一时刻不同视角的视频帧和参考帧之间的相关性,通过视频帧与参考帧之间的视差区间来确定运动矢量的目标搜索范围,或,通过该视差区间与默认搜索方式的搜索范围来进一步缩小运动矢量的目标搜索范围,最终得到所需的运动矢量,大大减少了运动估计的计算量,有效提高了视频编码效率。
1003、服务器基于该运动矢量编码该视频帧。
在本申请实施例中,服务器提供编码功能,能够基于该运动矢量对该视频帧进行编码。需要说明的是,本步骤1003的可选实现方式与上述图7所示的步骤703中类似,故在此不再赘述。
在本申请实施例提供的视频编码方法中,在对视频帧进行编码时,根据生成该视频帧时的渲染信息,以及生成该视频帧的参考帧时的渲染信息,来确定该视频帧中图像块的运动矢量,并对该视频帧进行编码。采用这种方法进行视频编码,能够大大减少运动估计的计算量,加快运动估计的速度,也就加快了视频编码速度,提高了视频编码效率,进而有效降低了端到端时延。
上述图10所示的实施例是以视频帧与参考帧来自不同视角为例进行说明。下面以视频帧为依赖视角为例,对本申请实施例提供的视频编码方法进行举例说明。参考图11,图11是本申请实施例提供的另一种视频编码方法的流程图。在本申请实施例中,该视频编码方法由如图1所示的服务器110执行,例如,该视频编码方法可应用于服务器110的VR驱动服务端113中。示意性地,该视频编码方法包括如下步骤1101至步骤1105。
1101、服务器响应于获取到的视频帧为依赖视角,获取该视频帧的参考帧,该参考帧中既包括与该视频帧视角相同的参考帧,又包括与该视频帧视角不同的参考帧。
其中,服务器响应于获取到的视频帧为依赖视角,从已编码的多视角图像中获取至少一个同为依赖视角的视频帧,以及当前时刻的基准视角的视频帧,将这类视频帧作为该视频帧的参考帧。示意性地,以该视频帧为t时刻的依赖视角对应的视频帧为例,服务器获取到的参考帧包括t-1、t-2……以及t-n时刻的依赖视角对应的视频帧,以及t时刻的基准视角对应的视频帧。也即是,服务器将t时刻之前的依赖视角对应的视频帧以及t时刻的基准视角对应的视频帧作为参考帧。
在一些实施例中,视频帧均携带视角标识,该视角标识用于指示该视频帧的视角。本步骤1101包括:服务器基于该视频帧的视角标识确定该视频帧为依赖视角,然后服务器以基准视角的视角标识为索引,在多视角图像中获取当前时刻基准视角的视频帧,同时,服务器以依赖视角的视角标识为索引,在多视角图像中获取至少一个同为依赖视角的视频帧,将获取到的视频帧作为参考帧。
需要说明的是,经过上述步骤1101,服务器根据视频帧的视角获取到相应的参考帧,既从时间维度(前后时刻)上考虑了视频帧与参考帧之间的相关性,又从空间维度(左右视角)上考虑了视频帧与参考帧之间的相关性,能够有效提高视频编码效率。
1102、对于该参考帧中与视频帧视角相同的参考帧,服务器基于渲染该视频帧时虚拟相机的第一位置、姿态信息、该视频帧的深度信息、渲染该参考帧时虚拟相机的第二位置、姿态信息、该参考帧的深度信息,确定该视频帧中图像块对应的第一运动矢量。
其中,服务器确定第一运动矢量的方式可参考上述步骤702中对应的实施方式,也即是由服务器进行时域运功估计以得到第一运动矢量。本申请在此不再赘述。
1103、对于该参考帧中与视频帧视角不同的参考帧,服务器基于渲染该视频帧时的焦距和虚拟相机的第一位置、该视频帧的深度信息、渲染该参考帧时虚拟相机的第二位置,确定该视频帧中图像块对应的第二运动矢量。
其中,服务器确定第二运动矢量的方式可参考上述步骤1002中对应的实施方式,也即是由服务器进行视点间运功估计以得到第二运动矢量。本申请在此不再赘述。
需要说明的是,在本申请实施例中,服务器是按照从前往后的顺序执行上述步骤1102和步骤1103的。在一些实施例中,服务器先执行步骤1103,再执行步骤1102。在另一些实施例中,服务器同步执行步骤1102和步骤1103。本申请实施例对于步骤1102和步骤1103的执行顺序不作限定。
1104、服务器基于该第一运动矢量和第二运动矢量,确定目标运动矢量。
其中,服务器调用RDO算法,基于该第一运动矢量和该第二运动矢量,对该视频帧的编码代价进行预测,得到至少两个预测模式;将该至少两个预测模式中,编码代价最小的预测模式所对应的运动矢量,作为目标运动矢量。
需要说明的是,经过上述步骤1102至步骤1104,服务器针对不同类型的参考帧,采用不同的方式来确定运动矢量,进一步地,结合RDO算法来确定目标运动矢量,通过这种方式来确定运动矢量,能够得 到对应编码代价最小的运动矢量,进而有效提高了视频编码的效率。
1105、服务器基于该目标运动矢量编码该视频帧。
在本申请实施例提供的视频编码方法中,在对视频帧进行编码时,根据生成该视频帧时的渲染信息,以及生成该视频帧的参考帧时的渲染信息,来确定该视频帧中图像块的运动矢量,并对该视频帧进行编码。采用这种方法进行视频编码,能够大大减少运动估计的计算量,加快视频编码速度,提高视频编码效率,进而有效降低端到端时延。
下面参考图12和图13,以服务器和终端之间的交互为例,对本申请的视频编码方法进行举例说明。其中,图12是本申请实施例提供的另一种视频编码方法的流程图;图13是本申请实施例提供的另一种视频编码方法的流程图。示意性地,该视频编码方法包括如下步骤1201至步骤1208。
1201、终端将渲染参数发送至服务器,其中,该渲染参数包括但不限于渲染分辨率、左右眼投影矩阵以及终端的显示参数。
1202、终端采集用户当前的交互事件,并将该交互事件发送至服务器,其中,该交互事件包括但不限于按键输入,以及头/手的位置、姿态信息。
示意性地,上述步骤1201和步骤1203请参考图13,如图13所示,服务器与终端之间通过网络进行连接,终端将渲染参数和用户当前的交互事件发送至服务器。
需要说明的是,本申请实施例对于上述步骤1201和步骤1202的执行顺序不作限定。在一些实施例中,终端先执行步骤1202,再执行步骤1201;在另一些实施例中,终端同步执行上述步骤1201和步骤1202。
1203、服务器基于接收到的渲染参数和交互事件,运行VR应用,执行应用逻辑计算和图像渲染计算,生成渲染后的视频帧。
1204、服务器中基于渲染参数、交互事件以及该视频帧的深度信息,进行运动矢量计算。
其中,渲染参数、交互事件以及深度信息均属于渲染信息。服务器在进行运动矢量计算时包括以下两种场景:第一种是前后帧场景,也即是视频帧和参考帧为同一视角,在这种场景下,服务器确定运动矢量的方式请参考上述图7所示实施例中的步骤702,也即是进行视角位移矢量计算;第二种是视点间场景,也即是视频帧和参考帧为不同视角,在这种场景下,服务器确定运动矢量的方式请参考上述图10所示实施例中的步骤1002,也即是进行视差区间计算。本申请在此不再赘述。
可选地,运动矢量计算包括初始运动矢量计算和目标运动矢量计算。其中,初始运动矢量计算用于执行视角位移矢量计算和视点间视差区间计算,服务器得到初始运动矢量的计算结果后,基于该计算结果,执行目标运动矢量计算(也可以理解为精确运动矢量计算),最终得到运动矢量。
可选地,在前后帧场景下,服务器执行初始运动矢量计算得到视角位移矢量后,该视角位移矢量能够作为运动矢量直接用于后续服务器对视频帧的编码过程中,或者,服务器能够基于该视角位移矢量执行目标运动矢量计算;在视点间场景下,服务器执行初始运动矢量计算得到视差区间后,服务器能够基于该视差区间执行目标运动矢量计算,从而减少运动矢量的搜索范围,实现加速编码。
可选地,上述运动矢量的计算过程可以由服务器的编码器执行,也可以在编码器外部计算后到传递到该编码器中。
示意性地,上述步骤1203和步骤1204请继续参考图13,如图13所示,服务器在接收到终端发送的渲染参数和交互事件后,执行逻辑计算和图像渲染计算,生成渲染后的视频帧。服务器基于该视频帧的深度信息以及接收到的渲染参数和交互事件,进行运动矢量计算。然后服务器基于得到的运动矢量对视频帧进行编码,生成视频编码数据。
1205、服务器基于计算得到的运动矢量,对该视频帧进行编码,生成视频编码数据。
1206、服务器将该视频编码数据发送至终端。
1207、终端根据该视频编码数据对该视频帧进行解码,将解码后的该视频帧通过终端中的系统驱动提交至头戴式显示器进行显示。
示意性地,上述步骤1206和步骤1207请继续参考图13,如图13所示,终端在接收到服务器发送的视频编码数据后,对该视频编码数据进行解码,并进行画面显示。
在本申请实施例提供的视频编码方法中,在对视频帧进行编码时,在对视频帧进行编码时,根据生成该视频帧时的渲染信息,以及生成该视频帧的参考帧时的渲染信息,来确定该视频帧中图像块的运动矢量,并对该视频帧进行编码。采用这种方法进行视频编码,能够大大减少运动估计的计算量,加快视频编码速度,提高视频编码效率,进而有效降低端到端时延。其中,无论该视频帧为左眼视角还是右眼视角,都能采用上述方法来确定运动矢量;进一步地,无论该视频帧与该参考帧是否属于同一视角,也都能采用上述方法来确定运动矢量。
以上实施例阐述了服务器对获取到的任意一个视频帧进行视频编码的具体实施方式。下面请参考图14,图14是本申请实施例提供的另一种视频编码方法的流程图。以服务器对一段时间内的多个视频帧进行编码,且一个视频帧为一个编码单元为例,对本申请的视频编码方法进行举例说明。示意性地,如图14所示,该视频编码方法由服务器执行,包括如下步骤1401至步骤1406。
1401、服务器获取编码单元,判断当前编码单元是否为起始编码单元,是否为基准视角。
其中,起始编码单元是指一定时间段内的首个视频帧。在本步骤1401中,服务器判断当前编码单元是否为基准视角的实施方式请参考上述图7所示实施例中的步骤701,本申请在此不再赘述。
1402、若当前编码单元为起始编码单元且为基准视角,则服务器将该编码单元编码为关键帧。
其中,关键帧是指一定时间段内的首个基准视角的视频帧。需要说明的是,由于该关键帧为一定时间内的首个视频帧,服务器在对该编码单元进行编码时不需要获取参考帧,而是对该编码单元直接进行编码。
1403、若当前编码单元为起始编码单元但为依赖视角,则服务器获取与该编码单元对应的基准视角的视频帧,以该视频帧为参考,将该编码单元编码为非关键帧。
其中,非关键帧是指一定时间段内,除首个基准视角的视频帧以外的其他视频帧。在本步骤1403中,服务器将当前编码单元编码为非关键帧的实施方式请参考上述图10所示实施例中的步骤1002和步骤1003,也即是由服务器进行视点间运动估计,得到运动矢量后对该视频帧进行编码。本申请在此不再赘述。
1404、若当前编码单元不是起始编码单元但为基准视角,则服务器以此前单帧或数帧(根据配置决定)同为基准视角的视频帧作为参考帧,将该编码单元编码为非关键帧。
其中,服务器将当前编码单元编码为非关键帧的实施方式,请参考上述图7所示实施例中的步骤702和步骤703,也即是由服务器进行时域运动估计,得到运动矢量后对该视频帧进行编码。本申请在此不再赘述。
1405、若当前编码单元不是起始编码单元且不是基准视角,则服务器以此前单帧或数帧(根据配置决定)同为基准视角的视频帧作为参考帧,同时将当前编码单元对应的基准视角的视频帧作为参考帧,将该编码单元编码为非关键帧。
其中,服务器将当前编码单元编码为非关键帧的实施方式请参考上述图11所示实施例中的步骤1102至步骤1105,也即是由服务器进行时域运动估计和视点间运动估计,得到运动矢量后对该视频帧进行编码。本申请在此不再赘述。
1406、服务器重复以上步骤1401至步骤1405,直到所有编码单元编码完成。
采用上述方法进行视频编码,能够大大减少运动估计的计算量,加快视频编码速度,提高视频编码效率,进而有效降低端到端时延。
上述图14是以一个视频帧为一个编码单元为例进行说明的,下面请参考图15,图15是本申请实施例提供的另一种视频编码方法的流程图。以一个画面组(group pf pictures,GOP)为一个编码单元为例,对本申请的视频编码方法进行举例说明。示意性地,如图15所示,t 0时刻为该编码单元的起始编码时刻,一个GOP包括t 0时刻、t 1时刻以及t 2时刻对应的视频帧,其中,每个时刻均包括两个视频帧,分别对应基准视角和依赖视角。该视频编码方法由服务器执行,包括如下步骤1501至步骤1505。
1501、服务器将t 0时刻基准视角的视频帧编码为关键帧。
其中,本步骤1501请参考上述图14所示实施例中的步骤1402,本申请在此不再赘述。
1502、服务器将t 0时刻基准视角的视频帧作为参考帧,将t 0时刻依赖视角的视频帧编码为非关键帧。
其中,本步骤1502请参考图10所示实施例中的步骤1002和步骤1003,也即是由服务器进行视点间运动估计,得到运动矢量后对该视频帧进行编码。本申请在此不再赘述。
1503、服务器将t 1时刻之前的基准视角的视频帧作为参考帧,将t 1时刻基准视角的视频帧编码为非关键帧。
其中,本步骤1503请参考上述图7所示实施例中的步骤702和步骤703,也即是由服务器进行时域运动估计,得到运动矢量后对该视频帧进行编码。本申请在此不再赘述。
1504、服务器将t 1时刻之前的依赖视角的视频帧、以及t 1时刻基准视角的视频帧作为参考帧,将t 1时刻依赖视角的视频帧编码为非关键帧。
其中,本步骤1504请参考上述图11所示实施例中的步骤1102至步骤1105,也即是由服务器进行时域运动估计和视点间运动估计,得到运动矢量后对该视频帧进行编码。本申请在此不再赘述。
1505、服务器按照t 1时刻基准视角和依赖视角的视频帧的编码方式,对t 2时刻的视频帧进行编码。
需要说明的是,图15所示仅为示意性地,将左眼视角(L)作为基准视角,右眼视角(R)作为依赖视角,相应地,也可以反过来将右眼视角作为基准视角,左眼视角作为依赖视角。本申请对此不作限定。另外,图15在时域运动估计时仅选取前一时刻的视频帧作为参考帧,应理解,在时域运动估计时,还能够将当前时刻之前的所有视频帧均作为参考帧,例如,编码t 2时刻基准视角的视频帧时,将t 0时刻和t 1时刻基准视角的视频帧均作为参考帧,本申请对此不作限定。
采用上述方法进行视频编码,能够大大减少运动估计的计算量,加快视频编码速度,提高视频编码效率,进而有效降低端到端时延。
图16是本申请实施例提供的一种视频编码装置的结构示意图,该视频编码装置用于执行上述视频编码方法执行时的步骤,参见图16,该视频编码装置1600包括:获取模块1601、确定模块1602以及编码模块1603。
获取模块1601,用于获取视频帧,以及该视频帧的参考帧,该视频帧和该参考帧来自多视角图像中的同一视角或者不同视角;
确定模块1602,用于基于第一渲染信息和第二渲染信息确定该视频帧中图像块的运动矢量;其中,该第一渲染信息为生成该视频帧时的渲染信息,该第二渲染信息为生成该参考帧时的渲染信息;
编码模块1603,用于基于该运动矢量编码该视频帧。
在一种可能的实现方式中,若该视频帧与该参考帧来自同一视角,则该第一渲染信息包括渲染该视频帧时虚拟相机的第一位置、姿态信息,以及该视频帧的深度信息;该第二渲染信息包括渲染该参考帧时虚拟相机的第二位置、姿态信息,以及该参考帧的深度信息。
在一种可能的实现方式中,若该视频帧与该参考帧来自不同视角,则该第一渲染信息包括渲染该视频帧时的焦距、虚拟相机的第一位置,以及该视频帧的深度信息;该第二渲染信息包括渲染该参考帧时虚拟相机的第二位置;
该确定模块1602还用于:基于该焦距、该第一位置与该第二位置之间的距离值以及该深度信息,确定该运动矢量。
在一种可能的实现方式中,该视频帧和该参考帧是运行虚拟现实VR应用时生成的,该视频帧编码后的数据用于在终端上解码显示,该第一位置、姿态信息,和该第二位置、姿态信息基于该终端的位置和姿态确定。
在一种可能的实现方式中,该终端为手机、平板、可穿戴设备或分体式设备,该分体式设备包括显示设备以及对应的控制设备。
在一种可能的实现方式中,该可穿戴设备为头戴式显示器。
在一种可能的实现方式中,该确定模块1602包括:
第一确定单元,用于基于该渲染该视频帧时虚拟相机的第一位置、姿态信息、该视频帧的深度信息、该渲染该参考帧时虚拟相机的第二位置、姿态信息、该参考帧的深度信息,确定该视频帧中像素或图像块对应的二维坐标与该参考帧中像素或图像块对应的二维坐标;
第二确定单元,用于基于该视频帧中像素或图像块对应的二维坐标和该参考帧中像素或图像块对应的二维坐标,确定该运动矢量。
在一种可能的实现方式中,该第二确定单元用于:
获取该视频帧中像素或图像块对应的二维坐标和该参考帧中像素或图像块对应的二维坐标之间的坐标差;
基于该坐标差,确定该运动矢量。
在一种可能的实现方式中,该确定模块1602还包括:
第三确定单元,用于基于该焦距、该第一位置与该第二位置之间的距离值以及该深度信息,确定该视频帧和该参考帧之间的视差区间;
第四确定单元,用于基于该视差区间,确定该视频帧中图像块对应的目标搜索范围;
第五确定单元,用于基于该目标搜索范围,确定该运动矢量。
在一种可能的实现方式中,该第三确定单元用于:
基于该深度信息,确定深度区间,该深度区间用于指示该视频帧中像素的深度值的取值范围;
基于该焦距、该第一位置与该第二位置之间的距离值以及该深度区间,确定该视频帧和该参考帧之间的视差区间。
在本申请实施例提供的视频编码装置中,在对视频帧进行编码时,根据生成该视频帧时的渲染信息,以及生成该视频帧的参考帧时的渲染信息,来确定该视频帧中图像块的运动矢量,并对该视频帧进行编码。采用这种装置进行视频编码,能够大大减少运动估计的计算量,加快视频编码速度,提高视频编码效率,进而有效降低端到端时延。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括程序代码的存储器,上述程序代码可由终端中的处理器执行,以使计算机完成上述实施例中的视频编码方法。例如,该计算机可读存储介质是只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、只读光盘(compact disc read-only memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括程序代码,当其在视频编码设备上运行时,使得该视频编码设备执行上述实施例中提供的视频编码方法。
本领域普通技术人员可以意识到,结合本文中所公开的实施例中描述的各方法步骤和单元,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各实施例的步骤及组成。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和单元的具 体工作过程,可以参见前述方法实施例中的对应过程,在此不再赘述。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。还应理解,尽管以下描述使用术语第一、第二等来描述各种元素,但这些元素不应受术语的限制。这些术语只是用于将一元素与另一元素区别分开。例如,在不脱离各种所述示例的范围的情况下,第一位置可以被称为第二位置,并且类似地,第二位置可以被称为第一位置。第一位置和第二位置都可以是位置,并且在某些情况下,可以是单独且不同的位置。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个第二运动矢量是指两个或两个以上的第二运动矢量。本文中术语“系统”和“网络”经常可互换使用。
还应理解,术语“如果”可被解释为意指“当...时”(“when”或“upon”)或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“如果确定...”或“如果检测到[所陈述的条件或事件]”可被解释为意指“在确定...时”或“响应于确定...”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。
以上描述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机程序指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例中的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。
该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机程序指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD)、或者半导体介质(例如固态硬盘))等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (19)

  1. 一种视频编码方法,其特征在于,所述方法包括:
    获取视频帧,以及所述视频帧的参考帧,所述视频帧和所述参考帧来自多视角图像中的同一视角或者不同视角;
    基于第一渲染信息和第二渲染信息确定所述视频帧中图像块的运动矢量;其中,所述第一渲染信息为生成所述视频帧时的渲染信息,所述第二渲染信息为生成所述参考帧时的渲染信息;
    基于所述运动矢量编码所述视频帧。
  2. 根据权利要求1所述的方法,其特征在于,若所述视频帧与所述参考帧来自同一视角,则所述第一渲染信息包括渲染所述视频帧时虚拟相机的第一位置、姿态信息,以及所述视频帧的深度信息;所述第二渲染信息包括渲染所述参考帧时虚拟相机的第二位置、姿态信息,以及所述参考帧的深度信息。
  3. 根据权利要求1所述的方法,其特征在于,若所述视频帧与所述参考帧来自不同视角,则所述第一渲染信息包括渲染所述视频帧时的焦距、虚拟相机的第一位置,以及所述视频帧的深度信息;所述第二渲染信息包括渲染所述参考帧时虚拟相机的第二位置;
    所述基于第一渲染信息和第二渲染信息确定所述视频帧与所述参考帧之间的运动矢量,包括:基于所述焦距、所述第一位置与所述第二位置之间的距离值以及所述深度信息,确定所述运动矢量。
  4. 根据权利要求2所述的方法,其特征在于,所述视频帧和所述参考帧是运行虚拟现实VR应用时生成的,所述视频帧编码后的数据用于在终端上解码显示,所述第一位置、姿态信息,和所述第二位置、姿态信息基于所述终端的位置和姿态确定。
  5. 根据权利要求4所述的方法,其特征在于,所述终端为手机、平板、可穿戴设备或分体式设备,所述分体式设备包括显示设备以及对应的控制设备。
  6. 根据权利要求5所述的方法,其特征在于,所述可穿戴设备为头戴式显示器。
  7. 根据权利要求2所述的方法,其特征在于,所述基于第一渲染信息和第二渲染信息确定所述视频帧与所述参考帧之间的运动矢量,包括:
    基于所述渲染所述视频帧时虚拟相机的第一位置、姿态信息、所述视频帧的深度信息、所述渲染所述参考帧时虚拟相机的第二位置、姿态信息、所述参考帧的深度信息,确定所述视频帧中像素或图像块对应的二维坐标与所述参考帧中像素或图像块对应的二维坐标;
    基于所述视频帧中像素或图像块对应的二维坐标和所述参考帧中像素或图像块对应的二维坐标,确定所述运动矢量。
  8. 根据权利要求7所述的方法,其特征在于,所述基于所述视频帧中像素或图像块对应的二维坐标和所述参考帧中像素或图像块对应的二维坐标,确定所述运动矢量,包括:
    获取所述视频帧中像素或图像块对应的二维坐标和所述参考帧中像素或图像块对应的二维坐标之间的坐标差;
    基于所述坐标差,确定所述运动矢量。
  9. 根据权利要求3所述的方法,其特征在于,所述基于所述焦距、所述第一位置与所述第二位置之间的距离值以及所述深度信息,确定所述运动矢量,包括:
    基于所述焦距、所述第一位置与所述第二位置之间的距离值以及所述深度信息,确定所述视频帧和所述参考帧之间的视差区间;
    基于所述视差区间,确定所述视频帧中图像块对应的目标搜索范围;
    基于所述目标搜索范围,确定所述运动矢量。
  10. 根据权利要求9所述的方法,其特征在于,所述基于所述焦距、所述第一位置与所述第二位置之间的距离值以及所述深度信息,确定所述视频帧和所述第二参考视频帧之间的视差区间,包括:
    基于所述深度信息,确定深度区间,所述深度区间用于指示所述视频帧中像素的深度值的取值范围;
    基于所述焦距、所述第一位置与所述第二位置之间的距离值以及所述深度区间,确定所述视频帧和所述参考帧之间的视差区间。
  11. 一种视频编码装置,其特征在于,所述装置包括:
    获取模块,用于获取视频帧,以及所述视频帧的参考帧,所述视频帧和所述参考帧来自多视角图像中的同一视角或者不同视角;
    确定模块,用于基于第一渲染信息和第二渲染信息确定所述视频帧中图像块的运动矢量;其中,所述第一渲染信息为生成所述视频帧时的渲染信息,所述第二渲染信息为生成所述参考帧时的渲染信息;
    编码模块,用于基于所述运动矢量编码所述视频帧。
  12. 根据权利要求11所述的装置,其特征在于,若所述视频帧与所述参考帧来自同一视角,则所述第一渲染信息包括渲染所述视频帧时虚拟相机的第一位置、姿态信息,以及所述视频帧的深度信息;所述第二渲染信息包括渲染所述参考帧时虚拟相机的第二位置、姿态信息,以及所述参考帧的深度信息。
  13. 根据权利要求11所述的装置,其特征在于,若所述视频帧与所述参考帧来自不同视角,则所述第一渲染信息包括渲染所述视频帧时的焦距、虚拟相机的第一位置,以及所述视频帧的深度信息;所述第二渲染信息包括渲染所述参考帧时虚拟相机的第二位置;
    所述确定模块还用于:基于所述焦距、所述第一位置与所述第二位置之间的距离值以及所述深度信息,确定所述运动矢量。
  14. 根据权利要求12所述的装置,其特征在于,所述确定模块包括:
    第一确定单元,用于基于所述渲染所述视频帧时虚拟相机的第一位置、姿态信息、所述视频帧的深度信息、所述渲染所述参考帧时虚拟相机的第二位置、姿态信息、所述参考帧的深度信息,确定所述视频帧中像素或图像块对应的二维坐标与所述参考帧中像素或图像块对应的二维坐标;
    第二确定单元,用于基于所述视频帧中像素或图像块对应的二维坐标和所述参考帧中像素或图像块对应的二维坐标,确定所述运动矢量。
  15. 根据权利要求14所述的装置,其特征在于,所述第二确定单元用于:
    获取所述视频帧中像素或图像块对应的二维坐标和所述参考帧中像素或图像块对应的二维坐标之间的坐标差;
    基于所述坐标差,确定所述运动矢量。
  16. 根据权利要求13所述的装置,其特征在于,所述确定模块还包括:
    第三确定单元,用于基于所述焦距、所述第一位置与所述第二位置之间的距离值以及所述深度信息,确定所述视频帧和所述参考帧之间的视差区间;
    第四确定单元,用于基于所述视差区间,确定所述视频帧中图像块对应的目标搜索范围;
    第五确定单元,用于基于所述目标搜索范围,确定所述运动矢量。
  17. 根据权利要求16所述的装置,其特征在于,所述第三确定单元用于:
    基于所述深度信息,确定深度区间,所述深度区间用于指示所述视频帧中像素的深度值的取值范围;
    基于所述焦距、所述第一位置与所述第二位置之间的距离值以及所述深度区间,确定所述视频帧和所述参考帧之间的视差区间。
  18. 一种视频编码设备,其特征在于,所述视频编码设备包括处理器和存储器,所述存储器用于存储至少一段程序代码,所述至少一段程序代码由所述处理器加载并执行如权利要求1至10任一项所述的视频编码方法。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储至少一段程序代码,所述至少一段程序代码用于执行如权利要求1至权利要求10中任一项所述的视频编码方法。
PCT/CN2021/142232 2021-01-14 2021-12-28 视频编码方法、装置、设备及存储介质 WO2022151972A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110048292 2021-01-14
CN202110048292.6 2021-01-14
CN202110320704.7A CN114765689A (zh) 2021-01-14 2021-03-25 视频编码方法、装置、设备及存储介质
CN202110320704.7 2021-03-25

Publications (1)

Publication Number Publication Date
WO2022151972A1 true WO2022151972A1 (zh) 2022-07-21

Family

ID=82365127

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/142232 WO2022151972A1 (zh) 2021-01-14 2021-12-28 视频编码方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN114765689A (zh)
WO (1) WO2022151972A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115174963A (zh) * 2022-09-08 2022-10-11 阿里巴巴(中国)有限公司 视频生成方法、视频帧生成方法、装置及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109672886A (zh) * 2019-01-11 2019-04-23 京东方科技集团股份有限公司 一种图像帧预测方法、装置及头显设备
CN110855971A (zh) * 2018-07-31 2020-02-28 英特尔公司 视频处理机制
US20200104976A1 (en) * 2018-09-28 2020-04-02 Apple Inc. Point cloud compression image padding
CN111213183A (zh) * 2017-10-13 2020-05-29 三星电子株式会社 渲染三维内容的方法和装置
CN111583350A (zh) * 2020-05-29 2020-08-25 联想(北京)有限公司 图像处理方法、装置、系统及服务器

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111213183A (zh) * 2017-10-13 2020-05-29 三星电子株式会社 渲染三维内容的方法和装置
CN110855971A (zh) * 2018-07-31 2020-02-28 英特尔公司 视频处理机制
US20200104976A1 (en) * 2018-09-28 2020-04-02 Apple Inc. Point cloud compression image padding
CN109672886A (zh) * 2019-01-11 2019-04-23 京东方科技集团股份有限公司 一种图像帧预测方法、装置及头显设备
CN111583350A (zh) * 2020-05-29 2020-08-25 联想(北京)有限公司 图像处理方法、装置、系统及服务器

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115174963A (zh) * 2022-09-08 2022-10-11 阿里巴巴(中国)有限公司 视频生成方法、视频帧生成方法、装置及电子设备

Also Published As

Publication number Publication date
CN114765689A (zh) 2022-07-19

Similar Documents

Publication Publication Date Title
JP2019534606A (ja) ライトフィールドデータを使用して場面を表す点群を再構築するための方法および装置
US10636201B2 (en) Real-time rendering with compressed animated light fields
US9648346B2 (en) Multi-view video compression and streaming based on viewpoints of remote viewer
US20130101017A1 (en) Providing of encoded video applications in a network environment
JP2020512772A (ja) Vrビデオ用に画像解像度を最適化してビデオストリーミングの帯域幅を最適化する画像処理のための方法及び装置
JP2020513703A (ja) 自由視点映像ストリーミング用の復号器中心uvコーデック
Shi et al. Real-time remote rendering of 3D video for mobile devices
JP7217226B2 (ja) グローバルな回転における動き補償画像を符号化する方法、デバイス及びストリーム
US11503267B2 (en) Image processing device, content processing device, content processing system, and image processing method
WO2022161107A1 (zh) 三维视频的处理方法、设备及存储介质
US11676330B2 (en) 3d conversations in an artificial reality environment
KR20220125813A (ko) 하이브리드 스트리밍
WO2022151972A1 (zh) 视频编码方法、装置、设备及存储介质
CN113012270A (zh) 一种立体显示的方法、装置、电子设备及存储介质
JP7171169B2 (ja) ライトフィールド・コンテンツを表す信号を符号化する方法および装置
WO2019122504A1 (en) Method for encoding and decoding volumetric video data
Park et al. InstantXR: Instant XR environment on the web using hybrid rendering of cloud-based NeRF with 3d assets
KR20110060180A (ko) 관심 객체 선택을 통한 3차원 모델 생성 방법 및 장치
JP4937161B2 (ja) 距離情報符号化方法,復号方法,符号化装置,復号装置,符号化プログラム,復号プログラムおよびコンピュータ読み取り可能な記録媒体
CN114788287A (zh) 对体积图像数据上的视图进行编码和解码
TWI817273B (zh) 即時多視像視訊轉換方法和系統
WO2022191070A1 (ja) 3dオブジェクトのストリーミング方法、装置、及びプログラム
Shi A low latency remote rendering system for interactive mobile graphics
JP5390649B2 (ja) 距離情報符号化方法,復号方法,符号化装置,復号装置,符号化プログラムおよび復号プログラム
CN116630744A (zh) 图像生成模型训练方法及图像生成方法、装置及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21919159

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21919159

Country of ref document: EP

Kind code of ref document: A1