CN116996661B

CN116996661B - Three-dimensional video display method, device, equipment and medium

Info

Publication number: CN116996661B
Application number: CN202311257319.8A
Authority: CN
Inventors: 周鹏远; 高宇
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2024-01-05
Anticipated expiration: 2043-09-27
Also published as: CN116996661A

Abstract

The invention provides a three-dimensional video display method, a device, equipment and a medium, which can be applied to the federal learning field, the video processing field and the video transmission field. The method comprises the following steps: predicting a first viewing angle at which a first user views a three-dimensional video frame at a target moment; generating a first video frame block acquisition request according to a first viewing angle; acquiring a second video frame block identification set generated by a second client and related to a second viewing angle; screening a first target identification subset from a first video frame block identification set and a second video frame block identification set corresponding to the first video frame block set according to a decoding allocation strategy; decoding a first video frame block corresponding to a first target sharing identifier in a first video frame block set acquired from a server to obtain a first sharing video frame block; and generating a target three-dimensional video display result in the first interactive interface according to the first shared video frame block and the second shared video frame block obtained from the second client. The invention can improve the video display quality.

Description

Three-dimensional video display method, device, equipment and medium

Technical Field

The invention relates to the field of federal learning, video processing and video transmission, in particular to a three-dimensional video display method, a device, equipment and a medium.

Background

With the large-scale promotion of 5G (5 th Generation Mobile Communication Technology, fifth generation mobile communication technology, abbreviated as 5G) commercial use, the occupancy of three-dimensional video content products such as volume video and holographic video is increasing. The point cloud video is used as effective and high-quality three-dimensional video data, and can be composed of points in a three-dimensional space, each point is associated with a plurality of attributes such as coordinates, colors and the like, and the three-dimensional video data is stored in a transmission mode which is popular. The point cloud video can be applied to the VR (Virtual Reality) field, the AR (Augmented Reality) field, the MR (Mixed Reality) field, and can be used in various application scenes such as game products, live broadcast and the like, so that the direction of watching the video can be freely selected, better perception effect and experience effect are generated, and the welcome of a large number of users is received. Therefore, how to guarantee the display quality of the video and improve the experience of watching the video by users are important.

Disclosure of Invention

In view of the above problems, the present invention provides a three-dimensional video display method, apparatus, device, and medium.

According to a first aspect of the present invention, there is provided a three-dimensional video display method, applied to a first client, including: predicting a first viewing angle at which a first user views a three-dimensional video frame at a target moment; generating a first video frame block acquisition request according to the first viewing angle, wherein the first video block acquisition request is used for acquiring a first video frame block set corresponding to the first viewing angle from a server, and the first video frame block set is obtained from the three-dimensional video frame; acquiring a second video frame block identification set generated by a second client and related to a second viewing angle, wherein the second viewing angle is a predicted viewing angle when a second user views the three-dimensional video frame at the target moment; screening a first target identification subset from a first video frame block identification set corresponding to the first video frame block set and the second video frame block identification set according to a decoding allocation strategy, wherein the first target identification subset comprises at least one first target sharing identification which is contained in an intersection of the first video frame block identification set and the second video frame block identification set; decoding a first video frame block corresponding to the first target sharing identifier in a first video frame block set acquired from the server to obtain a first sharing video frame block; and generating a target three-dimensional video display result in the first interactive interface according to the first shared video frame block and the second shared video frame block obtained from the second client.

According to a second aspect of the present invention, there is provided a three-dimensional video display method, applied to a server, including: responding to a first video frame block acquisition request from a first client, and analyzing the first video frame block acquisition request to obtain a first viewing angle and a first coding quality level; predicting a viewing viewport region of the three-dimensional video frame to obtain a first viewing viewport region related to the first client; determining a first video frame block set from the three-dimensional video frame according to the first viewing angle and the first viewing viewport region; and transmitting the first video frame block set to the first client, wherein the first client is adapted to decode a first video frame block corresponding to a first target sharing identifier in the first video frame block set acquired from the server, to obtain a first shared video frame block; and generating a target three-dimensional video display result in the first interactive interface according to the first shared video frame block and the second shared video frame block obtained from the second client.

A third aspect of the present invention provides a three-dimensional video display apparatus applied to a first client, including: the first viewing angle prediction module is used for predicting a first viewing angle at which a first user views the three-dimensional video frame at a target moment; a first request generating module, configured to generate a first video frame block acquisition request according to the first viewing angle, where the first video block acquisition request is used to acquire a first video frame block set corresponding to the first viewing angle from a server, where the first video frame block set is obtained from the three-dimensional video frame; the second video frame block identifier set obtaining module is configured to obtain a second video frame block identifier set generated by a second client and related to a second viewing angle, where the second viewing angle is a viewing angle predicted when a second user views the three-dimensional video frame at the target time; a first target identifier subset determining module, configured to screen a first target identifier subset from a first video frame block identifier set corresponding to the first video frame block set and the second video frame block identifier set according to a decoding allocation policy, where the first target identifier subset includes at least one first target sharing identifier, and the first target sharing identifier is included in an intersection of the first video frame block identifier set and the second video frame block identifier set; the first shared video frame block determining module is used for decoding the first video frame block corresponding to the first target sharing identifier in the first video frame block set acquired from the server to obtain a first shared video frame block; and the display module is used for generating a target three-dimensional video display result in the first interactive interface according to the first shared video frame block and the second shared video frame block obtained from the second client.

A fourth aspect of the present invention provides an electronic device comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method described above.

The fifth aspect of the present invention also provides a computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the above method.

A sixth aspect of the invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the above method.

According to the three-dimensional video display method, the device, the equipment and the medium provided by the invention, the first target identification subset is screened out from the first video frame block identification set and the second video frame block identification set corresponding to the first video frame block set according to the decoding allocation strategy, and the first video frame block corresponding to the first target sharing identification in the first target identification subset is decoded to obtain the first shared video frame block, so that the target three-dimensional video display result in the first interactive interface is obtained according to the first shared video block and the second shared video frame block obtained from the second client, the time consumed by the first client for decoding the second shared video frame block can be reduced, and meanwhile, the technical problem that the network bandwidth occupied by the first client through obtaining the complete three-dimensional video frame is more can be avoided, and the technical problem that the storage area occupied by the storage area generated by the complete three-dimensional video frame is occupied in the first client can be avoided.

Drawings

The foregoing and other objects, features and advantages of the invention will be apparent from the following description of embodiments of the invention with reference to the accompanying drawings, in which:

fig. 1 shows an application scene diagram of a three-dimensional video display method and apparatus according to an embodiment of the present invention;

FIG. 2 shows a flow chart of a three-dimensional video presentation method according to an embodiment of the invention;

FIG. 3 illustrates an application scenario diagram of a three-dimensional video presentation method according to an embodiment of the present invention;

FIG. 4 shows a flow chart of a three-dimensional video presentation method according to another embodiment of the present invention;

FIG. 5 shows a schematic diagram of a first coding quality prediction model according to an embodiment of the present invention;

FIG. 6 illustrates an application scenario diagram of a three-dimensional video presentation method according to an embodiment of the present invention;

FIG. 7 shows a block diagram of a three-dimensional video presentation device according to an embodiment of the present invention;

fig. 8 shows a block diagram of an electronic device adapted to implement a three-dimensional video presentation method according to an embodiment of the invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the invention, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all accord with the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.

In the technical scheme of the invention, the processes of data acquisition, collection, storage, use, processing, transmission, provision, disclosure, application and the like all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.

The inventor finds that in the process of transmitting the point cloud video, the point cloud video needs to be encoded and decoded efficiently so as to meet the requirement of network transmission bandwidth. The bandwidth required for point cloud video, e.g., 30 frames per second, can be as high as 6Gbps, which creates a significant challenge for the transmission bandwidth of point cloud video.

The embodiment of the invention provides a three-dimensional video display method, a device, equipment and a medium, wherein the three-dimensional video display method can be applied to a first client and comprises the following steps: predicting a first viewing angle at which a first user views a three-dimensional video frame at a target moment; generating a first video frame block acquisition request according to a first viewing angle, wherein the first video frame block acquisition request is used for acquiring a first video frame block set corresponding to the first viewing angle from a server, and the first video frame block set is obtained from a three-dimensional video frame; acquiring a second video frame block identification set generated by a second client and related to a second viewing angle, wherein the second viewing angle is a predicted viewing angle when a second user views a three-dimensional video frame at a target moment; screening a first target identification subset from a first video frame block identification set and a second video frame block identification set corresponding to the first video frame block set according to a decoding allocation strategy, wherein the first target identification subset comprises at least one first target sharing identification which is contained in an intersection of the first video frame block identification set and the second video frame block identification set; decoding a first video frame block corresponding to a first target sharing identifier in a first video frame block set acquired from a server to obtain a first sharing video frame block; and generating a target three-dimensional video display result in the first interactive interface according to the first shared video frame block and the second shared video frame block obtained from the second client.

According to the embodiment of the invention, the first target identification subset is screened from the first video frame block identification set and the second video frame block identification set corresponding to the first video frame block set according to the decoding allocation strategy, and the first video frame block corresponding to the first target sharing identification in the first target identification subset is decoded to obtain the first shared video frame block, so that the target three-dimensional video display result in the first interactive interface is obtained according to the first shared video block and the second shared video frame block obtained from the second client, the time period consumed by the first client for decoding the second shared video frame block can be reduced, meanwhile, the problem that the network bandwidth occupied by the first client through obtaining the complete three-dimensional video frame is more can be avoided, and the problem that the storage area redundant occupied by the first client because the complete three-dimensional video frame is stored in the first client can be avoided, so that the display quality of the three-dimensional video is improved under the condition that the network bandwidth resource is limited or the storage area resource is limited, and the viewing experience of the user is further improved.

Fig. 1 shows an application scene diagram of a three-dimensional video display method and device according to an embodiment of the invention.

As shown in fig. 1, an application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop portable computers, desktop computers, smart wearable devices (e.g., VR glasses), and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the three-dimensional video display method provided by the embodiment of the present invention may be generally executed by any one or more of the first terminal device 101, the second terminal device 102, and the third terminal device 103. Accordingly, the three-dimensional video display apparatus provided by the embodiment of the present invention may be generally disposed in any one or more of the first terminal device 101, the second terminal device 102, and the third terminal device 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The three-dimensional video display method of the disclosed embodiment will be described in detail with reference to fig. 2 to 6 based on the scenario described in fig. 1.

Fig. 2 shows a flow chart of a three-dimensional video presentation method according to an embodiment of the invention.

As shown in fig. 2, the three-dimensional video display method of this embodiment includes operations S210 to S260.

In operation S210, a first viewing angle at which a first user views a three-dimensional video frame at a target time is predicted.

According to an embodiment of the present invention, the first user may view the three-dimensional video through the first client, and the three-dimensional video frame may be any one or more video frames in the three-dimensional video. The user views the three-dimensional video frame at the first viewing angle, and can view the video frame content corresponding to the first viewing angle in the three-dimensional video frame.

In operation S220, a first video frame block acquisition request is generated according to the first viewing angle, the first video frame block acquisition request being used to acquire a first video frame block set corresponding to the first viewing angle from the server, the first video frame block set being obtained from the three-dimensional video frame.

According to the embodiment of the invention, the server side can be one end storing the three-dimensional video frame, the server side can obtain the first viewing angle by analyzing the first video frame block acquisition request, and the three-dimensional video frame is segmented according to the first viewing angle to obtain the first video frame block (also called tile), so that the first video frame block set can be obtained according to the first video frame block.

It should be noted that, the server may send the first video frame block set to the first client, and because the first video frame block set is obtained by dividing the three-dimensional video frame, the network bandwidth occupied by transmitting the first video frame block set is smaller than the network bandwidth occupied by transmitting the three-dimensional video frame.

In operation S230, a second video frame block identification set generated by the second client and related to a second viewing angle, the second viewing angle being a viewing angle predicted when the second user views the three-dimensional video frame at the target time, is acquired.

According to an embodiment of the present invention, the second video frame block identifier in the second video frame block identifier set may correspond to a second video frame block, and the second video frame block may be acquired by the second client from the server based on the same or corresponding manner as the first client.

In operation S240, a first target identifier subset is screened from a first video frame block identifier set and a second video frame block identifier set corresponding to the first video frame block set according to a decoding allocation policy, where the first target identifier subset includes at least one first target sharing identifier, and the first target sharing identifier is included in an intersection of the first video frame block identifier set and the second video frame block identifier set.

In operation S250, a first video frame block corresponding to a first target sharing identifier in a first video frame block set acquired from a server is decoded to obtain a first shared video frame block.

According to an embodiment of the present invention, the first target identifier subset may instruct the first client to decode a first target video frame block corresponding to the first video frame block set, and the first target video frame block corresponding to the first target sharing identifier may be a video frame block that needs to be decoded by both the first client and the second client. The first shared video frame block may be a decoded video frame block.

In operation S260, a target three-dimensional video presentation result is generated in the first interactive interface according to the first shared video frame block and the second shared video frame block obtained from the second client.

According to the embodiment of the invention, the second shared video frame block can also be a video frame block which is needed to be decoded by both the first client and the second client, and the first client can avoid decoding the second shared video frame block by the second shared video frame block obtained from the second client, so that the calculation cost of the first client for decoding is reduced, and the overall efficiency of three-dimensional video display is improved.

According to the embodiment of the invention, the second clients can comprise a plurality of second clients, every two second clients in the plurality of second clients are in communication connection, each second client in the plurality of second clients is respectively in communication connection with the first client, and each second client can also apply the three-dimensional video display method provided by the embodiment of the invention, so that the three-dimensional video display effect of the first client and each second client is improved as a whole.

According to an embodiment of the present invention, the three-dimensional video display method may further include: and sending the first shared video frame block to the second client.

Fig. 3 shows an application scenario diagram of a three-dimensional video presentation method according to an embodiment of the present invention.

As shown in fig. 3, in the application scene 300, a three-dimensional video frame 310 may be included, and the three-dimensional video frame 310 may be divided into a plurality of tiles according to a preset tile size. The first viewing angle at which the first user views the three-dimensional video frame 310 through the first client at the target time is B311, and the second viewing angle at which the second user views the three-dimensional video frame 310 through the first client at the target time corresponding to the second client is B312.

The first video frame block set may be determined to include the first video frame block as the video frame block 311, the video frame block 312 and the video frame block 313 according to the first viewing angle B311, and the second video frame block set may be determined to include the second video frame block as the video frame block 312, the video frame block 313 and the video frame block 314 according to the second viewing angle B312. The first client may obtain a first set of video frame blocks from the server and the second client may obtain a second set of video frame blocks from the server. Based on a preset decoding allocation policy, a first target video frame block corresponding to the first target identifier subset may be determined to be a video frame block 311 and a video frame block 312, where the video frame block corresponding to the first target sharing identifier is the video frame block 311. The second target video frame blocks corresponding to the second target identifier subset are video frame blocks 313 and video frame blocks 314, wherein the video frame blocks corresponding to the second target sharing identifier are video frame blocks 311.

The first client may decode the video frame block 311 and the video frame block 312 to obtain two decoded video frame blocks, and send the decrypted first shared video frame block to the second client. The second client may decode video frame block 313 and video frame block 314 to obtain two decoded video frame blocks, and send the decoded second shared video frame block to the first client.

Accordingly, the first client may render, in the first interactive interface, a video frame corresponding to the first viewing perspective of the first user based on the decoded video frame blocks corresponding to each of the video frame blocks 311, 312, and 313, thereby generating a target three-dimensional video presentation result. Accordingly, the second client may render, in the second interactive interface, a video frame corresponding to the second viewing angle of the second user based on the decoded video frame blocks corresponding to each of the video frame blocks 312, 313 and 314, thereby generating a target three-dimensional video presentation result corresponding to the second client.

According to the embodiment of the invention, the three-dimensional video frame blocks are cooperatively decoded based on the preset decoding allocation strategy between the first client and the second client, so that the overall call of the computational power resources among a plurality of clients can be realized, the computational power resources of the clients are saved, and the display efficiency of the three-dimensional video is improved.

According to the embodiment of the invention, the server side can cut and downsample the three-dimensional video frame, so as to ensure that the user can smoothly switch the view angles in the view ports corresponding to the view angles, and generate the three-dimensional video frame with higher video quality, and the three-dimensional video can be divided into a plurality of tiles. Only tiles corresponding to the user's view port need be transmitted when video frames are transmitted, so as to more effectively utilize bandwidth. The server can first find out a fitting cuboid of a video object in the three-dimensional video in the three-dimensional space, and the fitting cuboid is used as a cuboid for dividing the three-dimensional video frame, and the position and three-dimensional space parameters of the fitting cuboid are determined. Then, for a three-dimensional video object (for example, a point cloud object), dividing the three-dimensional video object into N×M partitions on a plane perpendicular to the height based on the size of the fitting cuboid, and dividing the N×M partitions into H partition layers in the height direction to obtain N×M×H tiles, namely obtaining N×M×H video frame blocks. And then, the server side can uniformly downsample each tile in the NxMxH tiles according to the preset quality level L to obtain NxMxH downsampled video frame blocks. A first set of video frame blocks may be determined from the nxmxh downsampled video frame blocks according to the first viewing angle.

It should be noted that, the first video frame blocks in the first video frame block set may have corresponding quality levels, and in the case that the three-dimensional video frame is a point cloud video frame, the higher the quality level may be set, the more the number of point cloud data in the corresponding first video frame blocks, and the clearer the display effect of the corresponding video frame blocks.

According to the embodiment of the invention, the server side can also encode the video frame blocks. For example, the compression encoding can be performed on a three-dimensional video frame with a quality level of l and corresponding to the target time t, the three-dimensional video frame can include n×m×h video frame blocks, and parameters of the video frame blocks after compression encoding can be expressed asAccordingly, the data size of the encoded video frame block may be expressed as S _{n，m，h，l，t} The point-to-point peak signal-to-noise ratio (PSNR) can be expressed as q _{n，m，h，l，t} The computational resources required to decode a block of video frames can be denoted as d _{n，m，h，l，t} 。

It should be noted that, in the case of higher quality levels, the video frame blocks will have larger data sizes and peak signal-to-noise ratios, and the video frame blocks with higher quality levels consume more decoding computing resources. In addition, the server may also reserve each uncompressed video frame block. For example, a video frame block of quality level l, may have a data size, in the uncompressed case, denoted as S' _{n，m，h，l，t} The point-to-point peak signal-to-noise ratio can be expressed as q' _{n，m，h，l，t} Uncompressed video frame blocks may reduce the computational resources required for decoding. In order to avoid negative effects of lossy compression on depth data in video frame blocks, the video frame blocks in the embodiments of the present invention may use lossless encoding to process video frame blocks, i.e. the video frame blocks are not compressed during the encoding process, thereby reducing the computational resources consumed in the client for decoding video frame blocks.

According to an embodiment of the present invention, the server may also generate attribute parameters related to the video frame blocks, for example, may generate a media presentation file (MPD, microsoft Project Database) for the point cloud video. The MPD may include tile type, period information, tile data size, peak signal-to-noise ratio, computational resources required for decoding, and so forth. And when a client (such as a first client or a second client and the like) needing to play the point cloud video sends a video playing request, sending attribute parameters contained in the media presentation file to the corresponding client.

According to an embodiment of the present invention, predicting a first viewing perspective at which a first user views a three-dimensional video frame at a target time may include: determining a historical three-dimensional video segment which is watched from the three-dimensional video containing the three-dimensional video frames according to the target moment; inputting the historical three-dimensional video clips into a first viewing angle prediction model, and outputting a first prediction result; and obtaining a first viewing angle according to the first prediction result.

According to an embodiment of the present invention, the historical three-dimensional video clip may be a plurality of historical three-dimensional video frames that were played at the first client before the target time, for example, the historical three-dimensional video clip may be a historical three-dimensional video frame in the video that precedes the three-dimensional video frame corresponding to the target time.

According to the embodiment of the invention, the first viewing angle prediction model can be constructed based on a neural network algorithm, for example, the first viewing angle prediction model can be constructed based on a neural network algorithm such as a cyclic neural network algorithm, a long-term and short-term memory network algorithm and the like. The first viewing angle prediction model can analyze video content of the historical three-dimensional video clips and output a first prediction result indicating the first viewing angle, so that the first viewing angle at the target moment can be predicted, and the prediction accuracy of the viewing angle is improved.

According to an embodiment of the present invention, the first viewing angle may be represented based on degrees of freedom of a three-dimensional object in the three-dimensional video frame. For example, it may be based on the degree of freedom of the position of the point cloud object in the point cloud video frame (represented by point cloud position coordinates (x, y, z)) and the degree of freedom of the direction (represented by direction coordinates (pitch angle, yaw angle, roll angle).

According to an embodiment of the present invention, predicting the first viewing angle at which the first user views the three-dimensional video frame at the target time may further include: the collected first historical view angle information of the first user in the historical view time period is input into a second view angle prediction model, and a second prediction result is output; and obtaining the first viewing angle according to the second prediction result.

According to an embodiment of the present invention, the first historical viewing angle information may include any historical viewing time of the first user during viewing of the historical three-dimensional video clip, a viewing angle of the first user, a time-series track of the viewing angle, a historical eye movement track of the first user corresponding to the plurality of historical viewing times, and so on.

According to an embodiment of the present invention, the second viewing angle prediction model may also be constructed based on a neural network algorithm. The second viewing angle prediction model may output a second prediction result indicating the first viewing angle, so that the first viewing angle at the predicted target time may be achieved to improve the prediction accuracy of the viewing angle.

According to the embodiment of the invention, the first prediction result can be used as a corresponding prediction result of the first viewing angle.

According to an embodiment of the present invention, the second prediction result may also be taken as a corresponding prediction result of the first viewing angle.

According to an embodiment of the present invention, the first viewing angle may also be determined in combination with the first prediction result and the second prediction result, for example, based on respective confidence levels of the first prediction result and the second prediction result.

According to an embodiment of the present invention, the first client may further predict a network bandwidth corresponding to transmitting the first video frame chunk set based on a machine learning algorithm, for example, may predict the network bandwidth based on an exponentially weighted moving average model (exponentially weighted moving averages model, EWMA model), and weight the network bandwidths corresponding to the plurality of historical moments in the historical time period to obtain the network bandwidth predicted for the target moment. The method can set larger bandwidth weight parameters for historical time points which are closer to the target time points, and set smaller bandwidth weight parameters for historical time points which are farther from the target time points, so that the prediction accuracy of network bandwidth is improved.

Fig. 4 shows a flow chart of a three-dimensional video presentation method according to another embodiment of the present invention.

As shown in fig. 4, the three-dimensional video display method may further include operations S410 to S420.

In operation S410, video presentation environment information related to a three-dimensional video is acquired.

In operation S420, video presentation environment information is input to the initial first coding quality prediction model, and a first coding quality level corresponding to the three-dimensional video frame is output.

According to an embodiment of the present invention, generating the first video frame block acquisition request according to the first viewing angle may include: a first video frame block acquisition request is generated based on the first viewing perspective and the first encoding quality level.

According to the embodiment of the present invention, the first coding quality prediction model may be constructed based on any type of deep learning algorithm, for example, but not limited to, the first coding quality prediction model may be constructed based on an attention network algorithm, and the first coding quality prediction model may be constructed based on other types of deep learning algorithms, which is not limited in the embodiment of the present invention.

According to an embodiment of the present invention, the first video frame block acquisition request may include a first viewing angle and a first coding quality level, so that the server may segment and extract a three-dimensional video frame corresponding to the target time according to the first viewing angle, and perform coding processing on an initial first video frame block set obtained by extraction based on the first coding quality level (for example, downsampling the first video frame block based on the first coding quality level l) to obtain a coded first video frame block set.

According to an embodiment of the present invention, the video presentation environment information includes at least one of: the three-dimensional video comprises historical three-dimensional video frames sequenced before the three-dimensional video frames, wherein the historical three-dimensional video frame blocks are obtained from the historical three-dimensional video frames; the transmission of three-dimensional video frame blocks is time-consuming; the number of frames of the remaining three-dimensional video frames in the three-dimensional video, the remaining three-dimensional video frames including video frames in the three-dimensional video that have not been transmitted to the first client; the buffer area capacity of the first client at the current moment.

According to an embodiment of the present invention, the first coding quality prediction model may be constructed based on a reinforcement learning algorithm. The video presentation environment information corresponding to the target time t may be represented as S _t =（l _t-1 ，b _t ，B _t ，d _t ，m _t ，n _t ）

l _t-1 The historical encoding quality level of the adjacent historical video frames may be represented, the adjacent historical video frames being adjacent historical video frames of the three-dimensional video ordered before the three-dimensional video frame corresponding to the target time. b _t Can represent the buffer area capacity of the first client at the current moment, B _t Can represent the predicted network bandwidth (in bps, bits per second), d _t May represent the download (transmission) duration, m, of contiguous historical video frames _t Can represent the data size of the first video frame block set corresponding to the coding quality level l, m in case the quality level l represents a plurality of levels _t Can be represented by a vector form. n is n _t The number of frames of the remaining three-dimensional video frames in the three-dimensional video may be represented.

For a target time t (for example, the target time may be the current time), the video presentation environment information S _t =（l _t-1 ，b _t ，B _t ，d _t ，m _t ，n _t ) Input to a first coding quality prediction model, the behavior (Actor) network of which may have a selection policyRepresenting taking action a under the input condition of video presentation environment information _t θ may represent a network parameter of a behavioural (Actor) network. Based on the first coding quality prediction model constructed by the reinforcement learning algorithm, probabilities corresponding to a plurality of quality levels can be output, and the quality level (predicted behavior) with the highest probability is selected as the first coding quality level of the first video frame block corresponding to the target moment.

According to an embodiment of the invention, the second client is provided with a second coding quality prediction component comprising a second coding quality prediction model.

According to the embodiment of the invention, the second coding quality prediction component can be constructed based on a second coding quality prediction model, and the second coding quality prediction model can be constructed at the second client based on the same or corresponding algorithm of the first coding quality prediction model.

According to an embodiment of the present invention, the three-dimensional video display method may further include: acquiring sample video display environment information and a sample label; and training a first coding quality prediction model related to the first client according to the sample video display environment information and the sample label to obtain a trained first coding quality prediction model.

According to the embodiment of the invention, the training method can be executed at the first client, and the first coding quality prediction model is obtained based on the training of the supervised training mode.

For example, a reward r for reinforcement learning algorithm (i.e., initial first coding quality prediction model) may be defined _t =qoe, the optimization objective can be defined as exposing the environmental information S at the sample video _t Maximizing the desired cumulative discount rewards and thereby optimizing the quality of user experience (QoE). The gradient of the cumulative discount prize relative to the Actor network parameter θ of the initial first coding quality prediction model may be represented by the following equation (1).

（1）；

The Critic network (evaluation network) of the initial first coding quality prediction model is an estimate of the cost function of the initial first coding quality prediction model, and may be expressed asThe method provided by the embodiment of the invention can perform the initial first coding quality prediction model based on the time difference modeThe row training, in turn, may express the gradient by equation (2).

（2）；

R in formula (2) _t QoE indicates that the target is displaying environmental information S on the sample video _t The maximization expectations below.

Fig. 5 shows a schematic diagram of a first coding quality prediction model according to an embodiment of the present invention.

As shown in fig. 5, the first coding quality prediction model may be actually constructed based on a reinforcement learning algorithm, and the first coding quality prediction model may include a behavior network and an evaluation network. The behavior network and the evaluation network can comprise a convolution layer, wherein the convolution layer is used for extracting characteristic information in the video display environment information, for example, image characteristic extraction can be carried out aiming at a historical three-dimensional video frame, and the extracted image characteristic is used as input of a subsequent reinforcement learning algorithm. The trained first coding quality prediction model may output a first coding quality level.

According to the embodiment of the invention, the sample video presentation environment information and the sample tag are also suitable for training an initial second coding quality prediction model related to the second client, so as to obtain a trained second coding quality prediction model.

According to an embodiment of the present invention, training an initial first coding quality prediction model and an initial second coding quality prediction model based on sample video presentation environment information and sample tags, respectively, includes the following operations.

According to sample video display environment information and sample labels, training an initial first coding quality prediction model in a 1 st stage to obtain a 1 st stage intermediate first coding quality prediction model and a 1 st stage intermediate first gradient, wherein the 1 st training stage represents the execution of a 1 st stage preset training iteration frequency; processing a first gradient in the 1 st stage middle and a second gradient in the 1 st stage middle, which is output by a second coding quality prediction model in the 1 st stage middle of a second client, according to a preset gradient aggregation formula to obtain the aggregation gradient in the 1 st stage middle; adjusting first model parameters of a 1 st-stage intermediate first coding quality prediction model according to the 1 st-stage intermediate aggregation gradient to obtain a 1 st-stage initial first coding quality prediction model, wherein the 1 st-stage intermediate aggregation gradient is also suitable for adjusting second model parameters of a 1 st-stage intermediate second coding quality prediction model to obtain a 1 st-stage initial second coding quality prediction model; according to sample video display environment information and sample labels, carrying out p-th stage training on a p-1-th stage initial first coding quality prediction model to obtain a p-th stage intermediate first coding quality prediction model and a p-th stage intermediate first gradient, wherein the p-th training stage represents the execution of p-th stage preset training iteration frequency, and p is an integer greater than 1; processing a first gradient in the p-th intermediate according to a preset gradient aggregation formula, and outputting a second gradient in the p-th intermediate with a second coding quality prediction model in the p-th intermediate of a second client to obtain a p-th intermediate aggregation gradient; according to the p-th intermediate aggregation gradient, adjusting a first model parameter of a p-th intermediate first coding quality prediction model to obtain a p-th initial first coding quality prediction model, wherein the p-th intermediate aggregation gradient is also suitable for adjusting a second model parameter of a p-th intermediate second coding quality prediction model to obtain a p-th initial second coding quality prediction model; and under the condition that the first loss function corresponding to the initial first coding quality prediction model and the second loss function corresponding to the initial second coding quality prediction model are converged, determining the P-th-stage initial first coding quality prediction model as a trained first coding quality prediction model, wherein P is more than or equal to P.

According to the embodiment of the invention, under the condition of communication connection between the first client and the second client, the model parameters of the initial first coding quality prediction model and the model parameters of the initial second coding quality prediction model can be updated based on federal aggregation, and the trained first coding quality prediction model and the trained second coding quality prediction model are obtained at the first client and the second client respectively.

For example, an initial first coding quality prediction model and an initial second coding quality prediction model may be set at a first client and a second client, and based on sample video display environment information and sample labels, stage 1 training may be performed at the first client and the second client, respectively, for example, training may be iterated 10 times to obtain a stage 1 intermediate first coding quality prediction model and a stage 1 intermediate second coding quality prediction model, and at the same time, a gradient corresponding to the stage 1 intermediate first coding quality prediction model obtained at the end of stage 1 training may be used as a stage 1 intermediate first gradient, and a gradient corresponding to the stage 1 intermediate second coding quality prediction model obtained at the end of stage 1 training may be used as a stage 1 intermediate second gradient.

For example, in the case of p=2, the first gradient in the middle of level 1 and the second gradient in the middle of level 1 may be sent to the aggregation server, which may obtain the aggregation gradient in the middle of level 1 based on formula (3).

；（3）

In the formula (3),an intermediate aggregation gradient can be represented, in case of end of stage 1 training +.>A level 1 intermediate polymerization gradient may be represented. />A kth intermediate coding quality prediction model corresponding to a kth client may be represented. For example, an intermediate first coding quality prediction model and an intermediate second coding quality prediction model may be represented. />Represents a weight parameter corresponding to the kth client, K representing the number of clients.

For another example, in the case of p=2, the aggregation server may send the level 1 intermediate aggregation gradient to the first client and the second client, where the first client and the second client may each adjust a first model parameter of the level 1 intermediate first coding quality prediction model according to the level 1 intermediate aggregation gradient, adjust a second model parameter of the level 1 intermediate second coding quality prediction model, and obtain a level 1 initial first coding quality prediction model and a level 1 initial second coding quality prediction model. For the level 1 initial first coding quality prediction model and the level 1 initial second coding quality prediction model, a level 2 training may be performed on the first client and the second client, for example, 10 model parameter iterations are performed, and a level 2 intermediate first gradient and a level 2 intermediate second gradient obtained after the level 2 training is completed are input into the above formula (3), so as to obtain a level 2 intermediate aggregation gradient.

And determining the P-th-stage initial first coding quality prediction model as the trained first coding quality prediction model under the condition that a first loss function corresponding to an initial first coding quality prediction model and a second loss function corresponding to the initial second coding quality prediction model are converged in the training process, wherein P is more than or equal to P.

According to the embodiment of the invention, the prediction precision of the coding quality grade aiming at the target moment can be effectively improved by calling the coding quality prediction models of the client cooperation training based on the federal learning mechanism, so that redundant occupation aiming at network bandwidth is avoided, and the overall playing quality of the three-dimensional video is improved.

According to the embodiment of the invention, each of the plurality of clients can request the corresponding first video frame block set with the first coding quality level from the server based on the coding quality prediction model, and send a request containing the first coding quality level and the first viewing angle to the server through the HTTP interface, and the server can divide the video frame corresponding to the target moment according to the request and code the divided video frame blocks according to the first coding quality level. The decoding tasks can be cooperatively distributed among the clients so as to reduce the decoding calculation cost for the encoded video frame blocks, avoid the problem of blocking caused by decoding the video frame blocks and improve the overall playing efficiency of the three-dimensional video.

According to the embodiment of the invention, the client can fuse the video frame blocks obtained after decoding to generate the three-dimensional video frames corresponding to the viewing angles of the users, and store the three-dimensional video frames in the playing buffer area of the client, and the client can render the three-dimensional video frames from the playing buffer area so as to display the target three-dimensional video display results in the interactive interface.

Fig. 6 shows an application scenario diagram of a three-dimensional video presentation method according to an embodiment of the present invention.

As shown in fig. 6, in this application scenario, the first client may be a VR device worn by the first user, the first client is communicatively connected to the server, and the first client is communicatively connected to the plurality of second clients. The first client can acquire the video display environment information, analyze the MPD file transmitted from the server through the MPD analysis module, and obtain the video display environment information input to the adaptive prediction module.

The adaptive prediction module may determine a first coding quality level based on the first coding quality prediction model and determine a video frame block identifier (i.e., tile identifier) corresponding to the first set of video frame blocks, and transmit the first coding quality level and tile identifier to the server over the HTTP interface. The server side can cut the three-dimensional video frame based on the first coding quality grade and the tile identifier to obtain an initial first video frame block set, and the initial first video frame block set can be sampled by a sampling module to obtain an uncompressed first video frame block set, or compressed and coded by an encoder to obtain a compressed first video frame block set. The encoded first video frame chunk set may form an MPD file and be sent to the first client.

The first client can decode and fuse the encoded first video frame block set through the decoder and the tile fusion module to obtain a three-dimensional video frame related to the first viewing angle, and render the three-dimensional video frame after extracting the three-dimensional video frame from the playing buffer area, so as to generate a target three-dimensional video display result in the first interactive interface.

Each two of the plurality of second clients and the first client may be communicatively coupled to facilitate transmission of blocks of video frames between the different clients. And meanwhile, the local gradient of each of the plurality of clients can be processed based on the federal learning aggregation server, and the generated aggregation gradient is distributed to the plurality of clients.

According to the three-dimensional video display method provided by the embodiment of the invention, the viewing angle similarity of viewing three-dimensional video frames among a plurality of users can be utilized, so that a plurality of clients can cooperatively decode video frame blocks, and the respective computing resources of the clients can be fully utilized.

According to an embodiment of the present invention, the three-dimensional video display method may further include: in response to receiving a collaborative decoding request sent by a third client, acquiring a third video frame block corresponding to a third target sharing identifier from the third client based on the collaborative decoding request, wherein the third video frame block corresponding to the third target sharing identifier is sent to the third client by the server according to a third video frame block acquisition request sent by the third client, the first video frame block set comprises third video frame blocks corresponding to the third target sharing identifier, and the third client determines a third video frame block acquisition request according to the three-dimensional video display method provided by the invention; and the current running load of the third client is larger than a preset load threshold.

According to an embodiment of the present invention, the three-dimensional video display method may further include: and decoding the third video frame block corresponding to the third target sharing identifier to obtain a third shared video frame block, and sending the third shared video frame block to the third client so as to save the calculation overhead of the third client.

According to the embodiment of the invention, after the third client makes a decision, the tile (third video frame block set) with corresponding quality can be requested from the point cloud video server (server) through the HTTP interface, and the server returns the requested third video frame block set. Multiple clients (e.g., a first client and a second client) that are in the same network may negotiate with each other to assign decoding tasks.

For example, each client transmits its own video decoding attribute, including tiles to be decoded, its own decoding performance, operating load conditions, system resource occupation conditions, etc., to other clients by broadcasting. According to the collaboration policy, the third client i can allocate the decoding task of the tile corresponding to the target sharing identifier to the first client j (j not equal to i) to complete, the first client j can be a client with stronger decoding capability and lower system resource occupation and can bear more decoding tasks, and the third client i can partially or completely give the plurality of video frame blocks shared with the first client j to the first client j for decoding, so that the overall decoding efficiency among a plurality of clients in the networking is improved to the greatest extent. After the decoding is finished, the third client i can also send the video frame block obtained after the decoding to other clients in a multicast mode to finish the collaborative decoding.

The embodiment of the invention also provides a three-dimensional video display method which is applied to the server and comprises the following steps: responding to a first video frame block acquisition request from a first client, and analyzing the first video frame block acquisition request to obtain a first viewing angle and a first coding quality level; predicting a viewing viewport region of the three-dimensional video frame to obtain a first viewing viewport region related to the first client; determining a first video frame block set from the three-dimensional video frame according to the first viewing angle and the first viewing viewport region; and sending a first video frame block set to a first client, wherein the first client is suitable for decoding the first video frame block corresponding to the first target sharing identification in the first video frame block set acquired from the server to obtain a first sharing video frame block; and generating a target three-dimensional video display result in the first interactive interface according to the first shared video frame block and the second shared video frame block obtained from the second client.

According to the embodiment of the invention, the three-dimensional video frame comprises a point cloud video frame, the three-dimensional video frame is subjected to viewing viewport region prediction, and the viewing viewport region can represent a region of interest of a user for viewing the three-dimensional video frame. Obtaining the first viewing viewport region associated with the first client may comprise obtaining based on the following.

Voxelization is carried out on the point cloud video frame to obtain a voxel block corresponding to the voxel block; extracting the characteristics of the voxel block point cloud to obtain the characteristics of the voxel block point cloud; extracting characteristics of point cloud data in a point cloud video frame to obtain point cloud characteristics; fusing point cloud features and voxel block features to obtain fused features; and inputting the fusion characteristic into a viewing viewport region prediction model, and outputting a first viewing viewport region.

According to an embodiment of the present invention, the first viewing angle and the first viewing viewport region determine the first set of video frame blocks from the three-dimensional video frame by determining an intersection region or a union region between the first viewing angle and the first viewing viewport region, and determining the first set of video frame blocks from the three-dimensional video frame based on the intersection region or the union region.

According to an embodiment of the invention, the viewing viewport region prediction model may be constructed based on a region extraction network (Region proposal network, RPN).

In one embodiment of the invention, the server may determine the first set of video frame blocks based on the following procedure.

For example, voxel processing is performed on a three-dimensional point cloud frame, point cloud data are grouped according to voxel blocks, an NxMxH voxel block is obtained, if the (N, M, H) th voxel block contains more than T points, the down-sampling is used for reducing the number of the point clouds to T, and if the number of the point clouds is less than T, all the points are sampled.

Local feature extraction is respectively carried out on N multiplied by M multiplied by H voxel blocks, and any non-empty voxel block is aimed atCalculating centroid coordinates of voxel block>An enhanced input voxel block is obtained, i.e. equation (4):

（4）；

in equation (4), (x, y, z) is the position coordinates of the point, (r, g,b) Is the color of the dot. V (V) _in Learning a local point cloud feature representation f for each point through a fully connected network layer _i For f _i Performing element-by-element maximum pooling aggregation to obtain voxel block characteristics, and fusing local point cloud characteristics obtained by extracting characteristics of point cloud data with voxel level characteristics to obtain output characteristics V _out . Repeatedly extracting the characteristics of all non-empty voxel blocks V, learning the voxel block characteristics of the whole point cloud video frame, wherein the voxel block characteristics of the whole point cloud video frame can be expressed as。

Because of the large number of voxel blocks, the extracted voxel block features are subjected to further feature extraction through a three-dimensional convolution intermediate layer, the extracted features are sent to a region extraction network (Region proposal network, RPN) to generate region extraction features, and then the region extraction features are subjected to a multi-layer perceptron layer and a pooling layer to obtain ROI featuresWhere max () represents the max pooling operation and MLP () represents the multi-layer perceptron layer. The final ROI feature is denoted +. >. The first viewing viewport region may be determined based on the ROI features.

According to the three-dimensional video display method provided by the embodiment of the invention, a point cloud video stream transmission method and a system suitable for federal reinforcement learning driving among multiple clients are provided, point clouds are divided into tiles, view port prediction is carried out before transmission, only tiles in a user view port are transmitted, and the point cloud tiles with proper quality level are intelligently selected by reinforcement learning. The method can also use a federal learning mode to enable a plurality of clients to perform joint training, can utilize a distributed data set to enhance the performance of the coding quality level prediction model under the condition of not revealing user privacy, uses the viewing angle similarity of three-dimensional video frames watched among a plurality of users, combines the plurality of clients to cooperatively decompress point cloud video, maximizes user experience quality (QoE) under the condition of limited communication resources and computing resources, and is suitable for large-scale users to use under the condition of complex networks.

Based on the three-dimensional video display method, the invention further provides a three-dimensional video display device. The device will be described in detail below in connection with fig. 7.

Fig. 7 shows a block diagram of a three-dimensional video presentation device according to an embodiment of the present invention.

As shown in fig. 7, the three-dimensional video presentation apparatus 700 of this embodiment may be applied to a first client, including a first viewing perspective prediction module 710, a first request generation module 720, a second video frame block identification set acquisition module 730, a first target identification subset determination module 740, a first shared video frame block determination module 750, and a presentation module 760.

The first viewing angle prediction module 710 is configured to predict a first viewing angle at which a first user views a three-dimensional video frame at a target time.

The first request generating module 720 is configured to generate a first video frame block acquisition request according to a first viewing angle, where the first video frame block acquisition request is used to acquire a first video frame block set corresponding to the first viewing angle from a server, where the first video frame block set is obtained from a three-dimensional video frame.

The second video frame block identifier set obtaining module 730 is configured to obtain a second video frame block identifier set generated by the second client and related to a second viewing angle, where the second viewing angle is a viewing angle predicted when the second user views the three-dimensional video frame at the target moment.

The first target identifier subset determining module 740 is configured to screen out a first target identifier subset from a first video frame block identifier set and a second video frame block identifier set corresponding to the first video frame block set according to a decoding allocation policy, where the first target identifier subset includes at least one first target sharing identifier, and the first target sharing identifier is included in an intersection of the first video frame block identifier set and the second video frame block identifier set.

The first shared video frame block determining module 750 is configured to decode, in a first video frame block set acquired from the server, a first video frame block corresponding to the first target sharing identifier, to obtain a first shared video frame block.

And the display module 760 is configured to generate a target three-dimensional video display result in the first interactive interface according to the first shared video frame block and the second shared video frame block obtained from the second client.

According to an embodiment of the present invention, the first viewing perspective prediction module 710 is further configured to: determining a historical three-dimensional video segment which is watched from the three-dimensional video containing the three-dimensional video frames according to the target moment; inputting the historical three-dimensional video clips into a first viewing angle prediction model, and outputting a first prediction result; and obtaining a first viewing angle according to the first prediction result.

According to an embodiment of the present invention, the first viewing perspective prediction module 710 is further configured to: the collected first historical view angle information of the first user in the historical view time period is input into a second view angle prediction model, and a second prediction result is output; and obtaining the first viewing angle according to the second prediction result.

According to an embodiment of the present invention, wherein the three-dimensional video frame is contained in a three-dimensional video, the three-dimensional video display device is further configured to: acquiring video display environment information related to a three-dimensional video; and inputting the video presentation environment information into a first coding quality prediction model, and outputting a first coding quality level corresponding to the three-dimensional video frame.

Wherein the first request generation module is further configured to: a first video frame block acquisition request is generated based on the first viewing perspective and the first encoding quality level.

According to an embodiment of the invention, the second client is provided with a second coding quality prediction component comprising a second coding quality prediction model, the three-dimensional video presentation device being further configured to:

Acquiring sample video display environment information and a sample label; training an initial first coding quality prediction model related to a first client according to sample video display environment information and sample labels to obtain a trained first coding quality prediction model; the sample video display environment information and the sample label are further suitable for training an initial second coding quality prediction model related to the second client, and a trained second coding quality prediction model is obtained.

According to an embodiment of the present invention, training an initial first coding quality prediction model and an initial second coding quality prediction model according to sample video presentation environment information and sample tags, respectively, includes: according to sample video display environment information and sample labels, training an initial first coding quality prediction model in a 1 st stage to obtain a 1 st stage intermediate first coding quality prediction model and a 1 st stage intermediate first gradient, wherein the 1 st training stage represents the execution of a 1 st stage preset training iteration frequency; processing a first gradient in the 1 st stage middle and a second gradient in the 1 st stage middle, which is output by a second coding quality prediction model in the 1 st stage middle of a second client, according to a preset gradient aggregation formula to obtain the aggregation gradient in the 1 st stage middle; adjusting first model parameters of a 1 st-stage intermediate first coding quality prediction model according to the 1 st-stage intermediate aggregation gradient to obtain a 1 st-stage initial first coding quality prediction model, wherein the 1 st-stage intermediate aggregation gradient is also suitable for adjusting second model parameters of a 1 st-stage intermediate second coding quality prediction model to obtain a 1 st-stage initial second coding quality prediction model; according to sample video display environment information and sample labels, carrying out p-th stage training on a p-1-th stage initial first coding quality prediction model to obtain a p-th stage intermediate first coding quality prediction model and a p-th stage intermediate first gradient, wherein the p-th training stage represents the execution of p-th stage preset training iteration frequency, and p is an integer greater than 1; processing a first gradient in the p-th intermediate according to a preset gradient aggregation formula, and outputting a second gradient in the p-th intermediate with a second coding quality prediction model in the p-th intermediate of a second client to obtain a p-th intermediate aggregation gradient; according to the p-th intermediate aggregation gradient, adjusting a first model parameter of a p-th intermediate first coding quality prediction model to obtain a p-th initial first coding quality prediction model, wherein the p-th intermediate aggregation gradient is also suitable for adjusting a second model parameter of a p-th intermediate second coding quality prediction model to obtain a p-th initial second coding quality prediction model; and under the condition that the first loss function corresponding to the initial first coding quality prediction model and the second loss function corresponding to the initial second coding quality prediction model are converged, determining the P-th-stage initial first coding quality prediction model as a trained first coding quality prediction model, wherein P is more than or equal to P.

According to an embodiment of the present invention, the three-dimensional video display device is further configured to: and sending the first shared video frame block to the second client.

Any of the first viewing angle prediction module 710, the first request generation module 720, the second video frame block identification set acquisition module 730, the first target identification subset determination module 740, the first shared video frame block determination module 750, and the presentation module 760 may be combined in one module to be implemented, or any of them may be split into a plurality of modules, according to an embodiment of the present invention. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the invention, at least one of the first viewing angle prediction module 710, the first request generation module 720, the second video frame block identification set acquisition module 730, the first target identification subset determination module 740, the first shared video frame block determination module 750, and the presentation module 760 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the first viewing angle prediction module 710, the first request generation module 720, the second video frame block identification set acquisition module 730, the first target identification subset determination module 740, the first shared video frame block determination module 750, and the presentation module 760 may be at least partially implemented as computer program modules that, when executed, may perform the corresponding functions.

The embodiment of the invention also provides a three-dimensional video display device which is applied to the server and comprises:

the first video frame block acquisition request analyzing module is used for responding to the first video frame block acquisition request from the first client and analyzing the first video frame block acquisition request to obtain a first viewing angle and a first coding quality level.

And the watching viewport region prediction module is used for predicting the watching viewport region of the three-dimensional video frame to obtain a first watching viewport region related to the first client.

And the first video frame block set screening module is used for determining a first video frame block set from the three-dimensional video frames according to the first viewing angle and the first viewing view port area.

The first video frame block set sending module is used for sending a first video frame block set to the first client, wherein the first client is suitable for decoding a first video frame block corresponding to a first target sharing identifier in the first video frame block set acquired from the server to obtain a first shared video frame block; and generating a target three-dimensional video display result in the first interactive interface according to the first shared video frame block and the second shared video frame block obtained from the second client.

It should be noted that, the three-dimensional video display method provided by the embodiment of the present invention corresponds to the three-dimensional video display device provided by the embodiment of the present invention, and the three-dimensional video display device provided by the embodiment of the present invention may be used to execute the three-dimensional video display method provided by the embodiment of the present invention.

As shown in fig. 8, an electronic device 800 according to an embodiment of the present invention includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may comprise a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the invention.

In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiment of the present invention by executing programs in the ROM 802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM 802 and the RAM 803. The processor 801 may also perform various operations of the method flow according to embodiments of the present invention by executing programs stored in the one or more memories.

According to an embodiment of the invention, the electronic device 800 may further comprise an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The electronic device 800 may also include one or more of the following components connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

The present invention also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present invention.

According to embodiments of the present invention, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the invention, the computer-readable storage medium may include ROM 802 and/or RAM 803 and/or one or more memories other than ROM 802 and RAM 803 described above.

Embodiments of the present invention also include a computer program product comprising a computer program containing program code for performing the method shown in the flowcharts. The program code means for causing a computer system to carry out the methods provided by embodiments of the present invention when the computer program product is run on the computer system.

The above-described functions defined in the system/apparatus of the embodiment of the present invention are performed when the computer program is executed by the processor 801. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the invention.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, and/or from a removable medium 811 via a communication portion 809. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the embodiment of the present invention are performed when the computer program is executed by the processor 801. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the invention.

According to embodiments of the present invention, program code for carrying out computer programs provided by embodiments of the present invention may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or in assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the invention and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the invention. In particular, the features recited in the various embodiments of the invention and/or in the claims can be combined in various combinations and/or combinations without departing from the spirit and teachings of the invention. All such combinations and/or combinations fall within the scope of the invention.

The embodiments of the present invention are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the invention is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the invention, and such alternatives and modifications are intended to fall within the scope of the invention.

Claims

1. The three-dimensional video display method is applied to a first client and is characterized by comprising the following steps of:

predicting a first viewing angle at which a first user views a three-dimensional video frame at a target moment;

generating a first video frame block acquisition request according to the first viewing angle, wherein the first video frame block acquisition request is used for acquiring a first video frame block set corresponding to the first viewing angle from a server, and the first video frame block set is obtained from the three-dimensional video frame;

acquiring a second video frame block identification set generated by a second client and related to a second viewing angle, wherein the second viewing angle is a predicted viewing angle when a second user views the three-dimensional video frame at the target moment;

Screening a first target identification subset from a first video frame block identification set and a second video frame block identification set corresponding to the first video frame block set according to a decoding allocation strategy, wherein the first target identification subset comprises at least one first target sharing identification which is contained in an intersection of the first video frame block identification set and the second video frame block identification set;

decoding a first video frame block corresponding to the first target sharing identifier in a first video frame block set acquired from the server to obtain a first sharing video frame block;

obtaining a second shared video frame block from the second client, wherein the second shared video frame block is different from the first shared video frame block, and the second shared video frame block is obtained by the second client executing the following operation:

generating a second video frame block acquisition request according to the second viewing angle, wherein the second video frame block acquisition request is used for acquiring a second video frame block set corresponding to the second viewing angle from the server, and the second video frame block set is obtained from the three-dimensional video frame;

Acquiring the first video frame block identification set generated by the first client;

screening a second target identifier subset from the first video frame block identifier set and the second video frame block identifier set according to the decoding allocation strategy, wherein the second target identifier subset comprises at least one second target sharing identifier which is contained in an intersection of the first video frame block identifier set and the second video frame block identifier set;

decoding a second video frame block corresponding to the second target sharing identifier in a second video frame block set acquired from the server to obtain a second sharing video frame block;

and generating a target three-dimensional video display result in a first interactive interface according to the first shared video frame block and the second shared video frame block obtained from the second client.

2. The method of claim 1, wherein predicting a first viewing perspective at which a first user views a three-dimensional video frame at a target time comprises:

determining a historical three-dimensional video segment which is watched from the three-dimensional video containing the three-dimensional video frame according to the target moment;

Inputting the historical three-dimensional video clips to a first viewing angle prediction model, and outputting a first prediction result;

and obtaining the first viewing angle according to the first prediction result.

3. The method of claim 1, wherein predicting a first viewing perspective at which a first user views a three-dimensional video frame at a target time comprises:

the collected first historical view angle information of the first user in the historical view time period is input into a second view angle prediction model, and a second prediction result is output;

and obtaining the first viewing angle according to the second prediction result.

4. The method of claim 1, wherein the three-dimensional video frame is included in a three-dimensional video, the three-dimensional video presentation method further comprising:

acquiring video display environment information related to the three-dimensional video;

inputting the video display environment information into a first coding quality prediction model, and outputting a first coding quality grade corresponding to the three-dimensional video frame;

wherein generating a first video frame block acquisition request according to the first viewing perspective comprises:

and generating the first video frame block acquisition request according to the first viewing angle and the first coding quality level.

5. The method of claim 4, wherein the video presentation environment information comprises at least one of:

a historical encoding quality level associated with a historical three-dimensional video frame block, the three-dimensional video including historical three-dimensional video frames ordered before the three-dimensional video frame, the historical three-dimensional video frame block derived from the historical three-dimensional video frames;

the transmission of the three-dimensional video frame blocks is time-consuming;

a frame number of remaining three-dimensional video frames in the three-dimensional video, the remaining three-dimensional video frames including video frames in the three-dimensional video that have not been transmitted to the first client;

and the buffer area capacity of the first client at the current moment.

6. The method of claim 4, wherein the second client is provided with a second coding quality prediction component comprising a second coding quality prediction model, the three-dimensional video presentation method further comprising:

acquiring sample video display environment information and a sample label;

training an initial first coding quality prediction model related to the first client according to the sample video display environment information and the sample label to obtain a trained first coding quality prediction model;

The sample video display environment information and the sample tag are further adapted to train an initial second coding quality prediction model related to the second client, and the trained second coding quality prediction model is obtained.

7. The method of claim 6, wherein training an initial first coding quality prediction model and an initial second coding quality prediction model based on the sample video presentation environment information and the sample tags, respectively, comprises:

according to the sample video display environment information and the sample label, carrying out a 1 st training stage on the initial first coding quality prediction model to obtain a 1 st intermediate first coding quality prediction model and a 1 st intermediate first gradient, wherein the 1 st training stage represents executing a 1 st preset training iteration frequency;

processing the intermediate first gradient of the 1 st level and the 1 st level intermediate second gradient output by the 1 st level intermediate second coding quality prediction model of the second client according to a preset gradient aggregation formula to obtain a 1 st level intermediate aggregation gradient;

adjusting first model parameters of the 1 st-stage intermediate first coding quality prediction model according to the 1 st-stage intermediate aggregation gradient to obtain a 1 st-stage initial first coding quality prediction model, wherein the 1 st-stage intermediate aggregation gradient is also suitable for adjusting second model parameters of the 1 st-stage intermediate second coding quality prediction model to obtain a 1 st-stage initial second coding quality prediction model;

According to the sample video display environment information and the sample label, carrying out a p-th training stage on a p-1-th stage initial first coding quality prediction model to obtain a p-th stage intermediate first coding quality prediction model and a p-th stage intermediate first gradient, wherein the p-th training stage represents the execution of a p-th stage preset training iteration frequency, and p is an integer greater than 1;

processing the intermediate first gradient of the p-th stage and the intermediate second gradient of the p-th stage output by the second coding quality prediction model of the second client according to a preset gradient aggregation formula to obtain an intermediate aggregation gradient of the p-th stage;

adjusting first model parameters of the p-th intermediate first coding quality prediction model according to the p-th intermediate aggregation gradient to obtain a p-th initial first coding quality prediction model, wherein the p-th intermediate aggregation gradient is also suitable for adjusting second model parameters of the p-th intermediate second coding quality prediction model to obtain a p-th initial second coding quality prediction model;

and under the condition that a first loss function corresponding to an initial first coding quality prediction model and a second loss function corresponding to the initial second coding quality prediction model are converged, determining a P-th-stage initial first coding quality prediction model as the trained first coding quality prediction model, wherein P is more than or equal to P.

8. The method as recited in claim 1, further comprising:

and sending the first shared video frame block to the second client.

9. The three-dimensional video display method is applied to a server and is characterized by comprising the following steps of:

responding to a first video frame block acquisition request from a first client, and analyzing the first video frame block acquisition request to obtain a first viewing angle and a first coding quality level;

predicting a viewing viewport region of the three-dimensional video frame to obtain a first viewing viewport region related to the first client;

determining a first set of video frame blocks from the three-dimensional video frame according to the first viewing perspective and the first viewing viewport region;

transmitting the first set of video frame blocks to the first client, wherein the first client is adapted to:

screening a second target identifier subset from the first video frame block identifier set and the second video frame block identifier set according to a decoding allocation strategy, wherein the second target identifier subset comprises at least one second target sharing identifier which is contained in an intersection of the first video frame block identifier set and the second video frame block identifier set;

10. A three-dimensional video presentation device for a first client, comprising:

the first viewing angle prediction module is used for predicting a first viewing angle at which a first user views the three-dimensional video frame at a target moment;

the first request generation module is used for generating a first video frame block acquisition request according to the first viewing angle, wherein the first video frame block acquisition request is used for acquiring a first video frame block set corresponding to the first viewing angle from a server, and the first video frame block set is obtained from the three-dimensional video frame;

The second video frame block identification set obtaining module is used for obtaining a second video frame block identification set which is generated by a second client and is related to a second viewing angle, wherein the second viewing angle is a predicted viewing angle when a second user views the three-dimensional video frame at the target moment;

the first target identification subset determining module is used for screening a first target identification subset from a first video frame block identification set and a second video frame block identification set corresponding to the first video frame block set according to a decoding allocation strategy, wherein the first target identification subset comprises at least one first target sharing identification which is contained in an intersection of the first video frame block identification set and the second video frame block identification set;

the first shared video frame block determining module is used for decoding the first video frame block corresponding to the first target sharing identifier in the first video frame block set acquired from the server to obtain a first shared video frame block;

the display module is used for generating a target three-dimensional video display result in a first interactive interface according to the first shared video frame block and a second shared video frame block obtained from the second client;

The three-dimensional video display device is further configured to:

obtaining the second shared video frame block from the second client, wherein the second shared video frame block is different from the first shared video frame block, and the second shared video frame block is obtained by the second client executing the following operation: