CN109451293B

CN109451293B - Self-adaptive stereoscopic video transmission system and method

Info

Publication number: CN109451293B
Application number: CN201810903751.2A
Authority: CN
Inventors: 刘奕彤; 田旺; 杨鸿文; 吴建伟
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-08-09
Filing date: 2018-08-09
Publication date: 2021-11-26
Anticipated expiration: 2038-08-09
Also published as: CN109451293A

Abstract

The invention provides a self-adaptive stereo video transmission method for the computing power of an open base station. The invention discloses a transmission scheme of a three-dimensional video from two angles of video coding and decoding and streaming media transmission. The streaming media server completes the coding, slicing and deployment of an original video source and supports various stereoscopic video coding schemes. The user requests a specified required video slice stream according to the device capability and network condition change. The base station carries out aggregation analysis on the service request of the access user, requests the existing stereoscopic video slice on the server from the streaming media server, and then generates a corresponding video slice according to the requirement specified by the user for pushing. The computing power of the base station is utilized to convert the stereoscopic video slice into a specific video slice stream compatible with user equipment and network capability, and the code rate of the video slice stream can be dynamically adjusted according to network changes to realize self-adaptive streaming. The invention reduces the storage and flow pressure of the server, reduces the requirement on the computing capacity of the user terminal, has good compatibility and can realize code rate self-adaptive transmission of various stereoscopic videos.

Description

Self-adaptive stereoscopic video transmission system and method

1. Field of application

The present invention relates to the problem of adaptive streaming of stereoscopic video.

2. Background of the invention

With the rapid development and mutual integration of computer vision, computer graphics and Video processing technologies, the existing traditional flat Video, i.e., two-dimensional Video service, has not been able to meet the needs of people, and can provide stereoscopic three-dimensional (3D) Video and Multi-view Video (MVV) for people to pay more attention.

To avoid confusion among terms, multi-view video herein refers to a set of video signals obtained by synchronously capturing the same scene from different perspectives by using a plurality of cameras located at different viewpoints, each of which is captured by a binocular video; the 3D video refers to a binocular video of a certain viewpoint, i.e., a set of video signal sets including two-channel video, the two channels corresponding to the left and right eyes of the human eye, respectively. The stereoscopic video is a general term for 3D video and multi-view video, and can present a stereoscopic display effect by combining with a specific display technology. The streaming media scheme of the stereoscopic video not only needs to consider the design of the transmission protocol and scheme, but also needs to consider the compression storage mode and the stereoscopic display mode of the video source. This scheme will be described herein from both video coding and streaming perspective.

The compression storage mode of the video source mainly depends on the video encoding and decoding technology and corresponds to the corresponding storage format. With reference to the mainstream codec standard h.264/AVC and the next generation codec standard h.265/HEVC, as well as the h.264/MVC, MVV-HEVC and 3D-HEVC extensions proposed by the two generation standards with respect to stereoscopic video, stereoscopic video mainly has the following storage formats:

a) the traditional binocular video consists of two paths of plane videos, and each path of video signal is independently coded and respectively corresponds to a left eye and a right eye;

b) the traditional video + depth map is composed of a traditional plane video and a depth map, the depth map is generated by recording depth information and the like by using a camera, and a new coding technology is introduced to improve the compression efficiency.

c) The multi-view video is composed of multiple paths of binocular videos, and each path of video corresponds to one view of the binocular videos, so that the multi-view video comprises multiple paths of traditional plane videos and depth maps. The multi-view video coding introduces parallax prediction and other technologies, and improves the compression efficiency by utilizing the inter-view redundancy.

The binocular video and the traditional video and the depth map can both represent the three-dimensional information of a scene, the binocular video and the traditional video and the depth map are relatively direct, and the two paths of videos directly correspond to the left eye and the right eye; the latter is an indirect representation method, and requires conversion to form a binocular video. The multi-view video contains multiple paths of binocular videos, so that stereoscopic multi-view stereoscopic display, binocular stereoscopic display and traditional two-dimensional display can be provided, and the data volume is larger.

The stereoscopic display simulates the retinal imaging process according to the visual mechanism of human eyes, and projects the left and right images of the same scene onto the retinas of the left and right eyes of a human to obtain stereoscopic vision feeling. It is therefore necessary to ensure that the left and right eyes see the corresponding views, respectively, without overlap, using suitable means. The stereoscopic video display technology is mainly classified into glasses type stereoscopic display (such as polarized glasses type stereoscopic display) and naked eye type stereoscopic display (such as parallax barrier naked eye type stereoscopic display).

In view of transmission, the data volume of the stereoscopic video far exceeds that of the traditional plane video, the transmission content is changed from a single-path monocular video to one-path or multi-path binocular video, and the requirement of bandwidth resources is extremely high. In the traditional binocular video scheme, the data volume is twice that of the traditional plane video; the traditional video + depth scheme has extremely high requirements on the terminal computing capacity. The scheme designed by the method is that video transcoding is completed at a base station, so that the transmission content is still the traditional binocular video, and the data volume is still twice of that of the traditional plane video. The multi-view video data volume is larger, and the bearing capacity of the existing network is far from enough.

3. Summary and features of the invention

The invention relates to a streaming media server, a base station and a user (client). The streaming media server provides a stereoscopic video source, and the stereoscopic videos comprise traditional binocular videos, traditional videos and depth maps and multi-view videos; the base station refers to a mobile communication base station with certain computing capacity, and is the last station of the mobile user access network. The user accesses the network through the base station, requests the three-dimensional video from the streaming media server, and acquires the video resource in a streaming transmission mode.

The method comprises the following specific steps:

a) and (5) video source coding. The video source server performs compression coding on original stereo video, and the video coding requires the use of a closed and fixed-length GOP (Group of Picture) structure, and each GOP can be independently decoded. For a traditional binocular video, two groups of video streams of left and right viewpoints are generated; for a traditional video and a depth map, generating a traditional video stream and a depth map data stream, wherein each frame in the depth map corresponds to each frame of the video; for multi-view video, if the number of views is N, 2N sets of video streams are generated.

b) Video source slices. And the video source server slices the video stream according to the fixed time length t, wherein each slice comprises the same number of GOPs. And generating a custom video information description file according to a data format negotiated with the user client in the processing process. The video description information after slicing contained in the file comprises a file name, total playing time, video compression formats (HEVC, AVC, VP9, VP8 and the like), audio compression formats (AAC, MP3 and the like), packaging modes (MPEG-2, MPEG-4 and the like), resolution, frame rate, slice duration, slice serial number, URI and encoding modes (traditional binocular video, traditional video + depth map, multi-view video and the like). In addition, according to different video coding modes, the traditional binocular video needs to contain channel serial numbers (corresponding to left and right eyes) and the like; the conventional video + depth map needs to include information description of the depth map (such as picture compression format, corresponding GOP number and video frame number), and the multi-view video needs to include view number and channel number.

c) A video information description file is requested. The user requests the video information description file from the streaming media server through the base station to acquire the video resource information which can be requested. And meanwhile, the base station also acquires video information deployed in the streaming media server.

d) A video slice file is requested. The user sends out a request according to the equipment capability, the change of the network condition, the change of the viewpoint and the like, wherein the request comprises the requirement that the user can receive the video slice, and the video, the audio compression format, the packaging mode, the resolution, the frame rate, the slice duration, the serial number, the code rate and the like of the video slice are required to be specified in the request. And for the stereo video with any coding mode, two groups of video streams are respectively requested, and then the two groups of video streams are respectively decoded, rendered and played.

e) The user group requests analysis and video transcoding. The base station analyzes and classifies the requests of the access users, and the users requesting the same video source form a user group. And the base station requests the streaming media server for video slices and then carries out transcoding and streaming pushing according to the user request. For a traditional binocular video, transcoding is required according to user requirements, slices meeting user specified requirements (video, audio compression format, packaging mode, resolution, slice duration, serial number and code rate) are generated and then pushed; for a traditional video and a depth map, the work of converting 2D into 3D is required to be completed, a traditional binocular video slice is rendered by using a 2D plane video and a corresponding depth map, and then a video slice pushing required by a user is generated; for multi-view video, it is necessary to analyze view information requested by a user group, then request one or more groups of video slices of a specific view from a streaming media server, then complete 2D to 3D and rate conversion, and push video slices specified by a user.

The stereoscopic video sources are various, the transmission scheme designed by the patent only performs simple slicing processing on the original video, and is low in complexity and low in storage pressure. The user equipment is very different, the format of the video slices received by the user is unified in the transmission scheme designed by the patent, the video slices are designed into slices of the traditional binocular video, and decoding and rendering can be directly performed by utilizing a traditional video decoder. Most of three-dimensional videos (traditional videos + depth maps and multi-view videos) have high requirements on terminal computing capacity, and the transcoding process is transferred to the base station side, so that the requirements on user equipment are reduced, and good compatibility is kept. For the multi-user scenario, the contents watched by the user groups in the same cell are highly similar, the demands of the user groups are integrated and analyzed, the flow pressure of the streaming media server is reduced, and the calculation amount of the base station side is reduced. For a single-user scenario, the wireless resources allocated to each user change constantly, and in the specially designed transmission scheme, the user can adaptively adjust the code rate of the requested video slice according to the capability of the terminal equipment, the network condition and the like, so that the network resources are utilized to the maximum extent, and smooth playing experience is ensured.

Drawings

(1) Fig. 1 is a schematic view of the present invention.

(2) FIG. 2 is a schematic flow chart of the method of the present invention.

(3) FIG. 3 is a schematic diagram of an embodiment.

4. Examples of specific embodiments

To further illustrate the method of practicing the present invention, an exemplary embodiment is given below. This example is merely representative of the principles of the present invention and does not represent any limitation of the present invention.

Suppose an HTTP streaming server is to deploy a section of multi-view video "Sport" with 6 views and duration of 800s, and the format is YUV 420. The base station has access to four users A, B, C and D, wherein the maximum resolution of the video played by the user A, B is 3840x2160, the maximum resolution of the video played by the user C, D is 7680x4320, the maximum frame rate of the video played by the user ABCD is 60fps, and all audio-video compression and encapsulation formats are supported. The average available bandwidth of users A, B, C and D is 15Mbps, 20Mbps, 30Mbps and 40Mbps respectively. According to the invention, the specific method comprises the following steps:

a) the original video encodes the slice. After encoding is completed, the video compression format is HEVC, the audio compression format is AAC, the packaging format is MPEG-4, the resolution is 7680x4320, the frame rate is 60fps, the GOP length is 30, the first frame of each GOP is an IDR frame, and the video code rate after compression is 50 Mbps. The original video contains 6 viewpoints, so that 6 groups of video files are obtained after coding and compressing, and the viewpoint sequence number is 01-06. After slicing, the slicing time is 2s, the slicing serial number is 1-400, and 6 groups are provided. These file description information are stored in the description file "MVV _ sport.

b) Suppose first that the user ABCD requests a change of video via the same base station. And the user ABCD sends an HTTP GET request to acquire a video information description file' MVV _ Sport.

c) The user ABCD sends HTTP POST request to HTTP stream media server to request the slice with 1-400 serial number. The user A requires that the resolution of the video is 3840x2160, the code rate is 15Mbps, and the view point sequence number is 01; user B requires that the resolution of the video is 3840x2160, the code rate is 20Mbps, and the view point sequence number is 02; user C requires that the resolution of the video is 7680x4320, the code rate is 30Mbps, and the viewpoint sequence number is 02; user D requires that the resolution of the video is 7680x4320, the code rate is 40Mbps, and the view sequence number is 02. In addition, other requirements of the user ABCD for video slices are consistent and are not listed here. This information is written into Body of the POST request.

d) And the base station analyzes and performs aggregation analysis after receiving the HTTP POST requests, and learns that the content of the user ABCD request is the same video source, and the video slices to be acquired come from the slice groups with the video viewpoint sequence numbers of 01 and 02. Video slices with slice numbers 1-400 in the two sets of

view numbers

01 and 02 will be requested in sequence from the HTTP streaming server.

e) And the base station transcodes the video according to the requirement on the video in the HTTP POST request Body. The slice with the view number of 01 comprises a traditional video slice and a depth map slice, and the slices are transcoded into binocular video slices through 2D to 3D. Generating a slice with the resolution of 3840x2160 and the code rate of 15Mbps by downsampling and code rate control according to the requirements of a user A; and generating slices with the resolution of 3840x2160 and the code rate of 20Mbps according to the requirements of the user B through down sampling and code rate control. The subsequent operations required for the slice with view number 02 are the same as above. And respectively pushing the transcoded slices to corresponding users.

f) Repeating the steps c) d) and e) until all the slices with the slice serial numbers of 1-400 are pushed to the user or the user cancels the service. In the above process, the available bandwidth of the user may be changed at any time, so that a new code rate requirement exists in the http post request of the user, and the base station needs to generate a video slice meeting the requirement according to each request of the user, thereby realizing the transmission of the adaptive code rate.

Claims

1. A self-adapting stereoscopic video transmission method, the method is compatible with existing streaming media server and terminal video decoder, participate in the stereoscopic video transmission business through the computational capability of the open base station, have reduced the storage and flow pressure of the server, has reduced the requirement for terminal computational capability of the user, has realized the self-adapting streaming transmission of multiple stereoscopic videos, the characteristic of the method lies in the following step:

a) before service is initiated, a server and a user need to agree on a media information description mode, and a written description file is acquired by the user and a base station simultaneously before service is initiated; the base station is used for aggregating and analyzing the user request and forwarding the user request;

b) the encoding operation related to the deployment of the video source by the server only limits the GOP structure, other encoding parameters are not affected, the slicing operation is simple, the encoded video content is not affected, and the server deploys one copy of the original video without redundancy;

c) the video slices received by the client are slices of traditional videos, are compatible with the existing video decoder, and do not need extra calculation except decoding;

d) the base station participates in the service and plays a key role; the base station classifies and combines the user requests, reduces the operation of acquiring the video slice stream from the server and reduces the flow pressure of the server; the base station converts the stereoscopic video slices into traditional binocular video slices, so that compatibility of user equipment is guaranteed; and the base station generates a video slice meeting the requirement of the user according to the requirement of the user, so that self-adaptive transmission is realized.

2. The adaptive stereoscopic video transmission method of claim 1, wherein the streaming media server refers to any server capable of implementing streaming media services, and implemented transmission protocols include HTTP, RTP/RTSP; the base station refers to any radio station with computing capability and providing mobile communication access function.

3. The adaptive stereoscopic video transmission method according to claim 1, wherein the stereoscopic video includes a binocular stereoscopic video, a monocular/binocular stereoscopic video with a depth map, a multi-view video with/without a depth map.

4. The adaptive stereoscopic video transmission method of claim 1, wherein the adaptation comprises adaptive adjustment of video compression format, audio compression format, and encapsulation format, and dynamic adaptive change of resolution, frame rate, and code rate.