CN117440176A

CN117440176A - Method, apparatus, device and medium for video transmission

Info

Publication number: CN117440176A
Application number: CN202210834758.XA
Authority: CN
Inventors: 桂永强; 施澍; 徐斌
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2022-07-14
Filing date: 2022-07-14
Publication date: 2024-01-23

Abstract

Embodiments of the present disclosure relate to methods, apparatuses, devices, and media for video transmission. The method includes obtaining a first video stream from a capture device via a first communication connection, the first communication connection being a local area network connection. The method further includes receiving the pose information of the first terminal device via a second communication connection, the second communication connection being a wide area network connection. The method also includes generating a second video stream based on the pose information and the first video stream, the second video stream including portions of the first video stream corresponding to the pose information. The method further includes transmitting the second video stream to the first terminal device via the second communication connection. According to the embodiment of the disclosure, the high-definition video stream can be transcoded into the video stream presented at the terminal equipment in real time at the acquisition end of the high-definition video without uploading the complete video stream, so that the required uplink transmission bandwidth is reduced. Therefore, the high-definition video distribution can be realized by using lower network transmission resources, and the application scene of the high-definition video is expanded.

Description

Method, apparatus, device and medium for video transmission

Technical Field

Embodiments of the present disclosure relate to the field of communications technology and, more particularly, relate to methods, apparatuses, devices, processing devices, computer-readable storage media, and computer program products for video transmission.

Background

In recent years, live broadcast services have achieved tremendous success, however, existing live broadcast solutions are basically 2D planar video, and with the development of Virtual Reality (VR) technology, panoramic video live broadcast can be expected to have a very wide development prospect. For VR panoramic video live broadcast, the resolution reaches 8K or even higher just can make the user have better experience.

However, resolution up to 8K presents a significant challenge for real-time processing and transmission capabilities of video. The existing network transmission capability and video processing capability performance are difficult to meet the requirements of high bandwidth and low latency.

Disclosure of Invention

In view of the above, embodiments of the present disclosure propose a solution for video transmission, which can solve or alleviate the above-mentioned problems faced by high-definition video transmission.

According to a first aspect of the present disclosure, a method for video transmission is provided. The method includes obtaining a first video stream from a capture device via a first communication connection, the first communication connection being a local area network connection. The method further includes receiving the pose information of the first terminal device via a second communication connection, the second communication connection being a wide area network connection. The method further includes generating a second video stream based on the pose information and the first video stream, the second video stream including portions of the first video stream corresponding to the pose information. The method further includes transmitting the second video stream to the first terminal device via the second communication connection. Based on the mode, the high-definition video stream can be transcoded into the video stream presented at the terminal equipment in real time at the acquisition end of the high-definition video without uploading the complete video stream, so that the required uplink transmission bandwidth is reduced. Therefore, the high-definition video distribution can be realized by using lower network transmission resources, and the application scene of the high-definition video is expanded.

In some embodiments of the first aspect, the first communication connection has a greater bandwidth than the second communication connection. In this way, it is possible to receive a high-bitrate video stream from a photographing apparatus and transmit a low-bitrate video stream resulting from transcoding of the high-bitrate video in an environment where network transmission capability is low.

In some embodiments of the first aspect, the first communication connection comprises a wired local area network connection or a wireless local area network connection, and the second communication connection comprises a cellular network connection. Cellular networks include, for example, 4G networks, 3G networks, and the like. In this way, high definition video transmission can be achieved in existing network environments without requiring additional infrastructure.

In some embodiments of the first aspect, generating the second video stream may include: determining a view port area for the first terminal device in the first video stream based on the gesture information, and determining a boundary area surrounding the view port area in the first video stream; and generating a second video stream based on the viewport region, the boundary region, and the first video stream. In some embodiments, the first terminal device may be, for example, a head-mounted VR device, and the user changes the pose of the VR device as he rotates his head, and the field of view changes accordingly. In this way, only a part of the video content in the user's field of view can be transmitted to the terminal device without the need to transmit the complete video content, which can reduce the required upstream transmission bandwidth. Moreover, by transmitting a boundary portion that is larger than the field of view, when the end device user is in motion (e.g., turning the head), the acquired boundary portion picture can be presented without requiring video content to be requested over the network. Thus, the delay sense of picture change can be reduced, and the user experience is improved.

In some embodiments of the first aspect, determining a boundary region around the viewport region in the first video stream may comprise: acquiring a motion-to-response delay of the first terminal equipment; and determining a boundary region based on the motion-to-response delay and the viewport region. In this way, the size of the boundary region can be flexibly set based on dynamic network conditions. For example, the border region may be set smaller when the latency is low and the network conditions are good, and the border region may be set larger when the latency is high and the network conditions are bad.

In some embodiments of the first aspect, the method may further comprise transmitting the second video stream to a second terminal device associated with the first terminal device. In some scenarios, the second terminal device may be a slave device to the first terminal device. When a plurality of users respectively watch different viewing angles, corresponding videos need to be generated for each user respectively, and the system overhead is high. In order to reduce the transcoding resource overhead and the uploading bandwidth requirement, a master device and a slave device can be arranged, only the master device can feed back gesture information and acquire corresponding video streams, and the slave device does not have the capability of controlling the visual angle conversion, and the video streams are consistent with the master device. Therefore, the application scene of high-definition video live broadcast is expanded, and the method is suitable for scenes such as game commentary, online class and the like which need the interpreter to control the visual angle change and other audiences to follow the watching.

In some embodiments of the first aspect, the method may be performed by a processing device, the processing device and the photographing device being included in a removable device. In some embodiments, the mobile device may be, for example, a balance car, a drone, a robot, or a handheld device. Based on the mode, the mobility of the video acquisition and transmission equipment is enhanced, and the scene of the high-definition video live broadcast application is expanded.

In some embodiments of the first aspect, the method may further comprise: receiving a movement control instruction from a first terminal device; and controlling movement of the movable device and/or the photographing device based on the movement control instruction. Based on the mode, a user of the terminal equipment can control video content collected by the shooting equipment, for example, control the movable equipment to go to a desired place and collect video, and therefore scenes of high-definition video live broadcast application are expanded.

In some embodiments of the first aspect, the video content may be a panoramic video stream. For example, a photographing device includes one or more fisheye cameras whose acquired images may be stitched to a 360 degree panoramic video stream. Based on this approach, the user may get an immersive virtual reality experience.

According to a second aspect of the present disclosure, an apparatus for video transmission is provided. The device comprises a video stream acquisition unit, a receiving unit, a generating unit and a transmission unit. The video stream acquisition unit is configured to acquire a first video stream from the photographing apparatus via a first communication connection, the first communication connection being a local area network connection. The receiving unit is configured to receive the pose information of the first terminal device via a second communication connection, the second communication connection being a wide area network connection. The generating unit is configured to generate a second video stream based on the pose information and the first video stream, the second video stream including a portion of the first video stream corresponding to the pose information. The transmission unit is configured to transmit the second video stream to the first terminal device via the second communication connection.

Some embodiments of the second aspect may have units that perform the actions or functions described in the first aspect, which may achieve similar advantageous effects as those of the first aspect. For brevity, this is not repeated here.

According to a third aspect of the present disclosure, there is provided an apparatus for video acquisition and transmission. The device comprises a shooting device for acquiring the first video stream. The device further comprises a processing device coupled with the photographing device via a first communication connection, which is a local area network connection, and coupled with the first terminal device via a second communication connection, which is a wide area network connection. The processing device is configured to obtain a first video stream from the photographing device via a first communication connection; receiving attitude information of the first terminal device via the second communication connection; generating a second video stream based on the gesture information and the first video stream, the second video stream including a portion of the first video stream corresponding to the gesture information; and transmitting the second video stream to the first terminal device via the second communication connection.

Some embodiments of the third aspect may have units that perform the actions or functions described in the first aspect, which may achieve similar advantageous effects as those of the first aspect. For brevity, this is not repeated here.

According to a fourth aspect of the present disclosure there is provided a processing device comprising at least one processing unit and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit cause the processing device to perform a method according to the first aspect of the present disclosure

According to a fifth aspect of the present disclosure, there is provided a computer readable storage medium comprising machine executable instructions which, when executed by a device, cause the device to perform a method according to the first aspect of the present disclosure.

According to a sixth aspect of the present disclosure, there is provided a computer program product comprising machine executable instructions which, when executed by a device, cause the device to perform the method according to the first aspect of the present disclosure.

The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure.

FIG. 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure may be implemented;

fig. 2 shows a schematic flow chart of a method for video transmission according to an embodiment of the present disclosure;

fig. 3 shows a schematic diagram of a processing device to terminal device video transmission according to an embodiment of the present disclosure;

fig. 4 shows a schematic flow chart of a process of generating a video stream for a terminal device according to an embodiment of the disclosure;

FIG. 5 shows a schematic diagram of a picture of a video stream according to an embodiment of the present disclosure;

fig. 6 shows a schematic block diagram of an apparatus for video transmission according to an embodiment of the present disclosure; and

FIG. 7 shows a schematic block diagram of an example device that may be used to implement embodiments of the present disclosure.

Detailed Description

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

It is noted that the numbers or values used herein are for ease of understanding the technology of the present disclosure, and are not limiting the scope of the present disclosure.

In live broadcasting of high-definition videos (for example, 8K or higher), panoramic video collection is firstly performed by using a shooting device, and currently, the industry usually shoots by using a plurality of cameras at the same time, and then multiple paths of videos are combined into one panoramic video through stitching. The panoramic video is then encoded to reduce the amount of video storage and transmission data, which is distributed to the user's terminal devices over the network. A terminal device (e.g., a smartphone or a headset VR device) decodes, renders, and plays the panoramic video after receiving it.

Conventional schemes typically employ full-frame schemes. The full-picture scheme directly transmits the coded panoramic video to a server (for example, through a 5G network) and then to terminal equipment, and the scheme can ensure that the user can respond in time when the head moves, but has higher requirements on uplink and downlink transmission bandwidths. For example, when the frame rate is 30fps, a code rate of up to 60-100Mbps needs to be met, and existing home WiFi or 4G networks are difficult to meet such high downlink bandwidth requirements, and such high code rate puts a great pressure on video uplink bandwidth and user tariffs. On the other hand, the above solution also requires uploading the complete panoramic video from the photographing device to the server, and typically requires a 5G infrastructure to provide sufficient transmission bandwidth. However, in the case of insufficient coverage of the 5G network, high-definition live video service cannot be provided.

In view of this, embodiments of the present disclosure provide methods for video transmission. In the method, the processing device obtains the original live video stream from the capturing device via a local connection, wherein the local network connection (locally) may have a higher transmission bandwidth, and is capable of transmitting the complete high-definition video content. In the method, the processing device also receives pose information of the terminal device via a wide area network connection (remote) between the terminal device and the processing device. The pose information reflects a field of view of a user of the terminal device, e.g., the pose information includes a pose of the head mounted VR device when in use by the user and a scale of viewing the video. The processing device generates a video stream for the terminal device from the original live video stream using the pose information, the video stream including a portion of a picture of the original live video stream corresponding to the pose information. The processing device further includes transmitting the generated video stream to the terminal device via the wide area network connection. Whereby the terminal device receives the video stream and presents it.

Implementation details of embodiments of the present disclosure are described in detail below with reference to fig. 1 through 7.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure may be implemented. The environment 100 includes a device 110 for video acquisition and transmission and at least one terminal device 140-1, 140-2 … … 140-N (collectively 140) communicatively coupled to the device 110, where N is any positive integer. The device 110 may be deployed on a live site, capturing live video content and distributing the video content to the terminal device 140. In some embodiments, the device 110 may be a mobile device, such as a balance car, drone, robot, or handheld device, that may be controlled to move on a live scene.

As shown, the device 110 includes a processing device 120 and a photographing device 130. The processing device 120 and the photographing device 130 may be physically located together. The processing device 120 may be any device having computing capabilities, such as a personal computer, tablet computer, notebook computer, smart phone, server, or the like. In some embodiments, the processing device 120 may be a portable device. The processing device 120 may include a graphics processing unit (Graphics Processing Unit, GPU) to provide the capability to process video content in real-time, e.g., codec and transcode, etc. The capture device 130 includes one or more high definition cameras (e.g., fisheye cameras) that may capture video of the environment surrounding the device 100. The capture device 130 may splice together and encode multiple video captured by multiple high definition cameras to produce a video stream. In some embodiments, the capture device 130 may be a panoramic capture device and the resulting video stream may be panoramic video. Herein, panoramic video may refer to 360 degree video or less than 360 degree (e.g., 180 degrees, 270 degrees, etc.) video, and the present disclosure does not limit the view angle size of panoramic video.

The photographing device 130 is coupled with the processing device 120 via the first communication connection 101 and may transmit the acquired video stream to the processing device 120 in real time. The first communication connection 101 is a local area network connection. In some embodiments, the first communication connection 101 may be a wired local area network connection, such as an ethernet, coaxial, USB, PCIe connection, or the like. The first communication connection 101 may also be a wireless local area network (Wireless Local Area Network, WLAN) connection, such as WiFi. In general, the first communication connection 101 provides sufficient transmission bandwidth for the capture device 130 to transmit a complete high definition video stream (e.g., panoramic video) to the processing device 120. Using a wireless local area network connection, the processing device 120 may act as a hotspot to which the photographing device 130 is connected. The photographing apparatus 130 and the processing apparatus 120 may be arranged such that they have good and stable signal strength therebetween (e.g., maintain a close distance and aligned antenna angle) to ensure transmission performance therebetween.

The terminal device 140 may be, for example, a smart phone, a personal computer, a tablet computer, a notebook computer, a wearable device (e.g., a head-mounted Virtual Reality (VR) display device, abbreviated VR head display), etc. The terminal device 140 may include a display apparatus for displaying or projecting video content, a sensor for sensing a user's posture (e.g., a head rotation angle), an input-output device, etc. The terminal device 140 also includes a communication module that supports at least one remote communication capability to access a wide area network, such as a public network, by any wired or wireless means. As shown, the terminal device 140 may be coupled to and in communication with the processing device 120 via the second communication connection 102, e.g., receive a video stream. The second communication connection 102 is a wide area network connection. In some embodiments, the wide area network connection may comprise a 4G network connection. For example, when the live video is covered by a 4G network but not by a 5G network, the processing device 120 may connect to the public network via the 4G network, transmitting the live video stream to the terminal device 140. In some embodiments, as described below with reference to fig. 2-5, the video stream transmitted by the processing device 120 to the terminal device 140 may be a portion of a cropped video stream from a capture device. Accordingly, the bandwidth required to transmit the video stream in real time to the terminal device 140 may be less than the bandwidth required by the photographing device 130 to transmit the video stream to the processing device 120. Accordingly, the bandwidth of the first communication connection 101 is provided to be larger than the bandwidth of the second communication connection 102.

It should be understood that the environment 100 shown in fig. 1 is only one example in which embodiments of the present disclosure may be implemented and is not intended to limit the scope of the present disclosure. Embodiments of the present disclosure are equally applicable to other systems or architectures. The environment 100 may include other components not shown, and the components shown in fig. 1 may be connected and combined in different ways. For example, the processing device 120 and the photographing device 130 may be separate devices, not necessarily integrated into the device 110. In addition, the environment 100 shown in FIG. 1 may also include other components not shown. For example, environment 100 may also include an application server or edge server located between device 110 and terminal device 140 for managing and controlling device 100 and terminal device 140.

Exemplary processes of video acquisition and transmission schemes of embodiments of the present disclosure are further described below in conjunction with fig. 2-5.

Fig. 2 shows a schematic flow chart of a method 200 for video transmission according to an embodiment of the disclosure. Process 200 may be implemented by device 110 or processing device 120 in fig. 1. For ease of description, process 200 will be described with reference to FIG. 1 and by way of example to processing device 120.

At block 210 of fig. 2, the processing device 120 obtains a first video stream from the photographing device 130 via the first communication connection 101, the first communication connection being a local area network connection. As mentioned above, the photographing device 130 may be, for example, a panorama photographing device of 8K or higher resolution. The photographing apparatus 130 photographs the surrounding environment through a fisheye camera and performs stitching to obtain panoramic video content in the format of equidistant columnar projection (Equirectangular Projection). The photographing device 130 may encode the panoramic video content. The encoding format may be any video codec format that has been developed or developed in the future, e.g., h.264, h.265 (High Efficiency Video Coding, HEVC), h.266 (Versatile Video Coding, VCC), etc. As an example, when the photographing apparatus 130 photographs in real time at a frame rate of, for example, 30fps, the code rate of the encoded panoramic video content is about 60Mbps to 100Mbps, and the specific size of the code rate depends on the photographed video content itself and the encoding algorithm.

The first communication connection 101 between the photographing device 130 and the processing device 120 may be a wired connection, such as an ethernet connection, a coaxial connection, a USB connection, a PCIe connection, etc. Alternatively, the first communication connection may be a wireless local area network connection, such as WiFi. The capture device 130 encapsulates the encoded video stream with an application layer transport protocol, such as Real-time transport protocol (Real-time Transport Protocol, RTP), real-time messaging protocol (Real Time Messaging Protocol, RTMP), etc., and pushes the video stream to the processing device 120 in Real-time.

Next, in blocks 220 to 240, the processing device 120 will generate a video stream for presentation on any terminal device 140 based on the first video stream from the photographing device 130 and transmit to the terminal device 140, taking the terminal device 140-1 as an example.

Specifically, at block 220, the processing device 120 receives pose information for the first terminal device 140-1 via the second communication connection 120. In some embodiments, the terminal device 140-1 may be a head-mounted VR display device and the pose information may include an orientation (orientation) of the device. The orientation may reflect the angle at which the user views the video, such as pitch angle (pitch) and yaw angle (yaw), where the pitch angle may range from [ -90,90] degrees and the yaw angle may range from [ -180,180] degrees, thereby overlaying the panoramic video content. The pitch angle and yaw angle may be detected by sensors (e.g., gravity sensor, inertial sensor, gyroscope, etc.) mounted on the terminal device 140-1. In some embodiments, the pose information may also include a scale or zoom in/out that represents how far or how near the user is viewing the video on terminal device 140-1. The scale may indicate a screen size when the user views the video, and the user may set or change the scale by operating the terminal device 140-1. As an example, when the scale is a value of 1, the user views the original picture range, when the scale is 0.5, the user views half of the original picture range (i.e., the user zooms in the picture, zooms in the picture for viewing), and when the scale is 2, the user views twice the original picture range (i.e., the user is far from the picture, zooms out the picture for viewing).

At block 230, the processing device 120 generates a second video stream based on the pose information and the first video stream, the second video stream including portions of the first video stream corresponding to the pose information. In the case where the first video stream is a panoramic video stream and the picture thereof is converted into a rectangular picture by equidistant columnar projection, a partial area in the picture of the first video stream is determined as the content to be transmitted by the first terminal device 140-1 according to the pose information. This partial region, also referred to as a viewport (viewport) region of the first terminal device 140-1, represents the field of view of the video viewed by the user on the terminal device 140-1.

In some embodiments, the processing device 120 may transcode the first video stream using a high performance processor (e.g., a graphics processing unit, GPU) to generate the second video stream. For example, the processing device 120 may perform a cropping operation on a picture of the first video stream, re-encode the cropped picture as a picture of the second video stream, and thereby generate the second video stream.

At block 240, the processing device 120 transmits the second video stream to the first terminal device 140-1 via the second communication connection 102. As mentioned above, the second communication connection 102 is a wide area network connection, such as a public network. In some embodiments, the wide area network connection may include a cellular network connection such as 4G, 3G. When the live scene is covered by the conventional 4G network but not by the 5G network, the processing device 120 may still transmit the transcoded second video stream to the first terminal device 140-1 via the 4G network.

In some embodiments, the processing device 120 may transmit the second video stream to an application server (e.g., a live application platform) in the wide area network, via which the video is delivered to the first terminal device 140-1. Alternatively, the processing device 120 may also transmit the second video stream directly to the first terminal device 140 without the application server as an intermediary. For example, the processing device 120 may transmit the second video stream to an edge server in the vicinity of the first terminal device 140-1.

A process 200 for video transmission according to an embodiment of the present disclosure is described above with reference to fig. 2. Compared with the existing scheme, the embodiment of the disclosure converts the high-definition video stream into the video stream presented at the terminal equipment in real time at the video acquisition end, and the complete video stream is not required to be uploaded, so that the required uplink transmission bandwidth is reduced. Therefore, the high-definition video distribution is realized by using lower network transmission resources, and the application scene of the high-definition video is expanded.

Further details of embodiments of the present disclosure are described below with reference to fig. 3-5. Fig. 3 shows a schematic diagram of video transmission from processing device 120 to terminal device 140-1 according to an embodiment of the present disclosure.

Processing device 120 includes an access server 121 and a transcoding server 122. Access server 121 and transcoding server 122 may be implemented as applications or virtual machines running on processing device 120. The access server 121 may be, for example, an nmginx server, for acquiring a first video stream transmitted in real time by the photographing apparatus 130, and pushing the acquired first video stream to the transcoding server 122, for example, pushing the stream using ffmepg. The transcoding server 122 is configured to generate a second video stream according to the gesture information from the terminal device 140-1 and the first video stream, and transmit the second video stream to the first terminal device 140-1.

As shown, the first terminal device 140-1 includes a pose information detector 141. The posture information detector 141 may include, for example, a gravity sensor, an inertial sensor, or a gyroscope, etc., for sensing the orientation of the terminal device 140-1 in real time, which may be represented by a pitch angle and a yaw angle. The sensed orientation of the device may be used to determine a range of viewing angles for the user. As an example, the up-down 30 degrees of the pitch angle of the device may be determined as the user vertical viewing angle, i.e., the user vertical viewing angle is 60 degrees, and the left-right 60 degrees of the yaw angle of the device may be determined as the user lateral viewing angle, i.e., the user lateral viewing angle is 120 degrees. In addition, the gesture information detector may also obtain a scaling of the video currently being viewed by the user. The scale represents the degree to which a lens zooms in or out when a user views a video. It will be appreciated that as the lens is zoomed in, the range of pictures that the user can see becomes smaller, and as the lens is zoomed in, the range of pictures that the user can see becomes larger. With a wide area network connection (e.g., public network) as mentioned above, the first terminal device 140-1 may send the detected device orientation and/or scaling as gesture information to the gesture information receiver 123 of the transcoding server 122 in real time.

At the same time, at processing device 120, access server 121 also pushes the real-time video stream from camera device 130 to transcoding server 122. The transcoding server 122 will generate a video stream for the terminal device 140-1 from the received pose information and the real-time video stream from the photographing device 130.

Referring to fig. 4, a schematic flow chart of a process 400 of generating a video stream for a terminal device according to an embodiment of the disclosure is shown. Process 400 may be implemented as an example implementation of block 230 in fig. 2. Process 400 may be implemented by processing device 120 or transcoding server 122. For ease of description, process 400 will be described with reference to FIG. 3 and by way of example to processing device 120.

At block 410, the processing device 120 determines a viewport region for the first terminal device 140-1 in the first video stream based on the pose information of the first terminal device 140. Fig. 5 shows a schematic diagram of pictures of a video stream, where region 500 represents a complete picture of a first video stream and region 510 is a determined viewport region, according to an embodiment of the disclosure.

The decoder 124 decodes the received first video stream to obtain a sequence of pictures. The pictures of the first video stream may be in an equidistant columnar projection format, having a rectangular shape with a pixel height (H) and a pixel width (W). As an example, the gesture information may include an orientation (pitch angle and yaw angle) and/or a scale of the first terminal device 140-1. The cropping unit 125 in the transcoding server 122 may determine a partial area of the first terminal device 140-1 in the rectangular-shaped picture as the viewport area based on the device orientation and/or the scaling. The picture in the viewport region falls within the user's field of view and needs to be transmitted and displayed on the first terminal device 140-1. The extent of viewport region 510 can be represented by coordinates of boundary pixels.

At block 420, the processing device 120 determines a boundary region around the viewport region in the first video stream. The processing device 120 may determine pixels outside the left and right boundaries and pixels outside the upper and lower boundaries of the viewport region in the picture of the first video stream as the boundary region. As shown in fig. 5, wherein region 520 is the determined boundary region.

Although the border area 520 is not within the user's field of view, it is beneficial to transmit images thereof to the first terminal device 140-1. This is because high definition panoramic video requires high latency. First, the corresponding picture change when the user is moving needs to be guaranteed to be presented within about 30ms, otherwise the user may feel a noticeable delay or even dizziness. However, even in the case where the network conditions are good, the network delay between the photographing device and the user often exceeds 100ms, and the codec of the video also brings more delay. With the border area 520, when the user changes the viewport area 510 of the terminal device 140-1 by his head, the first terminal device 140-1 can present the acquired picture of the border area without requesting video content over the network. This can reduce the delay of the screen change.

In some embodiments, the size of the border region 520 may be determined according to network conditions. Specifically, at block 422, the processing device 120 obtains a motion-to-response delay for the first terminal device 140-1. The movement of the first terminal device 140-1 to the response delay may be detected by means of the transmission of the gesture information. For example, the first terminal device 140-1 may carry an index in the pose information sent to the processing device 120, and the processing device 120 may carry the index in the video stream for the pose information. Thus, the first terminal device 140-1 can detect the delay experienced from the user generating the head movement to receiving the responsive video stream. The delay may be transmitted to the processing device 120 periodically, for example, every 5 seconds, 10 seconds, 30 seconds, etc., or any other value.

At block 424, the processing device 120 determines a bounding region 520 based on the motion-to-response delay and the viewport region 510. In some embodiments, processing device 120 may determine the pixel width that extends outward from viewport region 510 based on the motion-to-response delay. As an example, the outwardly extending pixel width may be smaller when the delay is lower and the network conditions are better, resulting in smaller border areas, while the outwardly extending pixel width may be more when the delay is greater and the network conditions are not better, resulting in larger border areas. The size of the boundary region is flexibly set in this way.

At block 430, the processing device 120 generates a second video stream based on the viewport region, the boundary region, and the first video stream. Specifically, the clipper 125 clips a picture to be transmitted to the first terminal apparatus 140-1 from the pictures of the first video stream according to the ranges of the viewport region and the boundary region. The cropping unit 125 generates a sequence of pictures in real-time as the photographing apparatus video stream and the pose information are transmitted in real-time. The encoder 126 then encodes and encapsulates the sequence of pictures to generate a second video stream. The pictures of the second video stream are reduced (e.g., to 1080p or less) compared to the range of pictures of the first video stream, thus reducing the code rate of the video stream. For example, compared to a 60Mbps to 100Mbps code rate for 8K resolution panoramic video, the code rate of the transcoded video stream is approximately 2Mbps to 8Mbps. Thus, the video stream for the terminal device can be transmitted over a network with a lower bandwidth. As described with reference to fig. 2, the video stream for the terminal device may be uploaded via a second communication connection, such as a 4G network.

With continued reference to fig. 3, the decoder 142 of the first terminal apparatus 140-1 decodes the received second video stream and then transmits the decoded video to the renderer 143. The renderer 143 presents the second video stream to the user, e.g., projects a picture in front of the user's eyes or displays on a screen.

In some embodiments, there are multiple users who view different views, and then a corresponding video needs to be generated for each user separately, which has a high overhead. In the case where the video processing performance of the processing device 120 is limited (e.g., the processing device 120 is a smart phone or a personal computer), the processing device 120 may generate a video stream for the terminal device only for the terminal device defined as the master device, and other devices may be associated with the master device as its slave devices, the slave devices sharing the same video stream with the master device.

As an example, assume that the first terminal device 140-1 is a master device and the second terminal device 140-2 through the nth terminal device 140-N are slave devices. In this case, the second video stream for the first terminal device 140-1 may be transmitted to the second terminal device 140-2 to the nth terminal device 140-N. It should be noted that the number of slave devices associated with a master device may be any number and the number of master devices allowed by the processing device 120 may also be any depending on the capabilities, which is not limited by the present disclosure. In this way, the transcoding resource overhead and upload bandwidth requirements of the processing device 120 are reduced. Therefore, the application scene of high-definition video live broadcast is expanded, and the method is suitable for scenes such as game commentary, online class and the like which need the interpreter to control the visual angle change and other audiences to follow the watching.

In addition, as mentioned above, to enhance mobility, the photographing device 130 and the processing device 120 may be integrated in the same device 110, and the device 110 may be a mobile device, such as a balance car, a drone, a robot, or a handheld device. In some embodiments, the processing device 120 may also receive movement control instructions of the first terminal device 140-1 via the second communication connection 102 (e.g., a TCP-based connection) and control movement of the device 110 and/or the photographing device based on the movement control instructions. For example, as a combination of live and virtual reality applications, the user may control the device 110 to move to a specified location and control the photographing device 130 to photograph in a specified pose. By the method, space limitation is broken through, and scenes of high-definition video live broadcast application are expanded.

Fig. 6 shows a schematic block diagram of an apparatus 600 for video transmission according to an embodiment of the disclosure. The apparatus 600 may be arranged at the device 110 or the processing device 120.

As shown, the apparatus 600 includes a video stream acquisition unit 610, a reception unit 620, a generation unit 630, and a transmission unit 640. The video stream acquisition unit 610 is configured to acquire a first video stream from a photographing apparatus via a first communication connection, which is a local area network connection. The receiving unit 620 is configured to receive the gesture information of the first terminal device via a second communication connection, the second communication connection being a wide area network connection. The generating unit 630 is configured to generate a second video stream based on the pose information and the first video stream, the second video stream including a portion of the first video stream corresponding to the pose information. The transmission unit 640 is configured to transmit the second video stream to the first terminal device via the second communication connection.

In some embodiments, the first communication connection may have a greater bandwidth than the second communication connection.

In some embodiments, the first communication connection may comprise a wired local area network connection or a wireless local area network connection, and the second communication connection may comprise a cellular network connection.

In some embodiments, the generation unit 630 may be further configured to: determining a view port area for the first terminal device in the first video stream based on the gesture information, and determining a boundary area surrounding the view port area in the first video stream; and generating a second video stream based on the viewport region, the boundary region, and the first video stream.

In some embodiments, the generating unit 630 may be further configured to obtain a motion-to-response delay of the first terminal device, and determine the boundary region based on the motion-to-response delay and the viewport region.

In some embodiments, the transmission unit 640 may be further configured to transmit the second video stream to a second terminal device associated with the first terminal device. Here, the second terminal device may be a slave device of the first terminal device. The slave device shares the video stream with the master device, but the pose information of the slave device is not fed back to the processing device or responded by the processing device.

In some embodiments, the apparatus 600 may be implemented by the processing device 120, and the processing device 120 and the photographing device 130 may be included in the movable device 110.

In some embodiments, the receiving unit 620 may be further configured to receive a movement control instruction from the first terminal device. The apparatus 600 may further include a movement control unit configured to control movement of the movable device and/or the photographing device based on the movement control instruction.

In some embodiments, the first video stream may be a panoramic video stream. The first video stream may be a high definition video of 8K or higher.

Fig. 7 shows a schematic block diagram of an example device 700 that may be used to implement embodiments of the present disclosure. For example, processing device 120 according to embodiments of the present disclosure may be implemented by device 700. As shown, the device 700 includes a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) 701, which may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 702 or loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The CPU/GPU 701, ROM 702, and RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks. The communication unit 709 may include various types of local area network communication modules and wide area network communication modules.

Various processes and processes described above, such as processes 200 and/or 400, may be performed by CPU/GPU 701. For example, in some embodiments, method processes 200 and/or 400 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by CPU/GPU 701, one or more actions of processes 200 and/or 400 described above may be performed.

The present disclosure may be methods, apparatus, systems, and/or computer program products. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disk, hard disk, random Access Memory (Random Access Memory, RAM), read-Only Memory (ROM), erasable programmable Read-Only Memory (Electrical Programmable Read Only Memory, EPROM or flash Memory), static Random-Access Memory (SRAM), portable compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disk (Digital Video Disc, DVD), memory stick, floppy disk, mechanical coding devices such as punch cards or in-groove protrusion structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing operations of the present disclosure can be assembly instructions, instruction set architecture (Instruction Set Architecture, ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network, LAN) or a wide area network (Wide Area Network, WAN), or it may be connected to an external computer (e.g., through the internet using an internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field programmable gate arrays (Field Programmable Gate Array, FPGAs), or programmable logic arrays (Programmable Logic Array, PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The embodiments of the present disclosure have been described above, the foregoing description is illustrative, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for video transmission, comprising:

obtaining a first video stream from a photographing apparatus via a first communication connection, the first communication connection being a local area network connection;

receiving pose information of the first terminal device via a second communication connection, the second communication connection being a wide area network connection;

generating a second video stream based on the gesture information and the first video stream, the second video stream including a portion of the first video stream corresponding to the gesture information; and

transmitting the second video stream to the first terminal device via the second communication connection.

2. The method of claim 1, wherein the first communication connection has a greater bandwidth than the second communication connection.

3. The method of claim 2, wherein the first communication connection comprises a wired local area network connection or a wireless local area network connection and the second communication connection comprises a cellular network connection.

4. The method of claim 1, wherein the generating a second video stream based on the pose information and the first video stream comprises:

determining a view port area for the first terminal device in the first video stream based on the gesture information;

determining a boundary region surrounding the viewport region in the first video stream; and

and generating the second video stream based on the viewport region, the boundary region and the first video stream.

5. The method of claim 4, wherein the determining a border region around the viewport region in the first video stream comprises:

acquiring a motion-to-response delay of the first terminal equipment; and

the boundary region is determined based on the motion-to-response delay and the viewport region.

6. The method of claim 1, further comprising:

Transmitting the second video stream to a second terminal device associated with the first terminal device.

7. The method of any of claims 1 to 6, the method being performed by a processing device, the processing device and the photographing device being included in a removable device.

8. The method of claim 7, the method further comprising:

receiving a movement control instruction from the first terminal device; and

and controlling the movement of the movable device and/or the shooting device based on the movement control instruction.

9. The method of any of claims 1-6, wherein the first video stream is a panoramic video stream.

10. An apparatus for video transmission, comprising:

a video stream acquisition unit configured to acquire a first video stream from a photographing apparatus via a first communication connection, the first communication connection being a local area network connection;

a receiving unit configured to receive pose information of the first terminal device via a second communication connection, the second communication connection being a wide area network connection;

a generating unit configured to generate a second video stream including a portion of the first video stream corresponding to the pose information, based on the pose information and the first video stream; and

A transmission unit configured to transmit the second video stream to the first terminal device via the second communication connection.

11. An apparatus for video acquisition and transmission, comprising:

the shooting equipment is used for acquiring a first video stream; and

a processing device coupled with the photographing device via a first communication connection and coupled with a first terminal device via a second communication connection, wherein the first communication connection is a local area network connection and the second communication connection is a wide area network connection;

the processing device is configured to:

obtaining the first video stream from the photographing apparatus via the first communication connection;

receiving pose information of the first terminal device via the second communication connection;

12. The device of claim 11, comprising at least one of a balance car, a drone, a robot, or a handheld device.

13. A processing apparatus, comprising:

at least one processing unit; and

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the processing device to perform the method of any one of claims 1 to 9.

14. A computer readable storage medium comprising machine executable instructions which, when executed by a device, cause the device to perform the method of any one of claims 1 to 9.

15. A computer program product comprising machine executable instructions which, when executed by a device, cause the device to perform the method of any one of claims 1 to 9.