WO2023103641A1 - 支持交互式观看的视频数据处理方法、设备及系统 - Google Patents

支持交互式观看的视频数据处理方法、设备及系统 Download PDF

Info

Publication number
WO2023103641A1
WO2023103641A1 PCT/CN2022/128146 CN2022128146W WO2023103641A1 WO 2023103641 A1 WO2023103641 A1 WO 2023103641A1 CN 2022128146 W CN2022128146 W CN 2022128146W WO 2023103641 A1 WO2023103641 A1 WO 2023103641A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
grid
client
grids
interest
Prior art date
Application number
PCT/CN2022/128146
Other languages
English (en)
French (fr)
Inventor
袁潮
温建伟
邓迪旻
Original Assignee
北京拙河科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京拙河科技有限公司 filed Critical 北京拙河科技有限公司
Publication of WO2023103641A1 publication Critical patent/WO2023103641A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements

Definitions

  • the present disclosure relates to video data processing. More specifically, the present disclosure relates to video data processing methods, devices and systems supporting interactive viewing.
  • FIG. 1 shows a schematic diagram of comparison between a relatively high original video resolution of a video source and a relatively low screen resolution of a client.
  • the original video resolution of the video source is 3840 ⁇ 2160
  • the screen resolution of the client is 1920 ⁇ 1080.
  • the client screen displays the video source screen in a point-to-point manner, only a part of the video source screen can be displayed, which affects the user's viewing experience of the video content.
  • the client can play the video source content in the following two ways.
  • the client can down-sample the video source picture to reduce the resolution so as to adapt to the screen resolution of the client, which is also the way currently used by conventional systems.
  • the problem with this method is that it cannot fully display the details of the content of the video source, thereby reducing the user's visual experience.
  • the user of the client can interact with the server that provides the video source in real time, and the server can provide the video content of the region of interest according to the request of the client, so that the client can display the video content of the video source on demand.
  • Video content in any area.
  • a video data processing method supporting interactive viewing including: dividing a video frame into a plurality of grids; for each grid in the plurality of grids, assigning a dedicated The video encoder of the grid encodes the video data stream of the grid; and provides the encoded video data stream of at least one grid in the plurality of grids in response to a video playback request of the client.
  • a video data processing method including: obtaining multi-level video pictures with different resolutions of the same video content; dividing each level of video pictures in the multi-level video pictures into multiple grids; for each grid in the multiple grids of each level of video picture, assign a video encoder dedicated to the grid; and use each video encoder to encode the video data stream of the corresponding grid, to An encoded video data stream of the corresponding grid is obtained.
  • a video data processing device supporting interactive viewing including: a processor; and a memory storing computer program instructions, wherein the computer program instructions are executed by the processor , causing the processor to perform the following steps: obtaining multi-level video frames with different resolutions of the same video content; dividing each level of video frames in the multi-level video frames into multiple grids; for each level of video each of the plurality of grids of the picture, assigning a video encoder dedicated to the grid; and encoding the video data stream of the corresponding grid with the respective video encoders to obtain encoded video of the corresponding grid data flow.
  • a system for supporting interactive viewing including: a server configured to: obtain multi-level video frames with different resolutions of the same video content; Each level of video picture in is divided into a plurality of grids; and for each grid in the plurality of grids of each level of video picture, a video encoder dedicated to the grid is allocated; The video data stream of the grid is encoded to obtain the encoded video data stream of the corresponding grid.
  • the system also includes: a client, configured to send a video playback request to the server.
  • the server is further configured to: in response to the client's video playback request, select a video frame that matches the decoding capability of the client from the multi-level frames; in the multiple grids of the selected video frame determining at least one grid corresponding to the video content requested by the video play request; and providing an encoded video data stream of the at least one grid to the client.
  • a computer-readable storage medium on which computer program instructions are stored, wherein the computer program instructions implement the above video data processing method supporting interactive viewing when executed.
  • Fig. 1 shows a schematic diagram of the comparison between the original resolution of the video source and the screen resolution of the client.
  • Fig. 2 shows a schematic diagram of a process in which a client interacts with a video source in an existing method.
  • FIG. 3 is a flowchart illustrating a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 4 shows a schematic diagram of a method for processing video data supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 5A shows a schematic diagram of determining coordinate information of a region of interest in a video frame in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • FIG. 5B shows a schematic diagram of determining a grid corresponding to a region of interest according to coordinate information of the region of interest in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 6 shows a schematic diagram of presenting a video frame corresponding to a region of interest at a client in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 7 shows a schematic diagram of specifying a relatively large part of a complete frame as an ROI at the client side in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • FIG. 8 is a flowchart illustrating another example of a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 9 shows a schematic view of multi-level video frames with different resolutions in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 10 shows a schematic view of determining several grids corresponding to regions of interest in video frames of various levels in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 11 shows a schematic diagram of an example of interaction between a client and a server in a method for processing video data supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 12 shows a schematic diagram of another example of interaction between a client and a server in the method for processing video data supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 13 shows a schematic diagram of determining regions of high interest and regions of low interest in a video source in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 14 shows a schematic diagram of non-uniform grid division of a video picture in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 15 shows a schematic hardware block diagram of a video data processing device supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 16 shows a schematic structural block diagram of a video data processing device supporting interactive viewing according to an embodiment of the present disclosure.
  • FIG. 2 shows a schematic diagram of a process in which a client interactively watches a video source beyond its screen resolution in an existing method.
  • the interactive viewing process mainly includes the following steps:
  • the client side determines the region of interest in the video screen based on the user's operation.
  • the client sends a play request containing information about the ROI to the server.
  • the server cuts out a part corresponding to the region of interest from the complete video source picture.
  • the server encodes and compresses the cut-out image to obtain encoded video data.
  • the server returns the encoded video data to the client.
  • the client side decodes the received encoded video data and presents a picture of the region of interest.
  • each client operates independently, and will send specific information to the server The client's playback request.
  • the server needs to cut out multiple corresponding different regions of interest from the complete video source screen in real time according to the different regions of interest specified by each client. part of the picture and send it back to the corresponding client after encoding and compression.
  • an independent video encoder must be provided for each user on the server side, so as to meet the user's unique viewing requirements for the area of interest.
  • the number of clients is extremely large, and the number of clients watching the live broadcast at the same time can reach hundreds of millions.
  • the number of hardware video encoders on the server side is limited. For example, a channel of a TV station often only needs one video encoder, and a high-end graphics card can only have about 20 built-in video encoders, and video encoders are expensive. Therefore, by stacking The video video encoder method cannot support a large number of client-side interactive live broadcast services. Therefore, the existing interactive viewing method cannot solve the business application scenario of "infinite number" of client interactive viewing.
  • this disclosure proposes the idea of grid segmentation of video pictures and grid-specific video encoder allocation.
  • grid segmentation is performed on the high-resolution video source, and then for each mesh
  • the video data is encoded by a video encoder dedicated to the grid allocation.
  • the scheme proposed in this disclosure to process the video data of the video source, the video encoding at the server can be relieved. server resource constraints, especially when there are a large number of devices on the client side for interactive viewing.
  • the improved video data processing technology described in this disclosure can be applied to an interactive live broadcast/on-demand system, thereby supporting a large number of clients to interactively watch live videos or on-demand videos that exceed the client screen resolution.
  • a user can interact with a server that provides live video content or on-demand content through a client, so as to obtain an area of interest from a complete picture for viewing.
  • the original video content with high resolution may be referred to as a video source or a video frame, and the video source or video frame may correspond to the video content of a live video or an on-demand video.
  • the specific frame content depicted in the source or video frame is limited.
  • FIG. 3 is a flowchart illustrating a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Fig. 4 shows a schematic diagram of a method for processing video data supporting interactive viewing according to an embodiment of the present disclosure. The video data processing method will be described below in detail with reference to FIG. 3 and FIG. 4 .
  • a video picture may refer to each frame picture of a video source with a resolution higher than the common resolution of the client, and it may have live or on-demand video content.
  • the video picture can be obtained in various ways. For example, multiple shooting devices can be used to shoot, and then the multiple shooting pictures can be spliced to obtain a panoramic high-definition video picture.
  • the video images can be obtained directly by using a camera with a pixel level of 100 million or higher to obtain the video images, thereby eliminating the need for maintenance of multiple camera devices and splicing of multiple video images.
  • terms such as "video picture” and "pixel picture” may be used interchangeably in the embodiments of the present disclosure. It should be noted that, the present disclosure does not limit the manner of acquiring video images.
  • grid segmentation processing can be performed on the 100 million-level pixel picture on the server side, for example, a 100 million-pixel video picture can be divided into 10 ⁇ 10 grids, and the resolution of each grid is 1000 ⁇ 1000, such as grid 1, grid 2, ..., grid 100.
  • grid information related to the grid segmentation process can be generated and recorded, such as: the original resolution of video images, the number of grids, the size of grids, and the grid size. grid coordinates, etc.
  • the grid size of grid 1 is 1000 ⁇ 1000 and the grid coordinates are (0,0), and so on. It can be understood that the resolution, grid size, number of grids, etc. of the video picture described above are only illustrative examples, and the grid division in the present disclosure is not limited to the above specific numerical examples.
  • the server may consider the decoding capabilities of common clients when performing grid division on the video images.
  • the size of each grid after the grid division of the video picture should be much smaller than the decoding capability of common clients, that is, considering that the video picture that the user expects to watch may correspond to more than one grid, the grid division The result of should enable the client to decode several grids of video data in real time at the same time.
  • the video screen can be divided into grids based on the fact that each grid does not exceed 100,000 pixels.
  • step S102 for each of the plurality of grids, a video encoder dedicated to the grid is assigned to encode the video data stream of the grid.
  • each A grid is assigned a video encoder dedicated to that grid.
  • grid 1 can be assigned its dedicated video encoder 1, so that video encoder 1 is used to encode the video data stream of grid 1 to obtain the encoded video data stream of grid 1, so as to And so on.
  • the server can encode these grid pictures independently to form 10 ⁇ 10 encoded video streams in units of grids.
  • step S103 in response to a video playback request from the client, an encoded video data stream of at least one grid among the plurality of grids is provided.
  • the user can interact with the server that provides video content through the client. For example, the user can drag and drop the video screen displayed on the Get its area of interest to watch.
  • the client's video playback request may include information related to the region of interest specified by the client.
  • the region of interest in the video picture specified by the client may first be determined from multiple grids of the video picture At least one grid corresponding to the region of interest is then provided with an encoded video data stream of the determined at least one grid. It can be understood that the region of interest specified by the client may be characterized in various ways, such as its coordinate information.
  • FIG. 5A shows a schematic diagram of determining coordinate information of a region of interest in a video frame in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure
  • FIG. 5B shows a schematic diagram of determining a region of interest according to an embodiment of the present disclosure
  • the server can receive the specified ROI from the client, and then determine the ROI based on the gridding information recorded in the previous meshing and segmentation process. corresponding grids, and stream the encoded video data of these grids back to the client for viewing.
  • the client may receive gridding information related to the gridding and segmentation process from the server in advance, and then determine several grids corresponding to the region of interest specified by the user based on the obtained gridding information, and Request the video data of these grids from the server for viewing.
  • the interactive viewing process may mainly include the following steps:
  • the client can send the coordinate information of the region of interest to the server.
  • the client may determine the coordinate information related to the region of interest according to the user's drag operation on the client screen in various ways. For example, after the user designates a region of interest in a video frame through a drag operation on the client screen, the client may determine the normalized coordinates of the region of interest in the complete frame. As shown in FIG. 5A , it is assumed that the upper left corner of the complete picture displayed on the client is the origin (0,0), and the normalized coordinates of the lower right corner are (1,1).
  • the normalized coordinates of the upper left corner and lower right corner of the region of interest in the complete screen can be calculated as (0.22, 0.24) and (0.56, 0.42 ). It can be understood that although it is described above that the region of interest is represented by the normalized coordinates of the upper left corner and the lower right corner of the region of interest, the present disclosure does not limit the manner of representing the coordinate information of the region of interest. As an illustrative example, in the embodiment of the present disclosure, the normalized coordinates of the upper left corner of the region of interest, and the normalized length and width of the region of interest may also be used to characterize.
  • the user may arbitrarily select an area from the complete image as the area of interest.
  • the default image ratio can be set to a reasonable fixed value, such as keeping The same aspect ratio as the original video source.
  • the aspect ratio of the region of interest specified by the user is different from the preset aspect ratio of the screen, one of the long side or the wide side of the selected region of interest can be used as a reference, and the other side The length can be matched according to the preset ratio.
  • the server can map the received normalized coordinates of the region of interest to the coordinates of the video screen at the server, thereby obtaining the pixel-level coordinates of the region of interest .
  • the normalized coordinates (0.22, 0.24) and The pixel-level coordinates after (0.56, 0.42) are mapped to the video screen are (0.22 ⁇ 10000, 0.24 ⁇ 10000) and (0.56 ⁇ 10000, 0.42 ⁇ 10000), namely (2200, 2400) and (5600, 4200).
  • the server can determine, based on the gridding information recorded during the gridding and segmentation process, which grids among the multiple grids of the video frame correspond to the region of interest. For example, the server can determine the minimum grid required to cover the region of interest in the video frame.
  • the minimum number of meshes covering the region of interest in the video picture can be determined. grid. As shown in FIG. 5B , a total of 12 grids covering the region of interest are shown in gray grids, and the coordinates of these grids are (2,2), (2,3)...(5,4) in sequence.
  • the 12 grids determined above include parts of the picture that are not of interest to the user.
  • the relative coordinates of the region of interest in the gray region formed by these 12 grids can also be determined, for example, the relative coordinates of the upper left corner and lower right corner of the region of interest in the gray region (x1, y1) and (x2, y2), which helps to cut out the non-user-interested picture parts from the 12 grids, and the process is described in detail below.
  • the server may provide the determined at least one encoded video data stream for the client to interact to watch. For example, the server may send the determined encoded video data streams of a total of 12 gray grids shown in FIG. 5B to the client. It can be understood that in this step, if it is an interactive viewing application scenario for a small number of clients, the video streams of several grids (that is, the video streams of the grid corresponding to the region of interest) can be pushed to each client as required.
  • the video streams of all grids of the video screen can also be pushed to the edge server (such as CDN), and then the edge server can play the video according to the video playback requests of different clients Push individual video streams for different grids to clients.
  • the server can send these video streams to the client through a communication channel such as a wired or wireless network according to a certain standard (MPEG-TS or RTP, etc.) or a custom format.
  • MPEG-TS MPEG-TS or RTP, etc.
  • the video streams of each grid provided to the client must identify its grid number in a certain way, so as to facilitate the client to perform reorganization and splicing. Therefore, in addition to the determined encoded video data stream of at least one grid, the server also needs to send necessary location information related to these grids to the client, so that the client can The coded video stream is reassembled and spliced into a video picture of the region of interest. For example, the server can send the coordinates (2,2), (2,3)...(5,4) of a total of 12 gray grids covering the region of interest described in conjunction with FIG. 5B to the client, so that the client The terminal can reorganize the corresponding video picture based on the grid coordinates of these grids.
  • the server can also provide the relative coordinates (x1, y1) and (x2) of the region of interest in the gray area ,y2) sent to the client.
  • the interactive viewing process may mainly include the following steps:
  • the client can obtain the meshing information during the meshing and segmentation process at the server in advance, so as to prepare for the interactive viewing that may be initiated by the user at any time.
  • the client may request gridding information from the server when accessing the server for the first time, so as to obtain gridding information provided by the server in response to the request.
  • the server after the server performs grid segmentation on the video screen, it can proactively push the grid information obtained after the segmentation to the client it serves for emergency needs.
  • the obtained gridding information may include the original resolution of the video picture as described above, the number of grids, the size of the grids, and the grid coordinates.
  • the server can only transmit the original resolution of the video image, the number of grids, the size of the grid, and a part of the grid coordinates, and the client can according to its Part of the gridding information received is used to calculate other gridding information by itself.
  • the gridding information For specific details of the gridding information, reference may be made to FIG. 5A and FIG. 5B , which will not be repeated here.
  • the client can determine the coordinate information of the region of interest for interactive viewing.
  • the client may determine the coordinate information related to the region of interest in various ways.
  • the normalized coordinates of the region of interest selected by the user in the complete frame can be determined in a manner similar to that described above in conjunction with FIG. 5A , such as the normalized coordinates of the upper left corner and lower right corner of the region of interest are ( 0.22,0.24) and (0.56,0.42).
  • the default screen ratio can also be set to a reasonable fixed value.
  • the client can map the normalized coordinates of the region of interest to the gridded coordinates obtained from the server in a manner similar to that described above in conjunction with Figure 5B.
  • the pixel-level coordinates of the region of interest in the video frame are obtained. For example, several grids corresponding to the region of interest specified by the user may be determined based on the obtained gridding information, for example, the minimum grids required to cover the region of interest in the video frame may be determined.
  • the client can calculate the pixel-level coordinates (2200, 2400) and (5600, 4200) after its normalized coordinates (0.22, 0.24) and (0.56, 0.42) are mapped to the video screen, and the obtained
  • the gridding information is used to determine which grids among the multiple grids of the video frame correspond to the region of interest.
  • the client may determine the minimum number of grids covering the region of interest in the video frame according to one or more gridding information of the video frame obtained and/or derived by itself , a total of 12 grids shown in gray grids in Figure 5B.
  • the client can also calculate the relative coordinates of the region of interest within the gray area formed by these 12 grids, such as the upper left of the region of interest The relative coordinates (x1, y1) and (x2, y2) of the corner and the lower right corner in the gray area, so as to subsequently cut out the pictures that are not of interest to the user.
  • the client can request the video stream of these grids from the server, that is, request the server for the above-determined A total of 12 grids of encoded video data streams.
  • the server can provide the video stream of the requested grid to the client according to an appropriate data transmission method. It can be understood that the video streams of each grid provided to the client must identify its grid number in a certain way, so as to facilitate the client to perform reorganization and splicing.
  • FIG. 5A and FIG. 5B A schematic diagram of how to determine several grids corresponding to the region of interest from the multiple grids in the video frame and provide the video stream of these grids to the client is described above with reference to FIG. 5A and FIG. 5B . Thereafter, the client can present video images corresponding to the region of interest on its screen according to the received video data streams of these grids.
  • the following describes an exemplary process of presenting a video image of an area of interest at a client in conjunction with FIG. Schematic diagram of the video frame corresponding to the region of interest.
  • the left side of FIG. 6 shows encoded video streams received by the client for each grid corresponding to the region of interest, for example, a total of 12 grids described above in conjunction with FIG. 5B .
  • the client can decode the encoded video data streams of each grid respectively, and then decode each decoded video data stream according to the grid coordinates of each grid Data streams are spliced. Finally, the client can directly present the spliced decoded video data stream on the screen of the client for interactive viewing by the user.
  • the video data of these 12 grids can be directly decoded and spliced and presented to the client without considering that the 12 grids may include pictures not of interest to the user, which may affect the perception.
  • the spliced video data stream can be forced to watch in full screen.
  • the client can decode the encoded video data streams of each grid respectively, and then Each decoded video data stream is spliced according to the grid coordinates of the grid.
  • the part not covered by the region of interest i.e., the non-user region of interest
  • the part not covered by the region of interest can be deducted from the obtained total of 12 grids, and then it can be shown in the right side of Figure 6 , present the segmented decoded video data stream to the client, for example, force it to watch in full screen.
  • the above-mentioned cutting process can be carried out according to the relative coordinates of the region of interest in the minimum number of grids covering the region of interest in the video frame, for example, the relative coordinates of the upper left corner and the lower right corner of the region of interest in the gray area ( x1,y1) and (x2,y2).
  • the relative coordinates may be determined by the server and sent back to the client, or determined by the client itself according to the meshing information, for example.
  • the video data processing method supporting interactive viewing by adopting the idea of grid-based segmentation of video images and grid-specific video encoder allocation, firstly, grid-based segmentation of video images is performed, Then assign a dedicated video encoder to each divided grid to encode the video data, so that the encoded video data of a part of the grid can be selected according to the playback request of the user to realize interactive viewing.
  • the advantage of the above embodiments of the present disclosure is that no matter how many clients interact with the server, the number of video encoders required by the server is fixed and equal to the number of grids divided by gridding, so that As long as the network bandwidth allows, it can provide interactive video viewing services for countless clients, especially in the case of a large number of client devices for interactive viewing, which can effectively alleviate the resource shortage of video encoders on the server side.
  • Fig. 7 shows a schematic diagram of specifying a relatively large part of a complete frame as an ROI at the client side in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • the obliquely shaded area in Figure 7 requires a total of 56 grids to cover the region of interest.
  • the actual video resolution of such a large number of grids has exceeded half of the total number of pixels (if the total number of pixels in the complete video screen is 100 million, The total number of pixels in the gray grid part has reached 56 million), such a high resolution is unbearable for both network transmission and client-side decoding.
  • the embodiment of the present disclosure provides a technology for processing the video data of the video source based on the idea of combining the grid segmentation of the video picture and the quality classification, so that when receiving the video playback request from the client, it can Provide video quality that matches the decoding capability of the client and several grids of video data under the video quality, so as to avoid problems such as picture freezing and incomplete display due to insufficient decoding capability.
  • the video data processing method based on the idea of grid segmentation and quality classification according to the embodiment of the present disclosure is described below in conjunction with FIG. 8, FIG. 9 and FIG. 10, wherein FIG. A flow chart of another example of the video data processing method, FIG.
  • FIG. 9 shows a schematic view of multi-level video frames with different resolutions in the video data processing method supporting interactive viewing according to an embodiment of the present disclosure
  • FIG. 10 It shows a schematic view of determining several grids corresponding to regions of interest in video frames of various levels in the video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • step S201 multi-level video frames with different resolutions of the same video content are obtained.
  • multiple ways may be used to construct multi-level video frames with the same video content (that is, the same video frame depicted, such as the same sports event) but with different resolutions.
  • the original video picture may be down-sampled to obtain multi-level video pictures with different resolutions, for subsequent grid division.
  • the original resolution of the video source can be used as the first-level video picture (full resolution picture), and the next-level video picture is obtained by down-sampling the previous level of video picture, so the video picture of each level The resolutions are all lower than the resolution of the previous video screen.
  • the original resolution of the first-level video picture is 8000 ⁇ 4000
  • the resolution of the second-level video picture can be set to half of the previous level, that is, 4000 ⁇ 2000
  • the resolution of the third-level video picture The rate can be set as 2000 ⁇ 1000, and so on.
  • the lowest-level video image may be equal to or smaller than the single video resolution (for example, 800 ⁇ 600) supported by common client devices, so as to be compatible with the decoding capabilities of various common clients.
  • the above resolutions and downsampling ratios of each video picture are illustrative examples.
  • the downsampling ratio of each level of video picture from the upper level video picture is not necessarily 2:1. , can also be other suitable ratios.
  • the ratios between the resolutions of the video images at different levels may also be different, as long as they decrease in order.
  • the aspect ratio of each level of video picture to the previous level of video picture can be set between 1/4 and 3/4. In this way, the first-level to fourth-level video images as shown in FIG. 9 can be obtained.
  • the resolution of the first-level video picture can be 7680 ⁇ 4320
  • the resolution of the second-level video picture can be 5120 ⁇ 2880
  • the resolution of the third-level video picture can be 3840 ⁇ 2460
  • the fourth-level video picture can be 3840 ⁇ 2460.
  • the resolution of the high-level video screen can be 1920 ⁇ 1080.
  • each level of video frames in the multi-level video frames is divided into multiple grids. It can be understood that, after obtaining multi-level video pictures, each level of video pictures can be divided into corresponding multiple grids. It should be noted that when the server divides the video images at all levels into grids, the size of each grid should be much smaller than the decoding capability of common clients, that is, the segmentation results should enable the client to simultaneously The video is decoded in real time. For example, grid segmentation can be performed on the basis that each grid does not exceed 100,000 pixels.
  • the server can obtain the grid information of the complete multi-level video picture, for example, it can include the number of pictures of the multi-level video picture (or called the number of picture classifications). ), the resolution of video images at all levels, the number of grids of video images at all levels (such as the number of grids in the horizontal and vertical directions), the grid size of video images at all levels, and the grid coordinates of each grid, etc. .
  • the server can generate and record the following information in the process of meshing and segmenting video images at all levels:
  • the grid information obtained by grid-dividing video images at various levels can be described in various formats, such as xml, json, and the like.
  • the gridding information of the multi-level video picture can be expressed as follows:
  • each grid after each level of video picture is divided into grids is the same as an example for description, this is only a schematic example.
  • the grid sizes of the gridded video pictures at a certain level may not be exactly the same. This case is called non-uniform grid partitioning.
  • more detailed gridding information should be included in the gridding information of this level of video picture, for example, a certain row of grids is required size, the grid size of a certain column, or the grid size at a specified position, etc.
  • a video encoder dedicated to the grid is allocated.
  • each grid (and its video stream) can be assigned a number, which at least includes the quality level of the video picture to which the grid belongs number and grid number.
  • the third-level video picture is divided into 100 grids. Taking the grid in the upper left corner as the origin, the coordinates of the grid corresponding to the cross-hatched lines are (2,1).
  • it belongs to the third-level video picture it can be numbered as (3,2,1). Of course, other numbering methods can also be adopted, as long as the grid can be uniquely identified in the server.
  • its dedicated video encoder can be allocated, so as to independently manage the video data stream of each grid in units of grids.
  • each video encoder is used to encode the video data stream of the corresponding grid to obtain the encoded video data stream of the corresponding grid. It can be understood that after the encoded video data streams of each grid are obtained in units of grids, they can be pushed to clients with interactive viewing requirements in an appropriate manner. For example, if it is an interactive viewing application scenario for a small number of clients, video streams of several grids (that is, video streams of grids corresponding to the region of interest) can be pushed to each client on demand.
  • the edge server such as CDN
  • the edge server plays the video according to the video of different clients Request to push individual video streams of different grids under a specific image quality to the client.
  • the video data processing method supporting interactive viewing as described above may further include: in response to a video playback request from the client, providing an encoded video data stream of at least one grid among multiple grids of a specific video picture , for interactive viewing by the user.
  • the user can interact with the server providing video content through the client, and obtain the region of interest from the complete video frame for viewing.
  • the server maintains multi-level video images, in this example, considering the specific decoding capability of the client, several grids under a specific video image are selected as the grids corresponding to the user-specified ROI. Similar to what was described above in conjunction with FIG. 5A and FIG.
  • the server can receive the designated ROI from the client and information related to the decoding capability of the client, and then based on the previous network In the process of gridding and segmentation, the recorded grid information of all levels of video screens, under the premise of not exceeding the decoding ability of the client, selects a number of grids corresponding to the area of interest under a specific video screen, and converts the grids of these grids to The encoded video data stream is sent back to the client for viewing.
  • the client may receive from the server in advance the gridding information of video pictures at all levels related to the gridding and segmentation process, and then based on the obtained gridding information of video pictures at all levels, within the client's Under the premise of decoding ability, select several grids corresponding to the user-specified area of interest under a specific video screen, and request the video data of these grids from the server for viewing.
  • the interactive viewing process may mainly include the following steps:
  • the user can specify the region of interest in the video screen during the interactive viewing process
  • the corresponding server can receive a video playback request from the client
  • the video playback request can include coordinate information related to the region of interest specified by the user
  • the video playback request may also include the number of grids that the client can simultaneously decode for various common grid sizes, as the decoding capability of the client for various grid sizes. It should be noted that the client can actively send its decoding capabilities for various common grid sizes to the server, so that the server can consider the relevant decoding capabilities when determining the grid corresponding to the region of interest.
  • the client in order to reduce the amount of data communication, in the case that the client obtains the meshing information in the meshing and dividing process of the video images at all levels in advance, the client can only use the grid information for the meshing and dividing process.
  • the decoding capabilities of several involved grid sizes can be sent to the server without sending the decoding capabilities of irrelevant grid sizes.
  • the server selects a video picture that matches the decoding capability of the client from the multi-level pictures, and determines the video picture requested by the video play request in a plurality of grids of the selected video picture. At least one grid corresponding to the video content.
  • the server can select an appropriate level of video quality from multi-level video quality according to the percentage of the area of interest specified in the video playback request to the entire frame, and consider the decoding capabilities of the client, and then select the appropriate level of video quality from the level Select the least number of grids that can cover the region of interest from among the plurality of grids in the video frame, as the grid corresponding to the region of interest.
  • the server can start from the first-level video frame and calculate the number of grids that the region of interest needs to occupy in the video frames of each level. , if the number of grids occupied exceeds the decoding capability of the client, then calculate the next level of video picture until the number of grids required in the video picture of this level is not greater than the decoding capability of the client, so as not to exceed the decoding capability of the client Provide high-resolution video images for interactive viewing as much as possible under the premise.
  • the minimum number of grids required to cover the region of interest in all levels of video pictures can be sequentially determined, and the 36 grids in the first-level video picture can be determined.
  • grids and 24 grids in the second-level video picture have exceeded the decoding capability of the client, while 16 grids in the third-level video picture have not exceeded the decoding ability of the client, so the third-level video picture can be
  • the 16 grids shown in gray in are the grids corresponding to the regions of interest.
  • the server can provide the determined encoded video data stream of at least one grid to the client. Thereafter, the client can follow the method similar to that described above in conjunction with FIG. 6 , after decoding, splicing, and optional cutting processes on the received encoded video data streams of several grids, the A video picture corresponding to the region of interest is presented on the screen of the client.
  • the interactive viewing process may mainly include the following steps:
  • the client can obtain the multi-level grid information in the process of grid segmentation of the multi-level video screen at the server in advance, so as to prepare for the interactive viewing that may be initiated by the user at any time .
  • the obtained gridding information may include the original resolution, number of grids, grid size, grid coordinates, etc. of the video frames at all levels as described above. It can be understood that in order to reduce the consideration of data communication and reduce the excessive occupation of bandwidth resources, the server can only transmit part of the grid information, and the client can calculate other grid information based on the received part of the grid information .
  • the client can select a video picture that matches the decoding capability of the client from the multi-level pictures according to its decoding ability for each grid size generated in the grid division process, and the selected video picture Determine at least one grid corresponding to the region of interest among the grids.
  • the client can select an appropriate level of video quality from multi-level video quality according to the percentage of the area of interest specified by the user in the entire frame and its decoding capability, and then select the appropriate level of video quality from the level Select the least number of grids that can cover the region of interest from among the multiple grids in the video frame.
  • the 16 grids shown in gray in the third-level video frame may be used as the grids corresponding to the region of interest.
  • the client can send a request to the service The client requests the video stream of these grids. Thereafter, the client can follow the method similar to that described above in conjunction with FIG. 6 , after decoding, splicing, and optional cutting processes on the received encoded video data streams of several grids, the A video picture corresponding to the region of interest is presented on the screen of the client.
  • the above description takes the decoding capability of the client as a factor, and describes the technology for processing video data of a video source based on the combination of grid segmentation and image quality grading of video images.
  • the network connection quality of the client may also be considered as a factor, and several grids under a specific video frame may be selected as the grids corresponding to the region of interest.
  • the video data of the video source is processed by adopting the idea of combining the grid segmentation of the video picture and the quality classification, which can provide a decoding capability matching the client terminal.
  • the video quality of the video and the video data of several grids under this quality so as to avoid providing the grid video data with inappropriate quality to the client and avoid the picture freeze and display on the client due to insufficient decoding ability of the client. Incomplete and other problems, thereby effectively improving the user's interactive viewing experience.
  • FIG. 11 shows an example of the interaction between the client and the server in the video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Schematic diagram which mainly includes the following steps:
  • Step 1 The server can send grid information of video images at all levels to the client.
  • the client can obtain the gridding information of video images at various levels during the gridding and segmentation process at the server in advance, so as to prepare for interactive viewing that may be initiated by the user at any time.
  • the client may request the gridding information from the server, so as to obtain the gridding information provided by the server in response to the request.
  • the server after the server has gridded and segmented the video images at all levels, it can actively push the gridded information to the client it serves.
  • Step 2 The client determines its own decoding capabilities for various grid sizes of video images at all levels.
  • the client's ability to decode common grids can be characterized by the number of grids that the client can decode at the same time.
  • the video decoding capabilities of common client devices eg, mobile phones, set-top boxes, etc.
  • the maximum number of grids that can be processed can be obtained by dividing the number of video pixels that the client can decode per second by the number of pixels that each video grid generates per second.
  • the above decoding capability information of the client can be obtained in various ways, for example, it can be used as an initial value for actual testing during software development, and the measured value can be obtained as a more accurate representation of the decoding capability of the client.
  • the present disclosure does not limit the manner of determining the decoding capability of the client.
  • Step 3 The client determines the area of interest.
  • the client can determine the coordinate information of the ROI. For example, after the user specifies a region of interest in a video frame through a drag operation on the client screen, the client can determine the normalized coordinates of the region of interest in the complete frame, so that it can be subsequently mapped to each level video screen. In this example, it is assumed that the normalized coordinates of the upper left corner and the lower right corner of the region of interest in the complete frame are (0.12, 0.25) and (0.38, 0.51) respectively. In order to prevent the screen aspect ratio selected by the user from being too unreasonable, the default screen ratio size can also be set to a fixed value.
  • Step 4 Calculate the minimum number of grids that can cover the region of interest in the video images at all levels.
  • the client can map the normalized coordinates of the region of interest to the gridded information at all levels obtained from the server to obtain the region of interest
  • the pixel-level coordinates of the region in the video frames of all levels are as follows:
  • the pixel-level coordinates of the upper-left corner and the lower-right corner of the region of interest in the first-level video frame are: (922, 1080) and (2918, 2203).
  • the pixel-level coordinates of the upper-left corner and the lower-right corner of the region of interest in the fourth-level video frame are: (230, 270) and (730, 551).
  • the minimum number of grids that can cover the region of interest in the video frames of all levels can be determined, as shown in Figure 10
  • the minimum number of grids that can cover the area of interest is shown in gray grids in the video screens of all levels, that is, the first-level video screen requires 36 grids, the second-level video screen requires 24 grids, and the third-level video screen requires 24 grids.
  • the video screen needs 16 grids, and the fourth-level video screen needs 4 grids.
  • Step 5 According to the decoding capability of the client, select video images with as high resolution as possible, and determine the grid that can cover the area of interest.
  • the client can select a video picture that matches the decoding capability of the client from the multi-level pictures according to its decoding ability for each grid size generated in the grid division process, and the selected video picture in multiple grids grid to determine the grid corresponding to the region of interest. For example, continuing with the example in Figure 10, it can be determined that the 36 grids in the first-level video picture and the 24 grids in the second-level video picture have exceeded the decoding capability of the client, while the 36 grids in the third-level video picture None of the 16 grids and 4 grids in the fourth-level video screen exceed the decoding capability of the client, so the 16 grids shown in gray in the higher-resolution third-level video screen can be used as a sense The grid corresponding to the region of interest.
  • the client can also calculate the relative coordinates of the region of interest within the gray region formed by the 16 grids determined above, so as to cut out the images that are not regions of interest to the user.
  • Step 6 The client requests the video stream from the server.
  • the client can request to the server Video streams of these grids. For example, continuing with the example in FIG. 10, the client requests the video data of 12 grids of the third-level video picture from the server, for example, providing the numbers of these grids, in order: (3,1,2),(3 ,2,2),(3,3,2),(3,1,3),(3,2,3),(3,3,3),(3,1,4),(3,2 ,4),(3,3,4),(3,1,5),(3,2,5),(3,3,5).
  • the server can send these video streams to the client through a communication channel such as a wired or wireless network according to a certain standard (MPEG-TS or RTP, etc.) or a custom format.
  • the video stream sent to the client must somehow identify its grid number so that the client can splice and reassemble it.
  • Step 7 The client decodes and presents the video stream after receiving it.
  • the client can follow the method similar to that described above in conjunction with FIG. 6 , after decoding, splicing, and optional cutting processes on the received encoded video data streams of several grids, the A video picture corresponding to the region of interest is presented on the screen of the client.
  • FIG. 12 shows another interaction between the client and the server in the video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • FIG. 12 shows another interaction between the client and the server in the video data processing method supporting interactive viewing according to an embodiment of the present disclosure.
  • Embodiment 4 The difference between Embodiment 4 and Embodiment 3 is that the client does not need to know the gridding information of the multi-level video screen of the server, but only sends a playback request including the coordinate information of the region of interest to the server, and informs the server of its Decoding ability, the server pushes the video stream of the corresponding grid in a specific video picture to the client according to the decoding ability of the client.
  • the specific process is as follows:
  • Step 1 The client provides its decoding capability to the server.
  • the client's ability to decode common grids can be characterized by the number of grids that the client can simultaneously decode.
  • the client may provide the decoding capability to the server after receiving the query from the server for the decoding capability of the client.
  • the client can actively provide its decoding capability to the server, and the server will make subsequent decisions accordingly.
  • Step 2 The client sends the information of the area of interest.
  • the user can specify a region of interest in a video frame during interactive viewing, and correspondingly, the client can determine the coordinate information of the region of interest. For example, after the user specifies a region of interest in a video frame through a drag operation on the client screen, the client can determine the normalized coordinates of the region of interest in the complete frame, so that it can be subsequently mapped to each level video screen. The client can provide the information of the region of interest to the server.
  • Step 3 Calculate the minimum number of grids that can cover the region of interest in the video frames at all levels.
  • the server can map the normalized coordinates of the region of interest to the recorded From the gridding information at all levels, the pixel-level coordinates of the region of interest in the video images at all levels are obtained.
  • the server can determine the minimum number of grids that can cover the area of interest in the video screens of all levels according to the pixel-level coordinates of the area of interest in the video screens of all levels and the grid coordinates of the video screens of all levels, such as In Fig. 10, gray grids are used to show the minimum number of grids that can cover the region of interest in the video screens of all levels.
  • Step 4 According to the decoding capability of the client, select video images with as high resolution as possible, and determine the grid that can cover the area of interest.
  • the server can select a video picture that matches the decoding capability of the client from the multi-level pictures according to the decoding ability of the client received from the client, and determine the area of interest in multiple grids of the selected video picture the corresponding grid. For example, continuing with the example in FIG. 10 , the 16 grids shown in gray in the third-level video frame can be used as grids corresponding to the region of interest.
  • the server can also calculate the relative coordinates of the region of interest within the gray region formed by the 16 grids determined above, so as to subsequently cut out the images that are not regions of interest to the user.
  • Step 5 Push the video stream to the client.
  • the server After the server selects a suitable video frame in the multi-level video frame and determines several grids corresponding to the region of interest among the multiple grids of the selected video frame, these grids can be pushed to the client. format video stream. It can be understood that the video stream sent to the client must identify its grid number in a certain way, so that the client can splice and reassemble. Continuing with the example in FIG. 10 , the server sends a total of 12 grids to the client, and the information provided includes: the number of grid rows is 4, the number of columns is 3, and the size of each grid is 384 ⁇ 288.
  • each video stream is required to contain its own grid coordinate information, namely (0,0), (0,1), (0,2), (1,0) , (1,2), (1,2), (2,0), (2,1), (2,2), (3,0), (3,1), (3,1) these values .
  • the relative coordinates of the region of interest determined above within the gray region formed by the 16 grids may also be included, so as to be used for subsequently cutting out images that are not regions of interest to the user.
  • Step 6 The client decodes and presents the video stream after receiving it.
  • the client can present on the screen of the client after decoding, splicing and optionally cutting the received encoded video data streams of several grids Video footage corresponding to the region of interest.
  • a client-specified region of interest can be characterized in a number of ways.
  • the above describes the operation manner of using the coordinate information of the region of interest to characterize the region of interest, and specifying the region of interest through the user's dragging gesture.
  • the above manner of representing the region of interest and the operation manner of the user's drag gesture are only illustrative examples, and the present disclosure is not limited thereto.
  • the user can use the user's finger or other operating Select its region of interest.
  • the client can request the video content of the corresponding grid from the server according to the meshing information and the coordinate information of the region of interest; or the client can receive the video content determined and pushed by the server according to the region of interest.
  • grid of video content for interactive viewing.
  • a user uses devices such as laptops, desktop computers, and workstations for interactive viewing, he can select a part of the area on the screen as an area of interest through an input device such as a mouse or a touch pad, and can use a method similar to the above
  • the video content corresponding to the frame of the region of interest is watched.
  • a user when a user watches a live broadcast or on-demand video through a TV, a projector, etc., he can select an area of interest through a remote control, etc., and thereby view detailed information of the area of interest.
  • a remote control etc.
  • it is possible to analyze the voice command input by the user for example, the user speaks the command "I want to see the details of the screen in the upper left corner"
  • the user's body Operations such as motion capture determine information about regions of interest that are otherwise input.
  • the user can indicate the name of the object of interest (for example, the name or number of an athlete in a live sports event, the name of a specified building in a high-definition street view shooting video) through text input, voice input, etc., and correspondingly,
  • the object of interest and its surrounding predetermined range can be viewed interactively as an area of interest.
  • each client since the ROI that each user wants to view is not the same, and will also change continuously during the viewing process, each client operates independently, and A playback request specific to the client will be sent to the server. Therefore, for a relatively static ROI, when the user wants to change the desired ROI, the user only needs to select the new ROI again on the video screen to watch the new ROI. video footage of the area.
  • the user when the picture content of the video source is an ultra-high-definition surveillance picture taken on a street view, the user (for example, a security officer) may initially only focus on the entrance area of a certain building, and may It remains unchanged within hours, so for this relatively static area of interest, the server can push the video content of several fixed grids to the client during this period of time. If the user wants to pay attention to other regions of interest at a later time, he only needs to select a new region of interest, and then he can request a new batch of grid video content from the server again, or receive a new batch of video content pushed by the server. A batch of grids of video content.
  • ROIs in the video source content may be relatively dynamic areas, for example, they may contain ROIs moving at a certain speed.
  • a certain athlete may be the user's interest object, and the user may want to focus on watching the details of the athlete's performance in the event.
  • the present disclosure proposes a grid determination method based on object tracking technology and a corresponding interactive viewing method for a region of interest that may have dynamic characteristics.
  • optical flow analysis algorithms mean shift algorithms, Kalman filter algorithms, particle filter algorithms, etc. can be used to analyze the continuous pictures of the video analysis to track the motion of the object of interest between successive frames of the video.
  • machine learning models can be used to track the motion of the object of interest between successive pictures, such as convolutional neural network, recurrent neural network, logistic regression, linear regression, random forest, support vector machine model , deep learning model, or any other form of machine learning model or algorithm for tracking. It can be understood that in the present disclosure, other suitable methods may be adopted to automatically determine the position of the object of interest or the region of interest by analyzing the video picture, as a basis for subsequent determination of the grid corresponding to the region of interest.
  • the tracking of the object of interest may be determined or predicted locally by the client through analysis of continuous video frames, or may be determined or predicted by the server through analysis of continuous video frames.
  • the server end can determine the interest related to the object of interest or related to the object of interest according to the gridded information recorded by the server.
  • Several grids corresponding to the region of interest and optionally, several grids with appropriate image quality are selected), and thereafter push the video data of the corresponding grids to the client.
  • the tracking task of the object of interest can be allocated to the client for local execution.
  • the client After the client has tracked the object of interest, it can use the grid information obtained from the server to determine several grids corresponding to the object of interest or the region of interest including the object of interest (and optionally, to select several grids with appropriate image quality), and request the corresponding grid from the server format of video data for viewing.
  • the object of interest and its surrounding predetermined range can be used as an area of interest, that is, a certain frame is expanded in all directions for the determined object of interest range as a margin, so as to avoid the problem of re-determining the grid corresponding to the region of interest too frequently due to too frequent movement of the object of interest, so as to provide services with a relatively smooth grid change between different times
  • the end requests the video data of the required grid or receives the video data of the grid pushed by the server, thereby reducing the pressure on the server.
  • the video stream of the grid corresponding to the object of interest with dynamic motion characteristics can be obtained , so as to be presented after decoding and splicing at the client, thereby eliminating the need for the user to frequently manually select the region of interest, and reducing the user's operational burden.
  • a uniform grid division method can be used to perform grid division for each level of video pictures in the multi-level video picture, so the grid size and resolution of each grid in the same video picture are different. Are the same.
  • each video picture may also be divided in a non-uniform grid division manner, so that the grid size and/or resolution of each grid in any one of the multi-level video pictures may be incomplete with each other. same.
  • various factors may be considered to determine whether to adopt a non-uniform grid division process.
  • the uppermost part of the panoramic picture may correspond to the sky in outdoor situations or the stadium roof in indoor situations
  • the lowest part of the panoramic picture may correspond to the auditorium
  • the middle part of may correspond to the arena and players that are broadcasting the event.
  • the video source or the picture content in the middle part of the video picture may be of interest to most viewers and has a high probability of being selected as an area of interest by the viewers (for example, It is expected to watch the details in these areas), while the uppermost part and the lowermost part in the video source may be of interest to only a small number of viewers and have a lower probability of being selected as the interest area by the viewer.
  • a non-uniform grid division process can be performed based on the user's degree of interest in each area in the entire screen. For example, for a certain sporting event, based on the number and frequency of each region being selected by the viewer as a region of interest in this viewing record, the regions of high interest and regions of low interest in the entire picture of the video source can be determined. As a supplement or alternative, the regions of high interest and regions of low interest in the entire picture of the video source can be determined based on the number and frequency of each region being selected by the viewer as the region of interest in historical viewing records (for example, previous events at the same venue). area.
  • a non-uniform grid segmentation process may be performed on the video picture based on the regions of high interest and regions of low interest determined by the degree of interest of the user.
  • the following describes an example of non-uniform grid division of a video picture in a video data processing method supporting interactive viewing according to an embodiment of the present disclosure in conjunction with FIG. 13 and FIG. 14 , wherein FIG. 13 shows the A schematic diagram of determining regions of high interest and regions of low interest in a video source in a video data processing method supporting interactive viewing.
  • FIG. 14 shows a video data processing method supporting interactive viewing according to an embodiment of the present disclosure. Schematic diagram of non-uniform grid division of the screen.
  • the video The entire frame of the source is divided into a high region of interest in the middle of the entire frame and two low regions of interest in the uppermost and lowermost parts.
  • the low image quality can be used for grid segmentation of the low interest area; and for Areas of high interest can be divided into grids while maintaining relatively high image quality, so that the video encoder can be applied to areas that are relatively more concerned, so as to maximize the video quality while sacrificing the viewing needs of a very small number of viewers. Encoder utilization efficiency.
  • the high interest area in the middle area of the original video picture it can still be divided using the grid segmentation method described above in conjunction with the first-level video picture in Figure 9 , to obtain the grid segmentation result corresponding to the high interest region, as shown in part 2 in Figure 14.
  • the grid segmentation structures can be pieced together into a new picture, for example, the grid segmentation of the high interest area of the original image quality can be The result (as shown in part 2 in Figure 14) and the gridded segmentation results of the two low-interest regions after downsampling (as shown in part 1 and part 3 in Figure 14) are used as a new video picture . It can be seen that since the grid segmentation of the low interest area is based on the down-sampled video picture, the resolution of the grid in the second part of the newly patched video picture is the same as that of the first part and The resolution of the grids in part 3 is different.
  • the size of the grid in part 2 in the newly patched video frame may also be different from the size of the grid in part 1 and part 3.
  • a new next-level non-uniform video picture can be pieced together in a similar manner, and so on.
  • more detailed gridding information needs to be included in the gridding information of this level of video picture, such as the number of grids in a row, The number of grids in a certain column, the grid size in a certain row, the grid size in a certain column, or the grid size at a specified position, etc., so that the detailed data of each grid can be accurately and reliably identified.
  • non-uniform grid segmentation can be performed based on the user's different degrees of interest in different regions in the entire image, so that a limited number of video images can be encoded.
  • Video encoders are allocated more reasonably to improve resource utilization efficiency of video encoders.
  • FIG. 15 shows a hardware block diagram of a device according to an embodiment of the present disclosure.
  • the device 1500 includes a processor U1501 and a memory U1502.
  • the processor U1501 may be any device with processing capabilities capable of implementing the functions of the various embodiments of the present disclosure, for example, it may be a general-purpose processor, a digital signal processor (DSP), an ASIC designed to perform the functions described herein , field programmable gate array (FPGA) or other programmable logic device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • PLD programmable logic device
  • Memory U1502 may include computer system-readable media in the form of volatile memory, such as random access memory (RAM) and/or cache memory, and may also include other removable/non-removable, volatile/nonvolatile Computer system memory, such as a hard drive, floppy disk, CD-ROM, DVD-ROM, or other optical storage media.
  • volatile memory such as random access memory (RAM) and/or cache memory
  • cache memory may also include other removable/non-removable, volatile/nonvolatile Computer system memory, such as a hard drive, floppy disk, CD-ROM, DVD-ROM, or other optical storage media.
  • computer program instructions are stored in the memory U1502, and the processor U1501 can execute the instructions stored in the memory U1502.
  • the processor is made to execute the video data processing method supporting interactive viewing in the embodiment of the present disclosure.
  • the video data processing method for supporting interactive viewing is basically the same as that described above with respect to FIGS. 1-14 , so in order to avoid repetition, details are not repeated here.
  • a device it may include a computer, a server, a workstation, and the like.
  • a video data processing device supporting interactive viewing is provided, and the device 1600 will be described in detail below with reference to FIG. 16 .
  • Fig. 16 shows a structural block diagram of a video data processing device supporting interactive viewing according to an embodiment of the present disclosure.
  • the device 1600 includes a video frame construction unit U1601 , a grid division unit U1602 and a video encoding unit U1603 .
  • the various components can respectively perform the various steps/functions of the video data processing method supporting interactive viewing described above in conjunction with FIGS. A detailed description of the same details is omitted.
  • the video frame construction unit U1601 can obtain multi-level video frames with different resolutions of the same video content.
  • the video frame construction unit U1601 may construct multi-level video frames with the same video content (that is, the same video frame depicted, such as the same sports event) but with different resolutions in various ways.
  • the video frame construction unit U1601 may down-sample the video frame to obtain multi-level video frames with different resolutions, as discussed above in conjunction with FIG. 9 , for subsequent grid division.
  • the grid division unit U1602 can divide each level of video frames in the multi-level video frames into multiple grids. For example, the grid division unit U1602 can divide each level of video picture into multiple grids, as discussed above in conjunction with FIG. 9 . It should be noted that when the grid division unit U1602 performs grid division on video images at various levels, the size of each grid should be much smaller than the decoding capability of common clients, that is, the segmentation result should enable the client to simultaneously A grid of video is decoded in real time.
  • the grid information of the complete multi-level video picture can be obtained, for example, it can include the number of pictures of the multi-level video picture (or called is the number of picture classifications), the resolution of video pictures at all levels, the number of grids of video pictures at all levels (for example, the number of grids in the horizontal and vertical directions), the grid size of video pictures at all levels, and the grid size of each grid grid coordinates, etc.
  • the video encoding unit U1603 may include, for each of the plurality of grids of the video picture of each level, a video encoder dedicated to the grid allocated for the grid. It can be understood that after the server divides each level of video images into grids, a serial number can be assigned to each grid (and its video stream). Correspondingly, for each grid, a dedicated video encoder can be assigned to it in the video encoding unit U1603, so as to independently manage the video data stream of each grid in units of grids. Each video encoder in the video encoding unit U1603 can encode the video data stream of the corresponding grid to obtain the encoded video data stream of the corresponding grid.
  • the device 1600 may further include a video stream providing unit (not shown), and the video stream providing unit may be configured to, in response to a video playback request from the client, select a video from the multi-level screens related to the client.
  • a video picture with matching decoding capabilities determining at least one grid corresponding to the video content requested by the video playback request among the plurality of grids of the selected video picture; and providing the client with the at least one grid A grid of encoded video data streams.
  • the video data processing technology supporting interactive viewing according to the present disclosure can also be realized by providing a computer program product containing program codes for implementing the method or device, or by any storage medium storing such a computer program product.
  • each component or each step can be decomposed and/or reassembled. These decompositions and/or recombinations should be considered equivalents of the present disclosure.
  • any part of the method and device of the present disclosure can be implemented in any computing device (including processor, storage medium, etc.) or network of computing devices with hardware, firmware, software or a combination of them.
  • the hardware may be implemented using a general purpose processor, digital signal processor (DSP), ASIC, field programmable gate array (FPGA), or other programmable logic device (PLD) designed to perform the functions described herein. , discrete gate or transistor logic, discrete hardware components, or any combination thereof.
  • DSP digital signal processor
  • ASIC digital signal processor
  • FPGA field programmable gate array
  • PLD programmable logic device
  • a general-purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller, or state machine.
  • a processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors cooperating with a DSP core, or any other such configuration.
  • the software can reside in any form of tangible computer readable storage medium.
  • such computer-readable tangible storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices or may be used to carry or store instructions or data in the form of structures desired program code and any other tangible medium that can be accessed by a computer.
  • disc includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc.

Abstract

本公开提供支持交互式观看的视频数据处理方法、设备及系统。所述支持交互式观看的视频数据处理方法包括:将视频画面分割为多个网格;对于所述多个网格中的每个网格,分配专用于该网格的视频编码器以对该网格的视频数据流进行编码;以及响应于客户端的视频播放请求,提供所述多个网格中的至少一个网格的经编码视频数据流。根据所述支持交互式观看的视频数据处理方法,只要网络带宽允许,就可以为无数客户端提供交互式视频观看服务,尤其是在进行交互式观看的客户端的设备数量众多的情况下,能够有效缓解服务端的视频编码器资源紧张问题。

Description

支持交互式观看的视频数据处理方法、设备及系统 技术领域
本公开涉及视频数据处理。更具体地,本公开涉及支持交互式观看的视频数据处理方法、设备及系统。
背景技术
随着视频拍摄硬件性能的不断提升,8K(3300万像素)及更高像素的视频拍摄设备已经或将要出现,而且基于多路拍摄设备的多路拍摄画面来拼接全景超高清视频的技术也在不断发展。相应的,采用多种方式获得高分辨率视频源已成为可能。然而,与之对应的客户端受限于屏幕分辨率,无法充分展现高分辨率视频源的内容。图1示出了视频源的相对较高的原始视频分辨率与客户端的相对较低的屏幕分辨率的比较示意图。如图1中示意性所示,视频源的原始视频分辨率为3840×2160,而客户端的屏幕分辨率为1920×1080。由于客户端的屏幕分辨率小于视频源的分辨率,因此如果客户端屏幕以点对点的方式展现视频源画面时,只能显示视频源画面中的部分区域,导致影响用户对视频内容的观看体验。在这种情况下,客户端可以采用以下两种方式播放视频源内容。
在第一种方式中,客户端可以对视频源画面进行下采样,降低分辨率从而适配客户端的屏幕分辨率,这也是目前常规系统所采用的方式。然而,该方式的问题是无法充分展现视频源内容的细节,从而降低了用户的视觉体验。
在第二种方式中,使客户端的用户可以与提供视频源的服务端进行实时交互,服务端可以根据客户端的请求提供其感兴趣区域的视频内容,从而使客户端可以按需展现视频源的任何区域的视频内容。然而,这种方式中,当与服务端交互的客户端的数量巨大时,服务端处存在视频编码器资源紧张的问题。
因此,需要在视频源分辨率高于客户端屏幕分辨率的情况下,提供一种改进的视频数据处理技术,从而能够支持用户体验良好的交互式观看。
发明内容
根据本公开的一个方面,提供了一种支持交互式观看的视频数据处理方 法,包括:将视频画面分割为多个网格;对于所述多个网格中的每个网格,分配专用于该网格的视频编码器以对该网格的视频数据流进行编码;以及响应于客户端的视频播放请求,提供所述多个网格中的至少一个网格的经编码视频数据流。
根据本公开的另一方面,提供了一种视频数据处理方法,包括:获得相同视频内容的具有不同分辨率的多级视频画面;将所述多级视频画面中的每级视频画面分割为多个网格;对于每级视频画面的多个网格中的每个网格,分配专用于该网格的视频编码器;以及利用各个视频编码器对相应网格的视频数据流进行编码,以获得相应网格的经编码视频数据流。
根据本公开的另一方面,提供了一种支持交互式观看的视频数据处理设备,包括:处理器;以及存储器,存储有计算机程序指令,其中,在所述计算机程序指令被所述处理器运行时,使得所述处理器执行以下步骤:获得相同视频内容的具有不同分辨率的多级视频画面;将所述多级视频画面中的每级视频画面分割为多个网格;对于每级视频画面的多个网格中的每个网格,分配专用于该网格的视频编码器;以及利用各个视频编码器对相应网格的视频数据流进行编码,以获得相应网格的经编码视频数据流。
根据本公开的另一方面,提供了一种支持交互式观看的系统,包括:服务端,被配置为:获得相同视频内容的具有不同分辨率的多级视频画面;将所述多级视频画面中的每级视频画面分割为多个网格;以及对于每级视频画面的多个网格中的每个网格,分配专用于该网格的视频编码器;利用各个视频编码器对相应网格的视频数据流进行编码,以获得相应网格的经编码视频数据流。该系统还包括:客户端,被配置为向服务端发送视频播放请求。所述服务端还被配置为:响应于客户端的视频播放请求,从所述多级画面中选择与所述客户端的解码能力相匹配的视频画面;在所选择的视频画面的多个网格中确定与所述视频播放请求所请求的视频内容相对应的至少一个网格;以及向所述客户端提供所述至少一个网格的经编码视频数据流。
根据本公开的再一方面,提供了一种计算机可读存储介质,其上存储有计算机程序指令,其中,所述计算机程序指令在被执行时实现上述支持交互式观看的视频数据处理方法。
附图说明
从下面结合附图对本公开实施例的详细描述中,本公开的这些和/或其它方面和优点将变得更加清楚并更容易理解,其中:
图1示出了视频源的原始分辨率与客户端的屏幕分辨率的比较示意图。
图2示出了现有方法中客户端对视频源进行交互式观看的过程的示意图。
图3是示出了根据本公开实施例的支持交互式观看的视频数据处理方法的流程图。
图4示出了根据本公开实施例的支持交互式观看的视频数据处理方法的示意图。
图5A示出了根据本公开实施例的支持交互式观看的视频数据处理方法中确定视频画面中的感兴趣区域的坐标信息的示意图。
图5B示出了根据本公开实施例的支持交互式观看的视频数据处理方法中根据感兴趣区域的坐标信息确定与感兴趣区域对应的网格的示意图。
图6示出了根据本公开实施例的支持交互式观看的视频数据处理方法中在客户端处呈现与感兴趣区域对应的视频画面的示意图。
图7示出了根据本公开实施例的支持交互式观看的视频数据处理方法中客户端处指定了完整画面中的相对较大部分作为感兴趣区域的示意图。
图8是示出了根据本公开实施例的支持交互式观看的视频数据处理方法的另一示例的流程图。
图9示出了根据本公开实施例的支持交互式观看的视频数据处理方法中具有不同分辨率的多级视频画面的示意性视图。
图10示出了根据本公开实施例的支持交互式观看的视频数据处理方法中在各级视频画面中确定与感兴趣区域对应的若干网格的示意性视图。
图11示出了根据本公开实施例的支持交互式观看的视频数据处理方法中客户端与服务端之间交互的示例的示意图。
图12示出了根据本公开实施例的支持交互式观看的视频数据处理方法中客户端与服务端之间交互的另一示例的示意图。
图13示出了根据本公开实施例的支持交互式观看的视频数据处理方法中确定视频源中的高感兴趣区域和低感兴趣区域的示意图。
图14示出了根据本公开实施例的支持交互式观看的视频数据处理方法中对视频画面进行非均匀网格化分割的示意图。
图15示出了根据本公开实施例的支持交互式观看的视频数据处理设备 的示意性硬件框图。
图16示出了根据本公开实施例的支持交互式观看的视频数据处理设备的示意性结构框图。
具体实施方式
为了使本领域技术人员更好地理解本公开,下面结合附图和具体实施方式对本公开作进一步详细说明。
首先,对本公开的改进的视频数据处理技术的基本思想进行简要的概述。如前所述,虽然某些技术可以基于客户端的请求来按需展现视频源的感兴趣区域的视频内容,然而,当与服务端交互的客户端的数量巨大时,服务端处存在视频编码器资源紧张的问题。图2示出了现有方法中客户端对超出其屏幕分辨率的视频源进行交互式观看的过程的示意图。如图2所示,该交互式观看的过程主要包括以下步骤:
1.客户端基于用户的操作,确定视频画面中的感兴趣区域。
2.客户端将包含有关该感兴趣区域的信息的播放请求发送至服务端。
3.服务端从完整视频源画面中切出与感兴趣区域对应的一部分。
4.服务端对切出的画面部分进行编码压缩以得到经编码视频数据。
5.服务端将经编码视频数据回传给客户端。
6.客户端对接收到的经编码视频数据进行解码并呈现感兴趣区域画面。
在实际的观看过程中,由于每个用户想要观看的感兴趣区域并不相同,而且在观看过程中也会不断变化,因此每个客户端是独立操作的,并且会向服务端发送特定于该客户端的播放请求。相应的,服务端在接收到各个客户端发来的播放请求后,需要根据各个客户端指定的彼此不同的感兴趣区域,实时地从完整视频源画面中切出与之对应的多个彼此不同的画面部分并且编码压缩后回传给相应的客户端。为了实现真正的交互式观看效果,必须在服务端为每一个用户提供一个独立的视频编码器,从而满足该用户对于其感兴趣区域的独特观看需求。然而,例如对于大型直播场景(例如,世界杯足球赛)来说,客户端数量是极为巨大的,同时观看直播的客户端数量可达到亿级。而服务端的硬件视频编码器数量是有限的,比如电视台一个频道往往只需要一个视频编码器,一块高端显卡也只能内置20个左右的视频编码器,且视频编码器的价格昂贵,因此通过堆叠视频视频编码器的方式无法支撑大量 客户端交互直播的业务。因此,现有的交互式观看方法无法解决“无限数量”客户端交互观看的业务应用场景。
有鉴于此,本公开提出视频画面的网格化分割和特定于网格的视频编码器分配的思想,首先对高分辨率的视频源进行网格化分割,然后对于分割后得到的每个网格分配专用于该网格的视频编码器对视频数据进行编码。相比于现有的交互式观看方法中为每个客户端分配其专属视频编码器的方式而言,通过采用本公开所提出的方案对视频源的视频数据进行处理,能够缓解服务端的视频编码器资源紧张问题,尤其是在进行交互式观看的客户端的设备数量众多的情况下。可以理解,本公开描述的改进的视频数据处理技术可以应用于交互式直播/点播系统,从而支持大量客户端对超过客户端屏幕分辨率的直播视频或点播视频进行交互式观看。例如,用户可以通过客户端与提供视频直播内容或点播内容的服务端交互,从而从完整画面中获取其感兴趣的区域进行观看。需说明的是,在以下描述中可以将具有高分辨率的原始视频内容称为视频源或视频画面,并且该视频源或视频画面可以对应于直播视频或点播视频的视频内容,本公开不对视频源或视频画面中所描绘的具体画面内容进行限制。
实施例1
图3是示出了根据本公开实施例的支持交互式观看的视频数据处理方法的流程图。图4示出了根据本公开实施例的支持交互式观看的视频数据处理方法的示意图。下面具体结合图3和图4描述该视频数据处理方法。
如图3所示,在步骤S101,将视频画面分割为多个网格。如以上所描述的,视频画面可以指分辨率高于客户端的常见分辨率的视频源的各帧画面,并且其可以具有直播或点播视频内容。另外,该视频画面可以采用多种方式获得。例如,可以通过多台拍摄设备进行拍摄,然后对多个拍摄画面进行拼接以获得全景高清视频画面。优选的,可以采用亿级及更高像素的拍摄设备直接进行拍摄以获得该视频画面,从而省去对多台拍摄设备的维护保养以及对多个视频画面的拼接等需要。可以理解,在本公开实施例中“视频画面”、“像素画面”等术语可以互换使用。需说明的是,本公开不对视频画面的获取方式进行限制。
如图4所示,可以在服务端对亿级像素画面进行网格化分割处理,例如,可以将1亿像素的视频画面分割为10×10个网格,每个网格的分辨率为 1000×1000,例如网格1、网格2、…..、网格100。同时,在对视频画面进行网格化分割的过程中,可以生成并记录与网格化分割过程相关的网格化信息,例如:视频画面的原始分辨率、网格数量、网格大小以及网格坐标等等。举例而言,网格1的网格大小为1000×1000且网格坐标为(0,0),以此类推。可以理解,以上所描述的视频画面的分辨率、网格大小、网格数量等仅为示意性举例,本公开的网格化分割不限于上述具体的数值举例。
优选的,服务端在对视频画面进行网格化分割时,可以考虑常见客户端的解码能力。例如,对视频画面进行网格化分割后的每个网格的大小应远小于常见客户端的解码能力,即,考虑到用户期望观看的视频画面可能对应于不止一个网格,因此网格化分割的结果应该使客户端能够同时对若干个网格的视频数据进行实时解码。基于现有的常见客户端设备的解码能力,可以以每个网格不超过10万像素为基准,对视频画面进行网格化分割。
返回图3,在步骤S102,对于所述多个网格中的每个网格,分配专用于该网格的视频编码器以对该网格的视频数据流进行编码。如上所述,与现有交互式观看方法中为每个客户端分配专用于该客户端的视频编码器不同,本步骤S102中,在对视频画面进行了网格化分割的基础上,可以为每个网格分配专用于该网格的视频编码器。继续结合图4描述,可以为网格1分配其专属的视频编码器1,从而利用视频编码器1对网格1的视频数据流进行编码,以获得网格1的经编码视频数据流,以此类推。可以理解,在本公开实施例中,网格的“经编码视频数据流”、“视频流”、“视频数据”等术语可以互换使用。相应地,服务端可以对这些网格画面独立地进行编码,以网格为单位形成10×10个经编码视频流。
在步骤S103,响应于客户端的视频播放请求,提供所述多个网格中的至少一个网格的经编码视频数据流。如上所述,在交互式观看的过程中,用户可以通过客户端与提供视频内容的服务端交互,例如用户可以实时对客户端的屏幕上显示的视频画面进行拖拽操作,从而从完整视频画面中获取其感兴趣的区域进行观看。相应的,客户端的视频播放请求中可以包括与客户指定的感兴趣区域相关的信息。由于用户期望观看的感兴趣区域的画面可能对应于不止一个网格,因此,在步骤S103中,可以首先根据客户端指定的视频画面中的感兴趣区域,从视频画面的多个网格中确定与感兴趣区域对应的至少一个网格,然后提供所确定的至少一个网格的经编码视频数据流。可以理解, 可以采用多种方式来表征客户端指定的感兴趣区域,例如其坐标信息。
为了说明的完整性,下面以用户指定了感兴趣区域的坐标信息为例,结合图5A-图5B来描述根据客户端指定的感兴趣区域从多个网格当中确定与之对应的若干网格的示意图,其中,图5A示出了根据本公开实施例的支持交互式观看的视频数据处理方法中确定视频画面中的感兴趣区域的坐标信息的示意图,图5B示出了根据本公开实施例的支持交互式观看的视频数据处理方法中根据感兴趣区域的坐标信息确定与感兴趣区域对应的若干网格的示意图。需说明的是,在本公开的实施例中,服务端可以从客户端接收其指定的感兴趣区域,然后基于先前的网格化分割过程中已记录的网格化信息来确定与感兴趣区域对应的若干网格,并将这些网格的经编码视频数据流回传给客户端进行观看。可替代地,客户端可以预先从服务端接收与网格化分割过程相关的网格化信息,然后基于所获得的网格化信息来确定与用户指定的感兴趣区域对应的若干网格,并向服务端请求这些网格的视频数据以便进行观看。
根据上述第一方面,作为在服务端处确定多个网格中与感兴趣区域对应的若干网格的实现方式的示例,交互式观看过程主要可以包括以下步骤:
首先,用户在进行交互式观看时可以指定视频画面中的感兴趣区域,相应的,客户端可以将感兴趣区域的坐标信息发送给服务端。在本公开的实施例中,客户端可以采用多种方式根据用户在客户端屏幕上的拖拽操作来确定与感兴趣区域相关的坐标信息。例如,在用户通过在客户端屏幕上的拖拽操作指定了视频画面中的感兴趣区域后,客户端可以确定该感兴趣区域在完整画面中的归一化坐标。如图5A所示,设客户端上显示的完整画面的左上角为原点(0,0),右下角的归一化坐标为(1,1)。相应地,可以根据用户的拖拽操作所对应的区域占完整画面的比例,计算感兴趣区域在完整画面中的左上角和右下角归一化坐标分别为(0.22,0.24)和(0.56,0.42)。可以理解,虽然以上描述了以感兴趣区域的左上角和右下角的归一化坐标的方式来表征感兴趣区域,但本公开中不对表征感兴趣区域的坐标信息的方式进行限制。作为示意性举例,本公开实施例中还可以采用感兴趣区域的左上角的归一化坐标、感兴趣区域的归一化长度和宽度来表征。另外,在实践中,用户可能会从完整画面中任意选择一个区域作为感兴趣区域,为了避免用户选择的画面长宽比例过于不合理,可以设置默认画面比例大小为一个合理的固定值,例如保持与原始视频源的画面长宽比例相同。在此示例中,当用户指定的感兴趣区域的长 宽比例与预设的画面长宽比例不同时,可以将所选择的感兴趣区域的长边或宽边其中一者作为基准,另一边的长度按照预设比例与之匹配即可。
然后,服务端在接收到感兴趣区域的坐标信息后,可以将接收到的感兴趣区域的归一化坐标映射到服务端处的视频画面的坐标中,从而得到该感兴趣区域的像素级坐标。如图5B所示,由于本示例中视频画面的分辨率为1亿像素,因此感兴趣区域(如斜阴影线区域所示)的左上角和右下角的归一化坐标(0.22,0.24)和(0.56,0.42)映射到视频画面后的像素级坐标为(0.22×10000,0.24×10000)和(0.56×10000,0.42×10000),即(2200,2400)和(5600,4200)。相应地,服务端可以基于网格化分割过程中已记录的网格化信息,来确定该视频画面的多个网格中的哪些网格与感兴趣区域是对应的。例如,服务端可以确定视频画面中用于覆盖该感兴趣区域所需的最少网格。在本公开的实施例中,可以依据视频画面的原始分辨率、网格数量、网格大小以及网格坐标等中的一个或多个,确定视频画面中覆盖该感兴趣区域的最少数量的网格。如图5B所示,以灰色网格示出了覆盖该感兴趣区域的总共12个网格,这些网格的坐标依次是(2,2),(2,3)…(5,4)。另一方面,考虑到用户指定的感兴趣区域的边界与所确定的灰色网格区域的边界可能并不是对齐的,因此以上确定的总共12个网格中包括了非用户感兴趣的画面部分。有鉴于此,在本示例中,还可以确定感兴趣区域在这12个网格所构成的灰色区域内相对坐标,例如感兴趣区域的左上角和右下角在灰色区域内的相对坐标(x1,y1)和(x2,y2),从而有助于从这12个网格中抠除掉非用户感兴趣的画面部分,其过程在以下具体描述。
最后,在从视频画面的多个网格中确定了与感兴趣区域对应的至少一个网格之后,服务端可以提供所确定的至少一个网格的经编码视频数据流,以供客户端进行交互式观看。例如,服务端可以将图5B所示的所确定的总共12个灰色网格的经编码视频数据流发送给客户端。可以理解,在本步骤中,如果是面向少量客户端的交互式观看应用场景,可以按需向各个客户端推送其需要的若干网格的视频流(即,与感兴趣区域对应的网格的视频流);如果是面向大规模客户端的交互式观看应用场景,还可以将视频画面的所有网格的视频流推送到边缘服务端(例如CDN),再由边缘服务端根据不同客户端的视频播放请求将不同网格的各个视频流推送到客户端。另外,服务端可以将这些视频流按照某种标准(MPEG-TS或RTP等)或自定义格式,通过有 线或无线网络等通信通道发送给客户端。需说明的是,本公开中不对视频数据流的推送方式、网络传输方式、视频数据编码方式等进行限制。
可以理解,提供给客户端的各个网格的视频流必须以某种方式标识其网格编号,以便于客户端进行重组和拼接。因此,除了所确定的至少一个网格的经编码视频数据流之外,服务端还需要将与这些网格有关的必要的位置信息发送给客户端,从而使得客户端能够将各个网格的经编码视频流重组拼接为感兴趣区域的视频画面。例如,服务端可以将结合图5B所描述的覆盖感兴趣区域的总共12个的灰色网格的坐标(2,2),(2,3)…(5,4)发送给客户端,以便客户端能够基于这些网格的网格坐标重组相应的视频画面。可选的,为了使得客户端能够从所确定的12个网格中抠除非用户感兴趣的画面部分,服务端还可以将感兴趣区域在灰色区域内的相对坐标(x1,y1)和(x2,y2)发送给客户端。
根据上述第二方面,作为在客户端处确定多个网格中与感兴趣区域对应的若干网格的实现方式的示例,交互式观看过程主要可以包括以下步骤:
首先,为了能够进行交互式观看,客户端可以事先获得服务端处进行网格化分割过程中的网格化信息,以便为用户可能随时发起的交互式观看做好准备。例如,客户端可以在首次接入服务端时,向服务端请求网格化信息,从而获得服务端响应于该请求而提供的网格化信息。又例如,服务端在对视频画面进行了网格化分割之后,可以主动将其分割后得到的网格化信息推送给其服务的客户端以备不时之需。在本步骤中,所获得的网格化信息可以包括如上所描述的视频画面的原始分辨率、网格数量、网格大小以及网格坐标等。可以理解,为了减少数据通讯考虑并且减少对带宽资源的过度占用,服务端可以仅传输视频画面的原始分辨率、网格数量、网格大小以及网格坐标的其中一部分,而客户端可以根据其接收到的部分网格化信息来自行推算其他网格化信息。网格化信息的具体细节可以参考图5A和图5B,在此不予赘述。
然后,在用户通过对客户端的屏幕上的视频画面进行拖拽操作后,客户端可以确定交互式观看的感兴趣区域的坐标信息。在本公开的实施例中,客户端可以采用多种方式确定与感兴趣区域相关的坐标信息。例如,可以采用与以上结合图5A所描述的类似的方式,确定用户选择的感兴趣区域在完整画面中的归一化坐标,诸如感兴趣区域的左上角和右下角归一化坐标分别为(0.22,0.24)和(0.56,0.42)。另外,为了避免用户选择的画面长宽比例过于不合 理,同样可以设置默认画面比例大小为一个合理的固定值。
之后,客户端在确定了感兴趣区域的坐标信息后,可以采用与以上结合图5B所描述的类似的方式,将感兴趣区域的归一化坐标映射到其从服务端处获得的网格化信息中,得到该感兴趣区域在视频画面中的像素级坐标。例如,可以基于所获得的网格化信息来确定与用户指定的感兴趣区域对应的若干网格,例如确定视频画面中用于覆盖该感兴趣区域所需的最少网格。例如,客户端可以计算其归一化坐标(0.22,0.24)和(0.56,0.42)映射到视频画面后的像素级坐标为(2200,2400)和(5600,4200),并且通过其所获得的网格化信息,来确定该视频画面的多个网格中的哪些网格与感兴趣区域是对应的。在本公开的实施例中,客户端可以依据其所获得的和/或其自行推导的视频画面的一个或多个网格化信息,确定视频画面中覆盖该感兴趣区域的最少数量的网格,如图5B中以灰色网格示出的总共12个网格。另外,考虑到以上12个网格中包括了非用户感兴趣的画面部分,客户端还可以计算感兴趣区域在这12个网格所构成的灰色区域内的相对坐标,例如感兴趣区域的左上角和右下角在灰色区域内的相对坐标(x1,y1)和(x2,y2),从而后续抠除非用户感兴趣的画面。
最后,在客户端在视频画面的多个网格中确定了与感兴趣区域对应的若干网格之后,客户端可以向服务端请求这些网格的视频流,即,向服务端请求以上确定的总共12个网格的经编码视频数据流。为减少数据通讯考虑,也可以仅在该请求中传递左上角和右下角的网格编号,由服务端自行推算其它应该传输的网格编号。相应的,服务端可以将被请求的网格的视频流按照合适的数据传输方式提供给客户端。可以理解,提供给客户端的各个网格的视频流必须以某种方式标识其网格编号,以便于客户端进行重组和拼接。
以上结合图5A和图5B描述了如何从视频画面的多个网格中确定与感兴趣区域对应的若干网格并且向客户端提供这些网格的视频流的示意图。此后,客户端可以根据接收到的这些网格的视频数据流,在其屏幕上呈现与感兴趣区域对应的视频画面。以下结合图6描述客户端处呈现感兴趣区域的视频画面的示例性处理,其中,图6示出了根据本公开实施例的支持交互式观看的视频数据处理方法中在客户端处呈现与感兴趣区域对应的视频画面的示意图。
根据一个实现方式,图6的左侧示出了客户端所接收到的与感兴趣区域对应的各个网格的经编码视频流,例如以上结合图5B所描述的总共12个网格。在该示例中,客户端在接收到这些网格的经编码视频流后,可以对每个 网格的经编码视频数据流分别进行解码,然后根据各个网格的网格坐标对各个经解码视频数据流进行拼接。最后,客户端可以直接在客户端的屏幕上呈现拼接后的经解码视频数据流以供用户进行交互式观看。可以理解,在不考虑这12个网格中可能包括非用户感兴趣的画面从而可能影响观感的情况下,可以直接这12个网格的视频数据进行解码和拼接后呈现给客户。例如,可以将拼接后的视频数据流强制全屏进行观看。
根据另一个实现方式,如上所述,考虑到获得的多个网格中包括了非用户感兴趣的画面内容,因此可能导致用户的观看体验可能不佳。因此,与上述实现方式不同,在本示例中可以从所获得的网格中抠除这些非用户感兴趣的画面部分,以避免非感兴趣画面影响用户观看体验。具体地,在该示例中,与上述的实现方式类似,客户端在接收到这些网格的经编码视频流后,可以对每个网格的经编码视频数据流分别进行解码,然后根据各个网格的网格坐标对各个经解码视频数据流进行拼接。最后,并非直接在客户端的屏幕上呈现拼接后的经解码视频数据流,而是根据感兴趣区域(斜阴影线区域)在所获得的若干网格所构成区域(灰色区域)内的相对坐标,从拼接后的视频流中切割出与感兴趣区域对应的交互式视频数据流,从而在客户端的屏幕上呈现切割后的交互式视频数据流以供用户观看。例如,如图6的中间所示,可以从所获得的总共12个网格中扣除未被感兴趣区域覆盖的部分(即,非用户感兴趣区域),之后可以如图6的右侧所示,将切割后的经解码视频数据流呈现给客户,例如将其强制全屏进行观看。可以理解,上述的切割过程可以根据感兴趣区域在视频画面中覆盖感兴趣区域的最少数量网格内的相对坐标而进行,例如感兴趣区域的左上角和右下角在灰色区域内的相对坐标(x1,y1)和(x2,y2)。相对坐标例如可以其是由服务器确定的并且回传给客户端的,或者其是客户端根据网格化信息自行确定的。
根据本公开实施例的支持交互式观看的视频数据处理方法,通过采用对视频画面的网格化分割和特定于网格的视频编码器分配的思想,首先对视频画面的进行网格化分割,然后对于每个分割后的网格分配专用视频编码器对视频数据进行编码,从而可以根据用户的播放请求选择其中一部分网格的编码视频数据来实现交互式观看。本公开的上述实施例的优势在于,无论有多少个客户端与服务端进行交互,对服务端而言,其需要的视频编码器数量是固定的并且等于网格化分割的网格数量,从而只要网络带宽允许,就可以为 无数客户端提供交互式视频观看服务,尤其是在进行交互式观看的客户端的设备数量众多的情况下,能够有效缓解服务端的视频编码器资源紧张问题。
实施例2
在实际交互式观看过程中,用户希望看到的感兴趣区域大小会有变化,有时需要看一个很大区域的全景(如体育赛事中赛场上整体形势),有时需要看一个很小区域的细节(如某个运动员的个人特写)。这就要求能够让用户对视频画面进行任意程度的灵活动态缩放。图7示出了根据本公开实施例的支持交互式观看的视频数据处理方法中客户端处指定了完整画面中的相对较大部分作为感兴趣区域的示意图。发明人注意到,如果服务端处只维护对高分辨率原始视频源内容进行网格化分割的一种视频画面,那么当用户需要观看相对全景的区域时,该感兴趣区域所覆盖的区域如图7中的斜阴影区域,即总共需要56个网格才能覆盖感兴趣区域,如此多数量的网格其实际视频分辨率已经超过总像素数的一半(如果完整视频画面的总像素1亿,则灰色网格部分总像素已达到5600万),如此高的分辨率无论对于网络传输,还是客户端解码都是不可承受的。在此情况下,当向客户端推送的若干网格的经编码视频的数据量超出了客户端的解码能力的上限值时,客户端在对接收到的经编码视频流进行解码和呈现时会出现画面卡顿或显示不完整等问题,导致影响客户端的观看体验。因此,需要进一步改进的支持交互式观看的视频数据处理技术,从而考虑与服务端交互的客户端的解码能力上限的问题。
有鉴于此,本公开实施例中提供一种基于视频画面的网格化分割与画质分级结合的思想对视频源的视频数据进行处理的技术,从而在接收到客户端的视频播放请求时,能够提供与客户端的解码能力相匹配的视频画质及该视频画质下的若干网格的视频数据,从而避免因解码能力不足导致画面卡顿、显示不全等问题。以下结合图8、图9和图10描述根据本公开实施例的基于网格化分割与画质分级思想的视频数据处理方法,其中图8是示出了根据本公开实施例的支持交互式观看的视频数据处理方法的另一示例的流程图,图9示出了根据本公开实施例的支持交互式观看的视频数据处理方法中具有不同分辨率的多级视频画面的示意性视图,图10示出了根据本公开实施例的支持交互式观看的视频数据处理方法中在各级视频画面中确定与感兴趣区域对应的若干网格的示意性视图。
如图8所示,在步骤S201,获得相同视频内容的具有不同分辨率的多级 视频画面。在本公开实施例中,可以采用多种方式构造具有相同视频内容(即,描绘的相同的视频画面,例如同一体育赛事)但具有不同分辨率的多级视频画面。例如,可以对原始视频画面下采样以获得具有不同分辨率的多级视频画面,以供后续分别对其进行网格化分割。如图9所示,可以将视频源的原始分辨率作为第一级视频画面(全分辨率画面),下一级视频画面由前一级视频画面通过下采样得到,因此每一级视频画面的分辨率均低于其前一级视频画面的分辨率。作为示意性举例,第一级视频画面的原始分辨率是8000×4000,第二级视频画面的分辨率可以设为前一级视频画面的一半,即4000×2000,第三级视频画面的分辨率可设置为2000×1000,以此类推。需说明的是,最低一级视频画面可以等于或小于常见客户端设备可支持的单视频视频分辨率(例如800×600),从而能够兼容于各种常见的客户端的解码能力。
需要说明的是,以上的各视频画面的分辨率和下采样比例的数值均为示意性举例,实践中,每一级视频画面从上一级视频画面进行下采样的比例不一定是2:1,还可以是其它合适的比例。另外,各级视频画面的分辨率之间的比例也可以不同,只要依次递减即可。优选的,为了减少视频画质分级数量,降低服务端压力,可以将每一级视频画面与前一级视频画面的长宽比例设置在1/4到3/4之间。通过此方式,可以获得如图9所示的第一级至第四级视频画面。如图9所示,第一级视频画面的分辨率可以为7680×4320,第二级视频画面的分辨率可以为5120×2880,第三级视频画面的分辨率可以为3840×2460,第四级视频画面的分辨率可以为1920×1080。
返回图8,在步骤S202,将所述多级视频画面中的每级视频画面分割为多个网格。可以理解,在获得了多级视频画面之后,可以将每级视频画面分割为相应的多个网格。需说明的是,服务端对各级视频画面进行网格化分割时,每个网格的尺寸应远小于常见客户端的解码能力,即,分割结果应该使客户端能够同时对多个网格的视频进行实时解码。例如,可以以每个网格不超过10万像素为基准进行网格化分割。继续以图9为例,其中:
(1)第一级视频画面以每个网格384×216大小进行分割,分割后的网格数量为20×20=400个。
(2)第二级视频画面以每个网格256×288大小进行分割,分割后的网格数量为20×10=200个。
(3)第三级视频画面以每个网格384×216大小进行分割,分割后的网格数量为10×10=100个。
(4)第四级视频画面以每个网格384×216大小进行分割,分割后的网格数量为5×5=25个。
当然理解,以上是以对于每个视频画面都是以相同的网格大小进行网格分割作为示例予以描述,当然每一级视频画面的网格的宽高尺寸也可以不相同,只要接近即可。此后,服务端完成对各级视频画面的网格化分割后,可以得到完整的多级视频画面的各个网格化信息,例如可以包括多级视频画面的画面数量(或称之为画面分级数量)、各级视频画面的分辨率、各级视频画面的网格数量(例如网格在水平方向和垂直方向的数量)、各级视频画面的网格大小以及各个网格的网格坐标等等。作为示意性举例,服务端在对各级视频画面进行网格化分割的过程中可以生成并且记录以下信息:
(1)画面分级数量(视频画面数量):4。
(2)每一级视频画面的总分辨率:7680×4320,5120×2880,3840×2160,1920×1080。
(3)网格在水平方向和垂直方向的数量:20×20,20×10,10,10,5×5。
(4)每一级视频进行网格化后每个网格的大小:384×216,256×288,384×216,384×216。
可以理解,通过对各级视频画面进行网格化分割后得到的网格化信息可以以多种格式描述,例如xml,json等。作为示意性举例,当采用json格式对网格化信息进行描述时,可以将多级视频画面的网格化信息表示如下:
Figure PCTCN2022128146-appb-000001
Figure PCTCN2022128146-appb-000002
需要说明的是,虽然以上以每级视频画面进行网格化分割后的每个网格的大小都是相同的作为示例予以描述,但这只是一种示意性举例。当然,某一级的视频画面的网格化后的网格大小可以不完全相同,例如在图像未能均匀切分的情况下,在图像边缘处的网格大小可能会与其它区域不同,这种情况被称为非均匀的网格化分割方式。在此情况下,对于以非均匀方式进行网格化分割的某级视频画面,则在该级视频画面的网格化信息中要包含更为详 细的网格化信息,例如需要某行网格大小、某列网格大小或者指定位置的网格大小等。
返回图8,在步骤S203,对于每级视频画面的多个网格中的每个网格,分配专用于该网格的视频编码器。可以理解,在服务端对每一级视频画面进行网格化分割后,可以给每个网格(及其视频流)分配一个编号,该编号中至少包含该网格所属的视频画面画质级别编号和网格编号。以第三级视频画面为例,第三级视频画面共分为100个网格,以左上角网格为原点,则交叉阴影线所对应的网格的坐标为(2,1),另外考虑到其属于第三级视频画面,因此可以将其编号为(3,2,1)。当然,也可以采取其它编号方式,只要在服务端中能够唯一标识该网格即可。相应的,对于每个网格,可以分配其专属的视频编码器,从而以网格为单位来独立地管理各个网格的视频数据流。
在步骤S204,利用各个视频编码器对相应网格的视频数据流进行编码,以获得相应网格的经编码视频数据流。可以理解,在以网格为单位获得了各个网格的经编码视频数据流后,可以采用合适的方式将其推送给具有交互式观看需求的客户端。例如,如果是面向少量客户端的交互式观看应用场景,可以按需向各个客户端推送其需要的特定画质下的若干网格的视频流(即,与感兴趣区域对应的网格的视频流);如果是面向大规模客户端的交互式观看应用场景,还可以将各级视频画面的所有网格的视频流推送到边缘服务端(例如CDN),再由边缘服务端根据不同客户端的视频播放请求将特定画质下的不同网格的各个视频流推送到客户端。需说明的是,本公开中不对视频数据流的推送方式、网络传输方式、视频数据编码方式等进行限制。
可选的,如上所述的支持交互式观看的视频数据处理方法还可以包括:响应于客户端的视频播放请求,提供特定视频画面的多个网格中的至少一个网格的经编码视频数据流,以供用户进行交互式观看。如上所述,在交互式观看的过程中,用户可以通过客户端与提供视频内容的服务端交互,从完整视频画面中获取其感兴趣的区域进行观看。另外,由于服务端处维护了多级视频画面,因此本示例中还考虑客户端的具体解码能力来选择特定视频画面下的若干网格作为与用户指定的感兴趣区域对应的网格。与以上结合图5A和图5B所描述的类似的,在本公开的实施例中,服务端可以从客户端接收其指定的感兴趣区域以及与客户端的解码能力有关的信息,然后基于先前的网格化分割过程中已记录的各级视频画面的网格化信息,在不超出客户端的解 码能力的前提下,选择特定视频画面下与感兴趣区域对应的若干网格,并将这些网格的经编码视频数据流回传给客户端进行观看。可替代地,客户端可以预先从服务端接收与网格化分割过程相关的各级视频画面的网格化信息,然后基于所获得的各级视频画面的网格化信息,在不超出客户端的解码能力的前提下,选择特定视频画面下与用户指定的感兴趣区域对应的若干网格,并向服务端请求这些网格的视频数据以便进行观看。
作为在服务端处确定特定视频画面下与感兴趣区域对应的若干网格的实现方式的示例,交互式观看过程主要可以包括以下步骤:
首先,用户在交互式观看的过程中可以指定视频画面中的感兴趣区域,相应的服务端可以接收来自客户端的视频播放请求,该视频播放请求可以包括与用户指定的感兴趣区域相关的坐标信息。另外,该视频播放请求中还可以包括客户端对于常见的各种网格大小能同时解码的网格数量,作为该客户端对于各种网格大小的解码能力。需说明的是,客户端可以主动将其对于常见的各种网格大小的解码能力均发给服务端,以供服务端在确定与感兴趣区域对应的网格时能够考虑相关的解码能力。可替代地,为了减少数据通讯量,在客户端预先获得了各级视频画面的网格化分割过程中的网格化信息的情况下,客户端可以仅将其对于网格化分割过程中所涉及到的几种网格大小的解码能力发送给服务端即可,而无需发送对于不相关的网格大小的解码能力。
然后,服务端响应于客户端的视频播放请求,从多级画面中选择与客户端的解码能力相匹配的视频画面,并且在所选择的视频画面的多个网格中确定与视频播放请求所请求的视频内容相对应的至少一个网格。例如,服务端可以根据视频播放请求中所指定的感兴趣区域占完整画面的百分比,并且考虑客户端的解码能力,来从多级视频画质中选择合适级别的视频画质,然后再从该级别的视频画面的多个网格当中选取能覆盖该感兴趣区域的最少数量网格,以作为与感兴趣区域对应的网格。作为示意性举例,当客户端发送感兴趣区域的坐标信息以及其解码能力后,服务端可以从第一级视频画面开始,依次计算该感兴趣区域在各级视频画面中需要占用的网格数量,如果占用的网格数量超过了客户端的解码能力,则计算下一级视频画面,直到在该级视频画面中需要的网格数量不大于客户端的解码能力,从而在不超出客户端的解码能力的前提下尽可能提供高分辨率的视频画面进行交互式观看。例如,如图10所示,可以从第一级视频画面开始,依次确定各级视频画面中用于覆 盖感兴趣区域所需的最少数量的网格,并且可以确定第一级视频画面中的36个网格、第二级视频画面中的24个网格均已经超出客户端的解码能力,而第三级视频画面中的16个网格未超出客户端的解码能力,因此可以将第三级视频画面中以灰色示出的16个网格作为与感兴趣区域对应的网格。
最后,服务端可以向客户端提供所确定的至少一个网格的经编码视频数据流。此后,客户端可以按照与以上结合图6所描述的类似的方法,根据接收到的若干个网格的经编码视频数据流,在对其分别进行解码、拼接以及可选的切割处理之后,在客户端的屏幕上呈现与感兴趣区域对应的视频画面。
作为在客户端处确定特定视频画面下与感兴趣区域对应的若干网格的实现方式的示例,交互式观看过程主要可以包括以下步骤:
首先,为了能够进行交互式观看,客户端可以事先获得服务端处对多级视频画面进行网格化分割过程中的多级网格化信息,以便为用户可能随时发起的交互式观看做好准备。在本步骤中,所获得的网格化信息可以包括如上所描述的各级视频画面的原始分辨率、网格数量、网格大小以及网格坐标等。可以理解,为了减少数据通讯考虑并且减少对带宽资源的过度占用,服务端可以仅传输一部分网格化信息,而客户端可以根据其接收到的部分网格化信息来自行推算其他网格化信息。
然后,客户端可以根据其对于网格化分割过程中产生的各个网格大小的解码能力,从多级画面中选择与客户端的解码能力相匹配的视频画面,并且在所选择的视频画面的多个网格中确定与感兴趣区域相对应的至少一个网格。例如,与上述示例类似的,客户端可以根据用户所指定的感兴趣区域占完整画面的百分比并且其解码能力,来从多级视频画质中选择合适级别的视频画质,然后再从该级别的视频画面的多个网格当中选取能覆盖该感兴趣区域的最少数量网格。例如,与上述示例类似的,可以将第三级视频画面中以灰色示出的16个网格作为与感兴趣区域对应的网格。
最后,在客户端在多级视频画面中选择了合适的视频画面,并且在所选定的视频画面的多个网格中确定了与感兴趣区域对应的若干网格之后,客户端可以向服务端请求这些网格的视频流。此后,客户端可以按照与以上结合图6所描述的类似的方法,根据接收到的若干个网格的经编码视频数据流,在对其分别进行解码、拼接以及可选的切割处理之后,在客户端的屏幕上呈现与感兴趣区域对应的视频画面。
可以理解,以上是以客户端的解码能力作为考虑因素,描述了基于视频画面的网格化分割与画质分级结合思想对视频源的视频数据进行处理的技术。不限于此,本公开实施例中还可以将客户端的网络连接质量作为考虑因素,选择特定视频画面下的若干个网格作为与感兴趣区域对应的网格。例如,当客户端通过自身的数据流量联网并且网络数据传输速率较慢时,可以选择较低分辨率的视频画面下的若干网格作为感兴趣区域;而当客户端通过路由器等联网并且网络数据传输速率较快时,可以选择较高分辨率的视频画面下的若干网格作为感兴趣区域。关于从多个网格中确定与感兴趣区域对应的网格的具体方法可以参考以上描述,在此不予赘述。
根据本公开实施例的支持交互式观看的视频数据处理方法,通过采用视频画面的网格化分割与画质分级结合的思想对视频源的视频数据进行处理,能够提供与客户端的解码能力相匹配的视频画质及该画质下的若干网格的视频数据,从而避免向客户端提供不合适画质的网格视频数据并且避免因客户端解码能力不足导致客户端处出现画面卡顿、显示不全等问题,从而有效提升用户的交互式观看体验。
实施例3
以上描述了在客户端处确定特定视频画面下与感兴趣区域对应的若干网格的实现方式的示例。下面将结合图11对该示例的具体交互过程予以描述,其中,图11中示出了根据本公开实施例的支持交互式观看的视频数据处理方法中客户端与服务端之间交互的示例的示意图,其主要包括以下步骤:
步骤1:服务端可以向客户端发送各级视频画面的网格化信息。
可以理解,客户端可以事先获得服务端处进行网格化分割过程中的各级视频画面的网格化信息,以便为用户可能随时发起的交互式观看做好准备。例如,客户端可以向服务端请求该网格化信息,从而获得服务端响应于该请求而提供的网格化信息。又例如,服务端在对各级视频画面进行了网格化分割之后,可以主动将网格化信息推送给其服务的客户端。
步骤2:客户端确定自己对各级视频画面的各种网格大小的解码能力。
一般而言,客户端对常见网格的解码能力可以用该客户端能够同时解码该网格的数量来表征。作为示意性举例,目前常见的客户端设备(例如,手机、机顶盒等)的视频解码能力一般不低于1920×1080@30fps。可以此为依据进行计算,将客户端解码每秒能解码的视频像素数除以每个视频网格每秒 产生的像素数即可得到最大能够处理的网格数量。例如,设某客户端每秒可解码的视频像素数为1920×1080×30=62,208,000,每个网格每秒的像素数量为384×216×30=2,488,320,则理论上该客户端最多能同时解码的网格数量62208000/2488320=25个。考虑到同时解码多个视频比解码单个视频性能会有所降低,可以估算其能同时解码的网格数为25×0.8=20个。可以采用多种方式获得客户端的上述解码能力信息,例如可以在软件开发时可以以此为初始值进行实际测试,得到实测值作为客户端的更准确的解码能力表征。本公开不对客户端解码能力的确定方式进行限制。
步骤3:客户端确定感兴趣区域。
如以上所讨论的,用户在进行交互式观看时可以指定视频画面中的感兴趣区域,相应的,客户端可以确定感兴趣区域的坐标信息。例如,在用户通过在客户端屏幕上的拖拽操作指定了视频画面中的感兴趣区域后,客户端可以确定该感兴趣区域在完整画面中的归一化坐标,以便后续将其映射到各级视频画面中。在该示例中,设感兴趣区域在完整画面中的左上角和右下角归一化坐标分别为(0.12,0.25)和(0.38,0.51)。为了避免用户选择的画面长宽比例过于不合理,同样可以设置默认画面比例大小为固定值。
步骤4:计算各级视频画面中能够覆盖感兴趣区域的最少网格数量。
如以上所讨论的,客户端在确定了感兴趣区域的坐标信息后,可以将感兴趣区域的归一化坐标映射到其从服务端处获得的各级网格化信息中,得到该感兴趣区域在各级视频画面中的像素级坐标。以图10中所示的各感兴趣区域在各级视频画面中的映射结果作为示例,计算结果如下:
(1)感兴趣区域的左上角和右下角在第一级视频画面中的像素级坐标为:(922,1080)和(2918,2203)。
(2)感兴趣区域的左上角和右下角在第二级视频画面中的像素级坐标为:(615,720)和(1946,1469)。
(3)感兴趣区域的左上角和右下角在第三级视频画面中的像素级坐标为:(461,540)和(1459,1102)。
(4)感兴趣区域的左上角和右下角在第四级视频画面中的像素级坐标为:(230,270)和(730,551)。
相应的,根据感兴趣区域在各级视频画面中的像素级坐标以及各级视频画面中的网格坐标,即可确定各级视频画面中能够覆盖感兴趣区域的最少网 格数量,如图10中各级视频画面中以灰色网格示出能够覆盖感兴趣区域的最少网格数量,即:第一级视频画面需要36个网格、第二级视频画面需要24个网格、第三级视频画面需要16个网格、第四级视频画面需要4个网格。
步骤5:根据客户端的解码能力,选择尽可能高分辨率的视频画面,并且确定其中能覆盖感兴趣区域的网格。
客户端可以根据其对于网格化分割过程中产生的各个网格大小的解码能力,从多级画面中选择与客户端的解码能力相匹配的视频画面,并且在所选择的视频画面的多个网格中确定与感兴趣区域对应的网格。例如,继续结合图10的示例,可以确定第一级视频画面中的36个网格、第二级视频画面中的24个网格均已经超出客户端的解码能力,而第三级视频画面中的16个网格、第四级视频画面中的4个网格均未超出客户端的解码能力,因此可以将较高分辨率的第三级视频画面中以灰色示出的16个网格作为与感兴趣区域对应的网格。可选的,客户端还可以计算感兴趣区域在如上确定的16个网格所构成的灰色区域内的相对坐标,以供后续抠除掉非用户感兴趣区域的画面。
步骤6:客户端向服务端请求视频流。
在客户端在多级视频画面中选择了合适的视频画面,并且在所选定的视频画面的多个网格中确定了与感兴趣区域对应的若干网格之后,客户端可以向服务端请求这些网格的视频流。例如,继续结合图10的示例,客户端向服务端请求第三级视频画面的12个网格的视频数据,例如提供这些网格的编号,依次为:(3,1,2),(3,2,2),(3,3,2),(3,1,3),(3,2,3),(3,3,3),(3,1,4),(3,2,4),(3,3,4),(3,1,5),(3,2,5),(3,3,5)。优选的,为减少数据通讯考虑,也可以仅传递给服务端左上角和右下角的网格编号,由服务端自行推算其它应该传输的网格编号。相应的,服务端可以将这些视频流按照某种标准(MPEG-TS或RTP等)或自定义格式,通过有线或无线网络等通信通道发送给客户端。发送给客户端的视频流必须以某种方式标识其网格编号,以便于客户端进行拼接和重组。
步骤7:客户端接收视频流后进行解码和呈现。
此后,客户端可以按照与以上结合图6所描述的类似的方法,根据接收到的若干个网格的经编码视频数据流,在对其分别进行解码、拼接以及可选的切割处理之后,在客户端的屏幕上呈现与感兴趣区域对应的视频画面。
实施例4
以上在实施例2中描述了在服务端处确定特定视频画面下与感兴趣区域 对应的若干网格的实现方式的示例。下面将结合图12对该示例的具体交互过程予以描述,其中,图12中示出了根据本公开实施例的支持交互式观看的视频数据处理方法中客户端与服务端之间交互的另一示例的示意图。实施例4与实施例3的区别在于客户端无需知道服务端的多级视频画面的网格化信息,而是仅仅向服务端发送包括感兴趣区域的坐标信息的播放请求,并且向服务端告知其解码能力,服务端根据客户端的解码能力向客户端推送特定视频画面中的相应网格的视频流。具体过程如下:
步骤1:客户端向服务端提供其解码能力。
与实施例3所描述的类似的,客户端对常见网格的解码能力可以用该客户端能够同时解码该网格的数量来表征。例如,客户端可以在收到服务端对客户端解码能力的查询后,向服务端提供其解码能力。又例如,客户端可以主动向服务端提供其解码能力,服务端将据此进行后续决策。
步骤2:客户端发送感兴趣区域的信息。
与实施例3所描述的类似的,用户在进行交互式观看时可以指定视频画面中的感兴趣区域,相应的,客户端可以确定感兴趣区域的坐标信息。例如,在用户通过在客户端屏幕上的拖拽操作指定了视频画面中的感兴趣区域后,客户端可以确定该感兴趣区域在完整画面中的归一化坐标,以便后续将其映射到各级视频画面中。客户端可以将该感兴趣区域的信息提供给服务端。
步骤3:计算各级视频画面中能够覆盖感兴趣区域的最少网格数量。
与实施例3所描述的类似的,服务端在收到感兴趣区域的坐标信息后,可以将感兴趣区域的归一化坐标映射到其对于多级视频画面的网格化分割过程中已记录的各级网格化信息中,得到该感兴趣区域在各级视频画面中的像素级坐标。相应的,服务端根据感兴趣区域在各级视频画面中的像素级坐标以及各级视频画面中的网格坐标,即可确定各级视频画面中能够覆盖感兴趣区域的最少网格数量,如图10中各级视频画面中以灰色网格示出能够覆盖感兴趣区域的最少网格数量所反映的。
步骤4:根据客户端的解码能力,选择尽可能高分辨率的视频画面,并且确定其中能覆盖感兴趣区域的网格。
服务端可以根据从客户端接收到的客户端解码能力,从多级画面中选择与客户端的解码能力相匹配的视频画面,并且在所选择的视频画面的多个网格中确定与感兴趣区域对应的网格。例如,继续结合图10的示例,可以将第 三级视频画面中以灰色示出的16个网格作为与感兴趣区域对应的网格。可选的,服务端还可以计算感兴趣区域在如上确定的16个网格所构成的灰色区域内的相对坐标,以供后续抠除掉非用户感兴趣区域的画面。
步骤5:向客户端推送视频流。
在服务端在多级视频画面中选择了合适的视频画面,并且在所选定的视频画面的多个网格中确定了与感兴趣区域对应的若干网格之后,可以向客户端推送这些网格的视频流。可以理解,发送给客户端的视频流必须以某种方式标识其网格编号,以便于客户端进行拼接和重组。继续结合图10的示例,服务端共向客户端发送12个网格,提供的信息包括:网格行数4,列数3,每个网格的大小384×288。另外,服务端发送各个网格的视频流时,要求每个视频流包含自己的网格坐标信息,即(0,0)、(0,1)、(0,2)、(1,0)、(1,2)、(1,2)、(2,0)、(2,1)、(2,2)、(3,0)、(3,1)、(3,1)这些值。可选的,还可以包括如上确定的感兴趣区域在这16个网格所构成的灰色区域内的相对坐标,以供后续抠除掉非用户感兴趣区域的画面。
步骤6:客户端接收视频流后进行解码和呈现。
与实施例3所描述的类似的,客户端可以根据接收到的若干个网格的经编码视频数据流,在对其分别进行解码、拼接以及可选的切割处理之后,在客户端的屏幕上呈现与感兴趣区域对应的视频画面。
实施例5
如以上所讨论的,可以采用多种方式来表征客户端指定的感兴趣区域。例如,以上描述了利用感兴趣区域的坐标信息来表征该感兴趣区域,并且通过用户的拖拽手势对感兴趣区域进行指定的操作方式。可以理解,以上表征感兴趣区域的方式和用户的拖拽手势操作方式仅为示意性举例,本公开不以此为限。例如,当用户利用手机、平板电脑、PDA等观看直播或点播视频并且希望以交互式方式进行观看时,可以通过用户的手指或其他操作体(诸如触控笔)的拖拽操作在屏幕上框选其感兴趣区域。作为响应,客户端可以通过其自己根据网格化信息和感兴趣区域的坐标信息来向服务端请求响应的网格的视频内容;或者客户端可以接收到服务端根据感兴趣区域而确定并且推送的网格的视频内容,从而进行交互式观看。又例如,当用户利用笔记本电脑、台式计算机、工作站等设备进行交互式观看时,可以通过鼠标、触控垫等输入设备在屏幕上选中一部分区域作为感兴趣区域,并且可以通过与上述 类似的方式观看到与感兴趣区域画面对应的视频内容。又例如,当用户通过电视、投影仪等观看直播或点播视频时,可以通过遥控器等选择感兴趣区域,并且从而观看到感兴趣区域的细节信息。再例如,对于以上提到的设备中的任何一种,可以通过对用户输入的语音命令进行分析(例如,用户说出“我想看左上角画面的细节”的命令),通过对用户的肢体操作进行运动捕捉等方式,确定以其他方式输入的感兴趣区域的信息。再例如,用户可以通过文本输入、语音输入等方式指示其感兴趣的对象的名称(例如,直播体育赛事中的运动员名字或编号、高清街景拍摄视频中的指定建筑物名称),并且相应的,可以将该感兴趣对象及其周围预定范围作为感兴趣区域进行交互式观看。
可以理解,如上所述,在实际的观看过程中,由于每个用户想要观看的感兴趣区域并不相同,而且在观看过程中也会不断变化,因此每个客户端是独立操作的,并且会向服务端发送特定于该客户端的播放请求。因此,对于相对静态的感兴趣区域而言,当用户想要改变期望观看的感兴趣区域时,用户只需在视频画面上再次对新的感兴趣区域进行选择,即可收看到新的感兴趣区域的视频画面。作为示意性举例,当视频源的画面内容为对街景拍摄的超高清监控画面时,用户(例如,安全员)可以最初仅关注于某个建筑物大楼的入口区域,并且可能在几分钟或几小时内保持不变,因此对于这种相对静态的感兴趣区域,服务端可以在这段时间内均向客户端推送固定的几个网格的视频内容即可。若用户在后续时刻想要关注其他感兴趣区域,只需重新选择一个新的感兴趣区域,便可以再次向服务端请求新的一批网格的视频内容,或者接收到服务端推送的新的一批网格的视频内容。
然而,发明人注意到,视频源内容中的某些感兴趣区域可能是相对动态的区域,例如其可能包含以一定速度移动的感兴趣对象。例如,对于一场直播体育赛事而言,某位运动员可能是用户的感兴趣对象,并且用户可能想要集中精力观看该运动员在赛事中的表现细节。在此情况下,考虑到运动员不断移动的动态特性,让用户随着时间推移而频繁地重新选择新的感兴趣区域是不切实际的,并且会给用户带来沉重的操作负担。有鉴于此,本公开对于可能具有动态特性的感兴趣区域,提出基于目标跟踪技术的网格确定方法以及相应的交互式观看方式。
例如,对于客户指定的期望观看的感兴趣对象(包括可能具有动态运动特性的人或物体),可以采用光流分析算法、均值漂移算法、Kalman滤波算 法、粒子滤波算法等对视频的连续画面进行分析,从而对该感兴趣对象在视频的连续画面之间的运动进行跟踪。作为补充或者替代,可以采用机器学习模型来对感兴趣对象在连续画面之间的运动进行跟踪,例如可以采用卷积神经网络、递归神经网络、逻辑回归、线性回归、随机森林、支持向量机模型、深度学习模型或任何其他形式的机器学习模型或算法来进行跟踪。可以理解,本公开可以采用其他合适的方式,通过对视频画面进行分析来自动确定感兴趣对象或感兴趣区域的位置,作为后续确定与感兴趣区域对应的网格的依据。
可以理解,对于感兴趣对象的跟踪,可以由客户端本地通过对连续视频画面的分析来进行确定或预测,或者可以由服务端通过对连续视频画面的分析来进行确定或预测。相应的,对于在服务端处基于对连续视频画面的分析来跟踪感兴趣对象的方式,服务端可以根据其记录的网格化信息来确定与该感兴趣对象或者与包括该感兴趣对象的感兴趣区域相对应的若干网格(并且可选的,选择合适画质下的若干网格),并且此后向客户端推送相对应的网格的视频数据。为了减少服务端对于感兴趣对象的跟踪的计算负担,可以将感兴趣对象的跟踪任务分摊到客户端本地执行,在客户端跟踪到了感兴趣对象后,可以根据从服务端获得的网格化信息来确定与该感兴趣对象或者与包括该感兴趣对象的感兴趣区域相对应的若干网格(并且可选的,选择合适画质下的若干网格),并且向服务端请求相对应的网格的视频数据以便进行观看。
需说明的是,对于移动速度可能相对较快的感兴趣对象,可以将该感兴趣对象以及其周围预定范围作为感兴趣区域,即为所确定的感兴趣对象在各个方向上均扩展一定的画面范围作为裕量,从而避免因感兴趣对象过于频繁移动而导致过于频繁地重新确定与感兴趣区域对应的网格的问题,从而以不同时刻之间的网格变动情况相对平稳的方式来向服务端请求所需网格的视频数据或者接收服务端推送的网格的视频数据,从而减轻对服务端施加的压力。
根据本公开实施例的基于目标跟踪技术的网格确定方法以及相应的交互式观看方式,无论以何种方式进行跟踪,均可以获得与具有动态运动特性的感兴趣对象对应的网格的视频流,从而在客户端处进行解码和拼接后进行呈现,从而省去用户频繁地手动选择感兴趣区域的需要,减轻用户的操作负担。
实施例6
如以上所讨论的,可以采用均匀的网格化分割方式对于多级视频画面中的每一级视频画面进行网格化分割,因此同一视频画面内的各个网格的网格 大小和分辨率都是相同的。当然,也可以采用非均匀的网格化分割方式对各个视频画面进行分割,从而多级视频画面中的任一级视频画面内的各个网格的网格大小和/或分辨率可以彼此不完全相同。在本公开的实施例中,可以考虑多种因素来决定是否要采用非均匀的网格化分割过程。例如,对于一场体育赛事的直播画面而言,全景画面的最上部分可能对应于户外情况下的天空或者室内情况下的场馆屋顶,全景画面的最下部分可能对应于观众席,而只有全景画面的中间部分可能对应于正在直播赛事的赛场和运动员。相应的,对于观看该直播赛事的众多用户而言,视频源或视频画面的中间部分的画面内容可能是大多数观众都具有兴趣的并且有较高的概率被观众选择为感兴趣区域(例如,期望观看这些区域内的细节),而视频源中的最上部分和最下部分可能是仅少部分观众具有兴趣的并且有较低的概率被观众选择为感兴趣区域。因此,可以基于用户对于整个画面中的各个区域的感兴趣程度,来进行非均匀的网格化分割过程。例如,对于某场体育赛事,可以基于本次观看记录中各个区域被观众选择为感兴趣区域的次数和频率,来确定视频源整个画面中的高感兴趣区域和低感兴趣区域。作为补充或者替代,可以基于历史观看记录(例如,同一场地的先前赛事)中各个区域被观众选择为感兴趣区域的次数和频率,来确定视频源整个画面中的高感兴趣区域和低感兴趣区域。
有鉴于此,本公开中可以基于通过用户感兴趣程度而确定的高感兴趣区域和低感兴趣区域,对视频画面进行非均匀的网格化分割过程。以下结合图13和图14来描述根据本公开实施例的支持交互式观看的视频数据处理方法中对视频画面进行非均匀网格化分割的示例,其中图13示出了根据本公开实施例的支持交互式观看的视频数据处理方法中确定视频源中的高感兴趣区域和低感兴趣区域的示意图,图14示出了根据本公开实施例的支持交互式观看的视频数据处理方法中对视频画面进行非均匀网格化分割的示意图。
如图13所示,可以根据当前观看期间用户将视频源的各个区域选择为感兴趣区域的次数和/或根据同种赛事的历史观看期间用户将各个区域选择为感兴趣区域的次数,将视频源的整个画面分为位于整个画面中间的高感兴趣区域以及位于最上部分和最下部分的两个低感兴趣区域。在本公开实施例中,考虑到低感兴趣区域内的视频画面被选择为感兴趣区域的可能性较低,因此可以对低感兴趣区域采用较低的画质进行网格化分割;而对于高感兴趣区域可以保持相对较高的画质进行网格化分割,从而可以在牺牲非常少部分观众 的观看需求的情况下将视频编码器尽量应用到相对更受到关注的区域,以便最大化视频编码器利用效率。
作为示意性举例,如图14所示,对于原始视频画面的中间区域的高感兴趣区域,可以仍然采用与以上图9中结合第一级视频画面所描述的网格化分割方式对其进行分割,以得到对应于高感兴趣区域的网格化分割结果,如图14中第②部分所示。而对于两个低感兴趣区域,可以首先对原始视频画面的最上部分和最下部分进行下采样,以获得下采样版本的两个低感兴趣区域,然后,对下采样后的低感兴趣区域(而非对原始视频画面的两个低感兴趣区域)进行网格化分割,以得到对应于两个低感兴趣区域的网格化分割结果,如图14中第①部分和第③部分所示。在对于不同兴趣等级的区域分别进行了网格化分割处理之后,可以将分别进行的网格化分割结构拼凑为新的画面,例如可以将对原始画质的高感兴趣区域的网格化分割结果(如图14中第②部分所示)以及对下采样后的两个低感兴趣区域的网格化分割结果(如图14中第①部分和第③部分所示)作为新的视频画面。可以看出,由于对低感兴趣区域的网格化分割是以下采样的视频画面为基础进行的,因此,新拼凑的视频画面中的第②部分中的网格的分辨率与第①部分和第③部分中的网格的分辨率是不同的。作为补充或者替代,新拼凑的视频画面中的第②部分中的网格大小也可以与第①部分和第③部分中的网格大小是不同的。通过以此方式进行非均匀的网格化分割,可以有效节省这两个低感兴趣区域划分得到的网格数量,并且因此可以有效节省为这些网格所分配的专属视频编码器的数量。
当然可以理解,可以以类似的方式,拼凑出新的下一级非均匀视频画面,以此类推。在此情况下,对于以非均匀方式进行网格化分割的某级视频画面,需要在该级视频画面的网格化信息中包含更为详细的网格化信息,例如某行网格数量、某列网格数量、某行网格大小、某列网格大小或者指定位置的网格大小等,从而使得能够准确可靠地标识出每个网格的细节数据。
根据本公开实施例的对视频画面进行非均匀网格化分割技术,可以基于用户对于整个画面中不同区域的不同程度的兴趣,进行非均匀的网格化分割,从而能够对有限数量的视频编码器进行更合理的分配,提升视频编码器资源利用效率。
实施例7
根据本公开的另一方面,提供一种支持交互式观看的视频数据处理设备, 以下结合图15详细描述该设备1500。图15示出了根据本公开实施例的设备的硬件框图。如图15所示,设备1500包括处理器U1501和存储器U1502。
处理器U1501可以是能够实现本公开各实施例的功能的任何具有处理能力的装置,例如其可以是设计用于进行在此所述的功能的通用处理器、数字信号处理器(DSP)、ASIC、场可编程门阵列(FPGA)或其他可编程逻辑器件(PLD)、离散门或晶体管逻辑、离散的硬件组件或者其任意组合。
存储器U1502可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(RAM)和/或高速缓存存储器,也可以包括其它可移动/不可移动的、易失性/非易失性计算机系统存储器,例如硬盘驱动器、软盘、CD-ROM、DVD-ROM或者其它光存储介质。
在本实施例中,存储器U1502中存储有计算机程序指令,并且处理器U1501可以运行存储器U1502中存储的指令。在所述计算机程序指令被所述处理器运行时,使得所述处理器执行本公开实施例的支持交互式观看的视频数据处理方法。关于用于支持交互式观看的视频数据处理方法与上文中针对图1-图14描述的基本相同,因此为了避免重复,不再赘述。作为设备的示例,可以包括计算机、服务端、工作站等等。
根据本公开的另一方面,提供一种支持交互式观看的视频数据处理设备,以下结合图16详细描述该设备1600。图16示出了根据本公开实施例的支持交互式观看的视频数据处理设备的结构框图。如图16所示,该设备1600包括视频画面构造单元U1601、网格化分割单元U1602和视频编码单元U1603。所述各个部件可分别执行上文中结合图1-图14描述的支持交互式观看的视频数据处理方法的各个步骤/功能,因此为了避免重复,在下文中仅对所述设备进行简要的描述,而省略对相同细节的详细描述。
视频画面构造单元U1601可以获得相同视频内容的具有不同分辨率的多级视频画面。在本公开实施例中,视频画面构造单元U1601可以采用多种方式构造具有相同视频内容(即,描绘的相同的视频画面,例如同一体育赛事)但具有不同分辨率的多级视频画面。例如,视频画面构造单元U1601可以对所述视频画面下采样以获得具有不同分辨率的多级视频画面,如以上结合图9所讨论的,以供后续分别对其进行网格化分割。
网格化分割单元U1602可以将所述多级视频画面中的每级视频画面分割为多个网格。例如,网格化分割单元U1602可以将每级视频画面均分割为多 个网格,如以上结合图9所讨论的。需说明的是,网格化分割单元U1602对各级视频画面进行网格化分割时,每个网格的尺寸应远小于常见客户端的解码能力,即,分割结果应该使客户端能够同时对多个网格的视频进行实时解码。此后,网格化分割单元U1602完成对各级视频画面的网格化分割后,可以得到完整的多级视频画面的各个网格化信息,例如可以包括多级视频画面的画面数量(或称之为画面分级数量)、各级视频画面的分辨率、各级视频画面的网格数量(例如网格在水平方向和垂直方向的数量)、各级视频画面的网格大小以及各个网格的网格坐标等等。
视频编码单元U1603可以包括对于每级视频画面的多个网格中的每个网格而分配的专用于该网格的视频编码器。可以理解,在服务端对每一级视频画面进行网格化分割后,可以给每个网格(及其视频流)分配一个编号。相应的,对于每个网格,可以在视频编码单元U1603中为其分配专属的视频编码器,从而以网格为单位来独立地管理各网格的视频数据流。视频编码单元U1603中的各个视频编码器可以对相应网格的视频数据流进行编码,以获得相应网格的经编码视频数据流。
可选的,该设备1600还可以包括视频流提供单元(未示出),该视频流提供单元可以被配置为响应于客户端的视频播放请求,从所述多级画面中选择与所述客户端的解码能力相匹配的视频画面;在所选择的视频画面的多个网格中确定与所述视频播放请求所请求的视频内容相对应的至少一个网格;以及向所述客户端提供所述至少一个网格的经编码视频数据流。
根据本公开的支持交互式观看的视频数据处理技术还可以通过提供包含实现所述方法或者设备的程序代码的计算机程序产品来实现,或者通过存储有这样的计算机程序产品的任意存储介质来实现。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。另外,来自一个实施例的特征可以与另一个或多个实施例的特征进行组合以获得更多的实施例。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。 如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
另外,如在此使用的,在以“至少一个”开始的项的列举中使用的“或”指示分离的列举,以便例如“A、B或C的至少一个”的列举意味着A或B或C,或AB或AC或BC,或ABC(即A和B和C)。此外,措辞“示例的”不意味着描述的例子是优选的或者比其他例子更好。
还需要指出的是,在本公开的装置和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。
对本领域的普通技术人员而言,能够理解本公开的方法和装置的全部或者任何部分,可以在任何计算装置(包括处理器、存储介质等)或者计算装置的网络中,以硬件、固件、软件或者它们的组合加以实现。所述硬件可以是利用被设计用于进行在此所述的功能的通用处理器、数字信号处理器(DSP)、ASIC、场可编程门阵列信号(FPGA)或其他可编程逻辑器件(PLD)、离散门或晶体管逻辑、离散的硬件组件或者其任意组合。通用处理器可以是微处理器,但是作为替换,该处理器可以是任何商业上可获得的处理器、控制器、微控制器或状态机。处理器还可以实现为计算设备的组合,例如DSP和微处理器的组合,多个微处理器、与DSP核协作的一个或多个微处理器或任何其他这样的配置。所述软件可以存在于任何形式的计算机可读的有形存储介质中。通过例子而不是限制,这样的计算机可读的有形存储介质可以包括RAM、ROM、EEPROM、CD-ROM或其他光盘存储、磁盘存储或其他磁存储器件或者可以用于携带或存储指令或数据结构形式的期望的程序代码并且可以由计算机访问的任何其他有形介质。如在此使用的,盘包括紧凑盘(CD)、激光盘、光盘、数字通用盘(DVD)、软盘和蓝光盘。
可以不脱离由所附权利要求定义的教导的技术而进行对在此所述的技术的各种改变、替换和更改。此外,本公开的权利要求的范围不限于以上所述的处理、机器、制造、事件的组成、手段、方法和动作的具体方面。可以利用与在此所述的相应方面进行基本相同的功能或者实现基本相同的结果的当前存在的或者稍后要开发的处理、机器、制造、事件的组成、手段、方法或

Claims (8)

  1. 一种支持交互式观看的视频数据处理方法,包括:
    将视频画面分割为多个网格;
    对于所述多个网格中的每个网格,在单个视频数据处理设备内分配专用于该网格的视频编码器以对该网格的视频数据流进行编码;以及
    响应客户端的视频播放请求,提供所述多个网格中的至少一个网格的经编码视频数据流,
    其中,将视频画面分割为多个网格包括:
    对所述视频画面下采样以获得具有不同分辨率的多级视频画面;以及
    将所述多级视频画面中的每级视频画面分割为多个网格,且在所述单个视频数据处理设备内,每个网格被分配专用于该网格的视频编码器,
    其中,响应客户端的视频播放请求,提供所述多个网格中的至少一个网格的经编码视频数据流包括:
    响应于客户端的视频播放请求,从具有最高分辨率的视频画面开始,依次在该视频画面的多个网格中确定与所述视频播放请求所请求的视频内容相对应的至少一个网格,并且确定所述至少一个网格的数量是否超出了所述客户端的解码能力,以从所述多级视频画面中选择尽可能高分辨率的并且与所述客户端的解码能力相匹配的视频画面以及所选择的视频画面中的所述至少一个网格,其中所述客户端的解码能力指示所述客户端对于各种网格大小能够同时解码的网格数量;以及
    向所述客户端提供所述至少一个网格的经编码视频数据流。
  2. 根据权利要求1所述的方法,其中,响应于客户端的视频播放请求,提供所述多个网格中的至少一个网格的经编码视频数据流包括:
    根据客户端指定的视频画面中的感兴趣区域,从所述多个网格中确定与所述感兴趣区域对应的所述至少一个网格;以及
    提供所述至少一个网格的经编码视频数据流。
  3. 根据权利要求2所述的方法,其中,根据客户端指定的视频画面中的感兴趣区域,从所述多个网格中确定与所述感兴趣区域对应的所述至少一个网格包括:
    从所述客户端获取所述客户端的解码能力;
    确定各级视频画面中用于覆盖所述感兴趣区域所需的最少网格数量;
    确定所述最少网格数量不超过所述客户端的解码能力的各级视频画面,并从中选择分辨率最高的视频画面;以及
    确定所选择的视频画面中覆盖所述感兴趣区域的所述至少一个网格。
  4. 根据权利要求2所述的方法,其中,根据客户端指定的视频画面中的感兴趣区域,从所述多个网格中确定与所述感兴趣区域对应的所述至少一个网格包括:
    在客户端处,从服务端获取各级视频画面的网格化信息;
    在客户端处,根据所述网格化信息,确定各级视频画面中用于覆盖所述感兴趣区域所需的最少网格数量;
    确定所述最少网格数量不超过所述客户端的解码能力的各级视频画面,并从中选择分辨率最高的视频画面;以及
    确定所选择的视频画面中覆盖所述感兴趣区域的所述至少一个网格。
  5. 根据权利要求4所述的方法,其中,所述网格化信息包括以下中的一个或多个:所述多级视频画面的画面数量、各级视频画面的分辨率、各级视频画面的网格数量、各级视频画面的网格大小以及各个网格的网格坐标。
  6. 一种视频数据处理方法,包括:
    获得相同视频内容的具有不同分辨率的多级视频画面;
    将所述多级视频画面中的每级视频画面分割为多个网格;
    对于每级视频画面的多个网格中的每个网格,在单个视频数据处理设备内分配专用于该网格的视频编码器;以及
    利用各个视频编码器对相应网格的视频数据流进行编码,以获得相应网格的经编码视频数据流,
    所述方法还包括:
    响应于客户端的视频播放请求,从具有最高分辨率的视频画面开始,依次在该视频画面的多个网格中确定与所述视频播放请求所请求的视频内容相对应的至少一个网格,并且确定所述至少一个网格的数量是否超出了所述客户端的解码能力,以从所述多级视频画面中选择尽可能高分辨率的并且与所述客户端的解码能力相匹配的视频画面以及所选择的视频画面中的所述至少一个网格,其中所述客户端的解码能力指示所述客户端对于各种网格大小能够同时解码的网格数量;以及
    向所述客户端提供所述至少一个网格的经编码视频数据流。
  7. 一种支持交互式观看的视频数据处理设备,包括:
    处理器;以及
    存储器,存储有计算机程序指令,
    其中,在所述计算机程序指令被所述处理器运行时,使得所述处理器执行以下步骤:
    获得相同视频内容的具有不同分辨率的多级视频画面;
    将所述多级视频画面中的每级视频画面分割为多个网格;
    对于每级视频画面的多个网格中的每个网格,在单个视频数据处理设备内分配专用于该网格的视频编码器;
    利用各个视频编码器对相应网格的视频数据流进行编码,以获得相应网格的经编码视频数据流;
    响应于客户端的视频播放请求,从具有最高分辨率的视频画面开始,依次在该视频画面的多个网格中确定与所述视频播放请求所请求的视频内容相对应的至少一个网格,并且确定所述至少一个网格的数量是否超出了所述客户端的解码能力,以从所述多级视频画面中选择尽可能高分辨率的并且与所述客户端的解码能力相匹配的视频画面以及所选择的视频画面中的所述至少一个网格,其中所述客户端的解码能力指示所述客户端对于各种网格大小能够同时解码的网格数量;以及
    向所述客户端提供所述至少一个网格的经编码视频数据流。
  8. 一种支持交互式观看的系统,包括:
    服务端,被配置为:
    获得相同视频内容的具有不同分辨率的多级视频画面;
    将所述多级视频画面中的每级视频画面分割为多个网格;
    对于每级视频画面的多个网格中的每个网格,在单个视频数据处理设备内分配专用于该网格的视频编码器;以及
    利用各个视频编码器对相应网格的视频数据流进行编码,以获得相应网格的经编码视频数据流;以及
    客户端,被配置为向服务端发送视频播放请求,
    其中所述服务端还被配置为:
    响应于客户端的视频播放请求,从具有最高分辨率的视频画面开始,依次在该视频画面的多个网格中确定与所述视频播放请求所请求的视频 内容相对应的至少一个网格,并且确定所述至少一个网格的数量是否超出了所述客户端的解码能力,以从所述多级视频画面中选择尽可能高分辨率的并且与所述客户端的解码能力相匹配的视频画面以及所选择的视频画面中的所述至少一个网格,其中所述客户端的解码能力指示所述客户端对于各种网格大小能够同时解码的网格数量;以及
    向所述客户端提供所述至少一个网格的经编码视频数据流。
PCT/CN2022/128146 2021-12-10 2022-10-28 支持交互式观看的视频数据处理方法、设备及系统 WO2023103641A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111505299.2A CN113905256B (zh) 2021-12-10 2021-12-10 支持交互式观看的视频数据处理方法、设备及系统
CN202111505299.2 2021-12-10

Publications (1)

Publication Number Publication Date
WO2023103641A1 true WO2023103641A1 (zh) 2023-06-15

Family

ID=79025598

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/128146 WO2023103641A1 (zh) 2021-12-10 2022-10-28 支持交互式观看的视频数据处理方法、设备及系统

Country Status (2)

Country Link
CN (1) CN113905256B (zh)
WO (1) WO2023103641A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113905256B (zh) * 2021-12-10 2022-04-12 北京拙河科技有限公司 支持交互式观看的视频数据处理方法、设备及系统
CN115580738B (zh) * 2022-02-23 2023-09-19 北京拙河科技有限公司 一种按需传输的高分辨率视频展现方法、设备和系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104469398A (zh) * 2014-12-09 2015-03-25 北京清源新创科技有限公司 一种网络视频画面处理方法及装置
US20150201197A1 (en) * 2014-01-15 2015-07-16 Avigilon Corporation Streaming multiple encodings with virtual stream identifiers
CN106060582A (zh) * 2016-05-24 2016-10-26 广州华多网络科技有限公司 视频传输系统、方法及装置
CN111193937A (zh) * 2020-01-15 2020-05-22 北京拙河科技有限公司 一种直播视频数据的处理方法、装置、设备和介质
CN111601151A (zh) * 2020-04-13 2020-08-28 北京拙河科技有限公司 一种亿级像素视频回看方法、装置、介质及设备
CN113905256A (zh) * 2021-12-10 2022-01-07 北京拙河科技有限公司 支持交互式观看的视频数据处理方法、设备及系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7493646B2 (en) * 2003-01-30 2009-02-17 United Video Properties, Inc. Interactive television systems with digital video recording and adjustable reminders
US9386275B2 (en) * 2014-01-06 2016-07-05 Intel IP Corporation Interactive video conferencing
CN104735464A (zh) * 2015-03-31 2015-06-24 华为技术有限公司 一种全景视频交互传输方法、服务器和客户端
CN107087212B (zh) * 2017-05-09 2019-10-29 杭州码全信息科技有限公司 基于空间可伸缩编码的交互式全景视频转码与播放方法及系统
CN112533005B (zh) * 2020-09-24 2022-10-04 深圳市佳创视讯技术股份有限公司 一种vr视频慢直播的交互方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150201197A1 (en) * 2014-01-15 2015-07-16 Avigilon Corporation Streaming multiple encodings with virtual stream identifiers
CN104469398A (zh) * 2014-12-09 2015-03-25 北京清源新创科技有限公司 一种网络视频画面处理方法及装置
CN106060582A (zh) * 2016-05-24 2016-10-26 广州华多网络科技有限公司 视频传输系统、方法及装置
CN111193937A (zh) * 2020-01-15 2020-05-22 北京拙河科技有限公司 一种直播视频数据的处理方法、装置、设备和介质
CN111601151A (zh) * 2020-04-13 2020-08-28 北京拙河科技有限公司 一种亿级像素视频回看方法、装置、介质及设备
CN113905256A (zh) * 2021-12-10 2022-01-07 北京拙河科技有限公司 支持交互式观看的视频数据处理方法、设备及系统

Also Published As

Publication number Publication date
CN113905256A (zh) 2022-01-07
CN113905256B (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2023103641A1 (zh) 支持交互式观看的视频数据处理方法、设备及系统
Gaddam et al. Tiling in interactive panoramic video: Approaches and evaluation
JP6884856B2 (ja) コンテンツに基づいた映像データのストリーム分割
US9071883B2 (en) System and method for improved view layout management in scalable video and audio communication systems
CN112204993B (zh) 使用重叠的被分区的分段的自适应全景视频流式传输
US20160277772A1 (en) Reduced bit rate immersive video
EP3804349B1 (en) Adaptive panoramic video streaming using composite pictures
US8639046B2 (en) Method and system for scalable multi-user interactive visualization
US10574933B2 (en) System and method for converting live action alpha-numeric text to re-rendered and embedded pixel information for video overlay
KR101528863B1 (ko) 파노라마 영상의 스트리밍 서비스 제공 시스템에서 타일링 영상 동기화 방법
US10931930B2 (en) Methods and apparatus for immersive media content overlays
CN104602127A (zh) 导播视频同步播放方法和系统以及视频导播设备
KR20130024357A (ko) 실시간 고해상도 파노라마 영상 스트리밍 시스템 및 방법
KR20120133006A (ko) Iptv 파노라마 영상의 스트리밍 서비스 제공 방법 및 그 서비스 제공 시스템
JPWO2015060349A1 (ja) 表示制御装置、配信装置、表示制御方法、および表示制御システム
US11120615B2 (en) Dynamic rendering of low frequency objects in a virtual reality system
CA3057924A1 (en) System and method to optimize the size of a video recording or video transmission by identifying and recording a region of interest in a higher definition than the rest of the image that is saved or transmitted in a lower definition format
CN115580738B (zh) 一种按需传输的高分辨率视频展现方法、设备和系统
Alface et al. Evaluation of bandwidth performance for interactive spherical video
JP5594842B2 (ja) 映像配信装置
AU2014202369B2 (en) Improved view layout management in scalable video and audio comunication systems
Inoue et al. Field trial of interactive panoramic video streaming system
Niamut et al. Advanced visual rendering, gesture-based interaction and distributed delivery for immersive and interactive media services

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22903065

Country of ref document: EP

Kind code of ref document: A1