WO2021072694A1

WO2021072694A1 - Adaptive resolution coding based on machine learning model

Info

Publication number: WO2021072694A1
Application number: PCT/CN2019/111598
Authority: WO
Inventors: Ran Wang; Yuchen SUN; Tsuishan CHANG; Changguo CHEN; Jian Lou
Original assignee: Alibaba Group Holding Limited
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2021-04-22

Abstract

At least one video frame and a corresponding bit rate of the at least one video frame of a video may be determined. The video may be a streaming video or a subset of a stored video. The at least one video frame and the corresponding bit rate into may be inputted into a machine learning model to obtain a recommended resolution. The at least one video frame and one or more other video frames associated with the at least one video frame may then be resized or resampled (e.g., downsampled) according to the recommended resolution. After resizing, the at least one video frame and the one or more other video frames may be encoded to obtain an encoded video according to a target bit rate.

Description

Adaptive resolution coding Based on Machine Learning Model

BACKGROUND

Video streaming and downloading/uploading are very common in people’s daily lives nowadays. A user may send or upload a video file from one device to another device such as a server or a computing device of another user. Along with the development of video technologies, people have an increasing demand for videos of a higher quality or resolution, such as high definition videos, ultra-high definition videos, etc. These high-quality or high-resolution videos usually have large file sizes, which may be of several hundred megabytes to several gigabytes, etc. These high-quality or high-resolution videos not only require a long period of time for uploading and transmitting over a communication network, but also incur a huge amount of traffic on the network, thus having a high transmission cost in terms of time and network bandwidth.

SUMMARY

This summary introduces simplified concepts of adaptive resolution coding, which will be further described below in the Detailed Description. This summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in limiting the scope of the claimed subject matter.

This application describes example implementations of adaptive resolution coding. In implementations, at least one video frame and a corresponding bit rate of the at least one video frame of a video may be determined. The video may be a streaming video or a subset (such as a segment) of a stored video. In implementations, the at least one video frame and the corresponding bit rate into may be inputted into a machine learning model to obtain a recommended resolution. The at least one video frame and one or more other video frames associated with the at least one video frame may then be resized or resampled (e.g., downsampled) according to the recommended resolution. After resizing, the at least one video frame and the one or more other video frames may be encoded to obtain an encoded video according to a target bit rate.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit (s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.

FIG. 1 illustrates an example relationship between bit rates and qualities of an example video.

FIG. 2 illustrates an example environment in which an adaptive resolution coding system may be used.

FIG. 3 illustrates an example adaptive resolution coding system in more detail.

FIG. 4 illustrates an example method of adaptive resolution coding.

FIG. 5 illustrates another example method of adaptive resolution coding.

DETAILED DESCRIPTION

Overview

As noted above, when videos of high-quality (such as videos of resolution of 4K, 8K, etc. ) are transmitted over a network, existing technologies suffer a high cost of transmission in term of both time and network bandwidth. As the qualities of videos that need to be transmitted increase, the cost of transmission of the videos increases sharply. This leads to a reduction in the network bandwidth that is available for other services in a network, and hence affecting the network performance for the other services.

This disclosure describes an example adaptive resolution coding system. In order to reduce the transmission cost for transmitting a given video at a certain bit rate (or called a target bit rate) as described above, the adaptive resolution coding system may first downsample the given video by a certain sampling ratio, and then encode the downsampled video at the target bit rate. The adaptive resolution coding system may then transmit the encoded video to another device over a communication network, so that the other device can restore the given video (e.g., restore an original resolution of the given video) by decoding and upsampling the encoded video.

In implementations, given a certain bit rate (or bandwidth) for transmitting a video, the quality of the video that is restored at a receiving end depends on an amount of downsampling (or a downsampling ratio) that is performed at a sending end. For example, FIG. 1 shows an example relationship 100 between bit rates and qualities of an example video that is restored after successive operations of downsampling, encoding, decoding, and upsampling. As can be seen from FIG. 1, for a given bit rate, there exists a certain resolution (or a certain downsampling ratio) of a video from among different resolutions that the video can be downsampled to such resolution (or downsampling ratio) and then encoded at the given bit rate at a sending end, so that the video after restoration (i.e., decoding and upsampling) at a receiving end can attain the best quality among the different resolutions.

In implementations, the adaptive resolution coding system may employ a machine learning model to determine an optimal resolution or downsampling ratio for resizing an input video of an input resolution before encoding and transmitting the video at or around a specific bit rate to another device over a communication network. In implementations, the machine learning model may be trained using a training sample set of different videos having a particular resolution or different resolutions and respective known values of optimal downsampling ratios that produce the best qualities for the different videos. After values of parameters (such as weights) of the machine learning model are determined, the adaptive resolution coding system may apply the machine learning model to determine a recommended downsampling ratio or resolution for an input video. In implementations, the machine learning model may include, but is not limited to, a neural network model such as a convolutional neural network (CNN) , a Bayesian network, a decision tree, etc.

By way of example and not limitation, the described adaptive resolution coding system may receive an input video having an input resolution and an instruction to transmit the input video at a certain bit rate (or a target bit rate) . The adaptive resolution coding system may obtain one or more frames (such as intra frames) and respective one or more bit rates from the input video. For example, the adaptive resolution coding system may attempt to encode the input video at the target rate, and obtain one or more intra frames and respective one or more bit rates after encoding. The adaptive resolution coding system may then input the one or more frames and the respective one or more bit rates into a trained machine learning model to obtain a recommended resolution or sampling ratio for resizing (e.g., downsampling) the input video. After obtaining the recommended resolution or sampling ratio from the trained machine learning model, the adaptive resolution coding system may resize the input video from the input resolution to the recommended resolution, and encode the resized input video according to a target bit rate for transmission over a communication network, thus reducing the transmission cost of the video while ensuring a high quality of the video after restoration (i.e., decoding and upsampling, for example) . In implementations, the input video may include, but is not limited to, some or all of a stored video, or some or all of a streaming video, etc.

In implementations, functions described herein to be performed by the adaptive resolution coding system may be performed by multiple separate units or services. For example, a receiving service may receive an input video and an instruction including a target bit rate, while an acquisition service may obtain one or more frames and respective one or more bit rates from the input video. A determination service may obtain a recommended resolution or sampling ratio for resizing (e.g., downsampling) the input video based on a machine learning model. In implementations, an encoding service may encode the resized input video according to a target bit rate, while a transmission service may transmit the encoded video to another device over a communication network.

Moreover, although in the examples described herein, the adaptive resolution coding system may be implemented as software and/or hardware installed in a single device, in other examples, the adaptive resolution coding system may be implemented and distributed in multiple devices or as services provided in one or more servers over a network and/or in a cloud computing architecture.

The application describes multiple and varied embodiments and implementations. The following section describes an example framework that is suitable for practicing various implementations. Next, the application describes example systems, devices, and processes for implementing an adaptive resolution coding system.

Example Environment

FIG. 2 illustrates an example environment 200 usable to implement an adaptive resolution coding system. The environment 200 may include an adaptive resolution coding system 202. In this example, the adaptive resolution coding system 202 is described to be included in a client device 204. In some instances, the adaptive resolution coding system 202 may exist as an individual entity or device. In implementations, the environment 200 may further include another client device 206 and a server 208. The adaptive resolution coding system 202 or the client device 204 may communicate data with the other client device 206 and the server 208 over a network 210. In implementations, the server 208 may be a server of a plurality of servers in a cloud or a data center.

In implementations, functions of the adaptive resolution coding system 202 may be included in or provided by the client device 204. In implementations, some or all of the functions of the adaptive resolution coding system 202 may be included in a cloud computing system or architecture, and may be provided as services to the client device 204.

In implementations, the client device 204 or the client device 206 may be implemented as any of a variety of computing devices including, but not limited to, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet or slate computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , a server computer, etc., or a combination thereof.

The network 210 may be a wireless or a wired network, or a combination thereof. The network 210 may be a collection of individual networks interconnected with each other and functioning as a single large network (e.g., the Internet or an intranet) . Examples of such individual networks include, but are not limited to, telephone networks, cable networks, Local Area Networks (LANs) , Wide Area Networks (WANs) , and Metropolitan Area Networks (MANs) . Further, the individual networks may be wireless or wired networks, or a combination thereof. Wired networks may include an electrical carrier connection (such a communication cable, etc. ) and/or an optical carrier or connection (such as an optical fiber connection, etc. ) . Wireless networks may include, for example, a WiFi network, other radio frequency networks (e.g.,

Zigbee, etc. ) , etc.

In implementations, the adaptive resolution coding system 202 may receive an instruction to transmit an input video at a target bit rate. The adaptive resolution coding system 202 may determine a recommended resolution or downsampling ratio for the input video based on a machine learning model, downsample the input video according to the recommended resolution or downsampling ratio, and encode the input video to obtain an encoded video for storage or transmission by the client device 204 or the adaptive resolution coding system 202.

Example Adaptive resolution coding system

FIG. 3 illustrates the adaptive resolution coding system 202 in more detail. In implementations, the adaptive resolution coding system 202 may include, but is not limited to, one or more processors 302, memory 304, and program data 306. In implementations, the adaptive resolution coding system 202 may further include one or more encoders 308, an input/output (I/O) interface 310, and/or a network interface 312. In implementations, some or all of the functions of the adaptive resolution coding system 202 may be implemented using hardware, for example, an ASIC (i.e., Application-Specific Integrated Circuit) , a FPGA (i.e., Field-Programmable Gate Array) , and/or other hardware. By way of example and not limitation, the one or more encoders 308 of the adaptive resolution coding system 202 may be implemented using an ASIC, a FPGA, and/or any other hardware.

In implementations, the one or more processors 302 are configured to execute instructions that are stored in the memory 304, and/or received from the input/output interface 310, and/or the network interface 312. In implementations, the one or more processors 302 may be implemented as one or more hardware processors including, for example, a microprocessor, an application-specific instruction-set processor, a physics processing unit (PPU) , a central processing unit (CPU) , a graphics processing unit, a digital signal processor, etc. Additionally or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include field-programmable gate arrays (FPGAs) , application-specific integrated circuits (ASICs) , application-specific standard products (ASSPs) , system-on-a-chip systems (SOCs) , complex programmable logic devices (CPLDs) , etc.

The memory 304 may include processor-readable media in a form of volatile memory, such as Random Access Memory (RAM) and/or non-volatile memory, such as read only memory (ROM) or flash RAM. The memory 304 is an example of processor-readable media.

The processor-readable media may include a volatile or non-volatile type, a removable or non-removable media, which may achieve storage of information using any method or technology. The information may include a processor-readable instruction, a data structure, a program module or other data. Examples of processor-readable media include, but not limited to, phase-change memory (PRAM) , static random access memory (SRAM) , dynamic random access memory (DRAM) , other types of random-access memory (RAM) , read-only memory (ROM) , electronically erasable programmable read-only memory (EEPROM) , quick flash memory or other internal storage technology, compact disk read-only memory (CD-ROM) , digital versatile disc (DVD) or other optical storage, magnetic cassette tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission media, which may be used to store information that may be accessed by a computing device. As defined herein, the processor-readable media does not include any transitory media, such as modulated data signals and carrier waves.

Although in this example, only hardware components are described in the adaptive resolution coding system 202, in other instances, the adaptive resolution coding system 202 may further include other hardware components and/or other software components such as program units to execute instructions stored in the memory 304 for performing various operations such as processing, determination, allocation, storage, etc. In some instances, the adaptive resolution coding system 202 may further include a model database 314 that is configured to store information of one or more trained machine learning models used for determining recommended resolutions for videos of different input resolutions.

Example Methods

FIGS. 4 and 5 show a schematic diagram depicting an example method of adaptive resolution coding. The methods of FIGS. 4 and 5 may, but need not, be implemented in the environment of FIG. 2 and using the system of FIG. 3. For ease of explanation,

methods

400 and 500 are described with reference to FIGS. 1-3. However, the

methods

400 and 500 may alternatively be implemented in other environments and/or using other systems.

The

methods

400 and 500 are described in the general context of computer-executable instructions. Generally, computer-executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types. Furthermore, each of the example methods are illustrated as a collection of blocks in a logical flow graph representing a sequence of operations that can be implemented in hardware, software, firmware, or a combination thereof. The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or alternate methods. Additionally, individual blocks may be omitted from the method without departing from the spirit and scope of the subject matter described herein. In the context of software, the blocks represent computer instructions that, when executed by one or more processors, perform the recited operations. In the context of hardware, some or all of the blocks may represent application specific integrated circuits (ASICs) or other physical components that perform the recited operations.

Referring back to FIG. 4, at block 402, the adaptive resolution coding system 202 may receive an input video and information of a target bit rate.

In implementations, the adaptive resolution coding system 202 may receive an instruction to encode an input video of a certain resolution (which is called as an input resolution) at a target bit rate (or a target bandwidth) from a user of the client device 204. The input video may include, but is not limited to, a stored video or a streaming video. In implementations, the input video may be a subset (e.g., a segment) of a stored video or streaming video.

At block 404, the adaptive resolution coding system 202 may determine or obtain image information of at least one video frame and a bit rate of the at least one video frame from the input video.

In implementations, in response to receiving the input video, the adaptive resolution coding system 202 may determine or obtain image information of at least one video frame and a bit rate of the at least one video frame from the input video, which can be used as an input to a trained machine learning model for determining or obtaining a recommended resolution (or a resampling ratio) . For example, the adaptive resolution coding system 202 may encode the input video of the input resolution at the target bit rate, and extract or obtain image information of at least one video frame and a bit rate of the at least one video frame from the encoded video. In implementations, encoding the input video of the input resolution at the target bit rate may include compressing a size of the input video, so that an average bit rate of transmitting the compressed video over a communication network (such as the network 210) is at or around the target bit rate.

In implementations, the adaptive resolution coding system 202 may obtain at least one video frame and a bit rate thereof by calculating a prediction residual of the input video and estimating resulting bits for the at least one video frame (e.g., the first video frame of the input video) and the bit rate thereof based on the prediction residual. In this case, the adaptive resolution coding system 202 may not need to encode or compress the input video completely. In implementations, the adaptive resolution coding system 202 may randomly select a portion of the input video, encode or compress the selected portion of the input video, and extract or obtain image information of at least one video frame and a bit rate of the at least one video frame from the encoded or compressed portion of the input video.

In implementations, the at least one video frame may include, but is not limited to, an intra frame that is representative of the input video from among all intra frames of the input video, or an intra frame that is randomly selected from among all the intra frames of the input video. In implementations, the image information of the at least one video frame may include, but is not limited to, image data of the at least one video frame (such as pixel values at each coordinate in the at least one video frame) . Additionally or alternatively, the image information of the at least one video frame may include feature data of the at least one video frame. For example, the adaptive resolution coding system 202 may perform feature extraction or detection on the at least one video frame after obtaining the at least one video frame from the input video, and additionally or alternatively use feature data that is obtained from the feature extraction or detection as the image information of the at least one video frame. Examples of the feature extraction or detection may include, but are not limited to, edge detection, corner detection, blob detection, curvature detection, shape-based detection, Hough transform, etc. Depending on what type of machine learning model is used at a later stage (i.e., a stage of determination of a recommended resolution or resampling ratio) , one or more types of feature extraction may be performed on the at least one video frame.

In implementations, the intra frame that is representative of the input video may include, but is not limited to, a first intra frame of the input video, an intra frame having a bit rate that is a median of bit rates associated with intra frames of the input video, an intra frame having a bit rate that is closest to an average of the bits rates associated with the intra frames of the input video, etc.

In implementations, the at least one intra frame may include one or more intra frames that are representative of the input video, and/or one or more intra frames that are randomly selected from among the intra frames of the input video.

In implementations, depending on the number of video frames and/or the size of the input video, the adaptive resolution coding system 202 may divide the input video into a plurality of video segments. For example, the adaptive resolution coding system 202 may divide the input video into a plurality of video segments having the same time length such as one second, two seconds, etc. In implementations, the adaptive resolution coding system 202 may divide the input video (e.g., a stored video) into a predetermined number of video segments.

In implementations, the adaptive resolution coding system 202 may divide the input video into a plurality of video segments based on any scene change detection (SCD) method and/or shot transition detection (STD) method. In an event that a scene change detection (SCD) method and/or a shot transition detection (STD) method is employed, the plurality of video segments that are obtained by the adaptive resolution coding system 202 may have different lengths.

By way of example and not limitation, an amount of change between two video frames may be used for detecting a presence of a scene change. For example, if scenes between two video frames are different, a residual obtained after performing motion compensation between these two video frames is usually large, or a difference between pixel values of these two video frames is usually large. In implementations, predetermined threshold (s) may be set up for a residual associated with motion compensation between two video frames and/or a difference between pixel values of two video frames. In response to detecting that a residual associated with motion compensation between two video frames and/or a difference between pixel values of two video frames is/are greater than respective predetermined threshold (s) , the adaptive resolution coding system 202 may determine that a scene change occurs between the two video frames. The adaptive resolution coding system 202 may divide the input video into a plurality of video segments, with boundaries of a video segment of the plurality of video segments corresponding to positions of respective scene changes that are detected.

In implementations, the input video may be divided into a plurality of video segments, and the adaptive resolution coding system 202 may determine or obtain at least one video frame and a respective bit rate from each video segment of the plurality of video segments according to a similar approach as described above.

At block 406, after determining at least one video frame and a respective bit rate, the adaptive resolution coding system 202 may input the image information of the at least one video frame and the bit rate into a trained machine learning model to determine or obtain a recommended resolution.

In implementations, the adaptive resolution coding system 202 may be associated with one or more trained machine learning models that are configured to receive image information of one or more video frames and respective one or more bit rates as inputs, and produce a recommended resolution (or resampling ratio) as an output. In implementations, the one or more trained machine learning models may be able to process video frames of a particular resolution or different resolutions. By way of example and not limitation, the adaptive resolution coding system 202 may have one or more trained machine learning models that are stored in the memory 304, e.g., stored in the model database 314. The adaptive resolution coding system 202 may select a trained machine learning model from the model database 314, and input the image information of the at least one video frame and the bit rate into the trained machine learning model to obtain a recommended resolution (or resampling ratio) .

Additionally or alternatively, one or more trained machine learning models may be stored in a remote device (for example, a server, a cloud, or a data center, etc. ) that is accessible to the adaptive resolution coding system 202 through a network, e.g., the network 210. The adaptive resolution coding system 202 may send the image information of the at least one video frame and the bit rate to the remote device through the network 210 to request the remote device for determining a recommended resolution (or resampling ratio) , and receive the recommended resolution from the remote device after the remote device determines the recommended resolution (or resampling ratio) using a trained machine learning model therein.

In implementations, depending on a type of the machine learning model, the image information of the at least one video frame inputted into the machine learning model may vary. By way of example and not limitation, if the machine learning model is a neural network model, the image information of the at least one video frame may include image data (e.g., pixel values) of the at least one video frame, because feature extraction or detection can be performed in the first few layers of the neural network model. In implementations, if the machine learning model is a decision tree model, the image information of the at least one video frame may include feature data of the at least one video frame, such as information about a presence or absence of certain features (such as an edge, a corner, a shape, etc. ) at different positions or coordinates on the at least one video frame. The adaptive resolution coding system 206 may perform such feature extraction or detection at block 404 as described above.

In implementations, the input video may be divided into a plurality of video segments as described above. In this case, the adaptive resolution coding system 202 may use at least one respective video frame and a respective bit rate of each video segment as an input to a trained machine learning model to obtain or determine a respective recommended resolution (or resampling ratio) . After obtaining respective recommended resolutions (or resampling ratios) for the plurality of video segments of the input video, the adaptive resolution coding system 202 may determine or calculate a resulting resolution (or resampling ratio) based on the respective recommended resolutions (or resampling ratios) of the plurality of video segments as a recommended resolution (or resampling ratio) for the input video.

In implementations, the adaptive resolution coding system 202 may further determine a resolution (or resampling ratio) that is representative of the respective recommended resolutions (or resampling ratios) of the plurality of video segments as the recommended resolution (or resampling ratio) for the input video. By way of example and not limitation, the resolution (or resampling ratio) that is representative of the respective recommended resolutions (or resampling ratios) of the plurality of video segments may include, but is not limited to, an average of the respective recommended resolutions (or resampling ratios) of the plurality of video segments, a median of the respective recommended resolutions (or resampling ratios) of the plurality of video segments, etc. In implementations, the adaptive resolution coding system 202 may randomly select one of the respective recommended resolutions (or resampling ratios) of the plurality of video segments as the recommended resolution (or resampling ratio) for the input video.

At block 408, in response to determining or obtaining the recommended resolution (or resampling ratio) for the input video, the adaptive resolution coding system 202 may resample or resize the at least one video frame and one or more other video frames associated with the at least one video frame based on the recommended resolution (or resampling ratio) .

In implementations, the adaptive resolution coding system 202 may resize or resample (e.g., downsample) the at least one video frame and one or more other video frames associated with the at least one video frame from the input resolution to the recommended resolution (or by the recommended resampling ratio) . In implementations, the adaptive resolution coding system 202 may downsample the input video from the input resolution to the recommended resolution (or by the recommended resampling ratio) . In implementations, the at least one video frame may include an intra frame, and the one or more other video frames associated with the at least one video frame may include inter frames depending on the intra frame.

In implementations, the input video may be divided into a plurality of video segments as described above, and the adaptive resolution coding system 202 may resize or resample (e.g., downsample) the input video including the plurality of video segments from the input resolution to the same recommended resolution (or by the same recommended resampling ratio) as described in the above situation when the input video are divided into the plurality of video segments.

In implementations, the input video may be divided into a plurality of video segments, and the adaptive resolution coding system 202 may further divide the plurality of video segments into a plurality of video groups that may not overlap with each other. Each video group may include one or more video segments. In implementations, the adaptive resolution coding system 202 may divide the plurality of video segments into a plurality of video groups based on a predetermined number of video segments and/or a predetermined time period. By way of example and not limitation, after obtaining or determining recommended resolutions (or resampling ratios) for a predetermined number of video segments and/or after a predetermined period of time has past, the adaptive resolution coding system 202 may group video segments whose recommended resolutions (or resampling ratios) have been determined and which have not been resized or resampled as an individual video group.

In implementations, the adaptive resolution coding system 202 may resize or resample (e.g., downsample) a video group to a recommended resolution (or by a recommended resampling ratio) associated with that video group, encode the resized video group, and send the encoded video group to another device over the network 210, without waiting for other subsequent video groups, thus further speeding up a process of transmitting the input video from one device to another device. A recommended resolution (or a recommended resampling ratio) associated with a video group may be determined as described above by selecting a resolution (or a resampling ratio) that is representative of resolutions (or resampling ratios) of video frames included in the video group, or by randomly selecting a resolution (or a resampling ratio) from among the resolutions (or the resampling ratios) of the video frames included in the video group.

In implementations, the adaptive resolution coding system 202 may resize or resample the input video (or video segment or video group) using a predetermined resizing filter or resampling filter such as a downsampling filter. Examples of the predetermined resizing filter or resampling filter may include, but are not limited to, a downsampling filter such as a bi-linear filter, an averaging filter, a Lanczos filter, a convolutional filter, etc.

At block 410, the adaptive resolution coding system 202 may encode or compress the resized video according to the target bit rate.

In implementations, after the input video is resized or resampled, the adaptive resolution coding system 202 may encode or compress the resized video using an encoder (e.g., one of the encoders 308) at the target bit rate. For example, the adaptive resolution coding system 202 or the encoder 308 may encode the resized video into a MPEG-4 format, a H. 264 format, or any format that is supported by the encoder 308 and/or agreed upon between the adaptive resolution coding system 202 and the other device (i.e., the client device 206) . In implementations, if video grouping is employed, the adaptive resolution coding system 202 or the encoder 308 may encode a video group according to the target bit rate to produce an encoded video group without waiting for other subsequent video groups.

At block 412, the adaptive resolution coding system 202 may transmit the encoded video to another device over a network.

In implementations, after an encoded video according to the target bit rate is obtained, the adaptive resolution coding system 202 may transmit the encoded video to another device (such as the client device 206) over a network, e.g., the network 210. In implementations, if video grouping is employed, the adaptive resolution coding system 202 may send an encoded video group to the other device over the network, without waiting for other subsequent encoded video groups. This further improves the speed of video transmission without a need of waiting for a completion of encoding the entire video, which could take tens of seconds, or minutes.

In implementations, if the input video is sent to the other device (such as the client device 206) for storage in the other device, the adaptive resolution coding system 202 may further send information of the plurality of video groups to the other device, so that the other device can recover the input video from the plurality of video groups. By way of example and not limitation, the adaptive resolution coding system 202 may include or insert respective sequence numbers of the plurality of video groups of the input video in corresponding data headers of data packets including the plurality of video groups, and a data header of a data packet including the last video group includes a special label indicating that the video group included in this data packet is the last video group of the input video. The other device can then recover the input video based on the sequence numbers included in the data headers of the data packets that are received.

In implementations, if the input video is sent to the other device as a streaming video (or a video stream) , an inclusion of a sequence number in a data header of a data packet including a video group as described above may or may not be used, depending on whether a strict in-order requirement (i.e., a requirement for a correct order of video groups to be displayed) is imposed at the other device. For example, the adaptive resolution coding system 202 may include or insert respective sequence numbers of the plurality of video groups of the input video in corresponding data headers of data packets including the plurality of video groups if the strict in-order requirement is imposed.

In implementations, the adaptive resolution coding system 202 may further send additional information. The additional information may include, but is not limited to, information of an original resolution of the input video (i.e., the input resolution) , information of the resampling filter (such as the downsampling filter) that is used for resizing or resampling to the other device, etc. As such, the other device may restore the video to the original resolution by decoding and resizing (e.g., upsampling) using an opposite or conjugate filter (such as a corresponding upsampling filter) .

Although the above method blocks are described to be executed in a particular order, in some implementations, some or all of the method blocks can be executed in other orders, or in parallel. For example, the adaptive resolution coding system 202 may encode a certain video group of an input video at a target bit rate, while resizing or resampling one or more video segments that are located after the video group according to a recommended resolution (or resampling ratio) . Additionally or alternatively, the adaptive resolution coding system 202 may determine a recommended resolution (or resampling ratio) for at least one video frame of a video segment, while preliminarily encoding video segments that are located after the video segment to determine respective one or more video frames and bit rates as inputs to a trained machine learning model.

Furthermore, although the adaptive resolution coding system 202 is described to obtain the bit rate of the at least one video frame, and use the bit rate as one of the inputs to the machine learning model in the above blocks, in other instances, the video resolution coding system 202 may obtain a size (e.g., an amount of bits) of the at least one video frame, and use the size of the at least one video frame as one of the inputs to the machine learning model instead. Moreover, in some implementations, the target bit rate is used as one of the inputs to the machine learning model, instead of the bit rate or the size of the at least one video frame.

Referring back to FIG. 5, at block 502 the client device 206 may receive an encoded or compressed video.

In implementations, the client device 206 may receive an encoded video from the video resolution coding system 202 or another device such as the client device 204 the network 210. In implementations, the client device 206 may further receive additional information, which may include, but is not limited to, information of an original or intended resolution to which the encoded video is resized or restored, information of a resizing or resampling filter that has been used in the encoded video, etc. In implementations, if the encoded video is a video group of an input video that is sent from the video resolution coding system 202 or the client device 204, the additional information may further include a sequence number associated with the video group.

At block 504, the client device 206 may decode the encoded or compressed video to obtain a decoded or decompressed video.

In implementations, the client device 206 may decode the encoded video into a video format that is supported by the client device 206. In implementations, the encoded video may be a compressed video, and decoding the encoded video may include decompressing the compressed video. Examples of the video format include, but are not limited to, a H. 264 format, a MPEG-4 format, an AVI format, etc.

At block 506, the client device 206 may resize the decoded video.

In implementations, based on the additional information received from the adaptive resolution coding system 202 or the client device 204, the client device 206, the client device 206 may resize the decoded video to the original resolution using an upsampling filter that is opposite or conjugate to the downsampling filter used in the encoded video.

At block 508, the client device 206 may play or present the resized video to a user of the client device 206, and/or store the resized video in a memory of the client device 206.

In implementations, if the video received by the client device 206 at block 502 is a video group of a plurality of video groups of a streaming video (or a video stream) , the client device 206 may display or store the video group according to a correct order of the plurality of video groups. For example, the client device 206 may place the video group in a buffer of the client device 206, and arrange the video group in a right position among one or more video groups (of the plurality of video groups) that have been received by the client device 206 according to a sequence number associated with the video group. The client device 206 may display the video group after video group (s) located prior thereto is/are displayed. In implementations, the video received by the client device 206 at block 502 is a video group of a plurality of video groups of a video intended to be stored in the client device 206, the client device 206 may place and arrange the video group in the buffer of the client device 206, and wait until all the video groups of the video are received to combine and store the video groups as a single video in the memory of the client device 206.

Any of the acts of any of the methods described herein may be implemented at least partially by a processor or other electronic device based on instructions stored on one or more computer-readable media. By way of example and not limitation, any of the acts of any of the methods described herein may be implemented under control of one or more processors configured with executable instructions that may be stored on one or more computer-readable media.

Example Machine Learning Model

By way of example and not limitation, a neural network model is used herein as an example of the trained machine learning model described above. It should be noted that the present disclosure is not limited to this example neural network model, and other types of machine learning models can also be used and applicable to the present disclosure.

In implementations, a neural network model, such as a convolutional neural network (CNN) model, may be used as the machine learning model described above. By way of example and for the sake of simplicity, a convolutional neural network (CNN) model, such as Mobilenet v2, may be used as an example and backbone for the machine learning model as described above. In this example, since the convolutional neural network model has characteristics or capabilities of performing extraction of image features, and thus image information of at least one video frame as described above may be pixel values of the at least one video frame. In implementations, training samples with image information of respective video frames of a plurality of videos and corresponding bit rates of the respective video frames of the plurality of videos (or respective target bit rates) as inputs, and respective known optimal resolutions (or resampling ratios) as outputs may be used for training the neural network model. These training samples may be obtained by a brute force approach or from a third-party database. In implementations, parameters (such as connection weights between nodes of same and different layers, biases, etc. ) of the neural network model are learned and determined using a subset of the training samples based on a particular optimization or training algorithm, such as a gradient descent method, a conjugate gradient method, a Quasi-Newton method, etc. After training, the neural network model may be tested and validated using another subset of the training samples. If an accuracy of recognition is less than a predetermined threshold, the neural network model may be retrained until the accuracy of recognition is greater than or equal to the predetermined threshold.

Furthermore, the neural network model may have a different number of convolutional layers, and/or a different number of feature maps in each layer, depending on the desired complexity of the neural network model, the desired accuracy of recognition, and/or the desired speed of computation for determining a recommended resolution (or resampling ratio) , etc. For example, if a recommended resolution or resampling ratio is needed for videos of a higher resolution, a higher number of features may exist in video frames of a video of the higher resolution, and so a higher number of feature maps in each layer may be desirable.

Conclusion

Although implementations have been described in language specific to structural features and/or methodological acts, it is to be understood that the claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed subject matter. Additionally or alternatively, some or all of the operations may be implemented by one or more ASICS, FPGAs, or other hardware.

The present disclosure can be further understood using the following clauses.

Clause 1: A method implemented by a computing device, the method comprising: determining at least one video frame and a corresponding bit rate of the at least one video frame; inputting the at least one video frame and the corresponding bit rate into a machine learning model to obtain a recommended resolution; resizing the at least one video frame and one or more other video frames associated with the at least one video frame according to the recommended resolution; and encoding the at least one video frame and the one or more other video frames to obtain an encoded video according to a target bit rate after the resizing.

Clause 2: The method of Clause 1, further comprising: sending the encoded video to a receiving computing device via a network.

Clause 3: The method of Clause 2, further comprising: sending information of a resizing filter for resizing the at least one video frame and the one or more other video frames to the receiving computing device via the network, the information of the resizing filter enabling the receiving computing device to undo the resizing using a corresponding reversing filter.

Clause 4: The method of Clause 1, wherein determining the at least one video frame and the bit rate of the at least one video frame comprises: encoding an input video comprising the at least one video frame according to the target bit rate; and extracting information of the at least one video frame and the corresponding bit rate from the encoded input video.

Clause 5: The method of Clause 4, wherein the input video comprises a subset of a streaming video or a stored video.

Clause 6: The method of Clause 1, wherein the at least one frame comprises an intra frame, and the one or more other video frames comprise one or more inter frames that are encoded based on the at least one frame.

Clause 7: The method of Clause 1, wherein the machine learning model is configured to receive video frames of a particular resolution and determine a corresponding resolution for resizing a video comprising the video frames of the particular resolution for transmission at a designated bit rate.

Clause 8: One or more processor-readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors comprising: determining at least one video frame and a corresponding bit rate of the at least one video frame; inputting the at least one video frame and the corresponding bit rate into a machine learning model to obtain a recommended resolution; resizing the at least one video frame and one or more other video frames associated with the at least one video frame according to the recommended resolution; and encoding the at least one video frame and the one or more other video frames to obtain an encoded video according to a target bit rate after the resizing.

Clause 9: The one or more processor-readable media of Clause 8, the acts further comprising: sending the encoded video to a receiving computing device via a network.

Clause 10: The one or more processor-readable media of Clause 9, the acts further comprising: sending information of a resizing filter for resizing the at least one video frame and the one or more other video frames to the receiving computing device via the network, the information of the resizing filter enabling the receiving computing device to undo the resizing using a corresponding reversing filter.

Clause 11: The one or more processor-readable media of Clause 8, wherein determining the at least one video frame and the bit rate of the at least one video frame comprises: encoding an input video comprising the at least one video frame according to the target bit rate; and extracting information of the at least one video frame and the corresponding bit rate from the encoded input video.

Clause 12: The one or more processor-readable media of Clause 11, wherein the input video comprises a subset of a streaming video or a stored video.

Clause 13: The one or more processor-readable media of Clause 8, wherein the at least one frame comprises an intra frame, and the one or more other video frames comprise one or more inter frames that are encoded based on the at least one frame.

Clause 14: The one or more processor-readable media of Clause 8, wherein the machine learning model is configured to receive video frames of a particular resolution and determine a corresponding resolution for resizing a video comprising the video frames of the particular resolution for transmission at a designated bit rate.

Clause 15: A system comprising: one or more processors; and memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors comprising: determining at least one video frame and a corresponding bit rate of the at least one video frame; inputting the at least one video frame and the corresponding bit rate into a machine learning model to obtain a recommended resolution; resizing the at least one video frame and one or more other video frames associated with the at least one video frame according to the recommended resolution; and encoding the at least one video frame and the one or more other video frames to obtain an encoded video according to a target bit rate after the resizing.

Clause 16: The system of Clause 15, the acts further comprising: sending the encoded video to a receiving computing device via a network.

Clause 17: The system of Clause 16, the acts further comprising: sending information of a resizing filter for resizing the at least one video frame and the one or more other video frames to the receiving computing device via the network, the information of the resizing filter enabling the receiving computing device to undo the resizing using a corresponding reversing filter.

Clause 18: The system of Clause 15, wherein determining the at least one video frame and the bit rate of the at least one video frame comprises: encoding an input video comprising the at least one video frame according to the target bit rate; and extracting information of the at least one video frame and the corresponding bit rate from the encoded input video.

Clause 19: The system of Clause 15, wherein the at least one frame comprises an intra frame, and the one or more other video frames comprise one or more inter frames that are encoded based on the at least one frame.

Clause 20: The system of Clause 15, wherein the machine learning model is configured to receive video frames of a particular resolution and determine a corresponding resolution for resizing a video comprising the video frames of the particular resolution for transmission at a designated bit rate.

Claims

A method implemented by a computing device, the method comprising:

determining at least one video frame and a corresponding bit rate of the at least one video frame;

inputting the at least one video frame and the corresponding bit rate into a machine learning model to obtain a recommended resolution;

resizing the at least one video frame and one or more other video frames associated with the at least one video frame according to the recommended resolution; and

encoding the at least one video frame and the one or more other video frames to obtain an encoded video according to a target bit rate after the resizing.
The method of claim 1, further comprising sending the encoded video to a receiving computing device via a network.
The method of claim 2, further comprising sending information of a resizing filter for resizing the at least one video frame and the one or more other video frames to the receiving computing device via the network, the information of the resizing filter enabling the receiving computing device to undo the resizing using a corresponding reversing filter.
The method of claim 1, wherein determining the at least one video frame and the bit rate of the at least one video frame comprises:

encoding an input video comprising the at least one video frame according to the target bit rate; and

extracting information of the at least one video frame and the corresponding bit rate from the encoded input video.
The method of claim 4, wherein the input video comprises a subset of a streaming video or a stored video.
The method of claim 1, wherein the at least one frame comprises an intra frame, and the one or more other video frames comprise one or more inter frames that are encoded based on the at least one frame.
The method of claim 1, wherein the machine learning model is configured to receive video frames of a particular resolution and determine a corresponding resolution for resizing a video comprising the video frames of the particular resolution for transmission at a designated bit rate.
One or more processor-readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors comprising:

determining at least one video frame and a corresponding bit rate of the at least one video frame;

inputting the at least one video frame and the corresponding bit rate into a machine learning model to obtain a recommended resolution;

resizing the at least one video frame and one or more other video frames associated with the at least one video frame according to the recommended resolution; and

encoding the at least one video frame and the one or more other video frames to obtain an encoded video according to a target bit rate after the resizing.
The one or more processor-readable media of claim 8, the acts further comprising sending the encoded video to a receiving computing device via a network.
The one or more processor-readable media of claim 9, the acts further comprising sending information of a resizing filter for resizing the at least one video frame and the one or more other video frames to the receiving computing device via the network, the information of the resizing filter enabling the receiving computing device to undo the resizing using a corresponding reversing filter.
The one or more processor-readable media of claim 8, wherein determining the at least one video frame and the bit rate of the at least one video frame comprises:

encoding an input video comprising the at least one video frame according to the target bit rate; and

extracting information of the at least one video frame and the corresponding bit rate from the encoded input video.
The one or more processor-readable media of claim 11, wherein the input video comprises a subset of a streaming video or a stored video.
The one or more processor-readable media of claim 8, wherein the at least one frame comprises an intra frame, and the one or more other video frames comprise one or more inter frames that are encoded based on the at least one frame.
The one or more processor-readable media of claim 8, wherein the machine learning model is configured to receive video frames of a particular resolution and determine a corresponding resolution for resizing a video comprising the video frames of the particular resolution for transmission at a designated bit rate.
A system comprising:

one or more processors; and

memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors comprising:

determining at least one video frame and a corresponding bit rate of the at least one video frame;

inputting the at least one video frame and the corresponding bit rate into a machine learning model to obtain a recommended resolution;

resizing the at least one video frame and one or more other video frames associated with the at least one video frame according to the recommended resolution; and

encoding the at least one video frame and the one or more other video frames to obtain an encoded video according to a target bit rate after the resizing.
The system of claim 15, the acts further comprising sending the encoded video to a receiving computing device via a network.
The system of claim 16, the acts further comprising sending information of a resizing filter for resizing the at least one video frame and the one or more other video frames to the receiving computing device via the network, the information of the resizing filter enabling the receiving computing device to undo the resizing using a corresponding reversing filter.
The system of claim 15, wherein determining the at least one video frame and the bit rate of the at least one video frame comprises:

encoding an input video comprising the at least one video frame according to the target bit rate; and

extracting information of the at least one video frame and the corresponding bit rate from the encoded input video.
The system of claim 15, wherein the at least one frame comprises an intra frame, and the one or more other video frames comprise one or more inter frames that are encoded based on the at least one frame.
The system of claim 15, wherein the machine learning model is configured to receive video frames of a particular resolution and determine a corresponding resolution for resizing a video comprising the video frames of the particular resolution for transmission at a designated bit rate.