WO2022036678A1

WO2022036678A1 - Multi-level region-of-interest quality controllable video coding techniques

Info

Publication number: WO2022036678A1
Application number: PCT/CN2020/110473
Authority: WO
Inventors: Guanlin WU; Tae Meon Bae; Yen-Kuang Chen; Minghai Qin; Haoran LI
Original assignee: Alibaba Group Holding Limited
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2022-02-24

Abstract

Video coding techniques can include determining one or more regions-of-interest in a plurality of levels of interest. Encoding bitrates can be determined for the regions-of-interest in each of the plurality of levels of interest. A compressed bitstream can be generated based on the regions-of-interest in the plurality of levels of interest using the corresponding encoding bitrate of the regions-of-interest.

Description

MULTI-LEVEL REGION-OF-INTEREST QUALITY CONTROLLABLE VIDEO CODING TECHNIQUES

BACKGROUND OF THE INVENTION

Numerous techniques are used for reducing the amount of data consumed by the transmission or storage of video. One common technique is to use variable bitrate encoding of video frame data. Referring to FIG. 1, an exemplary video frame image is illustrated. The portion of the frame that contains a piece of jewelry may be more important or interesting than the rest of the frame that generally contains the background. A region-of-interest (ROI) 110 about the piece of jewelry may be specified by a bounding box 120, with the remainder of the video from being a non-region-of-interest 130. In one implementation, the bounding box can be specified by a pair of coordinates of points at opposite corners. To reduce the amount of data that needs to be transmitted or stored, a first bitrate can be utilized to encode a region-of-interest (ROI) , and a second bitrate can be utilized to encode a non-region-of-interest. The detected region-of-interest (ROI) 110 can be encoded with a higher bitrate, so that the image of the piece of jewelry will have a better image quality than the non-region-of-interest 130 portion of the image that is encoded with a lower bitrate.

The detection of regions-of-interest and variable bitrate encoding of regions-of-interest and non-regions-of-interest can be computationally intensive. In addition, it can be difficult to adjust the variable bitrate encoding. Accordingly, there is a continuing need for improved variable bitrate encoding of video images.

SUMMARY OF THE INVENTION

The present technology may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the present technology directed toward multi-level region-of-interest (ML-ROI) quality controllable video coding techniques.

In one embodiment, a video processing system can include a multi-level region-of-interest detector, a rate controller, and a video encoder. The multi-level region-of-interest detector can be configured to determine multi-level regions-of-interest of a video stream. The rate controller can be configured to determine encoding parameters of the rate controller 220 for the multi-level regions-of-interest. The video encoder can be configured to encode the video stream using variable bitrate encoding based on the determined multi-level regions-of-interest and the encoding parameters of the multi-level regions-of-interest to generate a compressed bitstream. The rate controller can control a quality of regions-of-interest by allocating precise bitrates to each level of interest of the determined multi-level regions-of-interest including enhancing a quality of regions-of-interest of a first level of interest and degrading a quality of regions-of-interest of a second level of interest, wherein the first level is a higher priority than the second level of interest.

In another embodiment, a computing system can include one or more processor, one or more computing device readable storage media and a video encoder. The one or more computing device readable storage medium can store computing executable instructions that when executed by the one or more processors perform a method including determining one or more of regions-of-interest in each of a plurality of different levels of interest in each frame or sets of frames of an input video stream. The method can further include determining encoding parameters for each of the determined regions-of-interest in each frame or sets of frames based on the corresponding level of interest of each of the regions-of-interest. The video encoder can be configured to generate a compressed bitstream of the input video stream based on the determined encoding parameters of each of the regions-of-interest to generate a compressed bitstream.

In yet another embodiment, a method of video processing can include determining a plurality of regions-of-interest, including regions-of-interest of three or more different levels of interest, in each frame or sets of frames of a received video stream. Encoding parameters can be determined for each of the regions-of-interest in each frame or sets of frames based on the corresponding level of interest of each of the regions-of-interest. The received video stream can be encoded based on the determined encoding parameters of each of the regions-of-interest to generate a compressed bitstream.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present technology are illustrated by way of example and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:

FIG. 1 illustrates an exemplary video frame image.

FIG. 2 shows a block diagram of a video processing system, in accordance with aspects of the present technology.

FIG. 3 shows a flow diagram of video processing, in accordance with aspects of the present technology.

FIG. 4 illustrates an exemplary image frame of an input video stream, in accordance with aspects of the present technology.

FIG. 5 shows a block diagram of a video processing system, in accordance with aspects of the present technology.

FIG. 6 shows a block diagram of an exemplary processing unit including a video processing system, in accordance with aspects of the present technology.

FIG. 7 shows a block diagram of an exemplary processing core, in accordance with aspects of the present technology.

DETAILED DESCRIPTION OF THE INVENTION

Reference will now be made in detail to the embodiments of the present technology, examples of which are illustrated in the accompanying drawings. While the present technology will be described in conjunction with these embodiments, it will be understood that they are not intended to limit the technology to these embodiments. On the contrary, the invention is intended to cover alternatives, modifications and equivalents, which may be included within the scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present technology, numerous specific details are set forth in order to provide a thorough understanding of the present technology. However, it is understood that the present technology may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the present technology.

Some embodiments of the present technology which follow are presented in terms of routines, modules, logic blocks, and other symbolic representations of operations on data within one or more electronic devices. The descriptions and representations are the means used by those skilled in the art to most effectively convey the substance of their work to others skilled in the art. A routine, module, logic block and/or the like, is herein, and generally, conceived to be a self-consistent sequence of processes or instructions leading to a desired result. The processes are those including physical manipulations of physical quantities. Usually, though not necessarily, these physical manipulations take the form of electric or magnetic signals capable of being stored, transferred, compared and otherwise manipulated in an electronic device. For reasons of convenience, and with reference to common usage, these signals are referred to as data, bits, values, elements, symbols, characters, terms, numbers, strings, and/or the like with reference to embodiments of the present technology.

It should be borne in mind, however, that these terms are to be interpreted as referencing physical manipulations and quantities and are merely convenient labels and are to be interpreted further in view of terms commonly used in the art. Unless specifically stated otherwise as apparent from the following discussion, it is understood that through discussions of the present technology, discussions utilizing the terms such as “receiving, ” and/or the like, refer to the actions and processes of an electronic device such as an electronic computing device that manipulates and transforms data. The data is represented as physical (e.g., electronic) quantities within the electronic device’s logic circuits, registers, memories and/or the like, and is transformed into other data similarly represented as physical quantities within the electronic device.

In this application, the use of the disjunctive is intended to include the conjunctive. The use of definite or indefinite articles is not intended to indicate cardinality. In particular, a reference to “the” object or “a” object is intended to denote also one of a possible plurality of such objects. The use of the terms “comprises, ” “comprising, ” “includes, ” “including” and the like specify the presence of stated elements, but do not preclude the presence or addition of one or more other elements and or groups thereof. It is also to be understood that although the terms first, second, etc. may be used herein to describe various elements, such elements should not be limited by these terms. These terms are used herein to distinguish one element from another. For example, a first element could be termed a second element, and similarly a second element could be termed a first element, without departing from the scope of embodiments. It is also to be understood that when an element is referred to as being “coupled” to another element, it may be directly or indirectly connected to the other element, or an intervening element may be present. In contrast, when an element is referred to as being “directly connected” to another element, there are not intervening elements present. It is also to be understood that the term “and or” includes any and all combinations of one or more of the associated elements. It is also to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.

Referring to FIG. 2 a video processing system, in accordance with aspects of the present technology, is shown. The video processing system 200 can include a multi-level region-of-interest (ML-ROI) detector 210, a rate controller 220, a video encoder 230 and memory 240. The ML-ROI detector 210 can be configured to determine multi-level regions-of-interest in an input video stream 250. The rate controller 220 can be configured to determine encoding parameters for the multi-level regions-of-interest. The video encoder 230 can be configured to encode the input video stream 250 using variable bitrate encoding based on the determined multi-level regions-of-interest (ML-ROIs) and the corresponding encoding parameters for the multi-level regions-of-interest (ML-ROIs) to generate a compressed bitstream 260. The memory 240 can be configured to store, cache, or buffer frame data of the input video stream 250, region-of-interest data, levels of interest, encoding data, rate control data, intermediate result data, and the like. The video processing system 200 can be implemented in hardware, firmware, software or combinations thereof. Operation of the video processing system will be further explained with reference to FIG. 3, which shows a video processing method, in accordance with aspects of the present technology.

The multi-level region-of-interest (ML-ROI) detector 210 can receive the input video stream 250, at 310. The input video stream 250 can include a plurality of image data frames. The ML-ROI detector 210 can be configured to determine a plurality of regions-of-interest (ROIs) , including regions-of-interest (ROIs) of three or more different interest levels, in each image data frame or a set of image data frames. The term region-of-interest as used herein generally refers to identification of objects within a data set that also includes identification of an associated object type. The term levels of interest as used herein generally refers to interest level, priority, complexity or the like of the corresponding region-of-interest. The levels of interest can be pre-configured, user specified or the like. In one implementation, the ML-ROI detector 210 can determine one or more regions-of-interest including one or more object types of a highest level of interest, one or more regions-of-interest including one or more object types of a next level of interest, and so on. ML-ROI detector 210 can determine one or more regions-of-interest. Regions of the image that no object of one or more object types are detected in are generally referred to a non-regions-of-interest (non-ROIs) and can be associated with a lowest interest level. In one implementation, the ML-RIO detector 210 can be configured to determine between two and four levels of regions-of-interest and a non-region-of-interest level. However, any number of levels of regions-of-interest can be determined by the ML-RIO detector 210. The number of levels of regions of can be based upon, but is not limited to, the characteristics of the video frames, the processing resources available for determining the regions-of-interests and the levels thereof, the communication and or storage bandwidth available for transmitting and or storing the images, the resolution of the display devices, a specified data compression improvement, a specified quality of the reconstructed video frames, and the ability of the human eye to perceive the difference between the determined levels of interest. The regions-of-interest (ROIs) can be identified by bounding box coordinates, associated object type and associated interest level. In one implementation, the ML-ROI detector 210 can determine regions-of-interest of a predetermined number of levels of interest.

Referring now to FIG. 4, an exemplary image frame of an input video stream, in accordance with aspects of the present technology, is shown. The ML-ROI detector 210 can determine a plurality of regions-of interest 410-470 and a corresponding interest level (QP1-QP7) . For example, the ML-ROI detector 210 can determine a different interest level for each of the determined regions-of interest 410-470. In other frames, some regions-of-interest can be determined to have the same interest level, while other regions-of-interest have different interest levels.

The rate controller 220 can receive the input video stream 250 and the determined plurality of regions-of-interest, including regions-of-interest of three or more different levels. The rate controller 220 can be configured to determine encoding parameters for the plurality of regions-of-interest, at 330. In one implementation, the rate controller 220 can be configured to determine quantization parameters for each of the multi-level regions-of-interest for a video frame or set of video frames. The rate controller 230 can determine quantization parameters such that a sum of distortion is minimized, subject to a rate constraint, and subject to a target quality for each region-of-interest. The rate controller 220 can determine quantization parameters utilizing a reinforcement learning (RL) model. The video encoder 230 can comprise the environment of the RL model and the rate controller 220 can implement an agent 270 of the RL model. The agent 270 can generate quantization parameters and target bits (e.g., action) for use by the video encoder 230 based on region complexity, target quality, region quality, frame bit budget, region ratio parameters (e.g. state) and distortion optimization (e.g., residual between the input video frame and a reconstructed video frame) that maximizes a long-term distortion optimization (e.g., reward) .

Referring again to FIG 4, the rate controller 220 can determine a first quantization parameter (QP1) for the first region-of-interest 410 having a first level of interest, a second quantization parameter (QP2) for the second region-of-interest 420 having a second level of interest, and so on, through a seventh quantization parameter (QP7) for the seventh region-of-interest 470 having a seventh level of interest.

The video encoder 230 can receive the input video stream 250, the determined multi-level regions-of-interest (ML-ROIs) and the determined encoding parameters for the multi-level regions-of-interest (ML-ROIs) . The video encoder 230 can be configured to generate a compressed bitstream 260 based on the multi-level regions-of-interest (ML-ROIs) and the encoding parameters for the multi-level regions-of-interest (ML-ROIs) , at 340. For example, the video encoder 230 can be configured to encode one or more regions-of-interest having a first level of interest using a first bitrate, one or more regions-of-interest having a second level of interest using a second bitrate, one or more regions-of-interest having a third level of interest using a third bitrate, and so on to generate the compressed bitstream 260. In one implementation, the first level of interest can correspond to the highest interest level and can be encoded using the highest bitrate, and the lowest level of interest can correspond to the conventional non-region-of-interest and can be encoded using a lowest bitrate. The video encoder 230 can encode each of the multi-level regions-of-interest, including regions-of-interest of three or more different interest levels, determined by the ML-ROI detector 210 using the quantization parameters for each of the multi-level regions-of-interest determined by the rate controller 220. The compressed bitstream 260 can be output by the video encoder 230, at 350. In one implementation, outputting the compressed bitstream 260 can comprise streaming the compressed bitstream to one or more user on one or more networks as a streaming video service. In another implementation, the compressed bitstream 260 can be stored on one or more computing device-readable media (e.g., computer memory) . In addition, the processes at 320-340 can be repeated at 360 for each video frame or set of video frames of the received video stream 310.

Referring now to FIG. 5, a video processing system, in accordance with aspects of the present technology, is shown. The video processing system 200 can include a multi-level region-of-interest (ML-ROI) detector 210, a rate controller 220 and a video encoder 230. The ML-ROI detector 210 can be configured to determine a plurality of regions-of-interest (ROIs) in each image data frame or a set of image data frames of the input video stream. The ML-ROI detector 210 can also determine an associated interest level of each determined region-of-interest. In one implementation, the ML-ROI detector 210 can determine regions-of-interest (ROIs) in three or more levels of interest. In one implementation, the ML-ROI detector 210 can also adjust the determined regions-of-interest based on feedback from the rate controller 220 and the video encoder 230. The feedback can include frame target bit value, target quality, as encoded bitrate, reconstructed video, remaining bit budge, as encoded quality and or the like as described further below.

The rate controller 220 can include a group of pictures (GOP) bit allocation unit 505 configured to receive a requested bitrate and an input video stream. The input video stream can include a plurality of video data frames. The group of pictures bit allocation unit 505 can be configured to perform group of pictures (GOP) level bit allocation based on the video data frames and the requested bitrate. A frame bit allocation unit 510, of the rate controller 220, can be configured to perform frame level bit allocation based on the group of picture bit allocation to generate a frame target bit allocation.

A reinforced learning (RL) based multi-level region-of-interest bit allocation unit 515, of the rate controller 220, can be configured to receive coordinates of a plurality of regions-of-interest, including regions-of-interest of three or more different interest levels, determined by the region-of-interest detector 210. The reinforced learning (RL) based multi-level region-of-interest bit allocation unit 515 can also be configured to receive the frame target bit allocation from the frame bit allocation unit 510. The reinforced learning (RL) based multi-level region-of-interest bit allocation unit 515 can be configured to allocate bits for the plurality of determined regions-of-interest based on the interest level of each region-of-interest and the frame target bit allocation. The reinforced learning (RL) based multi-level region-of-interest bit allocation unit 515 can also be configured to receive target complexity estimates of the plurality of regions-of-interest estimated by a region-of-interest complexity estimation unit 520, as described further below. The reinforced learning (RL) based multi-level region-of-interest bit allocation unit 515 can also be configured to receive quality estimations of the plurality of regions-of-interest estimated by a region-of-interest quality estimation unit 525, as described further below. The reinforced learning (RL) based multi-level region-of-interest allocation unit 515 can be further configured to allocate bits for the plurality of determined regions-of-interest based on the estimated target complexity of the regions-of-interest and the estimated target quality of the regions-of-interest. In one implementation, the reinforced learning (RL) based multi-level region-of-interest allocation unit 515 can use the coordinates from the multi-level region-of interest (ML-ROI) detector 210 and complexity values of each region-of-interest estimated by the region-of-interest (ROI) complexity estimation unit 520 to allocate bits for each region-of-interest respectively. The reinforced learning (RL) based multi-level region-of-interest allocation unit 515 can be configured to determine target bit allocations for the plurality of regions-of-interest such that a sum of distortion is minimized, subject to a rate constraint, and subject to a target quality for each region-of-interest. In the reinforced learning (RL) framework, the states can be the complexity of each region, target bits for the current frame, the quantization parameters, and the requested quality. The rewards can be the sum of distortion of each region-of-interest (e.g., the residual between the original region-of-interest and the reconstructed region-of-interest) . The actions can be the target bits for each region. In one implementation, training of a reinforcement learning model of the reinforcement learning based ML-ROI bit allocation unit 515 can be based on a table lookup decision making scheme, learning model, deep learning model, neural network model or the like. After training the reinforcement learning model, the model can be used for decision making of the actions input to the video encoder 230.

A region-of-interest quantization model unit 530, of the rate controller 220, can receive the region-of-interest target bit allocation of the plurality of regions-of-interest from the region-of-interest bit allocation unit 515. The region-of-interest rate-lambda- quantization module unit 530 can be configured to generate quantization parameters (QP) and or rate-distortion-optimization (RDO) parameters for the plurality of regions-of-interest, including regions-of-interest of three or more different interest levels, based on the region-of-interest target bit allocation. In one implementation, there can be a region-of-interest quantization model unit 530 for each level of interest. For example, if there are n predetermined levels of interest there can be n region-of-interest quantization model units 530. The first region-of-interest quantization model unit can use the frame target bits from the reinforcement learning based ML-ROI bit allocation unit 515 and the target complexity of the one or more regions-of-interest in the first level of interest from the ROI complexity estimation unit 520 to obtain a quantization parameter for the one or more region-of-interest of the first level of interest. Similarly, the second region-of-interest rate-lambda-quantization model unit can use the frame target bits and the target complexity of the one or more regions-of-interest in the second level of interest to obtain a quantization parameter for the one or more region-of-interest of the second level of interest. In one implementation, the quantization model can be a rate-lambda quantization model.

A region-of-interest limitation unit 535 can receive the quantization parameters (QP) and or rate-distortion-optimization (RDO) parameters for the regions-of-interest of the respective interest levels. The region-of-interest limitation unit 535 can be configured to constrain changes in the quantization parameters (QP) and or rate-distortion-optimization (RDO) parameters for the plurality of regions-of-interest, including regions-of-interest of three or more different interest levels, to a predetermined rate of change range for quality stability purposes. In one implementation, the constraint of the region-of-interest limitation unit 535 can be based on a region-of-interest rate-lambda constraint.

The video encoder 230 can receive the constrained quantization parameters (QP) and rate-distortion-optimization (RDO) parameters for the plurality of regions-of-interest, including regions-of-interest of three or more different interest levels. The video encoder 230 can be configured to generate a compressed bitstream for data frames of the input video stream based on the constrained quantization parameters (QP) and or rate-distortion-optimization (RDO) parameters. Optionally, the video encoder 230 can be configured to generate the compressed bitstream based on the unconstrainted quantization parameters (QP) and or rate-distortion-optimization (RDO) parameters. The video encoder 230 can also be configured to generate feedback to the region-of-interest complexity estimation unit 520, the region-of-interest quality estimation unit 525, and the multi-level region-of-interest (ML-ROI) detector 210 after encoding a current frame. In one implementation, the video encoder 230 can provide residual encoder bit information to the region-of-interest complexity estimation unit 520. The video encoder 230 can also provide reconstructed video frame data to the region-of-interest (ROI) quality estimation unit 525 and the multi-level region-of-interest (ML-ROI) detector 210. The video encoder 230 can also provide as encoded bitrate information to the multi-level region-of-interest (ML-ROI) detector 210. In the event of bit-starvation during the encoding of data frames, the video encoder 230 can skip the variable bitrate encoding of a next data frame. In other implementation, the bit allocation by the frame bit allocation unit 510 and or reinforced learning based ML-RIO bit allocation unit 515 can be adjusted according to convention bit-starvation algorithms.

The region-of-interest complexity estimation unit 520 can receive residual encoder bit information from the video encoder 230. The region-of-interest complexity estimation unit 520 can be configured to estimate the target complexity of the plurality of regions-of-interest, including regions-of-interest of three or more different interest levels, based on the residual encoder bits of the previous frames or the current frame. In one implementation, the residual encoder bits can be a mean absolute difference (MAD) , a mean square absolute error (MSE) , or the like. In one implementation, the ROI target bits calculated for each region-of-interest can be calculated by the reinforcement learning based ML-ROI bit allocation unit 515 based on the ratio of the target complexity values of each region-of-interest and the target bits of the current frame.

In one implementation, the lower bound of bits for the regions-of-interest can be calculated by the region-of-interest bit allocation unit 515 based on the complexity values generated by the region-of-interest complexity estimation unit 520. The frame target bits minus the lower bound of bits for the plurality of regions-of-interest is the remaining bits, which can be used to perform the quality control of the plurality of regions-of-interest to reduce the chance of one or more regions-of-interest from consuming too many bits and cause bit-starving during generation of the compressed bitstream for the next image data frame.

The region-of-interest quality estimation unit 525 can receive requested quality information. The requested quality information can indicate a requested quality for the plurality of regions-of-interest, including regions-of-interest of three or more different interest levels. In one implementation, the requested quality information can be a residual factor between the quality for the different levels of interest. For example, the requested quality can be expressed as a 0 dB, 1 dB, 2 dB, etc. difference between quality for each of the levels of interest. The region-of-interest quality estimation unit 525 can be configured to estimate a target quality for the different levels of interest based on the requested quality information. The region-of-interest quality estimation unit 525 can also receive the input video stream and the reconstructed video from the video encoder 230. The region-of-interest quality estimation unit 525 can be further configured to estimate the target quality for the levels of interest based on the residual between the input video stream and the reconstructed video. The target quality for the levels of interest of the plurality of regions-of-interest can be output to the region-of-interest bit allocation unit 515, and the multi-level region-of-interest (ML-ROI) detector 210.

In one implementation, the region-of-interest quality estimation unit 525 can be configured to use the feedback information from the video encoder 230 to adjust a weighting of a target bit allocation for the different levels of interest of the plurality of regions-of-interest. In one implementation, if the quality of one or more regions-of-interest for a given level of interest is too low for the current (t) frame, more bits can be allocated to the regions-of-interest in the given level of interest in the next (t+1) frame to upgrade the quality. In one implementation, the quality of a video data frame can be some measure from the original frame and a reconstructed frame, such as the mean absolute value (MAD) , peak signal-to-noise ratio (PSNR) , structural similarity index matric (SSIM) , video multimethod assessment fusion (VMAF) , or the like. The quality can also be the difference of MAD, PSNR, SSIM, VMAF, or the like.

The region-of-interest detector 210 can receive the frame target bit allocation, the target quality, the as encoded bitrate, the reconstructed video and the input video. The region-of-interest detector 210 can be configured to adjust the one or more regions-of-interest of one or more levels of interest based on the frame target bit allocation, the target quality, the as encoded bitrate and the distortion (e.g., the residual between the input video and the reconstructed video) . In one implementation, the size of the one or more regions-of-interest of a given level of interest can be decreased or increased, by adjusting the coordinates of the one or more regions-of interest, based on the frame target bit allocation, the target quality, the as encoded bitrate and the distortion. For example, the size of the one or more regions-of-interest of a given level of interest can be decreased if the frame target bit allocation and the as encoded bitrate indicate that the estimated target quality cannot be satisfied. In another implementation, the number of determined regions-of-interest and or the number of levels of interest can be deceased or increased based on the frame target bit allocation, the target quality and the as encoded bitrate.

Referring now to FIG. 6 an exemplary processing system including a video processing unit, in accordance with aspects of the present technology, is shown. The exemplary processing system 600 include one or more processors 605 and one or more video encoders 230. The one or more processors 605 can include one or more communication interfaces, such as peripheral component interface (PCIe4) 610 and inter-integrated circuit (I ²C) interface 615, an on-chip circuit tester, such as a joint test action group (JTAG) engine 620, a direct memory access engine 625, a command processor (CP) 630, and one or more cores 635-650. The one or more cores 635-650 can be coupled in a direction ring bus configuration. The one or more cores 635-650 can execute one or more sets of computing device executable instructions to perform one or more functions including, but not limited to, a multi-level region-of-interest (ML-ROI) detector 210 and a rate control 220. In other implementations, the video encoder 230 can also be implemented in one or more sets of computing device executable instructions executing in one or more cores 635-650 of the processor 605. The one or more functions can be performed on individual core 635-650, can be distributed across a plurality of cores 635-650, can be performed along with one or more other functions on one or more cores, and or the like.

The one or more processors 605 can be a central processing unit (CPU) , a graphics processing unit (GPU) , a neural processing unit (NPU) , a vector processor, a memory processing unit, or the like, or combinations thereof. In one implementation, the one or more processors 605 can be implemented in a computing devices such as, but not limited to, a cloud computing platform, an edge computing device, a server, a workstation, a personal computer (PCs) , or the like.

Referring now to FIG. 7, a block diagram of an exemplary processing core, in accordance with aspects of the present technology, is shown. The exemplary processing core 700 can include a tensor engine (TE) 710, a pooling engine (PE) 715, a memory copy engine (ME) 720, a sequencer (SEQ) 725, an instructions buffer (IB) 730, a local memory (LM) 735, and a constant buffer (CB) 740. The local memory 735 can be pre-installed with model weights and can store in-use activations on-the-fly. The constant buffer 740 can store constant for batch normalization, quantization and the like. The tensor engine 710 can be utilized to accelerate fused convolution and or matrix multiplication. The pooling engine 715 can support pooling, interpolation, region-of-interest and the like operations. The memory copy engine 720 can be configured for inter-and or intra-core data copy, matrix transposition and the like. The tensor engine 710, pooling engine 715 and memory copy engine 720 can run in parallel. The sequencer 725 can orchestrate the operation of the tensor engine 710, the pooling engine 715, the memory copy engine 720, the local memory 735, and the constant buffer 740 according to instructions from the instruction buffer 730. The exemplary processing core 700 can provide video coding efficient computation under the control of operation fused coarse-grained instructions for functions such as multi-level region-of-interest (ML-ROI) detection, bitrate control, variable bitrate video encoding and or the like. A detailed description of the exemplary processing core 700 is not necessary to an understanding of aspects of the present technology, and therefore will not be described further herein.

Aspects of the present technology can advantageously enable quality enhancement for high priority regions and or bit savings from low priority regions. Aspects of the present technology can advantageously provide targeted quality control for different regions when allocating precise bits to the regions-of-interest. Aspects of the present technology advantageously utilize a reinforcement learning (RL) framework for determining quantization parameters and or bit allocation for the regions-of-interest of a plurality of different interest levels.

The foregoing descriptions of specific embodiments of the present technology have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the present technology to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the present technology and its practical application, to thereby enable others skilled in the art to best utilize the present technology and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.

Claims

A video processing system comprising:

a multi-level region-of-interest detector (210) configured to determine multi-level regions-of-interest of a video stream;

a rate controller (220) configured to determine encoding parameters for the multi-level regions-of-interest; and

a video encoder (230) configured to encode the video stream using variable bitrate encoding based on the determined multi-level regions-of-interest and the encoding parameters of the rate controller (220) for the multi-level regions-of-interest to generate a compressed bitstream.
The video processing system of Claim 1, wherein the rate controller (220) is further configured to control a quality of regions-of-interest by allocating precise bitrates to each level of interest of the determined multi-level regions-of-interest including enhancing a quality of regions-of-interest of a first level of interest and degrading a quality of regions-of-interest of a second level of interest, wherein the first level is a higher priority than the second level of interest.
The video processing system of Claim 1, wherein the rate controller (220) is further configured to determine respective quantization parameters of each region-of-interest of the determined multi-level regions-of-interest based on coordinates of the regions-of-interest of a current frame of the video stream from the multi-level region-of-interest detector (210) , residual encoder bits of each region-of-interest and as encoded bitrate of each region-of-interest of a previous frame of the video stream from the video encoder (230) , a requested quality and a requested bitrate.
The video processing system of Claim 1, wherein the multi-level region-of-interest detector (210) is further configured to adjust coordinates of each region-of-interest of the determined multi-level regions-of-interest based on feedback from the rate controller (220) and the video encoder (230) .
The video processing system of Claim 1, wherein the multi-level region-of-interest detector (210) is further configured to determine regions-of-interest of three or more different levels of interest.
The video processing system of Claim 1, wherein the rate controller (220) is further configured to determine encoding parameters of the multi-level regions-of-interest as a function of a distortion parameter, a rate constraint parameter, and a target quality parameter for each region-of-interest of the multi-level regions-of-interest.
The video processing system of Claim 1, wherein the rate controller (220) is further configured to control one or more parameters of region-of-interest detection by the multi-level region-of-interest detector (210) and control one or more parameters of bitrate encoding by the video encoder (230) based on a requested quality.
The video processing system of Claim 1, wherein the rate controller (220) is further configured to control one or more parameters of bitrate encoding based on feedback information received from the video encoder (230) .
The video processing system of Claim 8, wherein the rate controller (220) is further configured to control one or more parameters of bitrate encoding based on a residual between a frame of the video stream and a reconstructed data frame based on the compressed bitstream.
The video processing system of Claim 8, wherein the feedback information includes residual encoder bit information received from the video encoder (230) .
The video processing system of Claim 1, wherein the rate controller (220) is further configured to generate one or more parameters of bitrate encoding including a region-of-interest quantization parameter.
The video processing system of Claim 11, wherein the rate controller (220) is further configured to constrain a rate of change of the region-of-interest quantization parameter.
The video processing system of Claim 1, wherein the rate controller (220) includes:

a region-of-interest bit allocation unit (515) configured to generate a target bit allocation for one or more regions-of-interest based on a frame level bit allocation, coordinates of the one or more regions-of-interest, a target complexity of the one or more regions-of-interest, and a target quality for the one or more regions-of-interest; and

a region-of-interest quantization model unit (530) configured to generate a region-of-interest quantization parameter based on the target bit allocation.
The video processing system of Claim 13, wherein the rate controller (220) further includes:

a region-of-interest quality estimation unit (525) configured to determine a target quality based upon a requested quality and a residual between a frame of the video stream and a reconstructed data frame from the compressed bitstream; and

the region-of-interest bit allocation unit (515) further configured to generate the target bit allocation for the one or more regions-of-interest based the target quality.
The video processing system of Claim 14, wherein the multi-level region-of-interest detector (210) is configured to determine one or more regions-of-interest in a frame of the video stream based on frame target bit information and the target quality of the one or more regions-of-interest from the rate controller (220) .
The video processing system of Claim 15, wherein the multi-level region-of-interest detector (210) is configured to determine one or more regions-of-interest in a frame of the video stream further based on as encoded bitrate information from the video encoder (230) .
The video processing system of Claim 13, wherein the rate controller (220) further includes:

a region-of-interest complexity estimation unit (520) configured to estimate a target complexity based on residual encoder bit information from the video encoder; and

the region-of-interest bit allocation unit (515) further configured to allocate bits for the one or more regions-of-interest based the residual encoder bit information.
The video processing system of Claim 13, wherein the rate controller (220) further includes a region-of-interest quantization limitation unit (535) configured to constrain a rate of change of the region-of-interest quantization parameter.
The video processing system of Claim 13, wherein the rate controller (220) further includes:

a region-of-interest quantization model unit (530) configured to generate a region-of-interest quantization parameter based on the bit allocation and a quantization model.
The video processing system of Claim 13, wherein the video encoder (230) skips encoding a next video stream when bit-starving occurs.
A computing system comprising:

one or more processors (605) ;

one or more non-transitory computing device readable storage medium storing computing executable instructions that when executed by the one or more processors (605) perform a method comprising:

determining one or more of regions-of-interest in each of a plurality of different levels of interest in each frame or sets of frames of an input video stream (320) ; and

determining encoding parameters for each of the determined regions-of-interest in each frame or sets of frames based on a corresponding level of interest of each of the regions-of-interest (330) ; and

a video encoder (230) configured to generate a compressed bitstream of the input video stream based on the determined encoding parameters of each of the regions-of-interest to generate a compressed bitstream.
The computing system of Claim 21, wherein determining encoding parameters for each of the determined regions-of-interest in each frame or sets of frames (330) includes analyzing the one or more regions-of-interest in each of the plurality of different levels of interest using one or more artificial intelligence models to determine the regions-of-interest and corresponding levels of interest.
The computing system of Claim 21, wherein determining encoding parameters of each of the regions-of-interest (330) includes generating quantization parameters and target bits for use by the video encoder based on a region complexity, target quality, region quality, frame bit budget region ratio parameter and distortion of a frame or set of frames that reduces long-term distortion in the compressed bitstream.
A method of video processing comprising:

receiving a video stream (310) ;

determining a plurality of regions-of-interest, including regions-of-interest of three or more different levels of interest, in each frame or sets of frames of the received video stream (320) ;

determining encoding parameters for each of the regions-of-interest in each frame or sets of frames based on a corresponding level of interest of each of the regions-of-interest (330) ;

encoding the received video stream based on the determined encoding parameters of each of the regions-of-interest to generate a compressed bitstream (340) ; and

outputting the compressed bitstream (350) .
The method according to Claim 24, wherein outputting the compressed bitstream (350) comprises streaming the compressed bitstream.
The method according to Claim 24, wherein outputting the compressed bitstream (350) comprises storing the compressed bitstream.
The method according to Claim 24, wherein determining the encoding parameters for each of the regions-of-interest in each frame or sets of frames (330) includes:

determining one or more quantization parameters or one or more rate-of-distortion parameters for the determined regions-of-interest.
The method according to Claim 27, further comprising:

determining the one or more quantization parameters or one or more rate-of-distortion parameters based on information including one or more of coordinates of the one or more regions-of-interest, a target complexity, residual encoder bit data, a requested quality, a residual between a current video data frame and a reconstructed video data frame, a target quality, and a requested bitrate.