WO2022061728A1

WO2022061728A1 - System and method for region of interest quality controllable video coding

Info

Publication number: WO2022061728A1
Application number: PCT/CN2020/117792
Authority: WO
Inventors: Guanlin WU; Yen-Kuang Chen; Minghai Qin; Zhenzhen Wang; Haoran LI; Yuanwei Fang
Original assignee: Alibaba Group Holding Limited
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2022-03-31

Abstract

The present disclosure relates to a system and method for video coding. In some embodiments, an exemplary video coding system includes: a region of interest (ROI) detector having circuitry configured to determine a plurality of regions in a frame of a video; a rate controller communicatively coupled with the ROI detector and having circuitry configured to perform bit allocation for the plurality of regions based on demanded quality information for the plurality of regions and generate region bit allocation information; and a video encoder communicatively coupled with the ROI detector and the rate controller and having circuitry configured to encode the frame based on the region bit allocation information.

Description

SYSTEM AND METHOD FOR REGION OF INTEREST QUALITY CONTROLLABLE VIDEO CODING

BACKGROUND

A video is a set of static pictures (or “frames” ) capturing the visual information. To reduce the storage memory and the transmission bandwidth, a video can be encoded before storage or transmission and decoded before display. In a picture or a frame, there may be a region of interest (ROI) and a non region of interest (non-ROI) . Generally, the ROI may contain contents that should be enhanced and thus need to be encoded with more bit budget than the non-ROI. Machine learning (ML) or deep learning (DL) can utilize neural networks (NN) to assist video coding. But it is still challenging to code (e.g., encode or decode) the ROI and non-ROI differently and controllably enhance the ROI.

SUMMARY

In some embodiments, an exemplary video coding system can include: a region of interest (ROI) detector having circuitry configured to determine a plurality of regions in a frame of a video; a rate controller communicatively coupled with the ROI detector and having circuitry configured to perform bit allocation for the plurality of regions based on demanded quality information for the plurality of regions and generate region bit allocation information; and a video encoder communicatively coupled with the ROI detector and the rate controller and having circuitry configured to encode the frame based on the region bit allocation information.

In some embodiments, an exemplary method for video coding can include: receiving a video comprising a plurality of frames; determining a plurality of regions in a frame of the video; performing bit allocation for the plurality of regions based on demanded quality information for the plurality of regions to generate region bit allocation information; and encoding the frame based on the region bit allocation information to generate an encoded bit stream.

In some embodiments, an exemplary video coding apparatus includes at least one memory for storing instructions and at least one processor. At least one processor can be configured to execute the instructions to cause the apparatus to perform: receiving a video comprising a plurality of frames; determining a plurality of regions in a frame of the video; performing bit allocation for the plurality of regions based on demanded quality information for the plurality of regions to generate region bit allocation information; and encoding the frame based on the region bit allocation information to generate an encoded bit stream.

In some embodiments, an exemplary non-transitory computer readable storage medium storing a set of instructions that are executable by one or more processing devices to cause a computer to perform: receiving a video comprising a plurality of frames; determining a plurality of regions in a frame of the video; performing bit allocation for the plurality of regions based on demanded quality information for the plurality of regions to generate region bit allocation information; and encoding the frame based on the region bit allocation information to generate an encoded bit stream.

Additional features and advantages of the present disclosure will be set forth in part in the following detailed description, and in part will be obvious from the description, or may be learned by practice of the present disclosure. The features and advantages of the present disclosure will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not restrictive of the disclosed embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which comprise a part of this specification, illustrate several embodiments and, together with the description, serve to explain the principles and features of the disclosed embodiments. In the drawings:

FIG. 1 is a schematic representation of a neural network, according to some embodiments of the present disclosure.

FIG. 2A illustrates an exemplary neural network accelerator architecture, according to some embodiments of the present disclosure.

FIG. 2B illustrates an exemplary neural network accelerator core architecture, according to some embodiments of the present disclosure.

FIG. 2C illustrates a schematic diagram of an exemplary cloud system incorporating a neural network accelerator, according to some embodiments of the present disclosure.

FIG. 3 illustrates an exemplary operation unit configuration, according to some embodiments of the present disclosure.

FIG. 4 is a schematic representation of an exemplary video coding system, according to some embodiments of the present disclosure.

FIG. 5 is a schematic representation of an exemplary rate controller in a video coding system, according to some embodiments of the present disclosure.

FIG. 6A is a schematic representation of an exemplary frame with multi-level regions, according to some embodiments of the present disclosure.

FIG. 6B is a schematic representation of an exemplary original frame, according to some embodiments of the present disclosure.

FIG. 6C is a schematic representation of the frame of FIG. 6B with an ROI, according to some embodiments of the present disclosure.

FIG. 6D is a schematic representation of an exemplary frame with an adjusted ROI, according to some embodiments of the present disclosure.

FIG. 7 is a schematic diagram of an exemplary training process of a bit allocation model, according to some embodiments of the present disclosure.

FIG. 8 is a schematic diagram of exemplary convergence results, according to some embodiments of the present disclosure.

FIG. 9 is a flowchart of an exemplary video coding method, according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of apparatuses, systems and methods consistent with aspects related to the invention as recited in the appended claims.

During video coding, the ROI may need to be differentiated from the non-ROI. For example, the ROI may consume more bit than the non-ROI during encoding and thus have higher quality than the non-ROI. However, it is still difficult to allocate precise bits for the ROI and non-ROI.

In some embodiments of the present disclosure, a system or a method can perform improved bit allocation for multi-level regions (e.g., ROI and non-ROI) in a picture or frame of a video. Some embodiments can train a neural network to facilitate the bit allocation. The trained neural network can perform an accurate bit allocation and quantization parameters decision for different regions, and increase the quality converge speed.

FIG. 1 is a schematic representation of a neural network (NN) 100. As depicted in FIG. 1, neural network 100 may include an input layer 120 that accepts inputs, e.g., input 110-1, ..., input 110-m. Inputs may include an image, text, or any other structure or unstructured data for processing by neural network 100. In some embodiments, neural network 100 may accept a plurality of inputs simultaneously. For example, in FIG. 1, neural network 100 may accept up to m inputs simultaneously. Additionally or alternatively, input layer 120 may accept up to m inputs in rapid succession, e.g., such that input 110-1 is accepted by input layer 120 in one cycle, a second input is accepted by input layer 120 in a second cycle in which input layer 120 pushes data from input 110-1 to a first hidden layer, and so on. Any number of inputs can be used in simultaneous input, rapid succession input, or the like.

Input layer 120 may comprise one or more nodes, e.g., node 120-1, node 120-2, ..., node 120-a. Each node may apply an activation function to corresponding input (e.g., one or more of input 110-1, ..., input 110-m) and weight the output from the activation function by a particular weight associated with the node. An activation function may comprise a Heaviside step function, a Gaussian function, a multiquadratic function, an inverse multi-quadratic function, a sigmoidal function, or the like. A weight may comprise a positive value between 0.0 and 1.0 or any other numerical value configured to allow some nodes in a layer to have corresponding output scaled more or less than output corresponding to other nodes in the layer.

As further depicted in FIG. 1, neural network 100 may include one or more hidden layers, e.g., hidden layer 130-1, ..., hidden layer 130-n. Each hidden layer may comprise one or more nodes. For example, in FIG. 1, hidden layer 130-1 comprises node 130-1-1, node 130-1-2, node 130-1-3, ..., node 130-1-b, and hidden layer 130-n comprises node 130-n-1, node 130-n-2, node 130-n-3, ..., node 130-n-c. Similar to nodes of input layer 120, nodes of the hidden layers may apply activation functions to output from connected nodes of the previous layer and weight the output from the activation functions by particular weights associated with the nodes.

As further depicted in FIG. 1, neural network 100 may include an output layer 140 that finalizes outputs, e.g., output 150-1, output 150-2, ..., output 150-d. Output layer 140 may comprise one or more nodes, e.g., node 140-1, node 140-2, ..., node 140-d. Similar to nodes of input layer 120 and of the hidden layers, nodes of output layer 140 may apply activation functions to output from connected nodes of the previous layer and weight the output from the activation functions by particular weights associated with the nodes.

Although depicted as fully connected in FIG. 1, the layers of neural network 100 may use any connection scheme. For example, one or more layers (e.g., input layer 120, hidden layer 130-1, ..., hidden layer 130-n, output layer 140, or the like) may be connected using a convolutional scheme, a sparsely connected scheme, or the like. Such embodiments may use fewer connections between one layer and a previous layer than depicted in FIG. 1.

Moreover, although depicted as a feedforward network in FIG. 1, neural network 100 may additionally or alternatively use backpropagation (e.g., by using long short-term memory nodes or the like) . Accordingly, although neural network 100 is depicted similar to a CNN, neural network 100 may comprise a recurrent neural network (RNN) or any other neural network.

FIG. 2A illustrates an exemplary neural network accelerator architecture 200, according to some embodiments of the present disclosure. In the context of this disclosure, a neural network accelerator may also be referred to as a machine learning accelerator or deep learning accelerator. In some embodiments, accelerator architecture 200 may be referred to as a neural network processing unit (NPU) architecture 200. As shown in FIG. 2A, accelerator architecture 200 can include a plurality of cores 202, a command processor 204, a direct memory access (DMA) unit 208, a Joint Test Action Group (JTAG) /Test Access Port (TAP) controller 210, a peripheral interface 212, a bus 214, and the like.

It is appreciated that cores 202 can perform algorithmic operations based on communicated data. Cores 202 can include one or more processing elements that may include single instruction, multiple data (SIMD) architecture including one or more processing units configured to perform one or more operations (e.g., multiplication, addition, multiply-accumulate, etc. ) based on commands received from command processor 204. To perform the operation on the communicated data packets, cores 202 can include one or more processing elements for processing information in the data packets. Each processing element may comprise any number of processing units. According to some embodiments of the present disclosure, accelerator architecture 200 may include a plurality of cores 202, e.g., four cores. In some embodiments, the plurality of cores 202 can be communicatively coupled with each other. For example, the plurality of cores 202 can be connected with a single directional ring bus, which supports efficient pipelining for large neural network models. The architecture of cores 202 will be explained in detail with respect to FIG. 2B.

Command processor 204 can interact with a host unit 220 and pass pertinent commands and data to corresponding core 202. In some embodiments, command processor 204 can interact with host unit under the supervision of kernel mode driver (KMD) . In some embodiments, command processor 204 can modify the pertinent commands to each core 202, so that cores 202 can work in parallel as much as possible. The modified commands can be stored in an instruction buffer (not shown) . In some embodiments, command processor 204 can be configured to coordinate one or more cores 202 for parallel execution.

DMA unit 208 can assist with transferring data between host memory 221 and accelerator architecture 200. For example, DMA unit 208 can assist with loading data or instructions from host memory 221 into local memory of cores 202. DMA unit 208 can also assist with transferring data between multiple accelerators. DMA unit 208 can allow off-chip devices to access both on-chip and off-chip memory without causing a host CPU interrupt. In addition, DMA unit 208 can assist with transferring data between components of accelerator architecture 200. For example, DMA unit 208 can assist with transferring data between multiple cores 202 or within each core. Thus, DMA unit 208 can also generate memory addresses and initiate memory read or write cycles. DMA unit 208 also can contain several hardware registers that can be written and read by the one or more processors, including a memory address register, a byte-count register, one or more control registers, and other types of registers. These registers can specify some combination of the source, the destination, the direction of the transfer (reading from the input/output (I/O) device or writing to the I/O device) , the size of the transfer unit, or the number of bytes to transfer in one burst. It is appreciated that accelerator architecture 200 can include a second DMA unit, which can be used to transfer data between other accelerator architectures to allow multiple accelerator architectures to communicate directly without involving the host CPU.

JTAG/TAP controller 210 can specify a dedicated debug port implementing a serial communications interface (e.g., a JTAG interface) for low-overhead access to the accelerator without requiring direct external access to the system address and data buses. JTAG/TAP controller 210 can also have on-chip test access interface (e.g., a TAP interface) that implements a protocol to access a set of test registers that present chip logic levels and device capabilities of various parts.

Peripheral interface 212 (such as a PCIe interface) , if present, serves as an (and typically the) inter-chip bus, providing communication between the accelerator and other devices.

Bus 214 (such as a I2C bus) includes both intra-chip bus and inter-chip buses. The intra-chip bus connects all internal components to one another as called for by the system architecture. While not all components are connected to every other component, all components do have some connection to other components they need to communicate with. The inter-chip bus connects the accelerator with other devices, such as the off-chip memory or peripherals. For example, bus 214 can provide high speed communication across cores and can also connect cores 202 with other units, such as the off-chip memory or peripherals. Typically, if there is a peripheral interface 212 (e.g., the inter-chip bus) , bus 214 is solely concerned with intra-chip buses, though in some implementations it could still be concerned with specialized inter-bus communications.

Accelerator architecture 200 can also communicate with a host unit 220. Host unit 220 can be one or more processing unit (e.g., an X86 central processing unit) . As shown in FIG. 2A, host unit 220 may be associated with host memory 221. In some embodiments, host memory 221 may be an integral memory or an external memory associated with host unit 220. In some embodiments, host memory 221 may comprise a host disk, which is an external memory configured to provide additional memory for host unit 220. Host memory 221 can be a double data rate synchronous dynamic random-access memory (e.g., DDR SDRAM) or the like. Host memory 221 can be configured to store a large amount of data with slower access speed, compared to the on-chip memory integrated within accelerator chip, acting as a higher-level cache. The data stored in host memory 221 may be transferred to accelerator architecture 200 to be used for executing neural network models.

In some embodiments, a host system having host unit 220 and host memory 221 can comprise a compiler (not shown) . The compiler is a program or computer software that transforms computer codes written in one programming language into instructions for accelerator architecture 200 to create an executable program. In machine learning applications, a compiler can perform a variety of operations, for example, pre-processing, lexical analysis, parsing, semantic analysis, conversion of input programs to an intermediate representation, initialization of a neural network, code optimization, and code generation, or combinations thereof. For example, the compiler can compile a neural network to generate static parameters, e.g., connections among neurons and weights of the neurons.

In some embodiments, host system including the compiler may push one or more commands to accelerator architecture 200. As discussed above, these commands can be further processed by command processor 204 of accelerator architecture 200, temporarily stored in an instruction buffer of accelerator architecture 200, and distributed to corresponding one or more cores (e.g., cores 202 in FIG. 2A) or processing elements. Some of the commands may instruct a DMA unit (e.g., DMA unit 208 of FIG. 2A) to load instructions and data from host memory (e.g., host memory 221 of FIG. 2A) into accelerator architecture 200. The loaded instructions may then be distributed to each core (e.g., core 202 of FIG. 2A) assigned with the corresponding task, and the one or more cores may process these instructions.

It is appreciated that the first few instructions received by the cores 202 may instruct the cores 202 to load/store data from host memory 221 into one or more local memories of the cores (e.g., local memory 2032 of FIG. 2B) . Each core 202 may then initiate the instruction pipeline, which involves fetching the instruction (e.g., via a sequencer) from the instruction buffer, decoding the instruction (e.g., via a DMA unit 208 of FIG. 2A) , generating local memory addresses (e.g., corresponding to an operand) , reading the source data, executing or loading/storing operations, and then writing back results.

According to some embodiments, accelerator architecture 200 can further include a global memory (not shown) having memory blocks (e.g., 4 blocks of 8GB second generation of high bandwidth memory (HBM2) ) to serve as main memory. In some embodiments, the global memory can store instructions and data from host memory 221 via DMA unit 208. The instructions can then be distributed to an instruction buffer of each core assigned with the corresponding task, and the core can process these instructions accordingly.

In some embodiments, accelerator architecture 200 can further include memory controller (not shown) configured to manage reading and writing of data to and from a specific memory block (e.g., HBM2) within global memory. For example, memory controller can manage read/write data coming from core of another accelerator (e.g., from DMA unit 208 or a DMA unit corresponding to the another accelerator) or from core 202 (e.g., from a local memory in core 202) . It is appreciated that more than one memory controller can be provided in accelerator architecture 200. For example, there can be one memory controller for each memory block (e.g., HBM2) within global memory.

Memory controller can generate memory addresses and initiate memory read or write cycles. Memory controller can contain several hardware registers that can be written and read by the one or more processors. The registers can include a memory address register, a byte-count register, one or more control registers, and other types of registers. These registers can specify some combination of the source, the destination, the direction of the transfer (reading from the input/output (I/O) device or writing to the I/O device) , the size of the transfer unit, the number of bytes to transfer in one burst, or other typical features of memory controllers.

While accelerator architecture 200 of FIG. 2A can be used for convolutional neural networks (CNNs) in some embodiments of the present disclosure, it is appreciated that accelerator architecture 200 of FIG. 2A can be utilized in various neural networks, such as deep neural networks (DNNs) , recurrent neural networks (RNNs) , or the like. In addition, some embodiments can be configured for various processing architectures, such as neural network processing units (NPUs) , graphics processing units (GPUs) , tensor processing units (TPUs) , any other types of accelerators, or the like.

FIG. 2B illustrates an exemplary neural network accelerator core architecture, according to some embodiments of the present disclosure. As shown in FIG. 2B, core 202 can include one or more operation units such as first and

second operation units

2020 and 2022, a memory engine 2024, a sequencer 2026, an instruction buffer 2028, a constant buffer 2030, a local memory 2032, or the like.

One or more operation units can include first operation unit 2020 and second operation unit 2022. First operation unit 2020 can be configured to perform operations on received data (e.g., matrices) . In some embodiments, first operation unit 2020 can include one or more processing units configured to perform one or more operations (e.g., multiplication, addition, multiply-accumulate, element-wise operation, etc. ) . In some embodiments, first operation unit 2020 is configured to accelerate execution of convolution operations or matrix multiplication operations.

Second operation unit 2022 can be configured to perform a pooling operation, an interpolation operation, a region-of-interest (ROI) operation, and the like. In some embodiments, second operation unit 2022 can include an interpolation unit, a pooling data path, and the like.

Memory engine 2024 can be configured to perform a data copy within a corresponding core 202 or between two cores. DMA unit 208 can assist with copying data within a corresponding core or between two cores. For example, DMA unit 208 can support memory engine 2024 to perform data copy from a local memory (e.g., local memory 2032 of FIG. 2B) into a corresponding operation unit. Memory engine 2024 can also be configured to perform matrix transposition to make the matrix suitable to be used in the operation unit.

Sequencer 2026 can be coupled with instruction buffer 2028 and configured to retrieve commands and distribute the commands to components of core 202. For example, sequencer 2026 can distribute convolution commands or multiplication commands to first operation unit 2020, distribute pooling commands to second operation unit 2022, or distribute data copy commands to memory engine 2024. Sequencer 2026 can also be configured to monitor execution of a neural network task and parallelize sub-tasks of the neural network task to improve efficiency of the execution. In some embodiments, first operation unit 2020, second operation unit 2022, and memory engine 2024 can run in parallel under control of sequencer 2026 according to instructions stored in instruction buffer 2028.

Instruction buffer 2028 can be configured to store instructions belonging to the corresponding core 202. In some embodiments, instruction buffer 2028 is coupled with sequencer 2026 and provides instructions to the sequencer 2026. In some embodiments, instructions stored in instruction buffer 2028 can be transferred or modified by command processor 204.

Constant buffer 2030 can be configured to store constant values. In some embodiments, constant values stored in constant buffer 2030 can be used by operation units such as first operation unit 2020 or second operation unit 2022 for batch normalization, quantization, de-quantization, or the like.

Local memory 2032 can provide storage space with fast read/write speed. To reduce possible interaction with a global memory, storage space of local memory 2032 can be implemented with large capacity. With the massive storage space, most of data access can be performed within core 202 with reduced latency caused by data access. In some embodiments, to minimize data loading latency and energy consumption, static random access memory (SRAM) integrated on chip can be used as local memory 2032. In some embodiments, local memory 2032 can have a capacity of 192MB or above. According to some embodiments of the present disclosure, local memory 2032 be evenly distributed on chip to relieve dense wiring and heating issues.

FIG. 2C illustrates a schematic diagram of an exemplary cloud system incorporating a neural network accelerator 200, according to some embodiments of the present disclosure. As shown in FIG. 2C, cloud system 230 can provide a cloud service with artificial intelligence (AI) capabilities and can include a plurality of computing servers (e.g., 232 and 234) . In some embodiments, a computing server 232 can, for example, incorporate a neural network accelerator architecture 200 of FIG. 2A. Neural network accelerator architecture 200 is shown in FIG. 2C in a simplified manner for simplicity and clarity.

With the assistance of neural network accelerator architecture 200, cloud system 230 can provide the extended AI capabilities of image recognition, facial recognition, translations, 3D modeling, and the like. It is appreciated that neural network accelerator architecture 200 can be deployed to computing devices in other forms. For example, neural network accelerator architecture 200 can also be integrated in a computing device, such as a smart phone, a tablet, and a wearable device.

Moreover, while a neural network accelerator architecture is shown in FIGS. 2A-2B, it is appreciated that any accelerator that provides the ability to perform parallel computation can be used.

FIG. 3 illustrates an exemplary operation unit configuration 300, according to some embodiments of the present disclosure. According to some embodiments of the present disclosure, operation unit can be first operation unit (e.g., first operation unit 2020 in FIG. 2B) . Operation unit 2020 may include a first buffer 310, a second buffer 320, and a processing array 330.

First buffer 310 may be configured to store input data. In some embodiments, data stored in first buffer 310 can be input data to be used in processing array 330 for execution. In some embodiments, the input data can be fetched from local memory (e.g., local memory 2032 in FIG. 2B) . First buffer 310 may be configured to support reuse or share of data to be used in processing array 330. In some embodiments, input data stored in first buffer 310 may be activation data for a convolution operation.

Second buffer 320 may be configured to store weight data. In some embodiments, weight data stored in second buffer 320 can be used in processing array 330 for execution. In some embodiments, the weight data stored in second buffer 320 can be fetched from local memory (e.g., local memory 2032 in FIG. 1B) . In some embodiments, weight data stored in second buffer 320 may be filter data for a convolution operation.

According to some embodiments of the present disclosure, weight data stored in second buffer 320 can be compressed data. For example, weight data can be pruned data to save memory space on chip. In some embodiments, operation unit 2020 can further include a sparsity engine 390. Sparsity engine 390 can be configured to unzip compressed weight data to be used in processing array 330.

Processing array 330 may have a plurality of layers (e.g., K layers) . According to some embodiments of the present disclosure, each layer of processing array 330 may include a plurality of processing strings, which may perform computations in parallel. For example, first processing string included in the first layer of processing array 330 can comprise a first multiplier (e.g., dot product) 340_1 and a first accumulator (ACC) 350_1 and second processing string can comprise a second multiplier 340_2 and a second accumulator 350_2. Similarly, i-th processing string in the first layer can comprise an i-th multiplier 340_i and an i-th accumulator 350_i.

In some embodiments, processing array 330 can perform computations under SIMD control. For example, when performing a convolution operation, each layer of processing array 330 can execute same instructions with different data.

According to some embodiments of the present disclosure, processing array 330 shown in FIG. 3 can be included in a core (e.g., core 202 in FIG. 2B) . When a number of processing strings (e.g., i number of processing strings) included in one layer of processing array 330 is smaller than a number of work items (e.g., B number of work items) , i number of work items can be executed by processing array 330 and subsequently the rest of work items (B-i number of work items) can be executed by the processing array 330 in some embodiments. In some other embodiments, i number of work items can be executed by processing array 330 and the rest of work items can be executed by another processing array 330 in another core.

According to some embodiments of the present disclosure, processing array 330 may further include an element-wise operation processor (OP) 360. In some embodiments, element-wise operation processor 360 can be positioned at the end of processing strings. In some embodiments, processing strings in each layer of processing array 330 can share element-wise operation processor 360. For example, i number of processing strings in the first layer of processing array 330 can share element-wise operation processor 360. In some embodiments, element-wise operation processor 360 in the first layer of processing array 330 can perform its element-wise operation on each of output values, from accumulators 350_1 to 350_i, sequentially. Similarly, element-wise operation processor 360 in the Kth layer of processing array 330 can perform its element-wise operation on each of output values, from accumulators 350_1 to 350_i, sequentially. In some embodiments, element-wise operation processor 360 can be configured to perform a plurality of element-wise operations. In some embodiments, element-wise operation performed by the element-wise operation processor 360 may include an activation function such as ReLU function, ReLU6 function, Leaky ReLU function, Sigmoid function, Tanh function, or the like.

In some embodiments, multiplier 340 or accumulator 350 may be configured to perform its operation on different data type from what the element-wise operation processor 360 performs its operations on. For example, multiplier 340 or accumulator 350 can be configured to perform its operations on integer type data such as Int 8, Int 16, and the like and element-wise operation processor 360 can perform its operations on floating point type data such as FP24, and the like. Therefore, according to some embodiments of the present disclosure, processing array 330 may further include de-quantizer 370 and quantizer 380 with element-wise operation processor 360 positioned therebetween. In some embodiments, batch normalization operations can be merged to de-quantizer 370 because both de-quantizer 370 and batch normalization operations can be performed by multiplication operations and addition operations with constants, which can be provided from constant buffer 2030. In some embodiments, batch normalization operations and de-quantization operations can be merged into one operation by compiler. As shown in FIG. 3, constant buffer 2030 can provide constants to de-quantizer 370 for de-quantization or batch normalization.

FIG. 4 is a schematic representation of an exemplary video coding system 400, according to some embodiments of the present disclosure. It is appreciated that video coding system 400 can be implemented, at least in part, by neural network accelerator architecture 200 of FIGs. 2A and 2C, core 202 of FIGs. 2A-2B, operation unit configuration 300 of FIG. 3, or cloud system 230 of FIG. 2C.

As shown in FIG. 4, video coding system 400 can include an ROI detector 402, a rate controller 403, a video encoder 404, a memory device 405, or the like. ROI detector 402 can receive a video 401 including a plurality of frames. For example, ROI detector 402 can receive video 401 from outside or read video 401 from memory device 405 that is communicatively coupled with ROI detector 402. ROI detector 402 can have circuitry configured to determine a plurality of regions (e.g., one or more ROIs, non-ROI, multi-level regions, or the like) in a frame of video 401. For example, ROI detector 402 can have circuitry configured to execute an ROI detection neural network to determine ROI coordinates of one or more ROIs in a frame of video 401. ROI detector 402 can also determine priorities of the plurality of regions. FIG. 6B illustrates an exemplary original frame 610 while FIG. 6C illustrates the frame 610 with an ROI 611. As shown in FIG. 6C, ROI detector 402 can detect ROI 611 (shown as a θ region) in original frame 610. The remaining part 613 can be non-ROI (shown as a δ region) . In some embodiments, ROI detector 402 can further detect another ROI in non-ROI 613 to form more ROIs or multi-level regions.

ROI detector 402 can be communicatively coupled with rate controller 403 and video encoder 404. In some embodiments, ROI detector 402 can determine the region coordinates and region priorities of a frame of video 401 based on information from rate controller 403 and video encoder 404. ROI detector can provide region information (e.g., ROI coordinates, ROI priorities, or the like) to rate controller 403 and video encoder 404.

Rate controller 403 can also receive video 401. For example, rate controller 403 can receive video 401 from outside or from ROI detector 402, or read video 401 from memory device 405 that is communicatively coupled with rate controller 403. Rate controller 403 can have circuitry configured to perform bit allocation among the plurality of regions based on the region information from ROI detector 402 and demanded quality information for the plurality of regions and generate region bit allocation information. In some embodiments, rate controller 403 can receive demanded quality information for the plurality of regions from outside (e.g., from outside device, a user, or the like) or read a demanded (e.g., pre-determined) quality information for the plurality of regions from memory device 405, and have circuitry configured to determine quantization parameters (QPs) or rate-distortion-optimization (RDO) parameters (e.g., lambda values) for the plurality of regions according to the demanded quality. For example, rate controller 403 can have circuitry configured to execute a region bit allocation neural network to allocate bits for the plurality of regions. The region bit allocation neural network can be trained (e.g., by training process 700 as discussed with reference to FIG. 7) to perform the bit allocation. In the bit allocation, quality of the ROI can be enhanced if allocating more bits to the ROI when there is a predefined total bit budget for current frame. Bits can be saved from quality degradation of the non-ROI while keeping the same quality of ROI. Therefore, rate controller 403 can control the quality of the plurality of regions (e.g., the ROI or non-ROI) by allocating precise bits for the plurality of regions. In some embodiments, rate controller 403 can provide region bit allocation information to video encoder 404 to encode video 401.

Video encoder 404 can be communicatively coupled with rate controller 403 and receive region bit allocation information from rate controller 403. Video encoder 404 can have circuitry configured to encode video 401 based on the received region bit allocation information and generate encoded bit stream 406 that can be output from video coding system 400. Video encoder 404 can also feedback encoding information to ROI detector 402 and rate controller 403.

In some embodiments, rate controller 403 can receive region information (e.g., ROI coordinates of current frame of video 401) from ROI detector 402, encoding information (e.g., residual data and actual encoded bits of the plurality of regions of a previous frame, reconstructed video, or the like) from video encoder 404, demanded quality information, demanded bitrate, or the like. Then rate controller 403 can use the received information to determine bit allocation (e.g., respective QPs or RDO parameters, such as lambda values) for the plurality of regions. Video encoder 404 can utilize QPs or RDO parameters from rate controller 403 to encode current frame of video 401.

In some embodiments, rate controller 403 or video encoder 404 can feedback information (e.g., remaining bit budget, actual encoded bits of the plurality of regions, quality of the plurality of regions, or the like) to ROI detector 402. ROI detector 402 can use the feedback information to determine if the ROI (e.g., ROI coordinates) can be adjusted or not. For example, if the remaining bit budget is low or the quality of ROI or non-ROI is low, ROI detector 402 can reduce an area ratio of ROI in the frame to save more bits. FIG. 6D is a schematic representation of an exemplary frame 630 with an adjusted ROI 631, according to some embodiments of the present disclosure. ROI detector 402 can adjust ROI 611 in frame 610 to form an adjusted ROI 631 (shown as a θ region) and non-ROI 633 (shown as a δregion) in frame 630. As shown in FIG. 6C and FIG. 6D, adjusted ROI 631 has reduced area than that of ROI 611.

Memory device 405 can be communicatively coupled with ROI detector 402, rate controller 403, and video encoder 404 and can provide storage space for these components. For example, memory device 405 can store video 401 for read and processing by ROI detector 402, rate controller 403 and video encoder 404, or store encoded bit stream 406 for output to another component or device. Memory device 405 can also store temporary or intermediate data for ROI detector 402, rate controller 403 and video encoder 404. Although shown as a separate component in FIG. 4, memory device 405 can include a plurality of memory blocks that are integrated with ROI detector 402, rate controller 403 and video encoder 404, respectively, and in form of their internal memories.

FIG. 5 is a schematic representation of an exemplary rate controller 503 in a video coding system 500, according to some embodiments of the present disclosure. Video coding system 500 can include ROI detector 502, rate controller 503, video encoder 504, or the like. It is appreciated that video coding system 500, ROI detector 502, rate controller 503, and video encoder 504 can be implemented, at least in part, by neural network accelerator architecture 200 of FIGs. 2A and 2C, core 202 of FIGs. 2A-2B, operation unit configuration 300 of FIG. 3, or cloud system 230 of FIG. 2C. In some embodiments, video coding system 500, ROI detector 502, rate controller 503, and video encoder 504 can be implement as video coding system 400, ROI detector 402, rate controller 403, and video encoder 404 of FIG. 4, respectively.

As shown in FIG. 5, rate controller 503 can include a plurality of components, such as a group of picture (GOP) bit allocator 5031, a frame bit allocator 5032, an ROI/non-ROI bit allocator 5033, an ROI/non-ROI complexity estimator 5034, an ROI/non-ROI quality estimator 5035, an ROI rate quantizer 5036, a non-ROI rate quantizer 5037, an ROI/non-ROI limitator 5038, or the like. GOP bit allocator can receive video 501 and demanded bitrate 505. For example, GOP bit allocator can receive video 501 and demanded bitrate 505 from outside of or from ROI detector 402 that receives video 501 and demanded bitrate 505 from outside, or read video 501 and demanded bitrate 505 from a memory device (not shown in FIG. 5, e.g., memory device 405 of FIG. 4) that is communicatively coupled with rate controller 503. GOP bit allocator 5031 can have circuitry configured to perform bit allocation at GOP level for video 501 based on received demanded bitrate 505. GOP bit allocator can be communicatively coupled with frame bit allocator 5032 and ROI/non-ROI quality estimator 5035 and output GOP level bit allocation information (e.g., GOP target bits 551) to them.

Frame bit allocator 5032 can have circuitry configured to perform bit allocation at frame level among a plurality of frames of video 501 to obtain frame level bit allocation information (e.g., frame target bits 552) for current frame based on GOP level bit allocation information from GOP bit allocator 5031. Frame bit allocator 5032 can be communicatively coupled with ROI detector 502 and ROI/non-ROI bit allocator 5033 and output frame target bits 552 to them, as shown in FIG. 5.

ROI/non-ROI bit allocator 5033 can be communicatively coupled with frame bit allocator 5032, ROI/non-ROI complexity estimator 5034, ROI/non-ROI quality estimator 5035, ROI detector 502, or the like. ROI/non-ROI bit allocator 5033 can receive frame level bit allocation information (e.g., frame target bits 552) from frame bit allocator 5032, complexity information (e.g., target complexity value 555 of ROI or non-ROI) from ROI/non-ROI complexity estimator 5034, quality information (e.g., quality 554 of ROI or non-ROI) from ROI/non-ROI quality estimator 5035, and ROI information (e.g., ROI coordinates 553, ROI priorities, or the like) from ROI detector 502. ROI/non-ROI bit allocator 5033 can have circuitry configured to perform ROI/non-ROI bit allocation among ROI and non-ROI based on frame level bit allocation information from frame bit allocator 5032, complexity information from ROI/non-ROI complexity estimator 5034, quality information from ROI/non-ROI quality estimator 5035, and ROI information from ROI detector 502. In some embodiments, ROI/non-ROI bit allocator 5033 can have circuitry configured to execute a bit allocation model (e.g., bit allocation model 705 of FIG. 7) to allocate bits among ROI and non-ROI. The bit allocation model can include a region bit allocation neural network that can be trained (e.g., by training process 700 as discussed with reference to FIG. 7) to perform the bit allocation. The region bit allocation neural network can include any suitable types of neural network, including but not being limited to, CNN, RNN, or the like.

For example, ROI/non-ROI bit allocator 5033 can have circuitry configured to allocate bits among ROI (e.g., ROI 611 of FIG. 6C, or ROI 631 of FIG. 6D, shown as a θregion) and non-ROI (e.g., non-ROI 613 of FIG. 6C, or non-ROI 633 of FIG. 6D, shown as a δ region) based on a linear mean absolute difference (MAD) prediction model as follows:

T ^δ=F-T ^θ (Eq. 2)

where T ^θ and T ^δ represent bits allocated to the ROI and non-ROI, respectively, F represents frame bits (e.g., frame target bits 552) , r represents ratio factor of the ROI, MAD ^θ and MAD ^δrepresent MAD values (e.g., target complexity value 555) of the ROI and non-ROI, respectively. In some embodiments, target complexity value 555 can include mean square absolute error (MSE) values for the ROI and non-ROI. ROI/non-ROI bit allocator 5033 can have circuitry configured to allocate bits among ROI and non-ROI based on MSE prediction model using MSE values from ROI/non-ROI complexity estimator 5034.

As shown in FIG. 5, ROI/non-ROI bit allocator 5033 can be communicatively coupled with and output ROI target bits 556 and non-ROI target bits 557, respectively, to ROI rate quantizer 5036 and non-ROI rate quantizer 5037. ROI rate quantizer 5036 can have circuitry configured to perform an ROI rate quantization based on ROI target bits 556 from ROI/non-ROI bit allocator 5033 to generate QP or RDO parameter of the ROI (e.g., ROI base QP 558) . In some embodiments, ROI rate quantizer 5036 can also be communicatively coupled with ROI/non-ROI complexity estimator 5034 and receive therefrom target complexity value 555. ROI rate quantizer 5036 can have circuitry configured to utilize target complexity value 555 to perform an ROI rate quantization. Similarly, non-ROI rate quantizer 5037 can have circuitry configured to perform a non-ROI rate quantization based on non-ROI target bits 557 from ROI/non-ROI bit allocator 5033 to generate QP or RDO parameter of the non-ROI (e.g., non-ROI base QP 559) . In some embodiments, non-ROI rate quantizer 5037 can also be communicatively coupled with ROI/non-ROI complexity estimator 5034 and receive therefrom target complexity value 555. Non-ROI rate quantizer 5037 can have circuitry configured to utilize target complexity value 555 to perform a non-ROI rate quantization. It is appreciated that, although shown as separate components, ROI rate quantizer 5036 and non-ROI rate quantizer 5037 can be integrated into a single ROI/non-ROI rate quantizer.

For example, ROI rate quantizer 5036 can determine QP for the ROI (e.g., ROI 611 of FIG. 6C, or ROI 631 of FIG. 6D, shown as a θ region) based on MAD model as follows:

where Q ^θ represents the QP for the ROI, c1 and c2 represent two model parameters.

Similarly, non-ROI rate quantizer 5037 can determine QP for the non-ROI (e.g., non-ROI 613 of FIG. 6C, or non-ROI 633 of FIG. 6D, shown as a δ region) based on MAD model as follows:

where Q ^δ represents the QP for the non-ROI.

In some embodiments, ROI rate quantizer 5036 and non-ROI rate quantizer 5037 can be communicatively coupled with ROI/non-ROI limitator 5038 and output ROI base QP 558 and non-ROI base QP 559, respectively, to ROI/non-ROI limitator 5038. ROI/non-ROI limitator 5038 can have circuitry configured to perform limitation to constrain change ranges of ROI base QP 558 and non-ROI base QP 559. The limitation can improve the quality stability.

As shown in FIG. 5, video encoder 504 can be communicatively coupled with rate controller 503 (e.g., ROI/non-ROI limitator 5038) and receive region bit allocation information (e.g., ROI/non-ROI QP 561) from rate controller 503. Video encoder 504 can have circuitry configured to encode video 501 based on the received region bit allocation information and generate encoded bit stream 506 that can be output from video coding system 500. Video encoder 504 can also feedback encoding information to ROI detector 502 and rate controller 503. For example, as shown in FIG. 5, video encoder 504 can be communicatively coupled with ROI/non-ROI complexity estimator 5034 and feedback residuals 564 of a previous frame or current frame to ROI/non-ROI complexity estimator 5034. Based on feedbacked residuals 564, ROI/non-ROI complexity estimator 5034 can have circuitry configured to perform a complexity estimation and determine target complexity values 555 of ROI and non-ROI. Target complexity value 555 can include MAD value of the residuals, MSE value of the residuals, or the like. For example, ROI/non-ROI complexity estimator 5034 can update MAD values for ROI and non-ROI based on feedbacked residuals 564. In some embodiment, ROI/non-ROI bit allocator 5033 can determine ROI target bits 556 and non-ROI target bits 557 based on a ratio of target complexity values 555 of the ROI and non-ROI and frame target bits 552.

Moreover, video encoder 504 can be communicatively coupled with ROI/non-ROI quality estimator 5035 and feedback reconstructed video 562 to ROI/non-ROI quality estimator 5035. ROI/non-ROI quality estimator 5035 can also receive demanded quality information 507 for ROI and non-ROI. For example, ROI/non-ROI quality estimator 5035 can receive demanded quality information 507 for ROI and non-ROI from outside (e.g., from outside device or component, a user, or the like) or read demanded (e.g., pre-determined) quality information 507 for ROI and non-ROI from a memory device (e.g., memory device 405 of FIG. 4) . ROI/non-ROI quality estimator 5035 can have circuitry configured to adjust a weighting of target bit allocation for ROI and non-ROI and generate quality information 554 based on reconstructed video 562 and demanded quality information 507. For example, if quality in ROI is low for the i-th frame, quality information 554 can indicate that more bits can be allocated to ROI of the next frame, (i+1) -th frame, to upgrade the quality of the ROI. The quality can be a measurement from original frame and reconstructed frame, such as MAD, peak signal-to-noise ratio (PSNR) , structural similarity index metric (SSIM) , video multimethod assessment fusion (VMAF) , or the like. The quality can also be the difference of MAD, PSNR, SSIM, VMAF, or the like. In some embodiments, ROI/non-ROI quality estimator 5035 can also be communicatively coupled with ROI detector 502 and output ROI/non-ROI quality 560 to ROI detector 502. In some embodiments, ROI/non-ROI quality estimator 5035 can have circuitry configured to compute qualities for ROI and non-ROI based on reconstructed video 562 and demanded quality 507. ROI/non-ROI bit allocator 5033 can have circuitry configured to update the ratio factor r of the ROI based on qualities for ROI and non-ROI.

As shown in FIG. 5, video encoder 504 can also be communicatively coupled with ROI detector 502 and feedback actual encoded bits 563 of the ROI and non-ROI of a previous frame to video detector 502. ROI detector 502 can have circuitry configured to determine a plurality of regions (e.g., one or more ROIs, non-ROI, multi-level regions, or the like) in a frame of video 501. For example, ROI detector 502 can have circuitry configured to execute an ROI detection neural network to determine ROI coordinates of one or more ROIs (e.g., ROI 611 of FIG. 6C, or ROI 631 of FIG. 6D, shown as a θ region) in a frame of video 501. ROI detector 502 can also have circuitry configured to determine if the ROI (e.g., ROI coordinates) can be adjusted based on actual encoded bits 563 of the ROI and non-ROI of a previous frame and ROI/non-ROI quality 560. For example, if the remaining bit budget is low or the quality of ROI or non-ROI is low, ROI detector 502 can reduce an area ratio of ROI in the frame to save more bits. For example, ROI detector 502 can adjust ROI 611 in frame 610 to form an adjusted ROI 631 (shown as a θ region) and non-ROI 633 (shown as a δ region) in frame 630. As shown in FIG. 6C and FIG. 6D, adjusted ROI 631 has reduced area than that of ROI 611.

In some embodiment, a lower bound of bit requirement for ROI and non-ROI can be determined based on complexity values 555 of ROI and non-ROI (e.g., by ROI/non-ROI bit allocator 5033) . Remaining bits that can be determined from frame target bits 552 minus the lower bound of bit requirement for ROI and non-ROI can be used to perform the quality control of ROI and non-ROI (e.g., by ROI/non-ROI bit allocator 5033) . This scheme can prevent the ROI or non-ROI from consuming too many bits to cause a bit-starving for the following frames.

It is appreciated that connections (e.g., arrows) in FIG. 5 are not exclusive but exemplary, video coding system 500 can include other connections between different components. For example, ROI detector can be coupled with an outside device and receive video 501 from the outside device. Moreover, while the connections in FIG. 5 are uni-directional, it is appreciated that they can be bi-directional.

FIG. 6A is a schematic representation of an exemplary frame 600 with multi-level regions, according to some embodiments of the present disclosure. In some embodiments, exemplary frame 600 can be applied to video coding system 400 of FIG. 4 or video coding system 500 of FIG. 5.

As shown in FIG. 6A, frame 600 can include a plurality of regions (e.g., region 601, region 602, region 603, region 604, region 605, region 606, and region 607) at three or more quality levels. For example, regions 601-607 each has a level of interest different from each other, and thus frame 600 can have seven different quality levels for these regions 601-607. As another example,

regions

601, 604 and 607 can be at the same quality level and

regions

602, 603, 605 and 606 each can have respective quality level. Frame 600 then can have five different quality levels.

With reference to FIG. 4, frame 600 can be input into ROI detector 402. ROI detector 402 can have circuitry configured to determine multi-level regions (e.g., regions 601-607) in frame 600. For example, ROI detector 402 can have circuitry configured to execute a region detection neural network to determine region information (e.g, coordinates, priorities, or the like) of multi-level regions in frame 600.

Rate controller 403 can receive the region information of determined regions in frame 600 from ROI detector 402 and demanded qualities for detected regions. Then, rate controller 403 can perform bit allocation among the determined regions based on the received region information and demanded qualities and generate region bit allocation information. For example, rate controller 403 can have circuitry configured to execute a region bit allocation neural network to determine QPs or RDO parameters for the regions in frame 600. The region bit allocation neural network can be trained (e.g., by training process 700 as discussed with reference to FIG. 7) to perform the bit allocation.

Video encoder 404 can receive the region bit allocation information (e.g., QPs or RDO parameters for the regions in frame 600) from rate controller 403. Video encoder 404 can have circuitry configured to encode frame 600 based on the received region bit allocation information and generate an encoded bit stream. Video encoder 404 can also send feedback encoding information to ROI detector 402 and rate controller 403.

With reference to FIG. 5, frame bit allocator 5032 can have circuitry configured to allocate bits to frame 600 and generate frame level bit allocation information (e.g., frame target bits 552) for frame 600. ROI/non-ROI bit allocator 5033 can receive frame level bit allocation information (e.g., frame target bits 552) from frame bit allocator 5032, complexity information (e.g., target complexity value 555 of multi-level regions) from ROI/non-ROI complexity estimator 5034, quality information (e.g., quality 554 of multi-level regions) from ROI/non-ROI quality estimator 5035, and region information (e.g., coordinates 553, priorities, or the like) from ROI detector 502. ROI/non-ROI bit allocator 5033 can have circuitry configured to perform region bit allocation among multi-level regions based on the frame level bit allocation information, complexity information, quality information, and region information, and generate region target bits for multi-level regions in frame 600. In some embodiments, ROI/non-ROI bit allocator 5033 can have circuitry configured to execute a bit allocation model (e.g., bit allocation model 705 of FIG. 7) to allocate bits among multi-level regions. The bit allocation model can include a region bit allocation neural network that can be trained (e.g., by training process 700 as discussed with reference to FIG. 7) to perform the bit allocation.

ROI rate quantizer 5036, non-ROI rate quantizer 5037, and additional region rate quantizers each can have circuitry configured to perform a region rate quantization based on region target bits for one of multi-level regions in frame 600. These rate quantizers can generate QPs or RDO parameters (e.g., ROI base QP 558 and non-ROI base QP 559) for multi-level regions in frame 600. It is appreciated that, although shown as separate components, these rate quantizers can be integrated into a single region rate quantizer. ROI/non-ROI limitator 5038 can have circuitry configured to perform limitation to constrain change ranges of QPs or RDO parameters (e.g., ROI base QP 558 and non-ROI base QP 559) for multi-level regions in frame 600.

Video encoder 504 can have circuitry configured to encode frame 600 based on the received region bit allocation information (e.g., QPs or RDO parameters for multi-level regions in frame 600) and generate encoded bit stream 506 that can be output from video coding system 500. Video encoder 504 can also feedback encoding information to ROI detector 502 and rate controller 503. For example, as shown in FIG. 5, video encoder 504 can feedback residuals 564 of a previous frame or current frame 600 to ROI/non-ROI complexity estimator 5034. Based on feedbacked residuals 564, ROI/non-ROI complexity estimator 5034 can have circuitry configured to perform a complexity estimation and determine target complexity values 555 of the multi-level regions. Target complexity value 555 can include MAD value of the residuals, MSE value of the residuals, or the like.

Moreover, video encoder 504 can send feedback reconstructed video 562 to ROI/non-ROI quality estimator 5035. ROI/non-ROI quality estimator 5035 can also receive demanded quality information 507 for the multi-level regions. ROI/non-ROI quality estimator 5035 can have circuitry configured to adjust a weighting of target bit allocation for the multi-level regions and generate quality information 554 based on reconstructed video 562 and demanded quality information 507. For example, if quality in ROI is low for the i-th frame, quality information 554 can indicate that more bits can be allocated for ROI of the next frame, (i+1) -th frame, to upgrade the quality of the ROI. In some embodiments, ROI/non-ROI quality estimator 5035 can also output region quality (e.g., ROI/non-ROI quality 560) to ROI detector 502.

As shown in FIG. 5, video encoder 504 can also feedback actual encoded bits 563 of the multi-level regions of a previous frame to video detector 502. ROI detector 502 can have circuitry configured to determine if one or more of the multi-level regions (e.g., region coordinates) can be adjusted based on actual encoded bits 563 and region quality 560. For example, if the remaining bit budget is low or the quality of a region is low, ROI detector 502 can reduce an area ratio of the region in the frame to save more bits.

FIG. 7 is a schematic diagram of an exemplary training process 700 of a bit allocation model 705, according to some embodiments of the present disclosure. It is appreciated that training process 700 can be implemented, at least in part, by neural network accelerator architecture 200 of FIGs. 2A and 2C, core 202 of FIGs. 2A-2B, operation unit configuration 300 of FIG. 3, or cloud system 230 of FIG. 2C. In some embodiments, bit allocation model 705 can be applied to rate controller 403 of FIG. 4 or ROI/non-ROI bit allocator 5033 of rate controller 503 of FIG. 5 to perform bit allocation. The bit allocation model 705 can include a region bit allocation neural network that can be trained by training process 700 to perform the bit allocation.

As shown in FIG. 7, video 701 can be input into video encoder 702. It is appreciated that, in some embodiments, video encoder 702 can be implemented by video encoder 404 of FIG. 4 or video encoder 504 of FIG. 5. Video encoder 702 can have circuitry configured to encode video 701 and obtain side information for a frame (e.g., frame 600 of FIG. 6A) . The obtained side information can include side information of the whole frame, each coded tree unit (CTU) in the frame, each region (e.g., ROI, non-ROI, multi-level regions, or the like) in the frame. The side information can include an area ratio of a region in the frame, a target quality for a region in the frame, an image complexity of a region in the frame, a real quality value of a region in the frame, a target QP or RDO parameter from the frame level rate control, frame target bits from the frame level rate control, or the like. Video encoder 702 can store the obtained side information within side information dataset 703. Side information dataset 703 can be included in a memory device (e.g., memory device 405 of FIG. 4) .

A training device 704 can read the side information from side information dataset 703 and have circuitry configured to train a region bit allocation neural network based on the read side information. The region bit allocation neural network can be a prediction neural network for target bit allocation among regions in a frame. In some embodiments, training device 704 can be implemented by neural network accelerator architecture 200 of FIGs. 2A and 2C. The trained region bit allocation neural network can be stored to bit allocation model 705. In some embodiments, bit allocation model 705 can be incorporated into or used by a coding system (e.g., video coding system 400 of FIG. 4 or video coding system 500 of FIG. 5) to perform region bit allocation. For example, a rate controller (e.g., rate controller 403 of FIG. 4 or rate controller 503 (e.g., ROI/non-ROI bit allocator 5033) of FIG. 5) can have circuitry configured to generate region bit allocation information (e.g., QPs or RDO parameters) for a plurality of regions in a frame based on bit allocation model 705 and side information from side information dataset 703 or from video encoder 702. Video encoder 702 can have circuitry configured to encode video 701 based on the region bit allocation information to generate compressed bit stream.

In some embodiments, side information from video encoding flow can be reused as training dataset (e.g., side information dataset 703) to obtain the region bit allocation neural network. Moreover, side information from the video encoding flow can also be reused as inference information to obtain region bit allocation for regions in a frame of a video. Thus, some embodiments of the present disclosure can improve compatibility and reduce complexity.

In some embodiments, region bit allocation can be performed based on the trained region bit allocation neural network to estimate the region target bits for regions in a frame. The estimation can be precise and target quality can converge fast. FIG. 8 illustrates a schematic diagram of exemplary convergence results 800, according to some embodiments of the present disclosure. As shown in FIG. 8, white circles represent ROI/non-ROI PSNR difference result of an existing normal coding method, while black circles represent ROI/non-ROI PSNR difference result of an exemplary coding method according to some embodiments of the present disclosure. In the exemplary coding method, a five-layer fully connected neural network is used as region bit allocation neural network. A video including 96 frames is applied to the existing normal coding method and the exemplary coding method. Convergence results 800 indicate that the exemplary coding method can bring a 40%improvement for converge speed.

FIG. 9 is a flowchart of an exemplary video coding method 900, according to some embodiments of the present disclosure. Method 900 can be implemented, at least partially, by neural network accelerator 200 of FIGs. 2A and 2C, core 202 of FIGs. 2A-2B, cloud system 230 of FIG. 2C, video coding system 400 of FIG. 4, video coding system 500 of FIG. 5, or training process 700 of FIG. 7. Moreover, method 900 can also be implemented by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers. In some embodiments, a host unit (e.g., host unit 220 of FIG. 2A or 2C) may compile software code for generating instructions for providing to one or more accelerators to perform method 900.

As shown in FIG. 9, at step 901, method 900 can include receiving a video (e.g., video 401 of FIG. 4, video 501 of FIG. 5, or the like) comprising a plurality of frames. For example, referring to FIG. 4, ROI detector can receive video 401 from outside or read video 401 from memory device 405 that is communicatively coupled with ROI detector.

At step 903, method 900 can include determining a plurality of regions in a frame of the video. For example, referring to FIG. 4, ROI detector can have circuitry configured to determine a plurality of regions (e.g., one or more ROIs, non-ROI, multi-level regions, or the like) in a frame of video 401. For example, ROI detector 402 can have circuitry configured to execute an ROI detection neural network to determine ROI coordinates of one or more ROIs (e.g., ROI 611 of FIG. 6C, or ROI 631 of FIG. 6D, shown as a θ region) in a frame of video 401. ROI detector 402 can also determine ROI priorities of the regions.

At step 905, method 900 can include performing a bit allocation for the plurality of regions based on demanded quality information for the plurality of regions to generate region bit allocation information. For example, with reference to FIG. 5, rate controller 503 can receive demanded quality information 507 for the plurality of regions and have circuitry configured to perform a bit allocation for the plurality of regions based on the received demanded quality information to generate region bit allocation information (e.g., ROI/non-ROI QP 561) .

In some embodiments, performing the bit allocation for the plurality of regions can include executing a region bit allocation neural network to perform the bit allocation for the plurality of regions based on complexity information and quality information of the plurality of regions. The complexity information can be generated by performing a complexity estimation based on residuals of a previous frame or current frame. The complexity information comprises MAD or MSE of the residuals for the plurality of regions. The quality information can be generated by adjusting a weighting of the bit allocation for the plurality of regions based on the demanded quality information and reconstructed video. The quality information comprises MAD, PSNR, SSIM or VMAF, or a difference of MAD, PSNR, SSIM or VMAF. For example, with reference to FIG. 5, ROI/non-ROI bit allocator 5033 can have circuitry configured to perform ROI/non-ROI bit allocation among ROI and non-ROI based on complexity information from ROI/non-ROI complexity estimator 5034 and quality information from ROI/non-ROI quality estimator 5035. ROI/non-ROI complexity estimator 5034 can have circuitry configured to perform a complexity estimation and determine target complexity value 555 of ROI and non-ROI based on residuals 564 of a previous frame or current frame. ROI/non-ROI quality estimator 5035 can have circuitry configured to adjust a weighting of target bit allocation for ROI and non-ROI and generate quality information 554 based on reconstructed video 562 and demanded quality information 507.

In some embodiment, method 900 can also include determining QPs or RDO parameters for the plurality of regions to generate region base QPs or RDO parameters for the plurality of regions and performing a limitation to constrain change ranges of the region base QPs or RDO parameters to generate region QPs or RDO parameters for the plurality of regions as the region bit allocation information. For example, referring to FIG. 5,

ROI rate quantizers

5036 and 5037 can have circuitry configured to perform an ROI and non-ROI rate quantizations based on ROI target bits 556 and non-ROI target bits 557 from ROI/non-ROI bit allocator 5033 to generate QPs or RDO parameters of the ROI and non-ROI (e.g., ROI base QP 558 and non-ROI base QP 559) , respectively. ROI/non-ROI limitator 5038 can have circuitry configured to perform limitation to constrain change ranges of ROI base QP 558 and non-ROI base QP 559 (or RDO parameters) .

In some embodiments, method 900 can also include performing bit allocation at GOP level for the video based on a demanded bitrate and performing bit allocation at frame level. For example, with reference to FIG. 5, GOP bit allocator 5031 can have circuitry configured to perform bit allocation at GOP level for video 501 based on received demanded bitrate 505, and frame bit allocator 5032 can have circuitry configured to perform bit allocation at frame level among a plurality of frames of video 501.

At step 907, method 900 can include encoding the frame based on the region bit allocation information to generate an encoded bit stream. For example, referring to FIG. 5, video encoder 504 can have circuitry configured to encode video 501 based on the received region bit allocation information (e.g., ROI/non-ROI QP 561) and generate encoded bit stream 506.

In some embodiments, method 900 can also include obtaining side information during encoding of the video and training the region bit allocation neural network model based on the side information. The side information comprises an area ratio of a region in the frame, a target quality for a region in the frame, an image complexity of a region in the frame, a real quality value of a region in the frame, a target QP or RDO parameter, frame target bits. For example, with reference to FIG. 7, video encoder 702 can have circuitry configured to encode video 701 and obtain side information for a frame (e.g., frame 600 of FIG. 6A) . Training device 704 can have circuitry configured to train a region bit allocation neural network model based on the side information. In some embodiments, the side information can also be used to perform the bit allocation for the plurality of regions.

It is appreciated that the embodiments disclosed herein can be used in various application environments, such as artificial intelligence (AI) training and inference, database and big data analytic acceleration, video compression and decompression, and the like.

Embodiments of the present disclosure can be applied to many products. For example, some embodiments of the present disclosure can be applied to Ali-NPU (e.g., Hanguang NPU) , Ali-Cloud, Ali PIM-AI (Processor-in Memory for AI) , Ali-DPU (Database Acceleration Unit) , Ali-AI platform, Ali-Data Center AI Inference Chip, IoT Edge AI Chip, GPU, TPU, or the like.

The embodiments may further be described using the following clauses:

1. A video coding system, comprising:

a region of interest (ROI) detector having circuitry configured to determine a plurality of regions in a frame of a video;

a rate controller communicatively coupled with the ROI detector and having circuitry configured to perform bit allocation for the plurality of regions based on demanded quality information for the plurality of regions and to generate region bit allocation information; and

a video encoder communicatively coupled with the ROI detector and the rate controller and having circuitry configured to encode the frame based on the region bit allocation information.

2. The video coding system of clause 1, wherein the rate controller comprises:

a region bit allocator having circuitry configured to execute a region bit allocation neural network model to perform the bit allocation for the plurality of regions based on complexity information and quality information of the plurality of regions.

3. The video coding system of clause 2, wherein the rate controller comprises:

a region complexity estimator communicatively coupled with the video encoder and region bit allocator, the region complexity estimator having circuitry configured to perform a complexity estimation based on residuals of a previous frame or current frame from the video encoder and to generate the complexity information.

4. The video coding system of clause 3, wherein the complexity information comprises mean absolute difference (MAD) or mean square absolute error (MSE) of the residuals for the plurality of regions.

5. The video coding system of any one of clauses 2-4, wherein the rate controller comprises:

a region quality estimator communicatively coupled with the video encoder and region bit allocator, the region quality estimator having circuitry configured to adjust a weighting of the bit allocation for the plurality of regions based on reconstructed video from the video encoder and the demanded quality information and to generate the quality information.

6. The video coding system of clause 5, wherein the quality information comprises mean absolute difference (MAD) , peak signal-to-noise ratio (PSNR) , structural similarity index metric (SSIM) , video multimethod assessment fusion (VMAF) , or a difference of MAD, PSNR, SSIM or VMAF.

7. The video coding system of any one of clauses 2-6, wherein the rate controller comprises:

one or more region rate quantizers communicatively coupled with the region bit allocator and having circuitry configured to determine quantization parameters (QPs) or rate- distortion-optimization (RDO) parameters for the plurality of regions and to generate region base QPs or RDO parameters for the plurality of regions; and

a region limitator communicatively coupled with the one or more region rate quantizer and the video encoder, the region limitator having circuitry configured to perform a limitation to constrain change ranges of the region base QPs or RDO parameters and to generate region QPs or RDO parameters for the plurality of regions as the region bit allocation information.

8. The video coding system of any one of clauses 2-7, wherein the rate controller comprises:

a group of picture (GOP) bit allocator having circuitry configured to perform bit allocation at GOP level for the video based on a demanded bitrate; and

a frame bit allocator communicatively coupled with the GOP bit allocator and the region bit allocator, the frame bit allocator having circuitry configured to perform bit allocation at frame level.

9. The video coding system of any one of clauses 2-8, wherein the video encoder has circuitry configured to obtain side information during encoding of the video.

10. The video coding system of clause 9, wherein the side information is used to train the region bit allocation neural network model.

11. The video coding system of any one of clause 9 and clause 10, wherein the rate controller has circuitry configured to perform the bit allocation for the plurality of regions based on the side information.

12. The video coding system of any one of clauses 9-11, wherein the side information comprises an area ratio of a region in the frame, a target quality for a region in the frame, an image complexity of a region in the frame, a real quality value of a region in the frame, a target QP or RDO parameter, frame target bits.

13. The video coding system of any one of clauses 9-12, wherein the plurality of regions have two or more quality levels.

14. A method for video coding comprising:

receiving a video comprising a plurality of frames;

determining a plurality of regions in a frame of the video;

performing bit allocation for the plurality of regions based on demanded quality information for the plurality of regions to generate region bit allocation information; and

encoding the frame based on the region bit allocation information to generate an encoded bit stream.

15. The method of clause 14, wherein performing the bit allocation for the plurality of regions comprises:

executing a region bit allocation neural network model to perform the bit allocation for the plurality of regions based on complexity information and quality information of the plurality of regions.

16. The method of clause 15, wherein performing the bit allocation for the plurality of regions comprises:

performing a complexity estimation based on residuals of a previous frame or current frame to generate the complexity information.

17. The method of any one of clause 15 and clause 16, wherein performing the bit allocation for the plurality of regions comprises:

adjusting a weighting of the bit allocation for the plurality of regions based on the demanded quality information and reconstructed video to generate the quality information.

18. The method of any one of clauses 15-17, wherein performing the bit allocation for the plurality of regions comprises:

determining quantization parameters (QPs) or rate-distortion-optimization (RDO) parameters for the plurality of regions to generate region base QPs or RDO parameters for the plurality of regions; and

performing a limitation to constrain change ranges of the region base QPs or RDO parameters to generate region QPs or RDO parameters for the plurality of regions as the region bit allocation information.

19. The method of any one of clauses 15-18, wherein performing the bit allocation for the plurality of regions comprises:

performing bit allocation at group of picture (GOP) level for the video based on a demanded bitrate; and

performing bit allocation at frame level.

20. The method of any one of clauses 15-19, further comprising:

obtaining side information during encoding of the video.

21. The method of clause 20, further comprising:

training the region bit allocation neural network model based on the side information.

22. The method of any one of clause 20 and clause 21, wherein performing the bit allocation for the plurality of regions comprises:

performing the bit allocation for the plurality of regions based on the side information.

23. The method of any one of clauses 20-22, wherein the side information comprises an area ratio of a region in the frame, a target quality for a region in the frame, an image complexity of a region in the frame, a real quality value of a region in the frame, a target QP or RDO parameter, frame target bits.

24. A video coding apparatus, comprising:

at least one memory for storing instructions; and

at least one processor configured to execute the instructions to cause the apparatus to perform:

receiving a video comprising a plurality of frames;

determining a plurality of regions in a frame of the video;

25. The apparatus of clause 24, wherein the at least one processor is configured to execute the instructions to cause the apparatus to perform:

26. The apparatus of clause 25, wherein the at least one processor is configured to execute the instructions to cause the apparatus to perform:

27. The apparatus of any one of clause 25 and clause 26, wherein the at least one processor is configured to execute the instructions to cause the apparatus to perform:

28. The apparatus of any one of clauses 25-27, wherein the at least one processor is configured to execute the instructions to cause the apparatus to perform:

29. The apparatus of any one of clauses 25-28, wherein the at least one processor is configured to execute the instructions to cause the apparatus to perform:

performing bit allocation at frame level.

30. The apparatus of any one of clauses 25-29, wherein the at least one processor is configured to execute the instructions to cause the apparatus to perform:

obtaining side information during encoding of the video.

31. The apparatus of clause 30, wherein the at least one processor is configured to execute the instructions to cause the apparatus to perform:

32. The apparatus of any one of clause 30 and clause 31, wherein the at least one processor is configured to execute the instructions to cause the apparatus to perform:

perform the bit allocation for the plurality of regions based on the side information.

33. The apparatus of any one of clauses 30-32, wherein the side information comprises an area ratio of a region in the frame, a target quality for a region in the frame, an image complexity of a region in the frame, a real quality value of a region in the frame, a target QP or RDO parameter, frame target bits.

34. A non-transitory computer readable storage medium storing a set of instructions that are executable by one or more processing devices to cause a computer to perform:

receiving a video comprising a plurality of frames;

determining a plurality of regions in a frame of the video;

35. The non-transitory computer readable storage medium of clause 34, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

36. The non-transitory computer readable storage medium of clause 35, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

37. The non-transitory computer readable storage medium of any one of clause 35 and clause 36, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

38. The non-transitory computer readable storage medium of any one of clauses 35-37, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

39. The non-transitory computer readable storage medium of any one of clauses 35-38, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

performing bit allocation at frame level.

40. The non-transitory computer readable storage medium of any one of clauses 35-39, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

obtaining side information during encoding of the video.

41. The non-transitory computer readable storage medium of clause 40, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

42. The non-transitory computer readable storage medium of any one of clause 40 and clause 41, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

43. The non-transitory computer readable storage medium of any one of clauses 40-42, wherein the side information comprises an area ratio of a region in the frame, a target quality for a region in the frame, an image complexity of a region in the frame, a real quality value of a region in the frame, a target QP or RDO parameter, frame target bits.

The various example embodiments described herein are described in the general context of method steps or processes, which may be implemented in one aspect by a computer program product, embodied in a computer readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer readable medium may include removeable and nonremovable storage devices including, but not limited to, Read Only Memory (ROM) , Random Access Memory (RAM) , compact discs (CDs) , digital versatile discs (DVD) , etc. Generally, program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.

The foregoing description has been presented for purposes of illustration. It is not exhaustive and is not limited to precise forms or embodiments disclosed. Modifications and adaptations of the embodiments will be apparent from consideration of the specification and practice of the disclosed embodiments. For example, the described implementations include hardware, but systems and methods consistent with the present disclosure can be implemented with hardware and software. In addition, while certain components have been described as being coupled to one another, such components may be integrated with one another or distributed in any suitable fashion.

Moreover, while illustrative embodiments have been described herein, the scope includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of aspects across various embodiments) , adaptations or alterations based on the present disclosure. The elements in the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application, which examples are to be construed as nonexclusive. Further, the steps of the disclosed methods can be modified in any manner, including reordering steps and/or inserting or deleting steps.

The features and advantages of the present disclosure are apparent from the detailed specification, and thus, it is intended that the appended claims cover all systems and methods falling within the true spirit and scope of the present disclosure. As used herein, the indefinite articles “a” and “an” mean “one or more. ” Further, since numerous modifications and variances will readily occur from studying the present disclosure, it is not desired to limit the present disclosure to the exact reconstruction and operation illustrated and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope of the present disclosure.

As used herein, unless specifically stated otherwise, the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a component may include A or B, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or A and B. As a second example, if it is stated that a component may include A, B, or C, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.

Other embodiments will be apparent from consideration of the specification and practice of the embodiments disclosed herein. It is intended that the specification and examples be considered as example only, with a true scope and spirit of the disclosed embodiments being indicated by the following claims.

Claims

A video coding system, comprising:

a region of interest (ROI) detector having circuitry configured to determine a plurality of regions in a frame of a video;

a rate controller communicatively coupled with the ROI detector and having circuitry configured to perform bit allocation for the plurality of regions based on demanded quality information for the plurality of regions and to generate region bit allocation information; and

a video encoder communicatively coupled with the ROI detector and the rate controller and having circuitry configured to encode the frame based on the region bit allocation information.
The video coding system of claim 1, wherein the rate controller comprises:

a region bit allocator having circuitry configured to execute a region bit allocation neural network model to perform the bit allocation for the plurality of regions based on complexity information and quality information of the plurality of regions.
The video coding system of claim 2, wherein the rate controller comprises:

a region complexity estimator communicatively coupled with the video encoder and region bit allocator, the region complexity estimator having circuitry configured to perform a complexity estimation based on residuals of a previous frame or current frame from the video encoder and to generate the complexity information.
The video coding system of claim 2, wherein the rate controller comprises:

a region quality estimator communicatively coupled with the video encoder and region bit allocator, the region quality estimator having circuitry configured to adjust a weighting of the bit allocation for the plurality of regions based on reconstructed video from the video encoder and the demanded quality information and to generate the quality information.
The video coding system of claim 2, wherein the video encoder has circuitry configured to obtain side information during encoding of the video, and wherein the side information is used to train the region bit allocation neural network model.
The video coding system of claim 5, wherein the rate controller has circuitry configured to perform the bit allocation for the plurality of regions based on the side information.
The video coding system of claim 5, wherein the side information comprises an area ratio of a region in the frame, a target quality for a region in the frame, an image complexity of a region in the frame, a real quality value of a region in the frame, a target QP or RDO parameter, frame target bits.
The video coding system of claim 5, wherein the plurality of regions have two or more quality levels.
A method for video coding comprising:

receiving a video comprising a plurality of frames;

determining a plurality of regions in a frame of the video;

performing bit allocation for the plurality of regions based on demanded quality information for the plurality of regions to generate region bit allocation information; and

encoding the frame based on the region bit allocation information to generate an encoded bit stream.
The method of claim 9, wherein performing the bit allocation for the plurality of regions comprises:

executing a region bit allocation neural network model to perform the bit allocation for the plurality of regions based on complexity information and quality information of the plurality of regions.
The method of claim 10, wherein performing the bit allocation for the plurality of regions comprises:

performing a complexity estimation based on residuals of a previous frame or current frame to generate the complexity information.
The method of claim10, wherein performing the bit allocation for the plurality of regions comprises:

adjusting a weighting of the bit allocation for the plurality of regions based on the demanded quality information and reconstructed video to generate the quality information.
The method of claim 10, further comprising:

obtaining side information during encoding of the video; and

training the region bit allocation neural network model based on the side information.
The method of claim 13, wherein performing the bit allocation for the plurality of regions comprises:

performing the bit allocation for the plurality of regions based on the side information.
The method of claim 13, wherein the side information comprises an area ratio of a region in the frame, a target quality for a region in the frame, an image complexity of a region in the frame, a real quality value of a region in the frame, a target QP or RDO parameter, frame target bits.
A non-transitory computer readable storage medium storing a set of instructions that are executable by one or more processing devices to cause a computer to perform:

receiving a video comprising a plurality of frames;

determining a plurality of regions in a frame of the video;

performing bit allocation for the plurality of regions based on demanded quality information for the plurality of regions to generate region bit allocation information; and

encoding the frame based on the region bit allocation information to generate an encoded bit stream.
The non-transitory computer readable storage medium of claim 16, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

executing a region bit allocation neural network model to perform the bit allocation for the plurality of regions based on complexity information and quality information of the plurality of regions.
The non-transitory computer readable storage medium of claim 17, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

performing a complexity estimation based on residuals of a previous frame or current frame to generate the complexity information.
The non-transitory computer readable storage medium of claim 17, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

adjusting a weighting of the bit allocation for the plurality of regions based on the demanded quality information and reconstructed video to generate the quality information.
The non-transitory computer readable storage medium of claim 17, wherein the set of instructions are executable by the one or more processing devices to cause the computer to perform:

obtaining side information during encoding of the video; and

training the region bit allocation neural network model based on the side information.