WO2024001886A1 - 编码单元划分方法、电子设备和计算机可读存储介质 - Google Patents

编码单元划分方法、电子设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2024001886A1
WO2024001886A1 PCT/CN2023/101495 CN2023101495W WO2024001886A1 WO 2024001886 A1 WO2024001886 A1 WO 2024001886A1 CN 2023101495 W CN2023101495 W CN 2023101495W WO 2024001886 A1 WO2024001886 A1 WO 2024001886A1
Authority
WO
WIPO (PCT)
Prior art keywords
coding
depth
coding unit
image
image blocks
Prior art date
Application number
PCT/CN2023/101495
Other languages
English (en)
French (fr)
Inventor
曹洲
徐科
孔德辉
杨维
任聪
Original Assignee
深圳市中兴微电子技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市中兴微电子技术有限公司 filed Critical 深圳市中兴微电子技术有限公司
Publication of WO2024001886A1 publication Critical patent/WO2024001886A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/149Data rate or code amount at the encoder output by estimating the code amount by means of a model, e.g. mathematical model or statistical model
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/182Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a pixel

Definitions

  • the present disclosure relates to the field of communications, and in particular, to a coding unit dividing method, electronic equipment and computer-readable storage media.
  • the present disclosure provides a coding unit dividing method, an electronic device and a computer-readable storage medium.
  • the present disclosure provides a coding unit dividing method, which includes: dividing an original image to obtain multiple coding tree units; using image blocks as granularity, dividing each coding tree unit into image blocks including multiple image blocks. Dimensional array; perform visual attention mechanism calculation on each image block in the one-dimensional array of image blocks to obtain the coding unit division depth corresponding to each image block in the one-dimensional array of image blocks; according to the corresponding depth of each image block
  • the coding unit division depth is, and the coding tree unit is divided into coding units.
  • the present disclosure provides an electronic device, which includes: one or more Processor; memory on which one or more programs are stored, and when the one or more programs are executed by the one or more processors, the one or more processors implement the coding unit according to the present application Division method.
  • the present disclosure provides a computer-readable storage medium.
  • a computer program is stored on the computer-readable storage medium.
  • the computer program When executed by a processor, it causes the processor to implement the coding unit dividing method according to the present application.
  • FIG. 1 is a flowchart of a coding unit dividing method according to the present disclosure.
  • Figure 2 is a flow chart of the traditional encoding method.
  • Figure 3 is a schematic diagram of the CU partitioning structure of video encoding and decoding based on visual transformation according to the present disclosure.
  • FIG. 4 is a diagram showing the relationship between coding unit division size and depth correspondence table according to the present disclosure.
  • Figure 5 is a schematic diagram of coding tree unit (64 ⁇ 64) division depth according to the present disclosure.
  • Figure 6 is a schematic diagram of an electronic device according to the present disclosure.
  • Figure 7 is a schematic diagram of a computer-readable storage medium in accordance with the present disclosure.
  • the encoding time complexity has increased several times compared with H264/AVC.
  • the division process based on the coding unit (CU, Coding Unit) quadtree requires traversing all possible CU division results in the current coding tree unit (CTU, Coding Tree Unit) (64 ⁇ 64 pixels), and then based on each CU division method. Calculate the rate distortion cost (RDC, Rate Distortion Cost), and finally select the CU division method with the smallest RDC for encoding.
  • RDC Rate Distortion Cost
  • this disclosure proposes a ViT-based video encoding and decoding CU division method based on the ViT attention mechanism.
  • the present disclosure provides a CU partitioning method, as shown in Figure 1.
  • the method includes the following steps S100 to S400.
  • step S100 the original image is divided to obtain multiple CTUs.
  • each CTU is divided into a one-dimensional array of image blocks including multiple image blocks.
  • step S300 the visual attention mechanism calculation is performed on each image block in the one-dimensional array of image blocks to obtain the CU division depth corresponding to each image block in the one-dimensional array of image blocks.
  • step S400 the CTU is divided into CUs according to the CU division depth corresponding to each image block.
  • the present disclosure provides a video encoding and decoding CU block dividing method based on a visual attention mechanism.
  • a 64 ⁇ 64 pixel CTU can be divided into 256 4 ⁇ 4 pixel patches, and the 256 patches originally arranged in a matrix can be transformed into a linearly arranged one-dimensional array of patches.
  • one dimension refers to the patch, not the dimension of pixels. For pixels, the dimension is 256 ⁇ 16, where 256 represents the number of patch blocks and 16 represents the number of pixels contained in each patch block.
  • the above coding information is linearly transformed and projected, and the position coding information is added.
  • the information After receiving the information, input it into Vision Transformer (ViT).
  • ViT Vision Transformer
  • the CU division depth corresponding to each patch is obtained, and then the current CTU is divided into CUs based on the CU division depth corresponding to each patch.
  • Each CTU in the original image is divided into CUs according to this method, that is, CU division of the entire original image is achieved.
  • Figure 2 shows the overall flow chart of H265/HEVC protocol encoding.
  • the part in the dotted box in the figure is the part where the ViT network of the present disclosure replaces the traditional algorithm.
  • the present disclosure uses the ViT network to replace the traditional calculation method of loop traversal to find the optimal CU partition.
  • the ViT-based video codec CU division method proposed in this disclosure through reasonable training and learning, not only avoids the coding complexity caused by traversing all CU divisions in the traditional method, but also is consistent with Compared with traditional convolutional neural networks (CNN, Convolutional Neural Networks), it further reduces the computational complexity of the neural network and improves the encoding speed. Through the learning of the attention mechanism, the encoding quality of the video can be well guaranteed, while improving the High real-time performance and reliability of H265/HEVC protocol.
  • CNN convolutional neural networks
  • the CTU is divided into multiple layers according to the depth value of the CTU, and the CU division depth corresponding to each image block is used to divide the CTU into CUs (ie, step S400) includes: according to the CU corresponding to each image block.
  • Division depth CTUs are divided into CUs layer by layer in depth order.
  • dividing the CTU layer by layer in depth order includes: starting from the layer where the current depth i is 0, traversing and counting the CU division depth corresponding to all image blocks in the CTU that is greater than i
  • the number of image blocks Ni, where i represents the current depth, i and Ni are natural numbers; if Ni is greater than the preset CU division threshold ⁇ i for the current depth i, divide the current CU in the CTU and continue to the next depth CU partitioning, where ⁇ i is greater than 0; otherwise, end the CU partitioning of the CTU.
  • the CU division method provided by this disclosure traverses, counts and divides the current CTU at each depth according to the CU division depth corresponding to each patch. It replaces the traditional method of looping through each CU division method, thus avoiding the computational complexity caused by traversing all CU division methods and improving the encoding speed.
  • a CU partitioning threshold is preset for each depth, as a measure Determining conditions for whether CU division is required at the current depth.
  • the depth of CTU is from 0 to 3, divided into 4 levels from top to bottom, 64 ⁇ 64 ⁇ 32 ⁇ 32 ⁇ 16 ⁇ 16 ⁇ 8 ⁇ 8, and the CU division threshold ⁇ i is preset for each layer.
  • the CU division depth corresponding to each patch calculated by the above visual attention mechanism in each level, the number of patch blocks whose CU division depth exceeds the current level is counted as the number of patch depth prediction values of the current depth. If the number of patch depth prediction values of the current layer is greater than the preset CU division threshold of the current layer, the current CU is considered to be divided, and the current CU is divided into four parts.
  • the statistics are recursively traversed layer by layer in depth order from top to bottom, and the CTU is divided into CUs until the number of patch depth prediction values in a certain depth is less than or equal to the CU division threshold, and the CU division of the CTU is completed. .
  • the preset CU division threshold is not limited to comparison with the number of patch depth prediction values. More comparison rules can also be added, or comparisons can be made against other indicators. For example, the ratio of the number of patch depth predictions to the total number of patches in the current layer, or comparing the patch depth with the actual depth of the currently traversed CU.
  • performing visual attention mechanism calculation on each image block in the one-dimensional array of image blocks to obtain the CU division depth corresponding to each image block in the one-dimensional array of image blocks includes: The block one-dimensional array is dimensionally expanded to obtain the expanded input array; the visual attention mechanism is calculated on the input array to obtain the first calculation result; the fully connected layer is calculated on the first calculation result to obtain the second calculation result; A softmax layer is calculated on the second calculation result to obtain the CU division depth corresponding to each image block in the one-dimensional array of image blocks.
  • expanding the dimension of the one-dimensional array of image blocks includes linearly projecting the one-dimensional array of image blocks and adding position coding information to obtain an expanded input array.
  • Figure 3 is a schematic diagram of the CU partitioning structure of ViT-based video encoding and decoding according to the present disclosure.
  • the left side shows a schematic diagram in which the CTU (64 ⁇ 64) is divided into image blocks according to the size of 4 ⁇ 4.
  • the right side shows the divided image blocks first being flattened into one-dimensional input and then subjected to linear projection (Linear Projection).
  • the output corresponding to each patch block will become the input of the subsequent fully connected layer (Fully Connected Layer), and then through the Softmax layer ( Softmax Layer), obtain the CU division depth corresponding to each patch block, and map it to the depth label (Patches Depth Label) of the image block.
  • performing visual attention mechanism calculation on each image block in the one-dimensional array of image blocks to obtain the CU division depth corresponding to each image block in the one-dimensional array of image blocks also includes: Perform at least one round of visual attention mechanism training using backpropagation to obtain the depth prediction value of each image block; use the loss function to determine the depth prediction value of the image block with the smallest loss as the corresponding image block in the one-dimensional array of image blocks.
  • the CU division depth is
  • the CU division depth corresponding to each image block in the one-dimensional array of image blocks can be used as the label corresponding to each image block.
  • the image block may be a 4*4 pixel block divided according to the CU minimum division unit, and the CTU may be a 64*64 pixel block.
  • Figure 4 is a correspondence table between CU division size and CU division depth.
  • the CU division size can be any one of 64 ⁇ 64, 32 ⁇ 32, 16 ⁇ 16, 8 ⁇ 8, and 4 ⁇ 4, and the corresponding CU division depths are in order. is 0, 1, 2, 3, 4.
  • Figure 5 is a schematic diagram of a CU division method in CTU (64 ⁇ 64), in which CU blocks and corresponding division depths are marked.
  • Basic deep learning operations that can be used in embodiments of the present disclosure include but are not limited to: ViT, deconvolution (Deconvolution), linear rectification function (ReLU), S growth curve (Sigmoid), Full-Connection, Reshape, etc.
  • ViT Voice over IP
  • ResNet Residual Network
  • CNN feature compression and excitation residual network
  • SE-Resnet feature compression and excitation residual network
  • the CU partitioning method provided by the embodiments of the present disclosure can be applied to all places that require video encoding, such as video processing units (VPUs, Video Processing Units), video codec chip built-in algorithms, smart cockpits, video compression, video transmission, etc.
  • video processing units VPUs, Video Processing Units
  • video codec chip built-in algorithms smart cockpits
  • video compression video transmission, etc.
  • Figure 3 describes the overall process of the present disclosure.
  • the input is the CTU (64 ⁇ 64) video CTU in H265/HEVC
  • the output is the division depth corresponding to each patch block after division in the CTU based on patch (4 ⁇ 4) blocks.
  • the CU division can be completed.
  • This example mainly includes steps such as data set and label preparation, Vision Transformer training, Vision Transformer inference, and CU statistical division. The implementation details of each step will be introduced in detail below.
  • Step 1 Dataset and label preparation
  • the original data can be obtained from video sequence images from public data sets such as Vimeo90K, REDS4, VID4, etc.
  • the original image is cut into blocks according to the size of 64 ⁇ 64 to obtain CTU, and then the CU division method in each CTU block is exhaustively traversed and Calculate the rate-distortion cost (RDC) in turn, and select the CU division method with the smallest rate-distortion cost.
  • RDC rate-distortion cost
  • the CTU block is divided into patch blocks according to the size of 4 ⁇ 4, and each patch block is assigned the depth (depth label) of the CU block to which it belongs.
  • Step 2 Vision Transformer training
  • the loss function of the Softmax layer is the loss of multi-class cross entropy:
  • the depth label can take values 0, 1, 2, 3, and 4, and p(xi) represents the probability of possible results at each depth.
  • the Vision Transformer, fully connected layer, and Softmax layer training are completed, in the inference phase, as described in the Vision Transformer training step, the one-dimensional data transformed into the patch block divided into the current CTU block is used as input, and the Vision Transformer, fully connected layer After layer and Softmax layer, the depth corresponding to each patch block is obtained.
  • Step 4 CU statistical partitioning
  • the present disclosure also provides an electronic device, as shown in Figure 6, which includes: one or more processors 501; a memory 502, on which one or more programs are stored. When one or more programs are processed by one or more The processor executes, causing one or more processors to implement the CU partitioning method according to various embodiments of the present disclosure.
  • the electronic device may also include one or more I/O interfaces 503, connected between the processor and the memory, and configured to implement information exchange between the processor and the memory.
  • the processor 501 is a device with data processing capabilities, including but not limited to a central processing unit (CPU), etc.; the memory 502 is a device with data storage capabilities, including but not limited to random access memory (RAM, more specifically such as SDRAM). , DDR, etc.), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory (FLASH); the I/O interface (read-write interface) 503 is connected between the processor 501 and the memory 502, and can Implement information interaction between the processor 501 and the memory 502, which includes but is not limited to a data bus (Bus), etc.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • FLASH flash memory
  • the I/O interface (read-write interface) 503 is connected between the processor 501 and the memory 502, and can Implement information interaction between the processor 501 and the memory 502, which includes but is not limited to a data bus (Bus), etc.
  • processor 501, memory 502, and I/O interface 503 are connected to each other and, in turn, to other components of the computing device via bus 504.
  • the present disclosure also provides a computer-readable storage medium. As shown in FIG. 7 , a computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, the CU partitioning method according to various embodiments of the present disclosure is implemented.
  • H265/HEVC has greatly improved the compression rate compared with the previous generation video encoding and decoding standard H264/AVC protocol
  • the encoding complexity has also increased several times.
  • all CU divisions are traversed and RDC is calculated.
  • the calculation method for selecting the optimal CU partition is the main source of time consumed in H265/HEVC encoding.
  • This disclosure proposes a ViT-based video encoding and decoding CU division method, which takes the minimum basic block divided by CTU as input and outputs the division depth of each corresponding minimum basic block. Through the calculation of the image block attention mechanism, the video coding efficiency is improved. , making real-time high-quality video encoding possible.
  • the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may consist of several physical components. Components execute cooperatively. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
  • computer storage media includes volatile and nonvolatile media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. removable, removable and non-removable media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other medium used to store the desired information and that can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Analysis (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Algebra (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本公开提供一种编码单元划分方法、电子设备和计算机可读存储介质。所述编码单元划分方法包括:对原始图像进行划分,得到多个编码树单元;以图像块为粒度,将每个所述编码树单元分成包括多个图像块的图像块一维数组;对所述图像块一维数组中的各个图像块进行视觉注意力机制计算,得到所述图像块一维数组中的各个图像块对应的编码单元划分深度;根据所述各个图像块对应的编码单元划分深度,对所述编码树单元进行编码单元划分。

Description

编码单元划分方法、电子设备和计算机可读存储介质
相关申请的交叉引用
本申请要求于2022年6月30日提交的名称为“CU划分方法、电子设备和计算机可读存储介质”的中国专利申请CN 202210770312.5的优先权,其全部内容通过引用并入本文中。
技术领域
本公开涉及通信领域,尤其涉及一种编码单元划分方法、电子设备和计算机可读存储介质。
背景技术
随着人们物质与精神生活水平的不断提高,人们对高清甚至超高清视频的要求也越来越迫切。如何在保证视频质量的同时减少带宽的消耗这一核心问题逐渐成研究的焦点。与上一代高级视频编码标准H264/高级视频编码(AVC,Advanced Video Coding)相比,H265/高效率视频编码(HEVC,High Efficiency Video Coding)通过引入编码单元(CU,Coding Unit)四叉树划分结构、多角度帧内预测等技术,使压缩率获得了大幅提升。
发明内容
本公开提供一种编码单元划分方法方法、电子设备和计算机可读存储介质。
本公开提供了一种编码单元划分方法,包括:对原始图像进行划分,得到多个编码树单元;以图像块为粒度,将每个所述编码树单元分成包括多个图像块的图像块一维数组;对所述图像块一维数组中的各个图像块进行视觉注意力机制计算,得到所述图像块一维数组中的各个图像块对应的编码单元划分深度;根据所述各个图像块对应的编码单元划分深度,对所述编码树单元进行编码单元划分。
本公开提供了一种电子设备,所述电子设备包括:一个或多个 处理器;存储器,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现根据本申请的编码单元划分方法。
本公开提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器实现根据本申请的编码单元划分方法。
附图说明
图1是根据本公开的编码单元划分方法的流程图。
图2是传统编码方法的流程图。
图3是根据本公开的基于视觉转换的视频编解码CU划分结构示意图。
图4是根据本公开的编码单元划分尺寸与深度对应表关系图。
图5是根据本公开的编码树单元(64×64)划分深度示意图。
图6是根据本公开的电子设备的示意图。
图7是根据本公开的计算机可读存储介质的示意图。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本公开,并不用于限定本公开。
在后续的描述中,使用用于表示元件的诸如“模块”、“部件”或“单元”的后缀仅为了有利于本公开的说明,其本身没有特有的意义。因此,“模块”、“部件”或“单元”可以混合地使用。
随着HEVC中引入新编码技术,编码时间复杂度与H264/AVC相比提升了数倍。基于编码单元(CU,Coding Unit)四叉树的划分过程需要遍历当前编码树单元(CTU,Coding Tree Unit)(64×64像素)中所有可能的CU划分结果,然后基于每种CU划分方式分别计算率失真代价(RDC,Rate Distortion Cost),最后从中选出RDC最小的CU划分方式进行编码。这一过程虽然能降低码率,但是占据了编码时间的80%,编码复杂度也增长了数倍,因此如何在HEVC标准 中更有效的找出最优CU划分方式,对于加速H265/HEVC编码执行效率至关重要。
自视觉转换(ViT,Vision Transformer)机制提出以来,基于图像块(patch)的注意力机制相比与传统卷积神经网络而言,因其在所需训练数据、计算复杂度和实际性能等方面的优势,使其在图像视频应用领域获得广泛关注。
本公开针对编码码率与编码复杂度之间的矛盾,基于ViT注意力机制,提出一种基于ViT的视频编解码CU划分方法。
本公开提供一种CU划分方法,如图1所示,所述方法包括如下步骤S100至S400。
在步骤S100,对原始图像进行划分,得到多个CTU。
在步骤S200,以图像块为粒度,将每个CTU分成包括多个图像块的图像块一维数组。
在步骤S300,对图像块一维数组中的各个图像块进行视觉注意力机制计算,得到图像块一维数组中的各个图像块对应的CU划分深度。
在步骤S400,根据各个图像块对应的CU划分深度,对CTU进行CU划分。
本公开提供的是一种基于视觉注意力机制的视频编解码CU块划分方法。首先将原始图像划分成多个CTU。将当前CTU像素块(如64×64像素)按照CU最小划分单元分割成多个图像块(patch),即,每个patch块的尺寸为CU最小划分单元。
以patch作为基础块,使用patch块中含有的像素值对其分别编码。将一个CTU中所有的patch排列成以patch为单元的一维数组。例如,可以将64×64像素的CTU划分成256个4×4像素的patch,将原先以矩阵排列的256个patch变换成线性排列的patch一维数组。需要说明的是,“一维”是针对patch而言,而不是说像素的维度。对于像素而言,维度为256×16,其中,256表示patch块的数量,16表示每个patch块中含有的像素的个数。
之后,将以上编码信息进行线性变换投影,在添加位置编码信 息后,输入到Vision Transformer(ViT)中。经Vision Transformer注意力机制的一系列计算后,得到每个patch对应的CU划分深度,然后根据每个patch对应的CU划分深度,对当前CTU进行CU划分。对原始图像中每个CTU均按照此方法进行CU划分,即,实现对整个原始图像的CU划分。
图2示出了H265/HEVC协议编码的整体流程图,图中虚线框内的部分是本公开ViT网络替换传统算法的部分。由图中可以看出,本公开使用ViT网络替换了传统循环遍历寻找最优CU划分的计算方式。
与传统H265/HEVC相比,本公开提出的基于ViT的视频编解码CU划分方法,通过合理的训练与学习,既避免了传统方法中遍历所有CU的划分方式带来的编码复杂度,而且与传统卷积神经网络(CNN,Convolutional Neural Networks)相比,进一步降低了神经网络的计算复杂度,提高了编码速度,通过注意力机制的学习,可以很好的保证视频的编码质量,同时提高了H265/HEVC协议的高实时性与可靠性。
在一些实施例中,根据CTU的深度数值将CTU划分为多个层,并且根据各个图像块对应的CU划分深度,对CTU进行CU划分(即,步骤S400)包括:根据各个图像块对应的CU划分深度,对CTU按照深度顺序逐层进行CU划分。
进一步地,根据各个图像块对应的CU划分深度,对CTU按照深度顺序逐层进行CU划分包括:从当前深度i为0的层开始,遍历并统计CTU中所有图像块对应的CU划分深度大于i的图像块的数量Ni,其中,i表示当前深度,i和Ni为自然数;若Ni大于当前深度i预设的CU划分阈值αi,则对CTU中的当前CU进行划分,并继续进行下一深度的CU划分,其中,αi大于0;否则,结束对CTU的CU划分。
本公开提供的CU划分方法,根据每个patch对应的CU划分深度,在每个深度对当前CTU进行遍历、统计和CU划分。取代了传统方法中循环遍历每种CU划分方式,因此避免了因遍历所有CU划分方式而是带来的计算复杂度,提高了编码速度。
在本公开实施例中,针对每一深度预设CU划分阈值,作为衡量 当前深度下是否需要进行CU划分的判断条件。
例如,CTU的深度从0到3,由上向下分成4个层次,64×64→32×32→16×16→8×8,为各层预设CU划分阈值αi。根据上述视觉注意力机制计算得到的每个patch对应的CU划分深度,在每个层次中,统计CU划分深度超过当前层次的patch块的数量,作为当前深度的patch深度预测值数量。若当前层次的patch深度预测值数量大于当前层预设的CU划分阈值,则认为当前CU可被划分,进而对当前CU进行四分。划分完一个层次后,对下一层继续统计下一深度的patch深度预测值数量,并与下一深度对应的CU划分阈值进行比较,以判断是否执行CU划分步骤。依此类推,对由上向下按照深度顺序逐层递归地进行遍历统计,对CTU进行CU划分,直至在某一深度中patch深度预测值数量小于或等于CU划分阈值,结束对CTU的CU划分。
需要说明的是,预设的CU划分阈值,不限于与patch深度预测值的数量作比较,也可以增加更多的比较规则,或针对其他指标进行比较。例如,patch深度预测值数量占当前层总patch数量的比例,或者比较patch深度与当前遍历的CU的实际深度。
在一些实施例中,对图像块一维数组中的各个图像块进行视觉注意力机制计算,得到图像块一维数组中的各个图像块对应的CU划分深度(即,步骤S300)包括:对图像块一维数组进行扩维,得到扩维后的输入数组;对输入数组进行视觉注意力机制计算,得到第一计算结果;对第一计算结果进行全连接层的计算,得到第二计算结果;对第二计算结果进行软最大值(softmax)层的计算,得到图像块一维数组中的各个图像块对应的CU划分深度。
在一些实施例中,为图像块一维数组进行扩维包括:将图像块一维数组进行线性投影,并添加位置编码信息,得到扩维后的输入数组。
图3是根据本公开的基于ViT的视频编解码CU划分结构示意图。在图中,左侧示出了CTU(64×64)按照4×4的尺寸划分成图像块的示意图,右侧示出了首先将划分的图像块平展成一维输入,经过线性投影(Linear Projection)后,添加位置编码信息,然后输入到视 觉转换(Vision Transformer)层中,在经过每个patch之间的视觉注意力机制计算后,每个patch块对应的输出会成为后续全连接层(Fully Connected Layer)的输入,之后经过Softmax层(Softmax Layer),得到每个patch块对应的CU划分深度,映射到图像块的深度标签(Patches Depth Label)中。
在一些实施例中,对图像块一维数组中的各个图像块进行视觉注意力机制计算,得到图像块一维数组中的各个图像块对应的CU划分深度(即,步骤S300)还包括:通过反向传播的方式进行至少一轮视觉注意力机制训练,得到各个图像块的深度预测值;通过损失函数将损失最小的图像块的深度预测值确定为图像块一维数组中的各个图像块对应的CU划分深度。
本公开实施例中,可以将图像块一维数组中的各个图像块对应的CU划分深度作为与各个图像块对应的标签,通过视觉注意力机制计算得到每个图像块对应的CU划分深度后,可以继续通过反向传播对Vision Transformer层、全连接层、Softmax层进行训练,得到各个图像块的深度预测值,与真实标签进行比较,通过Softmax层损失函数来评价各个图像块的深度预测值,若损失已经达到最小,则认为已达到最优。
本公开实施例中,图像块可以是按照CU最小划分单元划分出的4*4像素块,CTU可以是64*64像素块。
图4是CU划分尺寸与CU划分深度对应表,其中CU划分尺寸可以为64×64、32×32、16×16、8×8、4×4其中任意一种尺寸,对应的CU划分深度依次为0、1、2、3、4。
图5是CTU(64×64)中的一种CU划分方式的示意图,其中,CU块与相应的划分深度已标出。
在进行视觉注意力机制训练和学习的过程中,也可以基于ViT对每个patch进行预测CU划分深度后,统计patch块CU深度预测值,重新对CTU由上往下进行统计并划分。
可以用于本公开实施例的深度学习基本操作包括但不限于:ViT、逆卷积(Deconvolution)、线性整流函数(ReLU)、S生长曲线 (Sigmoid)、全连接(Full-Connection)、重构(Reshape)等。
需要说明的是,除了使用ViT网络结构之外,也可以用残差网络(ResNet,Residual Network)、CNN、特征压缩与激发的残差网络(SE-Resnet)等网络结构来代替ViT实现CU的划分。
本公开实施例所提供的CU划分方法可以应用在所有需要视频编码的地方,如视频处理单元(VPU,Video Processing Unit)、视频编解码芯片内置算法、智能座舱、视频压缩、视频传输等方面。
下面结合一个示例对本公开的CU划分方法的具体应用进行介绍。
图3描述了本公开的整体流程,输入为H265/HEVC中的CTU(64×64)视频CTU,输出为在CTU中基于patch(4×4)块划分后,每个patch块对应的划分深度。最后根据统计CTU中patch块对应深度,即可完成CU划分。
本示例主要包括数据集与标签准备、Vision Transformer训练、Vision Transformer推理、CU统计划分等步骤,以下将详细介绍各步骤实施细节。
步骤1:数据集与标签准备
原始数据可从Vimeo90K、REDS4、VID4等公开数据集中获取视频序列图像获取,首先将原始图像按照64×64的尺寸进行切块获取CTU,之后对每个CTU块中CU划分方式采用穷举遍历并依次计算率失真代价(RDC),选择率失真代价最小的CU划分方式。最后将CTU块按照4×4的尺寸分成patch块,并为每个patch块分别赋予其所属CU块的深度(深度标签)。
步骤2:Vision Transformer训练
将以上获取到的每个CTU中的patch块数据与对应标签各自变换成一维数据,每个CTU块的patch块数据变换后可以表示成input=[patch_0,patch_1,patch_2,...,patch_255],其中,patch_0、patch_1、patch_2、……、patch_255为patch块编码数据,input维度为256×16;patch块对应标签变换后可以表示为label=[label_0,label_1,label_2,...,label_255],其中,label_0、label_1、label_2、……、label_255分别表示为patch_0、patch_1、 patch_2、……、patch_255块所属CU的划分深度。之后input经过线性投影并添加位置编码信息后,input的维度为256×768,输入到Vision Transformer中进行注意力机制计算,得到ViT结果输出ViT_output=[ViT_0,ViT_1,ViT_2,...,ViT_255]的中间结果输出,之后ViT_output作为后续全连接层、Softmax层输出,得到每个patch块对应的深度depth=[dep_0,dep_1,dep2,...,dep_255],最后与label真实标签进行比较,通过反向传播对Vision Transformer、全连接层、Softmax层进行训练。Softmax层损失函数为多分类交叉熵的损失:
其中,C=5表示深度分类的数量,在本公开中深度标签可以取值0、1、2、3、4,p(xi)表示各个深度可能结果的概率。
步骤3:Vision Transformer推理
在Vision Transformer、全连接层、Softmax层训练结束后,在推理阶段,如Vision Transformer训练步骤中所述,以当前CTU块分成的patch块变换成的一维数据作为输入,经Vision Transformer、全连接层、Softmax层后得到每个patch块对应的深度。
步骤4:CU统计划分
设置划分阈值α0、α1、α2、α3,对CTU由上向下(64×64→32×32→16×16→8×8)递归遍历统计划分。
1)首先统计顶层(64×64)CTU块(深度0,尺寸64×64)内部大于深度0的patch深度预测值数量,若patch深度预测值数量所占比例>α0,则对当前CTU块四分,否则结束CU划分。
2)对所有32×32块(CTU)(深度1,尺寸32×32)内部大于深度1的patch深度预测值数量,若patch深度预测值数量所占比例>α1,则对当前CU块四分,否则结束当前CU划分。
3)对所有16×16块(CTU)(深度2,尺寸16×16)内部大于深度2的patch深度预测值数量,若patch深度预测值数量所占比例>α2,则对当前CU块四分,否则结束当前CU划分。
4)对所有8×8块(CTU)(深度3,尺寸8×8)内部大于深度3的patch深度预测值数量,若patch深度预测值数量所占比例>α3,则对当前CU块四分,否则结束当前CU划分。
最终得到CTU中CU的最终划分方式,用于后续视频编解码流程中。
本公开还提供一种电子设备,如图6所示,其包括:一个或多个处理器501;存储器502,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现根据本公开各实施例的CU划分方法。
此外,电子设备还可以包括一个或多个I/O接口503,连接在处理器与存储器之间,配置为实现处理器与存储器的信息交互。
处理器501为具有数据处理能力的器件,其包括但不限于中央处理器(CPU)等;存储器502为具有数据存储能力的器件,其包括但不限于随机存取存储器(RAM,更具体如SDRAM、DDR等)、只读存储器(ROM)、带电可擦可编程只读存储器(EEPROM)、闪存(FLASH);I/O接口(读写接口)503连接在处理器501与存储器502间,能实现处理器501与存储器502的信息交互,其包括但不限于数据总线(Bus)等。
在一些实施例中,处理器501、存储器502和I/O接口503通过总线504相互连接,进而与计算设备的其它组件连接。
本公开还提供一种计算机可读存储介质,如图7所示,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现根据本公开各实施例的CU划分方法。
如上文所述,H265/HEVC虽然与上一代视频编解码标准H264/AVC协议相比,极大提高了压缩率,但编码复杂度也增长了数倍,其中,遍历所有CU划分并计算RDC,从中选择最优CU划分的计算方式是H265/HEVC编码所消耗时间的主要来源。本公开提出一种基于ViT的视频编解码CU划分方式,以CTU划分的最小基础块为输入,输出每个对应最小基础块的划分深度,通过图像块注意力机制的计算,提升了视频编码效率,使得实时高质量视频编码成为可能。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。
在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
以上参照附图说明了本公开的优选实施例,并非因此局限本公开的权利范围。本领域技术人员不脱离本公开的范围和实质内所作的任何修改、等同替换和改进,均应在本公开的权利范围之内。

Claims (11)

  1. 一种编码单元划分方法,包括:
    对原始图像进行划分,得到多个编码树单元;
    以图像块为粒度,将每个所述编码树单元分成包括多个图像块的图像块一维数组;
    对所述图像块一维数组中的各个图像块进行视觉注意力机制计算,得到所述图像块一维数组中的各个图像块对应的编码单元划分深度;
    根据所述各个图像块对应的编码单元划分深度,对所述编码树单元进行编码单元划分。
  2. 根据权利要求1所述的编码单元划分方法,其中,根据所述编码树单元的深度数值将所述编码树单元划分为多个层,并且根据所述各个图像块对应的编码单元划分深度,对所述编码树单元进行编码单元划分,包括:
    根据所述各个图像块对应的编码单元划分深度,对所述编码树单元按照深度顺序逐层进行编码单元划分。
  3. 根据权利要求2所述的编码单元划分方法,其中,根据所述各个图像块对应的编码单元划分深度,对所述编码树单元按照深度顺序逐层进行编码单元划分,包括:
    从当前深度i为0的层开始,遍历并统计所述编码树单元中所有图像块对应的编码单元划分深度大于i的图像块的数量Ni,其中,i表示当前深度,i和Ni为自然数;
    若Ni大于当前深度i预设的编码单元划分阈值αi,则对所述编码树单元中的当前编码单元进行划分,并继续进行下一深度的编码单元划分,其中,αi大于0;
    若Ni小于或等于当前深度i预设的编码单元划分阈值αi,则结束对所述编码树单元的编码单元划分。
  4. 根据权利要求1所述的编码单元划分方法,其中,对所述图像块一维数组中的各个图像块进行视觉注意力机制计算,得到所述图像块一维数组中的各个图像块对应的编码单元划分深度,包括:
    对所述图像块一维数组进行扩维,得到扩维后的输入数组;
    对所述输入数组进行视觉注意力机制计算,得到第一计算结果;
    对所述第一计算结果进行全连接层的计算,得到第二计算结果;
    对所述第二计算结果进行软最大值层的计算,得到所述图像块一维数组中的各个图像块对应的编码单元划分深度。
  5. 根据权利要求4所述的编码单元划分方法,其中,对所述图像块一维数组进行扩维包括:
    将所述图像块一维数组进行线性投影,并添加位置编码信息,得到扩维后的输入数组。
  6. 根据权利要求4所述的编码单元划分方法,其中,对所述图像块一维数组中的各个图像块进行视觉注意力机制计算,得到所述图像块一维数组中的各个图像块对应的编码单元划分深度,还包括:
    通过反向传播的方式进行至少一轮视觉注意力机制训练,得到各个图像块的深度预测值;
    通过损失函数将损失最小的图像块的深度预测值确定作为所述图像块一维数组中的各个图像块对应的编码单元划分深度。
  7. 根据权利要求1至6中任意一项所述的编码单元划分方法,其中,所述图像块是按照编码单元最小划分单元划分出的4*4像素块。
  8. 根据权利要求1至6中任意一项所述的编码单元划分方法,其中,所述编码树单元是64*64像素块。
  9. 一种电子设备,所述电子设备包括:
    一个或多个处理器;
    存储器,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现根据权利要求1至8中任意一项所述的编码单元划分方法。
  10. 根据权利要求9所述的电子设备,还包括:
    一个或多个I/O接口,连接在所述处理器与存储器之间,配置为实现所述处理器与存储器的信息交互。
  11. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现根据权利要求1至8中任意一项所述的编码单元划分方法。
PCT/CN2023/101495 2022-06-30 2023-06-20 编码单元划分方法、电子设备和计算机可读存储介质 WO2024001886A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210770312.5 2022-06-30
CN202210770312.5A CN117376572A (zh) 2022-06-30 2022-06-30 Cu划分方法、电子设备和计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2024001886A1 true WO2024001886A1 (zh) 2024-01-04

Family

ID=89383268

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101495 WO2024001886A1 (zh) 2022-06-30 2023-06-20 编码单元划分方法、电子设备和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN117376572A (zh)
WO (1) WO2024001886A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111510728A (zh) * 2020-04-12 2020-08-07 北京工业大学 一种基于深度特征表达与学习的hevc帧内快速编码方法
CN113382245A (zh) * 2021-07-02 2021-09-10 中国科学技术大学 图像划分方法和装置
CN113709455A (zh) * 2021-09-27 2021-11-26 北京交通大学 一种使用Transformer的多层次图像压缩方法
CN114286093A (zh) * 2021-12-24 2022-04-05 杭州电子科技大学 一种基于深度神经网络的快速视频编码方法
CN114550033A (zh) * 2022-01-29 2022-05-27 珠海横乐医学科技有限公司 视频序列导丝分割方法、装置、电子设备及可读介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111510728A (zh) * 2020-04-12 2020-08-07 北京工业大学 一种基于深度特征表达与学习的hevc帧内快速编码方法
CN113382245A (zh) * 2021-07-02 2021-09-10 中国科学技术大学 图像划分方法和装置
CN113709455A (zh) * 2021-09-27 2021-11-26 北京交通大学 一种使用Transformer的多层次图像压缩方法
CN114286093A (zh) * 2021-12-24 2022-04-05 杭州电子科技大学 一种基于深度神经网络的快速视频编码方法
CN114550033A (zh) * 2022-01-29 2022-05-27 珠海横乐医学科技有限公司 视频序列导丝分割方法、装置、电子设备及可读介质

Also Published As

Publication number Publication date
CN117376572A (zh) 2024-01-09

Similar Documents

Publication Publication Date Title
JP2021513284A5 (zh)
CN108322747B (zh) 一种面向超高清视频的编码单元划分优化方法
CN109688407B (zh) 编码单元的参考块选择方法、装置、电子设备及存储介质
US10154288B2 (en) Apparatus and method to improve image or video quality or encoding performance by enhancing discrete cosine transform coefficients
US20230082809A1 (en) Method and data processing system for lossy image or video encoding, transmissionand decoding
WO2016054975A1 (zh) 预测块的划分方法、编码设备和解码设备
WO2023082834A1 (zh) 视频压缩方法、装置、计算机设备和存储介质
CN105828081A (zh) 编码方法及编码装置
US12015767B2 (en) Intra-frame predictive coding method and system for 360-degree video and medium
WO2022253249A1 (zh) 特征数据编解码方法和装置
CN115589493A (zh) 一种用于船舶视频回传的卫星传输数据压缩方法
US20220408097A1 (en) Adaptively encoding video frames using content and network analysis
WO2024001886A1 (zh) 编码单元划分方法、电子设备和计算机可读存储介质
CN113676738A (zh) 一种三维点云的几何编解码方法及装置
WO2023193629A1 (zh) 区域增强层的编解码方法和装置
WO2023093377A1 (zh) 编解码方法及电子设备
CN108429916B (zh) 图像编码方法及装置
WO2020037501A1 (zh) 码率分配方法、码率控制方法、编码器和记录介质
CN112468826B (zh) 一种基于多层gan的vvc环路滤波方法及系统
WO2023077707A1 (zh) 视频编码方法、模型训练方法、设备和存储介质
CN112165619A (zh) 一种面向监控视频压缩存储的方法
CN108235026B (zh) 一种空间可伸缩的快速编码方法
WO2024007843A9 (zh) 一种编解码方法、装置及计算机设备
WO2024027616A1 (zh) 帧内预测方法、装置、计算机设备及可读介质
WO2022117104A1 (en) Systems and methods for video processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23830065

Country of ref document: EP

Kind code of ref document: A1