TW202243476A

TW202243476A - A front-end architecture for neural network based video coding

Info

Publication number: TW202243476A
Application number: TW110146050A
Authority: TW
Inventors: 希勒米艾內斯艾吉爾梅茲; 安基泰許庫瑪辛格; 莫哈美德塞伊德克班; 瑪塔卡克基維克茲
Original assignee: 美商高通公司
Priority date: 2020-12-10
Filing date: 2021-12-09
Publication date: 2022-11-01
Also published as: EP4260561A1; WO2022126120A1; JP2023553369A; KR20230117346A

Abstract

Techniques are described herein for processing video data using a neural network system. For instance, a process can include generating, by a first convolutional layer of an encoder sub-network of the neural network system, output values associated with a luminance channel of a frame. The process can include generating, by a second convolutional layer of the encoder sub-network, output values associated with at least one chrominance channel of the frame. The process can include generating, by a third convolutional layer based on the output values associated with the luminance channel of the frame and the output values associated with the at least one chrominance channel of the frame, a combined representation of the frame. The process can further include generating encoded video data based on the combined representation of the frame.

Description

Front-end architecture for neural network-based video decoding

概括而言，本案內容係關於影像和視訊譯碼，包括對影像及/或視訊的編碼（或壓縮）和解碼（解壓縮）。例如，本案內容的各態樣係關於用於使用基於端到端機器學習（例如，神經網路）的影像和視訊譯碼系統來處理亮度-色度（YUV）輸入格式（例如，4:2:0 YUV輸入格式、4:4:4 YUV輸入格式、4:2:2 YUV輸入格式等）及/或其他輸入格式的技術。In a nutshell, this case concerns image and video coding, including the encoding (or compression) and decoding (decompression) of images and/or video. For example, aspects of this case relate to image and video coding systems for processing luma-chrominance (YUV) input formats (e.g., 4:2 :0 YUV input format, 4:4:4 YUV input format, 4:2:2 YUV input format, etc.) and/or other input format technologies.

許多設備和系統允許視訊資料被處理和輸出以供消費。數位視訊資料包括大量資料，以滿足消費者和視訊提供者的需求。例如，視訊資料的消費者期望高品質視訊，包括高保真度、高解析度、高畫面播放速率等。結果，滿足這些需求所需要的大量視訊資料為處理和儲存視訊資料的通訊網路和設備帶來了負擔。Many devices and systems allow video data to be processed and output for consumption. Digital video data includes a large amount of data to meet the needs of consumers and video providers. For example, consumers of video data expect high-quality video, including high fidelity, high resolution, high frame rate, and the like. As a result, the large amount of video data required to meet these demands places a burden on the communication networks and equipment that process and store the video data.

各種視訊譯碼技術可以用於對視訊資料進行壓縮。視訊譯碼的一個目標是將視訊資料壓縮為使用較低位元速率的形式，同時避免或最小化對視訊品質的降級。隨著不斷發展的視訊服務變得可用，需要具有更好的譯碼效率的編碼技術。Various video decoding techniques can be used to compress video data. One goal of video decoding is to compress video data into a form that uses a lower bit rate while avoiding or minimizing degradation to video quality. As ever-evolving video services become available, coding techniques with better coding efficiency are required.

描述了用於使用一或多個機器學習系統來對影像及/或視訊內容進行譯碼（例如，編碼及/或解碼）的系統和技術。例如，提供了一種基於端到端機器學習（例如，神經網路）的影像和視訊譯碼（E2E-NNVC）系統，其可以處理YUV（數位域YCbCr）輸入格式（以及在一些情況下，其他輸入格式），在一些情況下，具體為4:2:0 YUV輸入格式。E2E-NNVC系統可以處理包含多個訊框的獨立訊框（亦被稱為影像或圖片）及/或視訊資料。YUV格式包括亮度通道（Y）和一對色度通道（U和V）。U和V通道可以相對於Y通道進行二次取樣，而不會對視覺品質產生顯著或明顯的影響。在YUV格式中，通道之間的相關性減小，這可能與其他色彩格式（例如，紅綠藍（RGB）格式）不同。本文描述的系統和技術的各態樣提供了前端架構（例如，新的子網路），以適應針對RGB輸入格式設計的E2E-NNVC（以及在一些情況下，針對其他輸入格式設計的E2E-NNVC）中的YUV 4:2:0輸入格式（以及在一些情況下，其他輸入格式）。前端架構適用於許多E2E-NNVC架構。Systems and techniques are described for decoding (eg, encoding and/or decoding) imagery and/or video content using one or more machine learning systems. For example, provides an end-to-end machine learning (e.g., neural network) based image and video coding (E2E-NNVC) system that can handle the YUV (digital domain YCbCr) input format (and in some cases, other input format), in some cases specifically the 4:2:0 YUV input format. The E2E-NNVC system can process individual frames (also called images or pictures) and/or video data consisting of multiple frames. The YUV format includes a luma channel (Y) and a pair of chroma channels (U and V). The U and V channels can be subsampled relative to the Y channel without noticeable or noticeable impact on visual quality. In YUV format, the correlation between channels is reduced, which may be different from other color formats such as red-green-blue (RGB) format. Aspects of the systems and techniques described herein provide front-end architectures (e.g., new subnetworks) to accommodate E2E-NNVC designed for RGB input formats (and in some cases, E2E-NNVC designed for other input formats). NNVC) in the YUV 4:2:0 input format (and in some cases, other input formats). The front-end architecture is applicable to many E2E-NNVC architectures.

在一個說明性實例中，提供了一種處理視訊資料的方法。該方法包括：由神經網路系統的編碼器子網路的第一迴旋層產生與訊框的亮度通道相關聯的輸出值；由該編碼器子網路的第二迴旋層產生與該訊框的至少一個色度通道相關聯的輸出值；由第三迴旋層基於與該訊框的該亮度通道相關聯的輸出值和與該訊框的該至少一個色度通道相關聯的輸出值來產生該訊框的組合表示；及基於該訊框的該組合表示來產生經編碼的視訊資料。In one illustrative example, a method of processing video data is provided. The method includes: generating, by a first convolutional layer of an encoder subnetwork of a neural network system, an output value associated with a luma channel of a frame; and generating, by a second convolutional layer of the encoder subnetwork, an output value associated with the frame an output value associated with at least one chroma channel of the frame; generated by the third convolutional layer based on the output value associated with the luma channel of the frame and the output value associated with the at least one chroma channel of the frame a combined representation of the frame; and generating encoded video data based on the combined representation of the frame.

在另一實例中，提供了一種用於處理視訊資料的裝置，其包括：記憶體；及處理器（例如，在電路中實現的），其耦合到該記憶體。在一些實例中，一個以上的處理器可以耦合到該記憶體，並且可以用於執行該等操作中的一或多個操作。該處理器被配置為：使用神經網路系統的編碼器子網路的第一迴旋層產生與訊框的亮度通道相關聯的輸出值；使用該編碼器子網路的第二迴旋層產生與該訊框的至少一個色度通道相關聯的輸出值；使用第三迴旋層基於與該訊框的該亮度通道相關聯的輸出值和與該訊框的該至少一個色度通道相關聯的輸出值來產生該訊框的組合表示；及基於該訊框的該組合表示來產生經編碼的視訊資料。In another example, an apparatus for processing video data is provided, comprising: a memory; and a processor (eg, implemented in a circuit) coupled to the memory. In some instances, more than one processor may be coupled to the memory and may be used to perform one or more of these operations. The processor is configured to: use a first convolutional layer of an encoder subnetwork of the neural network system to generate an output value associated with a luma channel of a frame; use a second convolutional layer of the encoder subnetwork to generate an output value associated with an output value associated with at least one chroma channel of the frame; using a third convolutional layer based on an output value associated with the luma channel of the frame and an output associated with the at least one chroma channel of the frame generating a composite representation of the frame; and generating encoded video data based on the composite representation of the frame.

在另一實例中，提供了一種用於對視訊資料進行編碼的非暫時性電腦可讀取媒體，其具有儲存在其上的指令，該等指令在被一或多個處理器執行時使得該一或多個處理器進行以下操作：使用神經網路系統的編碼器子網路的第一迴旋層產生與訊框的亮度通道相關聯的輸出值；使用該編碼器子網路的第二迴旋層產生與該訊框的至少一個色度通道相關聯的輸出值；使用第三迴旋層基於與該訊框的該亮度通道相關聯的輸出值和與該訊框的該至少一個色度通道相關聯的輸出值來產生該訊框的組合表示；及基於該訊框的該組合表示來產生經編碼的視訊資料。In another example, a non-transitory computer-readable medium for encoding video data is provided, having stored thereon instructions that, when executed by one or more processors, cause the The one or more processors perform the following operations: using a first convolutional layer of an encoder subnetwork of the neural network system to generate an output value associated with a luma channel of a frame; using a second convolutional layer of the encoder subnetwork layer produces an output value associated with at least one chrominance channel of the frame; using a third convolutional layer based on the output value associated with the luma channel of the frame and the at least one chrominance channel associated with the frame generating a composite representation of the frame from output values of the frames; and generating encoded video data based on the composite representation of the frame.

在另一實例中，提供了一種用於處理視訊資料的裝置。該裝置包括：用於經由神經網路系統的編碼器子網路的第一迴旋層來產生與訊框的亮度通道相關聯的輸出值的單元；用於經由該編碼器子網路的第二迴旋層來產生與該訊框的至少一個色度通道相關聯的輸出值的單元；用於經由使用第三迴旋層基於與該訊框的該亮度通道相關聯的輸出值和與該訊框的該至少一個色度通道相關聯的輸出值來產生該訊框的組合表示的單元；及用於基於該訊框的該組合表示來產生經編碼的視訊資料的單元。In another example, an apparatus for processing video data is provided. The device comprises: a unit for generating an output value associated with a luma channel of a frame via a first convolutional layer of an encoder subnetwork of a neural network system; a unit for via a second convolutional layer of the encoder subnetwork a convolutional layer to generate an output value associated with at least one chrominance channel of the frame; for using a third convolutional layer based on the output value associated with the luma channel of the frame and the output value associated with the frame means for generating a combined representation of the frame from the associated output value of the at least one chrominance channel; and means for generating encoded video data based on the combined representation of the frame.

在一些態樣中，該第三迴旋層包括1x1迴旋層。該1x1迴旋層包括一或多個1x1迴旋濾波器。In some aspects, the third convolutional layer includes a 1×1 convolutional layer. The 1x1 convolution layer includes one or more 1x1 convolution filters.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：使用該編碼器子網路的第一非線性層來處理與該訊框的該亮度通道相關聯的該等輸出值；及使用該編碼器子網路的第二非線性層來處理與該訊框的該至少一個色度通道相關聯的輸出值。在此類態樣中，該組合表示是基於該第一非線性層的輸出和該第二非線性層的輸出來產生的。In some aspects, the above method, apparatus, and computer readable medium for processing video data also include: using the first nonlinear layer of the encoder sub-network to process the luminance channel associated with the frame and processing the output values associated with the at least one chrominance channel of the frame using a second nonlinear layer of the encoder subnetwork. In such aspects, the combined representation is generated based on the output of the first nonlinear layer and the output of the second nonlinear layer.

在一些態樣中，該訊框的該組合表示是由該第三迴旋層使用該第一非線性層的輸出和該第二非線性層的輸出作為輸入來產生的。In some aspects, the combined representation of the frame is generated by the third convolutional layer using the output of the first nonlinear layer and the output of the second nonlinear layer as input.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：對該經編碼的視訊資料進行量化。In some aspects, the above-mentioned method, apparatus and computer-readable medium for processing video data also include: quantizing the encoded video data.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：對該經編碼的視訊資料進行熵譯碼。In some aspects, the above-mentioned method, apparatus and computer-readable medium for processing video data also includes: performing entropy decoding on the encoded video data.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：將該經編碼的視訊資料儲存在記憶體中。In some aspects, the above method, device and computer-readable medium for processing video data also include: storing the encoded video data in a memory.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：在傳輸媒體上向至少一個設備發送該經編碼的視訊資料。In some aspects, the above-mentioned method, apparatus and computer-readable medium for processing video data also include: sending the encoded video data to at least one device on a transmission medium.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：獲得經編碼的訊框；由該神經網路系統的解碼器子網路的第一迴旋層產生與經編碼的訊框的亮度通道相關聯的經重構的輸出值；及由該解碼器子網路的第二迴旋層產生與經編碼的訊框的至少一個色度通道相關聯的經重構的輸出值。In some aspects, the above method, apparatus, and computer readable medium for processing video data also include: obtaining encoded frames; generated by the first convolutional layer of the decoder sub-network of the neural network system a reconstructed output value associated with the luma channel of the encoded frame; and a reconstructed output value associated with at least one chrominance channel of the encoded frame produced by the second convolutional layer of the decoder subnetwork Structured output value.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：使用該解碼器子網路的第三迴旋層來將經編碼的訊框的該亮度通道與經編碼的訊框的該至少一個色度通道分離。In some aspects, the above method, apparatus, and computer readable medium for processing video data also include: using the third convolutional layer of the decoder subnetwork to combine the luma channel of the encoded frame with the The at least one chrominance channel of the encoded frame is separated.

在一些態樣中，該解碼器子網路的該第三迴旋層包括1x1迴旋層。該1x1迴旋層包括一或多個1x1迴旋濾波器。In some aspects, the third convolutional layer of the decoder sub-network includes a 1x1 convolutional layer. The 1x1 convolution layer includes one or more 1x1 convolution filters.

在一些態樣中，該訊框包括視訊訊框。在一些態樣中，該至少一個色度通道包括色度藍色通道和色度紅色通道。在一些態樣中，該訊框具有亮度色度（YUV）格式。In some aspects, the frame includes a video frame. In some aspects, the at least one chroma channel includes a chroma blue channel and a chroma red channel. In some aspects, the frame has a luma-chroma (YUV) format.

在一個說明性實例中，提供了一種處理視訊資料的方法。該方法包括：獲得經編碼的訊框；由該解碼器子網路的第一迴旋層將該經編碼的訊框的亮度通道與該經編碼的訊框的至少一個色度通道分離；由神經網路系統的解碼器子網路的第二迴旋層產生與該經編碼的訊框的該亮度通道相關聯的經重構的輸出值；由該解碼器子網路的第三迴旋層產生與該經編碼的訊框的該至少一個色度通道相關聯的經重構的輸出值；及產生輸出訊框，該輸出訊框包括與該亮度通道相關聯的經重構的輸出值和與該至少一個色度通道相關聯的經重構的輸出值。In one illustrative example, a method of processing video data is provided. The method includes: obtaining an encoded frame; separating, by a first convolutional layer of the decoder subnetwork, a luma channel of the encoded frame from at least one chrominance channel of the encoded frame; The second convolutional layer of the decoder subnetwork of the network system produces the reconstructed output value associated with the luma channel of the encoded frame; the third convolutional layer of the decoder subnetwork produces the output value associated with the reconstructed output value associated with the at least one chrominance channel of the encoded frame; and generating an output frame comprising the reconstructed output value associated with the luma channel and the A reconstructed output value associated with at least one chroma channel.

在另一實例中，提供了一種用於處理視訊資料的裝置，其包括：記憶體；及處理器（例如，在電路中實現的），其耦合到該記憶體。在一些實例中，一個以上的處理器可以耦合到該記憶體，並且可以用於執行該等操作中的一或多個操作。該處理器被配置為：獲得經編碼的訊框；使用該解碼器子網路的第一迴旋層來將該經編碼的訊框的亮度通道與該經編碼的訊框的至少一個色度通道分離；使用神經網路系統的解碼器子網路的第二迴旋層產生與該經編碼的訊框的該亮度通道相關聯的經重構的輸出值；使用該解碼器子網路的第三迴旋層產生與該經編碼的訊框的該至少一個色度通道相關聯的經重構的輸出值；及產生輸出訊框，該輸出訊框包括與該亮度通道相關聯的經重構的輸出值和與該至少一個色度通道相關聯的經重構的輸出值。In another example, an apparatus for processing video data is provided, comprising: a memory; and a processor (eg, implemented in a circuit) coupled to the memory. In some instances, more than one processor may be coupled to the memory and may be used to perform one or more of these operations. The processor is configured to: obtain an encoded frame; use the first convoluted layer of the decoder subnetwork to combine a luma channel of the encoded frame with at least one chrominance channel of the encoded frame Separation; using the second convolutional layer of the decoder subnetwork of the neural network system to generate a reconstructed output value associated with the luma channel of the encoded frame; using the third convolutional layer of the decoder subnetwork a convolutional layer produces reconstructed output values associated with the at least one chrominance channel of the encoded frame; and produces an output frame comprising the reconstructed output associated with the luma channel value and a reconstructed output value associated with the at least one chroma channel.

在另一實例中，提供了一種用於對視訊資料進行編碼的非暫時性電腦可讀取媒體，其具有儲存在其上的指令，該等指令在被一或多個處理器執行時使得該一或多個處理器進行以下操作：獲得經編碼的訊框；使用該解碼器子網路的第一迴旋層來將該經編碼的訊框的亮度通道與該經編碼的訊框的至少一個色度通道分離；使用神經網路系統的解碼器子網路的第二迴旋層產生與該經編碼的訊框的該亮度通道相關聯的經重構的輸出值；使用該解碼器子網路的第三迴旋層產生與該經編碼的訊框的該至少一個色度通道相關聯的經重構的輸出值；及產生輸出訊框，該輸出訊框包括與該亮度通道相關聯的經重構的輸出值和與該至少一個色度通道相關聯的經重構的輸出值。In another example, a non-transitory computer-readable medium for encoding video data is provided, having stored thereon instructions that, when executed by one or more processors, cause the one or more processors: obtain an encoded frame; use the first convoluted layer of the decoder sub-network to combine the luma channel of the encoded frame with at least one of the encoded frame chrominance channel separation; using a second convolutional layer of the decoder subnetwork of the neural network system to generate a reconstructed output value associated with the luma channel of the encoded frame; using the decoder subnetwork The third convolutional layer produces the reconstructed output value associated with the at least one chrominance channel of the encoded frame; and produces an output frame comprising the reconstructed output value associated with the luma channel The reconstructed output value and the reconstructed output value associated with the at least one chroma channel.

在另一實例中，提供了一種用於處理視訊資料的裝置。該裝置包括：用於獲得經編碼的訊框的單元；用於經由該解碼器子網路的第一迴旋層來將該經編碼的訊框的亮度通道與該經編碼的訊框的至少一個色度通道分離的單元；用於經由神經網路系統的解碼器子網路的第二迴旋層產生與該經編碼的訊框的該亮度通道相關聯的經重構的輸出值的單元；用於經由該解碼器子網路的第三迴旋層產生與該經編碼的訊框的該至少一個色度通道相關聯的經重構的輸出值的單元；及用於產生輸出訊框的單元，該輸出訊框包括與該亮度通道相關聯的經重構的輸出值和與該至少一個色度通道相關聯的經重構的輸出值In another example, an apparatus for processing video data is provided. The apparatus comprises: a unit for obtaining an encoded frame; and at least one of a luminance channel of the encoded frame and the encoded frame via a first convoluted layer of the decoder subnetwork. a unit for chrominance channel separation; a unit for generating, via a second convolutional layer of a decoder subnetwork of a neural network system, a reconstructed output value associated with the luma channel of the encoded frame; with means for generating a reconstructed output value associated with the at least one chrominance channel of the encoded frame through a third convolutional layer of the decoder subnetwork; and means for generating an output frame, The output frame includes reconstructed output values associated with the luma channel and reconstructed output values associated with the at least one chrominance channel

在一些態樣中，該解碼器子網路的該第一迴旋層包括1x1迴旋層。該1x1迴旋層包括一或多個1x1迴旋濾波器。In some aspects, the first convolutional layer of the decoder sub-network includes a 1x1 convolutional layer. The 1x1 convolution layer includes one or more 1x1 convolution filters.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：使用該解碼器子網路的第一非線性層來處理與該經編碼的訊框的亮度通道相關聯的值，其中與該亮度通道相關聯的該等經重構的輸出值是基於該第一非線性層的輸出來產生的；及使用該解碼器子網路的第二非線性層來處理與該經編碼的訊框的該至少一個色度通道相關聯的值，其中與該至少一個色度通道相關聯的該等經重構的輸出值是基於該第二非線性層的輸出來產生的。In some aspects, the above method, apparatus and computer readable medium for processing video data also includes: using the first non-linear layer of the decoder sub-network to process the luminance channel of the encoded frame associated values, wherein the reconstructed output values associated with the luma channel are generated based on the output of the first nonlinear layer; and using the second nonlinear layer of the decoder subnetwork to processing values associated with the at least one chroma channel of the encoded frame, wherein the reconstructed output values associated with the at least one chroma channel are based on the output of the second nonlinear layer produced.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：對該經編碼的訊框的取樣進行去量化。In some aspects, the above method, apparatus and computer readable medium for processing video data also includes: dequantizing the samples of the encoded frame.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：對該經編碼的訊框的取樣進行熵解碼。In some aspects, the above method, apparatus and computer readable medium for processing video data also includes entropy decoding the encoded frame samples.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：將該輸出訊框儲存在記憶體中。In some aspects, the above-mentioned method, device and computer-readable medium for processing video data also include: storing the output frame in a memory.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：顯示該輸出訊框。In some aspects, the above method, device and computer-readable medium for processing video data also include: displaying the output frame.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：由該神經網路系統的編碼器子網路的第一迴旋層產生與訊框的亮度通道相關聯的輸出值；由該編碼器子網路的第二迴旋層產生與該訊框的至少一個色度通道相關聯的輸出值；由該編碼器子網路的第三迴旋層基於與該訊框的該亮度通道相關聯的輸出值和與該訊框的該至少一個色度通道相關聯的輸出值來產生該訊框的組合表示；及基於該訊框的該組合表示來產生該經編碼的訊框。In some aspects, the above-mentioned method, apparatus, and computer readable medium for processing video data also include: generating, by the first convolutional layer of the encoder sub-network of the neural network system, the output value associated with at least one chrominance channel of the frame is generated by the second convolutional layer of the encoder subnetwork; the output value associated with at least one chrominance channel of the frame is generated by the third convolutional layer of the encoder subnetwork based on the generating a combined representation of the frame from an output value associated with the luma channel of the frame and an output value associated with the at least one chrominance channel of the frame; and generating the encoded representation based on the combined representation of the frame frame.

在一些態樣中，該編碼器子網路的該第三迴旋層包括1x1迴旋層。該1x1迴旋層包括一或多個1x1迴旋濾波器。In some aspects, the third convolutional layer of the encoder sub-network includes a 1x1 convolutional layer. The 1x1 convolution layer includes one or more 1x1 convolution filters.

在一些態樣中，上述用於處理視訊資料的方法、裝置和電腦可讀取媒體亦包括：使用該編碼器子網路的第一非線性層來處理與該訊框的亮度通道相關聯的輸出值；及使用該編碼器子網路的第二非線性層來處理與該訊框的該至少一個色度通道相關聯的輸出值；其中該組合表示是基於該第一非線性層的輸出和該第二非線性層的輸出來產生的。In some aspects, the above method, apparatus, and computer readable medium for processing video data also include: using the first nonlinear layer of the encoder sub-network to process the luminance channel associated with the frame an output value; and processing the output value associated with the at least one chrominance channel of the frame using a second nonlinear layer of the encoder subnetwork; wherein the combined representation is based on the output of the first nonlinear layer and the output of this second nonlinear layer is generated.

在一些態樣中，該訊框的該組合表示是由該編碼器子網路的該第三迴旋層使用該第一非線性層的輸出和該第二非線性層的輸出作為輸入來產生的。In some aspects, the combined representation of the frame is generated by the third convolutional layer of the encoder subnetwork using the output of the first nonlinear layer and the output of the second nonlinear layer as input .

在一些態樣中，該經編碼的訊框包括經編碼的視訊訊框。In some aspects, the encoded frame includes an encoded video frame.

在一些態樣中，該至少一個色度通道包括色度藍色通道和色度紅色通道。In some aspects, the at least one chroma channel includes a chroma blue channel and a chroma red channel.

在一些態樣中，該經編碼的訊框具有亮度-色度（YUV）格式。In some aspects, the encoded frame has a luma-chroma (YUV) format.

在一些態樣中，該裝置可以是以下各項或可以是以下各項的一部分：行動設備（例如，行動電話或所謂的「智慧型電話」、平板電腦或其他類型的行動設備）、網路連接的可穿戴設備、擴展現實設備（例如，虛擬實境（VR）設備、增強現實（AR）設備或混合現實（MR）設備）、個人電腦、膝上型電腦、伺服器電腦（例如，視訊伺服器或其他伺服器設備）、電視、車輛（或車輛的計算設備或系統）、照相機（例如，數位照相機、網際網路協定（IP）照相機等）、多照相機系統、機器人設備或系統、航空設備或系統、或其他設備。在一些態樣中，該裝置亦包括用於擷取一或多個影像或視訊訊框（或圖片）的至少一個照相機。例如，該裝置可以包括用於擷取包括視訊訊框的一或多個影像及/或一或多個視訊的一個照相機（例如，RGB照相機）或多個照相機。在一些態樣中，該裝置包括用於顯示一或多個影像、視訊、通知或其他可顯示資料的顯示器。在一些態樣中，該裝置包括發射器，其被配置為在傳輸媒體上向至少一個設備發送一或多個視訊訊框及/或語法資料。在一些態樣中，上述裝置可以包括一或多個感測器。在一些態樣中，處理器包括神經處理單元（NPU）、中央處理單元（CPU）、圖形處理單元（GPU）或其他處理設備或部件。In some aspects, the device may be, or may be part of, a mobile device (eg, a mobile phone or so-called "smart phone," tablet computer or other type of mobile device), a network Connected wearable devices, extended reality devices (for example, virtual reality (VR) devices, augmented reality (AR) devices, or mixed reality (MR) devices), personal computers, laptops, server computers (for example, video servers or other server devices), televisions, vehicles (or computing devices or systems for vehicles), cameras (e.g., digital cameras, Internet Protocol (IP) cameras, etc.), multi-camera systems, robotic devices or systems, aviation device or system, or other device. In some aspects, the device also includes at least one camera for capturing one or more images or video frames (or pictures). For example, the device may include a camera (eg, an RGB camera) or cameras for capturing one or more images including video frames and/or one or more videos. In some aspects, the device includes a display for displaying one or more images, videos, notifications, or other displayable data. In some aspects, the apparatus includes a transmitter configured to send one or more video frames and/or syntax data to at least one device over a transmission medium. In some aspects, the apparatus described above may include one or more sensors. In some aspects, a processor includes a neural processing unit (NPU), central processing unit (CPU), graphics processing unit (GPU), or other processing device or component.

該發明內容既不意欲標識所要求保護的主題的關鍵或必要特徵，亦不意欲單獨用於決定所要求保護的主題的範疇。經由參照本專利的整個說明書的適當部分、任何或所有附圖以及每個請求項，應當理解該主題。This Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. An understanding of the subject matter should be understood by reference to the appropriate portion of the entire specification of this patent, any or all drawings and each claim.

在參考以下說明書、請求項和附圖之後，前述內容以及其他特徵和實施例將變得更加顯而易見。The foregoing and other features and embodiments will become more apparent upon reference to the following specification, claims and drawings.

下文提供了本案內容的某些態樣和實施例。如對於本發明所屬領域中具有通常知識者將顯而易見的，這些態樣和實施例中的一些項可以獨立地應用，並且其中的一些項可以相結合地應用。在以下描述中，出於解釋的目的，闡述了具體細節以便提供對本案的實施例的透徹理解。然而，將顯而易見的是，可以在沒有這些具體細節的情況下實施各個實施例。附圖和描述不意欲是限制性的。Certain aspects and examples of the subject matter are provided below. As will be apparent to those having ordinary knowledge in the art to which the present invention pertains, some of these aspects and embodiments may be applied independently, and some of them may be applied in combination. In the following description, for purposes of explanation, specific details are set forth in order to provide a thorough understanding of the embodiments of the present case. It will be apparent, however, that various embodiments may be practiced without these specific details. The figures and description are not intended to be limiting.

隨後的描述僅提供了實例實施例，並且不意欲限制本案內容的範疇、適用性或配置。確切而言，對實例實施例的隨後描述將向本發明所屬領域中具有通常知識者提供用於實現實例實施例的可行描述。應當理解的是，在不背離如在所附的申請專利範圍中闡述的本案的精神和範疇的情況下，可以對元素的功能和佈置進行各種改變。The ensuing description provides example embodiments only, and is not intended to limit the scope, applicability, or configuration of the present disclosure. Rather, the ensuing description of the example embodiments will provide those having ordinary skill in the art to which the invention pertains with an enableable description for implementing the example embodiments. It should be understood that various changes can be made in the function and arrangement of elements without departing from the spirit and scope of the disclosure as set forth in the appended claims.

數位視訊資料可以包括大量資料，尤其是隨著對高品質視訊資料的需求持續增長。例如，視訊資料的消費者通常期望越來越高品質的、具有高保真度、高解析度、高畫面播放速率等的視訊。然而，滿足此類需求所需要的大量視訊資料可能為通訊網路以及處理和儲存視訊資料的設備帶來了顯著負擔。Digital video data can include large amounts of data, especially as the demand for high-quality video data continues to grow. For example, consumers of video content typically expect increasingly higher quality video with high fidelity, high resolution, high frame rate, and the like. However, the large amount of video data required to meet such demands can place a significant burden on communication networks and devices that process and store the video data.

各種技術可以用於對視訊資料進行譯碼。可以根據特定的視訊譯碼標準來執行視訊譯碼。實例視訊譯碼標準包括高效率視訊譯碼（HEVC）、改進的視訊譯碼（AVC）、運動影像專家組（MPEG）譯碼、以及多功能視訊譯碼（VVC）等。視訊譯碼通常使用預測方法（諸如訊框間預測或訊框內預測），預測方法利用在視訊影像或序列中存在的冗餘。視訊解碼技術的一個共同目標是將視訊資料壓縮為使用較低位元速率的形式，同時避免或最小化視訊品質的降級。隨著對視訊服務的需求增長以及新的視訊服務變得可用，需要具有更好的譯碼效率、效能和速率控制的譯碼技術。Various techniques may be used to decode the video data. Video coding may be performed according to a specific video coding standard. Example video coding standards include High Efficiency Video Coding (HEVC), Improved Video Coding (AVC), Motion Picture Experts Group (MPEG) Coding, Versatile Video Coding (VVC), and others. Video decoding typically uses predictive methods such as inter-frame prediction or intra-frame prediction, which exploit the redundancy present in video images or sequences. A common goal of video decoding techniques is to compress video data into a form that uses a lower bit rate while avoiding or minimizing video quality degradation. As the demand for video services grows and new video services become available, coding techniques with better coding efficiency, performance, and rate control are required.

本文描述了用於使用一或多個機器學習（ML）系統來執行影像及/或視訊譯碼的系統、裝置、程序（亦被稱為方法）和電腦可讀取媒體（統稱為「系統和技術」）。通常，ML是人工智慧（AI）的一子集。ML系統可以包括演算法和統計模型，電腦系統可以使用這些演算法和統計模型，以在無需使用顯式指令的情況下經由依賴於模式和推理而執行各種任務。ML系統的一個實例是神經網路（亦被稱為人工神經網路），其可以包括一組相互連接的人工神經元（例如，神經元模型）。神經網路可以用於各種應用及/或設備，諸如影像及/或視訊譯碼、影像分析及/或電腦視覺應用、網際網路協定（IP）照相機、物聯網路（IoT）設備、自主車輛、服務機器人等。Described herein are systems, apparatus, programs (also referred to as methods), and computer-readable media (collectively "systems and technology"). In general, ML is a subset of artificial intelligence (AI). ML systems can include algorithms and statistical models that computer systems can use to perform various tasks by relying on patterns and reasoning without the use of explicit instructions. One example of a ML system is a neural network (also known as an artificial neural network), which may include a set of interconnected artificial neurons (eg, a neuronal model). Neural networks can be used in various applications and/or devices, such as image and/or video decoding, image analysis and/or computer vision applications, Internet Protocol (IP) cameras, Internet of Things (IoT) devices, autonomous vehicles , service robots, etc.

神經網路中的各個節點可以經由採用輸入資料並且對資料執行簡單運算來類比生物神經元。對輸入資料執行的簡單運算的結果被選擇性地傳遞給其他神經元。權重值與網路之每一者向量和節點相關聯，並且這些值約束輸入資料如何與輸出資料相關。例如，可以將每個節點的輸入資料乘以相應的權重值，並且可以對乘積求和。可以經由可選的偏置來調整乘積的總和，並且可以將啟動函數應用於結果，從而產生節點的輸出訊號或「輸出啟動」（有時被稱為啟動圖或特徵圖）。權重值最初可以由經由網路的訓練資料的反覆運算流來決定（例如，在訓練階段期間建立權重值，在訓練階段中，網路學習如何經由其典型的輸入資料特性來辨識特定類別）。Individual nodes in a neural network can be compared to biological neurons by taking input data and performing simple operations on the data. The results of simple operations performed on input data are selectively passed to other neurons. Weight values are associated with each vector and node of the network, and these values constrain how the input data is related to the output data. For example, the input data for each node can be multiplied by the corresponding weight value, and the products can be summed. The sum of the products can be adjusted via an optional bias, and an activation function can be applied to the result, producing the node's output signal or "output activation" (sometimes called an activation map or feature map). The weight values may initially be determined by an iterative flow of training data through the network (eg, weight values are established during a training phase in which the network learns how to recognize a particular class via its typical input data characteristics).

存在不同類型的神經網路，諸如迴旋神經網路（CNN）、遞迴神經網路（RNN）、產生性對抗網路（GAN）、多層感知器（MLP）神經網路等。例如，迴旋神經網路（CNN）是一種類型的前饋人工神經網路。迴旋神經網路可以包括人工神經元的集合，每個人工神經元具有感受野（例如，輸入空間的空間局部化區域）並且共同平鋪輸入空間。RNN的工作原理是保存層的輸出，並且將該輸出回饋回輸入，以幫助預測層的結果。GAN是一種形式的產生性神經網路，其可以學習輸入資料中的模式，使得神經網路模型可以產生新的合成輸出，這些合成輸出可能合理地來自原始資料集。GAN可以包括一起操作的兩個神經網路，包括產生合成輸出的產生性神經網路和評估輸出真實性的判別性神經網路。在MLP神經網路中，可以將資料饋送給輸入層，並且一或多個隱藏層為資料提供抽象層次。隨後，可以基於經抽象的資料在輸出層上進行預測。There are different types of neural networks such as convolutional neural networks (CNN), recurrent neural networks (RNN), generative adversarial networks (GAN), multilayer perceptron (MLP) neural networks, etc. For example, a Convolutional Neural Network (CNN) is a type of feed-forward artificial neural network. A convolutional neural network may include a collection of artificial neurons, each with a receptive field (eg, a spatially localized region of the input space) that collectively tiles the input space. RNNs work by saving the output of a layer and feeding that output back into the input to help predict the outcome of the layer. A GAN is a form of generative neural network that learns patterns in input data such that the neural network model can produce new synthetic outputs that might reasonably have come from the original data set. GANs can include two neural networks operating together, including a generative neural network that produces a synthetic output and a discriminative neural network that evaluates the authenticity of the output. In an MLP neural network, data can be fed to an input layer, and one or more hidden layers provide a level of abstraction for the data. Subsequently, predictions can be made on the output layer based on the abstracted material.

在分層神經網路架構（當存在多個隱藏層時，其被稱為深度神經網路）中，第一層人工神經元的輸出成為第二層人工神經元的輸入，第二層人工神經元的輸出成為第三層人工神經元的輸入，依此類推。例如，可以對CNN進行訓練以辨識特徵的層次結構。CNN架構中的計算可以分佈在可以被配置在一或多個計算鏈中的一群處理節點上。這些多層架構可以一次訓練一個層，並且可以使用反向傳播進行微調。In a layered neural network architecture (when there are multiple hidden layers, it is called a deep neural network), the output of the first layer of artificial neurons becomes the input of the second layer of artificial neurons, and the second layer of artificial neurons The output of the neuron becomes the input of the artificial neurons of the third layer, and so on. For example, a CNN can be trained to recognize hierarchies of features. Computation in a CNN architecture can be distributed over a population of processing nodes that can be configured in one or more computational chains. These multilayer architectures can be trained one layer at a time and can be fine-tuned using backpropagation.

在一些態樣中，本文描述的系統和技術包括被設計用於處理具有亮度-色度（YUV）輸入格式的輸入資料的基於端到端ML（例如，使用神經網路架構）的影像和視訊譯碼（E2E-NNVC）系統。YUV格式包括亮度通道（Y）和一對色度通道（U和V）。U通道可以被稱為色度（或色度）-藍色通道，並且V通道可以被稱為色度（或色度）-紅色通道。在一些情況下，亮度（Y）通道或分量亦可以被稱為亮度通道或分量。在一些情況下，色度（U和V）通道或分量亦可以被稱為色度通道或分量。YUV輸入格式可以包括YUV 4:2:0、YUV 4:4:4、YUV 4:2:2等。在一些情況下，本文描述的系統和技術可以被設計為處理其他輸入格式，諸如具有Y-色度藍色（Cb）-色度紅色（Cr）（YCbCr）格式、紅-綠-藍（RGB）格式及/或其他格式的資料。本文描述的E2E-NNVC系統可以對包括多個訊框的獨立訊框（亦被稱為影像或圖片）及/或視訊資料進行編碼及/或解碼。In some aspects, the systems and techniques described herein include end-to-end ML-based (e.g., using a neural network architecture) image and video designed to process input data having a luminance-chrominance (YUV) input format Decoding (E2E-NNVC) system. The YUV format includes a luma channel (Y) and a pair of chroma channels (U and V). The U channel may be referred to as the chroma (or chroma)-blue channel, and the V channel may be referred to as the chroma (or chroma)-red channel. In some cases, the luma (Y) channel or component may also be referred to as a luma channel or component. In some cases, the chroma (U and V) channels or components may also be referred to as chroma channels or components. YUV input formats may include YUV 4:2:0, YUV 4:4:4, YUV 4:2:2, etc. In some cases, the systems and techniques described herein can be designed to handle other input formats, such as with Y-chroma blue (Cb)-chroma red (Cr) (YCbCr) format, red-green-blue (RGB ) format and/or other formats. The E2E-NNVC system described herein can encode and/or decode individual frames (also called images or pictures) and/or video data comprising multiple frames.

在許多情況下，E2E-NNVC系統被設計為負責學習用於熵譯碼（解碼器子網路）的經量化的隱變數（latent）的概率模型的自動編碼器子網路（編碼器子網路）和第二子網路（在一些情況下，亦被稱為超先驗網路）的組合。在一些情況下，可能存在解碼器的其他子網路。此類E2E-NNVC系統架構可以被視為變換加量化模組（或編碼器子網路）和熵建模子網路模組的組合。In many cases, an E2E-NNVC system is designed as an autoencoder subnetwork (encoder subnetwork) responsible for learning a probabilistic model of quantized latent variables (latent) for entropy decoding (decoder subnetwork). road) and the combination of the second subnetwork (in some cases, also known as the super prior network). In some cases, other subnetworks of decoders may exist. This type of E2E-NNVC system architecture can be regarded as a combination of transform plus quantization module (or encoder subnetwork) and entropy modeling subnetwork module.

大多數E2E-NNVC系統架構被設計為以非二次取樣輸入格式操作，諸如RGB、YUV 4:4:4或其他非二次取樣輸入格式。然而，視訊譯碼標準（諸如HEVC和VVC）被設計為在其各自的主簡檔中支援YUV 4:2:0色彩格式。為了支援4:2:0 YUV格式，必須修改被設計為以非二次取樣輸入格式操作的E2E-NNVC架構。Most E2E-NNVC system architectures are designed to operate with non-subsampled input formats, such as RGB, YUV 4:4:4 or other non-subsampled input formats. However, video coding standards such as HEVC and VVC are designed to support the YUV 4:2:0 color format in their respective main profiles. To support the 4:2:0 YUV format, the E2E-NNVC architecture, which is designed to operate with non-subsampled input formats, must be modified.

本文描述的系統和技術提供了用於處理一或多個適用於現有E2E-NNVC架構的特定色彩格式（例如，YUV 4:2:0色彩格式）的前端架構（例如，子網路）。系統和技術考慮了Y和UV通道的不同特性以及解析度的差異。例如，訊框或訊框的一部分的Y和UV通道可以輸入到神經網路系統的編碼器子網路的兩個單獨的神經網路層。在一些實例中，兩個神經網路層包括迴旋層。在一些態樣中，兩個單獨的神經網路層的輸出由編碼器子網路的一對非線性層或運算元處理。這對非線性層或運算元可以包括廣義除數正規化（GDN）層或運算元、參數校正線性單元（PReLU）層或運算元及/或其他非線性層或運算元。使用編碼器子網路的額外的神經網路層來對兩個獨立神經網路層的輸出（或非線性層或運算元的輸出）進行組合。The systems and techniques described herein provide a front-end architecture (eg, sub-network) for processing one or more specific color formats (eg, YUV 4:2:0 color format) applicable to existing E2E-NNVC architectures. The systems and techniques take into account the different characteristics and differences in resolution of the Y and UV channels. For example, the Y and UV channels of a frame or a portion of a frame can be input to two separate neural network layers of the encoder sub-network of the neural network system. In some examples, the two neural network layers include convolutional layers. In some aspects, the outputs of the two separate neural network layers are processed by a pair of non-linear layers or operators of the encoder sub-network. The pair of nonlinear layers or operands may include a generalized divisor normalization (GDN) layer or operand, a parameter rectified linear unit (PReLU) layer or operand, and/or other nonlinear layers or operands. Use an additional neural network layer of the encoder subnetwork to combine the outputs of two separate neural network layers (or outputs of nonlinear layers or operators).

在一些實例中，額外的神經網路層是1x1迴旋層。1x1迴旋層執行Y分量和UV分量的每圖元或每值交叉通道混合（例如，經由產生線性組合），從而產生改進譯碼效能的交叉分量（例如，交叉亮度和色度分量）預測。例如，Y分量和UV分量的交叉通道混合使Y分量與U和V分量去相關，這導致改進的譯碼效能（例如，提高譯碼效率）。在一些情況下，1x1迴旋層可以包括N個1x1迴旋濾波器（其中N等於與輸入到1x1迴旋層的通道數量相對應的整數值）。每個1x1迴旋濾波器具有各自的縮放因數，該縮放因數應用於Y分量的對應的第n通道和UV分量的對應的第n通道。In some examples, the additional neural network layer is a 1x1 convolutional layer. The 1x1 convoluted layer performs per-pixel or per-value cross-channel mixing of Y and UV components (eg, via generating a linear combination), resulting in cross-component (eg, cross-luminance and chrominance) prediction that improves coding performance. For example, cross-channel mixing of Y and UV components decorrelates the Y component from the U and V components, which results in improved coding performance (eg, increased coding efficiency). In some cases, the 1x1 convolutional layer may include N 1x1 convolutional filters (where N is equal to an integer value corresponding to the number of channels input to the 1x1 convolutional layer). Each 1x1 convolution filter has a respective scaling factor applied to the corresponding nth channel of the Y component and the corresponding nth channel of the UV component.

額外的神經網路層（例如，1x1迴旋層）的輸出可以由編碼器子網路的一或多個非線性層及/或一或多個額外的神經網路層（例如，迴旋層）處理。量化引擎可以對由編碼器子網路的最終神經網路層輸出的特徵執行量化，以產生經量化的輸出。熵編碼引擎可以對來自量化引擎的經量化的輸出進行熵編碼以產生位元串流。神經網路系統可以輸出位元串流以用於儲存、用於傳輸到另一設備、伺服器設備或系統等。The output of additional neural network layers (e.g., 1x1 convolutional layers) can be processed by one or more non-linear layers of the encoder subnetwork and/or one or more additional neural network layers (e.g., convolutional layers) . The quantization engine may perform quantization on the features output by the final neural network layer of the encoder sub-network to produce a quantized output. An entropy encoding engine may entropy encode the quantized output from the quantization engine to generate a bitstream. The neural network system can output a bit stream for storage, for transmission to another device, a server device or system, and the like.

神經網路系統的解碼器子網路或另一神經網路系統（另一設備）的解碼器子網路可以解碼位元串流。例如，解碼器子網路的熵解碼引擎可以對位元串流進行熵解碼，並且將經熵解碼的資料輸出到去量化引擎。去量化引擎可以對資料進行去量化。經去量化的資料可以由解碼器子網路的一或多個神經網路層（例如，迴旋層）及/或一或多個逆非線性層處理。例如，在被一或多個迴旋層和一或多個逆非線性層處理之後，1x1迴旋層可以處理資料。1x1迴旋層可以將資料劃分為Y通道特徵和組合UV通道特徵。Y通道特徵和組合UV通道特徵可以由兩個最終神經網路層（例如，兩個迴旋層）處理，以及在一些情況下由兩個最終逆非線性層處理。例如，第一最終神經網路層可以處理Y通道特徵並且輸出經重構的訊框的每個圖元或取樣的經重構的Y通道（例如，亮度取樣或圖元）。第二最終神經網路層可以處理組合UV通道特徵，並且可以輸出經重構的訊框的每個圖元或取樣的經重構的U通道（例如，色度藍色取樣或圖元）和經重構的訊框的每個圖元或取樣的經重構的V通道（例如，色度紅色取樣或圖元）。The decoder subnetwork of the NNS or the decoder subnetwork of another NNS (another device) can decode the bitstream. For example, the entropy decoding engine of the decoder sub-network can entropy decode the bitstream and output the entropy decoded data to the dequantization engine. The dequantization engine can dequantize data. The dequantized data may be processed by one or more neural network layers (eg, convolutional layers) and/or one or more inverse nonlinear layers of the decoder sub-network. For example, a 1x1 convolutional layer may process data after being processed by one or more convolutional layers and one or more inverse nonlinear layers. The 1x1 convoluted layer can divide the data into Y channel features and combined UV channel features. The Y channel feature and the combined UV channel feature may be processed by two final neural network layers (eg, two convolutional layers), and in some cases by two final inverse nonlinear layers. For example, the first final neural network layer may process the Y channel features and output a reconstructed Y channel of each pixel or sample (eg, luma samples or pixels) of the reconstructed frame. The second final neural network layer can process the combined UV channel features, and can output the reconstructed U channel of each primitive or sample of the reconstructed frame (e.g., chroma blue samples or primitives) and Reconstructed V channel for each pixel or sample of the reconstructed frame (eg, chroma red samples or pixels).

將參照附圖描述關於系統和技術的額外的細節。Additional details regarding the systems and techniques will be described with reference to the figures.

圖1示出片上系統（SOC）100的實例實現方式，SOC 100可以包括被配置為執行本文描述的一或多個功能的中央處理單元（CPU）102或多核心CPU。參數或變數（例如，神經訊號和突觸權重）、與計算設備相關聯的系統參數（例如，具有權重的神經網路）、延遲、頻段資訊、任務資訊以及其他資訊可以被儲存在與神經處理單元（NPU）108相關聯的記憶體塊中，被儲存在與CPU 102相關聯的記憶體塊中，被儲存在與圖形處理單元（GPU）104相關聯的記憶體塊中，被儲存在與數位訊號處理器（DSP）106相關聯的記憶體塊中，被儲存在記憶體塊118中，及/或可以分佈在多個塊中。在CPU 102處執行的指令可以是從與CPU 102相關聯的程式記憶體載入的，或者可以是從記憶體塊118載入的。1 illustrates an example implementation of a system on chip (SOC) 100, which may include a central processing unit (CPU) 102 or a multi-core CPU configured to perform one or more functions described herein. Parameters or variables (e.g., neural signals and synaptic weights), system parameters associated with computing devices (e.g., neural networks with weights), delays, frequency band information, task information, and other information can be stored in the neural processing memory block associated with CPU 102, stored in memory block associated with graphics processing unit (GPU) 104, stored in memory block associated with graphics processing unit (GPU) 104, The associated memory block of the digital signal processor (DSP) 106 is stored in the memory block 118 and/or may be distributed among multiple blocks. Instructions executed at CPU 102 may be loaded from program memory associated with CPU 102 or may be loaded from memory block 118 .

SOC 100亦可以包括針對特定功能定製的額外處理塊，諸如GPU 104、DSP 106、連接塊110（其可以包括第五代（5G）連接、第四代長期進化（4G LTE）連接、Wi-Fi連接、USB連接、藍芽連接等）、以及多媒體處理器112（例如，其可以偵測和辨識手勢）。在一種實現方式中，在CPU 102、DSP 106及/或GPU 104中實現NPU。SOC 100亦可以包括感測器處理器114、影像訊號處理器（ISP）116及/或導航模組120（其可以包括全球定位系統）。SOC 100 may also include additional processing blocks customized for specific functions, such as GPU 104, DSP 106, connectivity block 110 (which may include fifth generation (5G) connectivity, fourth generation long term evolution (4G LTE) connectivity, Wi- Fi connection, USB connection, Bluetooth connection, etc.), and the multimedia processor 112 (eg, it can detect and recognize gestures). In one implementation, the NPU is implemented in CPU 102 , DSP 106 and/or GPU 104 . The SOC 100 may also include a sensor processor 114 , an image signal processor (ISP) 116 and/or a navigation module 120 (which may include a global positioning system).

SOC 100可以是基於ARM指令集的。在本案內容的一態樣中，被載入到CPU 102中的指令可以包括用於在與輸入值和濾波器權重的乘法乘積相對應的查閱資料表（LUT）中搜尋被儲存的乘法結果的代碼。被載入到CPU 102中的指令亦可以包括用於當偵測到乘法乘積的查閱資料表命中時在乘法乘積的乘法運算期間禁用乘法器的代碼。另外，被載入到CPU 102中的指令可以包括用於當偵測到乘法乘積的查閱資料表缺失時儲存輸入值和濾波器權重的計算出的乘法乘積的代碼。SOC 100 may be based on the ARM instruction set. In one aspect of the present disclosure, the instructions loaded into the CPU 102 may include instructions for searching a stored multiplication result in a look-up table (LUT) corresponding to the multiplication product of the input value and the filter weight. code. The instructions loaded into CPU 102 may also include code for disabling the multiplier during the multiplication operation of the multiplication product when a multiplication product lookup table hit is detected. Additionally, the instructions loaded into the CPU 102 may include code for storing the calculated multiplicative product of the input value and the filter weight when a missing lookup table of the multiplicative product is detected.

SOC 100及/或其部件可以被配置為使用根據本文所論述的本案內容的各態樣的機器學習技術來執行視訊壓縮及/或解壓縮（亦被稱為視訊編碼及/或解碼，其被統稱為視訊譯碼）。經由使用深度學習架構來執行視訊壓縮及/或解壓縮，本案內容的各態樣可以提高設備上的視訊壓縮及/或解壓縮的效率。例如，使用所描述的視訊譯碼技術的設備可以使用基於機器學習的技術來對視訊更高效地壓縮，可以將經壓縮的視訊發送給另一設備，並且另一設備可以使用本文描述的基於機器學習的技術來對經壓縮的視訊更高效地解壓縮。SOC 100 and/or components thereof may be configured to perform video compression and/or decompression (also referred to as video encoding and/or decoding, which is referred to as collectively referred to as video decoding). Aspects of this subject matter can improve the efficiency of on-device video compression and/or decompression by using deep learning architectures to perform video compression and/or decompression. For example, a device using the video decoding techniques described can compress video more efficiently using machine learning-based techniques, the compressed video can be sent to another device, and the other device can use the machine-based Learned techniques to more efficiently decompress compressed video.

如前述，神經網路是機器學習系統的實例，並且可以包括輸入層、一或多個隱藏層和輸出層。從輸入層的輸入節點提供資料，由一或多個隱藏層的隱藏節點執行處理，並且經由輸出層的輸出節點產生輸出。深度學習網路典型地包括多個隱藏層。神經網路的每個層可以包括可以包括人工神經元（或節點）的特徵圖或啟動圖。特徵圖可以包括濾波器、核等。節點可以包括用於指示這些層中的一或多個層的節點的重要性的一或多個權重。在一些情況下，深度學習網路可以具有一系列的多個隱藏層，其中初期的層用於決定輸入的簡單且低級的特性，以及後期的層建立更複雜且抽象的特性的層次結構。As previously mentioned, a neural network is an example of a machine learning system and can include an input layer, one or more hidden layers, and an output layer. Data is provided from input nodes of the input layer, processing is performed by hidden nodes of one or more hidden layers, and output is generated via output nodes of the output layer. Deep learning networks typically include multiple hidden layers. Each layer of a neural network can include a feature map or activation map that can include artificial neurons (or nodes). Feature maps may include filters, kernels, etc. The nodes may include one or more weights to indicate the importance of the nodes of one or more of the layers. In some cases, deep learning networks can have a series of multiple hidden layers, where early layers are used to determine simple and low-level features of the input, and later layers build up a hierarchy of more complex and abstract features.

深度學習架構可以學習功能的層次結構。例如，若以視覺資料來提供，則第一層可以學習辨識輸入串流中的相對簡單的特徵，諸如邊緣。在另一實例中，若以聽覺資料來提供，則第一層可以學習辨識特定頻率中的頻譜功率。採用第一層的輸出作為輸入的第二層可以學習辨識特徵的組合，諸如用於視覺資料的簡單形狀或用於聽覺資料的聲音的組合。例如，較高層可以學習表示視覺資料中的複雜形狀或者聽覺資料中的詞語。再高層可以學習辨識常見的視覺物件或口語短語。Deep learning architectures can learn hierarchies of features. For example, if provided with visual data, the first layer can learn to recognize relatively simple features in the input stream, such as edges. In another example, the first layer can learn to recognize spectral power in specific frequencies if provided with auditory data. A second layer, taking the output of the first layer as input, can learn to recognize combinations of features, such as simple shapes for visual material or combinations of sounds for auditory material. For example, higher layers can learn to represent complex shapes in visual material or words in auditory material. Higher layers can learn to recognize common visual objects or spoken phrases.

當被應用於具有自然層次結構的問題時，深度學習架構可能表現得尤其好。例如，對機動運載工具的分類可以得益於首先學習辨識輪子、擋風玻璃和其他特徵。這些特徵可以在較高層處以不同的方式進行組合，以辨識汽車、卡車和飛機。Deep learning architectures may perform especially well when applied to problems with a natural hierarchy. For example, classifying motor vehicles could benefit from first learning to recognize wheels, windshields, and other features. These features can be combined in different ways at higher layers to recognize cars, trucks, and planes.

神經網路可以被設計為具有多種連接模式。在前饋網路中，資訊是從較低層傳遞到較高層的，其中給定層之每一者神經元與較高層中的神經元進行通訊。如前述，可以在前饋網路的連續層中建立層次表示。神經網路亦可以具有循環或回饋（亦被稱為自頂向下）連接。在循環連接中，來自給定層中的神經元的輸出可以被傳送給同一層中的另一個神經元。循環架構可以有助於辨識跨一個以上的按順序被遞送給神經網路的輸入資料區塊的模式。從給定層的神經元到較低層的神經元的連接被稱為回饋（或自頂向下）連接。當辨識高級概念可以輔助辨別輸入的特定低級特徵時，具有許多回饋連接的網路可以是有説明的。Neural networks can be designed with a variety of connectivity patterns. In a feed-forward network, information is passed from lower layers to higher layers, where each neuron in a given layer communicates with neurons in higher layers. As previously mentioned, hierarchical representations can be built in successive layers of a feed-forward network. Neural networks can also have recurrent or feedback (also known as top-down) connections. In recurrent connections, the output from a neuron in a given layer can be sent to another neuron in the same layer. Recurrent architectures can help identify patterns across more than one chunk of input data that is delivered sequentially to the neural network. Connections from neurons in a given layer to neurons in lower layers are called feedback (or top-down) connections. Networks with many feedback connections can be elucidative when identifying high-level concepts can aid in identifying specific low-level features of the input.

神經網路的各層之間的連接可以是全連接的或者局部連接的。圖2A示出全連接神經網路202的實例。在全連接神經網路202中，第一層中的神經元可以將其輸出傳送給第二層之每一者神經元，使得第二層之每一者神經元將接收來自第一層之每一者神經元的輸入。圖2B示出局部連接神經網路204的實例。在局部連接神經網路204中，第一層中的神經元可以連接到第二層中的有限數量的神經元。更一般而言，局部連接神經網路204的局部連接層可以被配置為使得一層之每一者神經元將具有相同或相似的連接模式，但是具有可能帶有不同的值（例如，210、212、214和216）的連接強度。局部連接的連線性模式可以在較高層中造成空間上不同的感受野，這是因為給定區域中的較高層神經元可以接收經由訓練而調諧到針對網路的總輸入的受限部分的屬性的輸入。The connections between the layers of the neural network can be fully connected or locally connected. FIG. 2A shows an example of a fully connected neural network 202 . In a fully connected neural network 202, a neuron in the first layer can send its output to every neuron in the second layer, so that every neuron in the second layer will receive every neuron from the first layer. One is the input of the neuron. FIG. 2B shows an example of a locally connected neural network 204 . In a locally connected neural network 204, neurons in a first layer can be connected to a limited number of neurons in a second layer. More generally, the locally connected layers of the locally connected neural network 204 may be configured such that each neuron in a layer will have the same or similar connection pattern, but with possibly different values (e.g., 210, 212 , 214 and 216) connection strength. Linear patterns of local connections can result in spatially distinct receptive fields in higher layers, since higher layer neurons in a given region can receive images tuned to a restricted fraction of the network's total input via training. attribute input.

局部連接神經網路的一個實例是迴旋神經網路。圖2C示出迴旋神經網路206的實例。迴旋神經網路206可以被配置為使得與針對第二層之每一者神經元的輸入相關聯的連接強度被共享（例如，208）。迴旋神經網路可以非常適於其中輸入的空間位置有意義的問題。根據本案內容的各態樣，迴旋神經網路206可以用於執行視訊壓縮及/或解壓縮的一或多個態樣。An example of a locally connected neural network is a convolutional neural network. FIG. 2C shows an example of a convolutional neural network 206 . Convolutional neural network 206 may be configured such that connection strengths associated with inputs to each neuron of the second layer are shared (eg, 208 ). Convolutional neural networks can be well suited for problems where the spatial location of the input is meaningful. According to various aspects of the subject matter, convolutional neural network 206 may be used to perform one or more aspects of video compression and/or decompression.

一種類型的迴旋神經網路是深度迴旋網路（DCN）。圖2D示出DCN 200的詳細實例，DCN 200被設計為從影像226中辨識視覺特徵，影像226是從諸如車載相機之類的影像擷取裝置230輸入的。可以對當前實例的DCN 200進行訓練以辨識交通標誌和在交通標誌上提供的數位。當然，DCN 200可以被訓練用於其他任務，例如辨識車道標線或辨識交通燈。One type of convolutional neural network is a deep convolutional network (DCN). FIG. 2D shows a detailed example of a DCN 200 designed to recognize visual features from an image 226 input from an image capture device 230 such as an onboard camera. The DCN 200 of the present example may be trained to recognize traffic signs and the digits provided on the traffic signs. Of course, the DCN 200 can be trained for other tasks, such as recognizing lane markings or recognizing traffic lights.

可以利用有監督學習來訓練DCN 200。在訓練期間，可以向DCN 200呈現影像（諸如限速標誌的影像226），並且隨後可以計算前向傳遞以產生輸出222。DCN 200可以包括特徵提取部分和分類部分。在接收到影像226時，迴旋層232可以將迴旋核（未圖示）應用於影像226以產生特徵圖的第一集合218。作為一實例，用於迴旋層232的迴旋核可以是產生28x28特徵圖的5x5核。在本實例中，因為在特徵圖的第一集合218中產生四個不同的特徵圖，所以在迴旋層232處對影像226應用四個不同的迴旋核。迴旋核亦可以被稱為濾波器或迴旋濾波器。The DCN 200 may be trained using supervised learning. During training, DCN 200 may be presented with imagery, such as imagery 226 of a speed limit sign, and then a forward pass may be computed to produce output 222 . DCN 200 may include a feature extraction part and a classification part. Upon receiving image 226 , convolution layer 232 may apply a convolution kernel (not shown) to image 226 to generate first set 218 of feature maps. As an example, the convolution kernel used for the convolution layer 232 may be a 5x5 kernel that produces a 28x28 feature map. In this example, because four different feature maps are produced in the first set of feature maps 218 , four different convolution kernels are applied to image 226 at convolution layer 232 . A convolution kernel may also be called a filter or a convolution filter.

特徵圖的第一集合218可以由最大池化層（未圖示）二次取樣以產生特徵圖的第二集合220。最大池化層減小特徵圖的第一集合218的尺寸。亦即，特徵圖的第二集合220的尺寸（諸如14x14）小於特徵圖的第一集合218的尺寸（諸如28x28）。減小的尺寸向後續層提供類似的資訊，同時減少記憶體消耗。特徵圖的第二集合220可以經由一或多個後續迴旋層（未圖示）被進一步迴旋以產生一或多個後續特徵圖集合（未圖示）。The first set of feature maps 218 may be subsampled by a max pooling layer (not shown) to produce the second set of feature maps 220 . The max pooling layer reduces the size of the first set 218 of feature maps. That is, the size of the second set 220 of feature maps (such as 14x14) is smaller than the size of the first set of feature maps 218 (such as 28x28). The reduced size provides similar information to subsequent layers while reducing memory consumption. The second set of feature maps 220 may be further convolved via one or more subsequent convolutional layers (not shown) to generate one or more subsequent sets of feature maps (not shown).

在圖2D的實例中，對特徵圖的第二集合220進行迴旋以產生第一特徵向量224。此外，對第一特徵向量224進一步迴旋以產生第二特徵向量228。第二特徵向量228的每個特徵可以包括對應於影像226的可能特徵（諸如，「符號」、「60」和「100」）的數字。softmax函數（未圖示）可以將第二特徵向量228中的數位轉換為概率。照此，DCN 200的輸出222是影像226包括一或多個特徵的概率。In the example of FIG. 2D , the second set of feature maps 220 is convolved to produce a first feature vector 224 . In addition, the first eigenvector 224 is further convolved to generate the second eigenvector 228 . Each feature of the second feature vector 228 may include a number corresponding to a possible feature of the image 226 such as "sign", "60", and "100". A softmax function (not shown) can convert the digits in the second feature vector 228 into probabilities. As such, the output 222 of the DCN 200 is the probability that the image 226 includes one or more features.

在本實例中，輸出222中的針對「符號」和「60」的概率高於輸出222的其他項（諸如，「30」、「40」、「50」、「70」、「80」、「90」和「100」）的概率。在訓練之前，由DCN 200產生的輸出222可能是不正確的。因此，可以計算輸出222與目標輸出之間的誤差。目標輸出是影像226的地面真值（例如，「符號」和「60」）。隨後，可以調整DCN 200的權重，使得DCN 200的輸出222與目標輸出更緊密地對準。In this example, the probabilities for "symbol" and "60" in output 222 are higher than other items in output 222 (such as "30", "40", "50", "70", "80", " 90” and “100”). Before training, the output 222 produced by the DCN 200 may be incorrect. Therefore, the error between the output 222 and the target output can be calculated. The target output is the ground truth of the image 226 (eg, "symbol" and "60"). Subsequently, the weights of the DCN 200 can be adjusted so that the output 222 of the DCN 200 more closely aligns with the target output.

為了調整權重，學習演算法可以計算針對權重的梯度向量。梯度可以指示關於在權重被調整的情況下誤差將增加或減少的量。在頂層，梯度可以直接地對應於將倒數第二層中的啟動神經元和輸出層中的神經元進行連接的權重的值。在較低層中，梯度可以取決於權重的值和所計算出的較高層的誤差梯度。隨後可以調整權重以減小誤差。這種調整權重的方式可以被稱為「反向傳播」，因為其涉及經由神經網路的「後向傳遞」。To adjust the weights, the learning algorithm may compute a gradient vector for the weights. Gradients may indicate as to how much the error will increase or decrease if the weights are adjusted. At the top layer, the gradient may correspond directly to the value of the weights connecting the firing neurons in the penultimate layer to the neurons in the output layer. In lower layers, the gradient can depend on the values of the weights and the calculated error gradients of the higher layers. The weights can then be adjusted to reduce the error. This way of adjusting weights can be called "backpropagation" because it involves a "backward pass" through the neural network.

在實踐中，權重的誤差梯度可以經由少量實例來計算，使得所計算出的梯度接近真實誤差梯度。這種近似方法可以被稱為隨機梯度下降法。可以重複隨機梯度下降，直到整個系統的可實現誤差率已經停止下降或直到誤差率已經達到目標水平。在學習之後，可以向DCN呈現新影像，並且經由網路的前向傳遞可以產生可以被認為是DCN的推斷或預測的輸出222。In practice, the error gradient of the weights can be computed over a small number of instances such that the computed gradient is close to the true error gradient. This approximation method can be called stochastic gradient descent. Stochastic gradient descent can be repeated until the achievable error rate of the overall system has stopped decreasing or until the error rate has reached a target level. After learning, the DCN may be presented with new images, and a forward pass through the network may produce an output 222 that may be considered the DCN's inference or prediction.

深度信念網路（DBN）是包括多層隱藏節點的概率模型。DBN可以用於提取訓練資料集的分層表示。DBN可以經由將受限玻爾茲曼機（RBM）的各層進行疊加來獲得。RBM是一種類型的人工神經網路，其可以學習在輸入集合上的概率分佈。由於RBM可以在沒有關於每個輸入應當被分類到的類別的資訊的情況下學習概率分佈，因此RBM通常在無監督學習中使用。使用混合的無監督和有監督範式，DBN的底部RBM可以以無監督的方式來訓練並且可以充當特徵提取器，以及頂部RBM可以以有監督的方式來訓練（基於來自前一層的輸入和目標類別的聯合分佈）並且可以充當分類器。Deep Belief Networks (DBNs) are probabilistic models that include multiple layers of hidden nodes. DBNs can be used to extract hierarchical representations of training datasets. A DBN can be obtained by stacking layers of a Restricted Boltzmann Machine (RBM). RBM is a type of artificial neural network that can learn a probability distribution over a set of inputs. RBMs are often used in unsupervised learning because they can learn a probability distribution without information about the class each input should be classified into. Using a mixed unsupervised and supervised paradigm, the bottom RBM of a DBN can be trained in an unsupervised manner and can act as a feature extractor, and the top RBM can be trained in a supervised manner (based on the input and target categories from the previous layer joint distribution) and can act as a classifier.

深度迴旋網路（DCN）是迴旋網路的網路，其被配置有額外的池化和非線性（例如，正規化）層。DCN在許多工作上都已經實現了最先進的效能。DCN可以使用有監督學習（其中輸入目標和輸出目標兩者對於許多範例都是已知的）來訓練，並且經由使用梯度下降方法來修改網路的權重。A deep convolutional network (DCN) is a network of convolutional networks configured with additional pooling and non-linear (eg, regularization) layers. DCN has achieved state-of-the-art performance on many tasks. A DCN can be trained using supervised learning (where both the input and output targets are known for many examples) and the weights of the network are modified by using gradient descent methods.

DCN可以是前饋網路。此外，如前述，從DCN的第一層中的神經元到下一較高層中的一組神經元的連接是在第一層中跨越神經元共享的。DCN的前饋和共享連接可以用於快速處理。例如，DCN的計算負擔可以比包括循環或回饋連接的類似尺寸的神經網路的計算負擔小得多。A DCN may be a feed-forward network. Furthermore, as before, connections from a neuron in the first layer of a DCN to a set of neurons in the next higher layer are shared across neurons in the first layer. DCN's feed-forward and shared connections can be used for fast processing. For example, the computational burden of a DCN can be much smaller than that of a similarly sized neural network that includes recurrent or feedback connections.

對迴旋網路的每個層的處理可以被認為空間不變範本或基投影。若首先將輸入分解為多個通道（諸如彩色影像的紅色、綠色和藍色通道），則在該輸入上訓練的迴旋網路可以被認為是三維的，其中兩個空間維度沿著沿著影像的軸線，以及第三維度擷取色彩資訊。迴旋連接的輸出可以被認為是形成後續層中的特徵圖，其中特徵圖（例如，220）的每個元素接收來自前一層中的一系列神經元（例如，特徵圖218）和來自多個通道之每一者通道的輸入。可以利用非線性（諸如整流max(0,x)）來進一步處理特徵圖中的值。來自鄰近神經元的值可以被進一步池化（這對應於下取樣），並且可以提供額外的局部不變性和降維。The processing of each layer of a convolutional network can be thought of as a spatially invariant template or base projection. If the input is first decomposed into channels (such as the red, green, and blue channels of a color image), a convolutional network trained on that input can be considered three-dimensional, with two spatial dimensions along the axis, and the third dimension captures color information. The output of the convolutional connection can be thought of as forming a feature map in subsequent layers, where each element of the feature map (e.g., 220) receives input from a series of neurons in the previous layer (e.g., feature map 218) and from multiple channels input for each of the channels. Values in the feature map can be further processed with non-linearities such as rectified max(0,x). Values from neighboring neurons can be further pooled (this corresponds to downsampling) and can provide additional local invariance and dimensionality reduction.

圖3是示出深度迴旋網路350的實例的方塊圖。深度迴旋網路350可以包括基於連線性和權重共享的多個不同類型的層。如圖3所示，深度迴旋網路350包括迴旋塊354A、迴旋塊354B。迴旋塊354A、迴旋塊354B中的各者可以被配置為具有迴旋層（CONV）356、正規化層（LNorm）358和最大池層（MAX POOL）360。FIG. 3 is a block diagram illustrating an example of a deep convolutional network 350 . The deep convolutional network 350 may include multiple different types of layers based on connectivity and weight sharing. As shown in FIG. 3 , the deep convolution network 350 includes a convolution block 354A and a convolution block 354B. Each of the convolution block 354A, the convolution block 354B may be configured with a convolution layer (CONV) 356 , a normalization layer (LNorm) 358 and a max pooling layer (MAX POOL) 360 .

迴旋層356可以包括一或多個迴旋濾波器，其可以被應用於輸入資料352以產生特徵圖。儘管僅圖示兩個迴旋塊354A、迴旋塊354B，但是本案內容並不限制於此，而相反，根據設計偏好，任何數量的迴旋塊（例如，塊354A、塊354B）可以被包括在深度迴旋網路350中。正規化層358可以對迴旋濾波器的輸出進行正規化。例如，正規化層358可以提供白化或橫向抑制。最大池化層360可以提供在空間內的下取樣聚合以用於局部不變性和降維。The convolutional layer 356 may include one or more convolutional filters that may be applied to the input data 352 to generate a feature map. Although only two convolution blocks 354A, 354B are shown, the present disclosure is not so limited, but rather, any number of convolution blocks (e.g., block 354A, block 354B) may be included in the depth convolution block according to design preference. Network 350. Normalization layer 358 may normalize the output of the convolution filter. For example, normalization layer 358 may provide whitening or lateral suppression. A max pooling layer 360 may provide downsampled aggregation in space for local invariance and dimensionality reduction.

例如，深度迴旋網路的並行濾波器組可以被載入在SOC 100的CPU 102或GPU 104上，以實現高效能和低功耗。在替代實施例中，並行濾波器組可以被載入在SOC 100的DSP 106或ISP 116上。此外，深度迴旋網路350可以存取可以在SOC 100上存在的其他處理塊，諸如分別專用於感測器和導航的感測器處理器114和導航模組120。For example, parallel filter banks of deep convolutional networks can be loaded on CPU 102 or GPU 104 of SOC 100 to achieve high performance and low power consumption. In an alternate embodiment, the parallel filter banks may be loaded on the DSP 106 or the ISP 116 of the SOC 100 . In addition, deep convolutional network 350 may access other processing blocks that may be present on SOC 100 , such as sensor processor 114 and navigation module 120 dedicated to sensors and navigation, respectively.

深度迴旋網路350亦可以包括一或多個全連接層，諸如層362A（被標記為「FC1」）和層362B（被標記為「FC2」）。深度迴旋網路350可以進一步包括邏輯回歸（LR）層364。在深度迴旋網路350的每個層356、358、360、362A、362B、364之間是要被更新的權重（未圖示）。這些層（例如，356、358、360、362A、362B、364）之每一者層的輸出可以充當深度迴旋網路350的這些層（例如，356、358、360、362A、362B、364）中的隨後層的輸入，以從在迴旋塊354A的開始提供的輸入資料352（例如，影像、音訊、視訊、感測器資料、及/或其他輸入資料）學習分層特徵表示。深度迴旋網路350的輸出是針對輸入資料352的分類得分366。分類得分366可以是概率集合，其中每個概率是包括輸入資料來自特徵集合的特徵的概率。Deep convolutional network 350 may also include one or more fully connected layers, such as layer 362A (labeled "FC1") and layer 362B (labeled "FC2"). The deep convolutional network 350 may further include a logistic regression (LR) layer 364 . Between each layer 356 , 358 , 360 , 362A, 362B, 364 of the deep convolutional network 350 are weights (not shown) to be updated. The output of each of these layers (e.g., 356, 358, 360, 362A, 362B, 364) may serve as Subsequent layers of input to learn hierarchical feature representations from input data 352 (eg, image, audio, video, sensor data, and/or other input data) provided at the beginning of convolution block 354A. The output of the deep convolutional network 350 is a classification score 366 for the input data 352 . Classification score 366 may be a set of probabilities, where each probability is a probability of including a feature of the input material from the set of features.

如上文所提到的，數位視訊資料可以包括大量資料，這可能為通訊網路以及處理和儲存視訊資料的設備帶來顯著負擔。例如，記錄未經壓縮的視訊內容通常導致隨著所記錄的視訊內容的解析度增加而大大增加的大的檔尺寸。在一個說明性實例中，以1080p/24（例如，在寬度上具有1920個圖元和在高度上具有1080個圖元的解析度，其中每秒擷取24個訊框）記錄的未經壓縮的每通道16位元視訊可能佔用每訊框12.4百萬位元組或每秒297.6百萬位元組。以每秒24個訊框的4K解析度而記錄的未經壓縮的每通道16位元視訊可能佔用每訊框49.8百萬位元組或每秒1195.2百萬位元組。As mentioned above, digital video data can include large amounts of data, which can place a significant burden on communication networks and equipment that process and store the video data. For example, recording uncompressed video content typically results in large file sizes that increase substantially as the resolution of the recorded video content increases. In one illustrative example, uncompressed The 16-bit-per-channel video may occupy 12.4 megabytes per frame or 297.6 megabytes per second. Uncompressed 16-bit-per-channel video recorded at 4K resolution at 24 frames per second may occupy 49.8 megabytes per frame or 1195.2 megabytes per second.

網路頻寬是大視訊檔可能變得有問題的另一約束。例如，視訊內容通常在無線網路上（例如，經由LTE、改進的LTE、新無線電（NR）、WiFi ^TM、藍芽 ^TM或其他無線網路）遞送，並且可能構成消費者網際網路流量的很大一部分。儘管在無線網路中的可用頻寬量態樣取得了進步，但是可能仍然期望減少用於在這些網路中遞送視訊內容的頻寬量。 Network bandwidth is another constraint where large video files can become problematic. For example, video content is often delivered over wireless networks (e.g., via LTE, LTE-Advanced, New Radio (NR), WiFi ^™ , Bluetooth ^™ , or other wireless networks) and may constitute a significant portion of consumer Internet traffic. big part. Despite advances in the amount of bandwidth available in wireless networks, it may still be desirable to reduce the amount of bandwidth used to deliver video content in these networks.

由於未經壓縮的視訊內容可能導致大檔，大檔可能涉及用於實體儲存的相當大的記憶體和用於傳輸的相當大的頻寬，因此可以利用視訊譯碼技術來壓縮以及隨後解壓縮此類視訊內容。Since uncompressed video content can result in large files, which can involve considerable memory for physical storage and considerable bandwidth for transmission, video decoding techniques can be used for compression and subsequent decompression such video content.

為了減小視訊內容的大小——並且因此減小儲存視訊內容所涉及的儲存量以及傳送視訊內容所涉及的頻寬量，可以根據特定的視訊譯碼標準（諸如，HEVC、AVC、MPEG、VVC等等）來執行視訊譯碼技術。視訊譯碼通常使用預測方法（諸如訊框間預測、訊框內預測等），預測方法利用在視訊影像或序列中存在的冗餘。視訊譯碼技術的一個共同目標是將視訊資料壓縮為使用較低位元速率的形式，同時避免或最小化視訊品質的降級。隨著對視訊服務的需求增長以及新的視訊服務變得可用，需要具有更好的譯碼效率、效能和速率控制的譯碼技術。In order to reduce the size of video content—and thus reduce the amount of storage involved in storing video content and the amount of bandwidth involved in transmitting video content, it is possible to reduce the size of video content based on certain video coding standards (such as HEVC, AVC, MPEG, VVC etc.) to perform video decoding techniques. Video decoding usually uses predictive methods (such as inter-frame prediction, intra-frame prediction, etc.), which exploit the redundancy that exists in video images or sequences. A common goal of video decoding techniques is to compress video data into a form that uses a lower bit rate while avoiding or minimizing video quality degradation. As the demand for video services grows and new video services become available, coding techniques with better coding efficiency, performance, and rate control are required.

一般而言，編碼設備根據視訊譯碼標準來對視訊資料進行編碼，以產生經編碼的視訊位元串流。在一些實例中，經編碼的視訊位元串流（或「視訊位元串流」或「位元串流」）是一系列的一或多個經編碼的視訊序列。編碼設備可以經由將每個圖片分割為多個切片來產生圖片的編碼表示。一切片是獨立於其他切片的，使得對該切片中的資訊進行編碼，而不依賴於來自在同一圖片內的其他切片的資料。切片包括一或多個切片段，其包括獨立的切片段以及（若存在的話）依賴於先前切片段的一或多個從屬切片段。在HEVC中，隨後將切片分割為亮度取樣和色度取樣的譯碼樹塊（CTB）。亮度取樣的CTB和色度取樣的一或多個CTB連同用於取樣的語法一起被稱為譯碼樹單元（CTU）。CTU亦可以被稱為「樹塊」或「最大譯碼單元」（LCU）。CTU是用於HEVC編碼的基本處理單元。CTU可以被拆分為尺寸不同的多個譯碼單元（CU）。CU包含被稱為譯碼塊（CB）的亮度和色度取樣陣列。Generally speaking, the encoding device encodes the video data according to the video decoding standard to generate an encoded video bit stream. In some examples, an encoded video bitstream (or "video bitstream" or "bitstream") is a series of one or more encoded video sequences. An encoding device may generate an encoded representation of a picture via partitioning each picture into slices. A slice is independent of other slices such that information in that slice is encoded without relying on data from other slices within the same picture. A slice includes one or more slice segments, including independent slice segments and, if present, one or more dependent slice segments that depend on previous slice segments. In HEVC, a slice is then partitioned into coding tree blocks (CTBs) of luma samples and chroma samples. A CTB of luma samples and one or more CTBs of chroma samples, together with the syntax for the samples, are called a coding tree unit (CTU). A CTU may also be called a "treeblock" or a "largest coding unit" (LCU). CTU is the basic processing unit for HEVC encoding. A CTU can be split into multiple coding units (CUs) of different sizes. A CU contains an array of luma and chroma samples called a coding block (CB).

亮度和色度CB可以被進一步拆分為預測塊（PB）。PB是亮度分量或色度分量的取樣塊，其使用相同的運動參數來進行訊框間預測或塊內複製（IBC）預測（當可用或被啟用以供使用時）。亮度PB和一或多個色度PB連同相關聯的語法一起形成預測單元（PU）。對於訊框間預測，運動參數集（例如，一或多個運動向量、參考索引等）是在用於每個PU的位元串流中用訊號通知的，並且用於亮度PB和一或多個色度PB的訊框間預測。運動參數亦可以被稱為運動資訊。CB亦可以被分割為一或多個變換塊（TB）。TB表示色彩分量的取樣的正方形塊，其中殘差變換（例如，在一些情況下，相同的二維變換）被應用於其以對預測殘留訊號進行譯碼。變換單元（TU）表示亮度和色度取樣的TB以及對應的語法元素。下文更加詳細地描述變換譯碼。Luma and chroma CBs can be further split into prediction blocks (PBs). A PB is a block of samples of luma or chroma components that use the same motion parameters for inter-frame prediction or intra-block copy (IBC) prediction when available or enabled for use. A luma PB and one or more chroma PBs, along with associated syntax, form a prediction unit (PU). For inter-frame prediction, a set of motion parameters (e.g., one or more motion vectors, reference indices, etc.) is signaled in the bitstream for each PU and used for luma PB and one or more Inter-frame prediction of chroma PB. The exercise parameters can also be referred to as exercise information. A CB can also be partitioned into one or more transform blocks (TB). A TB represents a square block of samples of a color component to which a residual transform (eg, in some cases, the same two-dimensional transform) is applied to code the prediction residual signal. A transform unit (TU) represents a TB of luma and chroma samples and corresponding syntax elements. Transform coding is described in more detail below.

根據HEVC標準，可以使用TU來執行變換。可以基於在給定CU內的PU的尺寸來設定TU的尺寸。TU可以具有與PU相同的尺寸或者小於PU。在一些實例中，可以使用被稱為殘差四叉樹（RQT）的四叉樹結構來將與CU相對應的殘差取樣細分為更小的單元。RQT的蔓葉線節點可以對應於TU。與TU相關聯的圖元差值可以被變換以產生變換係數。隨後可以由編碼設備對變換係數進行量化。According to the HEVC standard, transformation can be performed using TUs. The size of a TU may be set based on the size of the PUs within a given CU. A TU can have the same size as a PU or be smaller than a PU. In some instances, a quadtree structure known as a residual quadtree (RQT) may be used to subdivide residual samples corresponding to a CU into smaller units. A leafline node of an RQT may correspond to a TU. The primitive delta values associated with a TU may be transformed to generate transform coefficients. The transform coefficients may then be quantized by an encoding device.

一旦視訊資料的圖片被分割為CU，編碼設備就使用預測模式來預測每個PU。隨後從原始視訊資料中減去預測單元或預測塊以獲得殘差（下文描述）。對於每個CU，可以使用語法資料在位元串流內用訊號通知預測模式。預測模式可以包括訊框內預測（或圖片內預測）或訊框間預測（或圖片間預測）。訊框內預測利用在圖片內在空間上相鄰的取樣之間的相關性。例如，使用訊框內預測，每個PU是從在相同圖片中的相鄰影像資料進行預測的，使用例如DC預測以找到用於PU的平均值，使用平面預測以使平面表面適配PU，使用方向預測以從相鄰資料進行推斷，或者使用任何其他適當的預測類型。訊框間預測使用在圖片之間的時間相關性，以便推導出用於影像取樣塊的運動補償預測。例如，使用訊框間預測，每個PU是使用運動補償預測來從在一或多個參考圖片（按照輸出次序在當前圖片之前或之後）中的影像資料進行預測的。例如，可以在CU級別處作出是使用圖片間預測還是使用圖片內預測來對圖片區域進行解碼的決策。Once a picture of video data is partitioned into CUs, the encoding device uses prediction modes to predict each PU. The prediction units or prediction blocks are then subtracted from the original video data to obtain residuals (described below). For each CU, the prediction mode can be signaled within the bitstream using syntax data. The prediction mode may include intra-frame prediction (or intra-picture prediction) or inter-frame prediction (or inter-picture prediction). Intra-frame prediction exploits the correlation between spatially adjacent samples within a picture. For example, using intra prediction, each PU is predicted from neighboring image data in the same picture, using e.g. DC prediction to find the mean value for the PU, using planar prediction to fit the planar surface to the PU, Use directional predictions to extrapolate from neighboring data, or use any other suitable prediction type. Inter-frame prediction uses the temporal correlation between pictures to derive a motion-compensated prediction for a block of image samples. For example, using inter-frame prediction, each PU is predicted from image data in one or more reference pictures (before or after the current picture in output order) using motion compensated prediction. For example, a decision on whether to use inter-picture prediction or intra-picture prediction to decode a picture region can be made at the CU level.

在使用訊框內預測及/或訊框間預測執行預測之後，編碼設備可以執行變換和量化。例如，在預測之後，編碼設備可以計算與PU相對應的殘差值。殘差值可以包括在正被編碼的當前區塊（PU）與用於預測當前塊的預測塊（例如，當前塊的預測版本）之間的圖元差值。例如，在產生預測塊（例如，實行訊框間預測或訊框內預測）之後，編碼設備可以經由從當前塊中減去由預測單元產生的預測塊來產生殘差塊。殘差塊包括圖元差值集合，其對在當前塊的圖元值與預測塊的圖元值之間的差進行量化。在一些實例中，可以用二維塊格式（例如，圖元值的二維矩陣或陣列）來表示殘差塊。在此類實例中，殘差塊是對圖元值的二維表示。After performing prediction using intra prediction and/or inter prediction, the encoding device may perform transform and quantization. For example, after prediction, the encoding device may calculate a residual value corresponding to the PU. Residual values may include primitive difference values between the current block (PU) being encoded and a prediction block used to predict the current block (eg, a predicted version of the current block). For example, after generating a prediction block (eg, performing inter prediction or intra prediction), the encoding apparatus may generate a residual block by subtracting the prediction block generated by the prediction unit from the current block. The residual block includes a set of primitive difference values that quantize the difference between the primitive values of the current block and the primitive values of the predicted block. In some examples, a residual block may be represented in a two-dimensional block format (eg, a two-dimensional matrix or array of primitive values). In such instances, a residual block is a two-dimensional representation of primitive values.

使用塊變換來對在執行預測之後可能剩餘的任何殘差資料進行變換，塊變換可以是基於離散餘弦變換、離散正弦變換、整數變換、小波變換、其他適當的變換函數、或其任何組合的。在一些情況下，可以將一或多個塊變換（例如，尺寸32 x 32、16 x 16、8 x 8、4 x 4或其他適當的尺寸）應用於每個CU中的殘差資料。在一些實施例中，可以將TU用於由編碼設備實現的變換和量化程序。具有一或多個PU的給定CU亦可以包括一或多個TU。如下文進一步詳細描述的，殘差值可以使用塊變換而被變換為變換係數，並且隨後可以使用TU進行量化和掃瞄，以產生用於熵譯碼的序列化變換係數。Any residual data that may remain after performing prediction is transformed using a block transform, which may be based on discrete cosine transform, discrete sine transform, integer transform, wavelet transform, other suitable transform functions, or any combination thereof. In some cases, one or more block transforms (eg, of size 32x32, 16x16, 8x8, 4x4, or other suitable size) may be applied to the residual data in each CU. In some embodiments, TUs may be used for transform and quantization procedures implemented by encoding devices. A given CU having one or more PUs may also include one or more TUs. As described in further detail below, residual values may be transformed into transform coefficients using a block transform, and then quantized and scanned using TUs to produce serialized transform coefficients for entropy coding.

編碼設備可以執行對變換係數的量化。量化經由對變換係數進行量化以減少用於表示係數的資料量來提供進一步的壓縮。例如，量化可以減小與係數中的一些或所有係數相關聯的位元深度。在一個實例中，具有n位元值的係數可以在量化期間向下捨入為m位元值，其中n大於m。The encoding device may perform quantization of transform coefficients. Quantization provides further compression by quantizing the transform coefficients to reduce the amount of data used to represent the coefficients. For example, quantization may reduce the bit depth associated with some or all of the coefficients. In one example, a coefficient having an n-bit value may be rounded down to an m-bit value during quantization, where n is greater than m.

一旦執行了量化，則經編碼的視訊位元串流包括經量化的變換係數、預測資訊（例如，預測模式、運動向量、塊向量等）、分割資訊以及任何其他適當的資料（諸如其他語法資料）。經編碼的視訊位元串流的不同元素隨後可以由編碼設備進行熵編碼。在一些實例中，編碼設備可以利用預定義的掃瞄次序來掃瞄經量化的變換係數，以產生可以被熵編碼的序列化向量。在一些實例中，編碼設備可以執行自我調整掃瞄。在掃瞄經量化的變換係數以形成向量（例如，一維向量）之後，編碼設備可以對該向量進行熵編碼。例如，編碼設備可以使用上下文自我調整可變長度譯碼、上下文自我調整二進位算術譯碼、基於語法的上下文自我調整二進位算術譯碼、概率區間分割熵譯碼或另一種適當的熵編碼技術。Once quantization is performed, the encoded video bitstream includes quantized transform coefficients, prediction information (e.g., prediction mode, motion vector, block vector, etc.), segmentation information, and any other appropriate data (such as other syntax data ). The different elements of the encoded video bitstream can then be entropy encoded by the encoding device. In some examples, an encoding device may scan quantized transform coefficients using a predefined scan order to generate serialized vectors that may be entropy encoded. In some examples, encoding devices may perform self-adjusting scans. After scanning the quantized transform coefficients to form a vector (eg, a one-dimensional vector), the encoding device may entropy encode the vector. For example, the encoding device may use context self-adjusting variable length coding, context self-adjusting binary arithmetic coding, syntax-based context self-adjusting binary arithmetic coding, probability interval partitioning entropy coding, or another suitable entropy coding technique .

編碼設備可以儲存經編碼的視訊位元串流及/或可以在通訊鏈路上向接收設備（其可以包括解碼設備）發送經編碼的視訊位元串流資料。解碼設備可以經由熵解碼（例如，使用熵解碼器）並且提取構成經編碼的視訊資料的一或多個經編碼的視訊序列的元素，從而對經編碼的視訊位元串流資料進行解碼。解碼設備隨後可以重新縮放經編碼的視訊位元串流資料並且對其執行逆變換。殘差資料隨後被傳遞給解碼設備的預測階段。解碼設備隨後使用訊框內預測、訊框間預測、IBC及/或其他類型的預測來預測區塊（例如，PU）。在一些實例中，將預測與逆變換的輸出（殘差資料）相加。解碼設備可以將經解碼的視訊輸出到視訊目標設備，視訊目標設備可以包括用於將經解碼的視訊資料顯示給內容的消費者的顯示器或其他輸出設備。The encoding device may store the encoded video bitstream and/or may send the encoded video bitstream data over a communication link to a receiving device (which may include a decoding device). The decoding device may decode the encoded video bitstream data via entropy decoding (eg, using an entropy decoder) and extracting elements of one or more encoded video sequences constituting the encoded video data. The decoding device may then rescale and inverse transform the encoded video bitstream data. The residual data is then passed to the prediction stage of the decoding device. The decoding device then predicts the block (eg, PU) using intra prediction, inter prediction, IBC, and/or other types of prediction. In some instances, the predictions are summed with the output of the inverse transformation (residual data). The decoding device may output the decoded video to a video destination device, which may include a display or other output device for displaying the decoded video data to a consumer of the content.

經由各種視訊譯碼標準定義的視訊譯碼系統和技術（例如，上述HEVC視訊譯碼技術）可能能夠保留原始視訊內容中的大部分資訊，並且是可以基於訊號處理和資訊理論概念而先驗定義的。然而，在一些情況下，基於機器學習（ML）的影像及/或視訊系統可以提供優於非基於ML的影像和視訊譯碼系統（諸如基於端到端神經網路的影像和視訊譯碼（E2E-NNVC）系統）的優點。如前述，許多E2E-NNVC系統被設計為自動編碼器子網路（編碼器子網路）和負責學習在用於熵譯碼的經量化的隱變數上的概率模型的第二子網路的組合。這種架構可以被視為變換加量化模組（編碼器子網路）和熵建模子網路模組的組合。Video coding systems and techniques defined by various video coding standards (e.g., the above-mentioned HEVC video coding technique) may be able to preserve most of the information in the original video content, and are a priori defined based on signal processing and information theory concepts of. However, in some cases, machine learning (ML) based image and/or video systems can provide superior performance over non-ML based image and video decoding systems such as end-to-end neural network based image and video decoding ( E2E-NNVC) system) advantages. As mentioned earlier, many E2E-NNVC systems are designed as an autoencoder subnetwork (encoder subnetwork) and a second subnetwork responsible for learning a probability model on quantized latent variables for entropy decoding combination. This architecture can be viewed as a combination of a transform plus quantization module (encoder subnetwork) and an entropy modeling subnetwork module.

圖4圖示了系統400，其包括被配置為使用E2E-NNVC系統來執行視訊編碼和解碼的設備402。設備402耦合到照相機407和儲存媒體414（例如，資料存放裝置）。在一些實現方式中，照相機407被配置為將影像資料408（例如，視訊資料串流）提供給處理器404，以供E2E-NNVC系統進行編碼。在一些實現方式中，設備402可以耦合到及/或可以包括多個照相機（例如，雙照相機系統、三個照相機或其他數量的照相機）。在一些情況下，設備402可以耦合到麥克風及/或其他輸入設備（例如，鍵盤、滑鼠、諸如觸控式螢幕及/或觸控板之類的觸摸輸入設備、及/或其他輸入設備）。在一些實例中，照相機407、儲存媒體414、麥克風及/或其他輸入設備可以是設備402的一部分。FIG. 4 illustrates a system 400 that includes a device 402 configured to perform video encoding and decoding using an E2E-NNVC system. Device 402 is coupled to camera 407 and storage medium 414 (eg, a data storage device). In some implementations, the camera 407 is configured to provide image data 408 (eg, a video data stream) to the processor 404 for encoding by the E2E-NNVC system. In some implementations, device 402 can be coupled to and/or can include multiple cameras (eg, a two-camera system, three cameras, or other number of cameras). In some cases, device 402 may be coupled to a microphone and/or other input device (e.g., a keyboard, mouse, touch input device such as a touch screen and/or trackpad, and/or other input device) . In some examples, camera 407 , storage medium 414 , microphone, and/or other input devices may be part of device 402 .

設備402亦經由傳輸媒體418（例如，一或多個無線網路、一或多個有線網路、或其組合）耦合到第二設備490。例如，傳輸媒體418可以包括由無線網路、有線網路、或者有線網路和無線網路的組合提供的通道。傳輸媒體418可以形成基於封包的網路的一部分，諸如區域網路、廣域網或諸如網際網路之類的全球網路。傳輸媒體418可以包括路由器、交換機、基地台或可以用於促進從源設備到接收設備的通訊的任何其他設備。無線網路可以包括任何無線介面或無線介面的組合，並且可以包括任何合適的無線網路（例如，網際網路或其他廣域網、基於封包的網路、WiFi ^TM、射頻（RF）、UWB、WiFi直連、蜂巢、長期進化（LTE）、WiMax ^TM等）。有線網路可以包括任何有線介面（例如，光纖、乙太網路、電力線乙太網路、基於同軸電纜的乙太網路、數位訊號線（DSL）等）。有線網路及/或無線網路可以使用各種設備來實現，諸如基地台、路由器、存取點、橋接器、閘道、交換機等。經編碼的視訊位元串流資料可以根據諸如無線通訊協定之類的通訊標準進行調制，並且被發送給接收設備。 The device 402 is also coupled to a second device 490 via a transmission medium 418 (eg, one or more wireless networks, one or more wired networks, or a combination thereof). For example, transmission medium 418 may include a channel provided by a wireless network, a wired network, or a combination of wired and wireless networks. Transmission medium 418 may form part of a packet-based network, such as a local area network, a wide area network, or a global network such as the Internet. Transmission media 418 may include routers, switches, base stations, or any other device that may be used to facilitate communication from a source device to a receiving device. A wireless network may include any wireless interface or combination of wireless interfaces, and may include any suitable wireless network (e.g., the Internet or other wide area network, packet-based networking, WiFi ^™ , radio frequency (RF), UWB, WiFi Direct Connect, Cellular, Long Term Evolution (LTE), WiMax ^TM , etc.). A wired network may include any wired interface (eg, fiber optics, Ethernet, Ethernet over power line, Ethernet over coaxial cable, Digital Signal Line (DSL), etc.). Wired and/or wireless networks can be implemented using various devices, such as base stations, routers, access points, bridges, gateways, switches, and so on. The encoded video bit stream data can be modulated according to a communication standard such as a wireless communication protocol and sent to a receiving device.

設備402包括一或多個處理器（本文被稱為「處理器」）404，其耦合到記憶體406、第一介面（「I/F 1」）412和第二介面（「I/F 2」）416。處理器404被配置為從照相機407、從記憶體406及/或從儲存媒體414接收影像資料408。處理器404經由第一介面412（例如，經由記憶體匯流排）耦合到儲存媒體414，並且經由第二介面416（例如，網路周邊設備、無線收發機和天線、一或多個其他網路周邊設備、或其組合）耦合到傳輸媒體418。Device 402 includes one or more processors (referred to herein as "processors") 404 coupled to memory 406, a first interface ("I/F 1") 412, and a second interface ("I/F 2") ”)416. Processor 404 is configured to receive image data 408 from camera 407 , from memory 406 and/or from storage medium 414 . Processor 404 is coupled to storage medium 414 via a first interface 412 (e.g., via a memory bus), and via a second interface 416 (e.g., network peripherals, wireless transceivers and antennas, one or more other network peripheral devices, or a combination thereof) is coupled to transmission medium 418 .

處理器404包括E2E-NNVC系統410。E2E-NNVC系統410包括編碼器部分462和解碼器部分466。在一些實現方式中，E2E-NNVC系統410可以包括一或多個自動編碼器。編碼器部分462被配置為接收輸入資料470並處理輸入資料470，以至少部分地基於輸入資料470來產生輸出資料474。Processor 404 includes E2E-NNVC system 410 . The E2E-NNVC system 410 includes an encoder portion 462 and a decoder portion 466 . In some implementations, the E2E-NNVC system 410 may include one or more autoencoders. Encoder portion 462 is configured to receive input data 470 and process input data 470 to generate output data 474 based at least in part on input data 470 .

在一些實現方式中，E2E-NNVC系統410的編碼器部分462被配置為對輸入資料470執行失真壓縮以產生輸出資料474，使得輸出資料474具有與輸入資料470相比較少的位元。可以對編碼器部分462進行訓練以對輸入資料470（例如，影像或視訊訊框）進行壓縮，而不使用基於任何先前表示（例如，一或多個先前重構的訊框）的運動補償。例如，編碼器部分462可以僅使用來自視訊訊框的視訊資料來壓縮該視訊訊框，而不使用先前重構的訊框的任何資料。由編碼器部分462處理的視訊訊框在本文中可以被稱為經訊框內預測的訊框（I訊框）。在一些實例中，可以使用傳統視訊譯碼技術（例如，根據HEVC、VVC、MPEG-4或其他視訊譯碼標準）來產生I訊框。在此類實例中，處理器404可以包括被配置為執行基於塊的訊框內預測（諸如上文關於HEVC標準該的訊框內預測）的視訊譯碼設備（例如，編碼設備）或者與其耦合。在此類實例中，處理器404可以不包括E2E-NNVC系統410。In some implementations, encoder portion 462 of E2E-NNVC system 410 is configured to perform anamorphic compression on input material 470 to generate output material 474 such that output material 474 has fewer bits than input material 470 . The encoder portion 462 can be trained to compress input data 470 (eg, images or video frames) without using motion compensation based on any previous representation (eg, one or more previously reconstructed frames). For example, encoder portion 462 may only use video data from a video frame to compress the video frame without using any data from previously reconstructed frames. The video frames processed by the encoder portion 462 may be referred to herein as intra-frame predicted frames (I frames). In some examples, the I-frames may be generated using conventional video coding techniques (eg, according to HEVC, VVC, MPEG-4, or other video coding standards). In such examples, the processor 404 may include or be coupled to a video coding device (e.g., an encoding device) configured to perform block-based intra prediction, such as the intra prediction described above with respect to the HEVC standard. . In such instances, processor 404 may not include E2E-NNVC system 410 .

在一些實現方式中，可以對E2E-NNVC系統410的編碼器部分462進行訓練以使用基於先前表示（例如，一或多個先前重構的訊框）的運動補償來壓縮輸入資料470（例如，視訊訊框）。例如，編碼器部分462可以使用來自視訊訊框的視訊資料並且使用先前重構的訊框的資料來壓縮視訊訊框。由編碼器部分462處理的視訊訊框在本文中可以被稱為經訊框內預測的訊框（P訊框）。運動補償可以用於經由描述來自先前重構的訊框的圖元如何移動到當前訊框中的新位置上以及殘差資訊來決定當前訊框的資料。In some implementations, the encoder portion 462 of the E2E-NNVC system 410 can be trained to compress the input data 470 (e.g., video frame). For example, encoder portion 462 may use video data from a video frame and use data from a previously reconstructed frame to compress a video frame. The video frames processed by the encoder portion 462 may be referred to herein as intra-frame predicted frames (P frames). Motion compensation can be used to determine the data of the current frame by describing how the picture elements from the previously reconstructed frame moved to the new position in the current frame and the residual information.

如圖所述，E2E-NNVC系統410的編碼器部分462包括神經網路463和量化器464。神經網路463可以包括產生中間資料472的一或多個迴旋神經網路（CNN）、一或多個全連接神經網路、一或多個閘控循環單元（GRU）、一或多個長短期記憶（LSTM）網路、一或多個迴旋RNN、一或多個迴旋GRU、一或多個迴旋LSTM、一或多個GAN、其任何組合、及/或其他類型的神經網路架構。中間資料472被輸入到量化器464。在圖6A-圖6E中圖示可以被包括在編碼器部分462中的部件的實例。As depicted, the encoder portion 462 of the E2E-NNVC system 410 includes a neural network 463 and a quantizer 464 . Neural network 463 may include one or more convolutional neural networks (CNNs), one or more fully connected neural networks, one or more gated recurrent units (GRUs), one or more long Short-term memory (LSTM) networks, one or more convolutional RNNs, one or more convolutional GRUs, one or more convolutional LSTMs, one or more GANs, any combination thereof, and/or other types of neural network architectures. Intermediate data 472 is input to quantizer 464 . Examples of components that may be included in the encoder portion 462 are illustrated in FIGS. 6A-6E .

量化器464被配置為執行對中間資料472的量化以及在一些情況下執行對其的熵譯碼，以產生輸出資料474。輸出資料474可以包括經量化的（並且在一些情況下，經熵編碼的）資料。由量化器464執行的量化操作可以導致從中間資料472產生量化碼（或表示由E2E-NNVC系統410產生的量化碼的資料）。量化碼（或表示量化碼的資料）亦可以被稱為隱碼（latent code）或隱變數（表示為 z）。被應用於隱變數的熵模型在本文中可以被稱為「先驗」。在一些實例中，可以使用當根據現有的視訊譯碼標準來編碼及/或解碼視訊資料時執行的現有的量化及/或熵譯碼操作，來執行量化和熵譯碼操作。在一些實例中，量化及/或熵譯碼操作可以由E2E-NNVC系統410來完成。在一個說明性實例中，可以使用有監督訓練來訓練E2E-NNVC系統410，其中在訓練期間，殘差資料被用作輸入，並且量化碼和熵碼被用作已知輸出（標籤）。 Quantizer 464 is configured to perform quantization and, in some cases, entropy coding of intermediate material 472 to produce output material 474 . Output data 474 may include quantized (and, in some cases, entropy encoded) data. The quantization operation performed by quantizer 464 may result in the generation of quantized codes from intermediate data 472 (or data representing quantized codes generated by E2E-NNVC system 410 ). Quantization codes (or data representing quantization codes) may also be called latent codes or latent variables (denoted as z ). The entropy model applied to the latent variables may be referred to herein as a "prior". In some examples, quantization and entropy coding operations may be performed using existing quantization and/or entropy coding operations performed when encoding and/or decoding video data according to existing video coding standards. In some examples, quantization and/or entropy coding operations may be performed by the E2E-NNVC system 410 . In one illustrative example, E2E-NNVC system 410 may be trained using supervised training, where during training residual data is used as input and quantization and entropy codes are used as known outputs (labels).

E2E-NNVC系統410的解碼器部分466被配置為（例如，直接從量化器464及/或從儲存媒體414）接收輸出資料474。解碼器部分466可以處理輸出資料474，以至少部分地基於輸出資料474來產生對輸入資料470的表示476。在一些實例中，E2E-NNVC系統410的解碼器部分466包括神經網路468，其可以包括一或多個CNN、一或多個全連接神經網路、一或多個GRU、一或多個長短期記憶（LSTM）網路、一或多個迴旋RNN、一或多個迴旋GRU、一或多個迴旋LSTM、一或多個GAN、其任何組合、及/或其他類型的神經網路架構。在圖6A-圖6E中示出可以被包括在解碼器部分466中的部件的實例。Decoder portion 466 of E2E-NNVC system 410 is configured to receive output data 474 (eg, directly from quantizer 464 and/or from storage medium 414 ). Decoder portion 466 may process output data 474 to generate representation 476 of input data 470 based at least in part on output data 474 . In some examples, decoder portion 466 of E2E-NNVC system 410 includes neural network 468, which may include one or more CNNs, one or more fully connected neural networks, one or more GRUs, one or more Long short-term memory (LSTM) network, one or more convolutional RNNs, one or more convolutional GRUs, one or more convolutional LSTMs, one or more GANs, any combination thereof, and/or other types of neural network architectures . Examples of components that may be included in decoder portion 466 are shown in FIGS. 6A-6E .

處理器404被配置為將輸出資料474發送給傳輸媒體418或儲存媒體414中的至少一者。例如，可以將輸出資料474儲存在儲存媒體414處，以供解碼器部分466稍後檢索和解碼（或解壓縮）以產生對輸入資料470的表示476作為經重構的資料。經重構的資料可以用於各種目的，諸如以用於重播已被編碼/壓縮以產生輸出資料474的視訊資料。在一些實現方式中，可以在與解碼器部分466匹配的另一解碼器設備（例如，在設備402中、在第二設備490中或在另一設備中）處解碼輸出資料474，以產生對輸入資料470的表示476作為經重構的資料。例如，第二設備490可以包括與解碼器部分466匹配（或基本上匹配）的解碼器，並且可以經由傳輸媒體418將輸出資料474發送給第二設備490。第二設備490可以處理輸出資料474，以產生對輸入資料470的表示476作為經重構的資料。Processor 404 is configured to send output data 474 to at least one of transmission medium 418 or storage medium 414 . For example, output data 474 may be stored at storage medium 414 for later retrieval and decoding (or decompression) by decoder portion 466 to produce representation 476 of input data 470 as reconstructed data. The reconstructed data may be used for various purposes, such as for replaying video data that has been encoded/compressed to generate output data 474 . In some implementations, output material 474 can be decoded at another decoder device matched to decoder portion 466 (e.g., in device 402, in second device 490, or in another device) to produce a pair of A representation 476 of data 470 is input as reconstructed data. For example, second device 490 may include a matching (or substantially matching) decoder to decoder portion 466 and may transmit output material 474 to second device 490 via transmission medium 418 . The second device 490 may process the output data 474 to produce a representation 476 of the input data 470 as reconstructed data.

系統400的部件可以包括電路或其他電子硬體及/或可以使用其來實現，電子電路或其他電子硬體可以包括一或多個可程式設計電子電路（例如，微處理器、圖形處理單元（GPU）、數位訊號處理器（DSP）、中央處理單元（CPU）、及/或其他適當的電子電路），及/或可以包括電腦軟體、韌體或其任何組合，及/或使用其來實現，以執行本文描述的各種操作。Components of system 400 may include and/or may be implemented using electrical circuits or other electronic hardware, which may include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units ( GPU), digital signal processor (DSP), central processing unit (CPU), and/or other appropriate electronic circuits), and/or may include computer software, firmware, or any combination thereof, and/or use them to implement , to perform the various operations described in this article.

儘管系統400被示為包括某些部件，但是本發明所屬領域中具有通常知識者將領會的是，系統400可以包括比在圖4中所示的部件更多或更少的部件。例如，系統400亦可以包括輸入設備和輸出設備（未圖示），或者可以是包括輸入設備和輸出設備的計算設備的一部分。在一些實現方式中，系統400亦可以包括以下各項或者可以是包括以下各項的計算設備的一部分：一或多個記憶體設備（例如，一或多個隨機存取記憶體（RAM）部件、唯讀記憶體（ROM）部件、快取緩衝記憶體部件、緩衝器部件、資料庫部件、及/或其他記憶體設備）、與一或多個記憶體設備相通訊及/或電連接到其的一或多個處理設備（例如，一或多個CPU、GPU及/或其他處理設備）、用於執行無線通訊的一或多個無線介面（例如，包括用於每個無線介面的一或多個收發機和基頻處理器）、用於在一或多個硬接線連接上執行通訊的一或多個有線介面（例如，諸如通用序列匯流排（USB）輸入之類的序列介面、照明連接器、及/或其他有線介面）、及/或在圖4中未圖示的其他部件。Although system 400 is shown as including certain components, one of ordinary skill in the art to which this invention pertains will appreciate that system 400 may include more or fewer components than those shown in FIG. 4 . For example, system 400 may also include input devices and output devices (not shown), or may be part of a computing device that includes input devices and output devices. In some implementations, system 400 may also include, or may be part of, a computing device that includes: one or more memory devices (e.g., one or more random access memory (RAM) components , read-only memory (ROM) components, cache memory components, buffer components, database components, and/or other memory devices), communicate with and/or be electrically connected to one or more memory devices one or more processing devices (e.g., one or more CPUs, GPUs, and/or other processing devices), one or more wireless interfaces for performing wireless communications (e.g., including a or multiple transceivers and baseband processors), one or more wired interfaces for performing communications over one or more hardwired connections (e.g., serial interfaces such as Universal Serial Bus (USB) inputs, lighting connector, and/or other wired interface), and/or other components not shown in FIG. 4 .

在一些實現方式中，系統400可以由計算設備本端實現及/或被包括在計算設備中。例如，計算設備可以包括行動設備、個人電腦、平板電腦、虛擬實境（VR）設備（例如，頭戴式顯示器（HMD）或其他VR設備）、增強現實（AR）設備（例如，HMD、AR眼鏡或其他AR設備）、可穿戴設備、伺服器（例如，在軟體作為服務（SaaS）系統中或其他基於伺服器的系統）、電視機、及/或具有執行本文描述的技術的資源能力的任何其他計算設備。In some implementations, system 400 can be implemented locally by and/or included in a computing device. For example, computing devices may include mobile devices, personal computers, tablet computers, virtual reality (VR) devices (e.g., head-mounted displays (HMDs) or other VR devices), augmented reality (AR) devices (e.g., HMDs, AR glasses or other AR devices), wearable devices, servers (e.g., in a software-as-a-service (SaaS) system or other server-based systems), televisions, and/or any other computing device.

在一個實例中，E2E-NNVC系統410可以被併入可攜式電子設備，可攜式電子設備包括：記憶體406，其耦合到處理器404並且被配置為儲存可由處理器404執行的指令；及無線收發機，其耦合到天線和處理器404並且可操作以將輸出資料474發送給遠端設備。In one example, the E2E-NNVC system 410 may be incorporated into a portable electronic device comprising: a memory 406 coupled to the processor 404 and configured to store instructions executable by the processor 404; and a wireless transceiver coupled to the antenna and processor 404 and operable to transmit output data 474 to a remote device.

E2E-NNVC系統被通常設計為處理RGB輸入。在以下項中描述了針對RGB輸入的影像和視訊譯碼方案的實例：J.Balle，D.Minnen，S.Singh，S.J.Hwang，N.Johnston，「Variational image compression with a scale hyperprior」，ICLR，2018（被稱為「J.Balle論文」），以及D.Minnen，J.Balle，G.Toderici，「Joint Autoregressive and Hierarchical Priors for Learned Image Compression」，CVPR 2018（被稱為「D.Minnen論文」），據此經由引用的方式將其全部內容併入並且用於所有目的。E2E-NNVC systems are generally designed to handle RGB input. Examples of image and video decoding schemes for RGB input are described in: J.Balle, D.Minnen, S.Singh, S.J.Hwang, N.Johnston, "Variational image compression with a scale hyperprior", ICLR, 2018 (referred to as "J.Balle paper"), and D.Minnen, J.Balle, G.Toderici, "Joint Autoregressive and Hierarchical Priors for Learned Image Compression", CVPR 2018 (referred to as "D.Minnen paper" ), which is hereby incorporated by reference in its entirety for all purposes.

圖5是示出J.Balle論文中描述的E2E-NNVC系統實例的示意圖。圖5的E2E-NNVC系統中的g _a和g _s子網路分別對應於編碼器子網路（例如，編碼器部分462）和解碼器子網路（例如，解碼器部分466）。圖5的g _a和g _s子網路被設計用於三通道RGB輸入，其中所有三個R、G和B輸入通道皆經過相同的神經網路層（迴旋層和廣義除法正規化（GDN）層）並且由其處理。神經網路層可以包括執行迴旋運算的迴旋層和實現局部除法正規化的GDN及/或逆GDN（IGDN）非線性層。局部除法正規化是一種類型的變換，其已被證明特別適合於影像的密度建模和壓縮。E2E-NNVC系統（諸如圖5所示）以具有類似統計特性的輸入通道為目標，諸如RGB資料（其中不同R、G和B通道的統計特性相似）。 Fig. 5 is a schematic diagram showing an example of the E2E-NNVC system described in the J. Balle paper. The g _a and g _s subnetworks in the E2E-NNVC system of FIG. 5 correspond to the encoder subnetwork (eg, encoder part 462 ) and decoder subnetwork (eg, decoder part 466 ), respectively. The g _a and g _s subnetworks of Figure 5 are designed for a three-channel RGB input, where all three R, G, and B input channels go through the same neural network layers (convolutional layers and generalized division normalization (GDN) layer) and are handled by it. Neural network layers may include convolutional layers that perform convolutional operations and GDN and/or inverse GDN (IGDN) nonlinear layers that implement local division regularization. Local division regularization is a type of transformation that has been shown to be particularly suitable for density modeling and compression of imagery. E2E-NNVC systems (such as those shown in Figure 5) target input channels with similar statistical properties, such as RGB material (where the statistical properties of different R, G, and B channels are similar).

儘管E2E-NNVC系統典型地被設計為處理RGB輸入，但是大多數影像和視訊譯碼系統使用YUV輸入格式（例如，在許多情況下，YUV420輸入格式）。YUV格式的資料的色度（U和V）通道可以相對於亮度（Y）通道進行二次取樣。二次取樣對視覺品質產生最小的影響（例如，對視覺品質沒有顯著或明顯的影響）。二次取樣格式包括YUV420格式、YUV422格式及/或其他YUV格式。在YUV格式中，通道之間的相關性降低，這可能與其他色彩格式（例如，RGB格式）不同。此外，亮度（Y）和色度（U和V）通道的統計是不同的。例如，與亮度通道相比，U和V通道具有較小的方差，而在例如RGB格式中，不同R、G和B通道的統計特性更相似。視訊轉碼器-解碼器（或轉碼器）是根據資料的輸入特性來設計的（例如，轉碼器可以根據資料的輸入格式來對資料進行編碼及/或解碼）。例如，若訊框的色度通道被二次取樣（例如，色度通道的解析度是亮度通道的一半），則當轉碼器預測用於運動補償的訊框的塊時，亮度塊的寬度和高度皆將是色度塊的兩倍。在另一實例中，轉碼器可以決定將針對色度和亮度來對多少圖元進行編碼或解碼等等。Although E2E-NNVC systems are typically designed to handle RGB input, most image and video coding systems use a YUV input format (eg, in many cases, YUV420 input format). The chroma (U and V) channels of data in YUV format may be subsampled relative to the luma (Y) channel. Subsampling has minimal impact on visual quality (eg, no noticeable or noticeable impact on visual quality). Subsampling formats include YUV420 format, YUV422 format, and/or other YUV formats. In YUV format, the correlation between channels is reduced, which may be different from other color formats (for example, RGB format). Also, the statistics for the luma (Y) and chroma (U and V) channels are different. For example, the U and V channels have less variance than the luma channel, while in eg RGB formats the statistical properties of the different R, G and B channels are more similar. Video transcoder - a decoder (or transcoder) is designed according to the input characteristics of the data (for example, a transcoder can encode and/or decode data according to the input format of the data). For example, if the chroma channel of a frame is subsampled (e.g., the resolution of the chroma channel is half that of the luma channel), then when the transcoder predicts the blocks of the frame for motion compensation, the width of the luma block and height will both be twice the size of the chroma block. In another example, a transcoder may decide how many primitives to encode or decode for chroma and luma, and so on.

若將RGB輸入資料（如前述，大多數E2E-NNVC系統被設計為處理）替換為YUV 4:4:4輸入資料（其中所有通道具有相同尺寸），則由於亮度（Y）和色度（U和V）通道的不同的統計特性，E2E-NNVC系統處理輸入資料的效能降低。如前述，色度（U和V）通道以一些YUV格式進行二次取樣，諸如在YUV420的情況下。例如，對於具有YUV 4:2:0格式的內容，U和V通道解析度為Y通道解析度的一半（由於寬度和高度減半，U和V通道的大小為Y通道的四分之一）。此類二次取樣可能導致輸入資料與E2E-NNVC系統的輸入不相容。輸入資料是E2E-NNVC系統嘗試進行編碼及/或解碼的資訊（例如，包括三個通道（包括亮度（Y）和色度（U和V）通道）的YUV訊框）。許多基於神經網路的系統假設輸入資料的所有通道尺寸相同，並且因此將所有輸入通道饋送到同一網路。在此類情況下，可以（例如，使用矩陣加法）添加某些操作的輸出，在這種情況下，通道的尺寸必須相同。If RGB input data (which, as mentioned, most E2E-NNVC systems are designed to process) is replaced by YUV 4:4:4 input data (where all channels have the same size), then due to the luma (Y) and chrominance (U And V) channel different statistical properties, E2E-NNVC system performance reduction in processing input data. As before, the chroma (U and V) channels are subsampled in some YUV formats, such as in the case of YUV420. For example, for content with a YUV 4:2:0 format, the U and V channels are half the resolution of the Y channel (due to half the width and height, the U and V channels are one quarter the size of the Y channel) . Such subsampling may result in incompatibility of the input data with that of the E2E-NNVC system. The input data is information that the E2E-NNVC system attempts to encode and/or decode (eg, a YUV frame comprising three channels including luma (Y) and chrominance (U and V) channels). Many neural network-based systems assume that all channels of input data are of the same size, and thus feed all input channels to the same network. In such cases, the outputs of some operations can be added (for example, using matrix addition), in which case the channels must be of the same size.

在一些實例中，為了解決此類問題，可以將Y通道二次取樣為四個半解析度Y通道。四個半解析度Y通道可以與兩個色度通道組合，從而形成六個輸入通道。六個輸入通道可以輸入或饋送到針對RGB輸入設計的E2E-NNVC系統。此類方法可以解決關於亮度（Y）和色度（U和V）通道的解析度差異的問題。然而，亮度（Y）通道與色度（U和V）通道之間的固有差異仍然存在，從而導致差的譯碼（例如，編碼及/或解碼）效能。In some instances, to address such issues, the Y channel may be subsampled into four half-resolution Y channels. Four half-resolution Y channels can be combined with two chroma channels to form six input channels. Six input channels can be input or fed to the E2E-NNVC system designed for RGB input. Such methods can resolve issues regarding resolution differences between luma (Y) and chroma (U and V) channels. However, inherent differences between luma (Y) and chroma (U and V) channels still exist, resulting in poor coding (eg, encoding and/or decoding) performance.

如前述，本文描述了用於使用一或多個基於ML的系統來執行影像及/或視訊譯碼的系統和技術。本文描述的系統和技術提供被設計用於處理具有亮度-色度（YUV）輸入格式（例如，YUV420、YUV444、YUV422等）的輸入資料的前端架構（例如，新的子網路，諸如在基於端到端神經網路的影像和視訊譯碼（E2E-NNVC）系統中）。在一些實例中，前端架構被配置為在針對RGB輸入格式設計的E2E-NNVC中適應YUV 4:2:0輸入格式。如前述，前端架構適用於許多E2E-NNVC架構（例如，包括在J.Balle論文和D.Minnen論文中描述的架構）。本文描述的系統和技術考慮亮度（Y）通道和色度（U和V）通道的不同特性以及亮度（Y）通道和色度（U和V）通道的解析度差異。E2E-NNVC系統可以對包括多個訊框的獨立訊框（或影像）及/或視訊資料進行編碼及/或解碼。As previously described, described herein are systems and techniques for performing image and/or video coding using one or more ML-based systems. The systems and techniques described herein provide front-end architectures (e.g., new subnetworks, such as those based on End-to-End Neural Networks for Image and Video Coding (E2E-NNVC) systems). In some instances, the front-end architecture is configured to accommodate YUV 4:2:0 input format in E2E-NNVC designed for RGB input format. As previously mentioned, the front-end architecture is applicable to many E2E-NNVC architectures (including, for example, the architectures described in the J.Balle paper and the D.Minnen paper). The systems and techniques described herein take into account the different characteristics of the luma (Y) and chroma (U and V) channels and the resolution differences of the luma (Y) and chroma (U and V) channels. The E2E-NNVC system can encode and/or decode individual frames (or images) and/or video data including multiple frames.

在一些實例中，本文描述的系統和技術最初可以將Y通道和UV通道輸入或饋送到兩個分離的層中。隨後，E2E-NNVC系統可以在某一數量的層之後（例如，在第一對迴旋和非線性層或其他層之後，如下文描述的圖6A-圖6E所示）對與Y通道和UV通道相關聯的資料進行組合。由於U和V色度分量相對於亮度（Y）通道進行二次取樣，因此可以跳過第一迴旋層中的二次取樣，並且特定大小（例如，具有為（N/2+1）x（N/2+1）的大小）的迴旋（例如，CNN）核心可以用於二次取樣的輸入色度（U和V）通道。則與用於色度（U和V）通道的核心相比，具有不同大小的CNN核心（例如，NxN CNN核心）可以用於亮度（Y）通道。可以使用迴旋層（例如，1x1迴旋層）來對前端架構的兩個分支（分別攜帶亮度和色度通道或分量資訊）進行組合，該迴旋層跨通道對值進行組合。對1x1迴旋層的使用可以提供如本文描述的各種益處，包括提高解碼效率。In some examples, the systems and techniques described herein can initially input or feed the Y and UV channels into two separate layers. Subsequently, the E2E-NNVC system can pair Y and UV channels after a certain number of layers (e.g., after the first pair of convoluted and nonlinear layers or other layers, as shown in Figures 6A-6E described below). Associated data are combined. Since the U and V chroma components are subsampled relative to the luma (Y) channel, the subsampling in the first convoluted layer can be skipped, and a certain size (e.g., has a value of (N/2+1)x( A convoluted (eg, CNN) core of size N/2+1) can be used to subsample the input chroma (U and V) channels. Then a CNN core with a different size (eg, NxN CNN cores) can be used for the luma (Y) channel than the cores used for the chroma (U and V) channels. The two branches of the front-end architecture (carrying luma and chroma channel or component information, respectively) can be combined using a convolutional layer (eg, a 1x1 convolutional layer) that combines values across channels. The use of 1x1 convolutional layers can provide various benefits as described herein, including improved decoding efficiency.

圖6A-圖6F示出神經網路系統的前端架構的實例。在一些實例中，圖6A-圖6F的前端架構可以是被設計用於處理（編碼及/或解碼）具有YUV 4:2:0格式的資料的E2E-NNVC系統的一部分。例如，前端架構可以被配置用於處理YUV 4:2:0格式的輸入資料。圖6A、圖6C、圖6D和圖6E的前端架構在1x1迴旋層之後應用了兩個不同的非線性運算元。例如，在圖6A的架構中使用廣義除數正規化（GDN）運算元，而在圖6C-圖6E的架構中應用參數校正線性單元（PReLU）非線性運算元。在一些實例中，如圖6A和圖6C-圖6F所示的類似神經網路架構可以用於對其他類型的YUV內容（例如，具有YUV 4:4:4格式、YUV 4:2:2格式等的內容）及/或具有其他輸入格式的內容進行編碼及/或解碼。6A-6F illustrate an example of a front-end architecture of a neural network system. In some examples, the front-end architecture of FIGS. 6A-6F may be part of an E2E-NNVC system designed to process (encode and/or decode) data in YUV 4:2:0 format. For example, the front-end architecture may be configured to process input data in YUV 4:2:0 format. The front-end architectures of Figures 6A, 6C, 6D and 6E apply two different non-linear operators after the 1x1 convoluted layer. For example, the generalized divisor normalization (GDN) operator is used in the architecture of FIG. 6A, while the parameter rectified linear unit (PReLU) nonlinear operator is applied in the architecture of FIGS. 6C-6E. In some examples, similar neural network architectures as shown in FIG. 6A and FIGS. etc.) and/or content with other input formats to encode and/or decode.

例如，圖6A是示出可以被配置為直接利用4:2:0輸入（Y、U和V）資料工作的前端神經網路系統或架構的實例的示意圖。如圖6A所示，在神經網路系統的編碼器子網路處，使用1x1迴旋層606來對分支亮度和色度通道（亮度（Y）通道602和色度（U和V）通道604）進行組合，並且隨後應用GDN非線性運算元608。在神經網路系統的解碼器子網路上執行類似的操作，但順序相反。例如，如圖6A所示，應用逆GDN（IGDN）運算元609，使用1x1迴旋層613來分離Y和U、V通道，並且使用相應的IGDN 615、IGDN 616和迴旋層617、迴旋層618來處理分離的Y通道和U、V通道。For example, FIG. 6A is a schematic diagram illustrating an example of a front-end neural network system or architecture that can be configured to work directly with 4:2:0 input (Y, U, and V) data. As shown in Figure 6A, at the encoder sub-network of the neural network system, a 1x1 convolutional layer 606 is used to split the luma and chroma channels (luma (Y) channel 602 and chroma (U and V) channels 604) Combinations are made and then GDN nonlinear operators 608 are applied. Similar operations are performed on the decoder subnetwork of the neural network system, but in reverse order. For example, as shown in Figure 6A, applying an inverse GDN (IGDN) operator 609, a 1x1 convoluted layer 613 is used to separate the Y and U, V channels, and the corresponding IGDN 615, IGDN 616 and convoluted layers 617, 618 are used to separate Handle separate Y channels and U, V channels.

例如，圖6A的神經網路系統的編碼器子網路中的前兩個神經網路層包括第一迴旋層611（表示為Ncov | 3x3）|↓1）、第二迴旋層610（表示為Nconv | 5x5|↓2）、第一GDN層614和第二GDN層612。圖6A的前端神經網路架構的解碼器子網路中的最後兩個神經網路層包括第一逆GDN（IGDN）層616、第二逆GDN（IGDN）615、用於產生訊框的經重構的色度（U和V）分量的第一迴旋層618（表示為2conv | 3x3）|↑1）、以及用於產生訊框的經重構的亮度（Y）分量的第二迴旋層617（表示為1conv | 5x5）|↑2）。「Nconv」標記法是指給定迴旋層的輸出通道數量（N）（對應於輸出特徵數量）（其中N值定義了輸出通道數量）。3x3和5x5標記法指示相應的迴旋核心的大小（例如，3x3核心和5x5核心）。「↓1」和「↓2」標記法是指跨距值，其中↓1是指為1的跨距（用於下取樣，如「↓」所指示），以及↓2是指為1的跨距（用於下取樣）。「↑1」和「↑2」標記法是指跨距值，其中↑1是指為1的跨距（用於上取樣，如「↑」所指示），以及↑2是指為1的跨距（用於上取樣）。For example, the first two neural network layers in the encoder sub-network of the neural network system of FIG. 6A include a first convolutional layer 611 (denoted as Ncov|3x3)|↓1), a second convolutional layer 610 (denoted as Nconv | 5x5|↓2), the first GDN layer 614 and the second GDN layer 612 . The last two neural network layers in the decoder sub-network of the front-end neural network architecture of FIG. 6A include a first inverse GDN (IGDN) layer 616, a second inverse GDN (IGDN) 615, an A first convoluted layer 618 (denoted as 2conv|3x3)|↑1) of reconstructed chroma (U and V) components, and a second convolutional layer used to generate the reconstructed luma (Y) component of the frame 617 (expressed as 1conv|5x5)|↑2). The "Nconv" notation refers to the number of output channels (N) (corresponding to the number of output features) of a given convolutional layer (where the N value defines the number of output channels). The 3x3 and 5x5 notation indicates the size of the corresponding convolution core (eg, 3x3 core and 5x5 core). The "↓1" and "↓2" notations refer to span values, where ↓1 refers to a span of 1 (for downsampling, as indicated by "↓"), and ↓2 refers to a span of 1 pitch (for downsampling). The "↑1" and "↑2" notations refer to stride values, where ↑1 refers to a stride of 1 (for upsampling, as indicated by "↑"), and ↑2 refers to a stride of 1 pitch (for upsampling).

例如，迴旋層610經由以跨距值2在水平和垂直維度上應用5x5迴旋濾波器來將輸入亮度通道602下取樣4倍。迴旋層610的結果輸出是N個特徵值陣列（對應於N個通道）。迴旋層611經由以跨距值1在水平和垂直維度應用3x3迴旋濾波器來處理輸入色度（U和V）通道604。迴旋層611的結果輸出是N個特徵值陣列（對應於N個通道）。由迴旋層610輸出的特徵值陣列具有與由迴旋層611輸出的特徵值陣列相同的尺寸。隨後，GDN層612可以處理由迴旋層610輸出的特徵值，並且GDN層614可以處理由迴旋層611輸出的特徵值。For example, the convolutional layer 610 downsamples the input luma channel 602 by a factor of 4 by applying a 5x5 convolutional filter with a stride value of 2 in the horizontal and vertical dimensions. The resulting output of the convolutional layer 610 is an array of N eigenvalues (corresponding to N channels). The convolutional layer 611 processes the input chroma (U and V) channels 604 by applying a 3x3 convolutional filter with a stride value of 1 in the horizontal and vertical dimensions. The resulting output of the convolution layer 611 is an array of N eigenvalues (corresponding to N channels). The eigenvalue array output by the convolution layer 610 has the same size as the eigenvalue array output by the convolution layer 611 . Subsequently, the GDN layer 612 may process the feature values output by the convolution layer 610 , and the GDN layer 614 may process the feature values output by the convolution layer 611 .

隨後，1x1迴旋層606可以處理由GDN層612、GDN層614輸出的特徵值。1x1迴旋層606可以產生與亮度通道602和色度通道604相關聯的特徵的線性組合。線性組合操作作為Y分量和UV分量的每值交叉通道混合來操作，從而產生增強譯碼效能的交叉分量（例如，交叉亮度和色度分量）預測。1x1迴旋層606的每個1x1迴旋濾波器可以包括應用於亮度通道602的對應的第N通道和色度通道604的對應的第N通道的相應的縮放因數。The 1x1 convolution layer 606 may then process the feature values output by the GDN layer 612 , GDN layer 614 . The 1x1 convoluted layer 606 may produce a linear combination of features associated with the luma channel 602 and the chroma channel 604 . The linear combination operation operates as a per-value cross-channel mix of Y and UV components, resulting in cross-component (eg, cross-lumina and chroma) predictions that enhance coding performance. Each 1x1 convolution filter of 1x1 convolution layer 606 may include a respective scaling factor applied to a corresponding Nth channel of luma channel 602 and a corresponding Nth channel of chroma channel 604 .

圖6B是示出1x1迴旋層638的實例操作的示意圖。如前述，N表示輸出通道數量。如圖6B所示，提供2N個通道作為1x1迴旋層638的輸入，包括N通道色度（組合U和V）輸出632和N通道亮度（Y）輸出634。在圖6B的實例中，N的值等於2，其指示具有N通道色度輸出632的值的兩個通道和具有N通道亮度輸出634的值的兩個通道。參照圖6A，N通道色度輸出632可以是來自GDN層614的輸出，並且N通道亮度輸出634可以是來自GDN層612的輸出。然而，在其他實例中，N通道色度輸出632和N通道亮度輸出634可以從其他非線性層（例如，分別從圖6D的pReLU層652和pReLU層654、分別從圖6E的pReLU層662和pReLU層664）輸出或直接從迴旋層輸出（例如，分別從圖6F的迴旋層670和迴旋層671輸出）。FIG. 6B is a schematic diagram illustrating an example operation of the 1×1 convoluted layer 638 . As mentioned above, N represents the number of output channels. As shown in FIG. 6B , 2N channels are provided as input to a 1x1 convoluted layer 638 , including an N-channel chrominance (combined U and V) output 632 and an N-channel luminance (Y) output 634 . In the example of FIG. 6B , the value of N is equal to 2, which indicates two channels with an N-channel chroma output 632 value and two channels with an N-channel luma output 634 value. Referring to FIG. 6A , an N-channel chroma output 632 may be an output from the GDN layer 614 , and an N-channel luma output 634 may be an output from the GDN layer 612 . However, in other examples, N-channel chrominance output 632 and N-channel luma output 634 may be obtained from other nonlinear layers (e.g., from pReLU layer 652 and pReLU layer 654, respectively, of FIG. 6D , from pReLU layer 662 and pReLU layer, respectively, of FIG. pReLU layer 664 ) or directly from the convolutional layer (eg, from the convolutional layer 670 and the convolutional layer 671 of FIG. 6F , respectively).

1x1迴旋層638處理2N個通道並且執行2N個通道的特徵線性組合，並且隨後輸出N通道特徵或係數集合。1x1迴旋層638包括兩個1x1迴旋濾波器（基於N＝2）。第一1x1迴旋濾波器示為具有s ₁值，並且第二1x1迴旋濾波器示為具有s ₂值。s ₁值表示第一縮放因數，並且s ₂值表示第二縮放因數。在一個說明性實例中，s ₁值等於3，並且s ₂值等於4。1x1迴旋層638的1x1迴旋濾波器中的各者都具有跨距值1，這指示縮放因數s ₁和縮放因數s ₂將應用於UV輸出632和Y輸出634之每一者值。 The 1x1 convolutional layer 638 processes 2N channels and performs a linear combination of features of the 2N channels, and then outputs a set of N-channel features or coefficients. The 1x1 convolution layer 638 includes two 1x1 convolution filters (based on N=2). A first 1x1 convolution filter is shown with a value of s ₁ and a second 1x1 convolution filter is shown with a value of s ₂ . The s ₁ value represents the first scaling factor, and the s ₂ value represents the second scaling factor. In one illustrative example, the s ₁ value is equal to 3, and the s ₂ value is equal to 4. Each of the 1x1 convolution filters of the 1x1 convolution layer 638 has a stride value of 1, which indicates a scaling factor s ₁ and a scaling factor s ₂ will be applied to each of the UV output 632 and Y output 634 values.

例如，將第一1x1迴旋濾波器的縮放因數s ₁應用於UV輸出632的第一通道（C1）之每一者值和Y輸出634的第一通道（C1）之每一者值。一旦經由第一1x1迴旋濾波器的縮放因數s ₁對UV輸出632的第一通道（C1）的每個值和Y輸出634的第一通道（C1）的每個值進行縮放，就將經縮放的值組合成輸出值639的第一通道（C1）。將第二1x1迴旋濾波器的縮放因數s ₂應用於UV輸出632的第二通道（C2）之每一者值和Y輸出634的第二通道（C2）之每一者值。在經由第二1x1迴旋濾波器的縮放因數s ₂對UV輸出632的第二通道（C2）的每個值和Y輸出634的第二通道（C2）的每個值進行縮放之後，將經縮放的值組合成輸出值639的第二通道（C2）。因此，將四個Y通道和UV通道（兩個Y通道和兩個組合的UV通道）混合或組合成兩個輸出通道C1和輸出通道C2。 For example, the scaling factor s ₁ of the first 1×1 convolution filter is applied to each value of the first channel ( C1 ) of the UV output 632 and each value of the first channel ( C1 ) of the Y output 634 . Once each value of the first channel (C1) of the UV output 632 and each value of the first channel (C1) of the Y output 634 is scaled by the scaling factor s ₁ of the first 1x1 convolution filter, the scaled The values of the first channel (C1) are combined into an output value of 639. The scaling factor s ₂ of the second 1×1 convolution filter is applied to each value of the second channel ( C2 ) of the UV output 632 and each value of the second channel ( C2 ) of the Y output 634 . After scaling each value of the second channel (C2) of the UV output 632 and each value of the second channel (C2) of the Y output 634 via the scaling factor s2 of the _second 1x1 convolution filter, the scaled The values of the second channel (C2) are combined into an output value of 639. Therefore, the four Y channels and UV channels (two Y channels and two combined UV channels) are mixed or combined into two output channels C1 and C2.

返回到圖6A，1x1迴旋層606的輸出由編碼器子網路的額外GDN層和額外迴旋層處理。量化引擎620可以對編碼器子網路的最終神經網路層619輸出的特徵執行量化，以產生經量化的輸出。熵編碼引擎621可以對來自量化引擎620的經量化的輸出進行熵編碼，以產生位元串流。如圖6A所示，熵編碼引擎621可以使用超先驗網路產生的先驗來執行熵編碼。神經網路系統可以輸出位元串流以用於儲存、用於傳輸到另一設備、伺服器設備或系統，及/或以其他方式輸出位元串流。Returning to Figure 6A, the output of the 1x1 convolutional layer 606 is processed by an additional GDN layer and an additional convolutional layer of the encoder sub-network. The quantization engine 620 may perform quantization on the features output by the final neural network layer 619 of the encoder sub-network to produce a quantized output. The entropy encoding engine 621 may entropy encode the quantized output from the quantization engine 620 to generate a bitstream. As shown in FIG. 6A , the entropy coding engine 621 can perform entropy coding using the prior generated by the super prior network. A neural network system may output a bitstream for storage, for transmission to another device, server device or system, and/or otherwise output a bitstream.

神經網路系統的解碼器子網路或另一神經網路系統（另一設備）的解碼器子網路可以解碼位元串流。例如，如圖6A所示，解碼器子網路的熵解碼引擎622可以對位元串流進行熵解碼，並且將經熵解碼的資料輸出到去量化引擎623。熵解碼引擎622可以使用超先驗網路產生的先驗來執行熵解碼，如圖6A所示。去量化引擎623可以對資料進行去量化。經去量化的資料可以由解碼器子網路的多個迴旋層和多個逆GDN（IGDN）處理。The decoder subnetwork of the NNS or the decoder subnetwork of another NNS (another device) can decode the bitstream. For example, as shown in FIG. 6A , the entropy decoding engine 622 of the decoder sub-network can perform entropy decoding on the bitstream and output the entropy-decoded data to the dequantization engine 623 . The entropy decoding engine 622 may perform entropy decoding using the prior generated by the super prior network, as shown in FIG. 6A . The dequantization engine 623 can dequantize the material. The dequantized data can be processed by multiple convolutional layers and multiple inverse GDNs (IGDNs) in the decoder subnetwork.

在被IGDN層609處理之後，1x1迴旋層613可以處理資料。1x1迴旋層613可以包括2N個迴旋濾波器，其可以將資料劃分為Y通道特徵和組合UV通道特徵。例如，可以使用1x1迴旋層613的2N個1x1迴旋（導致縮放）來處理由IGDN層609輸出的N個通道之每一者通道。對於與應用於N個輸入通道的輸出通道（總共2N個輸出通道）相對應的每個縮放因數n _i，解碼器子網路可以在N個輸入通道上執行求和，從而產生2N個輸出。在一個說明性實例中，對於縮放因數n ₁，解碼器子網路可以將縮放因數n ₁應用於N個輸入通道，並且可以對結果求和，這導致一個輸出通道。解碼器子網路可以針對2N個不同的縮放因數（例如，縮放因數n ₁、縮放因數n ₂至縮放因數n _2N）執行該操作。 After being processed by the IGDN layer 609, the 1x1 convolutional layer 613 can process the data. The 1x1 convolutional layer 613 may include 2N convolutional filters, which may divide the data into Y channel features and combined UV channel features. For example, each of the N channels output by the IGDN layer 609 may be processed using 2N 1x1 convolutions of the 1x1 convolution layer 613 (resulting in scaling). For each scaling factor _ni corresponding to an output channel applied to N input channels (2N output channels in total), the decoder subnetwork can perform summation over the N input channels, resulting in 2N outputs. In one illustrative example, for a scaling factor n ₁ , the decoder subnetwork may apply the scaling factor n ₁ to N input channels and may sum the results, which results in one output channel. The decoder sub-network may perform this operation for 2N different scaling factors (eg scaling factor n ₁ , scaling factor n ₂ to scaling factor n _2N ).

由1x1迴旋層613輸出的Y通道特徵可以由IGDN 615處理。由1x1迴旋層613輸出的組合UV通道特徵可以由IGDN 616處理。迴旋層617可以處理Y通道特徵並且輸出經重構的訊框的每圖元或取樣的經重構的Y通道（例如，亮度取樣或圖元），示為經重構的Y分量624。迴旋層618可以處理組合UV通道特徵，並且可以輸出經重構的訊框的每圖元或取樣的經重構的U通道（例如，色度藍色取樣或圖元）和經重構的訊框的每圖元或取樣的經重構的V通道（例如，色度紅色取樣或圖元），示為經重構的U和V分量625。The Y channel features output by the 1x1 convoluted layer 613 can be processed by the IGDN 615 . The combined UV channel features output by the 1x1 convoluted layer 613 can be processed by the IGDN 616 . Convolutional layer 617 may process Y channel features and output a reconstructed Y channel per pixel or sample of the reconstructed frame (eg, luma samples or pixels), shown as reconstructed Y component 624 . The convoluted layer 618 may process combined UV channel features and may output a reconstructed U channel per primitive or sample of the reconstructed frame (e.g., chroma blue samples or primitives) and the reconstructed signal A frame's reconstructed V channel per primitive or sample (eg, chroma red samples or primitives), shown as reconstructed U and V components 625 .

圖6C是示出可以被配置為直接利用4:2:0輸入（Y、U和V）輸入資料進行操作的前端神經網路系統或架構的另一實例的示意圖。如圖6C所示，在神經網路系統的編碼器子網路處，使用1x1迴旋層648來對分支亮度和色度通道（亮度通道642和色度通道644）進行組合（類似於上文關於圖6A的1x1迴旋層606描述的），並且隨後應用pReLU非線性運算元649。在其他實例中，可以應用除pReLU非線性運算元之外的非線性運算元。類似的操作由圖6C的神經網路系統的解碼器子網路執行（類似於上文關於圖6A描述的），但順序相反（例如，應用pReLU運算元，使用1x1迴旋層來分離Y和U、V通道，並且使用相應的逆IGDN和迴旋層來處理分離的Y和U、V通道）。6C is a schematic diagram illustrating another example of a front-end neural network system or architecture that can be configured to operate directly with 4:2:0 input (Y, U, and V) input data. As shown in Figure 6C, at the encoder subnetwork of the neural network system, a 1x1 convolutional layer 648 is used to combine the branch luma and chrominance channels (luminance channel 642 and chrominance channel 644) (similar to 6A ), and then apply the pReLU non-linear operator 649 . In other examples, non-linear operators other than pReLU non-linear operators may be applied. Similar operations are performed by the decoder subnetwork of the neural network system of Figure 6C (similar to that described above with respect to Figure 6A), but in reverse order (e.g., applying pReLU operands, using 1x1 convolutional layers to separate Y and U , V channel, and the separate Y and U, V channels are processed using the corresponding inverse IGDN and convoluted layers).

與圖5中描述的E2E-NNVC系統（基於神經網路的轉碼器）相比，經由在g _a（編碼器側）和對應的g _s（解碼器側）中的前兩個網路層中單獨處理Y通道和UV通道來修改圖6A和圖6C的前端架構的輸入處理。用於處理Y分量的第一迴旋層（例如，圖6A中的迴旋層610和圖6C中的迴旋層646）（表示為Ncov | 5x5|↓2）可以與圖5中的第一迴旋層510相同或相似。類似地，圖6A和圖6C的解碼器子網路的用於產生經重構的亮度（Y）分量的第二迴旋層（表示為1conv | 5x5|↑2）可以與圖5的系統中的解碼器子網路g _s的最後一個迴旋層相同或相似。與圖5的系統不同，U和V色度通道由圖6A和圖6C的架構使用單獨的迴旋層（例如，單獨的CNN，諸如圖6A中的迴旋層611或圖6C中的迴旋層647）（表示為Ncov | 3x3|↓1）來處理，其中核心的大小為圖6A中的Ncov | 5x5|↓2迴旋層610或圖6C中的Ncov | 5x5|↓2迴旋層646的Y核心的一半（並且無下取樣，對應於等於1的跨距），之後跟有特定GDN層（一個GDN層用於亮度Y，以及一個GDN層用於色度U和V）。 Compared with the E2E-NNVC system (neural network based transcoder) described in Fig. 5, via the first two network layers in g _a (encoder side) and corresponding g _s (decoder side) The input processing of the front-end architecture of FIGS. 6A and 6C is modified by separately processing the Y channel and the UV channel. The first convolutional layer (eg, convolutional layer 610 in FIG. 6A and convolutional layer 646 in FIG. 6C ) for processing the Y component (denoted as Ncov|5x5|↓2) can be compared with the first convolutional layer 510 in FIG. same or similar. Similarly, the second convoluted layer (denoted as 1conv|5x5|↑2) of the decoder sub-networks of Figures 6A and 6C to produce the reconstructed luminance (Y) component can be compared with that of the system of Figure 5 The last convolutional layer of the decoder subnetwork _gs is the same or similar. Unlike the system of Figure 5, the U and V chrominance channels are provided by the architectures of Figures 6A and 6C using separate convolutional layers (e.g., separate CNNs such as convolutional layer 611 in Figure 6A or convolutional layer 647 in Figure 6C) (Denoted as Ncov | 3x3 | ↓ 1), where the size of the core is half of the Y core of the Ncov | 5x5 | ↓ 2 convoluted layer 610 in FIG. 6A or the Ncov | 5x5 | (and no downsampling, corresponding to a stride equal to 1), followed by specific GDN layers (one GDN layer for luma Y, and one GDN layer for chroma U and V).

在圖6A和圖6C的迴旋層（第一對CNN Nconv | 5x5|↓2和Nconv |3x3|↓1層）和GDN層之後，亮度（Y）通道和色度（U和V）通道（例如，輸入通道的變換或濾波版本）的表示或特徵具有相同的尺寸，並且隨後使用圖6A的1x1迴旋層606或圖6C的1x1迴旋層648進行組合。例如，在YUV 4:2:0格式中，亮度（Y）通道在每個維度上的大小是色度（U和V）通道的兩倍。當經由2對色度（U和V）通道進行二次取樣時，基於處理這些通道產生的輸出變為與亮度通道的conv2d輸出相同的維度（因為亮度通道未被二次取樣）。通道的單獨正規化解決了亮度和色度通道的方差的差異。如前述，隨後可以在使用額外三個迴旋層之前應用非線性運算元（例如，使用GDN 608或pReLU 649），直到到達量化步驟為止。After the convoluted layers (the first pair of CNN Nconv|5x5|↓2 and Nconv|3x3|↓1 layers) and GDN layers in Figures 6A and 6C, the luma (Y) channel and chrominance (U and V) channels (e.g. , transformed or filtered versions of the input channels) representations or features are of the same size and are then combined using the 1x1 convolutional layer 606 of FIG. 6A or the 1x1 convolutional layer 648 of FIG. 6C. For example, in the YUV 4:2:0 format, the luma (Y) channel is twice as large in each dimension as the chroma (U and V) channels. When subsampling the chroma (U and V) channels via 2 pairs, the output produced based on processing these channels becomes the same dimensionality as the conv2d output of the luma channel (since the luma channel is not subsampled). The separate normalization of the channels accounts for the difference in variance of the luma and chroma channels. As before, non-linear operators can then be applied (eg, using GDN 608 or pReLU 649 ) before using the additional three convolutional layers until the quantization step is reached.

在圖6A和圖6C中的架構的解碼器子網路中，使用單獨的IGDN和迴旋層來分開地產生經重構的亮度（Y）分量和經重構的色度（U和V）分量。例如，圖6A的用於產生經重構的色度（U和V）分量625的迴旋層618（解碼器子網路的2conv | 3x3|↑1層）的核心大小約為在用於產生經重構的亮度（Y）分量624的迴旋層617（解碼器子網路的1conv | 5x5 |↑2層）中使用的核心大小的一半（並且無上取樣，對應於等於1的跨距）。In the decoder subnetwork of the architecture in Figure 6A and Figure 6C, separate IGDN and Convolutional layers are used to separately generate the reconstructed luma (Y) component and the reconstructed chrominance (U and V) components . For example, the kernel size of the convoluted layer 618 (2conv|3x3|↑1 layer of the decoder subnet) used to generate the reconstructed chrominance (U and V) components 625 of FIG. Half the kernel size (and no upsampling, corresponding to a stride equal to 1) used in the convoluted layer 617 of the reconstructed luma (Y) component 624 (1conv|5x5|↑2 layer of the decoder subnetwork).

圖6D是示出可以被配置為直接利用4:2:0輸入（Y、U和V）輸入資料進行操作的前端神經網路架構的另一實例的示意圖。如圖6D所示，在編碼器側，使用1x1迴旋層來對分支亮度和色度通道進行組合，並且隨後應用pReLU非線性運算元。與圖6A和圖6C所示的架構相比，亮度和色度分支中的GDN層被替換為PReLU運算元。6D is a schematic diagram illustrating another example of a front-end neural network architecture that can be configured to operate directly with 4:2:0 input (Y, U, and V) input data. As shown in Figure 6D, at the encoder side, a 1x1 convoluted layer is used to combine the branched luma and chroma channels, and then the pReLU non-linear operator is applied. Compared to the architecture shown in Figure 6A and Figure 6C, the GDN layers in the luma and chrominance branches are replaced with PReLU operands.

圖6E是示出可以被配置為直接利用4:2:0輸入（Y、U和V）輸入資料進行操作的前端神經網路架構的另一實例的示意圖。如圖6E所示，在編碼器側，使用1x1迴旋層來對分支亮度和色度通道進行組合，並且隨後應用pReLU非線性運算元。與圖6A、圖6C和圖6D所示的架構相比，圖6E中的架構的所有GDN層皆被替換為PReLU運算元。6E is a schematic diagram illustrating another example of a front-end neural network architecture that can be configured to operate directly with 4:2:0 input (Y, U, and V) input data. As shown in Figure 6E, on the encoder side, the branch luma and chroma channels are combined using a 1x1 convoluted layer, and the pReLU non-linear operator is then applied. Compared with the architectures shown in FIGS. 6A , 6C and 6D, all GDN layers of the architecture in FIG. 6E are replaced with PReLU operands.

圖6F是示出可以被配置為直接利用4:2:0輸入（Y、U和V）輸入資料進行操作的前端神經網路架構的另一實例的示意圖。如圖6F所示，在編碼器側，使用1x1迴旋層來對分支亮度和色度通道進行組合。與圖6A-圖6E所示的架構相比，所有GDN層皆被完全移除，並且在迴旋層之間不使用非線性啟動操作。6F is a schematic diagram illustrating another example of a front-end neural network architecture that can be configured to operate directly with 4:2:0 input (Y, U, and V) input data. As shown in Figure 6F, on the encoder side, a 1x1 convoluted layer is used to combine the branch luma and chroma channels. Compared to the architecture shown in FIGS. 6A-6E , all GDN layers are completely removed, and no non-linear activation operation is used between convolutional layers.

圖6C-圖6F中示出的神經網路架構設計可以用於減少GDN層（例如，如圖6C的架構所示）或完全移除GDN層（例如，如圖6E和圖6F的架構所示）。The neural network architecture designs shown in FIGS. 6C-6F can be used to reduce GDN layers (e.g., as shown in the architecture of FIG. 6C ) or completely remove GDN layers (e.g., as shown in the architectures of FIGS. 6E and 6F ). ).

在一些實例中，本文描述的系統和技術可以用於在基於神經網路的譯碼系統的輸入處使用迴旋（例如，CNN）和正規化級組合的其他編碼器-解碼器子網路。In some examples, the systems and techniques described herein can be used for other encoder-decoder subnetworks that use a combination of convolution (eg, CNN) and normalization stages at the input of a neural network based coding system.

圖7是示出使用本文描述的一或多個機器學習技術來處理視訊的程序700的實例的流程圖。在方塊702處，程序700包括：由神經網路系統的編碼器子網路的第一迴旋層產生與訊框的亮度通道相關聯的輸出值。7 is a flowchart illustrating an example of a process 700 for processing video using one or more machine learning techniques described herein. At block 702, the process 700 includes generating, by a first convolutional layer of an encoder sub-network of the neural network system, an output value associated with a luma channel of a frame.

在方塊704處，程序700包括：由編碼器子網路的第二迴旋層產生與訊框的至少一個色度通道相關聯的輸出值。在方塊706處，程序700包括：由第三迴旋層基於與訊框的亮度通道相關聯的輸出值和與訊框的至少一個色度通道相關聯的輸出值來產生訊框的組合表示。在一些情況下，第三迴旋層包括1x1迴旋層（例如，圖6A-圖6F的編碼器子網路的1x1迴旋層），其包括一或多個1x1迴旋濾波器。在方塊708處，程序700包括：基於訊框的組合表示來產生經編碼的視訊資料。At block 704, the process 700 includes generating, by a second convolutional layer of the encoder subnetwork, an output value associated with at least one chrominance channel of the frame. At block 706, process 700 includes generating, by the third convolutional layer, a combined representation of the frame based on output values associated with a luma channel of the frame and output values associated with at least one chrominance channel of the frame. In some cases, the third convolutional layer includes a 1x1 convolutional layer (eg, the 1x1 convolutional layer of the encoder subnetwork of FIGS. 6A-6F ) that includes one or more 1x1 convolutional filters. At block 708, the process 700 includes generating encoded video data based on the combined representation of the frames.

在一些實例中，程序700包括：使用編碼器子網路的第一非線性層來處理與訊框的亮度通道相關聯的輸出值。程序700可以包括：使用編碼器子網路的第二非線性層來處理與訊框的至少一個色度通道相關聯的輸出值。在此類實例中，組合表示是基於第一非線性層的輸出和第二非線性層的輸出來產生的。在一些情況下，訊框的組合表示是由第三迴旋層（例如，圖6A-圖6F的編碼器子網路的1x1迴旋層）使用第一非線性層的輸出和第二非線性層的輸出作為輸入來產生的。In some examples, process 700 includes processing an output value associated with a luma channel of a frame using a first nonlinear layer of an encoder subnetwork. Process 700 may include processing output values associated with at least one chrominance channel of a frame using a second nonlinear layer of the encoder sub-network. In such instances, the combined representation is generated based on the output of the first nonlinear layer and the output of the second nonlinear layer. In some cases, the combined representation of the frame is formed by a third convolutional layer (e.g., the 1x1 convolutional layer of the encoder subnetwork of Figures 6A-6F ) using the output of the first nonlinear layer and the output of the second nonlinear layer. output is generated as input.

在一些實例中，程序700包括：（例如，使用量化引擎620）對經編碼的視訊資料進行量化。在一些實例中，程序700包括：對經編碼的視訊資料進行熵譯碼（例如，使用熵編碼引擎621）。在一些實例中，程序700包括：將經編碼的視訊資料儲存在記憶體中。在一些實例中，程序700包括：在傳輸媒體上向至少一個設備發送經編碼的視訊資料。In some examples, process 700 includes quantizing the encoded video data (eg, using quantization engine 620). In some examples, process 700 includes entropy coding the encoded video data (eg, using entropy coding engine 621 ). In some examples, process 700 includes storing the encoded video data in memory. In some examples, process 700 includes sending the encoded video data to at least one device over a transmission medium.

在一些實例中，程序700包括：獲得經編碼的訊框。程序700可以包括：由神經網路系統的解碼器子網路的第一迴旋層產生與經編碼的訊框的亮度通道相關聯的經重構的輸出值。程序700可以進一步包括：由解碼器子網路的第二迴旋層產生與經編碼的訊框的至少一個色度通道相關聯的經重構的輸出值。在一些實例中，程序700包括：使用解碼器子網路的第三迴旋層來將經編碼的訊框的亮度通道與經編碼的訊框的至少一個色度通道分離。在一些情況下，解碼器子網路的第三迴旋層包括1x1迴旋層（例如，圖6A-圖6F的解碼器子網路的1x1迴旋層），其包括一或多個1x1迴旋濾波器。In some examples, process 700 includes obtaining an encoded frame. Process 700 may include generating, by a first convolutional layer of a decoder sub-network of the neural network system, a reconstructed output value associated with a luma channel of the encoded frame. Process 700 may further include generating, by a second convolutional layer of the decoder subnetwork, a reconstructed output value associated with at least one chrominance channel of the encoded frame. In some examples, process 700 includes using a third convolutional layer of the decoder subnetwork to separate a luma channel of the encoded frame from at least one chrominance channel of the encoded frame. In some cases, the third convolutional layer of the decoder subnetwork includes a 1x1 convolutional layer (eg, the 1x1 convolutional layer of the decoder subnetwork of FIGS. 6A-6F ) that includes one or more 1x1 convolutional filters.

在一些實例中，訊框包括視訊訊框。在一些實例中，至少一個色度通道包括色度藍色通道和色度紅色通道。在一些實例中，訊框具有亮度-色度（YUV）格式。In some examples, the frames include video frames. In some examples, the at least one chroma channel includes a chroma blue channel and a chroma red channel. In some examples, the frame has a luma-chroma (YUV) format.

圖8是示出使用本文描述的一或多個機器學習技術來處理視訊的程序800的實例的流程圖。在方塊802處，程序800包括：獲得經編碼的訊框。在方塊804處，程序800包括：由解碼器子網路的第一迴旋層將經編碼的訊框的亮度通道與經編碼的訊框的至少一個色度通道分離。在一些情況下，解碼器子網路的第一迴旋層包括1x1迴旋層（例如，圖6A-圖6F的解碼器子網路的1x1迴旋層），其包括一或多個1x1迴旋濾波器。在方塊806處，程序800包括：由神經網路系統的解碼器子網路的第二迴旋層產生與經編碼的訊框的亮度通道相關聯的經重構的輸出值。在方塊808處，程序800包括：由解碼器子網路的第三迴旋層產生與經編碼的訊框的至少一個色度通道相關聯的經重構的輸出值。在方塊810處，程序800包括：產生輸出訊框，該輸出訊框包括與亮度通道相關聯的經重構的輸出值和與至少一個色度通道相關聯的經重構的輸出值。8 is a flowchart illustrating an example of a process 800 for processing video using one or more machine learning techniques described herein. At block 802, the process 800 includes obtaining an encoded frame. At block 804, the process 800 includes separating, by a first convolutional layer of the decoder subnetwork, a luma channel of the encoded frame from at least one chrominance channel of the encoded frame. In some cases, the first convolutional layer of the decoder subnetwork includes a 1x1 convolutional layer (eg, the 1x1 convolutional layer of the decoder subnetwork of FIGS. 6A-6F ) that includes one or more 1x1 convolutional filters. At block 806, the process 800 includes generating, by a second convolutional layer of the decoder sub-network of the neural network system, a reconstructed output value associated with a luma channel of the encoded frame. At block 808, the process 800 includes generating, by a third convolutional layer of the decoder subnetwork, a reconstructed output value associated with at least one chrominance channel of the encoded frame. At block 810, the process 800 includes generating an output frame including reconstructed output values associated with a luma channel and reconstructed output values associated with at least one chroma channel.

在一些實例中，程序800包括：使用解碼器子網路的第一非線性層來處理與經編碼的訊框的亮度通道相關聯的值。與亮度通道相關聯的經重構的輸出值是基於第一非線性層的輸出來產生的。程序700可以包括：使用解碼器子網路的第二非線性層來處理與經編碼的訊框的至少一個色度通道相關聯的值。與至少一個色度通道相關聯的經重構的輸出值是基於第二非線性層的輸出來產生的。In some examples, process 800 includes processing values associated with a luma channel of the encoded frame using a first nonlinear layer of the decoder subnetwork. A reconstructed output value associated with the luma channel is generated based on the output of the first nonlinear layer. Process 700 can include processing values associated with at least one chrominance channel of the encoded frame using a second nonlinear layer of the decoder subnetwork. A reconstructed output value associated with at least one chroma channel is generated based on the output of the second nonlinear layer.

在一些實例中，程序800包括：（例如，經由去量化引擎623）對經編碼的訊框的取樣進行去量化。在一些實例中，程序800包括：（例如，經由熵解碼引擎622）對經編碼的訊框的取樣進行熵解碼。在一些實例中，程序800包括：將輸出訊框儲存在記憶體中。在一些實例中，程序800包括：顯示輸出訊框。In some examples, process 800 includes dequantizing (eg, via dequantization engine 623 ) samples of the encoded frame. In some examples, process 800 includes entropy decoding (eg, via entropy decoding engine 622 ) samples of the encoded frame. In some examples, process 800 includes storing the output frame in memory. In some examples, process 800 includes displaying an output frame.

在一些實例中，程序800包括：由神經網路系統的編碼器子網路的第一迴旋層產生與訊框的亮度通道相關聯的輸出值。程序700可以包括：由編碼器子網路的第二迴旋層產生與訊框的至少一個色度通道相關聯的輸出值。程序700進一步可以包括：由編碼器子網路的第三迴旋層基於與訊框的亮度通道相關聯的輸出值和與訊框的至少一個色度通道相關聯的輸出值來產生訊框的組合表示。程序700可以包括：基於訊框的組合表示來產生經編碼的訊框。在一些情況下，編碼器子網路的第三迴旋層包括1x1迴旋層（例如，圖6A-圖6F的編碼器子網路的1x1迴旋層），其包括一或多個1x1迴旋濾波器。In some examples, process 800 includes generating, by a first convolutional layer of an encoder sub-network of the neural network system, an output value associated with a luma channel of the frame. Process 700 may include generating, by a second convolutional layer of the encoder sub-network, an output value associated with at least one chrominance channel of the frame. Process 700 may further include generating, by a third convolutional layer of the encoder sub-network, a combination of frames based on an output value associated with a luma channel of the frame and an output value associated with at least one chrominance channel of the frame express. Process 700 may include generating an encoded frame based on the combined representation of the frame. In some cases, the third convolutional layer of the encoder subnetwork includes a 1x1 convolutional layer (eg, the 1x1 convolutional layer of the encoder subnetwork of FIGS. 6A-6F ) that includes one or more 1x1 convolutional filters.

在一些實例中，程序800包括：使用編碼器子網路的第一非線性層來處理與訊框的亮度通道相關聯的輸出值，以及使用編碼器子網路的第二非線性層來處理與訊框的至少一個色度通道相關聯的輸出值。在此類實例中，組合表示是基於第一非線性層的輸出和第二非線性層的輸出來產生的。在一些實例中，訊框的組合表示是由編碼器子網路的第三迴旋層使用第一非線性層的輸出和第二非線性層的輸出作為輸入來產生的。In some examples, process 800 includes processing an output value associated with a luma channel of a frame using a first nonlinear layer of an encoder subnetwork, and processing an output value associated with a luma channel of a frame using a second nonlinear layer of an encoder subnetwork. An output value associated with at least one chroma channel of the frame. In such instances, the combined representation is generated based on the output of the first nonlinear layer and the output of the second nonlinear layer. In some examples, the combined representation of the frame is produced by a third convolutional layer of the encoder sub-network using the output of the first nonlinear layer and the output of the second nonlinear layer as input.

在一些實例中，經編碼的訊框包括經編碼的視訊訊框。在一些實例中，至少一個色度通道包括色度藍色通道和色度紅色通道。在一些實例中，經編碼的訊框具有亮度-色度（YUV）格式。In some examples, the encoded frames include encoded video frames. In some examples, the at least one chroma channel includes a chroma blue channel and a chroma red channel. In some examples, the encoded frames have a luma-chroma (YUV) format.

在一些實例中，本文描述的程序（例如，本文描述的程序700、程序800及/或本文描述的其他程序）可以由計算設備或裝置（諸如具有在圖9中所示的計算設備架構900的計算設備）來執行。在一個實例中，程序700及/或程序800可以由計算設備來執行，其中計算設備架構900實現圖6A-圖6F種所示的神經網路架構之一。在一些實例中，計算設備可以包括行動設備（例如，行動電話、平板計算設備）、可穿戴設備、擴展現實設備（例如，虛擬實境（VR）設備、增強現實（AR）設備或混合現實（MR）設備）、個人電腦、膝上型電腦、視訊伺服器、電視機、車輛（或車輛的計算設備）、機器人設備及/或具有用於執行本文描述的程序（包括程序700及/或程序800）的資源能力的任何其他計算設備。In some examples, the programs described herein (e.g., program 700 described herein, program 800, and/or other programs described herein) can be executed by a computing device or apparatus, such as a computer having a computing device architecture 900 shown in FIG. computing device) to perform. In one example, the program 700 and/or the program 800 can be executed by a computing device, wherein the computing device architecture 900 implements one of the neural network architectures shown in FIGS. 6A-6F . In some examples, computing devices may include mobile devices (e.g., mobile phones, tablet computing devices), wearable devices, extended reality devices (e.g., virtual reality (VR) devices, augmented reality (AR) devices, or mixed reality ( MR) equipment), personal computers, laptops, video servers, televisions, vehicles (or computing equipment for vehicles), robotic equipment and/or have a program for executing the programs described herein (including program 700 and/or program 800) resource capabilities of any other computing device.

在一些情況下，計算設備或裝置可以包括各種部件，諸如一或多個輸入設備、一或多個輸出設備、一或多個處理器、一或多個微處理器、一或多個微型電腦、一或多個發射器、接收器或組合發射器-接收器（例如，被稱為收發機）、一或多個照相機、一或多個感測器及/或被配置為執行本文描述的程序的步驟的其他部件。在一些實例中，計算設備可以包括顯示器、被配置為傳送及/或接收資料的網路介面、其任何組合及/或其他部件。網路介面可以被配置為傳送及/或接收基於網際網路協定（IP）的資料或其他類型的資料。In some cases, a computing device or apparatus may include various components, such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers , one or more transmitters, receivers or combination transmitter-receivers (for example, referred to as transceivers), one or more cameras, one or more sensors, and/or configured to perform the Other components of the procedure's steps. In some examples, a computing device may include a display, a network interface configured to transmit and/or receive data, any combination thereof, and/or other components. A network interface may be configured to transmit and/or receive Internet Protocol (IP)-based data or other types of data.

計算設備的部件可以在電路中實現。例如，部件可以包括及/或可以使用電子電路或其他電子硬體來實現，該等電子電路或其他電子硬體可以包括一或多個可程式設計電子電路（例如，微處理器、圖形處理單元（GPU）、數位訊號處理器（DSP）、中央處理單元（CPU）、神經處理單元（NPU）、及/或其他適當的電子電路），及/或可以包括電腦軟體、韌體或其任何組合及/或使用其來實現，以執行本文描述的各種操作。Components of a computing device may be implemented in circuits. For example, a component may include and/or be implemented using electronic circuits or other electronic hardware that may include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units (GPU), digital signal processor (DSP), central processing unit (CPU), neural processing unit (NPU), and/or other appropriate electronic circuits), and/or may include computer software, firmware, or any combination thereof and/or implemented using same to perform various operations described herein.

程序700和程序800被示為邏輯流程圖，邏輯流程圖的動作表示可以用硬體、電腦指令或其組合來實現的一系列操作。在電腦指令的背景下，該動作表示被儲存在一或多個電腦可讀取儲存媒體上的電腦可執行指令，該電腦可執行指令在由一或多個處理器執行時，執行所記載的操作。通常，電腦可執行指令包括執行特定功能或實現特定資料類型的常式、程式、物件、部件、資料結構等。描述操作的順序並不意欲被解釋為限制，並且可以以任何順序及/或並行地組合任何數量的所描述的操作以實現程序。Program 700 and program 800 are shown as logic flow diagrams, the actions of which represent a series of operations that can be implemented by hardware, computer instructions, or a combination thereof. In the context of computer instructions, the actions mean computer-executable instructions stored on one or more computer-readable storage media which, when executed by one or more processors, perform the described operate. Generally, computer-executable instructions include routines, programs, objects, components, data structures, etc. that perform specific functions or implement specific data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations may be combined in any order and/or in parallel to implement the procedure.

補充地，本文描述的程序（包括程序700、程序800及/或本文描述的其他程序）可以在被配置有可執行指令的一或多個電腦系統的控制下執行，並且可以被實現為在一或多個處理器上共同執行的代碼（例如，可執行指令、一或多個電腦程式、或一或多個應用），經由硬體來實現，或其組合。如上文所提到，代碼可以例如以包括可由一或多個處理器執行的複數個指令的電腦程式的形式儲存在電腦可讀或機器可讀儲存媒體上。電腦可讀取儲存媒體或機器可讀儲存媒體可以是非暫時性的。Additionally, the programs described herein (including program 700, program 800, and/or other programs described herein) can be executed under the control of one or more computer systems configured with executable instructions, and can be implemented as or codes (for example, executable instructions, one or more computer programs, or one or more applications) that are jointly executed on multiple processors, implemented through hardware, or a combination thereof. As mentioned above, code may be stored on a computer-readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. A computer-readable storage medium or a machine-readable storage medium may be non-transitory.

圖9示出可以實現本文描述的各種技術的實例計算設備的實例計算設備架構900。在一些實例中，計算設備可以包括行動設備、可穿戴設備、擴展現實設備（例如，虛擬實境（VR）設備、增強現實（AR）設備或混合現實（MR）設備）、個人電腦、膝上型電腦、視訊伺服器、車輛（或車輛的計算設備）或其他設備。例如，計算設備架構900可以實現圖6的系統。計算設備架構900的部件被示為使用連接905（例如，匯流排）彼此電通訊。實例計算設備架構900包括處理單元（CPU或處理器）910和計算設備連接905，計算設備連接905將包括計算設備記憶體915（諸如，唯讀記憶體（ROM）920和隨機存取記憶體（RAM）925）的各種計算設備部件耦合到處理器910。FIG. 9 illustrates an example computing device architecture 900 for an example computing device that may implement various techniques described herein. In some examples, computing devices may include mobile devices, wearable devices, extended reality devices (eg, virtual reality (VR) devices, augmented reality (AR) devices, or mixed reality (MR) devices), personal computers, laptop computer, video server, vehicle (or a vehicle's computing device), or other equipment. For example, computing device architecture 900 may implement the system of FIG. 6 . Components of computing device architecture 900 are shown as being in electrical communication with each other using connections 905 (eg, bus bars). Example computing device architecture 900 includes processing unit (CPU or processor) 910 and computing device connection 905, which will include computing device memory 915, such as read-only memory (ROM) 920 and random access memory ( Various computing device components such as RAM) 925 ) are coupled to the processor 910 .

計算設備架構900可以包括高速記憶體的快取記憶體，快取記憶體直接與處理器910連接、靠近處理器910或作為處理器910的一部分整合。計算設備架構900可以將資料從記憶體915及/或存放裝置930複製到快取記憶體912，以供處理器910快速存取。以這種方式，快取記憶體可以提供效能提升，其避免處理器910在等待資料時延遲。這些模組和其他模組可以控制或被配置為控制處理器910以執行各種行動。其他計算設備記憶體915亦可以可供使用。記憶體915可以包括具有不同效能特性的多種不同類型的記憶體。處理器910可以包括任何通用處理器以及硬體或軟體服務（諸如被儲存在存放裝置930中的服務1 932、服務2 934和服務3 936），硬體或軟體服務被配置為控制處理器910以及其中軟體指令被併入處理器設計中的專用處理器。處理器910可以是自包含系統，其包含多個核心或處理器、匯流排、記憶體控制器、快取記憶體等。多核心處理器可以是對稱的或非對稱的。Computing device architecture 900 may include high-speed memory cache memory coupled directly to processor 910 , proximate to processor 910 , or integrated as part of processor 910 . The computing device architecture 900 can copy data from the memory 915 and/or the storage device 930 to the cache memory 912 for fast access by the processor 910 . In this way, cache memory can provide a performance boost that prevents the processor 910 from being delayed while waiting for data. These and other modules may control or be configured to control the processor 910 to perform various actions. Other computing device memory 915 may also be available. Memory 915 may include a variety of different types of memory with different performance characteristics. Processor 910 may include any general-purpose processor and hardware or software services (such as Service 1 932, Service 2 934, and Service 3 936 stored in storage device 930) configured to control processor 910 and special purpose processors in which software instructions are incorporated into the processor design. Processor 910 may be a self-contained system that includes multiple cores or processors, a bus, memory controller, cache memory, and the like. Multi-core processors can be symmetric or asymmetric.

為了實現使用者與計算設備架構900的互動，輸入設備945可以表示任何數量的輸入機制，諸如用於語音的麥克風、用於手勢或圖形輸入的觸控式螢幕、鍵盤、滑鼠、運動輸入、語音等。輸出設備935亦可以是本發明所屬領域中具有通常知識者已知的多種輸出機制中的一或多個，諸如顯示器、投影儀、電視機、揚聲器設備等。在一些情況下，多模態計算設備可以使得使用者能夠提供多種類型的輸入以與計算設備架構900進行通訊。通訊介面940通常可以控制和管理使用者輸入和計算設備輸出。對在任何特定硬體設定上的操作沒有限制，並且因此，此處的基本特徵可以容易地被改進的硬體或韌體配置（隨著它們被開發）所取代。To enable user interaction with computing device architecture 900, input devices 945 may represent any number of input mechanisms, such as a microphone for voice, touch screen for gesture or graphical input, keyboard, mouse, motion input, voice etc. The output device 935 may also be one or more of various output mechanisms known to those skilled in the art, such as a display, projector, television, speaker device, and the like. In some cases, a multimodal computing device may enable a user to provide multiple types of input to communicate with computing device architecture 900 . Communication interface 940 can generally control and manage user input and computing device output. There is no restriction on operation on any particular hardware setup, and as such, the basic features herein can easily be replaced by improved hardware or firmware configurations as they are developed.

存放裝置930是非揮發性記憶體，並且可以是硬碟或可以儲存可由電腦存取的資料的其他類型的電腦可讀取媒體，諸如磁帶、快閃記憶卡、固態記憶體設備、數位多功能磁碟、盒式磁帶、隨機存取記憶體（RAM）925，唯讀記憶體（ROM）920、以及其的混合物。存放裝置930可以包括用於控制處理器910的服務932、服務934、服務936。預期其他硬體或軟體模組。存放裝置930可以連接到計算設備連接905。在一個態樣中，執行特定功能的硬體模組可以包括被儲存在電腦可讀取媒體中的軟體部件，該等軟體部件與執行該功能所必要的硬體部件（諸如處理器910、連接905、輸出設備935等）相連接。Storage device 930 is non-volatile memory and can be a hard disk or other type of computer-readable media that can store data that can be accessed by the computer, such as magnetic tape, flash memory cards, solid-state memory devices, digital versatile magnetic Discs, cassette tapes, random access memory (RAM) 925, read only memory (ROM) 920, and mixtures thereof. The repository 930 may include a service 932 , a service 934 , and a service 936 for controlling the processor 910 . Additional hardware or software mods are expected. Storage device 930 may be connected to computing device connection 905 . In one aspect, a hardware module that performs a specific function may include software components stored in a computer-readable medium that are connected to hardware components necessary to perform the function, such as the processor 910, 905, output device 935, etc.) are connected.

本案內容的各態樣適用於包括或耦合到一或多個主動深度感測系統的任何適當的電子設備（諸如安全系統、智慧型電話、平板電腦、膝上型電腦、車輛、無人機或其他設備）。儘管下文關於具有或耦合到一個光投影器的設備進行了描述，但是本案內容的各態樣適用於具有任何數量的光投影器的設備，並且因此不限於特定設備。Aspects of this disclosure apply to any suitable electronic device (such as a security system, smartphone, tablet, laptop, vehicle, drone, or other equipment). Although described below with respect to a device having or being coupled to one light projector, aspects of the disclosure apply to devices having any number of light projectors, and thus are not limited to a particular device.

術語「設備」不限於一個或特定數量的實體物件（諸如一個智慧型電話、一個控制器、一個處理系統等）。如本文所使用的，設備可以是具有可以實現本案內容的至少一些部分的一或多個部件的任何電子設備。儘管以下描述和實例使用術語「設備」來描述本案內容的各個態樣，但是術語「設備」不限於特定配置、類型或數量的物件。補充地，術語「系統」不限於多個部件或特定實施例。例如，系統可以在一或多個印刷電路板或其他基板上實現，並且可以具有可移動或靜態的部件。儘管以下描述和實例使用術語「系統」來描述本案內容的各個態樣，但是術語「系統」不限於特定配置、類型、或數量的物件。The term "device" is not limited to one or a specific number of physical objects (such as a smartphone, a controller, a processing system, etc.). As used herein, a device may be any electronic device having one or more components that can implement at least some portions of this disclosure. Although the following description and examples use the term "device" to describe various aspects of the present disclosure, the term "device" is not limited to a particular configuration, type, or number of items. Additionally, the term "system" is not limited to a number of components or a specific embodiment. For example, a system may be implemented on one or more printed circuit boards or other substrates and may have moving or static parts. Although the following description and examples use the term "system" to describe various aspects of the present disclosure, the term "system" is not limited to a particular configuration, type, or number of items.

在以上描述中提供了具體細節以提供對本文提供的實施例和實例的透徹理解。然而，本發明所屬領域中具有通常知識者將理解的是，可以在沒有這些具體細節的情況下實施這些實施例。為了解釋清楚，在一些情況下，本文的技術可以被呈現為包括包含如下的功能方塊的單獨的功能方塊，這些功能方塊包括設備、設備部件、以軟體體現的方法中的步驟或常式、或者硬體和軟體的組合。除了在各圖中所示及/或本文描述的部件之外，亦可以使用額外的部件。例如，電路、系統、網路、程序和其他部件可以以方塊圖形式被示為部件，以便不會在不必要的細節上模糊這些實施例。在其他情況下，公知的電路、程序、演算法、結構和技術可能被示為不具有不必要的細節，以便避免模糊這些實施例。In the above description specific details are provided to provide a thorough understanding of the embodiments and examples provided herein. However, it will be understood by those having ordinary skill in the art to which the present invention pertains that these embodiments may be practiced without these specific details. For clarity of explanation, in some cases the technology herein may be presented as comprising individual functional blocks comprising functional blocks comprising a device, a component of a device, a step or routine in a method embodied in software, or A combination of hardware and software. Additional components may be used in addition to those shown in the figures and/or described herein. For example, circuits, systems, networks, programs and other components may be shown in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, procedures, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments.

上文可能將各個實施例描述為程序或方法，該程序或方法被圖示為流程圖、流程示意圖、資料串流圖、結構圖或方塊圖。儘管流程圖可以將操作描述為順序的程序，但是這些操作中的許多操作可以並行或同時執行。另外，可以重新排列操作的次序。程序在其操作完成時被終止，但是可能具有未被包括在圖中的額外步驟。程序（process）可以對應於方法、函數、程序（procedure）、子常式、副程式等。當程序對應於函數時，其終止可以對應於該函數返回到調用函數或主函數。The various embodiments above may be described as procedures or methods, and the procedures or methods are illustrated as flowcharts, schematic flowcharts, data flow diagrams, structural diagrams or block diagrams. Although a flowchart may describe operations as a sequential program, many of these operations may be performed in parallel or simultaneously. Additionally, the order of operations may be rearranged. A program is terminated when its operations are complete, but may have additional steps not included in the figure. A procedure may correspond to a method, a function, a procedure, a subroutine, a subroutine, and the like. When a program corresponds to a function, its termination may correspond to the function returning to the calling function or the main function.

根據上述實例的程序和方法可以使用電腦可執行指令來實現，電腦可執行指令被儲存在電腦可讀取媒體中或者以其他方式可從電腦可讀取媒體得到。此類指令可以包括例如指令或資料，指令或資料使得通用電腦、專用電腦或處理設備執行或者以其他方式將其配置為執行特定功能或特定的一組功能。可以經由網路存取所使用的電腦資源的部分。電腦可執行指令可以是例如二進位檔案、諸如組合語言之類的中間格式指令、韌體、原始程式碼等。Programs and methods according to the above examples can be implemented using computer-executable instructions stored on or otherwise available from a computer-readable medium. Such instructions may include, for example, instructions or material which cause a general purpose computer, special purpose computer or processing device to execute or otherwise configure it to perform a particular function or a particular set of functions. The portion of computer resources used that can be accessed via the Internet. Computer-executable instructions may be, for example, binary files, intermediate format instructions such as assembly language, firmware, source code, and the like.

術語「電腦可讀取媒體」包括但不限於可攜式或非可攜式存放裝置、光學存放裝置以及能夠儲存、包含或攜帶指令及/或資料的各種其他媒體。電腦可讀取媒體可以包括可以在其中儲存資料的非暫時性媒體，並且不包括無線地或在有線連接上傳播的載波及/或暫時性電子訊號。非暫時性媒體的實例可以包括但不限於磁碟或磁帶、光學儲存媒體（諸如快閃記憶體）、記憶體或記憶體設備、磁碟或光碟、快閃記憶體、提供有非揮發性記憶體的USB設備、網路存放裝置、壓縮光碟（CD）或數位多功能光碟（DVD）、其任何適當的組合等等。電腦可讀取媒體可以具有儲存在其上的代碼及/或機器可執行指令，其可以表示程序、函數、副程式、程式、常式、子常式、模組、套裝軟體、軟體部件、或指令、資料結構或程式語句的任何組合。程式碼片段可以經由傳遞及/或接收資訊、資料、引數、參數或記憶體內容來耦合到另一程式碼片段或硬體電路。資訊、引數、參數、資料等可以經由包括記憶體共享、訊息傳遞、符記傳遞、網路傳輸等的任何適當的單元來傳遞、轉發或發送。The term "computer-readable medium" includes, but is not limited to, portable or non-portable storage devices, optical storage devices and various other media capable of storing, containing or carrying instructions and/or data. Computer-readable media may include non-transitory media in which data can be stored and do not include carrier waves and/or transitory electronic signals that travel wirelessly or over wired connections. Examples of non-transitory media may include, but are not limited to, magnetic disks or tapes, optical storage media such as flash memory, memory or memory devices, magnetic disks or optical disks, flash memory, provided with non-volatile memory USB devices, network storage devices, compact discs (CDs) or digital versatile discs (DVDs), any suitable combination thereof, etc. The computer-readable medium may have code and/or machine-executable instructions stored thereon, which may represent a program, function, subroutine, program, routine, subroutine, module, package, software component, or Any combination of instructions, data structures, or program statements. A code segment can be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or sent via any suitable means including memory sharing, message passing, token passing, network transmission, and the like.

在一些實施例中，電腦可讀存放裝置、媒體和記憶體可以包括包含位元串流等的電纜或無線訊號。然而，當提及時，非暫時性電腦可讀取儲存媒體明確地不包括諸如能量、載波訊號、電磁波和訊號本身之類的媒體。In some embodiments, computer-readable storage devices, media, and memories may include cables or wireless signals including bit streams and the like. However, when mentioned, non-transitory computer readable storage media specifically excludes media such as energy, carrier signals, electromagnetic waves, and signals themselves.

實現根據這些揭示內容的程序和方法的設備可以包括硬體、軟體、韌體、中介軟體、微代碼、硬體描述語言或其任何組合，並且可以採用多種形狀因數中的任何一種。當用軟體、韌體、中介軟體或微代碼來實現時，用於執行必要任務的程式碼或程式碼片段（例如，電腦程式產品）可以被儲存在電腦可讀或機器可讀取媒體中。處理器可以執行必要任務。形狀因數的典型實例包括膝上型電腦、智慧型電話、行動電話、平板設備或其他小型形狀因數的個人電腦、個人數位助理、機架式設備、獨立設備等。本文描述的功能亦可以體現在周邊設備或外掛程式卡中。經由額外的舉例，這種功能亦可以在單個設備中執行的不同晶片或不同程序之間的電路板上實現。An apparatus implementing programs and methods in accordance with these disclosures may comprise hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof and may take any of a variety of form factors. When implemented in software, firmware, middleware or microcode, the code or code segments (eg, a computer program product) to perform the necessary tasks may be stored on a computer-readable or machine-readable medium. The processor can perform the necessary tasks. Typical examples of form factors include laptops, smartphones, mobile phones, tablet devices or other small form factor personal computers, personal digital assistants, rack mountable devices, standalone devices, and the like. The functions described herein may also be embodied in peripheral devices or add-on cards. By way of additional example, such functionality may also be implemented on circuit boards between different chips or different processes executing in a single device.

指令、用於傳送此類指令的媒體、用於執行它們的計算資源以及用於支援此類計算資源的其他結構是用於提供在本案內容中描述的功能的實例單元。Instructions, the media for carrying such instructions, the computing resources for executing them, and other structures for supporting such computing resources are example elements for providing the functionality described in this disclosure.

在前面的描述中，參考本案的特定實施例描述了本案的各態樣，但是本發明所屬領域中具有通常知識者將認識到的是，本案不限於此。因此，儘管本文已經詳細描述了本案的說明性實施例，但是應理解的是，可以以其他方式不同地體現和採用本發明構思，並且所附的請求項意欲被解釋為包括此類變型，除了由現有技術限制的變型。可以單獨地或共同地使用上述應用的各個特徵和態樣。此外，在不背離本說明書的更寬泛的精神和範疇的情況下，實施例可以在除了本文描述的環境和應用之外的任何數量的環境和應用中使用。因此，說明書和附圖被認為是說明性的而不是限制性的。出於說明的目的，以特定次序描述了方法。應當領會的是，在替代實施例中，可以以與所描述的次序不同的次序來執行所述方法。In the foregoing description, aspects of the invention have been described with reference to particular embodiments of the invention, but those of ordinary skill in the art to which the invention pertains will recognize that the invention is not limited thereto. Therefore, while illustrative embodiments of the present case have been described in detail herein, it is to be understood that the inventive concept may be variously embodied and employed, and the appended claims are intended to be construed to include such variations, except Variants limited by prior art. The various features and aspects of the applications described above can be used individually or in combination. Furthermore, the embodiments may be used in any number of environments and applications other than those described herein without departing from the broader spirit and scope of this specification. Accordingly, the specification and drawings are to be regarded as illustrative rather than restrictive. For purposes of illustration, the methods are described in a particular order. It should be appreciated that, in alternative embodiments, the methods may be performed in an order different from that described.

本發明所屬領域中具有通常知識者將領會的是，在不背離本說明書的範疇的情況下，本文中使用的小於（「＜」）和大於（「＞」）符號或術語可以分別用小於或等於（「≦」）以及大於或等於（「≧」）符號來替換。Those of ordinary skill in the art to which the present invention pertains will appreciate that the less than ("<") and greater than (">") symbols or terms used herein may be replaced by less than or greater than, respectively, without departing from the scope of the present description. equal to ("≦") and greater than or equal to ("≧") symbols.

在將部件描述為「被配置為」執行某些操作的情況下，這種配置可以例如經由以下方式來實現：將電子電路或其他硬體設計為執行該操作，將可程式設計電路（例如，微處理器或其他適當的電路）程式設計為執行該操作，或其任何組合。Where a component is described as being "configured" to perform certain operations, such configuration may be achieved, for example, by designing electronic circuits or other hardware to perform that operation, by programming circuits (e.g., microprocessor or other suitable circuitry) programmed to perform that operation, or any combination thereof.

短語「耦合到」代表直接或間接地實體連接到另一部件的任何部件、及/或直接地或者間接地與另一部件通訊的任何部件（例如，經由有線或無線連接及/或其他適當的通訊介面而連接到另一部件）。The phrase "coupled to" means any component that is directly or indirectly physically connected to another component and/or any component that directly or indirectly communicates with another component (for example, via a wired or wireless connection and/or other suitable connected to another component via its communication interface).

記載集合中的「至少一個」及/或集合中的「一或多個」的請求項語言或其他語言指示該集合中的一個成員或者該集合中的多個成員滿足該請求項。例如，記載「A和B中的至少一個」或「A或B中的至少一個」的請求項語言意指A、B、或者A和B。在另一實例中，記載「A、B和C中的至少一個」或「A、B或C中的至少一個」的請求項語言意指A、B、C、或A和B、或A和C、或B和C、或A和B和C。語言集合中的「至少一個」及/或集合中的「一或多個」不將該集合限制為在該集合中列出的條目。例如，記載「A和B中的至少一個」或「A或B中的至少一個」的請求項語言可以意指A、B、或A和B，並且可以另外地包括在A和B的集合中未列出的條目。Claim language or other language stating "at least one" of a set and/or "one or more" of a set indicates that a member of the set or members of the set satisfy the claim. For example, claim language that states "at least one of A and B" or "at least one of A or B" means A, B, or A and B. In another example, claim language stating "at least one of A, B, and C" or "at least one of A, B, or C" means A, B, C, or A and B, or A and C, or B and C, or A and B and C. "At least one" of a language set and/or "one or more" of a set does not limit the set to the items listed in the set. For example, claim language stating "at least one of A and B" or "at least one of A or B" may mean A, B, or A and B, and may additionally be included in the set of A and B Items not listed.

結合本文揭示的實施例描述的各種說明性的邏輯方塊、模組、電路和演算法步驟可以被實現為電子硬體、電腦軟體、韌體或其組合。為了清楚地說明硬體和軟體的這種可互換性，上面已經對各種說明性的部件、方塊、模組、電路和步驟圍繞其功能進行了整體描述。至於這種功能被實現為硬體還是軟體取決於特定的應用和被施加在整個系統上的設計約束。具有普通知識者可以針對每種特定應用以不同的方式來實現所描述的功能，但是這種實現決策不應當被解釋為導致背離本案的範疇。The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, firmware, or combinations thereof. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Those with ordinary knowledge may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present case.

本文描述的技術亦可以用電子硬體、電腦軟體、韌體或其任何組合來實現。此類技術可以在各種設備中的任何一種中實現，諸如通用電腦、無線通訊設備手機或具有多種用途（包括在無線通訊設備手機和其他設備中的應用）的積體電路設備。被描述為模組或部件的任何特徵皆可以在整合邏輯裝置中一起實現，或者分別作為個別但是可交互動操作的邏輯裝置來實現。若用軟體來實現，則該等技術可以至少部分地由電腦可讀取資料儲存媒體來實現，電腦可讀取資料儲存媒體包括程式碼，程式碼包括在被執行時執行上述方法中的一或多個方法的指令。電腦可讀取資料儲存媒體可以形成電腦程式產品的一部分，電腦程式產品可以包括包裝材料。電腦可讀取媒體可以包括記憶體或資料儲存媒體，諸如隨機存取記憶體（RAM）（諸如同步動態隨機存取記憶體（SDRAM））、唯讀記憶體（ROM）、非揮發性隨機存取記憶體（NVRAM）、電子可抹除可程式設計唯讀記憶體（EEPROM）、快閃記憶體、磁或光資料儲存媒體等。補充地或替代地，該等技術可以至少部分地由以指令或資料結構的形式攜帶或傳送程式碼並且可以由電腦存取、讀取及/或執行的電腦可讀通訊媒體（諸如傳播的訊號或波）來實現。The techniques described herein may also be implemented with electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices, such as general-purpose computers, wireless communication device handsets, or integrated circuit devices that have multiple uses, including applications in wireless communication device handsets and other devices. Any features described as modules or components may be implemented together in an integrated logic device or separately as separate but interoperable logic devices. If implemented in software, the techniques may be implemented at least in part by a computer-readable data storage medium that includes program code that, when executed, performs one or more of the methods described above. Directives for multiple methods. A computer readable data storage medium may form a part of a computer program product, which may include packaging materials. Computer readable media may include memory or data storage media such as random access memory (RAM) such as synchronous dynamic random access memory (SDRAM), read only memory (ROM), non-volatile RAM Access memory (NVRAM), electronically erasable programmable read-only memory (EEPROM), flash memory, magnetic or optical data storage media, etc. Additionally or alternatively, such techniques may be implemented at least in part by computer-readable communication media (such as propagated signal or wave) to achieve.

程式碼可以由處理器執行，處理器可以包括一或多個處理器，諸如一或多個數位訊號處理器（DSP）、通用微處理器、特殊應用積體電路（ASIC）、現場可程式設計邏輯陣列（FPGA）或其他等效的整合或個別邏輯電路。此類處理器可以被配置為執行在本案內容中描述的任何技術。通用處理器可以是微處理器，但是在替代方式中，處理器可以是任何一般的處理器、控制器、微控制器或狀態機。處理器亦可以被實現為計算設備的組合，例如，DSP和微處理器的組合、複數個微處理器、一或多個微處理器與DSP核心的結合、或任何其他此類配置。因此，如本文所使用的術語「處理器」可以代表任何前述結構、前述結構的任何組合、或適於實現本文描述的技術的任何其他結構或裝置。The code may be executed by a processor, which may include one or more processors, such as one or more digital signal processors (DSPs), general-purpose microprocessors, application-specific integrated circuits (ASICs), field programmable Logic array (FPGA) or other equivalent integrated or individual logic circuits. Such processors may be configured to perform any of the techniques described in this disclosure. A general-purpose processor can be a microprocessor, but in the alternative, the processor can be any general processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in combination with a DSP core, or any other such configuration. Accordingly, the term "processor," as used herein may represent any of the foregoing structure, any combination of the foregoing, or any other structure or device suitable for implementation of the techniques described herein.

本案內容的說明性實例包括：Illustrative examples of the content of this case include:

態樣1：一種處理視訊資料的方法，該方法包括：由神經網路系統的編碼器子網路的第一迴旋層產生與訊框的亮度通道相關聯的輸出值；由該編碼器子網路的第二迴旋層產生與該訊框的至少一個色度通道相關聯的輸出值；由第三迴旋層基於與該訊框的該亮度通道相關聯的輸出值和與該訊框的該至少一個色度通道相關聯的輸出值來產生該訊框的組合表示；及基於該訊框的該組合表示來產生經編碼的視訊資料。Aspect 1: A method of processing video data, the method comprising: generating an output value associated with a luma channel of a frame by a first convolutional layer of an encoder subnetwork of a neural network system; The second convolutional layer of the way generates output values associated with at least one chrominance channel of the frame; the output value associated with the luma channel of the frame is generated by the third convolutional layer based on the output value associated with the luma channel of the frame and the at least one generating a combined representation of the frame from an associated output value of a chrominance channel; and generating encoded video data based on the combined representation of the frame.

態樣2：根據態樣1之方法，其中該第三迴旋層包括1x1迴旋層，該1x1迴旋層包括一或多個1x1迴旋濾波器。Aspect 2: The method of Aspect 1, wherein the third convolutional layer includes a 1x1 convolutional layer, and the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

態樣3：根據態樣1或2中任一項所述的方法，亦包括：使用該編碼器子網路的第一非線性層來處理與該訊框的該亮度通道相關聯的該等輸出值；及使用該編碼器子網路的第二非線性層來處理與該訊框的該至少一個色度通道相關聯的輸出值；其中該組合表示是基於該第一非線性層的輸出和該第二非線性層的輸出來產生的。Aspect 3: The method of any one of Aspects 1 or 2, further comprising: using the first non-linear layer of the encoder sub-network to process the an output value; and processing the output value associated with the at least one chrominance channel of the frame using a second nonlinear layer of the encoder subnetwork; wherein the combined representation is based on the output of the first nonlinear layer and the output of this second nonlinear layer is generated.

態樣4：根據態樣3之方法，其中該訊框的該組合表示是由該第三迴旋層使用該第一非線性層的輸出和該第二非線性層的輸出作為輸入來產生的。Aspect 4: The method according to Aspect 3, wherein the combined representation of the frame is generated by the third convolutional layer using the output of the first nonlinear layer and the output of the second nonlinear layer as input.

態樣5：根據態樣1至4中任一項所述的方法，亦包括：對該經編碼的視訊資料進行量化。Aspect 5: The method according to any one of Aspects 1 to 4, further comprising: quantizing the encoded video data.

態樣6：根據態樣1至5中任一項所述的方法，亦包括：對該經譯碼的視訊資料進行熵解碼。Aspect 6: The method according to any one of aspects 1 to 5, further comprising: performing entropy decoding on the decoded video data.

態樣7：根據態樣1至6中任一項所述的方法，亦包括：將該經編碼的視訊資料儲存在記憶體中。Aspect 7: The method according to any one of Aspects 1 to 6, further comprising: storing the encoded video data in a memory.

態樣8：根據態樣1至7中任一項所述的方法，亦包括：在傳輸媒體上向至少一個設備發送該經編碼的視訊資料。Aspect 8: The method according to any one of aspects 1 to 7, further comprising: sending the encoded video data to at least one device over a transmission medium.

態樣9：根據態樣1至8中任一項所述的方法，亦包括：獲得經編碼的訊框；由該神經網路系統的解碼器子網路的第一迴旋層產生與經編碼的訊框的亮度通道相關聯的經重構的輸出值；及由該解碼器子網路的第二迴旋層產生與經編碼的訊框的至少一個色度通道相關聯的經重構的輸出值。Aspect 9: The method according to any one of aspects 1 to 8, further comprising: obtaining an encoded frame; generated and encoded by the first convolutional layer of the decoder sub-network of the neural network system the reconstructed output value associated with the luma channel of the frame; and the reconstructed output associated with at least one chrominance channel of the encoded frame produced by the second convolutional layer of the decoder subnetwork value.

態樣10：根據態樣9之方法，亦包括：使用該解碼器子網路的第三迴旋層來將經編碼的訊框的該亮度通道與經編碼的訊框的該至少一個色度通道分離。Aspect 10: The method of Aspect 9, further comprising: using a third convolutional layer of the decoder subnetwork to combine the luma channel of the encoded frame with the at least one chrominance channel of the encoded frame separate.

態樣11：根據態樣10之方法，其中該解碼器子網路的該第三迴旋層包括1x1迴旋層，該1x1迴旋層包括一或多個1x1迴旋濾波器。Aspect 11: The method of Aspect 10, wherein the third convolutional layer of the decoder subnetwork includes a 1x1 convolutional layer, the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

態樣12：根據態樣1至11中任一項所述的方法，其中該訊框包括視訊訊框。Aspect 12: The method of any one of Aspects 1 to 11, wherein the frame comprises a video frame.

態樣13：根據態樣1至12中任一項所述的方法，其中該至少一個色度通道包括色度藍色通道和色度紅色通道。Aspect 13: The method of any one of Aspects 1 to 12, wherein the at least one chroma channel includes a chroma blue channel and a chroma red channel.

態樣14：根據態樣1至13中任一項所述的方法，其中該訊框具有亮度色度（YUV）格式。Aspect 14: The method of any one of Aspects 1 to 13, wherein the frame has a luma-chroma (YUV) format.

態樣15：一種用於處理視訊資料的裝置。該裝置包括：記憶體；及處理器，其耦合到該記憶體並且被配置為：使用神經網路系統的編碼器子網路的第一迴旋層產生與訊框的亮度通道相關聯的輸出值；使用該編碼器子網路的第二迴旋層產生與該訊框的至少一個色度通道相關聯的輸出值；使用第三迴旋層基於與該訊框的該亮度通道相關聯的輸出值和與該訊框的該至少一個色度通道相關聯的輸出值來產生該訊框的組合表示；及基於該訊框的該組合表示來產生經編碼的視訊資料。Aspect 15: A device for processing video data. The apparatus includes: a memory; and a processor coupled to the memory and configured to: generate an output value associated with a luma channel of a frame using a first convolutional layer of an encoder sub-network of the neural network system ; using the second convolutional layer of the encoder subnetwork to generate an output value associated with at least one chrominance channel of the frame; using a third convolutional layer based on the output value associated with the luma channel of the frame and output values associated with the at least one chrominance channel of the frame to generate a combined representation of the frame; and generating encoded video data based on the combined representation of the frame.

態樣16：根據態樣15之裝置，其中該第三迴旋層包括1x1迴旋層，該1x1迴旋層包括一或多個1x1迴旋濾波器。Aspect 16: The device of Aspect 15, wherein the third convolutional layer comprises a 1x1 convolutional layer comprising one or more 1x1 convolutional filters.

態樣17：根據態樣15或16中任一項所述的裝置，其中該處理器被配置為：使用該編碼器子網路的第一非線性層來處理與該訊框的該亮度通道相關聯的該等輸出值；及使用該編碼器子網路的第二非線性層來處理與該訊框的該至少一個色度通道相關聯的輸出值；其中該組合表示是基於該第一非線性層的輸出和該第二非線性層的輸出來產生的。Aspect 17: The apparatus of any of Aspects 15 or 16, wherein the processor is configured to process the luma channel of the frame using a first non-linear layer of the encoder subnetwork the associated output values; and process the output values associated with the at least one chrominance channel of the frame using a second nonlinear layer of the encoder sub-network; wherein the combined representation is based on the first The output of the nonlinear layer and the output of the second nonlinear layer are generated.

態樣18：根據態樣17之裝置，其中該訊框的該組合表示是由該第三迴旋層使用該第一非線性層的輸出和該第二非線性層的輸出作為輸入來產生的。Aspect 18: The apparatus according to Aspect 17, wherein the combined representation of the frame is generated by the third convolutional layer using the output of the first nonlinear layer and the output of the second nonlinear layer as input.

態樣19：根據態樣15至18中任一項所述的裝置，其中該處理器被配置為：對該經編碼的視訊資料進行量化。Aspect 19: The apparatus of any of Aspects 15-18, wherein the processor is configured to: quantize the encoded video data.

態樣20：根據態樣15至19中任一項所述的裝置，其中該處理器被配置為：對該經編碼的視訊資料進行熵解碼。Aspect 20: The apparatus of any of Aspects 15-19, wherein the processor is configured to entropy decode the encoded video data.

態樣21：根據態樣15至20中任一項所述的裝置，其中該處理器被配置為：將該經編碼的視訊資料儲存在記憶體中。Aspect 21: The apparatus of any of Aspects 15-20, wherein the processor is configured to: store the encoded video data in memory.

態樣22：根據態樣15至21中任一項所述的裝置，其中該處理器被配置為：在傳輸媒體上向至少一個設備發送該經編碼的視訊資料。Aspect 22: The apparatus of any one of Aspects 15 to 21, wherein the processor is configured to send the encoded video data to at least one device over a transmission medium.

態樣23：根據態樣15至22中任一項所述的裝置，其中該處理器被配置為：獲得經編碼的訊框；使用該神經網路系統的解碼器子網路的第一迴旋層來產生與經編碼的訊框的亮度通道相關聯的經重構的輸出值；及使用該解碼器子網路的第二迴旋層來產生與經編碼的訊框的至少一個色度通道相關聯的經重構的輸出值。Aspect 23: The apparatus of any one of Aspects 15 to 22, wherein the processor is configured to: obtain an encoded frame; use a first convolution of the decoder subnetwork of the neural network system layer to generate a reconstructed output value associated with the luma channel of the encoded frame; and use the second convolutional layer of the decoder subnetwork to generate Linked reconstructed output value.

態樣24：根據態樣23之裝置，其中該處理器被配置為：使用該解碼器子網路的第三迴旋層來將經編碼的訊框的該亮度通道與經編碼的訊框的該至少一個色度通道分離。Aspect 24: The device of Aspect 23, wherein the processor is configured to use a third convolutional layer of the decoder subnetwork to combine the luma channel of the encoded frame with the At least one chroma channel is separated.

態樣25：根據態樣24之裝置，其中該解碼器子網路的該第三迴旋層包括1x1迴旋層，該1x1迴旋層包括一或多個1x1迴旋濾波器。Aspect 25: The device of Aspect 24, wherein the third convolutional layer of the decoder subnetwork includes a 1x1 convolutional layer, the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

態樣26：根據態樣15至25中任一項所述的裝置，其中該訊框包括視訊訊框。Aspect 26: The device of any one of Aspects 15-25, wherein the frame comprises a video frame.

態樣27：根據態樣15至26中任一項所述的裝置，其中該至少一個色度通道包括色度藍色通道和色度紅色通道。Aspect 27: The device of any of Aspects 15-26, wherein the at least one chroma channel includes a chroma blue channel and a chroma red channel.

態樣28：根據態樣15至27中任一項所述的裝置，其中該訊框具有亮度色度（YUV）格式。Aspect 28: The device of any one of Aspects 15-27, wherein the frame has a luma-chroma (YUV) format.

態樣29：根據態樣15至28中任一項所述的裝置，其中該處理器包括神經處理單元（NPU）。Aspect 29: The apparatus of any one of Aspects 15 to 28, wherein the processor comprises a Neural Processing Unit (NPU).

態樣30：根據態樣15至29中任一項所述的裝置，其中該裝置包括行動設備。Aspect 30: The apparatus of any one of Aspects 15-29, wherein the apparatus comprises a mobile device.

態樣31：根據態樣15至30中任一項所述的裝置，其中該裝置包括擴展現實設備。Aspect 31: The apparatus of any of Aspects 15-30, wherein the apparatus comprises an extended reality device.

態樣32：根據態樣15至31中任一項所述的裝置，亦包括顯示器。Aspect 32: The device according to any one of Aspects 15 to 31, further comprising a display.

態樣33：根據態樣15至29中任一項所述的裝置，其中該裝置包括電視機。Aspect 33: The device of any one of Aspects 15 to 29, wherein the device comprises a television.

態樣34：根據態樣15至33中任一項所述的裝置，其中該裝置包括：照相機，其被配置為擷取一或多個視訊訊框。Aspect 34: The device of any of Aspects 15-33, wherein the device comprises a camera configured to capture one or more video frames.

態樣35：一種儲存指令的電腦可讀取儲存媒體，該等指令在被執行時使得一或多個處理器執行根據態樣1至14之操作中的任一項操作。Aspect 35: A computer-readable storage medium storing instructions that, when executed, cause one or more processors to perform any one of the operations of Aspects 1-14.

態樣36：一種裝置，包括用於執行根據態樣1至14之操作中的任一項操作的單元。Aspect 36: An apparatus comprising means for performing any one of the operations according to Aspects 1-14.

態樣37：一種處理視訊資料的方法，該方法包括：獲得經編碼的訊框；由該解碼器子網路的第一迴旋層將該經編碼的訊框的亮度通道與該經編碼的訊框的至少一個色度通道分離；由神經網路系統的解碼器子網路的第二迴旋層產生與該經編碼的訊框的該亮度通道相關聯的經重構的輸出值；由該解碼器子網路的第三迴旋層產生與該經編碼的訊框的該至少一個色度通道相關聯的經重構的輸出值；及產生輸出訊框，該輸出訊框包括與該亮度通道相關聯的經重構的輸出值和與該至少一個色度通道相關聯的經重構的輸出值。Aspect 37: A method of processing video data, the method comprising: obtaining an encoded frame; combining, by a first convolutional layer of the decoder subnetwork, a luminance channel of the encoded frame with the encoded frame separation of at least one chrominance channel of the frame; a second convolutional layer of the decoder subnetwork of the neural network system produces a reconstructed output value associated with the luma channel of the encoded frame; by the decoding The third convolutional layer of the device subnetwork generates the reconstructed output value associated with the at least one chrominance channel of the encoded frame; and generates an output frame including The associated reconstructed output value and the reconstructed output value associated with the at least one chroma channel.

態樣38：根據態樣37之方法，其中該解碼器子網路的該第一迴旋層包括1x1迴旋層，該1x1迴旋層包括一或多個1x1迴旋濾波器。Aspect 38: The method of Aspect 37, wherein the first convolutional layer of the decoder subnetwork includes a 1x1 convolutional layer, the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

態樣39：根據態樣37或38中任一項所述的方法，亦包括：使用該解碼器子網路的第一非線性層來處理與該經編碼的訊框的亮度通道相關聯的值，其中與該亮度通道相關聯的該等經重構的輸出值是基於該第一非線性層的輸出來產生的；及使用該解碼器子網路的第二非線性層來處理與該經編碼的訊框的該至少一個色度通道相關聯的值，其中與該至少一個色度通道相關聯的該等經重構的輸出值是基於該第二非線性層的輸出來產生的。Aspect 39: The method of any one of Aspects 37 or 38, further comprising: using the first non-linear layer of the decoder subnetwork to process the luminance channel associated with the encoded frame values, wherein the reconstructed output values associated with the luma channel are generated based on the output of the first nonlinear layer; and using the second nonlinear layer of the decoder subnetwork to process the Values associated with the at least one chroma channel of the encoded frame, wherein the reconstructed output values associated with the at least one chroma channel are generated based on the output of the second nonlinear layer.

態樣40：根據態樣37至39中任一項所述的方法，亦包括：對該經編碼的訊框的取樣進行去量化。Aspect 40: The method of any one of aspects 37-39, further comprising: dequantizing the samples of the encoded frame.

態樣41：根據態樣37至40中任一項所述的方法，亦包括：對該經編碼的訊框的取樣進行熵解碼。Aspect 41: The method of any of Aspects 37-40, further comprising: entropy decoding the samples of the encoded frame.

態樣42：根據態樣37至41中任一項所述的方法，亦包括：將該輸出訊框儲存在記憶體中。Aspect 42: The method according to any one of Aspects 37 to 41, further comprising: storing the output frame in a memory.

態樣43：根據態樣37至42中任一項所述的方法，亦包括：顯示該輸出訊框。Aspect 43: The method according to any one of Aspects 37 to 42, further comprising: displaying the output frame.

態樣44：根據態樣37至43中任一項所述的方法，亦包括：由該神經網路系統的編碼器子網路的第一迴旋層產生與訊框的亮度通道相關聯的輸出值；由該編碼器子網路的第二迴旋層產生與該訊框的至少一個色度通道相關聯的輸出值；由該編碼器子網路的第三迴旋層基於與該訊框的該亮度通道相關聯的輸出值和與該訊框的該至少一個色度通道相關聯的輸出值來產生該訊框的組合表示；及基於該訊框的該組合表示來產生該經編碼的訊框。Aspect 44: The method of any one of Aspects 37 to 43, further comprising: generating, by the first convolutional layer of the encoder sub-network of the neural network system, an output associated with a luma channel of a frame value; an output value associated with at least one chrominance channel of the frame is generated by the second convolutional layer of the encoder subnetwork; an output value associated with at least one chrominance channel of the frame is generated by the third convolutional layer of the encoder subnetwork based on the an output value associated with a luma channel and an output value associated with the at least one chrominance channel of the frame to generate a combined representation of the frame; and generating the encoded frame based on the combined representation of the frame .

態樣45：根據態樣44之方法，其中該編碼器子網路的該第三迴旋層包括1x1迴旋層，該1x1迴旋層包括一或多個1x1迴旋濾波器。Aspect 45: The method of Aspect 44, wherein the third convolutional layer of the encoder sub-network includes a 1x1 convolutional layer, the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

態樣46：根據態樣44或45中任一項所述的方法，亦包括：使用該編碼器子網路的第一非線性層來處理與該訊框的亮度通道相關聯的輸出值；及使用該編碼器子網路的第二非線性層來處理與該訊框的該至少一個色度通道相關聯的輸出值；其中該組合表示是基於該第一非線性層的輸出和該第二非線性層的輸出來產生的。Aspect 46: The method of any one of Aspects 44 or 45, further comprising: using the first nonlinear layer of the encoder subnetwork to process output values associated with the luma channel of the frame; and using a second nonlinear layer of the encoder subnetwork to process output values associated with the at least one chroma channel of the frame; wherein the combined representation is based on the output of the first nonlinear layer and the second The output of the second nonlinear layer is generated.

態樣47：根據態樣46之方法，其中該訊框的該組合表示是由該編碼器子網路的該第三迴旋層使用該第一非線性層的輸出和該第二非線性層的輸出作為輸入來產生的。Aspect 47: The method according to aspect 46, wherein the combined representation of the frame is obtained by the third convolutional layer of the encoder subnetwork using the output of the first nonlinear layer and the output of the second nonlinear layer output is generated as input.

態樣48：根據態樣37至47中任一項所述的方法，其中該經編碼的訊框包括經編碼的視訊訊框。Aspect 48: The method of any of Aspects 37-47, wherein the encoded frame comprises an encoded video frame.

態樣49：根據態樣37至48中任一項所述的方法，其中該至少一個色度通道包括色度藍色通道和色度紅色通道。Aspect 49: The method of any one of Aspects 37 to 48, wherein the at least one chroma channel includes a chroma blue channel and a chroma red channel.

態樣50：根據態樣37至49中任一項所述的方法，其中該經編碼的訊框具有亮度-色度（YUV）格式。Aspect 50: The method of any of Aspects 37-49, wherein the encoded frame has a luma-chrominance (YUV) format.

態樣49：一種用於處理視訊資料的裝置。該裝置包括：記憶體；及處理器，其耦合到該記憶體並且被配置為：獲得經編碼的訊框；使用該解碼器子網路的第一迴旋層來將該經編碼的訊框的亮度通道與該經編碼的訊框的至少一個色度通道分離；使用神經網路系統的解碼器子網路的第二迴旋層來產生與該經編碼的訊框的該亮度通道相關聯的經重構的輸出值；使用該解碼器子網路的第三迴旋層來產生與該經編碼的訊框的該至少一個色度通道相關聯的經重構的輸出值；及產生輸出訊框，該輸出訊框包括與該亮度通道相關聯的經重構的輸出值和與該至少一個色度通道相關聯的經重構的輸出值。Aspect 49: A device for processing video data. The device includes: a memory; and a processor coupled to the memory and configured to: obtain an encoded frame; use the first convolutional layer of the decoder subnetwork to a luma channel is separated from at least one chrominance channel of the encoded frame; using a second convolutional layer of a decoder sub-network of the neural network system to generate a chroma channel associated with the luma channel of the encoded frame a reconstructed output value; using a third convolutional layer of the decoder subnetwork to generate a reconstructed output value associated with the at least one chrominance channel of the encoded frame; and generating an output frame, The output frame includes reconstructed output values associated with the luma channel and reconstructed output values associated with the at least one chroma channel.

態樣50：根據態樣49之裝置，其中該解碼器子網路的該第一迴旋層包括1x1迴旋層，該1x1迴旋層包括一或多個1x1迴旋濾波器。Aspect 50: The device of Aspect 49, wherein the first convolutional layer of the decoder subnetwork includes a 1x1 convolutional layer, the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

態樣51：根據態樣49或50中任一項所述的裝置，其中該處理器被配置為：使用該解碼器子網路的第一非線性層來處理與該經編碼的訊框的亮度通道相關聯的值，其中與該亮度通道相關聯的該等經重構的輸出值是基於該第一非線性層的輸出來產生的；及使用該解碼器子網路的第二非線性層來處理與該經編碼的訊框的該至少一個色度通道相關聯的值，其中與該至少一個色度通道相關聯的該等經重構的輸出值是基於該第二非線性層的輸出來產生的。Aspect 51: The apparatus of any one of Aspects 49 or 50, wherein the processor is configured to use the first non-linear layer of the decoder subnetwork to process the encoded frame a value associated with a luma channel, wherein the reconstructed output values associated with the luma channel are generated based on the output of the first nonlinear layer; and using a second nonlinearity of the decoder sub-network layer to process values associated with the at least one chroma channel of the encoded frame, wherein the reconstructed output values associated with the at least one chroma channel are based on the second nonlinear layer output is generated.

態樣52：根據態樣49至51中任一項所述的裝置，其中該處理器被配置為：對該經編碼的訊框的取樣進行去量化。Aspect 52: The apparatus of any one of Aspects 49-51, wherein the processor is configured to dequantize samples of the encoded frame.

態樣53：根據態樣49至52中任一項所述的裝置，其中該處理器被配置為：對該經編碼的訊框的取樣進行熵解碼。Aspect 53: The apparatus of any one of Aspects 49 to 52, wherein the processor is configured to entropy decode the samples of the encoded frame.

態樣54：根據態樣49至53中任一項所述的裝置，其中該處理器被配置為：將該輸出訊框儲存在記憶體中。Aspect 54: The device of any one of Aspects 49 to 53, wherein the processor is configured to: store the output frame in memory.

態樣55：根據態樣49至54中任一項該的裝置，其中該處理器被配置為：顯示該輸出訊框。Aspect 55: The device of any one of Aspects 49 to 54, wherein the processor is configured to: display the output frame.

態樣56：根據態樣49至55中任一項所述的裝置，其中該處理器被配置為：由該神經網路系統的編碼器子網路的第一迴旋層產生與訊框的亮度通道相關聯的輸出值；由該編碼器子網路的第二迴旋層產生與該訊框的至少一個色度通道相關聯的輸出值；由該編碼器子網路的第三迴旋層基於與該訊框的該亮度通道相關聯的輸出值和與該訊框的該至少一個色度通道相關聯的輸出值來產生該訊框的組合表示；及基於該訊框的該組合表示來產生該經編碼的訊框。Aspect 56: The apparatus of any one of Aspects 49 to 55, wherein the processor is configured to generate a frame-intensity correlation with the first convolutional layer of the encoder sub-network of the neural network system an output value associated with a channel; an output value associated with at least one chrominance channel of the frame is generated by the second convolutional layer of the encoder subnetwork; an output value associated with at least one chrominance channel of the frame is generated by a third convolutional layer of the encoder subnetwork based on the an output value associated with the luma channel of the frame and an output value associated with the at least one chrominance channel of the frame to generate a combined representation of the frame; and generating the combined representation of the frame based on the combined representation of the frame Encoded frame.

態樣57：根據態樣56之裝置，其中該編碼器子網路的該第三迴旋層包括1x1迴旋層，該1x1迴旋層包括一或多個1x1迴旋濾波器。Aspect 57: The device of Aspect 56, wherein the third convolutional layer of the encoder sub-network includes a 1x1 convolutional layer, the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

態樣58：根據態樣44或57中任一項所述的裝置，其中該處理器被配置為：使用該編碼器子網路的第一非線性層來處理與該訊框的亮度通道相關聯的輸出值；及使用該編碼器子網路的第二非線性層來處理與該訊框的該至少一個色度通道相關聯的輸出值；其中該組合表示是基於該第一非線性層的輸出和該第二非線性層的輸出來產生的。Aspect 58: The apparatus of any one of Aspects 44 or 57, wherein the processor is configured to use the first nonlinear layer of the encoder sub-network to process and using a second nonlinear layer of the encoder subnetwork to process output values associated with the at least one chrominance channel of the frame; wherein the combined representation is based on the first nonlinear layer The output of and the output of this second nonlinear layer are produced.

態樣59：根據態樣58之裝置，其中該訊框的該組合表示是由該編碼器子網路的該第三迴旋層使用該第一非線性層的輸出和該第二非線性層的輸出作為輸入來產生的。Aspect 59: The apparatus according to Aspect 58, wherein the combined representation of the frame is obtained by the third convolutional layer of the encoder subnetwork using the output of the first nonlinear layer and the output of the second nonlinear layer output is generated as input.

態樣60：根據態樣49至59中任一項所述的裝置，其中該經編碼的訊框包括經編碼的視訊訊框。Aspect 60: The device of any one of Aspects 49-59, wherein the encoded frame comprises an encoded video frame.

態樣61：根據態樣49至60中任一項所述的裝置，其中該至少一個色度通道包括色度藍色通道和色度紅色通道。Aspect 61: The device of any one of Aspects 49 to 60, wherein the at least one chroma channel includes a chroma blue channel and a chroma red channel.

態樣62：根據態樣49至61中任一項所述的裝置，其中該經編碼的訊框具有亮度-色度（YUV）格式。Aspect 62: The device of any one of Aspects 49-61, wherein the encoded frame has a luma-chrominance (YUV) format.

態樣63：根據態樣49至62中任一項所述的裝置，其中該處理器包括神經處理單元（NPU）。Aspect 63: The apparatus of any one of Aspects 49 to 62, wherein the processor comprises a Neural Processing Unit (NPU).

態樣64：根據態樣49至63中任一項所述的裝置，其中該裝置包括行動設備。Aspect 64: The apparatus of any one of Aspects 49 to 63, wherein the apparatus comprises a mobile device.

態樣65：根據態樣49至64中任一項所述的裝置，其中該裝置包括擴展現實設備。Aspect 65: The apparatus of any one of Aspects 49 to 64, wherein the apparatus comprises an extended reality device.

態樣66：根據態樣49至65中任一項所述的裝置，亦包括顯示器。Aspect 66: The device of any one of Aspects 49 to 65, further comprising a display.

態樣67：根據態樣49至63中任一項所述的裝置，其中該裝置包括電視機。Aspect 67: The device of any one of Aspects 49 to 63, wherein the device comprises a television.

態樣68：根據態樣49至67中任一項所述的裝置，其中該裝置包括：照相機，其被配置為擷取一或多個視訊訊框。Aspect 68: The device of any of Aspects 49-67, wherein the device comprises a camera configured to capture one or more video frames.

態樣69：一種儲存指令的電腦可讀取儲存媒體，該等指令在被執行時使得一或多個處理器執行根據態樣37至48之操作中的任一項操作。Aspect 69: A computer-readable storage medium storing instructions that, when executed, cause one or more processors to perform any of the operations according to Aspects 37-48.

態樣70：一種裝置，包括用於執行根據態樣37至48之操作中的任一項操作的單元。Aspect 70: An apparatus comprising means for performing any of the operations according to aspects 37-48.

100:SOC 102:CPU 104:GPU 106:DSP 108:神經處理單元（NPU） 110:連接塊 112:多媒體處理器 114:感測器處理器 116:影像訊號處理器（ISP） 118:記憶體塊 120:導航模組 202:全連接神經網路 204:全連接神經網路 206:迴旋神經網路 210:值 212:值 214:值 216:值 218:第一集合 220:第二集合 222:輸出 224:第一特徵向量 226:影像 228:第二特徵向量 230:影像擷取裝置 232:迴旋層 350:深度迴旋網路 352:輸入資料 354A:迴旋塊 354B:迴旋塊 356:層 358:層 360:層 362A:層 362B:層 364:層 366:分類得分 400:系統 402:設備 404:處理器 406:記憶體 407:照相機 408:影像資料 410:E2E-NNVC系統 412:第一介面（「I/F 1」） 414:儲存媒體 416:第二介面（「I/F 2」） 418:傳輸媒體 462:編碼器部分 463:神經網路 464:量化器 466:解碼器部分 468:神經網路 470:輸入資料 472:中間資料 474:輸出資料 476:表示 490:第二設備 510:第一迴旋層 602:亮度（Y）通道 604:色度（U和V）通道 606:1x1迴旋層 608:GDN非線性運算元 609:逆GDN（IGDN）運算元 610:第二迴旋層 611:迴旋層 612:第二GDN層 613:1x1迴旋層 614:第一GDN層 615:IGDN 616:IGDN 617:迴旋層 618:迴旋層 619:最終神經網路層 620:量化引擎 621:熵編碼引擎 622:熵解碼引擎 623:去量化引擎 624:Y分量 625:U和V分量 632:N通道色度輸出 634:N通道亮度輸出 638:1x1迴旋層 639:輸出值 642:亮度通道 644:色度通道 646:迴旋層 647:迴旋層 648:1x1迴旋層 649:pReLU非線性運算元 652:pReLU層 654:pReLU層 662:pReLU層 664:pReLU層 670:迴旋層 671:迴旋層 700:程序 702:方塊 704:方塊 706:方塊 708:方塊 800:程序 802:方塊 804:方塊 806:方塊 808:方塊 810:方塊 900:計算設備架構 905:連接 910:處理器 912:快取記憶體 915:計算設備記憶體 920:唯讀記憶體（ROM） 925:隨機存取記憶體（RAM） 930:存放裝置 932:服務 934:服務 935:輸出設備 936:服務 940:通訊介面 945:輸入設備 C1:第一通道 C2:第二通道 g _a:子網路 g _s:子網路 GDN:正規化 IGDN:逆正規化 pReLU:參數校正線性單元 S ₁:值 S ₂:值 U:色度 V:色度 Y:亮度 100:SOC 102:CPU 104:GPU 106:DSP 108:Neural Processing Unit (NPU) 110:Connection Block 112:Multimedia Processor 114:Sensor Processor 116:Image Signal Processor (ISP) 118:Memory Block 120: navigation module 202: fully connected neural network 204: fully connected neural network 206: convolutional neural network 210: value 212: value 214: value 216: value 218: first set 220: second set 222: output 224: first feature vector 226: image 228: second feature vector 230: image capture device 232: convolution layer 350: depth convolution network 352: input data 354A: convolution block 354B: convolution block 356: layer 358: layer 360 : layer 362A: layer 362B: layer 364: layer 366: classification score 400: system 402: equipment 404: processor 406: memory 407: camera 408: image data 410: E2E-NNVC system 412: first interface ("I /F 1") 414: storage medium 416: second interface ("I/F 2") 418: transmission medium 462: encoder section 463: neural network 464: quantizer 466: decoder section 468: neural network 470: Input Data 472: Intermediate Data 474: Output Data 476: Representation 490: Second Device 510: First Convolution Layer 602: Luma (Y) Channel 604: Chroma (U and V) Channel 606: 1x1 Convolution Layer 608: GDN nonlinear operator 609: inverse GDN (IGDN) operator 610: second convolution layer 611: convolution layer 612: second GDN layer 613: 1x1 convolution layer 614: first GDN layer 615: IGDN 616: IGDN 617: convolution Layer 618: Convolution Layer 619: Final Neural Network Layer 620: Quantization Engine 621: Entropy Encoding Engine 622: Entropy Decoding Engine 623: Dequantization Engine 624: Y Component 625: U and V Component 632: N Channel Chroma Output 634: N-channel brightness output 638: 1x1 convolution layer 639: output value 642: brightness channel 644: chroma channel 646: convolution layer 647: convolution layer 648: 1x1 convolution layer 649: pReLU nonlinear operation unit 652: pReLU layer 654: pReLU layer 662: pReLU layer 664: pReLU layer 670: convolution layer 671: convolution layer 700: program 702: block 704: block 706: block 708: block 800: program 802: block 804: block 806: block 808: block 810: block 900 : Computing Device Architecture 905: Connection 910: Processor 912: Cache Memory 915: Computing Device Memory 920: Read Only Memory (ROM) 925: Random Access Memory (RAM) 930: Storage Device 932: Server Service 934: service 935: output device 936: service 940: communication interface 945: input device C1: first channel C2: second channel g _a : subnet g _s : subnet GDN: normalization IGDN: reverse normalization pReLU: Parametric Correction Linear Unit S ₁ : Value S ₂ : Value U: Chroma V: Chroma Y: Brightness

下文參照以下附圖來詳細地描述本案的說明性實施例：Illustrative embodiments of the present case are described in detail below with reference to the following drawings:

圖1示出片上系統（SOC）的實例實現方式；Figure 1 shows an example implementation of a system on chip (SOC);

圖2A示出全連接神經網路的實例；Figure 2A shows an example of a fully connected neural network;

圖2B示出局部連接神經網路的實例；Figure 2B shows an example of a locally connected neural network;

圖2C示出迴旋神經網路的實例；Figure 2C shows an example of a convolutional neural network;

圖2D示出被設計為從影像中辨識視覺特徵的深度迴旋網路（DCN）的詳細實例；Figure 2D shows a detailed example of a deep convolutional network (DCN) designed to recognize visual features from images;

圖3是示出深度迴旋網路（DCN）的方塊圖；Figure 3 is a block diagram illustrating a deep convolutional network (DCN);

圖4是示出根據一些實例的系統的實例的示意圖，該系統包括可操作以使用基於神經網路的系統來執行影像及/或視訊譯碼（編碼和解碼）的設備；4 is a schematic diagram illustrating an example of a system including an apparatus operable to perform image and/or video coding (encoding and decoding) using a neural network-based system, according to some examples;

圖5是示出根據一些實例的用於具有紅-綠-藍（RGB）格式的輸入的基於端到端神經網路的影像和視訊譯碼系統的實例的示意圖；5 is a schematic diagram illustrating an example of an end-to-end neural network-based image and video coding system for an input having a red-green-blue (RGB) format, according to some examples;

圖6A是示出根據一些實例的前端神經網路架構的實例的示意圖，該前端神經網路架構可以是基於端到端神經網路的影像和視訊解碼系統的一部分；6A is a schematic diagram illustrating an example of a front-end neural network architecture that may be part of an end-to-end neural network-based image and video decoding system, according to some examples;

圖6B是示出根據一些實例的1x1迴旋層的實例操作的示意圖；6B is a schematic diagram illustrating example operation of a 1x1 convolutional layer according to some examples;

圖6C是示出根據一些實例的前端神經網路架構的另一實例的示意圖，該前端神經網路架構可以是基於端到端神經網路的影像和視訊譯碼系統的一部分；6C is a schematic diagram illustrating another example of a front-end neural network architecture, which may be part of an end-to-end neural network-based image and video coding system, according to some examples;

圖6D是示出根據一些實例的前端神經網路架構的另一實例的示意圖，該前端神經網路架構可以是基於端到端神經網路的影像和視訊譯碼系統的一部分；6D is a schematic diagram illustrating another example of a front-end neural network architecture that may be part of an end-to-end neural network-based image and video coding system, according to some examples;

圖6E是示出根據一些實例的前端神經網路架構的另一實例的示意圖，該前端神經網路架構可以是基於端到端神經網路的影像和視訊譯碼系統的一部分；6E is a schematic diagram illustrating another example of a front-end neural network architecture that may be part of an end-to-end neural network-based image and video coding system, according to some examples;

圖6F是示出根據一些實例的前端神經網路架構的另一實例的示意圖，該前端神經網路架構可以是基於端到端神經網路的影像和視訊譯碼系統的一部分；6F is a schematic diagram illustrating another example of a front-end neural network architecture that may be part of an end-to-end neural network-based image and video coding system, according to some examples;

圖7是示出根據一些實例的用於處理視訊資料的程序的實例的流程圖；7 is a flowchart illustrating an example of a program for processing video data, according to some examples;

圖8是示出根據一些實例的用於處理視訊資料的程序的另一實例的流程圖；及8 is a flowchart illustrating another example of a program for processing video data, according to some examples; and

圖9示出可以實現本文描述的各種技術的實例計算設備的實例計算設備架構。9 illustrates an example computing device architecture of an example computing device that can implement various techniques described herein.

國內寄存資訊(請依寄存機構、日期、號碼順序註記) 無國外寄存資訊(請依寄存國家、機構、日期、號碼順序註記) 無 Domestic deposit information (please note in order of depositor, date, and number) none Overseas storage information (please note in order of storage country, institution, date, and number) none

632:N通道色度輸出 632: N channel chroma output

634:N通道亮度輸出 634:N channel brightness output

638:1x1迴旋層 638: 1x1 Convolution Layer

639:輸出值 639: output value

Claims

A method for processing video data, the method includes the following steps: generating output values associated with a luma channel of a frame by a first convolutional layer of an encoder subnetwork of a neural network system; generating output values associated with at least one chrominance channel of the frame from a second convolutional layer of the encoder subnetwork; generating a combined representation of the frame by a third convolutional layer based on the output values associated with the luma channel of the frame and the output values associated with the at least one chrominance channel of the frame ;and Encoded video data is generated based on the combined representation of the frame.

The method according to claim 1, wherein the third convolutional layer includes a 1x1 convolutional layer, and the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

The method according to Claim 1 also includes the following steps: processing the output values associated with the luma channel of the frame using a first non-linear layer of the encoder subnetwork; and processing output values associated with the at least one chrominance channel of the frame using a second nonlinear layer of the encoder subnetwork; Wherein the combined representation is generated based on an output of the first nonlinear layer and an output of the second nonlinear layer.

The method according to claim 3, wherein the combined representation of the frame is generated by the third convolutional layer using an output of the first nonlinear layer and an output of the second nonlinear layer as input.

The method according to Claim 1 also includes the following steps: The encoded video data is quantized.

The method according to Claim 1 also includes the following steps: Entropy coding is performed on the encoded video data.

The method according to Claim 1 also includes the following steps: The encoded video data is stored in memory.

The method according to Claim 1 also includes the following steps: The encoded video data is sent to at least one device over a transmission medium.

The method according to Claim 1 also includes the following steps: obtain an encoded frame; generating reconstructed output values associated with a luma channel of the encoded frame from a first convolutional layer of a decoder subnetwork of the neural network system; and Reconstructed output values associated with at least one chrominance channel of the encoded frame are generated by a second convolutional layer of the decoder subnetwork.

The method according to claim 9 also includes the following steps: A third convolutional layer of the decoder subnetwork is used to separate the luma channel of the encoded frame from the at least one chrominance channel of the encoded frame.

The method according to claim 10, wherein the third convolutional layer of the decoder sub-network includes a 1x1 convolutional layer, the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

The method according to claim 1, wherein the frame includes a video frame.

The method according to claim 1, wherein the at least one chroma channel includes a chroma blue channel and a chroma red channel.

The method according to claim 1, wherein the frame has a luma-chroma (YUV) format.

A device for processing video data, comprising: a memory; and A processor, coupled to the memory and configured to: using a first convolutional layer of an encoder sub-network of a neural network system to generate output values associated with a luma channel of the frame; generating output values associated with at least one chrominance channel of the frame using a second convolutional layer of the encoder subnetwork; using a third convolutional layer to generate a combined representation of the frame based on output values associated with the luma channel of the frame and output values associated with the at least one chrominance channel of the frame; and Encoded video data is generated based on the combined representation of the frame.

The device according to claim 15, wherein the third convolutional layer comprises a 1x1 convolutional layer, and the 1x1 convolutional layer comprises one or more 1x1 convolutional filters.

The device according to claim 15, wherein the processor is configured to: processing the output values associated with the luma channel of the frame using a first non-linear layer of the encoder subnetwork; and processing output values associated with the at least one chrominance channel of the frame using a second nonlinear layer of the encoder subnetwork; Wherein the combined representation is generated based on an output of the first nonlinear layer and an output of the second nonlinear layer.

The apparatus according to claim 17, wherein the combined representation of the frame is generated by the third convolutional layer using an output of the first nonlinear layer and an output of the second nonlinear layer as input.

The device according to claim 15, wherein the processor is configured to: The encoded video data is quantized.

The device according to claim 15, wherein the processor is configured to: Entropy encoding is performed on the encoded video data.

The apparatus according to any one of claim 15, wherein the processor is configured to: The encoded video data is stored in memory.

The device according to claim 15, wherein the processor is configured to: The encoded video data is sent to at least one device over a transmission medium.

The device according to claim 15, wherein the processor is configured to: obtain an encoded frame; using a first convolutional layer of a decoder subnetwork of the neural network system to generate reconstructed output values associated with a luma channel of the encoded frame; and A second convolutional layer of the decoder subnetwork is used to generate reconstructed output values associated with at least one chrominance channel of the encoded frame.

The device according to claim 23, wherein the processor is configured to: A third convolutional layer of the decoder subnetwork is used to separate the luma channel of the encoded frame from the at least one chrominance channel of the encoded frame.

The apparatus according to claim 24, wherein the third convolutional layer of the decoder sub-network comprises a 1x1 convolutional layer, the 1x1 convolutional layer comprises one or more 1x1 convolutional filters.

The device according to claim 15, wherein the frame includes a video frame.

The device according to claim 15, wherein the at least one chroma channel comprises a chroma blue channel and a chroma red channel.

The device according to claim 15, wherein the frame has a luma-chroma (YUV) format.

The device according to claim 15, wherein the processor comprises a Neural Processing Unit (NPU).

The device according to claim 15, wherein the device comprises a mobile device.

The device according to claim 15, further comprising: at least one of a display and a camera configured to capture one or more frames.

A method for processing video data, the method includes the following steps: obtain an encoded frame; separating a luma channel of the encoded frame from at least one chrominance channel of the encoded frame by a first convolutional layer of a decoder subnetwork; generating, by a second convolutional layer of the decoder subnetwork of a neural network system, reconstructed output values associated with the luma channel of the encoded frame; generating reconstructed output values associated with the at least one chrominance channel of the encoded frame from a third convolutional layer of the decoder subnetwork; and An output frame is generated, the output frame including the reconstructed output values associated with the luma channel and the reconstructed output values associated with the at least one chroma channel.

The method according to claim 32, wherein the first convolutional layer of the decoder sub-network includes a 1x1 convolutional layer, the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

The method according to claim 32 also includes the following steps: using a first non-linear layer of the decoder subnetwork to process values associated with the luma channel of the encoded frame, wherein the reconstructed output values associated with the luma channel are based on produced by an output of the first nonlinear layer; and Values associated with the at least one chroma channel of the encoded frame are processed using a second non-linear layer of the decoder subnetwork, wherein the weighted values associated with the at least one chroma channel The structured output value is generated based on an output of the second nonlinear layer.

The method according to claim 32 also includes the following steps: The samples of the encoded frame are dequantized.

The method according to claim 32 also includes the following steps: The samples of the encoded frame are entropy decoded.

The method according to claim 32 also includes the following steps: Store the output frame in memory.

The method according to claim 32 also includes the following steps: Display the output frame.

The method according to claim 32 also includes the following steps: generating an output value associated with a luma channel of the frame from a first convolutional layer of an encoder sub-network of the neural network system; generating output values associated with at least one chrominance channel of the frame from a second convolutional layer of the encoder subnetwork; generating the frame by a third convolutional layer of the encoder subnetwork based on output values associated with the luma channel of the frame and output values associated with the at least one chrominance channel of the frame a combined representation; and The encoded frame is generated based on the combined representation of the frame.

The method according to claim 39, wherein the third convolutional layer of the encoder sub-network includes a 1x1 convolutional layer, the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

The method according to claim 39 also includes the following steps: using a first nonlinear layer of the encoder subnetwork to process output values associated with the luma channel of the frame; and processing output values associated with the at least one chrominance channel of the frame using a second nonlinear layer of the encoder subnetwork; Wherein the combined representation is generated based on an output of the first nonlinear layer and an output of the second nonlinear layer.

The method according to claim 41, wherein the combined representation of the frame is used by the third convolutional layer of the encoder sub-network using an output of the first nonlinear layer and an output of the second nonlinear layer as input to produce.

The method according to claim 32, wherein the encoded frame comprises an encoded video frame.

The method according to claim 32, wherein the at least one chroma channel includes a chroma blue channel and a chroma red channel.

The method according to claim 32, wherein the encoded frame has a luma-chrominance (YUV) format.

A device for processing video data, comprising: a memory; and A processor, coupled to the memory and configured to: obtain an encoded frame; using a first convoluted layer of a decoder subnetwork to separate a luma channel of the encoded frame from at least one chrominance channel of the encoded frame; using a second convolutional layer of the decoder subnetwork of a neural network system to generate reconstructed output values associated with the luma channel of the encoded frame; using a third convolutional layer of the decoder subnetwork to generate reconstructed output values associated with the at least one chrominance channel of the encoded frame; and An output frame is generated, the output frame including the reconstructed output values associated with the luma channel and the reconstructed output values associated with the at least one chroma channel.

The apparatus according to claim 46, wherein the first convolutional layer of the decoder sub-network comprises a 1x1 convolutional layer, the 1x1 convolutional layer comprises one or more 1x1 convolutional filters.

The device according to claim 46, wherein the processor is configured to: using a first non-linear layer of the decoder subnetwork to process values associated with the luma channel of the encoded frame, wherein the reconstructed output values associated with the luma channel are based on produced by an output of the first nonlinear layer; and Values associated with the at least one chroma channel of the encoded frame are processed using a second non-linear layer of the decoder subnetwork, wherein the weighted values associated with the at least one chroma channel The structured output value is generated based on an output of the second nonlinear layer.

The device according to claim 46, wherein the processor is configured to: The samples of the encoded frame are dequantized.

The device according to claim 46, wherein the processor is configured to: The samples of the encoded frame are entropy decoded.

The device according to claim 46, wherein the processor is configured to: Store the output frame in memory.

The device according to claim 46, wherein the processor is configured to: Display the output frame.

The device according to claim 46, wherein the processor is configured to: generating an output value associated with a luma channel of the frame from a first convolutional layer of an encoder sub-network of the neural network system; generating output values associated with at least one chrominance channel of the frame from a second convolutional layer of the encoder subnetwork; generating the frame by a third convolutional layer of the encoder subnetwork based on output values associated with the luma channel of the frame and output values associated with the at least one chrominance channel of the frame a combined representation; and The encoded frame is generated based on the combined representation of the frame.

The apparatus according to claim 53, wherein the third convolutional layer of the encoder sub-network includes a 1x1 convolutional layer, the 1x1 convolutional layer includes one or more 1x1 convolutional filters.

The device according to claim 53, wherein the processor is configured to: using a first nonlinear layer of the encoder subnetwork to process output values associated with the luma channel of the frame; and processing output values associated with the at least one chrominance channel of the frame using a second nonlinear layer of the encoder subnetwork; Wherein the combined representation is generated based on the output of the first nonlinear layer and an output of the second nonlinear layer.

The apparatus according to claim 55, wherein the combined representation of the frame is used by the third convolutional layer of the encoder subnetwork as input using an output of the first nonlinear layer and an output of the second nonlinear layer to produce.

The device according to claim 46, wherein the encoded frame comprises an encoded video frame.

The device according to claim 57, wherein the at least one chroma channel includes a chroma blue channel and a chroma red channel.

The device according to claim 46, wherein the encoded frame has a luma-chroma (YUV) format.

The device according to claim 46, further comprising: at least one of a display and a camera configured to capture one or more video frames.