JP2024076475A

JP2024076475A - Video encoding device, video decoding device

Info

Publication number: JP2024076475A
Application number: JP2022188016A
Authority: JP
Inventors: 健中條; 知宏猪飼; 拓矢鈴木; 哲銘范; 裕渡辺
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2024-06-06

Abstract

【課題】動画像符号化、復号方式の枠組みを大きく変更せずに、付加的な補助情報を符号化、復号することで、低レートにおいても、画像認識精度を向上させる動画像符号化装置、復号装置を提供する。【解決手段】動画像符号化装置１０と、ネットワーク２１と、動画像復号装置３０と画像表示装置４１と、画像認識装置５１と、からなる動画像伝送システム１において、動画像復号装置３０は、符号化データＴｅから画像を復号して復号動画像Ｔｄを生成する画像復号装置３１と、画像復号装置３１で復号した復号動画像Ｔｄに対して、領域毎に複数の値を持つアテンション情報を補助情報として復号する補助情報復号装置９１と、を有する。【選択図】図１[Problem] To provide a video coding device and a decoding device that improve image recognition accuracy even at a low rate by coding and decoding additional auxiliary information without significantly changing the framework of video coding and decoding methods. [Solution] In a video transmission system 1 consisting of a video coding device 10, a network 21, a video decoding device 30, an image display device 41, and an image recognition device 51, the video decoding device 30 has an image decoding device 31 that decodes images from encoded data Te to generate a decoded video Td, and an auxiliary information decoding device 91 that decodes attention information having multiple values for each region as auxiliary information for the decoded video Td decoded by the image decoding device 31. [Selected Figure] Figure 1

Description

本発明の実施形態は、動画像符号化装置、復号装置に関する。 Embodiments of the present invention relate to video encoding devices and video decoding devices.

動画像を効率的に伝送または記録するために、動画像を符号化することによって符号化データを生成する動画像符号化装置、および、当該符号化データを復号することによって復号画像を生成する動画像復号装置が用いられている。 To efficiently transmit or record video, video encoding devices are used that generate encoded data by encoding video, and video decoding devices are used that generate decoded images by decoding the encoded data.

具体的な動画像符号化方式としては、例えば、H.264/AVCやH.265/HEVC（High-Efficiency Video Coding）方式などが挙げられる。 Specific examples of video encoding methods include H.264/AVC and H.265/HEVC (High-Efficiency Video Coding).

このような動画像符号化方式においては、動画像を構成する画像（ピクチャ）は、画像を分割することにより得られるスライス、スライスを分割することにより得られる符号化ツリーユニット（CTU：Coding Tree Unit）、符号化ツリーユニットを分割することで得られる符号化単位（符号化ユニット（Coding Unit：CU）と呼ばれることもある）、及び、符号化単位を分割することより得られる変換ユニット（TU：Transform Unit）からなる階層構造により管理され、CU毎に符号化／復号される。 In such video coding methods, images (pictures) that make up a video are managed in a hierarchical structure consisting of slices obtained by dividing images, coding tree units (CTUs) obtained by dividing slices, coding units (sometimes called coding units: CUs) obtained by dividing coding tree units, and transform units (TUs) obtained by dividing coding units, and are coded/decoded for each CU.

また、このような動画像符号化方式においては、通常、入力画像を符号化／復号することによって得られる局所復号画像に基づいて予測画像が生成され、当該予測画像を入力画像（原画像）から減算して得られる予測誤差（「差分画像」または「残差画像」と呼ぶこともある）が符号化される。予測画像の生成方法としては、画面間予測（インター予測）、および、画面内予測（イントラ予測）が挙げられる。 In such video coding methods, a predicted image is usually generated based on a locally decoded image obtained by encoding/decoding an input image, and the prediction error (sometimes called a "difference image" or "residual image") obtained by subtracting the predicted image from the input image (original image) is coded. Methods for generating predicted images include inter-prediction and intra-prediction.

また、近年の動画像符号化及び復号の技術として非特許文献１が挙げられる。 Another recent example of video encoding and decoding technology is Non-Patent Document 1.

非特許文献１は、非常に符号化効率の高い動画像符号化、復号方式である。 Non-Patent Document 1 describes a video encoding and decoding method with extremely high coding efficiency.

非特許文献２は、動画像の分析結果の記述と動画像符号化を統合する方法について議論されている。 Non-patent document 2 discusses a method for integrating the description of video analysis results with video coding.

非特許文献3は、非特許文献１の方式に対して、ニューラルネットワークポストフィルタ処理を行う手段を提供している。 Non-Patent Document 3 provides a means of performing neural network post-filter processing for the method described in Non-Patent Document 1.

ITU-T Recommendation H.266ITU-T Recommendation H.266 L.-Y. Duan, J. Liu, W. Yang, T. Huang and W. Gao, “Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics,”IEEETrans. Image Processing, vol.29, pp.8680-8695、Aug.2020.L.-Y. Duan, J. Liu, W. Yang, T. Huang and W. Gao, “Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics,” IEEETrans. Image Processing, vol.29, pp.8680-8695, Aug.2020. Text of ISO/IEC 23002-7:202x (2nd Ed.) DAM1 Additional SEI messages, Nov. 2022.Text of ISO/IEC 23002-7:202x (2nd Ed.) DAM1 Additional SEI messages, Nov. 2022.

しかしながら、非特許文献１は、符号化効率の高い動画像符号化、復号方式であるが、復号した動画像で、画像認識を行うと、伝送レートが低い場合、符号化歪によって、画像認識精度が低減するという問題がある。 However, although Non-Patent Document 1 describes a video encoding and decoding method with high coding efficiency, there is a problem in that when performing image recognition on the decoded video, if the transmission rate is low, coding distortion reduces the accuracy of image recognition.

また、非特許文献2では、動画像の分析結果の記述と動画像符号化を統合する方法について開示しているが、符号化効率の面で十分ではなく、低い伝送ビットレートを実現できないという課題がある。 Non-Patent Document 2 discloses a method for integrating the description of video analysis results with video coding, but this method is not sufficient in terms of coding efficiency, and there is an issue that a low transmission bit rate cannot be achieved.

非特許文献3は、非特許文献１の方式に対して、ニューラルネットワークに画像処理を行う手段を提供しているが画像認識技術には対応していないという課題がある。 Non-Patent Document 3 provides a means for performing image processing in a neural network compared to the method in Non-Patent Document 1, but has the problem that it does not support image recognition technology.

本発明の一態様に係る動画像復号装置は、符号化データから画像を復号する画像復号装置と、前記画像復号装置で復号した画像に対して、領域毎に複数の値を持つアテンション情報を補助情報として復号する補助情報復号装置を有することを特徴とする。 A video decoding device according to one aspect of the present invention is characterized by having an image decoding device that decodes an image from encoded data, and an auxiliary information decoding device that decodes attention information having multiple values for each region as auxiliary information for the image decoded by the image decoding device.

また、前記補助情報復号装置は、領域毎に複数の値を持つアテンション情報に関して、適用するニューラルネットワークを特定する情報を復号することを特徴とする。 The auxiliary information decoding device is also characterized in that it decodes information that specifies the neural network to be applied with respect to attention information having multiple values for each region.

本発明の一態様に係る動画像符号化装置は、入力画像を符号化する画像符号化装置と、前記入力画像に対して、領域毎に複数の値を持つアテンション情報を補助情報として符号化する補助情報符号化装置を有することを特徴とする。 A video coding device according to one aspect of the present invention is characterized by having an image coding device that codes an input image, and an auxiliary information coding device that codes attention information having multiple values for each region of the input image as auxiliary information.

また、前記補助情報符号化装置は、領域毎に複数の値を持つアテンション情報に関して、適用するニューラルネットワークを特定する情報を補助情報として符号化することを特徴とする。 The auxiliary information encoding device is also characterized in that it encodes, as auxiliary information, information that specifies the neural network to be applied with respect to attention information having multiple values for each region.

このような構成にすることで、動画像符号化、復号方式の枠組みを大きく変更せずに、付加的な補助情報を符号化、復号することで、低レートにおいても、画像認識精度を向上させるという課題が解決できる。 By configuring in this way, the problem of improving image recognition accuracy even at low rates can be solved by encoding and decoding additional auxiliary information without making major changes to the framework of the video encoding and decoding method.

本実施形態に係る動画像伝送システムの構成を示す概略図である。1 is a schematic diagram showing a configuration of a video transmission system according to an embodiment of the present invention. 符号化データの階層構造を示す図である。FIG. 2 is a diagram showing a hierarchical structure of encoded data. 画像復号装置の構成を示す概略図である。FIG. 1 is a schematic diagram showing a configuration of an image decoding device. 画像復号装置の概略的動作を説明するフローチャートである。11 is a flowchart illustrating a schematic operation of an image decoding device. 画像符号化装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of an image encoding device. ニューラルネットワークポストフィルタ特性(NNPFC)SEIのシンタクスの概要を示す図である。FIG. 1 is a diagram showing an overview of the syntax of the neural network post-filter characteristic (NNPFC) SEI. ニューラルネットワークポストフィルタアクティベーション(NNPFA)SEIのシンタクスを示す図である。A diagram showing the syntax of Neural Network Post Filter Activation (NNPFA) SEI. 一実施形態の補助情報を規定するアクティベーション情報SEIのシンタックス表の構成例を示す図である。11 is a diagram showing an example of the configuration of a syntax table of activation information SEI that specifies auxiliary information in one embodiment. FIG. 一実施形態のアクティベーション情報SEIのシンタクス要素からアクティベーションマップを作成する処理を示す図である。A diagram showing a process of creating an activation map from syntax elements of activation information SEI in one embodiment. SEIメッセージのコンテナであるSEIペイロードのシンタクスを示す図である。A diagram showing the syntax of an SEI payload, which is a container for an SEI message. ポスト画像処理装置1002の処理のフローチャートを示す図である。FIG. 10 is a flowchart showing the processing of the image post-processing device 1002. NNCの符号化装置・復号装置について示す図である。FIG. 1 is a diagram showing an encoding device and a decoding device of an NNC.

（第１の実施形態）
以下、図面を参照しながら本発明の実施形態について説明する。 First Embodiment
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

図1は、本実施に係る別の動画像伝送システムの構成を示す概略図である。 Figure 1 is a schematic diagram showing the configuration of another video transmission system according to this embodiment.

動画像伝送システム1は、画像を符号化した符号化データを伝送し、伝送された符号化データを復号し表示するとともに画像認識を行うシステムである。動画像伝送システム1は、動画像符号化装置10とネットワーク21と動画像復号装置30と画像表示装置41と画像認識装置51からなる。 The video transmission system 1 is a system that transmits coded data that encodes an image, decodes and displays the transmitted coded data, and performs image recognition. The video transmission system 1 is composed of a video encoding device 10, a network 21, a video decoding device 30, an image display device 41, and an image recognition device 51.

動画像符号化装置10は、画像符号化装置（画像符号化部）11、画像解析装置（画像解析部）61、補助情報作成装置（補助情報作成部）71、補助情報符号化装置（補助情報符号化部）81、プレ画像処理装置（プレ画像処理部）1001から構成される。 The video encoding device 10 is composed of an image encoding device (image encoding unit) 11, an image analysis device (image analysis unit) 61, an auxiliary information creation device (auxiliary information creation unit) 71, an auxiliary information encoding device (auxiliary information encoding unit) 81, and a pre-image processing device (pre-image processing unit) 1001.

動画像復号装置30は、画像復号装置（画像復号部）31、補助情報復号装置（補助情報復号部）91、ポスト画像処理装置（ホスト画像処理部）1002から構成される。 The video decoding device 30 is composed of an image decoding device (image decoding unit) 31, an auxiliary information decoding device (auxiliary information decoding unit) 91, and a post-image processing device (host image processing unit) 1002.

プレ画像処理装置1001は、入力動画像Tのプレ画像処理を行い、プレ処理画像Tpを画像符号化装置11と補助情報作成装置71に送る。 The pre-image processing device 1001 performs pre-image processing on the input video image T and sends the pre-processed image Tp to the image encoding device 11 and the auxiliary information creation device 71.

具体的な実施の形態の一例としては、補助情報作成装置71の出力のアクティベーション情報を、プレ画像処理装置1001に入力して、認識対象候補以外の領域にローパスフィルタ処理を行い、符号化の難易度を落とし、相対的に認識対象候補の領域の画質を向上させてもよい。 As an example of a specific embodiment, the activation information output by the auxiliary information creation device 71 may be input to the pre-image processing device 1001, and low-pass filter processing may be performed on areas other than the candidate recognition targets, reducing the difficulty of encoding and relatively improving the image quality of the candidate recognition target areas.

画像符号化装置11は、プレ画像処理装置1001の出力Tpを圧縮、符号化する。 The image encoding device 11 compresses and encodes the output Tp of the pre-image processing device 1001.

画像解析装置61は、入力動画像Tを解析して、画像認識装置51において、ピクチャ内のどの領域を注目すべきかの情報を解析し、解析結果を補助情報作成装置71に送る。 The image analysis device 61 analyzes the input video image T, and in the image recognition device 51, analyzes information on which areas in the picture should be noted, and sends the analysis results to the auxiliary information creation device 71.

補助情報作成装置71は、画像解析装置61での解析結果と、プレ画像処理装置1001でのプレ画像処理Tpを元に、ピクチャに対して、画像認識装置51を動作させる有用なアテンション（注目）情報を生成し、補助情報符号化装置81に送る。 The auxiliary information creation device 71 generates useful attention information for the picture based on the analysis results of the image analysis device 61 and the pre-image processing Tp of the pre-image processing device 1001, and sends it to the auxiliary information encoding device 81.

補助情報符号化装置81は、補助情報作成装置71で作成された補助情報を、予め決まったシンタックスに従って符号化する。画像符号化装置11の出力と補助情報符号化装置81の出力は、符号化データTeとしてネットワーク21に送出される。 The auxiliary information encoding device 81 encodes the auxiliary information created by the auxiliary information creation device 71 according to a predetermined syntax. The output of the image encoding device 11 and the output of the auxiliary information encoding device 81 are sent to the network 21 as encoded data Te.

動画像符号化装置10は、入力画像Tを入力として、画像の圧縮、符号化を行うとともに、画像を解析して、画像認識装置51に入力するための補助情報を生成し、符号化し、符号化データTeを生成し、ネットワーク21に送出する。 The video encoding device 10 receives an input image T, compresses and encodes the image, analyzes the image, generates auxiliary information to be input to the image recognition device 51, encodes the image, generates encoded data Te, and transmits the encoded data to the network 21.

図1において、補助情報符号化装置81は画像符号化装置11とつながれていないが、補助情報符号化装置81と画像符号化装置11とは、適宜必要な情報を通信してもよい。 In FIG. 1, the auxiliary information encoding device 81 is not connected to the image encoding device 11, but the auxiliary information encoding device 81 and the image encoding device 11 may communicate necessary information as appropriate.

ネットワーク21は、符号化された補助情報及び符号化データTeを画像復号装置31に伝送する。符号化された補助情報の一部または全部は、補助拡張情報SEIとして、符号化データTeに含められてもよい。ネットワーク21は、インターネット（Internet）、広域ネット
ワーク（WAN:Wide Area Network）、小規模ネットワーク（LAN:Local Area Network）またはこれらの組み合わせである。ネットワーク21は、必ずしも双方向の通信網に限らず、地上デジタル放送、衛星放送等の放送波を伝送する一方向の通信網であっても良い。また、ネットワーク21は、DVD（Digital Versatile Disc:登録商標）、BD（Blue-ray Disc:登録商標）等の符号化データTeを記録した記憶媒体で代替されても良い。 The network 21 transmits the encoded auxiliary information and the encoded data Te to the image decoding device 31. A part or all of the encoded auxiliary information may be included in the encoded data Te as auxiliary extension information SEI. The network 21 is the Internet, a wide area network (WAN), a local area network (LAN), or a combination of these. The network 21 is not necessarily limited to a bidirectional communication network, and may be a unidirectional communication network that transmits broadcast waves such as terrestrial digital broadcasting and satellite broadcasting. The network 21 may also be replaced by a storage medium on which the encoded data Te is recorded, such as a DVD (Digital Versatile Disc: registered trademark) or a BD (Blue-ray Disc: registered trademark).

動画像復号装置30は、ネットワーク21から送られてきた符号化データTeを入力し、画像を復号するとともに、補助情報を復号し、画像表示装置41、および、画像認識装置51に送出する。また、補助情報を復号して、画像認識装置51に出力する。 The video decoding device 30 inputs the encoded data Te sent from the network 21, decodes the image, decodes the auxiliary information, and sends them to the image display device 41 and the image recognition device 51. It also decodes the auxiliary information and outputs it to the image recognition device 51.

画像復号装置31は、ネットワーク21が伝送した符号化データTeのそれぞれを復号し、復号動画像Tdを生成してポスト画像処理装置1002に供給する。 The image decoding device 31 decodes each of the encoded data Te transmitted by the network 21, generates a decoded video image Td, and supplies it to the post-image processing device 1002.

補助情報復号装置91は、ネットワーク21が伝送した符号化された補助情報を復号して補助情報を生成して、画像認識装置61に送出する。 The auxiliary information decoding device 91 decodes the encoded auxiliary information transmitted by the network 21 to generate auxiliary information, which it then transmits to the image recognition device 61.

図1において、補助情報復号装置91は、画像復号装置31とは別に図示されているが、補助情報復号装置91は、画像復号装置31に含まれてもよい。例えば、補助情報復号装置91は、画像復号装置31の各機能部とは別に画像復号装置31に含まれてもよい。また、図1において、画像復号装置31とつながれていないが、補助情報復号装置91と画像復号装置31とは、適宜必要な情報を通信してもよい。 In FIG. 1, the auxiliary information decoding device 91 is illustrated separately from the image decoding device 31, but the auxiliary information decoding device 91 may be included in the image decoding device 31. For example, the auxiliary information decoding device 91 may be included in the image decoding device 31 separately from each functional unit of the image decoding device 31. Also, although not connected to the image decoding device 31 in FIG. 1, the auxiliary information decoding device 91 and the image decoding device 31 may communicate necessary information as appropriate.

ポスト画像処理装置1002は、画像復号装置31の出力である画像復号Tdにポスト画像処理を行い、ポスト画像処理Toを出力する。 The post-image processing device 1002 performs post-image processing on the image decoding Td, which is the output of the image decoding device 31, and outputs the post-image processing To.

具体的な実施の形態の一例としては、ニューラルネットワークを用いたポスト画像処理を行い、復号動画像Tdの画質を改善してもよい。この時、補助情報復号装置91からの補助情報として、画質を改善するネットワークパラメータを入力し、ポスト画像処理に用いる。 As an example of a specific embodiment, post-image processing using a neural network may be performed to improve the image quality of the decoded video image Td. At this time, network parameters for improving the image quality are input as auxiliary information from the auxiliary information decoding device 91 and used for the post-image processing.

画像表示装置41は、ポスト画像処理装置1002から出力されたポスト処理画像Toの全部または一部を表示する。画像表示装置41は、例えば、液晶ディスプレイ、有機EL（Electro-luminescence）ディスプレイ等の表示デバイスを備える。ディスプレイの形態としては、据え置き、モバイル、HMD等が挙げられる。また、画像復号装置31が高い処理能力を有する場合には、画質の高い画像を表示し、より低い処理能力しか有しない場合には、高い処理能力、表示能力を必要としない画像を表示する。 The image display device 41 displays all or part of the post-processed image To output from the post-image processing device 1002. The image display device 41 includes a display device such as a liquid crystal display or an organic EL (Electro-luminescence) display. Display forms include stationary, mobile, and HMD. When the image decoding device 31 has high processing power, it displays a high-quality image, and when it has only low processing power, it displays an image that does not require high processing power or display power.

画像認識装置51は、ポスト画像処理装置1002から出力されたポスト処理画像Toと、補助情報復号装置91によって復号された補助情報を用いて、画像の物体検出、物体の領域分割、物体の追跡、動作認識、人物動作評価などを行う。 The image recognition device 51 uses the post-processed image To output from the post-image processing device 1002 and the auxiliary information decoded by the auxiliary information decoding device 91 to perform image object detection, object region segmentation, object tracking, action recognition, human action evaluation, etc.

このような構成をすることで、動画像符号化、復号方式の枠組みを大きく変更せずに、付加的な補助情報を符号化、復号することで、低レートにおいても、画像認識精度を維持することができる枠組みを提供する。 This configuration provides a framework that can maintain image recognition accuracy even at low rates by encoding and decoding additional auxiliary information without significantly changing the framework of the video encoding and decoding method.

＜演算子＞
本明細書で用いる演算子を以下に記載する。 <Operator>
The operators used in this specification are listed below.

>>は右ビットシフト、<<は左ビットシフト、&はビットワイズAND、|はビットワイズOR、|=はOR代入演算子であり、||は論理和を示す。 >> is a right bit shift, << is a left bit shift, & is a bitwise AND, | is a bitwise OR, |= is the OR assignment operator, and || indicates logical OR.

x ? y : zは、xが真（0以外）の場合にy、xが偽（0）の場合にzをとる３項演算子である。 x ? y : z is a ternary operator that takes y if x is true (non-zero) and z if x is false (0).

Clip3(a,b,c)は、cをa以上b以下の値にクリップする関数であり、c<aの場合にはaを返し、c>bの場合にはbを返し、その他の場合にはcを返す関数である（ただし、a<=b）。 Clip3(a,b,c) is a function that clips c to a value between a and b, and returns a if c<a, returns b if c>b, and returns c otherwise (where a<=b).

abs(a)はaの絶対値を返す関数である。 abs(a) is a function that returns the absolute value of a.

Int(a)はaの整数値を返す関数である。 Int(a) is a function that returns the integer value of a.

floor(a)はa以下の最大の整数を返す関数である。 floor(a) is a function that returns the largest integer less than or equal to a.

ceil(a)はa以上の最小の整数を返す関数である。 ceil(a) is a function that returns the smallest integer greater than or equal to a.

a/dはdによるaの除算（小数点以下切り捨て）を表す。 a/d represents the division of a by d (truncated to an integer).

＜符号化データTeの構造＞
本実施形態に係る画像符号化装置11および画像復号装置31の詳細な説明に先立って、画像符号化装置11によって生成され、画像復号装置31によって復号される符号化データTeのデータ構造について説明する。 <Structure of encoded data Te>
Before describing in detail the image encoding device 11 and the image decoding device 31 according to this embodiment, the data structure of the encoded data Te generated by the image encoding device 11 and decoded by the image decoding device 31 will be described.

符号化データTeは、複数CVS(Coded Video Sequence)とEoB(End of Bitstream NAL unit)から構成される。CVSは複数のAU(Access Unit)とEoS(End of Sequence NAL unit)から構成される。CVS先頭のAUはCVSS(Coded Video Sequence Start) AUと呼ぶ。CVSをレイヤ毎に分割した単位をCLVS(Coded Layer Video Sequence)と呼ぶ。AUは、同一出力時刻の一つもしくは複数のレイヤのPU(Picture Unit)からなる。もし、Multilayerの符号化方式を採用しない場合は、AUは、一つのPUからなる。PUは、複数のNALユニットから構成される1つの復号ピクチャの符号化データの単位である。CLVSは、同一レイヤのPUから構成されていて、CLVS先頭のPUは、CLVSS(Coded Layer Video Sequence Start)PUと呼ぶ。CLVSS PUは、ランダムアクセス可能なIRAP(Intra Random Access Pictures)やGDR(Gradual Decoder Refresh Picture)となるPUに限定される。NALユニットは、Nal unit headerとRBSP(Raw Byte Sequence Payload)データから構成されていて、Nal unit headerは、2ビットの0データに続いて、レイヤ値を示す6ビットのnuh_layer_idと、NALユニットタイプを示す5ビットのnuh_unit_typeとTemporal ID値に1プラスされた値の3ビットのnuh_temporal_id_plus1から構成される。 The coded data Te consists of multiple CVSs (Coded Video Sequences) and EoBs (End of Bitstream NAL units). CVSs consist of multiple AUs (Access Units) and EoSs (End of Sequence NAL units). The AU at the beginning of a CVS is called a CVSS (Coded Video Sequence Start) AU. A CVS is divided into layers and units called CLVSs (Coded Layer Video Sequences). An AU consists of one or more layered PUs (Picture Units) with the same output time. If the multilayer coding method is not adopted, an AU consists of one PU. A PU is a unit of coded data for one decoded picture consisting of multiple NAL units. A CLVS consists of PUs of the same layer, and the PU at the beginning of a CLVS is called a CLVSS (Coded Layer Video Sequence Start) PU. CLVSS PUs are limited to PUs that are randomly accessible IRAPs (Intra Random Access Pictures) or GDRs (Gradual Decoder Refresh Pictures). A NAL unit consists of a Nal unit header and RBSP (Raw Byte Sequence Payload) data. The Nal unit header consists of 2 bits of 0 data, followed by a 6-bit nuh_layer_id that indicates the layer value, a 5-bit nuh_unit_type that indicates the NAL unit type, and a 3-bit nuh_temporal_id_plus1 that is the Temporal ID value plus 1.

図2は、PU単位での符号化データTeにおけるデータの階層構造を示す図である。符号化データTeは、例示的に、シーケンス、およびシーケンスを構成する複数のピクチャを含む。図2には、シーケンスSEQを既定する符号化ビデオシーケンス、ピクチャPICTを規定する符号化ピクチャ、スライスSを規定する符号化スライス、スライスデータを規定する符号化スライスデータ、符号化スライスデータに含まれる符号化ツリーユニット、符号化ツリーユニットに含まれる符号化ユニットを示す図が示されている。 Figure 2 is a diagram showing a hierarchical structure of data in the coded data Te in units of PU. The coded data Te illustratively includes a sequence and a number of pictures constituting the sequence. Figure 2 shows a diagram showing a coded video sequence that defines the sequence SEQ, a coded picture that specifies the picture PICT, a coded slice that specifies the slice S, coded slice data that specifies the slice data, a coding tree unit included in the coded slice data, and a coding unit included in the coding tree unit.

（符号化ビデオシーケンス）
符号化ビデオシーケンスでは、処理対象のシーケンスSEQを復号するために画像復号装
置31が参照するデータの集合が規定されている。シーケンスSEQは、図2に示すように、ビデオパラメータセットVPS（Video Parameter Set）、シーケンスパラメータセットSPS（Sequence Parameter Set）、ピクチャパラメータセットPPS（Picture Parameter Set）、Adaptation Parameter Set(APS)、ピクチャPICT、及び、補助拡張情報SEI（Supplemental Enhancement Information）を含んでいる。 (Coded Video Sequence)
In the coded video sequence, a set of data to be referred to by the image decoding device 31 in order to decode the sequence SEQ to be processed is specified. As shown in Fig. 2, the sequence SEQ includes a video parameter set VPS (Video Parameter Set), a sequence parameter set SPS (Sequence Parameter Set), a picture parameter set PPS (Picture Parameter Set), an adaptation parameter set (APS), a picture PICT, and supplemental enhancement information SEI (Supplemental Enhancement Information).

ビデオパラメータセットVPSでは、複数のレイヤから構成されている動画像において、複数の動画像に共通する符号化パラメータの集合および動画像に含まれる複数のレイヤおよび個々のレイヤに関連する符号化パラメータの集合が規定されている。 In a video parameter set (VPS), a set of coding parameters common to multiple videos composed of multiple layers, as well as a set of coding parameters related to multiple layers and individual layers included in the video, are specified.

シーケンスパラメータセットSPSでは、対象シーケンスを復号するために画像復号装置31が参照する符号化パラメータの集合が規定されている。例えば、ピクチャの幅や高さが規定される。なお、SPSは複数存在してもよい。その場合、PPSから複数のSPSの何れかを選択する。 The sequence parameter set SPS specifies a set of coding parameters that the image decoding device 31 references to decode the target sequence. For example, the width and height of a picture are specified. Note that there may be multiple SPSs. In that case, one of the multiple SPSs is selected from the PPS.

ここで、シーケンスパラメータセットSPSには以下のシンタックス要素が含まれる。
・ref_pic_resampling_enabled_flag：対象SPSを参照する単一のシーケンスに含まれる各画像を復号する場合に、解像度を可変とする機能（リサンプリング：resampling）を用いるか否かを規定するフラグである。別の側面から言えば、当該フラグは、予測画像の生成において参照される参照ピクチャのサイズが、単一のシーケンスが示す各画像間において変化することを示すフラグである。当該フラグの値が1である場合、上記リサンプリングが適用され、0である場合、適用されない。
・pic_width_max_in_luma_samples：単一のシーケンスにおける画像のうち、最大の幅を有する画像の幅を、輝度ブロック単位で指定するシンタックス要素である。また、当該シンタックス要素の値は、0ではなく、且つMax(8, MinCbSizeY)の整数倍であることが要求される。ここで、MinCbSizeYは、輝度ブロックの最小サイズによって定まる値である。
・pic_height_max_in_luma_samples：単一のシーケンスにおける画像のうち、最大の高さを有する画像の高さを、輝度ブロック単位で指定するシンタックス要素である。また、当該シンタックス要素の値は、0ではなく、且つMax(8, MinCbSizeY)の整数倍であることが要求される。 Here, the sequence parameter set SPS includes the following syntax elements:
ref_pic_resampling_enabled_flag: A flag that specifies whether or not to use a function that changes the resolution (resampling) when decoding each image included in a single sequence that references the target SPS. In other words, this flag indicates that the size of the reference picture referenced in generating a predicted image changes between each image indicated by a single sequence. If the value of this flag is 1, the resampling is applied, and if the value is 0, it is not applied.
pic_width_max_in_luma_samples: A syntax element that specifies the width of the image with the largest width among the images in a single sequence, in units of luminance blocks. The value of this syntax element must be a non-zero integer multiple of Max(8, MinCbSizeY), where MinCbSizeY is a value determined by the minimum size of a luminance block.
pic_height_max_in_luma_samples: A syntax element that specifies the height of the image with the maximum height among the images in a single sequence, in units of luminance blocks. The value of this syntax element is required to be a non-zero integer multiple of Max(8, MinCbSizeY).

ピクチャパラメータセットPPSでは、対象シーケンス内の各ピクチャを復号するために画像復号装置31が参照する符号化パラメータの集合が規定されている。なお、PPSは複数存在してもよい。その場合、対象シーケンス内の各ピクチャから複数のPPSの何れかを選択する。 The picture parameter set PPS specifies a set of coding parameters that the image decoding device 31 references in order to decode each picture in the target sequence. Note that there may be multiple PPSs. In that case, one of the multiple PPSs is selected for each picture in the target sequence.

ここで、ピクチャパラメータセットPPSには以下のシンタックス要素が含まれる。
・pps_pic_width_in_luma_samples：対象ピクチャの幅を指定するシンタックス要素である。当該シンタックス要素の値は、0ではなく、Max(8, MinCbSizeY)の整数倍であり、且つsps_pic_width_max_in_luma_samples以下の値であることが要求される。
・pps_pic_height_in_luma_samples：対象ピクチャの高さを指定するシンタックス要素である。当該シンタックス要素の値は、0ではなく、Max(8, MinCbSizeY)の整数倍であり、且つsps_pic_height_max_in_luma_samples以下の値であることが要求される。
・pps_conformance_window_flag：コンフォーマンス（クロッピング）ウィンドウオフセットパラメータが続いて通知されるか否かを示すフラグであって、コンフォーマンスウィンドウを表示する場所を示すフラグである。このフラグが1である場合、当該パラメータが通知され、0である場合、コンフォーマンスウインドウオフセットパラメータが存在しないことを示す。
・pps_conf_win_left_offset、pps_conf_win_right_offset、pps_conf_win_top_offset、pps_conf_win_bottom_offset：出力用のピクチャ座標で指定される矩形領域に関して、復
号処理で出力されるピクチャの左、右、上、下位置を指定するためのオフセット値である。また、pps_conformance_window_flagの値が0である場合、pps_conf_win_left_offset、pps_conf_win_right_offset、pps_conf_win_top_offset、pps_conf_win_bottom_offsetの値は0であるものと推定される。 Here, the picture parameter set PPS includes the following syntax elements:
pps_pic_width_in_luma_samples: A syntax element that specifies the width of the target picture. The value of this syntax element is required to be a non-zero integer multiple of Max(8, MinCbSizeY) and equal to or less than sps_pic_width_max_in_luma_samples.
pps_pic_height_in_luma_samples: A syntax element that specifies the height of the target picture. The value of this syntax element is required to be a non-zero integer multiple of Max(8, MinCbSizeY) and equal to or less than sps_pic_height_max_in_luma_samples.
pps_conformance_window_flag: a flag indicating whether conformance (cropping) window offset parameters will be signaled subsequently and where the conformance window should be displayed. If this flag is 1, the parameters will be signaled, if it is 0, the conformance window offset parameters are not present.
pps_conf_win_left_offset, pps_conf_win_right_offset, pps_conf_win_top_offset, pps_conf_win_bottom_offset: offset values for specifying the left, right, top, and bottom positions of a picture output by decoding processing with respect to a rectangular area specified by the output picture coordinates. In addition, when the value of pps_conformance_window_flag is 0, the values of pps_conf_win_left_offset, pps_conf_win_right_offset, pps_conf_win_top_offset, and pps_conf_win_bottom_offset are estimated to be 0.

ここで、色差フォーマットの変数ChromaFormatIdcは、sps_chroma_format_idの値であり、変数SubWidthCと、変数SubHightCは、このChromaFormatIdcによって決まる値である。モノクロフォーマットの場合は、SubWidthCとSubHightCは、共に1であり、4:2:0フォーマットの場合は、SubWidthCとSubHightCは、共に2であり、4:2:2フォーマットの場合は、SubWidthCが2でSubHightCが1であり、4:4:4フォーマットの場合は、SubWidthCとSubHightCは、共に1である。
・pps_init_qp_minus26は、PPSで参照されるスライスの量子化パラメータSliceQpYを導出するための情報である。 Here, the variable ChromaFormatIdc of the chrominance format is the value of sps_chroma_format_id, and the variables SubWidthC and SubHightC are values determined by this ChromaFormatIdc. In the case of a monochrome format, SubWidthC and SubHightC are both 1, in the case of a 4:2:0 format, SubWidthC and SubHightC are both 2, in the case of a 4:2:2 format, SubWidthC is 2 and SubHightC is 1, and in the case of a 4:4:4 format, SubWidthC and SubHightC are both 1.
pps_init_qp_minus26 is information for deriving the quantization parameter SliceQpY of the slice referenced by the PPS.

（サブピクチャ）
ピクチャは、さらに矩形のサブピクチャに分割されていてもよい。サブピクチャのサイズはCTUの倍数であってもよい。サブピクチャは縦横に整数個連続するタイルの集合で定義される。つまり、ピクチャは矩形のタイルに分割され、矩形のタイルの集合としてサブピクチャを定義する。サブピクチャの左上タイルのIDと右下タイルのIDを用いてサブピクチャを定義してもよい。 (Subpicture)
A picture may be further divided into rectangular sub-pictures. The size of a sub-picture may be a multiple of the CTU. A sub-picture is defined as a set of tiles that are an integer number of consecutive tiles vertically and horizontally. In other words, a picture is divided into rectangular tiles, and a sub-picture is defined as a set of rectangular tiles. A sub-picture may be defined using the IDs of the top left tile and the bottom right tile of the sub-picture.

図5は、動画像伝送システム1において処理の対象となる画像の概念図であって、時間の経過に伴う、当該画像の解像度の変化を示す図である。ただし、図5においては、画像が符号化されているか否かを区別していない。図5は、動画像伝送システム1の処理過程において、ピクチャパラメータセットPPSを用いて、解像度を適応的に変更しながら画像復号装置31に画像を伝送する例を示している。 Figure 5 is a conceptual diagram of an image to be processed in the video transmission system 1, showing changes in the resolution of the image over time. However, in Figure 5, no distinction is made as to whether the image is encoded or not. Figure 5 shows an example of transmitting an image to the image decoding device 31 while adaptively changing the resolution using a picture parameter set PPS during processing in the video transmission system 1.

（符号化ピクチャ）
符号化ピクチャでは、処理対象のピクチャPICTを復号するために画像復号装置31が参照するデータの集合が規定されている。ピクチャPICTは、図2に示すように、ピクチャヘッダPH、スライス0～スライスNS-1を含む（NSはピクチャPICTに含まれるスライスの総数）。 (Encoded Picture)
A coded picture defines a set of data to be referenced by the image decoding device 31 in order to decode a picture PICT to be processed. As shown in Fig. 2, the picture PICT includes a picture header PH and slices 0 to NS-1 (NS is the total number of slices included in the picture PICT).

以下、スライス0～スライスNS-1のそれぞれを区別する必要が無い場合、符号の添え字を省略して記述することがある。また、以下に説明する符号化データTeに含まれるデータであって、添え字を付している他のデータについても同様である。 In the following, when there is no need to distinguish between slices 0 to NS-1, the subscripts of the symbols may be omitted. The same applies to other data that are included in the encoded data Te described below and have subscripts.

ピクチャヘッダには、以下のシンタックス要素が含まれる。 The picture header contains the following syntax elements:

pic_temporal_mvp_enabled_flagは、当該ピクチャヘッダに関連付けられたスライスのインター予測に時間動きベクトル予測を用いるか否かを規定するフラグである。当該フラグの値が0である場合、当該ピクチャヘッダに関連付けられたスライスのシンタックス要素は、そのスライスの復号において時間動きベクトル予測が用いられないように制限される。当該フラグの値が1である場合、当該ピクチャヘッダに関連付けられたスライスの復号に時間動きベクトル予測が用いられることを示している。また、当該フラグが規定されていない場合、値が0であるものと推定される。 pic_temporal_mvp_enabled_flag is a flag that specifies whether or not temporal motion vector prediction is used for inter prediction of the slice associated with the picture header. If the value of the flag is 0, the syntax elements of the slice associated with the picture header are restricted so that temporal motion vector prediction is not used in decoding the slice. If the value of the flag is 1, it indicates that temporal motion vector prediction is used in decoding the slice associated with the picture header. If the flag is not specified, the value is presumed to be 0.

（符号化スライス）
符号化スライスでは、処理対象のスライスSを復号するために画像復号装置31が参照するデータの集合が規定されている。スライスは、図2に示すように、スライスヘッダ、お
よび、スライスデータを含んでいる。 (Coding Slice)
A coded slice defines a set of data to be referenced by the image decoding device 31 in order to decode a current slice S. As shown in Fig. 2, a slice includes a slice header and slice data.

スライスヘッダには、対象スライスの復号方法を決定するために画像復号装置31が参照する符号化パラメータ群が含まれる。スライスタイプを指定するスライスタイプ指定情報（slice_type）は、スライスヘッダに含まれる符号化パラメータの一例である。 The slice header includes a set of coding parameters that the image decoding device 31 refers to in order to determine the decoding method for the target slice. Slice type specification information (slice_type) that specifies the slice type is an example of a coding parameter included in the slice header.

スライスタイプ指定情報により指定可能なスライスタイプとしては、（１）符号化の際にイントラ予測のみを用いるＩスライス、（２）符号化の際に単予測(L0予測)、または、イントラ予測を用いるＰスライス、（３）符号化の際に単予測(L0予測或いはL1予測)、双予測、または、イントラ予測を用いるＢスライスなどが挙げられる。なお、インター予測は、単予測、双予測に限定されず、より多くの参照ピクチャを用いて予測画像を生成してもよい。以下、P、Bスライスと呼ぶ場合には、インター予測を用いることができるブロックを含むスライスを指す。 Slice types that can be specified by the slice type specification information include (1) an I slice that uses only intra prediction during encoding, (2) a P slice that uses uni-prediction (L0 prediction) or intra prediction during encoding, and (3) a B slice that uses uni-prediction (L0 prediction or L1 prediction), bi-prediction, or intra prediction during encoding. Note that inter prediction is not limited to uni-prediction or bi-prediction, and a predicted image may be generated using more reference pictures. Hereinafter, when referring to P or B slice, it refers to a slice that includes a block for which inter prediction can be used.

なお、スライスヘッダは、ピクチャパラメータセットPPSへの参照（pic_parameter_set_id）を含んでいても良い。 Note that the slice header may also contain a reference to the picture parameter set PPS (pic_parameter_set_id).

（符号化スライスデータ）
符号化スライスデータでは、処理対象のスライスデータを復号するために画像復号装置31が参照するデータの集合が規定されている。スライスデータは、図2の符号化スライスヘッダに示すように、CTUを含んでいる。CTUは、スライスを構成する固定サイズ（例えば64x64）のブロックであり、最大符号化単位（LCU:Largest Coding Unit）と呼ぶこともある。 (Encoded slice data)
The coded slice data specifies a set of data to be referenced by the image decoding device 31 in order to decode the slice data to be processed. The slice data includes a CTU, as shown in the coded slice header in Fig. 2. A CTU is a block of a fixed size (e.g., 64x64) that constitutes a slice, and is also called a Largest Coding Unit (LCU).

（符号化ツリーユニット）
図2には、処理対象のCTUを復号するために画像復号装置31が参照するデータの集合が規定されている。CTUは、再帰的な４分木分割（QT（Quad Tree）分割）、２分木分割（BT（Binary Tree）分割）あるいは３分木分割（TT（Ternary Tree）分割）により、符号化処理の基本的な単位である符号化ユニットCUに分割される。BT分割とTT分割を合わせてマルチツリー分割（MT（Multi Tree）分割）と呼ぶ。再帰的な４分木分割により得られる木構造のノードのことを符号化ノード（Coding Node）と称する。４分木、２分木、及び３分木の中間ノードは、符号化ノードであり、CTU自身も最上位の符号化ノードとして規定される。 (coding tree unit)
2 specifies a set of data that the image decoding device 31 refers to in order to decode a CTU to be processed. The CTU is divided into coding units CU, which are basic units of encoding processing, by recursive quad tree division (QT (Quad Tree) division), binary tree division (BT (Binary Tree) division), or ternary tree division (TT (Ternary Tree) division). BT division and TT division are collectively called multi tree division (MT (Multi Tree) division). A node of a tree structure obtained by recursive quad tree division is called a coding node. Intermediate nodes of a quad tree, binary tree, and ternary tree are coding nodes, and the CTU itself is specified as the top coding node.

CTは、CT情報として、CT分割を行うか否かを示すCU分割フラグ(split_cu_flag)、QT分割を行うか否かを示すQT分割フラグ（qt_split_cu_flag）、MT分割の分割方向を示すMT分割方向（mtt_split_cu_vertical_flag）、MT分割の分割タイプを示すMT分割タイプ（mtt_split_cu_binary_flag）を含む。split_cu_flag、qt_split_cu_flag、mtt_split_cu_vertical_flag、mtt_split_cu_binary_flagは符号化ノード毎に伝送される。 CT includes, as CT information, a CU split flag (split_cu_flag) indicating whether CT splitting is performed, a QT split flag (qt_split_cu_flag) indicating whether QT splitting is performed, an MT split direction (mtt_split_cu_vertical_flag) indicating the split direction of MT splitting, and an MT split type (mtt_split_cu_binary_flag) indicating the split type of MT splitting. split_cu_flag, qt_split_cu_flag, mtt_split_cu_vertical_flag, and mtt_split_cu_binary_flag are transmitted for each encoding node.

輝度と色差で異なるツリーを用いても良い。ツリーの種別をtreeTypeで示す。例えば、輝度(Y, cIdx=0)と色差(Cb/Cr, cIdx=1,2)で共通のツリーを用いる場合、共通単一ツリーをtreeType=SINGLE_TREEで示す。輝度と色差で異なる２つのツリー（DUALツリー）を用いる場合、輝度のツリーをtreeType=DUAL_TREE_LUMA、色差のツリーをtreeType=DUAL_TREE_CHROMAで示す。 Different trees may be used for luminance and chrominance. The type of tree is indicated by treeType. For example, when using a common tree for luminance (Y, cIdx=0) and chrominance (Cb/Cr, cIdx=1,2), the common single tree is indicated by treeType=SINGLE_TREE. When using two different trees (DUAL trees) for luminance and chrominance, the luminance tree is indicated by treeType=DUAL_TREE_LUMA and the chrominance tree is indicated by treeType=DUAL_TREE_CHROMA.

（符号化ユニット）
図2は、処理対象の符号化ユニットを復号するために画像復号装置31が参照するデータの集合が規定されている。具体的には、CUは、CUヘッダCUH、予測パラメータ、変換パラメータ、量子化変換係数等から構成される。CUヘッダでは予測モード等が規定される。 (Encoding Unit)
2 specifies a set of data to be referenced by the image decoding device 31 in order to decode a coding unit to be processed. Specifically, a CU is composed of a CU header CUH, prediction parameters, transformation parameters, quantization transformation coefficients, etc. The CU header specifies a prediction mode, etc.

予測処理は、CU単位で行われる場合と、CUをさらに分割したサブCU単位で行われる場合がある。CUとサブCUのサイズが等しい場合には、CU中のサブCUは１つである。CUがサブCUのサイズよりも大きい場合、CUはサブCUに分割される。たとえばCUが8x8、サブCUが4x4の場合、CUは水平２分割、垂直２分割からなる、４つのサブCUに分割される。 Prediction processing may be performed on a CU basis, or on a sub-CU basis, which is a further division of a CU. If the size of the CU and sub-CU are the same, there is one sub-CU in the CU. If the size of the CU is larger than the size of the sub-CU, the CU is divided into sub-CUs. For example, if the CU is 8x8 and the sub-CU is 4x4, the CU is divided into 2 parts horizontally and 2 parts vertically, into 4 sub-CUs.

予測の種類（予測モード）は、イントラ予測と、インター予測の２つがある。イントラ予測は、同一ピクチャ内の予測であり、インター予測は、互いに異なるピクチャ間（例えば、表示時刻間、レイヤ画像間）で行われる予測処理を指す。 There are two types of prediction (prediction modes): intra prediction and inter prediction. Intra prediction is a prediction within the same picture, while inter prediction refers to a prediction process performed between different pictures (e.g., between display times or between layer images).

変換・量子化処理はCU単位で行われるが、量子化変換係数は4x4等のサブブロック単位でエントロピー符号化してもよい。 The transformation and quantization processes are performed on a CU basis, but the quantized transformation coefficients may be entropy coded on a subblock basis, such as 4x4.

（予測パラメータ）
予測画像は、ブロックに付随する予測パラメータによって導出される。予測パラメータには、イントラ予測とインター予測の予測パラメータがある。 (Prediction parameters)
The predicted image is derived from prediction parameters associated with the block, which include intra-prediction and inter-prediction parameters.

以下、インター予測の予測パラメータについて説明する。インター予測パラメータは、予測リスト利用フラグpredFlagL0とpredFlagL1、参照ピクチャインデックスrefIdxL0とrefIdxL1、動きベクトルmvL0とmvL1から構成される。predFlagL0、predFlagL1は、参照ピクチャリスト（L0リスト、L1リスト）が用いられるか否かを示すフラグであり、値が１の場合に対応する参照ピクチャリストが用いられる。なお、本明細書中「ＸＸであるか否かを示すフラグ」と記す場合、フラグが０以外（たとえば１）をＸＸである場合、０をＸＸではない場合とし、論理否定、論理積などでは１を真、０を偽と扱う（以下同様）。但し、実際の装置や方法では真値、偽値として他の値を用いることもできる。 The prediction parameters of inter prediction are explained below. The inter prediction parameters are composed of prediction list usage flags predFlagL0 and predFlagL1, reference picture indexes refIdxL0 and refIdxL1, and motion vectors mvL0 and mvL1. predFlagL0 and predFlagL1 are flags indicating whether or not a reference picture list (L0 list, L1 list) is used, and when the value is 1, the corresponding reference picture list is used. Note that in this specification, when the term "flag indicating whether or not XX is true" is used, a flag other than 0 (for example, 1) is XX, and 0 is not XX, and in logical negation, logical product, etc., 1 is treated as true and 0 is treated as false (same below). However, in actual devices and methods, other values can be used as true and false values.

インター予測パラメータを導出するためのシンタックス要素には、例えば、マージモードで用いるアフィンフラグaffine_flag、マージフラグmerge_flag、マージインデックスmerge_idx、MMVDフラグmmvd_flag、AMVPモードで用いる参照ピクチャを選択するためのインター予測識別子inter_pred_idc、参照ピクチャインデックスrefIdxLX、動きベクトルを導出するための予測ベクトルインデックスmvp_LX_idx、差分ベクトルmvdLX、動きベクトル精度モードamvr_modeがある。 Syntax elements for deriving inter prediction parameters include, for example, the affine flag affine_flag used in merge mode, the merge flag merge_flag, the merge index merge_idx, the MMVD flag mmvd_flag, the inter prediction identifier inter_pred_idc for selecting a reference picture to be used in AMVP mode, the reference picture index refIdxLX, the prediction vector index mvp_LX_idx for deriving a motion vector, the difference vector mvdLX, and the motion vector precision mode amvr_mode.

（画像復号装置の構成）
本実施形態に係る画像復号装置31（図3）の構成について説明する。 (Configuration of the image decoding device)
The configuration of an image decoding device 31 (FIG. 3) according to this embodiment will be described.

画像復号装置31は、エントロピー復号部301、パラメータ復号部（予測画像復号装置）302、ループフィルタ305、参照ピクチャメモリ306、予測パラメータメモリ307、予測画像生成部（予測画像生成装置）308、逆量子化・逆変換部311、及び加算部312、予測パラメータ導出部320を含んで構成される。なお、後述の画像符号化装置11に合わせ、画像復号装置31にループフィルタ305が含まれない構成もある。 The image decoding device 31 includes an entropy decoding unit 301, a parameter decoding unit (prediction image decoding device) 302, a loop filter 305, a reference picture memory 306, a prediction parameter memory 307, a prediction image generating unit (prediction image generating device) 308, an inverse quantization and inverse transform unit 311, an addition unit 312, and a prediction parameter derivation unit 320. Note that, in accordance with the image encoding device 11 described below, the image decoding device 31 may also be configured not to include the loop filter 305.

パラメータ復号部302は、さらに、ヘッダ復号部3020、CT情報復号部3021、及びCU復号部3022（予測モード復号部）を備えており、CU復号部3022はさらにTU復号部3024を備えている。これらを総称して復号モジュールと呼んでもよい。ヘッダ復号部3020は、符号化データからVPS、SPS、PPS、APSなどのパラメータセット情報、スライスヘッダ（スライス情報）を復号する。CT情報復号部3021は、符号化データからCTを復号する。CU復号部3022は符号化データからCUを復号する。TU復号部3024は、TUに予測誤差が含まれている場合に、符号化データからQP更新情報（量子化補正値）と量子化予測誤差（residual_coding）を復号する。 The parameter decoding unit 302 further includes a header decoding unit 3020, a CT information decoding unit 3021, and a CU decoding unit 3022 (prediction mode decoding unit), and the CU decoding unit 3022 further includes a TU decoding unit 3024. These may be collectively referred to as a decoding module. The header decoding unit 3020 decodes parameter set information such as VPS, SPS, PPS, and APS, and slice headers (slice information) from the encoded data. The CT information decoding unit 3021 decodes the CT from the encoded data. The CU decoding unit 3022 decodes the CU from the encoded data. The TU decoding unit 3024 decodes QP update information (quantization correction value) and quantization prediction error (residual_coding) from the encoded data when a prediction error is included in the TU.

TU復号部3024は、スキップモード以外(skip_mode==0)の場合に、符号化データからQP更新情報と量子化予測誤差を復号する。より具体的には、TU復号部3024は、skip_mode==0の場合に、対象ブロックに量子化予測誤差が含まれているか否かを示すフラグcu_cbpを復号し、cu_cbpが1の場合に量子化予測誤差を復号する。cu_cbpが符号化データに存在しない場合は0と導出する。 When in a mode other than skip mode (skip_mode==0), the TU decoding unit 3024 decodes the QP update information and the quantized prediction error from the encoded data. More specifically, when skip_mode==0, the TU decoding unit 3024 decodes the flag cu_cbp indicating whether or not the current block contains a quantized prediction error, and decodes the quantized prediction error when cu_cbp is 1. When cu_cbp does not exist in the encoded data, it derives it as 0.

TU復号部3024は、符号化データから変換基底を示すインデックスmts_idxを復号する。また、TU復号部3024は、符号化データからセカンダリ変換の利用及び変換基底を示すインデックスstIdxを復号する。stIdxは0の場合にセカンダリ変換の非適用を示し、1の場合にセカンダリ変換基底のセット（ペア）のうち一方の変換を示し、2の場合に上記ペアのうち他方の変換を示す。 The TU decoding unit 3024 decodes an index mts_idx indicating the transformation base from the encoded data. The TU decoding unit 3024 also decodes an index stIdx indicating the use of a secondary transformation and the transformation base from the encoded data. When stIdx is 0, it indicates no application of a secondary transformation, when it is 1, it indicates one transformation of a set (pair) of secondary transformation bases, and when it is 2, it indicates the other transformation of the pair.

また、TU復号部3024はサブブロック変換フラグcu_sbt_flagを復号してもよい。cu_sbt_flagが１の場合には、CUを複数のサブブロックに分割し、特定の１つのサブブロックのみ残差を復号する。さらにTU復号部3024は、サブブロックの数が４であるか２であるかを示すフラグcu_sbt_quad_flag、分割方向を示すcu_sbt_horizontal_flag、非ゼロの変換係数が含まれるサブブロックを示すcu_sbt_pos_flagを復号してもよい。 The TU decoding unit 3024 may also decode the sub-block transform flag cu_sbt_flag. When cu_sbt_flag is 1, the CU is divided into multiple sub-blocks, and the residual of only one specific sub-block is decoded. The TU decoding unit 3024 may further decode a flag cu_sbt_quad_flag indicating whether the number of sub-blocks is 4 or 2, cu_sbt_horizontal_flag indicating the division direction, and cu_sbt_pos_flag indicating a sub-block that includes a non-zero transform coefficient.

予測画像生成部308は、インター予測画像生成部309及びイントラ予測画像生成部310を含んで構成される。 The predicted image generation unit 308 includes an inter predicted image generation unit 309 and an intra predicted image generation unit 310.

また、以降では処理の単位としてCTU、CUを使用した例を記載するが、この例に限らず、サブCU単位で処理をしてもよい。あるいはCTU、CUをブロック、サブCUをサブブロックと読み替え、ブロックあるいはサブブロック単位の処理としてもよい。 In the following, an example will be described in which CTU and CU are used as processing units, but this is not limiting and processing may be performed in sub-CU units. Alternatively, CTU and CU may be interpreted as blocks and sub-CU as sub-blocks, and processing may be performed in block or sub-block units.

エントロピー復号部301は、外部から入力された符号化データTeに対してエントロピー復号を行って、個々の符号（シンタックス要素）を復号する。エントロピー符号化には、シンタックス要素の種類や周囲の状況に応じて適応的に選択したコンテキスト（確率モデル）を用いてシンタックス要素を可変長符号化する方式と、あらかじめ定められた表、あるいは計算式を用いてシンタックス要素を可変長符号化する方式がある。前者のCABAC（Context Adaptive Binary Arithmetic Coding）は、コンテキストのCABAC状態（優勢シンボルの種別(0 or 1)と確率を指定する確率状態インデックスpStateIdx）をメモリに格納する。エントロピー復号部301は、セグメント（タイル、CTU行、スライス）の先頭で全てのCABAC状態を初期化する。エントロピー復号部301は、シンタックス要素をバイナリ列（Bin String）に変換し、Bin Stringの各ビットを復号する。コンテキストを用いる場合には、シンタックス要素の各ビットに対してコンテキストインデックスctxIncを導出し、コンテキストを用いてビットを復号し、用いたコンテキストのCABAC状態を更新する。コンテキストを用いないビットは、等確率(EP, bypass)で復号され、ctxInc導出やCABAC状態は省略される。復号されたシンタックス要素には、予測画像を生成するための予測情報および、差分画像を生成するための予測誤差などがある。 The entropy decoding unit 301 performs entropy decoding on the encoded data Te input from outside, and decodes each code (syntax element). There are two types of entropy coding: one is to perform variable-length coding of syntax elements using a context (probability model) adaptively selected according to the type of syntax element and the surrounding circumstances, and the other is to perform variable-length coding of syntax elements using a predefined table or formula. The former CABAC (Context Adaptive Binary Arithmetic Coding) stores the CABAC state of the context (probability state index pStateIdx that specifies the type (0 or 1) and probability of the most probable symbol) in memory. The entropy decoding unit 301 initializes all CABAC states at the beginning of a segment (tile, CTU row, slice). The entropy decoding unit 301 converts the syntax elements into a binary string (Bin String) and decodes each bit of the Bin String. When a context is used, a context index ctxInc is derived for each bit of the syntax element, the bit is decoded using the context, and the CABAC state of the used context is updated. Bits that do not use a context are decoded with equal probability (EP, bypass), and the ctxInc derivation and CABAC state are omitted. The decoded syntax elements include prediction information for generating a predicted image and prediction error for generating a difference image.

エントロピー復号部301は、復号した符号をパラメータ復号部302に出力する。復号した符号とは、例えば、予測モードpredMode、merge_flag、merge_idx、inter_pred_idc、refIdxLX、mvp_LX_idx、mvdLX、amvr_mode等である。どの符号を復号するかの制御は、パラメータ復号部302の指示に基づいて行われる。 The entropy decoding unit 301 outputs the decoded code to the parameter decoding unit 302. The decoded code is, for example, the prediction mode predMode, merge_flag, merge_idx, inter_pred_idc, refIdxLX, mvp_LX_idx, mvdLX, amvr_mode, etc. Control of which code to decode is performed based on an instruction from the parameter decoding unit 302.

（基本フロー）
図4は、画像復号装置31の概略的動作を説明するフローチャートである。 (Basic flow)
FIG. 4 is a flowchart illustrating a schematic operation of the image decoding device 31.

（S1100：パラメータセット情報復号）ヘッダ復号部3020は、符号化データからVPS、SPS、PPSなどのパラメータセット情報を復号する。 (S1100: Decoding parameter set information) The header decoding unit 3020 decodes parameter set information such as VPS, SPS, and PPS from the encoded data.

（S1200：スライス情報復号）ヘッダ復号部3020は、符号化データからスライスヘッダ（スライス情報）を復号する。 (S1200: Slice information decoding) The header decoding unit 3020 decodes the slice header (slice information) from the encoded data.

以下、画像復号装置31は、対象ピクチャに含まれる各CTUについて、S1300からS5000の処理を繰り返すことにより各CTUの復号画像を導出する。 The image decoding device 31 then repeats the processes from S1300 to S5000 for each CTU included in the target picture to derive a decoded image for each CTU.

（S1300：CTU情報復号）CT情報復号部3021は、符号化データからCTUを復号する。 (S1300: Decoding CTU information) The CT information decoding unit 3021 decodes the CTU from the encoded data.

（S1400：CT情報復号）CT情報復号部3021は、符号化データからCTを復号する。 (S1400: CT information decoding) The CT information decoding unit 3021 decodes the CT from the encoded data.

（S1500：CU復号）CU復号部3022はS1510、S1520を実施して、符号化データからCUを復号する。 (S1500: CU decoding) The CU decoding unit 3022 performs S1510 and S1520 to decode the CU from the encoded data.

（S1510：CU情報復号）CU復号部3022は、符号化データからCU情報、予測情報、TU分割フラグsplit_transform_flag、CU残差フラグcbf_cb、cbf_cr、cbf_luma等を復号する。 (S1510: CU information decoding) The CU decoding unit 3022 decodes CU information, prediction information, the TU split flag split_transform_flag, the CU residual flags cbf_cb, cbf_cr, cbf_luma, etc. from the encoded data.

（S1520：TU情報復号）TU復号部3024は、TUに予測誤差が含まれている場合に、符号化データからQP更新情報と量子化予測誤差、変換インデックスmts_idxを復号する。なお、QP更新情報は、量子化パラメータQPの予測値である量子化パラメータ予測値qPpredからの差分値である。 (S1520: TU information decoding) When a prediction error is included in a TU, the TU decoding unit 3024 decodes the QP update information, the quantization prediction error, and the transform index mts_idx from the encoded data. Note that the QP update information is a difference value from the quantization parameter predicted value qPpred, which is a predicted value of the quantization parameter QP.

（S2000：予測画像生成）予測画像生成部308は、対象CUに含まれる各ブロックについて、予測情報に基づいて予測画像を生成する。 (S2000: Generation of predicted image) The predicted image generation unit 308 generates a predicted image for each block included in the target CU based on the prediction information.

（S3000：逆量子化・逆変換）逆量子化・逆変換部311は、対象CUに含まれる各TUについて、逆量子化・逆変換処理を実行する。 (S3000: Inverse quantization and inverse transform) The inverse quantization and inverse transform unit 311 performs inverse quantization and inverse transform processing on each TU included in the target CU.

（S4000：復号画像生成）加算部312は、予測画像生成部308より供給される予測画像と、逆量子化・逆変換部311より供給される予測誤差とを加算することによって、対象CUの復号画像を生成する。 (S4000: Decoded image generation) The adder 312 generates a decoded image of the target CU by adding the predicted image supplied from the predicted image generation unit 308 and the prediction error supplied from the inverse quantization and inverse transform unit 311.

（S5000：ループフィルタ）ループフィルタ305は、復号画像にデブロッキングフィルタ、SAO、ALFなどのループフィルタをかけ、復号画像を生成する。 (S5000: Loop filter) The loop filter 305 applies a loop filter such as a deblocking filter, SAO, or ALF to the decoded image to generate a decoded image.

非特許文献１は、非常に符号化効率の高い動画像符号化、復号方式であるが、圧縮された動画像の復号画像で画像認識を行うと、伝送レートが低い場合、符号化歪によって、画像認識精度が低減するという問題があった。 Non-Patent Document 1 describes a video encoding and decoding method with extremely high coding efficiency, but when performing image recognition on the decoded image of a compressed video, there is a problem in that the accuracy of image recognition decreases due to coding distortion when the transmission rate is low.

また、非特許物件2では、動画像の分析結果の記述と動画像符号化を統合する方法について議論しているが、符号化効率の面で十分ではなく、低い伝送ビットレートを実現できないという課題があった。 In addition, Non-Patent Item 2 discusses a method for integrating the description of video analysis results with video coding, but this is not sufficient in terms of coding efficiency, and there is an issue that low transmission bit rates cannot be achieved.

また、非特許文献3は、非特許文献１の方式に対して、ニューラルネットワークポストフィルタ処理を行う手段を提供しているが画像認識技術には対応していないという課題がある。 In addition, Non-Patent Document 3 provides a means for performing neural network post-filter processing for the method in Non-Patent Document 1, but has the problem that it does not support image recognition technology.

本実施の形態では、動画像符号化、復号方式の枠組みを大きく変更せずに、付加的な補
助情報を符号化、復号することで、低レートにおいても、画像認識精度を維持することができる枠組みを提供する。 In this embodiment, a framework is provided that can maintain image recognition accuracy even at low rates by encoding and decoding additional auxiliary information without significantly changing the framework of the video encoding and decoding method.

（ニューラルネットワークポストフィルタ特性(NNPFC)SEI
図6は、非特許文献3のニューラルネットワークポストフィルタ特性(NNPFC)SEIメッセージのシンタクスの概略を示している。NNPFC SEIメッセージは、ポストフィルタ処理として適用するニューラルネットワークを指定する。特定のピクチャに対する特定のポストフィルタ処理の適用は、ニューラルネットワークポストフィルタアクティベーションSEIメッセージ（図7）によって示される。 (Neural network post-filter characteristic (NNPFC) SEI
Fig. 6 shows an outline of the syntax of the Neural Network Post-Filter Characteristics (NNPFC) SEI message of Non-Patent Document 3. The NNPFC SEI message specifies the neural network to be applied as a post-filter process. The application of a specific post-filter process to a specific picture is indicated by the Neural Network Post-Filter Activation SEI message (Fig. 7).

このSEIメッセージを適用するには、次の変数の定義が必要である。
・画像復号装置31が復号したピクチャの輝度画素単位の幅と高さ
・輝度画素配列CroppedYPic[idx]及び色差画素配列CroppedCbPic[idx]及びCroppedCrPic[idx]、ポストフィルタ処理の入力として使用される0からnumInputPics-1までの範囲のidxを持つピクチャ
・輝度画素配列の画素ビット深度BitDepthY
・色差画素配列の画素ビット長BitDepthC
・画像復号装置31が復号したピクチャの色差フォーマットChromaFormatIdcで示される色差フォーマットを示す変数SubWidthCとSubHeightC。4:2:0の時は、変数SubWidthCとSubHeightCは、ともに2であり、4:2:2の時は、変数SubWidthCが2で変数SubHeightCが1であり、4:4:4の時は、変数SubWidthCとSubHeightCは、ともに1である。
・nnpfc_auxiliary_inp_idcは、ニューラルネットワークポストフィルタの入力テンソルに補助情報が存在することを示し、もし値が1に等しい場合、0から1までの範囲の実数であるデブロッキングフィルタリング強度制御値StrengthControlValを補助情報として入力する。 To apply this SEI message, the following variables must be defined:
The width and height of the picture decoded by the image decoding device 31 in units of luminance pixels. The luminance pixel array CroppedYPic[idx] and the chrominance pixel arrays CroppedCbPic[idx] and CroppedCrPic[idx], and the picture with idx in the range from 0 to numInputPics-1, which are used as inputs for post-filter processing. The pixel bit depth BitDepthY of the luminance pixel array.
・Pixel bit length of the chrominance pixel array BitDepthC
Variables SubWidthC and SubHeightC indicating the chrominance format indicated by the chrominance format ChromaFormatIdc of the picture decoded by the image decoding device 31. When the chrominance format is 4:2:0, the variables SubWidthC and SubHeightC are both 2, when the chrominance format is 4:2:2, the variables SubWidthC and SubHeightC are both 2 and 1, when the chrominance format is 4:4:4, the variables SubWidthC and SubHeightC are both 1.
nnpfc_auxiliary_inp_idc indicates the presence of auxiliary information in the input tensor of the neural network postfilter, and if its value is equal to 1, input the deblocking filtering strength control value StrengthControlVal, which is a real number ranging from 0 to 1, as auxiliary information.

シンタクス要素nnpfc_idは、ポストフィルタ処理を識別するために使用できる識別番号を示す。 The syntax element nnpfc_id indicates an identification number that can be used to identify the post-filter process.

NNPFC SEIメッセージが、現在のCLVS内で特定のnnpfc_id値を持つ、復号順で最初のNNPFC SEIメッセージである場合、以下が適用される。
・このSEIメッセージが、基本のポストフィルタ処理であることを示す。
・このSEIメッセージが、現在のCLVSの最後まで、現在の復号されたピクチャと、現在のレイヤのすべての後続の復号されたピクチャに関係する。NNPFC SEIメッセージが、現在のCLVSで、復号順で前のNNPFC SEIメッセージの繰り返しである場合、後続のセマンティクスは、このSEIメッセージが現在のCLVS内で同じ内容を持つ唯一のNNPFC SEIメッセージであるかのように適用される。 If the NNPFC SEI message is the first NNPFC SEI message, in decoding order, with a particular nnpfc_id value within the current CLVS, the following applies:
- Indicates that this SEI message is a basic post-filter process.
This SEI message pertains to the current decoded picture and all subsequent decoded pictures of the current layer until the end of the current CLVS. If an NNPFC SEI message is a repetition of a previous NNPFC SEI message in the current CLVS in decoding order, the subsequent semantics apply as if this SEI message was the only NNPFC SEI message with the same content in the current CLVS.

NNPFC SEIメッセージが、現在のCLVS内で特定のnnpfc_id値を持つ復号順で最初のNNPFC
SEIメッセージでない場合、以下が適用される。
・このSEIメッセージは、同じnnpfc_id値を使用して、復号順で以前の基本のポストフィルタに関連する更新であることを示す
・このSEIメッセージは、現在のCLVSの終わりまで、または現在のレイヤ内の特定のnnpfc_id値を持つ次のNNPFC SEIメッセージまで、現在の復号されたピクチャと、現在のレイヤのすべての後続の復号されたピクチャに関係する
nnpfc_mode_idcが0の場合は、このSEIメッセージに、ポストフィルタ処理を指定するISO/IEC 15938-17ビットストリームが含まれているか、同じnnpfc_id値を持つ基本のポスト処理フィルターに関連する更新であることを示す。 The NNPFC SEI message is the first NNPFC in decoding order with a particular nnpfc_id value in the current CLVS.
If it is not an SEI message, the following applies:
Indicates that this SEI message is an update relative to the previous elementary post-filter in decoding order using the same nnpfc_id value. This SEI message pertains to the current decoded picture and all subsequent decoded pictures in the current layer, until the end of the current CLVS or until the next NNPFC SEI message with the particular nnpfc_id value in the current layer.
If nnpfc_mode_idc is 0, it indicates that this SEI message contains an ISO/IEC 15938-17 bit stream that specifies post-filtering or is an update relative to an underlying post-processing filter with the same nnpfc_id value.

nnpfc_mode_idcが1の場合は、nnpfc_id値に関連付けられたポストフィルタ処理は、タ
グURIのnnpfc_tag_uriによって識別される形式のnnpfc_uriによって示されるURIによって識別されるニューラルネットワークであることを示す。 nnpfc_mode_idc equal to 1 indicates that the post-filter associated with the nnpfc_id value is a neural network identified by the URI indicated by nnpfc_uri, of the form identified by the tag URI nnpfc_tag_uri.

nnpfc_reserved_zero_bit_aは、0を示す。 nnpfc_reserved_zero_bit_a indicates 0.

nnpfc_tag_uriは、IETF RFC4151で指定されているシンタクスとセマンティクスを持つタグURIが含まれている。基本のポストフィルタ処理として使用されるニューラルネットワークに関する形式と関連情報、または同じnnpfc_id値が指定されたポストフィルタ処理に関連する更新のために使用される。なお、nnpfc_tag_uriは、登録機関を必要とせずに、nnrpf_uriによって指定されたニューラルネットワークデータの形式を一意に識別することを可能とする。nnpfc_tag_uriが「tag:iso.org,2023:15938-17」の場合は、nnpfc_uriによって識別されるニューラルネットワークのデータがISO/IEC 15938-17に準拠していてNNC(Neural Network Coding)で符号化されていることを示す。 nnpfc_tag_uri contains a tag URI with syntax and semantics specified in IETF RFC4151. It is used to store format and related information about the neural network used as the base postfilter process, or for updates related to postfilter processes with the same nnpfc_id value. Note that nnpfc_tag_uri allows the format of the neural network data specified by nnrpf_uri to be uniquely identified without the need for a registration authority. If nnpfc_tag_uri is "tag:iso.org,2023:15938-17", it indicates that the neural network data identified by nnpfc_uri is ISO/IEC 15938-17 compliant and encoded with NNC (Neural Network Coding).

nnpfc_uriは、IETF Internet Standard 66で指定されているシンタクスとセマンティクスを持つURIが含まれており、ポストフィルタ処理として使用されるニューラルネットワーク、または同じnnpfc_id値を持つポストフィルタ処理に関連する更新として使用される。 nnpfc_uri contains a URI with syntax and semantics specified in IETF Internet Standard 66 to identify the neural network to be used as a postfilter or update associated with a postfilter with the same nnpfc_id value.

nnpfc_formatting_and_purpose_flagが1の場合、フィルタの目的、入力フォーマット、出力フォーマット、および複雑さに関連する構文要素が存在することを示す。nnpfc_formatting_and_purpose_flagが0の場合は、フィルタの目的、入力フォーマット、出力フォーマット、および複雑さに関連するシンタクス要素が存在しないことを示す。 When nnpfc_formatting_and_purpose_flag is 1, it indicates that syntax elements related to the filter's purpose, input format, output format, and complexity are present. When nnpfc_formatting_and_purpose_flag is 0, it indicates that syntax elements related to the filter's purpose, input format, output format, and complexity are not present.

このSEIメッセージが、現在のCLVS内で特定のnnpfc_id値を持つ、復号順で最初のNNPFC
SEIメッセージである場合、nnpfc_formatting_and_purpose_flagは1に等しいものとする。このSEIメッセージが復号順で最初のNNPFC SEIメッセージでない場合、現在のCLVS内に特定のnnpfc_id値がある場合、nnpfc_formatting_and_purpose_flagの値は0である必要がある。 This SEI message is the first NNPFC in decoding order with a particular nnpfc_id value in the current CLVS.
If this SEI message is a NN PFC SEI message, then nnpfc_formatting_and_purpose_flag shall be equal to 1. If this SEI message is not the first NN PFC SEI message in decoding order, then the value of nnpfc_formatting_and_purpose_flag shall be 0 for the particular nnpfc_id value in the current CLVS.

nnpfc_purposeはポストフィルタ処理の目的を示す。nnpfc_purposeの値は、1の場合、画質改善であり、2の場合、4:2:0色差フォーマットから4:2:2または4:4:4への色差のアップサンプリング、または4:2:2色差フォーマットから4:4:4の色差アップサンプリングであり、3の場合は、色差フォーマットを変更せずにトリミングされた復号出力画像の幅または高さを増やす。4の場合は、復号出力画像の幅または高さを増やし、色差フォーマットをアップサンプリングし、5の場合は、ピクチャーレートアップサンプリングを示す。 nnpfc_purpose indicates the purpose of the postfilter processing. The value of nnpfc_purpose is 1 for image quality improvement, 2 for chrominance upsampling from 4:2:0 chrominance format to 4:2:2 or 4:4:4, or chrominance upsampling from 4:2:2 chrominance format to 4:4:4, 3 for increasing the width or height of the cropped decoded output image without changing the chrominance format, 4 for increasing the width or height of the decoded output image and upsampling the chrominance format, and 5 for picture rate upsampling.

nnpfc_reserved_zero_bit_bは、0とする。 nnpfc_reserved_zero_bit_b is set to 0.

nnpfc_payload_byte[i]には、ISO/IEC 15938-17に準拠して、NNCで符号化されているビットストリームのi番目のバイトとする。 nnpfc_payload_byte[i] is the i-th byte of the bitstream encoded in NNC in accordance with ISO/IEC 15938-17.

（ニューラルネットワークポストフィルタアクティベーション(NNPFA)SEI）
図7は、非特許文献3のニューラルネットワークポストフィルタアクティベーション(NNPFA)SEIメッセージのシンタクスを示している。ニューラルネットワークポストフィルタアクティベーションNNPFA SEIメッセージは、一連のピクチャのポストフィルタ処理のために、nnpfa_target_idによって識別される対象ニューラルネットワークポストフィルタ処理の適用をアクティブ化または非アクティブ化する。 (Neural Network Post Filter Activation (NNPFA) SEI)
7 shows the syntax of the Neural Network Post-Filter Activation (NNPFA) SEI message of Non-Patent Document 3. The Neural Network Post-Filter Activation NNPFA SEI message activates or deactivates the application of a target neural network post-filtering identified by nnpfa_target_id for post-filtering of a sequence of pictures.

nnpfa_target_idは、対象とするピクチャのニューラルネットワークポストフィルタ処
理を示す。これは、現在のピクチャに対して、nnfpa_target_idと等しいnnpfc_idを持つ1つ以上のNNPFC SEIメッセージを特定する。 nnpfa_target_id indicates the target picture neural network postfiltering. It identifies one or more NNPFC SEI messages with nnpfc_id equal to nnfpa_target_id for the current picture.

次の条件の一方または両方が真でない限り、nnpfa_target_idの特定の値を持つNNPFA SEIメッセージは現在のPUに存在してはならない。
・現在のCLVS内に、復号順で現在のPUに先行するPUに存在するnnpfa_target_idの特定の値に等しいnnpfc_idを持つNNPFC SEIメッセージがある。
・現在のPUのnnpfa_target_idの特定の値と等しいnnpfc_idを持つNNPFC SEIメッセージがある。 An NNPFA SEI message with a particular value of nnpfa_target_id must not be present on the current PU unless one or both of the following conditions are true:
- There is an NNPFC SEI message in the current CLVS with nnpfc_id equal to a particular value of nnpfa_target_id present in a PU that precedes the current PU in decoding order.
- There is an NNPFC SEI message with nnpfc_id equal to a specific value of nnpfa_target_id of the current PU.

PUにnnpfc_idの特定の値を持つNNPFC SEIメッセージと、nnpfc_idの特定の値に等しいnnpfa_target_idを持つNNPFA SEIメッセージの両方が含まれる場合、NNPFC SEIメッセージは復号順でNNPFA SEIメッセージに先行するものとする。 If a PU contains both an NNPFC SEI message with a particular value of nnpfc_id and an NNPFA SEI message with nnpfa_target_id equal to a particular value of nnpfc_id, the NNPFC SEI message shall precede the NNPFA SEI message in decoding order.

nnpfa_cancel_flagが1の場合は、現在のSEIメッセージと同じnnpfa_target_idを持つ以前のNNPFA SEIメッセージによって設定された対象ニューラルネットワークポストフィルタ処理の継続性がキャンセルされることを示す。つまり、対象とするニューラルネットワークポストフィルタ処理は実行されない。 nnpfa_cancel_flag, when set to 1, indicates that the continuity of the target neural network postfilter process set by the previous NNPFA SEI message with the same nnpfa_target_id as the current SEI message is cancelled. That is, the target neural network postfilter process is not executed.

現在のSEIメッセージと同じnnpfa_target_idとnnpfa_cancel_flagが0の場合は、別のNNPFA SEIメッセージによってアクティブ化されない限り、使われない。nnpfa_cancel_flagが0の場合は、nnpfa_persistence_flagが続くことを示す。 If it has the same nnpfa_target_id as the current SEI message and nnpfa_cancel_flag is 0, it will not be used unless activated by another NNPFA SEI message. If nnpfa_cancel_flag is 0, it indicates that nnpfa_persistence_flag will persist.

nnnpfa_persistence_flagは、現在のレイヤの対象ニューラルネットワークポストフィルタ処理の継続性を示す。 nnnpfa_persistence_flag indicates the continuity of the target neural network postfilter processing for the current layer.

nnpfa_persistence_flagが0の場合は、対象とするニューラルネットワークポストフィルタ処理が、現在の画像のみのポストフィルタ処理に適用されることを示す。 When nnpfa_persistence_flag is 0, it indicates that the target neural network postfiltering applies to postfiltering of the current image only.

nnpfa_persistence_flagが1の場合は、次の条件の1つ以上がtrueになるまで、対象となるニューラルネットワークポストフィルタ処理を、現在の画像と現在のレイヤの後続のすべてのピクチャにポストフィルタ処理に適用することを示す。
・現在のレイヤの新しいCLVSが開始
・ビットストリームが終了
・現在のSEIメッセージと同じnnpfa_target_idと1に等しいnnpfa_cancel_flagを持つNNPFA SEIメッセージに関連付けられた現在のレイヤの画像は、出力順序で現在の画像の後に出力される。 nnpfa_persistence_flag, when set to 1, indicates that the target neural network postfiltering should be applied to the current image and all subsequent pictures of the current layer until one or more of the following conditions become true:
A new CLVS for the current layer begins. The bitstream ends. The picture for the current layer associated with an NNPFA SEI message that has the same nnpfa_target_id as the current SEI message and nnpfa_cancel_flag equal to 1 is output after the current picture in output order.

なお、ニューラルネットワークポストフィルタ処理は、現在のSEIメッセージと同じnnpfa_target_idおよびnnpfa_cancel_flagが1の場合、NNPFA SEIメッセージに関連付けられた現在のレイヤの後続のピクチャには適用されない。 Note that neural network postfiltering is not applied to subsequent pictures of the current layer associated with an NNPFA SEI message if the current SEI message has the same nnpfa_target_id and nnpfa_cancel_flag as the current SEI message and both have the same nnpfa_target_id and nnpfa_cancel_flag set to 1.

（アテンション情報SEI）
図8は、本実施形態の補助情報符号化装置81及び補助情報復号装置91で符号化、復号する補助情報のシンタックスの一形態を示す図である。この例では、attention_infoというSEIを示す。このSEIは、PU単位に伝送し、当該ピクチャに対して、画像認識装置が処理を行うにあたって認識精度の向上や、処理量の削減を目的としている。そのために、当該ピクチャに対する認識認識処理に用いるアテンション（注目）情報を生成するための補助情報として符号化、復号するSEIメッセージであり、payloadSizeの値のバイト数を有する。 (Attention Information SEI)
8 is a diagram showing one form of the syntax of auxiliary information encoded and decoded by the auxiliary information encoding device 81 and the auxiliary information decoding device 91 of this embodiment. In this example, an SEI called attention_info is shown. This SEI is transmitted in units of PU, and is intended to improve the recognition accuracy and reduce the amount of processing when the image recognition device processes the picture. For this purpose, it is an SEI message that is encoded and decoded as auxiliary information for generating attention information used in the recognition processing for the picture, and has the number of bytes of the value of payloadSize.

以下、本実施の形態におけるアテンション情報SEIのシンタックス及びシンタックス要素とセマンティクスについて説明する。 The syntax, syntax elements, and semantics of the attention information SEI in this embodiment are explained below.

このSEIメッセージを適用するには、次の変数の定義が必要である。
・画像復号装置31が復号したピクチャの色差フォーマットChromaFormatIdcで示される色差フォーマットを示す変数SubWidthCとSubHeightC。4:2:0の時は、変数SubWidthCとSubHeightCは、ともに2であり、4:2:2の時は、変数SubWidthCが2で変数SubHeightCが1であり、4:4:4の時は、変数SubWidthCとSubHeightCは、ともに1である。 To apply this SEI message, the following variables must be defined:
Variables SubWidthC and SubHeightC indicating the chrominance format indicated by the chrominance format ChromaFormatIdc of the picture decoded by the image decoding device 31. When the chrominance format is 4:2:0, the variables SubWidthC and SubHeightC are both 2, when the chrominance format is 4:2:2, the variables SubWidthC and SubHeightC are both 2 and 1, when the chrominance format is 4:4:4, the variables SubWidthC and SubHeightC are both 1.

シンタクス要素attention_target_idは、対象とするピクチャのニューラルネットワークを特定する。もし、nnfpa_target_idと等しいnnpfc_idを持つ1つ以上のニューラルネットワークポストフィルタ特性SEIメッセージが存在する場合、NNPFC SEIのnnpfc_purposeに、アテンション情報を入力して、ニューラルネットワークポストフィルタ処理が実行される。この場合、アテンション情報は、ポスト画像処理装置1002に入力され、アテンション情報として用いられる。 The syntax element attention_target_id specifies the neural network of the target picture. If there is one or more neural network post-filter characteristics SEI messages with nnpfc_id equal to nnfpa_target_id, attention information is input to nnpfc_purpose of the NNPFC SEI and neural network post-filter processing is performed. In this case, the attention information is input to the post image processing device 1002 and used as attention information.

具体的な実施の形態の例としては、nnpfc_auxliary_inp_idcの値が2の場合、アテンション情報をニューラルネットワークポストフィルタの入力テンソルとして入力し、アテンション情報で注目度が高い領域に対してポストフィルタ処理を行う。この時のポストフィルタ処理は、単に画質向上ではなく、画像認識率が向上することを基準に最適化してもよい。 As a specific example of an embodiment, when the value of nnpfc_auxliary_inp_idc is 2, attention information is input as an input tensor of the neural network postfilter, and postfilter processing is performed on areas that are attracting high attention in the attention information. The postfilter processing at this time may be optimized based on the criterion of improving image recognition rate, rather than simply improving image quality.

別の実施の形の例としては、NNPFA SEIによってアクティブ化されたピクチャのポストフィルタ処理を実行する時に、アテンション情報を用いて、注目度の高い領域のポストフィルタ処理の強度を調整してもよい。 As another example of implementation, when performing post-filtering of a picture activated by the NNPFA SEI, attention information may be used to adjust the strength of post-filtering of regions of high interest.

このような構成にすることで、画像認識精度の向上と、画像認識処理の処理量の削減が実現できる。 This configuration improves image recognition accuracy and reduces the amount of processing required for image recognition.

もし、attention_target_idの値が、nnpfc_idの値外の画像認識装置51のニューラルネットワークのid値と等しい場合は、アテンション情報は、画像認識装置51に入力される。この場合、アテンション情報は、画像認識装置51に入力され、アテンション情報として画像認識に用いられる。このような構成にすることで、画像認識精度の向上と、画像認識処理の処理量の削減が実現できる。 If the value of attention_target_id is equal to the ID value of the neural network of the image recognition device 51 that is outside the value of nnpfc_id, the attention information is input to the image recognition device 51. In this case, the attention information is input to the image recognition device 51 and used as attention information for image recognition. With this configuration, it is possible to improve the image recognition accuracy and reduce the amount of processing required for image recognition.

attention_width_in_luma_samplesと、attention_height_in_luma_samplesは、アテンション情報の輝度の水平方向と垂直方向の画素数を表している。なお、attention_width_in_luma_samplesの値は、色差画素数が輝度画素数と違う4:2:2や4:2:0色差フォーマットに対応するために、変数SubWidthCの倍数の値とし、attention_height_in_luma_samplesの値は、変数SubHeightCの倍数の値とする。 attention_width_in_luma_samples and attention_height_in_luma_samples represent the number of pixels in the horizontal and vertical directions for the luminance of the attention information. Note that the value of attention_width_in_luma_samples is a multiple of the variable SubWidthC to accommodate 4:2:2 and 4:2:0 chrominance formats, where the number of chrominance pixels is different from the number of luminance pixels, and the value of attention_height_in_luma_samples is a multiple of the variable SubHeightC.

attention_bit_depth_minus8は、アテンションの値マイナス8を示すシンタクス要素である。アテンション値は0から、2の（attention_bit_depth_minus8+8）乗マイナス１の値をとるものとする。 attention_bit_depth_minus8 is a syntax element that indicates the attention value minus 8. The attention value shall take values from 0 to 2 to the power of (attention_bit_depth_minus8+8) minus 1.

attention_number_of_region_minus1は、アテンション情報を記述するための領域の数マイナス1を表すシンタックス要素である。attention_number_of_region_minus1プラス1の値の数だけ、アテンション情報を記述するための領域の数を示す。 attention_number_of_region_minus1 is a syntax element that represents the number of regions for describing attention information minus 1. The number of regions for describing attention information is equal to the value of attention_number_of_region_minus1 plus 1.

attention_region_xと、attention_region_yは、アテンションを生成するための矩形の
左上の位置を示すシンタックス要素である。attention_region_xは、矩形領域の左上の輝度のx座標値（水平方向）である。なお、attention_region_xの値は、変数SubWidthCの倍数の値とする。attention_region_yは、矩形領域の左上の輝度のy座標値（垂直方向）である。なお、attention_region_yの値は、変数SubHeightCの倍数の値とする。また、attention_region_x、attention_region_yは、画面内の相対位置であってもよい。 Attention_region_x and attention_region_y are syntax elements that indicate the position of the top left corner of a rectangle for generating attention. Attention_region_x is the x coordinate value (horizontal direction) of the luminance of the top left corner of the rectangular region. The value of attention_region_x is a multiple of the variable SubWidthC. Attention_region_y is the y coordinate value (vertical direction) of the luminance of the top left corner of the rectangular region. The value of attention_region_y is a multiple of the variable SubHeightC. In addition, attention_region_x and attention_region_y may be relative positions within the screen.

attention_region_widthとattention_region_hightは、アテンションを生成するための矩形の大きさを示すシンタックス要素である。attention_region_widthは、矩形領域の水平方向の輝度の画素数であり、変数SubWidthCの値の倍数とする。なお、attention_region_x+attention_region_widthの値は、ピクチャの水平方向の画素数を超えないものとする。attention_region_hightは、矩形領域の垂直方向の輝度の画素数であり、変数SubHightCの値の倍数とする。なお、attention_region_y+attention_region_heightの値は、ピクチャの垂直方向の画素数を超えないものとする。 attention_region_width and attention_region_hight are syntax elements that indicate the size of the rectangle for generating attention. attention_region_width is the number of pixels of horizontal luminance in the rectangular region, and is a multiple of the value of the variable SubWidthC. Note that the value of attention_region_x+attention_region_width must not exceed the number of pixels in the horizontal direction of the picture. attention_region_hight is the number of pixels of vertical luminance in the rectangular region, and is a multiple of the value of the variable SubHightC. Note that the value of attention_region_y+attention_region_height must not exceed the number of pixels in the vertical direction of the picture.

本実施の形態においては、アテンション情報を生成するためを矩形として、矩形の左上の座標値と水平方向と垂直方向の画素数で表現する方式を示したが、別の方法でもよい。例えば、位置情報（attention_region_x、attention_region_y）は、矩形の左上でなく、右上、左下、右下、重心でもよい。また、矩形の大きさ（attention_region_width、attention_region_height）は、矩形以外にも、正方形に限定して、一辺の画素数のみ（region_size）を指定するようにしてもよい。あるいは、位置、サイズは、画素単位ではなく、4x4単位や16x16単位、もしくは符号化の単位であるCTUアドレスおよびCTUの個数で指定してもよい。 In this embodiment, attention information is generated using a rectangle, and is expressed by the coordinate value of the top left corner of the rectangle and the number of pixels in the horizontal and vertical directions. However, other methods are also possible. For example, the position information (attention_region_x, attention_region_y) may be the top right, bottom left, bottom right, or center of gravity instead of the top left corner of the rectangle. Furthermore, the size of the rectangle (attention_region_width, attention_region_height) may be limited to a square other than a rectangle, and only the number of pixels on one side (region_size) may be specified. Alternatively, the position and size may be specified in 4x4 or 16x16 units, or the CTU address and number of CTUs, which are the encoding units, instead of in pixel units.

attention_region_weight_yは、アテンションを生成するための輝度の値を示している。attention_region_weight_cbは、アテンションを生成するための色差Cbの値を示している。attention_region_weight_crは、アテンションを生成するための色差Crの値を示している。attention_region_weight_yとattention_region_weight_cbとattention_region_weight_crは、0から、2の（attention_bit_depth_minus8+8）乗マイナス１の値をとるものとする。 attention_region_weight_y indicates the luminance value for generating attention. attention_region_weight_cb indicates the chrominance Cb value for generating attention. attention_region_weight_cr indicates the chrominance Cr value for generating attention. attention_region_weight_y, attention_region_weight_cb, and attention_region_weight_cr shall take values from 0 to 2 to the power of (attention_bit_depth_minus8+8) minus 1.

図9は、図8のアテンション情報SEIのシンタクス要素の値からアテンションを作成する手順の一例を示している。 Figure 9 shows an example of a procedure for creating attention from the values of syntax elements of the attention information SEI in Figure 8.

配列AttentionY[][]、AttentionCb[][]、AttentionCrは[][]は、アテンション情報を表す2次元配列構造で、attention_width_in_luma_samplesとattention_height_in_luma_samplesが、水平方向と垂直方法の輝度の画素数を表している。アテンション情報の色差の画素数は、水平方向がattention_width_in_luma_samples/SubWidthC、垂直方向がattention_height_in_luma_samples/SubHeightCとなる。 The arrays AttentionY[][], AttentionCb[][], and AttentionCr[][] are two-dimensional array structures that represent attention information, and attention_width_in_luma_samples and attention_height_in_luma_samples represent the number of luminance pixels in the horizontal and vertical directions. The number of chrominance pixels of the attention information is attention_width_in_luma_samples/SubWidthC in the horizontal direction and attention_height_in_luma_samples/SubHeightC in the vertical direction.

初めに、配列AttentionY[][]、AttentionCb[][]、AttentionCrは[][]の初期化を行う。輝度のアテンション配列AttentionY[][]は、0で初期化し、色差のアテンション配列AttentionCb[][]、AttentionCrは[][]は、2の(attention_bit_depth_minus8 + 7)乗である、1 << (attention_bit_depth_minus8 + 7)の値で初期化される。 First, the arrays AttentionY[][], AttentionCb[][], and AttentionCr[][] are initialized. The luminance attention array AttentionY[][] is initialized to 0, and the chrominance attention arrays AttentionCb[][] and AttentionCr[][] are initialized to the value 1 << (attention_bit_depth_minus8 + 7), which is 2 to the power of (attention_bit_depth_minus8 + 7).

次に、attention_number_of_region_minus1プラス1回だけループして、attention_region_xとattention_region_yで示されるポイントから、attention_region_width、attention_region_heightのサイズをもつAttentionY[y][x]にattention_region_weight_yを入力する。attention_number_of_region_minus1はアテンション情報SEIで設定されたアテンション情報を記述するための矩形領域の数マイナス1を表すシンタックス要素である。 Next, loop attention_number_of_region_minus1 plus once and input attention_region_weight_y to AttentionY[y][x], which has a size of attention_region_width and attention_region_height, from the point indicated by attention_region_x and attention_region_y. attention_number_of_region_minus1 is a syntax element that represents the number of rectangular regions for describing the attention information set in the attention information SEI minus 1.

attention_region_xは矩形領域の左上の輝度のx座標値（水平方向）である。attention_region_yは矩形領域の左上の輝度のy座標値（垂直方向）である。attention_region_widthは水平方向の画素数、attention_region_heightは垂直方向の画素数である。AttentionY[y][x]は矩形領域の輝度のアテンション配列である。attention_region_weight_yはアテンションの値である。 attention_region_x is the x coordinate value (horizontal) of the luminance of the top left of the rectangular region. attention_region_y is the y coordinate value (vertical) of the luminance of the top left of the rectangular region. attention_region_width is the number of pixels in the horizontal direction, and attention_region_height is the number of pixels in the vertical direction. AttentionY[y][x] is the attention array of the luminance of the rectangular region. attention_region_weight_y is the attention value.

また、AttentionCb[y][x]、AttentionCr[y][x]に対しては、attention_region_x/SubWidthCとattention_region_y/SuabHeightCで示されるポイントから、attention_region_width/SubWidthC、attention_region_height/SubHeightCのが画素サイズを持つ矩形領域に、それぞれ、valCbとvalCrを入力する。 For AttentionCb[y][x] and AttentionCr[y][x], valCb and valCr are input, respectively, from the points indicated by attention_region_x/SubWidthC and attention_region_y/SubHeightC to rectangular regions with pixel sizes of attention_region_width/SubWidthC and attention_region_height/SubHeightC.

AttentionCb[y][x]、AttentionCr[y][x]は色差のアテンション配列である。attention_region_x/SubWidthCは矩形領域の左上の輝度のx座標値（水平方向）である。attention_region_y/SuabHeightCは矩形領域の左上の輝度のy座標値（垂直方向）である。attention_region_width/SubWidthCは水平方向画素数、attention_region_height/SubHeightCは垂直方向画素数である。 AttentionCb[y][x] and AttentionCr[y][x] are the attention arrays for chrominance. attention_region_x/SubWidthC is the x coordinate value (horizontal direction) of the luminance at the top left of the rectangular region. attention_region_y/SuabHeightC is the y coordinate value (vertical direction) of the luminance at the top left of the rectangular region. attention_region_width/SubWidthC is the number of horizontal pixels, and attention_region_height/SubHeightC is the number of vertical pixels.

valCbはattention_region_weight_cbに2の(attention_bit_depth_minus8 + 7)乗の値分を足した値、valCrはattention_region_weight_crに2の(attention_bit_depth_minus8 + 7)乗の値分を足した値である。 valCb is the sum of attention_region_weight_cb and 2 to the power of (attention_bit_depth_minus8 + 7), and valCr is the sum of attention_region_weight_cr and 2 to the power of (attention_bit_depth_minus8 + 7).

なお、本実施の形態でのアテンション情報を作成方法では、アテンション情報を記述するための矩形領域が重なった場合には、あとから定義された領域の値で書き換えることになるが、元に入っていた値と平均をとってもよい。 Note that in the method for creating attention information in this embodiment, if the rectangular areas for describing attention information overlap, the value will be overwritten with the value of the area defined later, but the average of the original value and this value may also be taken.

このようなシンタクスとセマンティクスによることで、簡潔にかつ少ない符号量でアテンション情報を記述することができる。 Using this syntax and semantics, attention information can be described concisely and with a small amount of code.

画像解析装置61では、入力動画像Tを解析して、認識対象候補を検出する。ここでは、処理量を抑えるためにあくまでも認識対象候補としての精度があれば良いものとする。また、認識候補領域は、例えば、固定的なカメラ画像のように、認識対象のピクチャ中の位置が想定できる場合は、予め画像解析装置61が検知対象領域を設定しておいてもよい。 The image analysis device 61 analyzes the input video image T to detect candidates for recognition. Here, in order to reduce the amount of processing, it is sufficient that the accuracy of the candidates for recognition is sufficient. Furthermore, when the position of the recognition target in the picture can be predicted, such as in the case of a fixed camera image, the image analysis device 61 may set the detection target area in advance as the candidate recognition area.

補助情報作成装置71では、画像解析装置61で検出したアテンション情報を、ピクチャ内の位置、矩形の大きさの情報に変換して、補助情報符号化装置81に送る。 The auxiliary information creation device 71 converts the attention information detected by the image analysis device 61 into information on the position within the picture and the size of the rectangle, and sends it to the auxiliary information encoding device 81.

あるいは、補助情報作成装置71の出力を画像符号化装置11に入力してもよい。この場合、画像符号化装置11は、補助情報作成部71で作成されたアテンション情報を用いて画質を制御してもよい。例えば、認識対象候補の領域にピクチャ内の他の領域よりも小さい値の量子化パラメータを用いるなどして、高画質にしてもよい。このようなことをすることで、認識精度の向上が可能である。 Alternatively, the output of the auxiliary information creation device 71 may be input to the image coding device 11. In this case, the image coding device 11 may control image quality using attention information created by the auxiliary information creation unit 71. For example, high image quality may be achieved by using a quantization parameter for the area of the candidate recognition target that is smaller than that for other areas in the picture. By doing this, it is possible to improve the recognition accuracy.

画像認識装置51には、復号動画像Tdに加えて、補助情報として、アテンション情報が入力される。その結果、ピクチャ内の全部の情報に対して、認識対象候補の領域を優先して処理すれば良いので処理量を大幅に削減し、認識精度の向上も可能である。更に、認識対象領域の復号画像の画質が向上すれば、認識精度が向上する。 In addition to the decoded video image Td, attention information is input to the image recognition device 51 as auxiliary information. As a result, it is possible to process the candidate recognition target area with priority over all information in the picture, significantly reducing the amount of processing and improving recognition accuracy. Furthermore, if the image quality of the decoded image of the recognition target area is improved, the recognition accuracy will also improve.

本実施の形態によれば、低レートで符号化された復号画像を用いても画像認識装置51での画像認識精度の向上と、画像認識処理の処理量の削減が実現できる。 According to this embodiment, even when using decoded images encoded at a low rate, it is possible to improve the image recognition accuracy of the image recognition device 51 and reduce the amount of processing required for image recognition.

また、補助情報作成装置71、補助情報符号化装置81及び補助情報復号装置91で汎用ネットワークパラメータを共通に保持してもよい。補助情報作成装置71では、ニューラルネットワークポストフィルタ特性SEIなどの枠組みを用いて、共通に保持している汎用ネットワークを部分的に更新するネットワークパラメータを補助情報として作成する。そして、補助情報符号化装置81で符号化し、補助情報復号装置91で復号してもよい。このような構成にすることで、補助情報の符号量を削減し、入力画像Tに応じた補助情報を作成、符号化、復号できる。 In addition, the auxiliary information creation device 71, the auxiliary information encoding device 81, and the auxiliary information decoding device 91 may commonly hold generic network parameters. The auxiliary information creation device 71 uses a framework such as the neural network postfilter characteristic SEI to create network parameters as auxiliary information that partially update the commonly held generic network. The auxiliary information may then be encoded by the auxiliary information encoding device 81 and decoded by the auxiliary information decoding device 91. With this configuration, the amount of code for the auxiliary information can be reduced, and auxiliary information corresponding to the input image T can be created, encoded, and decoded.

また、ネットワークパラメータの伝送フォーマットとして、複数のフォーマットに対応するために、フォーマットを示すパラメータ（識別子）を送付してもよい。また、識別子に続く実際の補助情報については、バイト列で伝送してもよい。 In order to support multiple formats as the transmission format of the network parameters, a parameter (identifier) indicating the format may be sent. Furthermore, the actual auxiliary information following the identifier may be transmitted as a byte string.

補助情報復号装置91で復号したネットワークパラメータの補助情報は、ポスト画像処理装置1002に入力する。 The auxiliary information of the network parameters decoded by the auxiliary information decoding device 91 is input to the post-image processing device 1002.

ポスト画像処理装置1002では、復号した補助情報（ニューラルネットワークポストフィルタ特性SEI、ニューラルネットワークポストフィルタアクティベーションSEI、アテンション情報SEI）を用いて、ニューラルネットワークを用いたポスト画像処理を行い、復号動画像Tdを復元する。 The post-image processing device 1002 uses the decoded auxiliary information (neural network post-filter characteristics SEI, neural network post-filter activation SEI, and attention information SEI) to perform post-image processing using a neural network to restore the decoded video image Td.

例えば、図8で示したアテンション情報SEIのアテンション情報を用いて特定の領域のみに、ポスト画像処理を行っても良い。この時のポストフィルタ処理は、単に画質向上ではなく、画像認識率が向上することを基準に最適化してもよい。 For example, post-image processing may be performed only on specific regions using the attention information in the attention information SEI shown in Figure 8. The post-filter processing in this case may be optimized based on improving the image recognition rate, rather than simply improving image quality.

これにより、復号画像側で復号動画像Tdの画質改善を行うとともに、画像認識装置での認識精度の向上を図る。 This improves the image quality of the decoded video image Td on the decoded image side, and also improves the recognition accuracy of the image recognition device.

なお、アクティベーション情報の符号化、復号は、SEI限定されず、SPS、PPS、APS、スライスヘッダなどのシンタックスを用いてもよい。 The encoding and decoding of activation information is not limited to SEI, and syntax such as SPS, PPS, APS, and slice header may also be used.

補助符号化装置81では、図6、図7、図8、図10のシンタックス表に基づいて、補助情報を符号化する。補助情報は、補助拡張情報SEIとして符号化され、画像符号化装置11の出力する符号化データTeに多重化されてネットワーク21に出力される。 The auxiliary encoding device 81 encodes the auxiliary information based on the syntax tables in Figures 6, 7, 8, and 10. The auxiliary information is encoded as auxiliary extension information SEI, multiplexed into the encoded data Te output by the image encoding device 11, and output to the network 21.

補助情報復号装置91は、符号化データTeから、図6、図7、図8、図10のシンタックス表に基づいて補助情報を復号し、復号結果をポスト画像処理装置1002と画像認識装置51に送る。補助情報復号装置91は補助拡張情報SEIとして符号化された補助情報を復号する。 The auxiliary information decoding device 91 decodes the auxiliary information from the encoded data Te based on the syntax tables in Figures 6, 7, 8, and 10, and sends the decoded result to the post-image processing device 1002 and the image recognition device 51. The auxiliary information decoding device 91 decodes the auxiliary information encoded as auxiliary extension information SEI.

ポスト画像処理装置1002は、復号動画像Tdと補助情報を用いて復号動画像Tdにポスト画像処理を行い、ポスト画像処理Toを生成する。 The post-image processing device 1002 performs post-image processing on the decoded video Td using the decoded video Td and auxiliary information to generate post-image processing To.

また、補助情報作成装置71、補助情報符号化装置81及び補助情報復号装置91で汎用ネットワークパラメータを共通に保持してもよい。補助情報作成装置71では、共通に保持している汎用ネットワークを部分的に更新するネットワークパラメータを補助情報として作成し、補助情報符号化装置81で符号化し、補助情報復号装置91で復号してもよい。このような構成にすることで、補助情報の符号量を削減し、入力画像Tに応じた補助情報を作成、符号化、復号できる。 In addition, the auxiliary information creation device 71, the auxiliary information encoding device 81, and the auxiliary information decoding device 91 may commonly hold general-purpose network parameters. The auxiliary information creation device 71 may create network parameters that partially update the commonly held general-purpose network as auxiliary information, which may then be encoded by the auxiliary information encoding device 81 and decoded by the auxiliary information decoding device 91. With this configuration, the amount of coding for the auxiliary information can be reduced, and auxiliary information corresponding to the input image T can be created, encoded, and decoded.

また、ネットワークパラメータの伝送フォーマットとして、複数のフォーマットに対応するために、フォーマットを示すパラメータ（識別子）を送付してもよい。また、識別子
に続く実際の補助情報については、バイト列で伝送してもよい。 In addition, in order to support multiple formats as the transmission format of the network parameters, a parameter (identifier) indicating the format may be sent. Furthermore, the actual auxiliary information following the identifier may be transmitted as a byte string.

なお、本実施の形態の一例では、SEIでのシンタックスを示したが、SEIに限定されず、SPS、PPS、APS、スライスヘッダなどのシンタックスを用いてもよい。 Note that in the example of this embodiment, the syntax is shown as SEI, but it is not limited to SEI, and syntax such as SPS, PPS, APS, slice header, etc. may also be used.

このような構成にすることで、動画像符号化、復号方式の枠組みを大きく変更せずに、付加的な補助情報を符号化、復号することで、低レートにおいても、画像認識精度を維持するという課題が解決できる。 By configuring in this way, the problem of maintaining image recognition accuracy even at low rates can be solved by encoding and decoding additional auxiliary information without making major changes to the framework of the video encoding and decoding method.

なお、上述した実施形態における画像符号化装置11、画像復号装置31の一部、例えば、エントロピー復号部301、パラメータ復号部302、ループフィルタ305、予測画像生成部308、逆量子化・逆変換部311、加算部312、予測パラメータ導出部320、予測画像生成部101、減算部102、変換・量子化部103、エントロピー符号化部104、逆量子化・逆変換部105、ループフィルタ107、符号化パラメータ決定部110、パラメータ符号化部111、予測パラメータ導出部120をコンピュータで実現するようにしても良い。その場合、この制御機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピュータシステム」とは、画像符号化装置11、画像復号装置31のいずれかに内蔵されたコンピュータシステムであって、OSや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ROM、CD-ROM等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 In addition, a part of the image encoding device 11 and the image decoding device 31 in the above-mentioned embodiment, for example, the entropy decoding unit 301, the parameter decoding unit 302, the loop filter 305, the predicted image generating unit 308, the inverse quantization and inverse transform unit 311, the addition unit 312, the prediction parameter derivation unit 320, the predicted image generating unit 101, the subtraction unit 102, the transform and quantization unit 103, the entropy coding unit 104, the inverse quantization and inverse transform unit 105, the loop filter 107, the coding parameter determination unit 110, the parameter coding unit 111, and the prediction parameter derivation unit 120 may be realized by a computer. In that case, a program for realizing this control function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed to realize the control function. In addition, the "computer system" referred to here is a computer system built into either the image encoding device 11 or the image decoding device 31, and includes hardware such as an OS and peripheral devices. Additionally, "computer-readable recording media" refers to portable media such as flexible disks, optical magnetic disks, ROMs, and CD-ROMs, as well as storage devices such as hard disks built into computer systems. Furthermore, "computer-readable recording media" may also include devices that dynamically store a program for a short period of time, such as a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line, or devices that store a program for a certain period of time, such as volatile memory within a computer system that serves as a server or client in such cases. Furthermore, the above-mentioned program may be one that realizes part of the functions described above, or may be one that can realize the functions described above in combination with a program already recorded in the computer system.

また、上述した実施形態における画像符号化装置11、画像復号装置31の一部、または全部を、LSI（Large Scale Integration）等の集積回路として実現しても良い。画像符号化装置11、画像復号装置31の各機能ブロックは個別にプロセッサ化しても良いし、一部、または全部を集積してプロセッサ化しても良い。また、集積回路化の手法はＬＳＩに限らず専用回路、または汎用プロセッサで実現しても良い。また、半導体技術の進歩によりLSIに代替する集積回路化の技術が出現した場合、当該技術による集積回路を用いても良い。 In addition, part or all of the image encoding device 11 and image decoding device 31 in the above-mentioned embodiments may be realized as an integrated circuit such as an LSI (Large Scale Integration). Each functional block of the image encoding device 11 and image decoding device 31 may be individually made into a processor, or part or all of them may be integrated into a processor. The integrated circuit method is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. Furthermore, if an integrated circuit technology that can replace LSI appears due to advances in semiconductor technology, an integrated circuit based on that technology may be used.

以上、図面を参照してこの発明の一実施形態について詳しく説明してきたが、具体的な構成は上述のものに限られることはなく、この発明の要旨を逸脱しない範囲内において様々な設計変更等をすることが可能である。 One embodiment of the present invention has been described in detail above with reference to the drawings, but the specific configuration is not limited to the above, and various design changes can be made without departing from the spirit of the present invention.

〔応用例〕
上述した動画像符号化装置10及び動画像復号装置30は、動画像の送信、受信、記録、再生を行う各種装置に搭載して利用することができる。なお、動画像は、カメラ等により撮像された自然動画像であってもよいし、コンピュータ等により生成された人工動画像（CGおよびGUIを含む）であってもよい。 [Application example]
The above-mentioned video encoding device 10 and video decoding device 30 can be mounted and used in various devices that transmit, receive, record, and play back videos. Note that the video may be a natural video captured by a camera or the like, or an artificial video (including CG and GUI) generated by a computer or the like.

（SEIペイロード）
図10は、SEIメッセージのコンテナであるSEIペイロードのシンタクスを示す図である。 (SEI payload)
FIG. 10 is a diagram showing the syntax of the SEI payload, which is a container of the SEI message.

nal_unit_typeがPREFIX_SEI_NUTの時に呼び出される。PREFIX_SEI_NUTは、スライスデータよりも前に位置するSEIであることを示している。 Called when nal_unit_type is PREFIX_SEI_NUT. PREFIX_SEI_NUT indicates that the SEI is located before the slice data.

payloadTypeが210の時、ニューラルネットワークポストフィルタ特性SEIが呼び出される。 When payloadType is 210, the neural network post-filter characteristics SEI is called.

payloadTypeが211の時、ニューラルネットワークポストフィルタアクティベーションSEIが呼び出される。 When payloadType is 211, the neural network post-filter activation SEI is called.

payloadTypeが212の時、アテンション情報SEIが呼び出される。 When payloadType is 212, the attention information SEI is called.

（SEIの復号とポストフィルタ処理）
ヘッダ復号部3020は、SEIメッセージのコンテナであるSEIペイロードを読み込み、ニューラルネットワークポストフィルタ特性SEIメッセージを復号する。例えば、ヘッダ復号部3020は、nnpfc_id、nnpfc_mode_idc、nnpfc_formatting_and_purpose_flag、nnpfc_purpose、nnpfc_reserved_zero_bit_a、nnpfc_uri_tag[i]、nnpfc_uri[i]、nnpfc_reserved_zero_bit_b、nnpfc_payload_byte[i]を復号する。 (SEI decoding and post-filtering)
The header decoding unit 3020 reads the SEI payload, which is a container of the SEI message, and decodes the neural network post-filter characteristics SEI message. For example, the header decoding unit 3020 decodes nnpfc_id, nnpfc_mode_idc, nnpfc_formatting_and_purpose_flag, nnpfc_purpose, nnpfc_reserved_zero_bit_a, nnpfc_uri_tag[i], nnpfc_uri[i], nnpfc_reserved_zero_bit_b, and nnpfc_payload_byte[i].

図11は、ポスト画像処理装置1002の処理のフローチャートを示す図である。ポスト画像処理装置1002は、上記SEIメッセージのパラメータに従って以下の処理を行う。 Figure 11 is a diagram showing a flowchart of the processing of the post-image processing device 1002. The post-image processing device 1002 performs the following processing according to the parameters of the above SEI message.

S6001：SEIから処理量と精度を読み込む。 S6001: Read processing volume and accuracy from SEI.

S6002：ポスト画像処理装置1002が処理可能な複雑度を超える場合には終了する。超えない場合にはS6003へ進む。 S6002: If the complexity exceeds the level that the post-image processing device 1002 can process, the process ends. If not, the process proceeds to S6003.

S6003：ポスト画像処理装置1002が処理可能な精度を超える場合には終了する。超えない場合にはS6004へ進む。 S6003: If the accuracy exceeds the processing capability of the post-image processing device 1002, the process ends. If not, the process proceeds to S6004.

S6004：SEIからネットワークモデルを特定し、ポスト画像処理装置1002のトポロジーを設定する。 S6004: Identify the network model from the SEI and set the topology of the post image processing device 1002.

S6005：SEIの更新情報からネットワークモデルのパラメータを導出する。 S6005: Derive network model parameters from SEI update information.

S6006：導出されたネットワークモデルのパラメータをポスト画像処理装置1002に読み込む。 S6006: The derived network model parameters are loaded into the post-image processing device 1002.

S6007：ポスト画像処理装置1002のフィルタ処理を実行し、外部に出力する。 S6007: Executes filter processing in the post-image processing device 1002 and outputs to the outside.

ただし、復号処理における輝度サンプルや色差サンプルの構築にSEIは必ずしも必要とされない。 However, the SEI is not necessarily required to construct luma and chroma samples in the decoding process.

(ポスト画像処理装置1002の詳細)
NNフィルタ部は入力画像inputTensorと入力パラメータ(例えば、QP、bSなど)を用いて、ニューラルネットワークモデルによるフィルタ処理を行う。入力画像は、コンポーネントごとの画像であってもよいし、複数コンポーネントをそれぞれチャネルとして持つ画像であってもよい。また、入力パラメータは画像と異なるチャネルに割り当ててもよい。 (Details of the post-imaging processing device 1002)
The NN filter unit performs filtering using a neural network model, using the input image inputTensor and input parameters (e.g., QP, bS, etc.). The input image may be an image for each component, or an image with multiple components as channels. In addition, the input parameters may be assigned to a channel different from the image.

NNフィルタ部は、以下の処理を繰り返し適用してもよい。 The NN filter section may repeatedly apply the following process:

NNフィルタ部は、inputTensorにカーネルk[m][i][j]を畳み込み演算(conv,convolution)し、biasを加算した出力画像outputTensorを導出する。ここで、nn=0..n-1、xx=0..width-1、yy=0..height-1であり、Σは各々mm、i、jに対する総和を表す。 The NN filter section performs a convolution operation (conv, convolution) on the inputTensor with the kernel k[m][i][j], and derives the output image outputTensor by adding bias. Here, nn=0..n-1, xx=0..width-1, yy=0..height-1, and Σ represents the sum over mm, i, and j, respectively.

outputTensor[nn][xx][yy]=ΣΣΣ(k[mm][i][j]*inputTensor[mm][xx+i-of][yy+j-of]+bias[nn])
1x1 Convの場合、Σは、各々mm=0..m-1、i=0、j=0の総和を表す。このとき、of=0を設定する。3x3 Convの場合、Σは各々mm=0..m-1、i=0..2、j=0..2の総和を表す。このとき、of=1を設定する。nはoutSamplesのチャネル数、mはinputTensorのチャネル数、widthはinputTensorとoutputTensorの幅、heightはinputTensorとoutputTensorの高さである。ofは、inputTensorとoutputTensorのサイズを同一にするために、inputTensorの周囲に設けるパディング領域のサイズである。以下、NNフィルタ部の出力が画像ではなく値（補正値）の場合には、outputTensorの代わりにcorrNNで出力を表わす。 outputTensor[nn][xx][yy]=ΣΣΣ(k[mm][i][j]*inputTensor[mm][xx+i-of][yy+j-of]+bias[nn])
For 1x1 Conv, Σ represents the sum of mm=0..m-1, i=0, j=0. In this case, of=0 is set. For 3x3 Conv, Σ represents the sum of mm=0..m-1, i=0..2, j=0..2. In this case, of=1 is set. n is the number of channels of outSamples, m is the number of channels of inputTensor, width is the width of inputTensor and outputTensor, and height is the height of inputTensor and outputTensor. of is the size of the padding area around inputTensor to make the sizes of inputTensor and outputTensor the same. In the following, when the output of the NN filter part is a value (correction value) rather than an image, the output is represented as corrNN instead of outputTensor.

なお、CWH形式のinputTensor、outputTensorではなくCHW形式のinputTensor、outputTensorで記述すると以下の処理と等価である。 Note that if you write inputTensor and outputTensor in CHW format instead of inputTensor and outputTensor in CWH format, it is equivalent to the following process.

outputTensor[nn][yy][xx]=ΣΣΣ(k[mm][i][j]*inputTensor[mm][yy+j-of][xx+i-of]+bias[nn])
また、Depth wise Convと呼ばれる以下の式で示す処理を行ってもよい。ここで、nn=0..n-1、xx=0..width-1、yy=0..height-1であり、Σは各々i、jに対する総和を表す。nはoutputTensorとinputTensorのチャネル数、widthはinputTensorとoutputTensorの幅、heightはinputTensorとoutputTensorの高さである。 outputTensor[nn][yy][xx]=ΣΣΣ(k[mm][i][j]*inputTensor[mm][yy+j-of][xx+i-of]+bias[nn])
In addition, a process called Depth-wise Conv, shown in the following formula, may be performed. Here, nn=0..n-1, xx=0..width-1, yy=0..height-1, and Σ represents the summation for i and j, respectively. n is the number of channels of outputTensor and inputTensor, width is the width of inputTensor and outputTensor, and height is the height of inputTensor and outputTensor.

outputTensor[nn][xx][yy]=ΣΣ(k[nn][i][j]*inputTensor[nn][xx+i-of][yy+j-of]+bias[nn])
またActivateと呼ばれる非線形処理、たとえばReLUを用いてもよい。
ReLU(x) = x >= 0 ? x : 0
また以下の式に示すleakyReLUを用いてもよい。 outputTensor[nn][xx][yy]=ΣΣ(k[nn][i][j]*inputTensor[nn][xx+i-of][yy+j-of]+bias[nn])
Also, a nonlinear process called Activate, for example, ReLU, may be used.
ReLU(x) = x >= 0 ? x : 0
Alternatively, leakyReLU shown in the following formula may be used.

leakyReLU(x) = x >= 0 ? x : a * x
ここでaは所定の値、例えば0.1や0.125である。また整数演算を行うために上記の全てのk、bias、aの値を整数として、convの後に右シフトを行ってもよい。 leakyReLU(x) = x >= 0 ? x : a * x
Here, a is a predetermined value, for example, 0.1 or 0.125. In order to perform integer arithmetic, all the values of k, bias, and a above may be integers, and a right shift may be performed after conv.

ReLUでは0未満の値に対しては常に0、それ以上の値に対しては入力値がそのまま出力される。一方、leakyReLUでは、0未満の値に対して、aで設定された勾配で線形処理が行われる。ReLUでは0未満の値に対する勾配が消失するため、学習が進みにくくなる場合がある。leakyReLUでは0未満の値に対する勾配が残され、上記問題が起こりにくくなる。また、上記leakyReLU(x)のうち、aの値をパラメータ化して用いるPReLUを用いてもよい。 With ReLU, 0 is always output for values less than 0, and the input value is output as is for values greater than or equal to 0. On the other hand, with leakyReLU, linear processing is performed for values less than 0 with the gradient set by a. With ReLU, the gradient for values less than 0 disappears, which can make it difficult for learning to progress. With leakyReLU, the gradient for values less than 0 remains, making the above problem less likely to occur. Also, of the above leakyReLU(x), PReLU, which uses a parameterized value of a, can be used.

(NNC)
Neural Network Coding(NNC)は、ニューラルネットワーク(NN)を効率的に圧縮するための国際標準規格ISO/IEC15938-17である。学習済みのNNの圧縮を行うことで、NNを保存や伝送を行う際の効率化が可能となる。 (NNC)
Neural Network Coding (NNC) is an international standard ISO/IEC15938-17 for efficiently compressing neural networks (NNs). Compressing trained NNs makes it possible to store and transmit NNs more efficiently.

以下にNNCの符号化・復号処理の概要について説明する。 The following provides an overview of NNC's encoding and decoding processes.

図12は、NNCの符号化装置・復号装置について示す図である。 Figure 12 shows the NNC encoding and decoding devices.

NN符号化装置801は、前処理部8011、量子化部8012、エントロピー符号化部8013を有する。NN符号化装置801は、圧縮前のNNモデルOを入力し、量子化部8012にてNNモデルOの量子化を行い、量子化モデルQを求める。NN符号化装置801は、量子化前に、前処理部8011にて枝刈り（プルーニング）やスパース化などのパラメータ削減手法を繰り返し適用してもよい。その後、エントロピー符号化部8013にて、量子化モデルQにエントロピー符号化を適用し、NNモデルの保存、伝送のためのビットストリームSを求める。 The NN coding device 801 has a pre-processing unit 8011, a quantization unit 8012, and an entropy coding unit 8013. The NN coding device 801 inputs an uncompressed NN model O, and the quantization unit 8012 quantizes the NN model O to obtain a quantized model Q. The NN coding device 801 may repeatedly apply parameter reduction techniques such as pruning and sparsification in the pre-processing unit 8011 before quantization. Then, the entropy coding unit 8013 applies entropy coding to the quantized model Q to obtain a bit stream S for storing and transmitting the NN model.

NN復号装置802は、エントロピー復号部8021、パラメータ復元部8022、後処理部8023を有する。NN復号装置802は、始めに伝送されたビットストリームSを入力し、エントロピー復号部8021にて、Sのエントロピー復号を行い、中間モデルRQを求める。NNモデルの動作環境がRQで使用された量子化表現を用いた推論をサポートしている場合、RQを出力し、推論に使用してもよい。そうでない場合、パラメータ復元部8022にてRQのパラメータを元の表現に復元し、中間モデルRPを求める。使用する疎なテンソル表現がNNモデルの動作環境で処理できる場合、RPを出力し、推論に使用してもよい。そうでない場合、NNモデルOと異なるテンソル、または構造表現を含まない再構成NNモデルRを求め、出力する。 The NN decoding device 802 has an entropy decoding unit 8021, a parameter restoration unit 8022, and a post-processing unit 8023. The NN decoding device 802 first inputs the transmitted bit stream S, and the entropy decoding unit 8021 performs entropy decoding of S to obtain an intermediate model RQ. If the operating environment of the NN model supports inference using the quantized representation used in RQ, the RQ may be output and used for inference. If not, the parameter restoration unit 8022 restores the parameters of RQ to their original representation to obtain an intermediate model RP. If the sparse tensor representation used can be processed in the operating environment of the NN model, the RP may be output and used for inference. If not, a reconstructed NN model R that does not include a tensor or structural representation different from the NN model O is obtained and output.

NNC規格には、整数、浮動小数点など、特定のNNパラメータの数値表現に対する復号手法が存在する。 The NNC standard includes decoding methods for the numerical representation of specific NN parameters, including integers and floating point.

復号手法NNR_PT_INTは、整数値のパラメータからなるモデルを復号する。復号手法NNR_PT_FLOATは、NNR_PT_INTを拡張し、量子化ステップサイズdeltaを追加する。このdeltaに上記整数値を乗算し、スケーリングされた整数を生成する。deltaは、整数の量子化パラメータqpとdeltaの粒度パラメータqp_densityから、以下のように導き出される。 The NNR_PT_INT decoding method decodes a model with integer-valued parameters. The NNR_PT_FLOAT decoding method extends NNR_PT_INT by adding a quantization step size delta. This delta is multiplied by the integer value above to produce a scaled integer. Delta is derived from the integer quantization parameter qp and the granularity parameter qp_density of delta as follows:

mul = 2^(qp_density) + (qp& (2^(qp_density)-1))
delta = mul * 2^((qp >> qp_density)-qp_density)
(学習済みNNのフォーマット)
学習済みNNの表現は、層のサイズや層間の接続などのトポロジー表現と、重みやバイアスなどのパラメータ表現の2つの要素からなる。 mul = 2^(qp_density) + (qp& (2^(qp_density)-1))
delta = mul * 2^((qp >> qp_density) - qp_density)
(Format of trained NN)
The representation of a trained NN consists of two elements: a topological representation, such as the size of layers and the connections between layers, and a parameter representation, such as weights and biases.

トポロジー表現は、TensorflowやPyTorchなどのネイティブフォーマットでカバーされているが、相互運用性向上のため、Open Neural Network Exchange Format(ONNX)、Neural Network Exchange Format(NNEF)などの交換フォーマットが存在する。 Topology representation is covered by native formats such as Tensorflow and PyTorch, but to improve interoperability, exchange formats such as Open Neural Network Exchange Format (ONNX) and Neural Network Exchange Format (NNEF) exist.

また、NNC規格では、圧縮されたパラメータテンソルを含むNNCビットストリームの一部として、トポロジー情報nnr_topology_unit_payloadを伝送する。これにより、交換フォーマットだけでなく、ネイティブフォーマットで表現されたトポロジー情報との相互運用を実現する。 The NNC standard also transmits topology information nnr_topology_unit_payload as part of the NNC bitstream that contains the compressed parameter tensors. This allows interoperability with topology information expressed in native formats as well as exchange formats.

（画像符号化装置の構成）
次に、本実施形態に係る画像符号化装置11の構成について説明する。図5は、本実施形態に係る画像符号化装置11の構成を示すブロック図である。画像符号化装置11は、予測画像生成部101、減算部102、変換・量子化部103、逆量子化・逆変換部105、加算部106、ループフィルタ107、予測パラメータメモリ（予測パラメータ記憶部、フレームメモリ）108、参照ピクチャメモリ（参照画像記憶部、フレームメモリ）109、符号化パラメータ決定部110、パラメータ符号化部111、予測パラメータ導出部120、エントロピー符号化部104を含んで構成される。 (Configuration of the Image Encoding Device)
Next, the configuration of the image encoding device 11 according to this embodiment will be described. Fig. 5 is a block diagram showing the configuration of the image encoding device 11 according to this embodiment. The image encoding device 11 includes a predicted image generating unit 101, a subtraction unit 102, a transformation/quantization unit 103, an inverse quantization/inverse transformation unit 105, an addition unit 106, a loop filter 107, a prediction parameter memory (prediction parameter storage unit, frame memory) 108, a reference picture memory (reference image storage unit, frame memory) 109, an encoding parameter determination unit 110, a parameter encoding unit 111, a prediction parameter derivation unit 120, and an entropy encoding unit 104.

予測画像生成部101はCU毎に予測画像を生成する。 The predicted image generation unit 101 generates a predicted image for each CU.

減算部102は、予測画像生成部101から入力されたブロックの予測画像の画素値を、画像Ｔの画素値から減算して予測誤差を生成する。減算部102は予測誤差を変換・量子化部103に出力する。 The subtraction unit 102 subtracts the pixel values of the predicted image of the block input from the predicted image generation unit 101 from the pixel values of image T to generate a prediction error. The subtraction unit 102 outputs the prediction error to the transformation and quantization unit 103.

変換・量子化部103は、減算部102から入力された予測誤差に対し、周波数変換によって変換係数を算出し、量子化によって量子化変換係数を導出する。変換・量子化部103は、量子化変換係数をパラメータ符号化部111及び逆量子化・逆変換部105に出力する。 The transform/quantization unit 103 calculates transform coefficients by frequency transforming the prediction error input from the subtraction unit 102, and derives quantized transform coefficients by quantizing the prediction error. The transform/quantization unit 103 outputs the quantized transform coefficients to the parameter coding unit 111 and the inverse quantization/inverse transform unit 105.

逆量子化・逆変換部105は、画像復号装置31における逆量子化・逆変換部311（図5）と同じであり、説明を省略する。算出した予測誤差は加算部106に出力される。 The inverse quantization and inverse transform unit 105 is the same as the inverse quantization and inverse transform unit 311 (Figure 5) in the image decoding device 31, and a description thereof will be omitted. The calculated prediction error is output to the addition unit 106.

パラメータ符号化部111は、ヘッダ符号化部1110、CT情報符号化部1111、CU符号化部1112（予測モード符号化部）を備えている。CU符号化部1112はさらにTU符号化部1114を備えている。以下、各モジュールの概略動作を説明する。 The parameter coding unit 111 includes a header coding unit 1110, a CT information coding unit 1111, and a CU coding unit 1112 (prediction mode coding unit). The CU coding unit 1112 further includes a TU coding unit 1114. The general operation of each module is explained below.

ヘッダ符号化部1110はヘッダ情報、分割情報、予測情報、量子化変換係数等のパラメータの符号化処理を行う。 The header encoding unit 1110 performs encoding processing of parameters such as header information, splitting information, prediction information, and quantization transformation coefficients.

CT情報符号化部1111は、QT、MT（BT、TT）分割情報等を符号化する。 The CT information encoding unit 1111 encodes QT, MT (BT, TT) division information, etc.

CU符号化部1112はCU情報、予測情報、分割情報等を符号化する。 The CU encoding unit 1112 encodes CU information, prediction information, split information, etc.

TU符号化部1114は、TUに予測誤差が含まれている場合に、QP更新情報と量子化予測誤差を符号化する。 When a TU contains a prediction error, the TU encoding unit 1114 encodes the QP update information and the quantized prediction error.

CT情報符号化部1111、CU符号化部1112は、インター予測パラメータ、量子化変換係数等のシンタックス要素をパラメータ符号化部111に供給する。 The CT information encoding unit 1111 and the CU encoding unit 1112 supply syntax elements such as inter prediction parameters and quantized transform coefficients to the parameter encoding unit 111.

エントロピー符号化部104には、パラメータ符号化部111から量子化変換係数と符号化パラメータが入力される。エントロピー符号化部104はこれらをエントロピー符号化して符号化データTeを生成し、出力する。 The entropy coding unit 104 receives the quantized transform coefficients and coding parameters from the parameter coding unit 111. The entropy coding unit 104 entropy codes these to generate and output coded data Te.

予測パラメータ導出部120は、符号化パラメータ決定部110から入力されたパラメータからターインター予測パラメータ及びイントラ予測パラメータを導出する。導出されたターインター予測パラメータ及びイントラ予測パラメータは、パラメータ符号化部111に出力される。 The prediction parameter derivation unit 120 derives inter-prediction parameters and intra-prediction parameters from the parameters input from the encoding parameter determination unit 110. The derived inter-prediction parameters and intra-prediction parameters are output to the parameter encoding unit 111.

加算部106は、予測画像生成部101から入力された予測ブロックの画素値と逆量子化・逆変換部105から入力された予測誤差を画素毎に加算して復号画像を生成する。加算部106は生成した復号画像を参照ピクチャメモリ109に記憶する。 The adder 106 generates a decoded image by adding, for each pixel, the pixel value of the predicted block input from the predicted image generation unit 101 and the prediction error input from the inverse quantization and inverse transform unit 105. The adder 106 stores the generated decoded image in the reference picture memory 109.

ループフィルタ107は加算部106が生成した復号画像に対し、デブロッキングフィルタ、SAO、ALFを施す。なお、ループフィルタ107は、必ずしも上記３種類のフィルタを含まなくてもよく、例えばデブロッキングフィルタのみの構成であってもよい。 The loop filter 107 applies a deblocking filter, SAO, and ALF to the decoded image generated by the adder 106. Note that the loop filter 107 does not necessarily have to include the above three types of filters, and may be configured, for example, as only a deblocking filter.

予測パラメータメモリ108は、符号化パラメータ決定部110が生成した予測パラメータを、対象ピクチャ及びCU毎に予め定めた位置に記憶する。 The prediction parameter memory 108 stores the prediction parameters generated by the encoding parameter determination unit 110 in a predetermined location for each target picture and CU.

参照ピクチャメモリ109は、ループフィルタ107が生成した復号画像を対象ピクチャ及びCU毎に予め定めた位置に記憶する。 The reference picture memory 109 stores the decoded image generated by the loop filter 107 in a predetermined location for each target picture and CU.

符号化パラメータ決定部110は、符号化パラメータの複数のセットのうち、１つのセットを選択する。符号化パラメータとは、上述したQT、BTあるいはTT分割情報、予測パラメータ、あるいはこれらに関連して生成される符号化の対象となるパラメータである。予測画像生成部101は、これらの符号化パラメータを用いて予測画像を生成する。 The coding parameter determination unit 110 selects one set from among multiple sets of coding parameters. The coding parameters are the above-mentioned QT, BT or TT division information, prediction parameters, or parameters to be coded that are generated in relation to these. The predicted image generation unit 101 generates a predicted image using these coding parameters.

また、上述した実施形態における画像符号化装置11、画像復号装置31の一部、または全部を、LSI（Large Scale Integration）等の集積回路として実現しても良い。画像符号化装置11、画像復号装置31の各機能ブロックは個別にプロセッサ化しても良いし、一部、または全部を集積してプロセッサ化しても良い。また、集積回路化の手法はＬＳＩに限らず専用回路、または汎用プロセッサで実現しても良い。また、半導体技術の進歩によりＬＳＩに代替する集積回路化の技術が出現した場合、当該技術による集積回路を用いても良い。 In addition, part or all of the image encoding device 11 and image decoding device 31 in the above-mentioned embodiments may be realized as an integrated circuit such as an LSI (Large Scale Integration). Each functional block of the image encoding device 11 and image decoding device 31 may be individually made into a processor, or part or all of them may be integrated into a processor. The integrated circuit method is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. Furthermore, if an integrated circuit technology that can replace LSI appears due to advances in semiconductor technology, an integrated circuit based on that technology may be used.

本実施の形態を図1に基づいて説明すると、動画像符号化装置10は、入力画像を符号化する画像符号化装置11と、入力画像に対して、領域毎に複数の値を持つアテンション情報を補助情報として符号化する補助情報符号化装置81を有することを特徴とする。また、補助情報符号化装置81は、領域毎に複数の値を持つアテンション情報に関して、適用するニューラルネットワークを特定する情報を補助情報として符号化することを特徴とする。 This embodiment will be described with reference to FIG. 1. A video coding device 10 is characterized by having an image coding device 11 that codes an input image, and an auxiliary information coding device 81 that codes, as auxiliary information, attention information having multiple values for each region of the input image. The auxiliary information coding device 81 is also characterized by encoding, as auxiliary information, information that specifies a neural network to be applied with respect to the attention information having multiple values for each region.

動画像復号装置30は、符号化データから画像を復号する画像復号装置31と、画像復号装置31で復号した画像に対して、領域毎に複数の値を持つアテンション情報を補助情報として復号する補助情報復号装置91を有することを特徴とする。また、補助情報復号装置91は
、領域毎に複数の値を持つアテンション情報に関して、適用するニューラルネットワークを特定する情報を復号することを特徴とする。 The video decoding device 30 is characterized by having an image decoding device 31 that decodes an image from encoded data, and an auxiliary information decoding device 91 that decodes attention information having multiple values for each region as auxiliary information for the image decoded by the image decoding device 31. The auxiliary information decoding device 91 is also characterized by decoding information specifying a neural network to be applied with respect to the attention information having multiple values for each region.

本発明の実施形態は上述した実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能である。すなわち、請求項に示した範囲で適宜変更した技術的手段を組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。 The embodiments of the present invention are not limited to the above-mentioned embodiments, and various modifications are possible within the scope of the claims. In other words, embodiments obtained by combining technical means that are appropriately modified within the scope of the claims are also included in the technical scope of the present invention.

本発明の実施形態は、画像データが符号化された符号化データを復号する動画像復号装置、および、画像データが符号化された符号化データを生成する動画像符号化装置に好適に適用することができる。また、動画像符号化装置によって生成され、動画像復号装置によって参照される符号化データのデータ構造に好適に適用することができる。 Embodiments of the present invention can be suitably applied to a video decoding device that decodes coded data in which image data is coded, and a video coding device that generates coded data in which image data is coded. They can also be suitably applied to the data structure of coded data that is generated by a video coding device and referenced by the video decoding device.

1 動画像伝送システム
30 動画像復号装置
31 画像復号装置
301 エントロピー復号部
302 パラメータ復号部
305、107 ループフィルタ
306、109 参照ピクチャメモリ
307、108 予測パラメータメモリ
308、101 予測画像生成部
311、105 逆量子化・逆変換部
312、106 加算部
320 予測パラメータ導出部
10 動画像符号化装置
11 画像符号化装置
102 減算部
103 変換・量子化部
104 エントロピー符号化部
110 符号化パラメータ決定部
111 パラメータ符号化部
120 予測パラメータ導出部
71 補助情報作成装置
81 補助情報符号化装置
91 補助情報復号装置
1001 プレ画像処理装置
1002 ポスト画像処理装置
1. Video transmission system
30 Video Decoding Device
31 Image Decoding Device
301 Entropy Decoding Unit
302 Parameter Decoding Unit
305, 107 Loop Filter
306, 109 Reference Picture Memory
307, 108 Prediction parameter memory
308, 101 Prediction image generation unit
311, 105 Inverse quantization and inverse transformation section
312, 106 Addition section
320 Prediction Parameter Derivation Unit
10 Video Encoding Device
11 Image Encoding Device
102 Subtraction section
103 Transformation and Quantization Section
104 Entropy coding unit
110 Encoding parameter determination unit
111 Parameter Encoding Unit
120 Prediction parameter derivation part
71 Auxiliary information creation device
81 Auxiliary information coding device
91 Auxiliary Information Decoding Device
1001 Pre-image processing device
1002 Post image processing device

Claims

an image decoding device that decodes an image from encoded data;
A video decoding device comprising an auxiliary information decoding device that decodes attention information having a plurality of values for each region as auxiliary information for an image decoded by the image decoding device.

The auxiliary information decoding device comprises:
Regarding attention information with multiple values for each region,
2. The video decoding device according to claim 1, further comprising: decoding information for specifying a neural network to be applied.

an image encoding device for encoding an input image;
A video encoding device comprising an auxiliary information encoding device for encoding attention information having a plurality of values for each region of the input image as auxiliary information.

The auxiliary information encoding device comprises:
Regarding attention information with multiple values for each region,
4. The video encoding device according to claim 3, wherein information for specifying a neural network to be applied is encoded as auxiliary information.