WO2023199172A1 - Apparatus and method for optimizing the overfitting of neural network filters - Google Patents
Apparatus and method for optimizing the overfitting of neural network filters Download PDFInfo
- Publication number
- WO2023199172A1 WO2023199172A1 PCT/IB2023/053425 IB2023053425W WO2023199172A1 WO 2023199172 A1 WO2023199172 A1 WO 2023199172A1 IB 2023053425 W IB2023053425 W IB 2023053425W WO 2023199172 A1 WO2023199172 A1 WO 2023199172A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- version
- overfitted
- network version
- candidate
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 659
- 238000000034 method Methods 0.000 title claims abstract description 180
- 238000011156 evaluation Methods 0.000 claims abstract description 107
- 230000015654 memory Effects 0.000 claims abstract description 54
- 238000004590 computer program Methods 0.000 claims abstract description 45
- 238000012545 processing Methods 0.000 claims description 90
- 238000012549 training Methods 0.000 claims description 68
- 230000006870 function Effects 0.000 claims description 56
- 238000012805 post-processing Methods 0.000 claims description 42
- 230000008569 process Effects 0.000 claims description 37
- 230000001537 neural effect Effects 0.000 claims description 15
- 230000011664 signaling Effects 0.000 claims description 15
- 239000010410 layer Substances 0.000 description 45
- 238000004891 communication Methods 0.000 description 42
- 239000013598 vector Substances 0.000 description 22
- 238000013139 quantization Methods 0.000 description 21
- 230000005540 biological transmission Effects 0.000 description 19
- 230000006978 adaptation Effects 0.000 description 18
- 238000003860 storage Methods 0.000 description 18
- 230000002123 temporal effect Effects 0.000 description 18
- 230000006835 compression Effects 0.000 description 14
- 238000007906 compression Methods 0.000 description 14
- 230000007246 mechanism Effects 0.000 description 13
- 238000013459 approach Methods 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 10
- 238000005457 optimization Methods 0.000 description 10
- 238000001914 filtration Methods 0.000 description 9
- 238000001514 detection method Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 241000023320 Luma <angiosperm> Species 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 7
- OSWPMRLSEDHDFF-UHFFFAOYSA-N methyl salicylate Chemical compound COC(=O)C1=CC=CC=C1O OSWPMRLSEDHDFF-UHFFFAOYSA-N 0.000 description 7
- 238000012986 modification Methods 0.000 description 7
- 230000004048 modification Effects 0.000 description 7
- 230000001413 cellular effect Effects 0.000 description 6
- 238000007726 management method Methods 0.000 description 6
- 238000010200 validation analysis Methods 0.000 description 6
- 230000000007 visual effect Effects 0.000 description 6
- 230000000875 corresponding effect Effects 0.000 description 5
- 239000000872 buffer Substances 0.000 description 4
- 230000006837 decompression Effects 0.000 description 4
- 239000011229 interlayer Substances 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000002265 prevention Effects 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 241000282412 Homo Species 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 3
- 230000002596 correlated effect Effects 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 239000000835 fiber Substances 0.000 description 3
- 238000013442 quality metrics Methods 0.000 description 3
- 238000009877 rendering Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 230000000153 supplemental effect Effects 0.000 description 3
- 230000001131 transforming effect Effects 0.000 description 3
- 102100022734 Acyl carrier protein, mitochondrial Human genes 0.000 description 2
- 208000031212 Autoimmune polyendocrinopathy Diseases 0.000 description 2
- 108091000069 Cystinyl Aminopeptidase Proteins 0.000 description 2
- 101000678845 Homo sapiens Acyl carrier protein, mitochondrial Proteins 0.000 description 2
- 102100020872 Leucyl-cystinyl aminopeptidase Human genes 0.000 description 2
- 238000007792 addition Methods 0.000 description 2
- 235000019395 ammonium persulphate Nutrition 0.000 description 2
- 238000000261 appearance potential spectroscopy Methods 0.000 description 2
- 239000000969 carrier Substances 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000001276 controlling effect Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 108010014173 Factor X Proteins 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000010267 cellular communication Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000008867 communication pathway Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000945 filler Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 239000000446 fuel Substances 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- AWSBQWZZLBPUQH-UHFFFAOYSA-N mdat Chemical compound C1=C2CC(N)CCC2=CC2=C1OCO2 AWSBQWZZLBPUQH-UHFFFAOYSA-N 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/117—Filters, e.g. for pre-processing or post-processing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/80—Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation
- H04N19/82—Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation involving filtering within a prediction loop
Definitions
- JU Joint Undertaking
- the examples and non-limiting embodiments relate generally to multimedia transport and neural networks, and more particularly, to method, apparatus, and computer program product for optimizing the overfitting of neural network filters.
- An example apparatus includes at least one processor; and at least one non-transitory memory comprising computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: run one or more candidate neural network versions by using at least data from an evaluation set; evaluate performance of the one or more candidate neural network versions based on the evaluation set; select a candidate neural network version based on one or more predetermined performance criteria; overfit the selected neural network version based at least on an overfitting set; and run the overfitted neural network version on an inference set.
- the apparatus may further include, wherein: the evaluation set comprises data for evaluating the one or more candidate neural network versions; the overfitting set comprises data for overfitting the selected neural network version; and the inference set comprises data for running the overfitted neural version.
- the example apparatus may further include, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
- the example apparatus may further include, wherein the inference set comprises a video, the evaluation set comprises a first random access (RA) segment of the video, and the overfitting set comprises the video or the first RA segment of the video.
- RA random access
- the example apparatus may further include, wherein the performance criteria comprise a distortion-based performance criterion.
- the example apparatus may further include, wherein the selected neural network version performs best according to the one or more performance criteria.
- the example apparatus may further include the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-filter; the evaluation set comprises a first RA segment of a video; the overfitting set comprises the video; the inference set comprises a decoded video; output of the each candidate neural network version comprises a post-processed first RA segment; and wherein the apparatus is further caused to: compute a first performance metric based on input to the each candidate neural network version and a second performance metric based on output of the each candidate neural network version; compute a third performance metric comprising performance of the each candidate neural network version based on the first performance metric and the second performance metric; and select the candidate neural network version with a value of the third performance metric greater than or equal to a predetermined value as the selected neural network version.
- the example apparatus may further include, wherein to overfit the selected neural network version, the apparatus is further caused to perform one or more iterations of following: input the decoded video to the selected neural network version; obtain a post-processed output video from the selected neural network version; compute a training loss between the decoded video and the post-processed output video; compute gradients for one or more parameters of the selected neural network version; and use the gradients for updating the one or more parameters of the selected neural network version.
- the example apparatus may further include, wherein the apparatus is caused to perform the one or more iterations until a stopping criterion is met.
- the example apparatus may further include, wherein the apparatus is further caused to: compute a weight-update based at least on weights of the overfitted neural network version and weights of the overfitted neural network version before overfitting; compress the weight-update; and signal or provide a bitstream representing the compressed weight-update to a decoder side in or along the bitstream representing an encoded data.
- the example apparatus may further include, wherein the one or more neural network versions comprise one or more of decoder-side neural network versions, wherein the one or more of decoder-side neural network versions are available at a decoder side and an encoder side.
- Another example apparatus includes at least one processor; and at least one non-transitory memory comprising computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: overfit one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions; evaluate performance of the first set of overfitted neural network versions on the evaluation set; select a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set: overfit a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and run the second overfitted neural network version on an inference set; and run the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
- the example apparatus may further include, wherein: the evaluation set comprises data for overfitting one or more candidate neural network versions and for evaluating the first set of overfitted neural network versions; the overfitting set comprises data for overfitting the neural network version used to obtain the selected first overfitted neural network version; and the inference set comprises data for running the selected first overfitted neural network version or the second overfitted neural network version.
- the example apparatus may further include, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
- the example apparatus may further include, wherein the performance criteria comprise a distortion-based performance criterion.
- the example apparatus may further include, wherein the selected first overfitted neural network version performs best according to the one or more performance criteria.
- the example apparatus may further include: the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-processing filter; the two candidate neural network versions are overfitted on a first RA segment of a video, to obtain two overfitted candidate neural network versions; and wherein the apparatus is further caused to: compute a fourth performance metric comprising performance of the each overfitted candidate neural network version based on a fifth performance metric and a sixth performance metric, wherein the fifth performance metric is based on a post-processed first RA segment and the sixth performance metric is based on a decoded first RA segment; select an overfitted candidate neural network version with a value of the fourth performance metric greater than or equal to a predetermined value as an optimal neural network version, to obtain a selected overfitted candidate neural network version; overfit the candidate neural network version used to obtain the selected overfitted candidate neural network version on the video, to obtain an overfitted selected neural network version; and post-process a decoded video
- the example apparatus may further include, wherein to overfit the each candidate neural network version, the apparatus is caused to perform one or more iterations of following: provide a decoded first RA segment as an input to the each candidate neural network version; obtain a postprocessed first RA segment from the each candidate neural network version; compute a training loss based at least on the post-processed first RA segment and respective uncompressed data; compute gradients for one or more parameters of the each candidate neural network version; and use the gradients for updating the one or more parameters of the each candidate DSNN version.
- the example apparatus may further include, wherein the apparatus is caused to perform the one or more iterations until a stopping criterion is met.
- the example apparatus may further include, wherein to overfit the selected neural network version, the apparatus is further caused to one or more iterations of following: provide the decoded video as an input to the selected neural network version; obtain a post-processed output video from the selected neural network version; compute a training loss based at least on the post-processed output video and respective uncompressed data; and compute gradients for one or more parameters of the selected neural network version; and use the gradients for updating the one or more parameters of the selected neural network version.
- the example apparatus may further include, wherein the apparatus is caused to perform the one or more iterations until a stopping criterion is met.
- the example apparatus may further include, wherein the one or more neural networks comprise one or more decoder-side neural networks, and wherein the one or more decoder-side neural networks are available at a decoder side and an encoder side.
- Yet another apparatus includes at least one processor; and at least one non-transitory memory comprising computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: process an output of a neural network version by using one or more processing operations; and optimize one or more parameters of the one or more processing operations at an encoder side.
- the example apparatus may further include, wherein the apparatus is further caused to signal the optimized one or more parameters, or information derived from the optimized one or more parameters, to a decoder side.
- the example apparatus may further include, wherein the one or more processing operations comprise a refinement operation, and wherein the apparatus is further caused to apply the refinement operation on an output of the neural network based at least on the optimized one or more parameters.
- the example apparatus may further include, wherein the apparatus is further caused to train or to overfit the neural network version based on the one or more processing operations.
- the example apparatus may further include, wherein to train or to overfit the neural network, the apparatus is further caused to: provide input data to the neural network; obtain output data from the neural network; compute a refined output data based at least on the output data from the neural network and a refinement function; compute a loss based at least on the refined output data and respective ground-truth data, wherein the respective ground-truth data comprises uncompressed version of the input data to the neural network; compute gradients of the MSE loss with respect to gradients of one or more parameters of the neural network; and use the gradients for update the one or more parameters of the neural network.
- the example apparatus may further include, wherein the neural network comprises a postprocessing filter, and wherein an input data to the post-processing filter is a decoded frame, and an output data from the post-processing filter is a post-processed frame.
- the example apparatus may further include, wherein the one or more processing operations comprise at least one of a scaling operation or a shifting operation.
- the example apparatus may further include, wherein the neural network comprises a decoder-side neural network, and wherein the decoder-side neural network is available at the decoder side and the encoder side.
- An example method includes: running one or more candidate neural network versions by using at least data from an evaluation set; evaluating performance of the one or more candidate neural network versions based on the evaluation set; selecting a candidate neural network version based on one or more predetermined performance criteria; overfitting the selected neural network version based at least on an overfitting set; and running the overfitted neural network version on an inference set.
- the example method may further include, wherein: the evaluation set comprises data for evaluating the one or more candidate neural network versions; the overfitting set comprises data for overfitting the selected neural network version; and the inference set comprises data for running the overfitted neural version.
- the example method may further include, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
- the example method may further include, wherein the inference set comprises a video, the evaluation set comprises a first random access (RA) segment of the video, and the overfitting set comprises the video or the first RA segment of the video.
- the inference set comprises a video
- the evaluation set comprises a first random access (RA) segment of the video
- the overfitting set comprises the video or the first RA segment of the video.
- the example method may further include, wherein the performance criteria comprise a distortion-based performance criterion.
- the example method may further include, wherein the selected neural network version performs best according to the one or more performance criteria.
- the example method may further include, wherein: the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-filter; the evaluation set comprises a first RA segment of a video; the overfitting set comprises the video; the inference set comprises a decoded video; output of the each candidate neural network version comprises a post-processed first RA segment; and wherein the method further comprises: computing a first performance metric based on input to the each candidate neural network version and a second performance metric based on output of the each candidate neural network version; computing a third performance metric comprising performance of the each candidate neural network version based on the first performance metric and the second performance metric; and selecting the candidate neural network version with a value of the third performance metric greater than or equal to a predetermined value as the selected neural network version.
- the example method may further include, wherein to overfit the selected neural network version, the method comprises performing one or more iterations of following:
- [0043] input the decoded video to the selected neural network version; obtain a post-processed output video from the selected neural network version; compute a training loss between the decoded video and the post-processed output video; compute gradients for one or more parameters of the selected neural network version; and use the gradients for updating the one or more parameters of the selected neural network version.
- the example method may further include, wherein the one or more iterations are performed until a stopping criterion met.
- the example method may further include: computing a weight -update based at least on weights of the overfitted neural network version and weights of the overfitted neural network version before overfitting; compressing the weight -update; and signaling or providing a bitstream representing the compressed weight-update to a decoder side in or along the bitstream representing an encoded data.
- the example method may further include, wherein the one or more neural network versions comprise one or more of decoder-side neural network versions, wherein the one or more of decoder-side neural network versions are available at a decoder side and an encoder side.
- Another method includes: overfitting one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions; evaluating performance of the first set of overfitted neural network versions on the evaluation set; selecting a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set: overfitting a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and running the second overfitted neural network version on an inference set; and running the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
- the example method may further include, wherein: the evaluation set comprises data for overfitting one or more candidate neural network versions and for evaluating the first set of overfitted neural network versions; the overfitting set comprises data for overfitting the neural network version used to obtain the selected first overfitted neural network version; and the inference set comprises data for running the selected first overfitted neural network version or the second overfitted neural network version.
- the example method may further include, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
- the example method may further include, wherein the performance criteria comprise a distortion-based performance criterion.
- the example method may further include, wherein the selected first overfitted neural network version performs best according to the one or more performance criteria.
- the example method may further include, wherein: the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-processing filter; the two candidate neural network versions are overfitted on a first RA segment of a video, to obtain two overfitted candidate neural network versions; and wherein the method further comprises: computing a fourth performance metric comprising performance of the each overfitted candidate neural network version based on a fifth performance metric and a sixth performance metric, wherein the fifth performance metric is based on a post-processed first RA segment and the sixth performance metric is based on a decoded first RA segment; selecting an overfitted candidate neural network version with a value of the fourth performance metric greater than or equal to a predetermined value as an optimal neural network version, to obtain a selected overfitted candidate neural network version; overfitting the candidate neural network version used to obtain the selected overfitted candidate neural network version on the video, to obtain an overfitted selected neural network version; and post-processing a de
- the example method may further include, wherein to overfit the each candidate neural network version, the method further comprises performing one or more iterations of following: providing a decoded first RA segment as an input to the each candidate neural network version; obtaining a post-processed first RA segment from the each candidate neural network version; computing a training loss based at least on the post-processed first RA segment and respective uncompressed data; computing gradients for one or more parameters of the each candidate neural network version; and using the gradients for updating the one or more parameters of the each candidate DSNN version.
- the example method may further include, wherein the one or more iterations are performed until a stopping criterion is met.
- the example method may further include, wherein to overfit the selected neural network version, the method further comprises performing one or more iterations of following: providing the decoded video as an input to the selected neural network version; obtaining a post-processed output video from the selected neural network version; computing a training loss based at least on the postprocessed output video and respective uncompressed data; and computing gradients for one or more parameters of the selected neural network version; and using the gradients for updating the one or more parameters of the selected neural network version.
- the example method may further include, wherein the one or more iterations are performed until a stopping criterion is met.
- the example method may further include, wherein the one or more neural networks comprise one or more decoder-side neural networks, and wherein the one or more decoder-side neural networks are available at a decoder side and an encoder side.
- Yet another example method includes: processing an output of a neural network version by using one or more processing operations; and optimizing one or more parameters of the one or more processing operations at an encoder side.
- the example method may further include signaling the optimized one or more parameters, or information derived from the optimized one or more parameters, to a decoder side.
- the example method may further include, wherein the one or more processing operations comprise a refinement operation, and wherein the method further comprises to applying the refinement operation on an output of the neural network based at least on the optimized one or more parameters.
- the example method may further include training or overfitting the neural network version based on the one or more processing operations.
- the example method may further include, wherein to train or to overfit the neural network, the method further comprises: providing input data to the neural network; obtaining output data from the neural network; computing a refined output data based at least on the output data from the neural network and a refinement function; computing a loss based at least on the refined output data and respective ground-truth data, wherein the respective ground-truth data comprises uncompressed version of the input data to the neural network; computing gradients of the MSE loss with respect to gradients of one or more parameters of the neural network; and using the gradients for update the one or more parameters of the neural network.
- the example method may further include, wherein the neural network comprises a postprocessing filter, and wherein an input data to the post-processing filter is a decoded frame, and an output data from the post-processing filter is a post-processed frame.
- the example method may further include, wherein the one or more processing operations comprise at least one of a scaling operation or a shifting operation.
- the example method may further include, wherein the neural network comprises a decoder-side neural network, and wherein the decoder-side neural network is available at the decoder side and the encoder side.
- An example computer readable medium includes program instructions for causing an apparatus to perform at least the following: run one or more candidate neural network versions by using at least data from an evaluation set; evaluate performance of the one or more candidate neural network versions based on the evaluation set; select a candidate neural network version based on one or more predetermined performance criteria; overfit the selected neural network version based at least on an overfitting set; and run the overfitted neural network version on an inference set.
- the example computer readable medium may further include, wherein the computer readable medium comprises a non-transitory computer readable medium.
- the example computer readable medium may further include, wherein the computer readable medium further causes the apparatus to perform the methods as described in any of the previous paragraphs.
- Another example computer readable medium comprising program instructions for causing an apparatus to perform at least the following: overfit one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions; evaluate performance of the first set of overfitted neural network versions on the evaluation set; select a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set: overfit a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and run the second overfitted neural network version on an inference set; and run the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
- the example computer readable medium may further include, wherein the computer readable medium comprises a non-transitory computer readable medium.
- the example computer readable medium may further include, wherein the computer readable medium further causes the apparatus to perform the methods as described in any of the previous paragraphs.
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an apparatus's the evaluation set
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable medium
- FIG. 1 A block diagram illustrating an example computer readable
- the example computer readable medium may further include, wherein the computer readable medium comprises a non-transitory computer readable medium.
- the example computer readable medium may further include, wherein the computer readable medium further causes the apparatus to perform the methods as describe in any of the previous paragraphs.
- Still another example apparatus includes means for performing the methods as describe in any of the previous paragraphs. BRIEF DESCRIPTION OF THE DRAWINGS
- FIG. 1 shows schematically an electronic device employing embodiments of the examples described herein.
- FIG. 2 shows schematically a user equipment suitable for employing embodiments of the examples described herein.
- FIG. 3 further shows schematically electronic devices employing embodiments of the examples described herein connected using wireless and wired network connections.
- FIG. 4 shows schematically a block diagram of an encoder on a general level.
- FIG. 5 is a block diagram showing an interface between an encoder and a decoder in accordance with the examples described herein.
- FIG. 6 illustrates a system configured to support streaming of media data from a source to a client device.
- FIG. 7 is a block diagram of an apparatus that may be specifically configured in accordance with an example embodiment.
- FIG. 8 illustrates examples of functioning of neural networks (NNs) as components of a pipeline of a traditional codec, in accordance with an example embodiment.
- FIG. 9 illustrates an example of modified video coding pipeline based on neural networks, in accordance with an example embodiment.
- FIG. 10 is an example neural network -based end-to-end learned video coding system, in accordance with an example embodiment.
- FIG. 11 illustrates a pipeline of video coding for machines (VCM), in accordance with an embodiment.
- FIG. 12 illustrates an example of an end-to-end learned approach for the use case of video coding for machines, in accordance with an embodiment.
- FIG. 13 illustrates an example of how the end-to-end learned system may be trained for the use case of video coding for machines, in accordance with an embodiment.
- FIG. 14 illustrates an example for overfitting a decoder-side neural network based on refinement operations, in accordance with an embodiment.
- FIG. 15 is an example apparatus, which may be implemented in hardware, and is caused to implement mechanisms for optimizing overfitting of neural network filters or optimizing one or more parameters of one or more processing operations, based on the examples described herein.
- FIG. 16 illustrates an example method for optimizing the overfitting of neural network filters, in accordance with an embodiment.
- FIG. 17 illustrates an example method for optimizing the overfitting of neural network, in accordance with another embodiment.
- FIG. 18 illustrates an example method for optimizing one or more parameters of one or more processing operations at an encoder side, in accordance with an embodiment.
- FIG. 19 illustrates an example method for optimizing the overfitting of neural network, in accordance with yet another embodiment.
- FIG. 20 is a block diagram of one possible and non-limiting system in which the example embodiments may be practiced.
- ALF adaptive loop filtering a.k.a. also known as
- AVC advanced video coding bpp bits-per-pixel
- DU distributed unit eNB or eNodeB evolved Node B (for example, an LTE base station)
- eNB or eNodeB evolved Node B (for example, an LTE base station)
- EN-DC E-UTRA-NR dual connectivity en-gNB or En-gNB node providing NR user plane and control plane protocol terminations towards the UE, and acting as secondary node in EN-DC
- E-UTRA evolved universal terrestrial radio access, for example, the LTE radio access technology
- FDMA frequency division multiple access f(n) fixed-pattern bit string using n bits written (from left to right) with the left bit first.
- FDC finetuning-driving content gNB (or gNodeB) base station for 5G/NR, for example, a node providing NR user plane and control plane protocol terminations towards the UE, and connected via the NG interface to the 5GC GSM Global System for Mobile communications
- H.222.0 MPEG-2 Systems is formally known as ISO/IEC 13818-1 and as ITU-T Rec. H.222.0
- H.26x family of video coding standards in the domain of the ITU-T H.26x family of video coding standards in the domain of the ITU-T
- LZMA2 simple container format that can include both uncompressed data and LZMA data
- UE user equipment ue(v) unsigned integer Exp-Golomb-coded syntax element with the left bit first
- circuitry refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present.
- This definition of ‘circuitry’ applies to all uses of this term herein, including in any claims.
- circuitry also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware.
- circuitry as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.
- a method, apparatus and computer program product are provided in accordance with example embodiments for optimizing the overfitting of neural network filters or optimizing one or more parameters of one or more processing operations.
- FIG. 1 shows an example block diagram of an apparatus 50.
- the apparatus may be an internet of things (loT) apparatus configured to perform various functions, for example, gathering information by one or more sensors, receiving or transmitting information, analyzing information gathered or received by the apparatus, or the like.
- the apparatus may comprise a video coding system, which may incorporate a codec.
- FIG. 2 shows a layout of an apparatus according to an example embodiment. The elements of FIG. 1 and FIG. 2 will be explained next.
- the apparatus 50 may for example be a mobile terminal or user equipment of a wireless communication system, a sensor device, a tag, or a lower power device.
- a sensor device for example, a sensor device, a tag, or a lower power device.
- a tag for example, a sensor device, a tag, or a lower power device.
- embodiments of the examples described herein may be implemented within any electronic device or apparatus which may process data by neural networks.
- the apparatus 50 may comprise a housing 30 for incorporating and protecting the device.
- the apparatus 50 may further comprise a display 32, for example, in the form of a liquid crystal display, light emitting diode display, organic light emitting diode display, and the like.
- the display may be any suitable display technology suitable to display media or multimedia content, for example, an image or a video.
- the apparatus 50 may further comprise a keypad 34.
- any suitable data or user interface mechanism may be employed.
- the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
- the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input.
- the apparatus 50 may further comprise an audio output device which in embodiments of the examples described herein may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
- the apparatus 50 may also comprise a battery (or in other embodiments of the examples described herein the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
- the apparatus may further comprise a camera 42 capable of recording or capturing images and/or video.
- the apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. In other embodiments the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB/firewire wired connection.
- the apparatus 50 may comprise a controller 56, a processor or a processor circuitry for controlling the apparatus 50.
- the controller 56 may be connected to a memory 58 which in embodiments of the examples described herein may store both data in the form of an image, audio data and video data, and/or may also store instructions for implementation on the controller 56.
- the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and/or decoding of audio, image and/or video data or assisting in coding and/or decoding carried out by the controller.
- the apparatus 50 may further comprise a card reader 48 and a smart card 46, for example, a UICC and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
- a card reader 48 and a smart card 46 for example, a UICC and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
- the apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals, for example, for communication with a cellular communications network, a wireless communications system or a wireless local area network.
- the apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and/or for receiving radio frequency signals from other apparatus(es).
- the apparatus 50 may comprise a camera 42 capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing.
- the apparatus may receive the video image data for processing from another device prior to transmission and/or storage.
- the apparatus 50 may also receive either wirelessly or by a wired connection the image for coding/decoding.
- the structural elements of apparatus 50 described above represent examples of means for performing a corresponding function.
- the system 10 comprises multiple communication devices which can communicate through one or more networks.
- the system 10 may comprise any combination of wired or wireless networks including, but not limited to, a wireless cellular telephone network (such as a GSM, UMTS, CDMA, LTE, 4G, 5G network, and the like), a wireless local area network (WLAN) such as defined by any of the IEEE 802.x standards, a Bluetooth® personal area network, an Ethernet local area network, a token ring local area network, a wide area network, and the Internet.
- a wireless cellular telephone network such as a GSM, UMTS, CDMA, LTE, 4G, 5G network, and the like
- WLAN wireless local area network
- the system 10 may include both wired and wireless communication devices and/or apparatus 50 suitable for implementing embodiments of the examples described herein.
- the system shown in FIG. 3 shows a mobile telephone network 11 and a representation of the Internet 28.
- Connectivity to the Internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways.
- the example communication devices shown in the system 10 may include, but are not limited to, an electronic device or apparatus 50, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, a notebook computer 22.
- PDA personal digital assistant
- IMD integrated messaging device
- the apparatus 50 may be stationary or mobile when carried by an individual who is moving.
- the apparatus 50 may also be located in a mode of transport including, but not limited to, a car, a truck, a taxi, a bus, a train, a boat, an airplane, a bicycle, a motorcycle or any similar suitable mode of transport.
- the embodiments may also be implemented in a set-top box; for example, a digital TV receiver, which may/may not have a display or wireless capabilities, in tablets or (laptop) personal computers (PC), which have hardware and/or software to process neural network data, in various operating systems, and in chipsets, processors, DSPs and/or embedded systems offering hardware/software based coding.
- a digital TV receiver which may/may not have a display or wireless capabilities
- PC personal computers
- hardware and/or software to process neural network data in various operating systems, and in chipsets, processors, DSPs and/or embedded systems offering hardware/software based coding.
- Some or further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24.
- the base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the Internet 28.
- the system may include additional communication devices and communication devices of various types.
- the communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocolinternet protocol (TCP-IP), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), Bluetooth, IEEE 802.11, 3GPP Narrowband loT and any similar wireless communication technology.
- CDMA code division multiple access
- GSM global systems for mobile communications
- UMTS universal mobile telecommunications system
- TDMA time divisional multiple access
- FDMA frequency division multiple access
- TCP-IP transmission control protocolinternet protocol
- SMS short messaging service
- MMS multimedia messaging service
- email instant messaging service
- IMS instant messaging service
- Bluetooth IEEE 802.11, 3GPP Narrowband loT and any similar wireless communication technology.
- a communications device involved in implementing various embodiments of the examples described herein may communicate using various media including, but not limited to
- a channel may refer either to a physical channel or to a logical channel.
- a physical channel may refer to a physical transmission medium such as a wire
- a logical channel may refer to a logical connection over a multiplexed medium, capable of conveying several logical channels.
- a channel may be used for conveying an information signal, for example a bitstream, from one or several senders (or transmitters) to one or several receivers.
- the embodiments may also be implemented in internet of things (loT) devices.
- the loT may be defined, for example, as an interconnection of uniquely identifiable embedded computing devices within the existing Internet infrastructure.
- the convergence of various technologies has and may enable many fields of embedded systems, such as wireless sensor networks, control systems, home/building automation, and the like, to be included in the loT.
- In order to utilize the loT devices are provided with an IP address as a unique identifier.
- the loT devices may be provided with a radio transmitter, such as WLAN or Bluetooth transmitter or a RFID tag.
- the loT devices may have access to an IP-based network via a wired network, such as an Ethernet-based network or a powerline connection (PLC).
- PLC powerline connection
- the devices/systems described in FIGs. 1 to 3 enable encoding, decoding, and/or transportation of, for example, a neural network representation and/or a media bitstream.
- An MPEG-2 transport stream (TS), specified in ISO/IEC 13818-1 or equivalently in ITU- T Recommendation H.222.0, is a format for carrying audio, video, and other media as well as program metadata or other metadata, in a multiplexed stream.
- a packet identifier (PID) is used to identify an elementary stream (a.k.a. packetized elementary stream) within the TS.
- PID packet identifier
- a logical channel within an MPEG-2 TS may be considered to correspond to a specific PID value.
- Available media file format standards include ISO base media file format (ISO/IEC 14496- 12, which may be abbreviated ISOBMFF) and file format for NAL unit structured video (ISO/IEC 14496-15), which derives from the ISOBMFF.
- ISOBMFF ISO base media file format
- ISO/IEC 14496-15 file format for NAL unit structured video
- Video codec includes an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form, or into a form that is suitable as an input to one or more algorithms for analysis or processing.
- a video encoder and/or a video decoder may also be separate from each other, for example, need not form a codec.
- encoder discards some information in the original video sequence in order to represent the video in a more compact form (e.g., at lower bitrate).
- Typical hybrid video encoders for example, many encoder implementations of ITU-T H.263 and H.264, encode the video information in two phases. Firstly pixel values in a certain picture area (or ‘block’) are predicted, for example, by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, for example, the difference between the predicted block of pixels and the original block of pixels, is coded.
- encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate).
- a specified transform for example, Discrete Cosine Transform (DCT) or a variant of it
- DCT Discrete Cosine Transform
- encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate).
- inter prediction In temporal prediction, the sources of prediction are previously decoded pictures (a.k.a. reference pictures). In intra block copy (IBC; a.k.a. intra-block-copy prediction and current picture referencing), prediction is applied similarly to temporal prediction, but the reference picture is the current picture and only previously decoded samples can be referred in the prediction process. Inter- layer or inter-view prediction may be applied similarly to temporal prediction, but the reference picture is a decoded picture from another scalable layer or from another view, respectively. In some cases, inter prediction may refer to temporal prediction only, while in other cases inter prediction may refer collectively to temporal prediction and any of intra block copy, inter-layer prediction, and inter-view prediction provided that they are performed with the same or similar process than temporal prediction. Inter prediction or temporal prediction may sometimes be referred to as motion compensation or motion-compensated prediction.
- Inter prediction which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, reduces temporal redundancy.
- inter prediction the sources of prediction are previously decoded pictures.
- Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated.
- Intra prediction can be performed in spatial or transform domain, for example, either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra-coding, where no inter prediction is applied.
- One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently when they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.
- FIG. 4 shows a block diagram of a general structure of a video encoder.
- FIG. 4 presents an encoder for two layers, but it would be appreciated that presented encoder could be similarly extended to encode more than two layers.
- FIG. 4 illustrates a video encoder comprising a first encoder section 500 for a base layer and a second encoder section 502 for an enhancement layer.
- Each of the first encoder section 500 and the second encoder section 502 may comprise similar elements for encoding incoming pictures.
- the encoder sections 500, 502 may comprise a pixel predictor 302, 402, prediction error encoder 303, 403 and prediction error decoder 304, 404.
- FIG. 4 shows a block diagram of a general structure of a video encoder.
- FIG. 4 presents an encoder for two layers, but it would be appreciated that presented encoder could be similarly extended to encode more than two layers.
- FIG. 4 illustrates a video encoder comprising a first encoder section 500 for a base layer and a second encoder section
- the pixel predictor 302, 402 also shows an embodiment of the pixel predictor 302, 402 as comprising an inter-predictor 306, 406, an intra-predictor 308, 408, a mode selector 310, 410, a filter 316, 416, and a reference frame memory 318, 418.
- the pixel predictor 302 of the first encoder section 500 receives base layer picture(s)/image(s) 300 of a video stream to be encoded at both the inter-predictor 306 (which determines the difference between the image and a motion compensated reference frame) and the intra-predictor 308 (which determines a prediction for an image block based only on the already processed parts of current frame or picture).
- the output of both the inter-predictor and the intra-predictor are passed to the mode selector 310.
- the intra-predictor 308 may have more than one intra-prediction modes. Hence, each mode may perform the intra-prediction and provide the predicted signal to the mode selector 310.
- the mode selector 310 also receives a copy of the base layer image(s) 300.
- the pixel predictor 402 of the second encoder section 502 receives enhancement layer picture(s)/images(s) 400 of a video stream to be encoded at both the interpredictor 406 (which determines the difference between the image and a motion compensated reference frame) and the intra-predictor 408 (which determines a prediction for an image block based only on the already processed parts of current frame or picture).
- the output of both the inter-predictor and the intra- predictor are passed to the mode selector 410.
- the intra-predictor 408 may have more than one intraprediction modes. Hence, each mode may perform the intra-prediction and provide the predicted signal to the mode selector 410.
- the mode selector 410 also receives a copy of the enhancement layer image(s) 400.
- the output of the inter-predictor 306, 406 or the output of one of the optional intra-predictor modes or the output of a surface encoder within the mode selector is passed to the output of the mode selector 310, 410.
- the output of the mode selector 310, 410 is passed to a first summing device 321, 421.
- the first summing device may subtract the output of the pixel predictor 302, 402 from the base layer image(s) 300/enhancement layer image(s) 400 to produce a first prediction error signal 320, 420 which is input to the prediction error encoder 303, 403.
- the pixel predictor 302, 402 further receives from a preliminary reconstructor 339, 439 the combination of the prediction representation of the image block 312, 412 and the output 338, 438 of the prediction error decoder 304, 404.
- the preliminary reconstructed image 314, 414 may be passed to the intra-predictor 308, 408 and to the filter 316, 416.
- the filter 316, 416 receiving the preliminary representation may filter the preliminary representation and output a final reconstructed image 340, 440 which may be saved in the reference frame memory 318, 418.
- the reference frame memory 318 may be connected to the inter-predictor 306 to be used as the reference image against which a future base layer image 300 is compared in inter-prediction operations.
- the reference frame memory 318 may also be connected to the inter-predictor 406 to be used as the reference image against which a future enhancement layer image(s) 400 is compared in inter-prediction operations. Moreover, the reference frame memory 418 may be connected to the inter-predictor 406 to be used as the reference image against which the future enhancement layer image(s) 400 is compared in in ter -prediction operations.
- Filtering parameters from the filter 316 of the first encoder section 500 may be provided to the second encoder section 502 subject to the base layer being selected and indicated to be source for predicting the filtering parameters of the enhancement layer according to some embodiments.
- the prediction error encoder 303, 403 comprises a transform unit 342, 442 and a quantizer 344, 444.
- the transform unit 342, 442 transforms the first prediction error signal 320, 420 to a transform domain.
- the transform is, for example, the DCT transform.
- the quantizer 344, 444 quantizes the transform domain signal, for example, the DCT coefficients, to form quantized coefficients.
- the prediction error decoder 304, 404 receives the output from the prediction error encoder 303, 403 and performs the opposite processes of the prediction error encoder 303, 403 to produce a decoded prediction error signal 338, 438 which, when combined with the prediction representation of the image block 312, 412 at the second summing device 339, 439, produces the preliminary reconstructed image 314, 414.
- the prediction error decoder may be considered to comprise a dequantizer 346, 446, which dequantizes the quantized coefficient values, for example, DCT coefficients, to reconstruct the transform signal and an inverse transformation unit 348, 448, which performs the inverse transformation to the reconstructed transform signal wherein the output of the inverse transformation unit 348, 448 contains reconstructed block(s).
- the prediction error decoder may also comprise a block filter which may filter the reconstructed block(s) according to further decoded information and filter parameters.
- the entropy encoder 330, 430 receives the output of the prediction error encoder 303, 403 and may perform a suitable entropy encoding/variable length encoding on the signal to provide a compressed signal.
- the outputs of the entropy encoders 330, 430 may be inserted into a bitstream, for example, by a multiplexer 508.
- FIG. 5 is a block diagram showing the interface between an encoder 501 implementing neural network based encoding 503, and a decoder 504 implementing neural network based decoding 505 in accordance with the examples described herein.
- the encoder 501 may embody a device, a software method or a hardware circuit.
- the encoder 501 has the goal of compressing an input data 511 (for example, an input video) to a compressed data 512 (for example, a bitstream) such that the bitrate measuring the size of compressed data 512 is minimized, and the accuracy of an analysis or processing algorithm is maximized.
- the encoder 501 uses an encoder or compression algorithm, for example to perform neural network based encoding 503, e.g., encoding the input data by using one or more neural networks.
- the general analysis or processing algorithm may be part of the decoder 504.
- the decoder 504 uses a decoder or decompression algorithm, for example, to perform the neural network based decoding 505 (e.g., decoding by using one or more neural networks) to decode the compressed data 512 (for example, compressed video) which was encoded by the encoder 501.
- the decoder 504 produces decompressed data 513 (for example, reconstructed data).
- the encoder 501 and decoder 504 may be entities implementing an abstraction, may be separate entities or the same entities, or may be part of the same physical device.
- the analysis/processing algorithm may be any algorithm, traditional or learned from data. In the case of an algorithm which is learned from data, in some embodiments it is assumed that this algorithm can be modified or updated, for example, by using optimization via gradient descent.
- An example of the learned algorithm is a neural network.
- An out-of-band transmission, signaling, or storage may refer to the capability of transmitting, signaling, or storing information in a manner that associates the information with a video bitstream.
- the out-of-band transmission may use a more reliable transmission mechanism compared to the protocols used for carrying coded video data, such as slices.
- the out-of-band transmission, signaling or storage can additionally or alternatively be used e.g. for ease of access or session negotiation.
- a sample entry of a track in a file conforming to the ISO Base Media File Format may comprise parameter sets, while the coded data in the bitstream is stored elsewhere in the file or in another file.
- Another example of out-of-band transmission, signaling, or storage comprises including information, such as NN and/or NN updates in a file format track that is separate from track(s) containing coded video data.
- the phrase along the bitstream (e.g. indicating along the bitstream) or along a coded unit of a bitstream (e.g. indicating along a coded tile) may be used in claims and described embodiments to refer to transmission, signaling, or storage in a manner that the ‘out-of-band’ data is associated with, but not included within, the bitstream or the coded unit, respectively.
- the phrase decoding along the bitstream or along a coded unit of a bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream or the coded unit, respectively.
- the phrase along the bitstream may be used when the bitstream is contained in a container file, such as a file conforming to the ISO Base Media File Format, and certain file metadata is stored in the file in a manner that associates the metadata to the bitstream, such as boxes in the sample entry for a track containing the bitstream, a sample group for the track containing the bitstream, or a timed metadata track associated with the track containing the bitstream.
- the phrase along the bitstream may be used when the bitstream is made available as a stream over a communication protocol and a media description, such as a streaming manifest, is provided to describe the stream.
- An elementary unit for the output of a video encoder and the input of a video decoder, respectively, may be a network abstraction layer (NAL) unit.
- NAL units For transport over packet-oriented networks or storage into structured files, NAL units may be encapsulated into packets or similar structures.
- a bytestream format encapsulating NAL units may be used for transmission or storage environments that do not provide framing structures.
- the bytestream format may separate NAL units from each other by attaching a start code in front of each NAL unit.
- encoders may run a byte-oriented start code emulation prevention algorithm, which may add an emulation prevention byte to the NAL unit payload if a start code would have occurred otherwise.
- a NAL unit may be defined as a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of a raw byte sequence payload interspersed as necessary with emulation prevention bytes.
- a raw byte sequence payload (RBSP) may be defined as a syntax structure containing an integer number of bytes that is encapsulated in a NAL unit.
- An RBSP is either empty or has the form of a string of data bits containing syntax elements followed by an RBSP stop bit and followed by zero or more subsequent bits equal to 0.
- NAL units include a header and payload.
- the NAL unit header indicates the type of the NAL unit.
- the NAL unit header indicates a scalability layer identifier (e.g. called nuh_layer_id in H.265/HEVC and H.266/VVC), which could be used e.g. for indicating spatial or quality layers, views of a multiview video, or auxiliary layers (such as depth maps or alpha planes).
- the NAL unit header includes a temporal sublayer identifier, which may be used for indicating temporal subsets of the bitstream, such as a 30-frames-per- second subset of a 60-frames-per-second bitstream.
- NAL units may be categorized into Video Coding Layer (VCL) NAL units and non-VCL NAL units.
- VCL NAL units are typically coded slice NAL units.
- a non-VCL NAL unit may be, for example, one of the following types: a video parameter set (VPS), a sequence parameter set (SPS), a picture parameter set (PPS), an adaptation parameter set (APS), a supplemental enhancement information (SEI) NAL unit, an access unit delimiter, an end of sequence NAL unit, an end of bitstream NAL unit, or a filler data NAL unit.
- VPS video parameter set
- SPS sequence parameter set
- PPS picture parameter set
- APS adaptation parameter set
- SEI Supplemental Enhancement information
- Some coding formats specify parameter sets that may carry parameter values needed for the decoding or reconstruction of decoded pictures.
- a parameter may be defined as a syntax element of a parameter set.
- a parameter set may be defined as a syntax structure that contains parameters and that can be referred to from or activated by another syntax structure, for example, using an identifier.
- Parameters that remain unchanged through a coded video sequence may be included in a sequence parameter set.
- an SPS may be limited to apply to a layer that references the SPS, e.g. an SPS may remain valid for a coded layer video sequence.
- the sequence parameter set may optionally contain video usability information (VUI), which includes parameters that may be important for buffering, picture output timing, rendering, and resource reservation.
- VUI video usability information
- a picture parameter set contains such parameters that are likely to be unchanged in several coded pictures.
- a picture parameter set may include parameters that can be referred to by the VCL NAL units of one or more coded pictures.
- a video parameter set may be defined as a syntax structure containing syntax elements that apply to zero or more entire coded video sequences and may contain parameters applying to multiple layers.
- the VPS may provide information about the dependency relationships of the layers in a bitstream, as well as many other information that are applicable to all slices across all layers in the entire coded video sequence.
- a video parameter set RBSP may include parameters that can be referred to by one or more sequence parameter set RBSPs.
- VPS video parameter set
- SPS sequence parameter set
- PPS picture parameter set
- An adaptation parameter set may be specified in some coding formats, such as H.266/VVC.
- An APS may be applied to one or more image segments, such as slices.
- an APS may be defined as a syntax structure containing syntax elements that apply to zero or more slices as determined by zero or more syntax elements found in slice headers or in a picture header.
- An APS may comprise a type (aps_params_type in H.266/VVC) and an identifier (aps_adaptation_parameter_set_id in H.266/VVC). The combination of an APS type and an APS identifier may be used to identify a particular APS.
- H.266/VVC comprises three APS types: an adaptive loop filtering (ALF), a luma mapping with chroma scaling (LMCS), and a scaling list APS types.
- ALF adaptive loop filtering
- LMCS luma mapping with chroma scaling
- the ALF APS(s) are referenced from a slice header (thus, the referenced ALF APSs can change slice by slice)
- the LMCS and scaling list APS(s) are referenced from a picture header (thus, the referenced LMCS and scaling list APSs can change picture by picture).
- the APS RBSP has the following syntax:
- Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike.
- SEI Supplemental enhancement information
- Some video coding specifications include SEI NAL units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units.
- a prefix SEI NAL unit can start a picture unit or alike; and a suffix SEI NAL unit can end a picture unit or alike.
- an SEI NAL unit may equivalently refer to a prefix SEI NAL unit or a suffix SEI NAL unit.
- An SEI NAL unit includes one or more SEI messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation.
- SEI messages are specified in H.264/AVC, H.265/HEVC, H.266/VVC, and H.274/VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for specific use.
- the standards may contain the syntax and semantics for the specified SEI messages but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance.
- One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified.
- FIG. 6 depicts an example of such a system 600 that includes a source 602 of media data and associated metadata.
- the source 602 may be, in an embodiment, a server. However, the source may be embodied in other manners when desired.
- the source 602 is configured to stream the media data and associated metadata to a client device 604.
- the client device may be embodied by a media player, a multimedia system, a video system, a smart phone, a mobile telephone or other user equipment, a personal computer, a tablet computer or any other computing device configured to receive and decompress the media data and process associated metadata.
- media data and metadata are streamed via a network 606, such as any of a wide variety of types of wireless networks and/or wireline networks.
- the client device is configured to receive structured information containing media, metadata and any other relevant representation of information containing the media and the metadata and to decompress the media data and process the associated metadata (e.g. for proper playback timing of decompressed media data).
- An apparatus 700 is provided in accordance with an example embodiment as shown in FIG. 7.
- the apparatus of FIG. 7 may be embodied by the source 602, such as a file writer which, in turn, may be embodied by a server, that is configured to stream a compressed representation of the media data and associated metadata.
- the apparatus may be embodied by the client device 604, such as a file reader which may be embodied, for example, by any of the various computing devices described above.
- the apparatus of an example embodiment includes, is associated with or is in communication with a processing circuitry 702, one or more memory devices 704, a communication interface 706 and optionally a user interface.
- the processing circuitry 702 may be in communication with the memory device 704 via a bus for passing information among components of the apparatus 700.
- the memory device may be non- transitory and may include, for example, one or more volatile and/or non-volatile memories.
- the memory device may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processing circuitry).
- the memory device may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment.
- the memory device could be configured to buffer input data for processing by the processing circuitry. Additionally or alternatively, the memory device could be configured to store instructions for execution by the processing circuitry.
- the apparatus 700 may, in some embodiments, be embodied in various computing devices as described above. However, in some embodiments, the apparatus may be embodied as a chip or chip set. In other words, the apparatus may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard). The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The apparatus may therefore, in some cases, be configured to implement an embodiment on a single chip or as a single ‘system on a chip.’ As such, in some cases, a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein.
- a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein.
- the processing circuitry 702 may be embodied in a number of different ways.
- the processing circuitry may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other circuitry including integrated circuits such as, for example, an ASIC (application specific integrated circuit), an FPGA (field programmable gate array), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like.
- the processing circuitry may include one or more processing cores configured to perform independently.
- a multi-core processing circuitry may enable multiprocessing within a single physical package.
- the processing circuitry may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.
- the processing circuitry 702 may be configured to execute instructions stored in the memory device 704 or otherwise accessible to the processing circuitry. Alternatively or additionally, the processing circuitry may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processing circuitry may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment while configured accordingly. Thus, for example, when the processing circuitry is embodied as an ASIC, FPGA or the like, the processing circuitry may be specifically configured hardware for conducting the operations described herein.
- the instructions may specifically configure the processing circuitry to perform the algorithms and/or operations described herein when the instructions are executed.
- the processing circuitry may be a processor of a specific device (e.g., an image or video processing system) configured to employ an embodiment by further configuration of the processing circuitry by instructions for performing the algorithms and/or operations described herein.
- the processing circuitry may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processing circuitry.
- the communication interface 706 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data, including video bitstreams.
- the communication interface may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s).
- the communication interface may alternatively or also support wired communication.
- the communication interface may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
- the apparatus 700 may optionally include a user interface that may, in turn, be in communication with the processing circuitry 702 to provide output to a user, such as by outputting an encoded video bitstream and, in some embodiments, to receive an indication of a user input.
- the user interface may include a display and, in some embodiments, may also include a keyboard, a mouse, a joystick, a touch screen, touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms.
- the processing circuitry may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as a display and, in some embodiments, a speaker, ringer, microphone and/or the like.
- the processing circuitry and/or user interface circuitry comprising the processing circuitry may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processing circuitry (e.g., memory device, and/or the like).
- computer program instructions e.g., software and/or firmware
- a neural network is a computation graph including several layers of computation. Each layer includes one or more units, where each unit performs a computation. A unit is connected to one or more other units, and a connection may be associated with a weight. The weight may be used for scaling the signal passing through an associated connection. Weights are learnable parameters, for example, values which can be learned from training data. There may be other learnable parameters, such as those of batch-normalization layers.
- Feed-forward neural networks are such that there is no feedback loop, each layer takes input from one or more of the previous layers, and provides its output as the input for one or more of the subsequent layers. Also, units inside a certain layer take input from units in one or more of preceding layers and provide output to one or more of following layers.
- Initial layers those close to the input data, extract semantically low-level features, for example, edges and textures in images, and intermediate and final layers extract more high-level features.
- feature extraction layers there may be one or more layers performing a certain task, for example, classification, semantic segmentation, object detection, denoising, style transfer, superresolution, and the like.
- recurrent neural networks there is a feedback loop, so that the neural network becomes stateful, for example, it is able to memorize information or a state.
- Neural networks are being utilized in an ever-increasing number of applications for many different types of devices, for example, mobile phones, chat bots, loT devices, smart cars, voice assistants, and the like. Some of these applications include, but are not limited to, image and video analysis and processing, social media data analysis, device usage data analysis, and the like.
- One of the properties of neural networks, and other machine learning tools, is that they are able to learn properties from input data, either in a supervised way or in an unsupervised way. Such learning is a result of a training algorithm, or of a meta-level neural network providing the training signal.
- the training algorithm includes changing some properties of the neural network so that its output is as close as possible to a desired output.
- the output of the neural network can be used to derive a class or category index which indicates the class or category that the object in the input image belongs to.
- Training usually happens by minimizing or decreasing the output error, also referred to as the loss. Examples of losses are mean squared error, cross-entropy, and the like.
- training is an iterative process, where at each iteration the algorithm modifies the weights of the neural network to make a gradual improvement in the network’s output, for example, gradually decrease the loss.
- Training a neural network is an optimization process, but the final goal is different from the typical goal of optimization. In optimization, the only goal is to minimize a function.
- the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, for example, data which was not used for training the model. This is usually referred to as generalization.
- data is usually split into at least two sets, the training set and the validation set.
- the training set is used for training the network, for example, to modify its learnable parameters in order to minimize the loss.
- the validation set is used for checking the performance of the network on data, which was not used to minimize the loss, as an indication of the final performance of the model.
- the errors on the training set and on the validation set are monitored during the training process to understand the following:
- the training set error should decrease, otherwise the model is in the regime of underfitting.
- the validation set error needs to decrease and be not too much higher than the training set error.
- the validation set error should be less than 20% higher than the training set error.
- the training set error is low, for example 10% of its value at the beginning of training, or with respect to a threshold that may have been determined based on an evaluation metric, but the validation set error is much higher than the training set error, or it does not decrease, or it even increases, the model is in the regime of overfitting. This means that the model has just memorized properties of the training set and performs well only on that set, but performs poorly on a set not used for training or tuning of its parameters.
- neural networks have been used for compressing and de-compressing data such as images.
- the most widely used architecture for such task is the auto-encoder, which is a neural network including two parts: a neural encoder and a neural decoder.
- these neural encoder and neural decoder would be referred to as encoder and decoder, even though these refer to algorithms which are learned from data instead of being tuned manually.
- the encoder takes an image as an input and produces a code, to represent the input image, which requires less bits than the input image. This code may have been obtained by a binarization or quantization process after the encoder.
- the decoder takes in this code and reconstructs the image which was input to the encoder.
- Such encoder and decoder are usually trained to minimize a combination of bitrate and distortion, where the distortion may be based on one or more of the following metrics: mean squared error (MSE), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), or the like.
- MSE mean squared error
- PSNR peak signal-to-noise ratio
- SSIM structural similarity index measure
- model ‘neural network’, ‘neural net’ and ‘network’ may be used interchangeably, and also the weights of neural networks may be sometimes referred to as learnable parameters or as parameters.
- Video codec includes an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form.
- an encoder discards some information in the original video sequence in order to represent the video in a more compact form, for example, at lower bitrate.
- Typical hybrid video codecs encode the video information in two phases. Firstly, pixel values in a certain picture area (or ‘block’) are predicted, for example, by motion compensation means or circuits (by finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means or circuit (by using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, e.g. the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g.
- the encoder may control the balance between the accuracy of the pixel representation (e.g., picture quality) and size of the resulting coded video representation (e.g., file size or transmission bitrate).
- the pixel values may be predicted by using spatial prediction techniques. This prediction technique uses the pixel values around the block to be coded in a specified manner. Secondly, the prediction error, for example, the difference between the predicted block of pixels and the original block of pixels is coded.
- encoder can control the balance between the accuracy of the pixel representation, for example, picture quality and size of the resulting coded video representation, for example, file size or transmission bitrate.
- DCT discrete cosine transform
- Inter prediction which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, exploits temporal redundancy.
- inter prediction the sources of prediction are previously decoded pictures.
- Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, for example, either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intracoding, where no inter prediction is applied.
- One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently when they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.
- the decoder reconstructs the output video by applying prediction techniques similar to the encoder to form a predicted representation of the pixel blocks. For example, using the motion or spatial information created by the encoder and stored in the compressed representation and prediction error decoding, which is inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain. After applying prediction and prediction error decoding techniques the decoder sums up the prediction and prediction error signals, for example, pixel values to form the output video frame.
- the decoder and encoder can also apply additional filtering techniques to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.
- the motion information is indicated with motion vectors associated with each motion compensated image block.
- Each of these motion vectors represents the displacement of the image block in the picture to be coded in the encoder side or decoded in the decoder side and the prediction source block in one of the previously coded or decoded pictures.
- the motion vectors are typically coded differentially with respect to block specific predicted motion vectors.
- the predicted motion vectors are created in a predefined way, for example, calculating the median of the encoded or decoded motion vectors of the adjacent blocks.
- Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor.
- the reference index of previously coded/decoded picture can be predicted.
- the reference index is typically predicted from adjacent blocks and/or or co-located blocks in temporal reference picture.
- typical high efficiency video codecs employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction.
- predicting the motion field information is carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks.
- the prediction residual after motion compensation is first transformed with a transform kernel, for example, DCT and then coded.
- a transform kernel for example, DCT
- Typical video encoders utilize Lagrangian cost functions to find optimal coding modes, for example, the desired macroblock mode and associated motion vectors.
- This kind of cost function uses a weighting factor X to tie together the exact or estimated image distortion due to lossy coding methods and the exact or estimated amount of information that is required to represent the pixel values in an image area:
- C D + R equation 1 [0188]
- C the Lagrangian cost to be minimized
- D the image distortion, for example, mean squared error with the mode and motion vectors considered
- R the number of bits needed to represent the required data to reconstruct the image block in the decoder including the amount of data to represent the candidate motion vectors.
- SEI message specifications the SEI messages are generally not extended in future amendments or versions of the standard.
- Conventional image and video codecs may use a set of filters to enhance the visual quality of the predicted and error-compensated visual content and can be applied either in-loop or out-of-loop, or both.
- in-loop filters a filter applied on one block in the currently-encoded or currently decoded frame will affect the encoding or decoding of another block in the same frame and/or in another frame which is predicted or processed based at least on the current frame.
- An in-loop filter can affect the bitrate and/or the visual quality.
- An enhanced block may cause a smaller residual, e.g., a smaller difference between original block and filtered block, thus using less bits in the bitstream output by the encoder.
- An out-of-loop filter, or post-processing filter may be applied on a frame or part of a frame after it has been reconstructed; the filtered visual content may not be used for decoding other content.
- NNNs neural networks
- NNs are used to replace or are used as an addition to one or more of the components of a traditional codec such as VVC/H.266.
- traditional means those codecs whose components and parameters are typically not learned from data by means of a training process, for example, those codecs whose components are not neural networks.
- Some examples of uses of neural networks within a traditional codec include but are not limited to:
- Additional in-loop filter for example, by having the NN as an additional in-loop filter with respect to the traditional loop filters;
- Intra-frame prediction for example, as an additional intra-frame prediction mode, or replacing the traditional intra-frame prediction
- Inter-frame prediction for example, as an additional inter-frame prediction mode, or replacing the traditional inter-frame prediction
- Transform and/or inverse transform for example, as an additional transform and/or inverse transform, or replacing the traditional transform and/or inverse transform;
- Probability model for the arithmetic codec for example, as an additional probability model, or replacing the traditional probability model.
- FIG. 8 illustrates examples of functioning of NNs as components of a pipeline of traditional codec, in accordance with an embodiment.
- FIG. 8 illustrates an encoder, which also includes a decoding loop.
- FIG. 8 is shown to include components described below:
- a luma intra pred block or circuit 801. This block or circuit performs intra prediction in the luma domain, for example, by using already reconstructed data from the same frame.
- the operation of the luma intra pred block or circuit 801 may be performed by a deep neural network such as a convolutional auto-encoder.
- a chroma intra pred block or circuit 802. This block or circuit performs intra prediction in the chroma domain, for example, by using already reconstructed data from the same frame.
- the chroma intra pred block or circuit 802 may perform cross-component prediction, for example, predicting chroma from luma.
- the operation of the chroma intra pred block or circuit 802 may be performed by a deep neural network such as a convolutional autoencoder.
- the intra pred block or circuit 803 and the inter-pred block or circuit 804 may perform the prediction on all components, for example, luma and chroma.
- the operations of the intra pred block or circuit 803 and the inter-pred block or circuit 804 may be performed by two or more deep neural networks such as convolutional auto-encoders.
- a probability estimation block or circuit 805 for entropy coding This block or circuit performs prediction of probability for the next symbol to encode or decode, which is then provided to the entropy coding module 812, such as an arithmetic coding module, to encode or decode the next symbol.
- the operation of the probability estimation block or circuit 805 may be performed by a neural network.
- a transform and quantization (T/Q) block or circuit 806 These are actually two blocks or circuits.
- the transform and quantization block or circuit 806 may perform a transform of input data to a different domain, for example, the FFT transform would transform the data to frequency domain.
- the transform and quantization block or circuit 806 may quantize its input values to a smaller set of possible values.
- there may be inverse quantization block or circuit and inverse transform block or circuit Q '/T 1 806a.
- One or both of the transform block or circuit and quantization block or circuit may be replaced by one or two or more neural networks.
- One or both of the inverse transform block or circuit and inverse quantization block or circuit 813 may be replaced by one or two or more neural networks.
- An in-loop filter block or circuit 807 Operations of the in-loop filter block or circuit 807 is performed in the decoding loop, and it performs filtering on the output of the inverse transform block or circuit, or on the reconstructed data, in order to enhance the reconstructed data with respect to one or more predetermined quality metrics. This filter may affect both the quality of the decoded data and the bitrate of the bitstream output by the encoder.
- the operation of the in-loop filter block or circuit 807 may be performed by a neural network, such as a convolutional auto-encoder. In examples, the operation of the inloop filter may be performed by multiple steps or filters, where the one or more steps may be performed by neural networks.
- a post-processing filter block or circuit 808 may be performed only at decoder side, as it may not affect the encoding process.
- the postprocessing filter block or circuit 808 filters the reconstructed data output by the in-loop filter block or circuit 807, in order to enhance the reconstructed data.
- the post-processing filter block or circuit 808 may be replaced by a neural network, such as a convolutional auto-encoder.
- a resolution adaptation block or circuit 809 this block or circuit may downsample the input video frames, prior to encoding. Then, in the decoding loop, the reconstructed data may be upsampled, by the upsampling block or circuit 810, to the original resolution.
- the operation of the resolution adaptation block or circuit 809 block or circuit may be performed by a neural network such as a convolutional auto-encoder.
- An encoder control block or circuit 811 This block or circuit performs optimization of encoder’ s parameters, such as what transform to use, what quantization parameters (QP) to use, what intra-prediction mode (out of N intra-prediction modes) to use, and the like.
- the operation of the encoder control block or circuit 811 may be performed by a neural network, such as a classifier convolutional network, or such as a regression convolutional network.
- An ME/MC block or circuit 814 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction.
- ME/MC stands for motion estimation / motion compensation
- end-to-end learned compression NNs are used as the main components of the image/video codecs.
- Option 1 re-use the video coding pipeline but replace most or all the components with NNs.
- FIG. 9 it illustrates an example of modified video coding pipeline based on neural networks, in accordance with an embodiment.
- An example of neural network may include, but is not limited, a compressed representation of a neural network.
- FIG. 9 is shown to include following components:
- a neural transform block or circuit 902 this block or circuit transforms the output of a summation/subtraction operation 903 to a new representation of that data, which may have lower entropy and thus be more compressible.
- a quantization block or circuit 904 this block or circuit quantizes an input data 901 to a smaller set of possible values.
- This block or circuit may control and optimize some or all the parameters of the encoding process, such as parameters of one or more of the encoding blocks or circuits.
- An entropy coding block or circuit 910 This block or circuit may perform lossless coding, for example, based on entropy.
- One popular entropy coding technique is arithmetic coding.
- a neural intra-codec block or circuit 912. This block or circuit may be an image compression and decompression block or circuit, which may be used to encode and decode an intra frame.
- An encoder 914 may be an encoder block or circuit, such as the neural encoder part of an auto-encoder neural network.
- a decoder 916 may be a decoder block or circuit, such as the neural decoder part of an auto-encoder neural network.
- An intra-coding block or circuit 918 may be a block or circuit performing some intermediate steps between encoder and decoder, such as quantization, entropy encoding, entropy decoding, and/or inverse quantization.
- a deep loop filter block or circuit 920 This block or circuit performs filtering of reconstructed data, in order to enhance it.
- a decode picture buffer block or circuit 922 This block or circuit is a memory buffer, keeping the decoded frame, for example, reconstructed frames 924 and enhanced reference frames 926 to be used for inter prediction.
- An inter-prediction block or circuit 928 This block or circuit performs inter-frame prediction, for example, predicts from frames, for example, frames 932, which are temporally nearby.
- An ME/MC 930 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction.
- ME/MC stands for motion estimation / motion compensation.
- training loss In order to train the neural networks of this system, a training objective function, referred to as ‘training loss’ , is typically utilized, which usually comprises one or more terms, or loss terms, or simply losses. Although here the Option 2 and FIG. 10 considered as example for describing the training objective function, a similar training objective function may also be used for training the neural networks for the systems in FIG. 6 and FIG. 7.
- the training loss comprises a reconstruction loss term and a rate loss term. The reconstruction loss encourages the system to decode data that is similar to the input data, according to some similarity metric.
- reconstruction losses are: a loss derived from mean squared error (MSE); a loss derived from multi-scale structural similarity (MS-SSIM), such as 1 minus MS- SSIM, or 1 - MS-SSIM; losses derived from the use of a pretrained neural network.
- MSE mean squared error
- MS-SSIM multi-scale structural similarity
- Error(f 1 , f2) where fl and f2 are the features extracted by a pretrained neural network for the input (uncompressed) data and the decoded (reconstructed) data, respectively, and error() is an error or distance function, such as LI norm or L2 norm
- losses derived from the use of a neural network that is trained simultaneously with the end- to-end learned codec For example, adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings proposed in the context of generative adversarial networks (GANs) and their variants.
- GANs
- the rate loss encourages the system to compress the output of the encoding stage, such as the output of the arithmetic encoder. ‘Compressing’ for example, means reducing the number of bits output by the encoding stage.
- rate loss typically encourages the output of the Encoder NN to have low entropy.
- the rate loss may be computed on the output of the Encoder NN, or on the output of the quantization operation, or on the output of the probability model. Following are some examples of rate losses are the following:
- a differentiable estimate of the entropy A sparsification loss, for example, a loss that encourages the output of the Encoder NN or the output of the quantization to have many zeros. Examples are LO norm, LI norm, LI norm divided by L2 norm; and
- one or more of reconstruction losses may be used, and one or more of rate losses may be used.
- the loss terms may then be combined for example as a weighted sum to obtain the training objective function.
- the different loss terms are weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion loss. For example, when more weight is given to one or more of the reconstruction losses with respect to the rate losses, the system may learn to compress less but to reconstruct with higher accuracy as measured by a metric that correlates with the reconstruction losses.
- These weights are usually considered to be hyperparameters of the training session and may be set manually by the operator designing the training session, or automatically for example by grid search or by using additional neural networks.
- video is considered as data type in various embodiments. However, it would be understood that the embodiments are also applicable to other media items, for example, images and audio data.
- Option 2 is illustrated in FIG. 10, and it includes of a different type of codec architecture.
- FIG. 10 it illustrates an example neural network-based end-to-end learned video coding system, in accordance with an example embodiment.
- a neural network-based end- to-end learned video coding system 1000 includes an encoder 1001, a quantizer 1002, a probability model 1003, an entropy codec 1004, for example, an arithmetic encoder 1005 and an arithmetic decoder 1006, a dequantizer 1007, and a decoder 1008.
- the encoder 1001 and the decoder 1008 are typically two neural networks, or mainly comprise neural network components.
- the probability model 1003 may also mainly comprise neural network components.
- the quantizer 1002, the dequantizer 1007, and the entropy codec 1004 are typically not based on neural network components, but they may also potentially comprise neural network components.
- the encoder, quantizer, probability model, entropy codec, arithmetic encoder, arithmetic decoder, dequantizer, and decoder may also be referred to as an encoder component, quantizer component, probability model component, entropy codec component, arithmetic encoder component, arithmetic decoder component, dequantizer component, and decoder component respectively.
- the encoder 1001 takes a video/image as an input 1009 and converts the video/image in original signal space into a latent representation that may comprise a more compressible representation of the input.
- the latent representation may be normally a 3 -dimensional tensor for image compression, where 2 dimensions represent spatial information, and the third dimension contains information at that specific location.
- the latent representation is a tensor of dimensions (or ‘shape’) 64x64x32 (e.g., with horizontal size of 64 elements, vertical size of 64 elements, and 32 channels).
- the channel dimension may be the first dimension, so for the above example, the shape of the input tensor may be represented as 3x128x128, instead of 128x128x3.
- the quantizer 1002 quantizes the latent representation into discrete values given a predefined set of quantization levels.
- the probability model 1003 and the arithmetic encoder 1005 work together to perform lossless compression for the quantized latent representation and generate bitstreams to be sent to the decoder side. Given a symbol to be encoded to the bitstream, the probability model 1003 estimates the probability distribution of all possible values for that symbol based on a context that is constructed from available information at the current encoding/decoding state, such as the data that has already encoded/decoded.
- the arithmetic encoder 1005 encodes the input symbols to bitstream using the estimated probability distributions.
- the arithmetic decoder 1006 and the probability model 1003 first decode symbols from the bitstream to recover the quantized latent representation. Then, the dequantizer 1007 reconstructs the latent representation in continuous values and pass it to the decoder 1008 to recover the input video/image. The recovered input video/image is provided as an output 1010.
- the probability model 1003, in this system 1000 is shared between the arithmetic encoder 1005 and the arithmetic decoder 1006. In practice, this means that a copy of the probability model 1003 is used at the arithmetic encoder 1005 side, and another exact copy is used at the arithmetic decoder 1006 side.
- the encoder 1001, the probability model 1003, and the decoder 1008 are normally based on deep neural networks.
- the system 1000 is trained in an end-to-end manner by minimizing the following rate-distortion loss function, which may be referred to simply as training loss, or loss:
- D is the distortion loss term
- R is the rate loss term
- X is the weight that controls the balance between the two losses.
- the distortion loss term may be referred to also as reconstruction loss. It encourages the system to decode data that is similar to the input data, according to some similarity metric. Following are some examples of reconstruction losses: a loss derived from mean squared error (MSE); a loss derived from multi-scale structural similarity (MS-SSIM), such as 1 minus MS- SSIM, or 1 - MS-SSIM; losses derived from the use of a pretrained neural network.
- MSE mean squared error
- MS-SSIM multi-scale structural similarity
- error(f 1 , f2) where fl and f2 are the features extracted by a pretrained neural network for the input (uncompressed) data and the decoded (reconstructed) data, respectively, and error() is an error or distance function, such as LI norm or L2 norm; and losses derived from the use of a neural network that is trained simultaneously with the end- to-end learned codec.
- adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings proposed in the context of generative adversarial networks (GANs) and their variants.
- the rate loss may be computed on the output of the encoder NN, or on the output of the quantization operation, or on the output of the probability model.
- the rate loss may comprise multiple rate losses. Following are some examples of rate losses: a differentiable estimate of the entropy of the quantized latent representation, which indicates the number of bits necessary to represent the encoded symbols, for example, bits- per-pixel (bpp); a sparsification loss, for example, a loss that encourages the output of the Encoder NN or the output of the quantization to have many zeros.
- Examples are LO norm, LI norm, LI norm divided by L2 norm; and a cross-entropy loss applied to the output of a probability model, where the probability model may be a NN used to estimate the probability of the next symbol to be encoded by the arithmetic encoder 1005.
- a similar training loss may be used for training the systems illustrated in FIG. 8 and FIG. 9.
- one or more of reconstruction losses may be used, and one or more of the rate losses may be used. All the loss terms may then be combined for example as a weighted sum to obtain the training objective function.
- the different loss terms are weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion loss. For example, when more weight is given to one or more of the reconstruction losses with respect to the rate losses, the system may learn to compress less but to reconstruct with higher accuracy as measured by a metric that correlates with the reconstruction losses.
- These weights are usually considered to be hyper-parameters of the training session and may be set manually by the operator designing the training session, or automatically for example by grid search or by using additional neural networks.
- the rate loss and the reconstruction loss may be minimized jointly at each iteration.
- the rate loss and the reconstruction loss may be minimized alternately, e.g., in one iteration the rate loss is minimized and in the next iteration the reconstruction loss is minimized, and so on.
- the rate loss and the reconstruction loss may be minimized sequentially, e.g., first one of the two losses is minimized for a certain number of iterations, and then the other loss is minimized for another number of iterations.
- the system 1000 contains the probability model 1003, the arithmetic encoder 1005, and the arithmetic decoder 1006.
- the system loss function contains the rate loss, since the distortion loss is always zero, in other words, no loss of information.
- Video Coding for Machines (VCM)
- a quality metric for the decoded data may be defined, which may be different from a quality metric for human perceptual quality.
- dedicated algorithms for compressing and decompressing data for machine consumption may be different than those for compressing and decompressing data for human consumption.
- the set of tools and concepts for compressing and decompressing data for machine consumption is referred to here as Video Coding for Machines.
- the receiver or decoder-side device may have multiple ‘machines’ or neural networks (NNs) for analyzing or processing decoded data. These multiple machines may be used in a certain combination which is for example determined by an orchestrator sub-system. The multiple machines may be used for example in temporal succession, based on the output of the previously used machine, and/or in parallel. For example, a video which was compressed and then decompressed may be analyzed by one machine (NN) for detecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of objects in the frames.
- NN neural network
- An ‘encoder-side device’ may encode input data, such as a video, into a bitstream which represents compressed data.
- the bitstream is provided to a ‘decoder-side device’ .
- the term ‘receiverside’ or ’decoder-side’ refers to a physical or abstract entity or device which performs decoding of compressed data, and the decoded data may be input to one or more machines, circuits or algorithms.
- the one or more machines may not be part of the decoder.
- the one or more machines may be run by the same device running the decoder or by another device which receives the decoded data from the device running the decoder. Different machines may be run by different devices.
- the encoded video data may be stored into a memory device, for example, as a file. The stored file may later be provided to another device.
- the encoded video data may be streamed from one device to another.
- machine and neural network may be used interchangeably, and may mean any process or algorithm (e.g., learned from data or not) which analyzes or processes data for a certain task.
- the term ‘receiver-side’ or ‘decoder-side’ refers to a physical or abstract entity or device which contains one or more machines, and runs these one or more machines on some encoded and eventually decoded video representation which is encoded by another physical or abstract entity or device, e.g., ‘encoder-side device’.
- the encoder-side and decoder-side may be present in the same physical or abstract entity or device.
- FIG. 11 illustrates a pipeline of video coding for machines (VCM), in accordance with an embodiment.
- a VCM encoder 1102 encodes the input video into a bitstream 1104.
- a bitrate 1106 may be computed 1108 from the bitstream 1104 in order to evaluate the size of the bitstream 1104.
- a VCM decoder 1110 decodes the bitstream 1104 output by the VCM encoder 1102.
- An output of the VCM decoder 1110 may be referred, for example, as decoded data for machines 1112. This data may be considered as the decoded or reconstructed video.
- the decoded data for machines 1112 may not have same or similar characteristics as the original video which was input to the VCM encoder 1102.
- this data may not be easily understandable by a human, if the human watches the decoded video from a suitable output device such as a display.
- the output of the VCM decoder 1110 is then input to one or more task neural network (task-NN).
- task-NN task neural network
- FIG. 11 is shown to include three example task-NNs, a task-NN 1114 for object detection, a task-NN 1116 for image segmentation, a task-NN 1118 for object tracking, and a non-specified one, a task-NN 1120 for performing task X.
- the goal of VCM is to obtain a low bitrate while guaranteeing that the task-NNs still perform well in terms of the evaluation metric associated with each task.
- FIG. 12 illustrates an example of an end-to-end learned approach, in accordance with an embodiment.
- a VCM encoder 1202 and a VCM decoder 1204 mainly includes neural networks.
- the video is input to a neural network encoder 1206.
- the output of the neural network encoder 1206 is input to a lossless encoder 1208, such as an arithmetic encoder, which outputs a bitstream 1210.
- the lossless codec may take an additional input from a probability model 1212, both in the lossless encoder 1208 and in a lossless decoder 1214, which predicts the probability of the next symbol to be encoded and decoded.
- the probability model 1212 may also be learned, for example it may be a neural network.
- the bitstream 1210 is input to the lossless decoder 1214, such as an arithmetic decoder, whose output is input to a neural network decoder 1216.
- the output of the neural network decoder 1216 is a decoded data for machines 1218, that may be input to one or more task-NNs, a task-NN 1220 for object detection, a task-NN 1222 for object segmentation, a task-NN 1224 for object tracking, and a non-specified one, a task-NN 1226 for performing task X.
- FIG. 13 illustrates an example of how the end-to-end learned system may be trained, in accordance with an embodiment. For the sake of simplicity, this embodiment is explained with help of one task-NN. However, it may be understood that multiple task-NNs may be similarly used in the training process.
- a rate loss 1302 may be computed 1304 from the output of a probability model 1306. The rate loss 1302 provides an approximation of the bitrate required to encode the input video data, for example, by a neural network encoder 1308.
- a task loss 1310 may be computed 1312 from a task output 1314 of a task-NN 1316.
- the rate loss 1302 and the task loss 1310 may then be used to train 1318 the neural networks used in the system, such as the neural network encoder 1308, probability model, a neural network decoder 1320.
- Training may be performed by first computing gradients of each loss with respect to the trainable parameters of the neural networks that are contributing or affecting the computation of that loss.
- the gradients are then used by an optimization method, such as Adam, for updating the trainable parameters of the neural networks.
- Adam an optimization method
- the training process may use additional losses which may not be directly related to one or more specific tasks, such as losses derived from pixel-wise distortion metrics (for example, MSE, MS-SSIM).
- the machine tasks may be performed at decoder side (instead of at encoder side) for multiple reasons, for example, the encoder-side device may not have the capabilities (e.g. computational, power, or memory) for running the neural networks that perform these tasks, or some aspects or the performance of the task neural networks may have changed or improved by the time that the decoder-side device needs the tasks results (e.g., different or additional semantic classes, better neural network architecture). Also, there may be a need for customization, where different clients may run different neural networks for performing these machine learning tasks.
- the encoder-side device may not have the capabilities (e.g. computational, power, or memory) for running the neural networks that perform these tasks, or some aspects or the performance of the task neural networks may have changed or improved by the time that the decoder-side device needs the tasks results (e.g., different or additional semantic classes, better neural network architecture).
- there may be a need for customization where different clients may run different neural networks for performing these machine learning tasks.
- a neural network may be used as filter in the decoding loop, and it may be referred to as neural network loop filter, or neural network in-loop filter.
- the NN loop filter may replace other loop filters of an existing video codec or may represent an additional loop filter with respect to the already present loop filters in an existing video codec.
- a neural network may be used as postprocessing filter, for example, applied to the output of an image or video decoder in order to remove or reduce coding artifacts.
- Content adaptation may be performed by having the encoder-side device compute an adaptation signal for one or more NNs used at decoder side (e.g., decoder-side NNs), and signaling the adaptation signal or a signal derived from the adaptation signal to the decoder side.
- the adaptation signal is a weight-update.
- the encoder includes the decoding operations and, in some cases, any post-processing operations, the decoder-side NNs that are content-adapted are assumed to be available also at encoder side. In practice, this may mean that two copies of one or more decoder-side NNs are available at encoder side and at decoder side.
- a decoder-side NN may be a NN in-loop filter.
- a decoder-side NN may be a NN that is part of an end-to-end trained decoder.
- a decoder-side NN may be a post-processing NN.
- the decoder side may use the adaptation signal or a signal derived from the adaptation signal to update or adapt the one or more NN.
- the updated or adapted one or more NNs are then used for their purpose, e.g., for filtering a reconstructed image block or patch.
- the adaptation signal may be compressed in a lossy and/or lossless way by the encoderside device.
- the decoder side may first need to decompress the compressed adaptation signal before using it for updating or adapting the NNs.
- the encoder may decide to optimize some part of the codec or some signal produced by the codec, with respect to the specific input content.
- the terms ‘optimize’, ‘adapt’, ‘finetune’, ‘update’, and ‘overfit’ may refer to the same operation, e.g., making a part of the codec (such as the parameters of a NN) or a signal produced by the codec more specific to the input content, in order to improve the rate-distortion performance.
- the parameters or the signal to be adapted may belong to one or more of the following categories of parameters:
- the encoder s trainable parameters or weights
- a subset of the encoder’ s trainable parameters or weights
- the output of the encoder e.g., the latent tensor
- a subset of the output of the encoder e.g., the latent tensor
- the decoder s trainable parameters or weights
- a subset of the decoder’ s trainable parameters or weights.
- the parameters may be a subset of trainable parameters or weights of a decoder, such as the bias parameters of a neural network that is part of the decoder.
- the optimization may be performed at encoder-side, and may comprise computing a loss function using the output of the decoder and eventually the output of the encoder, and differentiating the computed loss function with respect to the parameters or signal to be optimized.
- the parameters to be optimized are at least some of the parameters of a decoder-side NN
- an update to those parameters may need to be encoded and signaling to the decoder-side.
- the bitrate of the bitstream representing such signaling is an additional bitrate with respect to the bitrate of the bitstream representing the encoded image or video without any content adaptation.
- NN There may be more than one NN available at encoder side and decoder side.
- one problem is represented by how to select one or more optimal NNs for the overfitting process out of all the available NNs, in terms of rate-distortion performance.
- the one or more NNs that, after overfitting, perform best in terms of rate-distortion performance should be selected.
- V arious embodiments propose apparatus and methods for optimizing the overfitting of one or more decoder-side NNs (DSNNs) or optimizing one or more parameters of one or more processing operations.
- DSNNs decoder-side NNs
- at least some of these embodiments may be applied for training one or more neural networks present at encoder side and/or at decoder side, on a training dataset.
- a decoder-side NN is a NN that is used as part of the decoding and/or post-processing operations.
- An example of DSNN is an in-loop NN filter.
- Another example of DSNN is a postprocessing NN filter.
- each DSNN (e.g., for the DSNN used as in-loop filter, or for the DSNN used as postfilter) there may be more than one version available.
- Some embodiments address the problem of selecting one or more optimal versions to be overfitted among two or more available versions of DSNN.
- the two or more available versions of DSNN that are considered are referred to as candidate DSNN versions.
- candidate DSNN versions For simplicity, the case of selecting a single optimal version is considered in some of the embodiments.
- a set of data on which the NN will be run after being overfitted may be referred to as an inference set.
- a set of data on which the NN will be evaluated may be referred to as an evaluation set.
- a set of data on which the NN will be overfitted may be referred to as an overfitting set.
- the inference set, the evaluation set, and the overfitting set may partially or fully overlap with each other.
- the performance may comprise a rate-distortion performance or simply a distortion-based performance.
- the following embodiments address the problem of overfitting for achieving a better performance of the overfitted model.
- the output of a DSNN may be processed by one or more processing operations, such as scaling and shifting, where at least some of the parameters of these processing operations may be optimized at encoder side and signaled to the decoder side.
- processing operations such as scaling and shifting, where at least some of the parameters of these processing operations may be optimized at encoder side and signaled to the decoder side.
- An embodiment proposes to take these processing operations into account during the overfitting process and/or during the training process.
- FIG. 53 Various embodiments consider the case of compressing and decompressing data.
- the embodiments consider video as the data type.
- ’video’ may refer to one or more video frames, unless specified otherwise.
- the proposed embodiments can be extended to other types of data such as images, audio, speech, text, and the like.
- an encoder-side device performs a compression or encoding operation by using an encoder.
- the output of the video encoder is a bitstream representing the compressed video.
- a decoder-side device performs decompression or decoding operation by using a decoder.
- the output of the video decoder may be referred to as decoded video.
- the decoded video may be post-processed by one or more post-processing operations, such as a post-processing filter.
- the output of the one or more post-processing operations may be referred to as post-processed video.
- the encoder-side device may also include some decoding operations, for example, in a coding loop, and/or at least some post-processing operations. In an example, the encoder may include all the decoding operations and any post-processing operations.
- the encoder-side device and the decoder-side device may be the same physical device, or different physical devices.
- the decoder or the decoder-side device may contain one or more neural networks, referred to here as decoder-side neural networks (DSNNs).
- DSNNs decoder-side neural networks
- the encoder-side device includes at least some decoding and/or post-processing operations, at least some of the DSNNs may be available also at encoder side. In practice, this means that the encoder-side device may include copies of at least some of the DSNNs.
- Some examples of such DSNNs may include but are not limited to the following:
- a post-processing NN filter (here also referred to as post-filter, or NN post-filter, or post-filter NN), which takes as input at least one of the outputs of an end-to-end learned decoder or of a conventional decoder (i.e., a decoder not comprising neural networks or other components learned from data) or of a hybrid decoder (e.g., a decoder comprising one or more neural networks or other components learned from data);
- a NN in-loop filter also referred to here as in-loop NN filter, or NN loop filter, or loop NN filter, used within an end-to-end learned decoder, or within a hybrid decoder;
- a learned probability model (e.g., a NN) that is used for providing estimates of probabilities of symbols to be encoded and/or decoded by a lossless coding module, within an end-to-end learned codec or within a hybrid codec;
- a decoder neural network for an end-to-end learned codec is a decoder neural network for an end-to-end learned codec.
- a single DSNN is used when describing some of the embodiments.
- a NN post-filter as an example of a DSNN is used for describing some of the embodiments.
- the embodiments may be extended to the cases of multiple DSNNs and to the case where a DSNN is used for other purposes than post-processing.
- Two copies of the DSNN (e.g., the NN post-filter) considered in the embodiments are assumed to be available at encoder side and decoder side.
- each DSNN (e.g., for the DSNN used as in-loop filter, or for the DSNN used as postfilter) there may be more than one version available.
- the following embodiments address at least the problem of selecting one or more optimal versions to be overfitted among two or more available versions of DSNN.
- the two or more available versions of DSNN that are considered are referred to as candidate DSNN versions.
- at least some embodiments consider the case of selecting a single optimal version to be overfitted.
- the DSNN is a post-filter, and two candidate DSNN versions have same architecture but different values for at least some of their parameters.
- a set of data on which the NN is run after being overfitted is referred to as an inference set.
- a set of data on which the NN is evaluated is referred to as an evaluation set.
- a set of data on which the NN is overfitted is referred to as the overfitting set.
- the inference set, the evaluation set and the overfitting set may partially or fully overlap with each other.
- the inference set is a video
- the evaluation set is a first random access (RA) segment of the video
- the overfitting set is the first RA segment of the video.
- an RA segment may be specified to start with a picture that enables random access, e.g. enables starting a decoding process from that picture.
- an RA segment may start from an intra-coded picture, such as an IRAP picture in some video coding standards, or a gradual decoding refresh picture.
- the RA segment may, in some cases, be specified to pertain up to (but excluding) the next picture, in decoding order, that can start an RA segment.
- the inference set is a video
- the evaluation set is the first RA segment of the video
- the overfitting set is the video
- the encoder side devices may perform one or more of following operations:
- the overfitted DSNN may be applied on the inference set; and/or
- a weight-update may be computed based at least on the weights of the overfitted DSNN and the weights of the DSNN before overfitting.
- the weight -update may be compressed by using lossless or lossy compression.
- the bitstream representing the compressed weight-update may be signaled or provided to the decoder side, in or along the bitstream representing the encoded video.
- the performance may comprise a rate-distortion performance or simply a distortion-based performance.
- the DSNN is a post-filter, and two candidate DSNN versions are considered.
- the two candidate DSNN versions are run on the first RA segment, e.g., the input to each candidate DSNN version comprises the decoded first RA segment.
- the output of each candidate DSNN version comprises the post-processed first RA segment.
- a first PSNR is computed based at least on the input to the candidate DSNN version and respective uncompressed data.
- a second PSNR is computed based at least on the output of the candidate DSNN and respective uncompressed data.
- the performance of the candidate DSNN versions comprises a PSNR gain, that may be computed as a difference between the first PSNR and the second PSNR.
- the candidate DSNN version yielding highest PSNR gain or a predetermined PSNR gain is selected as the optimal DSNN version.
- the selected DSNN version is overfitted on the whole video.
- the overfitting may comprise one or more iterations, where each iteration comprises inputting the decoded video to the selected DSNN version, obtaining a post-processed output video from the selected DSNN version, computing a training loss based at least on the post-processed output video and respective uncompressed data, computing gradients for one or more parameters of the selected DSNN version, using the gradients for updating the one or more parameters of the selected DSNN version.
- the iterations are performed until a stopping criterion is satisfied.
- the overfitted DSNN may be used for post-processing the decoded video.
- the encoder may derive a weight-update as a difference between the weights of the overfitted DSNN and the weights of the DSNN before overfitting.
- the derived weight-update may be compressed using a lossy and/or a lossless encoder.
- the bitstream representing the compressed weight-update may be signaled to the decoder in or along the bitstream representing the encoded video.
- the decoder may decompress the compressed weight-update, use the decompressed weight-update to update the postfilter, and use the updated post-filter for post-processing one or more frames of a decoded video.
- the DSNN is a post-filter, and two candidate DSNN versions are considered.
- the two candidate DSNN versions are overfitted on the first RA segment.
- the overfitting of each candidate DSNN version may comprise one or more iterations, where each iteration comprises inputting the decoded first RA segment to the candidate DSNN version, obtaining a postprocessed first RA segment from the candidate DSNN version, computing a training loss based at least on the post-processed first RA segment and respective uncompressed data, computing gradients for one or more parameters of the candidate DSNN version, using the gradients for updating the one or more parameters of the candidate DSNN version.
- a first PSNR is computed based at least on the post-processed first RA segment and respective uncompressed data
- a second PSNR is computed based at least on the decoded first RA segment and respective uncompressed data.
- a PSNR gain for each of the two overfitted candidate DSNN versions is computed as a difference between the first PSNR and the second PSNR of the respective overfitted candidate DSNN versions.
- the PSNR gain of each overfitted candidate DSNN version represents the performance of each candidate DSNN version.
- the candidate DSNN version yielding the highest PSNR gain or a predetermined PSNR gain is selected as the optimal DSNN version.
- the selected DSNN version is overfitted on the whole video.
- the overfitting may comprise one or more iterations, where each iteration comprises inputting the decoded video to the selected DSNN version, obtaining a post-processed output video from the selected DSNN version, computing a training loss based at least on the post-processed output video and respective uncompressed data, computing gradients for one or more parameters of the selected DSNN version, using the gradients for updating the one or more parameters of the selected DSNN version.
- the iterations are performed until a stopping criterion is satisfied.
- the overfitted DSNN may be used for post-processing the decoded video.
- the encoder may derive a weight-update as a difference between the weights of the overfitted DSNN and the weights of the DSNN before overfitting.
- the derived weight-update may be compressed using a lossy and/or a lossless encoder.
- the bitstream representing the compressed weight-update may be signaled to the decoder in or along the bitstream representing the encoded video.
- the decoder may decompress the compressed weight-update, use the decompressed weight-update to update the post-filter, and use the updated post-filter for post-processing one or more frames of a decoded video.
- any suitable performance metric measuring the improvement in one or more of the following may be used: visual quality, machine vision quality, rate-distortion performance, complexity.
- the performance may be measured based on the gain in machine vision task accuracy, such as gain in mean accuracy.
- the performance may be measured based on a rate-distortion Lagrangian function, where the rate may comprise the rate of the bitstream representing the compressed video and the rate of the bitstream representing the compressed weight-update, and the distortion may comprise the mean-squared error (MSE) computed based on the output of the DSNN (or a signal derived therefrom) and corresponding ground-truth data.
- MSE mean-squared error
- the following embodiments address at least the problem of overfitting for achieving a better performance of the overfitted model. It is to be understood that at least some of the following embodiments may be applied to the case of training one or more neural networks for achieving a better performance of the trained models.
- the output of a DSNN may be processed by one or more processing operations, such as scaling and shifting, where at least some of the parameters of these processing operations may be optimized at encoder side and signaled to the decoder side.
- the one or more processing operations applied to the output of a DSNN are referred to as refinement operations.
- the DSNN (e.g., the overfitted DSNN) is a post-filter that post-processes a decoded video frame-by-frame, e.g., the DSNN gets as input a decoded frame and the output is a postprocessed frame.
- the post-processed frame is denoted as NN_out
- the decoded frame is denoted as NN_in
- ‘s’ is a parameter or variable that multiplies the difference between NN_out and NN_in.
- the value of the parameter ‘s’ is optimized at encoder side by rate-distortion optimization (RDO), by the method of least squares, or by other suitable optimization methods.
- refinement operations are performed by a function refine(NN_out, other_arguments) , where ‘other_arguments’ may comprise other data necessary to compute the value of the function.
- refined_NN_out refine(NN_out, NN_in, s)
- refine(NN_out, NN_in, s) (NN_out - NN_in)*s + NN_in.
- the refinement operations are taken into account during the overfitting process.
- the overfitting process is performed by computing the training loss based at least on the output of the refinement operations.
- the post-filter DSNN may be overfitted by performing one or more overfitting iterations until a stopping criterion is satisfied, where each iteration may comprise (a) inputting a decoded frame NN_in to the post-filter DSNN, (b) obtaining a postprocessed frame NN_out as an output of the post-filter DSNN, (c) computing a refinement refined_NN_out of the post-processed frame based at least on NN_out, NN_in, a value of the parameter ‘s’, and a refinement function refine(), where the value of the parameter ‘s’ may be determined based on least squares, (d) computing a MSE loss
- FIG. 14 illustrates an example for overfitting a decoder-side neural network (NN filter based on refinement operations, in accordance with an embodiment.
- Xi denotes the i-th input frame
- a VVC codec 1402 represents a video encoder that is conformant with the specification of VVC/H.266 standard, where the encoder also includes decoding operations, , denotes the decoded frame (previously denoted as NN_in) that is output by the VVC codec, , is input to a post-filter DSNN that is denoted as a NN filter 1404.
- the NN filter 1404 outputs the post- processed frame (previously denoted as NN_out), then a difference 1406 between y; and x) (denoted as ) is computed.
- the difference n is multiplied 1408 with a scaling parameter Si, which is estimated by least squares method (denoted as LST-SQ 1410).
- the result of the multiplication is then added 1412 to x ; to obtain a refined post-processed frame x, (previously denoted as refin ed_NN_out).
- a loss 1414 is computed based at least on x ⁇ ind Xi.
- the loss 1414 is used for overfitting 1416 the NN filter 1404.
- FIG. 15 is an example apparatus 1500, which may be implemented in hardware, caused to implement mechanisms for optimizing overfitting of neural networks or optimizing one or more parameters of one or more processing operations, based on the examples described herein.
- the apparatus 1500 comprises at least one processor 1502, at least one non-transitory memory 1504 including computer program code 1505, wherein the at least one memory 1504 and the computer program code 1505 are configured to, with the at least one processor 1502, cause the apparatus 1500 to implement mechanisms for optimizing the overfitting of neural network or optimizing the one or more parameters of the one or more processing operations 1506, based on the examples described herein.
- the at least one neural network or the portion of the at least one neural network may be used at a decoder-side for decoding or reconstructing one or more media items.
- the apparatus 1500 optionally includes a display 1508 that may be used to display content during rendering.
- the apparatus 1500 optionally includes one or more network (NW) interfaces (I/F(s)) 1510.
- NW I/F(s) 1510 may be wired and/or wireless and communicate over the Internet/other network(s) via any communication technique.
- the NW I/F(s) 1510 may comprise one or more transmitters and one or more receivers.
- the N/W I/F(s) 1510 may comprise standard well-known components such as an amplifier, filter, frequency-converter, (de)modulator, and encoder/decoder circuitry(ies) and one or more antennas.
- the apparatus 1500 may be a remote, virtual or cloud apparatus.
- the apparatus 1500 may be either a coder or a decoder, or both a coder and a decoder.
- the at least one memory 1504 may be implemented using any suitable data storage technology, such as semiconductor based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
- the at least one memory 1504 may comprise a database for storing data.
- the apparatus 1500 need not comprise each of the features mentioned, or may comprise other features as well.
- the apparatus 1500 may correspond to or be another embodiment of the apparatus 50 shown in FIG. 1 and FIG. 2, any of the apparatuses shown in FIG. 3, or apparatus 700 of FIG. 7.
- the apparatus 1500 may correspond to or be another embodiment of the apparatuses shown in FIG. 20, including UE 110, RAN node 170, or network element(s) 190.
- FIG. 16 illustrates an example method 1600 for optimizing overfitting of neural networks, in accordance with an embodiment.
- the apparatus 1500 includes means, such as the processing circuitry 1502 or the like, for optimizing overfitting of neural network filters.
- the method 1600 includes running one or more candidate neural network versions by using at least data from an evaluation set.
- the method 1600 includes evaluating performance of the one or more candidate neural network versions based on the evaluation set.
- the method 1600 includes selecting a candidate neural network version based on one or more predetermined performance criteria.
- the method 1600 includes overfitting the selected neural network version based at least on an overfitting set.
- the method 1600 includes running the overfitted neural network version on an inference set.
- the one or more neural network versions include one or more of decoder-side neural network versions, where the one or more of decoder-side neural network versions are available at a decoder side and an encoder side
- the evaluation set includes data for evaluating the one or more candidate neural network versions; the overfitting set includes data for overfitting the selected neural network version; and the inference set includes data for running the overfitted neural version.
- the evaluation set, overfitting set, and the inference set partially or fully overlap.
- FIG. 17 illustrates an example method 1700 for optimizing the overfitting of neural network, in accordance with another embodiment.
- the apparatus 1500 includes means, such as the processing circuitry 1502 or the like, for optimizing overfitting of neural network filters.
- the method 1700 includes overfitting one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions.
- the method 1700 includes evaluating performance of the first set of overfitted neural network versions on the evaluation set.
- the method 1700 includes selecting a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network.
- the method 1700 includes when the evaluation set is different from an overfitting set: overfitting a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and running the second overfitted neural network version on an inference set.
- the method 1700 includes running the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
- the one or more neural network versions include one or more of decoder-side neural network versions, where the one or more of decoder-side neural network versions are available at a decoder side and an encoder side.
- the evaluation set comprises data for evaluating the one or more candidate neural network versions; the overfitting set comprises data for overfitting the selected neural network version; and the inference set comprises data for running the overfitted neural version.
- the evaluation set, overfitting set, and the inference set partially or fully overlap.
- FIG. 18 illustrates an example method 1800 for optimizing one or more parameters of one or more processing operations at an encoder side, in accordance with an embodiment.
- the apparatus 1500 includes means, such as the processing circuitry 1502 or the like, optimizing one or more parameters of one or more processing operations.
- the method 1800 includes processing an output of a neural network version by using one or more processing operations.
- the method 1800 includes optimizing one or more parameters of the one or more processing operations at an encoder side.
- the method 1800 may further include signalling the optimized one or more parameters to a decoder side.
- the one or more processing operations include at least one of a scaling operation or a shifting operation.
- the neural network includes a decoder-side neural network, where the decoder-side neural network is available at the decoder side and the encoder side.
- FIG. 19 illustrates an example method 1900 for optimizing the overfitting of neural network, in accordance with yet another embodiment.
- the apparatus 1500 includes means, such as the processing circuitry 1502 or the like, for optimizing overfitting of neural network filters.
- the method 1900 includes overfitting one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions.
- the method 1900 includes evaluating performance of the first set of overfitted neural network versions on the evaluation set.
- the method 1900 includes selecting a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network.
- the method 1900 includes determining whether the evaluation set is same or substantially same as an overfitting set.
- the method 1900 includes overfitting a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and at 1912, the method 1900 includes running the second overfitted neural network version on an inference set.
- the method 1900 includes running the selected first overfitted neural network version on the inference set.
- the one or more neural network versions include one or more of decoder-side neural network versions, where the one or more of decoder-side neural network versions are available at a decoder side and a encoder side
- the evaluation set includes data for overfitting one or more candidate neural network versions and for evaluating the first set of overfitted neural network versions; the overfitting set includes data for overfitting the neural network version used to obtain the selected first overfitted neural network version; and the inference set includes data for running the selected first overfitted neural network version or the second overfitted neural network version.
- the evaluation set, overfitting set, and the inference set partially or fully overlap.
- FIG. 20 shows a block diagram of one possible and non-limiting example in which the examples may be practiced.
- a user equipment (UE) 110 radio access network (RAN) node 170, and network element(s) 190 are illustrated.
- the user equipment (UE) 110 is in wireless communication with a wireless network 100.
- a UE is a wireless device that can access the wireless network 100.
- the UE 110 includes one or more processors 120, one or more memories 125, and one or more transceivers 130 interconnected through one or more buses 127.
- Each of the one or more transceivers 130 includes a receiver, Rx, 132 and a transmitter, Tx, 133.
- the one or more buses 127 may be address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like.
- the one or more transceivers 130 are connected to one or more antennas 128.
- the one or more memories 125 include computer program code 123.
- the UE 110 includes a module 140, comprising one of or both parts 140-1 and/or 140-2, which may be implemented in a number of ways.
- the module 140 may be implemented in hardware as module 140-1, such as being implemented as part of the one or more processors 120.
- the module 140-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array.
- the module 140 may be implemented as module 140-2, which is implemented as computer program code 123 and is executed by the one or more processors 120.
- the one or more memories 125 and the computer program code 123 may be configured to, with the one or more processors 120, cause the user equipment 110 to perform one or more of the operations as described herein.
- the UE 110 communicates with RAN node 170 via a wireless link 111.
- the RAN node 170 in this example is a base station that provides access by wireless devices such as the UE 110 to the wireless network 100.
- the RAN node 170 may be, for example, a base station for 5G, also called New Radio (NR).
- the RAN node 170 may be a NG-RAN node, which is defined as either a gNB or an ng-eNB.
- a gNB is a node providing NR user plane and control plane protocol terminations towards the UE, and connected via the NG interface to a 5GC (such as, for example, the network element(s) 190).
- the ng-eNB is a node providing E-UTRA user plane and control plane protocol terminations towards the UE, and connected via the NG interface to the 5GC.
- the NG- RAN node may include multiple gNBs, which may also include a central unit (CU) (gNB-CU) 196 and distributed unit(s) (DUs) (gNB-DUs), of which DU 195 is shown.
- the DU may include or be coupled to and control a radio unit (RU).
- the gNB-CU is a logical node hosting radio resource control (RRC), SDAP and PDCP protocols of the gNB or RRC and PDCP protocols of the en-gNB that controls the operation of one or more gNB-DUs.
- RRC radio resource control
- the gNB-CU terminates the Fl interface connected with the gNB-DU.
- the Fl interface is illustrated as reference 198, although reference 198 also illustrates a link between remote elements of the RAN node 170 and centralized elements of the RAN node 170, such as between the gNB-CU 196 and the gNB-DU 195.
- the gNB-DU is a logical node hosting RLC, MAC and PHY layers of the gNB or en-gNB, and its operation is partly controlled by gNB-CU.
- One gNB- CU supports one or multiple cells. One cell is supported by only one gNB-DU.
- the gNB-DU terminates the Fl interface 198 connected with the gNB-CU.
- the DU 195 is considered to include the transceiver 160, for example, as part of a RU, but some examples of this may have the transceiver 160 as part of a separate RU, for example, under control of and connected to the DU 195.
- the RAN node 170 may also be an eNB (evolved NodeB) base station, for LTE (long term evolution), or any other suitable base station or node.
- eNB evolved NodeB
- LTE long term evolution
- the RAN node 170 includes one or more processors 152, one or more memories 155, one or more network interfaces (N/W I/F(s)) 161, and one or more transceivers 160 interconnected through one or more buses 157.
- Each of the one or more transceivers 160 includes a receiver, Rx, 162 and a transmitter, Tx, 163.
- the one or more transceivers 160 are connected to one or more antennas 158.
- the one or more memories 155 include computer program code 153.
- the CU 196 may include the processor(s) 152, memories 155, and network interfaces 161. Note that the DU 195 may also contain its own memory/memories and processor(s), and/or other hardware, but these are not shown.
- the RAN node 170 includes a module 150, comprising one of or both parts 150-1 and/or 150-2, which may be implemented in a number of ways.
- the module 150 may be implemented in hardware as module 150-1, such as being implemented as part of the one or more processors 152.
- the module 150-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array.
- the module 150 may be implemented as module 150-2, which is implemented as computer program code 153 and is executed by the one or more processors 152.
- the one or more memories 155 and the computer program code 153 are configured to, with the one or more processors 152, cause the RAN node 170 to perform one or more of the operations as described herein.
- the one or more network interfaces 161 communicate over a network such as via the links 176 and 131.
- Two or more gNBs 170 may communicate using, for example, link 176.
- the link 176 may be wired or wireless or both and may implement, for example, an Xn interface for 5G, an X2 interface for LTE, or other suitable interface for other standards.
- the one or more buses 157 may be address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, wireless channels, and the like.
- the one or more transceivers 160 may be implemented as a remote radio head (RRH) 195 for LTE or a distributed unit (DU) 195 for gNB implementation for 5G, with the other elements of the RAN node 170 possibly being physically in a different location from the RRH/DU, and the one or more buses 157 could be implemented in part as, for example, fiber optic cable or other suitable network connection to connect the other elements (for example, a central unit (CU), gNB-CU) of the RAN node 170 to the RRH/DU 195.
- Reference 198 also indicates those suitable network link(s).
- the cell makes up part of a base station. That is, there can be multiple cells per base station. For example, there could be three cells for a single carrier frequency and associated bandwidth, each cell covering one-third of a 360 degree area so that the single base station’s coverage area covers an approximate oval or circle. Furthermore, each cell can correspond to a single carrier and a base station may use multiple carriers. So if there are three 120 degree cells per carrier and two carriers, then the base station has a total of 6 cells.
- the wireless network 100 may include a network element or elements 190 that may include core network functionality, and which provides connectivity via a link or links 181 with a further network, such as a telephone network and/or a data communications network (for example, the Internet).
- core network functionality for 5G may include access and mobility management function(s) (AMF(S)) and/or user plane functions (UPF(s)) and/or session management function(s) (SMF(s)).
- AMF(S) access and mobility management function(s)
- UPF(s) user plane functions
- SMF(s) session management function
- Such core network functionality for LTE may include MME (Mobility Management Entity )/SGW (Serving Gateway) functionality. These are merely example functions that may be supported by the network element(s) 190, and note that both 5G and LTE functions might be supported.
- the RAN node 170 is coupled via a link 131 to the network element 190.
- the link 131 may be implemented as, for example, an NG interface for 5G, or an SI interface for LTE, or other suitable interface for other standards.
- the network element 190 includes one or more processors 175, one or more memories 171, and one or more network interfaces (N/W I/F(s)) 180, interconnected through one or more buses 185.
- the one or more memories 171 include computer program code 173.
- the one or more memories 171 and the computer program code 173 are configured to, with the one or more processors 175, cause the network element 190 to perform one or more operations.
- the wireless network 100 may implement network virtualization, which is the process of combining hardware and software network resources and network functionality into a single, softwarebased administrative entity, a virtual network.
- Network virtualization involves platform virtualization, often combined with resource virtualization.
- Network virtualization is categorized as either external, combining many networks, or parts of networks, into a virtual unit, or internal, providing network-like functionality to software containers on a single system. Note that the virtualized entities that result from the network virtualization are still implemented, at some level, using hardware such as processors 152 or 175 and memories 155 and 171, and also such virtualized entities create technical effects.
- the computer readable memories 125, 155, and 171 may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
- the computer readable memories 125, 155, and 171 may be means for performing storage functions.
- the processors 120, 152, and 175 may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs) and processors based on a multi-core processor architecture, as non-limiting examples.
- the processors 120, 152, and 175 may be means for performing functions, such as controlling the UE 110, RAN node 170, network element(s) 190, and other functions as described herein.
- the various embodiments of the user equipment 110 can include, but are not limited to, cellular telephones such as smart phones, tablets, personal digital assistants (PDAs) having wireless communication capabilities, portable computers having wireless communication capabilities, image capture devices such as digital cameras having wireless communication capabilities, gaming devices having wireless communication capabilities, music storage and playback appliances having wireless communication capabilities, Internet appliances permitting wireless Internet access and browsing, tablets with wireless communication capabilities, as well as portable units or terminals that incorporate combinations of such functions.
- cellular telephones such as smart phones, tablets, personal digital assistants (PDAs) having wireless communication capabilities, portable computers having wireless communication capabilities, image capture devices such as digital cameras having wireless communication capabilities, gaming devices having wireless communication capabilities, music storage and playback appliances having wireless communication capabilities, Internet appliances permitting wireless Internet access and browsing, tablets with wireless communication capabilities, as well as portable units or terminals that incorporate combinations of such functions.
- PDAs personal digital assistants
- portable computers having wireless communication capabilities
- image capture devices such as digital cameras having wireless communication capabilities
- gaming devices having wireless communication capabilities
- music storage and playback appliances having wireless communication capabilities
- modules 140-1, 140-2, 150-1, and 150-2 may be caused to implement mechanism for optimizing overfitting of neural network filters of the decoder-side neural network or optimizing one or more parameters of one or more processing operations.
- Computer program code 173 may also be configured to implement mechanisms for optimizing overfitting of neural network filters of the decoder-side neural network or optimizing one or more parameters of one or more processing operations.
- FIGs. 16 to 19 include a flowchart of an apparatus (e.g. 50, 100, 602, 604, 700, or 1500), method, and computer program product according to certain example embodiments.
- each block of the flowcharts, and combinations of blocks in the flowcharts may be implemented by various means, such as hardware, firmware, processor, circuitry, and/or other devices associated with execution of software including one or more computer program instructions.
- one or more of the procedures described above may be embodied by computer program instructions.
- the computer program instructions which embody the procedures described above may be stored by a memory (e.g. 58, 125, 704, or 1504) of an apparatus employing an embodiment and executed by processing circuitry (e.g.
- any such computer program instructions may be loaded onto a computer or other programmable apparatus (e.g., hardware) to produce a machine, such that the resulting computer or other programmable apparatus implements the functions specified in the flowchart blocks.
- These computer program instructions may also be stored in a computer-readable memory that may direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture, the execution of which implements the function specified in the flowchart blocks.
- the computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowchart blocks.
- a computer program product is therefore defined in those instances in which the computer program instructions, such as computer-readable program code portions, are stored by at least one non- transitory computer -readable storage medium with the computer program instructions, such as the computer-readable program code portions, being configured, upon execution, to perform the functions described above, such as in conjunction with the flowchart(s) of FIGs. 16 to 19.
- the computer program instructions, such as the computer-readable program code portions need not be stored or otherwise embodied by a non-transitory computer-readable storage medium, but may, instead, be embodied by a transitory medium with the computer program instructions, such as the computer- readable program code portions, still being configured, upon execution, to perform the functions described above.
- blocks of the flowcharts support combinations of means for performing the specified functions and combinations of operations for performing the specified functions for performing the specified functions. It will also be understood that one or more blocks of the flowcharts, and combinations of blocks in the flowcharts, may be implemented by special purpose hardware-based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.
- certain ones of the operations above may be modified or further amplified. Furthermore, in some embodiments, additional optional operations may be included. Modifications, additions, or amplifications to the operations above may be performed in any order and in any combination.
- references to a ‘computer’, ‘processor’, etc. should be understood to encompass not only computers having different architectures such as single/multi-processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry.
- References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device such as instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device, and the like.
- circuitry may refer to any of the following: (a) hardware circuit implementations, such as implementations in analog and/or digital circuitry, and (b) combinations of circuits and software (and/or firmware), such as (as applicable): (i) a combination of processor(s) or (ii) portions of processor(s)/software including digital signal processor(s), software, and memory(ies) that work together to cause an apparatus to perform various functions, and (c) circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.
- This description of ‘circuitry’ applies to uses of this term in this application.
- circuitry would also cover an implementation of merely a processor (or multiple processors) or a portion of a processor and its (or their) accompanying software and/or firmware.
- circuitry would also cover, for example and if applicable to the particular element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, or another network device.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Various embodiments provide an apparatus, a method, and a computer program product. The apparatus includes at least one processor; and at least one non-transitory memory comprising computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: run one or more candidate neural network versions by using at least data from an evaluation set; evaluate performance of the one or more candidate neural network versions based on the evaluation set; select a candidate neural network version based on one or more predetermined performance criteria; overfit the selected neural network version based at least on an overfitting set; and run the overfitted neural network version on an inference set.
Description
APPARATUS AND METHOD FOR OPTIMIZING THE OVERFITTING OF NEURAL NETWORK FILTERS
SUPPORT STATEMENT
[0001] The project leading to this application has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 876019. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Germany, Netherlands, Austria, Romania, France, Sweden, Cyprus, Greece, Lithuania, Portugal, Italy, Finland, Turkey.
TECHNICAL FIELD
[0002] The examples and non-limiting embodiments relate generally to multimedia transport and neural networks, and more particularly, to method, apparatus, and computer program product for optimizing the overfitting of neural network filters.
BACKGROUND
[0003] It is known to provide standardized formats for exchange of neural networks.
SUMMARY
[0004] An example apparatus includes at least one processor; and at least one non-transitory memory comprising computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: run one or more candidate neural network versions by using at least data from an evaluation set; evaluate performance of the one or more candidate neural network versions based on the evaluation set; select a candidate neural network version based on one or more predetermined performance criteria; overfit the selected neural network version based at least on an overfitting set; and run the overfitted neural network version on an inference set.
[0005] The apparatus may further include, wherein: the evaluation set comprises data for evaluating the one or more candidate neural network versions; the overfitting set comprises data for overfitting the selected neural network version; and the inference set comprises data for running the overfitted neural version.
[0006] The example apparatus may further include, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
[0007] The example apparatus may further include, wherein the inference set comprises a video, the evaluation set comprises a first random access (RA) segment of the video, and the overfitting set comprises the video or the first RA segment of the video.
[0008] The example apparatus may further include, wherein the performance criteria comprise a distortion-based performance criterion.
[0009] The example apparatus may further include, wherein the selected neural network version performs best according to the one or more performance criteria.
[0010] The example apparatus may further include the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-filter; the evaluation set comprises a first RA segment of a video; the overfitting set comprises the video; the inference set comprises a decoded video; output of the each candidate neural network version comprises a post-processed first RA segment; and wherein the apparatus is further caused to: compute a first performance metric based on input to the each candidate neural network version and a second performance metric based on output of the each candidate neural network version; compute a third performance metric comprising performance of the each candidate neural network version based on the first performance metric and the second performance metric; and select the candidate neural network version with a value of the third performance metric greater than or equal to a predetermined value as the selected neural network version.
[0011] The example apparatus may further include, wherein to overfit the selected neural network version, the apparatus is further caused to perform one or more iterations of following: input the decoded video to the selected neural network version; obtain a post-processed output video from the selected neural network version; compute a training loss between the decoded video and the post-processed output video; compute gradients for one or more parameters of the selected neural network version; and use the gradients for updating the one or more parameters of the selected neural network version.
[0012] The example apparatus may further include, wherein the apparatus is caused to perform the one or more iterations until a stopping criterion is met.
[0013] The example apparatus may further include, wherein the apparatus is further caused to: compute a weight-update based at least on weights of the overfitted neural network version and weights of the overfitted neural network version before overfitting; compress the weight-update; and signal or provide a bitstream representing the compressed weight-update to a decoder side in or along the bitstream representing an encoded data.
[0014] The example apparatus may further include, wherein the one or more neural network versions comprise one or more of decoder-side neural network versions, wherein the one or more of decoder-side neural network versions are available at a decoder side and an encoder side.
[0015] Another example apparatus includes at least one processor; and at least one non-transitory memory comprising computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: overfit one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions; evaluate performance of the first set of overfitted neural network versions on the evaluation set; select a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set: overfit a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and run the second overfitted neural network version on an inference set; and run the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
[0016] The example apparatus may further include, wherein: the evaluation set comprises data for overfitting one or more candidate neural network versions and for evaluating the first set of overfitted neural network versions; the overfitting set comprises data for overfitting the neural network version used to obtain the selected first overfitted neural network version; and the inference set comprises data for running the selected first overfitted neural network version or the second overfitted neural network version.
[0017] The example apparatus may further include, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
[0018] The example apparatus may further include, wherein the performance criteria comprise a distortion-based performance criterion.
[0019] The example apparatus may further include, wherein the selected first overfitted neural network version performs best according to the one or more performance criteria.
[0020] The example apparatus may further include: the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-processing filter; the two candidate neural network versions are overfitted on a first
RA segment of a video, to obtain two overfitted candidate neural network versions; and wherein the apparatus is further caused to: compute a fourth performance metric comprising performance of the each overfitted candidate neural network version based on a fifth performance metric and a sixth performance metric, wherein the fifth performance metric is based on a post-processed first RA segment and the sixth performance metric is based on a decoded first RA segment; select an overfitted candidate neural network version with a value of the fourth performance metric greater than or equal to a predetermined value as an optimal neural network version, to obtain a selected overfitted candidate neural network version; overfit the candidate neural network version used to obtain the selected overfitted candidate neural network version on the video, to obtain an overfitted selected neural network version; and post-process a decoded video by using the overfitted selected neural network.
[0021] The example apparatus may further include, wherein to overfit the each candidate neural network version, the apparatus is caused to perform one or more iterations of following: provide a decoded first RA segment as an input to the each candidate neural network version; obtain a postprocessed first RA segment from the each candidate neural network version; compute a training loss based at least on the post-processed first RA segment and respective uncompressed data; compute gradients for one or more parameters of the each candidate neural network version; and use the gradients for updating the one or more parameters of the each candidate DSNN version.
[0022] The example apparatus may further include, wherein the apparatus is caused to perform the one or more iterations until a stopping criterion is met.
[0023] The example apparatus may further include, wherein to overfit the selected neural network version, the apparatus is further caused to one or more iterations of following: provide the decoded video as an input to the selected neural network version; obtain a post-processed output video from the selected neural network version; compute a training loss based at least on the post-processed output video and respective uncompressed data; and compute gradients for one or more parameters of the selected neural network version; and use the gradients for updating the one or more parameters of the selected neural network version.
[0024] The example apparatus may further include, wherein the apparatus is caused to perform the one or more iterations until a stopping criterion is met.
[0025] The example apparatus may further include, wherein the one or more neural networks comprise one or more decoder-side neural networks, and wherein the one or more decoder-side neural networks are available at a decoder side and an encoder side.
[0026] Yet another apparatus includes at least one processor; and at least one non-transitory memory comprising computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: process an output of a neural network version by using one or more processing operations; and optimize one or more parameters of the one or more processing operations at an encoder side.
[0027] The example apparatus may further include, wherein the apparatus is further caused to signal the optimized one or more parameters, or information derived from the optimized one or more parameters, to a decoder side.
[0028] The example apparatus may further include, wherein the one or more processing operations comprise a refinement operation, and wherein the apparatus is further caused to apply the refinement operation on an output of the neural network based at least on the optimized one or more parameters.
[0029] The example apparatus may further include, wherein the refinement operation is defined as follows: refined_NN_out = (NN_out - NN_in)*s + NN_in; wherein the NN_out comprises an output of the neural network; wherein the NN_in comprises an input to the neural network; wherein s comprises a parameter that multiplies a difference between NN_out and NN_in; and wherein refined_NN_out is a result of the refinement operation.
[0030] The example apparatus may further include, wherein the apparatus is further caused to train or to overfit the neural network version based on the one or more processing operations.
[0031] The example apparatus may further include, wherein to train or to overfit the neural network, the apparatus is further caused to: provide input data to the neural network; obtain output data from the neural network; compute a refined output data based at least on the output data from the neural network and a refinement function; compute a loss based at least on the refined output data and respective ground-truth data, wherein the respective ground-truth data comprises uncompressed version of the input data to the neural network; compute gradients of the MSE loss with respect to gradients of one or more parameters of the neural network; and use the gradients for update the one or more parameters of the neural network.
[0032] The example apparatus may further include, wherein the neural network comprises a postprocessing filter, and wherein an input data to the post-processing filter is a decoded frame, and an output data from the post-processing filter is a post-processed frame.
[0033] The example apparatus may further include, wherein the one or more processing operations comprise at least one of a scaling operation or a shifting operation.
[0034] The example apparatus may further include, wherein the neural network comprises a decoder-side neural network, and wherein the decoder-side neural network is available at the decoder side and the encoder side.
[0035] An example method includes: running one or more candidate neural network versions by using at least data from an evaluation set; evaluating performance of the one or more candidate neural network versions based on the evaluation set; selecting a candidate neural network version based on one or more predetermined performance criteria; overfitting the selected neural network version based at least on an overfitting set; and running the overfitted neural network version on an inference set.
[0036] The example method may further include, wherein: the evaluation set comprises data for evaluating the one or more candidate neural network versions; the overfitting set comprises data for overfitting the selected neural network version; and the inference set comprises data for running the overfitted neural version.
[0037] The example method may further include, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
[0038] The example method may further include, wherein the inference set comprises a video, the evaluation set comprises a first random access (RA) segment of the video, and the overfitting set comprises the video or the first RA segment of the video.
[0039] The example method may further include, wherein the performance criteria comprise a distortion-based performance criterion.
[0040] The example method may further include, wherein the selected neural network version performs best according to the one or more performance criteria.
[0041] The example method may further include, wherein: the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-filter; the evaluation set comprises a first RA segment of a video; the overfitting set comprises the video; the inference set comprises a decoded video; output of the each candidate neural network version comprises a post-processed first RA segment; and wherein the method further comprises: computing a first performance metric based on input to the each candidate neural
network version and a second performance metric based on output of the each candidate neural network version; computing a third performance metric comprising performance of the each candidate neural network version based on the first performance metric and the second performance metric; and selecting the candidate neural network version with a value of the third performance metric greater than or equal to a predetermined value as the selected neural network version.
[0042] The example method may further include, wherein to overfit the selected neural network version, the method comprises performing one or more iterations of following:
[0043] input the decoded video to the selected neural network version; obtain a post-processed output video from the selected neural network version; compute a training loss between the decoded video and the post-processed output video; compute gradients for one or more parameters of the selected neural network version; and use the gradients for updating the one or more parameters of the selected neural network version.
[0044] The example method may further include, wherein the one or more iterations are performed until a stopping criterion met.
[0045] The example method may further include: computing a weight -update based at least on weights of the overfitted neural network version and weights of the overfitted neural network version before overfitting; compressing the weight -update; and signaling or providing a bitstream representing the compressed weight-update to a decoder side in or along the bitstream representing an encoded data.
[0046] The example method may further include, wherein the one or more neural network versions comprise one or more of decoder-side neural network versions, wherein the one or more of decoder-side neural network versions are available at a decoder side and an encoder side.
[0047] Another method includes: overfitting one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions; evaluating performance of the first set of overfitted neural network versions on the evaluation set; selecting a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set: overfitting a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and running the second overfitted neural network version on an inference set; and running the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
[0048] The example method may further include, wherein: the evaluation set comprises data for overfitting one or more candidate neural network versions and for evaluating the first set of overfitted neural network versions; the overfitting set comprises data for overfitting the neural network version used to obtain the selected first overfitted neural network version; and the inference set comprises data for running the selected first overfitted neural network version or the second overfitted neural network version.
[0049] The example method may further include, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
[0050] The example method may further include, wherein the performance criteria comprise a distortion-based performance criterion.
[0051] The example method may further include, wherein the selected first overfitted neural network version performs best according to the one or more performance criteria.
[0052] The example method may further include, wherein: the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-processing filter; the two candidate neural network versions are overfitted on a first RA segment of a video, to obtain two overfitted candidate neural network versions; and wherein the method further comprises: computing a fourth performance metric comprising performance of the each overfitted candidate neural network version based on a fifth performance metric and a sixth performance metric, wherein the fifth performance metric is based on a post-processed first RA segment and the sixth performance metric is based on a decoded first RA segment; selecting an overfitted candidate neural network version with a value of the fourth performance metric greater than or equal to a predetermined value as an optimal neural network version, to obtain a selected overfitted candidate neural network version; overfitting the candidate neural network version used to obtain the selected overfitted candidate neural network version on the video, to obtain an overfitted selected neural network version; and post-processing a decoded video by using the overfitted selected neural network.
[0053] The example method may further include, wherein to overfit the each candidate neural network version, the method further comprises performing one or more iterations of following: providing a decoded first RA segment as an input to the each candidate neural network version; obtaining a post-processed first RA segment from the each candidate neural network version; computing a training loss based at least on the post-processed first RA segment and respective uncompressed data; computing gradients for one or more parameters of the each candidate neural network version; and using the gradients for updating the one or more parameters of the each candidate DSNN version.
[0054] The example method may further include, wherein the one or more iterations are performed until a stopping criterion is met.
[0055] The example method may further include, wherein to overfit the selected neural network version, the method further comprises performing one or more iterations of following: providing the decoded video as an input to the selected neural network version; obtaining a post-processed output video from the selected neural network version; computing a training loss based at least on the postprocessed output video and respective uncompressed data; and computing gradients for one or more parameters of the selected neural network version; and using the gradients for updating the one or more parameters of the selected neural network version.
[0056] The example method may further include, wherein the one or more iterations are performed until a stopping criterion is met.
[0057] The example method may further include, wherein the one or more neural networks comprise one or more decoder-side neural networks, and wherein the one or more decoder-side neural networks are available at a decoder side and an encoder side.
[0058] Yet another example method includes: processing an output of a neural network version by using one or more processing operations; and optimizing one or more parameters of the one or more processing operations at an encoder side.
[0059] The example method may further include signaling the optimized one or more parameters, or information derived from the optimized one or more parameters, to a decoder side.
[0060] The example method may further include, wherein the one or more processing operations comprise a refinement operation, and wherein the method further comprises to applying the refinement operation on an output of the neural network based at least on the optimized one or more parameters.
[0061] The example method may further include, wherein the refinement operation is defined as follows: refined_NN_out = (NN_out - NN_in)*s + NN_in; wherein the NN_out comprises an output of the neural network; wherein the NN_in comprises an input to the neural network; wherein s comprises a parameter that multiplies a difference between NN_out and NN_in; and wherein refined_NN_out is a result of the refinement operation.
[0062] The example method may further include training or overfitting the neural network version based on the one or more processing operations.
[0063] The example method may further include, wherein to train or to overfit the neural network, the method further comprises: providing input data to the neural network; obtaining output data from the neural network; computing a refined output data based at least on the output data from the neural network and a refinement function; computing a loss based at least on the refined output data and respective ground-truth data, wherein the respective ground-truth data comprises uncompressed version of the input data to the neural network; computing gradients of the MSE loss with respect to gradients of one or more parameters of the neural network; and using the gradients for update the one or more parameters of the neural network.
[0064] The example method may further include, wherein the neural network comprises a postprocessing filter, and wherein an input data to the post-processing filter is a decoded frame, and an output data from the post-processing filter is a post-processed frame.
[0065] The example method may further include, wherein the one or more processing operations comprise at least one of a scaling operation or a shifting operation.
[0066] The example method may further include, wherein the neural network comprises a decoder-side neural network, and wherein the decoder-side neural network is available at the decoder side and the encoder side.
[0067] An example computer readable medium includes program instructions for causing an apparatus to perform at least the following: run one or more candidate neural network versions by using at least data from an evaluation set; evaluate performance of the one or more candidate neural network versions based on the evaluation set; select a candidate neural network version based on one or more predetermined performance criteria; overfit the selected neural network version based at least on an overfitting set; and run the overfitted neural network version on an inference set.
[0068] The example computer readable medium may further include, wherein the computer readable medium comprises a non-transitory computer readable medium.
[0069] The example computer readable medium may further include, wherein the computer readable medium further causes the apparatus to perform the methods as described in any of the previous paragraphs.
[0070] Another example computer readable medium comprising program instructions for causing an apparatus to perform at least the following: overfit one or more candidate neural network versions
on an evaluation set to obtain a first set of overfitted neural network versions; evaluate performance of the first set of overfitted neural network versions on the evaluation set; select a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set: overfit a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and run the second overfitted neural network version on an inference set; and run the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
[0071] The example computer readable medium may further include, wherein the computer readable medium comprises a non-transitory computer readable medium.
[0072] The example computer readable medium may further include, wherein the computer readable medium further causes the apparatus to perform the methods as described in any of the previous paragraphs.
[0073] Yes another example computer readable medium comprising program instructions for causing an apparatus to perform at least the following: overfit one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions; evaluate performance of the first set of overfitted neural network versions on the evaluation set; select a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set: overfit a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and run the second overfitted neural network version on an inference set; and run the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
[0074] The example computer readable medium may further include, wherein the computer readable medium comprises a non-transitory computer readable medium.
[0075] The example computer readable medium may further include, wherein the computer readable medium further causes the apparatus to perform the methods as describe in any of the previous paragraphs.
[0076] Still another example apparatus includes means for performing the methods as describe in any of the previous paragraphs.
BRIEF DESCRIPTION OF THE DRAWINGS
[0077] The foregoing aspects and other features are explained in the following description, taken in connection with the accompanying drawings, wherein:
[0078] FIG. 1 shows schematically an electronic device employing embodiments of the examples described herein.
[0079] FIG. 2 shows schematically a user equipment suitable for employing embodiments of the examples described herein.
[0080] FIG. 3 further shows schematically electronic devices employing embodiments of the examples described herein connected using wireless and wired network connections.
[0081] FIG. 4 shows schematically a block diagram of an encoder on a general level.
[0082] FIG. 5 is a block diagram showing an interface between an encoder and a decoder in accordance with the examples described herein.
[0083] FIG. 6 illustrates a system configured to support streaming of media data from a source to a client device.
[0084] FIG. 7 is a block diagram of an apparatus that may be specifically configured in accordance with an example embodiment.
[0085] FIG. 8 illustrates examples of functioning of neural networks (NNs) as components of a pipeline of a traditional codec, in accordance with an example embodiment.
[0086] FIG. 9 illustrates an example of modified video coding pipeline based on neural networks, in accordance with an example embodiment.
[0087] FIG. 10 is an example neural network -based end-to-end learned video coding system, in accordance with an example embodiment.
[0088] FIG. 11 illustrates a pipeline of video coding for machines (VCM), in accordance with an embodiment.
[0089] FIG. 12 illustrates an example of an end-to-end learned approach for the use case of video coding for machines, in accordance with an embodiment.
[0090] FIG. 13 illustrates an example of how the end-to-end learned system may be trained for the use case of video coding for machines, in accordance with an embodiment.
[0091] FIG. 14 illustrates an example for overfitting a decoder-side neural network based on refinement operations, in accordance with an embodiment.
[0092] FIG. 15 is an example apparatus, which may be implemented in hardware, and is caused to implement mechanisms for optimizing overfitting of neural network filters or optimizing one or more parameters of one or more processing operations, based on the examples described herein.
[0093] FIG. 16 illustrates an example method for optimizing the overfitting of neural network filters, in accordance with an embodiment.
[0094] FIG. 17 illustrates an example method for optimizing the overfitting of neural network, in accordance with another embodiment.
[0095] FIG. 18 illustrates an example method for optimizing one or more parameters of one or more processing operations at an encoder side, in accordance with an embodiment.
[0096] FIG. 19 illustrates an example method for optimizing the overfitting of neural network, in accordance with yet another embodiment.
[0097] FIG. 20 is a block diagram of one possible and non-limiting system in which the example embodiments may be practiced.
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0098] The following acronyms and abbreviations that may be found in the specification and/or the drawing figures are defined as follows:
3GP 3GPP file format
3GPP 3rd Generation Partnership Project
3GPP TS 3GPP technical specification
4CC four character code
4G fourth generation of broadband cellular network technology
5G fifth generation cellular network technology
5GC 5G core network
ACC accuracy
AGT approximated ground truth data
Al artificial intelligence
AIoT Al-enabled loT
ALF adaptive loop filtering a.k.a. also known as
AMF access and mobility management function
APS adaptation parameter set
AVC advanced video coding bpp bits-per-pixel
CABAC context-adaptive binary arithmetic coding
CDMA code-division multiple access
CE core experiment ctu coding tree unit
CU central unit
DASH dynamic adaptive streaming over HTTP
DCT discrete cosine transform
DSP digital signal processor
DSNN decoder-side NN
DU distributed unit eNB (or eNodeB) evolved Node B (for example, an LTE base station)
EN-DC E-UTRA-NR dual connectivity en-gNB or En-gNB node providing NR user plane and control plane protocol terminations towards the UE, and acting as secondary node in EN-DC
E-UTRA evolved universal terrestrial radio access, for example, the LTE radio access technology
FDMA frequency division multiple access f(n) fixed-pattern bit string using n bits written (from left to right) with the left bit first.
Fl or Fl-C interface between CU and DU control interface
FDC finetuning-driving content gNB (or gNodeB) base station for 5G/NR, for example, a node providing NR user plane and control plane protocol terminations towards the UE, and connected via the NG interface to the 5GC
GSM Global System for Mobile communications
H.222.0 MPEG-2 Systems is formally known as ISO/IEC 13818-1 and as ITU-T Rec. H.222.0
H.26x family of video coding standards in the domain of the ITU-T
HLS high level syntax
HQ high-quality
IBC intra block copy
ID identifier
IEC International Electrotechnical Commission
IEEE Institute of Electrical and Electronics Engineers
I/F interface
IMD integrated messaging device
IMS instant messaging service loT internet of things
IP internet protocol
IRAP intra random access point
ISO International Organization for Standardization
ISOBMFF ISO base media file format
ITU International Telecommunication Union
ITU-T ITU Telecommunication Standardization Sector
JPEG joint photographic experts group
LMCS luma mapping with chroma scaling
LPNN loss proxy NN
LQ low-quality
LTE long-term evolution
LZMA Lempel-Ziv-Markov chain compression
LZMA2 simple container format that can include both uncompressed data and LZMA data
LZO Lempel-Ziv-Oberhumer compression
LZW Lempel-Ziv-Welch compression
MAC medium access control mdat MediaDataBox
MME mobility management entity
MMS multimedia messaging service moov MovieBox
MP4 file format for MPEG-4 Part 14 files
MPEG moving picture experts group
MPEG-2 H.222/H.262 as defined by the ITU
MPEG-4 audio and video coding standard for ISO/IEC 14496
MSB most significant bit
NAL network abstraction layer
NDU NN compressed data unit ng or NG new generation ng-eNB or NG-eNB new generation eNB
NN neural network
NNEF neural network exchange format
NNR neural network representation
NR new radio (5G radio)
N/W or NW network
ONNX Open Neural Network eXchange
PB protocol buffers
PC personal computer
PDA personal digital assistant
PDCP packet data convergence protocol
PHY physical layer
PID packet identifier
PLC power line communication
PNG portable network graphics
PSNR peak signal-to-noise ratio
RAM random access memory
RAN radio access network
RBSP raw byte sequence payload
RD loss rate distortion loss
RFC request for comments
RFID radio frequency identification
RLC radio link control
RRC radio resource control
RRH remote radio head
RU radio unit
Rx receiver
SDAP service data adaptation protocol
SGD Stochastic Gradient Descent
SGW serving gateway
SMF session management function
SMS short messaging service
SPS sequence parameter set st(v) null-terminated string encoded as UTF-8 characters as specified in ISO/IEC 10646
SVC scalable video coding
SI interface between eNodeBs and the EPC
TCP-IP transmission control protocol-internet protocol
TDMA time divisional multiple access trak TrackBox
TS transport stream
TUC technology under consideration
TV television
Tx transmitter
UE user equipment ue(v) unsigned integer Exp-Golomb-coded syntax element with the left bit first
UICC Universal Integrated Circuit Card
UMTS Universal Mobile Telecommunications System u(n) unsigned integer using n bits
UPF user plane function
URI uniform resource identifier
URL uniform resource locator
UTF-8 8-bit Unicode Transformation Format
VPS video parameter set
WLAN wireless local area network
X2 interconnecting interface between two eNodeBs in LTE network
Xn interface between two NG-RAN nodes
[0099] Some embodiments will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments are shown. Indeed, various embodiments may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout. As used herein, the terms ‘data,’ ‘content,’ ‘information,’ and similar terms may be used interchangeably to refer
to data capable of being transmitted, received and/or stored in accordance with embodiments. Thus, use of any such terms should not be taken to limit the spirit and scope of embodiments.
[0100] Additionally, as used herein, the term ‘circuitry’ refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present. This definition of ‘circuitry’ applies to all uses of this term herein, including in any claims. As a further example, as used herein, the term ‘circuitry’ also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware. As another example, the term ‘circuitry’ as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.
[0101] As defined herein, a ‘computer-readable storage medium,’ which refers to a non-transitory physical storage medium (e.g., volatile or non-volatile memory device), can be differentiated from a ‘computer-readable transmission medium,’ which refers to an electromagnetic signal.
[0102] A method, apparatus and computer program product are provided in accordance with example embodiments for optimizing the overfitting of neural network filters or optimizing one or more parameters of one or more processing operations.
[0103] In an example, the following describes in detail suitable apparatus and possible mechanisms for optimizing the overfitting of neural network filters or optimizing one or more parameters of one or more processing operations. In this regard reference is first made to FIG. 1 and FIG. 2, where FIG. 1 shows an example block diagram of an apparatus 50. The apparatus may be an internet of things (loT) apparatus configured to perform various functions, for example, gathering information by one or more sensors, receiving or transmitting information, analyzing information gathered or received by the apparatus, or the like. The apparatus may comprise a video coding system, which may incorporate a codec. FIG. 2 shows a layout of an apparatus according to an example embodiment. The elements of FIG. 1 and FIG. 2 will be explained next.
[0104] The apparatus 50 may for example be a mobile terminal or user equipment of a wireless communication system, a sensor device, a tag, or a lower power device. However, it would be
appreciated that embodiments of the examples described herein may be implemented within any electronic device or apparatus which may process data by neural networks.
[0105] The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 may further comprise a display 32, for example, in the form of a liquid crystal display, light emitting diode display, organic light emitting diode display, and the like. In other embodiments of the examples described herein the display may be any suitable display technology suitable to display media or multimedia content, for example, an image or a video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the examples described herein any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
[0106] The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the examples described herein may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the examples described herein the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. In other embodiments the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB/firewire wired connection.
[0107] The apparatus 50 may comprise a controller 56, a processor or a processor circuitry for controlling the apparatus 50. The controller 56 may be connected to a memory 58 which in embodiments of the examples described herein may store both data in the form of an image, audio data and video data, and/or may also store instructions for implementation on the controller 56. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and/or decoding of audio, image and/or video data or assisting in coding and/or decoding carried out by the controller.
[0108] The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example, a UICC and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
[0109] The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals, for example, for communication with a cellular communications network, a wireless communications system or a wireless local area network. The
apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and/or for receiving radio frequency signals from other apparatus(es).
[0110] The apparatus 50 may comprise a camera 42 capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video image data for processing from another device prior to transmission and/or storage. The apparatus 50 may also receive either wirelessly or by a wired connection the image for coding/decoding. The structural elements of apparatus 50 described above represent examples of means for performing a corresponding function.
[0111] With respect to FIG. 3, an example of a system within which embodiments of the examples described herein can be utilized is shown. The system 10 comprises multiple communication devices which can communicate through one or more networks. The system 10 may comprise any combination of wired or wireless networks including, but not limited to, a wireless cellular telephone network (such as a GSM, UMTS, CDMA, LTE, 4G, 5G network, and the like), a wireless local area network (WLAN) such as defined by any of the IEEE 802.x standards, a Bluetooth® personal area network, an Ethernet local area network, a token ring local area network, a wide area network, and the Internet.
[0112] The system 10 may include both wired and wireless communication devices and/or apparatus 50 suitable for implementing embodiments of the examples described herein.
[0113] For example, the system shown in FIG. 3 shows a mobile telephone network 11 and a representation of the Internet 28. Connectivity to the Internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways.
[0114] The example communication devices shown in the system 10 may include, but are not limited to, an electronic device or apparatus 50, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, a notebook computer 22. The apparatus 50 may be stationary or mobile when carried by an individual who is moving. The apparatus 50 may also be located in a mode of transport including, but not limited to, a car, a truck, a taxi, a bus, a train, a boat, an airplane, a bicycle, a motorcycle or any similar suitable mode of transport.
[0115] The embodiments may also be implemented in a set-top box; for example, a digital TV receiver, which may/may not have a display or wireless capabilities, in tablets or (laptop) personal
computers (PC), which have hardware and/or software to process neural network data, in various operating systems, and in chipsets, processors, DSPs and/or embedded systems offering hardware/software based coding.
[0116] Some or further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the Internet 28. The system may include additional communication devices and communication devices of various types.
[0117] The communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocolinternet protocol (TCP-IP), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), Bluetooth, IEEE 802.11, 3GPP Narrowband loT and any similar wireless communication technology. A communications device involved in implementing various embodiments of the examples described herein may communicate using various media including, but not limited to, radio, infrared, laser, cable connections, and any suitable connection.
[0118] In telecommunications and data networks, a channel may refer either to a physical channel or to a logical channel. A physical channel may refer to a physical transmission medium such as a wire, whereas a logical channel may refer to a logical connection over a multiplexed medium, capable of conveying several logical channels. A channel may be used for conveying an information signal, for example a bitstream, from one or several senders (or transmitters) to one or several receivers.
[0119] The embodiments may also be implemented in internet of things (loT) devices. The loT may be defined, for example, as an interconnection of uniquely identifiable embedded computing devices within the existing Internet infrastructure. The convergence of various technologies has and may enable many fields of embedded systems, such as wireless sensor networks, control systems, home/building automation, and the like, to be included in the loT. In order to utilize the loT devices are provided with an IP address as a unique identifier. The loT devices may be provided with a radio transmitter, such as WLAN or Bluetooth transmitter or a RFID tag. Alternatively, the loT devices may have access to an IP-based network via a wired network, such as an Ethernet-based network or a powerline connection (PLC).
[0120] The devices/systems described in FIGs. 1 to 3 enable encoding, decoding, and/or transportation of, for example, a neural network representation and/or a media bitstream.
[0121] An MPEG-2 transport stream (TS), specified in ISO/IEC 13818-1 or equivalently in ITU- T Recommendation H.222.0, is a format for carrying audio, video, and other media as well as program metadata or other metadata, in a multiplexed stream. A packet identifier (PID) is used to identify an elementary stream (a.k.a. packetized elementary stream) within the TS. Hence, a logical channel within an MPEG-2 TS may be considered to correspond to a specific PID value.
[0122] Available media file format standards include ISO base media file format (ISO/IEC 14496- 12, which may be abbreviated ISOBMFF) and file format for NAL unit structured video (ISO/IEC 14496-15), which derives from the ISOBMFF.
[0123] Video codec includes an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form, or into a form that is suitable as an input to one or more algorithms for analysis or processing. A video encoder and/or a video decoder may also be separate from each other, for example, need not form a codec. Typically, encoder discards some information in the original video sequence in order to represent the video in a more compact form (e.g., at lower bitrate).
[0124] Typical hybrid video encoders, for example, many encoder implementations of ITU-T H.263 and H.264, encode the video information in two phases. Firstly pixel values in a certain picture area (or ‘block’) are predicted, for example, by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, for example, the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (for example, Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate).
[0125] In temporal prediction, the sources of prediction are previously decoded pictures (a.k.a. reference pictures). In intra block copy (IBC; a.k.a. intra-block-copy prediction and current picture referencing), prediction is applied similarly to temporal prediction, but the reference picture is the current picture and only previously decoded samples can be referred in the prediction process. Inter-
layer or inter-view prediction may be applied similarly to temporal prediction, but the reference picture is a decoded picture from another scalable layer or from another view, respectively. In some cases, inter prediction may refer to temporal prediction only, while in other cases inter prediction may refer collectively to temporal prediction and any of intra block copy, inter-layer prediction, and inter-view prediction provided that they are performed with the same or similar process than temporal prediction. Inter prediction or temporal prediction may sometimes be referred to as motion compensation or motion-compensated prediction.
[0126] Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, reduces temporal redundancy. In inter prediction the sources of prediction are previously decoded pictures. Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, for example, either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra-coding, where no inter prediction is applied.
[0127] One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently when they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.
[0128] FIG. 4 shows a block diagram of a general structure of a video encoder. FIG. 4 presents an encoder for two layers, but it would be appreciated that presented encoder could be similarly extended to encode more than two layers. FIG. 4 illustrates a video encoder comprising a first encoder section 500 for a base layer and a second encoder section 502 for an enhancement layer. Each of the first encoder section 500 and the second encoder section 502 may comprise similar elements for encoding incoming pictures. The encoder sections 500, 502 may comprise a pixel predictor 302, 402, prediction error encoder 303, 403 and prediction error decoder 304, 404. FIG. 4 also shows an embodiment of the pixel predictor 302, 402 as comprising an inter-predictor 306, 406, an intra-predictor 308, 408, a mode selector 310, 410, a filter 316, 416, and a reference frame memory 318, 418. The pixel predictor 302 of the first encoder section 500 receives base layer picture(s)/image(s) 300 of a video stream to be encoded at both the inter-predictor 306 (which determines the difference between the image and a motion compensated reference frame) and the intra-predictor 308 (which determines a prediction for an image block based only on the already processed parts of current frame or picture). The output of both the inter-predictor and the intra-predictor are passed to the mode selector 310. The intra-predictor 308 may have more than one intra-prediction modes. Hence, each mode may perform the intra-prediction and
provide the predicted signal to the mode selector 310. The mode selector 310 also receives a copy of the base layer image(s) 300. Correspondingly, the pixel predictor 402 of the second encoder section 502 receives enhancement layer picture(s)/images(s) 400 of a video stream to be encoded at both the interpredictor 406 (which determines the difference between the image and a motion compensated reference frame) and the intra-predictor 408 (which determines a prediction for an image block based only on the already processed parts of current frame or picture). The output of both the inter-predictor and the intra- predictor are passed to the mode selector 410. The intra-predictor 408 may have more than one intraprediction modes. Hence, each mode may perform the intra-prediction and provide the predicted signal to the mode selector 410. The mode selector 410 also receives a copy of the enhancement layer image(s) 400.
[0129] Depending on which encoding mode is selected to encode the current block, the output of the inter-predictor 306, 406 or the output of one of the optional intra-predictor modes or the output of a surface encoder within the mode selector is passed to the output of the mode selector 310, 410. The output of the mode selector 310, 410 is passed to a first summing device 321, 421. The first summing device may subtract the output of the pixel predictor 302, 402 from the base layer image(s) 300/enhancement layer image(s) 400 to produce a first prediction error signal 320, 420 which is input to the prediction error encoder 303, 403.
[0130] The pixel predictor 302, 402 further receives from a preliminary reconstructor 339, 439 the combination of the prediction representation of the image block 312, 412 and the output 338, 438 of the prediction error decoder 304, 404. The preliminary reconstructed image 314, 414 may be passed to the intra-predictor 308, 408 and to the filter 316, 416. The filter 316, 416 receiving the preliminary representation may filter the preliminary representation and output a final reconstructed image 340, 440 which may be saved in the reference frame memory 318, 418. The reference frame memory 318 may be connected to the inter-predictor 306 to be used as the reference image against which a future base layer image 300 is compared in inter-prediction operations. Subject to the base layer being selected and indicated to be source for inter-layer sample prediction and/or inter-layer motion information prediction of the enhancement layer according to some embodiments, the reference frame memory 318 may also be connected to the inter-predictor 406 to be used as the reference image against which a future enhancement layer image(s) 400 is compared in inter-prediction operations. Moreover, the reference frame memory 418 may be connected to the inter-predictor 406 to be used as the reference image against which the future enhancement layer image(s) 400 is compared in in ter -prediction operations.
[0131] Filtering parameters from the filter 316 of the first encoder section 500 may be provided to the second encoder section 502 subject to the base layer being selected and indicated to be source for predicting the filtering parameters of the enhancement layer according to some embodiments.
[0132] The prediction error encoder 303, 403 comprises a transform unit 342, 442 and a quantizer 344, 444. The transform unit 342, 442 transforms the first prediction error signal 320, 420 to a transform domain. The transform is, for example, the DCT transform. The quantizer 344, 444 quantizes the transform domain signal, for example, the DCT coefficients, to form quantized coefficients.
[0133] The prediction error decoder 304, 404 receives the output from the prediction error encoder 303, 403 and performs the opposite processes of the prediction error encoder 303, 403 to produce a decoded prediction error signal 338, 438 which, when combined with the prediction representation of the image block 312, 412 at the second summing device 339, 439, produces the preliminary reconstructed image 314, 414. The prediction error decoder may be considered to comprise a dequantizer 346, 446, which dequantizes the quantized coefficient values, for example, DCT coefficients, to reconstruct the transform signal and an inverse transformation unit 348, 448, which performs the inverse transformation to the reconstructed transform signal wherein the output of the inverse transformation unit 348, 448 contains reconstructed block(s). The prediction error decoder may also comprise a block filter which may filter the reconstructed block(s) according to further decoded information and filter parameters.
[0134] The entropy encoder 330, 430 receives the output of the prediction error encoder 303, 403 and may perform a suitable entropy encoding/variable length encoding on the signal to provide a compressed signal. The outputs of the entropy encoders 330, 430 may be inserted into a bitstream, for example, by a multiplexer 508.
[0135] FIG. 5 is a block diagram showing the interface between an encoder 501 implementing neural network based encoding 503, and a decoder 504 implementing neural network based decoding 505 in accordance with the examples described herein. The encoder 501 may embody a device, a software method or a hardware circuit. The encoder 501 has the goal of compressing an input data 511 (for example, an input video) to a compressed data 512 (for example, a bitstream) such that the bitrate measuring the size of compressed data 512 is minimized, and the accuracy of an analysis or processing algorithm is maximized. To this end, the encoder 501 uses an encoder or compression algorithm, for example to perform neural network based encoding 503, e.g., encoding the input data by using one or more neural networks.
[0136] The general analysis or processing algorithm may be part of the decoder 504. The decoder 504 uses a decoder or decompression algorithm, for example, to perform the neural network based decoding 505 (e.g., decoding by using one or more neural networks) to decode the compressed data 512
(for example, compressed video) which was encoded by the encoder 501. The decoder 504 produces decompressed data 513 (for example, reconstructed data).
[0137] The encoder 501 and decoder 504 may be entities implementing an abstraction, may be separate entities or the same entities, or may be part of the same physical device.
[0138] The analysis/processing algorithm may be any algorithm, traditional or learned from data. In the case of an algorithm which is learned from data, in some embodiments it is assumed that this algorithm can be modified or updated, for example, by using optimization via gradient descent. An example of the learned algorithm is a neural network.
[0139] An out-of-band transmission, signaling, or storage may refer to the capability of transmitting, signaling, or storing information in a manner that associates the information with a video bitstream. The out-of-band transmission may use a more reliable transmission mechanism compared to the protocols used for carrying coded video data, such as slices. The out-of-band transmission, signaling or storage can additionally or alternatively be used e.g. for ease of access or session negotiation. For example, a sample entry of a track in a file conforming to the ISO Base Media File Format may comprise parameter sets, while the coded data in the bitstream is stored elsewhere in the file or in another file. Another example of out-of-band transmission, signaling, or storage comprises including information, such as NN and/or NN updates in a file format track that is separate from track(s) containing coded video data.
[0140] The phrase along the bitstream (e.g. indicating along the bitstream) or along a coded unit of a bitstream (e.g. indicating along a coded tile) may be used in claims and described embodiments to refer to transmission, signaling, or storage in a manner that the ‘out-of-band’ data is associated with, but not included within, the bitstream or the coded unit, respectively. The phrase decoding along the bitstream or along a coded unit of a bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream or the coded unit, respectively. For example, the phrase along the bitstream may be used when the bitstream is contained in a container file, such as a file conforming to the ISO Base Media File Format, and certain file metadata is stored in the file in a manner that associates the metadata to the bitstream, such as boxes in the sample entry for a track containing the bitstream, a sample group for the track containing the bitstream, or a timed metadata track associated with the track containing the bitstream. In another example, the phrase along the bitstream may be used when the bitstream is made available as a stream over a communication protocol and a media description, such as a streaming manifest, is provided to describe the stream.
[0141] An elementary unit for the output of a video encoder and the input of a video decoder, respectively, may be a network abstraction layer (NAL) unit. For transport over packet-oriented networks or storage into structured files, NAL units may be encapsulated into packets or similar structures. A bytestream format encapsulating NAL units may be used for transmission or storage environments that do not provide framing structures. The bytestream format may separate NAL units from each other by attaching a start code in front of each NAL unit. To avoid false detection of NAL unit boundaries, encoders may run a byte-oriented start code emulation prevention algorithm, which may add an emulation prevention byte to the NAL unit payload if a start code would have occurred otherwise. In order to enable straightforward gateway operation between packet and stream-oriented systems, start code emulation prevention may be performed regardless of whether the bytestream format is in use or not. A NAL unit may be defined as a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of a raw byte sequence payload interspersed as necessary with emulation prevention bytes. A raw byte sequence payload (RBSP) may be defined as a syntax structure containing an integer number of bytes that is encapsulated in a NAL unit. An RBSP is either empty or has the form of a string of data bits containing syntax elements followed by an RBSP stop bit and followed by zero or more subsequent bits equal to 0.
[0142] In some coding standards, NAL units include a header and payload. The NAL unit header indicates the type of the NAL unit. In some coding standards, the NAL unit header indicates a scalability layer identifier (e.g. called nuh_layer_id in H.265/HEVC and H.266/VVC), which could be used e.g. for indicating spatial or quality layers, views of a multiview video, or auxiliary layers (such as depth maps or alpha planes). In some coding standards, the NAL unit header includes a temporal sublayer identifier, which may be used for indicating temporal subsets of the bitstream, such as a 30-frames-per- second subset of a 60-frames-per-second bitstream.
[0143] NAL units may be categorized into Video Coding Layer (VCL) NAL units and non-VCL NAL units. VCL NAL units are typically coded slice NAL units.
[0144] A non-VCL NAL unit may be, for example, one of the following types: a video parameter set (VPS), a sequence parameter set (SPS), a picture parameter set (PPS), an adaptation parameter set (APS), a supplemental enhancement information (SEI) NAL unit, an access unit delimiter, an end of sequence NAL unit, an end of bitstream NAL unit, or a filler data NAL unit. Parameter sets may be needed for the reconstruction of decoded pictures, whereas many of the other non-VCL NAL units are not necessary for the reconstruction of decoded sample values.
[0145] Some coding formats specify parameter sets that may carry parameter values needed for the decoding or reconstruction of decoded pictures. A parameter may be defined as a syntax element of
a parameter set. A parameter set may be defined as a syntax structure that contains parameters and that can be referred to from or activated by another syntax structure, for example, using an identifier.
[0146] Some types of parameter sets are briefly described in the following, but it needs to be understood, that other types of parameter sets may exist and that embodiments may be applied, but are not limited to, the described types of parameter sets.
[0147] Parameters that remain unchanged through a coded video sequence may be included in a sequence parameter set. Alternatively, an SPS may be limited to apply to a layer that references the SPS, e.g. an SPS may remain valid for a coded layer video sequence. In addition to the parameters that may be needed by the decoding process, the sequence parameter set may optionally contain video usability information (VUI), which includes parameters that may be important for buffering, picture output timing, rendering, and resource reservation.
[0148] A picture parameter set contains such parameters that are likely to be unchanged in several coded pictures. A picture parameter set may include parameters that can be referred to by the VCL NAL units of one or more coded pictures.
[0149] A video parameter set (VPS) may be defined as a syntax structure containing syntax elements that apply to zero or more entire coded video sequences and may contain parameters applying to multiple layers. The VPS may provide information about the dependency relationships of the layers in a bitstream, as well as many other information that are applicable to all slices across all layers in the entire coded video sequence.
[0150] A video parameter set RBSP may include parameters that can be referred to by one or more sequence parameter set RBSPs.
[0151] The relationship and hierarchy between a video parameter set (VPS), a sequence parameter set (SPS), and a picture parameter set (PPS) may be described as follows. A VPS resides one level above an SPS in the parameter set hierarchy and in the context of scalability. The VPS may include parameters that are common for all slices across all layers in the entire coded video sequence. The SPS includes the parameters that are common for all slices in a particular layer in the entire coded video sequence, and may be shared by multiple layers. The PPS includes the parameters that are common for all slices in a particular picture and are likely to be shared by all slices in multiple pictures.
[0152] An adaptation parameter set (APS) may be specified in some coding formats, such as H.266/VVC. An APS may be applied to one or more image segments, such as slices. In H.266/VVC,
an APS may be defined as a syntax structure containing syntax elements that apply to zero or more slices as determined by zero or more syntax elements found in slice headers or in a picture header. An APS may comprise a type (aps_params_type in H.266/VVC) and an identifier (aps_adaptation_parameter_set_id in H.266/VVC). The combination of an APS type and an APS identifier may be used to identify a particular APS. H.266/VVC comprises three APS types: an adaptive loop filtering (ALF), a luma mapping with chroma scaling (LMCS), and a scaling list APS types. The ALF APS(s) are referenced from a slice header (thus, the referenced ALF APSs can change slice by slice), and the LMCS and scaling list APS(s) are referenced from a picture header (thus, the referenced LMCS and scaling list APSs can change picture by picture). In H.266/VVC, the APS RBSP has the following syntax:
[0153] Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike. Some video coding specifications include SEI NAL units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units. A prefix SEI NAL unit can start a picture unit or alike; and a suffix SEI NAL unit can end a picture unit or alike. Hereafter, an SEI NAL unit may equivalently refer to a prefix SEI NAL unit or a suffix SEI NAL unit. An SEI NAL unit includes one or more SEI messages, which are not required for the decoding of output pictures
but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation.
[0154] Several SEI messages are specified in H.264/AVC, H.265/HEVC, H.266/VVC, and H.274/VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for specific use. The standards may contain the syntax and semantics for the specified SEI messages but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified.
[0155] The method and apparatus of an example embodiment may be utilized in a wide variety of systems, including systems that rely upon the compression and decompression of media data and possibly also the associated metadata. In at least an embodiment, however, the method and apparatus are configured to train or finetune a decoder-side neural network. In this regard, FIG. 6 depicts an example of such a system 600 that includes a source 602 of media data and associated metadata. The source 602 may be, in an embodiment, a server. However, the source may be embodied in other manners when desired. The source 602 is configured to stream the media data and associated metadata to a client device 604. The client device may be embodied by a media player, a multimedia system, a video system, a smart phone, a mobile telephone or other user equipment, a personal computer, a tablet computer or any other computing device configured to receive and decompress the media data and process associated metadata. In the illustrated embodiment, media data and metadata are streamed via a network 606, such as any of a wide variety of types of wireless networks and/or wireline networks. The client device is configured to receive structured information containing media, metadata and any other relevant representation of information containing the media and the metadata and to decompress the media data and process the associated metadata (e.g. for proper playback timing of decompressed media data).
[0156] An apparatus 700 is provided in accordance with an example embodiment as shown in FIG. 7. Inan embodiment, the apparatus of FIG. 7 may be embodied by the source 602, such as a file writer which, in turn, may be embodied by a server, that is configured to stream a compressed representation of the media data and associated metadata. In an alternative embodiment, the apparatus may be embodied by the client device 604, such as a file reader which may be embodied, for example,
by any of the various computing devices described above. In either of these embodiments and as shown in FIG. 7, the apparatus of an example embodiment includes, is associated with or is in communication with a processing circuitry 702, one or more memory devices 704, a communication interface 706 and optionally a user interface.
[0157] The processing circuitry 702 may be in communication with the memory device 704 via a bus for passing information among components of the apparatus 700. The memory device may be non- transitory and may include, for example, one or more volatile and/or non-volatile memories. In other words, for example, the memory device may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processing circuitry). The memory device may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment. For example, the memory device could be configured to buffer input data for processing by the processing circuitry. Additionally or alternatively, the memory device could be configured to store instructions for execution by the processing circuitry.
[0158] The apparatus 700 may, in some embodiments, be embodied in various computing devices as described above. However, in some embodiments, the apparatus may be embodied as a chip or chip set. In other words, the apparatus may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard). The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The apparatus may therefore, in some cases, be configured to implement an embodiment on a single chip or as a single ‘system on a chip.’ As such, in some cases, a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein.
[0159] The processing circuitry 702 may be embodied in a number of different ways. For example, the processing circuitry may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other circuitry including integrated circuits such as, for example, an ASIC (application specific integrated circuit), an FPGA (field programmable gate array), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like. As such, in some embodiments, the processing circuitry may include one or more processing cores configured to perform independently. A multi-core processing circuitry may enable multiprocessing within a single physical package. Additionally or alternatively, the processing circuitry may include one
or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.
[0160] In an example embodiment, the processing circuitry 702 may be configured to execute instructions stored in the memory device 704 or otherwise accessible to the processing circuitry. Alternatively or additionally, the processing circuitry may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processing circuitry may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment while configured accordingly. Thus, for example, when the processing circuitry is embodied as an ASIC, FPGA or the like, the processing circuitry may be specifically configured hardware for conducting the operations described herein. Alternatively, as another example, when the processing circuitry is embodied as an executor of instructions, the instructions may specifically configure the processing circuitry to perform the algorithms and/or operations described herein when the instructions are executed. However, in some cases, the processing circuitry may be a processor of a specific device (e.g., an image or video processing system) configured to employ an embodiment by further configuration of the processing circuitry by instructions for performing the algorithms and/or operations described herein. The processing circuitry may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processing circuitry.
[0161] The communication interface 706 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data, including video bitstreams. In this regard, the communication interface may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s). In some environments, the communication interface may alternatively or also support wired communication. As such, for example, the communication interface may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
[0162] In some embodiments, the apparatus 700 may optionally include a user interface that may, in turn, be in communication with the processing circuitry 702 to provide output to a user, such as by outputting an encoded video bitstream and, in some embodiments, to receive an indication of a user input. As such, the user interface may include a display and, in some embodiments, may also include a keyboard, a mouse, a joystick, a touch screen, touch areas, soft keys, a microphone, a speaker, or other
input/output mechanisms. Alternatively or additionally, the processing circuitry may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as a display and, in some embodiments, a speaker, ringer, microphone and/or the like. The processing circuitry and/or user interface circuitry comprising the processing circuitry may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processing circuitry (e.g., memory device, and/or the like).
[0163] Fundamentals of neural networks
[0164] A neural network (NN) is a computation graph including several layers of computation. Each layer includes one or more units, where each unit performs a computation. A unit is connected to one or more other units, and a connection may be associated with a weight. The weight may be used for scaling the signal passing through an associated connection. Weights are learnable parameters, for example, values which can be learned from training data. There may be other learnable parameters, such as those of batch-normalization layers.
[0165] Couple of examples of architectures for neural networks are feed-forward and recurrent architectures. Feed-forward neural networks are such that there is no feedback loop, each layer takes input from one or more of the previous layers, and provides its output as the input for one or more of the subsequent layers. Also, units inside a certain layer take input from units in one or more of preceding layers and provide output to one or more of following layers.
[0166] Initial layers, those close to the input data, extract semantically low-level features, for example, edges and textures in images, and intermediate and final layers extract more high-level features. After the feature extraction layers there may be one or more layers performing a certain task, for example, classification, semantic segmentation, object detection, denoising, style transfer, superresolution, and the like. In recurrent neural networks, there is a feedback loop, so that the neural network becomes stateful, for example, it is able to memorize information or a state.
[0167] Neural networks are being utilized in an ever-increasing number of applications for many different types of devices, for example, mobile phones, chat bots, loT devices, smart cars, voice assistants, and the like. Some of these applications include, but are not limited to, image and video analysis and processing, social media data analysis, device usage data analysis, and the like.
[0168] One of the properties of neural networks, and other machine learning tools, is that they are able to learn properties from input data, either in a supervised way or in an unsupervised way. Such
learning is a result of a training algorithm, or of a meta-level neural network providing the training signal.
[0169] In general, the training algorithm includes changing some properties of the neural network so that its output is as close as possible to a desired output. For example, in the case of classification of objects in images, the output of the neural network can be used to derive a class or category index which indicates the class or category that the object in the input image belongs to. Training usually happens by minimizing or decreasing the output error, also referred to as the loss. Examples of losses are mean squared error, cross-entropy, and the like. In recent deep learning techniques, training is an iterative process, where at each iteration the algorithm modifies the weights of the neural network to make a gradual improvement in the network’s output, for example, gradually decrease the loss.
[0170] Training a neural network is an optimization process, but the final goal is different from the typical goal of optimization. In optimization, the only goal is to minimize a function. In machine learning, the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, for example, data which was not used for training the model. This is usually referred to as generalization. In practice, data is usually split into at least two sets, the training set and the validation set. The training set is used for training the network, for example, to modify its learnable parameters in order to minimize the loss. The validation set is used for checking the performance of the network on data, which was not used to minimize the loss, as an indication of the final performance of the model. In particular, the errors on the training set and on the validation set are monitored during the training process to understand the following:
- when the network is learning at all - in this case, the training set error should decrease, otherwise the model is in the regime of underfitting.
- when the network is learning to generalize - in this case, also the validation set error needs to decrease and be not too much higher than the training set error. For example, the validation set error should be less than 20% higher than the training set error. When the training set error is low, for example 10% of its value at the beginning of training, or with respect to a threshold that may have been determined based on an evaluation metric, but the validation set error is much higher than the training set error, or it does not decrease, or it even increases, the model is in the regime of overfitting. This means that the model has just memorized properties of the training set and performs well only on that set, but performs poorly on a set not used for training or tuning of its parameters.
[0171] Lately, neural networks have been used for compressing and de-compressing data such as images. The most widely used architecture for such task is the auto-encoder, which is a neural network including two parts: a neural encoder and a neural decoder. In various embodiments, these neural
encoder and neural decoder would be referred to as encoder and decoder, even though these refer to algorithms which are learned from data instead of being tuned manually. The encoder takes an image as an input and produces a code, to represent the input image, which requires less bits than the input image. This code may have been obtained by a binarization or quantization process after the encoder. The decoder takes in this code and reconstructs the image which was input to the encoder.
[0172] Such encoder and decoder are usually trained to minimize a combination of bitrate and distortion, where the distortion may be based on one or more of the following metrics: mean squared error (MSE), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), or the like. These distortion metrics are meant to be correlated to the human visual perception quality, so that minimizing or maximizing one or more of these distortion metrics results into improving the visual quality of the decoded image as perceived by humans.
[0173] In various embodiments, terms ‘model’, ‘neural network’, ‘neural net’ and ‘network’ may be used interchangeably, and also the weights of neural networks may be sometimes referred to as learnable parameters or as parameters.
[0174] Fundamentals of video/image coding
[0175] Video codec includes an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form. Typically, an encoder discards some information in the original video sequence in order to represent the video in a more compact form, for example, at lower bitrate.
[0176] Typical hybrid video codecs, for example ITU-T H.263 and H.264, encode the video information in two phases. Firstly, pixel values in a certain picture area (or ‘block’) are predicted, for example, by motion compensation means or circuits (by finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means or circuit (by using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, e.g. the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g. discrete cosine transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, the encoder may control the balance between the accuracy of the pixel representation (e.g., picture quality) and size of the resulting coded video representation (e.g., file size or transmission bitrate).
[0177] In other example, the pixel values may be predicted by using spatial prediction techniques. This prediction technique uses the pixel values around the block to be coded in a specified manner. Secondly, the prediction error, for example, the difference between the predicted block of pixels and the original block of pixels is coded. This is typically done by transforming the difference in pixel values using a specified transform, for example, discrete cosine transform (DCT) or a variant of it; quantizing the coefficients; and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation, for example, picture quality and size of the resulting coded video representation, for example, file size or transmission bitrate.
[0178] Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, exploits temporal redundancy. In inter prediction the sources of prediction are previously decoded pictures.
[0179] Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, for example, either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intracoding, where no inter prediction is applied.
[0180] One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently when they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.
[0181] The decoder reconstructs the output video by applying prediction techniques similar to the encoder to form a predicted representation of the pixel blocks. For example, using the motion or spatial information created by the encoder and stored in the compressed representation and prediction error decoding, which is inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain. After applying prediction and prediction error decoding techniques the decoder sums up the prediction and prediction error signals, for example, pixel values to form the output video frame. The decoder and encoder can also apply additional filtering techniques to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.
[0182] In typical video codecs the motion information is indicated with motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded in the encoder side or decoded in the decoder side and the prediction source block in one of the previously coded or decoded pictures.
[0183] In order to represent motion vectors efficiently, the motion vectors are typically coded differentially with respect to block specific predicted motion vectors. In typical video codecs, the predicted motion vectors are created in a predefined way, for example, calculating the median of the encoded or decoded motion vectors of the adjacent blocks.
[0184] Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index is typically predicted from adjacent blocks and/or or co-located blocks in temporal reference picture.
[0185] Moreover, typical high efficiency video codecs employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction. Similarly, predicting the motion field information is carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks.
[0186] In typical video codecs, the prediction residual after motion compensation is first transformed with a transform kernel, for example, DCT and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.
[0187] Typical video encoders utilize Lagrangian cost functions to find optimal coding modes, for example, the desired macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor X to tie together the exact or estimated image distortion due to lossy coding methods and the exact or estimated amount of information that is required to represent the pixel values in an image area:
C = D + R equation 1
[0188] In equation 1, C is the Lagrangian cost to be minimized, D is the image distortion, for example, mean squared error with the mode and motion vectors considered, and R is the number of bits needed to represent the required data to reconstruct the image block in the decoder including the amount of data to represent the candidate motion vectors.
[0189] A design principle has been followed for SEI message specifications: the SEI messages are generally not extended in future amendments or versions of the standard.
[0190] Filters in video codecs
[0191] Conventional image and video codecs may use a set of filters to enhance the visual quality of the predicted and error-compensated visual content and can be applied either in-loop or out-of-loop, or both. In the case of in-loop filters, a filter applied on one block in the currently-encoded or currently decoded frame will affect the encoding or decoding of another block in the same frame and/or in another frame which is predicted or processed based at least on the current frame. An in-loop filter can affect the bitrate and/or the visual quality. An enhanced block may cause a smaller residual, e.g., a smaller difference between original block and filtered block, thus using less bits in the bitstream output by the encoder. An out-of-loop filter, or post-processing filter, may be applied on a frame or part of a frame after it has been reconstructed; the filtered visual content may not be used for decoding other content.
[0192] Information on Neural Network based Image/Video Coding
[0193] Recently, neural networks (NNs) have been used in the context of image and video compression, by following mainly two approaches.
[0194] In one approach, NNs are used to replace or are used as an addition to one or more of the components of a traditional codec such as VVC/H.266. Here, ‘traditional’ means those codecs whose components and parameters are typically not learned from data by means of a training process, for example, those codecs whose components are not neural networks. Some examples of uses of neural networks within a traditional codec include but are not limited to:
Additional in-loop filter, for example, by having the NN as an additional in-loop filter with respect to the traditional loop filters;
Single in-loop filter, for example, by having the NN replacing all traditional in-loop filters; Intra-frame prediction, for example, as an additional intra-frame prediction mode, or replacing the traditional intra-frame prediction;
Inter-frame prediction, for example, as an additional inter-frame prediction mode, or replacing the traditional inter-frame prediction;
Transform and/or inverse transform, for example, as an additional transform and/or inverse transform, or replacing the traditional transform and/or inverse transform; and
Probability model for the arithmetic codec, for example, as an additional probability model, or replacing the traditional probability model.
[0195] FIG. 8 illustrates examples of functioning of NNs as components of a pipeline of traditional codec, in accordance with an embodiment. In particular, FIG. 8 illustrates an encoder, which also includes a decoding loop. FIG. 8 is shown to include components described below:
A luma intra pred block or circuit 801. This block or circuit performs intra prediction in the luma domain, for example, by using already reconstructed data from the same frame. The operation of the luma intra pred block or circuit 801 may be performed by a deep neural network such as a convolutional auto-encoder.
A chroma intra pred block or circuit 802. This block or circuit performs intra prediction in the chroma domain, for example, by using already reconstructed data from the same frame. The chroma intra pred block or circuit 802 may perform cross-component prediction, for example, predicting chroma from luma. The operation of the chroma intra pred block or circuit 802 may be performed by a deep neural network such as a convolutional autoencoder.
An intra pred block or circuit 803 and an inter-pred block or circuit 804. These blocks or circuit perform intra prediction and in ter -prediction, respectively. The intra pred block or circuit 803 and the inter-pred block or circuit 804 may perform the prediction on all components, for example, luma and chroma. The operations of the intra pred block or circuit 803 and the inter-pred block or circuit 804 may be performed by two or more deep neural networks such as convolutional auto-encoders.
A probability estimation block or circuit 805 for entropy coding. This block or circuit performs prediction of probability for the next symbol to encode or decode, which is then provided to the entropy coding module 812, such as an arithmetic coding module, to encode or decode the next symbol. The operation of the probability estimation block or circuit 805 may be performed by a neural network.
A transform and quantization (T/Q) block or circuit 806. These are actually two blocks or circuits. The transform and quantization block or circuit 806 may perform a transform of input data to a different domain, for example, the FFT transform would transform the data to frequency domain. The transform and quantization block or circuit 806 may quantize its input values to a smaller set of possible values. In the decoding loop, there may be inverse
quantization block or circuit and inverse transform block or circuit Q '/T1 806a. One or both of the transform block or circuit and quantization block or circuit may be replaced by one or two or more neural networks. One or both of the inverse transform block or circuit and inverse quantization block or circuit 813 may be replaced by one or two or more neural networks.
An in-loop filter block or circuit 807. Operations of the in-loop filter block or circuit 807 is performed in the decoding loop, and it performs filtering on the output of the inverse transform block or circuit, or on the reconstructed data, in order to enhance the reconstructed data with respect to one or more predetermined quality metrics. This filter may affect both the quality of the decoded data and the bitrate of the bitstream output by the encoder. The operation of the in-loop filter block or circuit 807 may be performed by a neural network, such as a convolutional auto-encoder. In examples, the operation of the inloop filter may be performed by multiple steps or filters, where the one or more steps may be performed by neural networks.
A post-processing filter block or circuit 808. The post-processing filter block or circuit 808 may be performed only at decoder side, as it may not affect the encoding process. The postprocessing filter block or circuit 808 filters the reconstructed data output by the in-loop filter block or circuit 807, in order to enhance the reconstructed data. The post-processing filter block or circuit 808 may be replaced by a neural network, such as a convolutional auto-encoder.
A resolution adaptation block or circuit 809: this block or circuit may downsample the input video frames, prior to encoding. Then, in the decoding loop, the reconstructed data may be upsampled, by the upsampling block or circuit 810, to the original resolution. The operation of the resolution adaptation block or circuit 809 block or circuit may be performed by a neural network such as a convolutional auto-encoder.
An encoder control block or circuit 811. This block or circuit performs optimization of encoder’ s parameters, such as what transform to use, what quantization parameters (QP) to use, what intra-prediction mode (out of N intra-prediction modes) to use, and the like. The operation of the encoder control block or circuit 811 may be performed by a neural network, such as a classifier convolutional network, or such as a regression convolutional network. An ME/MC block or circuit 814 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction. ME/MC stands for motion estimation / motion compensation
[0196] In another approach, commonly referred to as ‘end-to-end learned compression’, NNs are used as the main components of the image/video codecs. Some examples of the second approach include, but are not limited to following:
[0197] Option 1: re-use the video coding pipeline but replace most or all the components with NNs. Referring to FIG. 9, it illustrates an example of modified video coding pipeline based on neural networks, in accordance with an embodiment. An example of neural network may include, but is not limited, a compressed representation of a neural network. FIG. 9 is shown to include following components:
A neural transform block or circuit 902: this block or circuit transforms the output of a summation/subtraction operation 903 to a new representation of that data, which may have lower entropy and thus be more compressible.
A quantization block or circuit 904: this block or circuit quantizes an input data 901 to a smaller set of possible values.
An inverse transform and inverse quantization blocks or circuits 906. These blocks or circuits perform the inverse or approximately inverse operation of the transform and the quantization, respectively.
An encoder parameter control block or circuit 908. This block or circuit may control and optimize some or all the parameters of the encoding process, such as parameters of one or more of the encoding blocks or circuits.
An entropy coding block or circuit 910. This block or circuit may perform lossless coding, for example, based on entropy. One popular entropy coding technique is arithmetic coding. A neural intra-codec block or circuit 912. This block or circuit may be an image compression and decompression block or circuit, which may be used to encode and decode an intra frame. An encoder 914 may be an encoder block or circuit, such as the neural encoder part of an auto-encoder neural network. A decoder 916 may be a decoder block or circuit, such as the neural decoder part of an auto-encoder neural network. An intra-coding block or circuit 918 may be a block or circuit performing some intermediate steps between encoder and decoder, such as quantization, entropy encoding, entropy decoding, and/or inverse quantization.
A deep loop filter block or circuit 920. This block or circuit performs filtering of reconstructed data, in order to enhance it.
A decode picture buffer block or circuit 922. This block or circuit is a memory buffer, keeping the decoded frame, for example, reconstructed frames 924 and enhanced reference frames 926 to be used for inter prediction.
An inter-prediction block or circuit 928. This block or circuit performs inter-frame prediction, for example, predicts from frames, for example, frames 932, which are temporally nearby. An ME/MC 930 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction. ME/MC stands for motion estimation / motion compensation.
[0198] In order to train the neural networks of this system, a training objective function, referred to as ‘training loss’ , is typically utilized, which usually comprises one or more terms, or loss terms, or simply losses. Although here the Option 2 and FIG. 10 considered as example for describing the training objective function, a similar training objective function may also be used for training the neural networks for the systems in FIG. 6 and FIG. 7. In an example, the training loss comprises a reconstruction loss term and a rate loss term. The reconstruction loss encourages the system to decode data that is similar to the input data, according to some similarity metric. Following are some examples of reconstruction losses are: a loss derived from mean squared error (MSE); a loss derived from multi-scale structural similarity (MS-SSIM), such as 1 minus MS- SSIM, or 1 - MS-SSIM; losses derived from the use of a pretrained neural network. For example, error(f 1 , f2), where fl and f2 are the features extracted by a pretrained neural network for the input (uncompressed) data and the decoded (reconstructed) data, respectively, and error() is an error or distance function, such as LI norm or L2 norm; and losses derived from the use of a neural network that is trained simultaneously with the end- to-end learned codec. For example, adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings proposed in the context of generative adversarial networks (GANs) and their variants.
[0199] The rate loss encourages the system to compress the output of the encoding stage, such as the output of the arithmetic encoder. ‘Compressing’ for example, means reducing the number of bits output by the encoding stage.
[0200] When an entropy-based lossless encoder is used, such as the arithmetic encoder, the rate loss typically encourages the output of the Encoder NN to have low entropy. The rate loss may be computed on the output of the Encoder NN, or on the output of the quantization operation, or on the output of the probability model. Following are some examples of rate losses are the following:
A differentiable estimate of the entropy;
A sparsification loss, for example, a loss that encourages the output of the Encoder NN or the output of the quantization to have many zeros. Examples are LO norm, LI norm, LI norm divided by L2 norm; and
A cross-entropy loss applied to the output of a probability model, where the probability model may be a NN used to estimate the probability of the next symbol to be encoded by the arithmetic encoder.
[0201] For training one or more neural networks that are part of a codec, such as one or more neural networks in FIG. 8 and/or FIG. 9, one or more of reconstruction losses may be used, and one or more of rate losses may be used. The loss terms may then be combined for example as a weighted sum to obtain the training objective function. Typically, the different loss terms are weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion loss. For example, when more weight is given to one or more of the reconstruction losses with respect to the rate losses, the system may learn to compress less but to reconstruct with higher accuracy as measured by a metric that correlates with the reconstruction losses. These weights are usually considered to be hyperparameters of the training session and may be set manually by the operator designing the training session, or automatically for example by grid search or by using additional neural networks.
[0202] For the sake of explanation, video is considered as data type in various embodiments. However, it would be understood that the embodiments are also applicable to other media items, for example, images and audio data.
[0203] Option 2 is illustrated in FIG. 10, and it includes of a different type of codec architecture. Referring to FIG. 10, it illustrates an example neural network-based end-to-end learned video coding system, in accordance with an example embodiment. As shown FIG. 10, a neural network-based end- to-end learned video coding system 1000 includes an encoder 1001, a quantizer 1002, a probability model 1003, an entropy codec 1004, for example, an arithmetic encoder 1005 and an arithmetic decoder 1006, a dequantizer 1007, and a decoder 1008. The encoder 1001 and the decoder 1008 are typically two neural networks, or mainly comprise neural network components. The probability model 1003 may also mainly comprise neural network components. The quantizer 1002, the dequantizer 1007, and the entropy codec 1004 are typically not based on neural network components, but they may also potentially comprise neural network components. In some embodiments, the encoder, quantizer, probability model, entropy codec, arithmetic encoder, arithmetic decoder, dequantizer, and decoder, may also be referred to as an encoder component, quantizer component, probability model component, entropy codec component, arithmetic encoder component, arithmetic decoder component, dequantizer component, and decoder component respectively.
[0204] On the encoding side, the encoder 1001 takes a video/image as an input 1009 and converts the video/image in original signal space into a latent representation that may comprise a more compressible representation of the input. The latent representation may be normally a 3 -dimensional tensor for image compression, where 2 dimensions represent spatial information, and the third dimension contains information at that specific location.
[0205] Consider an example, in which the input data is an image, when the input image is a 128x128x3 RGB image (with horizontal size of 128 pixels, vertical size of 128 pixels, and 3 channels for the Red, Green, Blue color components), and when the encoder downsamples the input tensor by 2 and expands the channel dimension to 32 channels, then the latent representation is a tensor of dimensions (or ‘shape’) 64x64x32 (e.g., with horizontal size of 64 elements, vertical size of 64 elements, and 32 channels). Please note that the order of the different dimensions may differ depending on the convention which is used. In some embodiments, for the input image, the channel dimension may be the first dimension, so for the above example, the shape of the input tensor may be represented as 3x128x128, instead of 128x128x3.
[0206] In the case of an input video (instead of just an input image), another dimension in the input tensor may be used to represent temporal information.
[0207] The quantizer 1002 quantizes the latent representation into discrete values given a predefined set of quantization levels. The probability model 1003 and the arithmetic encoder 1005 work together to perform lossless compression for the quantized latent representation and generate bitstreams to be sent to the decoder side. Given a symbol to be encoded to the bitstream, the probability model 1003 estimates the probability distribution of all possible values for that symbol based on a context that is constructed from available information at the current encoding/decoding state, such as the data that has already encoded/decoded. The arithmetic encoder 1005 encodes the input symbols to bitstream using the estimated probability distributions.
[0208] On the decoding side, opposite operations are performed. The arithmetic decoder 1006 and the probability model 1003 first decode symbols from the bitstream to recover the quantized latent representation. Then, the dequantizer 1007 reconstructs the latent representation in continuous values and pass it to the decoder 1008 to recover the input video/image. The recovered input video/image is provided as an output 1010. Note that the probability model 1003, in this system 1000, is shared between the arithmetic encoder 1005 and the arithmetic decoder 1006. In practice, this means that a copy of the probability model 1003 is used at the arithmetic encoder 1005 side, and another exact copy is used at the arithmetic decoder 1006 side.
[0209] In this system 1000, the encoder 1001, the probability model 1003, and the decoder 1008 are normally based on deep neural networks. The system 1000 is trained in an end-to-end manner by minimizing the following rate-distortion loss function, which may be referred to simply as training loss, or loss:
L=D+/.R - equation 2
[0210] In equation 2, D is the distortion loss term, R is the rate loss term, and X is the weight that controls the balance between the two losses.
[0211] The distortion loss term may be referred to also as reconstruction loss. It encourages the system to decode data that is similar to the input data, according to some similarity metric. Following are some examples of reconstruction losses: a loss derived from mean squared error (MSE); a loss derived from multi-scale structural similarity (MS-SSIM), such as 1 minus MS- SSIM, or 1 - MS-SSIM; losses derived from the use of a pretrained neural network. For example, error(f 1 , f2), where fl and f2 are the features extracted by a pretrained neural network for the input (uncompressed) data and the decoded (reconstructed) data, respectively, and error() is an error or distance function, such as LI norm or L2 norm; and losses derived from the use of a neural network that is trained simultaneously with the end- to-end learned codec. For example, adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings proposed in the context of generative adversarial networks (GANs) and their variants.
[0212] Multiple distortion losses may be used and integrated into D.
[0213] Minimizing the rate loss encourages the system to compress the quantized latent representation so that the quantized latent representation can be represented by a smaller number of bits. The rate loss may be computed on the output of the encoder NN, or on the output of the quantization operation, or on the output of the probability model. In an example embodiment, the rate loss may comprise multiple rate losses. Following are some examples of rate losses:
a differentiable estimate of the entropy of the quantized latent representation, which indicates the number of bits necessary to represent the encoded symbols, for example, bits- per-pixel (bpp); a sparsification loss, for example, a loss that encourages the output of the Encoder NN or the output of the quantization to have many zeros. Examples are LO norm, LI norm, LI norm divided by L2 norm; and a cross-entropy loss applied to the output of a probability model, where the probability model may be a NN used to estimate the probability of the next symbol to be encoded by the arithmetic encoder 1005.
[0214] A similar training loss may be used for training the systems illustrated in FIG. 8 and FIG. 9.
[0215] For training one or more neural networks that are part of a codec, such as one or more neural networks in FIG. 8, FIG. 9 and/or FIG. 10, one or more of reconstruction losses may be used, and one or more of the rate losses may be used. All the loss terms may then be combined for example as a weighted sum to obtain the training objective function. Typically, the different loss terms are weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion loss. For example, when more weight is given to one or more of the reconstruction losses with respect to the rate losses, the system may learn to compress less but to reconstruct with higher accuracy as measured by a metric that correlates with the reconstruction losses. These weights are usually considered to be hyper-parameters of the training session and may be set manually by the operator designing the training session, or automatically for example by grid search or by using additional neural networks.
[0216] In an example embodiment, the rate loss and the reconstruction loss may be minimized jointly at each iteration. In another example embodiment, the rate loss and the reconstruction loss may be minimized alternately, e.g., in one iteration the rate loss is minimized and in the next iteration the reconstruction loss is minimized, and so on. In yet another example embodiment, the rate loss and the reconstruction loss may be minimized sequentially, e.g., first one of the two losses is minimized for a certain number of iterations, and then the other loss is minimized for another number of iterations. These different ways of minimizing rate loss and reconstruction loss may also be combined.
[0217] It is to be understood that even in end-to-end learned approaches, there may be components which are not learned from data, such as an arithmetic codec.
[0218] For lossless video/image compression, the system 1000 contains the probability model 1003, the arithmetic encoder 1005, and the arithmetic decoder 1006. The system loss function contains the rate loss, since the distortion loss is always zero, in other words, no loss of information.
[0219] Video Coding for Machines (VCM)
[0220] Reducing the distortion in image and video compression is often intended to increase human perceptual quality, as humans are considered to be the end users, e.g. consuming or watching the decoded images or videos. Recently, with the advent of machine learning, especially deep learning, there is a rising number of machines (e.g., autonomous agents) that analyze or process data independently from humans and may even take decisions based on the analysis results without human intervention. Examples of such analysis are object detection, scene classification, semantic segmentation, video event detection, anomaly detection, pedestrian tracking, and the like. Example use cases and applications are self-driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person re-identification, smart traffic monitoring, drones, and the like. Accordingly, when decoded data is consumed by machines, a quality metric for the decoded data may be defined, which may be different from a quality metric for human perceptual quality. Also, dedicated algorithms for compressing and decompressing data for machine consumption may be different than those for compressing and decompressing data for human consumption. The set of tools and concepts for compressing and decompressing data for machine consumption is referred to here as Video Coding for Machines.
[0221] The receiver or decoder-side device may have multiple ‘machines’ or neural networks (NNs) for analyzing or processing decoded data. These multiple machines may be used in a certain combination which is for example determined by an orchestrator sub-system. The multiple machines may be used for example in temporal succession, based on the output of the previously used machine, and/or in parallel. For example, a video which was compressed and then decompressed may be analyzed by one machine (NN) for detecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of objects in the frames.
[0222] An ‘encoder-side device’ may encode input data, such as a video, into a bitstream which represents compressed data. The bitstream is provided to a ‘decoder-side device’ . The term ‘receiverside’ or ’decoder-side’ refers to a physical or abstract entity or device which performs decoding of compressed data, and the decoded data may be input to one or more machines, circuits or algorithms. The one or more machines may not be part of the decoder. The one or more machines may be run by the same device running the decoder or by another device which receives the decoded data from the device running the decoder. Different machines may be run by different devices.
[0223] The encoded video data may be stored into a memory device, for example, as a file. The stored file may later be provided to another device.
[0224] Alternatively, the encoded video data may be streamed from one device to another.
[0225] In various embodiments, machine and neural network may be used interchangeably, and may mean any process or algorithm (e.g., learned from data or not) which analyzes or processes data for a certain task. Further, the term ‘receiver-side’ or ‘decoder-side’ refers to a physical or abstract entity or device which contains one or more machines, and runs these one or more machines on some encoded and eventually decoded video representation which is encoded by another physical or abstract entity or device, e.g., ‘encoder-side device’. In some embodiments, the encoder-side and decoder-side may be present in the same physical or abstract entity or device.
[0226] FIG. 11 illustrates a pipeline of video coding for machines (VCM), in accordance with an embodiment. A VCM encoder 1102 encodes the input video into a bitstream 1104. A bitrate 1106 may be computed 1108 from the bitstream 1104 in order to evaluate the size of the bitstream 1104. A VCM decoder 1110 decodes the bitstream 1104 output by the VCM encoder 1102. An output of the VCM decoder 1110 may be referred, for example, as decoded data for machines 1112. This data may be considered as the decoded or reconstructed video. However, in some implementations of the pipeline of VCM, the decoded data for machines 1112 may not have same or similar characteristics as the original video which was input to the VCM encoder 1102. For example, this data may not be easily understandable by a human, if the human watches the decoded video from a suitable output device such as a display. The output of the VCM decoder 1110 is then input to one or more task neural network (task-NN). For the sake of illustration, FIG. 11 is shown to include three example task-NNs, a task-NN 1114 for object detection, a task-NN 1116 for image segmentation, a task-NN 1118 for object tracking, and a non-specified one, a task-NN 1120 for performing task X. The goal of VCM is to obtain a low bitrate while guaranteeing that the task-NNs still perform well in terms of the evaluation metric associated with each task.
[0227] One of the possible approaches to realize video coding for machines is an end-to-end learned approach. FIG. 12 illustrates an example of an end-to-end learned approach, in accordance with an embodiment. In this approach, a VCM encoder 1202 and a VCM decoder 1204 mainly includes neural networks. The video is input to a neural network encoder 1206. The output of the neural network encoder 1206 is input to a lossless encoder 1208, such as an arithmetic encoder, which outputs a bitstream 1210. The lossless codec may take an additional input from a probability model 1212, both in the lossless encoder 1208 and in a lossless decoder 1214, which predicts the probability of the next
symbol to be encoded and decoded. The probability model 1212 may also be learned, for example it may be a neural network. At a decoder-side, the bitstream 1210 is input to the lossless decoder 1214, such as an arithmetic decoder, whose output is input to a neural network decoder 1216. The output of the neural network decoder 1216 is a decoded data for machines 1218, that may be input to one or more task-NNs, a task-NN 1220 for object detection, a task-NN 1222 for object segmentation, a task-NN 1224 for object tracking, and a non-specified one, a task-NN 1226 for performing task X.
[0228] FIG. 13 illustrates an example of how the end-to-end learned system may be trained, in accordance with an embodiment. For the sake of simplicity, this embodiment is explained with help of one task-NN. However, it may be understood that multiple task-NNs may be similarly used in the training process. A rate loss 1302 may be computed 1304 from the output of a probability model 1306. The rate loss 1302 provides an approximation of the bitrate required to encode the input video data, for example, by a neural network encoder 1308. A task loss 1310 may be computed 1312 from a task output 1314 of a task-NN 1316.
[0229] The rate loss 1302 and the task loss 1310 may then be used to train 1318 the neural networks used in the system, such as the neural network encoder 1308, probability model, a neural network decoder 1320. Training may be performed by first computing gradients of each loss with respect to the trainable parameters of the neural networks that are contributing or affecting the computation of that loss. The gradients are then used by an optimization method, such as Adam, for updating the trainable parameters of the neural networks. It is to be understood that, in alternative or in addition to one or more task losses and/or one or more rate losses, the training process may use additional losses which may not be directly related to one or more specific tasks, such as losses derived from pixel-wise distortion metrics (for example, MSE, MS-SSIM).
[0230] The machine tasks may be performed at decoder side (instead of at encoder side) for multiple reasons, for example, the encoder-side device may not have the capabilities (e.g. computational, power, or memory) for running the neural networks that perform these tasks, or some aspects or the performance of the task neural networks may have changed or improved by the time that the decoder-side device needs the tasks results (e.g., different or additional semantic classes, better neural network architecture). Also, there may be a need for customization, where different clients may run different neural networks for performing these machine learning tasks.
[0231] Neural Network Based Filtering
[0232] In some video codecs, a neural network may be used as filter in the decoding loop, and it may be referred to as neural network loop filter, or neural network in-loop filter. The NN loop filter
may replace other loop filters of an existing video codec or may represent an additional loop filter with respect to the already present loop filters in an existing video codec.
[0233] In the context of image and video enhancement, a neural network may be used as postprocessing filter, for example, applied to the output of an image or video decoder in order to remove or reduce coding artifacts.
[0234] Content-adaptation for Decoder-side Neural Networks.
[0235] Content adaptation may be performed by having the encoder-side device compute an adaptation signal for one or more NNs used at decoder side (e.g., decoder-side NNs), and signaling the adaptation signal or a signal derived from the adaptation signal to the decoder side. In an example, the adaptation signal is a weight-update. As the encoder includes the decoding operations and, in some cases, any post-processing operations, the decoder-side NNs that are content-adapted are assumed to be available also at encoder side. In practice, this may mean that two copies of one or more decoder-side NNs are available at encoder side and at decoder side. In an example, a decoder-side NN may be a NN in-loop filter. In another example, a decoder-side NN may be a NN that is part of an end-to-end trained decoder. In yet another example, a decoder-side NN may be a post-processing NN. The decoder side may use the adaptation signal or a signal derived from the adaptation signal to update or adapt the one or more NN. The updated or adapted one or more NNs are then used for their purpose, e.g., for filtering a reconstructed image block or patch.
[0236] The adaptation signal may be compressed in a lossy and/or lossless way by the encoderside device. When the adaptation signal is compressed, the decoder side may first need to decompress the compressed adaptation signal before using it for updating or adapting the NNs.
[0237] At encoding phase (e.g., ‘inference phase’ for NNs), when a new input content needs to be encoded (such as an input image or video sequence), the encoder may decide to optimize some part of the codec or some signal produced by the codec, with respect to the specific input content. In at least some of the proposed embodiments, the terms ‘optimize’, ‘adapt’, ‘finetune’, ‘update’, and ‘overfit’ may refer to the same operation, e.g., making a part of the codec (such as the parameters of a NN) or a signal produced by the codec more specific to the input content, in order to improve the rate-distortion performance. The parameters or the signal to be adapted may belong to one or more of the following categories of parameters:
The encoder’ s trainable parameters or weights;
A subset of the encoder’ s trainable parameters or weights;
The output of the encoder, e.g., the latent tensor;
A subset of the output of the encoder, e.g., the latent tensor;
The decoder’ s trainable parameters or weights; or
A subset of the decoder’ s trainable parameters or weights.
[0238] For example, the parameters may be a subset of trainable parameters or weights of a decoder, such as the bias parameters of a neural network that is part of the decoder.
[0239] The optimization may be performed at encoder-side, and may comprise computing a loss function using the output of the decoder and eventually the output of the encoder, and differentiating the computed loss function with respect to the parameters or signal to be optimized.
[0240] When the parameters to be optimized are at least some of the parameters of a decoder-side NN, an update to those parameters (we may refer to such update as a weight-update) may need to be encoded and signaling to the decoder-side. The bitrate of the bitstream representing such signaling is an additional bitrate with respect to the bitrate of the bitstream representing the encoded image or video without any content adaptation.
[0241] There may be more than one NN available at encoder side and decoder side. Thus, one problem is represented by how to select one or more optimal NNs for the overfitting process out of all the available NNs, in terms of rate-distortion performance. Ideally, the one or more NNs that, after overfitting, perform best in terms of rate-distortion performance should be selected. However, there are also other factors to consider, such as computational complexity and runtime speed. This is one of the problems addressed by some of the embodiments.
[0242] Overfitting does not usually consider any additional operations which are performed on the output of the NN. This may result in sub-optimal performance of the overfitted NN. This is another problem addressed by some of the proposed embodiments.
[0243] V arious embodiments propose apparatus and methods for optimizing the overfitting of one or more decoder-side NNs (DSNNs) or optimizing one or more parameters of one or more processing operations. However, it is to be understood that at least some of these embodiments may be applied for training one or more neural networks present at encoder side and/or at decoder side, on a training dataset.
[0244] A decoder-side NN is a NN that is used as part of the decoding and/or post-processing operations. An example of DSNN is an in-loop NN filter. Another example of DSNN is a postprocessing NN filter. There may be more than one DSNNs, for example, one in-loop NN filter and one
post-processing NN filter. At least some of the DSNNs are available at both the encoder side and decoder side. For simplicity, in some embodiments, a single DSNN is considered and this single DSNN is assumed to be available at both encoder side and decoder side.
[0245] For each DSNN (e.g., for the DSNN used as in-loop filter, or for the DSNN used as postfilter) there may be more than one version available. Some embodiments address the problem of selecting one or more optimal versions to be overfitted among two or more available versions of DSNN. The two or more available versions of DSNN that are considered are referred to as candidate DSNN versions. For simplicity, the case of selecting a single optimal version is considered in some of the embodiments.
[0246] A set of data on which the NN will be run after being overfitted may be referred to as an inference set. A set of data on which the NN will be evaluated may be referred to as an evaluation set. A set of data on which the NN will be overfitted may be referred to as an overfitting set. The inference set, the evaluation set, and the overfitting set may partially or fully overlap with each other.
[0247] An embodiment proposes the following:
Run the candidate DSNN versions on the evaluation set;
Evaluate the performance of the candidate DSNN versions on the evaluation set;
Select the DSNN version that performs best or has predetermined performance;
Overfit the selected DSNN version on the overfitting set; and
Run the overfitted NN on the inference set.
[0248] Another embodiment proposes:
Overfit the candidate DSNN versions on the evaluation set;
Evaluate the performance of the overfitted candidate DSNN versions on the evaluation set;
Select the overfitted DSNN version that performs best or has predetermined performance;
When the evaluation set is different from the overfitting set, overfit the DSNN version used to obtain the selected best overfitted DSNN version on the overfitting set; and Use the DSNN version that was overfitted on the evaluation set or on the overfitting set and run it on the inference set.
[0249] The performance may comprise a rate-distortion performance or simply a distortion-based performance.
[0250] The following embodiments address the problem of overfitting for achieving a better performance of the overfitted model.
[0251] The output of a DSNN (e.g., overfitted DSNN) may be processed by one or more processing operations, such as scaling and shifting, where at least some of the parameters of these processing operations may be optimized at encoder side and signaled to the decoder side. An embodiment proposes to take these processing operations into account during the overfitting process and/or during the training process.
[0252] Preliminary information and assumptions
[0253] Various embodiments consider the case of compressing and decompressing data. For the sake of simplicity, the embodiments consider video as the data type. In various embodiments, ’video’ may refer to one or more video frames, unless specified otherwise. However, the proposed embodiments can be extended to other types of data such as images, audio, speech, text, and the like.
[0254] Various embodiments assume that an encoder-side device performs a compression or encoding operation by using an encoder. The output of the video encoder is a bitstream representing the compressed video. A decoder-side device performs decompression or decoding operation by using a decoder. The output of the video decoder may be referred to as decoded video. The decoded video may be post-processed by one or more post-processing operations, such as a post-processing filter. The output of the one or more post-processing operations may be referred to as post-processed video. The encoder-side device may also include some decoding operations, for example, in a coding loop, and/or at least some post-processing operations. In an example, the encoder may include all the decoding operations and any post-processing operations. The encoder-side device and the decoder-side device may be the same physical device, or different physical devices.
[0255] The decoder or the decoder-side device may contain one or more neural networks, referred to here as decoder-side neural networks (DSNNs). As the encoder-side device includes at least some decoding and/or post-processing operations, at least some of the DSNNs may be available also at encoder side. In practice, this means that the encoder-side device may include copies of at least some of the DSNNs. Some examples of such DSNNs may include but are not limited to the following:
A post-processing NN filter (here also referred to as post-filter, or NN post-filter, or post-filter NN), which takes as input at least one of the outputs of an end-to-end learned decoder or of a conventional decoder (i.e., a decoder not comprising neural networks or other components learned from data) or of a hybrid decoder (e.g., a decoder comprising one or more neural networks or other components learned from data);
A NN in-loop filter (also referred to here as in-loop NN filter, or NN loop filter, or loop NN filter), used within an end-to-end learned decoder, or within a hybrid decoder;
A learned probability model (e.g., a NN) that is used for providing estimates of probabilities of symbols to be encoded and/or decoded by a lossless coding module, within an end-to-end learned codec or within a hybrid codec; and
A decoder neural network for an end-to-end learned codec.
[0256] For the sake of simplicity, an example of a single DSNN is used when describing some of the embodiments. Also, for the sake of simplicity, a NN post-filter as an example of a DSNN is used for describing some of the embodiments. However, the embodiments may be extended to the cases of multiple DSNNs and to the case where a DSNN is used for other purposes than post-processing. Two copies of the DSNN (e.g., the NN post-filter) considered in the embodiments are assumed to be available at encoder side and decoder side.
[0257] Example Embodiments for Selecting one or more Optimal Models to be Overfitted
[0258] For each DSNN (e.g., for the DSNN used as in-loop filter, or for the DSNN used as postfilter) there may be more than one version available. The following embodiments address at least the problem of selecting one or more optimal versions to be overfitted among two or more available versions of DSNN. The two or more available versions of DSNN that are considered are referred to as candidate DSNN versions. For the sake of simplicity, at least some embodiments consider the case of selecting a single optimal version to be overfitted.
[0259] In an example, the DSNN is a post-filter, and two candidate DSNN versions have same architecture but different values for at least some of their parameters.
[0260] A set of data on which the NN is run after being overfitted is referred to as an inference set. A set of data on which the NN is evaluated is referred to as an evaluation set. A set of data on which the NN is overfitted is referred to as the overfitting set. The inference set, the evaluation set and the overfitting set may partially or fully overlap with each other.
[0261] In an example, the inference set is a video, the evaluation set is a first random access (RA) segment of the video, and the overfitting set is the first RA segment of the video. In a example, an RA segment may be specified to start with a picture that enables random access, e.g. enables starting a decoding process from that picture. For example, an RA segment may start from an intra-coded picture, such as an IRAP picture in some video coding standards, or a gradual decoding refresh picture. The RA
segment may, in some cases, be specified to pertain up to (but excluding) the next picture, in decoding order, that can start an RA segment.
[0262] In another example, the inference set is a video, the evaluation set is the first RA segment of the video, the overfitting set is the video.
[0263] In an embodiment, the encoder side devices may perform one or more of following operations:
Run the candidate DSNN versions on the evaluation set;
Evaluate the performance of the candidate DSNN versions on the evaluation set;
Select the DSNN version that performs best or has predetermined performance;
Overfit the selected DSNN version on the overfitting set;
The overfitted DSNN may be applied on the inference set; and/or
A weight-update may be computed based at least on the weights of the overfitted DSNN and the weights of the DSNN before overfitting. The weight -update may be compressed by using lossless or lossy compression. The bitstream representing the compressed weight-update may be signaled or provided to the decoder side, in or along the bitstream representing the encoded video.
[0264] The performance may comprise a rate-distortion performance or simply a distortion-based performance.
[0265] In an example of this embodiment, the DSNN is a post-filter, and two candidate DSNN versions are considered. The two candidate DSNN versions are run on the first RA segment, e.g., the input to each candidate DSNN version comprises the decoded first RA segment. The output of each candidate DSNN version comprises the post-processed first RA segment. A first PSNR is computed based at least on the input to the candidate DSNN version and respective uncompressed data. A second PSNR is computed based at least on the output of the candidate DSNN and respective uncompressed data. The performance of the candidate DSNN versions comprises a PSNR gain, that may be computed as a difference between the first PSNR and the second PSNR. The candidate DSNN version yielding highest PSNR gain or a predetermined PSNR gain is selected as the optimal DSNN version. The selected DSNN version is overfitted on the whole video. The overfitting may comprise one or more iterations, where each iteration comprises inputting the decoded video to the selected DSNN version, obtaining a post-processed output video from the selected DSNN version, computing a training loss based at least on the post-processed output video and respective uncompressed data, computing gradients for one or more parameters of the selected DSNN version, using the gradients for updating
the one or more parameters of the selected DSNN version. The iterations are performed until a stopping criterion is satisfied. After the overfitting of the selected DSNN version on the whole video has completed, the overfitted DSNN may be used for post-processing the decoded video. The encoder may derive a weight-update as a difference between the weights of the overfitted DSNN and the weights of the DSNN before overfitting. The derived weight-update may be compressed using a lossy and/or a lossless encoder. The bitstream representing the compressed weight-update may be signaled to the decoder in or along the bitstream representing the encoded video. At decoder side, the decoder may decompress the compressed weight-update, use the decompressed weight-update to update the postfilter, and use the updated post-filter for post-processing one or more frames of a decoded video.
[0266] In another embodiment, one or more of the following operations are performed:
Overfit the candidate DSNN versions on the evaluation set, thus obtaining a set of first overfitted DSNN versions;
Evaluate the performance of each of the first overfitted DSNN versions on the evaluation set;
Select the first overfitted DSNN version that performs best or has predetermined performance;
When the evaluation set is different from the overfitting set, overfit the DSNN version used to obtain the selected first overfitted DSNN version on the overfitting set, to obtain a second overfitted DSNN. Run the second overfitted DSNN version on the inference set; or
When the evaluation set is same or substantially same as the overfitting set, run the selected first overfitted DSNN version on the inference set.
[0267] In an example of this embodiment, the DSNN is a post-filter, and two candidate DSNN versions are considered. The two candidate DSNN versions are overfitted on the first RA segment. The overfitting of each candidate DSNN version may comprise one or more iterations, where each iteration comprises inputting the decoded first RA segment to the candidate DSNN version, obtaining a postprocessed first RA segment from the candidate DSNN version, computing a training loss based at least on the post-processed first RA segment and respective uncompressed data, computing gradients for one or more parameters of the candidate DSNN version, using the gradients for updating the one or more parameters of the candidate DSNN version. The iterations are performed until a stopping criterion is satisfied. After the overfitting of the two candidate DSNN versions has completed, for each of the two overfitted candidate DSNN versions, a first PSNR is computed based at least on the post-processed first RA segment and respective uncompressed data, and a second PSNR is computed based at least on the decoded first RA segment and respective uncompressed data. A PSNR gain for each of the two overfitted candidate DSNN versions is computed as a difference between the first PSNR and the second
PSNR of the respective overfitted candidate DSNN versions. The PSNR gain of each overfitted candidate DSNN version represents the performance of each candidate DSNN version. The candidate DSNN version yielding the highest PSNR gain or a predetermined PSNR gain is selected as the optimal DSNN version. The selected DSNN version is overfitted on the whole video. The overfitting may comprise one or more iterations, where each iteration comprises inputting the decoded video to the selected DSNN version, obtaining a post-processed output video from the selected DSNN version, computing a training loss based at least on the post-processed output video and respective uncompressed data, computing gradients for one or more parameters of the selected DSNN version, using the gradients for updating the one or more parameters of the selected DSNN version. The iterations are performed until a stopping criterion is satisfied. After the overfitting of the selected DSNN version on the whole video has completed, the overfitted DSNN may be used for post-processing the decoded video. The encoder may derive a weight-update as a difference between the weights of the overfitted DSNN and the weights of the DSNN before overfitting. The derived weight-update may be compressed using a lossy and/or a lossless encoder. The bitstream representing the compressed weight-update may be signaled to the decoder in or along the bitstream representing the encoded video. At decoder side, the decoder may decompress the compressed weight-update, use the decompressed weight-update to update the post-filter, and use the updated post-filter for post-processing one or more frames of a decoded video.
[0268] While only PSNR gain was discussed as a performance metric, any suitable performance metric measuring the improvement in one or more of the following may be used: visual quality, machine vision quality, rate-distortion performance, complexity. In an example, where the decoded and/or postprocessed video is consumed by one or more task neural networks (task-NNs) performing a machine vision task (e.g., image classification), the performance may be measured based on the gain in machine vision task accuracy, such as gain in mean accuracy. In another example, the performance may be measured based on a rate-distortion Lagrangian function, where the rate may comprise the rate of the bitstream representing the compressed video and the rate of the bitstream representing the compressed weight-update, and the distortion may comprise the mean-squared error (MSE) computed based on the output of the DSNN (or a signal derived therefrom) and corresponding ground-truth data.
[0269] Embodiments for Improving the Performance of Overfitting and/or Training
[0270] The following embodiments address at least the problem of overfitting for achieving a better performance of the overfitted model. It is to be understood that at least some of the following embodiments may be applied to the case of training one or more neural networks for achieving a better performance of the trained models.
[0271] The output of a DSNN may be processed by one or more processing operations, such as scaling and shifting, where at least some of the parameters of these processing operations may be optimized at encoder side and signaled to the decoder side. The one or more processing operations applied to the output of a DSNN are referred to as refinement operations.
[0272] In an example, the DSNN (e.g., the overfitted DSNN) is a post-filter that post-processes a decoded video frame-by-frame, e.g., the DSNN gets as input a decoded frame and the output is a postprocessed frame. In this example the post-processed frame is denoted as NN_out, the decoded frame is denoted as NN_in, accordingly the refinement operation applied on the post -processed frame may be defined as follows: refined_NN_out = (NN_out - NN_in)*s + NN_in
[0273] where refined_NN_out is the result of the refinement operation, ‘s’ is a parameter or variable that multiplies the difference between NN_out and NN_in. In this example, the value of the parameter ‘s’ is optimized at encoder side by rate-distortion optimization (RDO), by the method of least squares, or by other suitable optimization methods.
[0274] More generally, the refinement operations are performed by a function refine(NN_out, other_arguments) , where ‘other_arguments’ may comprise other data necessary to compute the value of the function. Accordingly, for the example above, refined_NN_out = refine(NN_out, NN_in, s), where refine(NN_out, NN_in, s) = (NN_out - NN_in)*s + NN_in.
[0275] In an embodiment, the refinement operations are taken into account during the overfitting process. In particular, the overfitting process is performed by computing the training loss based at least on the output of the refinement operations. Based on the example above, where the output of the postfilter DSNN is refined for obtaining refined_NN_out, the post-filter DSNN may be overfitted by performing one or more overfitting iterations until a stopping criterion is satisfied, where each iteration may comprise (a) inputting a decoded frame NN_in to the post-filter DSNN, (b) obtaining a postprocessed frame NN_out as an output of the post-filter DSNN, (c) computing a refinement refined_NN_out of the post-processed frame based at least on NN_out, NN_in, a value of the parameter ‘s’, and a refinement function refine(), where the value of the parameter ‘s’ may be determined based on least squares, (d) computing a MSE loss based at least on refined_NN_out and respective groundtruth data, where the respective ground-truth data may be the uncompressed version of the decoded frame NN_in, (e) computing gradients of the MSE loss with respect to gradients of one or more parameters of the post-filter DSNN, (f) using the gradients for updating the one or more parameters of the post-filter DSNN.
[0276] FIG. 14 illustrates an example for overfitting a decoder-side neural network (NN filter based on refinement operations, in accordance with an embodiment. In FIG. 14, Xi denotes the i-th input frame, a VVC codec 1402 represents a video encoder that is conformant with the specification of VVC/H.266 standard, where the encoder also includes decoding operations, , denotes the decoded frame (previously denoted as NN_in) that is output by the VVC codec, , is input to a post-filter DSNN that is denoted as a NN filter 1404. The NN filter 1404 outputs the post- processed frame (previously denoted as NN_out), then a difference 1406 between y; and x) (denoted as ) is computed. The difference n is multiplied 1408 with a scaling parameter Si, which is estimated by least squares method (denoted as LST-SQ 1410). The result of the multiplication is then added 1412 to x; to obtain a refined post-processed frame x, (previously denoted as refin ed_NN_out). A loss 1414 is computed based at least on x^ind Xi. The loss 1414 is used for overfitting 1416 the NN filter 1404.
[0277] FIG. 15 is an example apparatus 1500, which may be implemented in hardware, caused to implement mechanisms for optimizing overfitting of neural networks or optimizing one or more parameters of one or more processing operations, based on the examples described herein. The apparatus 1500 comprises at least one processor 1502, at least one non-transitory memory 1504 including computer program code 1505, wherein the at least one memory 1504 and the computer program code 1505 are configured to, with the at least one processor 1502, cause the apparatus 1500 to implement mechanisms for optimizing the overfitting of neural network or optimizing the one or more parameters of the one or more processing operations 1506, based on the examples described herein. In an embodiment, the at least one neural network or the portion of the at least one neural network may be used at a decoder-side for decoding or reconstructing one or more media items.
[0278] The apparatus 1500 optionally includes a display 1508 that may be used to display content during rendering. The apparatus 1500 optionally includes one or more network (NW) interfaces (I/F(s)) 1510. The NW I/F(s) 1510 may be wired and/or wireless and communicate over the Internet/other network(s) via any communication technique. The NW I/F(s) 1510 may comprise one or more transmitters and one or more receivers. The N/W I/F(s) 1510 may comprise standard well-known components such as an amplifier, filter, frequency-converter, (de)modulator, and encoder/decoder circuitry(ies) and one or more antennas.
[0279] The apparatus 1500 may be a remote, virtual or cloud apparatus. The apparatus 1500 may be either a coder or a decoder, or both a coder and a decoder. The at least one memory 1504 may be implemented using any suitable data storage technology, such as semiconductor based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The at least one memory 1504 may comprise a database for storing data. The apparatus 1500 need not comprise each of the features mentioned, or may comprise other
features as well. The apparatus 1500 may correspond to or be another embodiment of the apparatus 50 shown in FIG. 1 and FIG. 2, any of the apparatuses shown in FIG. 3, or apparatus 700 of FIG. 7. The apparatus 1500 may correspond to or be another embodiment of the apparatuses shown in FIG. 20, including UE 110, RAN node 170, or network element(s) 190.
[0280] FIG. 16 illustrates an example method 1600 for optimizing overfitting of neural networks, in accordance with an embodiment. As shown in block 1506 of FIG. 15, the apparatus 1500 includes means, such as the processing circuitry 1502 or the like, for optimizing overfitting of neural network filters. At 1602, the method 1600 includes running one or more candidate neural network versions by using at least data from an evaluation set. At 1604, the method 1600 includes evaluating performance of the one or more candidate neural network versions based on the evaluation set. At 1606, the method 1600 includes selecting a candidate neural network version based on one or more predetermined performance criteria. At 1608, the method 1600 includes overfitting the selected neural network version based at least on an overfitting set. At 1610, the method 1600 includes running the overfitted neural network version on an inference set. In an embodiment, the one or more neural network versions include one or more of decoder-side neural network versions, where the one or more of decoder-side neural network versions are available at a decoder side and an encoder side
[0281] In an embodiment, the evaluation set includes data for evaluating the one or more candidate neural network versions; the overfitting set includes data for overfitting the selected neural network version; and the inference set includes data for running the overfitted neural version. In an example, the evaluation set, overfitting set, and the inference set partially or fully overlap.
[0282] FIG. 17 illustrates an example method 1700 for optimizing the overfitting of neural network, in accordance with another embodiment. As shown in block 1506 of FIG. 15, the apparatus 1500 includes means, such as the processing circuitry 1502 or the like, for optimizing overfitting of neural network filters. At 1702, the method 1700 includes overfitting one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions. At 1704, the method 1700 includes evaluating performance of the first set of overfitted neural network versions on the evaluation set. At 1706, the method 1700 includes selecting a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network. At 1708, the method 1700 includes when the evaluation set is different from an overfitting set: overfitting a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and running the second overfitted neural network version on an inference set. At 1710, the method 1700 includes running the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set. In an embodiment, the one or more neural network
versions include one or more of decoder-side neural network versions, where the one or more of decoder-side neural network versions are available at a decoder side and an encoder side.
[0283] In an embodiment, the evaluation set comprises data for evaluating the one or more candidate neural network versions; the overfitting set comprises data for overfitting the selected neural network version; and the inference set comprises data for running the overfitted neural version. In example, the evaluation set, overfitting set, and the inference set partially or fully overlap.
[0284] FIG. 18 illustrates an example method 1800 for optimizing one or more parameters of one or more processing operations at an encoder side, in accordance with an embodiment. As shown in block 1506 of FIG. 15, the apparatus 1500 includes means, such as the processing circuitry 1502 or the like, optimizing one or more parameters of one or more processing operations. At 1802, the method 1800 includes processing an output of a neural network version by using one or more processing operations. At 1804, the method 1800 includes optimizing one or more parameters of the one or more processing operations at an encoder side. In an embodiment, the method 1800 may further include signalling the optimized one or more parameters to a decoder side.
[0285] In an embodiment, the one or more processing operations include at least one of a scaling operation or a shifting operation.
[0286] In an embodiment, the neural network includes a decoder-side neural network, where the decoder-side neural network is available at the decoder side and the encoder side.
[0287] FIG. 19 illustrates an example method 1900 for optimizing the overfitting of neural network, in accordance with yet another embodiment. As shown in block 1506 of FIG. 15, the apparatus 1500 includes means, such as the processing circuitry 1502 or the like, for optimizing overfitting of neural network filters. At 1902, the method 1900 includes overfitting one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions. At 1904, the method 1900 includes evaluating performance of the first set of overfitted neural network versions on the evaluation set. At 1906, the method 1900 includes selecting a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network. At 1908, the method 1900 includes determining whether the evaluation set is same or substantially same as an overfitting set. When it is determined at 1908 that the evaluation set is not same or substantially same as the overfitting set: at 1910, the method 1900 includes overfitting a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and at 1912, the method 1900 includes running the second overfitted neural network version on an inference set. When it is determined at 1908 that the evaluation set is same or substantially same as the overfitting set, at 1914, the method 1900 includes
running the selected first overfitted neural network version on the inference set. In an embodiment, the one or more neural network versions include one or more of decoder-side neural network versions, where the one or more of decoder-side neural network versions are available at a decoder side and a encoder side
[0288] In an embodiment, the evaluation set includes data for overfitting one or more candidate neural network versions and for evaluating the first set of overfitted neural network versions; the overfitting set includes data for overfitting the neural network version used to obtain the selected first overfitted neural network version; and the inference set includes data for running the selected first overfitted neural network version or the second overfitted neural network version. In an example, the evaluation set, overfitting set, and the inference set partially or fully overlap.
[0289] Referring to FIG. 20, this figure shows a block diagram of one possible and non-limiting example in which the examples may be practiced. A user equipment (UE) 110, radio access network (RAN) node 170, and network element(s) 190 are illustrated. In the example of FIG. 1, the user equipment (UE) 110 is in wireless communication with a wireless network 100. A UE is a wireless device that can access the wireless network 100. The UE 110 includes one or more processors 120, one or more memories 125, and one or more transceivers 130 interconnected through one or more buses 127. Each of the one or more transceivers 130 includes a receiver, Rx, 132 and a transmitter, Tx, 133. The one or more buses 127 may be address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like. The one or more transceivers 130 are connected to one or more antennas 128. The one or more memories 125 include computer program code 123. The UE 110 includes a module 140, comprising one of or both parts 140-1 and/or 140-2, which may be implemented in a number of ways. The module 140 may be implemented in hardware as module 140-1, such as being implemented as part of the one or more processors 120. The module 140-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array. In another example, the module 140 may be implemented as module 140-2, which is implemented as computer program code 123 and is executed by the one or more processors 120. For instance, the one or more memories 125 and the computer program code 123 may be configured to, with the one or more processors 120, cause the user equipment 110 to perform one or more of the operations as described herein. The UE 110 communicates with RAN node 170 via a wireless link 111.
[0290] The RAN node 170 in this example is a base station that provides access by wireless devices such as the UE 110 to the wireless network 100. The RAN node 170 may be, for example, a base station for 5G, also called New Radio (NR). In 5G, the RAN node 170 may be a NG-RAN node, which is defined as either a gNB or an ng-eNB. A gNB is a node providing NR user plane and control
plane protocol terminations towards the UE, and connected via the NG interface to a 5GC (such as, for example, the network element(s) 190). The ng-eNB is a node providing E-UTRA user plane and control plane protocol terminations towards the UE, and connected via the NG interface to the 5GC. The NG- RAN node may include multiple gNBs, which may also include a central unit (CU) (gNB-CU) 196 and distributed unit(s) (DUs) (gNB-DUs), of which DU 195 is shown. Note that the DU may include or be coupled to and control a radio unit (RU). The gNB-CU is a logical node hosting radio resource control (RRC), SDAP and PDCP protocols of the gNB or RRC and PDCP protocols of the en-gNB that controls the operation of one or more gNB-DUs. The gNB-CU terminates the Fl interface connected with the gNB-DU. The Fl interface is illustrated as reference 198, although reference 198 also illustrates a link between remote elements of the RAN node 170 and centralized elements of the RAN node 170, such as between the gNB-CU 196 and the gNB-DU 195. The gNB-DU is a logical node hosting RLC, MAC and PHY layers of the gNB or en-gNB, and its operation is partly controlled by gNB-CU. One gNB- CU supports one or multiple cells. One cell is supported by only one gNB-DU. The gNB-DU terminates the Fl interface 198 connected with the gNB-CU. Note that the DU 195 is considered to include the transceiver 160, for example, as part of a RU, but some examples of this may have the transceiver 160 as part of a separate RU, for example, under control of and connected to the DU 195. The RAN node 170 may also be an eNB (evolved NodeB) base station, for LTE (long term evolution), or any other suitable base station or node.
[0291] The RAN node 170 includes one or more processors 152, one or more memories 155, one or more network interfaces (N/W I/F(s)) 161, and one or more transceivers 160 interconnected through one or more buses 157. Each of the one or more transceivers 160 includes a receiver, Rx, 162 and a transmitter, Tx, 163. The one or more transceivers 160 are connected to one or more antennas 158. The one or more memories 155 include computer program code 153. The CU 196 may include the processor(s) 152, memories 155, and network interfaces 161. Note that the DU 195 may also contain its own memory/memories and processor(s), and/or other hardware, but these are not shown.
[0292] The RAN node 170 includes a module 150, comprising one of or both parts 150-1 and/or 150-2, which may be implemented in a number of ways. The module 150 may be implemented in hardware as module 150-1, such as being implemented as part of the one or more processors 152. The module 150-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array. In another example, the module 150 may be implemented as module 150-2, which is implemented as computer program code 153 and is executed by the one or more processors 152. For instance, the one or more memories 155 and the computer program code 153 are configured to, with the one or more processors 152, cause the RAN node 170 to perform one or more of the operations as described herein. Note that the functionality of the module 150 may be distributed, such as being distributed between the DU 195 and the CU 196, or be implemented solely in the DU 195.
[0293] The one or more network interfaces 161 communicate over a network such as via the links 176 and 131. Two or more gNBs 170 may communicate using, for example, link 176. The link 176 may be wired or wireless or both and may implement, for example, an Xn interface for 5G, an X2 interface for LTE, or other suitable interface for other standards.
[0294] The one or more buses 157 may be address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, wireless channels, and the like. For example, the one or more transceivers 160 may be implemented as a remote radio head (RRH) 195 for LTE or a distributed unit (DU) 195 for gNB implementation for 5G, with the other elements of the RAN node 170 possibly being physically in a different location from the RRH/DU, and the one or more buses 157 could be implemented in part as, for example, fiber optic cable or other suitable network connection to connect the other elements (for example, a central unit (CU), gNB-CU) of the RAN node 170 to the RRH/DU 195. Reference 198 also indicates those suitable network link(s).
[0295] It is noted that description herein indicates that ‘cells’ perform functions, but it should be clear that equipment which forms the cell may perform the functions. The cell makes up part of a base station. That is, there can be multiple cells per base station. For example, there could be three cells for a single carrier frequency and associated bandwidth, each cell covering one-third of a 360 degree area so that the single base station’s coverage area covers an approximate oval or circle. Furthermore, each cell can correspond to a single carrier and a base station may use multiple carriers. So if there are three 120 degree cells per carrier and two carriers, then the base station has a total of 6 cells.
[0296] The wireless network 100 may include a network element or elements 190 that may include core network functionality, and which provides connectivity via a link or links 181 with a further network, such as a telephone network and/or a data communications network (for example, the Internet). Such core network functionality for 5G may include access and mobility management function(s) (AMF(S)) and/or user plane functions (UPF(s)) and/or session management function(s) (SMF(s)). Such core network functionality for LTE may include MME (Mobility Management Entity )/SGW (Serving Gateway) functionality. These are merely example functions that may be supported by the network element(s) 190, and note that both 5G and LTE functions might be supported. The RAN node 170 is coupled via a link 131 to the network element 190. The link 131 may be implemented as, for example, an NG interface for 5G, or an SI interface for LTE, or other suitable interface for other standards. The network element 190 includes one or more processors 175, one or more memories 171, and one or more network interfaces (N/W I/F(s)) 180, interconnected through one or more buses 185. The one or more memories 171 include computer program code 173. The one or more memories 171 and the computer
program code 173 are configured to, with the one or more processors 175, cause the network element 190 to perform one or more operations.
[0297] The wireless network 100 may implement network virtualization, which is the process of combining hardware and software network resources and network functionality into a single, softwarebased administrative entity, a virtual network. Network virtualization involves platform virtualization, often combined with resource virtualization. Network virtualization is categorized as either external, combining many networks, or parts of networks, into a virtual unit, or internal, providing network-like functionality to software containers on a single system. Note that the virtualized entities that result from the network virtualization are still implemented, at some level, using hardware such as processors 152 or 175 and memories 155 and 171, and also such virtualized entities create technical effects.
[0298] The computer readable memories 125, 155, and 171 may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The computer readable memories 125, 155, and 171 may be means for performing storage functions. The processors 120, 152, and 175 may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs) and processors based on a multi-core processor architecture, as non-limiting examples. The processors 120, 152, and 175 may be means for performing functions, such as controlling the UE 110, RAN node 170, network element(s) 190, and other functions as described herein.
[0299] In general, the various embodiments of the user equipment 110 can include, but are not limited to, cellular telephones such as smart phones, tablets, personal digital assistants (PDAs) having wireless communication capabilities, portable computers having wireless communication capabilities, image capture devices such as digital cameras having wireless communication capabilities, gaming devices having wireless communication capabilities, music storage and playback appliances having wireless communication capabilities, Internet appliances permitting wireless Internet access and browsing, tablets with wireless communication capabilities, as well as portable units or terminals that incorporate combinations of such functions.
[0300] One or more of modules 140-1, 140-2, 150-1, and 150-2 may be caused to implement mechanism for optimizing overfitting of neural network filters of the decoder-side neural network or optimizing one or more parameters of one or more processing operations. Computer program code 173 may also be configured to implement mechanisms for optimizing overfitting of neural network filters
of the decoder-side neural network or optimizing one or more parameters of one or more processing operations.
[0301] As described above, FIGs. 16 to 19 include a flowchart of an apparatus (e.g. 50, 100, 602, 604, 700, or 1500), method, and computer program product according to certain example embodiments. It will be understood that each block of the flowcharts, and combinations of blocks in the flowcharts, may be implemented by various means, such as hardware, firmware, processor, circuitry, and/or other devices associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described above may be embodied by computer program instructions. In this regard, the computer program instructions which embody the procedures described above may be stored by a memory (e.g. 58, 125, 704, or 1504) of an apparatus employing an embodiment and executed by processing circuitry (e.g. 56, 120, 702, or 1502) of the apparatus. As will be appreciated, any such computer program instructions may be loaded onto a computer or other programmable apparatus (e.g., hardware) to produce a machine, such that the resulting computer or other programmable apparatus implements the functions specified in the flowchart blocks. These computer program instructions may also be stored in a computer-readable memory that may direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture, the execution of which implements the function specified in the flowchart blocks. The computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowchart blocks.
[0302] A computer program product is therefore defined in those instances in which the computer program instructions, such as computer-readable program code portions, are stored by at least one non- transitory computer -readable storage medium with the computer program instructions, such as the computer-readable program code portions, being configured, upon execution, to perform the functions described above, such as in conjunction with the flowchart(s) of FIGs. 16 to 19. In other embodiments, the computer program instructions, such as the computer-readable program code portions, need not be stored or otherwise embodied by a non-transitory computer-readable storage medium, but may, instead, be embodied by a transitory medium with the computer program instructions, such as the computer- readable program code portions, still being configured, upon execution, to perform the functions described above.
[0303] Accordingly, blocks of the flowcharts support combinations of means for performing the specified functions and combinations of operations for performing the specified functions for
performing the specified functions. It will also be understood that one or more blocks of the flowcharts, and combinations of blocks in the flowcharts, may be implemented by special purpose hardware-based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.
[0304] In some embodiments, certain ones of the operations above may be modified or further amplified. Furthermore, in some embodiments, additional optional operations may be included. Modifications, additions, or amplifications to the operations above may be performed in any order and in any combination.
[0305] In the above, some example embodiments have been described with reference to an SEI message or an SEI NAL unit. It needs to be understood, however, that embodiments can be similarly realized with any similar structures or data units. Where example embodiments have been described with SEI messages contained in a structure, any independently parsable structures could likewise be used in embodiments. Specific SEI NAL unit and a SEI message syntax structures have been presented in example embodiments, but it needs to be understood that embodiments generally apply to any syntax structures with a similar intent as SEI NAL units and/or SEI messages.
[0306] In the above, some embodiments have been described in relation to a particular type of a parameter set (namely adaptation parameter set). It needs to be understood, however, that embodiments could be realized with any type of parameter set or other syntax structure in the bitstream.
[0307] In the above, some example embodiments have been described with the help of syntax of the bitstream. It needs to be understood, however, that the corresponding structure and/or computer program may reside at the encoder for generating the bitstream and/or at the decoder for decoding the bitstream.
[0308] In the above, where example embodiments have been described with reference to an encoder, it needs to be understood that the resulting bitstream and the decoder have corresponding elements in them. Likewise, where example embodiments have been described with reference to a decoder, it needs to be understood that the encoder has structure and/or computer program for generating the bitstream to be decoded by the decoder.
[0309] Many modifications and other embodiments set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other
embodiments are intended to be included within the scope of the appended claims. Moreover, although the foregoing descriptions and the associated drawings describe example embodiments in the context of certain example combinations of elements and/or functions, it should be appreciated that different combinations of elements and/or functions may be provided by alternative embodiments without departing from the scope of the appended claims. In this regard, for example, different combinations of elements and/or functions than those explicitly described above are also contemplated as may be set forth in some of the appended claims. Accordingly, the description is intended to embrace all such alternatives, modifications and variances which fall within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
[0310] It should be understood that the foregoing description is only illustrative. Various alternatives and modifications may be devised by those skilled in the art. For example, features recited in the various dependent claims could be combined with each other in any suitable combination(s). In addition, features from different embodiments described above could be selectively combined into a new embodiment. Accordingly, the description is intended to embrace all such alternatives, modifications and variances which fall within the scope of the appended claims.
[0311] References to a ‘computer’, ‘processor’, etc. should be understood to encompass not only computers having different architectures such as single/multi-processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device such as instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device, and the like.
[0312] As used herein, the term ‘circuitry’ may refer to any of the following: (a) hardware circuit implementations, such as implementations in analog and/or digital circuitry, and (b) combinations of circuits and software (and/or firmware), such as (as applicable): (i) a combination of processor(s) or (ii) portions of processor(s)/software including digital signal processor(s), software, and memory(ies) that work together to cause an apparatus to perform various functions, and (c) circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present. This description of ‘circuitry’ applies to uses of this term in this application. As a further example, as used herein, the term ‘circuitry’ would also cover an implementation of merely a processor (or multiple processors) or a portion of a processor and its (or their) accompanying software and/or firmware. The term ‘circuitry’ would also cover, for
example and if applicable to the particular element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, or another network device.
Claims
1. An apparatus comprising at least one processor; and at least one non-transitory memory comprising computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: run one or more candidate neural network versions by using at least data from an evaluation set; evaluate performance of the one or more candidate neural network versions based on the evaluation set; select a candidate neural network version based on one or more predetermined performance criteria; overfit the selected neural network version based at least on an overfitting set; and run the overfitted neural network version on an inference set.
2. The apparatus of claim 1, wherein: the evaluation set comprises data for evaluating the one or more candidate neural network versions; the overfitting set comprises data for overfitting the selected neural network version; and the inference set comprises data for running the overfitted neural version.
3. The apparatus of any of claims 1 or 2, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
4. The apparatus of any of the previous claims, wherein the inference set comprises a video, the evaluation set comprises a first random access (RA) segment of the video, and the overfitting set comprises the video or the first RA segment of the video.
5. The apparatus of any of the previous claims, wherein the performance criteria comprise a distortion-based performance criterion.
6. The apparatus of any of the previous claims, wherein the selected neural network version performs best according to the one or more performance criteria.
7. The apparatus of any of claims 1 to 3, wherein: the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-filter; the evaluation set comprises a first RA segment of a video; the overfitting set comprises the video; the inference set comprises a decoded video; output of the each candidate neural network version comprises a post-processed first RA segment; and wherein the apparatus is further caused to: compute a first performance metric based on input to the each candidate neural network version and a second performance metric based on output of the each candidate neural network version; compute a third performance metric comprising performance of the each candidate neural network version based on the first performance metric and the second performance metric; and select the candidate neural network version with a value of the third performance metric greater than or equal to a predetermined value as the selected neural network version.
8. The apparatus of any of the claims 1 or 7, wherein to overfit the selected neural network version, the apparatus is further caused to perform one or more iterations of following: input the decoded video to the selected neural network version; obtain a post-processed output video from the selected neural network version; compute a training loss between the decoded video and the post-processed output video; compute gradients for one or more parameters of the selected neural network version; and use the gradients for updating the one or more parameters of the selected neural network version.
9. The apparatus of claim 8, wherein the apparatus is caused to perform the one or more iterations until a stopping criterion is met.
10. The apparatus of any of the previous claims, wherein the apparatus is further caused to: compute a weight-update based at least on weights of the overfitted neural network version and weights of the overfitted neural network version before overfitting; compress the weight-update; and
signal or provide a bitstream representing the compressed weight-update to a decoder side in or along the bitstream representing an encoded data.
11. The apparatus of any of the previous claims, wherein the one or more neural network versions comprise one or more of decoder-side neural network versions, wherein the one or more of decoder-side neural network versions are available at a decoder side and an encoder side.
12. An apparatus comprising at least one processor; and at least one non-transitory memory comprising computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: overfit one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions; evaluate performance of the first set of overfitted neural network versions on the evaluation set; select a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set: overfit a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and run the second overfitted neural network version on an inference set; and run the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
13. The apparatus of claim 12, wherein: the evaluation set comprises data for overfitting one or more candidate neural network versions and for evaluating the first set of overfitted neural network versions; the overfitting set comprises data for overfitting the neural network version used to obtain the selected first overfitted neural network version; and the inference set comprises data for running the selected first overfitted neural network version or the second overfitted neural network version.
14. The apparatus of any of claims 12 or 13, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
15. The apparatus of any of the previous claims, wherein the performance criteria comprise a distortion-based performance criterion.
16. The apparatus of any of the previous claims, wherein the selected first overfitted neural network version performs best according to the one or more performance criteria.
17. The apparatus of any of claims 12 to 14, wherein: the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-processing filter; the two candidate neural network versions are overfitted on a first RA segment of a video, to obtain two overfitted candidate neural network versions; and wherein the apparatus is further caused to: compute a fourth performance metric comprising performance of the each overfitted candidate neural network version based on a fifth performance metric and a sixth performance metric, wherein the fifth performance metric is based on a post-processed first RA segment and the sixth performance metric is based on a decoded first RA segment; select an overfitted candidate neural network version with a value of the fourth performance metric greater than or equal to a predetermined value as an optimal neural network version, to obtain a selected overfitted candidate neural network version; overfit the candidate neural network version used to obtain the selected overfitted candidate neural network version on the video, to obtain an overfitted selected neural network version; and post-process a decoded video by using the overfitted selected neural network.
18. The apparatus claim 17, wherein to overfit the each candidate neural network version, the apparatus is caused to perform one or more iterations of following: provide a decoded first RA segment as an input to the each candidate neural network version; obtain a post-processed first RA segment from the each candidate neural network version; compute a training loss based at least on the post-processed first RA segment and respective uncompressed data; compute gradients for one or more parameters of the each candidate neural network version; and use the gradients for updating the one or more parameters of the each candidate DSNN version.
19. The apparatus of claim 18, wherein the apparatus is caused to perform the one or more iterations until a stopping criterion is met.
20. The apparatus of any of the claims 16, 17, or 18, wherein to overfit the selected neural network version, the apparatus is further caused to one or more iterations of following: provide the decoded video as an input to the selected neural network version; obtain a post-processed output video from the selected neural network version; compute a training loss based at least on the post-processed output video and respective uncompressed data; and compute gradients for one or more parameters of the selected neural network version; and use the gradients for updating the one or more parameters of the selected neural network version.
21. The apparatus of claim 20, wherein the apparatus is caused to perform the one or more iterations until a stopping criterion is met.
22. The apparatus of any of the previous claims, wherein the one or more neural networks comprise one or more decoder-side neural networks, and wherein the one or more decoder-side neural networks are available at a decoder side and an encoder side.
23. An apparatus comprising at least one processor; and at least one non-transitory memory comprising computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: process an output of a neural network version by using one or more processing operations; and optimize one or more parameters of the one or more processing operations at an encoder side.
24. The apparatus of claim 23, wherein the apparatus is further caused to signal the optimized one or more parameters, or information derived from the optimized one or more parameters, to a decoder side.
25. The apparatus of claim 23 or 24, wherein the one or more processing operations comprise a refinement operation, and wherein the apparatus is further caused to apply the refinement
operation on an output of the neural network based at least on the optimized one or more parameters.
26. The apparatus of claim 25, wherein the refinement operation is defined as follows: refined_NN_out = (NN_out - NN_in)*s + NN_in; wherein the NN_out comprises an output of the neural network; wherein the NN_in comprises an input to the neural network; wherein s comprises a parameter that multiplies a difference between NN_out and NN_in; and wherein refined_NN_out is a result of the refinement operation.
27. The apparatus of any of claims 23 or 26, wherein the apparatus is further caused to train or to overfit the neural network version based on the one or more processing operations.
28. The apparatus of claim 25, wherein to train or to overfit the neural network, the apparatus is further caused to: provide input data to the neural network; obtain output data from the neural network; compute a refined output data based at least on the output data from the neural network and a refinement function; compute a loss based at least on the refined output data and respective ground-truth data, wherein the respective ground-truth data comprises uncompressed version of the input data to the neural network; compute gradients of the MSE loss with respect to gradients of one or more parameters of the neural network; and use the gradients for update the one or more parameters of the neural network.
29. The apparatus of claim 28, wherein the neural network comprises a post-processing filter, and wherein an input data to the post-processing filter is a decoded frame, and an output data from the post-processing filter is a post-processed frame.
30. The apparatus of any of the previous claims, wherein the one or more processing operations comprise at least one of a scaling operation or a shifting operation.
31. The apparatus of any of the previous claims, wherein the neural network comprises a decoder-side neural network, and wherein the decoder-side neural network is available at the decoder side and the encoder side.
32. A method comprising: running one or more candidate neural network versions by using at least data from an evaluation set; evaluating performance of the one or more candidate neural network versions based on the evaluation set; selecting a candidate neural network version based on one or more predetermined performance criteria; overfitting the selected neural network version based at least on an overfitting set; and running the overfitted neural network version on an inference set.
33. The method of claim 32, wherein: the evaluation set comprises data for evaluating the one or more candidate neural network versions; the overfitting set comprises data for overfitting the selected neural network version; and the inference set comprises data for running the overfitted neural version.
34. The method of any of claims 32 or 33, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
35. The method of any of the previous claims, wherein the inference set comprises a video, the evaluation set comprises a first random access (RA) segment of the video, and the overfitting set comprises the video or the first RA segment of the video.
36. The method of any of the previous claims, wherein the performance criteria comprise a distortion-based performance criterion.
37. The method of any of the previous claims, wherein the selected neural network version performs best according to the one or more performance criteria.
38. The method of any of claims 32 to 34, wherein: the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-filter; the evaluation set comprises a first RA segment of a video; the overfitting set comprises the video;
the inference set comprises a decoded video; output of the each candidate neural network version comprises a post-processed first RA segment; and wherein the method further comprises: computing a first performance metric based on input to the each candidate neural network version and a second performance metric based on output of the each candidate neural network version; computing a third performance metric comprising performance of the each candidate neural network version based on the first performance metric and the second performance metric; and selecting the candidate neural network version with a value of the third performance metric greater than or equal to a predetermined value as the selected neural network version.
39. The method of any of the claims 32 or 38, wherein to overfit the selected neural network version, the method further comprises performing one or more iterations of following: inputting the decoded video to the selected neural network version; obtaining a post-processed output video from the selected neural network version; computing a training loss between the decoded video and the post-processed output video; computing gradients for one or more parameters of the selected neural network version; and using the gradients for updating the one or more parameters of the selected neural network version.
40. The method of claim 39, wherein the one or more iterations are performed until a stopping criterion is met.
41. The method of any of the previous claims further comprising: computing a weight-update based at least on weights of the overfitted neural network version and weights of the overfitted neural network version before overfitting; compressing the weight-update; and signaling or providing a bitstream representing the compressed weight-update to a decoder side in or along the bitstream representing an encoded data.
42. The method of any of the previous claims, wherein the one or more neural network versions comprise one or more of decoder-side neural network versions, wherein the one or more of decoder-side neural network versions are available at a decoder side and an encoder side.
43. A method comprising: overfitting one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions; evaluating performance of the first set of overfitted neural network versions on the evaluation set; selecting a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set: overfitting a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and running the second overfitted neural network version on an inference set; and running the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
44. The method of claim 43, wherein: the evaluation set comprises data for overfitting one or more candidate neural network versions and for evaluating the first set of overfitted neural network versions; the overfitting set comprises data for overfitting the neural network version used to obtain the selected first overfitted neural network version; and the inference set comprises data for running the selected first overfitted neural network version or the second overfitted neural network version.
45. The method of any of claims 43 or 44, wherein the evaluation set, overfitting set, and the inference set partially or fully overlap.
46. The method of any of the previous claims, wherein the performance criteria comprise a distortion-based performance criterion.
47. The method of any of the previous claims, wherein the selected first overfitted neural network version performs best according to the one or more performance criteria
48. The method of any of claims 43 to 45, wherein: the one or more candidate neural network versions comprise two candidate neural network versions; each candidate neural network version comprises a post-processing filter;
the two candidate neural network versions are overfitted on a first RA segment of a video, to obtain two overfitted candidate neural network versions; and wherein the method further comprises: computing a fourth performance metric comprising performance of the each overfitted candidate neural network version based on a fifth performance metric and a sixth performance metric, wherein the fifth performance metric is based on a post-processed first RA segment and the sixth performance metric is based on a decoded first RA segment; selecting an overfitted candidate neural network version with a value of the fourth performance metric greater than or equal to a predetermined value as an optimal neural network version, to obtain a selected overfitted candidate neural network version; overfitting the candidate neural network version used to obtain the selected overfitted candidate neural network version on the video, to obtain an overfitted selected neural network version; and post-processing a decoded video by using the overfitted selected neural network.
49. The method of claim 48, wherein to overfit the each candidate neural network version, the method further comprises performing one or more iterations of following: providing a decoded first RA segment as an input to the each candidate neural network version; obtaining a post-processed first RA segment from the each candidate neural network version; computing a training loss based at least on the post-processed first RA segment and respective uncompressed data; computing gradients for one or more parameters of the each candidate neural network version; and using the gradients for updating the one or more parameters of the each candidate DSNN version.
50. The method of claim 49, wherein the one or more iterations are performed until a stopping criterion is met.
51. The method of any of the claims 47, 48, or 49, wherein to overfit the selected neural network version, the method further comprises performing one or more iterations of following: providing the decoded video as an input to the selected neural network version; obtaining a post-processed output video from the selected neural network version; computing a training loss based at least on the post-processed output video and respective uncompressed data; and
computing gradients for one or more parameters of the selected neural network version; and using the gradients for updating the one or more parameters of the selected neural network version.
52. The method of claim 51, wherein the one or more iterations are performed until a stopping criterion is met.
53. The method of any of the previous claims, wherein the one or more neural networks comprise one or more decoder-side neural networks, and wherein the one or more decoder-side neural networks are available at a decoder side and an encoder side.
54. A method comprising: processing an output of a neural network version by using one or more processing operations; and optimizing one or more parameters of the one or more processing operations at an encoder side.
55. The method of claim 54 further comprising signaling the optimized one or more parameters, or information derived from the optimized one or more parameters, to a decoder side.
56. The method of claim 54 or 55, wherein the one or more processing operations comprise a refinement operation, and wherein the method further comprises to applying the refinement operation on an output of the neural network based at least on the optimized one or more parameters.
57. The method of claim 56, wherein the refinement operation is defined as follows: refined_NN_out = (NN_out - NN_in)*s + NN_in; wherein the NN_out comprises an output of the neural network; wherein the NN_in comprises an input to the neural network; wherein s comprises a parameter that multiplies a difference between NN_out and NN_in; and wherein refined_NN_out is a result of the refinement operation.
58. The method of any of claims 54 or 57 further comprising training or overfitting the neural network version based on the one or more processing operations.
59. The method of claim 56, wherein to train or to overfit the neural network, the method further comprises: providing input data to the neural network; obtaining output data from the neural network; computing a refined output data based at least on the output data from the neural network and a refinement function; computing a loss based at least on the refined output data and respective ground-truth data, wherein the respective ground-truth data comprises uncompressed version of the input data to the neural network; computing gradients of the MSE loss with respect to gradients of one or more parameters of the neural network; and using the gradients for update the one or more parameters of the neural network.
60. The method of claim 59, wherein the neural network comprises a post-processing filter, and wherein an input data to the post-processing filter is a decoded frame, and an output data from the post-processing filter is a post-processed frame.
61. The method of any of the previous claims, wherein the one or more processing operations comprise at least one of a scaling operation or a shifting operation.
62. The method of any of the previous claims, wherein the neural network comprises a decoder-side neural network, and wherein the decoder-side neural network is available at the decoder side and the encoder side.
63. A computer readable medium comprising program instructions for causing an apparatus to perform at least the following: run one or more candidate neural network versions by using at least data from an evaluation set; evaluate performance of the one or more candidate neural network versions based on the evaluation set; select a candidate neural network version based on one or more predetermined performance criteria; overfit the selected neural network version based at least on an overfitting set; and run the overfitted neural network version on an inference set.
64. The computer readable medium of claim 63, wherein the computer readable medium comprises a non-transitory computer readable medium.
65. The computer readable medium of any of claims 63 or 64, wherein the computer readable medium further causes the apparatus to perform the methods as claimed in any of the claims 33 to 42.
66. A computer readable medium comprising program instructions for causing an apparatus to perform at least the following: overfit one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions; evaluate performance of the first set of overfitted neural network versions on the evaluation set; select a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set: overfit a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and run the second overfitted neural network version on an inference set; and run the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
67. The computer readable medium of claim 66, wherein the computer readable medium comprises a non-transitory computer readable medium.
68. The computer readable medium of any of claims 66 or 67, wherein the computer readable medium further causes the apparatus to perform the methods as claimed in any of the claims 44 to 53.
69. A computer readable medium comprising program instructions for causing an apparatus to perform at least the following: overfit one or more candidate neural network versions on an evaluation set to obtain a first set of overfitted neural network versions; evaluate performance of the first set of overfitted neural network versions on the evaluation set; select a first overfitted neural network version based on one or more predetermined performance criteria, to obtain a selected first overfitted neural network; when the evaluation set is different from an overfitting set:
overfit a neural network version, used to obtain the selected first overfitted neural network version, on an overfitting set to obtain a second overfitted neural network version; and run the second overfitted neural network version on an inference set; and run the selected first overfitted neural network version on the inference set when the evaluation set is same or substantially same as the overfitting set.
70. The computer readable medium of claim 69, wherein the computer readable medium comprises a non-transitory computer readable medium.
71. The computer readable medium of any of claims 69 or 70, wherein the computer readable medium further causes the apparatus to perform the methods as claimed in any of the claims 55 to 62.
72. An apparatus comprising means for performing the methods as claimed in any of the claims 32 to 42.
73. An apparatus comprising means for performing the methods as claimed in any of the claims 43 to 53.
74. An apparatus comprising means for performing the methods as claimed in any of the claims 54 to 62.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202263362774P | 2022-04-11 | 2022-04-11 | |
US63/362,774 | 2022-04-11 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023199172A1 true WO2023199172A1 (en) | 2023-10-19 |
Family
ID=86272316
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2023/053425 WO2023199172A1 (en) | 2022-04-11 | 2023-04-04 | Apparatus and method for optimizing the overfitting of neural network filters |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023199172A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170347110A1 (en) * | 2015-02-19 | 2017-11-30 | Magic Pony Technology Limited | Online Training of Hierarchical Algorithms |
US20190147361A1 (en) * | 2017-02-03 | 2019-05-16 | Panasonic Intellectual Property Management Co., Ltd. | Learned model provision method and learned model provision device |
WO2021165569A1 (en) * | 2020-02-21 | 2021-08-26 | Nokia Technologies Oy | A method, an apparatus and a computer program product for video encoding and video decoding |
US20220108171A1 (en) * | 2020-10-02 | 2022-04-07 | Google Llc | Training neural networks using transfer learning |
-
2023
- 2023-04-04 WO PCT/IB2023/053425 patent/WO2023199172A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170347110A1 (en) * | 2015-02-19 | 2017-11-30 | Magic Pony Technology Limited | Online Training of Hierarchical Algorithms |
US20190147361A1 (en) * | 2017-02-03 | 2019-05-16 | Panasonic Intellectual Property Management Co., Ltd. | Learned model provision method and learned model provision device |
WO2021165569A1 (en) * | 2020-02-21 | 2021-08-26 | Nokia Technologies Oy | A method, an apparatus and a computer program product for video encoding and video decoding |
US20220108171A1 (en) * | 2020-10-02 | 2022-04-07 | Google Llc | Training neural networks using transfer learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12036036B2 (en) | High-level syntax for signaling neural networks within a media bitstream | |
US11375204B2 (en) | Feature-domain residual for video coding for machines | |
US20230217028A1 (en) | Guided probability model for compressed representation of neural networks | |
US12113974B2 (en) | High-level syntax for signaling neural networks within a media bitstream | |
EP4168936A1 (en) | Apparatus, method and computer program product for optimizing parameters of a compressed representation of a neural network | |
US20240265240A1 (en) | Method, apparatus and computer program product for defining importance mask and importance ordering list | |
US20240249514A1 (en) | Method, apparatus and computer program product for providing finetuned neural network | |
US20240289590A1 (en) | Method, apparatus and computer program product for providing an attention block for neural network-based image and video compression | |
US20240202507A1 (en) | Method, apparatus and computer program product for providing finetuned neural network filter | |
WO2023135518A1 (en) | High-level syntax of predictive residual encoding in neural network compression | |
WO2023280558A1 (en) | Performance improvements of machine vision tasks via learned neural network based filter | |
US20230325639A1 (en) | Apparatus and method for joint training of multiple neural networks | |
US20230412806A1 (en) | Apparatus, method and computer program product for quantizing neural networks | |
EP4181511A2 (en) | Decoder-side fine-tuning of neural networks for video coding for machines | |
US20230186054A1 (en) | Task-dependent selection of decoder-side neural network | |
US20230196072A1 (en) | Iterative overfitting and freezing of decoder-side neural networks | |
US20240146938A1 (en) | Method, apparatus and computer program product for end-to-end learned predictive coding of media frames | |
US20240013046A1 (en) | Apparatus, method and computer program product for learned video coding for machine | |
WO2022269469A1 (en) | Method, apparatus and computer program product for federated learning for non independent and non identically distributed data | |
US20230169372A1 (en) | Appratus, method and computer program product for probability model overfitting | |
US20240267543A1 (en) | Transformer based video coding | |
US20240121387A1 (en) | Apparatus and method for blending extra output pixels of a filter and decoder-side selection of filtering modes | |
WO2023199172A1 (en) | Apparatus and method for optimizing the overfitting of neural network filters | |
US20240357104A1 (en) | Determining regions of interest using learned image codec for machines | |
WO2024084353A1 (en) | Apparatus and method for non-linear overfitting of neural network filters and overfitting decomposed weight tensors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23720385 Country of ref document: EP Kind code of ref document: A1 |