CN117980958A - Neural network for accurate rendering in display management - Google Patents

Neural network for accurate rendering in display management Download PDF

Info

Publication number
CN117980958A
CN117980958A CN202280064469.4A CN202280064469A CN117980958A CN 117980958 A CN117980958 A CN 117980958A CN 202280064469 A CN202280064469 A CN 202280064469A CN 117980958 A CN117980958 A CN 117980958A
Authority
CN
China
Prior art keywords
image
neural network
input
layer
pyramid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280064469.4A
Other languages
Chinese (zh)
Inventor
A·K·A·乔杜里
R·阿特金斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby Laboratories Licensing Corp
Original Assignee
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp
Priority claimed from PCT/US2022/041199 external-priority patent/WO2023028046A1/en
Publication of CN117980958A publication Critical patent/CN117980958A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/90Dynamic range modification of images or parts thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20172Image enhancement details
    • G06T2207/20208High dynamic range [HDR] image processing

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

Methods and systems for accurate rendering in a display map using a neural network are described. A series of neural networks, including pyramid halving sub-networks, pyramid downsampling sub-networks, pyramid upsampling sub-networks, and final layer generating sub-networks, generate base layer images and detail layer images for display mapping.

Description

Neural network for accurate rendering in display management
Cross Reference to Related Applications
The present application claims priority from U.S. provisional patent application No. 63/236,476, filed on month 8, 24 of 2021, and european patent application No. 21206398.6, filed on month 11, 4 of 2021, each of which is incorporated herein by reference in its entirety.
Technical Field
The present invention relates generally to images. More particularly, embodiments of the invention relate to precision rendering in display management.
Background
As used herein, the term "Dynamic Range (DR)" may relate to the ability of the Human Visual System (HVS) to perceive a range of intensities (e.g., luminance, brightness) in an image, such as from darkest gray (black) to brightest white (highlight). In this sense, DR is related to the "scene-referred" intensity of the reference scene. DR may also relate to the ability of a display device to adequately or approximately render an intensity range of a particular breadth (breadth). In this sense, DR is related to the "reference display (display-referred)" intensity. Unless a specific meaning is explicitly specified to have a specific meaning at any point in the description herein, it should be inferred that the terms can be used interchangeably in either sense, for example.
As used herein, the term "High Dynamic Range (HDR)" relates to DR broadness of about 14 to 15 orders of magnitude across the Human Visual System (HVS). Indeed, DR of a broad breadth in the range of intensities that humans can simultaneously perceive may be slightly truncated relative to HDR. As used herein, the term "Enhanced Dynamic Range (EDR) or Visual Dynamic Range (VDR)" may be related to such DR either alone or interchangeably: the DR may be perceived within a scene or image by the Human Visual System (HVS) including eye movement, allowing some light on the scene or image to adapt to changes.
In practice, an image includes one or more color components (e.g., luminance Y and chrominance Cb and Cr), where each color component is represented by an accuracy of n bits per pixel (e.g., n=8). For example, gamma luminance codec is used, where an image with n.ltoreq.8 (e.g., a color 24-bit JPEG image) is considered a standard dynamic range image, and where an image with n.gtoreq.10 can be considered an enhanced dynamic range image. EDR and HDR images may also be stored and distributed using high precision (e.g., 16-bit) floating point formats, such as the OpenEXR file format developed by Industrial optical magic LIGHT AND MAGIC.
As used herein, the term "metadata" relates to any auxiliary information that is transmitted as part of the encoded bitstream and that assists the decoder in rendering the decoded image. Such metadata may include, but is not limited to, minimum, average and maximum luminance values in the image, color space or gamut information, reference display parameters, and auxiliary signal parameters as described herein.
Most consumer desktop displays currently support light levels of 200 to 300cd/m 2 or nits. Most consumer HDTV ranges from 300 to 500 nits, with the new model number reaching 1000 nits (cd/m 2). Thus, such conventional displays represent a Lower Dynamic Range (LDR), also referred to as Standard Dynamic Range (SDR), relative to HDR or EDR. As the availability of HDR content increases due to the development of both capture devices (e.g., cameras) and HDR displays (e.g., the dolby laboratory's PRM-4200 professional reference monitor), the HDR content may be color graded and displayed on HDR displays supporting a higher dynamic range (e.g., from 1,000 nits to 5,000 nits or higher). In general, and without limitation, the methods of the present disclosure relate to any dynamic range above SDR.
As used herein, the term "display management" refers to a process performed on a receiver for rendering a picture for a target display. Such processes may include, for example, but are not limited to, tone mapping, gamut mapping, color management, frame rate conversion, and the like.
As used herein, the term "precision rendering" refers to a downsampling and upsampling/filtering process for dividing an input image into two layers, namely: filtered base layer image and detail layer image (reference [2 ]). By applying the tone mapping curve to the filtered base layer in tone mapping or display mapping, and then adding the detail layer back to the result, the original contrast of the image can be preserved globally as well as locally. This may also be referred to as "detail preservation" or "local tone mapping". A more detailed description of the exact rendering will be provided later.
The creation and playback of High Dynamic Range (HDR) content is now becoming commonplace, as HDR technology provides more realistic and lifelike images than earlier formats. Meanwhile, IC manufacturers have begun to incorporate hardware accelerators for Neural Networks (NNs). In order to improve existing display schemes while utilizing such neural network accelerators, as understood herein by the present inventors, improved techniques for accurate rendering and display management using neural networks have been developed.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Thus, unless otherwise indicated, any approaches described in this section are not to be construed so as to qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, questions identified with respect to one or more methods should not be deemed to be recognized in any prior art based on this section.
Drawings
Embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
FIG. 1 depicts an example process of a video transmission pipeline;
FIG. 2 depicts an example process of display management using precision rendering according to an embodiment of the invention;
FIG. 3 depicts an example of a precision rendering pipeline according to an embodiment of the invention;
FIG. 4 depicts an example neural network for a pyramid downsampling sub-network in accordance with an embodiment of the present invention;
FIG. 5A depicts an example neural network for a pyramid upsampling sub-network according to an embodiment of the present invention;
FIG. 5B depicts an example neural network for an edge filter in a pyramid upsampling sub-network according to an embodiment of the present invention;
FIG. 5C depicts an example neural network for an upsampling filter in a pyramid upsampling sub-network according to an embodiment of the present invention; and
FIG. 6 depicts an example neural network for a final layer generation sub-network, according to an embodiment of the invention.
Detailed Description
Methods and systems for accurate rendering in display management using neural networks are described herein. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It may be evident, however, that the subject invention may be practiced without these specific details. In other instances, well-known structures and devices are not described in detail to avoid unnecessarily obscuring, or confusing the present invention.
SUMMARY
Example embodiments described herein relate to methods of precision rendering in display management using neural network architecture. In an embodiment, a neural network system receives an input image of a first dynamic range and a first spatial resolution. Next, the system performs the following operations:
generating an input intensity image (I) based on the input image;
generating a second intensity image by sub-sampling the input intensity image using a pyramid halving network until its spatial resolution is less than or equal to a second spatial resolution;
generating a set of downsampled images based on the second intensity image and a pyramid downsampling neural network;
generating two upsampled images of a second spatial resolution based on the set of downsampled images and a pyramid upsampling neural network that includes edge-aware upsampling filtering; and
An output Base Layer (BL) image of a first spatial resolution is generated by combining the two up-sampled images and the input intensity image in a final layer neural network.
Neural network for accurate rendering in display management
Video encoding and decoding pipeline
Fig. 1 depicts an example process of a conventional video transmission pipeline (100) that illustrates various stages from video capture to video content display. A sequence of video frames (102) is captured or generated using an image generation block (105). The video frames (102) may be captured digitally (e.g., by a digital camera) or generated by a computer (e.g., using a computer animation) to provide video data (107). Alternatively, the video frames (102) may be captured on film by a film camera. The film is converted to a digital format to provide video data (107). In a production phase (110), the video data (107) is edited to provide a video production stream (112).
The video data of the production stream (112) is then provided to a processor at block (115) for post production editing. Post-production editing of the block (115) may include adjusting or modifying colors or brightness in particular regions of the image to enhance image quality or to achieve a particular appearance of the image according to the authoring intent of the video creator. This is sometimes referred to as "color adjustment" or "color grading (color grading)". Other edits (e.g., scene selection and ordering, image cropping, adding computer-generated visual effects, etc.) may be performed at block (115) to produce a final version (117) of the work for release. During post-production editing (115), video images are viewed on a reference display (125).
After post-production (115), the video data of the final work (117) may be transmitted to an encoding block (120) for downstream transmission to decoding and playback devices such as televisions, set-top boxes, movie theatres, and the like. In some embodiments, the encoding block (120) may include audio encoders and video encoders as defined by ATSC, DVB, DVD, blu-ray, and other transport formats to generate the encoded bitstream (122). In the receiver, the encoded bitstream (122) is decoded by a decoding unit (130) to generate a decoded signal (132) representing the same or a near-similar version of the signal (117). The receiver may be attached to a target display (140) that may have entirely different characteristics than the reference display (125). In this case, the display management block (135) may be configured to map the dynamic range of the decoded signal (132) to the characteristics of the target display (140) by generating a display mapping signal (137). Examples of display management procedures are described in references [1] and [2], without limitation.
Global tone mapping technique and local tone mapping technique
In conventional global display mapping, a mapping algorithm applies a single sigmoid-like function (see, e.g., references [3] and [4 ]) to map an input dynamic range to the dynamic range of a target display. Such mapping functions may be represented as piecewise linear or nonlinear polynomials characterized by anchor points, pivots, and other polynomial parameters generated using the characteristics of the input source and target displays. For example, in references [3-4], the mapping function uses anchor points based on luminance characteristics (e.g., minimum, medium (average), and maximum luminance) of the input image and the display. However, other mapping functions may use different statistics, such as variance or standard deviation of luminance values of block level, picture slices, or whole image.
As described in more detail in reference [2], the display mapping process (135) can be further improved by taking into account local contrast and detail information of the input image. For example, as described later, the input image may be split into two layers using downsampling and upsampling/filtering processes: a filtered base layer image and a detail layer image. By applying the tone mapping curve to the filtered base layer and then adding the detail layer back to the result, the original contrast of the image can be preserved globally as well as locally. This may also be referred to as "detail retention" or "exact rendering".
Thus, the display mapping may be performed as a multi-stage operation:
a) Generating a Base Layer (BL) image to direct SDR (or HDR) to HDR mapping;
b) Performing tone mapping on the base layer image;
c) The detail layer image is added to the tone mapped base layer image.
In reference [2], the generated Base Layer (BL) represents a spatially blurred, edge-preserved version of the original image. That is, it retains important edges but blurs finer details. More specifically, generating the BL image may include:
Using the intensity of the original image, creating an image pyramid with lower resolution layers, and saving each layer
Starting from the lowest resolution layer, upsampling to a higher layer to generate the base layer. Examples of generating base layer and detail layer images may be found later in this specification.
Neural network architecture
FIG. 2 depicts an example process (200) of display management using precision rendering (225). As depicted in fig. 2, the input video (202) may include video received from a video decoder and/or video received from a graphics processing unit (e.g., from a set-top box), and/or other video input (e.g., from an HDMI port in a camera, television, or set-top box, a Graphics Processing Unit (GPU), etc.). The input video 202 may be characterized as an "SDR" video or an "HDR" video that is to be displayed on an HDR display or SDR display after appropriate dynamic range conversion.
In an embodiment, the process 200 comprises a mapping curve generating unit (215) for generating a tone mapping curve based on a characteristic of the intensity (I) of the input signal. Examples of such processes can be found in references [1-5 ]. The output of the mapping curve generation unit is fed to a display mapping unit (220) together with the output of the exact rendering block (225) and the optional detail layer prediction unit (230) to generate a mapping output 222.
To extract intensity, the input RGB image may be converted to a luminance-chrominance color format, such as YCbCr, ICtCp, etc., using color conversion techniques known in the art (such as ITU-R rec. Bt 2100, etc.). In an alternative embodiment, the intensity may be characterized as its R, G and B-component per-pixel maximum. The intensity extraction step may be bypassed if the source image has been represented as a single channel intensity image. In some embodiments, the pixel values may also be normalized to [0,1] according to a predefined standard dynamic range (e.g., between 0.005 and 100 nits) in order to calculate image statistics.
As depicted in fig. 2, process 200 includes a precision rendering block (225) that generates a base layer (I BL) (BL) image and a detail layer image (I DL) (DL) given the intensity (I) of the original image. In an embodiment, the pixel at position (x, y) in the detail layer image is generated as
IDL(x,y)=I(x,y)-IBL(x,y)*dg, (1)
Where dg represents the detail gain scaler within [0,1 ].
The detail layer prediction block (230) takes as input two channels: a Detail Layer (DL) channel of the input image and an intensity (I) channel of the source image. The neural network generates a single channel Predicted Detail Layer (PDL) image having the same resolution as the detail layer image, containing residual values to be added to the detail layer image. In an embodiment, the detail layer residual stretches the local contrast of the output image to increase its perceived contrast and dynamic range. By utilizing detail layer input and input images, as discussed in reference [5], the neural network implementation of block 230 can predict contrast stretching based not only on the content of the detail layer but also on the content of the source image. To some extent, this provides the Neural Network (NN) with the possibility to correct any problems that may arise from decomposing a fixed exact rendering into a base image and a detail image.
In some embodiments, the base layer I BL may be used directly, or in combination with the input intensity image I, e.g.
IB=α*IBL+(1-α)*I,
Where α is the scaler within [0,1 ]. When α=0, tone mapping is equivalent to conventional global tone mapping. When α=1, tone mapping is performed only on the base layer image.
Given I DL, an optional scaler beta within [0,1] on image I DL may be used to adjust the sharpening of the tone mapping output to generate the final tone mapped image
I′=I′BL+IDL*β, (2)
Where I' BL represents a tone mapped version of I BL (or I B). When detail layer prediction 230 is used, then
I′=I′BL+(IDL+PDL)*β。 (3)
In alternative implementations, process 200 may be simplified by bypassing (removing) detail layer prediction (230) and by using only the original Detail Layer (DL). Thus, given a pyramid representation of an input image, process 200 may adjust as follows:
in block 225, the intensity of the input image is separated into a base layer and a detail layer
Generating a mapping curve in block 215
Generating an optimized mapping of only the Base Layer (BL) of the input image using the mapping curve
Add the original Detail Layer (DL) to the optimization map to generate the final image (see, e.g., equation (2)).
FIG. 3 depicts an example of a precision rendering pipeline according to an embodiment. As depicted in fig. 3, a Precision Rendering Network (PRN) may be divided into four consecutive sub-networks:
Pyramid halving subnetwork (305)
Pyramid downsampling subnetwork (310)
A pyramid upsampling subnetwork (315); and
Final layer Generation subnetwork
The output of each of these sub-networks forms the input of the next sub-network.
Given the timing nature of the exact rendering process, embodiments may choose to apply the neural network only to selected steps, and apply traditional processing to the remaining steps. In other embodiments, two or more consecutive subnetworks may be combined into a larger subnetwork. In an embodiment, all four sub-networks may also be combined into a single neural network. It is expected that the division of neural network processing from traditional processing will depend heavily on the availability of hardware accelerators for neural network processing.
In an embodiment, pyramid halving sub-network 305 may be considered a preprocessing step for adjusting resolution constraints of the rest of the network. For example, if the rest of the network (e.g., steps 310, 315) can only process images with a resolution of up to 1024 x 576, this step can be iteratively invoked until the width of the output image is below 1024 or the height of the image is below 576. The network may also be used to replicate/fill boundary pixels so that all possible inputs meet the resolution requirements of the sub-network.
For example, for an input image of 4K resolution, the first layer (e.g., 2K resolution) may be skipped. Then, during upsampling (e.g., in step 320), the quarter resolution image would simply be doubled twice. Similarly, for an 8K resolution input image, one-half and one-quarter resolution layers may be skipped. This ensures that the subsequent layers of the pyramid will have the same dimensions, regardless of the input image size.
In the remainder of this description, the convolutional network is defined by its pixel size (mxn), the number of image channels (C) they operate on, and the number of such kernels (K) in the filter bank. In this sense, each convolution can be described by the size of the filter bank mxn×c×k (where mxn represents width×height). For example, a filter bank of size 3×3×1×2 consists of 2 convolution kernels, each operating on one channel and of size 3 pixels×3 pixels. In the case of a convolutional network containing a bias, it will be denoted as bias (B) =true, otherwise, as b=false.
Some filter banks may also have strides, which means that some results of the convolution are discarded. A stride (S) of 1 means that each input pixel will produce an output pixel. A stride of 2 means that only every other pixel in each dimension produces an output, and so on. Thus, a filter bank of step 2 will produce an output of (M/2) x (N/2) pixels, where M x N is the input image size. All inputs, except the inputs of the fully connected core, will be filled, so setting the stride to 1 will produce an output with the same number of pixels as the inputs. The output of each convolution set is fed as input to the next convolution layer.
In an embodiment, the pyramid halving network (305) has a fill unit (denoted "fill" in fig. 4 and 5), followed by a single convolution operation with offset (B) =false and stride of 2, which is basically a downsampling of the image. Thus, it can be expressed as a2×2×1×1 convolutional network of stride (S) =2. For example, given an input of 1920×1080, its output would be 960×540. The filling unit simply adds rows and columns to the input image so that the non-resolution input of the convolutional network is converted to match the required resolution (e.g., 1024 x 576) regardless of the resolution of the input I.
The pyramid downsampling sub-network (310) generates pyramid representations of the inputs for later use in achieving improved tone mapping. For example, in an embodiment, given a full high definition input, the pyramid may generate the following layers: 1024X 576, 512X 288, 256X 144, 128X 72, 64X 36, 32X 18 and 16X 9.
Although the pyramid is described in terms of sub-sampling using sub-sampling factor 2, other sub-sampling factors may be used without loss of generality. Since this is used for downsampling, stride 2 is used for each convolution filter. Prior to computing the first level of the pyramid (e.g., 1024 x 576), the input image may be filled by copying boundary pixels, taking into account the input images of various sizes or aspect ratios.
Prior to computing the first level of the pyramid (e.g., 1024×576), the input image may be populated such that:
ensure pyramid level from minimum to maximum, all spatial dimensions are divisible by two
Copy boundary pixels, thereby taking into account the designated region of interest (ROI)
Copy boundary pixels, thereby taking into account input images of various sizes or aspect ratios
FIG. 4 depicts an example neural network for a pyramid downsampling sub-network (310). In an embodiment, the subnetwork 310 includes a filler network 405 followed by six consecutive convolutional neural network blocks (e.g., 410-2, 410-6, 410-7), each 4 x 2x 1, where b=false, and s=2. Thus, given an input 402 of 960×540, starting from 1024×576 (layer 1), the network generates a further output: 512×288 (layer 2), 256×144 (layer 3), 128×72 (layer 4), 64×36 (layer 5), 32×18 (layer 6), and 16×9 (layer 7). Thus, the pyramid downsampling sub-network/neural network 310 may generate a set of images that form an N-level (e.g., n=7) pyramid representation of the image of the input 402. The pyramid downsampling sub-network 310 may include two or more consecutive convolution blocks, where each convolution block may generate a downsampled image of a respective layer of the pyramid representation. The downsampled image of the i-th pyramid layer is denoted as P (i), which may have a lower spatial resolution for i=2, …, N than the spatial resolution of the downsampled image P (i-1) of the i-1 th pyramid layer.
Fig. 5A depicts an example neural network for a pyramid up-sampling sub-network (315). The network receives the downsampled pyramid data from the pyramid downsampling sub-network (310) and reconstructs the original image at its original resolution using edge-aware upsampling filters at each layer. The minimum resolution level of the pyramid (e.g., 16 x 9) is first up-sampled, then the other levels are processed and up-sampled until the resolution of the highest resolution (e.g., 1024 x 576) pyramid level.
The pyramid image at layer i is denoted as P (i). Starting from the lowest resolution level (e.g., i=7), the lowest resolution pyramid image (e.g., P (7)) is fed to an edge preserving filter (505), which generates two coefficient "images," denoted as al (7) and bl (7) (definition see below). Next, both al (7) and bl (7) are upsampled twice using the upsampling layer NN (510) to generate upsampled coefficient images a (7) and b (7).
At the next layer, i=6, the P (6) layer of the pyramid is combined with the upsampled coefficient images a (7) and b (7) to generate an image
F(6)=a(7)*P(6)+B(7), (4)
The image is fed to an edge upsampling filter together with the image P (6) to generate coefficients "image" al (6) and bl (6). Next, both al (6) and bl (6) are upsampled twice to generate upsampled coefficient images a (6) and b (6). The same process continues for the other pyramid layers. In general, for i=7, 6,5, …,2,
F(i-1)=a(i)*P(i-1)+b(i), (5)
Wherein the operation of multiplying the coefficient image by the image corresponds to multiplying its corresponding pixel by pixel. For example, at pixel location (m, n), for a pyramid level i of size W (i) x H (i),
F(i-1)m,n=a(i)m,n*P(i-1)m,n+b(i)m,n, (6)
M=1, 2, …, W (i-1) and n=1, 2, …, H (i-1).
As depicted in fig. 5A, at layer 7, p (7) =f (7), and at layer 1 there is no need to apply an upsampling filter (510). Furthermore, at layer 1, given the 1024×576 output of the edge filter, two "stripe" blocks cut them to 960×540.
Fig. 5B depicts an example neural network for an edge filter (505) in a pyramid upsampling sub-network, according to an embodiment. Given two inputs (F, P), the edge filter will use multiple basic arithmetic blocks (e.g., addition, multiplication, and division) and four 3×3×1×1, s=1 and b=false convolutional neural network blocks (also referred to as convolutional blocks) to generate corresponding al (i) and bl (i) values, the outputs of which are denoted as C1, C2, C3, and C4. The other inputs of the edge filter include weights PW [ i,0] and PW [ i,1], whose values are in the range of [0,1] (reference [2 ]).
C1 represents a local average of F, C2 represents a local average of (f×p), C3 represents a local average of (p×p), and C4 represents a local average of P. Thus, as can be seen from fig. 5B:
Fig. 5C depicts an example neural network for an upsampling filter (510) in a pyramid upsampling sub-network according to an embodiment. Given an input of mxn (e.g., al (i) or bl (i)), the filter will generate an output of 2mx2n (e.g., a (i) or b (i)). The upsampling filter includes two processing stages, each of which mimics a conventional separable filter operating on rows (or columns) and columns (or rows). The level 1 process includes one filler block and two 3X 1X 1, S=1 and b=false convolution block. The 2 nd stage processing includes one filler block and two convolution blocks of 1×3×1×1, s=1 and b=false. At each stage, the outputs of both convolution blocks are spliced using a "splice (Concatenate)" block. For "column splice," if both inputs are mxn, then the output will be mx2 n. However, column stitching does not simply stitch two inputs, but rather creates an output by taking one column at a time from each input to interleave. Also, for a "line stitched" block, since both inputs can be m 2n, the block will take one line at a time from each input to interleave to generate a 2m 2n image.
FIG. 6 depicts an example neural network for a final layer generation sub-network, according to an embodiment of the invention. The network takes as inputs the raw intensity image (I) and outputs a (1) and b (1) from the pyramid upsampling sub-network (315) to generate an output Base Layer (BL).
BL=IBL=a(1)*I+b(1)。 (8)
As depicted in fig. 6, the network may include optional upsampling blocks and padding blocks such that the resolution of the BL matches the resolution of the input I. For example, if the resolution of a (1) and b (1) is 960×540, the output of the upsampling layer will be 1920×1080. If the resolution of I is 1920 x 1080, then the filler block will also generate 1920 x 1080 output. As discussed previously, the upsampling layer NN may be used multiple times to match the number of times the pyramid halving network (305) is used.
In another embodiment, instead of applying the upsampling network multiple times, a particular NN may be directly applied to upsample the image by an appropriate factor (e.g., 4 times, 8 times, etc.). For example, in an embodiment, NN 510 (see fig. 5C) may be modified to be 4-fold amplified as follows:
Replace the rows of two 3 x 1 convolution blocks with the rows of four 5 x 1 convolution blocks, wherein all outputs are provided as inputs to a column splice network having four inputs and one output
Generating m x 4n outputs by interleaving columns of its inputs using a column splice network, as discussed earlier, replacing rows of two 1 x 3 x1 convolution blocks with rows of four 1 x 5 x1 convolution blocks, wherein all outputs are provided as inputs to a row splice network having four inputs and one output
Generating a 4mx4n output by interleaving rows of its inputs using a row-stitched network, as previously discussed
In an embodiment, the edge filter weights may be derived outside of the NN implementation. However, the weights may also be derived from an offline training process using a batch of images. The entire network may be trained on pairs of input images and corresponding base layer images. For example, a large number of (HDR) images may be smoothed using the analyzer block described in reference [2], or any edge preserving smoothing process may be applied. A plurality of small batches of such pairs may be iteratively taken as inputs, wherein the error differences between the reference image and the predicted smoothed image are back propagated through the network until the error converges or the performance reaches an acceptable state on the validation set. When the error converges, the corresponding weights for each convolution filter are stored for processing during run-time.
In conventional image processing, the filter weights would be selected to achieve a locally optimal result, but the result would not necessarily translate to a global optimum due to the presence of different components. The neural network architecture has visibility across the entire network and the weights of each convolution block can be optimally selected for each sub-network.
Reference to the literature
Each of the references listed herein is incorporated by reference in its entirety.
Atkins, us patent 9,961,237, "DISPLAY MANAGEMENT for HIGH DYNAMIC RANGE video [ display management for high dynamic range video ]",
PCT application PCT/US2020/028552, WIPO publication No. WO/2020/219341, "DISPLAY MANAGEMENT for HIGH DYNAMIC RANGE IMAGES [ for display management of high dynamic range images ]", filed by Atkins et al at 16, 4, 2020).
Ballestad and A.Kostin, U.S. Pat. No. 8,593,480, "Method and apparatus for IMAGE DATA transformation [ methods and apparatus for image data transformation ]",
J.a. pytalz and r.atkins, U.S. patent 10,600,166, "Tone curve mapping for HIGH DYNAMIC RANGE IMAGES [ tone curve mapping for high dynamic range images ]".
U.S. provisional patent application Ser. No. 63/226,847, "Neural networks for DYNAMIC RANGE version AND DISPLAY MANAGEMENT [ neural network for dynamic range conversion and display management ]", filed by Wanat et al at 2021, 7, 29, and also filed as PCT/US2022/037991 at 2022, 7, 22.
Example computer System embodiment
Embodiments of the invention may be implemented using a computer system, a system configured with electronic circuits and components, an Integrated Circuit (IC) device such as a microcontroller, a Field Programmable Gate Array (FPGA) or another configurable or Programmable Logic Device (PLD), a discrete-time or Digital Signal Processor (DSP), an application-specific IC (ASIC), and/or an apparatus including one or more of such systems, devices, or components. The computer and/or IC may execute, control, or perform instructions related to image transformations, such as those described herein. The computer and/or IC may calculate any of the various parameters or values described herein in connection with the exact rendering during the display mapping process. Image and video embodiments may be implemented in hardware, software, firmware, and various combinations thereof.
Certain embodiments of the invention include a computer processor executing software instructions that cause the processor to perform the method of the invention. For example, one or more processors in a display, encoder, set-top box, transcoder, etc. may implement the methods described above in connection with precision rendering in a display map by executing software instructions in a program memory accessible to the processors. The present invention may also be provided in the form of a program product. The program product may comprise any tangible and non-transitory medium carrying a set of computer readable signals comprising instructions that, when executed by a data processor, cause the data processor to perform the method of the invention. The program product according to the present invention may take any of a variety of tangible forms. The program product may comprise, for example, physical media such as magnetic data storage media including floppy disks, hard disk drives, optical data storage media including CD ROMs, DVDs, electronic data storage media including ROMs, flash RAM, etc. The computer readable signal on the program product may optionally be compressed or encrypted.
Where a component (e.g., a software module, processor, assembly, device, circuit, etc.) is referred to above, unless otherwise indicated, reference to that component (including a reference to "a means") should be interpreted as including as equivalents of that component any component which performs the function of the described component (e.g., that is functionally equivalent), including components which are not structurally equivalent to the disclosed structure which performs the function in the illustrated example embodiments of the invention.
Equivalents, extensions, alternatives and miscellaneous items
Example embodiments are thus described in connection with precision rendering in a display map. In the foregoing specification, embodiments of the application have been described with reference to numerous specific details that may vary from implementation to implementation. Thus, the sole and exclusive indicator of what is the application, and is intended by the applicants to be the application, is the set of claims that issue from this detailed description, in the specific form in which such claims issue, including any subsequent correction. Any definitions expressly set forth herein for terms contained in such claims shall govern the meaning of such terms as used in the claims. Thus, no limitation, element, property, feature, advantage or attribute that is not expressly recited in a claim should limit the scope of such claim in any way. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Exemplary embodiments of enumeration
The present invention may be embodied in any of the forms described herein, including but not limited to the following Enumerated Example Embodiments (EEEs) that describe the structure, features, and functions of portions of the present invention.
EEE 1. A method for precision rendering in a display map, the method comprising:
accessing an input image of a first dynamic range and a first spatial resolution;
generating an input intensity image (I) based on the input image;
generating a second intensity image by sub-sampling the input intensity image until its spatial resolution is less than or equal to a second spatial resolution;
generating a set of downsampled images based on the second intensity image and a pyramid downsampling neural network;
generating two upsampled images of a second spatial resolution based on the set of downsampled images and a pyramid upsampling neural network that includes edge-aware upsampling filtering; and
An output Base Layer (BL) image of a first spatial resolution is generated by combining the two up-sampled images and the input intensity image in a final layer neural network.
EEE 2. The method as in EEE 1 wherein generating the second intensity image comprises processing the input image with a filler block and a subsequent 2 x 1 convolution block offset = false and step = 1.
The method as recited in EEE 1 or 2, wherein the set of downsampled images is generated by the pyramid downsampling neural network and forms a pyramid representation of the second intensity image.
The method as in EEE 4, wherein the pyramid downsampling neural network comprises two or more consecutive convolution blocks, wherein each convolution block is configured to generate a downsampled image of a respective layer of the pyramid representation.
EEE 5 the method of any one of EEEs 1-4, wherein the pyramid downsampling neural network comprises two or more 4 x2 x1 successive convolution blocks offset = False and stride = 2.
EEE 6. The method of any of EEEs 1-5 wherein the pyramid upsampling neural network comprises a plurality of processing layers, wherein given an input layer image P (i) having an ith spatial resolution, the ith processing layer computes a (i) and b (i) values based on P (i), F (i), edge filter neural network, and upsampling filter neural network, wherein,
F(i-1)=a(i)*P(i-1)+b(i),
And wherein the spatial resolution of a (i) and b (i) is higher than the spatial resolution of P (i-1).
EEE 7. The method as in EEE 6, wherein the edge filter neural network (of the ith processing layer) comprises:
Input images F and P;
Inputting weights PW [ i,0] and PW [ i,1];
Four 3×3×1×1 convolution blocks of step=1 and outputs C1, C2, C3, and C4, where C1 represents a local average of F, C2 represents a local average of (f×p), C3 represents a local average of (p×p), and C4 represents a local average of P; and
Generating outputs al (i) and bl (i), wherein generating outputs al (i) and bl (i) comprises calculating:
T1=C2-(C1*C4),
T2=T1/((C3-C42)+PW[i,0]),
T3=C1-(T2*C4),
bl(i)=T3*PW[i,1],
al(i)=(T2*PW[i,1])+(1-PW[i,1])。
EEE 8. The method as in EEE 6 or 7, wherein the upsampling filter neural network (of the ith processing layer) comprises:
An m x n spatial resolution filter input;
A first layer having two 3 x 1 convolution blocks, each convolution block processing the filter input and generating a first filter output and a second filter output;
a column splicer for interleaving columns of the first filter output and the second filter output and generating a first layer mx 2n filter output;
A second layer having two 1 x 3 x1 convolution blocks, each convolution block processing the first layer mx2 n filter output and generating a third filter output and a fourth filter output; and
And a line splicer for interleaving lines of the third and fourth filter outputs and generating an upsampled filter output of 2m x 2n spatial resolution.
EEE 9. The method as in EEE 8, wherein, given that the filter input is al (i), the upsampling filter output is a (i), and given that the filter input is bl (i), the upsampling filter output is b (i).
EEE 10 the method of any one of EEE 1 to 9, wherein the final layer neural network computes a Base Layer (BL) image as
BL=a(1)*I+b(1),
Where I represents the input intensity image, and a (1) and b (1) represent the two up-sampled images generated by the pyramid up-sampling sub-network.
EEE 11 the method of any one of EEEs 1 to 10, further comprising calculating a detail layer image (DL) as
DL(x,y)=I(x,y)-BL(x,y)*dg,
Where, for a pixel at position (x, y), I (x, y) represents a pixel in the input intensity image, BL (x, y) represents a corresponding pixel in the detail layer image, and dg represents a scaling variable within [0,1 ].
EEE 12. An apparatus comprising a processor and configured to perform any of the methods as described in EEEs 1 through 11.
EEE 13. A non-transitory computer-readable storage medium having stored thereon computer-executable instructions for performing the method according to any one of EEEs 1 to 11 with one or more processors.

Claims (12)

1. A method for precision rendering in a display map, the method comprising:
accessing an input image (202) of a first dynamic range and a first spatial resolution;
generating an input intensity image (I) based on the input image;
generating a second intensity image by sub-sampling the input intensity image until its spatial resolution is less than or equal to a second spatial resolution;
generating a set of downsampled images based on the second intensity image and a pyramid downsampling neural network;
generating two upsampled images of a second spatial resolution based on the set of downsampled images and a pyramid upsampling neural network that includes edge-aware upsampling filtering; and
An output Base Layer (BL) image of the first spatial resolution is generated by combining the two upsampled images and the input intensity image in a final layer neural network.
2. The method of claim 1, wherein generating the second intensity image comprises processing the input image with a2 x 1 convolution block of padding and subsequent offset = false and stride = 1.
3. The method of claim 1 or 2, wherein the set of downsampled images is generated by the pyramid downsampling neural network and forms a pyramid representation of the second intensity image.
4. A method as claimed in claim 3, wherein the pyramid downsampling neural network comprises two or more successive convolution blocks, wherein each convolution block is configured to generate a downsampled image of a respective layer of the pyramid representation.
5. The method of any of claims 1 to 4, wherein the pyramid downsampling neural network comprises two or more 4 x 2 x 1 consecutive convolutions of offset = False and stride = 2.
6. The method of any one of claim 1 to 5, wherein the pyramid upsampling neural network comprises a plurality of processing layers, wherein given an input layer image P (i) having an ith spatial resolution, the ith processing layer computes a (i) and b (i) values based on P (i), F (i), edge filter neural network, and upsampling filter neural network, wherein,
F(i-1)=a(i-1)*P(i)+b(i-i),
And wherein the spatial resolution of a (i) and b (i) is higher than the spatial resolution of P (i).
7. The method of claim 6, wherein the edge filter neural network comprises:
Input images F and P;
inputting weights PW [0] and PW [1];
Four 3×3×1×1 convolution blocks of step=1 and outputs C1, C2, C3, and C4, where C1 represents a local average of F, C2 represents a local average of (f×p), C3 represents a local average of (p×p), and C4 represents a local average of P; and
Generating output al and bl, wherein generating output al and bl comprises computing:
T1=C2-(C1*C4),
T2=T1/((C3-C42)+PW[0]),
T3=C1-(T2*C4),
bl=T3*PW[i,1],
al=(T2*PW[i,1])+(1-PW[1])。
8. The method of claim 6 or 7, wherein the upsampling filter neural network comprises:
An m x n spatial resolution filter input;
A first layer having two 3 x 1 convolution blocks, each convolution block processing the filter input and generating a first filter output and a second filter output;
a column splicer for interleaving columns of the first filter output and the second filter output and generating a first layer mx 2n filter output;
A second layer having two 1 x 3 x1 convolution blocks, each convolution block processing the first layer mx2 n filter output and generating a third filter output and a fourth filter output; and
A line splicer for interleaving lines of the third filter output and the fourth filter output and generating an upsampled filter output of 2m x 2n spatial resolution,
Wherein, given that the filter input is al (i), the upsampling filter output is a (i), and given that the filter input is bl (i), the upsampling filter output is b (i).
9. The method of any one of claims 1 to 8, wherein the final layer neural network computes a Base Layer (BL) image as bl=a (1) i+b (1),
Where I represents the input intensity image, and a (1) and b (1) represent the two up-sampled images generated by the pyramid up-sampling sub-network.
10. The method of any of claim 1 to 9, further comprising calculating a detail layer image (DL) as DL (x, y) =i (x, y) -BL (x, y) dg,
Where, for a pixel at position (x, y), I (x, y) represents a pixel in the input intensity image, BL (x, y) represents a corresponding pixel in the detail layer image, and dg represents a scaling variable within [0,1 ].
11. An apparatus comprising a processor and configured to perform any of the methods of claims 1-10.
12. A non-transitory computer-readable storage medium having stored thereon computer-executable instructions for performing the method of any one of claims 1 to 10 with one or more processors.
CN202280064469.4A 2021-08-24 2022-08-23 Neural network for accurate rendering in display management Pending CN117980958A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202163236476P 2021-08-24 2021-08-24
US63/236,476 2021-08-24
EP21206398.6 2021-11-04
PCT/US2022/041199 WO2023028046A1 (en) 2021-08-24 2022-08-23 Neural networks for precision rendering in display management

Publications (1)

Publication Number Publication Date
CN117980958A true CN117980958A (en) 2024-05-03

Family

ID=78528679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280064469.4A Pending CN117980958A (en) 2021-08-24 2022-08-23 Neural network for accurate rendering in display management

Country Status (1)

Country Link
CN (1) CN117980958A (en)

Similar Documents

Publication Publication Date Title
CN110337667B (en) Tone curve mapping for high dynamic range images
JP7483747B2 (en) High Dynamic Range Image Display Management
KR102141193B1 (en) Signal reshaping for high dynamic range signals
JP6362793B2 (en) Display management for high dynamic range video
US8824829B2 (en) Enhancing dynamic ranges of images
JP5180344B2 (en) Apparatus and method for decoding high dynamic range image data, viewer capable of processing display image, and display apparatus
KR102157032B1 (en) Display management for high dynamic range video
RU2433477C1 (en) Image dynamic range expansion
CN117980958A (en) Neural network for accurate rendering in display management
JP2024531432A (en) Neural Networks for High Precision Rendering in Display Management
CN117716385A (en) Neural network for dynamic range conversion and display management of images
JP2024527025A (en) Neural networks for image dynamic range conversion and display management.
CN118044198A (en) Dynamic spatial metadata for image and video processing
WO2023055612A1 (en) Dynamic spatial metadata for image and video processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination