EP4635192A1 - Erzeugung von panoramavideos - Google Patents

Erzeugung von panoramavideos

Info

Publication number
EP4635192A1
EP4635192A1 EP24847243.3A EP24847243A EP4635192A1 EP 4635192 A1 EP4635192 A1 EP 4635192A1 EP 24847243 A EP24847243 A EP 24847243A EP 4635192 A1 EP4635192 A1 EP 4635192A1
Authority
EP
European Patent Office
Prior art keywords
video
panoramic
pixels
canvas
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP24847243.3A
Other languages
English (en)
French (fr)
Inventor
Erika LU
Jingwei Ma
Brian CURLESS
Tali Dekel
Shiran Zada
Aleksander HOLY SKI
Michael Rubinstein
Forrester Cole
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Publication of EP4635192A1 publication Critical patent/EP4635192A1/de
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/698Control of cameras or camera modules for achieving an enlarged field of view, e.g. panoramic image capture
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/617Upgrading or updating of programs or applications for camera control

Definitions

  • Panorama stitching includes sparse feature based or semi-dense registration; rigid or depth-compensated alignment; and image blending to resolve seams and inconsistencies across observations.
  • Panoramic videos are more complicated than panoramic images, especially where the panoramic video includes dynamic objects, since scene motion may cause failures in both registration and compositing.
  • creating a panoramic video from images using conventional systems is not possible because as the camera captures images from a first location along an axis, other images are not captured from other locations along the axis, resulting in missing portions of a panoramic video.
  • the background description provided herein is for the purpose of generally presenting the context of the disclosure.
  • a computer-implemented method includes receiving an input panning video that includes image frames captured with a camera that pans at least once from a first side to a second side. The method further includes reprojecting the input panning video onto a panoramic canvas, wherein the panoramic canvas includes the image frames and missing regions.
  • the method further includes outputting, with a generative machine-learning model, a panoramic video that includes synthesized pixels for the missing regions of the panoramic canvas by: generating a base video at a coarse temporal scale that includes the synthesized Attorney Docket No. LE-2749-01-WO pixels for the missing regions and refining the base video by performing temporal upsampling, merging the temporally upsampled base video with the panoramic canvas, and resynthesizing the synthesized pixels for the missing regions in the base video.
  • the computer-implemented method is implemented as a media application on a computing device.
  • the temporal upsampling includes temporally upsampling the base video and performing spatial alignment and color alignment of the temporally upsampled base video with the panoramic canvas to obtain an aligned temporally upsampled vase video.
  • the merging the temporally upsampled base video and the panoramic canvas includes merging the aligned temporally upsampled base video with the panoramic canvas.
  • the resynthesizing includes generating a mask of pixels for the missing regions, regenerating the synthesized pixels for the missing regions based on the mask, and merging the regenerated pixels with the synthesized pixels.
  • a quality of the panoramic video improves as a function of a number of times the image frames captured with the camera include panning from the first side to the second side and panning from the second side to the first side.
  • the input panning video includes one or more moving objects.
  • the method further includes responsive to outputting the panoramic video, applying a super-resolution pass to the panoramic video to increase resolution of the panoramic video.
  • the method further includes selecting a panoramic image from the panoramic video.
  • the operations include receiving an input panning video that includes image frames captured with a camera that pans at least once from a first side to a second side; reprojecting the input panning video onto a panoramic canvas, wherein the panoramic canvas includes the image frames and missing regions; and outputting, with a generative machine-learning model, a panoramic video that includes synthesized pixels for the missing regions of the panoramic canvas by: generating a base video at a coarse temporal scale that includes the pixels for the missing regions and refining the base video by performing temporal upsampling, merging the temporally upsampled base video with the panoramic canvas, and resynthesizing the synthesized pixels for the missing regions in the base video.
  • the temporal upsampling includes temporally upsampling the base video and performing spatial alignment and color alignment of the temporally upsampled base video with the panoramic canvas to obtain an aligned temporally upsampled vase video.
  • the merging the temporally upsampled base video and the panoramic canvas includes merging the aligned temporally upsampled base video with the panoramic canvas.
  • the resynthesizing includes generating a mask of pixels for the missing regions, regenerating the synthesized pixels for the missing regions based on the mask, and merging the regenerated pixels with the synthesized pixels.
  • the input panning video includes one or more moving objects.
  • a system comprises one or more processors and a memory coupled to the one or more processors, with instructions stored thereon that, when executed by the processor, cause the processor to perform operations.
  • the operations include receiving an input panning video that includes image frames captured with a camera that pans at least once from a first side to a second side; reprojecting the input panning video onto a panoramic canvas, wherein the panoramic canvas includes the image frames and missing regions; and outputting, with a generative machine-learning model, a panoramic video that includes synthesized pixels for the missing regions of the panoramic canvas by: generating a base video at a coarse temporal scale that includes the pixels for the missing regions and refining the base video by performing temporal upsampling, merging the temporally upsampled base video with the panoramic canvas, and resynthesizing the synthesized pixels for the missing regions in the base video.
  • the temporal upsampling includes temporally upsampling the base video and performing spatial alignment and color alignment of the temporally upsampled base video with the panoramic canvas to obtain an aligned temporally upsampled vase video.
  • the merging the temporally upsampled base video and the panoramic canvas includes merging the aligned temporally upsampled base video with the panoramic canvas.
  • the resynthesizing includes generating a mask of pixels for the missing regions, regenerating the synthesized pixels for the missing regions based on the mask, and merging the regenerated pixels with the synthesized pixels.
  • the input panning video includes one or more moving objects.
  • the operations further comprise responsive to outputting the panoramic video, applying a super-resolution pass to the panoramic video to increase resolution of the panoramic video. In some embodiments, the operations further comprise selecting a panoramic image from the panoramic video.
  • Figure 1 is a block diagram of an example network environment, according to some embodiments described herein.
  • Figure 2 is a block diagram of an example computing device, according to some embodiments described herein.
  • Figure 3 illustrates input images from an input panning video, input images projected onto a panoramic canvas, and output images from the resulting output panoramic video, according to some embodiments described herein.
  • Figure 4 illustrates an example token-based model that outputs a panoramic video, according to some embodiments described herein.
  • Figure 5 illustrates an example diffusion model that is trained to output a panoramic video, according to some embodiments described herein.
  • Figures 6A-D illustrate an example process for generating a panoramic video, according to some embodiments described herein.
  • Figure 7 illustrates an example of how a diffusion model performs upsampling and outpainting to generate a panoramic video, according to some embodiments described herein.
  • Figure 8 illustrates an example comparison of how probabilities during outpainting use a token-based model as compared to a diffusion model, according to some embodiments described herein.
  • FIG. 9 illustrates a flowchart of an example method to generate a panoramic video, according to some embodiments described herein.
  • DETAILED DESCRIPTION Overview [0020]
  • the resulting panoramic video may have more synthesized pixels than original pixels. This exacerbates the need for image generation that properly tracks movement of objects in the panoramic video.
  • a panoramic video of people kayaking on a river not only needs image generation for still objects, such as Attorney Docket No. LE-2749-01-WO buildings along the horizon, but also tracking and alignment of movement, such as the movement of paddles in the water.
  • the technology described herein advantageously generates a panoramic video by reprojecting an input panning video onto a panoramic canvas, where the panoramic canvas includes the image frames from the input panning video and missing regions.
  • a generative machine-learning model outputs a panoramic video that includes synthesized pixels for the missing regions of the panoramic canvas.
  • the generative machine-learning model may generate a base video at a coarse temporal scale that includes the synthesized pixels for the missing regions.
  • FIG. 2 is a block diagram of an example computing device 200 that may be used to implement one or more features described herein.
  • Computing device 200 can be any suitable computer system, server, or other electronic or hardware device.
  • the computing device 200 is the media server 101.
  • the computing device 200 is a user device 115.
  • computing device 200 includes a processor 235, a memory 237, an input/output (I/O) interface 239, a display 241, a camera 243, and a storage device 245, all coupled via a bus 218.
  • the processor 235 may be coupled to the bus 218 via signal line 222
  • the memory 237 may be coupled to the bus 218 via signal line 224
  • the I/O interface 239 may be coupled to the bus 218 via signal line 226
  • the display 241 may be coupled to the bus 218 via signal line 228,
  • the camera 243 may be coupled to the bus 218 via signal line 230
  • the storage device 245 may be coupled to the bus 218 via signal line 232.
  • Processor 235 can be one or more processors and/or processing circuits to execute program code and control basic operations of the computing device 200.
  • a “processor” includes any suitable hardware system, mechanism or component that processes data, signals or other information.
  • a processor may include a system with a general-purpose central processing unit (CPU) with one or more cores (e.g., in a single-core, dual-core, or multi-core configuration), multiple processing units (e.g., in a multiprocessor configuration), a graphics processing unit (GPU), a field-programmable gate array (FPGA), an application-specific Attorney Docket No.
  • CPU general-purpose central processing unit
  • cores e.g., in a single-core, dual-core, or multi-core configuration
  • multiple processing units e.g., in a multiprocessor configuration
  • GPU graphics processing unit
  • FPGA field-programmable gate array
  • processor 235 may include one or more co-processors that implement neural-network processing.
  • processor 235 may be a processor that processes data to produce probabilistic output, e.g., the output produced by processor 235 may be imprecise or may be accurate within a range from an expected output. Processing need not be limited to a particular geographic location or have temporal limitations.
  • a processor may perform its functions in real-time, offline, in a batch mode, etc. Portions of processing may be performed at different times and at different locations, by different (or the same) processing systems.
  • a computer may be any processor in communication with a memory.
  • Memory 237 is typically provided in computing device 200 for access by the processor 235, and may be any suitable processor-readable storage medium, such as random access memory (RAM), read-only memory (ROM), Electrical Erasable Read-only Memory (EEPROM), Flash memory, etc., suitable for storing instructions for execution by the processor or sets of processors, and located separate from processor 235 and/or integrated therewith.
  • Memory 237 can store software operating on the computing device 200 by the processor 235, including a media application 103.
  • the memory 237 may include an operating system 262, other applications 264, and application data 266.
  • Other applications 264 can include, e.g., an image library application, an image management application, an image gallery application, communication applications, web hosting engines or applications, media sharing applications, etc.
  • One or more methods disclosed herein can operate in several environments and platforms, e.g., as a stand-alone computer program that can run on any type of computing device, as a web application having web pages, as a mobile application ("app") run on a mobile computing device, etc.
  • the application data 266 may be data generated by the other applications 264 or hardware of the computing device 200.
  • I/O interface 239 can provide functions to enable interfacing the computing device 200 with other systems and devices. Interfaced devices can be included as part of the computing device 200 or can be separate and communicate with the computing device 200. For example, network communication devices, storage devices (e.g., memory 237 and/or Attorney Docket No. LE-2749-01-WO storage device 245), and input/output devices can communicate via I/O interface 239.
  • the I/O interface 239 can connect to interface devices such as input devices (keyboard, pointing device, touchscreen, microphone, scanner, sensors, etc.) and/or output devices (display devices, speaker devices, printers, monitors, etc.).
  • interface devices such as input devices (keyboard, pointing device, touchscreen, microphone, scanner, sensors, etc.) and/or output devices (display devices, speaker devices, printers, monitors, etc.).
  • Some examples of interfaced devices that can connect to I/O interface 239 can include a display 241 that can be used to display content, e.g., images, video, and/or a user interface of an output application as described herein, and to receive touch (or gesture) input from a user.
  • display 241 may be utilized to display a user interface.
  • Display 241 can include any suitable display device such as a liquid crystal display (LCD), light emitting diode (LED), or plasma display screen, cathode ray tube (CRT), television, monitor, touchscreen, three-dimensional display screen, or other visual display device.
  • display 241 can be a flat display screen provided on a mobile device, multiple display screens embedded in a glasses form factor or headset device, or a monitor screen for a computer device.
  • Camera 243 may be any type of image capture device that can capture images and/or video. In some embodiments, the camera 243 captures images or video that the I/O interface 239 transmits to the media application 103.
  • the storage device 245 stores data related to the media application 103.
  • the storage device 245 may store a training data set that includes training data, such as a plurality of training images, a generative machine-learning model, an outpainter model, etc.
  • the media application 103 receives an input panning video that includes image frames captured with the camera 243 that pans at least once from a first side to a second side.
  • the camera 243 captures multiple pans in one capture, such as a pan from the first side to the second side and then back to the first side again.
  • a user may take a user device 115 and move it from side-to-side multiple times to capture the input video.
  • the resolution of the panoramic video is improved by multiple pans to improve tracking of movement in different spatial locations.
  • the resolution of a panoramic video is improved by the number of image frames that are captured at different angles because the increased number of image frames reduces the number of synthesized pixels that are generated for the missing regions in the panoramic video.
  • the input panning video includes one or more moving objects, such as moving people (e.g., a person kayaking on a river, a person scuba diving, people walking around a landmark, people skating), cars, trees, water, etc.
  • moving people e.g., a person kayaking on a river, a person scuba diving, people walking around a landmark, people skating
  • cars trees, water, etc.
  • Attorney Docket No. LE-2749-01-WO [0034]
  • the media application 103 reprojects the input panning video onto a panoramic canvas to produce input frames (x 0 ) and corresponding mask frames of valid pixels (m 0 ) in a panorama coordinate system.
  • the mask may be used to identify the valid pixels that are kept and the missing regions where synthesized pixels are to be generated.
  • the panoramic canvas is a canvas with the dimensions of a panorama that includes the image frames and missing regions (of the scene that the camera didn’t capture in the image frame captured at that instant).
  • the panoramic canvas includes both the image frames from the input panning video that are positioned based on their coordinate location within the panoramic canvas and missing regions that may be illustrated with blank pixels, black pixels, or no data.
  • Figure 3 illustrates three input images 300 from an input panning video, three input images 300 projected onto a panoramic canvas 310, and three output images 320 from an output panoramic video.
  • the input images 300 depict people kayaking on a river.
  • a first input image 300a includes a person 301 kayaking in the foreground, a person 302 kayaking in the background, and a skyline 303.
  • a second input image 300b includes the first person 301 at a later time in the input panning video where the paddle 304 is at a different position compared to the first input image 300a and where a kayak 305 is also visible.
  • a third input image 300c includes the second person 302 at a different position as compared to the first input image 300a.
  • the panoramic canvases 310 include respective input images 300 reprojected onto the panoramic canvases 310.
  • the media application 103 uses a homography solver to perform the reprojection of the input panning videos with rotation and no translation or scaling.
  • the media application 103 may use Simultaneous Localization And Mapping (SLAM) to perform the reprojection.
  • SLAM Simultaneous Localization And Mapping
  • the media application 103 ignores the translation of the camera 243 and computes an elevation and azimuth of each input pixel’s ray relative to a corresponding camera 243 direction and then projects each ray to an equirectangular canvas.
  • the output images 320 include synthesized pixels for the missing regions of the panoramic canvases 310.
  • the second output image 320b includes the second person 302, which was not visible in the second input image 300b.
  • the media application 103 temporally downsamples the input panning video to a context window length used by a generative machine-learning model that is part of the media application 103.
  • the media application 103 prepares the temporally-downsampled version of the input frames of the input panning video ( ⁇ x ⁇ ⁇ ⁇ ⁇ ⁇ ) such that the coarsest input (x ⁇ ) fits exactly or Attorney Docket No. LE-2749-01-WO approximately in a context window used by the generative machine-learning model (e.g., 80 frames, 11 frames, etc.).
  • the media application 103 applies temporal prefiltering with a box blur before subsampling from x 0 to avoid temporal aliasing.
  • the input panning video is wider than the generative machine- learning model’s native aspect ratio.
  • the media application 103 may downscale the input panning video (x ⁇ ) to match the generative machine-learning model’s native height.
  • Generative Machine-Learning Model [0040]
  • the generative machine-learning model generates a base video at a coarse temporal scale that includes synthesized pixels for the missing regions.
  • the generative machine-learning model outputs multiple overlapping spatial windows to span the panorama width.
  • the generative machine-learning model may average the distributions predicted by the generative machine-learning model in each window and output base image frames for the base video based on the average.
  • the generative machine-learning model refines the base video using temporal upsampling, merging of the base video with the input pixels from the input panning video, and resynthesizing the synthesized pixels for the missing regions.
  • the media application 103 uses training data to train the generative machine-learning model to output a panoramic video from input data, such as application data 266 (e.g., an input panning video captured by the user device 115).
  • Training data may be obtained from any source, e.g., a data repository specifically marked for training, data for which permission is provided for use as training data for machine learning, etc.
  • the training may occur on the media server 101 that provides the training data directly to the user device 115, the training occurs locally on the user device 115, or a combination of both.
  • the trained machine-learning model may include one or more model forms or structures.
  • model forms or structures can include any type of neural-network, such as a linear network, a deep-learning neural network that implements a plurality of layers (e.g., “hidden layers” between an input layer and an output layer, with each layer being a linear network), a convolutional neural network (e.g., a network that splits or partitions input data into multiple parts or tiles, processes each tile separately using one or more neural- network layers, and aggregates the results from the processing of each tile), a sequence-to- sequence neural network (e.g., a network that receives as input sequential data, such as words in a sentence, frames in a video, etc. and produces as output a result sequence), etc.
  • a convolutional neural network e.g., a network that splits or partitions input data into multiple parts or tiles, processes each tile separately using one or more neural- network layers, and aggregates the results from the processing of each tile
  • a sequence-to- sequence neural network e.g., a network that receives as
  • the model form or structure may specify connectivity between various nodes and organization of nodes into layers.
  • nodes of a first layer e.g., an input layer
  • data can include, for example, one or more pixels per node, e.g., when the trained model is used for analysis, e.g., of an image.
  • Subsequent intermediate layers may receive as input, output of nodes of a previous layer per the connectivity specified in the model form or structure. These layers may also be referred to as hidden layers.
  • a final layer (e.g., output layer) produces an output of the generative machine-learning model.
  • model form or structure also specifies a number and/ or type of nodes in each layer.
  • the trained generative machine-learning model can include one or more models.
  • One or more of the models may include a plurality of nodes, arranged into layers per the model structure or form.
  • the nodes may be computational nodes with no memory, e.g., configured to process one unit of input to produce one unit of output. Computation performed by a node may include, for example, multiplying each of a plurality of node inputs by a weight, obtaining a weighted sum, and adjusting the weighted sum with a bias or intercept value to produce the node output.
  • the computation performed by a node may also include applying a step/activation function to the adjusted weighted sum.
  • the step/activation function may be a nonlinear function.
  • such computation may include operations such as matrix multiplication.
  • computations by the plurality of nodes may be performed in parallel, e.g., using multiple processors cores of a multicore processor, using individual processing units of a graphics processing unit (GPU), or special-purpose neural circuitry.
  • nodes may include memory, e.g., may be able to store and use one or more earlier inputs in processing a subsequent input. For example, nodes with memory may include long short-term memory (LSTM) nodes.
  • LSTM long short-term memory
  • the trained model may include embeddings or weights for individual nodes.
  • a model may be initiated as a plurality of nodes organized into layers as specified by the model form or structure.
  • a respective weight may be applied to a connection between each pair of nodes that are connected per the model form, e.g., nodes in successive layers of the neural network.
  • the respective weights may be randomly assigned, or initialized to default values.
  • the model may then be trained, e.g., using training data, to produce a result.
  • Training may include applying supervised learning techniques.
  • the training data can include a plurality of inputs (e.g., a plurality of input panning videos) and a corresponding groundtruth output for each input (e.g., a groundtruth panoramic video for each input panning video).
  • the output of the model e.g., a panoramic video
  • the groundtruth output e.g., the groundtruth panoramic video
  • values of the weights are automatically adjusted, e.g., in a manner that increases a probability that the model produces the groundtruth panoramic video.
  • a trained model includes a set of weights, or embeddings, corresponding to the model structure.
  • the trained generative machine-learning model may include an initial set of weights, e.g. downloaded from a server that provides the weights.
  • a trained generative machine-learning model includes a set of weights, or embeddings, corresponding to the model structure.
  • the media application 103 may generate a trained generative machine-learning model that is based on prior training, e.g., by a developer of the media application 103, by a third-party, etc.
  • the training of the generative machine-learning model may include, for each input panning video, obtaining a panoramic video based on the input panning video.
  • the generative machine-learning model may calculate a loss value based on a comparison of the panoramic video and the groundtruth panoramic video (included in the training data).
  • the generative machine- learning model may compute an optical flow EndPoint Error (EPE) to measure the consistency of the generated motion. The flow between consecutive frames of the groundtruth and the output video may be calculated as an L2 difference.
  • EPE optical flow EndPoint Error
  • small motions at low image resolution may be evaluated based on a grid-based flow to produce more reliable sub-pixel alignments than a network-based flow.
  • pixel-level metrics may be evaluated separately for static and dynamic regions.
  • the generative machine-learning model may update a weight of one or more nodes of the convolutional neural network based on the loss value (e.g., in a way that, after adjustment and running another cycle of the training, the loss value is reduced, till the loss value is below a threshold).
  • the generative machine-learning model is a token-based model.
  • Figure 4 illustrates an example token-based model 400 that outputs a panoramic video, according to some embodiments described herein.
  • the token-based model 400 Attorney Docket No. LE-2749-01-WO includes an encoder 410, a bidirectional transformer 420, and a decoder 430.
  • the encoder 410 receives a masked video 405 (e.g., the reprojected input panning video) that is downsampled (e.g., to 11 frames per second at 160x96 pixels) and compresses the masked video 405 into masked tokens 415.
  • the mask includes input pixels where synthesized pixels are generated for the regions outside of the mask.
  • the masked tokens 415 represent discrete embeddings that describe the content of the masked video 405.
  • the attention layers of the encoder 410 are masked such that later frames attend to earlier frames in the window, but earlier frames are blocked from attending to later frames.
  • the bidirectional transformer 420 iteratively translates the masked tokens 415 into video tokens 425.
  • the bidirectional transformer 420 that performs forward passes and backward passes over the coarsest level clip (x ⁇ ) to reduce or eliminate artifacts in the panoramic video.
  • the bidirectional transformer 420 may take a result of a forward pass in regions that were previously seen by the bidirectional transformer 420 and combine the result with a backward pass in the rest of the regions to obtain the video tokens 425.
  • the video tokens 425 are decoded by the decoder 430 to output a panoramic video 435 that is similarly downsampled (e.g., 11 frames per second at 160x96 pixels).
  • the panoramic video 435 is upsampled (e.g., 320x192 pixels).
  • the generative machine-learning model is a diffusion model, such as a space-time, pixel diffusion model.
  • the diffusion model is trained to generate panoramic videos by progressively adding noise to panoramic videos (noising) and then training the diffusion model to perform a denoising process to recover the original panoramic videos from the noise.
  • the media application 103 trains a diffusion model to receive a reprojected input panning video as input and output a panoramic video that includes synthesized pixels for the missing regions of the panoramic canvas.
  • the media application 103 trains the diffusion model on an image outpainting task where the training data includes image pairs of a ground truth image and a corresponding image with random pixels or groups of pixels that are removed.
  • the diffusion model is trained to receive an incomplete input image and output an image with synthesized pixels that replace any missing regions.
  • the diffusion model is fine-tuned by using training videos where a fixed mask is replaced with a panning mask to simulate the situation where a user captures an input panning video by moving a camera from side to side.
  • the media application 103 performs fine-tuning of the trained diffusion model using video pairs of natural videos (ground truth videos) paired by masked videos with synthetic panoramic masks that are designed to mimic real masks.
  • Figure 5 illustrates an example diffusion model 500 that is trained to output a panoramic video, according to some embodiments described herein. The diffusion model 500 receives noise 505 and a masked video 510.
  • the diffusion model 500 receives noise 505 and a masked video 510 that are downsampled (e.g., 80 frames that are 128x128 pixels) and outputs a panoramic video 520 that is similarly downsampled.
  • the panoramic video 520 may be a base layer that is upsampled to 1024x1024 pixels.
  • Figures 6A-D illustrate an example process for generating a panoramic video. Although the examples mention particular numbers of frames, the numbers of frames are used as examples and other numbers of frames may also be used.
  • Figure 6A includes a process 600 with an input panning video 605.
  • a panoramic projection 610 projects the input panning video 605 on a panoramic canvas (starting with the temporally finest input frames x 0 ).
  • Spatial downsampling 615 is performed to form a downsampled reprojected input video 620a (ending with the temporally coarsest input frames x K ).
  • the downsampled reprojected input video 620a may be progressively downsampled to match the generative machine- learning model’s 625 native height.
  • each level of reprojected input video 620 may be associated with a 2X downsampling. Although four levels are illustrated, any number of levels (k) may be used, such as from two to five levels or more than five levels.
  • the input video at temporal scale k (x k ) has a corresponding mask at temporal scale k (m k ).
  • Two different sliding spatial windows 621, 622 are used to provide portions of the base reprojected input video 620a to the generative machine-learning model 625.
  • the sliding spatial windows 621, 622 may be associated with varying dimensions depending on the type of generative machine-learning model 625 that is used to generate the base video 630a.
  • the sliding spatial windows 621, 622 advantageously enable the process 600 to be used with input videos 605 of varying panorama widths.
  • the generative machine-learning model 625 generates the base video 630a by performing spatial outpainting to generate synthesized pixels for the missing regions.
  • the model predictions in the sliding spatial windows 631, 632 are averaged and a new sample is drawn from the average.
  • the generative machine-learning model 625 is a diffusion model
  • the projected panoramic canvas is cropped to each input window and missing regions Attorney Docket No. LE-2749-01-WO outside the image frames are outpainted.
  • the Gaussian distribution over pixel values (represented by ⁇ and ⁇ ) may be averaged, and a new sample may be drawn using diffusion methods, such as a Denoising Diffusion Probabilistic Model (DDPM).
  • DDPM Denoising Diffusion Probabilistic Model
  • spatial aggregation may be performed by averaging the predicted probability distributions over the tokens before sampling.
  • the generative machine-learning model 625 progressively restores the temporal dimensions of the input video 605 by progressively completing panoramic videos at upsampled temporal resolutions. For each level ⁇ ⁇ [ ⁇ ⁇ 1, 0], the generative machine- learning model 625 combines a reprojected input video 620 (x k ) with a completed courser- level panoramic video (y k+1 ) to produce a completed low resolution panoramic video 630 (y k ). [0063] The generative machine-learning model 625 obtains consistent temporal details by performing temporal upsampling, merging with the reprojected input video 620, and resynthesizing pixels outside of the input regions to output low resolution panoramic videos 630.
  • temporal upsampling includes temporally upsampling a panoramic video y k+1 to create a ⁇ ⁇ ⁇ ⁇ that is frame-rate matched and composite x k over ⁇ ⁇ ⁇ ⁇ to form a merged ⁇ ⁇ ⁇ ⁇ ⁇ .
  • x k is aligned to ⁇ ⁇ ⁇ ⁇ prior to compositing using a grid-warp-based optical flow and color histogram matching.
  • the panoramic video from the previous level (y k+1 ) 705 is temporally upsampled ( ⁇ ⁇ ⁇ ⁇ ) and merged with a current level reprojected input video (x k ) 715 to form a partially-completed input ( ⁇ ⁇ ⁇ ⁇ ).
  • the partially-completed input ( ⁇ ⁇ ⁇ ⁇ ) matches the input video with the original mask (m k ) but lacks temporal details outside of the mask.
  • outpainting 720 is used for context.
  • the diffusion model outpaints 720 i.e., performs resynthesis of) content outside the original mask (m k ) to output the completed panoramic video (y k ) 725.
  • outpainting 720 is controlled by a mask conditioning signal.
  • a full-frame mask is applied to the odd-numbered frames of the partially-completed input ( ⁇ ⁇ ⁇ ⁇ ) and the original mask (m k ) is applied to the even-numbered frames.
  • the diffusion sampling is constrained to maintain the temporally- upsampled, synthesized pixels at the odd-numbered frames, while the synthesized pixels are resynthesized at the even-numbered frames.
  • videos with fast motion e.g., a skier, a Attorney Docket No. LE-2749-01-WO kayaker, a skateboarder, etc.
  • the temporally-upsampled pixels at odd- numbered frames may also need to be resynthesized.
  • the full-frame masks are maintained at odd frames for a predetermined amount of the sampling schedule (e.g., 1/8), then the original mask (m k ) for the entire window is used so that the diffusion model synthesizes new temporal details for the missing regions outside of the original mask.
  • the diffusion model 734 is applied in a sliding-window fashion with half-window overlap (e.g., 40 frames) between two windows 730, 732.
  • the diffusion model 734 is fine tuned by randomly cropping videos to be 128x128 pixels and augmenting a set of synthetic and real panning video masks.
  • the diffusion model 734 is optimized for a diffusion denoising objecting with a squared error loss using dynamic masks (i.e., masks that change location in subsequent frames to simulate an input panning video).
  • the token-based model is fine tuned to reduce fidelity loss with the encoder/decoder architecture to better align the synthesized pixels with the input video and preserve details of the input in the outpainted regions.
  • the decoder may be finetuned on patches of valid pixels prior to synthesizing the final result.
  • an example comparison 800 is illustrated of how probabilities during outpainting using a token-based model compare to a diffusion model, according to Attorney Docket No. LE-2749-01-WO some embodiments described herein.
  • a reprojected input video 801 includes a missing region 802.
  • the generative machine-learning model generates a sample 803 from two predicted probability distributions associated with two overlapping windows: a left window 804 and a right window 805.
  • a token-based model outputs discrete token probabilities 806, 807 based on the left window 804 and right window 805, respectively, that are aggregated 808.
  • a diffusion model outputs pixel probabilities 809, 810 based on a Gaussian distribution over pixel values (represented by ⁇ and ⁇ ) that are aggregated 811. [0075]
  • a spatial super resolution 635 is applied to the upsampled panoramic video 630d.
  • Figure 6B illustrates example numbers of frames at each stage where the generative machine-learning model is a token-based model 656 that accepts 11 frames 655.
  • An input video 651 includes 35 frames.
  • the reprojected input video is completed at 11 frames 655.
  • the 2X upsampling of 11 frames is 22 frames 654, the 2X upsampling of 22 frames is 44 frames 653, and 44 frame version is resampled back to 35 frames 652.
  • Figure 6C illustrates an example process 660 of converting an input video 661 to a panoramic video.
  • the input video is of a person kayaking.
  • the dashed square 662 in the input video highlights a section of the input video 661 where unobserved movement occurs during capture of the panning video. For example, as one person’s kayak moves from left to right and another person’s kayak moves from right to left, the activities in the dashed square 662 are not visible to the camera.
  • One difficulty in generating a panoramic video is ensuring that movement, such as a person using a paddle is properly aligned.
  • the reprojected input video 663 includes unknown regions 664 where portions of the reprojected input video 663 are to be filled in with synthesized pixels to create a panoramic video. The person’s movement with the paddle in the reprojected input video 663 goes from pausing to paddling to pausing again.
  • the generative machine-learning model 665 outputs a panoramic video 666 where the missing regions are replaced with synthetic pixels and the synthesized images are merged Attorney Docket No. LE-2749-01-WO with the reprojected input video 663. For example, the original pixels in the unmasked areas in the reprojected input video 663 are merged with synthesized images.
  • Figure 6D illustrates the process 670 of finalizing the panoramic video using a token- based model 679.
  • a low-resolution panoramic video 671 includes 11 frames with temporal context that are upsampled with the reprojected input video 672, which includes 22 frames and temporal details to result in a 2x upsampled video 673 with 22 frames.
  • the upsampled video is modified by performing spatial alignment 674 (e.g., using temporal upsampling and frame-rate matching), color alignment 675 (e.g., through color histogram matching), and compositing 676 (e.g., using grid-warp-based optical flow or other types of image interpolation and/or blending), for example by using the process described in Figure 6A, to result in a merged video 677 with 22 frames.
  • a token-based model is used for generating synthesized pixels.
  • a different type of generative machine-learning model such as a diffusion model
  • color alignment may not be performed.
  • the token-based model 679 may receive as input the merged video 677 with a mask 678 where the white portions represent the input and the black portions represent resynthesized pixels.
  • the token-based model 679 may generate a mask of the pixels for the missing regions, resynthesize the pixels for the missing regions of the panoramic canvas, and merge the mask with the resynthesized pixels.
  • the token-based model 679 outputs a high- resolution panoramic video 680 (i.e., with a threshold number of pixels) with 22 frames.
  • the panoramic video may be provided to the user that captured the input panning video.
  • the panoramic video is displayed with an option to select a panoramic image from the panoramic video.
  • the panoramic video may be displayed as a series of panoramic images and a user interface may include an option for isolating one or more of the panoramic images.
  • Example Method [0083]
  • Figure 9 illustrates a flowchart of an example method 900 to generate a panoramic video.
  • the method 900 may be performed by the computing device 200 in Figure 2.
  • the method 900 is performed by the user device 115, the media server 101, or in part the user device 115 and in part the media server 101.
  • the method 900 of Figure 9 may begin at block 902.
  • an input panning video is received that includes image frames captured with a camera that pans at least once from a first side to a second side.
  • a quality of the panoramic video Attorney Docket No. LE-2749-01-WO improves as a function of a number of times the image frames captured with the camera include panning from the first side to the second side and panning from the second side to the first side.
  • the input panning video includes one or more moving objects.
  • Block 902 may be followed by block 904.
  • the input panning video is reprojected onto a panoramic canvas, where the panoramic canvas includes the image frames and missing regions.
  • Block 904 may be followed by block 906.
  • a generative machine-learning model outputs a panoramic video that includes synthesized pixels for the missing regions of the panoramic canvas by: generating a base video at a coarse temporal scale that includes the pixels for the missing regions; and refining the base video by performing temporal upsampling, merging the temporally upsampled base video with the panoramic canvas, and resynthesizing the synthesized pixels for the missing regions.
  • the temporal upsampling may include temporally upsampling the base video and performing spatial alignment and color alignment of the temporally upsampled base video with the panoramic canvas to obtain an aligned temporally upsampled vase video.
  • Merging the temporally upsampled base video with the panoramic canvas may include merging the aligned temporally upsampled base video with the panoramic canvas.
  • the resynthesizing may include generating a mask of pixels for the missing regions, regenerating the synthesized pixels for the missing regions based on the mask, and merging the regenerated pixels with the synthesized pixels.
  • a super- resolution pass is applied to the panoramic video to increase resolution of the panoramic video.
  • a panoramic image is selected from the panoramic video.
  • LE-2749-01-WO herein may enable collection of user information (e.g., information about a user’s social network, social actions, or activities, profession, a user’s preferences, or a user’s current location), and if the user is sent content or communications from a server.
  • user information e.g., information about a user’s social network, social actions, or activities, profession, a user’s preferences, or a user’s current location
  • certain data may be treated in one or more ways before it is stored or used, so that personally identifiable information is removed.
  • a user’s identity may be treated so that no personally identifiable information can be determined for the user, or a user’s geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined.
  • the processor may be a special-purpose processor selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored in a non-transitory computer-readable storage medium, including, but not limited to, any type of disk including optical disks, ROMs, CD-ROMs, magnetic disks, RAMs, EPROMs, EEPROMs, magnetic or optical cards, flash memories including USB keys with non-volatile memory, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
  • the specification can take the form of some entirely hardware embodiments, some entirely software embodiments or some embodiments containing both hardware and software elements.
  • the specification is implemented in software, which includes, but is not limited to, firmware, resident software, microcode, etc.
  • the description can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
  • a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • a data processing system suitable for storing or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus.
  • the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)
EP24847243.3A 2023-12-29 2024-12-27 Erzeugung von panoramavideos Pending EP4635192A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363616037P 2023-12-29 2023-12-29
PCT/US2024/062142 WO2025145075A1 (en) 2023-12-29 2024-12-27 Generating panoramic videos

Publications (1)

Publication Number Publication Date
EP4635192A1 true EP4635192A1 (de) 2025-10-22

Family

ID=94386319

Family Applications (1)

Application Number Title Priority Date Filing Date
EP24847243.3A Pending EP4635192A1 (de) 2023-12-29 2024-12-27 Erzeugung von panoramavideos

Country Status (6)

Country Link
EP (1) EP4635192A1 (de)
JP (1) JP2026507312A (de)
KR (1) KR20250113478A (de)
CN (1) CN120569978A (de)
DE (1) DE112024000287T5 (de)
WO (1) WO2025145075A1 (de)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564527B (zh) * 2018-04-04 2022-09-20 百度在线网络技术(北京)有限公司 基于神经网络的全景图内容补全和修复的方法及装置
US12254589B2 (en) * 2022-04-01 2025-03-18 Adobe Inc. Extrapolating panoramas from images using a generative model

Also Published As

Publication number Publication date
JP2026507312A (ja) 2026-03-02
WO2025145075A1 (en) 2025-07-03
CN120569978A (zh) 2025-08-29
DE112024000287T5 (de) 2025-09-18
KR20250113478A (ko) 2025-07-25

Similar Documents

Publication Publication Date Title
US12148123B2 (en) Multi-stage multi-reference bootstrapping for video super-resolution
CN110322542B (zh) 重建真实世界3d场景的视图
US20230401672A1 (en) Video processing method and apparatus, computer device, and storage medium
US8718328B1 (en) Digital processing method and system for determination of object occlusion in an image sequence
US11568524B2 (en) Tunable models for changing faces in images
US11288543B1 (en) Systems and methods for depth refinement using machine learning
US20250272803A1 (en) Method, computer device, and storage medium for processing video denoising model
CN109191554B (zh) 一种超分辨图像重建方法、装置、终端和存储介质
CN108491763B (zh) 三维场景识别网络的无监督训练方法、装置及存储介质
WO1995006297A1 (en) Example-based image analysis and synthesis using pixelwise correspondence
Yue et al. Real-rawvsr: Real-world raw video super-resolution with a benchmark dataset
US20250310561A1 (en) Variable resolution variable frame rate video coding using neural networks
Yang et al. 360spred: Saliency prediction for 360-degree videos based on 3d separable graph convolutional networks
US20230334626A1 (en) Techniques for denoising videos
Srinath et al. UnDIVE: Generalized underwater video enhancement using generative priors
Jiang et al. A neural refinement network for single image view synthesis
Wang et al. Deep Learning-Driven Ultra-High-Definition Image Restoration: A Survey
WO2025145075A1 (en) Generating panoramic videos
CN120499417A (zh) 插帧处理方法、装置、电子设备及可读存储介质
US20250173877A1 (en) Methods and systems motion vector calculation and processing
Wang et al. Taming High-Resolution Auxiliary G-buffers for Deep Supersampling of Rendered Content
Galshetwar et al. Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation
Liu et al. Uncertainty-constrained fusion of single-view and multi-view depth estimation for AR virtual-real occlusion
US12541865B2 (en) Image processing with image registration to facilitate combination of images modified by neural networks with reference images
US20240380913A1 (en) Motion compensation via inertial tracking and optical flow

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20250715

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC ME MK MT NL NO PL PT RO RS SE SI SK SM TR