WO2023281250A1 - Assemblage d'images - Google Patents

Assemblage d'images Download PDF

Info

Publication number
WO2023281250A1
WO2023281250A1 PCT/GB2022/051721 GB2022051721W WO2023281250A1 WO 2023281250 A1 WO2023281250 A1 WO 2023281250A1 GB 2022051721 W GB2022051721 W GB 2022051721W WO 2023281250 A1 WO2023281250 A1 WO 2023281250A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
image
stream
captured
background
Prior art date
Application number
PCT/GB2022/051721
Other languages
English (en)
Inventor
Michael Paul Alexander Geissler
Oliver Augustus KINGSHOTT
Original Assignee
Mo-Sys Engineering Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GBGB2109804.1A external-priority patent/GB202109804D0/en
Application filed by Mo-Sys Engineering Limited filed Critical Mo-Sys Engineering Limited
Publication of WO2023281250A1 publication Critical patent/WO2023281250A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/272Means for inserting a foreground image in a background image, i.e. inlay, outlay
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/034Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/61Control of cameras or camera modules based on recognised objects
    • H04N23/611Control of cameras or camera modules based on recognised objects where the recognised objects include parts of the human body
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/2224Studio circuitry; Studio devices; Studio equipment related to virtual studio applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/2624Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects for obtaining an image which is composed of whole input images, e.g. splitscreen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/2628Alteration of picture size, shape, position or orientation, e.g. zooming, rotation, rolling, perspective, translation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing

Definitions

  • This invention relates to image stitching, for example in relation to video content, where images for use in a video stream are formed from multiple sources and are joined together.
  • Figure 1 shows an arrangement for recording video.
  • a subject 1 is in front of a display screen 2.
  • Multiple cameras 3, 4 are located so as to view the subject against the screen.
  • a display controller 5 can control the display of images on the screen. These may be still or moving images which serve as a background behind the subject.
  • This setup provides an economical way to generate video content with complex backgrounds. Instead of the background being built as a traditional physical set it can be computer generated and displayed on the screen.
  • the image displayed on the screen should be displayed with a suitable transformation so that it appears realistic from the point of view of the camera.
  • This is usually achieved by a render engine implemented in display controller 5.
  • the render engine has access to a datastore 6 which stores three-dimensional locations of the objects to be represented in the background scene.
  • the render engine then calculates the position and appearance of those objects as they would be seen from the point of view of the active camera, for instance camera 3. The results of that calculation are used to form the output to the display screen 2.
  • the render engine re-calculates the background image as it would be seen from the point of view of that camera.
  • the CGI or background image is relatively simple, meaning that the processing requirements are sufficiently low for the compositor to be able to keep up with the footage as it is being shot - the result being real-time compositing.
  • the complexity of the CGI increases, the data processing requirements of the compositor increase.
  • the background can be a complex landscape, including buildings, weather phenomena, moving water and so on, which all add a great deal of complexity to the background image.
  • a method of compositing a video stream comprising the steps of: obtaining first and second video streams, wherein there is overlap between the background of the first and second video streams; identifying a common feature in the background of the first and second video streams; and stitching the first and second video streams together along the identified feature within the background of the first and second video streams.
  • Embodiments of the present invention may provide a hybrid arrangement in which a relatively small screen is used behind the actors/presenters to enable them to have a visible but relatively narrow (either or both in terms of width or height) “scene” with which to interact, and in which the remainder of the scene can be added by way of CGI or the like, but typically without requiring the provision of a “green screen” or similar technology.
  • the invention may allow a relatively low cost solution for providing a set extension.
  • the first video stream may have a relatively narrow background and the second video stream has a relatively wide background.
  • the background of the first and second streams may be generated from the same 3D model and/or from the same perspective.
  • the method may further comprise the step of extracting a portion of the first image stream for stitching into the second image stream.
  • the method may further comprise the step of comparing the first and second image streams to identify areas of difference, thereby identifying potential extraction portions.
  • the method may further comprise the step of identifying a portion of the first image stream for extraction and then increasing the size of the extracted portion by dilating in at least one direction.
  • the common feature may be inboard of an edge of the first video stream.
  • the common feature is preferably outboard of a subject within the first video stream.
  • the background of the first video stream is preferably provided by a display screen, typically an LED screen.
  • the screen may be formed of different parts, where the more centrally located sections behind foreground objects may be a higher resolution, as these images are more likely to be contained in any combined stream, and lower resolution sections towards the more outer sections, as these areas are more likely to be excluded from a combined stream and, in effect, replaced by the higher resolution images from a CGI image stream.
  • the identified feature may change during the first video stream from a first identified feature to a second identified feature.
  • the identified feature may be a common edge of an object in each background.
  • the identified feature may be determined by one or more of: dilation about an identified object, identification of a feature line along which the join can lie, or determination of a smart seam based on best fit criteria.
  • best fit criteria may include setting the generated stitching seam to be a hard edges on edges features and/or soft edges on smooth areas at a threshold distance from the extracted subject.
  • the generated stitching seam may be dependent on the characteristic of the background around the subject.
  • a video compositing system comprising: one or more inputs for receiving and/or storing first and second video streams to be joined together, the video streams having at least partially overlapping backgrounds; an input for receiving an identification of a feature common to the backgrounds of the first and second video streams, and a processor configured to stitch the first and second video streams together along the identified feature.
  • the processor may be further configured to detect the common feature and provide the necessary input.
  • the processor may be further configured to ensure that the common feature is inboard of the edge of the first video stream.
  • the processor may be further configured to ensure that the common feature is outboard of any subject in the first video stream.
  • the system may further comprise a camera for capturing the first video stream.
  • the system may further comprise a display screen for displaying the background to be used during capture of the first video stream.
  • the display screen may obtain the background image to display from a video store in the video compositing system, or it may be supplied real time from a render engine, which may be rendering a 3D model into the 2D images for display.
  • the system may further comprise an infra red camera for determining the location of any subject in the first video stream.
  • the system may further comprise a comparator for determining the location of any subject in the first video stream by comparing the first and second video streams, typically by identifying areas of difference between the first and second video streams.
  • the system may further comprise one or more primary compositors running in parallel with one or more secondary compositors, the primary compositor or compositors being optimised for rendering real-time CGI footage and the secondary compositor or compositors being optimised for rendering high-quality CGI footage.
  • a render engine configured to receive a first captured image stream and a second CGI image stream, compare the image streams to identify areas of difference thereby defining one or more possible subjects for extraction in the first stream, extracting one or more of the areas of difference, and stitching the extracted area of difference into the second CGI image stream.
  • Also disclosed is a method of compositing a video stream comprising the steps of: obtaining a first captured video stream in which the background is displayed on an LED screen and in which the background is based on a 3D model, obtaining a second video stream based on the 3D model, rendering at least a portion of the first captured video stream, stitching at least a portion of the images from the first captured video stream into the second video stream along one or more seams.
  • the one or more seams may be determined by one or more of: dilation about an identified object, identification of a feature line along which the seam can lie, or determination of a smart seam based on best fit criteria.
  • the terms “subject”, “actor”, “presenter”, “people” are generally synonymous and are intended to cover any form of object that it is desired to video and may include a performer.
  • Such performers may be human or animal or even robots, and may be delivering a fictional portrayal of a scene such as in a film or television programme, a live or pre-recorded report, or any other form of video content.
  • the subject may be or include one or more inanimate objects.
  • the provision of the subjects against a screen also means that the additional CGI does not need to appear behind the subjects in the scene, which is a complex and time- consuming task especially when the subjects are people that are moving, such that the scene behind them is continually changing.
  • the computing power required by the relevant render engines and the like in that scenario is significant and can only be done in post production.
  • the first video stream may have on an edge of the stream a vertical line indicative of say the end of a wall.
  • the second video stream may have an equivalent vertical line on an opposite edge of the stream, such that when the two edges of the streams are aligned, a continuous image is formed.
  • the “overlap” is therefore the common edge which permits alignment. It is, however, more common that the CGI background has significant portions in common with, and preferably contains all of, the display screen background.
  • the background of the captured video stream is preferably only images shown on the screen.
  • the display screen is typically an LED screen.
  • the background for the display screen and for the CGI stream are preferably both generated / rendered from the same 3D model. This allows for more accurate alignment when overlaying the two streams and/or greater accuracy when identifying the areas of difference.
  • the background for the CGI stream may be adapted to match the background of the display screen stream. By this, we mean that as the angle of the camera view (pan, tilt, height etc) that generates the first captured video stream is altered, the equivalent changes are made to the CGI stream background to replicate the effect of the camera motion.
  • the first and second image streams that are combined may be created from the same perspective.
  • one stream may be a captured/filmed image stream of the LED screen and foreground objects such as actors, where the perspective is determined based upon the camera position.
  • the second stream may then be computer generated based on the same perspective from the same virtual camera position.
  • a method in which two image streams are combined, wherein the two image streams are created from the same perspective, where a first image stream is a captured image stream of a display screen and a second image stream is computer generated.
  • the displayed image on the screen and the second image stream are preferably based on the same 3D model.
  • This may be combined with any of the disclosed methods of identifying foreground objects within the captured video stream and extracting them for stitching into the second image stream, for example, by identifying non-common and/or overlapping areas, adding rim dilation, e.g. extracted subject or extracted actor dilation, around those non-common areas / foreground objects and/or expanding the extracted image out to a non-obvious border like lines or fades.
  • the present invention may further include a CGI compositing system comprising one or more primary compositors running in parallel with one or more secondary compositors, the primary compositor or compositors being optimised for rendering real-time CGI footage and the secondary compositor or compositors being optimised for rendering high-quality CGI footage.
  • a CGI compositing system comprising one or more primary compositors running in parallel with one or more secondary compositors, the primary compositor or compositors being optimised for rendering real-time CGI footage and the secondary compositor or compositors being optimised for rendering high-quality CGI footage.
  • the present invention may also include a system for compositing a scene comprising a captured video image and a computer-generated background image, the system comprising: at least one video camera for recording and storing a raw video image of a scene and for outputting a working video image of the scene; a selector switch operatively interposed between each of the video cameras and a real-time compositor for feeding a selected one of the working video images at any given time to the real time compositor; the real time compositor being adapted to incorporate a first CGI image into the selected working video image and to display a composite video image representative of the final shot on a display screen; the system further comprising: a data storage device adapted to store an archive copy of the raw video images from each of the video cameras and a postproduction compositor operatively connected to the data storage device for superimposing a second CGI image onto any one or more of the video images captured by the video cameras, wherein the first CGI image is a lower resolution, or simplified
  • the invention may include and may operate two GGI rendering engines in parallel that work from a common set of video footage, for example, the raw video images.
  • a common set of video footage for example, the raw video images.
  • CGI rendering that is to say, rendering of a three-dimensional computer model to be conducted at two different resolutions (levels of detail) simultaneously, with the lower resolution version of the CGI footage being composited in real time with the actual video footage to allow the effect to be seen in real-time, whereas the higher resolution CGI rendering is carried out in near- time, to enable the final composited footage to be reviewed later on.
  • a video post-processing arrangement comprising a computer having one or more processors configured to execute code to: receive a background image signal representing a background image; receive a captured video signal representing video of a subject against a background comprising at least part of the background image; process the captured video to identify regions of the captured video occupied by the subject; define a border region around the identified regions; and form an output video stream by replacing, in the captured video signal, regions of the captured video signal outside the identified regions and the border region with the background image as received in the background image signal.
  • the background image signal may be a computer-generated imagery signal.
  • the arrangement may comprise a display screen, the display screen being arranged to receive the background image and display it.
  • the computer may be arranged to identify the regions of the captured video occupied by the subject at least in part by visual comparison between the background image and the captured video signal.
  • the computer may be arranged to receive depth data captured by a depth sensor indicating distances from the camera to objects in its field of view and to identify the regions of the captured video occupied by the subject at least in part in dependence on the depth data.
  • the computer may be arranged to receive shadow data captured by a second camera indicating regions of the background against which the video of the subject was captured that have been shaded by the subject from an illuminator and to identify the regions of the captured video occupied by the subject at least in part in dependence on the shadow data.
  • the width of the border may be greater than 1 % of the shortest side of a frame of the captured video.
  • the computer may be arranged to detect receive depth data captured by a depth sensor indicating distances from the camera to objects in its field of view and to identify the regions of the captured video occupied by the subject at least in part in dependence on the depth data.
  • the computer may be arranged to identify in the captured video a region of a predetermined colour and to form the output signal by replacing, in the captured video signal, regions of the captured video signal of the predetermined colour with the background image as received in the background image signal.
  • the computer may be arranged not to replace regions of the captured video signal in the identified regions and the border region with the background image as received in the background image signal.
  • Figure 1 shows an arrangement for recording video.
  • Figure 2 shows a further arrangement for recording video.
  • Figure 3 shows an example of combined first and second video streams.
  • Figure 4 shows an example of image dilation.
  • Figure 5 shows a further arrangement for recording video.
  • Figure 6 shows a yet further arrangement for recording video.
  • Figure 7 shows perspective differences when a camera moves.
  • Figure 8 shows a correction applied to Figure 7.
  • Figure 9 shows a video processing arrangement.
  • FIG. 2 shows an arrangement for recording video.
  • the arrangement comprises a display screen 10, multiple video cameras 11 , 12, a video feed switch 13, a camera selector unit 14, a camera selector user interface 15, multiple display controllers 16, 17 and a scene database 18.
  • the display screen 10 is controllable to display a desired scene. It may be a front- or back-projection screen, or a light emissive screen such as an LED wall. It may be made up of multiple sub-units such as individual frameless displays butted together. It may be planar, curved or of any other suitable shape.
  • a subject 19 is in front of the screen, so that the subject can be viewed against an image displayed on the screen. That is, with the image displayed on the screen as a background to the subject.
  • the subject may be an actor, an inanimate object or any other item that is desired to be videoed. Typically, the subject is free to move in front of the screen.
  • the stream is selected in dependence on a signal from the camera selector unit 14, which operates under the control of user interface 15.
  • the user interface 15 by operating the user interface 15, and operator can cause a selected one of the incoming video streams captured by the cameras to be output. In this way the operator can cut between the two cameras.
  • Each display controller comprises a processor 21, 22 and a memory 23, 24.
  • Each memory stores in non-transitory form instructions executable by the respective processor to cause the processor to provide the respective display controller with the functions as described herein.
  • the two display controllers may be substantially identical.
  • the scene database stores information from which the display controllers can generate images of a desired scene from a given point of view to allow such images to be displayed on the screen 10.
  • the scene database may store one or more images that can be subject to transformations (e.g. any of affine or projective transformations, trapezoidal transformations and/or scaling transformations) by the display controllers to adapt the stored images to a representation of how the scenes they depict may appear from different points of view.
  • transformations may take into account the distortion induced by the lens currently installed in the camera, the pan/tilt attitude of the camera and any offset of the camera image plane from a datum location of the camera. Transformations to deal with these issues are known in the literature.
  • the scene database may store data defining the appearance of multiple objects and those objects’ placement in three dimensions in one or more scenes.
  • the display controllers can calculate the appearance of the collection of objects from a given point of view.
  • the transformations may take into account the distortion induced by the lens currently installed in the camera, the pan/tilt attitude of the camera and any offset of the camera image plane from a datum location of the camera.
  • the display controllers may implement a three-dimensional rendering engine.
  • An example of such an engine is Unreal Engine available from Epic Games, Inc.
  • a display controller may be continuously active but may output control data to the screen only when it determines itself to be operational to control the screen. When a display controller is outputting data to the screen, the screen displays that data as an image.
  • each display controller 16, 17 has a processor running code stored in the respective memory. That code causes the respective display controller to retrieve data from the memory 18 and to form an image of a scene using that data and the location of a given point of view. Then, when the controller is operational to control the screen it outputs that image to the screen, which displays the image.
  • the image displayed by the screen may be a still image or a video image.
  • a video image allows the background to vary during the course of a presentation by the subject that is being recorded by the cameras.
  • the video image may, for example, portray falling rain, moving leaves, flying birds or other background motion.
  • each camera may be provided with a location estimating device 25, 26 that estimates the location of the camera in the studio or other environment where filming is taking place. That device may, for example, be a StarTracker sensor/processor system as is commercially available from the applicant. Such a device can allow the location of the camera to be tracked as the camera moves. Location data determined by such a device can be passed to the display controllers for use as point of view locations. These are just examples of mechanisms whereby the display controllers can receive the locations of the cameras.
  • FIG. 2 shows a camera location estimation unit 27, which could form part of a StarTracker system.
  • that unit communicates wirelessly with the devices 25, 26 to learn the cameras’ locations and provides those locations to the display controllers 16, 17, although other forms of communication may be possible.
  • Figures 1 and 2 show a relatively small screen, such as 3-4 metres wide and 2-3 metres high, in practice these screens could be much larger as discussed above in order to provide the necessary scale of the background scenery. This is especially true for outdoor and/or outer space scenes which deliver dramatic effect by way of the “vastness” of the scene in which the actors are being portrayed.
  • the present invention has two image streams, firstly a relatively narrow captured image stream of the actors/presenters in front of the screen and a second relatively wide stream of the CGI to fit around the relatively narrow captured stream.
  • the CGI stream would typically contain the background images shown on the screen and therefore captured by in the narrow stream in actors are present, as this allows for easier further production work. This is illustrated schematically in Figure 5.
  • the camera 11 captures images of the subject 19 in front of display screen 10. Those images are of a relatively narrow frame shot in which the background on the screen 10 is provided by a first render engine 59. Typically, this image will be of relatively low quality as the render engine may be generating the background in real time or near real time. Both the background on the display screen and the second CGI image stream are typically generated (i.e. rendered) from the same 3D model. Thus, rendering in this context is the generation of 2D images from the 3D model.
  • Location information is captured by the tracker 25 and fed via line 52 to a second render engine 60, as well as to the first render engine 59.
  • the feed to the first render engine allows the background on the display screen to be adjusted based on the changing camera position.
  • lens distortion information is also fed to this second render engine 60.
  • the second render engine can then determine how to correct and/or re-render the captured images to compensate for lens distortion issues, lighting discrepancies, colour differences and to improve the overall quality of the image in the captured image stream.
  • Additional render engines and/or computers may be used to carry out one or more of the task disclosed herein.
  • the second render engine 60 may either communicate with the internet, for example via the cloud 63, or the first render engine, to obtain the second image stream, or it may already contain details of the second video stream which is typically the CGI background and is the wider background which does not fit on the display screen 10.
  • the second render engine may utilise its own onboard processing, memory and other computing requirements, or may alternatively use cloud based services 63 to carry out one or more of the tasks.
  • the second render engine 60 then stitches the captured image stream and the CGI image stream together.
  • Various steps can be carried out when stitching the image streams together and these are discussed below. Not all steps necessarily are required, and the step may be carried out in a different order to that described below.
  • a comparison 61 of the first and second streams is carried out.
  • This comparison can also be known as a difference key.
  • This identification 62 recognises absent areas or “areas of difference” which are typically people, props or other objects that are the subject(s) of the image stream.
  • Portions of the first image stream then need to be extracted 64 so that they can be combined by insertion, i.e. stitched into the second image stream. These portions need to include all relevant areas of difference to ensure that the final video stream includes all relevant parts of the captured image.
  • the comparison between the two image streams will allow the extracted image(s) to be placed in the correct location within the second image stream. This may be done by overlaying the second CGI stream over the extracted portion(s) of the first image stream or vice versa.
  • the comparison may also include one or more steps of segmentation of one or both of the image streams, that is the identification 62 of distinct areas via segmentation algorithms, or through object recognition within the images, e.g. a chair or a car, such that the identified object can then be recognised at a later point for further processing, such as extraction or alteration, such as changing colour.
  • image blending techniques may be included to correct colour and/or brightness when merging join lines. Such techniques may include two band image blending, multi-band image blending and Poisson image editing.
  • the second CGI image stream When stitching the images together, typically, the second CGI image stream will be of a higher quality than the captured image stream (which is captured in real time and therefore will be of lower quality compared to the near time CGI stream), so to minimise the amount of re-rendering of the captured image stream that is required, it is desirable to reduce the size of the extracted image where possible.
  • one technique is to extract close to, preferably on the edge of, the area of difference, i.e. ensure that the area of difference is as small as possible. It is clear that the stitch line is therefore outboard of the subject, else part of the subject would be lost.
  • the area to be extracted maybe dilated, that is expanded for example to line 71. This may still be insufficient if the subject moves significantly, for example obsolete arm movements, as continually changing the stitching location will increase the likelihood of a viewer noticing the stitching location. As such, the extracted image may be dilated further to line 72 or even beyond. Such dilation may be helpful when looking to use the technique described with reference to Figure 3 below, which demonstrates one way in which two streams can be combined.
  • the determination of the location of the stitching is important when ensuring that the viewer of the final combined image stream cannot discern where, in any given image in the final product, one stream finishes and another stream starts.
  • first 30 and second 40 video streams are shown in Figure 3 such that the combined stream 50 can be broadcast/sent for further processing etc.
  • the first stream 30, depicted by the dotted line is wholly within the bounds of the second stream 40. It may be that only part of the first stream 30 overlaps with the second stream 40, or it may be that an edge of first stream 30 has a common edge with second stream 40.
  • the background of the first stream 30 is also include within the background of second stream 40, such the edge 32 of house 31 , cloud 33, hill 34 and horizon 35 all appear in each background. It is therefore possible to identify one or more of these features are being features along which stitching between the first and second video streams can occur. If for example, edge 32 of the house 31 extended fully from top to bottom of the first video stream, then such an edge would provide an ideal location to stitch. Such features may be identified by the segmentation step discussed above. In practice however, the scene will not necessarily contain features which extend fully from one side to the other, but rather the stitching will need to occur at various different features. For example, given the position of the subject 36, the stitching could occur along part of wall edge 32, then horizon 35 and finally along hill 34.
  • the definition of the hill and the horizon, being in the far distance, would also provide an ideal location to stitch the two streams together as these features would naturally include some blurring due to the distance. In this case, a blend or fade using stitching algorithms may be beneficial.
  • the edge 32 of the house 31 would, being in the foreground, require less blending/fading but more of a “clean cut”. Given the likely significant difference between the colour, contrast and/or brightness of the house relative to the background sky, such a clean cut would not be noticeable to the viewer.
  • the hill 34 and the horizon 35 may pass behind the subject 36, such that these would not provide suitable “hard” features to stitch along. In that case, it might be necessary to stitch across the sky in which the cloud 33 is located. Being a relatively blurry object, the cloud and or the sky itself would provide a suitable feature to use for location of the stitching, as blurring or fading between first and second video streams would be less noticeable by the viewer.
  • This problem can be reduced further by making the “join” between the two streams lie along the edge of an object e.g. the edge 32 of the wall 31 within the combined stream, such that the edge of the object provides a “natural” change in appearance such that any change due to the appearance difference between the two streams is “lost” within the edge of the object.
  • the object may have a linear or substantially linear edge, such as a tree trunk, edge of a building or billboard or the like, or may include one or more curved sections, for example the curve of a cloud formation, a wheel or similar.
  • the natural border between objects may be very clear and well defined, e.g. a wall edge, such that whilst there is a significant change in appearance between the captured stream and the CGI stream, the significant change is already part of the scene and therefore is not distracting to the viewer.
  • a cloud is used as the join, this has a natural blurriness and/or gradual change of appearance, such that any smoothing that is generated by a traditional stitching algorithm is not noticeable.
  • Such a region may be known a smooth area.
  • the join may be made up of any of one or more linear sections, one or more curved sections and/or one or more smooth areas depending upon the make up of the scene.
  • any join line between the two streams is be placed inboard of the edge of the screen to avoid limitations on the choice of joining locations.
  • any join line does not pass behind any actors, presenters and/or props, as to do so greatly increases the post-production work required.
  • the system may be configured to recognise the location of actors or indeed any other object or feature to be extracted, within the scene, for example by detecting a silhouette of a human form in the captured stream and comparing it to the CGI stream containing the same background images. The human form would be missing from the CGI stream. This is discussed above in relation to the extraction of an area of difference.
  • An alternative method for detecting the human form would be to have a further camera, for example an infrared camera, associated with the camera capturing the actors in front of the screen, such that the actor(s) would appear as a shadow within the image captured by the infrared camera.
  • the main camera i.e the camera doing the filming
  • This method may also be useful to detect shadows when using LED floors, if each actual light source has an associated IR light next to it to cast equivalent shadows, which are then detected by a fixed IR camera and synthesised into the scene.
  • the join line or lines may be chosen to be lines of colour and/or contrast and/or brightness changes.
  • the join line or lines may alternatively be in regions of a single colour and/or contrast and/or brightness such that fading between the joined streams may be effected.
  • edges or join lines may be automatically detected by known edge detection algorithms such as “Canny” or may be by way of image segmentation algorithms that can identify for example a building with a straight edge wall.
  • the joins could be manually selected depending upon the user’s desire and skill levels.
  • Segmentation techniques may include instance segmentation (i.e. labels each object) and semantic segmentation (i.e. labels particular objects such as humans) and/or may include algorithms such as “graph cut” or “max flow” to find a good seam based on whatever stitching strategy is used, such as minimum texture regions or strong boundaries.
  • machine learning methods Al
  • the join line may alter location during a given scene - for example, whilst an actor is in a first position, the edge of a building may provide the most suitable join line, but later in the scene, the actor may move to a second position in which the actor overlaps with the edge of the building. In the situation, it is not desirable for the join line to pass behind the actor, so the join line may be switched to an alternative location on the captured stream that is no longer behind the actor. It is desirable to minimise the switching of stitching locations to minimise production difficulties and also as the human eye is adept at recognising changes, the fewer different stitching location, the less opportunity there is for the stitch to be detected.
  • the camera is recording footage and sending it either to a storage unit or another computer (not shown, but e.g. 60 in Figure 5) to be processed.
  • the camera tracking device 25 is tracking the pose (position and orientation) of the camera. This pose information is sent to a computer 59.
  • the computer 59 receives the pose information and renders a virtual scene 81 that has a perspective aligned to the pose of the tracked camera. This rendering is depicted in frame 81 . A subsection 82 of this rendering (denoted by the dashed frame) is then displayed on the screen 80, e.g. an LED wall. A human actor 19 is positioned in front of the LED wall 80.
  • the latency problem is now explained and stems from the change in perspective of the camera 11 from the dotted position A to the solid position B - there is a latency in updating the LED wall to match the new perspective of the camera in position B such that for at least a few frames when the camera is in the new position, the image shown on the LED screen is created based on the perspective of the camera in position A or in the transition from A to B.
  • the camera tracking device 25 sends the new pose information of the camera to the computer 59 that renders the virtual world 81.
  • the computer 59 processes the new camera pose data and re-renders the virtual scene to match the new camera perspective.
  • the camera footage (depicted by frame 83) has captured an LED wall 70 whose lines are no longer parallel with the 2D render 81. This causes a mismatch between the details in the camera footage and the virtual render as shown, for example, by the clouds 85.
  • the first step is to identify a transformation that explains the change in perspective due to the movement of the camera.
  • a transformation can be solved through the known trajectory of the camera (from the tracking information) and the known delay between the camera movement and re-rendering of the virtual scene.
  • the transformation does not have to be on the camera footage. We could also apply the transformation on the 2D virtual rendering to instead match the virtual rendering to the camera footage. It may be possible to apply a transformation to each of the camera footage and the 2D virtual rendering. The overall aim however is the same, namely to match the two streams’ perspective.
  • the transformation may be a simple shifting of position of the captured camera footage 83, e.g. one or more of up, down, left or right, or may include a more complex transformation to address the sort of misalignment shown in Figure 7.
  • One example of transformation may be affine transformation (or projective transformation which can provide greater generality). Methods for finding such transformations include using Direct Linear Transform which simply involves solving a system of linear equations via the identification of corresponding points, since the system can use for example StarTracker’s tracking information, it is possible recover the transformation more easily.
  • Another challenge to overcome is how to match the frames to be transformed into each other.
  • One method is to identify the delay between the camera pose change and the LED wall update, which can be obtained empirically by timing when the rendering updates after shifting the camera pose.
  • the next step is to key the actor, or any other foreground object, out of the camera footage which may result in large areas of the LED wall being removed.
  • This is shown schematically in Figure 4 where the various lines 70, 71 and 72 represent various levels of dilation and/or keying out of the subject to be extracted.
  • edges of the Key may be non-trivial to merge into the 2D virtual world such that: edges of the Key could be made to align with an actual edge in the picture, e.g. the contours of furniture. By using these edges, the join lines will be less obnoxious to the audience. edges of the Key may also be some distance away from the actor’s actual body to ensure finer features such as hair, folds of clothes and/or even motion blur is not cropped away.
  • the method of keying the actor or an object out of a scene can be conducted in a variety of methods.
  • the general name for performing this task would be called image segmentation.
  • image segmentation there are algorithms such as edge detection, k-means clustering, water shed or net trained methods such as Mask R-CNN.
  • edge detection k-means clustering
  • water shed or net trained methods
  • Mask R-CNN mask R-CNN.
  • alignment can be performed through methods such as feature descriptor matching or template matching.
  • Adjusting these properties to make a seamless final image may include: compensating for colour differences compensating for different exposure levels blending colour and lighting gradients in the images (e.g. Poisson image editing)
  • the camera footage Key can be warped to match features in the 2D virtual render. Warping is typically the localised bending/curving or other transformation that align features between the camera footage and the virtual scene. For example, if the clouds 85 still dd not perfectly align, localised warping can be applied to ensure that edges or other features correctly align. The key is to only warp local areas as to avoid affecting the rest of the Key which may degrade its quality.
  • the warping operation can be applied to either the Key or the 2D virtual render or to both.
  • the overall aim is simply to join up features to create a seamless product from the two streams.
  • the methods used for image warping include any of: finding corresponding feature descriptors, corresponding edges or creating triangulated segments, forwarding mapping and inverse mapping and/or 2 pass mesh warp to name a few related techniques.
  • the geometric transformation allows the system to align the virtual images and the camera images together.
  • the same transformation can be reused to other special effects or details to be added onto the background of the actor (e.g. explosions).
  • the perspective distortions will remain coherent with the actor in the camera image.
  • the method of reusing the identified transformation for both virtual and camera image alignment may include the following:
  • the transformation for alignment is estimated for a single frame.
  • the saved transformation may be used in post production to add further special effects or additional background details onto the same frame to ensure coherent perspective distortion.
  • the background around the actor may have been dilated and, if this has been carried out, then this will therefore remain the same as the original LED background content, so no additional special effects will appear in the dilated sections around the actor. This may be compensated for or the dilation may or may not be carried out.
  • the system and/or methods may include an image alignment quality consistency checker.
  • the transformation mentioned defines the alignment between the virtual and the camera image. This said transformation will be estimated based on different factors for example but not limited to:
  • the factors for estimating the transformation could be particularly poor for a single frame and therefore the estimation is poor compared to neighbourhood frames.
  • the frame can be flagged for further human operator assessment during or after the entire video sequence is stitched. This allows the operator to fast track to the problematic frame to be fixed instead of watching the entire video sequence frame by frame.
  • transformations are an assembly of numeric parameters that make up a single transformation. Therefore, neighbouring frames which contain much of the same content and similar camera perspective should be expected to have similar numeric parameters estimated for their transformation. This enables a measure of the similarity of transformation between frames to be obtained. Therefore, frames that have transformation parameters that exceed a certain threshold can be flagged.
  • the alignment procedure searches for a transformation that best aligns the virtual and the camera based on a numeric metric (e.g. mutual information, cross correlation, sum squared error for corresponding points).
  • a numeric metric e.g. mutual information, cross correlation, sum squared error for corresponding points.
  • the tracked motion data from devices such as the StarTracker can be used to constrain the directionality of our transformation.
  • the transformations can be expected to have zero directionality or are in a direction related to the camera motion. These would constrain estimated transformations from poor data situations that may produce erratic results. When the estimated transformation is not coherent with camera motion, it can be flagged.
  • the method of image alignment quality consistency checking therefore may comprise: 1. Input virtual stream in line with any of the methods described above
  • Such a method will largely automate the checking process and will reduce the human burden by only requiring human intervention for those frames that have been identified as potentially needing human involvement.
  • imagery may be displayed behind a subject as video is being captured of the subject.
  • the imagery may be displayed on a wall formed of light emitting displays, for example LED displays, or may be projected on to a screen.
  • the imagery may be formed so as to correct for the position of a camera that is capturing video of the subject.
  • the captured video After the video of the subject and the CGI background has been captured, it may be advantageous to post-process the captured video to replace some or all of the captured video of the CGI imagery with directly generated CGI imagery. In that way, the resolution of the CGI imagery as seen by an eventual viewer might be improved.
  • the video stream as captured by the camera 100 will be referred to as the captured video stream. This is conveyed at 101.
  • the video stream derived from the captured video stream in which regions of the captured CGI have been replaced by directly generated CGI will be referred to as the enhanced video stream.
  • the post-processing may be performed by a post-processing computer 103 having one or more processors 104 and having access to a memory 105 storing in non transient form instructions executable by the processor(s) to cause them to perform the processing tasks described herein.
  • the same computer, or another one 106 may generate the CGI. That CGI may be transmitted to the screen 107 or a projector for display, and also made available for the post-processing tasks.
  • the post-processing computer processes them to estimate which spatial regions of them represent imagery captured of the CGI background displayed on the screen and/or which spatial regions of them represent imagery captured of the subject 108. This may be done in any of multiple ways, for example:
  • the post-processing computer may compare the captured video with the images that have been displayed on the wall 107 to detect similarity therebetween. In making the comparison, the post-processing computer may take account of the position of the camera 100 relative to the screen 107 and the resulting distortion of the displayed images and/or of any lighting applied to the screen. This method has been found to be effective for many situations, but can benefit from augmentation in some situations: for example when the subject is dimly or unevenly lit.
  • the camera 100 may carry or be closely associated with a distance sensor 109.
  • the distance sensor can sense the distance from itself to objects in front of it, in the field of view of the camera.
  • the distance sensor may be a time-of-flight sensor such as a LIDAR sensor or a RADAR sensor.
  • the distance sensor may generate a representation of the distance to objects at a range of spatial locations across the field of view of the camera.
  • Data from the distance sensor may be passed to the post processing computer 103. Since the subject 108 is in front of the screen 107, regions of the captured video that represent the subject can be differentiated using the data from the distance sensor.
  • the camera 100 may carry or be closely associated with an emitter 110 of light of a specific frequency or frequency band, for example an infra-red emitter.
  • the emitted frequency is selected to be one that is not detectable by the camera 100.
  • the emitter is directed at the subject and the screen.
  • the subject forms a shadow in the emitted light field against the screen.
  • a second camera 111 is directed at the screen. Conveniently the second camera is directed obliquely at the screen so that its view of the screen is substantially uninterrupted by the subject. There may be multiple such second cameras to mitigate the effect of obstruction by the subject.
  • the second camera(s) could be integrated into the screen.
  • the second camera is such as to detect light of the frequency emitted by the emitter 110.
  • the second camera can therefore capture an image of the shadow against the screen. That captured data can be passed to the post-processing computer. The post-processing computer can then use that data to help differentiate regions of the captured video that represent the subject.
  • this arrangement may be reversed, with one or more illuminators in the relevant frequency being arranged to obliquely illuminate the screen, or being integrated with the screen, and a detector for that frequency being carried by or closely associated with the camera 100.
  • Object recognitions algorithms are known. These are typically implemented by neural networks. Object recognition algorithms take an image or a video as input and identify spatial regions of the image/video that represent individual objects. The algorithms may also estimate what type of object is represented at the identified regions. The post-processing computer may implement such an algorithm on the displayed CGI (as received at 112) and the captured video. Regions of the captured video where the identified objects differ in position and/or type may be taken to represent regions where the displayed CGI is overlain in the captured video by the subject. It will be noted that in this process it is not necessary for the object recognition algorithm to accurately identify the type of the object.
  • the post-processing computer can process the captured video to form the output video. It does this by replacing selected regions of the captured video with corresponding regions of the original CGI. This results in those regions appearing generally the same as in the captured video, but with potentially enhanced image quality.
  • the post-processing computer can replace all the CGI regions of the captured video with original CGI.
  • this can have the disadvantage that considerable processing power is needed to avoid artefacts at the edge of the subject: it is difficult to comprehensively detect small features at the edge of the subject, such as strands of hair, and such details can be lost.
  • One way to address this is to apply a border or halo around the detected subject regions in the captured video.
  • the post-processing computer then replaces CGI in the captured video only outside the halo.
  • the captured video of the subject and the CGI as displayed on the screen is retained. It has been found that when a halo is implemented in this way, the post-processing computer has greater facility to adjust in real time the transition, across a frame of the output video, between the CGI in the captured video and the original CGI to be inserted.
  • That transition may be managed in multiple ways, for example:
  • Transformations such as any one or more of colour corrections, blur, sharpening, vignetting and affine transformations may be applied to one or both of the captured video and the original CGI. This may be done so that the CGI in the captured video and the original CGI to be inserted match each other in the relevant respect.
  • the amount of such transformations to apply may be estimated by the post-processing computer by comparison of the CGI in the captured video and the original CGI.
  • Other information may be used such as stored information describing the performance of an image sensor of the camera 100, the behaviour of a lens used by the camera 100, to capture the captured video, the position of camera 100 relative to the screen and lighting effects applied to the screen.
  • -A blending or smoothing algorithm may be used to merge the captured video and the original CGI around the edge of the halo.
  • the width of the border or halo may be greater than 10 pixels, greater than 20 pixels or greater than 50 pixels.
  • the width of the border or halo may be greater than 0.5%, 1 %, 2% or 5% of the shortest side length of a frame of the captured video.
  • Figure 9 shows a floor 113 in front of the screen 107.
  • floor 113 is coloured in a known colour such as green or blue.
  • the post-processing computer can then replace the floor with CGI in the manner generally known for green screen filming. In conventional systems this might be expected result in poor quality output because it would be difficult to make the replaced CGI on the floor match well with the CGI as displayed on the screen.
  • the displayed CGI as seen in the captured video can be replaced with parts of the original CGI as described above, so can the image of the floor in the captured video be replaced with other parts of that same original CGI. Therefore, a good match may be possible between the regions represented by the floor and the wall.
  • the original CGI as inserted or stitched over the wall regions and the floor regions may meet each other (where the floor and the wall meet visible to the camera 100) and may be formed from contiguous portions of the original CGI. This can avoid artefacts at the boundary.
  • Shadows in the light from that illuminator may be formed on the floor 113. Those shadows may be detected by camera 111 or similar cameras.
  • the data received from camera 111 or similar cameras is passed to post-processing computer 103. It can then process the received background imagery 112 to apply shadows (darkening) thereto in regions corresponding to those where shadows were detected on the floor. This can improve the perception of realism in the output image.

Abstract

Un procédé de composition d'un flux vidéo comprend les étapes consistant à : obtenir des premier et second flux vidéo, dans lesquels il existe un chevauchement entre l'arrière-plan des premier et second flux vidéo ; identifier une caractéristique commune dans l'arrière-plan des premier et second flux vidéo ; et assembler les premier et second flux vidéo ensemble selon la caractéristique identifiée dans l'arrière-plan des premier et second flux vidéo.
PCT/GB2022/051721 2021-07-07 2022-07-04 Assemblage d'images WO2023281250A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GBGB2109804.1A GB202109804D0 (en) 2021-07-07 2021-07-07 image stitching
GB2109804.1 2021-07-07
GB2114637.8 2021-10-13
GB2114637.8A GB2609996A (en) 2021-07-07 2021-10-13 Image stitching

Publications (1)

Publication Number Publication Date
WO2023281250A1 true WO2023281250A1 (fr) 2023-01-12

Family

ID=82558083

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2022/051721 WO2023281250A1 (fr) 2021-07-07 2022-07-04 Assemblage d'images

Country Status (1)

Country Link
WO (1) WO2023281250A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013156776A1 (fr) 2012-04-18 2013-10-24 Michael Geissler Compositeurs d'imagerie générée par ordinateur
US20150221066A1 (en) * 2014-01-31 2015-08-06 Morpho, Inc. Image processing device and image processing method
US20170230585A1 (en) * 2016-02-08 2017-08-10 Qualcomm Incorporated Systems and methods for implementing seamless zoom function using multiple cameras
EP3537704A2 (fr) * 2018-03-05 2019-09-11 Samsung Electronics Co., Ltd. Dispositif électronique et procédé de traitement d'images
CN110855905A (zh) * 2019-11-29 2020-02-28 联想(北京)有限公司 视频处理方法、装置和电子设备
US20200145644A1 (en) * 2018-11-06 2020-05-07 Lucasfilm Entertainment Company Ltd. LLC Immersive content production system with multiple targets

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013156776A1 (fr) 2012-04-18 2013-10-24 Michael Geissler Compositeurs d'imagerie générée par ordinateur
US20150221066A1 (en) * 2014-01-31 2015-08-06 Morpho, Inc. Image processing device and image processing method
US20170230585A1 (en) * 2016-02-08 2017-08-10 Qualcomm Incorporated Systems and methods for implementing seamless zoom function using multiple cameras
EP3537704A2 (fr) * 2018-03-05 2019-09-11 Samsung Electronics Co., Ltd. Dispositif électronique et procédé de traitement d'images
US20200145644A1 (en) * 2018-11-06 2020-05-07 Lucasfilm Entertainment Company Ltd. LLC Immersive content production system with multiple targets
CN110855905A (zh) * 2019-11-29 2020-02-28 联想(北京)有限公司 视频处理方法、装置和电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOWU CHEN ET AL: "Video motion stitching using trajectory and position similarities", SCIENCE CHINA INFORMATION SCIENCES, SP SCIENCE CHINA PRESS, HEIDELBERG, vol. 55, no. 3, 25 February 2012 (2012-02-25), pages 600 - 614, XP035020391, ISSN: 1869-1919, DOI: 10.1007/S11432-011-4534-Y *

Similar Documents

Publication Publication Date Title
US11019283B2 (en) Augmenting detected regions in image or video data
CN111050210B (zh) 执行操作的方法、视频处理系统及非瞬态计算机可读介质
US10600157B2 (en) Motion blur simulation
US6724386B2 (en) System and process for geometry replacement
US9747870B2 (en) Method, apparatus, and computer-readable medium for superimposing a graphic on a first image generated from cut-out of a second image
US8768099B2 (en) Method, apparatus and system for alternate image/video insertion
EP3668093B1 (fr) Procédé, système et appareil de capture de données d'image pour vidéo à point de vue libre
US8922718B2 (en) Key generation through spatial detection of dynamic objects
US20060165310A1 (en) Method and apparatus for a virtual scene previewing system
US9747714B2 (en) Method, device and computer software
KR102198217B1 (ko) 순람표에 기반한 스티칭 영상 생성 장치 및 방법
US11676252B2 (en) Image processing for reducing artifacts caused by removal of scene elements from images
JP2014178957A (ja) 学習データ生成装置、学習データ作成システム、方法およびプログラム
US20220180475A1 (en) Panoramic image synthesis device, panoramic image synthesis method and panoramic image synthesis program
US11823357B2 (en) Corrective lighting for video inpainting
JP6272071B2 (ja) 画像処理装置、画像処理方法及びプログラム
Yeh et al. Real-time video stitching
US11128815B2 (en) Device, method and computer program for extracting object from video
CN114399610A (zh) 基于引导先验的纹理映射系统和方法
JP2017050857A (ja) 画像処理装置、画像処理方法およびプログラム
WO2023281250A1 (fr) Assemblage d'images
GB2609996A (en) Image stitching
US10078905B2 (en) Processing of digital motion images
KR101718309B1 (ko) 색상 정보를 활용한 자동 정합·파노라믹 영상 생성 장치 및 방법
KR101893142B1 (ko) 객체 영역 추출 방법 및 그 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22741832

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE