WO2022023142A1 - Virtual window - Google Patents

Virtual window Download PDF

Info

Publication number
WO2022023142A1
WO2022023142A1 PCT/EP2021/070397 EP2021070397W WO2022023142A1 WO 2022023142 A1 WO2022023142 A1 WO 2022023142A1 EP 2021070397 W EP2021070397 W EP 2021070397W WO 2022023142 A1 WO2022023142 A1 WO 2022023142A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimensional
real
virtual
image
viewer
Prior art date
Application number
PCT/EP2021/070397
Other languages
French (fr)
Inventor
John Frederick MOORE
Karl Moore
Original Assignee
Roomality Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Roomality Limited filed Critical Roomality Limited
Publication of WO2022023142A1 publication Critical patent/WO2022023142A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/0304Detection arrangements using opto-electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/239Image signal generators using stereoscopic image cameras using two 2D image sensors having a relative position equal to or related to the interocular distance
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0081Depth or disparity estimation from stereoscopic image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N2213/00Details of stereoscopic systems
    • H04N2213/006Pseudo-stereoscopic systems, i.e. systems wherein a stereoscopic effect is obtained without sending different images to the viewer's eyes

Definitions

  • the present invention relates to image processing, including but not limited to image generation and image display.
  • a viewer’s eyes are surrounded by a virtual reality headset, where sensors in the headset determine the orientation and look direction of the viewer and the headset presents a digitally generated image on a display screen in the headset, which depends on the look direction of the viewer, such that the viewer finds themselves visually in a digitally generated three-dimensional coordinate space which changes depending on the orientation of their head in real three dimensional space.
  • sensors in the headset determine the orientation and look direction of the viewer and the headset presents a digitally generated image on a display screen in the headset, which depends on the look direction of the viewer, such that the viewer finds themselves visually in a digitally generated three-dimensional coordinate space which changes depending on the orientation of their head in real three dimensional space.
  • There may be other sensory interfaces such as earphones or audio headsets, and body worn position sensors.
  • the known virtual reality headsets have the characteristics that they generally enclose the viewer’s optical senses so that the viewer cannot see the real environment around them.
  • the viewer is fully immersed in a virtual environment from the in respect of sight sensory perception, and this limits the application of this type of system to applications where the viewer’s sight senses are fully occupied by a digitally generated virtual scene, with no input from the real environment.
  • the view direction can be varied by the up/ down and left / right cursors, or other assigned keyboard keys.
  • the realism of the screen view is limited because the virtual view position of the pilot user is effectively behind the video monitor or screen, whereas the actual human user is positioned in front of the video monitor.
  • the embodiments and specific methods provide an interactive system that enables scene image information displayed within a monitor of specific dimensions to be adjusted (resized and shifted) according to the position of a viewer with respect to the display monitor.
  • the system responds precisely in real-time for a realistic experience without the need for any wearables such as identification labels, goggles or headsets.
  • the system is also designed so that it can be readily extended to increase coverage and performance by the modular addition of a multiplicity of tracking and/or display systems.
  • the system comprises: at tracking module for tracking the location and position of human viewer in relation to a real-world three-dimensional space, for example a room; a virtual three-dimensional scene generation engine for generating a virtual three-dimensional scene; a module for adjusting or resizing the digitally generated three-dimensional scene data to account for a viewpoint of a determined position of a viewer in relation to a real three-dimensional space; a scene formatting module for reformatting the digitally generated virtual three- dimensional scene for display on a visual display monitor; and a display monitor for displaying a virtual scene.
  • the tracking system is operable to passively track a human subject within a field of view of the tracking system, to identify human subject viewers by identifying facial features of the human viewer; and to identify a position on a face of a human viewer which is representative of a viewpoint of that viewer. This is achieved by capturing video images of the subject in the field of view of the tracking system in real-time as a stream of video images, analysing the video images using a first machine learning algorithm for facial recognition; and performing recognition of facial landmarks, using a second machine learning algorithm, and calculating the x, y coordinates in two-dimensional image frame space of a mid - eye position of a detected face.
  • a two dimensional digitally generated display represents a scene in three dimensional space, wherein the two dimensional image changes to represent a view of the scene in three dimensional space from different locations in a real three dimensional space on one side of the display device, and the three dimensional space is a virtual space on the other side of the display device from the viewing location.
  • Specific embodiments and methods disclosed herein aim to provide a method and apparatus for providing a view of a virtual scene in a virtual three- dimensional coordinate space in a two-dimensional view on a flat or curved screen in which the two-dimensional view on the screen, as viewed from the viewing position of a user, appears to the user as if the user were in a three dimensional space.
  • the apparatus and methods herein operate to display real-time digitally generated scene images on a display monitor (also referred to herein as a virtual window or as a synthetic window).
  • a display monitor also referred to herein as a virtual window or as a synthetic window.
  • the relative position of the viewer to the display monitor causes different parts of the scene image to be displayed on the display monitor as the scene would appear to a viewer from different perspectives of viewpoints in real space.
  • the embodiments herein may display real-time scene images on a display monitor also referred to as a virtual window.
  • the position of the viewer causes different parts of the virtual three-dimensional scene image to be displayed as it appears as viewed from different perspectives.
  • the scene image will adjust according to a position and angle of the human viewer in real time as the viewer moves around.
  • the viewer’s position is taken to be the midpoint between the eyes of the viewer and is referred to herein as the mid eye point.
  • a room there may be one or more visual display devices, each placed upon a different wall or face in a room.
  • a single virtual scene may be displayed upon one or more virtual windows, each represented by a separate visual display device. Further, there may be more than one visual display device on a single wall.
  • a plurality of separate visual display devices may be coordinated, so that a viewer present in a room and surrounded by a plurality of visual display devices, from their viewpoint in the room may see a virtual three-dimensional scene displayed as two-dimensional images on the one or plurality of visual display devices, with the overall effect that as a person moves around within a room or space, the one or plurality of virtual windows surrounding that viewer appear as if the viewer is looking out onto a real three-dimensional space through the plurality of visual display devices, but the images presented on the plurality of display devices represent a virtual scene, instead of a scene in real three- dimensional space.
  • Each individual visual display device gives a corresponding view onto the virtual three-dimensional scene as would be seen through an aperture surrounding the perimeter of the visual display device, if that aperture were a transparent aperture looking out onto a real three-dimensional space, except that the virtual scene replaces the real scene.
  • an image processing apparatus comprising: a tracking apparatus for identifying a position of a viewer in relation to said visual display device; a scene generator for generating a virtual three-dimensional scene; a scene adjuster for modifying an output of said three dimensional scene generator in response to an output of said tracking apparatus; and a visual display device for displaying an image of said virtual three- dimensional scene; wherein said scene adjuster receives location information about a position of said viewer from said tracking apparatus, and modifies in real time a virtual three dimensional scene generated by said scene generator such that said modified scene corresponds to a view which the viewer would see from their position through an aperture between said real three dimensional space and said virtual three dimensional space.
  • said tracking apparatus comprises: first and second video cameras, each producing a video stream signal of a real three-dimensional space; a video grabber for said video stream signals into a stream of image data frames; and a memory storage device for storing said streams of image data frames.
  • Said screen display may have a viewable surface of a shape selected from the set: a flat planar screen; a curved screen; a part-cylindrical screen.
  • said means for determining a position of a viewer comprises a pair of spaced apart video cameras directed with their respective fields of view overlapping each other in real three-dimensional space.
  • the means for generating a virtual three-dimensional scene comprises a digital three-dimensional scene generator, operable to generate a virtual three-dimensional scene having one or a plurality of moving objects.
  • the view seen by the viewer in three-dimensional coordinate space is modified in real-time depending upon the viewer’s actual real position in real three dimensional space in relation to the physical display screen.
  • said image processing apparatus further comprises a video grabber for receiving a stream of video data; and converting said stream of video data into a stream of image data frames.
  • a means for adding a timestamp to each image data frame is provided.
  • said apparatus comprises a machine learning algorithm for detecting facial images.
  • said machine learning algorithm operates to: identify images in said image frames containing human faces; and determine a position in real three-dimensional space of said human face images.
  • said machine learning algorithm operates to identify facial landmark features from a stream of said image data frames; and determine a mid-eye position in real three-dimensional space from said image data frames.
  • the apparatus comprises a machine learning algorithm applied to detect body images.
  • the machine learning algorithm may operate to:
  • the image processing apparatus further comprises a formatter operable for: receiving coordinates of an outer perimeter of said display device in virtual three-dimensional space; cropping a view of said virtual three-dimensional scene to coincide with a straight line view from a position of a viewer in said virtual three-dimensional space, through said outer perimeter of said display device.
  • a formatter operable for: receiving coordinates of an outer perimeter of said display device in virtual three-dimensional space; cropping a view of said virtual three-dimensional scene to coincide with a straight line view from a position of a viewer in said virtual three-dimensional space, through said outer perimeter of said display device.
  • an apparatus for generating a screen image which represents a three- dimensional view from the perspective of a human viewer comprising: a visual display screen defining a perimeter of a viewing aperture or window on which can be displayed a real time virtual image; means for determining a physical position of said viewer’s eyes in relation to a physical position of said viewing aperture or window; means for generating a virtual three-dimensional coordinate space populated with a plurality of objects; means for positioning a representation of said viewer in said virtual three- dimensional coordinate space; means for positioning the viewing aperture/window in the virtual 3D space; means for determining a view as seen by said representation of said viewer in said virtual three-dimensional coordinate space; means for generating a two-dimensional image of said three-dimensional view as seen by said representation of said viewer through the aperture or window said image changing in real time, depending on an orientation of said viewer relative to said aperture /window within said virtual three-dimensional coordinate space.
  • a method for generating a video display screen view which represents a three- dimensional view from the perspective of a subject viewer, positioned adjacent a video display screen, said method comprising: determining a physical position of said subject viewer in relation to a physical position of said visual display screen in real three-dimensional space; generating a virtual three-dimensional scene populated with a plurality of objects in virtual three dimensional space; generating a view of said virtual three dimensional scene, as seen through an aperture coinciding with a perimeter of said display screen, and as viewed from said position in real three dimensional space; and displaying said view on said visual display device.
  • said process of generating a view of said virtual three dimensional scene comprises adjusting a view angle of said three-dimensional scene to correspond with a position of said subject viewer in real three- dimensional space.
  • said process of generating a view of said virtual three- dimensional scene comprises adjusting sizes of said variety of objects in said virtual three-dimensional space, according to a coordinate set of a said position of a viewer in real three-dimensional space.
  • said method further comprises formatting said adjusted virtual three-dimensional scene for display as a two-dimensional image of said virtual three-dimensional scene on said display monitor
  • said process of determining a physical position of said subject viewer in relation to a physical position of said visual display screen comprises capturing first and second scenes of video data across a field of view extending in said real three-dimensional space.
  • the method preferably further comprises converting said first and second scenes of video data to corresponding first and second streams of video image frames.
  • the method preferably further comprises comprising applying a first machine learning algorithm to detect facial images contained in said first and second streams of video images.
  • the method further comprises applying a third machine learning algorithm to detect body images contained in the first and second streams of video images.
  • the method further comprises applying a fourth machine learning algorithm to detect facial landmark features from the first and second streams of video images.
  • the method preferably further comprises applying a second machine learning algorithm to detect facial landmark features from said first and second streams of video images.
  • the method preferably further comprises determining a mid-eye position of facial detected facial images contained in said first and second streams of video images. [0041] The method preferably further comprises determining a set of three- dimensional coordinates in real three-dimensional space for each said determined mid-eye position.
  • the method preferably further comprises determining said three dimensional coordinates of said mid eye position comprises triangulation between two images of an object each captured by a different camera.
  • the method further comprises determining a mid-eye position of detected facial images contained in the first and second streams of video images.
  • the method comprises determining a set of three- dimensional coordinates in real three-dimensional space for each of the determined mid-eye positions.
  • the process of determining three-dimensional coordinates of said mid eye positions comprises triangulation between two images of an object, each image being captured by different camera
  • said virtual three-dimensional scene comprises a digitally generated virtual three-dimensional scene.
  • the method preferably further comprises generating a two- dimensional image of said view of said three-dimensional scene as seen from said viewer position in three-dimensional real space; and varying, said two dimensional view depending on a change of said position of said viewer in said real three-dimensional space.
  • the method preferably further comprises: determining coordinates of an outer perimeter of said display device in virtual three-dimensional space; cropping a view of said virtual three-dimensional scene to coincide with a straight line view from a determined position of a viewer in said virtual three- dimensional space, through said outer perimeter of said display device.
  • the method preferably further comprises for a plurality of visual display devices: determining a physical position of said subject viewer in relation to a physical position of each said visual display device in real three -dimensional space; generating a virtual three-dimensional scene populated with a plurality of objects in virtual three dimensional space; generating a corresponding respective view of said virtual three dimensional scene, as seen through an aperture coinciding with a perimeter of each said display device, and as viewed from said position in real three dimensional space; and displaying a said corresponding respective view on each said visual display device, such that for each physical position of the subject viewer in the real three- dimensional space, view of a virtual three-dimensional scene displayed on each said visual display device is visually consistent with the each other view of each other said visual display device.
  • said plurality of views are coordinated on said plurality of visual display device, such that said subject viewer at said physical position within a first region of real three-dimensional space views a virtual three-dimensional scene on said plurality of visual display devices, which appears to coincide with a second region of real three-dimensional space surrounding said first region of three-dimensional space; and for each position which said subject viewer occupies within said first region of real three-dimensional space, the views of three-dimensional space, the views displayed on the plurality of visual display devices change in real time, and co ordinated with each other, to give the appearance that the subject viewer is viewing said three-dimensional scene through a plurality of apertures surrounding said viewer.
  • an apparatus for determining a viewpoint position of a person in a real three-dimensional space comprising: a pair of video cameras each capable of capturing a stream of video data; means for converting each said captured stream of video data into a stream of image data frames; means for storing said stream of image data frames; means for detecting faces in said stream of image data frames; means for detecting facial features in said stream of image data frames; means for determining a viewpoint of a said detected face.
  • said pair of video cameras are spaced apart from each other and are arranged such that their respective fields of view overlap each other in real three-dimensional space.
  • said means for converting each said captured stream of video data into a stream of image data frames comprises a video grabber for receiving a plurality of streams of video data and converting each said stream of video data into a stream of image data frames.
  • the image processing apparatus comprises means for adding a timestamp to each image data frame.
  • said apparatus for detecting faces comprises a computer platform and a first machine learning algorithm for detecting facial images
  • said first machine learning algorithm operates to: identify images in said image frames containing human faces; and determine a position in real three-dimensional space of said human face images.
  • said means for detecting facial features in said stream of image data frames comprises a second machine learning algorithm operable to identify facial landmark features from a stream of said image data frames.
  • said means for determining a viewpoint of said detected face comprises a computer platform operating an algorithm for determining a position in real three-dimensional space from detected facial features in said image data frames.
  • said apparatus for detecting bodies comprises a computer platform and a third machine learning algorithm for detecting body images.
  • said third machine learning algorithm operates to: Identify images and said image frames containing human body; and determine a position in real three-dimensional space of said human body images.
  • said means for detecting body features in said stream of image data frames comprises a fourth machine learning algorithm operable to identify body landmark features from a stream of said image data frames.
  • said means for determining a viewpoint of said detected body comprises a computer platform operating an algorithm for determining a position in real three-dimensional space from detected body features in said image data frames.
  • an apparatus for determining a viewpoint position of a person in a real three-dimensional space comprising: a pair of video cameras each capable of capturing a stream of video data; means for converting each said captured stream of video data into a stream of image data frames; means for storing said stream of image data frames; means for detecting one or more body features in said stream of image data frames; and means for determining a viewpoint of a said detected body from an analysis of said body features detected from said stream of image data frames.
  • a method for determining a viewpoint position of a person in a real three-dimensional space comprising: capturing first and second streams of video data from first and second positions, said first and second positions spaced apart from each other and covering a common field of view in real three-dimensional space; converting each said captured stream of video data into a corresponding stream of image data frames; storing said streams of image data frames; detecting faces in said stream of image data frames; detecting facial features in said stream of image data frames; and determining a viewpoint of a said detected face.
  • the method further comprises adding a timestamp to each said image data frame.
  • said process of detecting faces comprises: applying a pre - trained machine learning algorithm to recognise images of human faces in said image data frames; and identify images in said image frames containing human faces. determine a position in real three-dimensional space of said human face images.
  • said method further comprises generating a two- dimensional area boundary around a identified facial image in a said image frame; and cropping an area within said two-dimensional area boundary, containing said identified facial image.
  • said process of detecting facial features in said stream of image data comprises: applying a pre-trained machine learning algorithm to recognise images of landmark features of a human face, said landmark features selected from the set: eyes; pupils; nose; lips; eyebrow; temples; chin; moles, teeth; cheeks.
  • said process of determining a viewpoint of said detected face comprises: determining a position of each of said a pair of eyes of a face in said data frames; and determining a mid point between said positions of said eyes; and assigning said viewpoint to be said mid point.
  • said process of detecting one or more bodies comprises: applying a pre-trained machine learning algorithm to recognise images of one or more human bodies in said image data frames; and identifying images and said image frames containing one or more human body; determining a position in real three-dimensional space of one or more images of a human body.
  • said method comprises generating a two-dimensional area boundary around an identified body image in a said frame; and cropping an area within said two-dimensional area boundary , said area containing said identified body image.
  • said process of detecting body features in said stream of image data comprises: applying a pre-trained machine learning algorithm to recognise images of landmark features of a human body, said landmark features selected from the set: eyes; pupils; nose; lips; eyebrow; temples; chin; moles; teeth; cheeks; ears; neck; shoulder; elbow; wrists; hands; chest; waist; hips; knees; ankles; feet; legs; arms.
  • a frame aggregation module aggregates received data frames from multiple sources and outputs aggregated data.
  • a method of converting a stream of two-dimensional images captured as a stream of two-dimensional image frames of real world observed humans present within a three dimensional space into a set of horizontal and vertical coordinates in said real three- dimensional space comprising: capturing a stream of image frames representing objects within a three- dimensional space; capturing said stream of image frames substantially in real time; inputting said stream of image frames into a plurality of independently operating machine learning algorithms ML1 - MLn; each said machine learning algorithm being pre-trained to identify a corresponding respective set of key points of a human body; generating an output of each of said machine learning algorithms, said output comprising a set of X, Y coordinates in real three-dimensional space wherein each individual said X, Y coordinate represents a key point of a human body; and aggregating a plurality of individual X, Y coordinates to produce a plurality of individual streams of X, Y coordinates each of which represents the movement of
  • facial and body recognition is carried out at the same time in parallel by a plurality of parallel operating machine learning algorithms each configured with and trained upon datasets to identify specific anatomical key points of a viewer’s head and body.
  • the key points are weighted in order of importance in order to determine a look direction of the viewer, using key points from the main body of the viewer as lower weighted key points compared to key points identified on the viewer’s face, so that if the viewer’s face becomes obscured to a monitoring camera, the lower weighted key points from the users body compensate for the obscured key points on the viewer’s face to help determine the overall look direction of the viewer.
  • the key points on the viewer’s body may be used to establish the overall position of the viewer within a 3-D room or 3-D real coordinate space, and the key points identified on the viewer’s face give specific information on the look direction of the viewer within the 3-D room or 3-D real coordinate space.
  • Figure 1 herein illustrates schematically a view of a virtual window aperture looking out onto a virtual three-dimensional scene from a first viewpoint
  • Figure 2 herein illustrates schematically a view of the virtual window aperture looking out onto the virtual three-dimensional scene from a second viewpoint
  • Figure 3 herein illustrates schematically in view from above, a field of view of a human viewer showing areas of three-dimensional view and two-dimensional view
  • Figure 4 herein illustrates schematically an installation of a first embodiment apparatus for generating a virtual three-dimensional scene for viewing through a virtual window in the form of a display screen, in which the position of a human viewer is tracked in real time, and in which the virtual three-dimensional scene changes in real time according to a position of the human viewer;
  • Figure 5 herein illustrates schematically a view from a position in a first region of real three-dimensional space, looking into a virtual window, into a virtual scene in a virtual three-dimensional space
  • Figure 6 herein illustrates schematically in overview, and apparatus for tracking viewer, and for generating and displaying a virtual three-dimensional scene, in which the virtual three-dimensional scene changes in response to a positional movement of a viewer in real three-dimensional space;
  • Figure 7 herein illustrates schematically a configuration of a plurality of binocular camera sets located around a room, comprising part of a tracking system, capable of viewing and tracking a person throughout substantially all parts of the room;
  • Figure 8 herein illustrates schematically the first type of display screen for use as a first virtual window
  • Figure 9 herein illustrates schematically second type of display screen for use as a second virtual window
  • Figure 10 herein illustrates schematically a third type of displaced for use as a third virtual window
  • Figure 11 herein illustrates schematically an overview of part of the tracking apparatus comprising modules for capturing a series of image frames from a binocular camera set, and showing a configuration process for configuring the part of the tracking apparatus;
  • Figure 12 herein illustrates schematically an overview of the part of a tracking apparatus, shown in Figure 11 , and a process for capturing a series of video image frames from a binocular camera set, and storing image frames to a memory buffer in real time;
  • Figure 13 herein illustrates schematically operation of the tracking apparatus of Figure 11 herein, combining meta data and frame data, and a method for transporting the frame data to a frame processing module
  • Figure 14 herein illustrates schematically process stages for frame processing of a stream of digital image frames, including preprocessing of the raw frame data as captured by a camera to result in a stream of RGB images, detecting faces in the RGB images using a first machine learning algorithm, and detecting facial landmarks within RGB images of faces using a second machine learning algorithm in order to detect a mid -eye point of a human viewer;
  • Figure 15 herein illustrates schematically process stages for frame data aggregation using a third machine learning algorithm produce an output of x, y coordinates in two-dimensional image frame space representing a position of a mid-eye point of a human viewer in real time;
  • Figure 16 Illustrates schematically an overall data processing pipeline for observing real-world events in a real-world three-dimensional space, including movement of human viewers within the real three-dimensional space, capturing images of a region within a real three-dimensional space, and processing those images to detract humans within the real three-dimensional space, and to determine a position of a human viewer in the real three-dimensional space;
  • Figure 17 illustrates schematically set up parameters for a neural network machine learning algorithm for performing face recognition and facial feature extraction, and for generating a two dimensional box within an image frame, in which an image of a face is present;
  • Figure 18 herein illustrates schematically set up parameters for a second neural network machine learning algorithm for performing identification and recognition of facial landmarks and image of a human face and identify facial with two-dimensional x. y coordinates in image frame space for each identified facial feature;
  • Figure 19 herein illustrates schematically some of the body landmarks that the system may identify on a person
  • Figure 20 herein illustrates schematically some of the facial landmarks that the system may identify on a person
  • Figure 21 herein illustrates schematically process stages for frame processing of a stream of digital image frames, including preprocessing of the raw frame data as captured by a camera to result in a stream of RGB images, detecting faces in the RGB images using a third machine learning algorithm, and detecting body landmarks within RGB images of bodies using a fourth machine learning algorithm in order to detect a mid-body point of a human viewer; and
  • Figure 22 illustrates schematically an overall data processing pipeline for observing real-world events in a real-world three-dimensional space, including movement of human viewers within the real three-dimensional space, capturing images of a region within a real three-dimensional space, and processing those images to detract humans within the real three-dimensional space, and to determine a position of a human viewer in the real three-dimensional space.
  • the term “viewer” refers to a human natural person having one or two functioning eyes. For ease of description of the apparatus and methods disclosed herein, it is assumed that the viewer has two eyes, but the apparatus and methods will also operate without modification for a person being only a single functioning eye.
  • position when referring to a viewer is defined as the mid-point of a line extending between the geometric centres of the two eye balls of a viewer. As a viewer moves around in real three dimensional space, the position of the viewer changes.
  • viewpoint means for a viewer having a pair of functioning eyes, a position midway along a line extending between the centre point of the pupil of each eye.
  • viewpoint means a position at the centre of the pupil of that person’s eye.
  • viewpoint means a position at the centre of the pupil of the viewer’s unobscured functioning eye.
  • view direction refers to a line direction in a horizontal plane relative in real three dimensional space which bisects a line extending between the mid -point of a pair of eyes of a viewer.
  • aperture means a region of three dimensional space surrounded by a boundary in three-dimensional space.
  • two-dimensional image is used to refer to a digitally generated image for display on a nominally flat planar two- dimensional visual display screen.
  • two-dimensional image is used to mean an image which is displayed on a the pixels of a visual display screen, irrespective of the actual three-dimensional shape of that screen, whether it be curved, flat, elliptical cylindrical, a three-dimensional elliptical segment, or any other shape of curve.
  • axes in real three-dimensional space are denoted (X, Z, Y), and axes in virtual three-dimensional space are denoted (X’, Z, Y’) in which the letters X and Z denote horizontal axes and the letter Y denotes a vertical axis, and the symbol ‘ denotes virtual three-dimensional space.
  • Coordinate points in real three- dimensional space are denoted (x, z, y) and coordinate points in virtual three- dimensional space are denoted (x’, z’, y’).
  • FIG. 1 there is illustrated schematically a view of a visual display device in a room, with a digitally generated image representing a view looking out through the see - through window aperture into a digitally generated woodland scene in virtual three dimensional space, as viewed from a position on one side of the aperture.
  • a human viewer in real three dimensional space moves relative to a frame of the virtual window aperture
  • the view through the virtual window aperture changes as if the virtual scene were in real three dimensional space on the other side of the virtual window aperture.
  • the viewer and the space behind the virtual window are each in real three- dimensional space, with a wall and aperture separating the viewer and the space behind the virtual window.
  • the equivalent real-world situation is represented by a transparent window aperture in an opaque wall, where there is a clear unobstructed line of sight from the viewer through the aperture to the scene in real three-dimensional space, but there is no clear line of sight from the viewer to parts of the scene on the other side of the wall where those parts of the scene are obscured by the opaque wall surrounding the aperture.
  • the virtual window embodiment described herein there is no clear line of sight through the visual display device comprising the virtual window into real the real world behind the visual display device, but due to the digitally generated image displayed on the visual display device, from the viewer’s perspective it appears as if there is a three-dimensional scene looking through the virtual window display device, and as a viewer moves around a room, the viewer views different parts of the three-dimensional scene as if the virtual three- dimensional scene were a real three-dimensional scene being viewed through a window on an opposite side of a wall to the real three-dimensional scene occupied by the viewer.
  • FIG. 2 there is illustrated schematically the virtual window aperture of Figure 1 with a human viewer positioned on a first side of the aperture and looking through the aperture from a different position on the first side compared to Figure 1 herein.
  • the viewer At the different positions on a first side of the virtual window, the viewer has a different view of the scene on the other side of the aperture and can see different parts of the scene, and the same parts from a different view angle.
  • An object of the methods and embodiments described herein is to automatically generate in real time the virtual scene in virtual three-dimensional space on the second side of the aperture, the virtual three-dimensional scene appearing to coincide with a real three-dimensional space on the opposite side of the aperture to the viewer, so that from the viewpoint of the viewer the virtual three-dimensional scene appears as if it were a real three-dimensional scene in real three dimensional space on the opposite side of the aperture to which the viewer is located.
  • the virtual three-dimensional scene is digitally generated using a three-dimensional scene generator engine, for example the Unreal ® Engine and in the general case, would not be a representation of the actual scene in real three-dimensional space on the second side of aperture.
  • a three-dimensional scene generator engine for example the Unreal ® Engine and in the general case, would not be a representation of the actual scene in real three-dimensional space on the second side of aperture.
  • the virtual scene generated and displayed on the display device would not be chosen to correspond to a scene of the other room on the other side of the display device (although it could do), but rather could be generated as a view of a woodland scene, a mountain scene, a cityscape scene, an ocean scene, a fantasy scene or any other digitally generated virtual three-dimensional scene.
  • the digitally generated virtual three-dimensional scene may be static, that is, a three-dimensional equivalent of a photograph where the features and objects of the image do not change with time.
  • the virtual three- dimensional scene changes in time.
  • a woodland scene may have a stream of flowing water which flows at a same rate as in a real world woodland scene, and may have birds, deer, or other animals which walk-through the scene at the same rate as in real time.
  • a city virtual three- dimensional scene may have pedestrians, vehicles, and other moving objects which move around the scene in real time.
  • the digitally generated image is not limited to having virtual three-dimensional objects which move at the same rate of movement as in real life, for example the object may be generated to move more slowly, or more quickly than in real time, equivalent to a “slow motion” or “fast forward” function, but in most applications envisaged, in order to enhance the sense of realism from the perspective of the viewer, objects within the virtual three-dimensional scene will move at the same rate as they would do in a real three-dimensional space in the real world.
  • FIG. 3 there is illustrated schematically from above a field of view of human viewer having normal eyesight.
  • the human viewer has a lateral field of view of approximately 180° which varies from person-to- person, as measured about a central view direction with both of the viewer’s eyes looking directly ahead relative to the viewer’s skull.
  • the viewer has a lateral angle of peripheral of around 120° centred on the central view direction, in which both eyes receive light from the field of view, and within the 120° peripheral vision field of view there is an approximately 60° angle in which the viewer has three- dimensional viewing.
  • the central 60° lateral field of view is three- dimensional, and the outer part of the 120° field of view, further away from the centre line the direction of view gives two-dimensional viewing.
  • the outermost 30° regions of view between 60° and 90° from the central view direction are viewable only by the left eye, to the left of the viewer, and by the right eye to the right of the viewer.
  • a virtual line is drawn between the centres of the viewer’s eyes, and the midpoint of that line is taken as the viewpoint position, referred to herein as the mid - eye position.
  • FIG. 4 there is illustrated schematically in plan view an installation of the apparatus disclosed herein in a real three -dimensional space 400.
  • the real three-dimensional space comprises a wall 401 in which there is located a virtual window 402, which represents an aperture in the wall 401.
  • the three-dimensional space has first and second horizontal coordinates X, Z and a vertical coordinate Y. Individual positions within the three-dimensional space are represented as (x, y, z) coordinates.
  • the real three-dimensional space is separated by the wall 401 into a first region 403 on the first side of the wall and a second region 404 on a second side of the wall, the first and second regions being visibly separated by the opaque wall 401 .
  • a human viewer can walk around in the first region 403 and look in any direction horizontally, vertically or at an angle.
  • On a second side of the wall the viewer cannot see the second region 404 of real three-dimensional space, because the view of the second region 404 from the first region 403 is obscured by the opaque wall 401.
  • the virtual window 402 Located on the wall preferably in a vertical plane, is the virtual window 402 in the form of a visual display device.
  • the virtual window aperture comprises a region of three-dimensional space which is bounded by a rectangular boundary of opaque material, in this case, the wall 401.
  • the window aperture contains the visual display device, the outer perimeter of which forms the virtual aperture.
  • the opaque material of the wall surrounding the aperture blocks a direct view to part of the digitally generated virtual scene on the other side of the wall.
  • the digitally generated virtual three-dimensional space comprises a plurality of objects labelled 1 to 6 in Figure 4 and shown as circles. Since Figure 4 is in plan view, the circles represent circular cylindrical upright pillars in three dimensions, which could be for example tree trunks in a woodland scene.
  • each of dotted construction lines 405, 406 in the example shown represents a vertical plane in real three-dimensional space in the first region 403, and each represents a vertical plane in virtual three- dimensional space in the second region 404.
  • the aperture was a real transparent window into the real three- dimensional space in the second region 404 the objects 2, 3, 4, were real objects, standing at position A viewer would be able to directly see objects 2 - 4 by direct line of sight, but would be unable to see object 1 at horizontal position (x , z’ ), or objects 5 or 6 each of which are out of direct line of sight through the aperture as viewed from position A.
  • the viewer moves to a third position C in the first region, and if the aperture 402 was transparent real window and the objects 1 - 6 were real objects in real three-dimensional space coinciding with the second region 404, the viewer at position C would have a field-of-view bounded by the construction lines 409, 410, each of which in this example represents a vertical plane, and because the viewer is closer to the aperture the field-of-view is wider than at positions A or B and the viewer is able to see objects 1 - 5, but object 6 is obscured from view by the opaque wall 401.
  • the objects 1 - 6 are digitally generated objects in virtual three-dimensional space displayed on the display screen 402, and which, depending upon the position of the viewer in the first region 403 of real three-dimensional space move around on the two-dimensional display screen to give the impression as described above as if they were real objects in real three- dimensional space.
  • the scene image will adjust according to the position and angle of the viewer relative to the window.
  • Position is here nominally defined as the midpoint between the eyes and referred to as the ‘mid-eye point’.
  • the scene displayed on the display will appear to shift to the right by a distance (-x’->), if the viewer lowers their view (-y->), the scene image shift up (+y’->), if the viewer moves towards the ‘window’ (-z->) the image zooms out.
  • Objects in the image virtual scene will be adjusted according to the change in the field of view and perspective of the 3D scene corresponding to each new position of the viewer.
  • Detection of the mid-eye point to display of the newly adjusted scene image happens at 60 frames per second so that there is no lag in response detected by viewer (which can happen at the lower standard video frames rates of 25/30 frames per second).
  • the higher frame rate of 60 Hz helps accommodate for small jerky movements but doubles the data load.
  • little data is stored since the high volume of image data is reduced to a set of three coordinates x, y and z for each pair of synched camera frames.
  • the tracking system is a passive system in that no artificial markers or structured lighting are required.
  • the hardware consists of two synchronized cameras in stereoscopic configuration, a frame grabber and a host computer. Processing can be subdivided into three parts (i) stereoscopic image acquisition and pre-processing, (ii) Al extraction of mid-eye point and other key points, as detailed below, and (iii) calculation of x, y and z of the viewer position.
  • the tracking system utilizes a solid-state solution (no moving parts) but uses a state- of-the-art imaging system to locate the target point. Although there are many methods/technologies to also obtain 3D image information such as time-of-flight cameras, they lack the resolution, speed and tracking performance required.
  • the current version of the tracking system is built around a dual camera stereoscopic configuration using high performance, high resolution cameras.
  • This design route also provides the easy implementation of Al image recognition technologies to be introduced at the individual frame level. This allows for the development of improved learning algorithms for detection and enhanced accuracy in tracking the target point (mid-eye point).
  • image processing and determination of the target point has been completed it becomes a matter of triangulation to determine the distance (z) of its corresponding pixel.
  • the large volume of image data generated by the two cameras is reduced to just three numbers x, y and z at the frame level.
  • the system is intended to work anonymously in that it does not require the recognition of a particular individual viewer, nor does the system require the processing or storage of such information.
  • face recognition could augment performance (as the Al system learns the person user) so it might be offered as an option, even for a limited number of prescribed users.
  • the display system hardware consists of computer, GPUs and graphics cards for rendering and displaying a 3D image onto a large monitor screen. Processing can be subdivided into three parts: (i) rendering of 3D scene image, (ii) adjusting the scene image according to updated viewer tracking coordinates x, y and z, and (iii) displaying the newly adjusted 3D image scene.
  • the display system comprises a large, high-resolution monitor/display screen such as a 4K monitor which renders the appropriate scene content in terms of the corresponding field of view that would be seen out through a real window.
  • 3D image scenes are computer generated using Unreal ® Engine but real 3D images could also be used if available (i.e. , 3D computer generated scenes are preferred but not essential.
  • the following criteria are for a Mideye Tracking Prototype:
  • Tracking accuracy ° (requires recognition of the same pixel in the scene such as mid-eye point but otherwise could be another facial target such as freckle, a mole, the left or right eye/pupil, etc.
  • the target point could jump to another facial target once the head angle (i.e. , turns head away from cameras) and the distance between facial target points were accurately known. That is, the midpoint could be interpolated from other facial features of a known subject.
  • FIG. 5 there is illustrated three separate two- dimensional images of virtual three-dimensional scenes as generated and displayed on the display monitor at respective viewer positions A, B and C as shown in Figure 4.
  • the tracking system may also track the position of the viewer vertically in the first region 403, and adjust the two dimensional view into the virtual three- dimensional scene accordingly, so for example if viewer sits down at horizontal position A, reducing the vertical coordinate of their viewing position, without changing the horizontal coordinates, the corresponding lower boundary of the field of view of the viewer into the virtual three dimensional scene represented by a plane containing the construction lines 405, 406 and representing a plane defined by a lower straight line edge of the aperture frame of the display device and the viewpoint of the viewer.
  • a corresponding upper plane bounded by the upper horizontal boundary of the display device also limits the upper extent of the view into the virtual three-dimensional scene.
  • FIG. 4 and 5 herein there is illustrated schematically a single flat visual display screen on a single wall of a room.
  • the apparatus and methods described herein may apply to a plurality of visual display devices arranged on a same surface (for example a same wall) or on different walls and may be arranged around or surrounding a region of real three-dimensional space (region 403 in Figure 4 herein), so that a room in which a viewer is positioned may be fitted with a plurality of visual display devices, all receiving a virtual scene, wherein the virtual scene as viewed through any virtual window is consistent with the scene as viewed through the other virtual tours surrounding the same region of real three-dimensional space, in the same way that if the viewer were in a room having apertures or windows looking out onto a real three-dimensional scene, the viewers perception of the real three-dimensional scene would change through all windows simultaneously as the viewer moves around the room and their viewpoint with respect to each individual with changes.
  • the basic operational system consists of two main parts: a tracking system or apparatus 600 which determines the position of the viewer the first region of real three- dimensional space, and a display system 601 which generates, renders and displays an appropriate shifted and adjusted image scene representing scene in virtual three-dimensional space in the second region.
  • a tracking system or apparatus 600 which determines the position of the viewer the first region of real three- dimensional space
  • a display system 601 which generates, renders and displays an appropriate shifted and adjusted image scene representing scene in virtual three-dimensional space in the second region.
  • Components of the tracking system comprise known stereo or binocular cameras 602, one or more computer platforms 603 having one or more digital processors, one or more communications ports, data storage, memory, and interfaces for interfacing with other components; communications network components, electronic data storage devices and user interfaces.
  • the data storage, memory and/or electronic data storage devices may comprise solid state drive, hard disk drive, random access memory, optical storage device, non-local storage, such as cloud storage or other such suitable storage means, or any suitable combination of the aforementioned.
  • the computer platform may be provided at a single location adjacent the hardware platform may be the single location near to the display device, or alternatively, some data processing tasks may be carried out at a remote location accessible over a communications network, for example at a remote data processing and/or data storage centre.
  • the computing platform comprises a frame grabber 604 which receives one or more streams of input video signal and generates frames of image data from the input video streams; a three dimensional scene generator 605, such as the Unreal ® Engine; and a graphics card for 606 for driving the display monitor 607.
  • a frame grabber 604 which receives one or more streams of input video signal and generates frames of image data from the input video streams
  • a three dimensional scene generator 605, such as the Unreal ® Engine such as the Unreal ® Engine
  • a graphics card for 606 for driving the display monitor 607.
  • the output of graphics card 611 drives the display monitor 607.
  • the coordinate conversion module 608 converts the position coordinates in real three-dimensional space (x, y, z) from the tracking system into coordinates in virtual three-dimensional space as used by the three-dimensional scene generation engine 605.
  • system comprises more than one visual display device, where different visual display devices at different positions with respect to the viewer, for each individual visual display device, a corresponding respective view of the virtual three-dimensional scene is selected by the scene reformat module 610, and sent to a graphics card.
  • the graphics 611 may comprise one graphics card per visual display device, or a single graphics card capable of handling multiple views multiple display monitors, with the effect that all the plurality of monitors are synchronised so that their respective two-dimensional images presented on the respective display monitors give the impression that the viewer, who is in a real three-dimensional space, is in a room which is surrounded by the virtual scene, which can be viewed through any one or more of the virtual windows, and as the viewer moves around the room, the view through each virtual window presents a consistent picture of the virtual three-dimensional scene, as if the room in which the viewer is positioned (region 403 in figure 4) is in the same real three-dimensional coordinate space as the virtual three-dimensional coordinate system of the virtual scene.
  • Information concerning the viewer’s position relative to the aperture is captured by at least one pair of spaced apart stereo or binocular cameras.
  • Cameras can be actively mounted at a position immediately adjacent the virtual window display monitor, and may, but not essentially be mounted symmetrically either side of a vertical centre line which bisects the aperture.
  • Each camera of each pair of cameras has its own field of view and the fields of view of the two cameras of each pair overlap each other, so that both cameras can each capture a separate image of a viewer within the combined field of view of the two cameras.
  • the cameras are video cameras which operate in the visible light range, but infra-red cameras which capture infra-red images may be used, or a combination of one or more visible range and one or more infra-red cameras may be used.
  • FIG. 7 there is illustrated schematically a real three-dimensional space in plan view representing for example a hotel room or conference room, where three stereo vision camera sets are mounted at three separate wall locations within the room.
  • Each stereo camera set has its own field of view, and the fields of view of the three camera sets are arranged to overlap with each other obtain the maximum amount of overlap coverage within the room, so that a viewer located at any of positions 1 to 4 as shown can have their face and/or body detected from images produced by at least one of the camera sets.
  • the cameras are co-located with the visual display device, and in other implementations the cameras may be located at different locations distributed in the first region 403, and provided that the cameras scan across a field-of-view in which they can capture images of a human viewer in the first region, they will still be able to operate for the purpose of facial and/or body recognition and determining a viewpoint of the viewer.
  • the scene adjustment and resizing module 609 receives a set of converted coordinates from the coordinate conversion module 608 and receives scene data from the three-dimensional scene generator 605.
  • the scene adjustment and resizing module operates to incorporate the position of the viewer into the virtual three-dimensional space coordinates of the virtual three- dimensional scene so that the three-dimensional virtual scene is seen from the viewpoint of the viewer in the virtual three-dimensional coordinate space.
  • the output of the scene adjustment and resizing module 609 comprises a view of the virtual three-dimensional scene as viewed from a point of view position of the human viewer, as if the human viewer were in the same virtual three-dimensional coordinate system as the scene, where the virtual three- dimensional coordinate system matches and is mapped to the real three- dimensional coordinate system of the room in which the viewer is standing (the first region 403).
  • the coordinate converter 608 receives an input of the coordinates of the outer perimeter / outer frame of the display monitor 607 in real three- dimensional coordinates. This can be done on initial system setup / configuration. Alternatively the coordinates of the outer perimeter of the display monitor can be obtained by examining the video stream from one or a plurality of stereo cameras and applying a module to recognise the geometric features of the rectangular or square frame of the display monitor in real three-dimensional space.
  • the scene reformatting module 610 which applies in virtual three-dimensional space, an aperture between the view position of the viewer and the view of the virtual three-dimensional scene.
  • the view of the virtual three-dimensional scene which is inside the aperture frame is selected for display on the display monitor.
  • a first display screen comprising a flat planar display surface.
  • Such screens are conventionally known in the art and are in widespread use on tablet computers, mobile phone screens, laptop computer screens, iPad ® screens, large format home cinema screens, and large HDTV screens and the like.
  • the display surface may comprise a light emitting diode (LED) display, liquid crystal display, a high definition television (HDTV) display or the like.
  • the screen comprises a substantially rectangular planar surface bounded by a substantially rectangular outer perimeter.
  • FIG. 9 there is illustrated schematically a second type of known curved display screen in which the display surface wraps around a viewer, and the display surface follows a part circular cylindrical surface having a focal line f in a nominal X and Z orthogonal directions, and following a straight line in a nominal Y (height) direction, orthogonal to each of the X and Z axes.
  • the viewer may locate themselves along the focal line of the screen, but not necessarily.
  • a third type of display screen not shown herein comprises a screen similar to that shown in Figure 6 herein, but instead of the screen curving in a circular cylindrical path in the X / Z plane, follows an elliptical curve about a focal line f1 that extends in the Z direction, orthogonal to the x/y plane. The viewer may locate themselves along the along the focal line of the screen, but not necessarily.
  • the visual display screen may comprise a part elliptical screen or part spherical screen as shown in Figure 10 herein.
  • Other curves which are non-circular, non-spherical or non-elliptical may also be used.
  • the visual display device may comprise a flexible sheet display, which can adapt to a variety of curves.
  • the screen In the best mode, and in most applications, the screen is likely to be a flat planar screen.
  • the example embodiments herein describe a flat screen display, but it will be understood by the skilled person that in the general case, the display screen need not be flat or planar.
  • the display monitor comprises a display surface surrounded by an outer perimeter which delineates the viewable area of the display device, and which forms the outer boundary of the virtual window.
  • the screen surface may be of various three dimensional shapes but in the present embodiment is a rectangular flat planar surface.
  • the tracking apparatus 800 comprises a plurality of at least two video cameras 1101 , 1102 arranged to cover a common camera field of view coincident with the first region 403, so that the cameras can detect a person located in the first region; a hardware grabber 1103 in communication with the plurality of cameras; a software frame grabber 1104 in communication with the hardware grabber 1105; and a memory buffer 1106 which may be located in the GPU’s RAM or in any other suitable location.
  • the cameras, hardware grabber and software grabber may comprise known video capture products from Kaya Instruments or any other suitable cameras, hardware grabber and software grabber.
  • the cameras are located in real three dimensional space, and the physical locations of each camera in relation to each other camera and in relation to the three dimensional space is recorded in an initial set up procedure. If the cameras are Lidar cameras on a hand held portable device, having a visual display for example and Apple ® iPad ® device, the positions of the Lidar cameras in relation to the screen will be fixed, and the orientation of the whole hand held device in real three dimensional space may be determined from internal inertial gyroscopes or other orientation sensors within the handheld device.
  • Each of the cameras 1101 , 1102 captures a video image which is converted to digitised format by the hardware grabber and software grabber 1104, 1105 and which outputs digitised frames of video data which are stored in the buffer 1106.
  • the buffer 1106 stores in real time a first sequence of image frames from first hardware camera 1101 and a second sequence of image frames captured by second hardware camera 1102, where each of the sequences of captured video frames represents an image over a common field-of-view of the two cameras in the first region of real three-dimensional space, in which a human viewer is located. All processes operate as continuously operating real-time processes.
  • the information stored in the buffer may be stored in temporary files which can be automatically deleted after use.
  • the frame grabber software may use Kaya DLL (double link library) or any suitable software to discover available hardware grabbers and connect to the hardware cameras.
  • the frame grabber sets up the hardware grabber and buffer memory to collect and store image frames of data, as described above.
  • the software frame grabber allocates memory to the buffers, in which the raw data frames captured by the hardware Grabber will be saved, as described above.
  • Software frame grabber updates configuration of the cameras, and notifies the hardware grabber the about the area of memory buffer allocated to the captured frames of video data.
  • the software frame grabber sets up a callback function to be triggered by the hardware grabber each time a new frame of video data becomes available.
  • the image frame is input into the artificial intelligence analysis program to determine key points from the image.
  • the key points are determined using artificial intelligence which has been trained to recognize keypoints from a database of models showing images of human faces with keypoints identified. These key points are used to determine the mid-eye point. Determination of the mid-eye point can be done using any appropriate method of analysis. One such method includes triangulation of the face using the identified key points.
  • the orientation of the user or subject viewer’s head may be inferred or partially inferred from determining the user’s body position. This may provide additional data to determine the viewer’s look direction for example if one or both cameras cannot see all key points. Body recognition and location determination is as described hereinafter.
  • the method may also use body recognition and tracking.
  • the system mainly tracks the head of the user; however, if there is any uncertainty about the viewer’s look direction determined from capturing face key points, the body tracking information may be used to increase the accuracy of the system.
  • Some key points which may be used in the tracking system are depicted in Figure 19, including the mid - eye point 1901 ; nose 1911 ; right eye 1912; left eye 1902; right ear 1913; left ear 1903; neck 1910; right shoulder 1914; right elbow 1915; right wrist 1916; left shoulder 1904; left elbow 1905; left wrist 1906; right hip 1917; right knee 1918; right ankle 1919; left hip 1907; left knee 1908; and left ankle 1909.
  • the key points detailed above may be separated into subcategories or levels, such as:
  • Level 2 nose, right eye, left eye, right ear and left ear;
  • Level 3 neck, right shoulder right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee and left ankle.
  • These subcategories may be prioritized in a hierarchy within the system such that precedence is given to data collected from the mid - eye point, then data collected from the subcategory two points and finally the subcategory three points.
  • the additional subcategory two and three key points allow the accuracy of the position tracking and localization from the data to be improved.
  • the system may also rely upon the additional body tracking key points if the system is unable to identify facial key points, for example if the head is turned such that only one eye is visible by the cameras.
  • Al artificial Intelligence
  • Each of the cameras 1101 , 1102 captures a video image which is converted to digitized format by the hardware grabber and software grabber 1104, 1105 and which outputs digitized frames of video data which are stored in the buffer 1106.
  • the buffer 1106 stores in real time a first sequence of image frames from first hardware camera 1101 and a second sequence of image frames captured by second hardware camera 1102, where each of the sequences of captured video frames represents an image over a common field-of-view of the two cameras in the first region of real three-dimensional space, in which a human viewer is located. All processes operate as continuously operating real-time processes.
  • the information stored in the buffer may be stored in temporary files which can be automatically deleted after use.
  • the frame grabber software may use Kaya DLL (double link library) or any suitable software to discover available hardware grabbers and connect to the cameras. The frame grabber sets up the hardware grabber and buffer memory to collect and store image frames of data, as described above.
  • the software frame grabber allocates memory to the buffers, in which the raw data frames captured by the hardware grabber will be saved, as described above.
  • Software frame grabber updates configuration of the cameras, and notifies the hardware grabber the about the area of memory buffer allocated to the captured frames of video data.
  • the software frame grabber sets up a callback function to be triggered by the hardware grabber each time a new frame of video data becomes available.
  • the image frame is input into the artificial intelligence analysis program to determine keypoints from the image.
  • the key points are determined using artificial intelligence which has been trained to recognize keypoints from a database of models showing images of human bodies with keypoints identified. These keypoints are used to determine the mid-eye point and/or the location of the human. These tasks can be done using any appropriate method of analysis. One such method includes triangulation using the identified key points.
  • the video cameras jointly observe real-world events in the first region 403 of real three-dimensional space.
  • Each camera captures the respective stream of video data, which updates the data in the memory buffer and saves the two parallel streams of video frames to the memory buffer.
  • the frame grabber receives a notification about the data being saved to the memory buffer from the hardware grabber.
  • the hardware grabber On each frame captured by each camera the hardware grabber notifies the software frame grabber, about the capture event and location in memory where the frame is saved
  • the video frame data is not transferred to the frame grabber, but is saved to the memory buffer instead. Saving the video frames directly to the memory buffer enables a fast rate of high-speed frame capture.
  • the information saved to the buffer may be advantageously stored in temporary files which can be automatically deleted after use, reducing the need to manually delete files regularly.
  • the software frame grabber reads data from the buffer and transfers it to a frame processing pipeline.
  • the software frame grabber converts the frame data 1300 to a digitized format ready to be transported over a network or locally.
  • the frame processing pipeline is preferably implemented in C++, but could also be implemented in any other suitable language, for example Python.
  • the frame data is combined with meta data, comprising source data and frame capture time stamp data, so that information describing the source of the frame, i.e. which camera the frame was captured by, and the time of capture of the frame in real time is added to each frame.
  • the combined meta data and frame data generated by the software frame grabber is sent as a message via ZMQ messaging 1301 to the frame processing module 1302.
  • ZMQ messaging is a known asynchronous messaging library application.
  • combined meta data and frame data packets may be transferred from the software frame grabber to the-frame processing module 1302 via a shared memory buffer.
  • any embeddable networking libraries compatible with C++, Python or any other programming language suitable for processing video material may be used.
  • frame data can be distributed to other computers in a same network.
  • the frame processing pipeline comprises three consecutive stages of (a) a pre-processing stage 1400 for preprocessing the frame data; (b) face and/or body detection stage 1401 for detecting images of faces and/or bodies in the image frames; and (c) a facial and/or body landmark detection stage 1402.
  • a pre-processing stage 1400 for preprocessing the frame data
  • face and/or body detection stage 1401 for detecting images of faces and/or bodies in the image frames
  • Each of the stages may be constructed as separate modules.
  • the face and/or body detection stage 1401 may be implemented as a first machine learning
  • Each frame is converted into a standard image and is analysed by two machine learning computer vision modules.
  • an RGB image is described; however, the type of image is not limited to a standard RGB image. It will be apparent to a person skilled in the art that a Grayscale image, an RGBa image, a palette image such that each pixel is coded using one number, or another such image type may also be used.
  • a first computer machine learning module detects faces in the image, and a second machine learning module searches for and finds facial landmarks.
  • a third computer machine learning module detects bodies in the image, and a fourth machine learning module searches for and finds body landmarks. All data found, relating to facial recognition, facial landmarks, body recognition and body landmarks is passed to a frame aggregator 1403.
  • raw frame data 1404 in Bayer format as captured by a camera is processed and transformed into standard RGB image data by de mosaicing 1405, to obtain an RGB image 1406. (In the current implementation, this is partly done by the hardware grabber).
  • each RGB image is processed using a trained face detection machine learning module 1407 to extract the locations of faces on the RGB image, which correspond to locations in the field of view of the camera in real three-dimensional space in the first region.
  • the face is located as being within a face box, comprising a two dimensional rectangle on the RGB image.
  • the coordinates of the face box is returned, and a separate face picture is extracted from the main RGB image.
  • a single face box location and a single face picture being extracted.
  • a facial image is cropped to produce an RGB image 1409 of a detected face.
  • each viewer identified in the image frame has their face identified in a corresponding respective face box 1408, and the location coordinates of each face box is recorded.
  • a separate face picture 1409 is extracted from each face box where a face is detected.
  • Facial landmarks 1411 can include facial features such as eyes, nose, mouth, lips, nostrils, ears or other physical features of a human face.
  • a midpoint equidistant between the centre of the eyes of the face (the mid-eye position).
  • the horizontal and vertical (x, y) coordinates within the two dimensional RGB image frame are identified for the mid - eye position of the identified face.
  • the mid eye position for the identified face is used in other parts of the apparatus to determine the three-dimensional position of the human viewer within the real three-dimensional space in the first region.
  • the mid - eyes coordinates for the frame are passed to the frame aggregator.
  • the process shown with reference to Figure 14 herein is applied for each video frame, and each video frame timestamp so that the position of the viewer in each two-dimensional video frame is known at the time that the video frame was captured.
  • Video frames from each camera are processed in sequence to provide the coordinates of the mid eye position of faces within the first region 403 over a sequence of timed image frames.
  • a plurality of frames of RGB images are processed in parallel, one stream per each camera, each stream giving a slightly different view into the first region, and each stream identifying the same human face within the first region 403, whereby at each time there are produced at least two different RGB images taken from different angles from different cameras each identifying a respective position of the mid - eye position within an RGB image taken by that camera.
  • the image may be reprocessed using a trained body detection machine learning module 2107 to extract the locations of bodies on the RGB image, which correspond to locations in the field of view of the camera in real three-dimensional space in the first region.
  • the body may be located as being within a body box, comprising a two dimensional rectangle on the RGB image.
  • RGB image For each body discovered in the RGB image, the coordinates of the body box are returned, and a separate body picture is extracted from the main RGB image.
  • the use of an RGB image is described; however, the type of image is not limited to a standard RGB image. It will be apparent to a person skilled in the art that a Grayscale image, an RGBa image, a palette image such that each pixel is coded using one number, or another such image type may also be used. In a case where there is a single person in the first region and a single body detected, this results in a single body box location and a single body picture being extracted. A body image is cropped to produce an RGB image 1409 of a detected body.
  • each viewer identified in the image frame has their body identified in a corresponding respective body box 2108, and the location coordinates of each body box is recorded.
  • a separate body picture 2109 is extracted from each body box where a body is detected.
  • each face image or picture is analysed to find and identify body landmarks or key points.
  • a trained machine learning algorithm is used to detect the body landmarks on the RGB image.
  • the machine learning algorithm may be trained using Darknet Neural Network, MobileNets, or the like.
  • the mid body area Based upon the coordinates within the RGB image of the individual body landmarks, there may be determined or calculated for each body picture, an area between the neck, right shoulder, left shoulder, right hip and left hip (the mid body area).
  • the horizontal and vertical (x, y) coordinates within the two dimensional RGB image frame can be identified for the mid-body area of the identified body.
  • the mid-body area for the identified body may be used in other parts of the apparatus to determine the three-dimensional position of the human viewer within the real three-dimensional space in the first region.
  • the mid-body area coordinates for the frame and/ or the key-points coordinates are passed to the frame aggregator.
  • the process shown with reference to Figure 21 herein is applied for each video frame, and each video frame timestamp so that the position of the viewer in each two-dimensional video frame is known at the time that the video frame was captured.
  • video frames 2100 from each camera are processed in sequence to provide the coordinates of the body key-points and/or mid-body area of bodies within the first region 403 over a sequence of timed image frames.
  • a plurality of frames of RGB images are processed in parallel, one stream per each camera, each stream giving a slightly different view into the first region, and each stream identifying the same human body within the first region 403, whereby at each time there are produced at least two different RGB images taken from different angles from different cameras each identifying a respective location of the body key-points and/or mid-body area within an RGB image taken by that camera.
  • process 2101 raw frame data is captured by a camera.
  • process 2102 the raw frame data undergoes Bayer de-mosaicing to obtain an RGB image 2103.
  • Body detection from the RGB images comprises inputting the RGB image 2103 into a pre-trained machine learning model to detect parts of the image which represent a human body in process 2104 to achieve a set of body box coordinates 2105; cropping the image in process 2106 to create an RGB image
  • Body landmark detection comprises inputting the RGB image with a detected body and detected body features 2107 into a machine learning algorithm
  • the frame is then passed on to a frame aggregation process.
  • the frame aggregation module aggregates data from multiple sources and serves it in aggregated format to the output.
  • Output formats may include: Unreal ® Engine UDP socket, terminal / console, Text/CSV data; or other standard data formats.
  • Frame coordinates and meta data 1500 are aggregated 1501 to form aggregation coordinates from multiple sources.
  • a process 1502 of aggregating and saving debug data operates continuously.
  • Processed coordinates 1503 are served to an output port 1504.
  • FIG. 16 there is illustrated schematically the main stages of the full pipeline process for converting images of real world observed humans into a set of horizontal and vertical coordinates in real three- dimensional space corresponding to the mid - eye positions of one or more human viewers, from a stream of image frames from a single camera.
  • a video camera captures 1600 a stream of image frames representing observations of real world events within a real three-dimensional space (the first region 403), and the hardware grabber 1104 stores those images to 1106 in real time as hereinbefore described. By real-time, this equates to a frame capture rate of between 30 and 120 frames per second.
  • the software frame grabber feeds the captured frames into a first frame processing stage comprising a first machine learning algorithm 1601 and a second frame processing stage comprising a second machine learning algorithm 1601.
  • the first and second frame processing stages operate in parallel to each other.
  • an aggregator stage 1603 which is implemented as a machine learning processor.
  • the aggregator stage produces an output 1604 of a set of coordinates within the captured image frames which correspond to the coordinates of the mid - eye position of the human viewer.
  • the frame processing rate of the software frame grabber 1106 through to the output of x, y coordinates of the mid - eye positions is in the range 28 to 32 frames per second.
  • the frame processing rate of the frame processors, and the aggregator to produce the output of x, y coordinates of the mid eye positions, in the current version is in the range 30 to 45 frames per second.
  • the frame aggregation module aggregates data from multiple sources and serves it in aggregated format to the output.
  • Output formats may include: Unreal ® Engine UDP socket, terminal / console, Text/CSV data; or other standard data formats.
  • Frame coordinates and meta data 1500 are aggregated 1501 to form aggregation coordinates from multiple sources.
  • a process 1502 of aggregating and saving debug data operates continuously.
  • Processed coordinates 1503 are served to an output port 1504.
  • FIG. 22 there is illustrated schematically the main stages of the full pipeline process for converting images of real world observed humans into a set of horizontal and vertical coordinates in real three- dimensional space corresponding to the mid-eye positions and/or body key-points and/or mid-body areas of one or more human viewers, from a stream of image frames from a single camera.
  • a video camera captures 2200 a stream of image frames representing observations of real world events within a real three-dimensional space (the first region 403), and the hardware grabber 1104 stores those images to 1106 in real time as hereinbefore described. By real-time, this equates to a frame capture rate of between 30 and 120 frames per second.
  • the software frame grabber feeds the captured frames into a first frame processing stage comprising a first machine learning algorithm 2204, a second frame processing stage comprising a second machine learning algorithm 2205, a third frame processing stage comprising a third machine learning algorithm 2206 and a fourth frame processing stage comprising a fourth machine learning algorithm 2207.
  • the first, second, third and fourth frame processing stages operate in parallel to each other.
  • processing stages are fed into an aggregator stage 2208, which is implemented as a machine learning processor.
  • the aggregator stage produces an output 2209 of a set of coordinates within the captured image frames which correspond to the coordinates of the mid-eye and/or body key-points and/or mid body area of the human viewer.
  • the frame processing rate of the software frame grabber 1106 through to the output of x, y coordinates of the mid-eye and/or mid body positions is in the range 28 to 32 frames per second.
  • the frame processing rate of the frame processors, and the aggregator to produce the output of x, y coordinates of the mid-eye and/or body key-points and/or mid-body area (stages 2204 to 2209) in the current version is in the range 30 to 45 frames per second.
  • the first machine learning model which is trained to detect faces and generate face box data, and face box coordinates comprises a neural network, as is the known in the art.
  • the neural network is trained on image frame examples each containing a representation of a human face.
  • the neural network analyses new frame for a set of features resembling a human face.
  • the neural network predicts for coordinates for each face box, the face box being a rectangle in two-dimensional image space (image frame space) which contains or is likely to contain a human face.
  • the network used for performing the feature extraction is based on Darknet - 53 and contains 53 convoluted layers as shown in Figure 17 herein.
  • the output of the face box detection module is, for each image frame the two dimensional coordinates in frame image space which contains an image of a human face.
  • the third machine learning model which is trained to detect bodies and generate body box data, and body box coordinates comprises a neural network, as is the known in the art.
  • the neural network is trained on image frame examples each containing a representation of a human body.
  • the neural network analyses new frame for a set of features resembling a human body.
  • the neural network predicts for coordinates for each face box, the body box being a rectangle in two-dimensional image space (image frame space) which contains or is likely to contain a human body.
  • the network used for performing the feature extraction is based on Darknet - 53 and contains 53 convoluted layers as shown in Figure 17 herein.
  • the output of the body box detection module is, for each image frame the two dimensional coordinates in frame image space which contains an image of a human body.
  • Detection and recognition of facial land marks is achieved using a known neural network.
  • the RGB images are cropped from the complete frame to extract just the part of the RGB image 1409 which contains a representation of a human face.
  • facial landmark detection algorithms are used to calculate the position of the eyes of the face.
  • detecting facial landmarks becomes a less computationally intensive tasks, and as a result smaller and faster neural network architectures can be used in the facial landmark processing module compared to the facial detection processing module.
  • the neural network is based on the MobileNet and is optimised for high speed.
  • the network architecture of the second machine learning algorithm is as set out in Figure 18 herein.
  • the output of the facial landmark detection is a list of individual facial features, together with the coordinates of each facial feature within the two dimensional image frame space.
  • machine learning modules comprise a known computer platform having one or more data processors, memories, data storage devices, input and output interfaces, graphical use interfaces and other user interfaces such as voice command; input/output ports and communications ports.
  • Detection and recognition of body land marks is achieved using a known neural network.
  • the RGB images are cropped from the complete frame to extract just the part of the RGB image 1409 which contains a representation of a human body.
  • body landmark detection algorithms are used to calculate the position of the body key-points and/or mid body area.
  • detecting body landmarks becomes a less computationally intensive tasks, and as a result smaller and faster neural network architectures can be used in the body landmark processing module compared to the body detection processing module.
  • the neural network is based on the Darknet Neural Network and is optimised for high speed.
  • the network architecture of the fourth machine learning algorithm is as set out in Figure 18 herein.
  • the output of the body landmark detection is a list of individual body features, together with the coordinates of each body feature within the two dimensional image frame space.
  • the method described herein converts a stream of two- dimensional images captured as a stream of two-dimensional image frames of real world observed humans present within a three dimensional space into a set of horizontal and vertical coordinates in said real three-dimensional space, said method comprising: capturing a stream of image frames representing objects within a three-dimensional space; capturing said stream of image frames substantially in real time; inputting said stream of image frames into a plurality of independently operating machine learning algorithms ML1 - MLn; each said machine learning algorithm being pre-trained to identify a corresponding respective set of key points of a human body; generating an output of each of said machine learning algorithms, said output comprising a set of X, Y coordinates in real three-dimensional space wherein each individual said X, Y coordinate represents a key point of a human body; and aggregating a plurality of individual X, Y coordinates to produce a plurality of individual streams of X, Y coordinates each of which represents the movement of a key point of a said
  • machine learning modules comprise a known computer platform having one or more data processors, memories, data storage devices, input and output interfaces, graphical use interfaces and other user interfaces such as voice command; input /output ports and communications ports.
  • a large screen high definition digital TV monitor may be positioned on a wall of a room, for example a basement room having no real windows, to provide a virtual window out on to a digitally generated virtual scene such as a landscape, seascape, forest or city scene and as a user moves their position around the room, wherein the digitally generated scene changes as if the viewer were in a room positioned within the scene and looking out on to the scene via the visual display monitor which acts as a transparent window.
  • Use in such an application may enable better usage of rooms which have limited possibility for natural outside views, for example in hotel rooms which have limited or undesirable natural outside views, underground conference rooms, underground cellar conversions.
  • the virtual window disclosed herein may be used in underground railway transport, for example extended tunnels under rivers or seaways, to give passengers the impression of travelling though scenery, where there is no natural above ground outside view other than the dark sidewalls to the tunnel itself.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)

Abstract

An interactive system that enables scene image information displayed within a visual display monitor of specific dimensions to be adjusted, resized and /or shifted according to the position of a viewer with respect to the display monitor. The system responds precisely in real-time to movement of a user to provide a realistic 3D visual experience without the need for any wearables such as identification labels, goggles or headsets. The system is also designed so that it can be readily extended to increase coverage and performance by the modular addition of a multiplicity of tracking and/or display systems

Description

VIRTUAL WINDOW
Field of the Invention
[0001] The present invention relates to image processing, including but not limited to image generation and image display.
Background of the Invention
[0002] In many image processing applications, there is a requirement for a virtual reality environment. These include applications such as video games, computer implemented training systems, for example flight simulators for commercial or military use, battlefield simulations, deep sea engineering simulation, fire-fighting simulations, entering hostile environments or buildings, or the like, video games, and virtual tours of monuments, buildings, real estate and the like.
[0003] In a first known virtual reality system, a viewer’s eyes are surrounded by a virtual reality headset, where sensors in the headset determine the orientation and look direction of the viewer and the headset presents a digitally generated image on a display screen in the headset, which depends on the look direction of the viewer, such that the viewer finds themselves visually in a digitally generated three-dimensional coordinate space which changes depending on the orientation of their head in real three dimensional space. There may be other sensory interfaces such as earphones or audio headsets, and body worn position sensors.
[0004] The known virtual reality headsets have the characteristics that they generally enclose the viewer’s optical senses so that the viewer cannot see the real environment around them. The viewer is fully immersed in a virtual environment from the in respect of sight sensory perception, and this limits the application of this type of system to applications where the viewer’s sight senses are fully occupied by a digitally generated virtual scene, with no input from the real environment. [0005] In other prior art applications which use a flat or curved video monitor screen where the user does not wear any goggles or viewing headset it is known to provide a side to side scan facility of a three-dimensional scene, for example in a flight simulator where a user can look left, right, up or down from a virtual cockpit, and a video screen shows a view as if looking left, right, up or down. The look direction is adjusted either using cursor controls on a computer keyboard, or a mouse or joystick control. An example of such an application is Microsoft Flight Simulator X in which the head position of a pilot is assumed fixed in real space, i.e. looking at a flat screen display monitor, but the view direction can be varied by the up/ down and left / right cursors, or other assigned keyboard keys. In this prior art system, the realism of the screen view is limited because the virtual view position of the pilot user is effectively behind the video monitor or screen, whereas the actual human user is positioned in front of the video monitor.
Summary of the Invention
[0006] The embodiments and specific methods provide an interactive system that enables scene image information displayed within a monitor of specific dimensions to be adjusted (resized and shifted) according to the position of a viewer with respect to the display monitor. The system responds precisely in real-time for a realistic experience without the need for any wearables such as identification labels, goggles or headsets. The system is also designed so that it can be readily extended to increase coverage and performance by the modular addition of a multiplicity of tracking and/or display systems.
[0007] The system comprises: at tracking module for tracking the location and position of human viewer in relation to a real-world three-dimensional space, for example a room; a virtual three-dimensional scene generation engine for generating a virtual three-dimensional scene; a module for adjusting or resizing the digitally generated three-dimensional scene data to account for a viewpoint of a determined position of a viewer in relation to a real three-dimensional space; a scene formatting module for reformatting the digitally generated virtual three- dimensional scene for display on a visual display monitor; and a display monitor for displaying a virtual scene.
[0008] The tracking system is operable to passively track a human subject within a field of view of the tracking system, to identify human subject viewers by identifying facial features of the human viewer; and to identify a position on a face of a human viewer which is representative of a viewpoint of that viewer. This is achieved by capturing video images of the subject in the field of view of the tracking system in real-time as a stream of video images, analysing the video images using a first machine learning algorithm for facial recognition; and performing recognition of facial landmarks, using a second machine learning algorithm, and calculating the x, y coordinates in two-dimensional image frame space of a mid - eye position of a detected face.
[0009] Specific embodiments and methods presented herein aim to provide a visual display device which, from the point of the human user mimics a real aperture or window, looking out over a real scene, but where the scene is a digitally generated virtual scene displayed on a visual display device, such as a liquid crystal display screen, or other like pixelated display screen.
[0010] In a preferred method, a two dimensional digitally generated display represents a scene in three dimensional space, wherein the two dimensional image changes to represent a view of the scene in three dimensional space from different locations in a real three dimensional space on one side of the display device, and the three dimensional space is a virtual space on the other side of the display device from the viewing location.
[0011] Specific embodiments and methods disclosed herein aim to provide a method and apparatus for providing a view of a virtual scene in a virtual three- dimensional coordinate space in a two-dimensional view on a flat or curved screen in which the two-dimensional view on the screen, as viewed from the viewing position of a user, appears to the user as if the user were in a three dimensional space.
[0012] The apparatus and methods herein operate to display real-time digitally generated scene images on a display monitor (also referred to herein as a virtual window or as a synthetic window). The relative position of the viewer to the display monitor causes different parts of the scene image to be displayed on the display monitor as the scene would appear to a viewer from different perspectives of viewpoints in real space.
[0013] The embodiments herein may display real-time scene images on a display monitor also referred to as a virtual window. The position of the viewer causes different parts of the virtual three-dimensional scene image to be displayed as it appears as viewed from different perspectives. The scene image will adjust according to a position and angle of the human viewer in real time as the viewer moves around. The viewer’s position is taken to be the midpoint between the eyes of the viewer and is referred to herein as the mid eye point.
[0014] In a room, there may be one or more visual display devices, each placed upon a different wall or face in a room. A single virtual scene may be displayed upon one or more virtual windows, each represented by a separate visual display device. Further, there may be more than one visual display device on a single wall. A plurality of separate visual display devices may be coordinated, so that a viewer present in a room and surrounded by a plurality of visual display devices, from their viewpoint in the room may see a virtual three-dimensional scene displayed as two-dimensional images on the one or plurality of visual display devices, with the overall effect that as a person moves around within a room or space, the one or plurality of virtual windows surrounding that viewer appear as if the viewer is looking out onto a real three-dimensional space through the plurality of visual display devices, but the images presented on the plurality of display devices represent a virtual scene, instead of a scene in real three- dimensional space. Each individual visual display device gives a corresponding view onto the virtual three-dimensional scene as would be seen through an aperture surrounding the perimeter of the visual display device, if that aperture were a transparent aperture looking out onto a real three-dimensional space, except that the virtual scene replaces the real scene.
[0015] According to a first aspect of the present invention, there is provided an image processing apparatus comprising: a tracking apparatus for identifying a position of a viewer in relation to said visual display device; a scene generator for generating a virtual three-dimensional scene; a scene adjuster for modifying an output of said three dimensional scene generator in response to an output of said tracking apparatus; and a visual display device for displaying an image of said virtual three- dimensional scene; wherein said scene adjuster receives location information about a position of said viewer from said tracking apparatus, and modifies in real time a virtual three dimensional scene generated by said scene generator such that said modified scene corresponds to a view which the viewer would see from their position through an aperture between said real three dimensional space and said virtual three dimensional space.
[0016] Preferably said tracking apparatus comprises: first and second video cameras, each producing a video stream signal of a real three-dimensional space; a video grabber for said video stream signals into a stream of image data frames; and a memory storage device for storing said streams of image data frames.
[0017] Said screen display may have a viewable surface of a shape selected from the set: a flat planar screen; a curved screen; a part-cylindrical screen.
[0018] Preferably said means for determining a position of a viewer comprises a pair of spaced apart video cameras directed with their respective fields of view overlapping each other in real three-dimensional space.
[0019] Preferably the means for generating a virtual three-dimensional scene comprises a digital three-dimensional scene generator, operable to generate a virtual three-dimensional scene having one or a plurality of moving objects.
[0020] Preferably the view seen by the viewer in three-dimensional coordinate space is modified in real-time depending upon the viewer’s actual real position in real three dimensional space in relation to the physical display screen.
[0021] Preferably said image processing apparatus further comprises a video grabber for receiving a stream of video data; and converting said stream of video data into a stream of image data frames. [0022] Preferably there is provided a means for adding a timestamp to each image data frame.
[0023] Preferably said apparatus comprises a machine learning algorithm for detecting facial images.
[0024] Preferably said machine learning algorithm operates to: identify images in said image frames containing human faces; and determine a position in real three-dimensional space of said human face images.
[0025] Preferably said machine learning algorithm operates to identify facial landmark features from a stream of said image data frames; and determine a mid-eye position in real three-dimensional space from said image data frames.
[0026] Preferably the apparatus comprises a machine learning algorithm applied to detect body images.
[0027] The machine learning algorithm may operate to:
Identify images in said image frames containing human body; and determine a position in real three-dimensional space of the human body images.
[0028] Preferably the image processing apparatus further comprises a formatter operable for: receiving coordinates of an outer perimeter of said display device in virtual three-dimensional space; cropping a view of said virtual three-dimensional scene to coincide with a straight line view from a position of a viewer in said virtual three-dimensional space, through said outer perimeter of said display device.
[0029] According to a second aspect of the present invention there is provided an apparatus for generating a screen image which represents a three- dimensional view from the perspective of a human viewer, said apparatus comprising: a visual display screen defining a perimeter of a viewing aperture or window on which can be displayed a real time virtual image; means for determining a physical position of said viewer’s eyes in relation to a physical position of said viewing aperture or window; means for generating a virtual three-dimensional coordinate space populated with a plurality of objects; means for positioning a representation of said viewer in said virtual three- dimensional coordinate space; means for positioning the viewing aperture/window in the virtual 3D space; means for determining a view as seen by said representation of said viewer in said virtual three-dimensional coordinate space; means for generating a two-dimensional image of said three-dimensional view as seen by said representation of said viewer through the aperture or window said image changing in real time, depending on an orientation of said viewer relative to said aperture /window within said virtual three-dimensional coordinate space.
[0030] According to a third aspect of the present invention there is provided a method for generating a video display screen view which represents a three- dimensional view from the perspective of a subject viewer, positioned adjacent a video display screen, said method comprising: determining a physical position of said subject viewer in relation to a physical position of said visual display screen in real three-dimensional space; generating a virtual three-dimensional scene populated with a plurality of objects in virtual three dimensional space; generating a view of said virtual three dimensional scene, as seen through an aperture coinciding with a perimeter of said display screen, and as viewed from said position in real three dimensional space; and displaying said view on said visual display device.
[0031] Preferably said process of generating a view of said virtual three dimensional scene comprises adjusting a view angle of said three-dimensional scene to correspond with a position of said subject viewer in real three- dimensional space.
[0032] Preferably said process of generating a view of said virtual three- dimensional scene comprises adjusting sizes of said variety of objects in said virtual three-dimensional space, according to a coordinate set of a said position of a viewer in real three-dimensional space. [0033] Preferably said method further comprises formatting said adjusted virtual three-dimensional scene for display as a two-dimensional image of said virtual three-dimensional scene on said display monitor
[0034] Preferably said process of determining a physical position of said subject viewer in relation to a physical position of said visual display screen comprises capturing first and second scenes of video data across a field of view extending in said real three-dimensional space.
[0035] The method preferably further comprises converting said first and second scenes of video data to corresponding first and second streams of video image frames.
[0036] The method preferably further comprises comprising applying a first machine learning algorithm to detect facial images contained in said first and second streams of video images.
[0037] Preferably the method further comprises applying a third machine learning algorithm to detect body images contained in the first and second streams of video images.
[0038] Preferably the method further comprises applying a fourth machine learning algorithm to detect facial landmark features from the first and second streams of video images.
[0039] The method preferably further comprises applying a second machine learning algorithm to detect facial landmark features from said first and second streams of video images.
[0040] The method preferably further comprises determining a mid-eye position of facial detected facial images contained in said first and second streams of video images. [0041] The method preferably further comprises determining a set of three- dimensional coordinates in real three-dimensional space for each said determined mid-eye position.
[0042] The method preferably further comprises determining said three dimensional coordinates of said mid eye position comprises triangulation between two images of an object each captured by a different camera.
[0043] Preferably the method further comprises determining a mid-eye position of detected facial images contained in the first and second streams of video images.
[0044] Preferably the method comprises determining a set of three- dimensional coordinates in real three-dimensional space for each of the determined mid-eye positions.
[0045] Preferably the process of determining three-dimensional coordinates of said mid eye positions comprises triangulation between two images of an object, each image being captured by different camera
[0046] Preferably said virtual three-dimensional scene comprises a digitally generated virtual three-dimensional scene.
[0047] The method preferably further comprises generating a two- dimensional image of said view of said three-dimensional scene as seen from said viewer position in three-dimensional real space; and varying, said two dimensional view depending on a change of said position of said viewer in said real three-dimensional space.
[0048] The method preferably further comprises: determining coordinates of an outer perimeter of said display device in virtual three-dimensional space; cropping a view of said virtual three-dimensional scene to coincide with a straight line view from a determined position of a viewer in said virtual three- dimensional space, through said outer perimeter of said display device.
The method preferably further comprises for a plurality of visual display devices: determining a physical position of said subject viewer in relation to a physical position of each said visual display device in real three -dimensional space; generating a virtual three-dimensional scene populated with a plurality of objects in virtual three dimensional space; generating a corresponding respective view of said virtual three dimensional scene, as seen through an aperture coinciding with a perimeter of each said display device, and as viewed from said position in real three dimensional space; and displaying a said corresponding respective view on each said visual display device, such that for each physical position of the subject viewer in the real three- dimensional space, view of a virtual three-dimensional scene displayed on each said visual display device is visually consistent with the each other view of each other said visual display device.
[0049] Preferably said plurality of views are coordinated on said plurality of visual display device, such that said subject viewer at said physical position within a first region of real three-dimensional space views a virtual three-dimensional scene on said plurality of visual display devices, which appears to coincide with a second region of real three-dimensional space surrounding said first region of three-dimensional space; and for each position which said subject viewer occupies within said first region of real three-dimensional space, the views of three-dimensional space, the views displayed on the plurality of visual display devices change in real time, and co ordinated with each other, to give the appearance that the subject viewer is viewing said three-dimensional scene through a plurality of apertures surrounding said viewer.
[0050] According to a third aspect there is provided an apparatus for determining a viewpoint position of a person in a real three-dimensional space, said apparatus comprising: a pair of video cameras each capable of capturing a stream of video data; means for converting each said captured stream of video data into a stream of image data frames; means for storing said stream of image data frames; means for detecting faces in said stream of image data frames; means for detecting facial features in said stream of image data frames; means for determining a viewpoint of a said detected face.
[0051] Preferably said pair of video cameras are spaced apart from each other and are arranged such that their respective fields of view overlap each other in real three-dimensional space. [0052] Preferably said means for converting each said captured stream of video data into a stream of image data frames comprises a video grabber for receiving a plurality of streams of video data and converting each said stream of video data into a stream of image data frames.
[0053] Preferably the image processing apparatus comprises means for adding a timestamp to each image data frame.
[0054] Preferably said apparatus for detecting faces comprises a computer platform and a first machine learning algorithm for detecting facial images;
[0055] Preferably said first machine learning algorithm operates to: identify images in said image frames containing human faces; and determine a position in real three-dimensional space of said human face images.
[0056] Preferably said means for detecting facial features in said stream of image data frames comprises a second machine learning algorithm operable to identify facial landmark features from a stream of said image data frames.
[0057] Preferably said means for determining a viewpoint of said detected face comprises a computer platform operating an algorithm for determining a position in real three-dimensional space from detected facial features in said image data frames.
[0058] Preferably said apparatus for detecting bodies comprises a computer platform and a third machine learning algorithm for detecting body images.
[0059] Preferably said third machine learning algorithm operates to: Identify images and said image frames containing human body; and determine a position in real three-dimensional space of said human body images.
[0060] Preferably said means for detecting body features in said stream of image data frames comprises a fourth machine learning algorithm operable to identify body landmark features from a stream of said image data frames.
[0061] Preferably said means for determining a viewpoint of said detected body comprises a computer platform operating an algorithm for determining a position in real three-dimensional space from detected body features in said image data frames.
[0062] According to a fourth aspect, there is provided an apparatus for determining a viewpoint position of a person in a real three-dimensional space, said apparatus comprising: a pair of video cameras each capable of capturing a stream of video data; means for converting each said captured stream of video data into a stream of image data frames; means for storing said stream of image data frames; means for detecting one or more body features in said stream of image data frames; and means for determining a viewpoint of a said detected body from an analysis of said body features detected from said stream of image data frames. [0063] According to a fifth aspect, there is provided a method for determining a viewpoint position of a person in a real three-dimensional space, said method comprising: capturing first and second streams of video data from first and second positions, said first and second positions spaced apart from each other and covering a common field of view in real three-dimensional space; converting each said captured stream of video data into a corresponding stream of image data frames; storing said streams of image data frames; detecting faces in said stream of image data frames; detecting facial features in said stream of image data frames; and determining a viewpoint of a said detected face.
[0064] Preferably the method further comprises adding a timestamp to each said image data frame.
[0065] Preferably said process of detecting faces comprises: applying a pre - trained machine learning algorithm to recognise images of human faces in said image data frames; and identify images in said image frames containing human faces. determine a position in real three-dimensional space of said human face images. [0066] Preferably said method further comprises generating a two- dimensional area boundary around a identified facial image in a said image frame; and cropping an area within said two-dimensional area boundary, containing said identified facial image.
[0067] Preferably said process of detecting facial features in said stream of image data comprises: applying a pre-trained machine learning algorithm to recognise images of landmark features of a human face, said landmark features selected from the set: eyes; pupils; nose; lips; eyebrow; temples; chin; moles, teeth; cheeks. [0068] Preferably said process of determining a viewpoint of said detected face comprises: determining a position of each of said a pair of eyes of a face in said data frames; and determining a mid point between said positions of said eyes; and assigning said viewpoint to be said mid point. [0069] Preferably said process of detecting one or more bodies comprises: applying a pre-trained machine learning algorithm to recognise images of one or more human bodies in said image data frames; and identifying images and said image frames containing one or more human body; determining a position in real three-dimensional space of one or more images of a human body.
[0070] Preferably said method comprises generating a two-dimensional area boundary around an identified body image in a said frame; and cropping an area within said two-dimensional area boundary , said area containing said identified body image.
[0071] Preferably said process of detecting body features in said stream of image data comprises: applying a pre-trained machine learning algorithm to recognise images of landmark features of a human body, said landmark features selected from the set: eyes; pupils; nose; lips; eyebrow; temples; chin; moles; teeth; cheeks; ears; neck; shoulder; elbow; wrists; hands; chest; waist; hips; knees; ankles; feet; legs; arms.
[0072] In a specific implementation, a frame aggregation module aggregates received data frames from multiple sources and outputs aggregated data.
[0073] In a further aspect there is provided a method of converting a stream of two-dimensional images captured as a stream of two-dimensional image frames of real world observed humans present within a three dimensional space into a set of horizontal and vertical coordinates in said real three- dimensional space, said method comprising: capturing a stream of image frames representing objects within a three- dimensional space; capturing said stream of image frames substantially in real time; inputting said stream of image frames into a plurality of independently operating machine learning algorithms ML1 - MLn; each said machine learning algorithm being pre-trained to identify a corresponding respective set of key points of a human body; generating an output of each of said machine learning algorithms, said output comprising a set of X, Y coordinates in real three-dimensional space wherein each individual said X, Y coordinate represents a key point of a human body; and aggregating a plurality of individual X, Y coordinates to produce a plurality of individual streams of X, Y coordinates each of which represents the movement of a key point of a said human body within said real three-dimensional space.
[0074] In the preferred embodiments, facial and body recognition is carried out at the same time in parallel by a plurality of parallel operating machine learning algorithms each configured with and trained upon datasets to identify specific anatomical key points of a viewer’s head and body. The key points are weighted in order of importance in order to determine a look direction of the viewer, using key points from the main body of the viewer as lower weighted key points compared to key points identified on the viewer’s face, so that if the viewer’s face becomes obscured to a monitoring camera, the lower weighted key points from the users body compensate for the obscured key points on the viewer’s face to help determine the overall look direction of the viewer. The key points on the viewer’s body may be used to establish the overall position of the viewer within a 3-D room or 3-D real coordinate space, and the key points identified on the viewer’s face give specific information on the look direction of the viewer within the 3-D room or 3-D real coordinate space. [0075] Other aspects are as set out in the claims herein, the content of which is incorporated into this summary of invention by reference.
Brief Description of the Drawings [0076] For a better understanding of the invention and to show how the same may be carried into effect, there will now be described by way of example only, specific embodiments, methods and processes according to the present invention with reference to the accompanying drawings in which: Figure 1 herein illustrates schematically a view of a virtual window aperture looking out onto a virtual three-dimensional scene from a first viewpoint;
Figure 2 herein illustrates schematically a view of the virtual window aperture looking out onto the virtual three-dimensional scene from a second viewpoint;
Figure 3 herein illustrates schematically in view from above, a field of view of a human viewer showing areas of three-dimensional view and two-dimensional view; Figure 4 herein illustrates schematically an installation of a first embodiment apparatus for generating a virtual three-dimensional scene for viewing through a virtual window in the form of a display screen, in which the position of a human viewer is tracked in real time, and in which the virtual three-dimensional scene changes in real time according to a position of the human viewer;
Figure 5 herein illustrates schematically a view from a position in a first region of real three-dimensional space, looking into a virtual window, into a virtual scene in a virtual three-dimensional space; Figure 6 herein illustrates schematically in overview, and apparatus for tracking viewer, and for generating and displaying a virtual three-dimensional scene, in which the virtual three-dimensional scene changes in response to a positional movement of a viewer in real three-dimensional space;
Figure 7 herein illustrates schematically a configuration of a plurality of binocular camera sets located around a room, comprising part of a tracking system, capable of viewing and tracking a person throughout substantially all parts of the room;
Figure 8 herein illustrates schematically the first type of display screen for use as a first virtual window;
Figure 9 herein illustrates schematically second type of display screen for use as a second virtual window;
Figure 10 herein illustrates schematically a third type of displaced for use as a third virtual window;
Figure 11 herein illustrates schematically an overview of part of the tracking apparatus comprising modules for capturing a series of image frames from a binocular camera set, and showing a configuration process for configuring the part of the tracking apparatus;
Figure 12 herein illustrates schematically an overview of the part of a tracking apparatus, shown in Figure 11 , and a process for capturing a series of video image frames from a binocular camera set, and storing image frames to a memory buffer in real time;
Figure 13 herein illustrates schematically operation of the tracking apparatus of Figure 11 herein, combining meta data and frame data, and a method for transporting the frame data to a frame processing module; Figure 14 herein illustrates schematically process stages for frame processing of a stream of digital image frames, including preprocessing of the raw frame data as captured by a camera to result in a stream of RGB images, detecting faces in the RGB images using a first machine learning algorithm, and detecting facial landmarks within RGB images of faces using a second machine learning algorithm in order to detect a mid -eye point of a human viewer;
Figure 15 herein illustrates schematically process stages for frame data aggregation using a third machine learning algorithm produce an output of x, y coordinates in two-dimensional image frame space representing a position of a mid-eye point of a human viewer in real time;
Figure 16 Illustrates schematically an overall data processing pipeline for observing real-world events in a real-world three-dimensional space, including movement of human viewers within the real three-dimensional space, capturing images of a region within a real three-dimensional space, and processing those images to detract humans within the real three-dimensional space, and to determine a position of a human viewer in the real three-dimensional space;
Figure 17 illustrates schematically set up parameters for a neural network machine learning algorithm for performing face recognition and facial feature extraction, and for generating a two dimensional box within an image frame, in which an image of a face is present;
Figure 18 herein illustrates schematically set up parameters for a second neural network machine learning algorithm for performing identification and recognition of facial landmarks and image of a human face and identify facial with two-dimensional x. y coordinates in image frame space for each identified facial feature;
Figure 19 herein illustrates schematically some of the body landmarks that the system may identify on a person; Figure 20 herein illustrates schematically some of the facial landmarks that the system may identify on a person;
Figure 21 herein illustrates schematically process stages for frame processing of a stream of digital image frames, including preprocessing of the raw frame data as captured by a camera to result in a stream of RGB images, detecting faces in the RGB images using a third machine learning algorithm, and detecting body landmarks within RGB images of bodies using a fourth machine learning algorithm in order to detect a mid-body point of a human viewer; and
Figure 22 illustrates schematically an overall data processing pipeline for observing real-world events in a real-world three-dimensional space, including movement of human viewers within the real three-dimensional space, capturing images of a region within a real three-dimensional space, and processing those images to detract humans within the real three-dimensional space, and to determine a position of a human viewer in the real three-dimensional space.
Detailed Description of the Embodiments
[0077] There will now be described by way of example a specific mode contemplated. In the following description numerous specific details are set forth in order to provide a thorough understanding. It will be apparent however, to one skilled in the art, that the present invention may be practiced without limitation to these specific details. In other instances, well known methods and structures have not been described in detail so as not to unnecessarily obscure the description.
[0078] In this specification the term “viewer” refers to a human natural person having one or two functioning eyes. For ease of description of the apparatus and methods disclosed herein, it is assumed that the viewer has two eyes, but the apparatus and methods will also operate without modification for a person being only a single functioning eye. [0079] In this specification the term “position”, when referring to a viewer is defined as the mid-point of a line extending between the geometric centres of the two eye balls of a viewer. As a viewer moves around in real three dimensional space, the position of the viewer changes.
[0080] In this specification, the term “viewpoint” means for a viewer having a pair of functioning eyes, a position midway along a line extending between the centre point of the pupil of each eye. For a viewer having a single eye only the term “viewpoint” means a position at the centre of the pupil of that person’s eye. For a person having two functioning eyes, but who chooses to close or obscure one of their eyes temporarily, the term “viewpoint” means a position at the centre of the pupil of the viewer’s unobscured functioning eye.
[0081] In this specification the term “view direction” refers to a line direction in a horizontal plane relative in real three dimensional space which bisects a line extending between the mid -point of a pair of eyes of a viewer.
[0082] In this specification, the term “aperture” means a region of three dimensional space surrounded by a boundary in three-dimensional space.
[0083] In this specification, the term “two-dimensional image” is used to refer to a digitally generated image for display on a nominally flat planar two- dimensional visual display screen. For ease of description, where a non-planar visual display screen is used, the term “two-dimensional image” is used to mean an image which is displayed on a the pixels of a visual display screen, irrespective of the actual three-dimensional shape of that screen, whether it be curved, flat, elliptical cylindrical, a three-dimensional elliptical segment, or any other shape of curve.
[0084] In this specification, when referring to three-dimensional space, axes in real three-dimensional space are denoted (X, Z, Y), and axes in virtual three-dimensional space are denoted (X’, Z, Y’) in which the letters X and Z denote horizontal axes and the letter Y denotes a vertical axis, and the symbol ‘ denotes virtual three-dimensional space. Coordinate points in real three- dimensional space are denoted (x, z, y) and coordinate points in virtual three- dimensional space are denoted (x’, z’, y’).
[0085] Referring to Figure 1 herein there is illustrated schematically a view of a visual display device in a room, with a digitally generated image representing a view looking out through the see - through window aperture into a digitally generated woodland scene in virtual three dimensional space, as viewed from a position on one side of the aperture. As the position of a human viewer in real three dimensional space moves relative to a frame of the virtual window aperture, the view through the virtual window aperture changes as if the virtual scene were in real three dimensional space on the other side of the virtual window aperture. The viewer and the space behind the virtual window are each in real three- dimensional space, with a wall and aperture separating the viewer and the space behind the virtual window.
[0086] The equivalent real-world situation is represented by a transparent window aperture in an opaque wall, where there is a clear unobstructed line of sight from the viewer through the aperture to the scene in real three-dimensional space, but there is no clear line of sight from the viewer to parts of the scene on the other side of the wall where those parts of the scene are obscured by the opaque wall surrounding the aperture.
[0087] In the virtual window embodiment described herein, there is no clear line of sight through the visual display device comprising the virtual window into real the real world behind the visual display device, but due to the digitally generated image displayed on the visual display device, from the viewer’s perspective it appears as if there is a three-dimensional scene looking through the virtual window display device, and as a viewer moves around a room, the viewer views different parts of the three-dimensional scene as if the virtual three- dimensional scene were a real three-dimensional scene being viewed through a window on an opposite side of a wall to the real three-dimensional scene occupied by the viewer.
[0088] Referring to Figure 2 herein, there is illustrated schematically the virtual window aperture of Figure 1 with a human viewer positioned on a first side of the aperture and looking through the aperture from a different position on the first side compared to Figure 1 herein. At the different positions on a first side of the virtual window, the viewer has a different view of the scene on the other side of the aperture and can see different parts of the scene, and the same parts from a different view angle.
[0089] Specific embodiments and methods disclosed herein aim to replicate the situation in which both the viewer and the scene are in real three- dimensional space, with the virtual scene being viewed by a virtual window or aperture such that the viewer on a first side of the virtual window is in real three- dimensional space, and the aperture is filled with a digital visual display device on which is displayed a two-dimensional image which corresponds with a view into a virtual three-dimensional space on the other side of the aperture to that where the viewer is located.
[0090] An object of the methods and embodiments described herein is to automatically generate in real time the virtual scene in virtual three-dimensional space on the second side of the aperture, the virtual three-dimensional scene appearing to coincide with a real three-dimensional space on the opposite side of the aperture to the viewer, so that from the viewpoint of the viewer the virtual three-dimensional scene appears as if it were a real three-dimensional scene in real three dimensional space on the opposite side of the aperture to which the viewer is located.
[0091] In a best mode, the virtual three-dimensional scene is digitally generated using a three-dimensional scene generator engine, for example the Unreal® Engine and in the general case, would not be a representation of the actual scene in real three-dimensional space on the second side of aperture. For example, for an application in a hotel room, where the display device is mounted on a wall, and the viewer is located in a hotel room in real three-dimensional space on one side of the wall, and on the other side of the wall there is another hotel room, the virtual scene generated and displayed on the display device would not be chosen to correspond to a scene of the other room on the other side of the display device (although it could do), but rather could be generated as a view of a woodland scene, a mountain scene, a cityscape scene, an ocean scene, a fantasy scene or any other digitally generated virtual three-dimensional scene.
[0092] The digitally generated virtual three-dimensional scene may be static, that is, a three-dimensional equivalent of a photograph where the features and objects of the image do not change with time. Alternatively, the virtual three- dimensional scene changes in time. For example, a woodland scene may have a stream of flowing water which flows at a same rate as in a real world woodland scene, and may have birds, deer, or other animals which walk-through the scene at the same rate as in real time. In another example, a city virtual three- dimensional scene may have pedestrians, vehicles, and other moving objects which move around the scene in real time. The digitally generated image is not limited to having virtual three-dimensional objects which move at the same rate of movement as in real life, for example the object may be generated to move more slowly, or more quickly than in real time, equivalent to a “slow motion” or “fast forward” function, but in most applications envisaged, in order to enhance the sense of realism from the perspective of the viewer, objects within the virtual three-dimensional scene will move at the same rate as they would do in a real three-dimensional space in the real world.
[0093] Referring to Figure 3 herein, there is illustrated schematically from above a field of view of human viewer having normal eyesight. The human viewer has a lateral field of view of approximately 180° which varies from person-to- person, as measured about a central view direction with both of the viewer’s eyes looking directly ahead relative to the viewer’s skull. The viewer has a lateral angle of peripheral of around 120° centred on the central view direction, in which both eyes receive light from the field of view, and within the 120° peripheral vision field of view there is an approximately 60° angle in which the viewer has three- dimensional viewing. Within the 120° range of peripheral vision, in which both eyes receive light from the field a view, the central 60° lateral field of view is three- dimensional, and the outer part of the 120° field of view, further away from the centre line the direction of view gives two-dimensional viewing. Within the 180° lateral field of view the outermost 30° regions of view between 60° and 90° from the central view direction are viewable only by the left eye, to the left of the viewer, and by the right eye to the right of the viewer. For the purposes of determining the viewpoint / position of the human viewer, a virtual line is drawn between the centres of the viewer’s eyes, and the midpoint of that line is taken as the viewpoint position, referred to herein as the mid - eye position.
[0094] Referring to Figure 4 herein, there is illustrated schematically in plan view an installation of the apparatus disclosed herein in a real three -dimensional space 400. The real three-dimensional space comprises a wall 401 in which there is located a virtual window 402, which represents an aperture in the wall 401. The three-dimensional space has first and second horizontal coordinates X, Z and a vertical coordinate Y. Individual positions within the three-dimensional space are represented as (x, y, z) coordinates.
[0095] The real three-dimensional space is separated by the wall 401 into a first region 403 on the first side of the wall and a second region 404 on a second side of the wall, the first and second regions being visibly separated by the opaque wall 401 . On a first side of the wall a human viewer can walk around in the first region 403 and look in any direction horizontally, vertically or at an angle. On a second side of the wall the viewer cannot see the second region 404 of real three-dimensional space, because the view of the second region 404 from the first region 403 is obscured by the opaque wall 401. [0096] Located on the wall preferably in a vertical plane, is the virtual window 402 in the form of a visual display device.
[0097] Referring again to Figure 4 herein, in the example shown, the virtual window aperture comprises a region of three-dimensional space which is bounded by a rectangular boundary of opaque material, in this case, the wall 401. The window aperture contains the visual display device, the outer perimeter of which forms the virtual aperture.
[0098] As a viewer moves to a position on one side of the aperture, the opaque material of the wall surrounding the aperture blocks a direct view to part of the digitally generated virtual scene on the other side of the wall.
[0099] On the second side of the wall in the second region 404, the digitally generated virtual three-dimensional space comprises a plurality of objects labelled 1 to 6 in Figure 4 and shown as circles. Since Figure 4 is in plan view, the circles represent circular cylindrical upright pillars in three dimensions, which could be for example tree trunks in a woodland scene.
[0100] With the human viewer standing at a first position A in the first region 403, denoted by three-dimensional coordinates (x1, z1 , y1) in real three- dimensional space, the viewer has a field-of-view into the virtual three-dimensional space on the opposite side of the wall in the second region 404 bounded by the dotted construction lines 405, 406. As the aperture around the visual display device in this example is rectangular, each of dotted construction lines 405, 406 in the example shown represents a vertical plane in real three-dimensional space in the first region 403, and each represents a vertical plane in virtual three- dimensional space in the second region 404. The viewer cannot actually see into the real three-dimensional space coinciding with second region 404, but can only see a virtual three-dimensional space in the second region 404 by looking at the (in this example), vertically mounted planar display screen 402. [0101] If the aperture was a real transparent window into the real three- dimensional space in the second region 404 the objects 2, 3, 4, were real objects, standing at position A viewer would be able to directly see objects 2 - 4 by direct line of sight, but would be unable to see object 1 at horizontal position (x , z’ ), or objects 5 or 6 each of which are out of direct line of sight through the aperture as viewed from position A.
[0102] If the viewer moves to position B in real three-dimensional space in first region 403, the field of view from position B as shown in plan is bounded by the construction lines 407, 408 in this example, each of which represents a vertical plane, and if the objects 1 - 6 were real objects in real three-dimensional space, the viewer at position B would be able to see objects 3, 4, 5 and 6 but not objects 1 or 2 which would be obscured by the opaque wall 401.
[0103] Similarly the viewer moves to a third position C in the first region, and if the aperture 402 was transparent real window and the objects 1 - 6 were real objects in real three-dimensional space coinciding with the second region 404, the viewer at position C would have a field-of-view bounded by the construction lines 409, 410, each of which in this example represents a vertical plane, and because the viewer is closer to the aperture the field-of-view is wider than at positions A or B and the viewer is able to see objects 1 - 5, but object 6 is obscured from view by the opaque wall 401.
[0104] In the present method, the objects 1 - 6 are digitally generated objects in virtual three-dimensional space displayed on the display screen 402, and which, depending upon the position of the viewer in the first region 403 of real three-dimensional space move around on the two-dimensional display screen to give the impression as described above as if they were real objects in real three- dimensional space.
Overview of Operation [0105] The scene image will adjust according to the position and angle of the viewer relative to the window. (Position is here nominally defined as the midpoint between the eyes and referred to as the ‘mid-eye point’.) For example, if the viewer moves to the left by a distance (+x->), the scene displayed on the display will appear to shift to the right by a distance (-x’->), if the viewer lowers their view (-y->), the scene image shift up (+y’->), if the viewer moves towards the ‘window’ (-z->) the image zooms out. Objects in the image virtual scene will be adjusted according to the change in the field of view and perspective of the 3D scene corresponding to each new position of the viewer. Detection of the mid-eye point to display of the newly adjusted scene image happens at 60 frames per second so that there is no lag in response detected by viewer (which can happen at the lower standard video frames rates of 25/30 frames per second). The higher frame rate of 60 Hz helps accommodate for small jerky movements but doubles the data load. However, little data is stored since the high volume of image data is reduced to a set of three coordinates x, y and z for each pair of synched camera frames.
[0106] An artist conception of a three dimensional scene is adjusted in response to unique viewer position defined by x, y, and z. Three viewer positions A, B, and C are shown for variation in the horizontal plane (i.e. , along the x-z axes or left/right and near/far). The monitor display for these viewer positions are shown in Figure 5 herein. (Circles in the upper part of the schematic are in fact ‘cylinders’ for example tree trunks and viewed as such in the x-y plane. The x’, y’ and z ’ coordinates relate to the corresponding coordinates within the 3D scene.)
[0107] The tracking system is a passive system in that no artificial markers or structured lighting are required. The hardware consists of two synchronized cameras in stereoscopic configuration, a frame grabber and a host computer. Processing can be subdivided into three parts (i) stereoscopic image acquisition and pre-processing, (ii) Al extraction of mid-eye point and other key points, as detailed below, and (iii) calculation of x, y and z of the viewer position. The tracking system utilizes a solid-state solution (no moving parts) but uses a state- of-the-art imaging system to locate the target point. Although there are many methods/technologies to also obtain 3D image information such as time-of-flight cameras, they lack the resolution, speed and tracking performance required. As result, the current version of the tracking system is built around a dual camera stereoscopic configuration using high performance, high resolution cameras. This design route also provides the easy implementation of Al image recognition technologies to be introduced at the individual frame level. This allows for the development of improved learning algorithms for detection and enhanced accuracy in tracking the target point (mid-eye point). Once image processing and determination of the target point has been completed it becomes a matter of triangulation to determine the distance (z) of its corresponding pixel. The large volume of image data generated by the two cameras is reduced to just three numbers x, y and z at the frame level.
[0108] The system is intended to work anonymously in that it does not require the recognition of a particular individual viewer, nor does the system require the processing or storage of such information. However, face recognition could augment performance (as the Al system learns the person user) so it might be offered as an option, even for a limited number of prescribed users.
[0109] The display system hardware consists of computer, GPUs and graphics cards for rendering and displaying a 3D image onto a large monitor screen. Processing can be subdivided into three parts: (i) rendering of 3D scene image, (ii) adjusting the scene image according to updated viewer tracking coordinates x, y and z, and (iii) displaying the newly adjusted 3D image scene. The display system comprises a large, high-resolution monitor/display screen such as a 4K monitor which renders the appropriate scene content in terms of the corresponding field of view that would be seen out through a real window. Currently, 3D image scenes are computer generated using Unreal® Engine but real 3D images could also be used if available (i.e. , 3D computer generated scenes are preferred but not essential. [0110] The following criteria are for a Mideye Tracking Prototype:
• Detection of mid-eye point to within 0.02° degrees. This angular tolerance may subsequently be relaxed in later versions.
• Tracking accuracy ° (requires recognition of the same pixel in the scene such as mid-eye point but otherwise could be another facial target such as freckle, a mole, the left or right eye/pupil, etc. (In fact once the stereoscopic system has captured the facial measurements (ideally from a suitably positioned frontal shot of the face), the target point could jump to another facial target once the head angle (i.e. , turns head away from cameras) and the distance between facial target points were accurately known. That is, the midpoint could be interpolated from other facial features of a known subject.
• Determine the x, y, and z coordinate within a fraction of a frame period, i.e., for 60Hz less than or equal to 16 milliseconds.
[0111] Referring to Figure 5 herein, there is illustrated three separate two- dimensional images of virtual three-dimensional scenes as generated and displayed on the display monitor at respective viewer positions A, B and C as shown in Figure 4.
[0112] In addition to modifying the two dimensional view of the virtual three- dimensional scene in response to a movement of viewer location A, B or C horizontally, the tracking system may also track the position of the viewer vertically in the first region 403, and adjust the two dimensional view into the virtual three- dimensional scene accordingly, so for example if viewer sits down at horizontal position A, reducing the vertical coordinate of their viewing position, without changing the horizontal coordinates, the corresponding lower boundary of the field of view of the viewer into the virtual three dimensional scene represented by a plane containing the construction lines 405, 406 and representing a plane defined by a lower straight line edge of the aperture frame of the display device and the viewpoint of the viewer. A corresponding upper plane bounded by the upper horizontal boundary of the display device also limits the upper extent of the view into the virtual three-dimensional scene.
[0113] For convenience of explanation and description, in Figures 4 and 5 herein, there is illustrated schematically a single flat visual display screen on a single wall of a room. Flowever the apparatus and methods described herein may apply to a plurality of visual display devices arranged on a same surface (for example a same wall) or on different walls and may be arranged around or surrounding a region of real three-dimensional space (region 403 in Figure 4 herein), so that a room in which a viewer is positioned may be fitted with a plurality of visual display devices, all receiving a virtual scene, wherein the virtual scene as viewed through any virtual window is consistent with the scene as viewed through the other virtual tours surrounding the same region of real three-dimensional space, in the same way that if the viewer were in a room having apertures or windows looking out onto a real three-dimensional scene, the viewers perception of the real three-dimensional scene would change through all windows simultaneously as the viewer moves around the room and their viewpoint with respect to each individual with changes.
Flardware & Firmware
[0114] Referring to Figure 6 herein, there is illustrated schematically in overview the hardware and firmware components of the system. The basic operational system consists of two main parts: a tracking system or apparatus 600 which determines the position of the viewer the first region of real three- dimensional space, and a display system 601 which generates, renders and displays an appropriate shifted and adjusted image scene representing scene in virtual three-dimensional space in the second region.
[0115] Components of the tracking system comprise known stereo or binocular cameras 602, one or more computer platforms 603 having one or more digital processors, one or more communications ports, data storage, memory, and interfaces for interfacing with other components; communications network components, electronic data storage devices and user interfaces. The data storage, memory and/or electronic data storage devices may comprise solid state drive, hard disk drive, random access memory, optical storage device, non-local storage, such as cloud storage or other such suitable storage means, or any suitable combination of the aforementioned. As will be understood by the person skilled in the art, the computer platform may be provided at a single location adjacent the hardware platform may be the single location near to the display device, or alternatively, some data processing tasks may be carried out at a remote location accessible over a communications network, for example at a remote data processing and/or data storage centre.
[0116] The computing platform comprises a frame grabber 604 which receives one or more streams of input video signal and generates frames of image data from the input video streams; a three dimensional scene generator 605, such as the Unreal® Engine; and a graphics card for 606 for driving the display monitor 607.
[0117] From the digital image frames generated by the frame grabber, there are produced real time moving three dimensional coordinate positions of a human viewer in the first region 304, using machine learning algorithms. The three dimensional positions of the viewer are converted by coordinate converter 608 to a format which is readable by a size adjustment / re sizing module 609 which receives a digitally generated virtual three-dimensional scene from the three- dimensional scene generating engine 605, and which adjusts the view selection of the virtual three-dimensional scene depending upon the coordinates of the viewer’s position in the first region 403 of real three-dimensional space; and a scene reformatter 610 which formats the re sized virtual three-dimensional scene data into a two-dimensional format of real-time moving two dimensional image frames in a format which can be read by graphics card 611. The output of graphics card 611 drives the display monitor 607. [0118] The coordinate conversion module 608 converts the position coordinates in real three-dimensional space (x, y, z) from the tracking system into coordinates in virtual three-dimensional space as used by the three-dimensional scene generation engine 605.
[0119] Where the system comprises more than one visual display device, where different visual display devices at different positions with respect to the viewer, for each individual visual display device, a corresponding respective view of the virtual three-dimensional scene is selected by the scene reformat module 610, and sent to a graphics card. The graphics 611 may comprise one graphics card per visual display device, or a single graphics card capable of handling multiple views multiple display monitors, with the effect that all the plurality of monitors are synchronised so that their respective two-dimensional images presented on the respective display monitors give the impression that the viewer, who is in a real three-dimensional space, is in a room which is surrounded by the virtual scene, which can be viewed through any one or more of the virtual windows, and as the viewer moves around the room, the view through each virtual window presents a consistent picture of the virtual three-dimensional scene, as if the room in which the viewer is positioned (region 403 in figure 4) is in the same real three-dimensional coordinate space as the virtual three-dimensional coordinate system of the virtual scene.
Determining the Viewpoint of the Viewer
[0120] Information concerning the viewer’s position relative to the aperture is captured by at least one pair of spaced apart stereo or binocular cameras. Cameras can be actively mounted at a position immediately adjacent the virtual window display monitor, and may, but not essentially be mounted symmetrically either side of a vertical centre line which bisects the aperture. Each camera of each pair of cameras has its own field of view and the fields of view of the two cameras of each pair overlap each other, so that both cameras can each capture a separate image of a viewer within the combined field of view of the two cameras. [0121] Preferably the cameras are video cameras which operate in the visible light range, but infra-red cameras which capture infra-red images may be used, or a combination of one or more visible range and one or more infra-red cameras may be used.
[0122] Referring to Figure 7 herein, there is illustrated schematically a real three-dimensional space in plan view representing for example a hotel room or conference room, where three stereo vision camera sets are mounted at three separate wall locations within the room. Each stereo camera set has its own field of view, and the fields of view of the three camera sets are arranged to overlap with each other obtain the maximum amount of overlap coverage within the room, so that a viewer located at any of positions 1 to 4 as shown can have their face and/or body detected from images produced by at least one of the camera sets.
[0123] It is not essential that the cameras are co-located with the visual display device, and in other implementations the cameras may be located at different locations distributed in the first region 403, and provided that the cameras scan across a field-of-view in which they can capture images of a human viewer in the first region, they will still be able to operate for the purpose of facial and/or body recognition and determining a viewpoint of the viewer.
Scene Adjustment and Re-Sizing
[0124] The scene adjustment and resizing module 609 receives a set of converted coordinates from the coordinate conversion module 608 and receives scene data from the three-dimensional scene generator 605. The scene adjustment and resizing module operates to incorporate the position of the viewer into the virtual three-dimensional space coordinates of the virtual three- dimensional scene so that the three-dimensional virtual scene is seen from the viewpoint of the viewer in the virtual three-dimensional coordinate space.
View Selection and Aperture Positioning [0125] The output of the scene adjustment and resizing module 609 comprises a view of the virtual three-dimensional scene as viewed from a point of view position of the human viewer, as if the human viewer were in the same virtual three-dimensional coordinate system as the scene, where the virtual three- dimensional coordinate system matches and is mapped to the real three- dimensional coordinate system of the room in which the viewer is standing (the first region 403).
[0126] The coordinate converter 608 receives an input of the coordinates of the outer perimeter / outer frame of the display monitor 607 in real three- dimensional coordinates. This can be done on initial system setup / configuration. Alternatively the coordinates of the outer perimeter of the display monitor can be obtained by examining the video stream from one or a plurality of stereo cameras and applying a module to recognise the geometric features of the rectangular or square frame of the display monitor in real three-dimensional space.
[0127] Selecting the final view as seen on the display monitor is carried out by the scene reformatting module 610, which applies in virtual three-dimensional space, an aperture between the view position of the viewer and the view of the virtual three-dimensional scene. The view of the virtual three-dimensional scene which is inside the aperture frame is selected for display on the display monitor.
[0128] In real-time as the coordinate positions of the viewer change, as the viewer walks around the room and changes their position with respect to the display monitor in real three-dimensional space, the position of the aperture of the display monitor relative to the position of the viewer in virtual three-dimensional space changes, and so the view angle through the aperture changes, thereby changing the view onto the scene, as seen and displayed on the display monitor.
Display Monitor
[0129] Referring to Figure 8 herein, there is illustrated schematically a first display screen comprising a flat planar display surface. Such screens are conventionally known in the art and are in widespread use on tablet computers, mobile phone screens, laptop computer screens, iPad® screens, large format home cinema screens, and large HDTV screens and the like. The display surface may comprise a light emitting diode (LED) display, liquid crystal display, a high definition television (HDTV) display or the like. The screen comprises a substantially rectangular planar surface bounded by a substantially rectangular outer perimeter.
[0130] Referring to Figure 9 herein, there is illustrated schematically a second type of known curved display screen in which the display surface wraps around a viewer, and the display surface follows a part circular cylindrical surface having a focal line f in a nominal X and Z orthogonal directions, and following a straight line in a nominal Y (height) direction, orthogonal to each of the X and Z axes. The viewer may locate themselves along the focal line of the screen, but not necessarily.
[0131] A third type of display screen, not shown herein comprises a screen similar to that shown in Figure 6 herein, but instead of the screen curving in a circular cylindrical path in the X / Z plane, follows an elliptical curve about a focal line f1 that extends in the Z direction, orthogonal to the x/y plane. The viewer may locate themselves along the along the focal line of the screen, but not necessarily.
[0132] In other embodiments, the visual display screen may comprise a part elliptical screen or part spherical screen as shown in Figure 10 herein. Other curves which are non-circular, non-spherical or non-elliptical may also be used.
[0133] The visual display device may comprise a flexible sheet display, which can adapt to a variety of curves. In the best mode, and in most applications, the screen is likely to be a flat planar screen. For ease of description, the example embodiments herein describe a flat screen display, but it will be understood by the skilled person that in the general case, the display screen need not be flat or planar. [0134] In the general case the display monitor comprises a display surface surrounded by an outer perimeter which delineates the viewable area of the display device, and which forms the outer boundary of the virtual window. The screen surface may be of various three dimensional shapes but in the present embodiment is a rectangular flat planar surface.
[0135] Referring to Figure 11 herein, the tracking apparatus 800 comprises a plurality of at least two video cameras 1101 , 1102 arranged to cover a common camera field of view coincident with the first region 403, so that the cameras can detect a person located in the first region; a hardware grabber 1103 in communication with the plurality of cameras; a software frame grabber 1104 in communication with the hardware grabber 1105; and a memory buffer 1106 which may be located in the GPU’s RAM or in any other suitable location. The cameras, hardware grabber and software grabber may comprise known video capture products from Kaya Instruments or any other suitable cameras, hardware grabber and software grabber.
[0136] The cameras are located in real three dimensional space, and the physical locations of each camera in relation to each other camera and in relation to the three dimensional space is recorded in an initial set up procedure. If the cameras are Lidar cameras on a hand held portable device, having a visual display for example and Apple® iPad® device, the positions of the Lidar cameras in relation to the screen will be fixed, and the orientation of the whole hand held device in real three dimensional space may be determined from internal inertial gyroscopes or other orientation sensors within the handheld device.
Face Recognition and Viewpoint Determination
[0137] There will now be described methods of operation of the tracking apparatus for recognising a human face, and determining a viewpoint of the human viewer. [0138] Each of the cameras 1101 , 1102 captures a video image which is converted to digitised format by the hardware grabber and software grabber 1104, 1105 and which outputs digitised frames of video data which are stored in the buffer 1106. The buffer 1106 stores in real time a first sequence of image frames from first hardware camera 1101 and a second sequence of image frames captured by second hardware camera 1102, where each of the sequences of captured video frames represents an image over a common field-of-view of the two cameras in the first region of real three-dimensional space, in which a human viewer is located. All processes operate as continuously operating real-time processes. The information stored in the buffer may be stored in temporary files which can be automatically deleted after use.
[0139] The frame grabber software may use Kaya DLL (double link library) or any suitable software to discover available hardware grabbers and connect to the hardware cameras. The frame grabber sets up the hardware grabber and buffer memory to collect and store image frames of data, as described above.
[0140] The software frame grabber allocates memory to the buffers, in which the raw data frames captured by the hardware Grabber will be saved, as described above.
[0141] Software frame grabber updates configuration of the cameras, and notifies the hardware grabber the about the area of memory buffer allocated to the captured frames of video data.
[0142] The software frame grabber sets up a callback function to be triggered by the hardware grabber each time a new frame of video data becomes available.
[0143] The image frame is input into the artificial intelligence analysis program to determine key points from the image. The key points are determined using artificial intelligence which has been trained to recognize keypoints from a database of models showing images of human faces with keypoints identified. These key points are used to determine the mid-eye point. Determination of the mid-eye point can be done using any appropriate method of analysis. One such method includes triangulation of the face using the identified key points.
[0144] The orientation of the user or subject viewer’s head may be inferred or partially inferred from determining the user’s body position. This may provide additional data to determine the viewer’s look direction for example if one or both cameras cannot see all key points. Body recognition and location determination is as described hereinafter.
Body Recognition and Location Determination
[0145] As a second measure, the method may also use body recognition and tracking. The system mainly tracks the head of the user; however, if there is any uncertainty about the viewer’s look direction determined from capturing face key points, the body tracking information may be used to increase the accuracy of the system.
[0146] Some key points which may be used in the tracking system are depicted in Figure 19, including the mid - eye point 1901 ; nose 1911 ; right eye 1912; left eye 1902; right ear 1913; left ear 1903; neck 1910; right shoulder 1914; right elbow 1915; right wrist 1916; left shoulder 1904; left elbow 1905; left wrist 1906; right hip 1917; right knee 1918; right ankle 1919; left hip 1907; left knee 1908; and left ankle 1909. These are not the only key points which may be used; any keypoint which is capable of being consistently recognized by the body recognition software, process and//or Al package may be used. The key points detailed above may be separated into subcategories or levels, such as:
Level 1 : mid - eye point;
Level 2: nose, right eye, left eye, right ear and left ear;
Level 3: neck, right shoulder right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee and left ankle. These subcategories may be prioritized in a hierarchy within the system such that precedence is given to data collected from the mid - eye point, then data collected from the subcategory two points and finally the subcategory three points. In effect, the additional subcategory two and three key points allow the accuracy of the position tracking and localization from the data to be improved. The system may also rely upon the additional body tracking key points if the system is unable to identify facial key points, for example if the head is turned such that only one eye is visible by the cameras.
[0147] An artificial Intelligence (Al) package is used to recognize the key points on the body of the person. The Al package is trained using models which are trained through deep learning libraries such as PyTorch, TensorFlow and OpenCV. This list of libraries is non-exhaustive.
[0148] There will now be described methods of operation of the tracking apparatus for recognizing a human body, and determining a viewpoint of the human viewer.
[0149] Each of the cameras 1101 , 1102 captures a video image which is converted to digitized format by the hardware grabber and software grabber 1104, 1105 and which outputs digitized frames of video data which are stored in the buffer 1106. The buffer 1106 stores in real time a first sequence of image frames from first hardware camera 1101 and a second sequence of image frames captured by second hardware camera 1102, where each of the sequences of captured video frames represents an image over a common field-of-view of the two cameras in the first region of real three-dimensional space, in which a human viewer is located. All processes operate as continuously operating real-time processes. The information stored in the buffer may be stored in temporary files which can be automatically deleted after use. [0150] The frame grabber software may use Kaya DLL (double link library) or any suitable software to discover available hardware grabbers and connect to the cameras. The frame grabber sets up the hardware grabber and buffer memory to collect and store image frames of data, as described above.
[0151] The software frame grabber allocates memory to the buffers, in which the raw data frames captured by the hardware grabber will be saved, as described above.
[0152] Software frame grabber updates configuration of the cameras, and notifies the hardware grabber the about the area of memory buffer allocated to the captured frames of video data.
[0153] The software frame grabber sets up a callback function to be triggered by the hardware grabber each time a new frame of video data becomes available.
[0154] The image frame is input into the artificial intelligence analysis program to determine keypoints from the image. The key points are determined using artificial intelligence which has been trained to recognize keypoints from a database of models showing images of human bodies with keypoints identified. These keypoints are used to determine the mid-eye point and/or the location of the human. These tasks can be done using any appropriate method of analysis. One such method includes triangulation using the identified key points.
Frame Capture
[0155] Referring to Figure 12 herein, the video cameras jointly observe real-world events in the first region 403 of real three-dimensional space. Each camera captures the respective stream of video data, which updates the data in the memory buffer and saves the two parallel streams of video frames to the memory buffer. The frame grabber receives a notification about the data being saved to the memory buffer from the hardware grabber. [0156] On each frame captured by each camera the hardware grabber notifies the software frame grabber, about the capture event and location in memory where the frame is saved
[0157] The video frame data is not transferred to the frame grabber, but is saved to the memory buffer instead. Saving the video frames directly to the memory buffer enables a fast rate of high-speed frame capture. The information saved to the buffer may be advantageously stored in temporary files which can be automatically deleted after use, reducing the need to manually delete files regularly.
Frame Extraction
[0158] Referring to Figure 13 herein the software frame grabber reads data from the buffer and transfers it to a frame processing pipeline. The software frame grabber, converts the frame data 1300 to a digitized format ready to be transported over a network or locally. The frame processing pipeline is preferably implemented in C++, but could also be implemented in any other suitable language, for example Python.
[0159] The frame data is combined with meta data, comprising source data and frame capture time stamp data, so that information describing the source of the frame, i.e. which camera the frame was captured by, and the time of capture of the frame in real time is added to each frame.
[0160] In one embodiment, the combined meta data and frame data generated by the software frame grabber is sent as a message via ZMQ messaging 1301 to the frame processing module 1302. ZMQ messaging is a known asynchronous messaging library application. As an alternative to using ZMQ messaging, combined meta data and frame data packets may be transferred from the software frame grabber to the-frame processing module 1302 via a shared memory buffer. As another alternative, instead of ZMQ messaging, any embeddable networking libraries compatible with C++, Python or any other programming language suitable for processing video material may be used.
[0161] Multiple instances of the frame processing pipeline may run in parallel to enable best optimisation of hardware allocation. Further in various embodiments, frame data can be distributed to other computers in a same network.
Facial Frame Processing
[0162] Referring to Figure 14 herein, the frame processing pipeline comprises three consecutive stages of (a) a pre-processing stage 1400 for preprocessing the frame data; (b) face and/or body detection stage 1401 for detecting images of faces and/or bodies in the image frames; and (c) a facial and/or body landmark detection stage 1402. Each of the stages may be constructed as separate modules. The face and/or body detection stage 1401 may be implemented as a first machine learning
[0163] Each frame is converted into a standard image and is analysed by two machine learning computer vision modules. In the embodiment presented herein, the use of an RGB image is described; however, the type of image is not limited to a standard RGB image. It will be apparent to a person skilled in the art that a Grayscale image, an RGBa image, a palette image such that each pixel is coded using one number, or another such image type may also be used. A first computer machine learning module detects faces in the image, and a second machine learning module searches for and finds facial landmarks. A third computer machine learning module detects bodies in the image, and a fourth machine learning module searches for and finds body landmarks. All data found, relating to facial recognition, facial landmarks, body recognition and body landmarks is passed to a frame aggregator 1403.
[0164] In the pre-processing stage 1401 , raw frame data 1404 in Bayer format as captured by a camera, is processed and transformed into standard RGB image data by de mosaicing 1405, to obtain an RGB image 1406. (In the current implementation, this is partly done by the hardware grabber).
[0165] In the face detection stage, each RGB image is processed using a trained face detection machine learning module 1407 to extract the locations of faces on the RGB image, which correspond to locations in the field of view of the camera in real three-dimensional space in the first region. The face is located as being within a face box, comprising a two dimensional rectangle on the RGB image.
[0166] For each face discovered, in the RGB image, the coordinates of the face box is returned, and a separate face picture is extracted from the main RGB image. In a case where there is a single person in the first region and a single faces detected, this results in a single face box location and a single face picture being extracted. A facial image is cropped to produce an RGB image 1409 of a detected face. In the case of more than one viewer in the first region 403, each viewer identified in the image frame has their face identified in a corresponding respective face box 1408, and the location coordinates of each face box is recorded. A separate face picture 1409 is extracted from each face box where a face is detected.
[0167] In the facial landmarks processing stage 1402, each face image or picture is analysed to find and identify facial landmarks. A trained machine learning algorithm is used to detect the facial landmarks on the RGB image. Facial landmarks 1411 can include facial features such as eyes, nose, mouth, lips, nostrils, ears or other physical features of a human face.
[0168] Based upon the coordinates within the RGB image of the individual facial landmarks, there is determined or calculated for each face picture, a midpoint equidistant between the centre of the eyes of the face (the mid-eye position). The horizontal and vertical (x, y) coordinates within the two dimensional RGB image frame are identified for the mid - eye position of the identified face. The mid eye position for the identified face is used in other parts of the apparatus to determine the three-dimensional position of the human viewer within the real three-dimensional space in the first region.
[0169] The mid - eyes coordinates for the frame are passed to the frame aggregator. The process shown with reference to Figure 14 herein is applied for each video frame, and each video frame timestamp so that the position of the viewer in each two-dimensional video frame is known at the time that the video frame was captured.
[0170] Video frames from each camera are processed in sequence to provide the coordinates of the mid eye position of faces within the first region 403 over a sequence of timed image frames. As there is more than one camera, a plurality of frames of RGB images are processed in parallel, one stream per each camera, each stream giving a slightly different view into the first region, and each stream identifying the same human face within the first region 403, whereby at each time there are produced at least two different RGB images taken from different angles from different cameras each identifying a respective position of the mid - eye position within an RGB image taken by that camera.
Body Frame Processing
[0171] If, in the face detection stage, no faces are detected and/or if, in the facial landmarks processing stage, no facial landmarks are processed and/or detected, the image may be reprocessed using a trained body detection machine learning module 2107 to extract the locations of bodies on the RGB image, which correspond to locations in the field of view of the camera in real three-dimensional space in the first region. The body may be located as being within a body box, comprising a two dimensional rectangle on the RGB image.
[0172] For each body discovered in the RGB image, the coordinates of the body box are returned, and a separate body picture is extracted from the main RGB image. In the embodiment presented herein, the use of an RGB image is described; however, the type of image is not limited to a standard RGB image. It will be apparent to a person skilled in the art that a Grayscale image, an RGBa image, a palette image such that each pixel is coded using one number, or another such image type may also be used. In a case where there is a single person in the first region and a single body detected, this results in a single body box location and a single body picture being extracted. A body image is cropped to produce an RGB image 1409 of a detected body. In the case of more than one viewer in the first region 403, each viewer identified in the image frame has their body identified in a corresponding respective body box 2108, and the location coordinates of each body box is recorded. A separate body picture 2109 is extracted from each body box where a body is detected.
[0173] In the body landmarks processing stage 2102, each face image or picture is analysed to find and identify body landmarks or key points. A trained machine learning algorithm is used to detect the body landmarks on the RGB image. The machine learning algorithm may be trained using Darknet Neural Network, MobileNets, or the like.
[0174] Based upon the coordinates within the RGB image of the individual body landmarks, there may be determined or calculated for each body picture, an area between the neck, right shoulder, left shoulder, right hip and left hip (the mid body area). The horizontal and vertical (x, y) coordinates within the two dimensional RGB image frame can be identified for the mid-body area of the identified body. The mid-body area for the identified body may be used in other parts of the apparatus to determine the three-dimensional position of the human viewer within the real three-dimensional space in the first region.
[0175] The mid-body area coordinates for the frame and/ or the key-points coordinates are passed to the frame aggregator. The process shown with reference to Figure 21 herein is applied for each video frame, and each video frame timestamp so that the position of the viewer in each two-dimensional video frame is known at the time that the video frame was captured. [0176] Referring to Figure 21 herein, video frames 2100 from each camera are processed in sequence to provide the coordinates of the body key-points and/or mid-body area of bodies within the first region 403 over a sequence of timed image frames. As there is more than one camera, a plurality of frames of RGB images are processed in parallel, one stream per each camera, each stream giving a slightly different view into the first region, and each stream identifying the same human body within the first region 403, whereby at each time there are produced at least two different RGB images taken from different angles from different cameras each identifying a respective location of the body key-points and/or mid-body area within an RGB image taken by that camera.
[0177] In process 2101 , raw frame data is captured by a camera. In process 2102, the raw frame data undergoes Bayer de-mosaicing to obtain an RGB image 2103.
[0178] Body detection from the RGB images comprises inputting the RGB image 2103 into a pre-trained machine learning model to detect parts of the image which represent a human body in process 2104 to achieve a set of body box coordinates 2105; cropping the image in process 2106 to create an RGB image
2107 with a detected human body.
[0179] Body landmark detection comprises inputting the RGB image with a detected body and detected body features 2107 into a machine learning algorithm
2108 which is pre-trained to detect body landmarks on facebox to result in detected body landmarks 2109; and determining the XY coordinates 2110 for the body landmarks of that frame. The frame is then passed on to a frame aggregation process.
Frame Aggregation
[0180] Referring to Figure 15 herein, the frame aggregation module aggregates data from multiple sources and serves it in aggregated format to the output. Output formats may include: Unreal® Engine UDP socket, terminal / console, Text/CSV data; or other standard data formats.
[0181] Frame coordinates and meta data 1500 are aggregated 1501 to form aggregation coordinates from multiple sources. A process 1502 of aggregating and saving debug data operates continuously. Processed coordinates 1503 are served to an output port 1504.
[0182] Referring to Figure 16 herein, there is illustrated schematically the main stages of the full pipeline process for converting images of real world observed humans into a set of horizontal and vertical coordinates in real three- dimensional space corresponding to the mid - eye positions of one or more human viewers, from a stream of image frames from a single camera.
[0183] A video camera captures 1600 a stream of image frames representing observations of real world events within a real three-dimensional space (the first region 403), and the hardware grabber 1104 stores those images to 1106 in real time as hereinbefore described. By real-time, this equates to a frame capture rate of between 30 and 120 frames per second.
[0184] The software frame grabber feeds the captured frames into a first frame processing stage comprising a first machine learning algorithm 1601 and a second frame processing stage comprising a second machine learning algorithm 1601. The first and second frame processing stages operate in parallel to each other.
[0185] In processing stages are fed into an aggregator stage 1603, which is implemented as a machine learning processor. The aggregator stage produces an output 1604 of a set of coordinates within the captured image frames which correspond to the coordinates of the mid - eye position of the human viewer. [0186] In Figure 16, the frame processing rate of the software frame grabber 1106 through to the output of x, y coordinates of the mid - eye positions is in the range 28 to 32 frames per second. The frame processing rate of the frame processors, and the aggregator to produce the output of x, y coordinates of the mid eye positions, in the current version is in the range 30 to 45 frames per second.
Body Frame Aggregation
[0187] Referring to Figure 22 herein, the frame aggregation module aggregates data from multiple sources and serves it in aggregated format to the output. Output formats may include: Unreal® Engine UDP socket, terminal / console, Text/CSV data; or other standard data formats.
[0188] Frame coordinates and meta data 1500 are aggregated 1501 to form aggregation coordinates from multiple sources. A process 1502 of aggregating and saving debug data operates continuously. Processed coordinates 1503 are served to an output port 1504.
[0189] Referring to Figure 22 herein, there is illustrated schematically the main stages of the full pipeline process for converting images of real world observed humans into a set of horizontal and vertical coordinates in real three- dimensional space corresponding to the mid-eye positions and/or body key-points and/or mid-body areas of one or more human viewers, from a stream of image frames from a single camera.
[0190] A video camera captures 2200 a stream of image frames representing observations of real world events within a real three-dimensional space (the first region 403), and the hardware grabber 1104 stores those images to 1106 in real time as hereinbefore described. By real-time, this equates to a frame capture rate of between 30 and 120 frames per second. [0191] The software frame grabber feeds the captured frames into a first frame processing stage comprising a first machine learning algorithm 2204, a second frame processing stage comprising a second machine learning algorithm 2205, a third frame processing stage comprising a third machine learning algorithm 2206 and a fourth frame processing stage comprising a fourth machine learning algorithm 2207. The first, second, third and fourth frame processing stages operate in parallel to each other.
[0192] In processing stages are fed into an aggregator stage 2208, which is implemented as a machine learning processor. The aggregator stage produces an output 2209 of a set of coordinates within the captured image frames which correspond to the coordinates of the mid-eye and/or body key-points and/or mid body area of the human viewer.
[0193] In Figure 22, the frame processing rate of the software frame grabber 1106 through to the output of x, y coordinates of the mid-eye and/or mid body positions (stages 2203 to 2209) is in the range 28 to 32 frames per second. The frame processing rate of the frame processors, and the aggregator to produce the output of x, y coordinates of the mid-eye and/or body key-points and/or mid-body area (stages 2204 to 2209) in the current version is in the range 30 to 45 frames per second.
Facebox Detection by Machine Learning Implemented Module
[0194] Referring to Figure 17 herein, the first machine learning model which is trained to detect faces and generate face box data, and face box coordinates comprises a neural network, as is the known in the art. The neural network is trained on image frame examples each containing a representation of a human face. The neural network analyses new frame for a set of features resembling a human face.
[0195] Based on extracted facial features, the neural network predicts for coordinates for each face box, the face box being a rectangle in two-dimensional image space (image frame space) which contains or is likely to contain a human face.
[0196] In the current implementation, the network used for performing the feature extraction is based on Darknet - 53 and contains 53 convoluted layers as shown in Figure 17 herein.
[0197] Currently the network is trained on the data based on a publicly available “Wide face” dataset containing over 32,000 image examples, each with annotated faces.
[0198] The output of the face box detection module is, for each image frame the two dimensional coordinates in frame image space which contains an image of a human face.
Bodybox Detection by Machine Learning Implemented Module
[0199] The third machine learning model which is trained to detect bodies and generate body box data, and body box coordinates comprises a neural network, as is the known in the art. The neural network is trained on image frame examples each containing a representation of a human body. The neural network analyses new frame for a set of features resembling a human body.
[0200] Based on extracted body features, the neural network predicts for coordinates for each face box, the body box being a rectangle in two-dimensional image space (image frame space) which contains or is likely to contain a human body.
[0201] In the current implementation, the network used for performing the feature extraction is based on Darknet - 53 and contains 53 convoluted layers as shown in Figure 17 herein. [0202] The output of the body box detection module is, for each image frame the two dimensional coordinates in frame image space which contains an image of a human body.
Facial Landmarks Detection using Machine Learning Implemented Module
[0203] Detection and recognition of facial land marks is achieved using a known neural network. Using the bounding boxes with face box detection as previously described, the RGB images are cropped from the complete frame to extract just the part of the RGB image 1409 which contains a representation of a human face. On the cropped image of the face facial landmark detection algorithms are used to calculate the position of the eyes of the face.
[0204] After detecting the face box, detecting facial landmarks becomes a less computationally intensive tasks, and as a result smaller and faster neural network architectures can be used in the facial landmark processing module compared to the facial detection processing module.
[0205] In the current version, the neural network is based on the MobileNet and is optimised for high speed. The network architecture of the second machine learning algorithm is as set out in Figure 18 herein.
[0206] Currently the neural network is trained on the data based on a publicly available 300 - W Facial dataset (“300 faces In-The-Wild).
[0207] The output of the facial landmark detection is a list of individual facial features, together with the coordinates of each facial feature within the two dimensional image frame space.
[0208] It will be understood by the person skilled in the art that the machine learning modules comprise a known computer platform having one or more data processors, memories, data storage devices, input and output interfaces, graphical use interfaces and other user interfaces such as voice command; input/output ports and communications ports.
Body Landmarks Detection using Machine Learning Implemented Module
[0209] Detection and recognition of body land marks is achieved using a known neural network. Using the bounding boxes with body box detection as previously described, the RGB images are cropped from the complete frame to extract just the part of the RGB image 1409 which contains a representation of a human body. On the cropped image of the body, body landmark detection algorithms are used to calculate the position of the body key-points and/or mid body area.
[0210] After detecting the body box, detecting body landmarks becomes a less computationally intensive tasks, and as a result smaller and faster neural network architectures can be used in the body landmark processing module compared to the body detection processing module.
[0211] In the present embodiment, the neural network is based on the Darknet Neural Network and is optimised for high speed. The network architecture of the fourth machine learning algorithm is as set out in Figure 18 herein.
[0212] The output of the body landmark detection is a list of individual body features, together with the coordinates of each body feature within the two dimensional image frame space.
[0213] The method described herein converts a stream of two- dimensional images captured as a stream of two-dimensional image frames of real world observed humans present within a three dimensional space into a set of horizontal and vertical coordinates in said real three-dimensional space, said method comprising: capturing a stream of image frames representing objects within a three-dimensional space; capturing said stream of image frames substantially in real time; inputting said stream of image frames into a plurality of independently operating machine learning algorithms ML1 - MLn; each said machine learning algorithm being pre-trained to identify a corresponding respective set of key points of a human body; generating an output of each of said machine learning algorithms, said output comprising a set of X, Y coordinates in real three-dimensional space wherein each individual said X, Y coordinate represents a key point of a human body; and aggregating a plurality of individual X, Y coordinates to produce a plurality of individual streams of X, Y coordinates each of which represents the movement of a key point of a said human body within said real three-dimensional space.
[0214] It will be understood by the person skilled in the art that the machine learning modules comprise a known computer platform having one or more data processors, memories, data storage devices, input and output interfaces, graphical use interfaces and other user interfaces such as voice command; input /output ports and communications ports.
Industrial Applicability & General Application Examples
[0215] In one application of the embodiments, a large screen high definition digital TV monitor may be positioned on a wall of a room, for example a basement room having no real windows, to provide a virtual window out on to a digitally generated virtual scene such as a landscape, seascape, forest or city scene and as a user moves their position around the room, wherein the digitally generated scene changes as if the viewer were in a room positioned within the scene and looking out on to the scene via the visual display monitor which acts as a transparent window. Use in such an application may enable better usage of rooms which have limited possibility for natural outside views, for example in hotel rooms which have limited or undesirable natural outside views, underground conference rooms, underground cellar conversions.
[0216] In another example application, the virtual window disclosed herein may be used in underground railway transport, for example extended tunnels under rivers or seaways, to give passengers the impression of travelling though scenery, where there is no natural above ground outside view other than the dark sidewalls to the tunnel itself.
[0217] Applications and uses of the embodiments and methods herein are not restricted to the foregoing examples. For the avoidance of doubt, any technical feature described in relation to one embodiment herein may be substituted for an equivalent technical feature of any other embodiment described herein, and the features of any one embodiment described herein may be added to any other embodiment described herein, except in as far as such addition would be mutually exclusive to an existing technical feature of said embodiment.

Claims

Claims
1. An image processing apparatus comprising: a tracking apparatus for identifying a position of a viewer in relation to said visual display device; a scene generator for generating a virtual three-dimensional scene; a scene adjuster for modifying an output of said three dimensional scene generator in response to an output of said tracking apparatus; and at least one visual display device for displaying an image of said virtual three-dimensional scene; wherein said scene adjuster receives location information about a position of said viewer from said tracking apparatus, and modifies in real time a virtual three dimensional scene generated by said scene generator such that said modified scene corresponds to a view which the viewer would see from their position through an aperture between said real three dimensional space and said virtual three dimensional space.
2. The image processing apparatus as claimed in claim 1 , wherein said tracking apparatus comprises: first and second video cameras, each producing a video stream signal of a real three-dimensional space; a video grabber for said video stream signals into a stream of image data frames; and a memory storage device for storing said streams of image data frames.
3. The image processing apparatus as claimed in claim 1 , wherein said at least one visual display has a viewable surface of a shape selected from the set: a flat planar screen; a curved screen; a part- cylindrical screen.
4. The image processing apparatus as claimed in claim any one of the preceding claims, wherein said means for determining a position of a viewer comprises a pair of spaced apart video cameras directed with their respective fields of view overlapping each other in real three-dimensional space.
5. The image processing apparatus as claimed in claim 4, wherein the means for generating a virtual three-dimensional scene comprises a digital three- dimensional scene generator, operable to generate a virtual three-dimensional scene having one or a plurality of moving objects.
6. The image processing apparatus as claimed in any one of the preceding claims, wherein the view seen by the viewer in three-dimensional coordinate space is modified in real-time depending upon the viewer’s actual real position in real three dimensional space in relation to said at least one visual display.
7. The image processing apparatus as claimed in any one of the preceding claims, further comprising a video grabber for receiving a stream of video data; and converting said stream of video data into a stream of image data frames.
8. The image processing apparatus as claimed in claim 8, comprising means for adding a timestamp to each image data frame.
9. The image processing apparatus as claimed in any one of the preceding claims, wherein said apparatus comprises a machine learning algorithm for detecting facial images.
10. The image processing apparatus as claimed in claim 9, wherein said machine learning algorithm operates to: identify images in said image frames containing human faces; and determine a position in real three-dimensional space of said human face images.
11. The image processing apparatus as claimed in claim 9 or 10, wherein said machine learning algorithm operates to identify facial landmark features from a stream of said image data frames; and determine a mid-eye position in real three-dimensional space from said image data frames.
12. The image processing apparatus as claimed in any one of claims 1 to 8, wherein said apparatus comprises a machine learning algorithm for detecting body images.
13. The image processing apparatus as claimed in claim 12, wherein said machine learning algorithm operates to: identify images in said image frames containing human bodies; and determine a position in real three-dimensional space of said human body images.
14. The image processing apparatus as claimed in claim 12 or 13, wherein said machine learning algorithm operates to identify body landmark features from a stream of said image data frames.
15. The image processing apparatus as claimed in any one of the preceding claims, comprising a formatter operable for: receiving coordinates of an outer perimeter of said at least one display device in virtual three-dimensional space; cropping a view of said virtual three-dimensional scene to coincide with a straight line view from a position of a viewer in said virtual three-dimensional space, through said outer perimeter of said at least one display device.
16. An apparatus for generating a screen image which represents a three-dimensional view from the perspective of a human viewer, said apparatus comprising: at least one visual display device defining a corresponding respective perimeter of a viewing aperture or window on which can be displayed a real time virtual image; means for determining a physical position of said viewer’s eyes in relation to a physical position of said viewing aperture or window; means for generating a virtual three-dimensional coordinate space populated with a plurality of objects; means for positioning a representation of said viewer in said virtual three- dimensional coordinate space; means for positioning the viewing aperture /window in the virtual 3D space; means for determining a view as seen by said representation of said viewer in said virtual three-dimensional coordinate space through said at least one visual display screen; means for generating a two-dimensional image of said three-dimensional view as seen by said representation of said viewer through the aperture or window said image changing in real time, depending on an orientation of said viewer relative to said aperture /window within said virtual three-dimensional coordinate space.
17. A method for generating a video view which represents a three- dimensional view from the perspective of a subject viewer, positioned adjacent a video display device, said method comprising: determining a physical position of said subject viewer in relation to a physical position of said visual display device in real three -dimensional space; generating a virtual three-dimensional scene populated with a plurality of objects in virtual three dimensional space; generating a view of said virtual three dimensional scene, as seen through an aperture coinciding with a perimeter of said display device, and as viewed from said position in real three dimensional space; and displaying said view on said visual display device.
18. The method as claimed in claim 17, wherein said process of generating a view of said virtual three dimensional scene comprises adjusting a view angle of said three-dimensional scene to correspond with a position of said subject viewer in real three-dimensional space.
19. The method as claimed in claim 17 or 18, wherein said process of generating a view of said virtual three-dimensional scene comprises adjusting sizes of said variety of objects in said virtual three-dimensional space, according to a coordinate set of a said position of a viewer in real three-dimensional space.
20. The method as claimed in claim 17 or 18, further comprising formatting said adjusted virtual three-dimensional scene for display as a two- dimensional image of said virtual three-dimensional scene on said display monitor
21. The method as claimed in any one of claims 17 to 20, wherein said process of determining a physical position of said subject viewer in relation to a physical position of said visual display screen comprises capturing first and second scenes of video data across a field of view extending in said real three- dimensional space.
22. The method as claimed in claim 21 , further comprising converting said first and second scenes of video data to corresponding first and second streams of video image frames.
23. The method as claimed in claim 22, further comprising applying a first machine learning algorithm to detect facial images contained in said first and second streams of video images.
24. The method as claimed in claim 22 or 23 further comprising applying a second machine learning algorithm to detect facial landmark features from said first and second streams of video images.
25. The method as claimed in claim 22 to 24 further comprising applying a third machine learning algorithm to detect body images contained in said first and second streams of video images.
26. The method as claimed in claim 22 to 25 further comprising applying a fourth machine learning algorithm to detect facial landmark features from said first and second streams of video images.
27. The method as claimed in claim 24, further comprising determining a mid-eye position of facial detected facial images contained in said first and second streams of video images.
28. The method as claimed in claim 24 or 27, further comprising determining a set of three-dimensional coordinates in real three-dimensional space for each said determined mid -eye position.
29. The method as claimed in claim 28 wherein determining said three dimensional coordinates of said mid eye position comprises triangulation between two images of an object each captured by a different camera.
30. The method as claimed in claim 26, further comprising determining a mid-eye position of facial detected facial images contained in said first and second streams of video images.
31. The method as claimed in claim 26 or 30, further comprising determining a set of three-dimensional coordinates in real three-dimensional space for each said determined mid -eye position.
32. The method as claimed in claim 31 wherein determining said three dimensional coordinates of said mid eye position comprises triangulation between two images of an object each captured by a different camera.
33. The method as claimed in any one of claims 17 to 31 wherein said virtual three-dimensional scene comprises a digitally generated virtual three- dimensional scene.
34. The method as claimed in a one of claims 17 to 33, comprising generating a two-dimensional image of said view of said three-dimensional scene as seen from said viewer position in three-dimensional real space; and varying, said two dimensional view depending on a change of said position of said viewer in said real three-dimensional space.
35. The method as claimed in any one of claims 17 to 34, further comprising: determining coordinates of an outer perimeter of said display device in virtual three-dimensional space; cropping a view of said virtual three-dimensional scene to coincide with a straight line view from a determined position of a viewer in said virtual three- dimensional space, through said outer perimeter of said display device.
36. The method as claimed in any one of claims 17 to 35, comprising a plurality of visual display devices, wherein said method comprises: determining a physical position of said subject viewer in relation to a physical position of each said visual display device in real three -dimensional space; generating a virtual three-dimensional scene populated with a plurality of objects in virtual three dimensional space; generating a corresponding respective view of said virtual three dimensional scene, as seen through an aperture coinciding with a perimeter of each said display device, and as viewed from said position in real three dimensional space; and displaying a said corresponding respective view on each said visual display device, such that for each physical position of the subject viewer in the real three- dimensional space, view of a virtual three-dimensional scene displayed on each said visual display device is visually consistent with the each other view of each other said visual display device.
37. The method as claimed in claim 36, wherein said plurality of views are coordinated on said plurality of visual display device, such that said subject viewer at said physical position within a first region of real three-dimensional space views a virtual three-dimensional scene on said plurality of visual display devices, which appears to coincide with a second region of real three-dimensional space surrounding said first region of three-dimensional space; and for each position which said subject viewer occupies within said first region of real three-dimensional space, the views of three-dimensional space, the views displayed on the plurality of visual display devices change in real time, and co- ordinated with each other, to give the appearance that the subject viewer is viewing said three-dimensional scene through a plurality of apertures surrounding said viewer.
38. An apparatus for determining a viewpoint position of a person in a real three-dimensional space, said apparatus comprising: a pair of video cameras each capable of capturing a stream of video data; means for converting each said captured stream of video data into a stream of image data frames; means for storing said stream of image data frames; means for detecting faces in said stream of image data frames; means for detecting facial features in said stream of image data frames; means for determining a viewpoint of a said detected face.
39. An apparatus for determining a viewpoint position of a person in a real three-dimensional space, said apparatus comprising: a pair of video cameras each capable of capturing a stream of video data; means for converting each said captured stream of video data into a stream of image data frames; means for storing said stream of image data frames; means for detecting bodies in said stream of image data frames; means for detecting body features in said stream of image data frames; means for determining a viewpoint of a said detected body.
40. The image processing apparatus as claimed in claim 38 or 39, wherein said a pair of video cameras are spaced apart from each other and are arranged such that their respective fields of view overlap each other in real three- dimensional space.
41. The image processing apparatus as claimed in claim 38, 39 or 40, wherein said means for converting each said captured stream of video data into a stream of image data frames comprises a video grabber for receiving a plurality of streams of video data and converting each said stream of video data into a stream of image data frames.
42. The image processing apparatus as claimed in any one of claims 38 to 41 , further comprising means for adding a timestamp to each image data frame.
43. The image processing apparatus as claimed in any one of claims 38, 40, 41 or 42, wherein said apparatus for detecting faces comprises a computer platform and a first machine learning algorithm for detecting facial images.
44. The image processing apparatus as claimed in claim 43, wherein said first machine learning algorithm operates to: identify images in said image frames containing human faces; and determine a position in real three-dimensional space of said human face images.
45. The image processing apparatus as claimed in any one of claims 38 or 40 to 44, wherein said means for detecting facial features in said stream of image data frames comprises a second machine learning algorithm operable to identify facial landmark features from a stream of said image data frames.
46. The image processing apparatus as claimed in any one of claims
38or 40 to 45, wherein said means for determining a viewpoint of said detected face comprises a computer platform operating an algorithm for determining a position in real three-dimensional space from detected facial features in said image data frames.
47. The image processing apparatus as claimed in any one of claims 39 to 42, wherein said apparatus for detecting bodies comprises a computer platform and a third machine learning algorithm for detecting body images;
48. The image processing apparatus as claimed in claim 47, wherein said third machine learning algorithm operates to: identify images in said image frames containing human bodies; and determine a position in real three-dimensional space of said human body images.
49. The image processing apparatus as claimed in any one of claims 39 to 48, wherein said means for detecting body features in said stream of image data frames comprises a fourth machine learning algorithm operable to identify body landmark features from a stream of said image data frames.
50. The image processing apparatus as claimed in any one of claims 38 to 49, wherein said means for determining a viewpoint of said detected body comprises a computer platform operating an algorithm for determining a position in real three-dimensional space from detected body features in said image data frames.
51. A method for determining a viewpoint position of a person in a real three-dimensional space, said method comprising: capturing first and second streams of video data from first and second positions, said first and second positions spaced apart from each other and covering a common field of view in real three-dimensional space; converting each said captured stream of video data into a corresponding stream of image data frames; storing said streams of image data frames; detecting faces in said stream of image data frames; detecting facial features in said stream of image data frames; and determining a viewpoint of a said detected face.
52. A method for determining a viewpoint position of a person in a real three-dimensional space, said method comprising: capturing first and second streams of video data from first and second positions, said first and second positions spaced apart from each other and covering a common field of view in real three-dimensional space; converting each said captured stream of video data into a corresponding stream of image data frames; storing said streams of image data frames; detecting bodies in said stream of image data frames; detecting body features in said stream of image data frames; and determining a viewpoint of a said detected body.
53. The method as claimed in claim 51 or 52, further comprising adding a timestamp to each said image data frame.
54. The method as claimed in claim 51 or 53, wherein said process of detecting faces comprises: applying a pre - trained machine learning algorithm to recognise images of human faces in said image data frames; and identify images in said image frames containing human faces. determine a position in real three-dimensional space of said human face images.
55. The method as claimed in claim 54, further comprising generating a two-dimensional area boundary around a identified facial image in a said image frame; and cropping an area within said two-dimensional area boundary, containing said identified facial image.
56. The method as claimed in any one of claims 51 to 55, wherein said process of detecting facial features in said stream of image data comprises: applying a pre-trained machine learning algorithm to recognise images of landmark features of a human face, said landmark features selected from the set: eyes; pupils; nose; lips; eyebrow; temples; chin; moles, teeth; cheeks.
57. The method as claimed in any one of claims 51 to 56, wherein said process of determining a viewpoint of said detected face comprises: determining a position of each of said a pair of eyes of a face in said data frames; and determining a mid point between said positions of said eyes; and assigning said viewpoint to be said mid point.
58. The method as claimed in claim 52 or 53, wherein said process of detecting bodies comprises: applying a pre - trained machine learning algorithm to recognise images of human bodies in said image data frames; and identify images in said image frames containing human bodies. determine a position in real three-dimensional space of said human body images.
59. The method as claimed in claim 58, further comprising generating a two-dimensional area boundary around an identified body image in a said image frame; and cropping an area within said two-dimensional area boundary, containing said identified body image.
60. The method as claimed in any one of claims 52 to 59, wherein said process of detecting body features in said stream of image data comprises: applying a pre-trained machine learning algorithm to recognise images of landmark features of a human body, said landmark features selected from the set: eyes; pupils; nose; lips; eyebrow; temples; chin; moles, teeth; cheeks; ears; neck; shoulders; elbows; wrists; hands; chest; waist; hips; knees; ankles; feet; legs; arms.
PCT/EP2021/070397 2020-07-27 2021-07-21 Virtual window WO2022023142A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB2011623.2A GB202011623D0 (en) 2020-07-27 2020-07-27 Roomality
GB2011623.2 2020-07-27

Publications (1)

Publication Number Publication Date
WO2022023142A1 true WO2022023142A1 (en) 2022-02-03

Family

ID=72339226

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/070397 WO2022023142A1 (en) 2020-07-27 2021-07-21 Virtual window

Country Status (2)

Country Link
GB (1) GB202011623D0 (en)
WO (1) WO2022023142A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023005606A1 (en) * 2021-07-28 2023-02-02 京东方科技集团股份有限公司 Display device and display method therefor

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114399949B (en) * 2022-01-28 2023-11-28 苏州华星光电技术有限公司 Adjustable display device and adjusting method thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120060177A1 (en) * 2010-09-02 2012-03-08 Verizon Patent And Licensing, Inc. Perspective display systems and methods
EP3506149A1 (en) * 2017-12-27 2019-07-03 Fundacion Centro De Tecnologias De Interaccion Visual Y Comunicaciones Vicomtech Method, system and computer program product for eye gaze direction estimation
US20190320163A1 (en) * 2007-08-24 2019-10-17 Videa Llc Multi-screen perspective altering display system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190320163A1 (en) * 2007-08-24 2019-10-17 Videa Llc Multi-screen perspective altering display system
US20120060177A1 (en) * 2010-09-02 2012-03-08 Verizon Patent And Licensing, Inc. Perspective display systems and methods
EP3506149A1 (en) * 2017-12-27 2019-07-03 Fundacion Centro De Tecnologias De Interaccion Visual Y Comunicaciones Vicomtech Method, system and computer program product for eye gaze direction estimation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023005606A1 (en) * 2021-07-28 2023-02-02 京东方科技集团股份有限公司 Display device and display method therefor

Also Published As

Publication number Publication date
GB202011623D0 (en) 2020-09-09

Similar Documents

Publication Publication Date Title
US10181222B2 (en) Method and device for augmented reality display of real physical model
US11315287B2 (en) Generating pose information for a person in a physical environment
CN104699247B (en) A kind of virtual reality interactive system and method based on machine vision
CN109298629B (en) System and method for guiding mobile platform in non-mapped region
CN102959616B (en) Interactive reality augmentation for natural interaction
CN106484115B (en) For enhancing and the system and method for virtual reality
US6411266B1 (en) Apparatus and method for providing images of real and virtual objects in a head mounted display
US11417069B1 (en) Object and camera localization system and localization method for mapping of the real world
US20140368539A1 (en) Head wearable electronic device for augmented reality and method for generating augmented reality using the same
US20160343166A1 (en) Image-capturing system for combining subject and three-dimensional virtual space in real time
US6690338B1 (en) Apparatus and method for providing images of real and virtual objects in a head mounted display
US20110102460A1 (en) Platform for widespread augmented reality and 3d mapping
CN107004279A (en) Natural user interface camera calibrated
JP7073481B2 (en) Image display system
WO2012142202A1 (en) Apparatus, systems and methods for providing motion tracking using a personal viewing device
Oskiper et al. Augmented reality binoculars
WO2022023142A1 (en) Virtual window
KR20180120456A (en) Apparatus for providing virtual reality contents based on panoramic image and method for the same
CN106843790A (en) A kind of information display system and method
CN110969706B (en) Augmented reality device, image processing method, system and storage medium thereof
US11200741B1 (en) Generating high fidelity spatial maps and pose evolutions
US20220230357A1 (en) Data processing
EP3547081B1 (en) Data processing
US20230071828A1 (en) Information processing apparatus, information processing system, and information processing method
WO2023277020A1 (en) Image display system and image display method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21748596

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21748596

Country of ref document: EP

Kind code of ref document: A1