WO2022180630A1 - Camera capture and communications system - Google Patents

Camera capture and communications system Download PDF

Info

Publication number
WO2022180630A1
WO2022180630A1 PCT/IL2022/050209 IL2022050209W WO2022180630A1 WO 2022180630 A1 WO2022180630 A1 WO 2022180630A1 IL 2022050209 W IL2022050209 W IL 2022050209W WO 2022180630 A1 WO2022180630 A1 WO 2022180630A1
Authority
WO
WIPO (PCT)
Prior art keywords
camera
cameras
user
video output
participants
Prior art date
Application number
PCT/IL2022/050209
Other languages
French (fr)
Inventor
Zeev Abrams
Original Assignee
3Chron Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 3Chron Ltd. filed Critical 3Chron Ltd.
Publication of WO2022180630A1 publication Critical patent/WO2022180630A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/012Head tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/803Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/19Sensors therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/243Image signal generators using stereoscopic image cameras using three or more 2D image sensors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/08Bandwidth reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2004Aligning objects, relative positioning of parts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2016Rotation, translation, scaling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0081Depth or disparity estimation from stereoscopic image signals

Definitions

  • the present disclosure relates substantially to the field of telecommunications, and in particular to a 3D teleconference system.
  • a camera capture and communications system comprising: at least one camera, each comprising a video output; and one more processors configured to: receive the video output of each of the at least one camera; for each of a plurality of predetermined temporal points map a position and orientation of a user imaged by the at least one camera to a virtual position and orientation; and output data associated with the received video output and the mapped positions and orientations.
  • the mapping comprises adjusting a yaw angle between a direction where the user is facing and a direction where the at least one camera is facing.
  • a predetermined positioning arrangement is defined for a plurality of meeting participants, the mapped virtual position and orientation being in relation to the predetermined positioning configurations.
  • the predetermined positioning configuration is adjustable.
  • the mapping comprises outputting a positioning arrangement shown on a display of the user.
  • the one or more processors are further configured to: identify facial landmarks within the received video output; and match the identified facial landmarks with a 3-dimensional (3D) model of the user, wherein the data associated with the received video output is responsive to an outcome of the matching.
  • the one or more processors are further configured to: track the outcome of the matching is tracked over time; and adjust the data associated with the received video output responsive to an outcome of the tracking.
  • the 3D model is unique to the user.
  • the system further comprises a plurality of terminals, each associated with a respective participant, wherein each of the plurality of terminals receives data associated with the received video output and the mapped positions and orientations.
  • the one or more processors are further configured to estimate the pose of the user within the video output, the mapping being responsive to the estimated pose.
  • the system further comprises at least one depth sensor, wherein the mapping is responsive to an output of the at least one depth sensor.
  • the at least one camera comprises a plurality of cameras
  • the one or more processors are further arranged to: fuse an image of the video output of a first of the plurality of cameras with an image of the video output of a second of the plurality of cameras; match features of the image of the first camera to features of the image of the second camera to generate a depth map (DM) or point cloud (PCL); fuse the generated DM or PCL with the output of the at least one depth sensor, and wherein the output data associated with the received video is responsive to the fused images and the fused DM or PCL.
  • DM depth map
  • PCL point cloud
  • the at least one camera comprises a plurality of cameras, displaced from each other by respective predetermined distances, and wherein the outputs of the cameras are synchronized with each other.
  • the one or more processors are further configured to: determine, for each of the plurality of cameras, an angle between the user and the respective camera; and responsive to the determined angles, select one of the plurality of cameras, wherein the mapping is responsive to the selected one of the plurality of cameras.
  • the plurality of cameras comprises at least 3 cameras.
  • the at least three cameras comprise three lower cameras and one upper camera, the three lower cameras being displaced horizontally from each other and the upper camera being displaced vertically above the three lower cameras.
  • the one or more processors are further arranged to define a predetermined range in relation to the at least one within the received images, the output data associated with the received video generated from image data within the predetermined range and not from image data outside the predetermined range.
  • a camera capture and communications method comprising: receiving a video output of each of at least one camera; for each of a plurality of predetermined temporal points, mapping a position and orientation of a user imaged by the at least one camera to a virtual position and orientation; and outputting data associated with the received video output and the mapped positions and orientations.
  • the mapping comprises adjusting a yaw angle between a direction where the user is facing and a direction where the at least one camera is facing.
  • a predetermined positioning arrangement is defined for a plurality of meeting participants, the mapped virtual position and orientation being in relation to the predetermined positioning configurations.
  • the predetermined positioning configuration is adjustable.
  • the mapping comprises outputting a positioning arrangement shown on a display of the user.
  • the method further comprises: identifying facial landmarks within the received video output; and matching the identified facial landmarks with a 3-dimensional (3D) model of the user, wherein the data associated with the received video output is responsive to an outcome of the matching.
  • the method further comprises: tracking the outcome of the matching is tracked over time; and adjusting the data associated with the received video output responsive to an outcome of the tracking.
  • the 3D model is unique to the user.
  • the method further comprises, for each of a plurality of terminal associated with a respective participant, receiving at the respective terminal the data associated with the received video output and the mapped positions and orientations.
  • the method further comprises estimating the pose of the user within the video output, the mapping being responsive to the estimated pose. [0022] In one embodiment, the method further comprises receiving an output of at least one depth sensor, wherein the mapping is responsive to the received output of the at least one depth sensor.
  • the at least one camera comprises a plurality of cameras
  • the method further comprises: fusing an image of the video output of a first of the plurality of cameras with an image of the video output of a second of the plurality of cameras; matching features of the image of the first camera to features of the image of the second camera to generate a depth map (DM) or point cloud (PCL); fusing the generated DM or PCL with the output of the at least one depth sensor, and wherein the output data associated with the received video is responsive to the fused images and the fused DM or PCL.
  • DM depth map
  • PCL point cloud
  • the at least one camera comprises a plurality of cameras, displaced from each other by respective predetermined distances, and wherein the outputs of the cameras are synchronized with each other.
  • the method further comprises: determining, for each of the plurality of cameras, an angle between the user and the respective camera; and responsive to the determined angles, selecting one of the plurality of cameras, wherein the mapping is responsive to the selected one of the plurality of cameras.
  • the plurality of cameras comprises at least 3 cameras.
  • the at least three cameras comprise three lower cameras and one upper camera, the three lower cameras being displaced horizontally from each other and the upper camera being displaced vertically above the three lower cameras.
  • the method further comprises defining a predetermined range in relation to the at least one within the received images, the output data associated with the received video generated from image data within the predetermined range and not from image data outside the predetermined range.
  • x, y, and/or z means any element of the seven-element set ⁇ (x), (y), (z), (x, y), (x, z), (y, z), (x, y, z) ⁇ .
  • FIG. 1 A illustrates a prior art method of conversing with multiple people
  • FIGs. IB - 1C illustrate various configurations for displaying 3D representations of participants of a conversation, in accordance with some embodiments
  • FIGs. 2A - 2B illustrate the variation of configurations and orientations of four participants in a virtual conversation from the point of view of a single user, in accordance with some embodiments;
  • FIG. 3 is a top down view of a user situated in front of a single camera, looking at the depiction of another participant appearing off-axis from the camera, in accordance with some embodiments;
  • FIG. 4 illustrates an overhead schematic of a communications system utilizing three (or more) cameras, consisting of a user being captured, a capture system, and a representation of other users, in accordance with some embodiments;
  • FIGs. 5A - 5C demonstrate the conceptual advantage of utilizing more than a single imaging sensor in the described camera system, in accordance with some embodiments;
  • FIG. 6 illustrates the relation between the components of the 3D camera and the other participants as a function of fields of view of the user and camera sensors, in accordance with some embodiments;
  • FIGs. 7 A - 7H illustrate the generalized mapping from symmetric distributions of participants to a tighter placement of participants, from the point for view of the user, for a number of configurations;
  • FIG. 8 illustrates a flow chart of an algorithm for positioning a user in a 3D environment given data from a single camera, in accordance with some embodiments
  • FIG. 9 describes the relation between a user, the cameras, and intermediary points between cameras as a function of the user’s position, in accordance with some embodiments.
  • FIG. 10 illustrates a block diagram of one embodiment of a communications system employing a discrete selection of camera angles
  • FIG. 11 is a depiction of a set of algorithms used to match 3D data with an existing model, in accordance with some embodiments;
  • FIG. 12 displays a relation between a 3D camera of this disclosure and the algorithms of FIG. 11 ;
  • FIG. 13 illustrates a general block diagram of the different components in the processing pipeline of the system, including possible segregation of components into different computational systems along the pipeline, in accordance with some embodiments;
  • FIG. 14 illustrates a generalized schematic of 3D webcam comprising multiple sensors, in accordance with some embodiments;
  • FIG. 15 illustrates a more detailed setup of a 3D webcam comprising multiple sensors, in accordance with some embodiments
  • FIG. 16 illustrates a schematic description of an algorithm for finding stereo correspondences between two imaging sensors, as well as a method of improving the algorithm based on other data, in accordance with some embodiments;
  • FIG. 17 illustrates a block diagram of a method of finding depth point, in accordance with some embodiments.
  • FIG. 18 illustrates a block diagram of a communication system connecting multiple participants, in relation to a central server, in accordance with some embodiments.
  • Adding a third dimension can alleviate many of the issues in a VC session: by adding spatial awareness to a VC, one can solve many of the inadequacies directly, as well as removing the issue of distracting backgrounds and adding more realism to a conversation. These solutions can be generalized to non- VC modalities, such as in online environments (known as the “metaverse”).
  • the solutions described here enable users to have communications in a 3D environment using 3 tiers of hardware: using an existing webcam, a trio (or more) of webcams, or an embedded camera system.
  • Each of these solutions has progressively better capabilities, which thereby enable a better emulation of areal-life conversation.
  • the focus of these solutions is on the capture and presentation of people in a 3D environment; this can then be presented on either a 2D screen, 3D screen, or 3D headset.
  • the 3D headset can include either virtual- or augmented-reality, or any mixture of them, without any limitations on the technology described here.
  • the use of the word “display” will be used to connote all these modalities.
  • Part of the disclosure refers to specific methods of spatially distributing other users within a 3D environment. There is a spatial relation between the location and position of the user and the other participants [the use of “user” will refer to the “main” user of the system, described from their point of view, with all other users being described as “participants”, despite them also being symmetrical users in the system].
  • This set of disclosures is targeted at a use case where a user is sitting or standing in front of a capture device (camera) and monitor (or headset), and looking at the other participants, as is typical in the case of most communications done online today.
  • capture device as used herein is meant to include any type of device that captures images.
  • the general configuration to be emulated is that of multiple users seated around a round table, with the ability to turn towards one another. It can be noted that in this simple configuration, one can limit the degrees-of-freedom (DOF) to primarily the ability to rotate one’s head/body to the left/right (“yaw”) and some translational movement. If limiting the scope of the discussion (but not the disclosure) to people seated around a table, then one can neglect the translation and focus primarily on the ability to look to the sides only. This naturally lends itself to more simplified cylindrical coordinate system descriptions of each participant.
  • DOE degrees-of-freedom
  • the first embodiment describes the emulation of a conversation utilizing only a single webcam.
  • the solution describes ways of capturing and displaying the user to the other participants in a way that enables the user to turn and address other participants.
  • the user is allowed to move their head to turn to other users, however, in order to always have the best “view” of the user, they are forced to keep their head primarily facing the camera. This is facilitated by employing head tracking technologies to rotate the viewpoint of the user on a display, thereby ensuring that the user’s head is always centered on the camera.
  • This solution can work regardless of the physical size of the display, but has many drawbacks to be described.
  • the second embodiment improves upon the first by increasing the number of virtual viewpoints by physically increasing the number of cameras used.
  • a trio of cameras could be used to capture the user from directly in front, and to the two sides, at some given, described, angle. This enables a digital or discrete transition of the views to other participants to be emulated, and also improves the user interface by allowing the user to physically turn their head/body towards another participant, without the requirement of always being centered on the central camera.
  • the third embodiment includes methods of creating 3D representations of the user (here, and in all other descriptions, limited to 1, for simplicity of description, but without a loss of generality for multiple users using the same system simultaneously).
  • This third embodiment includes descriptions of the algorithms and systems required to enable such a solution.
  • system requirements for such an embodiment are much higher than the first two, a description is given in this disclosure of an embedded camera system that would have preferable features for such an embodiment.
  • the primary focus is on the exemplary use case of a user sitting at a desk and interacting with others in a virtual environment that emulates a physical meeting place. Since the physical Field Of View (FOV) of the user may be different from the virtual one, as a function of the display being used, the distribution of the other participants in the virtual environment may be preferred to be different, and customizable by the user. However, as the spatial relation between the users should be maintained in order to enable the correct position and orientation of each user vis-a-vis each other, a form of mapping is described to enable this, such that physical rotations by the user are correctly mapped to their virtual rotations as seen by the other users, at each of a plurality of temporal points (i.e. the mapping is performed repeatedly during the conversation).
  • FOV Field Of View
  • the general disclosure refers to a complete system of creating and transmitting 3D representations of a user to one or more other participants in an online conversation, known as a VC.
  • the system consists of software to control and interface with the system, hardware to capture the user, algorithms used to create and process the 3D data, and a transmission pipeline to convey the data to other users.
  • additional software included on a server level, used to arrange and manage the different users into a combined environment viewable by each user.
  • Figure 1A begins by depicting a common arrangement in a multi-person call today. It is simplified for description purposes, and it generalizes for any conversation with more than 2 participants in total.
  • a display 100 has a webcam on top 101 that can view the user.
  • the webcam is simplified here to a single camera device attached to the display, but can also be embedded within the frame of the device, without a loss of generality. Furthermore, it can be placed at the bottom of the display or on an edge, as is found in many variations on devices today; however they are all similar as consisting of a single webcam.
  • the software program 102 will place each speaker in some form of 2D mosaic. With 2 participants and a user, this can be with the user on top 103, and the other two participants to the left 104 and right 105.
  • Some software systems have different configurations, such as highlighting the speaker, meaning that the view is constantly changing. Having the user as one of the sub-screens 103 is problematic as it is known to distract the user and potentially induce fatigue. A thumbnail at most is all that is really needed to ensure that the viewer’s camera is on and working correctly.
  • FIG. 1A there is no spatial relation between the participants, and each participant may have a different configuration of screens on their display. Even if the order of sub-screens is the same on each, there is still no way to “look” at a specific participant, and have the others know that you are “looking” at them, as it is in a 2D environment.
  • FIG. IB illustrates an initial solution to this, within a system 10, with 3 participants as well as the user (not shown).
  • system 10 there is still a 2D display 106 (though it can be a 3D display or headset as well), with a single webcam 107.
  • the participant on the left 109 is angled slightly, signifying that they are looking at the user (this is shown from the viewpoint of the user), and the participant on “top” 110 is furthest away from the user. So long as this same relative positioning is maintained between the participants, then everyone can know where they are in relation to the others. If the participants are able to rotate in 3D space, then it is clear who each participant is looking at. In this case (though hard to depict in a 2D schematic), it appears that both other participants are looking at the user.
  • the FOV of the user is essentially defined by the first-person view they have of the 3D environment as appearing on the 2D display 106.
  • the user can see 2 other participants simultaneously, with a large FOV, however, if there were more participants, not all of them would appear onscreen at the same time, with some of them appearing perhaps offscreen, beyond the periphery of the FOV defined by the display’s FOV setting.
  • This FOV setting can be defined in 3D rendering software, and can even include FOVs that are beyond the natural abilities of a person.
  • FIG. 1C illustrates another embodiment of a system 11, system 11 consisting of an ultrawide display 111.
  • the software 112 can recognize the larger viewing area, and enable a higher effective FOV, showing most of the virtual table 113.
  • the multiple camera angles emulate the required viewpoints for the other participants 115.
  • the higher effective FOV allows the user to see the spatial relationship between all the other participants, with each one portrayed as looking in a different direction. What this allows is the user to be able to physically turn their head at any portion of the screen and see and focus on different users, with a camera always giving some form of coverage of them.
  • FIG. 2A there is a top-down depiction of a generalized case of 4 participants meeting, with the user depicted at the bottom 200.
  • the participants are situated around a round table 201 such that an even distribution of the participants around the virtual table will be described as a square (or rhombus) 202.
  • each participant is shown looking towards the center of the table, such that the user’s FOV 203 is directed at the participant directly across from them 204.
  • the other participants to the right 205 and left 206 are not seen by the viewer.
  • the FOV of the user 203 is most easily described in real-world terms as being the FOV of a person (which can be in the range of 90 degrees horizontally), or in terms of the FOV created by a headset (which is typically lower). However, it can also be described in terms of a virtual camera positioned at the user, and depicting a 3D scene, as shown in FIG. IB. In this case, the FOV can be changed in software to expand out the view, and it can be displayed on a 2D screen. However, larger FOVs can result in distortions of the image in 2D on a smaller display.
  • the user’s 207 FOV 208 includes the other three participants, since they are no longer evenly distributed around the virtual table 209. Instead, they are now positioned in a “kite” or diamond geometry 210, with respect to the user 207. This can be done in a virtual environment, so long as these other users aren’t physically in the same space, where their spatial relation would be defined and immutable. Although the other participants are virtually positioned, they still may move from their defined position, such that they move closer (e.g. 213) or further (e.g. 211) from the user 207.
  • FIG. 3 For a single webcam system, which is the first embodiment of the disclosure, the general problem can be described in FIG. 3.
  • a user 300 is positioned in front of a camera device 301, and is communicating with at least one other participant - here only a single one is shown 302.
  • the other participant may appear on a 2D display, or appear in a virtual environment in a headset (a “hologram”).
  • the other participant appears at an offset distance, x 303, from the webcam 301. If the viewer then looks at that participant - here signified by the triangular nose section 304, then the angle between the optical axis and the participant can be seen as a 2D “yaw” angle 305.
  • the problem is that the user is looking directly at the other participant 302, but the view that the other participant will receive will be that of the webcam 301, which displays the user looking to the right. As such, there will be a discrepancy between what the participant sees (an off-axis view) with what they should be seeing (the user looking right at them).
  • This disclosure presents a solution for this issue by using head-tracking to re-position the participant 302 towards the center of the display such that the user 300 is then looking directly at the webcam, and thereby realigning the expected viewpoint.
  • the goal of this first embodiment is to allow the participants that are being addressed to always have the user looking at them. To do this, the participants are shifted towards the webcam 303 by simultaneously zeroing the yaw angle 305. This will allow the user to use their neck/head to address participants, but will then force them to move their head back into the central position to continue. This can be considered as an “always centered” configuration of head tracking.
  • the angle that is to be re-adjusted is also a function of the distance to the camera, d 306, as well as the potential translation of the user, 1307, to the sides. For simplicity, these can be ignored in later descriptions, however, they are to be included in the calculations of the corrected angles in subsequent descriptions of the disclosure.
  • the translational displacement (d and t as well as potentially a third axis up/down, neglected here for simplicity of description) can be deduced from head/body tracking techniques, such that the corrections can occur every frame.
  • the second and third form of embodiment solve the problem of requiring the user to recenter their head/body towards the central webcam by increasing the number of cameras.
  • the cameras can be either part of a single device (described below), or as disparate webcams connected to a single computational unit.
  • the general configuration can be described as follows in FIG. 4.
  • the user 400 is positioned in front of a device 401, which is primarily the 3D webcam device described in the 3 rd embodiment of this disclosure, but may also include other computational devices either attached or not, such as personal computers, mobile phones, or display terminals.
  • the 3D webcam 401 comprises at least three color sensors depicted here. This limitation on 3 is for simplification purposes in order to show a top-down view, and other configurations will be described below.
  • the 3D webcam 401 will have a central camera 402 positioned directly in front of the user 400, as well as a right (top) 403 and left (bottom) 404 camera. These cameras are depicted as being on, or near the same optical axis as the central camera 402 (meaning, on the same z-axis plane, with the z-axis pointing from the camera to the user), however other embodiments can be used to offset towards the user, without a loss of generalization. Both side cameras (403 and 404) are depicted with a rotation towards the central axis of the central camera 402, to overlap the fields of view of the three cameras.
  • An additional sensor 405 is schematically shown here to be adjacent to the central camera 402.
  • This sensor 405 can be a range or depth sensor, such as a Time-of-Flight (ToF) sensor.
  • a ToF sensor can be a single pixel, or even a full matrix of sensors providing a direct depthmap (DM) within its Field Of View (FOV).
  • DM direct depthmap
  • FOV Field Of View
  • Placing this sensor 405 adjacent to the central camera 402 allows the calibrated matching of the depth data with the color data.
  • Alternative configurations could include depth sensors adjacent to the off-axis cameras, as well as different technologies to detect range, such as ultrasound and structured light sensors.
  • ToF is beneficial in that it is simpler to align and co-locate with the central sensor 402, as well as potentially being of lesser cost than other technologies.
  • the user 400 will typically sit at a distance 406 of 0.5 - 1.5 m from the camera device 401.
  • This distance 406 is up to each user, however, these distances are standard for most VC applications.
  • the background of the capture image may sometimes be obtrusive. Correcting this can employ methods such as foreground detection and segmentation.
  • it is a known advantage of 3D cameras that they can automatically remove the background by simply discarding all information outside of a certain range. Therefore, in a 3D webcam 401 such as the one disclosed here (which shares those features with a regular 3D camera), a range can be defined (a priori or by the user) below which no information will be taken, e.g.
  • range 407 or beyond which no information will be taken, e.g. range 408.
  • range 408 e.g. range 408.
  • the pixels from the RGB camera can be correlated to match those within the desired range from the depth camera.
  • the other participants 409 will appear to the user as appearing “behind” the device 401.
  • FIGs. 5A - 5C the advantages of the off-axis cameras are schematically depicted.
  • the user 500 is depicted having an exaggerated “nose” 501 used here to signify the direction that the user is looking.
  • the camera system has been simplified to include a central camera 502, top (right) camera 503, bottom (bottom) camera 504, and an optional 3D sensor 505.
  • the user 500 is looking straight ahead, and therefore the user’s face is well captured within the FOV of the central camera 506.
  • the ideal capture of the facial region is with the right (top) camera 508, such that the FOV 509 of camera 508 contains the entire front of the face of the user.
  • the other cameras will have an unreliable view of the user’s face - for example, the left camera (bottom) 510 will be unable to see the right eye of the user at all under most angles.
  • FIG. 5C Another situation that the user may find themselves in is when they are looking between cameras, as in FIG. 5C.
  • the user 511 is looking between the central 512 and right (top) 513 cameras, such that their FOVs (514 and 515, respectively) will both capture the user’s face. This can occur if the other participants are placed in a virtual environment between the FOVs of the cameras, or if the user is simply “staring into space”. Regardless, this is a standard situation where the information from the two (or more) sensors are in one embodiment fused to obtain a better (intermediary) result, such as to correct for gaze.
  • the relation between the position of the other participants and user will be a function of the camera device, or individual cameras themselves: their size and rotation angles, the distance of the viewer to the cameras, the number of participants, and the preferred positioning of the participants by the user (software controlled). This will be described in general fashion in FIG. 6 in relation to the cameras themselves.
  • a user 600 who is located in front of a simplified 3D camera setup, with a central camera 601 and a left 602 and right 603 camera (simplified to only these three components, for clarity purposes), and at a distance 604, denoted D, from the central camera 601.
  • the cameras are here shown along the same axis (for simplicity), with a distance 605, denoted L, between the side cameras (602 and 603) and the central one 601.
  • FOVs there are two sets of FOVs that are relevant to this discussion: the FOV 606 of the user 600 and the FOV 607 of the camera(s).
  • the camera FOVs are fixed by the device, whereas the user can rotate their FOV (rotations along one axis are the only ones of interest for this discussion).
  • the side cameras (602 and 603) are better at capturing most of the user’s face even when the user 600 turns their face. However, after a certain angle, half of the face will be occluded.
  • This angle can also be defined by a certain distance 608, denoted 5, to the right and left (assuming symmetry) of the side cameras (602 and 603), and related to D and L by the tan() function. The further back a user sits/stands from the camera, the wider effective angle they can see (S+L), however at the expense of increasing the distance to the other users (lowered effective resolutions of the 3D representations).
  • the angle that the user is looking at the rightmost camera 603 may or may not be equal to the angle to the right participant 610, since it is a function of D and L (closer to the camera means a higher angle).
  • the 3D representation with in-filling of any occluded regions would account for this, and the added camera in the general direction that the user is looking will provide a far superior result than if there was only a single camera 601 in the center, regardless of how much the user 600 may be modeled.
  • the disclosure here includes the ability to redefine the virtual positioning of the other participants. This can be done per-user (meaning everyone has their own preferred positioning), or with the positioning defined by one of the users (such as the meeting host) being inherited by all other participants. Adjusting the virtual positions of the participants will then require further adjustments to the perceived angle that each participant is seen by others, as will be described below, since each participant will be rotated in different directions.
  • FIGs. 7 A - 7H multiple examples are qualitatively shown regarding the different configurations possible for 3-6 participants.
  • the case of 2 participants is trivial, and more than 6 becomes cumbersome to draw, but is not a limitation of the system.
  • 3 participants would be sitting around a virtual equilateral triangle 700, with an equal distance between them, as defined by a circle 701 encompassing triangle 700.
  • Each would have equivalent views, without angular or rotation distortions. While each user is equivalent, to describe the changes that could be made per-user, in this discussion regarding FIG. 7A, the “user” 702 will be located at the bottom of the triangle 700.
  • the cameras can be located at a distance of 5-30 cm from each other (another variable).
  • a distance of 20 cm between the cameras will be considered, as well as a distance to the user of 60 cm or 90 cm from the camera.
  • Table 1 includes the case of 3 participants, showing an example of the location of the virtual participants (PI and P2) as a function of their distance from the central camera
  • the angle is set at 30 degrees (from center) for the symmetric positions, giving a distance of ⁇ 35 cm to either side for the locations of the virtual avatars when seated at a distance of 60 cm from the central camera.
  • the angle for the tight configuration need not be unchangeable.
  • T able 1 the angle is defined to be 18.5 cm, which places the virtual participants at a distance of 20 cm off center when seated at a 60 cm distance.
  • the reason to choose this angle is therefore to ideally place the avatars in the direct line of sight to the side cameras, which in this embodiment example are 20 cm from the center too. Note though, that this distance would change when the user moves their head forward or backwards. Nevertheless, the goal is to obtain as ideal a FOV coverage as possible of the front of the face, making this a near ideal angle.
  • the angle could therefore be set by the user/software, or it can be deduced by first measuring the distance to the user, for example in a setup stage prior to the conversation, and then adjusted to ensure that the camera angle best fits the user’s distance.
  • other numbers of participants can be described in the same way. As discussed in relation to FIG. 2, the case of 4 participants would ideally be in the orientation of a square virtual table 706 (as illustrated in FIG. 7C), with the tighter form being a “kite” shape 707 (with respect to the user placed at the bottom), as illustrated in FIG. 7D.
  • Table 2 the angles, distances and locations of examples of both the symmetric and tight configurations are shown.
  • angles for the tight configuration were once again chosen such that the virtual participants appear up to 10 cm away from the direct line of sight to one of the 3 cameras when seated at a distance of 60 cm from the 3D webcam.
  • the outer participants (PI and P4) would be 10 cm away from the outer cameras, whereas the inner participants would be between the cameras, and close to the side cameras (2 cm away, approximately). If the user moves backwards to 90 cm away from the camera, the outer participants, located ⁇ 15 cm away from the outer cameras, will begin to be far from ideal, and therefore the angles would be adjusted in the tight configuration to be closer.
  • 120 cm away from the center would force the user to turn their head to see the outer participants, as they would begin to be outside the central FOV of the user (roughly 120 degrees total, whereas the half angle to the outer participants is ⁇ 54 degrees).
  • Some of these can be remedied algorithmically by taking into account the mapping of the angles of the tight configuration with the movements of the participants (such as to rotate them artificially more or less, depending upon their mapped distortion), or by simply changing their size with respect to the difference in virtual distance between the participants.
  • the algorithm for the first embodiment can be described.
  • the first embodiment only a single camera is used, and the participants are rotated as a function of where they are looking, with the user being in an “always centered” face tracking configuration.
  • FIG. 8 a block diagram of an embodiment of this algorithm is depicted.
  • the process begins in step 800 with a query or setting of the camera parameters, such as the resolution and frame rate.
  • step 801. To make the virtual participants appear more persistent with their virtual environment, it is beneficial to remove the background of the user, as described in step 801. This can be done using a neural network (NN) and/or using depth data if using a camera with depth sensing capabilities. There are multiple known methods of removing backgrounds in addition to these two methods, however they are the most robust in removing backgrounds.
  • This step can be optional in the case that the processing power of the user’ s device is not enough to handle this stage. For example, if the user is using a mobile device with limited capabilities to run a NN or similar algorithm, then they will appear in the virtual environment with their background as-is.
  • the next stage of the algorithm is to perform face detection and finding facial landmarks, as described in step 802.
  • Facial landmarks can come in a variety of different solutions that are known in the art. While there are solutions providing hundreds and thousands of facial landmarks, the bare minimum required for this algorithm is at least 3 points that are not co-linear. Co-linear points could be the line between the eyes alone; a third point is needed to obtain 3D triangulation, for example the eyes and nose tip. Typically, more points are found, with more points enabling higher precision and robustness in pose estimation, which is implemented next in step 803. In this stage, the translation and rotation vectors of the person’s head (and potentially eyes) are calculated by an algorithm that finds these vectors as a function of the location of the user’s camera.
  • the algorithms to find these vectors are also known in the art, and can be found in the OpenCV open-source computer vision library, including variations of “pose and project” algorithms. These algorithms typically utilize the camera parameters from step 800, but can work on idealized parameters as well. Additionally, for using such algorithms the 2D data points found in the landmark detection phase of step 802 are matched with a 3D model of the person. For the case of matching the landmarks to the head (or body), this can be done using a generic 3D model of the human head. This generic model can be customized or personalized to the user to obtain better accuracy. However, even a generic model with 3D points not well fit to the user will still obtain relatively good results.
  • the output of this stage is a set of vectors (or matrices) with the translational and rotation vectors (6DOF), given in the coordinate system of the camera.
  • the next stage is to rotate the viewpoint of the user in the case of the first embodiment of a single camera, as described in step 804.
  • This stage may be optional if the user prefers to rotate their screen manually (such as is done in video games using a user interface).
  • the rotation of the screen can be simplified for this description to the yaw angle around the y- axis, although full 6DOF is also possible.
  • the goal is to have the user always looking “forward” at the camera in order to always have their full face in view of the camera.
  • Methods to rotate the viewpoint are highly dependent upon the software used to view the other participants, however they are known in the art of both head tracking applications as well as gaming applications.
  • step 805 The rotation and translation vectors calculated in the previous stages were in the coordinate system of the camera, but what needs is sent to the other participants in step 805 is the translation and rotation in the coordinate system of the joint environment that they are all in.
  • the physical camera in front of the user acts as an anchor point to their virtual (2D/3D) representation: any rotation/translation of the user will be done relative to it, and the initial orientation of the user can be defined as looking at the center of the virtual table in the examples above. There need not be an actual table in this environment, so long as there is some form of predefined “center”. Then, all rotations and translations are relative to the anchor points in relation to it.
  • the user can also navigate using an external interface such as a mouse, touchpad, keyboard, gamepad or joystick, as is done in the gaming world.
  • an external interface such as a mouse, touchpad, keyboard, gamepad or joystick, as is done in the gaming world.
  • the user will always be looking forward, which is advantageous, but also requires that the user free up a hand or two to move, which may provide a less immersive experience. It is also less desired in the case of a headset, where there may not be any peripheral interface available.
  • the image data is also sent in step 805, as well as any other data such as 3D meshes, as described in later embodiments, and other metadata that is typically sent in such VC systems (e.g. name, location, etc.).
  • the landmark features can also be sent at this point, to reconstruct the user using various graphical and computational techniques.
  • the system comprises a user display (not shown), and each user display can have its own positioning arrangement of the participants.
  • This data can be sent as an analog angular compression value, or as a Boolean if the compression angles are predetermined.
  • the one or more processors of the system output data associated with the received video output (i.e. the processed images) and the mapped positions and orientations.
  • the combined data is sent to the other participants, with each other user in one embodiment receiving the exact same data, which is preferable for scaling issues.
  • the other participants will then locally render the user (and other participants) on their display and software, as described in step 806.
  • This rendering for the case of a single camera will primarily include the 2D video sent, however may also include other 3D data information.
  • the exact means of the rendering is dependent upon the software use, and in no ways limits this disclosure in any way.
  • the render can be on a 2D plane situated in 3D space, or on a 3D object or template.
  • step 807 the correction is made for any angular compression, as described above.
  • the correct orientation of the rendered user on the participant’s end should preferably match the viewpoint of the user. This is true whether or not head tracking is implemented. If the angles are compressed, as described in the Tables above, and limiting for simplicity the discussion to yaw angle compression, then a yaw rotation of r Se m is in one embodiment compressed by a factor of a/b, with a being the maximal angle to the rightmost viewer without compression, and b being the maximal angle to the rightmost viewer after compression.
  • the second embodiment of this disclosure includes a simpler, more “digital” variation of the 3D representation.
  • the 3D representation of the person can always be replaced with a 2D image of the person, and situated in a 3D environment.
  • the 3D aspect of the conversation is somewhat maintained in the spatial awareness of the viewers with respect to one another, however, this will be ruined once every time the user rotates their head a little to control the camera.
  • FIG. 9 a simplistic top-down schematic depicting a 3 -camera configuration 900 is shown, here forgoing any 3D sensor aspect for simplicity. Shown also is a user 901 who is situated at a distance 902, denoted d, from the central camera 903, assumingly located - as in a standard configuration - before a static camera system and display.
  • FIG. 10 illustrates a block diagram of a system 1000 used for this second embodiment.
  • a central camera 1001 and the side cameras 1002, which all may or may not include additional 3D sensors.
  • the background may be removed in a background removal functionality 1003 using the methods described above.
  • pose estimation can be employed on all of the camera streams, it is primarily performed by a pose estimation functionality 1004 on the central camera 1001, since the 3D orientation of the user will be with respect to it. Removing the background and estimating the pose can also be done directly within a combined deep learning network.
  • the camera system 1000 further comprises an image selection functionality 1005 which will then select which images to send, via a communications system 1006, and to which other participants.
  • the central camera would generally be chosen, regardless of the orientation of the user’s head.
  • the participant on the right will see the stream of video coming from the camera to the right of the first user, whereas the participant on the left would (perhaps) see the video emitted from the camera on the left. This would give the participants a feel for whether the user is looking or addressing them, or whether they are looking in a different direction.
  • the number of outgoing streams will be defined by the number of overall participants in the system. Therefore, the image selection process of image selection functionality 1005, will be a function of the 3D orientation of the participants with respect to each other, as well as the number of users. It could be that many users are sent identical streams, as in this case, there are only 3 viewpoints to choose from.
  • the solution is a digital or discrete form of the general solution depicted in this embodiment, where the creation of a 3D representation is such that it can be viewed from any, nearly arbitrary, angle.
  • any of background removal functionalities 1003, pose estimation functionality 1004 and image selection functionality 1005 are implemented as respective portions of software one or more processors. Alternatively, or additionally, any of back ground removal functionalities 1003, pose estimation functionality 1004 and image selection functionality 1005 are implemented on a dedicated circuitry, such as a microprocessor, a field-programmable gate array, or other type of integrated circuit.
  • communications system 1006 comprises any appropriate communication hardware, such as a connection to the Internet, a connection to an Ethernet network and/or a cellular connection.
  • a block diagram such as that appearing in FIG. 10 primarily refers to the operation of the system during run-time.
  • an initial alignment or calibration stage can occur that runs on a different set of system blocks.
  • a calibration section can occur before a session whereby the orientation of the user’s face can be found using all 3 cameras (in this example), and the calculation of threshold angle to switch between cameras can be found by averaging the estimated poses between the three cameras. This processing can also occur on a separate system, without the need for real-time computational speeds.
  • the third main embodiment of this disclosure involves improving the representation of each user from a 2D video stream into a 3D representation (“hologram” or 3D mesh).
  • a 3D representation (“hologram” or 3D mesh”.
  • a combination of 3D data and machine learning algorithms can be utilized, as has been done in the prior art.
  • a specific set of algorithms can be utilized to minimize the amount of processing done.
  • processing 3D or NN solutions to this problem require high end hardware, and are difficult to perform in real-time.
  • simplifications can be made.
  • FIG. 11 illustrates a generalized description of the usage of 3D data and models.
  • FIG. 11 a depiction of the head only appears for simplicity, though the goal of this disclosure is to capture the upper torso and arms/hands as well.
  • a model of a human head 1100 can be completely generic, or be of a library of models such as gender, race and age, as well as the inclusion of hair and skin tone.
  • An advantage of using a model is that there are methods of describing the model using a smaller number of parameters, which can be less than the full 3D data required to create this model.
  • 3D data will only cover a portion of the object 1101.
  • some of the points appear “behind” the head, but this is just for simplicity in describing the method, since the drawing here is in 2D, whereas the points are in 3D.
  • data from behind the user won't be obtained unless the user uses a mirror, or the user pre-captures themselves from all directions.
  • this 3D data can be obtained from both depth sensors and stereo calculations, as described above, and therefore can include data from more than one angle, as well as having a weight per data point regarding the accuracy of that point.
  • the data can then be combined at a step 1102: using the 3D data 1101 as well as the model 1100 to obtain a match between the two, indicated as object 1103.
  • the model is generic, then there should be mismatches between the data and model, yet, even if the model is based on the user (or previous frame), there will be mismatches. For example, if the user smiles, then the locations of the mouth will be different, and modifications to the model should be made.
  • the advantages of using a model are that they can in-fill regions that are not captured, as well as potentially reduce the amount of data transmitted.
  • a step 1104 can be processed where the data points found are mapped to the set vertices of the model 1100.
  • This allows the model to be tracked through time, with the same number of vertices in existence.
  • the tracking of data between frames is also important for 3D data compression. Likewise, it can remove superfluous data points, as well as interpolate missing data points needed.
  • the resulting 3D representation 1105 can then be described as one that can match the user more accurately, can be represented by a fixed number (and more evenly spaced) of points, and that covers the entire region, including potentially occluded regions of the images.
  • a final benefit of using a model is in filling in regions of the face that may be occluded by headsets. If the user is wearing an AR/VR headset, then parts of their face may be occluded from the camera around the eyes. For AR glasses with semi-transparent views, this may be negligible, however, for opaque coverings of the eyes, there is a benefit to filling in this region “artificially” by using a model.
  • This model can be pre-trained on the user (before the session), i.e. the model is unique to this particular user, and can include eye gaze tracking and expression tracking to have the model mimic the user, as has been shown in the art for VR headsets.
  • FIG. 12 The utility of these occluded points is described in relation to a device 1200 illustrated in FIG. 12. Only a portion of device 1200 is illustrated, for simplicity.
  • the simplified notation of 3 cameras 1201 (as in FIG. 9) is displayed, with a user 1202 looking to the right.
  • the rightmost (top) camera’s FOV 1203 will be the best to capture the user’s face and upper body, however, the data points obtained via the depth sensor and/or stereo matching, denoted 1203, will be limited to primarily the front section of the user 1202. Therefore, not only will the back of the user 1202 not be present, but potentially the right side of the face as well.
  • the 3D representation can be modeled at step 1204 to include the regions occluded from the camera’s FOV, as well as be distributed in a more ideal manner in step 1205, such as to minimize the amount of data and track the data from frame to frame. Due to the specific coverage of the cameras in this device disclosure, and the specific orientation of a person with regards to the device (a webcam), there is a strong relation between the algorithms, pipeline and orientation calculations with the device itself.
  • the third embodiment of this disclosure focuses on the improvements that would occur if the 3 cameras (or more) of the previous embodiment were combined into a single embedded device.
  • the algorithms of the previous paragraphs can be implemented per-user, and ideally at the edge, or in the cloud.
  • FIG. 13 a disconnected block diagram is shown for many of the discrete blocks of processing that occur in the pipeline of this latter embodiment. Variations on this embodiment can include all or some of the blocks, and the exact sequence of the blocks may change, without a loss of generality.
  • each of the blocks of FIG. 13 can be implemented are implemented as respective portions of software one or more processors.
  • any of the blocks of FIG. 13 can be implemented on a dedicated circuitry, such as a microprocessor, a field-programmable gate array, or other type of integrated circuit.
  • the general flow of the blocks is to first obtain the images, to extract information from the images, to create 3D representations of the users, to transmit the images to a central server, to process the locations of the different participants, and to send data back to the participants.
  • the general form of this pipeline is somewhat similar to a standard VC pipeline, with the modifications that the data being sent is not merely a 2D video stream and includes 3D meta-data, as well as the processing steps of creating a 3D representation, which may or may not be processed on the central server.
  • the audio portion (or even 3D audio) is not depicted for reasons of simplification only.
  • the data streams that initiate the pipeline are primarily color video and depth data, assuming a dedicated depth sensor is used obtained in step 1300 by color sensors and in step 1301 by one or more depth sensors.
  • a dedicated depth sensor is used obtained in step 1300 by color sensors and in step 1301 by one or more depth sensors.
  • the device can utilize both stereo data 1303 and depth sensor data 1301, which can then be combined at step 1304. This will be discussed in detail below.
  • the device disclosed as part of this disclosure described the use of embedded processing units to perform these functions. Therefore, the delineation of where these processes occur (appearing as dashed lines 1302 in FIG. 13) is a function of the exact hardware architecture, as well as the embedded software on these devices.
  • the device can perform next to no processing steps appearing in this pipeline schematic, with the first delineation line signifying the end of the device component, and the transfer of data to a secondary device or cloud server. Brief descriptions of these embodiments will be described below, with an overview of the processing pipeline stages described here first.
  • further direct processing on the individual data streams can include a feature extraction step 1305, such as facial detection, as well as a segmentation step 1306, which can allow a reduction of the amount of data processed, as well as ultimately a reduction in the amount of noise in the system. Segmentation also enables background subtraction.
  • a feature extraction step 1305 such as facial detection
  • segmentation step 1306 which can allow a reduction of the amount of data processed, as well as ultimately a reduction in the amount of noise in the system. Segmentation also enables background subtraction.
  • a processing unit embedded or not can perform these calculations in parallel or subsequent to the prior processing steps.
  • the next stage is to combine the textures at step 1307, as well as potentially reduce the amount of data at step 1308.
  • This data reduction can be performed by down-sampling, decimating, or utilizing only data from segmented image frames, for example.
  • Another step is that of compression, in step 1309.
  • the video streams are almost always compressed to lower the bandwidth requirements, and is typically performed using formats such as H.264 or VP8.
  • the next stage (regardless of a potential compression and decompression stage 1309) is to combine the data into a 3D representation. There are numerous known algorithms for creating 3D meshes from data, in step 1310. If only color images are used, then photogrammetry methods are used.
  • 3D depth data is included, then first a 3D mesh is created, and the texture is added, or 3D data points are each attributed a color vertex, with others also possible.
  • An alternative approach is to utilize a pre-existing template - for the example of a person, this could be the form of a generic mannequin - and then place the textures and/or 3D data onto this template.
  • This modeling process in step 1311, can consider the 3D data to correct the model, as well as potentially also use data from a parallel meshing process 1310 that can be used to improve the model.
  • Yet another approach is to use neural network variations to create the 3D data based on existing data and an estimate of the 3D model based on the input data, or employ methods of neural radiative fields to estimate the viewpoints of virtual intermediary capture angles (not shown in the figure for simplicity).
  • the model used in step 1311 can be generic or can be pre -learnt on the user in a process stage occurring beforehand, such as when installing the camera device, or when initiating a conversation.
  • An advantage of utilizing some form of modeling is to infill data that is lacking. For the device in this disclosure, the capture of the back of the head and torso will not exist, and therefore utilizing some form of closure method to the 3D representation is beneficial, whether by utilizing a generic, pretrained, or estimated model.
  • the orientation of the user vis-a-vis the other participants in the conversation is to be considered, in step 1312.
  • This stage is in one embodiment identical to that of the previously described embodiments.
  • the orientation of the users can also be utilized if the data transmitted to the participants (clients) is also pre-rendered, in step 1313.
  • This allows a simple video stream to be transmitted, as opposed to sending the full 3D representation to each user.
  • this method also has drawbacks, as lags associated with the transmission may cause discomfort for the user if the images are not rendered at the same rate as the movement of the user and the other participants (as is known in the art regarding VR headsets as well, for server-side rendering).
  • the next step 1314 could be a simple video compression of the stream. If, in the primary embodiment sets, the 3D representations are transmitted as-is, then in one embodiment they also undergo some form of compression. This compression can unfold the textures from the 3D mesh, and impart the 3D data information as well (once again, not including audio in this description). There are methods of creating this form of compression, including methods that transform the 3D representation into a video stream, which can then be further compressed utilizing standard video compression techniques, including methods of mesh-tracking and 3D binary trees.
  • the final stage, 1315 is to transmit the data to the participants, including the user. Due to compression techniques, this is generally some form of compressed video and/or including 3D meta-data and sound. The participants then need to decipher the 3D data in a controlled manner.
  • the orientation is a function of the pre-arranged orientation processed as described above in step 1312, as well as particular settings the specific user may have regarding their preferred method of viewing the other participants.
  • the bandwidth of each participant can also define the amount of compression and data sent to itself from the server, as is known in the field of VCs.
  • the data the user sees can also include the initial data stream 1300, such as a preview of the user in 2D (or 3D, from the server).
  • One embodiment of the disclosure will involve a camera device (with or without depth sensors) that only transmits the data, either raw, de-Bayered or compressed to the cloud. The rest of the processes will then occur on the cloud.
  • An alternative embodiment will include an intermediate processing unit, such as a personal computer to which the device is attached.
  • an intermediate processing unit such as a personal computer to which the device is attached.
  • any - if not all - of the processes up until the positioning stage can occur.
  • the device would need a powerful computational source, and therefore would limit the independence of the pipeline from each participant.
  • Another embodiment of the device will include embedding some of the processing steps onto the device itself (disregarding the transmission component, for this description). This can include the calculation of 3D depth and stereo, as well as fusing the 3D depth data.
  • FIG. 14 illustrates a generalized and simplified schematic of a 3D webcam device 1400 as described in this disclosure.
  • the device 1400 consists of multiple sensors and configurations that are all variations of the embodiments of this disclosure, and the most basic form is shown here.
  • the device 1400 In order to view a person’s head, with the capability to capture that person when they move their head left and right (but not necessarily up and down, since our heads are usually relatively level during conversation), the device 1400 preferably has at least 3 cameras: a first camera 1401 directly in front of the user, a right camera 1402 and a left camera 1403.
  • RGB color cameras can be “simple” RGB color cameras, but could also include variations on standard sensors such as RGB-IR sensors, or cameras with diffractive elements or phase elements to split colors or deduce depth, without any loss of generality.
  • An additional feature which may be beneficial is if the camera sensors employ global shutters instead of rolling shutters, for improved stereoscopic imaging. They may also employ internal computational capabilities, such as are available on many camera sensors today.
  • the offset distance 1404 of the cameras from the central camera 1401 is in one embodiment symmetric, with the baseline distance affecting the ability to increase the angular coverage of a user.
  • the side cameras 1402 and 1403 can also be rotated towards the central camera 1401, and they may be raised/lowered or brought forward/backwards (not shown), without a loss of generality.
  • the three cameras (1401-1403) may therefore lie on the same plane, or be on a different plane from each other.
  • ToF direct or indirect
  • a single ToF sensor comprising an imaging sensor 1405 and a light source 1406, as is standard for most ToF sensors.
  • the other side cameras 1402 and 1403 can also have a co-located depth sensor, however the single sensor in the center is the primary embodiment variation of this disclosure.
  • this sensor may be replaced with a different technology, such as structured light, assisted stereo illumination, ultrasound or radar, without any loss of generality.
  • the general paradigm for most embodiments is the use of 3 cameras, with a depth sensor co-located with the central camara, at the bare minimum.
  • a camera “above” to capture the face from a higher angle is also beneficial.
  • This is shown as an optional embodiment 1407, with the additional camera sensor 1408 offset by a distance 1409 above the central camera 1401.
  • This camera 1408 too can lie in a different plane as the other cameras, with some embodiments having the top 1408 and side cameras 1402, 1403 creating an arc towards the face of the user, and can also include rotations of the cameras, or potential additional cameras.
  • a stereoscopic calculation can be employed between the central and satellite sensors in the first order, and between satellite sensors in the second order of stereoscopic calculations.
  • An additional feature that can be included in this device is the capture of directional sound (“stereo sound”). This can be done by including a microphone within the device, and preferably near each sensor, such that there is one microphone 1410 near the central camera 1401 as well as microphones 1411 near the side cameras 1402 and 1403. This allows the 3D representation to include 3D sound, allowing directional and spatial information of the sound to be included as well.
  • FIG. 15 displays a design for one of the embodiments of this disclosure.
  • the device 1500 is centered on a central camera 1501, with the co-located depth sensor 1502 also adjoined mechanically to the hub 1501 A of the central camera 1501.
  • the left camera 1503 and right camera 1504 are shown here to be offset by rods or arms 1505 and 1506, at a distance from the central hub 1501A.
  • the baseline distances can be anywhere from 5cm to 40cm, however as the 3D webcam would typically sit atop of a laptop or monitor, would usually be in the order of 10-30cm.
  • a long arm length also is disadvantageous as the mechanical stability of the arm (even if held from below) may cause the calibration of the cameras to each other to be affected by even thermal fluctuations.
  • a rigid arm structure is preferred, with the arms also being used to connect electronically the cameras to the central hub. If the arms are intentionally bent, even to ensure better coverage, care must be taken to maintain mechanical stability. In the case where one of the arms is bent or vibrates, such as shown by arrow 1507, then the calibration and alignment between the cameras will be affected, thereby deleteriously affecting the stereoscopic calculations. There are methods to correct for such changes to the alignment, employing both imaging and 3D data techniques.
  • the central hub 1501A of the device 1500 includes the cameras 1501 and 1502, but also internal processing units, input and output connectivity and mechanical attachments 1508, such as a tripod mount or clasp (as examples).
  • This hub 1501 A can include the electronics used to create 3D data, to calculate stereo pairs, to fuse DMs and/or point clouds (PCL)s, to perform color space transforms, to compress the images and/or 3D data, and to fuse the color images.
  • PCL point clouds
  • it can contain computational power used to combine the 3D and color images into a 3D representation, as well as compress or decimate that representation for transmission purposes. It can also include methods of performing AI algorithms on the device itself, such as background subtraction, as described above.
  • the computational unit embedded within may comprise of a single unit, or multiple adjoined units. For example, it can contain a board designed for fusing the 3D data, as well as a board designed to compress and transmit the data.
  • Transmitting the data out of the unit can be done either wirelessly or wired.
  • the transmission modem can be either embedded within, or if wired, connected externally to the hub via a cable 1509.
  • This modem can be a free-standing unit, or one embedded in a different device such as, but not limited to, a personal computer, laptop or mobile device.
  • the cable connecting the device to an external device 1510 can provide both input and output, such as is standard with most peripheral devices today.
  • 3 cameras in some embodiments of this disclosure allows the use of what is known as the trilinear or trifocal tensor.
  • This mathematical description of 3 cameras enables the calibration and alignment of the three cameras in a single set of equations (tensor) that can be solved using multiple methods. Specifically, with 3 separate images from cameras (calibrated), the trilinear tensor for this system can be found from a given number of matched points between the images. This framework would then allow the cameras to be realigned from a single set of images, thereby enabling the correction of any misalignment of the cameras due to shifts or rotations and vibrations of the cameras. It can also recreate the 3D representation based on the knowledge of the tensor and the images, and can be considered one of the embodiments of the 3D camera component.
  • the method of producing DMs or PCLs from two stereo images require algorithms with pros and cons.
  • the calculations are to be implemented in real-time (e.g. at 30 fps), for large images (e.g. with 1 M pixels each), and using lower cost hardware, an appropriate algorithm should be selected.
  • One of the most efficient methods is based on a “patch-match” technique, where areas of distinct features are compared to one another in each image. Other methods, such as those based on feature detection and matching, also exist.
  • FIG. 16 A simplified graphic describing the use of stereo imaging calculations and the benefits of the device described here in performing faster and more efficient stereo imaging is shown in FIG. 16. Shown are two simplified “images” (drawings are intentionally simplistic for description purposes only) with a left image 1600 and right image 1601 taken from two different cameras. This could be the same camera taken at two different times as well, but the description will discuss only the former case for clarity. In these images, there is a “person” 1602 appearing (for descriptive purposes), with there being a shift in their location from image 1600 to image 1601, as is standard in describing two stereo images.
  • the general approach is typically to run along a single row of pixels (assuming here that the two cameras are mounted horizontally, or have been rectified already), and to search for features that match along that row.
  • the difference in pixels provides the disparity, which then correlates to the depth of the point in space using triangulation formulas.
  • this is shown as matching the left row of pixels (or group of pixels for a patch) 1605 with the right row of pixels 1606.
  • the features are located in distinct “columns” 1607 and 1608, on the left and right respectively, such that if the feature on the right is mapped into the pixels 1609 on the left, then the disparity of pixels 1610 will be the distance in pixels between them.
  • FIG. 17 A block diagram for the given method of finding the depth points is shown in FIG. 17 for a pair of color cameras (RGB or other similar sensors) 1700 and 1701, and a depth sensor 1702.
  • the diagram is for a single pair, and the block diagram can be easily generalized to any number of pairs.
  • the data from both the color cameras will be utilized for both stereo depth calculations, as well as producing color textures to be used for the 3D texturing.
  • the first set of outputs is simply the fusion 1703 of the color images. This can include steps to lower the data amounts in those images, such as reducing the RGB color format to a YUV422 color format, or similar. It can also include taking raw image data and combining it (before de- Bayering).
  • the images can be fused as simply adjacent matrices or fused using other methods that may make use of other processes occurring in a subsequent step 1704 such as feature extraction or segmentation. They can also be output as individual frames if so desired.
  • One method of combining the images is to make a composite image. If, for example, 3 color cameras are used at a resolution of 1 M pixels, then a combined image of 4 M pixels can be created by placing the images in a square matrix, with one of the 2x2 squares being empty (or filled with other information, such as depth information). This form of combination would make further compression easier for algorithms designed to work on set aspect ratio frame sizes.
  • the second use of the stereo images 1700 and 1701 is to calculate distance using a stereo algorithm.
  • step 1705 the images are converted to grayscale before the calculation.
  • the algorithm described in general in FIG. 16 occurs, where the input 1707 from the depth sensor 1702 can be applied as well to improve the efficiency of the algorithm.
  • the output 1708 of the algorithm will be either a DM or PCL (depending upon certain details of the algorithm), and there would then be 2 depth data images for the given frames (although the points may be displaced due to the different FOVs of the color and depth cameras).
  • These depth data points can be combined in step 1709 in a fusion algorithm that can be weighted to account for the inaccuracies inherent to the depth sensor and stereo matching pair.
  • the output 1710 of this step would be either two different depth frames, or, using the fusion method described, a single DM or PCL, which could then also be further compressed or reduced.
  • the described block diagram involves steps that require processing on a computational device. While there are camera sensors with embedded process involved, such as de-Bayering and compression, as well as even segmentation using NNs, the device described in this disclosure is different from existing architectures. As such, some form of computational device must be connected to the cameras, either within the device itself (embedded), or connected via a transmission cable (wired). The above-mentioned processes, such as stereo matching and depth fusion could therefore occur on the device itself, given the requirements of the algorithm described, and using the appropriate processing amount. However, creating a full 3D representation may require computational assets that are not tenable for being embedded within a single device, either for monetary or physical size considerations. Therefore, it is preferable to allow some of the algorithms to be processed in other locations, either on a separate computational device, or completely distributed, such as on an externally located server (“cloud”), as was described above in relation to FIG. 13.
  • cloud externally located server
  • the overall architecture of the system 1800 involves multiple participants connecting with each other, as shown in FIG. 18.
  • the host may not be the first member to join into a meeting, but they will be defined as the host for two distinct purposes in software, and also potentially in the architecture:
  • the host may also take the roll of the server 1802 for the connected network of users, or, as will more typically be the case, the server is located in a separate location, on the cloud, or on the internet, or in general located on a server in a separate location that may be near or even far from the host (with preference to closer distances due to shorter lag times).
  • the host is important for the software, as it can define the beginning of a linke -li t or array of users (or other data structure), which essentially describes “user #1”. It can also be relevant when in reference to FIG. 7, where if the host decides to position the other users in a different orientation, such as to emulate a manager sitting in front of their employees at a desk, then the host can be defined as the person whose rotation and orientation is at the “bottom” of the table (as described in relation to FIG. 7).
  • Each participant 1803 in the system 1800 will have an interface 1804 with the overall system that may consist of the camera devices described in this disclosure, or other forms of cameras (both 2D and 3D), as well as even without a camera (or the camera being off), and only including audio. Therefore, the data sent from each user to the server 1802 may be different.
  • the host can send data, via a dedicated interface 1805 including their own image information (e.g. a 3D representation, as in this disclosure, or even just a 2D image from a webcam), as well as information regarding the entire scene that the others will need in order to understand their orientation to each other.
  • This data can include the coordinates of the participants with respect to the host, in a global coordinate system.
  • each participant can be given a unique coordinate that can be rendered in the same location for all users.
  • the coordinates per user can include location and orientation, and additional metadata information can be included for each user, such as (but not limited to) geographical location, local time at the user’s time zone, hyperlinks to other information, as well as information regarding the user during the conversation, such as if they are muted or have their camera off.
  • Participants that are added or removed from a conversation in this system will then be remapped to the data structure at the host or server, and can be retransmitted to each individual participant. Likewise, each participant can send similar information about themselves.
  • Each user of the system 1800 will therefore be transmitting their own representation (3D, 2D, or other, as described above), as well as receiving from the other users their representations and local transformations including position and rotation.
  • each user then renders the other users on their own device.
  • An alternate embodiment will have the rendering done on the server for each user, who then only receives a server-side-rendered image (or set of images), instead of the full information from each user. This latter embodiment allows for less data transmission to each user, but also requires extremely low latencies to be achieved on the server.
  • the servers may also be performing other algorithmic tasks (per user), as described in earlier figures.

Abstract

A camera capture and communications system constituted of: at least one camera, each comprising a video output; and one more processors configured to: receive the video output of each of the at least one camera; for each of a plurality of predetermined temporal points map a position and orientation of a user imaged by the at least one camera to a virtual position and orientation; and output data associated with the received video output and the mapped positions and orientations.

Description

CAMERA CAPTURE AND COMMUNICATIONS SYSTEM
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application claims priority from U.S . provisional patent application S/N 63/152,568, filed February 23, 2021, and entitled "Multicamera Capture and Communications System", the entire contents of which being incorporated herein by reference.
TECHNICAL FIELD
[0002] The present disclosure relates substantially to the field of telecommunications, and in particular to a 3D teleconference system.
BACKGROUND
[0003] Interactions online can occur using several types of programs, typically involving 2D video streams, but more recently, using 3D environments as well. When speaking with multiple people using a system that involves 2D video, there is no way to distinguish between who is talking to whom, as there is no spatial correspondence between the speakers. In a standard Video Conferencing (VC) solution, each speaker is given a “square” on the screen, and there is no relation between the squares on one user’s screen and their distribution on any other user’s screen in the conversational network. This lack of spatial awareness has detrimental implications towards the utility of such communications software, causing increased fatigue and lack of engagement due to the unnaturalness of the paradigm.
[0004] Placing people in a 3D environment can solve some of these issues, as it more closely emulates a real-life conversation, which humans are more accustomed to. In a normal conversation around a table, for example, we have developed natural capabilities to distinguish between who is talking to whom and recognize the importance of being able to address someone by turning one’s body towards the intended person. Likewise, the pose and gaze of every user can be towards a single person, or even to random directions, as opposed to all being focused on everyone at the same time.
[0005] The flaw in current communications systems is not just in how they are displayed, but also in how they are captured. Currently, systems rely on cameras attached to devices such as “webcams” (generic nomenclature that includes built in cameras on devices such as mobile handsets, tablets, laptops and monitors), which are generally positioned in the center-top of a device’s screen. What every person sees can be approximated by placing a virtual camera on the face of the user (emulating the eyes), however, how each person is seen is a function of the other users. Thus, for example, if there are two other people in a conversation, their viewpoint of the initial user would be most adequately emulated by placing a virtual camera in the place of their own faces. These virtual cameras located in space would thus emulate the viewpoints of a real-world conversation. The problem, which this disclosure is designed to solve, is that users typically only have a single camera situated directly in front of them.
SUMMARY
[0006] Accordingly, it is a principal object of the present invention to overcome at least some of the disadvantages of prior art telecommunications systems. This is provided in one embodiment by a camera capture and communications system comprising: at least one camera, each comprising a video output; and one more processors configured to: receive the video output of each of the at least one camera; for each of a plurality of predetermined temporal points map a position and orientation of a user imaged by the at least one camera to a virtual position and orientation; and output data associated with the received video output and the mapped positions and orientations.
[0007] In one embodiment, the mapping comprises adjusting a yaw angle between a direction where the user is facing and a direction where the at least one camera is facing. In another embodiment, a predetermined positioning arrangement is defined for a plurality of meeting participants, the mapped virtual position and orientation being in relation to the predetermined positioning configurations. In one embodiment, the predetermined positioning configuration is adjustable.
[0008] In one embodiment, the mapping comprises outputting a positioning arrangement shown on a display of the user. In another embodiment, the one or more processors are further configured to: identify facial landmarks within the received video output; and match the identified facial landmarks with a 3-dimensional (3D) model of the user, wherein the data associated with the received video output is responsive to an outcome of the matching. [0009] In one embodiment, the one or more processors are further configured to: track the outcome of the matching is tracked over time; and adjust the data associated with the received video output responsive to an outcome of the tracking. In another embodiment, the 3D model is unique to the user. [0010] In one embodiment, the system further comprises a plurality of terminals, each associated with a respective participant, wherein each of the plurality of terminals receives data associated with the received video output and the mapped positions and orientations.
[0011] In one embodiment, the one or more processors are further configured to estimate the pose of the user within the video output, the mapping being responsive to the estimated pose. In another embodiment, the system further comprises at least one depth sensor, wherein the mapping is responsive to an output of the at least one depth sensor.
[0012] In one embodiment, the at least one camera comprises a plurality of cameras, wherein the one or more processors are further arranged to: fuse an image of the video output of a first of the plurality of cameras with an image of the video output of a second of the plurality of cameras; match features of the image of the first camera to features of the image of the second camera to generate a depth map (DM) or point cloud (PCL); fuse the generated DM or PCL with the output of the at least one depth sensor, and wherein the output data associated with the received video is responsive to the fused images and the fused DM or PCL.
[0013] In one embodiment, the at least one camera comprises a plurality of cameras, displaced from each other by respective predetermined distances, and wherein the outputs of the cameras are synchronized with each other. In another embodiment, the one or more processors are further configured to: determine, for each of the plurality of cameras, an angle between the user and the respective camera; and responsive to the determined angles, select one of the plurality of cameras, wherein the mapping is responsive to the selected one of the plurality of cameras.
[0014] In one embodiment, the plurality of cameras comprises at least 3 cameras. In another embodiment, the at least three cameras comprise three lower cameras and one upper camera, the three lower cameras being displaced horizontally from each other and the upper camera being displaced vertically above the three lower cameras. [0015] In one embodiment, the one or more processors are further arranged to define a predetermined range in relation to the at least one within the received images, the output data associated with the received video generated from image data within the predetermined range and not from image data outside the predetermined range. [0016] In one independent embodiment, a camera capture and communications method is provided, the method comprising: receiving a video output of each of at least one camera; for each of a plurality of predetermined temporal points, mapping a position and orientation of a user imaged by the at least one camera to a virtual position and orientation; and outputting data associated with the received video output and the mapped positions and orientations. [0017] In one embodiment, the mapping comprises adjusting a yaw angle between a direction where the user is facing and a direction where the at least one camera is facing. In another embodiment, a predetermined positioning arrangement is defined for a plurality of meeting participants, the mapped virtual position and orientation being in relation to the predetermined positioning configurations. [0018] In one embodiment, the predetermined positioning configuration is adjustable. In another embodiment, the mapping comprises outputting a positioning arrangement shown on a display of the user. In another embodiment, the method further comprises: identifying facial landmarks within the received video output; and matching the identified facial landmarks with a 3-dimensional (3D) model of the user, wherein the data associated with the received video output is responsive to an outcome of the matching.
[0019] In another embodiment, the method further comprises: tracking the outcome of the matching is tracked over time; and adjusting the data associated with the received video output responsive to an outcome of the tracking. In another embodiment, the 3D model is unique to the user. [0020] In one embodiment, the method further comprises, for each of a plurality of terminal associated with a respective participant, receiving at the respective terminal the data associated with the received video output and the mapped positions and orientations.
[0021] In another embodiment, the method further comprises estimating the pose of the user within the video output, the mapping being responsive to the estimated pose. [0022] In one embodiment, the method further comprises receiving an output of at least one depth sensor, wherein the mapping is responsive to the received output of the at least one depth sensor.
[0023] In another embodiment, the at least one camera comprises a plurality of cameras, wherein the method further comprises: fusing an image of the video output of a first of the plurality of cameras with an image of the video output of a second of the plurality of cameras; matching features of the image of the first camera to features of the image of the second camera to generate a depth map (DM) or point cloud (PCL); fusing the generated DM or PCL with the output of the at least one depth sensor, and wherein the output data associated with the received video is responsive to the fused images and the fused DM or PCL.
[0024] In one embodiment, the at least one camera comprises a plurality of cameras, displaced from each other by respective predetermined distances, and wherein the outputs of the cameras are synchronized with each other.
[0025] In another embodiment, the method further comprises: determining, for each of the plurality of cameras, an angle between the user and the respective camera; and responsive to the determined angles, selecting one of the plurality of cameras, wherein the mapping is responsive to the selected one of the plurality of cameras.
[0026] In one embodiment, the plurality of cameras comprises at least 3 cameras. In another embodiment, the at least three cameras comprise three lower cameras and one upper camera, the three lower cameras being displaced horizontally from each other and the upper camera being displaced vertically above the three lower cameras.
[0027] In one embodiment, the method further comprises defining a predetermined range in relation to the at least one within the received images, the output data associated with the received video generated from image data within the predetermined range and not from image data outside the predetermined range.
[0028] Additional features and advantages of the invention will become apparent from the following drawings and description.
[0029] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. In case of conflict, the patent specification, including definitions, governs. As used herein, the articles "a" and "an" mean "at least one" or "one or more" unless the context clearly dictates otherwise. As utilized herein, “and/or” means any one or more of the items in the list joined by “and/or”. As an example, “x and/or y” means any element of the three-element set {(x), (y), (x, y)}. In other words, “x and/or y” means “x, y or both of x and y”. As another example, “x, y, and/or z” means any element of the seven-element set {(x), (y), (z), (x, y), (x, z), (y, z), (x, y, z)}.
[0030] Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by anyone of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is tme (or present), and both A and B are true (or present).
[0031] In addition, use of the “a” or “an” are employed to describe elements and components of embodiments of the instant inventive concepts. This is done merely for convenience and to give a general sense of the inventive concepts, and “a” and “an” are intended to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.
[0032] As used herein, the term "about", when referring to a measurable value such as an amount, a temporal duration, and the like, is meant to encompass variations of +/-10%, more preferably +1-5%, even more preferably +/- 1%, and still more preferably +/-0.1% from the specified value, as such variations are appropriate to perform the disclosed devices and/or methods.
[0033] The following embodiments and aspects thereof are described and illustrated in conjunction with systems, tools and methods which are meant to be exemplary and illustrative, but not limiting in scope. In various embodiments, one or more of the above-described problems have been reduced or eliminated, while other embodiments are directed to other advantages or improvements. BRIEF DESCRIPTION OF DRAWINGS
[0034] For a better understanding of the invention and to show how the same may be carried into effect, reference will now be made, purely by way of example, to the accompanying drawings in which like numerals designate corresponding sections or elements throughout. [0035] With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how several forms of the invention may be embodied in practice. In the accompanying drawings:
[0036] FIG. 1 A illustrates a prior art method of conversing with multiple people; [0037] FIGs. IB - 1C illustrate various configurations for displaying 3D representations of participants of a conversation, in accordance with some embodiments;
[0038] FIGs. 2A - 2B illustrate the variation of configurations and orientations of four participants in a virtual conversation from the point of view of a single user, in accordance with some embodiments; [0039] FIG. 3 is a top down view of a user situated in front of a single camera, looking at the depiction of another participant appearing off-axis from the camera, in accordance with some embodiments;
[0040] FIG. 4 illustrates an overhead schematic of a communications system utilizing three (or more) cameras, consisting of a user being captured, a capture system, and a representation of other users, in accordance with some embodiments;
[0041] FIGs. 5A - 5C demonstrate the conceptual advantage of utilizing more than a single imaging sensor in the described camera system, in accordance with some embodiments; [0042] FIG. 6 illustrates the relation between the components of the 3D camera and the other participants as a function of fields of view of the user and camera sensors, in accordance with some embodiments;
[0043] FIGs. 7 A - 7H illustrate the generalized mapping from symmetric distributions of participants to a tighter placement of participants, from the point for view of the user, for a number of configurations;
[0044] FIG. 8 illustrates a flow chart of an algorithm for positioning a user in a 3D environment given data from a single camera, in accordance with some embodiments;
[0045] FIG. 9 describes the relation between a user, the cameras, and intermediary points between cameras as a function of the user’s position, in accordance with some embodiments;
[0046] FIG. 10 illustrates a block diagram of one embodiment of a communications system employing a discrete selection of camera angles;
[0047] FIG. 11 is a depiction of a set of algorithms used to match 3D data with an existing model, in accordance with some embodiments; [0048] FIG. 12 displays a relation between a 3D camera of this disclosure and the algorithms of FIG. 11 ;
[0049] FIG. 13 illustrates a general block diagram of the different components in the processing pipeline of the system, including possible segregation of components into different computational systems along the pipeline, in accordance with some embodiments; [0050] FIG. 14 illustrates a generalized schematic of 3D webcam comprising multiple sensors, in accordance with some embodiments;
[0051] FIG. 15 illustrates a more detailed setup of a 3D webcam comprising multiple sensors, in accordance with some embodiments;
[0052] FIG. 16 illustrates a schematic description of an algorithm for finding stereo correspondences between two imaging sensors, as well as a method of improving the algorithm based on other data, in accordance with some embodiments; [0053] FIG. 17 illustrates a block diagram of a method of finding depth point, in accordance with some embodiments; and
[0054] FIG. 18 illustrates a block diagram of a communication system connecting multiple participants, in relation to a central server, in accordance with some embodiments.
DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS
[0055] In the following description, various aspects of the disclosure will be described. For the purpose of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the different aspects of the disclosure. However, it will also be apparent to one skilled in the art that the disclosure may be practiced without specific details being presented herein. Furthermore, well-known features may be omitted or simplified in order not to obscure the disclosure. In the figures, like reference numerals refer to like parts throughout. In order to avoid undue clutter from having too many reference numbers and lead lines on a particular drawing, some components will be introduced via one or more drawings and not explicitly identified in every subsequent drawing that contains that component.
[0056] Adding a third dimension can alleviate many of the issues in a VC session: by adding spatial awareness to a VC, one can solve many of the inadequacies directly, as well as removing the issue of distracting backgrounds and adding more realism to a conversation. These solutions can be generalized to non- VC modalities, such as in online environments (known as the “metaverse”).
[0057] The solutions described here enable users to have communications in a 3D environment using 3 tiers of hardware: using an existing webcam, a trio (or more) of webcams, or an embedded camera system. Each of these solutions has progressively better capabilities, which thereby enable a better emulation of areal-life conversation. The focus of these solutions is on the capture and presentation of people in a 3D environment; this can then be presented on either a 2D screen, 3D screen, or 3D headset. In one embodiment, the 3D headset can include either virtual- or augmented-reality, or any mixture of them, without any limitations on the technology described here. The use of the word “display” will be used to connote all these modalities. [0058] Part of the disclosure refers to specific methods of spatially distributing other users within a 3D environment. There is a spatial relation between the location and position of the user and the other participants [the use of “user” will refer to the “main” user of the system, described from their point of view, with all other users being described as “participants”, despite them also being symmetrical users in the system]. This set of disclosures is targeted at a use case where a user is sitting or standing in front of a capture device (camera) and monitor (or headset), and looking at the other participants, as is typical in the case of most communications done online today. The term "capture device", as used herein is meant to include any type of device that captures images. What is important is the relationship between the user and the camera, even if the user moves the camera itself, such as with a mobile device, and thus the spatial relationship between the user-camera and the other participants. The general configuration to be emulated is that of multiple users seated around a round table, with the ability to turn towards one another. It can be noted that in this simple configuration, one can limit the degrees-of-freedom (DOF) to primarily the ability to rotate one’s head/body to the left/right (“yaw”) and some translational movement. If limiting the scope of the discussion (but not the disclosure) to people seated around a table, then one can neglect the translation and focus primarily on the ability to look to the sides only. This naturally lends itself to more simplified cylindrical coordinate system descriptions of each participant.
[0059] The first embodiment describes the emulation of a conversation utilizing only a single webcam. The solution describes ways of capturing and displaying the user to the other participants in a way that enables the user to turn and address other participants. In order to do this, the user is allowed to move their head to turn to other users, however, in order to always have the best “view” of the user, they are forced to keep their head primarily facing the camera. This is facilitated by employing head tracking technologies to rotate the viewpoint of the user on a display, thereby ensuring that the user’s head is always centered on the camera. This solution can work regardless of the physical size of the display, but has many drawbacks to be described.
[0060] The second embodiment improves upon the first by increasing the number of virtual viewpoints by physically increasing the number of cameras used. At the bare minimum, a trio of cameras could be used to capture the user from directly in front, and to the two sides, at some given, described, angle. This enables a digital or discrete transition of the views to other participants to be emulated, and also improves the user interface by allowing the user to physically turn their head/body towards another participant, without the requirement of always being centered on the central camera.
[0061] The third embodiment includes methods of creating 3D representations of the user (here, and in all other descriptions, limited to 1, for simplicity of description, but without a loss of generality for multiple users using the same system simultaneously). This third embodiment includes descriptions of the algorithms and systems required to enable such a solution. In addition, as the system requirements for such an embodiment are much higher than the first two, a description is given in this disclosure of an embedded camera system that would have preferable features for such an embodiment.
[0062] In all the embodiments, the primary focus is on the exemplary use case of a user sitting at a desk and interacting with others in a virtual environment that emulates a physical meeting place. Since the physical Field Of View (FOV) of the user may be different from the virtual one, as a function of the display being used, the distribution of the other participants in the virtual environment may be preferred to be different, and customizable by the user. However, as the spatial relation between the users should be maintained in order to enable the correct position and orientation of each user vis-a-vis each other, a form of mapping is described to enable this, such that physical rotations by the user are correctly mapped to their virtual rotations as seen by the other users, at each of a plurality of temporal points (i.e. the mapping is performed repeatedly during the conversation).
[0063] The general disclosure refers to a complete system of creating and transmitting 3D representations of a user to one or more other participants in an online conversation, known as a VC. The system consists of software to control and interface with the system, hardware to capture the user, algorithms used to create and process the 3D data, and a transmission pipeline to convey the data to other users. Furthermore, there is additional software included on a server level, used to arrange and manage the different users into a combined environment viewable by each user.
[0064] Figure 1A begins by depicting a common arrangement in a multi-person call today. It is simplified for description purposes, and it generalizes for any conversation with more than 2 participants in total. In the figure, a display 100 has a webcam on top 101 that can view the user. The webcam is simplified here to a single camera device attached to the display, but can also be embedded within the frame of the device, without a loss of generality. Furthermore, it can be placed at the bottom of the display or on an edge, as is found in many variations on devices today; however they are all similar as consisting of a single webcam.
[0065] In a typical arrangement of participants, the software program 102 will place each speaker in some form of 2D mosaic. With 2 participants and a user, this can be with the user on top 103, and the other two participants to the left 104 and right 105. Some software systems have different configurations, such as highlighting the speaker, meaning that the view is constantly changing. Having the user as one of the sub-screens 103 is problematic as it is known to distract the user and potentially induce fatigue. A thumbnail at most is all that is really needed to ensure that the viewer’s camera is on and working correctly. As can be seen in FIG. 1A, there is no spatial relation between the participants, and each participant may have a different configuration of screens on their display. Even if the order of sub-screens is the same on each, there is still no way to “look” at a specific participant, and have the others know that you are “looking” at them, as it is in a 2D environment.
[0066] FIG. IB illustrates an initial solution to this, within a system 10, with 3 participants as well as the user (not shown). In system 10 there is still a 2D display 106 (though it can be a 3D display or headset as well), with a single webcam 107. There is a virtual table 108 depicted with the participants sitting around it, with only a portion of the table 108 within the FOV of the viewer (as defined by what they can see onscreen). The participant on the left 109 is angled slightly, signifying that they are looking at the user (this is shown from the viewpoint of the user), and the participant on “top” 110 is furthest away from the user. So long as this same relative positioning is maintained between the participants, then everyone can know where they are in relation to the others. If the participants are able to rotate in 3D space, then it is clear who each participant is looking at. In this case (though hard to depict in a 2D schematic), it appears that both other participants are looking at the user.
[0067] In this example, the FOV of the user is essentially defined by the first-person view they have of the 3D environment as appearing on the 2D display 106. Here, it is shown that the user can see 2 other participants simultaneously, with a large FOV, however, if there were more participants, not all of them would appear onscreen at the same time, with some of them appearing perhaps offscreen, beyond the periphery of the FOV defined by the display’s FOV setting. This FOV setting can be defined in 3D rendering software, and can even include FOVs that are beyond the natural abilities of a person. [0068] FIG. 1C illustrates another embodiment of a system 11, system 11 consisting of an ultrawide display 111. In system 11, the software 112 can recognize the larger viewing area, and enable a higher effective FOV, showing most of the virtual table 113. For this variation on the embodiment, it would be preferable to have multiple capture devices 114 in order to enable better coverage, as described in this disclosure. The multiple camera angles emulate the required viewpoints for the other participants 115. In this depiction of the embodiment, it can be seen that the higher effective FOV allows the user to see the spatial relationship between all the other participants, with each one portrayed as looking in a different direction. What this allows is the user to be able to physically turn their head at any portion of the screen and see and focus on different users, with a camera always giving some form of coverage of them.
[0069] The orientation of the participants and user with respect to each other is one of the aspects of this disclosure. The correspondence between them may also be something that changes per participant. In FIG. 2A, there is a top-down depiction of a generalized case of 4 participants meeting, with the user depicted at the bottom 200. The participants are situated around a round table 201 such that an even distribution of the participants around the virtual table will be described as a square (or rhombus) 202. In this depiction, each participant is shown looking towards the center of the table, such that the user’s FOV 203 is directed at the participant directly across from them 204. The other participants to the right 205 and left 206 are not seen by the viewer. The FOV of the user 203 is most easily described in real-world terms as being the FOV of a person (which can be in the range of 90 degrees horizontally), or in terms of the FOV created by a headset (which is typically lower). However, it can also be described in terms of a virtual camera positioned at the user, and depicting a 3D scene, as shown in FIG. IB. In this case, the FOV can be changed in software to expand out the view, and it can be displayed on a 2D screen. However, larger FOVs can result in distortions of the image in 2D on a smaller display.
[0070] In order to see all the other participants simultaneously, it may be preferable to have the other participants appear within the FOV of the user, as portrayed in FIG. 2B. Here, the user’s 207 FOV 208 includes the other three participants, since they are no longer evenly distributed around the virtual table 209. Instead, they are now positioned in a “kite” or diamond geometry 210, with respect to the user 207. This can be done in a virtual environment, so long as these other users aren’t physically in the same space, where their spatial relation would be defined and immutable. Although the other participants are virtually positioned, they still may move from their defined position, such that they move closer (e.g. 213) or further (e.g. 211) from the user 207.
[0071] For a single webcam system, which is the first embodiment of the disclosure, the general problem can be described in FIG. 3. In this simple case, a user 300 is positioned in front of a camera device 301, and is communicating with at least one other participant - here only a single one is shown 302. The other participant may appear on a 2D display, or appear in a virtual environment in a headset (a “hologram”). In this generalized case, the other participant appears at an offset distance, x 303, from the webcam 301. If the viewer then looks at that participant - here signified by the triangular nose section 304, then the angle between the optical axis and the participant can be seen as a 2D “yaw” angle 305. In this case, it can be understood that the problem is that the user is looking directly at the other participant 302, but the view that the other participant will receive will be that of the webcam 301, which displays the user looking to the right. As such, there will be a discrepancy between what the participant sees (an off-axis view) with what they should be seeing (the user looking right at them). This disclosure presents a solution for this issue by using head-tracking to re-position the participant 302 towards the center of the display such that the user 300 is then looking directly at the webcam, and thereby realigning the expected viewpoint.
[0072] The goal of this first embodiment is to allow the participants that are being addressed to always have the user looking at them. To do this, the participants are shifted towards the webcam 303 by simultaneously zeroing the yaw angle 305. This will allow the user to use their neck/head to address participants, but will then force them to move their head back into the central position to continue. This can be considered as an “always centered” configuration of head tracking.
[0073] The angle that is to be re-adjusted is also a function of the distance to the camera, d 306, as well as the potential translation of the user, 1307, to the sides. For simplicity, these can be ignored in later descriptions, however, they are to be included in the calculations of the corrected angles in subsequent descriptions of the disclosure. In general, the translational displacement (d and t as well as potentially a third axis up/down, neglected here for simplicity of description) can be deduced from head/body tracking techniques, such that the corrections can occur every frame. [0074] The second and third form of embodiment solve the problem of requiring the user to recenter their head/body towards the central webcam by increasing the number of cameras. These cameras can be either part of a single device (described below), or as disparate webcams connected to a single computational unit. The general configuration can be described as follows in FIG. 4. The user 400 is positioned in front of a device 401, which is primarily the 3D webcam device described in the 3rd embodiment of this disclosure, but may also include other computational devices either attached or not, such as personal computers, mobile phones, or display terminals. The 3D webcam 401 comprises at least three color sensors depicted here. This limitation on 3 is for simplification purposes in order to show a top-down view, and other configurations will be described below. The 3D webcam 401 will have a central camera 402 positioned directly in front of the user 400, as well as a right (top) 403 and left (bottom) 404 camera. These cameras are depicted as being on, or near the same optical axis as the central camera 402 (meaning, on the same z-axis plane, with the z-axis pointing from the camera to the user), however other embodiments can be used to offset towards the user, without a loss of generalization. Both side cameras (403 and 404) are depicted with a rotation towards the central axis of the central camera 402, to overlap the fields of view of the three cameras. An additional sensor 405 is schematically shown here to be adjacent to the central camera 402. This sensor 405 can be a range or depth sensor, such as a Time-of-Flight (ToF) sensor. A ToF sensor can be a single pixel, or even a full matrix of sensors providing a direct depthmap (DM) within its Field Of View (FOV). Placing this sensor 405 adjacent to the central camera 402 allows the calibrated matching of the depth data with the color data. Alternative configurations could include depth sensors adjacent to the off-axis cameras, as well as different technologies to detect range, such as ultrasound and structured light sensors. Specifically, the use of ToF is beneficial in that it is simpler to align and co-locate with the central sensor 402, as well as potentially being of lesser cost than other technologies.
[0075] The user 400 will typically sit at a distance 406 of 0.5 - 1.5 m from the camera device 401. This distance 406 is up to each user, however, these distances are standard for most VC applications. For standard 2D VC, the background of the capture image may sometimes be obtrusive. Correcting this can employ methods such as foreground detection and segmentation. In particular, it is a known advantage of 3D cameras that they can automatically remove the background by simply discarding all information outside of a certain range. Therefore, in a 3D webcam 401 such as the one disclosed here (which shares those features with a regular 3D camera), a range can be defined (a priori or by the user) below which no information will be taken, e.g. range 407, or beyond which no information will be taken, e.g. range 408. For a 3D camera or the disclosed webcam 401, if the 3D data is calibrated to the RGB data, then the pixels from the RGB camera can be correlated to match those within the desired range from the depth camera. Regardless of the type of display used, the other participants 409 will appear to the user as appearing “behind” the device 401.
[0076] In the depiction of FIG. 4, there are two virtual participants, and therefore when the user 400 speaks to their virtual depictions (avatars), the user 400 will naturally move their head to the left or right so as to look at the participant while speaking, as is natural for conversations. A single camera, however, will have difficulty capturing the full facial features (which are the most critical part of any VC), due to the off-axis viewing of the camera when the user’s head is turned towards a virtual participant to the left or right. Furthermore, even if on a 2D display, the mismatch between the location of the participant’ s face on the screen and the 2D webcam typically causes an unnatural case of loss of eye contact, making the user appear that they are looking at the other participant’ s chest. Therefore, while it is advantageous to be able to correct for gaze artificially using 3D modeling (as will also be done within this disclosure), a more complete solution will also account for the changes in angle by utilizing side cameras, as in this disclosure.
[0077] In FIGs. 5A - 5C, the advantages of the off-axis cameras are schematically depicted. In FIG. 5A, the user 500 is depicted having an exaggerated “nose” 501 used here to signify the direction that the user is looking. The camera system has been simplified to include a central camera 502, top (right) camera 503, bottom (bottom) camera 504, and an optional 3D sensor 505. In this depiction, the user 500 is looking straight ahead, and therefore the user’s face is well captured within the FOV of the central camera 506. However, as seen in Fig. 5B, when the user 507 is looking to the right, then the ideal capture of the facial region is with the right (top) camera 508, such that the FOV 509 of camera 508 contains the entire front of the face of the user. In this case, the other cameras will have an unreliable view of the user’s face - for example, the left camera (bottom) 510 will be unable to see the right eye of the user at all under most angles.
[0078] Another situation that the user may find themselves in is when they are looking between cameras, as in FIG. 5C. Here, the user 511 is looking between the central 512 and right (top) 513 cameras, such that their FOVs (514 and 515, respectively) will both capture the user’s face. This can occur if the other participants are placed in a virtual environment between the FOVs of the cameras, or if the user is simply “staring into space”. Regardless, this is a standard situation where the information from the two (or more) sensors are in one embodiment fused to obtain a better (intermediary) result, such as to correct for gaze.
[0079] The relation between the position of the other participants and user will be a function of the camera device, or individual cameras themselves: their size and rotation angles, the distance of the viewer to the cameras, the number of participants, and the preferred positioning of the participants by the user (software controlled). This will be described in general fashion in FIG. 6 in relation to the cameras themselves. Starting with a user 600 who is located in front of a simplified 3D camera setup, with a central camera 601 and a left 602 and right 603 camera (simplified to only these three components, for clarity purposes), and at a distance 604, denoted D, from the central camera 601. The cameras are here shown along the same axis (for simplicity), with a distance 605, denoted L, between the side cameras (602 and 603) and the central one 601. There are two sets of FOVs that are relevant to this discussion: the FOV 606 of the user 600 and the FOV 607 of the camera(s). The camera FOVs are fixed by the device, whereas the user can rotate their FOV (rotations along one axis are the only ones of interest for this discussion). As described above, the side cameras (602 and 603) are better at capturing most of the user’s face even when the user 600 turns their face. However, after a certain angle, half of the face will be occluded. This angle can also be defined by a certain distance 608, denoted 5, to the right and left (assuming symmetry) of the side cameras (602 and 603), and related to D and L by the tan() function. The further back a user sits/stands from the camera, the wider effective angle they can see (S+L), however at the expense of increasing the distance to the other users (lowered effective resolutions of the 3D representations).
[0080] To describe this relationship more, on a qualitative level, an example of 2 other participants will here be described. In the case of 2 other participants, the ideal spatial positioning is to place the participants at the corners of an equilateral triangle. The angle between them is 60 degrees, such that the user 600 would need to rotate their FOV 30 degrees to the right to see the right participant 609. In this case, the left participant 610 would be 60 degrees to the left of the user’s FOV center, and thus nearly out of range (the image here is drawn with exaggeration, but also with consideration of the smaller FOV of headsets). The angle that the user is looking at the rightmost camera 603 may or may not be equal to the angle to the right participant 610, since it is a function of D and L (closer to the camera means a higher angle). As the goal is to have the users feel that they are looking at one another directly, the 3D representation with in-filling of any occluded regions would account for this, and the added camera in the general direction that the user is looking will provide a far superior result than if there was only a single camera 601 in the center, regardless of how much the user 600 may be modeled.
[0081] As described in relation to the 4 participants in FIG. 2, there is also a configuration where the user may want to see the participants closer together. In this case, the right and left virtual participants (611, 612 respectively) will be shifted towards the center 613 by a given amount. This amount will be discussed below, but the figure demonstrates two points (even though the figure is simplified): The first is that the locations of the shifted participants may also not be in the direct line of sight with the cameras (it depends upon the number of users as well); Second is that the relative size (and distance to) of the virtual participants will dictate how “tight”, or how closely packed, they can be placed with relation to one another. The issues of rotation angle between the different participants still holds for this configuration, as was described above.
[0082] Irrespective of the number of cameras used in each embodiment, the disclosure here includes the ability to redefine the virtual positioning of the other participants. This can be done per-user (meaning everyone has their own preferred positioning), or with the positioning defined by one of the users (such as the meeting host) being inherited by all other participants. Adjusting the virtual positions of the participants will then require further adjustments to the perceived angle that each participant is seen by others, as will be described below, since each participant will be rotated in different directions.
[0083] In FIGs. 7 A - 7H, multiple examples are qualitatively shown regarding the different configurations possible for 3-6 participants. The case of 2 participants is trivial, and more than 6 becomes cumbersome to draw, but is not a limitation of the system. In the most basic case, illustrated in FIG. 7A, 3 participants would be sitting around a virtual equilateral triangle 700, with an equal distance between them, as defined by a circle 701 encompassing triangle 700. Each would have equivalent views, without angular or rotation distortions. While each user is equivalent, to describe the changes that could be made per-user, in this discussion regarding FIG. 7A, the “user” 702 will be located at the bottom of the triangle 700.
[0084] If the user prefers a tighter grouping of the other participants (pre-defined, or changed at will), then the other two users would be shifted towards each other, as described in relation to FIG. 6. For the case of a triangle (3 participants), as illustrated in FIG. 7B, the triangle 703 would become more acute, in terms of the viewpoint of the user 704. Ideally, this triangle would be isosceles, such that the other two participants, located at the vertices of the triangle, will be at the same distance 705 to the user 704. However, as described above, this need not be the case, although changing the distance will further increase the potential distortions to the rotations.
[0085] Assuming an embodiment consisting of at least 3 cameras, which are symmetrically distributed along a horizontal line (for simplicity), and placed in front of a user (e.g. on top of a laptop monitor), then the user can still move closer or farther to the camera, in an analog fashion. Since the standard seating distance between a person and their camera is between 0.5 - 1 m, one can find the locations of the virtual participants (assuming AR/VR headsets) with respect to the center of the camera. For the equilateral triangle 700, these locations are set, however, for the “tighter” configuration triangle 703, the angle can be adjusted. The ideal angles to be adjusted to would take into account the locations of the cameras. In one embodiment, the cameras can be located at a distance of 5-30 cm from each other (another variable). For the following discussion, in order to list some of the possibilities described within this disclosure, yet not to limit the general solution, a distance of 20 cm between the cameras will be considered, as well as a distance to the user of 60 cm or 90 cm from the camera.
[0086] Table 1 includes the case of 3 participants, showing an example of the location of the virtual participants (PI and P2) as a function of their distance from the central camera
(physical distance L+S in FIG. 6). The angle is set at 30 degrees (from center) for the symmetric positions, giving a distance of ~35 cm to either side for the locations of the virtual avatars when seated at a distance of 60 cm from the central camera.
Table 1
Figure imgf000020_0001
Figure imgf000021_0005
[0087] The angle for the tight configuration need not be unchangeable. In T able 1 , the angle is defined to be 18.5 cm, which places the virtual participants at a distance of 20 cm off center when seated at a 60 cm distance. The reason to choose this angle is therefore to ideally place the avatars in the direct line of sight to the side cameras, which in this embodiment example are 20 cm from the center too. Note though, that this distance would change when the user moves their head forward or backwards. Nevertheless, the goal is to obtain as ideal a FOV coverage as possible of the front of the face, making this a near ideal angle. If the user moves farther away, to 90 cm, then they will be 10 cm away from the nearest camera, which would still provide relatively good coverage (in particular when compared to the case of a single camera, which would be completely off-axis to the viewer). The angle could therefore be set by the user/software, or it can be deduced by first measuring the distance to the user, for example in a setup stage prior to the conversation, and then adjusted to ensure that the camera angle best fits the user’s distance. [0088] In addition to 3 participants, other numbers of participants can be described in the same way. As discussed in relation to FIG. 2, the case of 4 participants would ideally be in the orientation of a square virtual table 706 (as illustrated in FIG. 7C), with the tighter form being a “kite” shape 707 (with respect to the user placed at the bottom), as illustrated in FIG. 7D. In Table 2, the angles, distances and locations of examples of both the symmetric and tight configurations are shown.
Table 2
Figure imgf000021_0001
Figure imgf000021_0003
Angle from center ; -45
Distance from center: @ 60cm i I -60cm j0cm j60cm
Figure imgf000021_0002
Distance from center: @ 90cm | -90cm j 0cm 190cm
Figure imgf000021_0004
Tight positions PI P2 P3 [0089] The case of 4 participants in Table 2 is nearly identical to that of a triangular arrangement, since the third participant (P2 in Table 2) is simply placed directly across from the user, at an angle of “0”, which is directly in the line of sight of the user and the central camera, regardless of the symmetric or tight configurations.
[0090] The case of 5 participants would place them at the vertices of a pentagon 708, as illustrated in FIG. 7E, with the tight configuration from the perspective of the user as a “gem” 709, as illustrated in FIG. 7F. Table 3 displays the approximate parameters for an example of this configuration.
Table 3
Figure imgf000022_0001
[0091] It should be noted that the angles for the tight configuration here were once again chosen such that the virtual participants appear up to 10 cm away from the direct line of sight to one of the 3 cameras when seated at a distance of 60 cm from the 3D webcam. The outer participants (PI and P4) would be 10 cm away from the outer cameras, whereas the inner participants would be between the cameras, and close to the side cameras (2 cm away, approximately). If the user moves backwards to 90 cm away from the camera, the outer participants, located ~15 cm away from the outer cameras, will begin to be far from ideal, and therefore the angles would be adjusted in the tight configuration to be closer. For the symmetric configuration, 120 cm away from the center would force the user to turn their head to see the outer participants, as they would begin to be outside the central FOV of the user (roughly 120 degrees total, whereas the half angle to the outer participants is ~54 degrees).
[0092] The case of 6 participants would symmetrically be in the form of a hexagon 710, as illustrated in FIG. 7G, and the tighter configuration also in the form of some gem-like polygon 711, as illustrated in FIG. 7H. In Table 4, the positions and angles of such an example are displayed.
Table 4
Figure imgf000023_0001
[0093] When there are 6 participants, then one of the participants will always be directly in front of the user (as in any even number of participants). The angles of the tight configuration are here once again chosen to be similar to the triangular and square configurations.
[0094] The limitation on how many users in total can be shown using this paradigm, without beginning to place users behind users, is that at a certain number (or angle), the holograms will overlap. The solution would then be to shrink their size, which is the equivalent to increasing the size of the table. This is entirely within the description of this disclosure, however will not be drawn here beyond 6 participants as the extension of the principals described here should be clear.
[0095] In addition to the configurations shown above where the participants are all placed on the edges of a circle encompassing the polygon of the virtual table (for 3 or more participants), it was described above that the participants could also be placed along a line. In the case of 4 or more participants, this would mean that the spatial locations of the participants would be along the perimeter of a triangle only, however, not on the outer encompassing circle (except the user and the outer participants on the vertices of the triangle). As described above, while this situation is achievable using this disclosure, it will also lead to further distortions in the sizes and angles of the participants with respect to each other. Some of these can be remedied algorithmically by taking into account the mapping of the angles of the tight configuration with the movements of the participants (such as to rotate them artificially more or less, depending upon their mapped distortion), or by simply changing their size with respect to the difference in virtual distance between the participants.
[0096] Having described the positioning of the cameras in either a single- or multi-camera configuration, as well as the positioning of the participants in a virtual setting, the algorithm for the first embodiment can be described. In the first embodiment, only a single camera is used, and the participants are rotated as a function of where they are looking, with the user being in an “always centered” face tracking configuration. In FIG. 8, a block diagram of an embodiment of this algorithm is depicted. For the case of a single camera, the process begins in step 800 with a query or setting of the camera parameters, such as the resolution and frame rate.
[0097] To make the virtual participants appear more persistent with their virtual environment, it is beneficial to remove the background of the user, as described in step 801. This can be done using a neural network (NN) and/or using depth data if using a camera with depth sensing capabilities. There are multiple known methods of removing backgrounds in addition to these two methods, however they are the most robust in removing backgrounds. This step can be optional in the case that the processing power of the user’ s device is not enough to handle this stage. For example, if the user is using a mobile device with limited capabilities to run a NN or similar algorithm, then they will appear in the virtual environment with their background as-is. [0098] The next stage of the algorithm is to perform face detection and finding facial landmarks, as described in step 802. There are once again a number of off-the-shelf solutions for this, some relying on computer vision algorithms, and others utilizing NNs. The reason to detect the head is that people will rotate their heads to address people. This stage can also include eye gaze detection to see exactly where the person is looking. However, while it may seem that the eye gaze is the most important factor for seeing who a person is addressing, it is actually the least telling signature of this. People will rotate their head and then body to address people, particularly if they are addressing them for longer than a few seconds.
[0099] Facial landmarks can come in a variety of different solutions that are known in the art. While there are solutions providing hundreds and thousands of facial landmarks, the bare minimum required for this algorithm is at least 3 points that are not co-linear. Co-linear points could be the line between the eyes alone; a third point is needed to obtain 3D triangulation, for example the eyes and nose tip. Typically, more points are found, with more points enabling higher precision and robustness in pose estimation, which is implemented next in step 803. In this stage, the translation and rotation vectors of the person’s head (and potentially eyes) are calculated by an algorithm that finds these vectors as a function of the location of the user’s camera. The algorithms to find these vectors are also known in the art, and can be found in the OpenCV open-source computer vision library, including variations of “pose and project” algorithms. These algorithms typically utilize the camera parameters from step 800, but can work on idealized parameters as well. Additionally, for using such algorithms the 2D data points found in the landmark detection phase of step 802 are matched with a 3D model of the person. For the case of matching the landmarks to the head (or body), this can be done using a generic 3D model of the human head. This generic model can be customized or personalized to the user to obtain better accuracy. However, even a generic model with 3D points not well fit to the user will still obtain relatively good results. The output of this stage is a set of vectors (or matrices) with the translational and rotation vectors (6DOF), given in the coordinate system of the camera.
[00100] The next stage is to rotate the viewpoint of the user in the case of the first embodiment of a single camera, as described in step 804. This stage may be optional if the user prefers to rotate their screen manually (such as is done in video games using a user interface). The rotation of the screen can be simplified for this description to the yaw angle around the y- axis, although full 6DOF is also possible. In this case, the goal is to have the user always looking “forward” at the camera in order to always have their full face in view of the camera. Methods to rotate the viewpoint are highly dependent upon the software used to view the other participants, however they are known in the art of both head tracking applications as well as gaming applications.
[00101] The rotation and translation vectors calculated in the previous stages were in the coordinate system of the camera, but what needs is sent to the other participants in step 805 is the translation and rotation in the coordinate system of the joint environment that they are all in. To do this, one can assume that the physical camera in front of the user acts as an anchor point to their virtual (2D/3D) representation: any rotation/translation of the user will be done relative to it, and the initial orientation of the user can be defined as looking at the center of the virtual table in the examples above. There need not be an actual table in this environment, so long as there is some form of predefined “center”. Then, all rotations and translations are relative to the anchor points in relation to it. If the head tracking modality is not used, then the user can also navigate using an external interface such as a mouse, touchpad, keyboard, gamepad or joystick, as is done in the gaming world. In this case, the user will always be looking forward, which is advantageous, but also requires that the user free up a hand or two to move, which may provide a less immersive experience. It is also less desired in the case of a headset, where there may not be any peripheral interface available.
[00102] In addition to the rotation and translation vectors transmitted (in the shared environment world), the image data is also sent in step 805, as well as any other data such as 3D meshes, as described in later embodiments, and other metadata that is typically sent in such VC systems (e.g. name, location, etc.). The landmark features can also be sent at this point, to reconstruct the user using various graphical and computational techniques. In addition, and more importantly considering the discussion of FIGs. 7A - 7H, it is desirable to know what format they are viewing the virtual world in - in an evenly distributed environment, or one with compressed (“tight”) angles of viewing, as described above. In other words, the system comprises a user display (not shown), and each user display can have its own positioning arrangement of the participants. This data can be sent as an analog angular compression value, or as a Boolean if the compression angles are predetermined.
[00103] Thus, the one or more processors of the system output data associated with the received video output (i.e. the processed images) and the mapped positions and orientations. The combined data is sent to the other participants, with each other user in one embodiment receiving the exact same data, which is preferable for scaling issues. The other participants will then locally render the user (and other participants) on their display and software, as described in step 806. This rendering for the case of a single camera will primarily include the 2D video sent, however may also include other 3D data information. The exact means of the rendering is dependent upon the software use, and in no ways limits this disclosure in any way. For example, the render can be on a 2D plane situated in 3D space, or on a 3D object or template.
[00104] In addition to the rendering of the participants/user, in step 807 the correction is made for any angular compression, as described above. In order to maintain the illusion that the user is rotating in space, even though they are only imaged via a 2D camera, the correct orientation of the rendered user on the participant’s end should preferably match the viewpoint of the user. This is true whether or not head tracking is implemented. If the angles are compressed, as described in the Tables above, and limiting for simplicity the discussion to yaw angle compression, then a yaw rotation of rSem is in one embodiment compressed by a factor of a/b, with a being the maximal angle to the rightmost viewer without compression, and b being the maximal angle to the rightmost viewer after compression. Then, given (e.g.) a Boolean, Ptight, of whether the particular participant sending the data is viewing things in a “tight” state or not, the rotation that the local version of the participant will undergo will be: nO ai = rsent*[l
+ ptight(a/b -1) ].
[00105] Essentially, this means that if the user is viewing things in a tight configuration, and turns (e.g.) 10 degrees to the right to look at another participant, then the others will have that user rotate by the correct angle of (e.g.) 30 degrees locally to have them oriented towards the person that they’re speaking to in a symmetric distribution. Note that translation will affect this angle as well, as a lateral shift of the user away from the optical axis can also be inferred as a shift in viewing angle. However, this effect can be neglected for simpler systems as it may also make the experience more difficult for the user.
[00106] As described above, the second embodiment of this disclosure includes a simpler, more “digital” variation of the 3D representation. Considering the description above of a 2D screen and employing head tracking to follow the user’s visual focus, the 3D representation of the person can always be replaced with a 2D image of the person, and situated in a 3D environment. The 3D aspect of the conversation is somewhat maintained in the spatial awareness of the viewers with respect to one another, however, this will be ruined once every time the user rotates their head a little to control the camera. Furthermore, when using a wider screen, it is advantageous to allow the user to physically turn their head and body to address someone, and then maintain that pose, without requiring them to re-center themselves towards the central camera.
[00107] In Fig. 9, a simplistic top-down schematic depicting a 3 -camera configuration 900 is shown, here forgoing any 3D sensor aspect for simplicity. Shown also is a user 901 who is situated at a distance 902, denoted d, from the central camera 903, assumingly located - as in a standard configuration - before a static camera system and display. Since most users are seated in a chair before their system, they are typically limited in motion to a limited amount 904 to the sides, denoted t, and neglecting all motion in the perpendicular axis to the page (the “z-axis”), there is a small region in front of the camera that they can generally move within, and for the purposes of the use cases of this disclosure, would be acceptable for them to move within. The system could generally also begin with an alignment process whereby the user is told to position themselves directly in front of, and at a given range of distances d from the central camera 903. When positioned in front of the camera, the direct line of sight 905 will be to the central camera 903 (neglecting the z-axis orientation of the eyes vis-a-vis the display), whereas the side camera 906 [shown in this schematic as the top camera] will require the user to change their head (or body) position 907 to look at it. Since the camera arrangement of this disclosure comprises the cameras being offset by a baseline 908, denoted b, from each other, then the angle 909, denoted in the description Q, between the two cameras (903 and 906) from the point of view of the user will simply be tan(0)=b/d. This angle will vary if the user moves forward/backwards from the initial distance d, which can be measured using an additional 3D camera, using a stereo or trifocal tensor approach, or using machine learning estimates (as will be discussed below). If the user rotates their viewpoint to an arbitrary point 910 (within the plane shown, for simplicity), between the two cameras (903 and 906), then there will be a point midway between the two cameras for which the ideal camera chosen to portray the user as “looking in that direction” will change. This description does not take into account the rotations of the cameras within the camera system, for simplicity of description. That midway point can be given by an angle 911, denoted here in the description a, which will be given by tan(a)=(b/2)/d. Therefore, given the estimated distance to the viewer, and their rotational perspective with respect to the central camera, one can deduce which camera will have the best perspective of the user’s face. This description can be extended to more cameras within this plane, as well as to cameras extending outside this plane. Note that the angle of the cameras on the device are used to offset the rotational angle of the user, and a trade-off choice for the use case should be taken to ensure the best choice of camera rotation vs. distance to the cameras. Correcting for lateral translations, t can likewise be taken into account when finding the threshold angle to switch between cameras, as a function of distance d to the central camera 903, by increasing/reducing the calculated threshold angle a.
[00108] FIG. 10 illustrates a block diagram of a system 1000 used for this second embodiment. Here, a distinction is made between a central camera 1001 and the side cameras 1002, which all may or may not include additional 3D sensors. For each of the images collected from each camera branch - in addition or instead of the algorithmic pipeline described in the first embodiment - the background may be removed in a background removal functionality 1003 using the methods described above. While pose estimation can be employed on all of the camera streams, it is primarily performed by a pose estimation functionality 1004 on the central camera 1001, since the 3D orientation of the user will be with respect to it. Removing the background and estimating the pose can also be done directly within a combined deep learning network. Regardless of the method used, the camera system will then be given a set of vectors describing the user’s translation and rotation with respect to the central camera. The central camera 1001 is used as a general point of reference in these forms of systems, without a loss of generality, and the description here of only a single user is for simplicity purposes, but can be extended to multiple simultaneous users within the frame of the camera. However, the case of multiple users may confound the system, and will therefore not be described here. The camera system 1000 further comprises an image selection functionality 1005 which will then select which images to send, via a communications system 1006, and to which other participants. For example, in the case of only 2 overall participants of the system, with the participants positioned directly across from one another, the central camera would generally be chosen, regardless of the orientation of the user’s head. In contrast, with 3 participants, if the user is to look at a user on their right, then the participant on the right will see the stream of video coming from the camera to the right of the first user, whereas the participant on the left would (perhaps) see the video emitted from the camera on the left. This would give the participants a feel for whether the user is looking or addressing them, or whether they are looking in a different direction.
[00109] In an example system here in FIG. 10 consisting of only 3 camera views, the number of outgoing streams will be defined by the number of overall participants in the system. Therefore, the image selection process of image selection functionality 1005, will be a function of the 3D orientation of the participants with respect to each other, as well as the number of users. It could be that many users are sent identical streams, as in this case, there are only 3 viewpoints to choose from. In this respect, the solution is a digital or discrete form of the general solution depicted in this embodiment, where the creation of a 3D representation is such that it can be viewed from any, nearly arbitrary, angle.
[00110] In one embodiment, any of background removal functionalities 1003, pose estimation functionality 1004 and image selection functionality 1005 are implemented as respective portions of software one or more processors. Alternatively, or additionally, any of back ground removal functionalities 1003, pose estimation functionality 1004 and image selection functionality 1005 are implemented on a dedicated circuitry, such as a microprocessor, a field-programmable gate array, or other type of integrated circuit. In another embodiment, communications system 1006 comprises any appropriate communication hardware, such as a connection to the Internet, a connection to an Ethernet network and/or a cellular connection.
[00111] Other variations of this embodiment can include the use of other sections of the embodiments described within this disclosure. For example, since an important aspect of video communications may be the image of the user from head-on, then in some configurations, only the camera with the best angle of the person’s face can be transmitted, whereas the rest of the users may only see an artificially created 2D or 3D representation of the user. This may be beneficial, as it can reduce the bandwidth of the video streams being sent to a single one, with the rest being sent as a more minimalist model representation.
[00112] It should be noted that for this embodiment, as well as all other embodiments, a block diagram such as that appearing in FIG. 10 primarily refers to the operation of the system during run-time. However, an initial alignment or calibration stage can occur that runs on a different set of system blocks. For example, as described above, a calibration section can occur before a session whereby the orientation of the user’s face can be found using all 3 cameras (in this example), and the calculation of threshold angle to switch between cameras can be found by averaging the estimated poses between the three cameras. This processing can also occur on a separate system, without the need for real-time computational speeds.
[00113] The third main embodiment of this disclosure involves improving the representation of each user from a 2D video stream into a 3D representation (“hologram” or 3D mesh). In order to obtain such a representation, a combination of 3D data and machine learning algorithms can be utilized, as has been done in the prior art. However, in this embodiment, since the use case (emulating a VC) and configuration of cameras (person in front of a display and camera set) is known, a specific set of algorithms can be utilized to minimize the amount of processing done. Typically, processing 3D or NN solutions to this problem require high end hardware, and are difficult to perform in real-time. However, given a set of 3 cameras (or more) in a set configuration that fits the use case (i.e., neglecting things such as the back of the head, which aren’t important in a VC setting), simplifications can be made.
[00114] Turning to a discussion of some of the 3D algorithms used, and their relation to the architecture of the camera device described in this disclosure, FIG. 11 illustrates a generalized description of the usage of 3D data and models. There are multiple methods of creating 3D representations, including a completely Volumetric approach (relying solely on the input data), and an “AI” approach relying on matching sparse data to existing models. In FIG. 11, a depiction of the head only appears for simplicity, though the goal of this disclosure is to capture the upper torso and arms/hands as well. A model of a human head 1100 can be completely generic, or be of a library of models such as gender, race and age, as well as the inclusion of hair and skin tone. An advantage of using a model is that there are methods of describing the model using a smaller number of parameters, which can be less than the full 3D data required to create this model.
[00115] In the embodiment set where 3D data is also used, one can assume that the 3D data will only cover a portion of the object 1101. In the figure, some of the points appear “behind” the head, but this is just for simplicity in describing the method, since the drawing here is in 2D, whereas the points are in 3D. However, data from behind the user won't be obtained unless the user uses a mirror, or the user pre-captures themselves from all directions. Thus, for most typical embodiments, there will be regions of the head (in this example) that will not be captured by the 3D data in real-time. It should be noted that this 3D data can be obtained from both depth sensors and stereo calculations, as described above, and therefore can include data from more than one angle, as well as having a weight per data point regarding the accuracy of that point.
[00116] The data can then be combined at a step 1102: using the 3D data 1101 as well as the model 1100 to obtain a match between the two, indicated as object 1103. If the model is generic, then there should be mismatches between the data and model, yet, even if the model is based on the user (or previous frame), there will be mismatches. For example, if the user smiles, then the locations of the mouth will be different, and modifications to the model should be made. The advantages of using a model are that they can in-fill regions that are not captured, as well as potentially reduce the amount of data transmitted. For example, given a model with a set number of vertices, a step 1104 can be processed where the data points found are mapped to the set vertices of the model 1100. This allows the model to be tracked through time, with the same number of vertices in existence. The tracking of data between frames is also important for 3D data compression. Likewise, it can remove superfluous data points, as well as interpolate missing data points needed. The resulting 3D representation 1105 can then be described as one that can match the user more accurately, can be represented by a fixed number (and more evenly spaced) of points, and that covers the entire region, including potentially occluded regions of the images.
[00117] A final benefit of using a model is in filling in regions of the face that may be occluded by headsets. If the user is wearing an AR/VR headset, then parts of their face may be occluded from the camera around the eyes. For AR glasses with semi-transparent views, this may be negligible, however, for opaque coverings of the eyes, there is a benefit to filling in this region “artificially” by using a model. This model can be pre-trained on the user (before the session), i.e. the model is unique to this particular user, and can include eye gaze tracking and expression tracking to have the model mimic the user, as has been shown in the art for VR headsets.
[00118] The utility of these occluded points is described in relation to a device 1200 illustrated in FIG. 12. Only a portion of device 1200 is illustrated, for simplicity. Here, the simplified notation of 3 cameras 1201 (as in FIG. 9) is displayed, with a user 1202 looking to the right. Under this situation, the rightmost (top) camera’s FOV 1203 will be the best to capture the user’s face and upper body, however, the data points obtained via the depth sensor and/or stereo matching, denoted 1203, will be limited to primarily the front section of the user 1202. Therefore, not only will the back of the user 1202 not be present, but potentially the right side of the face as well. Using the method described above, the 3D representation can be modeled at step 1204 to include the regions occluded from the camera’s FOV, as well as be distributed in a more ideal manner in step 1205, such as to minimize the amount of data and track the data from frame to frame. Due to the specific coverage of the cameras in this device disclosure, and the specific orientation of a person with regards to the device (a webcam), there is a strong relation between the algorithms, pipeline and orientation calculations with the device itself.
[00119] The third embodiment of this disclosure focuses on the improvements that would occur if the 3 cameras (or more) of the previous embodiment were combined into a single embedded device. In this case the algorithms of the previous paragraphs can be implemented per-user, and ideally at the edge, or in the cloud.
[00120] In FIG. 13, a disconnected block diagram is shown for many of the discrete blocks of processing that occur in the pipeline of this latter embodiment. Variations on this embodiment can include all or some of the blocks, and the exact sequence of the blocks may change, without a loss of generality. In one embodiment, each of the blocks of FIG. 13 can be implemented are implemented as respective portions of software one or more processors. Alternatively, or additionally, any of the blocks of FIG. 13 can be implemented on a dedicated circuitry, such as a microprocessor, a field-programmable gate array, or other type of integrated circuit.
[00121] The general flow of the blocks is to first obtain the images, to extract information from the images, to create 3D representations of the users, to transmit the images to a central server, to process the locations of the different participants, and to send data back to the participants. The general form of this pipeline is somewhat similar to a standard VC pipeline, with the modifications that the data being sent is not merely a 2D video stream and includes 3D meta-data, as well as the processing steps of creating a 3D representation, which may or may not be processed on the central server. In FIG. 13, the audio portion (or even 3D audio) is not depicted for reasons of simplification only.
[00122] The data streams that initiate the pipeline are primarily color video and depth data, assuming a dedicated depth sensor is used obtained in step 1300 by color sensors and in step 1301 by one or more depth sensors. For the basic embodiment of the disclosure described in examples such as FIG. 12, there are 3 color sensors and a single depth sensor. There are advantages to having the cameras all on the same device, being fully interconnected, which will be further expounded upon below. These advantages include being able to synchronize the cameras to take images at exactly the same time, which allows for creating 3D reconstruction from multiple images of a moving target (viz. a person moving). [00123] In order to obtain 3D data, the device can utilize both stereo data 1303 and depth sensor data 1301, which can then be combined at step 1304. This will be discussed in detail below. The device disclosed as part of this disclosure described the use of embedded processing units to perform these functions. Therefore, the delineation of where these processes occur (appearing as dashed lines 1302 in FIG. 13) is a function of the exact hardware architecture, as well as the embedded software on these devices. In a simple embodiment, the device can perform next to no processing steps appearing in this pipeline schematic, with the first delineation line signifying the end of the device component, and the transfer of data to a secondary device or cloud server. Brief descriptions of these embodiments will be described below, with an overview of the processing pipeline stages described here first.
[00124] In addition to calculating depth data from stereo and/or depth sensors, further direct processing on the individual data streams can include a feature extraction step 1305, such as facial detection, as well as a segmentation step 1306, which can allow a reduction of the amount of data processed, as well as ultimately a reduction in the amount of noise in the system. Segmentation also enables background subtraction. There are currently system-on-chip camera systems that can implement these features directly on the camera itself, and in this case, these stages could potentially occur prior to the previous calculations. Alternatively, a processing unit (embedded or not) can perform these calculations in parallel or subsequent to the prior processing steps.
[00125] The next stage is to combine the textures at step 1307, as well as potentially reduce the amount of data at step 1308. This data reduction can be performed by down-sampling, decimating, or utilizing only data from segmented image frames, for example. Another step is that of compression, in step 1309. When dealing with standard VC, the video streams are almost always compressed to lower the bandwidth requirements, and is typically performed using formats such as H.264 or VP8. These encoders and decoders have been implemented into various embedded hardware systems, as well as being able to be implemented via software on a variety of different platforms. If the only input blocks were color inputs at step 1300, and the processing is to be implemented entirely on the cloud, then the ideal situation is for the device to compress the images and transmit them to the cloud as quickly as possible. If depth data is also to be used as an input at step 1301, then a compression format that also manages to either individually compress this data, or compress it in tandem with the color images can be utilized. [00126] The next stage (regardless of a potential compression and decompression stage 1309) is to combine the data into a 3D representation. There are numerous known algorithms for creating 3D meshes from data, in step 1310. If only color images are used, then photogrammetry methods are used. If 3D depth data is included, then first a 3D mesh is created, and the texture is added, or 3D data points are each attributed a color vertex, with others also possible. An alternative approach is to utilize a pre-existing template - for the example of a person, this could be the form of a generic mannequin - and then place the textures and/or 3D data onto this template. This modeling process, in step 1311, can consider the 3D data to correct the model, as well as potentially also use data from a parallel meshing process 1310 that can be used to improve the model. Yet another approach is to use neural network variations to create the 3D data based on existing data and an estimate of the 3D model based on the input data, or employ methods of neural radiative fields to estimate the viewpoints of virtual intermediary capture angles (not shown in the figure for simplicity). The model used in step 1311 can be generic or can be pre -learnt on the user in a process stage occurring beforehand, such as when installing the camera device, or when initiating a conversation. An advantage of utilizing some form of modeling is to infill data that is lacking. For the device in this disclosure, the capture of the back of the head and torso will not exist, and therefore utilizing some form of closure method to the 3D representation is beneficial, whether by utilizing a generic, pretrained, or estimated model.
[00127] Once the 3D representation is created, the orientation of the user vis-a-vis the other participants in the conversation is to be considered, in step 1312. This stage is in one embodiment identical to that of the previously described embodiments. The orientation of the users can also be utilized if the data transmitted to the participants (clients) is also pre-rendered, in step 1313. This allows a simple video stream to be transmitted, as opposed to sending the full 3D representation to each user. However, this method also has drawbacks, as lags associated with the transmission may cause discomfort for the user if the images are not rendered at the same rate as the movement of the user and the other participants (as is known in the art regarding VR headsets as well, for server-side rendering). If the objects are pre rendered, according to the FOV of each participant, then the next step 1314 could be a simple video compression of the stream. If, in the primary embodiment sets, the 3D representations are transmitted as-is, then in one embodiment they also undergo some form of compression. This compression can unfold the textures from the 3D mesh, and impart the 3D data information as well (once again, not including audio in this description). There are methods of creating this form of compression, including methods that transform the 3D representation into a video stream, which can then be further compressed utilizing standard video compression techniques, including methods of mesh-tracking and 3D binary trees.
[00128] The final stage, 1315, is to transmit the data to the participants, including the user. Due to compression techniques, this is generally some form of compressed video and/or including 3D meta-data and sound. The participants then need to decipher the 3D data in a controlled manner. The orientation is a function of the pre-arranged orientation processed as described above in step 1312, as well as particular settings the specific user may have regarding their preferred method of viewing the other participants. The bandwidth of each participant can also define the amount of compression and data sent to itself from the server, as is known in the field of VCs. The data the user sees can also include the initial data stream 1300, such as a preview of the user in 2D (or 3D, from the server).
[00129] A brief description of certain embodiments of the disclosure, with relation to where the demarcation between components appearing in FIG. 13 follows:
[00130] One embodiment of the disclosure will involve a camera device (with or without depth sensors) that only transmits the data, either raw, de-Bayered or compressed to the cloud. The rest of the processes will then occur on the cloud.
[00131] An alternative embodiment will include an intermediate processing unit, such as a personal computer to which the device is attached. In this embodiment, any - if not all - of the processes up until the positioning stage can occur. For this to occur, the device would need a powerful computational source, and therefore would limit the independence of the pipeline from each participant.
[00132] Another embodiment of the device will include embedding some of the processing steps onto the device itself (disregarding the transmission component, for this description). This can include the calculation of 3D depth and stereo, as well as fusing the 3D depth data.
[00133] In order to fully realize the potential of the last embodiment, a dedicated camera device would be preferably be created that can implement the algorithms described above. FIG. 14 illustrates a generalized and simplified schematic of a 3D webcam device 1400 as described in this disclosure. The device 1400 consists of multiple sensors and configurations that are all variations of the embodiments of this disclosure, and the most basic form is shown here. In order to view a person’s head, with the capability to capture that person when they move their head left and right (but not necessarily up and down, since our heads are usually relatively level during conversation), the device 1400 preferably has at least 3 cameras: a first camera 1401 directly in front of the user, a right camera 1402 and a left camera 1403. These can be “simple” RGB color cameras, but could also include variations on standard sensors such as RGB-IR sensors, or cameras with diffractive elements or phase elements to split colors or deduce depth, without any loss of generality. An additional feature which may be beneficial is if the camera sensors employ global shutters instead of rolling shutters, for improved stereoscopic imaging. They may also employ internal computational capabilities, such as are available on many camera sensors today. The offset distance 1404 of the cameras from the central camera 1401 is in one embodiment symmetric, with the baseline distance affecting the ability to increase the angular coverage of a user. The side cameras 1402 and 1403 can also be rotated towards the central camera 1401, and they may be raised/lowered or brought forward/backwards (not shown), without a loss of generality. The three cameras (1401-1403) may therefore lie on the same plane, or be on a different plane from each other.
[00134] Having 3 cameras (with “camera” being the generic term for the color sensor) allows stereoscopic calculations of depth to be processed between pairs. This allows 3 pairs of depth calculations to be implemented, with some embodiments including all, none, or some of these paired calculations to occur. Standard corrections to the images, such as rectification of the side images if the cameras are tilted is included in this description. There is an advantage for performing this calculation when the cameras are on the same horizontal axis, and employ a global shutter (for video). However, in this disclosure, the inclusion of dedicated depth sensors is also employed in some of the embodiments of this disclosure. In particular, in FIG. 14, the use of a single ToF (direct or indirect) sensor is shown, comprising an imaging sensor 1405 and a light source 1406, as is standard for most ToF sensors. This is the most simplistic depiction of the ToF sensor, which can also include other features and multiple light sources, but the imaging sensor itself 1405 is in one embodiment co-located as near as possible to the central RGB sensor 1401 to correlate the two obtained images. The other side cameras 1402 and 1403 (and any other camera described within this disclosure) can also have a co-located depth sensor, however the single sensor in the center is the primary embodiment variation of this disclosure. Likewise, as described above, this sensor may be replaced with a different technology, such as structured light, assisted stereo illumination, ultrasound or radar, without any loss of generality. [00135] The general paradigm for most embodiments is the use of 3 cameras, with a depth sensor co-located with the central camara, at the bare minimum. However, for more optimal coverage of a human torso and head, a camera “above” to capture the face from a higher angle is also beneficial. This is shown as an optional embodiment 1407, with the additional camera sensor 1408 offset by a distance 1409 above the central camera 1401. This camera 1408 too can lie in a different plane as the other cameras, with some embodiments having the top 1408 and side cameras 1402, 1403 creating an arc towards the face of the user, and can also include rotations of the cameras, or potential additional cameras. In all these embodiments, a stereoscopic calculation can be employed between the central and satellite sensors in the first order, and between satellite sensors in the second order of stereoscopic calculations.
[00136] An additional feature that can be included in this device is the capture of directional sound (“stereo sound”). This can be done by including a microphone within the device, and preferably near each sensor, such that there is one microphone 1410 near the central camera 1401 as well as microphones 1411 near the side cameras 1402 and 1403. This allows the 3D representation to include 3D sound, allowing directional and spatial information of the sound to be included as well.
[00137] FIG. 15 displays a design for one of the embodiments of this disclosure. Here, the device 1500 is centered on a central camera 1501, with the co-located depth sensor 1502 also adjoined mechanically to the hub 1501 A of the central camera 1501. The left camera 1503 and right camera 1504 are shown here to be offset by rods or arms 1505 and 1506, at a distance from the central hub 1501A. The baseline distances can be anywhere from 5cm to 40cm, however as the 3D webcam would typically sit atop of a laptop or monitor, would usually be in the order of 10-30cm. There are limitations to placing the cameras too far from each other for stereo imaging, even if one or more of the cameras is rotated, as it limits the minimal working distance for the stereo imaging. Furthermore, a long arm length also is disadvantageous as the mechanical stability of the arm (even if held from below) may cause the calibration of the cameras to each other to be affected by even thermal fluctuations. A rigid arm structure is preferred, with the arms also being used to connect electronically the cameras to the central hub. If the arms are intentionally bent, even to ensure better coverage, care must be taken to maintain mechanical stability. In the case where one of the arms is bent or vibrates, such as shown by arrow 1507, then the calibration and alignment between the cameras will be affected, thereby deleteriously affecting the stereoscopic calculations. There are methods to correct for such changes to the alignment, employing both imaging and 3D data techniques. For example, if an initial alignment is done, and then the camera arm vibrates, then the calculated 3D data points between the central camera 1501 and side cameras 1503 and 1504 will be shifted. A correction can be done where the points are brought back into alignment with the original calibration, thereby correcting for the error due to the bending or vibration of the arm. There are also methods employing “structure from motion” techniques to align images to each other per-frame. Nevertheless, a preferred embodiment will allow for as minimal changes to the permanent locations of the cameras as possible.
[00138] The central hub 1501A of the device 1500 includes the cameras 1501 and 1502, but also internal processing units, input and output connectivity and mechanical attachments 1508, such as a tripod mount or clasp (as examples). This hub 1501 A can include the electronics used to create 3D data, to calculate stereo pairs, to fuse DMs and/or point clouds (PCL)s, to perform color space transforms, to compress the images and/or 3D data, and to fuse the color images. Furthermore, it can contain computational power used to combine the 3D and color images into a 3D representation, as well as compress or decimate that representation for transmission purposes. It can also include methods of performing AI algorithms on the device itself, such as background subtraction, as described above. The computational unit embedded within may comprise of a single unit, or multiple adjoined units. For example, it can contain a board designed for fusing the 3D data, as well as a board designed to compress and transmit the data.
[00139] Transmitting the data out of the unit, with the “data” being a collection of information collected by the potential color, depth and sound sensors described to be within the unit (some not shown here), can be done either wirelessly or wired. If using a wireless connectivity paradigm, the transmission modem can be either embedded within, or if wired, connected externally to the hub via a cable 1509. This modem can be a free-standing unit, or one embedded in a different device such as, but not limited to, a personal computer, laptop or mobile device. The cable connecting the device to an external device 1510 can provide both input and output, such as is standard with most peripheral devices today.
[00140] It should be noted, that the use of 3 cameras in some embodiments of this disclosure allows the use of what is known as the trilinear or trifocal tensor. This mathematical description of 3 cameras enables the calibration and alignment of the three cameras in a single set of equations (tensor) that can be solved using multiple methods. Specifically, with 3 separate images from cameras (calibrated), the trilinear tensor for this system can be found from a given number of matched points between the images. This framework would then allow the cameras to be realigned from a single set of images, thereby enabling the correction of any misalignment of the cameras due to shifts or rotations and vibrations of the cameras. It can also recreate the 3D representation based on the knowledge of the tensor and the images, and can be considered one of the embodiments of the 3D camera component.
[00141] The method of producing DMs or PCLs from two stereo images require algorithms with pros and cons. In particular, if the calculations are to be implemented in real-time (e.g. at 30 fps), for large images (e.g. with 1 M pixels each), and using lower cost hardware, an appropriate algorithm should be selected. One of the most efficient methods is based on a “patch-match” technique, where areas of distinct features are compared to one another in each image. Other methods, such as those based on feature detection and matching, also exist.
[00142] A simplified graphic describing the use of stereo imaging calculations and the benefits of the device described here in performing faster and more efficient stereo imaging is shown in FIG. 16. Shown are two simplified “images” (drawings are intentionally simplistic for description purposes only) with a left image 1600 and right image 1601 taken from two different cameras. This could be the same camera taken at two different times as well, but the description will discuss only the former case for clarity. In these images, there is a “person” 1602 appearing (for descriptive purposes), with there being a shift in their location from image 1600 to image 1601, as is standard in describing two stereo images. For the sake of discussion, assume that the algorithm is attempting to match a feature 1603 of the left image 1600 and a feature 1604 of the right image 1600, under the assumption that both of these features are the same physical location, but appearing in different camera locations due to the difference in triangulation between the cameras. The simplification of this description and schematic of FIG. 16 are relevant to specific instances of the embodiment whereby pairs of cameras lie on the same imaging plane (without rotations), and therefore the concept of epipolar lines (as is known in the art) can be represented as parallel lines on the imaging system. Generalizing to non- planar geometries can be included in some embodiments of the disclosed technology.
[00143] When matching the features, the general approach is typically to run along a single row of pixels (assuming here that the two cameras are mounted horizontally, or have been rectified already), and to search for features that match along that row. The difference in pixels provides the disparity, which then correlates to the depth of the point in space using triangulation formulas. In the graphical depiction of FIG. 16, this is shown as matching the left row of pixels (or group of pixels for a patch) 1605 with the right row of pixels 1606. The features are located in distinct “columns” 1607 and 1608, on the left and right respectively, such that if the feature on the right is mapped into the pixels 1609 on the left, then the disparity of pixels 1610 will be the distance in pixels between them. For a standard algorithm, since there is no knowledge of where the mapped feature 1609 would appear, the search would have to be from the point of the first feature 1611 until the end, while searching to the right (in this simplified example). Smaller disparities correlate with farther points, and vice versa. The disparity for two cameras placed at a distance of -10-30 cm, and viewing an object at a distance of 0.5- lm can therefore be quite large, resulting in a required search pattern that is long and less efficient. Methods of improving upon this approach would require further information, either from the previous frame, in which the features may not have existed, or using an initial estimate based on a depth sensor. Both of these methods are utilized in the current disclosure, whereby (for example) using the knowledge of the depth (even if not completely accurate) provided by the co-located depth sensor, a better initial guess, or seed, for the search pattern is chosen, as step 1612. This allows systems to have a limited disparity range over which to search, and is especially useful for embedded systems that have built-in fast stereo algorithms designed for small disparities.
[00144] A block diagram for the given method of finding the depth points is shown in FIG. 17 for a pair of color cameras (RGB or other similar sensors) 1700 and 1701, and a depth sensor 1702. The diagram is for a single pair, and the block diagram can be easily generalized to any number of pairs. In this diagram, the data from both the color cameras will be utilized for both stereo depth calculations, as well as producing color textures to be used for the 3D texturing. The first set of outputs is simply the fusion 1703 of the color images. This can include steps to lower the data amounts in those images, such as reducing the RGB color format to a YUV422 color format, or similar. It can also include taking raw image data and combining it (before de- Bayering). The images can be fused as simply adjacent matrices or fused using other methods that may make use of other processes occurring in a subsequent step 1704 such as feature extraction or segmentation. They can also be output as individual frames if so desired. One method of combining the images is to make a composite image. If, for example, 3 color cameras are used at a resolution of 1 M pixels, then a combined image of 4 M pixels can be created by placing the images in a square matrix, with one of the 2x2 squares being empty (or filled with other information, such as depth information). This form of combination would make further compression easier for algorithms designed to work on set aspect ratio frame sizes. [00145] The second use of the stereo images 1700 and 1701 is to calculate distance using a stereo algorithm. For this, in step 1705, the images are converted to grayscale before the calculation. In the stereo processing block 1706, the algorithm described in general in FIG. 16 occurs, where the input 1707 from the depth sensor 1702 can be applied as well to improve the efficiency of the algorithm. The output 1708 of the algorithm will be either a DM or PCL (depending upon certain details of the algorithm), and there would then be 2 depth data images for the given frames (although the points may be displaced due to the different FOVs of the color and depth cameras). These depth data points can be combined in step 1709 in a fusion algorithm that can be weighted to account for the inaccuracies inherent to the depth sensor and stereo matching pair. The output 1710 of this step would be either two different depth frames, or, using the fusion method described, a single DM or PCL, which could then also be further compressed or reduced.
[00146] The described block diagram involves steps that require processing on a computational device. While there are camera sensors with embedded process involved, such as de-Bayering and compression, as well as even segmentation using NNs, the device described in this disclosure is different from existing architectures. As such, some form of computational device must be connected to the cameras, either within the device itself (embedded), or connected via a transmission cable (wired). The above-mentioned processes, such as stereo matching and depth fusion could therefore occur on the device itself, given the requirements of the algorithm described, and using the appropriate processing amount. However, creating a full 3D representation may require computational assets that are not tenable for being embedded within a single device, either for monetary or physical size considerations. Therefore, it is preferable to allow some of the algorithms to be processed in other locations, either on a separate computational device, or completely distributed, such as on an externally located server (“cloud”), as was described above in relation to FIG. 13.
[00147] The overall architecture of the system 1800 involves multiple participants connecting with each other, as shown in FIG. 18. There is typically a user of the system who sends out the request for a meeting, and can be considered the “host” 1801. The host may not be the first member to join into a meeting, but they will be defined as the host for two distinct purposes in software, and also potentially in the architecture: For the architecture, this means that the host may also take the roll of the server 1802 for the connected network of users, or, as will more typically be the case, the server is located in a separate location, on the cloud, or on the internet, or in general located on a server in a separate location that may be near or even far from the host (with preference to closer distances due to shorter lag times). The host is important for the software, as it can define the beginning of a linke -li t or array of users (or other data structure), which essentially describes “user #1”. It can also be relevant when in reference to FIG. 7, where if the host decides to position the other users in a different orientation, such as to emulate a manager sitting in front of their employees at a desk, then the host can be defined as the person whose rotation and orientation is at the “bottom” of the table (as described in relation to FIG. 7).
[00148] Each participant 1803 in the system 1800 will have an interface 1804 with the overall system that may consist of the camera devices described in this disclosure, or other forms of cameras (both 2D and 3D), as well as even without a camera (or the camera being off), and only including audio. Therefore, the data sent from each user to the server 1802 may be different. The host can send data, via a dedicated interface 1805 including their own image information (e.g. a 3D representation, as in this disclosure, or even just a 2D image from a webcam), as well as information regarding the entire scene that the others will need in order to understand their orientation to each other. This data can include the coordinates of the participants with respect to the host, in a global coordinate system. As such, each participant can be given a unique coordinate that can be rendered in the same location for all users. The coordinates per user can include location and orientation, and additional metadata information can be included for each user, such as (but not limited to) geographical location, local time at the user’s time zone, hyperlinks to other information, as well as information regarding the user during the conversation, such as if they are muted or have their camera off. Participants that are added or removed from a conversation in this system will then be remapped to the data structure at the host or server, and can be retransmitted to each individual participant. Likewise, each participant can send similar information about themselves.
[00149] Each user of the system 1800 will therefore be transmitting their own representation (3D, 2D, or other, as described above), as well as receiving from the other users their representations and local transformations including position and rotation. In this variation of the embodiment, each user then renders the other users on their own device. An alternate embodiment will have the rendering done on the server for each user, who then only receives a server-side-rendered image (or set of images), instead of the full information from each user. This latter embodiment allows for less data transmission to each user, but also requires extremely low latencies to be achieved on the server. Note that the servers may also be performing other algorithmic tasks (per user), as described in earlier figures.
[00150] These previous variations of the embodiments can be taken into account with the other embodiments described in this disclosure to address the issue of varying bandwidths of receiving and transmitting data. Using the main embodiments of the device described here, one can create with the same device either a 3D representation, which is inherently a complex data structure, or even just a 2D image taken from discrete vantage points (as described in relation to the second primary embodiment). When there is less bandwidth to transmit data, the system (based at the server, or even per-participant) may decide to reduce the amount of data sent and revert to only sending 2D (compressed) images.
[00151] It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination.
[00152] Unless otherwise defined, all technical and scientific terms used herein have the same meanings as are commonly understood by one of ordinary skill in the art to which this invention belongs. Although methods similar or equivalent to those described herein can be used in the practice or testing of the present invention, suitable methods are described herein.
[00153] All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the patent specification, including definitions, will prevail. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.
[00154] It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather the scope of the present invention is defined by the appended claims and includes both combinations and subcombinations of the various features described hereinabove as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description.

Claims

1. A camera capture and communications system comprising: at least one camera, each comprising a video output; and one more processors configured to: receive the video output of each of the at least one camera; for each of a plurality of predetermined temporal points map a position and orientation of a user imaged by the at least one camera to a virtual position and orientation; and output data associated with the received video output and the mapped positions and orientations.
2. The system of claim 1, wherein the mapping comprises adjusting a yaw angle between a direction where the user is facing and a direction where the at least one camera is facing.
3. The system of claim 1 or claim 2, wherein a predetermined positioning arrangement is defined for a plurality of meeting participants, the mapped virtual position and orientation being in relation to the predetermined positioning configurations.
4. The system of claim 3, wherein the predetermined positioning configuration is adjustable.
5. The system of any one of claims 1 - 4, wherein the mapping comprises outputting a positioning arrangement shown on a display of the user.
6. The system of any one of claims 1 - 5, wherein the one or more processors are further configured to: identify facial landmarks within the received video output; and match the identified facial landmarks with a 3 -dimensional (3D) model of the user, wherein the data associated with the received video output is responsive to an outcome of the matching.
7. The system of claim 6, wherein the one or more processors are further configured to: track the outcome of the matching is tracked over time; and adjust the data associated with the received video output responsive to an outcome of the tracking.
8. The system of claim 6 or 7, wherein the 3D model is unique to the user.
9. The system of any one of claims 1 - 8, further comprising a plurality of terminals, each associated with a respective participant, wherein each of the plurality of terminals receives data associated with the received video output and the mapped positions and orientations.
10. The system of any one of claims 1 - 9, wherein the one or more processors are further configured to estimate the pose of the user within the video output, the mapping being responsive to the estimated pose.
11. The system of any one of claims 1 - 10, further comprising at least one depth sensor, wherein the mapping is responsive to an output of the at least one depth sensor.
12. The system of claim 11, wherein the at least one camera comprises a plurality of cameras, and wherein the one or more processors are further arranged to: fuse an image of the video output of a first of the plurality of cameras with an image of the video output of a second of the plurality of cameras; match features of the image of the first camera to features of the image of the second camera to generate a depth map (DM) or point cloud (PCL); fuse the generated DM or PCL with the output of the at least one depth sensor, and wherein the output data associated with the received video is responsive to the fused images and the fused DM or PCL.
13. The system of any one of claims 1 - 11, wherein the at least one camera comprises a plurality of cameras, displaced from each other by respective predetermined distances, and wherein the outputs of the cameras are synchronized with each other.
14. The system of any one of claims 12 or 13, wherein the one or more processors are further configured to: determine, for each of the plurality of cameras, an angle between the user and the respective camera; and responsive to the determined angles, select one of the plurality of cameras, wherein the mapping is responsive to the selected one of the plurality of cameras.
15. The system of any one of claims 12 - 14, wherein the plurality of cameras comprises at least 3 cameras.
16. The system of claim 15, wherein the at least three cameras comprise three lower cameras and one upper camera, the three lower cameras being displaced horizontally from each other and the upper camera being displaced vertically above the three lower cameras.
17. The system of any one of claims 1 - 16, wherein the one or more processors are further arranged to define a predetermined range in relation to the at least one within the received images, the output data associated with the received video generated from image data within the predetermined range and not from image data outside the predetermined range.
18. A camera capture and communications method, the method comprising: receiving a video output of each of at least one camera; for each of a plurality of predetermined temporal points, mapping a position and orientation of a user imaged by the at least one camera to a virtual position and orientation; and outputting data associated with the received video output and the mapped positions and orientations.
19. The method of claim 18, wherein the mapping comprises adjusting a yaw angle between a direction where the user is facing and a direction where the at least one camera is facing.
20. The method of claim 18 or claim 19, wherein a predetermined positioning arrangement is defined for a plurality of meeting participants, the mapped virtual position and orientation being in relation to the predetermined positioning configurations.
21. The method of claim 20, wherein the predetermined positioning configuration is adjustable.
22. The method of any one of claims 18 - 21, wherein the mapping comprises outputting a positioning arrangement shown on a display of the user.
23. The method of any one of claims 18 - 22, further comprising: identifying facial landmarks within the received video output; and matching the identified facial landmarks with a 3-dimensional (3D) model of the user, wherein the data associated with the received video output is responsive to an outcome of the matching.
24. The method of claim 23, further comprising: tracking the outcome of the matching is tracked over time; and adjusting the data associated with the received video output responsive to an outcome of the tracking.
25. The method of claim 24 or 25, wherein the 3D model is unique to the user.
26. The method of any one of claims 18 - 25, further comprising, for each of a plurality of terminals associated with a respective participant, receiving at the respective terminal the data associated with the received video output and the mapped positions and orientations.
27. The method of any one of claims 18 - 26, further comprising estimating the pose of the user within the video output, the mapping being responsive to the estimated pose.
28. The method of any one of claims 18 - 27, further comprising receiving an output of at least one depth sensor, wherein the mapping is responsive to the received output of the at least one depth sensor.
29. The method of claim 28, wherein the at least one camera comprises a plurality of cameras, and wherein the method further comprises: fusing an image of the video output of a first of the plurality of cameras with an image of the video output of a second of the plurality of cameras; matching features of the image of the first camera to features of the image of the second camera to generate a depth map (DM) or point cloud (PCL); fusing the generated DM or PCL with the output of the at least one depth sensor, and wherein the output data associated with the received video is responsive to the fused images and the fused DM or PCL.
30. The method of any one of claims 18 - 28, wherein the at least one camera comprises a plurality of cameras, displaced from each other by respective predetermined distances, and wherein the outputs of the cameras are synchronized with each other.
31. The method of claim 29 or 30, further comprising: determining, for each of the plurality of cameras, an angle between the user and the respective camera; and responsive to the determined angles, selecting one of the plurality of cameras, wherein the mapping is responsive to the selected one of the plurality of cameras.
32. The method of any one of claims 29 - 31, wherein the plurality of cameras comprises at least 3 cameras.
33. The method of claim 32, wherein the at least three cameras comprise three lower cameras and one upper camera, the three lower cameras being displaced horizontally from each other and the upper camera being displaced vertically above the three lower cameras.
34. The method of any one of claims 18 - 33, further comprising defining a predetermined range in relation to the at least one within the received images, the output data associated with the received video generated from image data within the predetermined range and not from image data outside the predetermined range.
PCT/IL2022/050209 2021-02-23 2022-02-22 Camera capture and communications system WO2022180630A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163152568P 2021-02-23 2021-02-23
US63/152,568 2021-02-23

Publications (1)

Publication Number Publication Date
WO2022180630A1 true WO2022180630A1 (en) 2022-09-01

Family

ID=83047817

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2022/050209 WO2022180630A1 (en) 2021-02-23 2022-02-22 Camera capture and communications system

Country Status (1)

Country Link
WO (1) WO2022180630A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2616850A (en) * 2022-03-21 2023-09-27 Michael Baumberg Adam Pseudo 3D virtual meeting

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7532230B2 (en) * 2004-01-29 2009-05-12 Hewlett-Packard Development Company, L.P. Method and system for communicating gaze in an immersive virtual environment
US20120169838A1 (en) * 2011-01-05 2012-07-05 Hitoshi Sekine Three-dimensional video conferencing system with eye contact
US8319819B2 (en) * 2008-03-26 2012-11-27 Cisco Technology, Inc. Virtual round-table videoconference
US20130271560A1 (en) * 2012-04-11 2013-10-17 Jie Diao Conveying gaze information in virtual conference
US9538133B2 (en) * 2011-09-23 2017-01-03 Jie Diao Conveying gaze information in virtual conference
US20170308734A1 (en) * 2016-04-22 2017-10-26 Intel Corporation Eye contact correction in real time using neural network based machine learning
US20190246066A1 (en) * 2011-03-14 2019-08-08 Polycom, Inc. Methods and System for Simulated 3D Videoconferencing
US20190244413A1 (en) * 2012-05-31 2019-08-08 Microsoft Technology Licensing, Llc Virtual viewpoint for a participant in an online communication

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7532230B2 (en) * 2004-01-29 2009-05-12 Hewlett-Packard Development Company, L.P. Method and system for communicating gaze in an immersive virtual environment
US8319819B2 (en) * 2008-03-26 2012-11-27 Cisco Technology, Inc. Virtual round-table videoconference
US20120169838A1 (en) * 2011-01-05 2012-07-05 Hitoshi Sekine Three-dimensional video conferencing system with eye contact
US20190246066A1 (en) * 2011-03-14 2019-08-08 Polycom, Inc. Methods and System for Simulated 3D Videoconferencing
US9538133B2 (en) * 2011-09-23 2017-01-03 Jie Diao Conveying gaze information in virtual conference
US20130271560A1 (en) * 2012-04-11 2013-10-17 Jie Diao Conveying gaze information in virtual conference
US20190244413A1 (en) * 2012-05-31 2019-08-08 Microsoft Technology Licensing, Llc Virtual viewpoint for a participant in an online communication
US20170308734A1 (en) * 2016-04-22 2017-10-26 Intel Corporation Eye contact correction in real time using neural network based machine learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2616850A (en) * 2022-03-21 2023-09-27 Michael Baumberg Adam Pseudo 3D virtual meeting

Similar Documents

Publication Publication Date Title
US11455032B2 (en) Immersive displays
RU2665872C2 (en) Stereo image viewing
US8928659B2 (en) Telepresence systems with viewer perspective adjustment
JP6353214B2 (en) Image generating apparatus and image generating method
JP6441231B2 (en) Apparatus, system and method for imaging and displaying appearance
EP3954111A1 (en) Multiuser asymmetric immersive teleconferencing
WO2017086263A1 (en) Image processing device and image generation method
US10681276B2 (en) Virtual reality video processing to compensate for movement of a camera during capture
WO2010119852A1 (en) Arbitrary viewpoint image synthesizing device
WO2014154839A1 (en) High-definition 3d camera device
KR20150053730A (en) Method and system for image processing in video conferencing for gaze correction
WO2018056155A1 (en) Information processing device, image generation method and head-mounted display
US11960086B2 (en) Image generation device, head-mounted display, and image generation method
JPWO2012147363A1 (en) Image generation device
US20220113543A1 (en) Head-mounted display and image display method
WO2021207747A2 (en) System and method for 3d depth perception enhancement for interactive video conferencing
WO2022180630A1 (en) Camera capture and communications system
US11099392B2 (en) Stabilized and tracked enhanced reality images
CN111712859A (en) Apparatus and method for generating view image
CN110060349B (en) Method for expanding field angle of augmented reality head-mounted display equipment
US20220113794A1 (en) Display device and image display method
CN113366825A (en) Image signal representing a scene
US11734875B2 (en) Image representation of a scene
US20200252585A1 (en) Systems, Algorithms, and Designs for See-through Experiences With Wide-Angle Cameras
JP7471307B2 (en) Image representation of the scene

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22759080

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22759080

Country of ref document: EP

Kind code of ref document: A1