FI20215726A1

FI20215726A1 - A transmission terminal for 3D telepresence

Info

Publication number: FI20215726A1
Application number: FI20215726A
Authority: FI
Inventors: Seppo Valli
Original assignee: Teknologian Tutkimuskeskus Vtt Oy
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2022-12-22
Also published as: WO2022269132A1

Abstract

A transmission terminal (200) for 3D telepresence session is caused to perform: forming a static 3D reconstruction (211) of a local meeting space; forming a realtime 3D person model (232) of a local participant; aligning the static 3D reconstruction of the local meeting space and 2D image data captured by a fixed camera sensor during the telepresence session to obtain an aligned static 3D reconstruction (238); tracking realtime position data of the local participant by wearable capture device (213) with respect to the fixed camera sensor (261) during the telepresence session; forming a combined 3D model (236); receiving (251), from a server, a first viewpoint of a remote participant with respect to the local participant; forming a first projection to the combined 3D model from the first viewpoint of the remote participant; and coding and streaming (254) the first projection to the remote participant or to the server for transmission to the remote participant.

Description

A transmission terminal for 3D telepresence

FIELD

[0001] Various example embodiments relate to telepresence systems, in particular to atransmission terminal for 3D telepresence.

BACKGROUND

[0002] Videoconference is an online meeting, where people may communicate with each other using videotelephony technologies. These technologies comprise reception and transmission of audio-video signals by users, e.g. meeting participants, at different — locations. Telepresence videoconferencing refers to higher level of videotelephony, which aims to give the users the appearance of being present at a real world location remote from one’s own physical location.

SUMMARY

[0003] According to some aspects, there is provided the subject-matter of the independent claims. Some example embodiments are defined in the dependent claims. The scope of protection sought for various example embodiments is set out by the independent claims. The example embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various example embodiments. — [0004] According to a first aspect, there is provided a transmission terminal for 3D

O telepresence session comprising at least one processor; and at least one memory including & computer program code, the at least one memory and the computer program code - configured to, with the at least one processor, cause the terminal at least to perform:

N r forming a static 3D reconstruction of a local meeting space wherein the static 3D a = 25 reconstruction of the local meeting space is formed based on: image and depth data ©

S captured of the local meeting space before a telepresence session by a wearable capture

N device; and pose data of the wearable capture device tracked by the wearable capture

O

N device; forming a realtime 3D person model of a local participant based on 2D image data or image and depth data captured by a fixed camera sensor during the telepresence session; aligning the static 3D reconstruction of the local meeting space and the 2D image data captured by the fixed camera sensor during the telepresence session to obtain an aligned static 3D reconstruction; tracking realtime position data of the local participant by the wearable capture device with respect to the fixed camera sensor during the telepresence session; forming a combined 3D model based on the aligned static 3D reconstruction of the local meeting space, the realtime 3D person model, and the realtime position information of the local participant; receiving, from a server, a first viewpoint of a remote participant with respect to the local participant, wherein positions of the remote participant and the local participant are mapped into a unified virtual meeting geometry; forming a first projection to the combined 3D model from the first viewpoint of the remote participant; — and coding and streaming the first projection to the remote participant or to the server for transmission to the remote participant.

[0005] According to a second aspect, there is provided a method comprising forming, by a transmission terminal, a static 3D reconstruction of a local meeting space wherein the static 3D reconstruction of the local meeting space is formed based on: image and depth data captured of the local meeting space before a telepresence session by a wearable capture device; and pose data of the wearable capture device tracked by the wearable capture device; forming a realtime 3D person model of a local participant based on 2D image data captured by a fixed camera sensor during the telepresence session; aligning the static 3D reconstruction of the local meeting space and the 2D image data captured by the fixed camera sensor during the telepresence session to obtain an aligned static 3D reconstruction; tracking realtime position data of the local participant by the wearable capture device with respect to the fixed camera sensor during the telepresence session; forming a combined 3D model based on the aligned static 3D reconstruction of the

N local mecting space, the realtime 3D person model, and the realtime position information

N 25 — of the local participant; receiving, from a server, a first viewpoint of a remote participant

S with respect to the local participant, wherein positions of the remote participant and the

N local participant are mapped into a unified virtual meeting geometry; forming a first

E projection to the combined 3D model from the first viewpoint of the remote participant;

O and coding and streaming the first projection to the remote participant or to the server for

L 30 transmission to the remote participant.

[0006] According to an embodiment, the fixed camera sensor is in a known position with respect to a display used in telepresence session and/or is embedded to a user device used for telepresence session.

[0007] According to an embodiment, the method comprises detecting holes and/or other errors in the static 3D reconstruction of the local meeting space or receiving an indication on holes and/or other errors in the static 3D reconstruction of the local meeting space from the remote participant; requesting, via a user interface, the local participant to direct the wearable capture device to a direction with detected holes and/or other errors to capture image and depth data from that direction; updating the static 3D reconstruction based on image and depth data captured from the direction with detected holes and/or other errors.

[0008] According to an embodiment, the method comprises animating the realtime 3D person model of the local participant before rendering the realtime 3D person model to the tracked position into the static 3D reconstruction of the local meeting space, wherein the animating is performed based on features extracted from the 2D image data captured of the local participant by the fixed camera sensor during the telepresence session.

[0009] According to an embodiment, the method comprises storing the static 3D — reconstruction of the local meeting space into a local memory of the wearable capture device; and copying the static 3D reconstruction of the local meeting space to the at least one memory of the transmission terminal.

[0010] According to an embodiment, the method comprises transmitting the realtime position data of the local participant to the server to be used for forming a virtual geometry — between participants and for informing viewpoint of the local participant to the remote participant.

[0011] According to an embodiment, the wearable capture device is a headphone

S configured to capture image and depth data, and configured to track position.

S [0012] According to an embodiment, the headphone is wirelessly connected to a user

N 25 — device used for telepresence session.

E

[0013] According to an embodiment, the wearable capture device is arranged to,

N when worn by the local participant, leave face of the local participant unmasked such that 5 eye contact is enabled with the remote participant.

N

[0014] According to an embodiment, the method comprises detecting that a position of the fixed camera sensor is changed during the telepresence session; and realigning the static 3D reconstruction and the realtime image data captured by the fixed camera sensor.

[0015] According to an embodiment, the method comprises segmenting the 2D image data of the local participant from background.

[0016] According to an embodiment, the realtime 3D person model is formed based on 2D image data of the local participant representing different angles of view of the local participant and optionally the pose data tracked by the wearable device worn by the local participant.

[0017] According to an embodiment, the method comprises correcting pose of the realtime 3D person model in the combined 3D model to compensate for a position difference between the fixed camera sensor and a display.

[0018] According to an embodiment, the method comprises receiving, from the server, a second viewpoint of the remote participant with respect to the local participant, wherein the second viewpoint is different than the first viewpoint; forming a second projection to the combined 3D model from the second viewpoint of the remote participant; and coding and streaming the second projection to the remote participant or to the server for transmission to the remote participant.

[0019] According to a third aspect, there is provided a computer program configured — to cause at least a method of the second aspect and any of the embodiments thereof to be performed.

O [0020] According to a further aspect, there is provided a computer readable medium

O comprising program instructions that, when executed by at least one processor, cause an

O

- apparatus to at least to perform a method of the second aspect and any of the embodiments

N x 25 — thereof. a a © BRIEF DESCRIPTION OF THE DRAWINGS 5

N [0021] Fig. 1 shows, by way of example, a method for 3D telepresence;

O

N

[0022] Fig. 2 shows, by way of example, a block schema of operations at a local site;

[0023] Fig. 3 shows, by way of example, a block schema of a wearable capture device; and

[0024] Fig. 4 shows, by way of example, a block diagram of an apparatus.

DETAILED DESCRIPTION

5 [0025] Videoconferencing and telepresence support remote work, reduce emissions and may even prevent pandemics. In current 3D telepresence systems, real-time capture of meeting spaces and participants, and transmission of 3D data reguire multiple sensor setups, high computation power, and high bitrates. Current solutions for 3D telepresence are not compatible with existing videoconferencing solutions. Stereoscopic 3D does not — support natural eye-focus or accommodation and causes discomfort and nausea. In addition, viewing fixed-viewpoint scenes on screen displays, albeit in stereo, severely restricts a participant’s ability to change his/her viewpoint, in a natural way, to a remote participant.

[0026] There is provided a 3D terminal and a method for 3D telepresence, which enables low-cost enhancement of existing videoconferencing setups and provision of 3D cues. The 3D terminal discloses herein enables more flexibility in choosing one's viewpoint to remote sites and participants.

[0027] A typical teleconference at home or work occurs in a freguently used space.

Over time, such a space may be efficiently 3D scanned by a person carrying a suitable — sensor for the purpose. For example, the sensor may be embedded to a headphone type of wearable capture device, favorably wireless. Wireless wearable capture device enables a 5 person to stroll around and make a 3D capture of the meeting space before the

N teleconference, during his/her daily tasks and routines, e.g. cleaning, playing with children

S or pets, occasional phone calls, or even during an ongoing telepresence session.

N

I 25 [0028] After a 3D room model is formed based on 3D scan of the meeting space, a

S laptop computer or other existing videoconferencing system enables supporting improved

N 3D perception. This is possible by real-time capturing a participant by fixed camera, e.g. = the laptop camera, and embedding or superimposing him/her as a 3D model into the

N essentially static room model, made in advance. For this purpose, the room model and the laptop view are aligned. After combining the 3D model of the person capture and the room model, the resulting 3D information can be processed for supporting changed viewpoints.

Then the resulting information may be coded and sent using any feasible method.

[0029] As a result, a remote participant can be provided with new viewpoints based on his/her manual input, or based on his/her movements at the remote site. User motions may be tracked realtime by the same wearable device, which is used for capturing the meeting space for 3D reconstruction, for example. The wearable capture device may be a wireless headphone, and supports normal headphone functionalities for audio (a microphone, controls for muting and adjusting volume) in addition to comprising means for image and depth data capture and position tracking. The wearable capture device — enables eye contact between participants of the 3D telepresence session.

[0030] Fig. 1 shows, by way of example, a method 100 for 3D telepresence. The method 100 may be performed e.g. by a transmission terminal 200 of Fig. 2, or in a control device configured to control the functioning thereof, when installed therein. The method 100 comprises forming 110, by a transmission terminal, a static 3D reconstruction of a local meeting space wherein the static 3D reconstruction of the local meeting space is formed based on: image and depth data captured of the local meeting space before a telepresence session by a wearable capture device; and pose data of the wearable capture device tracked by the wearable capture device. The method 100 comprises forming 120 a realtime 3D person model of a local participant based on 2D image data captured by a — fixed camera sensor during the telepresence session. The method 100 comprises aligning 130 the static 3D reconstruction of the local meeting space and the 2D image data captured by the fixed camera sensor during the telepresence session to obtain an aligned static 3D reconstruction. The method 100 comprises tracking 140 realtime position data of the local

S participant by the wearable capture device with respect to the fixed camera sensor during © 25 — the telepresence session. The method 100 comprises forming 150 a combined 3D model 7 based on the aligned static 3D reconstruction of the local meeting space, the realtime 3D 2 person model, and the realtime position information of the local participant. The method - 100 comprises receiving 160, from a server, a first viewpoint of a remote participant with

N respect to the local participant, wherein positions of the remote participant and the local

N 30 participant are mapped into a unified virtual meeting geometry. The method 100 comprises

N forming 170 a first projection to the combined 3D model from the first viewpoint of the remote participant. The method 100 comprises coding and streaming 180 the first projection to the remote participant or to the server for transmission to the remote participant.

[0031] The method as disclosed herein enables backward compatibility with existing 2D videoconferencing solution, which lowers a threshold for take-up of a system.

Capturing the space beforehand protects privacy of the user and/or family members, for example, since they are not visible in the stream during the telepresence session even if they pass by the fixed camera. The method as discloses herein enables support for motion parallax and spatial faithfulness in 3D remote interaction and telepresence. A fixed 3D capture setup is not required by the method as discloses herein, which eases up taking a system into use. Costs for the new system remain low, as 3D reconstruction of the local meeting space may be made using one wearable device, and the rest of the devices may be existing devices, but configured to function as disclosed herein. By using viewpoint-on- demand approach for encoding, the realtime transmission of complete 3D reconstructions is avoided. Required bitrates and processing power are reduced by the method as disclosed herein. Combining the 3D person model produced by the transmission terminal to the reconstruction of the space enables correcting the viewpoint (parallax) difference between the fixed camera and the display, commonly hindering eye contact.

[0032] Fig. 2 shows, by way of example, a block schema of operations at a local site.

A transmission terminal 200, or a 3D terminal, comprises a 3D capturer 210, a 3D — processor 230 and a 3D coder 250. The 3D capturer 210 and the 3D processor 230 may be considered to form together a 3D composer 205.

[0033] The transmission terminal 200 is shown in Fig. 2, but in case of a

SN symmetrical application, a similar 3D capturer is a part of a 3D receiver terminal, as well.

O

N

O [0034] The transmission terminal 200 comprises communication interface 206 for 7 25 communicating with remote participant(s) residing in remote meeting space(s) and/or a 2 server that manages telepresence sessions, forms the meeting geometry, performs mapping * between user positions and delivers data streams to participants, for example. The server

N may be a computer communicating with user devices, e.g. laptops or alike, of the

N participants of the telepresence session over a communication network. The server may

N 30 — have cloud based computing functionalities.

[0035] Functions of the 3D coder 250 and the 3D processor 230 may be executed by a user device 260 used for telepresence session. The user device 260 may be e.g. a personal computer or laptop. Computer may be equipped with an external display and a camera.

Laptop may comprise an integrated or embedded display and camera. Functions of the 3D capturer 210 may be executed by a wearable 3D capturer, for example, an example of which is shown in Fig. 3.

[0036] A static 3D reconstruction of a local meeting space, e.g. a 3D room model 211, is formed by a room reconstructor 212. A local participant 201 resides in the local meeting space. The 3D space, where the local participant resides, is captured and modelled using a wearable capture device 213 worn by the local participant 201. The wearable capture device 213 comprises a camera and a depth sensor. The camera may be a stereoscopic camera. The capture device 213 may be carried by a person, the local participant, moving around in the local meeting space, and the space may be captured while the person performs daily activities, for example. Alternatively or in addition, the space may be captured during a telepresence session when the person is moving in the local meeting space. Capturing the space with the wearable capture device 213, instead of a fixed sensor setup, enables formation of the reconstruction of the local meeting space without several sensors and cameras installed in fixed positions in the local meeting space.

The 3D reconstruction, or the 3D room model 211, of the local meeting space may be — produced offline and saved into a memory of the 3D terminal before a real-time telepresence session. The 3D room model 231 represents the model saved in the memory of the 3D terminal. Capturing the space as disclosed herein enables backward compatibility with existing 2D videoconferencing solutions, which lowers a threshold for take-up of a

S novel system. Capturing the spacc beforchand protects privacy of the user and/or family

N 25 members, for example, since they are not visible in the stream during the telepresence

S session even if they pass by the fixed camera.

N

I [0037] The 3D reconstruction may be formed based on image and depth data 214 - captured of the local meeting space, by the wearable capture device 213, and pose data of

N the wearable capture device 213 tracked by the wearable capture device 213 worn by the

N 30 local participant 201 during capture of the local meeting space. These captures are

N performed before a telepresence session. The block 214 represents the capture performed before a telepresence session, e.g. offline.

[0038] The local participants 201, 202 shown in Fig. 2 represents 203 the same participant wearing the same wearable capture device 213 at separate time instants.

However, 3D capture of the meeting space before the meeting may be performed by another person than the one who participates to the meeting. In order to perceive a meeting participant’s gestures and reactions, the local participant 202 is captured and 3D modelled in real-time during the telepresence session. The local participant 202 may be captured using a traditional embedded computer camera, such as a laptop camera, or using a camera which is in a known position with respect to a display used in the telepresence session.

This kind of camera is referred to as a fixed camera 261 herein. The 2D image data captured by the fixed camera sensor 261 during the telepresence session shows the local participant and at least part of the local meeting space. The wearable device enables eye contact between the participants of a 3D telepresence session.

[0039] The fixed camera sensor may be or comprise a depth sensor using e.g. video- plus-depth format. The depth sensor may be attached or embedded to the user device used — for telepresence session. Depth sensor may produce better, e.g. in sense of speed and accuracy, 3D shape of a local participant’s upper body and face.

[0040] A realtime 3D person model 232 of the local participant 202 is formed, by a 3D person reconstructor 233, based on 2D image data captured by a fixed camera sensor during the telepresence session. The fixed camera sensor 261 is in a known position with — respect to a display used in telepresence session and/or embedded to a user device used for telepresence session. The user device may be e.g. a laptop.

[0041] Before forming the 3D person model, the 2D image data of the local

SN participant may be segmented from the background. The background represents the local

N meeting space.

S

- 25 [0042] The static 3D reconstruction 231 of the local meeting space and the 2D image 2 data captured by the fixed camera sensor 261 during the telepresence session are aligned * by a 3D aligner 234. The aligned static 3D reconstruction, or the aligned room model 238,

N may be stored in the memory of the 3D processor 230. The alignment may be repeated in 5 case the position of the fixed camera sensor 261 is changed during the telepresence

N 30 session. Change of position may be detected, for example, by comparing 2D image data of different time points, wherein the 2D image data is captured by the fixed camera sensor 261. Based on the comparison, movement may be detected by applying motion estimation.

If the detected movement is large scale, initialization is performed by finding a projection to the 3D reconstruction that best corresponds to the 2D image data by using exhaustive search. Alternatively, the change of position of the fixed camera sensor 261 may be detected by sensors of the wearable capture device 213.

[0043] After the alignment, the 3D reconstruction may be textured by the more real- time and up-to-date view captured by the fixed camera sensor 261.

[0044] The position of the local participant 202 in the local meeting space with respect to the fixed camera sensor 261 is tracked 216 in realtime during the telepresence session. The tracking may be performed by the wearable capture device 213. The position — information may be transmitted 217 to the 3D processor 230, e.g. to the 3D combiner 237, e.g. via wireless link. The 3D processor 230 may map the position to computer coordinates.

[0045] A combined 3D model 236 is formed, by a 3D combiner 237, based on the aligned static 3D reconstruction of the local meeting space, the realtime 3D person model, — and the realtime position information of the local participant. The position information of the local participant may be mapped to computer coordinates. The 3D person model may be added to the 3D reconstruction of the local meeting space by z-buffering and z-ordering of their depth maps. A depth map is an image comprising pixel distances from a viewpoint.

Z-buffering compares the depth maps of the meeting space and person model, pixel by — pixel, and shows the occluding, i.e. more close, pixels as a result of the process.

[0046] The combined 3D model 236 of a local meeting space with its occupant may be used to support 3D perception cues in various applications. For example, virtual

N viewpoints may be formed for a remote viewer based on movements of the remote viewer. » The combined 3D model may be used to support a remote viewer with small viewpoint 7 25 changes for perceiving motion parallax or with large viewpoint changes supporting 2 viewer's mobility, for example. When a remote viewer chooses a viewpoint differing from * the local fixed camera view, the combined 3D model may be oriented and scaled according

N to the new viewpoint, and a projection corresponding to the new viewpoint may be 5 streamed to the receiver for viewing.

N

[0047] A first viewpoint of the remote participant with respect to the local participant is received 251 from a server. Position of the remote participant X is received in transmitter coordinates, that is, in local meeting space coordinates. Correspondingly, the realtime position data of the local participant is transmitted 240 to the server to be used for forming a virtual geometry between participants and for informing viewpoint of the local participant to the remote participants(s). The position data may be transmitted in transmitter coordinates. Positions of the remote participant(s) and the local participant are mapped into a unified virtual meeting geometry in the server. The unified virtual meeting geometry is mutual and consistent across all meeting sites. This enables supporting viewpoint changes. The virtual meeting geometry may be formed, for example, based on the positions of the wearable capture devices and the fixed camera sensors. Optionally, the captured 3D data of the meeting spaces may be used in forming the meeting geometry, if more knowledge on the meeting space volumes and dimensions are seen beneficial.

[0048] A first projection to the combined 3D model is formed, by a perspective generator 252, from the first viewpoint of the remote participant. The perspective generator has the knowledge of the viewpoints of the remote participants with respect to the local participant, since they are received 251 from the server. The perspective generator has the knowledge how the positions of the remote participants and the local participant are mapped into the unified virtual meeting geometry.

[0049] The first projection is encoded by the encoder 253 and streamed 254 to the remote participant or to the server for transmission to the remote participant. Transmitting — the first projection which corresponds to the first viewpoint (viewpoint-on-demand), instead of a full combined 3D model, reduces bitrates.

[0050] Fig. 3 shows, by way of example, a block schema of a wearable capture = device 300, e.g. the wearable capture device 213 of Fig. 2. The wearable capture device

N may be a wireless capture device used for capturing data for a 3D model of the meeting

S 25 — space and for positioning a participant at each meeting sites so that the participants may be

N served with realistic viewpoints even when changing their positions.

E

[0051] The 3D capture device may be headphone device embedded with a RGB-D

N sensor 350 or camera for capturing realtime RGB and depth data, and with sensors 370 for

N tracking a participant’s viewpoint. Tracking algorithm may use data from motion sensors,

N 30 — such as inertial measurement unit attached to the wearable device. The capture device may comprise cameras for capturing stereoscopic views. The capture device may be configured to form the 3D reconstruction of the meeting space and transmit the 3D reconstruction to the 3D processor. In other words, the capture device may comprise a 3D reconstructor for forming a 3D model of a room. Alternatively, the 3D capture is transmitted to the 3D processor for reconstruction of the model based on the 3D capture.

[0052] The capture device comprises normal headphone functions 360 such as — speakers, microphone, processing 310 and memory 320 unit, and communication interface 330 for communicating with the user device 260 and/or with other modules of the 3D transmission terminal, for example. Wireless technologies such as Bluetooth, WLAN, etc. may be used for data transfer between the capture device 213 and the 3D processor 230.

Wired communication may be used e.g. for offline download of the firmware and/or software. In at least some embodiments, the wearable capture device leaves face of the user unmasked such that eye contact is enabled between the participants of a 3D telepresence session. In other words, the wearable capture device is configured to enable eye contact between the participants of a 3D telepresence session.

[0053] The 3D capture data captured with the wearable capture device moving — around the meeting space may be used for 3D reconstruction by using e.g. structure from motion (SfM) and/or simultaneous localization and mapping (SLAM) approaches. SLAM constructs and updates a map of an unknown environment while simultaneously keeping track of camera location. For example, a single moving capture sensor, e.g. RGB-D sensor, may be used for 3D capture by the wearable capture device.

[0054] In 3D reconstruction, Truncated Signed Distance Function (TSDF) may be used for compacting and storing representations of 3D surfaces. TSDFs represent the distance to the nearest surface for every cell or voxel in a volume. Fast algorithms exist for = real-time update of TSDF. In addition to 3D shape, color TSDF may also be used to

N reconstruct color or texture of the surface. 3 — 25 [0055] Alternatively, 3D modelling may be made using a stereoscopic camera and 2 views, by deriving the distances of matching features (disparity) in them. In addition to an * option for 3D sensing, a stereoscopic camera comprised in the wearable capture device

N may be used for supporting delivery of stereoscopic content from the captured space. With 5 stereoscopic (S3D) displays for viewing, this option supports 3D perception even without

N 30 delivering and processing a depth stream for stereoscopic synthesis.

[0056] A static 3D room model, or 3D reconstruction of the local meeting space, may be stored locally into the memory 320 of the wearable capture device 300. This supports independent 3D reconstruction outside telepresence sessions, without a connection to the user device 260 used for telepresence session. Before or during telepresence sessions, the 3D room model, or corresponding data for forming the model, is copied to the 3D processor 230 for speeding up computations and saving battery power in the wearable device 300.

[0057] 3D tracking and reconstruction derives and tracks the position of a reconstructing camera. As the camera is embedded in a wearable device carried by a — meeting participant, essentially the same tracking solution can be used also for user positioning for combining 3D room and person captures, and to support viewpoints for remote users. The offline 3D room reconstruction and realtime user positioning may be separate processes performed at different times, e.g. they are not simultaneous process steps, which also eases up using the same tracking module for both purposes.

[0058] The 3D room reconstructor 212 may have too little information for making a good quality 3D room model. For example, inadequate viewpoints while capturing the meeting space may result with holes or other errors in the 3D reconstruction. The holes and/or other errors may be detected by the reconstructor, or an indication on holes and/or other errors may be received from the remote participant. It may then be requested, via a — user interface, the local participant to direct the wearable capture device to a direction with detected holes and/or other errors to capture image and depth data from that direction.

Then, the static 3D reconstruction may be updated based on image and depth data captured from the direction with detected holes and/or other errors. Instruction for a participant to

S make additional captures, e.g. by moving or viewing in certain directions, may be

O 25 — presented via a user interface of the transmission terminal. The user interface may be n comprised in the user device used for telepresence session, and may comprise e.g. a

I display of the laptop. Alternatively, the wearable capture device may comprise a user o interface. The user interface of the wearable device may comprise a display and virtual or

S physical buttons via which the user may manage the 3D capturing process, e.g. by

N 30 instructing, initializing, performing, updating and resetting a 3D reconstruction process.

N For example, audible instructions may be presented to the user via speakers of the headphones.

[0059] Updating of the 3D reconstruction with new image data may be performed using 3D reconstruction algorithms such as TSDF algorithm. The updating of the 3D reconstruction may be performed using a new 2D image data taken from a known viewpoint or direction. A corresponding projection from the current 3D reconstruction may — be determined, since the viewpoint is known. The known viewpoint is the fixed camera’s viewpoint. The 2D image data may be compared with the projection. If differences in pixel values and/or depth values are detected based on the comparison, the voxel value of the 3D reconstruction may be updated to correspond to the pixel value of the reconstructed first frame. Position of the voxel may be updated accordingly based on the depth value of the corresponding pixel.

[0060] Alternatively or additionally, the holes and/or other errors in the 3D reconstruction may be corrected by means of image processing such as inpainting and/or filtering.

[0061] Referring back to Fig. 2, user tracking 215 is performed for several purposes. — User position is needed for forming the unified virtual meeting geometry between the participants, serving a participant with viewpoints complying with movements of the participant in the defined meeting geometry, and rendering a 3D person model in a correct 3D position to the 3D room model. The block 215 represents the capture performed during the telepresence session. Image and depth data and pose data of the wearable device are — transmitted to the real-time user positioning module 216.

[0062] Realtime position data of the local participant is transmitted 240 to the server to be used for formation of the virtual meeting geometry. Bitrates for delivering 3D = position data to the server may be reduced e.g. by differential, run-length (RL), or variable

N length coding (VLC).

S

- 25 [0063] A tracker, or sensors 370 for position tracking, embedded in the wearable 2 capture device 300 derives the position of the participant. RGB-D sensor, e.g. time of flight * (ToF), gives distances in real-world scale. If a regular camera is used in the wearable

N capture device, it is possible to assist positioning by a graphical marker or alike, indicating 5 the scale by its set dimensions. Instead of a graphical marker, even a computer screen

N 30 and/or camera embedded to it may be detected by the regular camera and known dimensions of the screen and/or embedded camera may be used for determining their position and scale.

[0064] Referring to the 3D person reconstructor 233 of Fig. 2, the 3D person model is formed based on 2D image data of the local participant 202 representing different angles of view of the local participant and the pose data tracked by the wearable device 213 worn by the local participant 202. During a video conference, a person often turns one’s head and sways into various directions when using one’s computer, looking at the desk or room, seeing other persons and while listening, thinking, and ideating. While the person is naturally browsing into various directions, images or video frames may be captured and used to form a 3D person model. The person’s face and upper body may be modelled using camera captures by the fixed camera 261, e.g. embedded into the user device used for telepresence session. Swaying and/or moving of the participant reveals background features, which are helpful in aligning the computer view (2D image data) with the 3D reconstruction of the local meeting space. Aligning is also supported if the person leaves from the fixed camera’s field of view for a while.

[0065] Referring to a 3D person animator 239 of Fig. 2, a 3D person model may be animated by a 3D animator before rendering the realtime 3D person model to the tracked position into the static 3D reconstruction of the local meeting space. The rendering may be performed by the real-time 3D combiner 237. The animating is performed based on features extracted from the 2D image data captured of the local participant by the fixed camera sensor during the telepresence session. Features may be action parameters, e.g. landmarks.

[0066] After animation, or as a part of the animation step, the 3D pose of a 3D person model in the combined 3D model may be corrected to compensate for a position difference between a fixed camera and display of the user device used for telepresence

S session. The position difference (parallax) disturbs correct perception for eye contact. In © 25 — other words, eye contact may be incorrectly established between meeting sites due to the 7 participant being captured from a point of view of the fixed camera which is different from 2 the position of faces and eyes of another participant on the display. This deviation may be + referred to as camera-display parallax. The compensation corrects the eye-gaze. Correction

N may be performed by reorienting the 3D model.

O 30 [0067] The telepresence system may use peer-to-peer (P2P), i.e. full mesh, connections between meeting spaces, or the telepresence system may be a server-based system.

[0068] Fig. 4 shows, by way of example, a block diagram of an apparatus 400 capable of performing the method as disclosed herein. The apparatus may be or comprise a transmission terminal at the local site, e.g. the transmission terminal 200 of Fig. 2, or a receiver terminal at a remote site. Comprised in apparatus 400 is processor 410, which may comprise, for example, a single- or multi-core processor wherein a single-core processor comprises one processing core and a multi-core processor comprises more than one processing core. Processor 410 may comprise, in general, a control device. Processor 410 may comprise more than one processor. Processor 410 may be a control device. Processor 410 may be means for performing method steps in apparatus 400. Processor 410 may be configured, at least in part by computer instructions, to perform actions.

[0069] Apparatus 400 may comprise memory 420. Memory 420 may comprise random-access memory and/or permanent memory. Memory 420 may comprise at least one RAM chip. Memory 420 may comprise solid-state, magnetic, optical and/or holographic memory, for example. Memory 420 may be at least in part accessible to — processor 410. Memory 420 may be at least in part comprised in processor 410. Memory 420 may be means for storing information. Memory 420 may comprise computer instructions that processor 410 is configured to execute. When computer instructions configured to cause processor 410 to perform certain actions are stored in memory 420, and apparatus 400 overall is configured to run under the direction of processor 410 using — computer instructions from memory 420, processor 410 and/or its at least one processing core may be considered to be configured to perform said certain actions. Memory 420 may be at least in part external to apparatus 400 but accessible to apparatus 400.

[0070] Apparatus 400 may comprise a transmitter 430. Apparatus 400 may comprise

S a receiver 440. Transmitter 430 and receiver 440 may be configured to transmit and © 25 receive, respectively, information in accordance with at least one wireless or cellular or 7 non-cellular standard. Transmitter 430 may comprise more than one transmitter. Receiver 2 440 may comprise more than one receiver. Transmitter 430 and/or receiver 440 may be + configured to operate in accordance with global system for mobile communication, GSM,

N wideband code division multiple access, WCDMA, 5G, long term evolution, LTE, IS-95,

N 30 — wireless local area network, WLAN, Ethernet and/or worldwide interoperability for

N microwave access, WiMAX, standards, for example.

[0071] Apparatus 400 may comprise user interface, UI, 450. UI 450 may comprise at least one of a display, a keyboard, a touchscreen, a mouse. A user may be able to operate apparatus 400 via UI 460.

[0072] Apparatus 400 may comprise an embedded camera, or an external camera connected to it.

[0073] The transmission terminal disclosed herein enables support for motion parallax and spatial faithfulness in 3D remote interaction and telepresence. A fixed 3D capture setup is not required by the disclosed transmission terminal, which eases up taking a system into use. Costs for the new system remain low, as 3D reconstruction of the local meeting space may be made using one wearable device, and the rest of the devices may be existing devices, but configured to function as disclosed herein. By using viewpoint-on- demand approach for encoding, the realtime transmission of complete 3D reconstructions is avoided. Required bitrates and processing power are reduced by the transmission terminal as disclosed herein. Combining the 3D person model produced by the transmission terminal to the reconstruction of the space enables correcting the viewpoint (parallax) difference between the fixed camera and the display, commonly hindering eye contact. The wearable capture device of the transmission terminal enables eye contact between participants of the 3D telepresence session.

[0074] Referring back to Fig. 2, the 3D reconstruction of the local meeting space — may be replaced with a 3™ party room model, which may be received 290 from a database such as a room model library. For example, if it is detected by the 3D processor that the formed 3D reconstruction of the local meeting space has holes or other errors or the = formation has otherwise been failed, a 3" party room model may be used instead. Other

N reasons for using 3™ party room model may be a wish to use reconstruction of a more

S 25 — luxury meeting space. Usage of a 3"! party room model protects privacy of the participant. =

I [0075] In at least some embodiments, the 3D person model may be replaced with a * 3" party person model, which may be received 295 from a database such as a person

N model library. For example, if it is detected by the 3D processor that the formed 3D person 5 model has holes or other errors or the formation of the model has otherwise been failed, a

N 30 — 3" party person model may be used instead. Other reasons for using 3™ party person model may be role play, for example. Usage of a 3™ party person model protects privacy of the participant.

[0076] The transmission terminal 200 of Fig. 2 for 3D telepresence session functions as a reception terminal, as well, when participating to a 3D telepresence session. The local participant position in local coordinates is transmitted 240 to the meeting server, the meeting server further transmits this information to a remote terminal, wherein a — perspective generator forms projections to a combined 3D model from the viewpoint of the local participant. Thus, the terminal 200 receives projections to the combined 3D model from other sites from one’s own viewpoint. The combined 3D model is formed in the remote terminal(s) in a similar manner as disclosed herein. The received projection is shown on the display used for telepresence session. The display may be a screen of a — laptop or an external display. In at least some embodiments, the display is not a head- mounted display. In at least some embodiments, the display is arranged to leave face of the local participant unmasked such that eye contact is enabled with participants of the telepresence session.

N

O

N

O

N

I a a ©

N

K

LO

N

O

N

Claims

CLAIMS:

1. A transmission terminal for 3D telepresence session comprising at least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the terminal at least to perform: - forming a static 3D reconstruction of a local meeting space wherein the static 3D reconstruction of the local meeting space is formed based on: - image and depth data captured of the local meeting space before a telepresence session by a wearable capture device; and - pose data of the wearable capture device tracked by the wearable capture device; - forming a realtime 3D person model of a local participant based on 2D image data or image and depth data captured by a fixed camera sensor during the telepresence session; - aligning the static 3D reconstruction of the local meeting space and the 2D image data captured by the fixed camera sensor during the telepresence session to obtain an aligned static 3D reconstruction; - tracking realtime position data of the local participant by the wearable capture device with respect to the fixed camera sensor during the telepresence session; - forming a combined 3D model based on the aligned static 3D reconstruction of the local meeting space, the realtime 3D person model, and the realtime position information of the local participant; - receiving, from a server, a first viewpoint of a remote participant with respect to the local S 25 — participant, wherein positions of the remote participant and the local participant are O mapped into a unified virtual meeting geometry; n - forming a first projection to the combined 3D model from the first viewpoint of the z remote participant; and c - coding and streaming the first projection to the remote participant or to the server for S 30 transmission to the remote participant.

S

2. The transmission terminal of claim 1, wherein the fixed camera sensor is in a known position with respect to a display used in telepresence session and/or is embedded to a user device used for telepresence session.

3. The transmission terminal of claim1 or 2, further caused to perform: detecting holes and/or other errors in the static 3D reconstruction of the local meeting space or receiving an indication on holes and/or other errors in the static 3D reconstruction of the local meeting space from the remote participant ; requesting, via a user interface, the local participant to direct the wearable capture device to a direction with detected holes and/or other errors to capture image and depth data from that direction; updating the static 3D reconstruction based on image and depth data captured from the — direction with detected holes and/or other errors.

4. The transmission terminal of any preceding claim, further caused to perform: animating the realtime 3D person model of the local participant before rendering the realtime 3D person model to the tracked position into the static 3D reconstruction of the local meeting space, wherein the animating is performed based on features extracted from the 2D image data captured of the local participant by the fixed camera sensor during the telepresence session.

5. The transmission terminal of any preceding claim, further caused to perform: storing the static 3D reconstruction of the local meeting space into a local memory of the wearable capture device; and copying the static 3D reconstruction of the local meeting space to the at least one memory of the transmission terminal. S 25

6. The transmission terminal of any preceding claim, further caused to perform: O transmitting the realtime position data of the local participant to the server to be used for n forming a virtual geometry between participants and for informing viewpoint of the local z participant to the remote participant. a I 30 —

7. The transmission terminal of any preceding claim, wherein the wearable capture device N is a headphone configured to: N - capture image and depth data; and - track position.

8. The transmission terminal of any preceding claim, wherein the wearable capture device is arranged to, when worn by the local participant, leave face of the local participant unmasked such that eye contact is enabled with the remote participant.

9. The transmission terminal of any preceding claim, further caused to perform: detecting that a position of the fixed camera sensor is changed during the telepresence session; and realigning the static 3D reconstruction and the realtime image data captured by the fixed camera sensor.

10. The transmission terminal of any preceding claim, further caused to perform: segmenting the 2D image data of the local participant from background.

11. The transmission terminal of any preceding claim, wherein the realtime 3D person model is formed based on 2D image data of the local participant representing different angles of view of the local participant and optionally the pose data tracked by the wearable device worn by the local participant.

12. The transmission terminal of any preceding claim, further caused to perform: — correcting pose of the realtime 3D person model in the combined 3D model to compensate for a position difference between the fixed camera sensor and a display.

13. The transmission terminal of any preceding claim, further caused to perform: receiving, from the server, a second viewpoint of the remote participant with respect to the S 25 — local participant, wherein the second viewpoint is different than the first viewpoint; O forming a second projection to the combined 3D model from the second viewpoint of the n remote participant; and z coding and streaming the second projection to the remote participant or to the server for c transmission to the remote participant. S 30 N 14. A method comprising: N forming, by a transmission terminal, a static 3D reconstruction of a local meeting space wherein the static 3D reconstruction of the local meeting space is formed based on:

- image and depth data captured of the local meeting space before a telepresence session by a wearable capture device; and - pose data of the wearable capture device tracked by the wearable capture device; —- forming a realtime 3D person model of a local participant based on 2D image data or image and depth data captured by a fixed camera sensor during the telepresence session; - aligning the static 3D reconstruction of the local meeting space and the 2D image data captured by the fixed camera sensor during the telepresence session to obtain an aligned static 3D reconstruction; - tracking realtime position data of the local participant by the wearable capture device with respect to the fixed camera sensor during the telepresence session; - forming a combined 3D model based on the aligned static 3D reconstruction of the local meeting space, the realtime 3D person model, and the realtime position information of the local participant; — - receiving, from a server, a first viewpoint of a remote participant with respect to the local participant, wherein positions of the remote participant and the local participant are mapped into a unified virtual meeting geometry; - forming a first projection to the combined 3D model from the first viewpoint of the remote participant; and — - coding and streaming the first projection to the remote participant or to the server for transmission to the remote participant.