WO2017177019A1 - System and method for supporting synchronous and asynchronous augmented reality functionalities - Google PatentsSystem and method for supporting synchronous and asynchronous augmented reality functionalities Download PDF
- Publication number
- WO2017177019A1 WO2017177019A1 PCT/US2017/026378 US2017026378W WO2017177019A1 WO 2017177019 A1 WO2017177019 A1 WO 2017177019A1 US 2017026378 W US2017026378 W US 2017026378W WO 2017177019 A1 WO2017177019 A1 WO 2017177019A1
- Grant status
- Patent type
- Prior art keywords
- Prior art date
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
- H04N7/157—Conference systems defining a virtual conference space and using avatars or agents
- G06—COMPUTING; CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
- G06T19/006—Mixed reality
SYSTEM AND METHOD FOR SUPPORTING
SYNCHRONOUS AND ASYNCHRONOUS AUGMENTED REALITY
CROSS-REFERENCE TO RELATED APPLICATIONS
 This application is a non-provisional filing of, and claims benefit under 35 U.S.C § 119(e) from U.S. Provisional Patent Application Serial No. 62/320,098, entitled "Apparatus and Method for Supporting Synchronous and Asynchronous Augmented Reality Functionalities," filed April 8, 2016, the entirety of which is incorporated herein by reference.
 Augmented Reality (AR) is a concept and a set of technologies for merging of real and virtual elements to produce new visualizations - typically a video - where physical and digital objects co-exist and interact in real time. Three dimensional (3D) models and animations are some examples of virtual elements to be visualized in AR. However, AR objects can be any digital information for which spatiality (3D position and orientation in space) gives added value, for example pictures, videos, graphics, text, and audio.
 AR visualizations make use of a means to display augmented virtual elements as a part of the physical view. AR visualizations may be implemented using for example a tablet with an embedded camera, which captures video from the user's environment and shows it together with virtual elements on its display. AR glasses, either video-see-through or optical-see-through, either monocular or stereoscopic, can also be used for viewing. A user may be viewing the AR remotely over a network or view the same augmented view as a local user.
 In AR, graphical tags, fiducials or markers have been commonly used for defining position, orientation and scale for AR objects. Graphical markers have certain advantages over the using of natural features. For example, graphical markers help to make the offline process for mixed reality content production and use more independent of the actual target environment. This allows content to be positioned more reliably in the target embodiment based on the position of graphical markers, whereas changes in the environment (e.g. changes in lighting or in the position of miscellaneous objects) can otherwise make it more difficult for an augmented reality system to consistently identify position and orientation information based only on the environment. SUMMARY
 Users have become accustomed to conducting telepresence and remote collaboration sessions using two-dimensional video applications. However, two-dimensional video does not lend itself well to collaborative augmented reality (AR) sessions because the limited ability of 2D video to convey perception of a third dimension can make it difficult to properly position AR objects in a remote environment. At the same time, communicating full 3D data in real time would impose a high bitrate requirement. In addition, providing full 3D data may endanger a user's privacy, as it may not be easy for a user to see exactly what 3D data is being captured by 3D sensors, particularly when those sensors operate using infrared (IR) beams or other invisible forms of data collection.
 In exemplary embodiments, the challenge of providing a sufficient 3D perception to a remote user is addressed as follows. 3D sensors in the environment of a local user collect data and generate a 3D model of the local environment. From the 3D model, a perspective view is generated. This perspective view represents a view from a viewpoint that may be selected by the remote user, as if the perspective view were the view of a virtual camera positioned within the local environment. The perspective view may be, for example, a side view. The perspective view may be provided as a conventional 2D video stream to the remote user. By providing the perspective view as a 2D video stream, a substantially lower bitrate is required as compared to a stream of 3D data. In some embodiments, the local user is also equipped with a conventional video camera, e.g. a video camera of user equipment such as a laptop, tablet computer or smartphone. 2D video captured by the conventional video camera may also be provided to the remote user. The combined use of the generated perspective view and the conventional video camera view allows a remote user to more accurately position AR objects within the environment of the local user. In some embodiments, the remote user may change the viewpoint from which the perspective view is generated. The availability of the perspective view in addition to a conventional video view allows for more accurate placement of AR objects with in the local user's environment.
 In exemplary embodiments, the challenge of protecting the user's privacy is addressed as follows. The local user is equipped with a conventional video camera, e.g. a video camera of user equipment such as a laptop, tablet computer or smartphone. 2D video captured by the local video camera is displayed to the local user. 3D data (e.g. a point cloud) is also collected by sensors in the local user's environment. However, the 3D data that falls outside the field of view of the local video camera is excluded from data sent to the remote user. (The field of view of the local user's video camera may be described as a viewing pyramid, as the volume within the field of view may have a generally pyramidal shape, with an apex at the video camera.) The local user can thus be confident that the data being sent to the remote user does not include regions that have been hidden from the local video camera. In some embodiments, the 3D data is not sent directly to the remote user but rather is used to generate a 2D perspective video that is sent to the remote user.
 In some embodiments, asynchronous communication is supported. 3D data collected as described above may be stored for later use by a user (including a local user or a remote user) to insert AR objects in the stored model of the local user's environment. For privacy protection, the stored 3D data may exclude data that falls outside a field of view of the local user's video camera. A remote user can thus add AR objects to a local user's environment even when the local user is not currently online. When the local user resumes use of the AR system, the AR objects that were added in the local user's absence may be visible to the local user in his or her local environment (e.g. using an optical or video see-through AR headset).
 The exemplary embodiments described above may be performed with the use of a self- calibrating system of cameras and/or other sensors that establish a local coordinate system used to describe positions in a local environment. The cameras and/or other sensors may iteratively or otherwise operate to determine their collective positions and orientations within the local coordinate system. In some embodiments, at least a portion of the configuration is performed in advance using, e.g. floor plans. Such an embodiment may be employed by, for example, a hotel wishing to enable its guests (and their friends and families) to decorate the hotel room with personalized augmented reality decorations or other virtual objects.
 This disclosure provides systems and methods for remote Augmented Reality (AR). The systems and methods disclosed herein provide for remotely augmenting environments that do not have graphical markers attached to their surfaces, where the augmentation is performed independent of a local user's assistance. Additionally, the interaction may be both synchronous and asynchronous, live video from the local site is used, and the local user's privacy is supported.
 In accordance with an embodiment, the AR framework enables remote AR functionalities as add-on features to more conventional videoconferencing systems. Locally-captured 3D data can be combined with real-time video to support remote AR interaction. The 3D data is captured via a fixed local infrastructure that is configured to capture and deliver a 3D model of the environment. Portions of the local 3D data are then transmitted in addition to the live video. In an exemplary embodiment, the portion of the 3D data that is sent for enabling remote AR is limited to the intersection of the 3D reconstructed local space and the view captured in the real-time video.
 In accordance with an embodiment, spatiality is supported by providing users individual video based viewpoints and perspectives, utilizing a spatial augmented reality system. Remote 3D AR is enabled by with a spatial augmented reality system that includes a 3D capture setup that is auto-calibrated with the user video terminal. The AR may be synchronous or non-synchronous (or off-line). The perspective videos reduce the bandwidth required to transmit the AR and video data. The spatial AR system is downward compatible with non-AR video conferencing systems. Tracking of real and virtual spatial position and orientation is provided for AR objects as well as other users. Tracking of real and virtual spatial position and orientation may also be supported for audio, as well as video and 3D data. The location of the source of audio may be determined, and transmitted only if within an intersection of a viewing pyramid. The transmitted sound may include transmitted data regarding the directionality of the sounds for directional, stereo, or surround transmission at a remote end.
 In the following, for simplicity, the invention is described in two-point setting. The disclosed system can however be applied straightforwardly also in multi-point settings.
BRIEF DESCRIPTION OF THE DRAWINGS
 FIG. 1 depicts an example camera marker based 3D capturing system, in accordance with an embodiment.
 FIG. 2 depicts an example method, in accordance with an embodiment.
 FIG. 3 depicts the infrastructure of a P2P AR system, in accordance with an embodiment.
 FIGs. 4A-B depict a sequence diagram, in accordance with an embodiment.
 FIG. 5 depicts the infrastructure of an AR system with an application server, in accordance with an embodiment.
 FIGs. 6A-B depict a sequence diagram, in accordance with an embodiment.
 FIG. 7A depicts an overhead view of a physical location, in accordance with an embodiment.
 FIG. 7B depicts a perspective view from a user terminal, in accordance with an embodiment.
 FIG. 7C depicts a perspective view from a virtual camera position, in accordance with an embodiment.
 FIGs. 7D-F depict an example intersection of a viewing pyramid and 3D information, in accordance with an embodiment.
 FIG. 8 is a functional block diagram of components of a camera marker device.  FIG. 9A illustrates an exemplary wireless transmit/receive unit (WTRU) that may be employed as a user terminal, a camera marker, or a server in some embodiments.
 FIG. 9B illustrates an exemplary network entity that may be employed as a user terminal, a camera marker, a server, or back-end service in some embodiments.
 This disclosure provides a framework for remote AR. The framework provides for remotely augmenting environments independent of a local user's assistance. In some embodiments, the environments being augmented do not have graphical markers attached to their surfaces. Additionally, the interaction may be both synchronous and asynchronous, live video from the local site is used, and the local user's privacy is supported.
 In accordance with at least one embodiment, the AR framework enables remote AR functionalities as add-on features to more conventional videoconferencing systems. Locally- captured 3D data is combined with real-time video to support remote AR interaction. The 3D data is captured via a fixed local infrastructure that is configured to capture and deliver a 3D model of the environment. Portions of the local 3D data are then transmitted in addition to the live video. The portion of the 3D data that is sent for enabling remote AR is limited to the intersection of the 3D reconstructed local space, and the outgoing video view.
 AR visualizations can be seen correctly from different virtual viewpoints, such that when the user changes his/her viewpoint, virtual elements stay or act as if they would part of the physical scene. AR tracking technologies are used to derive the 3D properties of the environment for AR content production, and when viewing the content, for tracking the viewer's (camera) position with respect to the environment.
 In some embodiments, printed graphical markers are used in the environment, to be detected from a video as a reference for both augmenting virtual information in right orientation and scale, and for tracking the viewer's (camera) position. In other embodiments, markerless AR can be used to avoid the potential disruption of physical markers. Markerless AR relies on detecting distinctive features of the environment and using those features for augmenting virtual information and tracking user's position.
 Some AR applications are meant for local viewing of the AR content, where the user is also in the space which has been augmented. However, as the result is typically shown as a video on a display, it can also be seen remotely over network, if wanted.  Producing AR content remotely - e.g. augmenting virtual objects and animations over a network - is a useful feature in many applications, for example: remote guidance, maintenance, and consultancy. One area addressed herein is delivery of virtual objects in telepresence and social media applications. Telepresence applications make use of synchronous interaction between two or more users, both content producer(s) and consumer(s).
 In embodiments with synchronous interaction, remote and local users have a common video conference and see the virtual objects that are added to the video stream in the real time. Synchronous interaction may have two or more users interact in real time, or close to real time ("on-line"), for example using audio and video. For many applications, including those supporting real time AR interaction, it is quite demanding due to required bandwidth, processing time, small latency, etc.
 In embodiments with asynchronous communication the participants have 3D models of the environments available at a later time, and can add virtual objects there, and other participants can see them when accessing the model. Asynchronous interactions deliver and share information, for example messages, audio, and images, without hard real-time constraints. In many cases, asynchronous interaction is preferred as it does not require simultaneous presence from the interacting parties.
 In many applications, supporting synchronous and asynchronous functionalities in parallel is beneficial. Synchronous and asynchronous functionalities can also be mixed in more integral way in order to create new ways of interactions.
 If graphical markers are attached to the local environment, remote augmentation can be performed by detecting the markers' position, orientation and scale (pose) from the received local video and aligning virtual objects with respect to the markers. This method may be partly automated and is suitable for unplanned synchronous interactions.
 In embodiments that are unassisted, the interaction either does not need or allow assistance by a local user. In embodiments that are assisted, interaction involves assistance by local user, but can after that be used both for asynchronous (off line) or synchronous (real-time) interactions.
 Markerless 3D-feature-based methods can be used in cases when visible markers would be too disruptive or do not work at all, like in large scale augmentations outdoors. They can generally be made more accurate, robust and wide base than marker-based methods. Feature-based methods, like those based on point-clouds of features, may require more advance preparations than marker- based methods, may require more complex data capture, may involve complex processing, and may utilize more complex tools for AR content production compared to marker based approach. In addition, they may not provide scale reference for the augmentations as when using markers.
 Although feature-based methods may require advance preparations, they can also be used for augmenting spaces remotely, where users can perform the required preparations, and where the local environment stays stable enough so that the results of those preparations can be used repeatedly, in several synchronous sessions. In these solutions, 3D scanning of the local space can be made by using a moving camera or a depth sensor - with the latter also to some extent in a fixed setup.
 Marker-based methods can be applied even if there are no predefined markers in the local environment. In this approach, the application offers a user interface for selecting a known feature set (e.g. poster in the wall or a logo of machine) from the local environment. This set of features used for tracking is in practice an image that can be used in lieu of a formal marker to define 3D location and 3D orientation.
 With restrictions, even unknown planar features (those recognized and defined objectively by the remote user) can be used for augmentation. In these embodiments, however, the depth and scale may not be able to be derived accurately from the remote video, and the augmentation is restricted to replacing planar feature sets with other subjectively scaled planar objects (e.g. a poster with another poster).
 Generic and precise 3D tracking of features may be used in an embodiment of synchronous remote AR. For example, in a local environment that has no features that are known in advance, simultaneous localization and mapping (SLAM) may be used. These methods simultaneously estimate the 3D pose of the camera and 3D features of the scene from a live video stream. SLAM results in a set of 3D points, which can be used by a remote user to align virtual obj ects to a desired 3D position.
 Local 3D features can also be captured with a set of fixed video cameras, each filming the environment from different angles. These streams can be used to calculate a set of 3D points that can be used by the remote user.
 Optionally, the above described 3D point set can be created by using depth camera. For making the point cloud, related camera and/or depth sensor based solutions described for 3D telepresence are also applicable.  In accordance with an embodiment, local assistance is not needed when using fixed instrumentation for 3D data captures. Current solutions for feature based AR may require local assistance, and new solutions without local assistance would be beneficial.
 Capturing local space in real-time without preparation or assistance may be performed by a fixed setup of 3D cameras and/or sensors, and this information may be provided to a remote user to make accurate 3D augmentations. Note that this choice may preclude the use of most common methods for 3D feature capture, namely those based on a single moving camera or depth sensor. Examples include SLAM and Kinect Fusion algorithms. Examples of techniques that can be used to capture a local environment using point cloud data include, for example, the algorithms available through the Point Cloud Library maintained by Open Perception.
 In some embodiments, content generation is supported in local AR applications. The AR content may be generated off-line in a remote place after receiving either a set of images from a local environment, or a locally generated point cloud.
 In accordance with some embodiments, local assistance in 3D feature capture is not used, and thus methods based on moving a single camera or depth sensor in space may not be used to meet the real-time constraints. One solution for real-time unassisted 3D capture for use in realtime 3D telepresence may be accomplished with multi-sensor capture that is typically used for deriving a 3D representation of the captured scene. In accordance with at least one embodiment, the multi-camera setup is calibrated using markers. The calibration method includes: (i) printing a pattern and attaching it to a planar surface, (ii) capturing multiple images of the model plane under different orientations by moving either the plane or the camera, (iii) detecting the feature points in the images, (iv) estimating five intrinsic parameters and all extrinsic parameters using a close-form solution, (v) estimating the coefficients of the radial distortion by solving the linear least-squares, and (vi) refining parameters via a minimizing equation.
 A distributed multi-camera or multi-sensor system may be calibrated to ensure a common understanding of the 3D features they are capturing. In determining an intersection of a viewing pyramid as captured by a camera on a terminal device and a 3D data of a space, the terminal device is calibrated with the multi-camera system. The calibration may be based on electronic markers due to the simplicity of marker based calibration.
 The coding and transmission of real-time captured 3D data requires more bandwidth than real-time video. For example, raw data bitrate of a Kinect 1 sensor is almost 300MB/s (9.83 MB per frame), making efficient compression methods desirable. Compression methods for Kinect type of depth data (either RGB-D or ToF) are however still in their infancy.  In one embodiment, the medium between participants is via remote AR interaction, using real-time video, either as such or augmented.
 A distributed multi-camera or sensor system is first calibrated to provide a common understanding of 3D features (e.g. a common coordinate system) for the 3D features they are capturing. This is a demanding process, and prone to different kinds of errors, depending on sensor type, amount, and positions.
 The disclosed principle of forming the intersection of 3D capture and the video view does not make specific assumptions for the sensor system or its calibration scheme. A special feature of the disclosed system is that the camera of the user's interaction device (laptop, tablet, or the like) is calibrated with the sensor system.
 Some feature-based AR solutions are not suited well to support remote AR in unassisted synchronous settings. In many cases for remote AR, a local user can assist scanning of the environment with a moving sensor. The in-advance preparations are not however always possible or desirable.
 In some embodiments that permit remote augmentation of a local space, graphical markers are not attached and no advanced preparations are required. This is possible even in unassisted synchronous interaction based on real-time video connection, if enough image data and/or 3D information about the space is captured in real-time and provided to the remote site.
 In embodiments with remote AR interaction over a network, both synchronous (real-time) and asynchronous (off-line) situations are supported. In real-time AR interactions, different users simultaneously interact with each other and with AR content. In off-line AR interactions, at least one user is not simultaneously available as another, but the different users are able to share and interact with AR content over a network. An off-line AR interaction may occur before, during, or after a real-time AR interaction.
 In some virtual worlds, remote AR interaction is supported by local users. The local users may attach graphical markers to assist bringing in virtual objects from virtual worlds as part of the physical environment. In such an embodiment, remote augmentation may not be supported.
 Many users are accustomed to use of video-based tools for communication and interaction. Users understand how to control what is visible to remote parties and often check the background of their own local scene before joining a video chat session or transmitting the video to a remote party. In some embodiments, video-based tools for communication and interaction are preserved in AR interactions.  Supporting user privacy is advantageous for social networking services, which reach to peoples' homes, workplaces or other private premises. Some privacy controls permit the local user control over what data a remote user receives, from visual data seen by or displayed to the remote user, or 3D data transmitted to the remote user. Privacy is desired when fixed instrumentation of cameras are used to capture 3D data in private places, such as user's homes.
 Trust for privacy is an important factor in user acceptance for a service or system. However, using 3D capture for interaction benefits from user acceptance also in more broad sense. The system set-up should be easy and unobtrusive enough, and the service should fit in with existing trusted ways of communication and interaction.
 In order to enable remote 3D augmentation, enough 3D information is captured and sent from the local environment. The amount of information transmitted is a tradeoff between bitrate, accuracy, and ease-of-use in AR content production. Bitrate is naturally also affected by the coding and transmission scheme used for the outgoing 3D data.
 The 3D information may be captured by a fixed local infrastructure and may contain information regarding local 3D properties for delivery to remote users. The 3D information obtained from the fixed infrastructure is combined with information from a user's video terminal, such as a laptop, tablet, smart glasses, or the like, for remote communication.
 A remote AR system benefits from support for producing AR content. Both in marker based and markerless (feature based) methods, viewing the marker or captured scene from different viewpoints is helpful when deciding on the 3D position for the augmentation. Especially when using 3D features - e.g. in the form of a 3D point-cloud - clarity, speed, and ease-of-use are not easy to achieve in AR content production.
 In at least one embodiment of a remote AR system, (i) support is provided for remotely augmenting environments, which do not have graphical markers attached to their surfaces, (ii) a local user is not required to assist the augmentation process, (iii) the AR interactions are able to be synchronous or asynchronous, (iv) live video from the local site is transmitted, and (v) the local user's privacy is preserved.
 In at least one embodiment, an intersection of 3D data and real-time video is determined.
In such an embodiment, the additional 3D information sent for enabling remote AR is limited to the intersection of (i) the 3D reconstructed local space, and (ii) the outgoing video view. The intersection is defined geometrically by a viewing pyramid (which may be a substantially rectangular viewing pyramid) opening towards the local space, along the camera's viewing direction, with the apex of the pyramid behind the camera lens. The pyramid of vision may be truncated by e.g. parallel planes limiting 3D shapes assumed to be too near or far from camera. A natural truncation boundary is formed by the far end of the volume of the 3D reconstructed local space. Viewing pyramids refer to a pyramid with rectangular or any other cross section shape.
 In an exemplary embodiment, the video connection is the primary means for real-time communication in the system. People are already very much accustomed to use video conferencing systems, and users are now accustomed to showing part of their surroundings - even at home - for a number of their friends and contacts. When using video, users have a good understanding and control of what they show to other users. Typically, they pay attention to the video content before joining to a video meeting, when choosing their position and outgoing view. The real-time video is used for communication, and at the same time defines the part of user's space available both for producing (binding) and receiving 3D augmentations.
 An exemplary embodiment of the disclosure operates to restrict the outgoing 3D information to the intersection of the 3D reconstruction and the real-time video view. In addition to privacy needs, this principle serves also in limiting the amount of bits required when transmitting 3D information for remote 3D augmentation. The amount of transmitted bits is smaller for the intersection compared to the complete 3D reconstruction.
 In one embodiment, the 3D capture setup produces a 3D reconstruction of the local user and the physical surroundings. The 3D capture may be used for spatial reference for adding AR objects while editing an AR scene, i.e. a compilation of virtual elements (AR objects), each with precise pose (position, orientation and scale). The added AR objects may be 3D models, 3D scanned real objects, any other visual information, or audio sources. The resulting AR scene may be viewed by users in a modified real-time video in a physical environment. The resulting AR scene may also bring content from a virtual world, such as Second Life or other computer games, to the user's physical environment.
 A local user may experience the direction and distance of the augmented audio source in the local environment. The augmented audio source may be reproduced with a spatial audio system that includes multiple channels and speakers. A surround sound system may also be used as a spatial audio system. The spatial audio system is configured to reproduce sounds and respond to a user's distance from the augmented audio sources. The augmented audio sources may be associated with either physical or augmented 3D objects.
 In some embodiments, generating a 3D model includes the 3D capture system performing a calibration procedure. Portions of the 3D capture system may be stationary after the calibration procedure. A coordinate system remains unchanged with the fixed 3D capture system. The coordinate system may then be used in both synchronous and asynchronous AR sessions.
 In some embodiments, the AR scene is anchored into the physical location based on the coordinates. Anchoring the AR scene to the physical location comprises aligning its coordinates with the physical location. The AR objects are depicted in the AR scene in relation to the physical location. This includes position and orientation information to ensure the AR objects are displayed correctly and at the right scale.
 The 3D model of the environment is transmitted either during a live video session, such as during a synchronous AR session, or between real-time sessions, such as during asynchronous AR sessions. As an alternative to transmitting a 3D setting, a perspective video of the 3D setting may be generated and transmitted to the other users in the AR session. The viewpoint of the perspective video may be chosen from various positions.
 FIG. 1 depicts an example camera marker based 3D capturing system setup 100. In the example system setup, a plurality of 3D depth sensors (shown as cameras 102a/l 02b/ 102c) is configured in an array to collect 3D information of the scene used for generating a 3D model of the local environment. Each of the cameras are communicatively coupled with local computers 104a/104b/.../104n and transmits data to a back-end server 106 to combine information from each 3D depth sensor in the plurality of 3D cameras.
 A user interacts with a user video terminal (e.g., a laptop terminal) 110 to capture video from the field of view 108. The combined video data from the video terminal 110 and 3D data from the 3D depth sensors 102a/l 02b/ 102c are transmitted to a remote user 112. An AR object 116 may be within the field of view 108. The remote user 112 views the AR scene on a laptop, tablet, or other similar device, or views a rendering of the AR scene at the remote environment.
 In one embodiment, one of the cameras is a front-end device. As an example, the laptop terminal device 110 may be the front-end device and is equipped with a visible-light camera for capturing video of the field of view 108. The remote user 112 receives video data representative of the field of view and a truncated 3D model from the intersection of the complete 3D model and the field of view 108 of the laptop terminal 110.
 The 3D data model associated with the local environment is produced in real-time during a synchronous session. The model is used by local or remote users as a spatial reference for producing an accurate AR scene, where the scene may include a compilation of virtual elements
(such as the AR object 116) having position and orientation information. In synchronous interaction, the 3D data model is provided to the remote user 112 together with real-time video view of the local environment. The video view is generated by the local user's terminal, such as the video terminal 110. The AR scene can be produced using both 3D data and video view. In asynchronous AR sessions, the 3D data model generated during the synchronous AR session is stored and may be accessed by the remote user 112 for AR scene creation. While FIG. 1 depicts a single local scene and a single remote scene, some embodiments may include any number of different scenes in the AR session.
 To support both synchronous and asynchronous AR sessions, AR objects are aligned into the 3D physical scene. Multiple views may be generated, with each view representing a view of the scene from a selected viewpoint. Each view may be presented as a perspective video. It may be preferred to transmit perspective videos in synchronous AR sessions or when bandwidth is limited. In embodiments without significant bandwidth limitations, such as some asynchronous AR sessions, complete 3D data may be transmitted with the video instead of a perspective video scene.
 In embodiments with limited bandwidth, the data to be transmitted may be reduced by omitting less useful information. Such less useful information may include surfaces pointing away from the local user's video camera, as they are not visible to the remote user and are therefore not likely of interest to remote users.
 In some embodiments, 3D information is not sent real-time, but on-demand according to a remote user's request. In such an embodiment, the 3D data can be stored at the local terminal or at a server.
 FIG. 2 depicts an example method, in accordance with an embodiment. In particular, FIG. 2 depicts the example method 200. In the method 200, a synchronous AR session is started with both local and remote users. The method 200 may be performed in the context of the 3D AR capture system 100 of FIG. 1. Initially, video and a 3D model of the local environment is obtained. The synchronous AR session may be anchored in the real-world setting of the local user. A snapshot of the 3D data representing the local space obtained during a synchronous AR session is stored. The 3D snapshot can be used for producing augmentations asynchronously, after the realtime synchronous AR session. The 3D model of the local environment is produced during the synchronous AR session. The model is used by local or remote users as a spatial reference for producing an AR scene. The AR scene may include virtual elements, AR objects, with position, orientation, and scale information.
 In synchronous AR sessions, the 3D data is provided to a remote user with real-time video view of the local scene. The video view is generated by the local user's terminal having a video camera and display, for example a laptop or a tablet. The 3D data of the local user's environment may be restricted to an intersection of the 3D reconstructed local space and the outgoing video view, such as a viewing pyramid. The AR scene may be produced using both 3D data and video. The AR object's pose may be set using the video view and perspective (side) views generated from the 3D data.
 In asynchronous AR sessions, the 3D data generated during the synchronous session is stored. The data can be accessed by a user for AR scene creation. The pose of AR objects can be set using different perspective views to the 3D data. The AR object scene generated during an asynchronous session can then be augmented to the local user's video view in a synchronous AR session.
 As shown in FIG. 2, the method 200 includes, at 202, the local and remote users starting a synchronous AR session over a network. During the synchronous session, video and a 3D model of the local environment (e.g., a first real-world space) are captured from the local site, at 204, and the video is transmitted to the remote user. At 206, the remote user adds a new AR object to the local user's environment by selecting a pose for the object in the video view and any on-demand generated perspective videos. The synchronous AR session is terminated by either the local or the remote user at 208.
 The 3D model of the local environment and the added AR objects are stored into a server at 210. A representation of the first real -world space is rendered at a location remote from the local environment in an asynchronous AR session. A remote user edits the AR scene associated with the synchronous AR session by adding new or editing existing AR objects asynchronously at 212. The AR system receives information regarding user input of the placement of a virtual object relative to the 3D model information, allowing the user to set a position and orientation (the pose) of an AR object in the video view and/or perspective views of the stored 3D model of the local environment at 214. The user ends the asynchronous session and the positions and orientations of the edited AR objects are stored in the server at 216. When a new synchronous session anchored in the first real -world space is started, at 218, the AR objects stored in the server (that were edited at 212) are augmented (e.g., rendered) at the position corresponding to the placement of the virtual object in the local user's video stream.
 In a local process, the local user starts, or joins, an interactive session with remote participants. Before video is transmitted from the local user to the remote participants, the user can see what is visible through a video camera of a local user's terminal device, which may be any device suitable for use in AR systems, such as smart phones, tablet computers, laptop computers, camera accessories, and the like. The user is able to reposition the terminal device, ensuring that only non-sensitive or non-private information is visible in the viewing pyramid of the video camera of the terminal device. The AR system and terminal then initialize, which may include performing a calibration, locating the video terminal, making a 3D capture, and determining an intersection of the 3D capture and the viewing pyramid. The initialization process may be repeated if the terminal device is moved or repositioned. The user may then participate in the AR session with the remote participants. User participation may include viewing augmentations in the local space produced by the local user or the remote participants, creating remote AR content to the other peers, and the like, until the AR session is terminated.
 In a remote process, the remote participant starts, or joins, an interactive session with the local participants. The remote participant receives live video from the local site. The remote participant can select an area, or region, of interest from the received live video and receives 3D data regarding the features associated with the region of interest. A 3D editor may be used to edit 3D objects into the 3D data. The 3D objects are aligned with respect to the 3D data, or 3D feature sets, and a mapping between the 3D objects and 3D data is created. Using the alignment mapping, the received video is augmented with the 3D objects, displayed in the desired position. The augmented video is transmitted to a far end, along with the mapping between the 3D object location and the 3D feature points to the far end.
 In one embodiment, there is no need for graphical markers. The AR system enables 3D feature based AR from the real-time video connections.
 In one embodiment, local preparation and local user assistance is not required for AR sessions. The AR system is based on using a distributed real-time 3D capture setup. The AR session may determine the intersection of the live-video and the 3D data live or off-line with 3D reconstruction calibrated with the camera view.
 In one embodiment, a local user can protect his or her privacy by adjusting the real-time video view of the terminal device (e.g. zooming in or out, or pointing the video camera of the terminal device in a different direction). This changes the viewing pyramid of the terminal device and in turn controls limits the region in which 3D information is being transmitted.
 In one embodiment, non-symmetrical use cases are supported with remote participants not required to have the 3D capture setup installed in order to make augmentations in the local scene.
 In one embodiment, bitrates of transmitted data are reduced by using perspective videos, as compared to sending real-time 3D information.  In one embodiment, the ease-of-use of AR content production is increased by providing 3D intersection data to the remote user to make 3D augmentations to correct areas in the received video.
 FIG. 3 depicts the infrastructure of a peer-to-peer (P2P) AR system 300, in accordance with an embodiment. The infrastructure includes local user 306 that has a 3D capture setup 302 and a main video terminal 304. The infrastructure also includes a remote user 310 that has a remote user terminal 312. A storage manager 320 transmits data between the AR system at the local user 306 and a server 308. An asynchronous remote interaction application 322 transmits data between the server 308 and the system at the remote user 310. A synchronous remote interaction application 314, an augmenting application 316 and a streaming application 318 transmit data between the AR systems at the local user 306 and remote user 310.
 At the local user 306, the 3D capture system 302 captures 3D information regarding the local environment, similar to the cameras 102 and computers 104 of FIG. 1. The main video terminal 304 captures video data within a viewing pyramid, similar to the video terminal 110 of FIG. 1. At the remote user 310, the remote user 310 has a remote user terminal 312 with a synchronous remote interaction application 314 that is capable of receiving the video stream from the local user 306 and selecting position and scale of new virtual objects (AR objects) in the local user's 3D environment model.
 During a synchronous AR session, the AR session includes a capture setup 302 (which may be a multi-camera sensor or a sensor system capable of creating a 3D model of the local user's environment), a main video terminal 304 (which may be a laptop with an embedded video camera), a spatial audio system, a remote user terminal 312 capable of receiving video stream from the local user and running a synchronous remote interaction application 314. If the remote user 310 also has a capture setup, it may be similar to the main video terminal 304 and capture setup 302 of the local user 306. A video streaming application 318 is capable of streaming local user's video to the remote user, and a video augmenting application 316 is capable of augmenting AR objects to the video. A synchronous remote interaction application 314 is capable of receiving a video stream from the local user's main video terminal and receiving side views generated from 3D data by the local user's capture setup 302. The application is further capable of adding new AR objects to the local user's 3D environment by selecting a position and orientation of such objects from the video stream and the side views.
 During a subsequent asynchronous AR session, the AR session includes an asynchronous remote interaction application 322 (e.g. a scene editor) for setting the position and orientation (pose) of AR objects in the 3D model captured during the synchronous AR session. The 3D model is stored in a backend server 308. The 3D model includes both 3D information regarding the physical location at the local scene and data associated with AR objects. An application server may deliver video streams and augments the AR objects in the application server. A session manager is capable of coordinating different AR sessions and a storage manager is capable of transitioning between a synchronous and an asynchronous AR session.
 In some embodiments, the AR session is a pure peer-to-peer session, with all of the messaging, augmentation, and video streaming implemented at and between the clients. In other embodiments, the AR session uses an application server that acts as an intermediary between the users (clients) and routes the messages, video and implements AR augmentations. Various combinations of peer-to-peer and application server AR sessions may be implemented as well.
 In an exemplary embodiment, the capture setup 302 may include a multi-camera or sensor system capable of creating a 3D model of the local user's environment, including the environment's shape and appearance, such as color. The local environment may include a room, its occupants, and furniture. The 3D model may be updated in real-time as the environment changes. In some embodiments, an intersection between the 3D data and a viewing pyramid associated with the video terminal is determined, and only data falling within the intersection is transmitted to a remote party.
 The capture setup 302 sends the video from the main video terminal to the remote user, as a main collaboration channel. The capture setup 302 may also be capable of creating images of the 3D model from different viewpoints. These images of the 3D model may provide perspective (or side view) videos that may be delivered to the remote user 310. The capture setup 302 may remain stationary to permit the reuse of a coordinate system determined in subsequent synchronous and asynchronous AR sessions. The video terminal may move without altering the coordinate system.
 The main video terminal 304 used by the local user may 306 be a standard laptop-camera combination, a tablet with a camera, or the like. The camera associated with the main video terminal 304 captures video within a viewing pyramid, the portions of the local environment that are visible to the camera. An intersection between the 3D data captured by the capture setup and the viewing pyramid may be determined. The capability to select the visible parts without user involvement is based on the capture setup's ability to automatically calibrate.
 The spatial audio system may be used to augment an audio in the user's environment. The spatial audio system may include multiple channels and speakers, such as an audio surround sound system. Augmented audio sources may be associated with either physical or augmented 3D objects. Audio is reproduced in selected 3D positions similarly to rendering AR objects.
 The remote user terminal 312 enables remote users to participate without an AR system. In some embodiments, the remote user only displays a video stream from the local user via the synchronous remote interaction application. If the remote user is also sending 3D environment information to other users and has a capture setup, the remote user terminal may be similar to the main video terminal of the local user.
 The video streaming application 318 is software capable to receive the video stream from the camera of the main video terminal or remote user terminal, and transmitting the video to the other users.
 The video augmenting application 316 receives the video streamed by the video streaming application and a set of AR objects that include pose information. The application edits the video so that the AR objects are rendered to the output video in the correct pose, or position, scale, and orientation. The same application may also control spatial audio augmentation if being used.
 The synchronous remote interaction application 314 enables users to interact with each other by adding new AR objects to the videos from other users' environments and editing existing scene (AR content) consisting of AR objects added earlier. The application 314 is capable of receiving a video stream from the local user's main video terminal 304 and receiving perspective videos (e.g. side views) generated by the local user's capture setup 302. Using these video streams, the remote user 310 can look at the local user's 306 environment from different angles and add new AR objects to the local user's 3D environment by selecting position coordinates from the video stream and the side views. The remote user 310 can also change position, scale, and orientation of the AR objects. The video augmenting application 316 can augment the AR objects into the local user's 306 video in real time, allowing all the users (306 and 310) to see the scene editing process while it happens. The synchronous remote interaction application 314 may also be used by the local user 306 for adding new AR objects to the local video, allowing other users to see the augmented objects placed by the local user 306.
 While the users have a live video connection, they interact synchronously, seeing how the other users edit the AR content in the augmented video. When the synchronous session ends, the storage manager 320 directs the storage of the 3D model of the local user's environment, as well as the AR objects added to the environment, to a server 308. A user can use the asynchronous remote interaction application 322 to add new AR objects or edit exiting AR objects with respect to the 3D model captured from the local environment between synchronous sessions. The asynchronous remote interaction application 322 uses the same editing logic as the synchronous remote interaction application 314 and may offer the user side views of the 3D model of the environment and added AR objects. The asynchronous remote interaction application 322 may allow users to edit position, scale, and orientation of the AR objects. The AR objects are stored into the server, and the video augmenting application 316 uses them to augment the objects into video when the synchronous session starts again.
 The session manager is an application that controls the user information such as access rights, etc. and coordinates the sessions. The storage manager 320 is an application that controls the transition between synchronous and asynchronous phases by storing the status of the 3D model of the local environment and the AR objects added to the model. The environment may be saved when a synchronous AR session ends, periodically, continually, or other at other times.
 The server 308 is a server computer, or server cloud, that contains data storage configured to store 3D capture results and AR objects for asynchronous interaction. The data storage may also be used to store other information related to users, user sessions, and the like.
 In some embodiments, the system of FIG. 3 may be modified to include additional remote users. The additional remote user may have similar AR equipment as the remote user 310 and be in communication with an additional backend server and an additional application server while having an AR session with the local user 306. The additional remote user may visualize the local user with the same perspective as the remote user 310, or from a different perspective than the remote user 310.
 FIGs. 4A and 4B depict a sequence diagram, in accordance with an embodiment. The sequence diagram depicts an order of communications between the components in an AR infrastructure, such as the one depicted in FIG. 3. In the sequence diagram depicted in FIGs. 4A and 4B, the remote user 402 and the local user 406 remote users are communicating via a peer-to- peer network during a synchronous interaction. The video streams are transported between clients (remote and local users), without an intervening server. The AR objects are augmented to video stream in the sender's system (local user). Asynchronous interactions may be facilitated by a backend server 404.
 In the FIGs. 4A and 4B, a 3D model of the environment of a local user is continually captured at 408. The local user 406 receives a request for a video stream from the remote user 402. At 410, the video stream is retrieved by the main video terminal and provided to the remote stream user.  At 411, the pose of an AR object is set. Setting the pose comprises, at 412, selecting a point from the video using the synchronous remote interaction application. The remote user 402 requests a side-view from the local user 406, and at 414, the local user 406 retrieves the side-view of the environment from the capture system and provides the side-view to the remote user 402. At 416, the remote user 402 selects a pose for the AR object from the side view and sends the AR object, and coordinates, to the local user 406. At 420, the AR object is augmented into the video via the augmenting application. The augmented video is then provided to the remote user 402. The steps of 411 may be looped until the pose is finalized. The synchronous interaction may then terminate.
 Continuing with FIG. 4B, at 422, a snapshot of the 3D environment model is retrieved and provided to the storage manager for storage. In an asynchronous session 423, the remote user 402 requests the 3D snapshot, and at 424, the server 404 retrieves the 3D snapshot received from the local user 406. The remote user 402 receives the 3D snapshot from the server 404, and at 426 the remote user 402 positions a new AR object into the 3D model, or revises the pose of an existing AR object. The AR objects position and location are then sent to the server 404. At 428, the server stores the location and identity of the AR object. At the start of a new synchronous session between the local user and the remote user, the local user 406 queries the server 404 for new AR objects, and receives the new AR objects and position information from the server. The remote user 402 requests an augmented stream, and the local user 406 generates an augmented view that includes the new AR object at 430, and provides the augmented view to the remote user 402.
 The application server is a server computer, or server cloud, that is configured to deliver video streams and to augment the AR objects in the application server variation. It is configured to run video augmenting applications and communicates with the storage server to retrieve data associated with AR objects added during asynchronous AR sessions.
 FIG. 5 depicts the infrastructure of an AR system with an application server, in accordance with an embodiment. In particular, the AR system architecture 500 of FIG. 5 is similar to the architecture 300 depicted in FIG. 3, but also includes an application server 534 having an augmenting application 532. In system 500, the application server 534 receives streaming video from the main video terminal 304 via the streaming application 530. The augmenting application augments the video stream and provides the augmented video to the remote user via the streaming application 536. The application server 534 also receives AR object information via communication link 538 from the backend server 308.
 In some embodiments, the system of FIG. 5 may be modified to include additional remote users. The additional remote user may have similar AR equipment as the remote user 310 and be in communication with an additional backend server and an additional application server while having an AR session with the local user 306. The additional remote user may visualize the local user with the same perspective as the remote user 310, or from a different perspective than the remote user 310.
 FIGs. 6A and 6B depict a sequence diagram, in accordance with an embodiment. The sequence diagram of FIG. 6A and continued on FIG. 6B depict an order of communications between the components in an AR system architecture, such as the AR system architecture depicted in FIG. 5. The AR system components include a remote user 602, an application server 604, a local user 606, and a backend server 630. The application server 604 is similar to application server 534 of FIG. 5 and routes the videos, augmentation data, and other messages between the local user 606 and the remote user 602. In some embodiments, the video streams are transmitted between clients via the application server 604 and the AR objects are augmented to the video stream on the application server 604.
 In the sequence diagram of FIGs. 6A and 6B, the local user 606 continually captures a 3D model of the environment using capture setup at 608. The remote user 602 requests a video stream, and the application server 604 forwards the request to the local user 606. At 610, the local user 606 retrieves the video stream from the main terminal and provides the video stream to the application server 604 for forwarding to the remote user 602.
 The loop 611 is repeated until a pose is set for an AR object. In the loop 611, the remote user 602 selects a point from the video at 612. A request for a side-view is sent to the application server 604 and forwarded to the local user 606. At 614, the local user 606 retrieves the requested side view of the environment using the capture set up and returns the side view to the application server 604 for forwarding to the remote user 602. At 616, a pose is set from the side view, and the AR object and coordinates are sent to the application server 604. At 620, the application server 604 augments the AR object to the video using an augmenting application and returns the augmented video to the remote user.
 At 622, an asynchronous interaction occurs to update the pose of AR objects in the scene that was captured during the previous synchronous session. The asynchronous interaction 622 may be similar to the asynchronous session 423 of FIG. 4A, with the backend server 630 storing the asynchronously updated AR object information.
 As continued on FIG. 6B, the remote user 602 requests augmented video, and the application server forwards the request to the local user 606. At 632, the local user 606 retrieves the video stream from the main video terminal and returns the video to the application server 604. The application server 604 requests information of the AR objects and the respective coordinates from the backend server 630. At 634, the backend server 630 retrieves the AR object and the respective coordinate information and forwards the information to the application server 604. At 636, the application server 604 augments the AR objects into the video and provides the augmented video to the remote user 602.
 In some embodiments, storage of the 3D data and AR scene may occur periodically, at the end of the AR session, after changes to the environment are processed, at the permission of the users, at the beginning of an AR session, at the conclusion of an AR session, or other times. Also, the 3D scene may be stored in portions throughout an AR session. The storage of the AR scene may occur at the local terminal, the remote terminal, or a server.
 In accordance with one embodiment, the 3D model used in positioning the AR objects is captured on the fly. A typical normal local user's environments (e.g. living rooms at home or offices) are constantly changing: remote control in the living room is always missing and a pile of papers in the offices keeps growing. However, in some cases the environment is intentionally kept as stable as possible. In these cases, a variation is beneficial where the 3D model of the local environment is created in advance instead of real-time 3D capturing.
 An additional embodiment for use in, for example a hotel environment, may be implemented using the systems and methods disclosed herein. A hotel often has a large number of rooms that are furnished similarly. The rooms may be cleaned daily, and each piece of furniture is moved to its original position when possible. When new customers arrive, each room often looks exactly the same as compared to a default initial state.
 In an exemplary embodiment, operators of a hotel have created 3D models of each room. Since the rooms are mostly identical, it is quite cost efficient to create the models. The 3D models may be generated via a capture system, from floor plans, from a catalog of standard objects (such as the same beds and dressers furnishing each hotel room), and the like.
 The operators of the hotel may offer their guests a collaboration tool, such as software for conducting synchronous and asynchronous AR sessions as described above. The guest can view the room in advance in virtual reality and decorate the room in advance (asynchronously) for his/her own comfort (e.g. by adding pictures of his/her family members and other familiar items.)
 While the guest is travelling to the hotel, his/her family can (asynchronously) create notes, virtual decorations or other AR objects to the 3D model of the room. The objects will be visible to the guest via AR glasses, or on the computer screen when the guest starts the application in his/her laptop. The collaboration system could also be installed e.g. in the television set equipped with camera.
 While the guest is staying in the hotel, he/she can have a synchronous session with his/her family members. During the collaboration session, AR objects can be moved or added also realtime (synchronously).
 Such embodiments allow a hotel guest to virtually decorate the room for his or her own comfort and to use real-time interaction to interact with friends, family members, or other contacts. The friends or family of the hotel guest can interact by decorating the room for the comfort of the traveling family member, leaving notes and wishes for the family member, and using real-time interaction to interact with the family member.
 The components used to implement this variation scenario may be same as those used in other embodiments disclosed herein, except that the capture set-up may not be required. This is a valid case especially for the hotel environment, because equipping hotel rooms with camera setup may not be accepted by the guests. Since the hotel room environment can be modelled in real scale, system calibration for getting correct scale for AR objects is not needed. System calibration is needed also for defining user terminal position with respect to the 3D model. Since the environment is known in advance, the system can use known feature sets, such as paintings on the wall, to calculate its pose within the environment. In some embodiments, augmentation may be performed even before a particular room is selected for a hotel guest because different rooms may have the same configuration (or, e.g., a mirror-image configuration with respect to one another). The pose of different AR objects may thus be determined in advance based on a generic model room and then displayed to the user at the corresponding position (which may be a mirror-image position) in the room to which the guest is ultimately assigned.
 In an additional use case scenario, Pekka and Seppo are having an extended video conference using an AR system. They both have installed the system in their apartments. Pekka has a pile of laundry in the corner of his room and he has pointed the video camera of his user terminal so that Seppo cannot see the laundry in the video stream.
 Seppo has a 3D model of a piece of furniture he thinks looks good in Pekka' s environment. Seppo selects a position where he wants to add the furniture model, by pointing on the video view coming from Pekka' s apartment. Next, he selects side view(s) of Pekka' s environment in order to position the model more precisely into correct position. Even though the system creates a 3D model of the whole room, Seppo cannot see the pile of dirty clothes in any of the side views, because the system shows only those objects that are included in the main video view.  Both users can see the furniture model augmented to the video stream from Pekka's environment. In addition, Pekka can see the furniture model from different viewpoints using his AR glasses. Because the system has calibrated itself, the furniture is automatically scaled to right size and looks natural in its environment.
 When the conference is ended, a 3D model of Pekka's environment is stored into server. Seppo goes to a furniture store and sees even more interesting furniture. He gets the 3D model of the new furniture and now, using his mobile terminal, replaces the earlier one in Pekka's environment stored on the server.
 Pekka can see the additions with his mobile terminal, using a 3D browser while on the move. Finally, when Pekka returns home, he can see the new furniture augmented into the video view, and it is available for discussing by the parties when they have their next video conference.
 In another use case scenario, local and remote AR systems capture 3D data of their respective spaces. The capture setups may be automatically calibrated and are configured to detect the user terminal position and camera direction. Each user directs the terminal camera so that parts of the room the user considers private are left outside of the video view, or viewing pyramid. The local and remote users start the video conference using the remote AR interaction system. The users see each other's rooms in normal video views. Data is transmitted in the spatial/video intersection of the 3D model and the live video stream. The remote user adds a new AR object to the local user's room. This may also be done by adding the object into the video stream of the room. The local or remote user can select a position of the added AR object from the video terminal's video stream.
 The remote user may also move the virtual camera view to provide a different perspective view of the 3D spatial geometry of the local user's room. The perspective view is limited to the intersection of the viewing pyramid and the 3D data. The remote user can see the perspective view as a virtual video stream and can further reposition the AR object in the local user's space. The AR object is augmented into the video stream generated by the AR interaction system of the local user. The AR object may also be seen in the video stream generated by the local user's video terminal. If either of the users is utilizing an AR system, such as AR glasses, the augmentation may be seen from any direction. The AR system stores the AR object, the selected position, and a snapshot of the captured 3D model of the environment to be used offline, or in a subsequent AR session. The AR session can then be ended.
 After the real-time (synchronous) AR session is completed, either user may edit the AR scene offline, or asynchronously. For example, the remote user may add a second AR object to the AR scene by retrieving the snapshot of the previous AR scene, positioning the new AR object in the scene, and storing the offline edited AR scene. The object may be positioned similarly to the real-time editing, for example, by changing the perspective view and further repositioning of the AR object. A new 3D model is generated from the saved sessions. When the AR system associated with the same physical location is resumed, the updated AR system may be rendered during the new AR session.
 In various embodiments, each of the AR sessions is associated with a unique identifier. The unique identifier may be associated with a physical location, a user, a set of users, and the like. In one example use case, a user creates a first AR scene that has a first unique identifier for use with AR sessions with the user's spouse from the user's office. The user also creates a second AR scene that has a second unique identifier for use in AR sessions with the user's employer, that is remotely located. Both of the AR scenes may be edited with different AR objects in real-time or offline. In the first AR scene, the user's desk has personal AR objects, such as framed family photos and a personal task list, on the user's desk. In the second AR scene, the user's desk has piles of work related AR objects, such as work files and projects, on the user's desk. The user's spouse may perform off-line editing of the first AR scene to update a shopping list on the user's desk. After the user concludes an AR session with the user's employer in the second AR scene, the user returns to the first AR scene and sees the updated shopping list that was edited by the user's spouse offline.
 In some embodiments, the 3D capture system is used to capture 3D data of a local physical location. The 3D data of the local physical location is used to determine the true scale for real size augmentations, such as adding in virtual objects. A remote user is able to use a scene editor application to position 3D AR objects in the local physical location based on the 3D data. Side views for positioning can be generated as perspectives to the 3D intersection. With the 3D AR objects placed in the local physical location, they appear to the remote user in the video stream, even if the local camera is moving because the AR objects are anchored to a position of the 3D local physical location. The local users can see the AR objects in the outgoing video and may also see the objects represented within the local physical location either through AR viewing glasses or hologram technology.
 In some asynchronous remote AR embodiments, remote users are able to alter the AR scene for use in a later synchronous AR session. In these embodiments, a snapshot of the 3D space is captured during a synchronous AR session and stored in a server. The snapshot may be limited to an intersection of a viewing pyramid and the local physical space. An offline AR scene editing application permits editing of the AR scene offline. The editing may occur in a 3D rendering or in perspective (side) views generated from the 3D data associated with the local physical location. User access to edit the AR scenes may be limited to a set of users with authorization.
 The AR scene editing application stores the edits to the AR scene for use in a later synchronous AR session. The AR objects positioned asynchronously appear in the correct position based on an unchanged coordinate system. The AR session may further include spatial audio, as disclosed herein.
 In some embodiments, the offline AR scene editing application permits editing of a 3D space that was created offline, or without an AR capture system. In such an embodiment, the 3D space is created for a physical location offline. This may be accomplished by accessing room dimensions, furniture dimensions, and furniture layout of a typical room. Alternatively, a 3D capture may be taken of a first room, and other similarly constructed and furnished rooms may copy the 3D capture. This may be applicable in a setting with a condominium building or a hotel building. A user is then associated with the hotel room that has an offline-generated 3D space, and AR content is rendered in the hotel room. The AR content may include AR objects and also utilized in synchronous AR sessions.
 FIG. 7A depicts an overhead view of a physical location, in accordance with an embodiment. The overhead view 700 includes a user terminal 706, a desk 708, a user 404, an AR object 710 (such as the AR plant illustrated in FIG. 7 A), a lamp 712, and a position of a virtual camera 714. The user terminal 706 is depicted as a camera. The volume within the physical location that falls within the field of view of the user terminal camera may be described as a viewing pyramid 716. A video camera of the user terminal 706 is configured to capture video images of areas within the viewing pyramid 716. Inside the viewing pyramid 716 is the desk 708, the user 704, and the AR object plant 710. Outside of the viewing pyramid 716, to the left side of the drawing, is the lamp 712. The area depicted in FIG. 7A may be used in an AR session. 3D data may be obtained from a 3D capture system of the complete area (including the lamp). In some embodiments, the 3D data transmitted to a remote user is limited to the intersection of the viewing pyramid and the 3D data.
 FIG. 7B depicts a perspective view from a user terminal 706, in accordance with an embodiment. The AR scene may be rendered in a perspective view to a remote user. The perspective view depicted in FIG. 7B comprises the video stream captured from the user terminal 706, 3D data from within the intersection of the viewing pyramid 716 and the full 3D model, and AR objects 710 placed within the AR scene. As shown in FIG. 7B, the view only includes the desk 708, the user 704, and the AR object plant 710, and does not include the lamp 712, as the lamp 712 is outside of the viewing pyramid 716 and not in the intersection.
 The orientation of the objects is taken from the perspective view of the user terminal 706, with the desk 708 in front of the user 704, and the plant 710 visually to the left of the user 704, and partially behind the desk 708.
 FIG. 7C depicts a perspective view from a virtual camera position 714 of FIG. 7A, in accordance with an embodiment. In some embodiments, the remote user may be displayed the AR scene from the vantage point of the virtual camera 714. As shown in FIG. 7A, the virtual camera 714 is placed to the side of the overhead view, and thus provides a different perspective from the physical video camera of the user terminal 706. The perspective view from the virtual camera similarly includes the desk 708, the user 704, and the AR virtual object plant 710. While the lamp 712 might in theory be visible to a physical camera at the location of the virtual camera 714, the lamp 712 is not included in the perspective view of the virtual camera 714 because the lamp 712 is outside of the viewing pyramid 716 of the video camera of the user terminal 706, with the video camera of the user terminal 706 operating as a model-extent-setting camera. In some embodiments, only the perspective view from the virtual camera 714 is sent to the remote user (in, for example, any one of several available formats for transmission of live video), thereby requiring a lower data rate than sending the entirety of the 3D data to the remote user. The remote user may send to the local user information representing the coordinates (e.g. location, direction, and any roll/tilt/zoom parameters) of the virtual camera 714 within the local physical location, and the local user terminal 706 may generate the appropriate perspective view to send to the remote user. The remote user may be able to change the coordinates of the virtual camera 714 in real time.
 The orientation of the objects is rendered from the perspective of the virtual camera 714, and thus, the user 704 is behind the virtual object plant 710, and the desk 708 is visually to the right of the virtual object plant 710 and user 704. Since the user 704 is behind the virtual object plant 710, the plant 710 obscures portions the user 704.
 FIGs. 7D-7F illustrate various steps of obtaining an intersection of a field of view of a user terminal camera and a full 3D model, in accordance with some embodiments. FIG. 7D illustrates the full 3D model of a room 720. FIG. 7E illustrates a field of view 730 of a user terminal camera in the room (not shown). In FIG. 7E, the field of view is a shown as a viewing pyramid, however alternative shapes of a field of view may also be utilized. FIG. 7F illustrates the intersection 740 of the field of view of the user terminal camera and the full 3D model. In the example intersection, a 3D space is the intersection of a complete room model and a field of view of a camera, which may take the form of a 3D pyramid specified by the real-time camera position and properties. The intersection is thus a truncated 3D reconstruction (3D model) of the space appearing in the remote video view and thus is a part of the more complete 3D reconstruction made by the infrastructure.
 While above embodiments only transmit the 3D truncated model to reduce bandwidth, it should be noted that further location information may be provided to remote users. For example, even though a remote user only receives the truncated 3D model, the remote user may also receive dimensions of the room, and in further embodiments, information illustrating to the remote user which area of the room corresponds to the truncated 3D model. In such embodiments, the remote user may augment objects to the local user in areas that are outside of the truncated 3D model, even though the remote user did not receive the full 3D model. In some embodiments, the remote user provides coordinates for the augmented object according to the received location information.
 FIG. 8 is a functional block diagram of components of a camera marker device. In particular, FIG. 8 depicts the camera marker device 800 that includes a processor 802, a camera 804, a display 806, a keypad 808, a non-volatile memory 810, a volatile memory 812, an IP network I/O 814, and a wireless receiver 816 and a wireless transmitter 818 that communicate via the wireless I/O 820 via Wi-Fi, Bluetooth, Infrared, or the like. In some embodiments, the camera marker 800 is provided with audio capture and playback features. Audio may be used to increase the attractiveness and effectiveness of the videos used for announcing/advertising the available AR content. Audio may also be used as a component of the augmented AR content. A microphone can be used to capture user responses or commands.
 When building up a multi-marker setup, various combinations of electronic and paper markers are feasible. In such a setup, for example, a paper marker on the floor could specify the floor level without the risk of an electronic device being stepped on. Paper markers may also be used as a way to balance the trade-off between calibration accuracy and system cost. In addition to graphical markers, also natural print-out pictures can be used as part of a hybrid marker setup. Even natural planar or 3D feature sets can be detected by multiple camera markers and used for augmenting 3D objects.
 In some embodiments, at least some local processing is performed in each marker device in order to reduce the amount of information to be transmitted to the common server. Marker detection is one of such local operations. Note that camera marker setup is relatively stable, and tracking in camera markers is not needed to such an extent as in the user's viewing device (AR glasses or tablet), which is moving along with the user. Another example is the control of the wide- angle camera in order to capture, for example, cropped views of other markers (for marker detection and identification), or user's visual parameters. A third example for local processing is to use camera view for deriving the actual lighting conditions in the environment in order to adapt the respective properties for the virtual content for improved photorealism.
 Instead of just with visual cameras, camera markers can be equipped with 3D cameras, such as RGB-D or ToF sensors, for capturing depth information. As the success of, e.g. the Kinect camera has shown, it can increase the versatility and performance of related functionalities and services. The use of camera markers may encourage the acceptance of 3D cameras as a ubiquitous part of users' environment.
 In some embodiments, a real-time 3D reconstruction of room-sized spaces may be obtained with a system that uses Kinect Fusion modified to a set of fixed sensors, which might be used also in a system of 3D camera markers.
 Together with the knowledge of the user's real view-point (the information obtained e.g. by analyzing the captured 3D scene, or obtained from virtual glasses), the 3D captured scene can be used to implement the accurate user-perspective AR rendering. A more traditional way of capturing 3D information is to use two (e.g. stereo) or more cameras.
 As described above, multiple markers can be used in AR to give both more and better 3D data of the environment. To provide this benefit, multiple markers are calibrated with respect to each other and the scene. Typically, calibration is performed by capturing the multi-marker scene by a moving external camera and making geometrical calculations from its views.
 Providing the markers with wide-angle cameras enables self-calibration in a multiple camera-marker system. The views of the marker cameras themselves can be used for the mutual calibration of all devices, and the calibration can be updated when necessary, e.g. to adapt into any possible changes in the setup.
 Auto-calibration, which can be applied also for multiple camera markers setup. The calibration is a real-time process and does not need a separate calibration phase. The user may lay markers randomly on suitable places and start tracking immediately. The accuracy of the system improves on the run as the transformation matrices are updated dynamically. Calibration can also be done as a separate stage, and the results can be saved and used later with another application. The above calibration techniques may be applied to various types of markers.
 In some embodiments, the functions of the described camera marker are performed using a general-purpose consumer tablet computer. A tablet computer may be similar to the camera marker device 800 of FIG. 8, and is generally provided with components such as a display, camera (though typically not with wide-angle optics), and wired and wireless network connections. In some embodiments, a camera marker is implemented using dedicated software running on the tablet device. In some embodiments, the camera marker is implemented using a special-purpose version of a tablet computer. The special-purpose version of the tablet computer may, for example, have reduced memory, lower screen resolution (possibly greyscale only), wide-angle optics, and may be pre-loaded with appropriate software to enable camera marker functionality. In some embodiments, inessential functionality such as GPS, magnetometer, and audio functions may be omitted from the special-purpose tablet computer.
 Exemplary embodiments disclosed herein are implemented using one or more wired and/or wireless network nodes, such as a wireless transmit/receive unit (WTRU) or other network entity.
 FIG. 9A is a system diagram of an exemplary WTRU 902, which may be employed as a user device in embodiments described herein. As shown in FIG. 8, the WTRU 902 may include a processor 918, a communication interface 919 including a transceiver 920, a transmit/receive element 922, a speaker/microphone 924, a keypad 926, a display/touchpad 928, a non-removable memory 930, a removable memory 932, a power source 934, a global positioning system (GPS) chipset 936, and sensors 938. It will be appreciated that the WTRU 902 may include any subcombination of the foregoing elements while remaining consistent with an embodiment.
 The processor 918 may be a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Array (FPGAs) circuits, any other type of integrated circuit (IC), a state machine, and the like. The processor 918 may perform signal coding, data processing, power control, input/output processing, and/or any other functionality that enables the WTRU 902 to operate in a wireless environment. The processor 918 may be coupled to the transceiver 920, which may be coupled to the transmit/receive element 922. While FIG. 9A depicts the processor 918 and the transceiver 920 as separate components, it will be appreciated that the processor 918 and the transceiver 920 may be integrated together in an electronic package or chip.
 The transmit/receive element 922 may be configured to transmit signals to, or receive signals from, a base station over the air interface 915/916/917. For example, in one embodiment, the transmit/receive element 922 may be an antenna configured to transmit and/or receive RF signals. In another embodiment, the transmit/receive element 922 may be an emitter/detector configured to transmit and/or receive IR, UV, or visible light signals, as examples. In yet another embodiment, the transmit/receive element 922 may be configured to transmit and receive both RF and light signals. It will be appreciated that the transmit/receive element 922 may be configured to transmit and/or receive any combination of wireless signals.
 In addition, although the transmit/receive element 922 is depicted in FIG. 9A as a single element, the WTRU 902 may include any number of transmit/receive elements 922. More specifically, the WTRU 902 may employ MTMO technology. Thus, in one embodiment, the WTRU 902 may include two or more transmit/receive elements 922 (e.g., multiple antennas) for transmitting and receiving wireless signals over the air interface 915/916/917.
 The transceiver 920 may be configured to modulate the signals that are to be transmitted by the transmit/receive element 922 and to demodulate the signals that are received by the transmit/receive element 922. As noted above, the WTRU 902 may have multi-mode capabilities. Thus, the transceiver 920 may include multiple transceivers for enabling the WTRU 902 to communicate via multiple RATs, such as UTRA and IEEE 802.11, as examples.
 The processor 918 of the WTRU 902 may be coupled to, and may receive user input data from, the speaker/microphone 924, the keypad 926, and/or the display/touchpad 928 (e.g., a liquid crystal display (LCD) display unit or organic light-emitting diode (OLED) display unit). The processor 918 may also output user data to the speaker/microphone 924, the keypad 926, and/or the display/touchpad 928. In addition, the processor 918 may access information from, and store data in, any type of suitable memory, such as the non-removable memory 930 and/or the removable memory 932. The non-removable memory 930 may include random-access memory (RAM), readonly memory (ROM), a hard disk, or any other type of memory storage device. The removable memory 932 may include a subscriber identity module (SFM) card, a memory stick, a secure digital (SD) memory card, and the like. In other embodiments, the processor 918 may access information from, and store data in, memory that is not physically located on the WTRU 902, such as on a server or a home computer (not shown).
 The processor 918 may receive power from the power source 934, and may be configured to distribute and/or control the power to the other components in the WTRU 902. The power source 934 may be any suitable device for powering the WTRU 902. As examples, the power source 934 may include one or more dry cell batteries (e.g., nickel -cadmium (NiCd), nickel-zinc (NiZn), nickel metal hydride (NiMH), lithium-ion (Li-ion), and the like), solar cells, fuel cells, and the like.  The processor 918 may also be coupled to the GPS chipset 936, which may be configured to provide location information (e.g., longitude and latitude) regarding the current location of the WTRU 902. In addition to, or in lieu of, the information from the GPS chipset 936, the WTRU 902 may receive location information over the air interface 915/916/917 from a base station and/or determine its location based on the timing of the signals being received from two or more nearby base stations. It will be appreciated that the WTRU 902 may acquire location information by way of any suitable location-determination method while remaining consistent with an embodiment.
 The processor 918 may further be coupled to other peripherals 938, which may include one or more software and/or hardware modules that provide additional features, functionality and/or wired or wireless connectivity. For example, the peripherals 938 may include sensors such as an accelerometer, an e-compass, a satellite transceiver, a digital camera (for photographs or video), a universal serial bus (USB) port, a vibration device, a television transceiver, a hands free headset, a Bluetooth® module, a frequency modulated (FM) radio unit, a digital music player, a media player, a video game player module, an Internet browser, and the like.
 FIG. 9B depicts an exemplary network entity 990 that may be used in embodiments of the present disclosure, for example as a common server used for the setup of one or more camera markers. As depicted in FIG. 9B, network entity 990 includes a communication interface 992, a processor 994, and non-transitory data storage 996, all of which are communicatively linked by a bus, network, or other communication path 998.
 Communication interface 992 may include one or more wired communication interfaces and/or one or more wireless-communication interfaces. With respect to wired communication, communication interface 992 may include one or more interfaces such as Ethernet interfaces, as an example. With respect to wireless communication, communication interface 992 may include components such as one or more antennae, one or more transceivers/chipsets designed and configured for one or more types of wireless (e.g., LTE) communication, and/or any other components deemed suitable by those of skill in the relevant art. And further with respect to wireless communication, communication interface 992 may be equipped at a scale and with a configuration appropriate for acting on the network side— as opposed to the client side— of wireless communications (e.g., LTE communications, Wi-Fi communications, and the like). Thus, communication interface 992 may include the appropriate equipment and circuitry (perhaps including multiple transceivers) for serving multiple mobile stations, UEs, or other access terminals in a coverage area.  Processor 994 may include one or more processors of any type deemed suitable by those of skill in the relevant art, some examples including a general-purpose microprocessor and a dedicated DSP.
 Data storage 996 may take the form of any non-transitory computer-readable medium or combination of such media, some examples including flash memory, read-only memory (ROM), and random-access memory (RAM) to name but a few, as any one or more types of non-transitory data storage deemed suitable by those of skill in the relevant art could be used. As depicted in FIG. 9B, data storage 996 contains program instructions 997 executable by processor 994 for carrying out various combinations of the various network-entity functions described herein.
 Although features and elements are described above in particular combinations, one of ordinary skill in the art will appreciate that each feature or element can be used alone or in any combination with the other features and elements. In addition, the methods described herein may be implemented in a computer program, software, or firmware incorporated in a computer- readable medium for execution by a computer or processor. Examples of computer-readable storage media include, but are not limited to, a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD- ROM disks, and digital versatile disks (DVDs). A processor in association with software may be used to implement a radio frequency transceiver for use in a WTRU, UE, terminal, base station, RNC, or any host computer.
Priority Applications (2)
|Application Number||Priority Date||Filing Date||Title|
|Publication Number||Publication Date|
|WO2017177019A1 true true WO2017177019A1 (en)||2017-10-12|
Family Applications (1)
|Application Number||Title||Priority Date||Filing Date|
|PCT/US2017/026378 WO2017177019A1 (en)||2016-04-08||2017-04-06||System and method for supporting synchronous and asynchronous augmented reality functionalities|
Country Status (1)
|WO (1)||WO2017177019A1 (en)|
|Publication number||Priority date||Publication date||Assignee||Title|
|US20020010734A1 (en) *||2000-02-03||2002-01-24||Ebersole John Franklin||Internetworked augmented reality system and method|
|US20020080094A1 (en) *||2000-12-22||2002-06-27||Frank Biocca||Teleportal face-to-face system|
|EP2400464A2 (en) *||2010-06-25||2011-12-28||Palo Alto Research Center Incorporated||Spatial association between virtual and augmented reality|
|US20130117377A1 (en) *||2011-10-28||2013-05-09||Samuel A. Miller||System and Method for Augmented and Virtual Reality|
|US20140320529A1 (en) *||2013-04-26||2014-10-30||Palo Alto Research Center Incorporated||View steering in a combined virtual augmented reality system|
Patent Citations (5)
|Publication number||Priority date||Publication date||Assignee||Title|
|US20020010734A1 (en) *||2000-02-03||2002-01-24||Ebersole John Franklin||Internetworked augmented reality system and method|
|US20020080094A1 (en) *||2000-12-22||2002-06-27||Frank Biocca||Teleportal face-to-face system|
|EP2400464A2 (en) *||2010-06-25||2011-12-28||Palo Alto Research Center Incorporated||Spatial association between virtual and augmented reality|
|US20130117377A1 (en) *||2011-10-28||2013-05-09||Samuel A. Miller||System and Method for Augmented and Virtual Reality|
|US20140320529A1 (en) *||2013-04-26||2014-10-30||Palo Alto Research Center Incorporated||View steering in a combined virtual augmented reality system|
Non-Patent Citations (3)
|HENRY CHEN ET AL: "3D Collaboration Method over HoloLens(TM) and Skype(TM) End Points", IMMERSIVE MEDIA EXPERIENCES, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 30 October 2015 (2015-10-30), pages 27 - 30, XP058074925, ISBN: 978-1-4503-3745-8, DOI: 10.1145/2814347.2814350 *|
|IRLITTI ANDREW ET AL: "Tangible interaction techniques to support asynchronous collaboration", 2013 IEEE INTERNATIONAL SYMPOSIUM ON MIXED AND AUGMENTED REALITY (ISMAR), IEEE, 1 October 2013 (2013-10-01), pages 1 - 6, XP032534747, DOI: 10.1109/ISMAR.2013.6671840 *|
|US7616226B2 (en)||Video conference system and a method for providing an individual perspective view for a participant of a video conference between multiple participants|
|US20090109240A1 (en)||Method and System for Providing and Reconstructing a Photorealistic Three-Dimensional Environment|
|US20020056120A1 (en)||Method and system for distributing video using a virtual set|
|US20060244817A1 (en)||Method and system for videoconferencing between parties at N sites|
|Zhang et al.||3D-TV content creation: automatic 2D-to-3D video conversion|
|US20120169838A1 (en)||Three-dimensional video conferencing system with eye contact|
|US20150055937A1 (en)||Aggregating images and audio data to generate virtual reality content|
|US20070182812A1 (en)||Panoramic image-based virtual reality/telepresence audio-visual system and method|
|Schreer et al.||3D Videocommunication: Algorithms, concepts and real-time systems in human centred communication|
|US20110109715A1 (en)||Automated wireless three-dimensional (3D) video conferencing via a tunerless television device|
|US20120274750A1 (en)||Apparatus, systems and methods for shared viewing experience using head mounted displays|
|Tanimoto et al.||Free-viewpoint TV|
|US20120314077A1 (en)||Network synchronized camera settings|
|US20100225732A1 (en)||System and method for providing three dimensional video conferencing in a network environment|
|Zilly et al.||Production rules for stereo acquisition|
|US20100225735A1 (en)||System and method for providing three dimensional imaging in a network environment|
|CN1678084A (en)||Image processing apparatus and method|
|US20060114251A1 (en)||Methods for simulating movement of a computer user through a remote environment|
|CN101453662A (en)||Stereo video communication terminal, system and method|
|US20130128052A1 (en)||Synchronization of Cameras for Multi-View Session Capturing|
|US20120274736A1 (en)||Methods and systems for communicating focus of attention in a video conference|
|US20130039632A1 (en)||Surround video playback|
|CN101459857A (en)||Communication terminal and information system|
|US20120013711A1 (en)||Method and system for creating three-dimensional viewable video from a single video stream|
|DPE1||Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)|