WO2021190280A1

WO2021190280A1 - System and method for augmented tele-cooperation

Info

Publication number: WO2021190280A1
Application number: PCT/CN2021/079357
Authority: WO
Inventors: Yuan Tian; Yi Xu; Shuxue Quan
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2020-03-24
Filing date: 2021-03-05
Publication date: 2021-09-30
Also published as: CN115104078A

Abstract

Techniques for augmented reality systems. A computer system receives a stream of images of a scene from a camera of a first device. The computer system constructs a three-dimensional representation of the scene from the stream of images and convertsan object in the scene into a virtual object. The computing system renders the three-dimensional representation on a first display. The computing system further transmits the three-dimensional representation to a second device. The computing system renders the three-dimensional representation and the virtual object on a second display of the second device. The computing system receives an annotation to the virtual object in the three-dimensional representation. The computing system updates the virtual object with the annotation on the first display.

Description

SYSTEM AND METHOD FOR AUGMENTED TELE-COOPERATION

BACKGROUND

Augmented Reality (AR) superimposes virtual contents on top of a user’s view of the real world. Using AR, a user can scan the environment using a camera, and a computing systemperforms visual inertial odometry (VIO) in real time. Once the camera pose is tracked continuously, virtual objects can be placed into the AR scene to create an illusion that real objects and virtual objects are merged together.

SUMMARY

The present disclosurerelates generally to methods and systems related to augmented reality applications. More particularly, embodiments of the present disclosure provide methods and systems for tele-cooperation using augmented reality.

Techniques for tele-cooperation using augmented reality are described. A computer system is used for tele-cooperation. The computer system is configured to perform various operations. The operations include receiving, from a camera of a first device, a stream of images of a scene. The operations further include constructing, from the stream of images, a three-dimensional representation of the scene and converting an object in the scene into a virtual object. The operations further include rendering the three-dimensional representation on a first display of the first device. The operations further include transmitting the three-dimensional representation to a second device. The operations further include rendering, on a second display of the second device, the three-dimensional representation and the virtual object. The operations further include receiving, on the second device, an annotation to the virtual object in the three-dimensional representation. The operations further include updating, on the first display, the virtual object with the annotation.

Numerous benefits are achieved by way of the present disclosure over conventional techniques. For example, embodiments of the present disclosure involve methods and systems that provide augmented-reality based tele-cooperation, tele-presence, and tele-immersion. Examples of use cases includemanufacturing, medicine, communication, design, and entertainment. These and other embodiments of the disclosure, along with many of its advantages and features, are described in more detail in conjunction with the text below and corresponding figures.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which:

FIG. 1 illustrates an example of an augmented reality tele-cooperation environment, according to at least one embodiment of the disclosure.

FIG. 2 illustrates an example of a computer system for augmented reality applications, according to at least one embodiment of the disclosure.

FIG. 3 illustrates an example of a semantic object database, according to at least one embodiment of the disclosure.

FIG. 4 illustrates an example of a process for augmented reality-based tele-cooperation, according to at least one embodiment of the disclosure.

FIG. 5 illustrates an exemplary interaction timeline in an augmented reality-based tele-cooperation environment, according to at least one embodiment of the disclosure.

FIG. 6 illustrates an exemplary computer system, according to embodiments of the present disclosure.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

In the following description, various embodiments will be described. For purposes of explanation, specific configurations and details are set forth to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being described.

Embodiments of the present disclosureare directed to, among other things, augmented-reality based tele-cooperation. Tele-cooperation can providean electronic shared workspace to geographically dispersed users, supporting communication, collaboration, and coordination. Disclosed solutions can use augmented reality (AR) techniquesto improve tele-cooperation. For example, disclosed solutions enable a remote user operating a remote computing system to interact virtually with objects in the scene of a local user operating a local computing system.

Disclosed techniques provide benefits relative to existing solutions. For instancewhile existing solutions can enable a local user to stream video and audio to a remote expert in real time, such solutions do not facilitate meaningful interactions between a local user and a remote expert. For example, a two-dimensional video stream fails to convey a physical structure of an object in the view of the local user and does not facilitate virtual interactions with the object, instead relying on verbal communications between local user and remote expert. In contrast, embodiments of the present disclosureuse AR techniques including scene understanding, object detection, semantic labeling of objects, and/oradvanced user gestures to improve interaction and collaboration between local and remote users.

The following simplified example is introduced for discussion purposes. A local user and a remote user each wear an AR headset. Each AR headset connects with a respective computing system and a respective user input device. The two computing systems are connected in real-time via a network. Thelocal computing system processes an input video stream from a camera and reconstructs a three-dimensional (3D) scene. The local computing systemtransmits the reconstructed scene (e.g., including identified objects) to the remote computing system. In turn, the remote computing system displays the scene within a display in the remote user’s AR headset so that the remote user sees what the local user is seeing and doing. The remote user can also interact with the virtual scene. For example, a virtual hand of the remote user can be conveyed to the local user, which enables the local user to better understand the remote user, who may be an expert. The remote user can further add visual or audio instructions or annotations. In some cases, semantic information of virtual objects can be determined and shared.

Turning now to the Figures, FIG. 1 illustrates an example of an augmented reality tele-cooperation environment, according to at least one embodiment of the disclosure. FIG. 1 depicts tele-cooperation environment 100, which includes local augmented reality system 110, remote augmented reality system 120, real-world object 130, and network 160.

In the example depicted by FIG. 1, a local user operates local augmented reality system 110 (that is, a first device) . Local augmented reality system 110 observes a scene, which includes real-world object 130, and transmits information about the scene and real-world object 130 to remote augmented reality system 120 (that is, a second device) , which can visualizethe scene and real-world object 130. Further, remote augmented reality system 120 can receive interactions from a remote user and transmit the interactions back via network 160 to local augmented reality system 110, where the interactions can be further visualized by the local user.

Local augmented reality system 110 includes one or more of local computing system 112, display 114 (that is, a first display) , input device 116, and camera 118. Remote augmented reality system 120 includes one or more of remote computing system 122, display 124 (that is, a second display) , and input device 126. Examples of suitable computing systems are discussed further with respect to FIGS. 2 and 6. For example, as discussed with respect to FIG. 2, a computing system can include AR capabilities including a depth sensor and optical sensor.

Examples of suitable displays include a Liquid Crystal Display (LCD) or a light emitting diode (LED) screen. A display can be standalone, integrated with another device such as a handheld device or smart phone, or disposed within an AR headset. In an embodiment, a display can be divided into a split-screen view, which allows for annotations, documents, or a stream of images to be visualized alongside a virtual environment.

Input devices are capable of receiving inputs from a user, for example, via the user’s hand, arm, or finger. Examples of suitable input devices include cameras or wearable arm or hand sensors that track movement or touch screens or touch surfaces that respond to touch, taps, swipes, or other gestures.

Camera 118 is operable to capture still images or frames of image data, or a stream of image data. Examples of camera 118 include standard smartphone cameras (forward or rear-facing) , depth sensors, infrared cameras, and the like.

Input devices

116 and 126 can each receive interactions from a respective user. For example, a local user operating local augmented reality system 110 can interact with a virtual object by annotating, moving, or otherwise altering the virtual object. In addition, a local user operating local augmented reality system 110 can add annotations to the reconstructed scene as virtual objects. The changes can be sent across network 160 to the remote augmented reality system 120. A remote user who is operating remote augmented reality system 120 can also interact with virtual objects received from local augmented reality system 110. For example, the remote user can insert a hand into the scene to illustrate how to make a repair to an object represented as a virtual object. The hand of the remote user can be visualized as a virtual object and be included in an updated scene, which can be sent to local augmented reality system 110 and rendered on display 114. A virtual hand can be moved to make an object appear to move in the scene.

Network 160 connects local computing system 112 and remote computing system 122. Network 160 can be a wired or a wireless network. Various information can be transmitted across network 160, for example, information about three-dimensional objects, images or video of a scene captured by camera 118, annotations, interactions, and semantic information.

In an embodiment, as discussed further with respect to FIG. 3, one or more of local computing system 112 or remote computing system 122 can perform semantic analysis of detected objects. Such analysis can be performed online (e.g., in real-time) , or offline (e.g., prior to use of the augmented reality system) . For example, semantic information about real-world object 130, for example, a type, size, location, owner, etc. can be determined.

FIG. 2 illustrates an example of a computer system for AR applications, according to at least one embodiment of the disclosure. Computer system 210 is an example of local computing system 112 and remote computing system 122. Each of local computing system 112 and remote computing system 122 can include AR capabilities.

More specifically, the AR applications can be implemented by an AR module 216 of the computer system 210. Generally, the RGB optical sensor 214 generates an RGB image of a real-world environment that includes, for instance, a real-world object 230. The depth sensor 212 generates depth data about the real-world environment, where this data includes, for instance, a depth map that shows depth (s) of the real-world object 230 (e.g., distance (s) between the depth sensor 212 and the real-world object 230) . Following an initialization of an AR session (where this initialization can include calibration and tracking) , the AR module 216 renders an AR scene 220 of the of the real-world environment in the AR session, where this AR scene 220 can be presented at a graphical user interface (GUI) on a display of the computer system 210. The AR scene 220 shows a real-world object representation 222 of the real-world object 230, for example, as a video feed on the display. In addition, the AR scene 220 shows a virtual object 224 not present in the real-world environment. The AR module 216 can generate a red, green, blue, and depth (RGBD) image from the RGB image and the depth map to detect an occlusion of the virtual object 224 by at least a portion of the real-world object representation 222 or vice versa. The AR module 216 can additionally or alternatively generate a 3D model of the real-world environment based on the depth map, where the 3D model includes multi-level voxels. Such voxels are used to detect collision between the virtual object 224 and at least a portion of the real-world object representation 222. The AR scene 220 can be rendered to properly show the occlusion and avoid the rendering of the collision.

In an example, the computer system 210 represents a suitable user device that includes, in addition to the depth sensor 212 and the RGB optical sensor 214, one or more graphical processing units (GPUs) , one or more general purpose processors (GPPs) , and one or more memories storing computer-readable instructions that are executable by at least one of the processors to perform various functionalities of the embodiments of the present disclosure. For instance, the computer system 210 can be any of a smartphone, a tablet, an AR headset, or a wearable AR device.

The depth sensor 212 has a known maximum depth range (e.g., a maximum working distance) and this maximum value may be stored locally and/or accessible to the AR module 216. The depth sensor 212 can be a ToF camera. In this case, the depth map generated by the depth sensor 212 includes a depth image. The RGB optical sensor 214 can be a color camera. The depth image and the RGB image can have different resolutions. Typically, the resolution of the depth image is smaller than that of the RGB image. For instance, the depth image has a 640x180 resolution, whereas the RGB image has a 2920x1280 resolution.

In addition, the depth sensor 212 and the RGB optical sensor 214, as installed in the computer system 210, may be separated by a transformation (e.g., distance offset, field of view angle difference, etc. ) . This transformation may be known and its value may be stored locally and/or accessible to the AR module 216. When cameras are used, the ToF camera and the color camera can have similar field of views. But because of the transformation, the field of views would partially, rather than fully, overlap.

The AR module 216 can be implemented as specialized hardware and/or a combination of hardware and software (e.g., general purpose processor and computer-readable instructions stored in memory and executable by the general purpose processor) . In addition to initializing an AR session and performing Visual Inertial Odometry (VIO) , the AR module 216 can detect occlusion and collision to properly render the AR scene 220.

In some embodiments, AR module 216 can perform object detection. Image processing techniques may be used on detected image data to identify objects. For example, edge detection may be used to identify a section within the image data that includes an object. Discontinuities in brightness, color, and/or texture may be identified across an image to detect edges of various objects within the image.

In some cases, a depth map is generated. A depth map can be used for embodiments such as object detection. For example, sensor data captured from depth sensor 212and/or image data captured from RGB optical sensor 214 can be used to determine a depth map. Depth information can include a value that is assigned to each pixel. Each value represents a distance between the user device and a particular point corresponding to the location of that pixel. The depth information may be analyzed to detect sudden variances in depth. For example, sudden changes in distance may indicate an edge or a border of an object.

In some embodiments, both image data and depth information can be used. In an embodiment, objects may first be identified in either the image data or the depth information and various attributes of the objects may be determined from the other information. For example, edge detection techniques may be used to identify a section of the image data that includes an object. The section may then be mapped to a corresponding section in the depth information to determine depth information for the identified object (e.g., a point cloud) . In another example, a section that includes an object may first be identified within the depth information. In this example, the section may then be mapped to a corresponding section in the image data to determine appearance attributes for the identified object (e.g., color or texture values) .

In some embodiments, various attributes (e.g., color, texture, point cloud data, object edges) of an object identified in sensor data may be used as input to a machine learning module to identify or generate a 3D model that matches the identified object. In some embodiments, a point cloud for the object may be generated from the depth information and/or image data and compared to point cloud data stored in a database to identify a closest matching 3D model. Alternatively, a 3D model of an object (e.g., a user or a product) may be generated using the sensor data. Amesh may be created from point cloud data obtained from a section of depth information. The system may then map appearance data from a section of image data corresponding to the section to the mesh to generate a basic 3D model. Although particular techniques are described, it should be noted that there are a number of techniques for identifying particular objects from sensor output.

FIG. 3 illustrates an example of a semantic object database, according to at least one embodiment of the disclosure. In the example depicted in FIG. 3., object database 302 includes entries 304a-n. Object database 302 can be used to determine semantic information about objects detected in an image or video stream. Each entry 304a-n includes semantic information that represents an object. In some cases, the objects are objects that appear in a scene. In other cases, the objects may be known and information about the objects stored in a database.

Object database 302 can be domain-specific. For example, for augmented reality applications related to repairing automobiles, object database 302 may contain semantic and other information about car parts. In another example, for augmented reality applications about home improvement, object database 302 may contain semantic and other information about common tools, standard building materials, and so forth.

In the example depicted in FIG. 3, entries 304a-n each refer to the different objects in real-world object 130. For example, entry 304a refers to the table depicted in real-world object 130, entry 304b to the first leg, entry 304c to the second leg, entry 304d to the table top, and entry 304n to the cylinder, and so forth.

FIG. 4 illustrates an example flow foraugmented reality-based tele-cooperation, according to embodiments of the present disclosure. The flow is described in connection with a computer system that is an example of the computer systems described herein. Some or all of the operations of the flows can be implemented via specific hardware on the computer system and/or can be implemented as computer-readable instructions stored on a non-transitory computer-readable medium of the computer system. As stored, the computer-readable instructions represent programmable modules that include code executable by a processor of the computer system. The execution of such instructions configures the computer system to perform the respective operations. Each programmable module in combination with the processor represents a means for performing a respective operation (s) . While the operations are illustrated in a particular order, it should be understood that no particular order is necessary and that one or more operations may be omitted, skipped, and/or reordered.

Flow 400 involves augmented reality systems. For example, a first device, e.g., local augmented reality system 110 and a second device, e.g., remote augmented reality system 120, communicate with each other.

In some cases, before or during execution of flow 400, authentication is performed. An authentication request can be transmitted from the first device to the second device. Therefore, operations performed in flow 400, e.g., receiving the stream of images, or sending information across a network, can be conditioned based on the second device accepting the authentication request.

At block402, the computer system receives, from a camera of a first device, a stream of images of a scene. For example, referring back to FIG. 1, local computing system 112 of local augmented reality system 110receives a stream of images of a scene from camera 118. The stream of images can include different information such as pixels in a color space (e.g., RGB) , infrared spectrum information, or depth information. The scene can include a pixel representation of one or more real-world objects, e.g., real-world object 130, which are present but undetected.

At block404, the computer system constructs, from the stream of images, a three-dimensional representation of the scene converting an object into a virtual object. Local computing system 112 constructs a three-dimensional representation of the scene. The three-dimensional representation includes a representation of objects in the scene.

In an embodiment, the computer system can identify one or more objects and then convert the object (s) into a virtual object. Once the object (s) are detected, a 3D reconstruction is generated of the environment. The reconstruction can be a polygonal mesh model of the scene with textures.

Continuing the example, the three-dimensional representation includes a three-dimensional representation of real-world object 130. Once the real-world object 130 is detected, local computing system 112 converts the real-world object 130 into a virtual object. Operations can be performed on virtual objects, for example, rotation, flip, re-size, etc.

In a further embodiment, as discussed with respect to FIG. 3, objects can be identified and matched in a semantic database. Identification of objects and matching of labels can be performed on the local computing system 112, the remote computing system 122, or both systems. Matching can be performed based on determined characteristics of the object, e.g., shape, size, contours, edges, etc. For example, when an object is detected in the stream of images, local computing system 112 can create a visual signature for the object and search the object database. If a match is found, the corresponding semantic information can be transmitted via network 160 to remotecomputing system 122 and a label displayed on display 124. In this manner, the remote user can benefit from the identification of the object. Semantic information can include a description, part name, serial number, and so on. In some cases, optical character recognition (OCR) can be performed on an image of an object and can thereby determine the semantic information.

At block406, the computer system renders the three-dimensional representation on a first display of the first device. Continuing the example, local computing system 112 renders real-world object 130 on display 114.

In an embodiment, display 114 can be sub-divided into two or more sections. For example, a first section can display the rendered scene including the rendered real-world object 130, and a second section can display notes, comments, or annotations.

At block408, the computer system transmits the three-dimensional representation to a second device. Local computing system 112 transmits the three-dimensional representation across network 160 to the remote augmented reality system 120. Remote computing system 122 receives the three-dimensional representation. If encrypted and/or encoded, remote computing system 122 can decrypt and/or decode the three-dimensional representation as necessary.

In an embodiment, the stream of images is transmitted separately from or together with the three-dimensional representation. For example, local computing system 112 can transmit the stream of images from camera 118, optionally encoded and/or encrypted, across network 160 to the remote augmented reality system 120.

At block410, the computer system renders, on a second display of the second device, the three-dimensional representation and the virtual object. Remote computing system 122 receives the three-dimensional representation and renders the three-dimensional representation on display 124. For example, the real-world object 130 is rendered on display 124. In this manner, a user operating remote augmented reality system 120 can interact with the real-world object 130in a virtual manner.

As discussed, in an embodiment, the stream of images can be sent over network 160. If the stream of images is transmitted, then the remote computing system 122 decrypts and/or decodes the stream. The stream can be visualized at the remote augmented reality system 120 on display 124. In a further embodiment, when an object is detected in the stream of images at the remote augmented reality system 120, remote computing system 122 can create a visual signature for the object and search the object database. If a match is found, the corresponding semantic information can be displayed on display 124. After an initial 3D reconstruction, incremental updates can be sent to the remote system in a streaming fashion.

At block412, where the computer system receives, on the second device, an annotation to the virtual object in the three-dimensional representation. Continuing the example, a user annotates one or more embodiments of real-world object 130. For example, a remote user expert operating remote augmented reality system 120 can provide notes or comments for the benefit of a local user of the local augmented reality system110. Annotations can also include movements such as rotations, which may help illustrate a concept better than text or audio alone. For example, by rotating a virtual object, the remote user can illustrate a point about an embodiment of the object that may have been obscured.

Because the position and orientation of the camera on a AR-enabled device is being tracked continuously, the visual instructions can be rendered on the screen of local user as if they are in the same location and orientation as they are in the 3D virtual workspace of the remote user and vice versa. For example, the local user can also use hand gesture and other tools to place markups in the scene and the remote expert can view the markups as well to confirm potential questions from the local user.

Users operating local computing system 112 or remote computing system 122 can interact with virtual objects in a variety of ways. For example, a computing system receives, from an input device, an interaction with a virtual object. An interaction can be triggered by any manner of user interface gestures such as a tap, touch, drag, or hand gesture. The computing system then acts accordingly, e.g., according to a predefined meaning of the interaction. Interactions on one system, e.g., local augmented reality system 110, may be transmitted to the other system, e.g., remote augmented reality system 120, and vice versa. In this manner, each user can see what the other user is doing.

Further examples of annotations include text, audio, and video. In an embodiment, remotecomputing system 122 can record an audio file of the remote user explaining a concept and transmit the audio file across network 160 as an annotation. The local user can then play the audio file on the local augmented reality device.

At block414, the computer system updates, on the first display, the virtual object with the annotation. Continuing the example, local computing system 112 causes the virtual object to be displayed on display 114 with the annotations from the remote augmented reality system 120.

It should be appreciated that the specific steps illustrated in FIG. 4 provide a particular method of performing augmented reality-based tele-cooperation, according to one embodiment. Other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present disclosuremay perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 4 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

In an embodiment, the position and orientation of camera 118 is being tracked continuously. Virtual object (s) are rendered using a camera pose and appear as if they are attached to the physical environment of the local user. In this manner, both local and remote users can continue to interact with the virtual environment and the displays of both devices are updated. FIG. 5 depicts one such example.

FIG. 5 illustrates an exemplary interaction timeline in an augmented reality-based tele-cooperation environment, according to at least one embodiment of the disclosure. FIG. 5 depicts views 501-504. Each of view 501-504 includes contents of display 114 (on local augmented reality system 110) and display 124 (onremote augmented reality system 120) . For example purposes, views 501-504 occur sequentially in time, but other orderings are possible. Display 114 includes two sub-displays 114a-b. Display 124 includes two sub-displays 124a-b. Sub-displays can be virtual (e.g., one split screen) or physical (e.g. two physical displays) .

In view 501, display 114 displays sub-display 114a as empty. Sub-display 114b displays a real-world object. As depicted in sub-display 114b, a local user is pointing to a cylinder object on top of a table. Local augmented reality system 110 transmits a 3D representation of the real-world object and an indication that the local user is pointing to the real-world object to remote augmented reality system 120.

In turn, as depicted in view 502, display 124b of remote augmented reality system 120 show a highlighted cylinder object and the local user’s hand pointing to the object.

In view 503, the remote user sees the user’s hand and/or the highlighted object and locates a relevant document (e.g., an instruction manual) . The remote user causes the document to be displayed in display 124a. Further, the remote user uses a pinch technique with two hands on the object.

In view 504, the pinch technique demonstrated by the remote user is shown on display 114b. Additionally, the document selected by the remote user is shown in display 114a.

FIG. 6 illustrates an exemplary computer system 600, according to embodiments of the present disclosure. The computer system 600 is an example of the computer system described herein above. Although these components are illustrated as belonging to a same computer system 600, the computer system 600 can also be distributed.

The computer system 600 includes at least a processor 602, a memory 604, a storage device 606, input/output peripherals (I/O) 608, communication peripherals 610, and an interface bus 612. The interface bus 612 is configured to communicate, transmit, and transfer data, controls, and commands among the various components of the computer system 600. The memory 604 and the storage device 606 include computer-readable storage media, such as RAM, ROM, electrically erasable programmable read-only memory (EEPROM) , hard drives, CD-ROMs, optical storage devices, magnetic storage devices, electronic non-volatile computer storage, for example

memory, and other tangible storage media. Any of such computer-readable storage media can be configured to store instructions or program codes embodying embodiments of the disclosure. The memory 604 and the storage device 606 also include computer-readable signal media. A computer-readable signal medium includes a propagated data signal with computer-readable program code embodied therein. Such a propagated signal takes any of a variety of forms including, but not limited to, electromagnetic, optical, or any combination thereof. A computer-readable signal medium includes any computer-readable medium that is not a computer-readable storage medium and that can communicate, propagate, or transport a program for use in connection with the computer system 600.

Further, the memory 604 includes an operating system, programs, and applications. The processor 602 is configured to execute the stored instructions and includes, for example, a logical processing unit, a microprocessor, a digital signal processor, and other processors. The memory 604 and/or the processor 602 can be virtualized and can be hosted within another computer system of, for example, a cloud network or a data center. The I/O peripherals 608 include user interfaces, such as a keyboard, screen (e.g., a touch screen) , microphone, speaker, other input/output devices, and computing components, such as graphical processing units, serial ports, parallel ports, universal serial buses, and other input/output peripherals. The I/O peripherals 608 are connected to the processor 602 through any of the ports coupled to the interface bus 612. The communication peripherals 610 are configured to facilitate communication between the computer system 600 and other computing devices over a communications network and include, for example, a network interface controller, modem, wireless and wired interface cards, antenna, and other communication peripherals.

While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation, and does not preclude inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. Indeed, the methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the present disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the present disclosure.

Unless specifically stated otherwise, it is appreciated that throughout this specification, discussions utilizing terms such as “processing, ” “computing, ” “calculating, ” “determining, ” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.

The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multipurpose microprocessor-based computer systems accessing stored software that programs or configures the computer system from a general-purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present disclosure. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.

Embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied-for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.

Conditional language used herein, such as, among others, “can, ” “could, ” “might, ” “may, ” “e.g., ” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain examples include, while other examples do not include, certain features, elements, and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without author input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular example.

The terms “including, ” “comprising, ” “having, ” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list. The use of “adapted to” or “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Additionally, the use of “based on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Similarly, the use of “based at least in part on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based at least in part on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.

The various features and processes described above may be used independently of one another, or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of the present disclosure. In addition, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The example blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed examples. Similarly, the example systems and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the disclosed examples.

Claims

A method for augmented tele-cooperation, comprising:

receiving, from a camera of a first device, a stream of images of a scene;

constructing, from the stream of images, a three-dimensional representation of the scene and converting an object in the scene into a virtual object;

rendering the three-dimensional representation on a first display of the first device;

transmitting the three-dimensional representation to a second device;

rendering, on a second display of the second device, the three-dimensional representation and the virtual object;

receiving, on the second device, an annotation to the virtual object in the three-dimensional representation; and

updating, on the first display, the virtual object with the annotation.
The method of claim 1, further comprising:

transmitting the stream of images to the second device; and

visualizing, on the second display, the stream of images.
The method of claim 1, further comprising transmitting, from the first device to the second device, an authentication request, wherein receivingthe stream of images occurs subsequent to the second device accepting the authentication request.
The method of claim 1, further comprising:

receiving a user input gesture from a user input device on the second device;

identifying an action corresponding to the user input gesture; and

performing the action on the virtual object.
The method of claim 1, further comprising:

identifying a semantic label associated with the virtual object;

transmitting the semantic label to the second device; and

rendering the semantic label on thesecond display.
The method of claim 1, wherein the annotation comprises audio, and the method further comprisesplaying the audio on the first device.
The method of claim 1, further comprising:

receiving, on the first device, an annotation to the virtual object in the three-dimensional representation; and

transmitting the annotation to the second device.
An augmented reality system comprising:

one or more processors;

a first device comprising a camera and a first display;

a second device comprising a second display; and

one or more memories storing computer-readable instructions that, upon execution by the one or more processors, configure the processors to:

receive, from the camera, a stream of images of a scene;

construct, from the stream of images, a three-dimensional representation of the scene and converting anobject in the scene into a virtual object;

render the three-dimensional representation on the first display;

render on the second display, the three-dimensional representation and the virtual object;

receive, on the second device, an annotation to the virtual object in the three-dimensional representation; and

update, on the first display, the virtual object with the annotation.
The augmented reality system of claim 8, wherein the one or more memories store computer-readable instructions that, upon execution by the one or more processors, further configure the processors to:

transmit the stream of images to the second device; and

visualize the stream of images on the second display.
The augmented reality system of claim 8, wherein the one or more memories store computer-readable instructions that, upon execution by the one or more processors, further configure the processors totransmit, from the first device to the second device, an authentication request, whereinreceiving the stream of images occurs subsequent to an acceptance of an authentication request.
The augmented reality system of claim 8, wherein the one or more memories store computer-readable instructions that, upon execution by the one or more processors, further configure the processors to:

receive a user input gesture from a user input device on the first device;

identify an action corresponding to the user input gesture; and

perform the action on the virtual object.
The augmented reality system of claim 8, wherein the one or more memories store computer-readable instructions that, upon execution by the one or more processors, further configure the processors to:

identify, at the second device, a semantic label associated with the virtual object;

transmit the semantic label to the first device; and

render the semantic label on the first display.
The augmented reality system of claim 8, wherein the annotation comprises audio, and the one or more memories store computer-readable instructions that, upon execution by the one or more processors, further configure the processors to play the audio.
The augmented reality system of claim 8, wherein the one or more memories store computer-readable instructions that, upon execution by the one or more processors, further configure the processors to:

receive, on the first device, an annotation to the virtual object in the three-dimensional representation; and

transmit the annotation to the second device.
One or more non-transitory computer-storage media storing instructions that, upon execution on a computer system, cause the computer system to perform operations comprising:

receiving, from a camera of a first device, a stream of images of a scene;

constructing, from the stream of images, a three-dimensional representation of the scene and converting an object in the scene into a virtual object;

rendering the three-dimensional representation on a first display of the first device;

transmitting the three-dimensional representation to a second device;

rendering, on a second display of the second device, the three-dimensional representation and the virtual object;

receiving, on the second device, an annotation to the virtual object in the three-dimensional representation; and

updating, on the first display, the virtual object with the annotation.
The non-transitory computer-storage media of claim 15, wherein the operations further comprise:

receiving a user input gesture from a user input device on the second device;

identifying an action corresponding to the user input gesture; and

performing the action on the virtual object.
The non-transitory computer-storage media of claim 15, wherein the operations further comprise:

identifying a semantic label associated with the virtual object;

transmitting the semantic label to the second device; and

rendering the semantic label on the second display.
A method for augmented tele-cooperation, comprising:

receiving, from a camera of a first device, a stream of images of a scene;

constructing, from the stream of images, a three-dimensional representation of the scene and converting an object in the scene into a virtual object;

rendering the three-dimensional representation on a first display of the first device;

transmitting the three-dimensional representation to a second device, wherein the three-dimensional representation transmitted is configured for the second device to render on a second display of the second device the three-dimensional representation and the virtual object;

updating, on the first display, the virtual object with an annotationto the virtual object in the three-dimensional representation, wherein the annotation to the virtual object in the three-dimensional representationis received on thesecond device.
The method of claim 18, further comprising:

identifying a semantic label associated with the virtual object;

transmitting the semantic label to the second device; and

rendering the semantic label on the second display.
The method of claim 18, further comprising:

receiving, on the first device, an annotation to the virtual object in the three-dimensional representation; and

transmitting the annotation to the second device.