WO2021197016A1

WO2021197016A1 - System and method for enhancing subjects in videos

Info

Publication number: WO2021197016A1
Application number: PCT/CN2021/080211
Authority: WO
Inventors: Yuxin MA; Yi Xu; Shuxue Quan
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2020-04-01
Filing date: 2021-03-11
Publication date: 2021-10-07

Abstract

Described herein are a system and methods for enhancing ordinary video with augmented reality content using consumer technology. Video content including a plurality of images in a sequence associated with a frame rate is obtained by the system. The system then obtains a three-dimensional (3D) model at least in part associated with an object depicted in the video content. This may involve generating the 3D model from depth and image data collected via the user device. The system then determines a first pose of the object describing a first three-dimensional condition of the object, and associates the 3D model, the first pose with a first image of the plurality of images at a first timestamp. A second user device is presented with an option to view the 3D model using augmented reality (AR) upon reaching the first timestamp.

Description

SYSTEM AND METHOD FOR ENHANCING SUBJECTS IN VIDEOS

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims a priority to US Provisional Application No. 63/003,541, filed on April 01, 2020, the content of which is herein incorporated by reference in its entirety.

TECHNICAL FIELD

The present disclosure generally relates to the present disclosure relates generally to methods and systems related to augmented reality (AR) applications. More particularly, embodiments of the present disclosure provide methods and systems for enhancing ordinary video with augmented reality content using consumer technology. Embodiments of the present disclosure are applicable to a variety of applications in augmented reality and computer-based display systems.

BACKGROUND

Augmented Reality (AR) superimposes virtual content over a user’s view of the real world. With the development of AR software development kits (SDK) , the mobile industry has brought mobile device AR platforms to the mainstream. An AR SDK typically provides six degrees-of-freedom (6DoF) tracking capability. A user can scan the environment using a camera included in an electronic device (e.g., a smartphone or an AR system) and the electronic device performs visual inertial odometry (VIO) in real time. Once the camera pose is tracked continuously, virtual objects can be placed into the AR scene to create an illusion that real objects and virtual objects are merged together.

Despite the progress made in the field of AR, there is a need in the art for improved methods of enhancing subjects in videos.

SUMMARY OF THE DISCLOSURE

According to one aspect of the disclosure, a method is provided. The method is performed by a computing system, of providing enhanced augmented reality content. The method comprises obtaining video content using an optical sensor in communication with the computing system, wherein the video content comprises a plurality of images in a sequence associated with a frame rate, obtaining a three-dimensional (3D) model at least in part associated with an object depicted in the video content, determining a first pose of the object describing a first three-dimensional condition of the object, and associating, by the computing system, the 3D model, the first pose with a first image of the plurality of images at a first timestamp.

According to another aspect of the disclosure, a system is provided. The system comprises a processor, and a memory including instructions that, when executed with the processor, cause the system to: obtain video content using an optical sensor in communication with the system, wherein the video content comprises a plurality of images in a sequence associated with a frame rate, obtain a three-dimensional (3D) model at least in part associated with an object depicted in the video content, determine a first pose of the object describing a first three-dimensional condition of the object, and associate the 3D model and the first pose with a first image of the plurality of images at a first timestamp.

According to yet another aspect of the disclosure, a non-transitory computer readable medium is provided. The non-transitory computer readable medium is for storing specific computer-executable instructions that, when executed by a processor, cause a computer system to at least: obtain video content using an optical sensor in communication with the computing system, wherein the video content comprises a plurality of images in a sequence associated with a frame rate, obtain a three-dimensional (3D) model at least in part associated with an object depicted in the video content, determine a first pose of the object describing a first three-dimensional condition of the object, and associate the 3D model and the first pose with a first image of the plurality of images at a first timestamp.

Numerous benefits are achieved by way of the present disclosure over conventional techniques. For example, embodiments of the present disclosure involve methods and systems that provide three-dimensional (3D) models for incorporation into video content automatically on a mobile device. These and other embodiments of the disclosure along with many of its advantages and features are described in more detail in conjunction with the text below and attached figures.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to make the technical solution described in the embodiments of the present disclosure more clearly, the drawings used for the description of the embodiments will be briefly described. Apparently, the drawings described below are only for illustration but not for limitation. It should be understood that, one skilled in the art may acquire other drawings based on these drawings, without making any inventive work.

FIG. 1 illustrates an example of a computer system that includes a depth sensor and a red, green, and blue (RGB) optical sensor for AR applications, according to an embodiment of the present disclosure;

FIG. 2 is a simplified flowchart illustrating an example of a method for providing 3D models of real objects in AR-enhanced video content, according to an embodiment of the present disclosure;

FIG. 3 is another simplified flowchart illustrating another example of a method for providing 3D models of real objects in AR-enhanced video content, according to an embodiment of the present disclosure;

FIG. 4 is another simplified flowchart illustrating a method of providing 3D models of real objects in AR-enhanced video content according to an embodiment of the present disclosure;

FIG. 5 depicts an illustrative example of a technique for obtaining 3D models using sensor data, according to an embodiment of the present disclosure;

FIG. 6 is a simplified flowchart illustrating a method of providing 3D models of real objects in AR-enhanced video content according to an embodiment of the present disclosure;

FIG. 7A is an illustrative example of AR content including a 3D model, according to an embodiment of the present disclosure;

FIG. 7B is another illustrative example of AR content including another 3D model, according to an embodiment of the present disclosure; and

FIG. 8 illustrates an example computer system according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

In the following description, various embodiments will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being described.

The present disclosure relates generally to methods and systems related to virtual reality and augmented reality applications. More particularly, embodiments of the present disclosure provide methods and systems for enhancing video content with 3D models. Embodiments of the present disclosure are applicable to a variety of applications in virtual reality and computer-based AR systems.

FIG. 1 illustrates an example of a computer system 110 that includes a depth sensor 112 and an RGB optical sensor 114 for AR applications, according to an embodiment of the present disclosure. The AR applications can be implemented by an AR module 116 of the computer system 110. Generally, the RGB optical sensor 114 generates an RGB image of a real-world environment that includes, for instance, a real-world object 130. In some embodiments, the RGB optical sensor 114 may include additional or alternative spectral sensitivity, for example, infrared, ultraviolet, etc. The RGB optical sensor 114 may also generate a video including multiple images in a sequence. The depth sensor 112 generates depth data about the real-world environment 132, where this data includes, for instance, a depth map that shows depth (s) of the real-world object 130 (e.g., distance (s) between the depth sensor 112 and the real-world object 130) .

In some embodiments, a user is provided with the ability to generate and view a 3D model of an object with which a video may be enhanced. Following an initialization of an AR session 126 (where this initialization can include calibration and tracking) , the AR module 116 creates a 3D model 122 of the real-world object 130 to be rendered either on top of the live feed of an AR scene 120 of the real-world environment 132 in the AR session or within a 3D model view mode, where the

AR scene

120 or 3D model view mode can be presented via a graphical user interface (GUI) on a display of the computer system 110. The computer system 110 may obtain the 3D model 122 directly by processing sensor data collected from the real world object 130, as described in more detail in reference to FIG. 5. In addition, the AR scene 120 shows one or more AR features 124 not present in the real-world environment. The AR module 116 can generate a red, green, blue, and depth (RGBD) image from the RGB image and the depth map to associate the captured 3D model 122 with the pose of the real world object 130 in the real world environment 132, as depicted in the video. Note that a “pose” of an object includes both the object’s location and orientation. The AR module 116 can also link the generated 3D model 122 to one or more timestamps in the video by associating the obtained 3D model 122 with one or more poses of the real world object 130 in one or more RGB images included in the video.

In an example, the computer system 110 represents a suitable user device that includes, in addition to the depth sensor 112 and the RGB optical sensor 114, one or more graphical processing units (GPUs) , one or more general purpose processors (GPPs) , and one or more memories storing computer-readable instructions that are executable by at least one of the processors to perform various functionalities of the embodiments of the present disclosure. For instance, the computer system 110 can be any of a smartphone, a tablet, an AR headset, or a wearable AR device.

In addition, the depth sensor 112 and the RGB optical sensor 114, as installed in the computer system 110, may be separated by a transformation (e.g., distance offset, field of view angle difference, etc. ) . This transformation may be known and its value may be stored locally and/or accessible to the AR module 116. When cameras are used, a time of flight (ToF) camera and a color camera can have similar field of views. But because of the transformation, the field of views would partially, rather than fully, overlap.

The AR module 116 can be implemented as specialized hardware and/or a combination of hardware and software (e.g., general purpose processor and computer-readable instructions stored in memory and executable by the general purpose processor) . The AR module 116 can obtain the 3D model 122 of the real world object 130 to properly render the AR scene 120 as part of a video.

FIG. 2 is a simplified flowchart illustrating an example of a method 200 for providing 3D models of real objects in AR-enhanced video, according to an embodiment of the present disclosure. In some embodiments, the method 200 includes obtaining video (210) by using an image sensor (e.g., RGB optical sensor of FIG. 1) in communication with a computer system (e.g., computer system 110 of FIG. 1) , wherein the video depicts an object 220. The object 220 may be a real object in the environment of the video and/or it may be a real object which the user of the computer system intends to place in the video. To that end, the computer system may obtain a 3D model (e.g., 3D model 122 of FIG. 1) of the object in the video (212) , as described in more detail in reference to FIG. 5. Obtaining the 3D model of the object may include generating a 3D model by processing multiple images captured using the image sensor to generate a texture or skin to fit onto a depth map of the object, for example, by feature detection and mapping. The computer system may define contours 224 of the 3D model in the video by determining a pose (214) of the object 220. Pose refers to a reference location and orientation of the object 220, to be applied as an adjustment to a 3D model to make it appear to be a part of an image or scene (e.g., AR scene 120 of FIG. 1) when overlaying the 3D model on the object in the image or video. In some embodiments, the computer system provides enhanced video content by associating the 3D model with the video at a specific timestamp. This timestamp may be determined automatically or manually. For example, in an application where a 3D model is to be added to a video that has been distributed on a streaming platform, an algorithm analyzing video data may determine a segment of the video that features the object 220 at a particular time, and may associate the 3D model with a pose of that object at that timestamp (216) . In some embodiments, associating the 3D model with a pose of that object may include providing an AR scene to be accessible by the user of the user device when viewing enhanced video content. For example, the AR scene may be accessible via one or more interface elements (e.g., interface elements 128 of FIG. 1) that may hide, close, minimize, or otherwise obscure the video content, while surfacing the AR scene, presenting the 3D model in the AR scene as a real object in the environment of the user. Similarly, a user of the computer system may manually indicate, via a user interface (e.g., interface elements 128 of FIG. 1) , one or more timestamps at which AR scenes are to be included in the enhanced video content. One or more timestamps may be used as a means of differentiating multiple 3D models associated with the video content.

FIG. 3 is another simplified flowchart illustrating another example of a method 300 for providing 3D models of real objects in AR-enhanced video, according to an embodiment of the present disclosure. In some embodiments, the process of obtaining 3D models includes a special user interface (e.g., interface elements 128 of FIG. 1) configured to receive a result indicating whether an association of a 3D model with a pose of a real world object is acceptable. To generate enhanced video content, some embodiments include identifying an object in video (310) , where the video is obtained as part of generating enhanced video content (e.g., obtaining video 210 of FIG. 2) as described in more detail in reference to FIG. 2. Obtaining a 3D model, as described in more detail in reference to FIG. 5, includes capturing images and feature data to generate a texture mapped to a set of coordinates that can be adjusted to match a pose of a real world object within an image. As illustrated in the example flowchart in FIG. 3, this includes obtaining a 3D model of a real world object in a video (320) , determining a pose of the object (322) , and associating the pose of the object with the video at a timestamp (324) , as described above in reference to FIG. 2.

In some embodiments, this process may also include generating a user interface to receive the acceptance result, by generating user interface elements (326) . The user interface elements 128 may form a part of a user interface for generating AR-enhanced video content. For example, the user interface elements 128 may permit a user of the computer system (e.g., computer system 110 of FIG. 1) to indicate acceptance or rejection of an association of a 3D model with a real object in the video. In some embodiments, when the computer system receives a rejection command (328) , the computer system repeats at least part of the method until it receives an acceptance command. For example, the computer system may repeat the entire process, beginning with obtaining a new 3D model of the object. Alternatively, the computer system may repeat only the association of the pose of the object with the video at the timestamp. That is to say that the 3D model may be acceptable, but the mapping of the 3D model to the real world object may be the reason that the computer system received the rejection command. In some embodiments, the computer system may determine improper mapping automatically (e.g., without user interaction) as part of the step of associating pose of the real world object, for example, by generating a measure of error between one or more features detected from the real world object and the corresponding edges and/or coordinates of the 3D model. In some cases, when the error is outside an allowable range, the computer system may automatically repeat the process described above (e.g., via implementation of an auto-fitting algorithm as part of determining the pose of the object) . In some embodiments, the computer system receives an accept command and/or does not receive a rejection command, in which case the computer system may encode and store the enhanced video content 330 in a data store, as described in more detail in reference to FIG. 4.

FIG. 4 is another simplified flowchart illustrating a method 400 of providing 3D models of real objects in as AR content to a video, according to an embodiment of the present disclosure. In some embodiments, a camera obtains video content 410 using a recording application or other software on a computer system (e.g., computer system 110 of FIG. 1) . In some embodiments, the computer system may form a part of a mobile device, including, but not limited to, a smart phone, a tablet, an AR headset device, a dedicated device for creating AR content, and the like.

In some embodiments, the computer system identifies at least one object 420 in the video content 410 for creation of a 3D model based thereupon to be incorporated as AR content to the video. Until an object 420 is identified and/or while an object is not identified, the computer system may continue to obtain video content 410. In some embodiments, when an object 420 has been identified, the computer system may suspend obtaining video 422, such that the image sensor used to obtain video content 410 may be used to obtain a 3D model 424 of the object 420. In some embodiments, obtaining a 3D model includes processing data from both an image sensor (e.g., RGB optical sensor 114 of FIG. 1) and a depth sensor (e.g., sensor 112 of FIG. 1) , as described in more detail in reference to FIG. 5, below.

As described in more detail in reference to FIGS. 2-3, the computer system may place the partially or fully reconstructed 3D model 424 into the video content 410 by determining a pose 426 of the 3D model in each frame of the video content 410. In some embodiments, the pose 426 of the 3D model is determined using a computer vision based object pose estimation algorithm. In some embodiments, the pose 426 of the 3D model is determined in association with a coordinate map determined, for example, based at least in part on a simultaneous localization and mapping (SLAM) process executed by the computer system. Such processes define an output pose of the image sensor and a coordinate map defining a plurality of features in the environment around the computer system. In some embodiments, the plurality of features includes determining a shape of the object 420 and/or the environment where a 3D model of the object is to be placed in the video content 410.

In some embodiments, the computer system associates the 3D model 424 with enhanced video content 428 at least in part by generating an overlay of the 3D model 424 in the video content 410. Similarly, the enhanced video content 428 may include AR features (e.g., AR features 124 of FIG. 1) placed in the video content 410 automatically (e.g., based on feature detection algorithms) and/or manually by a user of the computer system. As described in more detail in reference to FIG. 3, the computer system or a user of the computer system may then determine whether the enhanced video content 428 satisfies one or more metrics of quality (e.g., via interface elements 128 of FIG. 1) . If the result 430 of the determination is satisfactory, the computer system may resume obtaining video 432, such that additional video content is appended to the enhanced video content 428.

In some embodiments, additional objects 440 are detected and identified, for which 3D models are obtained and associated with subsequent, additional AR scenes in the enhanced video content 428. While additional objects 440 remain, the computer system may continue obtaining video content 410, whereas when a user and/or the computer system determines that no additional objects remain, the computer system may encode and store the enhanced video content. Encoding may include, but is not limited to, generating a video content file for storage, for example, a compressed video file format or an AR file format (e.g., . obj, . fbx, . usdv, etc. ) .

In some embodiments, the computer system stores 450 the video content 410, once generated, locally using a data store in communication with the computer system that is incorporated into the device (e.g., flash memory, a hard drive, etc. ) . In some embodiments, the computer system stores the video content in a distributed storage system in communication with the computer system via a network (e.g., a cloud storage system) .

It should be appreciated that the specific steps illustrated in FIG. 4 provide a particular method of providing 3D models of real objects as AR content to a video according to an embodiment of the present disclosure. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present disclosure may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 4 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

FIG. 5 depicts an illustrative example of a technique for obtaining 3D models using sensor data, according to an embodiment of the present disclosure. In accordance with at least some embodiments, sensor data 502 may be obtained from one or more input sensors installed upon a user device. The captured sensor data 502 includes image information 504 captured by a camera device (e.g., RGB optical sensor 114 of FIG. 1) as well as depth map information 506 captured by a depth sensor (e.g., sensor 112 of FIG. 1) .

As stated above, the sensor data 502 may include image information 504. One or more image processing techniques may be used on image information 504 in order to identify one or more objects within that image information 504. For example, edge detection may be used to identify a section 508 within the image information 504 that includes an object. To do this, discontinuities in brightness, color, and/or texture may be identified across an image in order to detect edges of various objects within the image. Section 508 depicts an illustrative example image of a chair in which such discontinuities have been emphasized.

As also stated above, the sensor data 502 may include depth information 506. In the depth information 506, a value may be assigned to each pixel that represents a distance between the user device and a particular point corresponding to the location of that pixel. The depth information 506 may be analyzed to detect sudden variances in depth within the depth information 506. For example, sudden changes in distance may indicate an edge or a border of an object within the depth information 506.

In some embodiments, the sensor data 502 may include both image information 504 and depth information 506. In at least some of these embodiments, objects may first be identified in either the image information 504 or the depth information 506 and various attributes of the objects may be determined from the other information. For example, edge detection techniques may be used to identify a section of the image information 504 that includes an object 508. The section 508 may then be mapped to a corresponding section 510 in the depth information to determine depth information for the identified object (e.g., a point cloud) . In another example, a section 510 that includes an object may first be identified within the depth information 506. In this example, the section 510 may then be mapped to a corresponding section 508 in the image information to determine appearance attributes for the identified object (e.g., color or texture values) .

In some embodiments, various attributes (e.g., color, texture, point cloud data, object edges) of an object identified in sensor data 502 may be used as input to a machine learning module in order to identify or generate a 3D model 512 that matches the identified object. In some embodiments, a point cloud for the object may be generated from the depth information and/or image information and compared to point cloud data stored in a database to identify a closest matching 3D model. Alternatively, a 3D model of an object (e.g., a user or a product) may be generated using the sensor data 502. To do this, a mesh may be created from point cloud data obtained from a section 510 of depth information 506. The system may then map appearance data from a section of image information 504 corresponding to section 510 to the mesh to generate a basic 3D model. Although particular techniques are described, it should be noted that there are a number of techniques for identifying particular objects from sensor output.

As described elsewhere, sensor data captured by a user device (e.g., user device 102 of FIG. 1) may be used to generate a 3D model of a user using the techniques described above. This 3D model of a user may then be provided to a mobile application server as user data. In some embodiments, sensor data may be used to generate a 3D model of a product, which may then be stored in an object model database 238. For example, a user wishing to sell a product may capture sensor data related to the product from his or her user device. The user’s user device may then generate a 3D model in the manner outlined above and may provide that 3D model to the mobile application server.

FIG. 6 is a simplified flowchart illustrating a method of providing 3D models of real objects as AR content to a video according to an embodiment of the present disclosure. The flow is described in connection with a computer system that is an example of the computer systems described herein. Some or all of the operations of the flows can be implemented via specific hardware on the computer system and/or can be implemented as computer-readable instructions stored on a non-transitory computer-readable medium of the computer system. As stored, the computer-readable instructions represent programmable modules that include code executable by a processor of the computer system. The execution of such instructions configures the computer system to perform the respective operations. Each programmable module in combination with the processor represents a means for performing a respective operation (s) . While the operations are illustrated in a particular order, it should be understood that no particular order is necessary and that one or more operations may be omitted, skipped, and/or reordered.

The method includes obtaining video content using an optical sensor in communication with the computing system (602) . Furthermore, the video content includes multiple images in a sequence associated with a frame rate. As described in more detail in reference to FIG. 1, the video content is obtained by a computer system (e.g., computer system 110 of FIG. 1) using an image sensor (e.g., RGB optical sensor 114 of FIG. 1) . In some embodiments, the video depicts a real world object (e.g., real world object 130 of FIG. 1) in an environment (e.g., real-world environment 132 of FIG. 1) .

The method further includes obtaining, by the computing system, a three-dimensional (3D) model at least in part associated with an object depicted in the video content (604) . Optionally, obtaining a 3D model includes generating, by the computing system, a 3D model of the object using one or more images obtained by a sensor in communication with the computing system. As described in more detail in reference to FIG. 5, the computer system may obtain a 3D model of the real world object by determining one or more features of the real world object, such as edges, and may map a texture and/or skin to those features, the texture and/or skin obtained by processing multiple images of the real world object. In some cases, obtaining a 3D model may include moving the computer system (e.g., a tablet, smart phone, or AR device) around the real world object, such that sensors in communication with the computer system may obtain depth and image data, from which an AR module (e.g., AR module 116 of FIG. 1) may obtain the 3D model. Optionally, obtaining a 3D model includes suspending, by the computing system, obtaining video content. Optionally, the method further includes resuming, by the computing system, obtaining the video content after associating the 3D model, the first pose with the first image of the plurality of images at a first timestamp.

The method further includes determining, by the computing system, a first pose of the object describing a first three-dimensional condition of the object (606) . As described in more detail in reference to FIGS. 1-4, associating the pose with the three-dimensional condition of the object includes determining a coordinate map and a camera and/or sensor pose, based at least in part on images and position data collected by the computer system. Furthermore, the contours and/or features of the object may be described using feature detection and tracking implemented by the computer system, such that the edges of the real world object are described in terms of the coordinate map and adjusted for the perspective of the image sensor.

The method further includes associating, by the computing system, the 3D model, the first pose with a first image of the plurality of images at a first timestamp (608) . In some embodiments, this includes adjusting the 3D model of the real world object to overlay the model on the image of the object in the video. In some embodiments, this includes applying one or more adjustments to the 3D model to fit the pose of the real world object in the video, and placing the model in a pose in the video frame corresponding to the pose of the real world object. The 3D model may then be adjusted to fit the image of the real world object, and included in an AR scene corresponding to a timestamp in the video.

Optionally, the method further includes providing for presentation, by the computing system, user interface data, a user interface generated with the user interface data being configured to present an interactive user interface element overlaid on an image of the plurality of images corresponding to the first timestamp, wherein the interactive user interface element is configured to receive a user command. The method optionally includes, in accordance with receiving the user command, presenting the partially or fully obtained 3D model overlaid on the video content. As described in more detail in reference to FIG. 1, the computer system provides an AR scene (e.g., AR scene 120 of FIG. 1) in the video, thereby providing enhanced video content.

Optionally, the method further includes repeating the above mentioned steps for additional real world objects in the video, as described in more detail in reference to FIG. 4. Optionally, the method further includes generating a media file encoding video and the 3D model in a computer readable format and storing the media file in a data store.

As described in more detail in reference to FIGS. 2-4, the computer system may generate and/or present the overlaid partially or fully obtained 3D model on the video at the timestamp, such that the resulting AR scene does not satisfy acceptance criteria. For example, the pose and location of the real world object may have been determined incorrectly. In another example, the mapping of skin to features in the 3D model may be incomplete or of insufficient quality. Optionally, the method further includes providing for presentation, by the computing system, user interface data, a user interface generated with the user interface data being configured to present the 3D model and an interactive user interface element overlaid on the video content, wherein the interactive user interface element is configured to receive a user reject command. As part of the optional inclusion, the method further includes, in accordance with receiving a user reject command, at least one of re-obtaining an updated 3D model of the object, re-determining an updated first pose and of the object describing the first three-dimensional condition of the object, or re-associating the 3D model, the updated first pose with the first image of the plurality of images at the first timestamp.

It should be appreciated that the specific steps illustrated in FIG. 6 provide a particular method of providing 3D models of real objects as AR content to a video according to an embodiment of the present disclosure. As noted above, other sequences of steps may also be performed according to alternative embodiments. For example, alternative embodiments of the present disclosure may perform the steps outlined above in a different order. Moreover, the individual steps illustrated in FIG. 6 may include multiple sub-steps that may be performed in various sequences as appropriate to the individual step. Furthermore, additional steps may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize many variations, modifications, and alternatives.

FIG. 7A is an illustrative example of AR content including a 3D model 710, according to an embodiment of the present disclosure. In some embodiments, a user of a mobile device 702 viewing the AR-enhanced video content (e.g., AR-enhanced video content 428 of FIG. 4) at a first timestamp may be presented with the option to enter an AR scene associated with the first timestamp 720. The AR scene 720 may include the 3D model 710, one or more interface elements 722, and video content controls 724. In some embodiments, the 3D model 710 may be a first 3D model associated with a real object, as described in more detail in reference to FIG. 4. For example, the 3D model 710 may include, but is not limited to, a piece of furniture in a physical condition (e.g., incomplete assembly/disassembly) , a home repair component (e.g., a light switch or replacement part) , or the like. In some embodiments, the interface elements 722 may include placement and/or orientation controls for placing the 3D model 710 in an environment 726 of the user device 702 (e.g., place, rotate, etc. ) . The interface elements 722 may permit the user of the user device 702 to customize the AR scene associated with the first timestamp 720 by controlling the placement and/or orientation of the 3D model 710. In some embodiments, the video content controls 724 may include a user interface including one or more controls including, but not limited to, an indicator of the position of the timestamp in the video content, tracking controls, and/or display controls. In some embodiments, the environment 726 may be a domestic interior, an office interior, a public space inside or outside a building, a street environment, a park environment, etc. In some embodiments, the video content may be hidden, closed, or displayed in a smaller picture by the user device 702 while the user of the user device 702 is viewing the first AR scene 720, to be resurfaced in response to the user of the user device 702 ending the first AR scene 720 (e.g., via the one or more interface elements 722) .

FIG. 7B is another illustrative example of AR content including another 3D model 712, according to an embodiment of the present disclosure. In some embodiments, the AR-enhanced video content (e.g., AR-enhanced video content 428 of FIG. 4) includes another 3D model 712 presented in a second AR scene associated with a second timestamp 740 on the user device 702. The second AR scene 740 may include one or more interface elements 742 which may be the same or different from the one or more interface elements 742 of the AR scene associated with the first timestamp 720. Similarly, an environment 746 of the second AR scene 740 may be the same or different from the environment 726 of the first AR scene 720. As with the first AR scene 720, the video content may be minimized and/or hidden or displayed in a smaller picture during interaction with the other 3D model 712, such that the video content may resume, via the video content controls 744, after the user of the user device 702 ends the second AR scene 740.

In an illustrative example of FIG. 7A and/or FIG. 7B, a smartphone is used for an video session that shows the augmented real-world environment at a specific time in a video (e.g., a timestamp) . In particular, the 3D model (710 and 712) may be overlaid on the real world object in the video at a specific timestamp, such that a user of the computer system may interrupt the video to enter the AR session. More particularly, the AR session renders the 3D model (710 and 712) on top of the live video feed in an AR scene, using one or more interface elements. In addition, a user of the smartphone can interact with the 3D model to move the 3D model (i.e., by rotation in place, translation, etc. ) within the AR scene being presented. During the AR-enhanced video, multiple AR scenes may be included, allowing the viewer to interact with objects as 3D models in an enhanced video. For example, playing the enhanced video may include generating a menu of

interface elements

722 or 742 at the timestamp associated with the AR scene, such that a viewer of the enhanced video may pause the enhanced video to present the AR scene. Following which, the viewer may return to watching the video as normal. Presenting the AR scene may include presenting the 3D model associated with a specific timestamp in a real-world environment of the user, such that the object appears in front of the user as a real object, viewed through the user device from multiple angles by maneuvering the user device around the virtual position of the 3D model.

FIG. 8 illustrates examples of components of a computer system 800 according to certain embodiments. The computer system 800 is an example of the computer system described herein above. Although these components are illustrated as belonging to a same computer system 800, the computer system 800 can also be distributed.

The computer system 800 includes at least a processor 802, a memory 804, a storage device 806, input/output peripherals (I/O) 808, communication peripherals 810, and an interface bus 812. The interface bus 812 is configured to communicate, transmit, and transfer data, controls, and commands among the various components of the computer system 800. The memory 804 and the storage device 806 include computer-readable storage media, such as RAM, ROM, electrically erasable programmable read-only memory (EEPROM) , hard drives, CD-ROMs, optical storage devices, magnetic storage devices, electronic non-volatile computer storage, for example

memory, and other tangible storage media. Any of such computer readable storage media can be configured to store instructions or program codes embodying aspects of the disclosure. The memory 804 and the storage device 806 also include computer readable signal media. A computer readable signal medium includes a propagated data signal with computer readable program code embodied therein. Such a propagated signal takes any of a variety of forms including, but not limited to, electromagnetic, optical, or any combination thereof. A computer readable signal medium includes any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use in connection with the computer system 800.

Further, the memory 804 includes an operating system, programs, and applications. The processor 802 is configured to execute the stored instructions and includes, for example, a logical processing unit, a microprocessor, a digital signal processor, and other processors. The memory 804 and/or the processor 802 can be virtualized and can be hosted within another computer system of, for example, a cloud network or a data center. The I/O peripherals 808 include user interfaces, such as a keyboard, screen (e.g., a touch screen) , microphone, speaker, other input/output devices, and computing components, such as graphical processing units, serial ports, parallel ports, universal serial buses, and other input/output peripherals. The I/O peripherals 808 are connected to the processor 802 through any of the ports coupled to the interface bus 812. The communication peripherals 810 are configured to facilitate communication between the computer system 800 and other computing devices over a communications network and include, for example, a network interface controller, modem, wireless and wired interface cards, antenna, and other communication peripherals.

While the present subject matter has been described in detail with respect to specific embodiments thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing may readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation, and does not preclude inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. Indeed, the methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the present disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the present disclosure.

Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “processing, ” “computing, ” “calculating, ” “determining, ” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.

The system or systems discussed herein are not limited to any particular hardware architecture or configuration. A computing device can include any suitable arrangement of components that provide a result conditioned on one or more inputs. Suitable computing devices include multipurpose microprocessor-based computer systems accessing stored software that programs or configures the computer system from a general-purpose computing apparatus to a specialized computing apparatus implementing one or more embodiments of the present subject matter. Any suitable programming, scripting, or other type of language or combinations of languages may be used to implement the teachings contained herein in software to be used in programming or configuring a computing device.

Embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the examples above can be varied-for example, blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.

Conditional language used herein, such as, among others, “can, ” “could, ” “might, ” “may, ” “e.g., ” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain examples include, while other examples do not include, certain features, elements, and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without author input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular example.

The terms “including, ” “including, ” “having, ” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list. The use of “adapted to” or “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Additionally, the use of “based on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Similarly, the use of “based at least in part on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based at least in part on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Headings, lists, and numbering included herein are for ease of explanation only and are not meant to be limiting.

The various features and processes described above may be used independently of one another, or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of the present disclosure. In addition, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The example blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed examples. Similarly, the example systems and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the disclosed examples.

Claims

A method, by a computing system, of providing enhanced augmented reality content, the method comprising:

obtaining, by the computing system, video content by using an optical sensor in communication with the computing system, wherein the video content comprises a plurality of images in a sequence associated with a frame rate;

obtaining, by the computing system, a three-dimensional (3D) model at least in part associated with an object depicted in the video content;

determining, by the computing system, a first pose of the object describing a first three-dimensional condition of the object; and

associating, by the computing system, the 3D model, the first pose with a first image of the plurality of images at a first timestamp.
The method of claim 1, wherein obtaining a 3D model comprises:

generating, by the computing system, the 3D model of the object by using one or more images obtained by a sensor in communication with the computing system.
The method of claim 1, further comprising:

providing for presentation, by the computing system, user interface data, a user interface generated with the user interface data being configured to present an interactive user interface element overlaid on an image of the plurality of images corresponding to the first timestamp, wherein the interactive user interface element is configured to receive a user command; and

in accordance with receiving the user command, presenting, by the computing system, the partially or fully obtained 3D model overlaid on the video content.
The method of claim 1, further comprising:

obtaining, by the computing system, a second 3D model at least in part associated with a second object portrayed in the video content;

determining, by the computing system, a second pose of the object describing a second three-dimensional condition of the object; and

associating, by the computing system, the second 3D model, and the second pose with a second image of the plurality of images at a second timestamp.
The method of claim 1, wherein:

obtaining a three-dimensional (3D) model comprises: suspending, by the computing system, obtaining video content; and

the method further comprises: resuming, by the computing system, obtaining the video content after associating the 3D model, the first pose with the first image of the plurality of images at a first timestamp.
The method of claim 1, further comprising:

providing for presentation, by the computing system, user interface data, a user interface generated with the user interface data being configured to present the 3D model and an interactive user interface element overlaid on the video content, wherein the interactive user interface element is configured to receive a user reject command; and

in accordance with receiving a user reject command, at least one of:

re-obtaining, by the computing system, an updated 3D model of the object;

re-determining, by the computing system, an updated first pose of the object describing the first three-dimensional condition of the object; or

re-associating, by the computing system, the 3D model, and the updated first pose with the first image of the plurality of images at the first timestamp.
The method of claim 1, further comprising:

generating a media file encoding video and the 3D model in a computer readable format; and

storing the media file in a data store.
A system, comprising:

a processor; and

a memory including instructions that, when executed with the processor, cause the system to, at least:

obtain video content using an optical sensor in communication with the system, wherein the video content comprises a plurality of images in a sequence associated with a frame rate;

obtain a three-dimensional (3D) model at least in part associated with an object depicted in the video content;

determine a first pose of the object describing a first three-dimensional condition of the object; and

associate the 3D model and the first pose with a first image of the plurality of images at a first timestamp.
The system of claim 8, wherein obtaining a 3D model comprises:

generating the 3D model from the object, by using one or more images obtained by a sensor in communication with the system.
The system of claim 8, wherein the instructions further cause the system to:

provide for presentation user interface data, a user interface generated with the user interface data being configured to present an interactive user interface element overlaid on an image of the plurality of images corresponding to the first timestamp, wherein the interactive user interface element is configured to receive a user command; and

in accordance with receiving the user command, presenting the partially or fully obtained 3D model overlaid on the video content.
The system of claim 8, wherein the instructions further cause the system to:

obtain a second 3D model at least in part associated with a second object portrayed in the video content;

determine a second pose of the object describing a second three-dimensional condition of the object; and

associate the second 3D model and the second pose with a second image of the plurality of images at a second timestamp.
The system of claim 8, wherein:

obtaining a 3D model comprises: suspending obtaining video content; and

the instructions further cause the system to resume obtaining the video content after associating the 3D model and the first pose with the first image of the plurality of images at a first timestamp.
The system of claim 8, wherein the instructions further cause the system to:

provide for presentation user interface data, a user interface generated with the user interface data being configured to present the 3D model and an interactive user interface element overlaid on the video content, wherein the interactive user interface element is configured to receive a user reject command; and

in accordance with receiving a user reject command, at least one of :

re-obtain an updated 3D model of the object;

re-determine an updated first pose of the object describing the first three-dimensional condition of the object; or

re-associate the 3D model and the updated first pose with the first image of the plurality of images at the first timestamp.
The system of claim 8, wherein the instructions further cause the system to:

generate a media file encoding video and the 3D model in a computer readable format; and

store the media file in a data store.
A non-transitory computer readable medium storing specific computer-executable instructions that, when executed by a processor, cause a computer system to at least:

obtain video content by using an optical sensor in communication with the computing system, wherein the video content comprises a plurality of images in a sequence associated with a frame rate;

obtain a three-dimensional (3D) model at least in part associated with an object depicted in the video content;

determine a first pose of the object describing a first three-dimensional condition of the object; and

associate the 3D model and the first pose with a first image of the plurality of images at a first timestamp.
The non-transitory computer readable medium of claim 15, wherein obtaining a 3D model comprises:

generating the 3D model from the object, by using one or more images obtained by a sensor in communication with the computing system.
The non-transitory computer readable medium of claim 15, wherein the specific computer-executable instructions further cause the system to:

provide for presentation user interface data, a user interface generated with the user interface data being configured to present an interactive user interface element overlaid on an image of the plurality of images corresponding to the first timestamp, wherein the interactive user interface element is configured to receive a user command; and

in accordance with receiving the user command, presenting the partially or fully obtained 3D model overlaid on the video content.
The non-transitory computer readable medium of claim 15, wherein the specific computer-executable instructions further cause the system to:

obtain a second 3D model at least in part associated with a second object portrayed in the video content;

determine a second pose of the object describing a second three-dimensional condition of the object; and

associate the second 3D model and the second pose with a second image of the plurality of images at a second timestamp.
The non-transitory computer readable medium of claim 15, wherein obtaining a three-dimensional (3D) model comprises: suspending obtaining video content; and

the specific computer-executable instructions further cause the system to resume obtaining the video content after associating the 3D model and the first pose with the first image of the plurality of images at a first timestamp.
The non-transitory computer readable medium of claim 15, wherein the specific computer-executable instructions further cause the system to:

provide for presentation user interface data, a user interface generated with the user interface data being configured to present the 3D model and an interactive user interface element overlaid on the video content, wherein the interactive user interface element is configured to receive a user reject command; and

in accordance with receiving a user reject command, at least one of:

re-obtain an updated 3D model of the object;

re-determine an updated first pose of the object describing the first three-dimensional condition of the object; or

re-associate the 3D model, the updated first pose with the first image of the plurality of images at the first timestamp.