WO2016095057A1

WO2016095057A1 - Peripheral tracking for an augmented reality head mounted device

Info

Publication number: WO2016095057A1
Application number: PCT/CA2015/051353
Authority: WO
Inventors: Dhanushan Balachandreswaran; Kibaya Mungai Njenga; Jian Yao; Mehdi MAZAHERI TEHRANI
Original assignee: Sulon Technologies Inc.
Priority date: 2014-12-19
Filing date: 2015-12-18
Publication date: 2016-06-23

Abstract

An augmented reality head mounted device (HMD) comprises at least one downward facing camera system for capturing a downward image stream of a region of the physical environment below the HMD. A processor of the HMD is configured to obtain depth information from the downward image stream and map the captured region to a virtual map. The processor may render virtual elements of CGI reflective of mapped features of the physical environment, including parts of the user's body lying within the captured region. An augmented reality image stream representing an augmented reality environment reflecting the user's engagement with the physical environment is presented to the user and other users.

Description

PERIPHERAL TRACKING FOR AN AUGMENTED REALITY HEAD MOUNTED DEVICE TECHNICAL FIELD

[0001] The following relates generally to systems and methods for augmented reality or virtual reality environments, and more specifically to systems and methods for detecting the physical environment for use in rendering an augmented reality or virtual reality environment.

BACKGROUND

[0002] The range of applications for augmented reality and virtual reality visualization has increased with the advent of wearable technologies and 3-dimensional rendering techniques. Augmented reality and virtual reality exist on a continuum of mixed reality visualization.

SUMMARY

[0003] In one aspect, an augmented reality (AR) head mounted device (HMD) wearable by a user in a physical environment is provided, the HMD comprising: at least one downward facing camera system having a field of view facing downward relative to the user's natural field of view to capture a downward image stream of a downward region of the physical environment; and a processor communicatively coupled to the at least one downward facing camera system, configured to obtain the downward image stream to obtain or derive depth information for the downward region of the physical environment.

[0004] In another aspect, an augmented reality (AR) head mounted device (HMD) wearable by a user in a physical environment is provided, the HMD comprising a processor communicatively coupled to at least one downward facing camera system having a field of view facing downward relative to the user's natural field of view to capture a downward image stream of a downward region of the physical environment, the processor configured to obtain the downward image stream to obtain or derive depth information for the downward region of the physical environment.

[0005] In a further aspect, a method of obtaining depth information for a downward region of a physical environment using an augmented reality (AR) head mounted device (HMD) wearable by a user in a physical environment is provided, the method comprising: receiving, from at least one downward facing camera system having a field of view facing downward relative to the user's natural field of view, a downward image stream of a downward region of the physical environment; and obtaining, by a processor communicatively coupled to the HMD, the downward image stream to obtain or derive depth information for the downward region of the physical environment.

[0006] These and other aspects are contemplated and described herein. It will be appreciated that the foregoing summary sets out representative aspects of systems and methods for peripheral tracking for a head mounted device for augmented reality and virtual reality applications to assist skilled readers in understanding the following detailed description.

DESCRIPTION OF THE DRAWINGS

[0007] A greater understanding of the embodiments will be had with reference to the Figures, in which:

[0008] Fig. 1 illustrates an embodiment of a head mounted device;

[0009] Fig. 2 illustrates fields of view for a downward facing camera system of a head mounted device;

[0010] Fig. 3 illustrates a user equipped wearing a head mounted device comprising a downward facing camera system;

[001 1] Fig. 4 illustrates a method of tracking a user's pose within a physical environment from an image stream of a stereo camera in a downward facing camera system; and

[0012] Fig. 5 shows a side elevational view of a user wearing a head mounted device having a forward facing camera system and a downward facing camera system.

DETAILED DESCRIPTION

[0013] For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the Figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practised without these specific details. In other instances, well known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Also, the description is not to be considered as limiting the scope of the embodiments described herein.

[0014] Various terms used throughout the present description may be read and understood as follows, unless the context indicates otherwise: "or" as used throughout is inclusive, as though written "and/or"; singular articles and pronouns as used throughout include their plural forms, and vice versa; similarly, gendered pronouns include their counterpart pronouns so that pronouns should not be understood as limiting anything described herein to use, implementation, performance, etc. by a single gender; "exemplary" should be understood as "illustrative" or "exemplifying" and not necessarily as "preferred" over other embodiments. Further definitions for terms may be set out herein; these may apply to prior and subsequent instances of those terms, as will be understood from a reading of the present description.

[0015] Any module, unit, component, server, computer, terminal, engine or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, data libraries, or data storage devices (removable and/or non-removable) such as, for example, magnetic discs, optical discs, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD- ROM, digital versatile discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disc storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the device or accessible or connectable thereto. Further, unless the context clearly indicates otherwise, any processor or controller set out herein may be implemented as a singular processor or as a plurality of processors. The plurality of processors may be arrayed or distributed, and any processing function referred to herein may be carried out by one or by a plurality of processors, even though a single processor may be exemplified. Any method, application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.

[0016] The present disclosure is directed to systems and methods for augmented reality ("AR") or virtual reality ("VR"). The term "AR" as used herein may refer to both VR and AR. In the present disclosure, AR includes: visualization or interaction by a user with real physical objects and structures along with virtual objects and structures overlaid thereon; and viewing or interaction by a user with a fully virtual set of objects and structures that are generated to include renderings of physical objects and structures and that may comply with scaled versions of physical environments to which virtual objects and structures are applied, which may alternatively be referred to as an "enhanced virtual reality". Further, the virtual objects and structures could be dispensed with altogether, and the AR system may display to the user a version of the physical environment which solely comprises an image stream of the physical environment. Finally, a skilled reader will also appreciate that by discarding aspects of the physical environment, the systems and methods presented herein are also applicable to VR applications, which may be understood as "pure" VR. For the reader's convenience, the following may refer to "AR" but is understood to include all of the foregoing and other variations recognized by the skilled reader.

[0017] A user may wear a head mounted device ("HMD") to view an AR environment presented on a display of the HMD. As described above, the AR environment comprises rendered virtual elements which may or may not be combined with real images of the physical environment ("physical images"). The user may move throughout the surrounding physical environment. The user's sense of immersion within the presented AR environment may be enhanced by tracking the user's movements within the physical environment into the AR environment. The user's interaction with the AR environment may be further enhanced through gesture recognition or other recognition of the user's motions within the physical environment as inputs to modify the parameters of the AR environment. The user's sense of immersion may be further enhanced by reflecting other features from the physical environment in the AR environment presented to the user.

[0018] The HMD itself may be equipped and configured to perform such tracking and/or recognition through one or more of pose tracking, mapping of the physical environment, gesture detection and graphics rendering. An HMD may comprise a forward facing camera system disposed substantially in front of the user's face when worn. The forward facing camera system has at least one camera to capture a physical image stream comprising depth and/or visual information from a region of the physical environment before the user. An onboard or external processor communicatively coupled to the forward facing camera system may generate a map of the physical environment using depth and visual information from the physical image stream and may base the AR environment on the map. The forward facing direction of the one or more cameras results in a general alignment of the user's natural field of view (FOV) with the effective FOV of the forward facing camera system. At a given moment, the AR environment presented to the user may then reflect substantially real-time attributes of the captured region of the physical environment. For example, the processor may provide physical images to a display of the HMD as the processor receives them from an image camera of the forward facing camera system, so that the user's view of the physical environment is a substantially real-time representation of the state of the physical environment. For clarity, the term "real-time" may be used without the qualifier "substantially". However, it will be appreciated that the term "realtime", whether qualified or not, encompasses exact simultaneity, as well as approximate simultaneity, in which a degree of lag is permissible between events in the physical environment and their portrayal in the AR image stream. It is desirable to minimize such lag, preferably so as to be imperceptible to a typical user.

[0019] Accurate and efficient pose tracking, particularly of the user's head, is desirable to determine the user's substantially real-time gaze so that virtual elements within the AR environment reflect the user's gaze. An AR HMD may be equipped and configured to perform pose tracking based on depth or visual information obtained from an image stream. The image stream may be provided by a camera system disposed on the HMD. A potential source for the image stream is one of the above-described forward facing camera systems; in addition to using the physical image stream for displaying substantially real-time representations of the physical environment to the user, the processor may also use the same physical image stream for pose tracking. In image or depth-based pose tracking, an onboard or external processor communicatively coupled to the one or more cameras identifies salient features common to a series of frames in an image stream from the cameras; those features may be depth or image features. The processor discerns changes to the user's pose across frames based on changes to the identified features between the captured frames.

[0020] The efficiency and accuracy of pose tracking may be increased by capturing a physical image stream with a relatively greater number and distinctiveness of features. For example, an image stream showing a flat, white wall with a prominent, dark pattern applied to the wall may be preferable for pose tracking relative to a physical image stream showing the same flat, white wall without the pattern applied thereto. Using the physical image stream from a forward facing camera system of the HMD for pose tracking may reduce the need for multiple camera systems. However, in many AR applications, applicant has determined that the resulting physical image stream is typically likely to contain fewer and less distinct salient features. For example, the ground and many floor surfaces typically exhibit greater feature richness than walls and ceilings, or distant landscapes and skyscapes. If the user is predominantly standing while viewing an AR environment by an HMD, then a forward facing of the HMD typically captures predominantly the walls or landscape situated before the user.

[0021] A further use for the physical image stream is to recognize any captured gestures. The processor may be configured to identify such gestures by comparing the physical image stream to a gesture library accessible by the processor. Again, a potential source for that image stream is the forward facing camera system. However, applicant has determined that users required to perform gestures within the field of view of a forward facing camera system are prone to fatigue. For example, a user may be required to hold his hands or arms in front of his chest when using gestures as inputs. An outcome of this is that users tend to drop their hands or arms over time, leading to user discomfort and reduced gesture recognition accuracy.

[0022] Further, when rendering the AR environment, it may be preferable for the processor to render at least the entire AR environment within the user's field of view while wearing the HMD. Typically that may comprise incorporating information from those cameras which are generally aligned with the user's natural field of view, namely, the cameras of the forward facing camera system. However, in various AR applications it may be desirable to map features of the physical environment even though they do not lie within the FOV of the forward facing camera system of the HMD.

[0023] In a multi-user AR environment, for example, it may be desirable for a first user to "see" a second user, even though that second user may be situated behind a wall obscuring the first user's line of sight of the second user. In that case, the AR environment displayed to the first user may comprise an avatar, or virtual rendition, of the second user. Rendering of an avatar at least partially reflecting the second user's real-time appearance requires depth information for the second user's body which may not be captured by the second user's forward facing camera system, nor by the first user's forward facing camera system (due to obstruction by the wall). By capturing events which are peripheral to the FOV of the forward facing camera system, the second user's real-time appearance may be at least partly reflected in a virtual rendition provided to the first user,

[0024] Similarly, it may desirable to reduce any lag between events taking place in the physical environment and their rendering into and display within the AR environment. As a user moves throughout the physical environment, different events or features from the physical environment are perceivable within the FOV of the forward facing camera system. From one moment to the next, an event or feature which is perceivable to the forward facing camera system may become peripheral as the pose of the system's FOV changes; similarly, events which are peripheral become perceivable to the system. As the HMD moves along with its user's head, the processor renders a frame of the AR image stream based on the pose of the HMD, and the colour and depth information at a point in time. However, during the rendering, the HMD's pose may undergo a change that is sufficiently large for the user to perceive that the displayed frame of the AR image stream does not reflect the user's actual pose. By capturing events which are peripheral to the FOV of the forward facing camera system, the processor may also incorporate those peripheral events into its rendering of the AR environment. Immediately prior to issuing the subject frame of the AR image stream to the display, the processor may re-ascertain the instantaneous pose of the HMD and select the corresponding region of the AR image stream for display. The apparent lag may be sufficiently reduced so as to be unperceivable to the user. This is referred to herein as "peripheral rendering".

[0025] In at least one embodiment, then, an HMD comprises at least one downward facing camera system to capture a downward image stream of the physical environment. In further embodiments, the at least one downward facing camera system is wearable by the user separately from the HMD, such as, for example, on a strap attachable to the user's chest, head or shoulders, while being communicatively coupled to a processor of the HMD. In either case, the at least one downward facing camera system is communicatively coupled to a processor situated onboard or remotely from the HMD, the at least one processor being configured to generate an AR image stream to generate a virtual map which models the physical environment using the depth or visual information in the downward image stream. The processor generates the map as a depth map, such as, for example, a point cloud (i.e., in which the points correspond to the obtained depth information for the physical environment), which may further comprise visual information for each point.

[0026] The singular term "processor" is used herein, but it will be appreciated that the processor may be distributed amongst the components occupying the physical environment, within the physical environment or in a server in network communication with a network accessible from the physical environment. For example, the processor may be distributed between one or more HMDs and a console located within the physical environment, or over the Internet via a network accessible from the physical environment.

[0027] Referring now to Fig. 1 , an embodiment of an HMD 12 wearable by a user is shown. The HMD is shown shaped as a helmet; however, other shapes and configurations are contemplated. The HMD 12 comprises: (i) a substantially forward facing camera system 123 disposed to the front of the HMD 12 to capture a forward image stream of the physical environment; (ii) a substantially downward facing camera system 128 disposed to the front and bottom of the HMD 12 to capture a downward image stream of the physical environment; (iii) a processor 130 communicatively coupled to the forward facing camera system 123 and the downward facing camera system 128 and disposed upon the HMD 12, the processor 130 being configured to render an AR image stream based on information from the forward image stream and the downward image stream; and (iv) a display system 121 disposed in front of the user's line of sight when worn and communicatively coupled to the processor 130 to display the AR image stream from the processor 130. When wearing and using the HMD 12, its user experiences the displayed AR image stream as an AR environment.

[0028] The HMD 12 further comprises a power management system 1 13 for distributing power to the components of the HMD 12. The power management system 1 13 may itself comprise a power source, such as, for example, a battery, or it may be electrically coupled to a power source located elsewhere onboard or remotely from the HMD 12, such as, for example, a battery pack disposed upon the user or located within the physical environment, through a wired connection to the HMD 12. The power management system 1 13 may be embodied as a module distinct from the processor 130, or it may be integral to, or contiguous with, the processor 130.

[0029] If haptic feedback and audio output are desired, the HMD 12 may further comprise a haptic feedback system and an audio system, respectively. The haptic feedback system comprises a haptic feedback device 120 communicatively coupled to the processor 130 to provide haptic feedback to the user when actuated, and the audio system comprises one or more speakers 124 communicatively coupled to the processor to provide audio feedback when actuated. The HMD 12 may further comprise a wireless communication system 126 having, for example, antennae, to communicate with other components in an AR and/or VR system, such as, for example, other HMDs, a gaming console, a router, or at least one peripheral device 13 to enhance user engagement with the AR and/or VR. Alternatively, the HMD 12 may comprise a wired connection to the other components.

[0030] The downward facing camera system 128 captures regions of the physical environment which are peripheral to the forward facing camera system 123. For example, the downward facing camera system 128 may face completely downward or it may face downward at 45 degrees to horizontal relative to the physical space, or any other generally downward angle that can permit the field of view of the downward camera system 128 to capture a region of the physical environment lying substantially below the frontal portion of the HMD 12. The downward facing camera system 128 may capture numerous features generally lying below the front of the HMD 12, such as, for example, features of the floor or ground of the physical environment, and the body and limbs of the user.

[0031] The preferred orientation for the downward facing camera system 128 depends on the desired use or uses for the resulting downward image stream, as well as the FOV of the downward facing camera system 128. Typically, at least a portion of the FOV should intersect a location on the ground or floor lying directly below the downward facing camera system 128 when the HMD 12 is worn in a standing position by a large range of users; if the downward image stream is desired for mapping of the user's body, then the field of view should capture any regions of the user's body which are desired to be mapped and, preferably, across a range of likely orientations resulting from the user's head movements relative to the rest of the user's body; if the downward image stream is desired for peripheral rendering, then the FOV should at least partially intersect the FOV of the forward facing camera system 123 on the ground or floor when the HMD 12 is worn by a user, as further described below with reference to Fig. 5; and, if the downward image stream is desired for pose tracking, then any downward orientation which primarily captures the ground below the user is suitable. It will be appreciated that the FOV and angle of the downward facing camera system may be selected to enable multiple uses for the downward image stream therefrom. Further, the angle and FOV of the forward facing camera system 123 may also be selected in conjunction with the FOV and angle of the downward facing camera system 128 to enable one or more uses for the downward image stream, as further described below with reference to Fig. 5.

[0032] The embodiments of the HMD 12 shown in Figs. 1 and 2A each comprise a single downward facing camera system 128 that is disposed to the front of the HMD 12 in front of the user's face when worn. In at least another embodiment, the HMD 12 comprises a plurality of spaced apart downward facing camera systems 128 in order to capture a larger combined field of view of the physical environment surrounding the user, as shown in Fig. 2B. The four downward facing camera systems 128 shown in Fig. 2B are disposed about the HMD 12 at substantially 90° increments and preferably arranged so that each of the downward facing camera systems 128 is disposed substantially orthogonally to its neighbouring downward facing camera systems 128. The illustrated downward facing camera systems 128 are selected and positioned to capture a region of the physical environment below the HMD 12 that extends 360 degrees around the HMD 12, as well as its user when standing upright. Other suitable configurations depend on the desired use for the downward image streams from the downward facing camera systems 128. For example, a 360 degree view of the user's body lying below the HMD 12 is preferable if the downward image streams are desired for rendering an avatar of the user. It will be appreciated that a 360 view may be provided by four, more than four or fewer than four downward facing camera systems 128 if both edges of the respective field of view of each downward facing camera system 128 intersects a field of view of a neighbouring downward facing camera system 128. For example, as shown, each of the four downward facing camera systems 128 has a field of view that is sufficiently wide to intersect with the field of view of each of its two neighbouring downward facing camera systems 128. The field of view of each of the downward facing camera systems 128 is illustrated in Figs. 2A and 2B by shaded regions coaxial with each downward facing camera system 128. Each shaded region extends pyramid-wise and generally downward from each downward facing camera system 128 and captures a region of the user's body. In embodiments, at least one downward facing camera system 128 is mounted to the HMD 12 by an elongate armature or other suitable member to retain the downward facing camera system separated from the HMD 12. The separation may be preferred when the downward facing camera system 128 is to be used for mapping the user's body, since the captured region grows larger as the separation increases.

[0033] In embodiments of the HMD 12 where the combined field of view of the at least one downward facing camera system 128 is less than 360 degrees about the user or HMD 12, a 360 degree view of the physical space may still be achieved. In use, the user of the HMD 12 may be urged or prompted to pivot, such as, for example, by the processor 130 displaying a prompt on the display of the HMD 12. As the user pivots, the downward facing camera system 128 captures a plurality of frames of a downward image stream. The processor 130 acquires the downward image stream and implements a stitching technique to align the depth information along the rotation. For example, the processor may identify features common to subsequent frames in the downward image stream and allocate coordinates in the virtual map based on the already allocated coordinates for the same feature in the previous frame. In various applications, however, a 360 degree field of view about a user or HMD may not be required to achieve the desired use for the downward image stream.

[0034] Each of the one or more downward facing camera systems 128 of the HMD 12 comprises at least one camera to capture a downward image stream. The selection of the at least one camera in the downward facing camera system 128 depends on the desired use for the resulting downward image stream. For example, a depth camera may be preferable to an image camera where pose tracking, height detection and modelling are desired. A depth camera of the downward facing camera system 128 may implement any suitable technology for providing depth information, such as, for example, structured light, time of flight (TOF), visible light, or infrared (IR) depth sensing. More specifically, a structured light camera is typically preferable to a time of flight (TOF) camera for pose tracking because a structured light camera is typically more robust than an equivalent TOF camera across a wider variety of surface types. In contrast, a TOF camera is typically preferable when a higher resolution depth map is required, such as, for example, for peripheral rendering or modelling of the user's anatomy, since a TOF camera typically provides clearer feature mapping but lower depth resolution than an SL camera.

[0035] When the downward image stream is desired to be used for displaying physical images of the captured region of the physical environment in the AR image stream, then an image camera is preferred. The image camera may be any suitable image capture device operable to capture visual information from the physical environment in digital format, such as, for example, a colour camera or video camera. If the physical images are to be displayed alongside virtual elements as a 3D AR image stream, then a stereo image camera is further preferred. Further, if depth information is to be derived from the image stream of the image camera, then the image camera is preferably a stereo or multi camera. The image stream from a single mono image camera may be virtualized to simulate a stereo camera, but any depth information derived therefrom is in relative terms unless another observation of a physical environment dimension is provided to the processor. In contrast, the depth information derivable from a stereo or multi camera image stream is sufficient to derive the aforementioned observation. It will be appreciated that the downward facing camera system 128 may comprise a combination of depth and image cameras to exploit preferred properties of each. For example, while a stereo camera may provide high quality depth information for captured features which exhibit relatively high visual contrast, a TOF camera may provide higher quality depth information for lower contrast features.

[0036] Fig. 3 illustrates the field of view of an exemplary embodiment of the downward facing camera system 128 taken from an elevational perspective of a user equipped with an HMD 12. The illustrated embodiment comprises a stereo camera. The dashed lines emanating from the downward facing camera system 128 denote its FOV. The stereo camera comprises at least two spaced apart image sensors 128a to measure distances from the downward facing camera system 128 to an obstacle, such as, for example, the ground 301 and obstacle 303, as shown.

[0037] In use, a processor (whether the processor 130 of the HMD 12, or a processor of the downward camera system 128) communicatively coupled to the sensors 128a identifies features, such as, for example, the illustrated obstacle 303, that are common to the image streams from the sensors 128. Since the sensors 128a are spaced apart, each captures a different perspective of any object or feature within both sensors' fields of view, such as, for example, the illustrated obstacle 303. The processor derives depth information from the image streams based on the disparity in perspectives, as well as the parameters (also referred to as "specifications") of the sensors 128a. The processor may retrieve the sensor parameters from any available source, whether as preconfigured parameters from a memory accessible by the processor, or as variable parameters provided as an output from the downward facing camera system 128. The processor preferably periodically or continuously retrieves the parameters if they are variable during use of the HMD 12.

[0038] Referring again to Figs. 1 and 2, each of the forward facing camera system 123 and the downward facing camera system 128 provides an image stream to the processor 130 of the HMD 12. Each image stream may comprise visual information from the captured region of the physical environment, such as, for example, colour (or "RGB") information or greyscale information. The processor 130 may further derive depth information from the visual information, as further described herein; however, the image stream may further or alternatively comprise depth information for the captured region of the physical environment so that the depth information is already derived prior to reaching the processor 130.

[0039] If the downward facing camera system 128 comprises an infrared ("IR") depth camera, it may provide depth information for features of the physical environment based on the time of flight ("TOF") of an I R beam from the depth camera 128 to points of the features and back to the depth camera 128. In one scenario, for example, if a user moves from a standing position to a crouching position, the TOF of the IR beam will decrease between both positions. The processor 130 may map the changes in height of the user or HMD 12 based on changes in the TOF of the IR beams of the at least one downward facing camera 128.

[0040] Much like the downward facing camera system 128, the forward facing camera system 123 may comprise one camera or more than one camera and/or more than one type of camera, such as, for example, a mono or stereo image camera and a depth camera. The processor 130 may base mapping of the physical environment primarily on the forward image stream, while additionally incorporating depth and/or visual information available from the downward image stream. If visual information from the physical environment is to be incorporated into the rendered AR image stream, then the forward facing camera system 123 must comprise an image camera to capture the visual information; otherwise, the forward facing camera system 123 may comprise a depth camera with or without an image camera. In alternate embodiments of the HMD, the HMD comprises a LIDAR or other scanner to capture the depth information. In further alternate embodiments, the HMD may comprise additional camera systems facing outwardly from the sides and/or rear of the HMD to increase the effective FOV of the forward facing camera system 123. [0041] In use, the HMD 12 is worn by a user situated in a physical environment, such as, for example, a room of a building, or an outdoor area. Within its FOV, the forward facing camera system 123 captures a forward image stream of the physical environment before the user. The processor 130 obtains the forward image stream to generate a virtual map which models the physical environment. The virtual map may comprise a depth map, such as, for example, a point cloud, and the depth map may further comprise visual information for the points. Visual information may comprise RGB or greyscale values, or other suitable visual information.

[0042] The processor 130 uses depth information from the image stream to generate the depth map. If the image stream comprises depth information, the processor 130 may directly generate the depth map from the image stream. Alternatively, if the image stream solely comprises visual information, then the processor 130 may derive depth information from the visual information according to a suitable technique, such as, for example, by the method 400 illustrated in Fig. 4 and further described below. If configured to do so, the processor 130 generates the visual map from any visual information available in the image stream from the forward facing camera system 123. Substantially concurrently, the processor 130 supplements the virtual map with depth and/or visual information from the image stream captured by the at least one downward facing camera system. For example, the processor 130 may supplement the map with any captured regions of the user's body lying below the HMD.

[0043] In addition to mapping the captured regions of the physical environment to the virtual map, the processor 130 renders virtual elements which it situates within the virtual map for an AR image stream. The virtual elements may at least partially conform to features within the physical environment, or they may be entirely independent of such features.

[0044] In embodiments, the processor 130 renders the AR image stream from the point of view of a notional or virtual camera system situated in the virtual map. The notional camera system "captures" within its FOV a region of the virtual map, including any visual information from any visual map and any virtual elements. The AR image stream includes selected or all visible elements captured with the FOV of the notional camera. The processor may further render the AR image stream to add shading, textures or other details. The processor then provides the AR image stream to the display system 121 of the HMD 12 for display to the user.

[0045] As previously described, immersion is enhanced when the AR image stream substantially tracks the user's real time pose. Therefore, the processor 130 preferably tracks the real time pose of the HMD 12 relative to the physical environment, and applies substantially the equivalent pose to the notional camera relative to the virtual map. The resulting AR image stream thereby substantially reflects the user's actual pose within the physical environment. It is therefore desirable for the processor to accurately and efficiently track the pose of the HMD 12. In embodiments, pose tracking of the HMD 12 comprises camera based pose tracking instead of, or in addition to, other types of pose tracking, such as, for example, magnetic pose tracking, inertial measurement (IMU) based or GPS pose tracking. While the processor 130 may perform image based pose tracking solely from the forward image stream, pose tracking from the downward image stream is preferred in many instances.

[0046] As described above, image based pose tracking quality improves along with an increase in feature richness of the available image stream. Lower lying surfaces often exhibit greater feature richness than background and overhead surfaces. Further, in large indoor and outdoor spaces, the ground or floor is seldom likely located more than 1.5-2 metres from the highest point on an HMD 12, while walls, trees or other features before the user are frequently likely located more than 1.5-2 metres from the nearest point on the same HMD 12. Still further, human motion throughout a physical environment typically exhibits greater range and fluctuation across the ground or floor than toward and away from the ground or floor. Correspondingly, image based pose tracking may be more robust when based on images of the ground than on images of the physical environment before a standing user.

[0047] Referring now to Fig. 4, embodiments of a method 400 are shown for tracking the pose of an HMD from a downward image stream captured by a stereo camera of the downward facing camera system 128 of the HMD 12. The stereo camera may be any suitable stereo camera, such as, for example, the stereo camera having left and right sensors 128a in the downward camera system 128 illustrated in Fig. 3. Although the singular term "image stream" is used herein, it will be appreciated that a camera system, including a stereo camera, may capture more than one image stream. With reference to Fig. 4, the image stream will be referred to as a "stereo" image stream.

[0048] The stereo camera is pre-calibrated to correct lens-related distortions in each sensor, as well as the rigid body transformation (i.e., translation and rotation) between the sensors. The stereo camera is calibrated by a suitable non-linear optimization applied to a stereo image stream captured by the stereo camera from a test field. The calibration procedure results in extrinsic and intrinsic camera parameters which may be stored in a library accessible to the processor of an HMD.

[0049] The stereo image stream 431 illustrated in Fig. 4 is captured by right and left sensors of a stereo camera; however, other sensor configurations are contemplated, such as, for example, top and bottom sensor or front and back or other configurations. The stereo image stream 431 comprises two parallel sequences of frames captured by the left and right sensors over a plurality of epochs (specific moment when each frame is captured). During use, at block 401 , the processor 130 of an HMD receives, and detects salient features within, the stereo image stream 431. The processor may detect all features in the stereo image stream, or the processor may reduce processing by detecting only relatively distinctive features, or only those features lying within a region of interest in the stereo image stream.

[0050] At block 403, the processor accesses a descriptor library 433 and describes the detected features by suitable terms from the descriptor library. Since speed and precision are typically desired in pose detection, the processor preferably employs an efficient descriptor library and salient feature identification method, such as, for example, an ORB descriptor library and FAST feature detection, respectively.

[0051] At block 405, the processor matches the identified features between frames captured at the same capture epochs and across two or more capture epochs, thereby generating a list of matched features. For a given frame in either the right or left sequence of the stereo image stream, the processor searches for matching identified features in the other sequence at the same epoch, as well as in the left and right sequences at the previous epoch. If the same feature is present in both sequences at the instant or previous epoch, and at least one sequence at the other epoch, the same feature is considered a "common" feature.

[0052] At block 407, the processor generates a plurality of feature tracks by using the common features to link the matched features. A feature track is a projection of an object point to the frames of the stereo image stream using the coordinate system by which the processor maps the physical environment. For example, the processor may use the origin of the stereo camera at the initial capture epoch as the origin of the physical environment coordinate system. For subsequently captured frames (i.e., frames captured during subsequent capture epochs) the processor determines the pose of the HMD at block 409. For example, to estimate the pose of the HMD between consecutive epochs f, and t_i+1, the processor first assumes that the pose pose, at I, equals the pose pose_i+1 at t_i+1. This assumption may apply whenever the frame rate of the stereo camera is sufficiently high relative to the velocity of the HMD; however, if the frame rate is not sufficiently high, the assumption may be invalid. In those cases, the processor may alternatively implement Perspective-Three-Point ("P3P") pose estimation conjugated with random sample consensus ("RANSAC") to estimate the pose pose_i+1 at t_i+1. The processor then derives a refined pose pose_i+1 at t_i+1 by applying stereo bundle adjustment. [0053] The processor applies the stereo bundle adjustment by intersecting the feature tracks to generate 3D points using the rigid body transformation from the stereo camera calibration stage. The bundle adjustment minimizes the projection error of the 3D points. The processor may use the resulting refined poses to perform odometry, environment mapping, and rendering of the AR image stream, including any virtual elements therein. For example, at block 411 the processor may further generate a dense depth map of the physical environment based on the stereo image stream 431. At each capture epoch, an epoch-specific dense depth map may be derived independently using the rigid body transformation between the sensors. The processor registers the pose of each dense depth map from the pose determination performed at block 407. The processor combines the plurality of epoch-specific dense depth maps across all captured epochs to generate a combined dense depth map.

[0054] Although Fig. 4 illustrates the method 400 using a stereo camera, the method may be applied to a multi camera, i.e., a camera having more than two spaced apart sensors. Subject to suitable modification, the method 400 may further use a mono camera, i.e., a camera having a single image sensor. During motion of the HMD, an image stream from a mono camera comprises a plurality of frames captured at different epochs. The processor may virtualize a stereo camera by determining the rigid body transformation of the mono camera between any two suitable epochs during the motion. The resulting transformation will be defined in relative terms, so that any pose tracking using the rigid body transformation will result in relative outputs. However, by obtaining an observation of the captured region of the physical environment, the processor may use the observation to solve the resulting relative values. For example, the user may be prompted to enter an absolute dimension between captured features.

[0055] The method 400 illustrated in Fig. 4 uses visual information from the stereo image stream. However, as previously described, the downward facing camera system 128 may comprise a depth camera. The processor 130 may perform pose tracking using depth information from the downward image stream. In one embodiment of a pose tracking method, the processor 130 receives a downward image stream from the depth camera and identifies salient structural features in a plurality of frames of the downward image stream. The processor calculates the change in pose between subsequent frames of the downward image stream by determining the transformation that is required to align the identified salient features back to their original pose in a previous frame.

[0056] Referring now to Fig. 5, a side elevational view is shown of a user 501 wearing an HMD 12 comprising a forward facing camera system 123 and a downward facing camera system 128. The dashed rays emanating from each of the camera systems indicates their respective fields of view. The fields of view are positioned to at least partially overlap at the ground 503 in front of the user. In one aspect, the overlap enables the processor 130 to derive scaled depth information for the camera systems by treating a mono image camera in each camera system as one sensor of a stereo or multi camera. If each of the frontward facing camera system 123 and the downward facing camera system 123 comprises a mono camera, the processor 130 may implement the pose tracking and 3D mapping described above with reference to Fig. 4 by using salient features from the region of overlap, such as, for example, the obstacle 505.

[0057] Regardless of the camera configuration in each camera system, the overlap shown in Fog. 5 may enable the previously described peripheral rendering. Due to the overlap, the processor 130 may stitch the downward and forward image streams into an extended image stream spanning both by using salient features in both image streams at each epoch to align the depth and/or visual information in each image stream to the other.

[0058] In further aspects, the downward facing camera system 128 enables the processor 130 to determine the real-time height h of the HMD 12 by measuring the vertical distance from the downward facing camera system 128 to the ground 503. Rather than relying on features situated above or before a user within the physical environment to derive the user's height during pose tracking, the height tracking enabled by the downward facing camera system may be more direct and robust, as well as less susceptible to cumulative errors during use.

[0059] In still further aspects, the downward facing camera system 128 illustrated in Fig. 5 provides a downward image stream from which the processor 130 may track the user's body. For gesture tracking of the user's hands 502, the fields of view of both the forward facing camera system 123 and the downward facing camera system 128 preferably at least partially overlap within the region of the physical environment where the user is most likely to make hand 502 gestures. In order to track the user's gestures, the processor 130 analyzes the image streams from both camera systems to identify the user's body and its parts, such as, for example, the hands 502. Upon segmenting skin from the image streams, the processor may 130 compare identified skin regions with human body parts expressed in a library. For example, the processor may identify a user's finger within the image streams by segmenting any skin elements and comparing the elements against parameters for a finger, which parameters may be stored in a memory. Alternatively, if the HMD's downward facing camera system 128 comprises a depth camera, the processor 130 may identify body parts with reference to probabilistic expressions defined in a library accessible by the processor 130, without first segmenting the user's skin. The identification of body parts may be based upon dimensions and/or structures of the parts, rather than identification of pixels corresponding to skin pigmentation.

[0060] Upon identifying the user's body and body parts, the processor 130 may apply inverse kinematics to the body parts to determine whether the body part's state or change of state corresponds to a gesture command. Suitable inverse kinematics procedures may comprise OpenNI, OpenCV and other techniques used with alternative vision systems (e.g., Microsoft™ Kinect™). The processor 130 may map the user's body parts in real-time and/or register the identified gesture as a user input to the AR environment. Since the user's body parts are tracked by the downward facing camera system 128, the user may more comfortably perform detected gestures. Further, camera based body and gesture tracking enables the processor 130 to track the user's body parts without requiring the user to wear tracking devices (except for the HMD 12, which may track the user's head), such as magnetic tracking devices or IMUs, on the tracked body parts.

[0061] In yet further aspects, the downward facing camera system enables the processor 130 to map interactions between a relatively large region of the user's body and the physical environment. As shown in Figs. 2 and 5, for example, the FOV of the downward facing camera system 128 may capture at least some of the user's torso and lower body. From time to time, the processor 130 may display the interaction within the AR image stream displayed on the display 121. The processor 130 may further actuate the haptic feedback device 120 in response to detected interactions, such as, for example, the user's foot hitting an obstacle. Still further, the downward facing camera system 128 may capture another user's body interacting with the body of the user wearing the HMD 12. For example, in a combat-type AR environment, any hits by the other user may be captured by the downward facing camera system 128 for detection and identification by the processor 130. The processor 130 register any hit to actuate the haptic feedback device 120 or to effect another outcome.

[0062] In still further aspects, the downward facing camera system 128 may enable the processor 130 to map at least some of the user's body for rendering thereof as an avatar. For example, the processor 130 may render the user's body as a combination of physical and virtual features by using the depth and, optionally, visual information from the downward image stream. The rendering may be displayed to the user within the AR image stream. Alternatively or further, if the HMD 12 and other users' HMDs are equipped for communication between each other, the depth information and/or rendering of the user's body may be shared with other users' HMDs for incorporation into their respective AR image streams. In an exemplary scenario, if a first user is standing behind a wall in a room so that a second user's forward facing camera system cannot capture depth information for the first user's body, then the second user may view an avatar representing the first user by using the shared depth information captured by the downward facing camera system 128 of the first user's HMD 12. In another exemplary scenario, the second user may view virtual features representing the first user's footsteps across the ground based on the capture of that interaction by the downward facing camera system 128 of the first user's HMD 12.

[0063] Although the foregoing has been described with reference to certain specific embodiments, various modifications thereto will be apparent to those skilled in the art without departing from the spirit and scope of the invention as outlined in the appended claims. The entire disclosures of all references recited above are incorporated herein by reference.

Claims

1. An augmented reality (AR) head mounted device (HMD) wearable by a user in a

physical environment, the HMD comprising:

at least one downward facing camera system having a field of view facing downward relative to the user's natural field of view to capture a downward image stream of a downward region of the physical environment; and

a processor communicatively coupled to the at least one downward facing camera

system, configured to obtain the downward image stream to obtain or derive depth information for the downward region of the physical environment.

2. The HMD of claim 1 , wherein the processor is further configured to perform pose

tracking of the HMD using the downward image stream.

3. The HMD of claim 2, wherein at least one of the downward facing camera systems

comprises a stereo camera having a first image sensor and a second image sensor spaced apart from the first image sensor to capture two parallel sequences of frames in the downward image stream.

4. The HMD of claim 3, wherein the processor is configured to perform the pose tracking by accessing camera parameters for the stereo camera and performing stereo bundle adjustment using the camera parameters.

5. The HMD of claim 2, wherein each downward facing camera system comprises a multi camera having a first lens, a second lens spaced apart from the first lens, and at least a third lens spaced apart from the first lens and the second lens to capture at least three parallel sequences of frames in the downward image stream.

6. The HMD of claim 2, wherein at least one of the downward facing camera systems

comprises a depth camera.

7. The HMD of claim 1 , wherein the processor is configured to use the depth information to track the height of the downward facing camera system relative to a region of the ground within the downward region of the physical environment.

8. The HMD of claim 7, wherein at least one of the downward facing camera systems

comprises a depth camera.

9. The HMD of claim 1 , wherein at least one of the downward facing camera systems is disposed to capture a region of the user's body for gesture recognition, and the processor is further configured to recognize at least one gesture from the downward image stream and register the at least one gesture as an input to modify parameters of an AR environment.

10. The HMD of claim 9, wherein the downward image stream comprises visual information and the processor is configured to recognize the at least one gesture by: detecting at least one region of the user's skin in each frame of the downward image stream using the visual information; associating each region of the user's skin to a body part of the user; determining at least one change of state of each body part between frames captured at two or more epochs; and applying inverse kinematics to the at least one change of state to identify each gesture.

1 1. The HMD of claim 9, wherein the processor is configured to recognize the at least one gesture by: identifying at least one body part of the user within each frame of the downward image stream by comparison of the depth information with a library of probabilistic expressions of human body parts; determining changes of state of the at least one body past between frames captured at two or more epochs; and applying inverse kinematics to the at least one change of state to identify each gesture.

12. The HMD of claim 1 , wherein the processor is further configured to map at least one region of the user's body from the depth information using the downward image stream.

13. The HMD of claim 12, wherein the processor is further configured to communicate the map of the at least one region to another HMD.

14. The HMD of any of claims 1 to 13, further comprising a forward facing camera system having a field of view facing forward relative to and aligned with the user's natural field of view to capture a forward image stream of a forward region of the physical environment, wherein the processor is further configured to obtain the forward image stream to obtain or derive depth information for the forward region of the physical environment.

15. The HMD of claim 14, wherein a field of view of the forward facing camera system is at least in part peripheral to a field of view of the at least one downward facing camera system, the processor further configured to: generate a virtual map of the physical environment from the forward image stream; generate an extended image stream from the forward image stream and the downward image stream; supplement the virtual map with the extended image stream; and render features of the physical environment peripheral to the field of view of the forward facing camera system.

16. An augmented reality (AR) head mounted device (HMD) wearable by a user in a

physical environment, the HMD comprising a processor communicatively coupled to at least one downward facing camera system having a field of view facing downward relative to the user's natural field of view to capture a downward image stream of a downward region of the physical environment, the processor configured to obtain the downward image stream to obtain or derive depth information for the downward region of the physical environment.

17. The HMD of claim 16, wherein the processor is further configured to perform pose

tracking of the HMD using the downward image stream.

18. The HMD of claim 17, wherein at least one of the downward facing camera systems comprises a stereo camera having a first image sensor and a second image sensor spaced apart from the first image sensor to capture two parallel sequences of frames in the downward image stream.

19. The HMD of claim 18, wherein the processor is configured to perform the pose tracking by accessing camera parameters for the stereo camera and performing stereo bundle adjustment using the camera parameters.

20. The HMD of claim 17, wherein each downward facing camera system comprises a multi camera having a first lens, a second lens spaced apart from the first lens, and at least a third lens spaced apart from the first lens and the second lens to capture at least three parallel sequences of frames in the downward image stream.

21. The HMD of claim 17, wherein at least one of the downward facing camera systems comprises a depth camera.

22. The HMD of claim 16, wherein the processor is configured to use the depth information to track the height of the downward facing camera system relative to a region of the ground within the downward region of the physical environment.

23. The HMD of claim 22, wherein at least one of the downward facing camera systems comprises a depth camera.

24. The HMD of claim 16, wherein at least one of the downward facing camera systems is disposed to capture a region of the user's body for gesture recognition, and the processor is further configured to recognize at least one gesture from the downward image stream and register the at least one gesture as an input to modify parameters of an AR environment.

25. The HMD of claim 24, wherein the downward image stream comprises visual information and the processor is configured to recognize the at least one gesture by: detecting at least one region of the user's skin in each frame of the downward image stream using the visual information; associating each region of the user's skin to a body part of the user; determining at least one change of state of each body part between frames captured at two or more epochs; and applying inverse kinematics to the at least one change of state to identify each gesture.

26. The HMD of claim 24, wherein the processor is configured to recognize the at least one gesture by: identifying at least one body part of the user within each frame of the downward image stream by comparison of the depth information with a library of probabilistic expressions of human body parts; determining changes of state of the at least one body past between frames captured at two or more epochs; and applying inverse kinematics to the at least one change of state to identify each gesture.

27. The HMD of claim 16, wherein the processor is further configured to map at least one region of the user's body from the depth information using the downward image stream.

28. The HMD of claim 27, wherein the processor is further configured to communicate the map of the at least one region to another HMD.

29. The HMD of any of claims 16 to 28, further comprising a forward facing camera system having a field of view facing forward relative to and aligned with the user's natural field of view to capture a forward image stream of a forward region of the physical environment, wherein the processor is further configured to obtain the forward image stream to obtain or derive depth information for the forward region of the physical environment.

30. The HMD of claim 29, wherein a field of view of the forward facing camera system is at least in part peripheral to a field of view of the at least one downward facing camera system, the processor further configured to: generate a virtual map of the physical environment from the forward image stream; generate an extended image stream from the forward image stream and the downward image stream; supplement the virtual map with the extended image stream; and render features of the physical environment peripheral to the field of view of the forward facing camera system.

31. A method of obtaining depth information for a downward region of a physical

environment using an augmented reality (AR) head mounted device (HMD) wearable by a user in a physical environment, the method comprising: receiving, from at least one downward facing camera system having a field of view facing downward relative to the user's natural field of view, a downward image stream of a downward region of the physical environment; and obtaining, by a processor communicatively coupled to the HMD, the downward image stream to obtain or derive depth information for the downward region of the physical environment.

32. The method of claim 31 , wherein the processor is further configured to perform pose tracking of the HMD using the downward image stream.

33. The method of claim 32, wherein at least one of the downward facing camera systems comprises a stereo camera having a first image sensor and a second image sensor spaced apart from the first image sensor to capture two parallel sequences of frames in the downward image stream.

34. The method of claim 33, wherein the processor is configured to perform the pose

tracking by accessing camera parameters for the stereo camera and performing stereo bundle adjustment using the camera parameters.

35. The method of claim 32, wherein each downward facing camera system comprises a multi camera having a first lens, a second lens spaced apart from the first lens, and at least a third lens spaced apart from the first lens and the second lens to capture at least three parallel sequences of frames in the downward image stream.

36. The method of claim 32, wherein at least one of the downward facing camera systems comprises a depth camera.

37. The method of claim 31 , wherein the processor is configured to use the depth

information to track the height of the downward facing camera system relative to a region of the ground within the downward region of the physical environment.

38. The method of claim 37, wherein at least one of the downward facing camera systems comprises a depth camera.

39. The method of claim 31 , wherein at least one of the downward facing camera systems is disposed to capture a region of the user's body for gesture recognition, and the processor is further configured to recognize at least one gesture from the downward image stream and register the at least one gesture as an input to modify parameters of an AR environment.

40. The method of claim 39, wherein the downward image stream comprises visual information and the processor is configured to recognize the at least one gesture by: detecting at least one region of the user's skin in each frame of the downward image stream using the visual information; associating each region of the user's skin to a body part of the user; determining at least one change of state of each body part between frames captured at two or more epochs; and applying inverse kinematics to the at least one change of state to identify each gesture.

41. The method of claim 39, wherein the processor is configured to recognize the at least one gesture by: identifying at least one body part of the user within each frame of the downward image stream by comparison of the depth information with a library of probabilistic expressions of human body parts; determining changes of state of the at least one body past between frames captured at two or more epochs; and applying inverse kinematics to the at least one change of state to identify each gesture.

42. The method of claim 31 , wherein the processor is further configured to map at least one region of the user's body from the depth information using the downward image stream.

43. The method of claim 42, wherein the processor is further configured to communicate the map of the at least one region to another HMD.

44. The method of any of claims 31 to 43, wherein the processor is further configured to obtain from a forward facing camera system having a field of view facing forward relative to and aligned with the user's natural field of view, a forward image stream of a forward region of the physical environment, wherein the processor is further configured to obtain the forward image stream to obtain or derive depth information for the forward region of the physical environment.

45. The method of claim 44, wherein a field of view of the forward facing camera system is at least in part peripheral to a field of view of the at least one downward facing camera system, the processor further configured to: generate a virtual map of the physical environment from the forward image stream; generate an extended image stream from the forward image stream and the

downward image stream; supplement the virtual map with the extended image stream; and render features of the physical environment peripheral to the field of view of the forward facing camera system.