CN113397526A

CN113397526A - Human body height estimation

Info

Publication number: CN113397526A
Application number: CN202110282036.3A
Authority: CN
Inventors: A·达维格; G·J·希托; T·R·皮斯
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2020-03-17
Filing date: 2021-03-16
Publication date: 2021-09-17

Abstract

The present disclosure relates to human body height estimation. Various implementations disclosed herein include apparatuses, systems, and methods that determine a distance between a portion of a human head (e.g., a top) and an underlying floor as an estimate of a height of the human. For example, an exemplary process may include determining a first location of a human body on a head in a three-dimensional (3D) coordinate system, the first location determined based on detecting features on the head from a two-dimensional (2D) image of the human body in a physical environment; determining a second location on the floor below the first location in the 3D coordinate system; and estimating an altitude based on determining a distance between the first location and the second location.

Description

Human body height estimation

Cross Reference to Related Applications

This application claims the benefit of U.S. provisional application serial No. 62/990,589, filed on day 3, month 17, 2020, which is incorporated herein by reference in its entirety.

Technical Field

The present disclosure relates generally to the field of mobile devices, and in particular to systems, methods, and devices for determining height measurements of a person or object based on information detected in a physical environment.

Background

Detecting a human body in a physical environment and accurate height measurements of the human body can play an important role in generating accurate reconstructions and for providing height information to a user on a mobile device. There are many obstacles to providing a computer-based system to automatically and accurately generate body height measurements based on sensor data. The sensor data (e.g., light intensity images) acquired about the physical environment may be incomplete or insufficient to provide accurate measurements. As another example, images often lack depth and semantic information, and measurements generated without such data may lack accuracy. The prior art does not allow for automatic, accurate and efficient generation of body height measurements using mobile devices, e.g., based on a user capturing a photograph or video or other sensor data. Furthermore, the prior art may not provide sufficiently accurate and efficient measurement results in a real-time environment (e.g., immediate measurement during scanning).

Disclosure of Invention

Various implementations disclosed herein include apparatuses, systems, and methods that determine a distance between a portion of a human head (e.g., a top) and an underlying floor as an estimate of a height of the human. A portion of the human head (e.g., the top) is determined based on detecting features on the head (e.g., the face) from a two-dimensional (2D) image. For example, this may involve identifying the portion of the mask (e.g., semantically segmenting the mask) based on the location of the human face. The depth data may be used to identify a head boundary in the mask even if the mask does not indicate a boundary, for example, if another person stands behind the user's head. The depth data may be used to determine a three-dimensional (3D) position of the floor even if a portion of the floor located below the top of the head is not in the image data. It may be desirable to display the height of the human body in a live camera view. It may be desirable to determine the height of the human body and store it, for example, in metadata associated with the image. This information can then be used to track the height of the person as the person grows (e.g., to track and record the growth of the child over time).

In some implementations, segmentation is used to facilitate excluding objects from the model. For example, segmentation techniques (e.g., machine learning models) may be used to generate segmentation masks that identify particular types of objects (e.g., humans, animals, etc.). The segmentation mask may indicate pixel locations in the corresponding RGB image using a value of 0 or a value of 1 to indicate whether the pixel locations correspond to objects to be included in the data or excluded from the data. Segmentation techniques may be used to semantically label objects within a physical environment using a confidence level. The measurement data may then be determined based on a mask of the image (e.g., a segmentation mask that identifies an object, such as a human body, determined to be included in the data).

Some implementations of the present disclosure relate to an exemplary method of determining a distance between a portion of a head of a human body and an underlying floor as an estimate of a height of the human body. The example method involves, at a device having a processor, determining a first location (e.g., top) of a human body on a head in a 3D coordinate system, the first location determined based on detecting features (e.g., a face) on the head from a 2D image of the human body in a physical environment. In some implementations, this may involve acquiring a mask that identifies 2D coordinates corresponding to one or more persons. For example, semantic segmentation masks based on 2D RGB and/or depth images may be utilized. This may involve detecting a feature on the head of the human body and identifying the first location as a certain location within the mask based on the feature. For example, find the center of the face and move up to find the top of the head within the mask.

The example method also involves determining a second location on the floor below the first location in the 3D coordinate system. In some implementations, this may involve determining floor level using RGB data and/or depth sensor data. Additionally or alternatively, the floor plane may be determined using machine learning to classify the plane as a floor (e.g., classifying each data point based on detected height). In some implementations, the floor plane can be determined over time using a 3D model that is updated using data from the individual instances.

The example method also involves estimating an altitude based on determining a distance between the first location and the second location. For example, a first location on the top of the head and a second location on the floor may each have corresponding 3D coordinates, which the system may use to determine the height. In some implementations, estimating the height of the human body may involve using multiple height estimates acquired over time, discarding outliers, or other means for averaging several data points of measurement data acquired over time.

In some implementations, determining the first location includes acquiring a mask that identifies portions of the image that correspond to the one or more persons; detecting features on the head of the human body (or each human body if multiple persons); and identifying a first location within a portion of the mask corresponding to the one or more persons based on the detected one or more features on the head of each respective person identified. For example, semantic segmentation masks based on 2D RGB images, depth images, or a combination of both RGB and depth images may be utilized. In some implementations, identifying a first location within a portion of a mask can include finding a center of a face and moving upward to find a top of a head within the mask.

In some implementations, the feature detected on the head is a face. For example, "face" may be defined as a portion of the head, including one or more, but if necessary all, of the eyes, nose, mouth, cheeks, ears, chin, and the like. In some implementations, identifying the first location within the portion of the mask includes determining a center of the face and an orientation of the face; and identifying a portion of the mask that is in a direction from a center of the face (e.g., in a direction from the center of the nose up to the center of the forehead), wherein the direction is based on the orientation. In some implementations, identifying the first location within the portion of the mask also includes generating a bounding box corresponding to the face (e.g., a 2D box for the face, which 2D box can be projected to various points on the head in 3D space). For example, the bounding box may provide the position, pose (e.g., orientation and position), and shape of a detected feature of a human body (e.g., the face of a human body). In some implementations, the bounding box is a 3D bounding box. Alternatively, the bounding box is a 2D box, which 2D box can be projected to various points on the head in 3D space as a 3D bounding box.

In some implementations, identifying the first location within the portion of the mask includes identifying a boundary of the head based on data from the depth sensor. For example, identifying the boundaries of a human head may include determining the sides (e.g., ears), bottom (e.g., chin), and top (e.g., top of forehead or top position on head) of the head. In some implementations, the body may be non-upright or the head may be tilted, and the system will use the vertex-most point of the body for height estimation. For example, if the head of a human body is tilted sideways, such as resting on its shoulders, the ear may be the topmost point for height estimation. In some implementations, the system can determine that the person is non-upright or that his head is tilted, and take into account by making adjustments based on the determination that the person is non-upright when calculating the overall height of the person. In some implementations, the height estimation is based on measurements from the top of the human head to the underfoot floor thereof. Alternatively, if a person is lifting their hands above their head, the system may determine an estimate of the height from the top of their hands to the floor plane or to another determined plane. In some implementations, the height estimation may automatically determine whether a human body in the image is extending out of both hands and desires to automatically calculate its abduction. The height estimation system may then detect features of each hand to determine an estimated spread of the identified human body.

In some implementations, determining the second location on the floor includes determining a floor plane based on image data acquired from a camera on the device. For example, this may involve determining floor level using RGB data and/or depth sensor data. In some implementations, the image data includes depth data, the depth data includes a plurality of depth points, and determining the floor plane based on the image data includes classifying a portion of the depth points as the floor plane based on the determined height of each of the depth points; and determining a second position on the floor based on the determined height of each of the depth points corresponding to the floor plane. For example, floor planes may be determined using machine learning to classify the determined planes as floors.

In some implementations, the image data includes a plurality of image frames acquired over a period of time, and determining the floor plane based on the image data includes generating a 3D model based on the image data including the floor plane, and iteratively updating the 3D model for each acquired frame of the plurality of image frames. For example, during a live video transmission, multiple images are acquired over time, and the human height estimation system described herein may iteratively update the height estimates for one or more frames of the video transmission to determine an accurate depiction of the human height. Each iteration should provide a more accurate height determination.

In some implementations, the image data includes a picture of the entirety of the human body from the head to the foot, where the foot is in contact with the floor in the image data. For example, a full body shot of the entire body is shown. Alternatively, the height estimation system may determine the height estimate by using the top of the head and determining and extending the floor plane without having to look at the entire human body. For example, from different perspectives (e.g., from an overhead view, such as from a drone, or a person standing down a second floor, or a person standing on a ladder, etc.), the person whose height is desired to be calculated may not be entirely in the image. The human body can be shown from the waist upwards, even if the human body is standing, but the floor is seen behind the human body in the image data. The height estimation system described herein may still determine the z-height position of the 3D coordinates of the floor as the second position; determining a z-height position of the 3D coordinates of the top of the head; and determining an estimated height of the human body based on the z-height difference of the two locations.

In some implementations, estimating the altitude based on determining the distance between the first location and the second location includes determining a plurality of distance estimates between the first location and the second location; identifying a subset of the plurality of distance estimates using an averaging technique (e.g., calculating an average, removing outliers); and determining a distance between the first location and the second location based on the subset of data.

In some implementations, the person is a first person and the 2D image includes the first person and a second person, wherein determining the first location on the head of the first person is based on determining that the first person is closer to a center of the 2D image than the second person is to the center of the 2D image. For example, multiple persons may be included in the image data, and the body height estimation system described herein may interpret which body height to calculate based on the center of the camera's reticle or the center of the image that the user focuses with the device's camera. In some implementations, the body height estimation system described herein can estimate the height of each detected body in the image data.

According to some implementations, an apparatus includes one or more processors, non-transitory memory, and one or more programs; the one or more programs are stored in a non-transitory memory and configured to be executed by one or more processors, and the one or more programs include instructions for performing, or causing the performance of, any of the methods described herein. According to some implementations, a non-transitory computer-readable storage medium has stored therein instructions that, when executed by one or more processors of a device, cause the device to perform, or cause to be performed any of the methods described herein. According to some implementations, an apparatus includes: one or more processors, non-transitory memory, and means for performing or causing performance of any of the methods described herein.

Drawings

Accordingly, the present disclosure may be understood by those of ordinary skill in the art and a more particular description may be had by reference to certain illustrative embodiments, some of which are illustrated in the accompanying drawings.

FIG. 1 is a block diagram of an exemplary operating environment in accordance with some implementations.

Fig. 2 is a block diagram of an example server, according to some implementations.

Fig. 3 is a block diagram of an example device according to some implementations.

Fig. 4 is a flow diagram representation of an exemplary method for determining a distance between a portion of a head of a human body and an underlying floor as an estimate of a height of the human body, according to some implementations.

Fig. 5A-5F are block diagrams illustrating an exemplary system flow for determining a distance between a portion of a head of a human body and an underlying floor as an estimate of the height of the human body, according to some implementations.

In accordance with common practice, the various features shown in the drawings may not be drawn to scale. Accordingly, the dimensions of the various features may be arbitrarily expanded or reduced for clarity. Additionally, some of the figures may not depict all of the components of a given system, method, or apparatus. Finally, throughout the specification and drawings, like reference numerals may be used to refer to like features.

Detailed Description

Numerous details are described in order to provide a thorough understanding of example implementations shown in the drawings. The drawings, however, illustrate only some example aspects of the disclosure and therefore should not be considered limiting. It will be understood by those of ordinary skill in the art that other effective aspects and/or variations do not include all of the specific details described herein. Moreover, well-known systems, methods, components, devices, and circuits have not been described in detail so as not to obscure more pertinent aspects of the example implementations described herein.

FIG. 1 is a block diagram of an exemplary operating environment 100 according to some implementations. In this example, exemplary operating environment 100 illustrates an exemplary physical environment 105 including a chair 140, a table 142, and a door 150. Additionally, the exemplary operating environment 100 includes a human body (e.g., user 102) holding a device (e.g., device 120), a human body (e.g., object/human body 104) sitting on a chair 140, and a human body (e.g., object/human body 106) standing in the physical environment 105. While relevant features are shown, those of ordinary skill in the art will recognize from the present disclosure that various other features are not shown for the sake of brevity and so as not to obscure more pertinent aspects of the exemplary implementations disclosed herein. To this end, as a non-limiting example, operating environment 100 includes server 110 and device 120. In an exemplary implementation, operating environment 100 does not include server 110, and the methods described herein are performed on device 120.

In some implementations, the server 110 is configured to manage and coordinate user experiences. In some implementations, the server 110 includes a suitable combination of software, firmware, and/or hardware. The server 110 is described in more detail below with reference to fig. 2. In some implementations, the server 110 is a computing device located locally or remotely with respect to the physical environment 105. In one example, server 110 is a local server located within physical environment 105. In another example, the server 110 is a remote server (e.g., a cloud server, a central server, etc.) located outside of the physical environment 105. In some implementations, server 110 is communicatively coupled with device 120 via one or more wired or wireless communication channels (e.g., bluetooth, IEEE802.11x, IEEE 802.16x, IEEE 802.3x, etc.).

In some implementations, the device 120 is configured to present the environment to the user. In some implementations, the device 120 includes a suitable combination of software, firmware, and/or hardware. The device 120 is described in more detail below with reference to fig. 3. In some implementations, the functionality of the server 110 is provided by and/or integrated with the device 120.

In some implementations, the device 120 is a handheld electronic device (e.g., a smartphone or tablet) configured to present content to a user. In some implementations, the user 102 wears the device 120 on his head. Device 120 may include one or more displays provided for displaying content. For example, the device 120 may encompass the field of view of the user 102. In some implementations, the device 120 is replaced with a chamber, housing, or chamber configured to present content, where the user 102 does not wear or hold the device 120.

Fig. 2 is a block diagram of an example of a server 110 according to some implementations. While some specific features are shown, those skilled in the art will appreciate from the present disclosure that various other features are not shown for the sake of brevity and so as not to obscure more pertinent aspects of the particular implementations disclosed herein. To this end, and by way of non-limiting example, in some implementations, the server 110 includes one or more processing units 202 (e.g., microprocessors, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs), Central Processing Units (CPUs), processing cores, etc.), one or more input/output (I/O) devices 206, one or more communication interfaces 208 (e.g., Universal Serial Bus (USB), FIREWIRE, THUNDERBOLT, IEEE 802.3x, IEEE802.11x, IEEE 802.16x, global system for mobile communications (GSM), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Global Positioning System (GPS), Infrared (IR), bluetooth, ZIGBEE, and/or similar types of interfaces), one or more programming (e.g., I/O) interfaces 210, memory 220, and one or more communication buses 204 for interconnecting these components and various other components.

In some implementations, the one or more communication buses 204 include circuitry to interconnect system components and control communications between system components. In some implementations, the one or more I/O devices 206 include at least one of a keyboard, a mouse, a trackpad, a joystick, one or more microphones, one or more speakers, one or more image sensors, one or more displays, and the like.

The memory 220 includes high speed random access memory such as Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), double data rate random access memory (DDR RAM), or other random access solid state memory devices. In some implementations, the memory 220 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. Memory 220 optionally includes one or more storage devices located remotely from the one or more processing units 202. Memory 220 includes a non-transitory computer-readable storage medium. In some implementations, the memory 220 or a non-transitory computer-readable storage medium of the memory 220 stores the following programs, modules, and data structures, or a subset thereof, including an optional operating system 230 and one or more application programs 240.

Operating system 230 includes processes for handling various basic system services and for performing hardware related tasks. In some implementations, the application 240 is configured to manage and coordinate one or more experiences of one or more users (e.g., a single experience of one or more users, or multiple experiences of a respective group of one or more users).

The application 240 includes a mask data unit 242, a floor unit 244, a feature detection unit 246, and a measurement unit 248. The mask data unit 242, the floor unit 244, the feature detection unit 246, and the measurement unit 248 may be combined into a single application or unit or divided into one or more additional applications or units.

The mask data unit 242 is configured with instructions executable by the processor to acquire image data (e.g., light intensity data, depth data, camera position information, etc.) and generate a mask (e.g., a segmentation mask) that identifies a portion of an image associated with an object (e.g., a human body) based on the image data using one or more of the techniques disclosed herein. For example, the mask data unit 242 collects light intensity data (e.g., RGB data 503) from the light intensity camera 502 and depth data 505 from the depth camera 504, and generates segmentation masks for all identified persons (e.g., identifies all locations of identified persons for three-dimensional (3D) data to provide to the feature detection unit 246 and the measurement unit 248). In some implementations, the mask data unit 242 collects segmentation data (e.g., RGB-S) from the segmentation unit and generates a segmentation mask based on the segmentation data. Alternatively, the mask data unit 242 generates the segmentation mask directly based on the acquired depth data (e.g., a depth camera on the device 120).

The floor unit 244 is configured with instructions executable by the processor to acquire image data (e.g., light intensity data, depth data, camera position information, etc.) and generate floor data of the physical environment using one or more of the techniques disclosed herein. For example, the floor unit 244 captures a series of light intensity images (e.g., RGB data) from a light intensity camera (e.g., live camera transmission) and depth data from a depth camera and generates floor data (e.g., identifies floor planes or nadirs on the floor with respect to one or more persons identified in the image data) to provide to the measurement unit 248. In some implementations, the floor unit 244 determines the floor plane based on the segmentation data identifying the floor. Alternatively, the floor unit 244 determines the floor plane based on a machine learning unit trained to identify floor points in the depth data; and classifying the floor points based on the identified lowest point in the 3D coordinate system. In some implementations, the determined floor plane is iteratively updated during live camera transfers using one or more of the techniques disclosed herein to improve accuracy.

The feature detection unit 246 is configured with instructions executable by the processor to acquire image data (e.g., light intensity data, depth data, camera position information, etc.) and mask data (e.g., a segmented mask) that identifies a portion of an image associated with an object (e.g., a human body); and determining a location (e.g., top) on the head of the human body based on the image data and the mask data using one or more of the techniques disclosed herein. For example, the feature detection unit 246 acquires a series of light intensity images (e.g., RGB data) from a light intensity camera (e.g., live camera transmission), depth data from a depth camera, and mask data from the mask data unit 242, and determines a location (e.g., 3D coordinates) on the head of the human body based on the image data and the mask data using one or more of the techniques disclosed herein to provide to the measurement unit 248. In some implementations, the feature detection unit 246 determines the top of the head of the human by: detecting one or more features of a human head (such as a face); finding the center of the face; and move upward to find the top of the head within the mask. For example, "face" may be defined as a portion of the head, including one or more, but if necessary all, of the eyes, nose, mouth, cheeks, ears, chin, and the like. In some implementations, identifying the first location within the portion of the mask includes determining a center of the face and an orientation of the face; and identifying a portion of the mask that is in a direction from a center of the face (e.g., in a direction from the center of the nose up to the center of the forehead), wherein the direction is based on the orientation. In some implementations, identifying the first location within the portion of the mask further includes generating a bounding box corresponding to the face. For example, the bounding box may provide the position, pose (e.g., orientation and position), and shape of a detected feature of a human body (e.g., the face of a human body). In some implementations, the bounding box is a 3D bounding box. Alternatively, the bounding box is a 2D box, which 2D box can be projected to various points on the head in 3D space as a 3D bounding box. In some implementations, the feature detection unit 246 determines the top of the head based on viewing at least some portion of the front of the face. Alternatively, the feature detection unit 246 determines the top of the head based on looking at only the back of the head and based on features of the head (e.g., ears) that are viewable from the back of the head.

The measurement unit 248 is configured with instructions executable by the processor to collect image data (e.g., light intensity data, depth data, camera position information, etc.), mask data, floor data, and feature detection data, and determine a distance between a portion of the head of a human body and the underlying floor as an estimate of the height of the human body using one or more of the techniques disclosed herein. For example, the measurement unit 248 acquires a series of light intensity images (e.g., RGB data) from a light intensity camera (e.g., live camera transmission) and depth data from a depth camera, acquires mask data from the mask data unit 242, acquires floor data from the floor unit 244, acquires feature detection data (e.g., head top data) from the feature detection unit 246, and determines an estimated height of the human body identified by the mask data. For example, a first location on the top of the head and a second location on the floor may each have corresponding 3D coordinates, and the system may determine the height or distance between these two coordinates in a particular direction and axis of the coordinate system. In some implementations, estimating the height of the human body may involve using multiple distance estimates acquired over time, discarding outliers, or other means for averaging several data points of measurement data acquired over time.

Although these elements are shown as residing on a single device (e.g., server 110), it should be understood that in other implementations, any combination of elements may reside in separate computing devices. Moreover, FIG. 2 serves more as a functional description of the various features present in a particular implementation, as opposed to the structural schematic of the implementations described herein. As one of ordinary skill in the art will recognize, the items displayed separately may be combined, and some items may be separated. For example, some of the functional blocks shown separately in fig. 2 may be implemented in a single module, and various functions of a single functional block may be implemented in various implementations by one or more functional blocks. The actual number of modules and the division of particular functions and how features are allocated therein will vary depending upon the particular implementation and, in some implementations, will depend in part on the particular combination of hardware, software, and/or firmware selected for a particular implementation.

Fig. 3 is a block diagram of an example of a device 120 according to some implementations. While some specific features are shown, those skilled in the art will appreciate from the present disclosure that various other features are not shown for the sake of brevity and so as not to obscure more pertinent aspects of the particular implementations disclosed herein. To this end, and by way of non-limiting example, in some implementations device 120 includes one or more processing units 302 (e.g., microprocessors, ASICs, FPGAs, GPUs, CPUs, processing cores, etc.), one or more input/output (I/O) devices and sensors 306, one or more communication interfaces 308 (e.g., USB, FIREWIRE, THUNDERBOLT, IEEE 802.3x, IEEE802.11x, IEEE 802.16x, GSM, CDMA, TDMA, GPS, IR, BLUETOOTH, ZIGBEE, SPI, I2C, and/or similar types of interfaces), one or more programming (e.g., I/O) interfaces 310, one or more AR/VR displays 312, one or more internally and/or externally facing image sensors 314, memory 320, and one or more communication buses 304 for interconnecting these and various other components.

In some implementations, the one or more communication buses 304 include circuitry to interconnect and control communications between system components. In some implementations, the one or more I/O devices and sensors 306 include at least one of: an Inertial Measurement Unit (IMU), an accelerometer, a magnetometer, a gyroscope, a thermometer, one or more physiological sensors (e.g., a blood pressure monitor, a heart rate monitor, a blood oxygen sensor, a blood glucose sensor, etc.), one or more microphones, one or more speakers, a haptic engine, one or more depth sensors (e.g., structured light, time of flight, etc.), and the like.

In some implementations, the one or more displays 312 are configured to present an experience to the user. In some implementations, the one or more displays 312 correspond to holographic, Digital Light Processing (DLP), Liquid Crystal Displays (LCD), liquid crystal on silicon (LCoS), organic light emitting field effect transistors (OLET), Organic Light Emitting Diodes (OLED), surface-conduction electron emitter displays (SED), Field Emission Displays (FED), quantum dot light emitting diodes (QD-LED), micro-electro-mechanical systems (MEMS), and/or similar display types. In some implementations, the one or more displays 312 correspond to diffractive, reflective, polarizing, holographic, etc. waveguide displays. For example, the device 120 includes a single display. As another example, device 120 includes a display for each eye of the user.

In some implementations, the one or more image sensor systems 314 are configured to acquire image data corresponding to at least a portion of the physical environment 105. For example, the one or more image sensor systems 314 include one or more RGB cameras (e.g., with Complementary Metal Oxide Semiconductor (CMOS) image sensors or Charge Coupled Device (CCD) image sensors), monochrome cameras, infrared cameras, event-based cameras, and so forth. In various implementations, the one or more image sensor systems 314 also include an illumination source, such as a flash, that emits light. In various implementations, the one or more image sensor systems 314 also include an on-camera Image Signal Processor (ISP) configured to perform a plurality of processing operations on the image data, the plurality of processing operations including at least a portion of the processes and techniques described herein.

The memory 320 comprises high speed random access memory such as DRAM, SRAM, DDR RAM or other random access solid state memory devices. In some implementations, the memory 320 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. Memory 320 optionally includes one or more storage devices located remotely from the one or more processing units 302. The memory 320 includes a non-transitory computer-readable storage medium. In some implementations, the memory 320 or a non-transitory computer-readable storage medium of the memory 320 stores the following programs, modules, and data structures, or a subset thereof, including an optional operating system 330 and one or more application programs 340.

Operating system 330 includes processes for handling various basic system services and for performing hardware related tasks. In some implementations, the application 340 is configured to manage and coordinate one or more experiences of one or more users (e.g., a single experience of one or more users, or multiple experiences of a respective group of one or more users).

The application 340 includes a mask data unit 342, a floor unit 344, a feature detection unit 346, and a measurement unit 348. The mask data unit 342, the floor unit 344, the feature detection unit 346, and the measurement unit 348 may be combined into a single application or unit or divided into one or more additional applications or units.

The mask data unit 342 is configured with instructions executable by the processor to acquire image data (e.g., light intensity data, depth data, camera position information, etc.) and generate a mask (e.g., a segmentation mask) that identifies a portion of an image associated with an object (e.g., a human body) based on the image data using one or more of the techniques disclosed herein. For example, the mask data unit 342 collects light intensity data (e.g., RGB data) from a light intensity camera and depth data from a depth camera and generates segmentation masks for all identified persons (e.g., identifies all locations of identified persons for 3D data to provide to the feature detection unit 346 and the measurement unit 348). In some implementations, the mask data unit 342 collects segmentation data (e.g., RGB-S) from the segmentation unit and generates a segmentation mask based on the segmentation data. Alternatively, the mask data unit 342 generates the segmentation mask directly based on the acquired depth data (e.g., a depth camera on the device 120).

The floor unit 344 is configured with instructions executable by the processor to collect image data (e.g., light intensity data, depth data, camera position information, etc.) and generate floor data of the physical environment using one or more of the techniques disclosed herein. For example, the floor unit 344 acquires a series of light intensity images (e.g., RGB data) from a light intensity camera (e.g., live camera transmission) and depth data from a depth camera and generates floor data (e.g., identifies floor planes or nadirs on the floor with respect to one or more persons identified in the image data) to provide to the measurement unit 348. In some implementations, the floor unit 344 determines the floor plane based on segmentation data that identifies the floor. Alternatively, the floor unit 344 determines the floor plane based on a machine learning unit trained to identify floor points in the depth data; and classifying the floor points based on the identified lowest point in the 3D coordinate system. In some implementations, the determined floor plane is iteratively updated during live camera transfers using one or more of the techniques disclosed herein to improve accuracy.

The feature detection unit 346 is configured with instructions executable by the processor to acquire image data (e.g., light intensity data, depth data, camera position information, etc.) and mask data (e.g., a segmented mask) that identifies a portion of an image associated with an object (e.g., a human body); and determining a location (e.g., top) on the head of the human body based on the image data and the mask data using one or more of the techniques disclosed herein. For example, the feature detection unit 346 acquires a series of light intensity images (e.g., RGB data) from a light intensity camera (e.g., live camera transmission), depth data from a depth camera, and mask data from the mask data unit 342, and determines a location (e.g., 3D coordinates) on the head of the human body based on the image data and the mask data using one or more of the techniques disclosed herein to provide to the measurement unit 348. In some implementations, the feature detection unit 346 determines the top of the head of the human by: detecting one or more features of a human head (such as a face); finding the center of the face; and move upward to find the top of the head within the mask. For example, "face" may be defined as a portion of the head, including one or more, but if necessary all, of the eyes, nose, mouth, cheeks, ears, chin, and the like. In some implementations, identifying the first location within the portion of the mask includes determining a center of the face and an orientation of the face; and identifying a portion of the mask that is in a direction from a center of the face (e.g., in a direction from the center of the nose up to the center of the forehead), wherein the direction is based on the orientation. In some implementations, identifying the first location within the portion of the mask further includes generating a 3D bounding box corresponding to the face. For example, the 3D bounding box may provide the position, pose (e.g., orientation and position), and shape of the detected feature of the human body (e.g., the face of the human body). In some implementations, the feature detection unit 346 determines the top of the head based on viewing at least some portion of the front of the face. Alternatively, the feature detection unit 346 determines the top of the head based on looking at only the back of the head, but based on features of the head (e.g., ears) that are viewable from the back of the head.

The measurement unit 348 is configured with instructions executable by the processor to collect image data (e.g., light intensity data, depth data, camera position information, etc.), mask data, floor data, and feature detection data, and determine a distance between a portion of the head of a human body and the underlying floor as an estimate of the height of the human body using one or more of the techniques disclosed herein. For example, the measurement unit 348 acquires a series of light intensity images (e.g., RGB data) from a light intensity camera (e.g., live camera transmission) and depth data from a depth camera, acquires mask data from the mask data unit 342, acquires floor data from the floor unit 344, acquires feature detection data (e.g., head top data) from the feature detection unit 346, and determines an estimated height of a human body identified by the mask data. For example, a first location on the top of the head and a second location on the floor may each have corresponding 3D coordinates, and the system may determine the height or distance between these two coordinates in a particular direction and axis of the coordinate system. In some implementations, estimating the height of the human body may involve using multiple distance estimates acquired over time, discarding outliers, or other means for averaging several data points of measurement data acquired over time.

Although these elements are shown as residing on a single device (e.g., device 120), it should be understood that in other implementations, any combination of elements may reside in separate computing devices. Moreover, FIG. 3 serves more as a functional description of the various features present in a particular implementation, as opposed to the structural schematic of the implementations described herein. As one of ordinary skill in the art will recognize, the items displayed separately may be combined, and some items may be separated. For example, some of the functional blocks shown separately in fig. 3 (e.g., applications 340) may be implemented in a single module, and various functions of a single functional block may be implemented in various implementations by one or more functional blocks. The actual number of modules and the division of particular functions and how features are allocated therein will vary depending upon the particular implementation and, in some implementations, will depend in part on the particular combination of hardware, software, and/or firmware selected for a particular implementation.

Fig. 4 is a flow diagram representation of an example method 400 that determines a distance between a portion of a head (e.g., a top) of a human body and an underlying floor as an estimate of the height of the human body, according to some implementations. In some implementations, the method 400 is performed by a device (e.g., the server 110 or the device 120 of fig. 1-3), such as a mobile device, a desktop computer, a laptop computer, or a server device. Method 400 may be performed on a device (e.g., device 120 of fig. 1 and 3) having a screen for displaying images and/or a screen for viewing stereoscopic images, such as a Head Mounted Display (HMD). In some implementations, the method 400 is performed by processing logic (including hardware, firmware, software, or a combination thereof). In some implementations, the method 400 is performed by a processor executing code stored in a non-transitory computer-readable medium (e.g., memory). The human body height estimation process of the method 400 is illustrated with reference to fig. 5A to 5F.

At block 402, the method 400 determines a first location (e.g., top) of a human body on a head in a 3D coordinate system based on detecting features (e.g., a face) on the head from a two-dimensional (2D) image of the human body in a physical environment. In some implementations, this may involve acquiring a mask that identifies 2D coordinates corresponding to one or more persons. For example, semantic segmentation masks based on 2D RGB and/or depth images may be utilized. A feature on the head of the human body is detected and a first location is identified as a location within the mask based on the feature. For example, find the center of the face and move up to find the top of the head within the mask.

In some implementations, determining the first location includes acquiring a mask that identifies a portion of the image corresponding to one or more persons; detecting a feature on the head; and identifying a first location within a portion of the mask corresponding to the one or more persons based on the features on the head. For example, semantic segmentation masks based on 2D RGB images, depth images, or a combination of both RGB and depth images may be utilized. In some implementations, identifying a first location within a portion of a mask can include finding a center of a face and moving upward to find a top of a head within the mask.

In some implementations, the feature detected on the head is a face. For example, "face" may be defined as a portion of the head, including one or more, but if necessary all, of the eyes, nose, mouth, cheeks, ears, chin, and the like. In some implementations, identifying the first location within the portion of the mask includes determining a center of the face and an orientation of the face; and identifying a portion of the mask that is in a direction from a center of the face (e.g., in a direction from the center of the nose up to the center of the forehead), wherein the direction is based on the orientation. In some implementations, identifying the first location within the portion of the mask further includes generating a bounding box corresponding to the face. For example, the bounding box may provide the position, pose (e.g., orientation and position), and shape of a detected feature of a human body (e.g., the face of a human body). In some implementations, the bounding box is a 3D bounding box. Alternatively, the bounding box is a 2D box, which 2D box can be projected to various points on the head in 3D space as a 3D bounding box. In some implementations, the process 400 determines the top of the head based on viewing at least some portion of the front of the face. Additionally, in some implementations, the process 400 determines the top of the head based on viewing only the back of the head and based on features of the head (e.g., ears) that are viewable from the back of the head.

At block 404, the method 400 determines a second location on the floor below the first location in the 3D coordinate system. In some implementations, this may involve determining floor level using RGB data and/or depth sensor data. Additionally or alternatively, the floor plane may be determined using machine learning to classify the plane as a floor (e.g., classifying each data point based on detected height). In some implementations, the floor plane can be determined over time using a 3D model that is updated using data from the instances.

In some implementations, determining the second location on the floor includes determining a floor plane based on image data acquired from a camera on the device. For example, the RGB data and/or the depth sensor data are used to determine the floor level. In some implementations, the image data includes depth data including a plurality of depth points, wherein determining the floor plane based on the image data includes classifying a portion of the plurality of depth points as the floor plane based on the determined height of each of the depth points; and determining a second position on the floor based on the determined height of each of the depth points corresponding to the floor plane. For example, floor planes may be determined using machine learning to classify planes as floors.

At block 406, the method 400 estimates an altitude based on determining a distance between the first location and the second location. For example, a first location on the top of the head and a second location on the floor may each have corresponding 3D coordinates, and the system may determine the height or distance between these two coordinates in a particular direction and/or axis (e.g., x-axis, y-axis, or z-axis) of the coordinate system. In some implementations, estimating the height of the human body may involve using multiple distance estimates acquired over time, discarding outliers, or other means for averaging several data points of measurement data acquired over time.

In some implementations, estimating the altitude based on determining the distance between the first location and the second location includes determining a plurality of distance estimates between the first location and the second location; identifying a subset of the plurality of distance estimates using an averaging technique (e.g., calculating an average, removing outliers, etc.); and determining a distance between the first location and the second location based on the subset of data. For example, when image data (e.g., RGB and depth data) is acquired, a running average of the height estimates may be performed over time. The running average may be based on continuously updating the determination of the top of the head, continuously updating the determination of the floor plane, or both.

In some implementations, the person is a first person and the 2D image includes the first person and a second person, wherein determining the first location on the head of the first person is based on determining that the first person is closer to a center of the 2D image than the second person is to the center of the 2D image. For example, multiple persons may be included in the image data, and the body height estimation system described herein may interpret which body height to calculate based on the center of the camera's reticle or the center of the image that the user focuses with the device's camera. In some implementations, the body height estimation system described herein can estimate the height of each detected body in the image data. In some implementations, the body height estimation system described herein may wait for some user interaction with the body in the image on the screen of the user device. For example, a user may be viewing a live video of a group of people and tapping on a person of interest to determine a height estimate.

In some implementations, the human height estimation system described herein may acquire a single frame of a 2D light intensity image (e.g., a single RGB image, such as a photograph), extrapolate depth data from the 2D image, and determine a height estimate for the human based on the 2D image and the depth data. For example, if it is desired to track the height of a child over time, the system may capture several photographs of the child (e.g., scan or take photographs from the child's photo album) and cause the body height estimation system described herein to determine an estimated height for each photograph. Additionally, if the photograph is digital, the body height estimation system described herein may store the height data within the metadata of the digitally stored image.

In use, with process 400, a user (e.g., user 102 in fig. 1) may acquire an image of a human body (or multiple people) within a room with a device (e.g., a smartphone, such as device 120), and the processes described herein will acquire image data, identify people within the environment, and determine a height estimate for the human body (or each of the multiple people) as the human body is being scanned by a camera on the device. In some implementations, the height estimate for the human body may be automatically displayed and updated on a user device superimposed during live camera transmission. In some implementations, the height estimate of the human body may be provided after some type of user interaction after scanning the physical environment. For example, the user may be shown an option of the identified person, and the user may select or click on the person whose measurement information is desired, and the measurement information will then be displayed. Thus, as shown and discussed below with reference to fig. 5A to 5F, the height estimation unit 520 determines the distance between a portion of the head of the human body and the underlying floor as an estimated height value of the human body by: determining segmentation mask data (e.g., a 2D mask of a human body) using a mask data unit 530 (e.g., the mask data unit 242 of fig. 2 and/or the mask data unit 342 of fig. 3); determining floor level data using a floor unit 540 (e.g., floor unit 244 of fig. 2 and/or floor unit 344 of fig. 3); detecting a feature on the head of the human body using a feature detection unit 550 (e.g., feature detection unit 246 of fig. 2 and/or feature detection unit 346 of fig. 3); and determining a distance between the top of the head of the person and the floor plane by a measurement unit 560 (e.g., measurement unit 248 of fig. 2 and/or measurement unit 348 of fig. 3).

Fig. 5A-5F are system flow diagrams of an exemplary environment 500A-500F, respectively, in which a system may determine a distance between a portion of a human head and an underlying floor as an estimate of a height of the human using depth and light intensity image information detected in the physical environment. In some implementations, the system flows of exemplary environment 500A-500F are executed on a device (e.g., server 110 or device 120 of fig. 1-3), such as a mobile device, desktop computer, laptop computer, or server device. The system flow of exemplary environment 500A-500F may be displayed on a device (e.g., device 120 of fig. 1 and 3) having a screen for displaying images and/or a screen for viewing stereoscopic images, such as a Head Mounted Display (HMD). In some implementations, the system flows of exemplary environment 500A-500F are executed on processing logic (including hardware, firmware, software, or combinations thereof). In some implementations, the system flows of exemplary environment 500A-500F execute on a processor executing code stored in a non-transitory computer-readable medium (e.g., memory).

Fig. 5A is a block diagram of an exemplary environment 500A in which a height estimation unit 520 may determine a distance between a portion of a head of a human body and an underlying floor as an estimated height value of the human body using light intensity image information and depth information detected in the physical environment. The system flow of exemplary environment 500A acquires light intensity image data 503 (e.g., live camera transmission from light intensity camera 502), depth image data 505 (e.g., depth image data from depth camera 504), and other physical environment information sources (e.g., camera positioning information 507 from position sensor 506) of a physical environment (e.g., physical environment 105 of fig. 1), and determines a distance between a portion of a human's head and an underlying floor as an estimate of the human's height by: determining segmentation mask data (e.g., a 2D mask of a human body) using a mask data unit 530 (e.g., the mask data unit 242 of fig. 2 and/or the mask data unit 342 of fig. 3); determining floor level data using a floor unit 540 (e.g., floor unit 244 of fig. 2 and/or floor unit 344 of fig. 3); detecting a feature on the head of the human body using a feature detection unit 550 (e.g., feature detection unit 246 of fig. 2 and/or feature detection unit 346 of fig. 3); and determining a distance between the top of the head of the person and the floor plane by a measurement unit 560 (e.g., measurement unit 248 of fig. 2 and/or measurement unit 348 of fig. 3).

The measurement unit 560 is configured with instructions executable by the processor to collect image data (e.g., light intensity data, depth data, camera position information, etc.), mask data, floor data, and feature detection data, and determine a distance between a portion of the head of a human body and the underlying floor as an estimate of the height of the human body using one or more of the techniques disclosed herein. For example, the measurement unit 560 (e.g., the measurement unit 248 of fig. 2 and/or the measurement unit 348 of fig. 3) collects a series of light intensity images (e.g., RGB data 503) from the light intensity camera 502 (e.g., live camera transmission) and depth data 505 from the depth camera 504, acquires mask data from the mask data unit 530, acquires floor data from the floor unit 540, acquires feature detection data (e.g., head top data) from the feature detection unit 550, and determines an estimated height of a human body identified by the mask data. For example, a first location on the top of the head and a second location on the floor may each have corresponding 3D coordinates, and the system may determine the height or distance between these two coordinates in a particular direction and axis of the coordinate system. In some implementations, estimating the height of the human body may involve using multiple distance estimates acquired over time, discarding outliers, or other means for averaging several data points of measurement data acquired over time.

In one exemplary implementation, the environment 500A includes an image synthesis pipeline that acquires or acquires data of a physical environment (e.g., image data from one or more image sources). Exemplary environment 500A is an example of acquiring image data (e.g., light intensity data and depth data) for a plurality of image frames. The image source may include a depth camera 504 that acquires depth data 505 of the physical environment; and a light intensity camera 502 (e.g., an RGB camera) that acquires light intensity image data 503 (e.g., a series of RGB image frames). For positioning information, some implementations include a visual inertial ranging (VIO) system to estimate the distance traveled by determining equivalent ranging information using camera sequence images (e.g., light intensity data 503). Alternatively, some implementations of the present disclosure may include a SLAM system (e.g., position sensor 506). The SLAM system may include a multi-dimensional (e.g., 3D) laser scanning and range measurement system that is independent of GPS and provides real-time simultaneous localization and mapping. The SLAM system can generate and manage very accurate point cloud data resulting from reflections from laser scans of objects in the environment. Over time, the movement of any point in the point cloud is accurately tracked so that the SLAM system can use the point in the point cloud as a reference point for location, maintaining an accurate understanding of its location and orientation as it travels through the environment.

Fig. 5B is a block diagram of an exemplary environment 500B in which a mask data unit 530 (e.g., mask data unit 242 of fig. 2 and/or mask data unit 342 of fig. 3) may generate masked data 531 (e.g., segmentation mask) using acquired image data 512 (e.g., depth data, such as 3D point clouds, RGB data, etc.) and/or acquired semantic segmentation data (e.g., RGB-S data that semantically labels objects and/or people within the image data). In particular, environment 500B is acquiring image 512 (e.g., image data from physical environment 105 of fig. 1, where user 102 is acquiring image data including object/body 104 and object/body 106). The masked data 532 represents an object (e.g., a person) to be masked, which means that only the identified object (e.g., a person) is shown and the remaining information is masked. For example, masked image data 513 illustrates the determined mask data including mask 533a (e.g., object/body 104) and mask 533b (e.g., object/body 106). The mask data unit 530 may send the masked image information to a feature detection unit (e.g., the feature detection unit 246 of fig. 2 and/or the feature detection unit 346 of fig. 3). Alternatively, the mask data unit 530 may send the identified pixel locations of the identified objects/bodies to the feature detection unit, which may then determine whether to include or exclude the identified objects/bodies (e.g., based on a confidence value).

In some implementations, the mask data unit 530 uses a machine learning model, where the segmentation mask model can be configured to identify semantic labels of pixels or voxels of the image data. The segmentation mask machine learning model may generate a segmentation mask that identifies a particular type of object (e.g., human, cat) associated with motion (e.g., human often moving, while furniture does not move). In some implementations, the mask can indicate an object using a value of 0 or a value of 1 to indicate whether information should be included in the 3D model (e.g., a moving object). In some implementations, the split machine learning model may be a neural network executed by a neural engine/circuit on a processor chip tuned to accelerate AI software. In some implementations, the segmentation mask can include a pixel-level confidence value. For example, a pixel location may be labeled as a 0.8 chair, and thus the system has 80% confidence that the x, y, z coordinates of that pixel location are the chair. The confidence level may be adjusted as additional data is acquired. In some implementations, the machine learning model is a neural network (e.g., an artificial neural network), a decision tree, a support vector machine, a bayesian network, or the like.

Fig. 5C is a block diagram of an exemplary environment 500C in which a mask data unit 530 (e.g., the mask data unit 242 of fig. 2 and/or the mask data unit 342 of fig. 3) may generate masked data 531 using acquired image data 514 (e.g., depth data, such as 3D point clouds, RGB data, etc.) and/or acquired semantic segmentation data (e.g., RGB-S data that semantically labels objects and/or people within the image data). In particular, environment 500C is an example of an acquired image 514 (e.g., image data from physical environment 105 of fig. 1). However, the image 514 is different from the image 512 in the environment 500B because the user 102 is acquiring image data including the object/body 104 and the object/body 106, but the object/body 106 stands behind the object/body 104 and is partially blocked by the object/body 104. The masked data 531 represents an object (e.g., a person) to be masked, which means that only the identified object (e.g., a person) is shown and the remaining information is masked. For example, masked image data 515 shows the determined mask data, which includes mask 534a (e.g., object/body 104) and mask 534b (e.g., object/body 106). In some implementations, using the techniques described herein, even if

masks

534a and 534b overlap, the feature detection unit can interpret between

masks

534a and 534b based on detecting features on the head of the human body (such as features on the face) for mask 534 a. The mask data unit 530 may send the masked image information to a feature detection unit (e.g., the feature detection unit 246 of fig. 2 and/or the feature detection unit 346 of fig. 3). Alternatively, the mask data unit 530 may send the identified pixel locations of the identified objects/bodies to the feature detection unit, which may then determine whether to include or exclude the identified objects/bodies (e.g., based on a confidence value).

Fig. 5D is a block diagram of an example environment 500D in which a floor unit 540 (e.g., the floor unit 244 of fig. 2 and/or the floor unit 344 of fig. 3) may use captured image data 512 (e.g., depth data, such as 3D point clouds, RGB data, etc.) and/or captured semantic segmentation data (e.g., RGB-S data that semantically labels features within the image data, such as a floor) to generate floor data 531 (e.g., 3D floor planes, etc.). In particular, environment 500D is acquiring image 512 (e.g., image data from physical environment 105 of fig. 1, where user 102 is acquiring image data including object/body 104 and object/body 106). The floor data 542 represents the identified floor level using the techniques described herein. When estimating the height of the human body with respect to the top of the head of the human body and the determined floor or floor plane, the floor unit 530 may transmit floor plane information to the measurement unit 560 to be used as the second position of the 3D coordinate system.

Fig. 5E is a block diagram of an exemplary environment 500E in which a floor unit 540 (e.g., the floor unit 244 of fig. 2 and/or the floor unit 344 of fig. 3) may use acquired image data 512 (e.g., depth data, such as the 3D point cloud 505, RGB data, etc.) and/or acquired semantic segmentation data (e.g., RGB-S data that semantically labels features within the image data, such as a floor) to generate floor data 572 (e.g., the 3D floor plane 573, etc.). In particular, environment 500E is acquiring an image 512 (e.g., image data from physical environment 105 of fig. 1, where user 102 is acquiring image data including object/body 104 and object/body 106) and a 3D point cloud 505 for depth data. The floor data 572 represents the identified floor plane using the techniques described herein. For example, the floor unit 540 uses the depth classification unit 570 to determine a floor plane 573 based on machine learning that is trained to identify floor points in the depth data, as shown in the 3D point cloud 571. The depth classification unit classifies the floor points based on the identified lowest point in the 3D coordinate system. When estimating the height of the human body relative to the top of the head of the human body and the determined floor or floor plane, the floor unit 540 may send floor plane information 573 to the measurement unit 560 to be used as a second position of the 3D coordinate system.

Fig. 5F is a block diagram of an exemplary environment 500F in which a feature detection unit 550 (e.g., the feature detection unit 246 of fig. 2 and/or the feature detection unit 346 of fig. 3) may send information about the top of the head of the human body to a measurement unit using captured image data (e.g., light intensity data 503, depth data 505) and captured mask data 531 from a mask data unit 530. In particular, environment 500F is acquiring image 512 (e.g., image data from physical environment 105 of fig. 1, where user 102 is acquiring the image data) and mask image data 513 including mask 533a for standing object/person 106. In one exemplary implementation and for purposes of illustration, the feature detection unit 550 includes a bounding box unit 552, a center determination unit 554, and a head top determination unit 556. The bounding box unit 552 generates bounding box data 553 surrounding the detected feature of the head. Specifically, the bounding box unit 552 detects a feature portion on the head of the human body from the masked data (e.g., the human body) of the mask image data, and identifies the face based on the detected feature portion. The center determining unit 554 utilizes the bounding box data 553 and determines sides (e.g., a first side 555a and a second side 555b) of the head. After determining the side of the head of the human body, the center determining unit 554 determines a center line 555c corresponding to the human body. The head top determining unit 556 uses the center line 555c and then determines head top data (e.g., 3D coordinates), which is transmitted to the measuring unit to estimate the height of the human body.

In some implementations, the exemplary environment 500 includes a 3D representation unit configured with instructions executable by a processor to acquire sensor data (e.g., RGB data 503, depth data 505, etc.) and measurement data 562, and to generate 3D representation data with measurements using one or more techniques. For example, the 3D representation unit analyzes the RGB images from the light intensity camera 402 using a sparse depth map from the depth camera 404 (e.g., a time-of-flight sensor) and other physical environment information sources (e.g., camera positioning information such as position/motion data 507 from the sensor 506 or camera SLAM system, etc.) to generate 3D representation data (e.g., a 3D model representing the physical environment of fig. 1). In some implementations, the 3D representation data can be a 3D representation that represents a surface in the 3D environment using a 3D point cloud having associated semantic tags. The 3D representation may be a 3D bounding box for each detected object of interest, such as the human body 104/106, the table 142, and the chair 140. In some implementations, the 3D representation data is a 3D reconstructed mesh generated using a gridding algorithm based on depth information detected in the physical environment, the depth information being integrated (e.g., fused) to reshape the physical environment. A meshing algorithm (e.g., a two-way cube meshing algorithm, a poisson meshing algorithm, a tetrahedral meshing algorithm, etc.) may be used to generate a mesh representing a room (e.g., physical environment 105) and/or objects within the room (e.g., human 104/106, table 142, chair 140, etc.). In some implementations, for 3D reconstruction using meshes, in order to effectively reduce the amount of memory used in the reconstruction process, a voxel hashing method is used, where the 3D space is divided into voxel blocks, referenced by a hash table using its 3D position as a key. The voxel blocks are constructed only around the object surface, freeing up memory that would otherwise be used to store free space. The voxel hashing method is also faster than competing methods of the time (such as octree-based methods). Furthermore, it supports data flow between the GPU, where memory is typically limited, and the CPU, where memory is richer.

In some implementations, the exemplary environment 500A also includes an integration unit configured with instructions executable by the processor to acquire a subset of the image data (e.g., light intensity data 403, depth data 405, etc.) and positioning information (e.g., camera pose information from the position/motion sensor 506) and integrate (e.g., fuse) the subset of the image data using one or more known techniques. For example, the image integration unit receives a subset of depth image data 505 (e.g., sparse depth data) and a subset of intensity image data 503 (e.g., RGB) from image sources (e.g., light intensity camera 502 and depth camera 504), and integrates the subsets of image data and generates 3D data. The 3D data may include dense 3D point clouds (e.g., imperfect depth maps and camera poses for a plurality of image frames around an object) that are sent to a 3D representation unit (e.g., a unit that generates a 3D model of the image data). The 3D data may also be voxelized.

In some implementations, the exemplary environment 500A also includes a semantic segmentation unit configured with instructions executable by the processor to obtain a subset of the light intensity image data (e.g., light intensity data 503) and identify and segment wall structures (walls, doors, windows, etc.) and objects (e.g., human bodies, tables, teapots, chairs, vases, etc.) using one or more known techniques. For example, the segmentation unit receives a subset of intensity image data 503 from an image source (e.g., light intensity camera 502) and generates segmentation data (e.g., semantic segmentation data, such as RGB-S data). In some implementations, the segmentation unit uses a machine learning model, where the semantic segmentation model may be configured to identify semantic labels of pixels or voxels of the image data. In some implementations, the machine learning model is a neural network (e.g., an artificial neural network), a decision tree, a support vector machine, a bayesian network, or the like.

Numerous specific details are set forth herein to provide a thorough understanding of claimed subject matter. However, it will be understood by those skilled in the art that claimed subject matter may be practiced without these specific details. In other instances, methods, apparatuses, or systems that are known to one of ordinary skill have not been described in detail so as not to obscure claimed subject matter.

Unless specifically stated otherwise, it is appreciated that throughout the description, discussions utilizing terms such as "processing," "computing," "calculating," "determining," and "identifying" or the like, refer to the action and processes of a computing device, such as one or more computers or similar electronic computing devices, that manipulates and transforms data represented as physical electronic or magnetic quantities within the computing platform's memories, registers or other information storage devices, transmission devices or display devices.

The one or more systems discussed herein are not limited to any particular hardware architecture or configuration. The computing device may include any suitable arrangement of components that provides a result conditioned on one or more inputs. Suitable computing devices include a multi-purpose microprocessor-based computer system that accesses stored software that programs or configures the computing system from a general-purpose computing device to a specific purpose computing device that implements one or more implementations of the inventive subject matter. The teachings contained herein may be implemented in software for programming or configuring a computing device using any suitable programming, scripting, or other type of language or combination of languages.

Implementations of the methods disclosed herein may be performed in the operation of such computing devices. The order of the blocks presented in the above examples may be varied, e.g., the blocks may be reordered, combined, and/or divided into sub-blocks. Some blocks or processes may be performed in parallel.

The use of "adapted to" or "configured to" herein is meant to be an open and inclusive language that does not exclude devices adapted to or configured to perform additional tasks or steps. Additionally, the use of "based on" means open and inclusive, as a process, step, calculation, or other action that is "based on" one or more stated conditions or values may in practice be based on additional conditions or values beyond those stated. The headings, lists, and numbers included herein are for ease of explanation only and are not intended to be limiting.

It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first node may be referred to as a second node, and similarly, a second node may be referred to as a first node, which changes the meaning of the description, as long as all occurrences of the "first node" are renamed consistently and all occurrences of the "second node" are renamed consistently. The first node and the second node are both nodes, but they are not the same node.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the claims. As used in the description of this particular implementation and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

As used herein, the term "if" may be interpreted to mean "when the prerequisite is true" or "in response to a determination" or "according to a determination" or "in response to a detection" that the prerequisite is true, depending on the context. Similarly, the phrase "if it is determined that [ the prerequisite is true ]" or "if [ the prerequisite is true ]" or "when [ the prerequisite is true ]" is interpreted to mean "upon determining that the prerequisite is true" or "in response to determining" or "according to determining that the prerequisite is true" or "upon detecting that the prerequisite is true" or "in response to detecting" that the prerequisite is true, depending on context.

The foregoing description and summary of the invention is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined solely by the detailed description of the exemplary implementations, but rather according to the full breadth permitted by the patent laws. It will be understood that the specific embodiments shown and described herein are merely illustrative of the principles of the invention and that various modifications can be implemented by those skilled in the art without departing from the scope and spirit of the invention.

Claims

1. A method, comprising:

at a device having a processor:

determining a first location of a human body on a head in a three-dimensional (3D) coordinate system, the first location being determined based on detecting features on the head from a two-dimensional (2D) image of the human body in a physical environment;

determining a second location on the floor below the first location in the 3D coordinate system; and

estimating an altitude based on determining a distance between the first location and the second location.

2. The method of claim 1, wherein determining the first location comprises:

obtaining a mask that identifies portions of an image that correspond to one or more persons;

detecting a feature on the head; and identifying the first location within a portion of the mask corresponding to one or more persons based on features on the head.

3. The method of claim 1, wherein the feature is a face.

4. The method of claim 3, wherein identifying the first location within the portion of the mask comprises:

determining a center of the face and an orientation of the face; and

identifying a portion of the mask in a direction from the center of the face, wherein the direction is based on the orientation.

5. The method of claim 4, wherein identifying the first location within the portion of the mask further comprises generating a bounding box corresponding to the face.

6. The method of claim 1, wherein identifying the first location within the portion of the mask comprises identifying a boundary of the head based on data from a depth sensor.

7. The method of claim 1, wherein determining the second location on the floor comprises determining a floor plane based on image data acquired from a camera on the device.

8. The method of claim 7, wherein the image data comprises depth data comprising a plurality of depth points, wherein determining the floor plane based on the image data comprises:

classifying a portion of the plurality of depth points as the floor plane based on the determined height of each of the depth points; and

determining the second location on the floor based on the determined heights of each of the depth points corresponding to the floor plane.

9. The method of claim 7, wherein the image data comprises a plurality of image frames acquired over a period of time, wherein determining the floor plane based on the image data comprises:

generating a 3D model based on the image data including the floor plane; and

iteratively updating the 3D model for each acquired frame of the plurality of image frames.

10. The method of claim 1, wherein estimating the height based on determining the distance between the first location and the second location comprises:

determining a plurality of distance estimates between the first location and the second location;

identifying a subset of the plurality of distance estimates using an averaging technique; and

determining the distance between the first location and the second location based on the subset of data.

11. The method of claim 1, wherein the person is a first person and the 2D image includes the first person and a second person, wherein determining the first location on the head of the first person is based on determining that the first person is closer to a center of the 2D image than the second person is to the center of the 2D image.

12. An apparatus, comprising:

a non-transitory computer-readable storage medium; and

one or more processors coupled to the non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium comprises program instructions that, when executed on the one or more processors, cause the apparatus to perform operations comprising:

13. The apparatus of claim 12, wherein determining the first location comprises:

detecting a feature on the head; and

identifying the first location within a portion of the mask corresponding to one or more persons based on features on the head.

14. The apparatus of claim 12, wherein the feature is a face.

15. The apparatus of claim 14, wherein identifying the first location within the portion of the mask comprises:

determining a center of the face and an orientation of the face; and

16. The apparatus of claim 4, wherein identifying the first location within the portion of the mask further comprises generating a bounding box corresponding to the face.

17. The apparatus of claim 12, wherein identifying the first location within the portion of the mask comprises identifying a boundary of the head based on data from a depth sensor.

18. The device of claim 12, wherein determining the second location on the floor comprises determining a floor plane based on image data acquired from a camera on the device.

19. The apparatus of claim 18, wherein the image data comprises depth data comprising a plurality of depth points, wherein determining the floor plane based on the image data comprises:

20. A non-transitory computer-readable storage medium storing computer-executable program instructions on a computer to perform operations comprising: