WO2023214093A1

WO2023214093A1 - Accurate 3d body shape regression using metric and/or semantic attributes

Info

Publication number: WO2023214093A1
Application number: PCT/EP2023/062148
Authority: WO
Inventors: Michael J. Black; Lea MÜLLER; Vassilis CHOUTAS; Dimitrios TZIONAS; Chun-Hao Paul HUANG
Original assignee: MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V.
Priority date: 2022-05-06
Filing date: 2023-05-08
Publication date: 2023-11-09

Abstract

The present invention provides a method for training a machine learning model for estimating shapes of objects based on sensor data, the method comprising: - obtaining a training dataset comprising training sensor data and a corresponding ground truth attribute, - estimating, by the machine learning model, a shape for the training sensor data, determining an attribute corresponding to the estimated shape, and - optimizing the machine learning model using a loss function that is based on a difference of the determined attribute compared to the ground truth attribute.

Description

ACCURATE 3D BODY SHAPE REGRESSION USING METRIC AND/OR SEMANTIC ATTRIBUTES CROSS-REFERENCE AND PRIORITY CLAIM This application claims priority to European Application EP 22172168.1, filed on May 6, 2022, which is hereby incorporated by reference. TECHNICAL FIELD The present invention relates to a method for training a machine learning model for estimating shapes of objects based on sensor data. BACKGROUND The field of 3D human pose and shape (HPS) estimation is progressing rapidly, and methods now regress accurate 3D pose from a single image [7, 29, 31, 34–37, 49, 72, 74]. Unfortunately, less attention has been paid to body shape and many methods produce body shapes that clearly do not represent the person in the image (Fig.1, top right). There are several reasons behind this. Current evaluation datasets focus on pose and not shape. Training datasets of images with 3D ground-truth shape are lacking. Additionally, humans appear in images wearing clothing that obscures the body, making the problem challenging. Finally, the fundamental scale ambiguity in 2D images makes 3D shape difficult to estimate. For many applications, however, realistic body shape is critical. These include AR/VR, apparel design, virtual try- on, and fitness. Thus, it is important to represent and estimate all possible 3D body shapes. SUMMARY It is an object of the present invention to overcome one or more of the problems of the prior art. In this regard, the present invention is defined by the appended independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures. A first aspect of the present invention provides a method for training a machine learning model for estimating shapes of objects based on sensor data, the method comprising: - obtaining a training dataset comprising training sensor data and a corresponding ground truth attribute, - estimating, by the machine learning model, a shape for the training sensor data, - determining an attribute corresponding to the estimated shape, and - optimizing the machine learning model using a loss function that is based on a difference of the determined attribute compared to the ground truth attribute. Preferably, the sensor data comprises an image. Optionally, the object comprises a human. In an implementation, the machine learning model comprises a neural network. Preferably, the attribute comprises a metric attribute, in particular a measurement, preferably a circumference and/or a height of the object. Preferably, the attribute comprises a semantic attribute and wherein preferably the determining the attribute corresponding to the estimated shape comprises using a polynomial regression model, preferably a second-degree polynomial regression model. Preferably, the attribute is a human-annotated attribute and the method preferably comprises a further step of obtaining a plurality of human-annotated attributes. Further preferably, the estimated shape comprises a parametric representation of the shape, wherein in particular the parametric representation comprises SMPL-X shape coefficients. Preferably, the parametric representation comprises a higher number of parameters than a number of attribute values of the attribute. Preferably, the shape only comprises pose-independent information, or the shape also comprises pose information. A second aspect of the present invention provides a method for training a machine learning model to estimate shapes of objects based on sensor data, the method comprising: - obtaining a training dataset comprising training sensor data and a corresponding ground truth attribute, - estimating, by the machine learning model, a shape for the training sensor data, - determining a shape for a ground truth attribute corresponding to the training sensor data, and - optimizing the machine learning model using a loss function that is based on a difference between the shape estimated by the machine learning model and the shape determined for the ground truth attribute. It is understood that the method of the first and second aspect can be carried out by a computer, in particular all steps can be carried out by a computer. Preferably, the sensor data comprises an image. Optionally, the object comprises a human. In a possible implementation, the machine learning model comprises a neural network. Preferably, the attribute comprises a metric attribute, in particular a measurement, preferably a circumference and/or a height of the object. Preferably, the attribute comprises a semantic attribute and wherein preferably the determining the attribute corresponding to the estimated shape comprises using a polynomial regression model, preferably a second-degree polynomial regression model. Preferably, the attribute is a human-annotated attribute and the method preferably comprises a further step of obtaining a plurality of human-annotated attributes. Preferably, the estimated shape comprises a parametric representation of the shape, in particular a parametric representation comprising SMPL-X shape coefficients. Preferably, the parametric representation comprises a higher number of parameters than a number of attribute values of the attribute. Preferably, the shape only comprises pose-independent information, or the shape also comprises pose information. A further aspect of the present invention provides a method for estimating shapes of objects based on sensor data, wherein the method is based on a machine learning model that has been trained using the method of one of the previous aspects. A further aspect of the present invention provides a training device for training a machine learning model to estimate shapes of objects based on sensor data, wherein the training device is configured to carry out a method according to the first or second aspect. A further aspect of the present invention provides a machine learning model for estimating shapes of objects based on sensor data, wherein the machine learning model has been trained with a method according to the first or second aspect. A further aspect of the present invention provides a computer-readable storage medium storing program code, the program code comprising instructions that when executed by a processor carry out the method of the first or second aspect. BRIEF DESCRIPTION OF THE DRAWINGS To illustrate the technical features of embodiments of the present disclosure more clearly, the accompanying drawings provided for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are merely some embodiments of the present disclosure; modifications on these embodiments are possible without departing from the scope of the present disclosure as defined in the claims. FIG.1 shows that existing work on 3D human reconstruction from a color image focuses mainly on pose. We present SHAPY, a model that focuses on body shape and learns to predict dense 3D shape from a color image, using crowd-sourced linguistic shape attributes. Even with this weak supervision, SHAPY outperforms the state of the art (SOTA) [59] on in- the-wild images with varied clothing. FIG.2 shows model-agency websites that contain multiple images of models together with anthropometric measurements. A wide range of body shapes are represented; example from pexels.com. FIG.3 illustrates crowd-sourced scores for linguistic body-shape attributes [64] and computed anthropometric measurements for CAESAR [54] body meshes. We also crowd-source linguistic shape attribute scores for model images, like those in Fig.2. FIG.4 illustrates shape representations and data collection. FIG.5 shows a histogram of height and chest/waist/hips circumference for data from model- agency websites (Sec.3.2) and CAESAR. Model-agency data is diverse, yet not as much as CAESAR data. FIG.6 illustrates SHAPY first estimates shape, , and pose, . Shape is used by: (1) our virtual anthropometric measurement (VM) module to compute height, , and circumferences, , and (2) our S2A module to infer linguistic attribute scores, . There are several SHAPY variations, e.g., SHAPY-H uses only VM to infer , while SHAPY-HA uses VM to infer and S2A to infer . FIG.7 shows “Human Bodies in the Wild” (HBW) color images, taken in the lab and in the wild, and the SMPL-X ground-truth shape. FIG.8 shows qualitative results from HBW. From left to right: RGB, ground-truth shape, SHAPY and Sengupta et al. [59]. For example, in the upper- and lower- right images, SHAPY is less affected by pose variation and loose clothing. FIG.9 shows a layout of the Amazon Mechanical Turk task for a male subject. Left: the 3D body mesh in A-pose. Right: the attributes and ratings buttons. FIG.10 shows automatic anatomical measurements on a 3D mesh. The red points lie on the intersection of planes at chest/waist/hip height with the mesh, while their convex hull is shown with black lines. FIG.11 shows the 20K body mesh surface points (in black) used to evaluated body shape estimation accuracy. FIG.12 shows qualitative results of SHAPY predictions for female bodies. FIG.13 shows further qualitative results of SHAPY predictions for female bodies. FIG.14 shows qualitative results of SHAPY predictions for male bodies. FIG.15 shows further qualitative results of SHAPY predictions for male bodies. FIG.16 shows several failure cases. In the first example (upper left) the weight is underestimated. Other failure cases of SHAPY are muscular bodies (upper right) and body shapes with high BMI (second row). DETAILED DESCRIPTION While methods that regress 3D human meshes from images have progressed rapidly, the estimated body shapes often do not capture the true human shape. This is problematic since, for many applications, accurate body shape is as important as pose. The key reason that body shape accuracy lags pose accuracy is the lack of data. While humans can label 2D joints, and these constrain 3D pose, it is not so easy to “label” 3D body shape. Since paired data with images and 3D body shape are rare, we exploit two sources of information: (1) we collect internet images of diverse “fashion” models together with a small set of anthropometric measurements; (2) we collect linguistic shape attributes for a wide range of 3D body meshes and the model images. Taken together, these datasets provide sufficient constraints to infer dense 3D shape. We exploit the anthropometric measurements and linguistic shape attributes in several novel ways to train a neural network, called SHAPY, that regresses 3D human pose and shape from an RGB image. Herein, SHAPY, refers to a preferred implementation of the present invention. We evaluate SHAPY on public benchmarks but note that they either lack significant body shape variation, ground-truth shape, or clothing variation. Thus, we collect a new dataset for evaluating 3D human shape estimation, called HBW, containing photos of “Human Bodies in the Wild” for which we have ground- truth 3D body scans. On this new benchmark, SHAPY significantly outperforms state-of-the-art methods on the task of 3D body shape estimation. This is the first demonstration that 3D body shape regression from images can be trained from easy-to-obtain anthropometric measurements and linguistic shape attributes. 1. INTRODUCTION The field of 3D human pose and shape (HPS) estimation is progressing rapidly and methods now regress accurate 3D pose from a single image [7, 29, 31, 34–37, 49, 72, 74]. Unfortunately, less attention has been paid to body shape and many methods produce body shapes that clearly do not represent the person in the image (Fig. 1, top right). There are several reasons behind this. Current evaluation datasets focus on pose and not shape. Training datasets of images with 3D ground-truth shape are lacking. Additionally, humans appear in images wearing clothing that obscures the body, making the problem challenging. Finally, the fundamental scale ambiguity in 2D images, makes 3D shape difficult to estimate. For many applications, however, realistic body shape is critical. These include AR/VR, apparel design, virtual try-on, and fitness. To democratize avatars, it is important to represent and estimate all possible 3D body shapes; we make a step in that direction. Note that commercial solutions to this problem require users to wear tight fitting clothing and capture multiple images or a video sequence using constrained poses. In contrast, we tackle the unconstrained problem of 3D body shape estimation in the wild from a single RGB image of a person in an arbitrary pose and standard clothing. Most current approaches to HPS estimation learn to regress a parametric 3D body model like SMPL [42] from images using 2D joint locations as training data. Such joint locations are easy for human annotators to label in images. Supervising the training with joints, however, is not sufficient to learn shape since an infinite number of body shapes can share the same joints. For example, consider someone who puts on weight. Their body shape changes but their joints stay the same. Several recent methods employ additional 2D cues, such as the silhouette, to provide additional shape cues [58,59]. Silhouettes, however, are influenced by clothing and do not provide explicit 3D supervision. Synthetic approaches [40], on the other hand, drape SMPL 3D bodies in virtual clothing and render them in images. While this provides ground-truth 3D shape, realistic synthesis of clothed humans is challenging, resulting in a domain gap. To address these issues, we present SHAPY, a new deep neural network that accurately regresses 3D body shape and pose from a single RGB image. To train SHAPY, in a preferred embodiment, we first need to address the lack of paired training data with real images and ground-truth shape. Without access to such data, we need alternatives that are easier to acquire, analogous to 2D joints used in pose estimation. To do so, we introduce two novel datasets and corresponding training methods. First, in lieu of full 3D body scans, we preferably use images of people with diverse body shapes for which we have anthropometric measurements such as height as well as chest, waist, and hip circumference. Note that we also refer to such measurements as “metric attributes.” While many 3D human shapes can share the same measurements, they do constrain the space of possible shapes. Additionally, these are important measurements for applications in clothing and health. Accurate anthropometric measurements like these are difficult for individuals to take themselves but they are often captured for different applications. Specifically, modeling agencies provide such information about their models; accuracy is a requirement for modeling clothing. Thus, we collect a diverse set of such model images (with varied ethnicity, clothing, and body shape) with associated measurements; see Fig.2. Since sparse anthropometric measurements do not fully constrain body shape, we exploit a novel approach and also use linguistic shape attributes. Note that we also refer to these as “semantic attributes.” Prior work has shown that people can rate images of others according to shape attributes such as “short/tall”, “long legs” or “pear shaped” [64]; see Fig.3. Using the average scores from several raters, Streuber et al. [64] (BodyTalk) regress metrically accurate 3D body shape. This approach gives us a way to easily label images of people and use these labels to constrain 3D shape. To our knowledge, this sort of linguistic shape attribute data has not previously been exploited to train a neural network to infer 3D body shape from images. We exploit these new datasets to train SHAPY preferably with up to three novel losses, which can be exploited by any 3D human body reconstruction method: (1) We define functions of the SMPL body mesh that return a sparse set of anthropometric measurements. When measurements are available for an image we use a loss that penalizes mesh measurements that differ from the ground-truth (GT). (2) We learn a “Shape to Attribute” (S2A) function that maps 3D bodies to linguistic attribute scores. During training, we map meshes to attribute scores and penalize differences from the GT scores. (3) We similarly learn a function that maps “Attributes to Shape” (A2S). We then penalize body shape parameters that deviate from the prediction. We study each term in detail to arrive at a preferred embodiment of the final method. Evaluation is challenging because existing benchmarks with GT shape either contain too few subjects [68] or have limited clothing complexity and only pseudo-GT shape [58]. We fill this gap with a new dataset, named “Human Bodies in the Wild” (HBW), that contains a ground-truth 3D body scan and several in- the-wild photos of 35 subjects, for a total of 2543 photos. Evaluation on this shows that SHAPY estimates much more accurate 3D shape. 2. RELATED WORK 3D human pose and shape (HPS): Methods that reconstruct 3D human bodies from one or more RGB images can be split into two broad categories: (1) parametric methods that predict parameters of a statistical 3D body model, such as SCAPE [3], SMPL [42], SMPL-X [49], Adam [29], GHUM [72], and (2) non-parametric methods that predict a free-form representation of the human body [26, 57, 66, 71]. Parametric approaches lack details w.r.t. non-parametric ones, e.g., clothing or hair. However, parametric models disentangle the effects of identity and pose on the overall shape. Therefore, their parameters provide control for re-shaping and re-posing. Moreover, pose can be factored out to bring meshes in a canonical pose; this is important for evaluating estimates of an individual’s shape. Finally, since topology is fixed, meshes can be compared easily. For these reasons, we use a SMPL-X body model. Parametric methods follow two main paradigms and are based on optimization or regression. Optimization-based methods [5, 7, 18, 49] search for model configurations that best explain image evidence, usually 2D landmarks [8], subject to model priors that usually encourage parameters to be close to the mean of the model space. Numerous methods penalize the discrepancy between the projected and ground-truth silhouettes [24, 38] to estimate shape. However, this needs special care to handle clothing [4]; without this, erroneous solutions emerge that “inflate” body shape to explain the “clothed” silhouette. Regression-based methods [9, 16, 27, 30, 34, 37, 40, 45, 73] are currently based on deep neural networks that directly regress model parameters from image pixels. Their training sets are a mixture of data captured in laboratory settings [25, 63], with model parameters estimated from MoCap markers [44], and in-the-wild image collections, such as COCO [41], that contain 2D keypoint annotations. Optimization and regression can be combined, for example via in-the-network model fitting [37,45]. Estimating 3D body shape: State-of-the-art methods are effective for estimating 3D pose but struggle with estimating body shape under clothing. There are several reasons for this. First, 2D keypoints alone are not sufficient to fully constrain 3D body shape. Second, shape priors address the lack of constraints, but bias solutions towards “average” shapes [7,37,45,49]. Third, datasets with in-the-wild images have noisy 3D bodies, recovered by fitting a model to 2D keypoints [7, 49]. Fourth, datasets captured in laboratory settings have a small number of subjects, who do not represent the full spectrum of body shapes. Thus, there is a scarcity of images with known, accurate, 3D body shape. Existing methods deal with this in two ways. First, rendering synthetic images is attractive since it gives automatic and precise ground-truth annotation. This involves shaping, posing, dressing and texturing a 3D body model [22,58,60,67,69], then lighting it and rendering it in a scene. Doing this realistically and with natural clothing is expensive, hence, current datasets suffer from a domain gap. Alternative methods use artist-curated 3D scans [48,56,57], which are realistic but limited in variety. Second, 2D shape cues for in-the-wild images, (bodypart segmentation masks [14, 46, 55], silhouettes [1, 24, 50]) are attractive, as these can be manually annotated or automatically detected [17, 20]. However, fitting to such cues often gives unrealistic body shapes, by inflating the body to “explain” the clothing “baked” into silhouettes and masks. A related work is Sengupta et al. [58–60] who estimate body shape using a probabilistic learning approach, trained on edge-filtered synthetic images. They evaluate on the SSP-3D dataset of real images with pseudo-GT 3D bodies, estimated by fitting SMPL to multiple video frames. SSP-3D is biased to people with tight-fitting clothing. Their silhouette-based method works well on SSP-3D but does not generalize to people in normal clothing, tending to over-estimate body shape; see Fig.1. In contrast to previous work, a preferred embodiment is trained with in-the-wild images paired with linguistic shape attributes, which are annotations that can be easily crowd-sourced for weak shape supervision. We also go beyond SSP-3D to provide HBW, a new dataset with in-the-wild images, varied clothing, and precise GT from 3D scans. Shape, measurements and attributes: Body shapes can be generated from anthropometric measurements [2, 61, 62]. Tsoli et al. [65] register a body model to multiple high-resolution body scans to extract body measurements. The “Virtual Caliper” [52] allows users to build metrically accurate avatars of themselves using measurements or VR game controllers. ViBE [23] collects images, measurements (bust, waist, hip circumference, height) and the dress-size of models from clothing websites to train a clothing recommendation network. We draw inspiration from these approaches for data collection and supervision. Streuber et al. [64] learn BodyTalk, a model that generates 3D body shapes from linguistic attributes. For this, they select attributes that describe human shape and ask annotators to rate how much each attribute applies to a body. They fit a linear model that maps attribute ratings to SMPL shape parameters. Inspired by this, we preferably collect attribute ratings for CAESAR meshes [54] and in-the- wild data as proxy shape supervision to train a HPS regressor. Unlike BodyTalk, SHAPY automatically infers shape from images. Anthropometry from images: Single-view metrology [10] estimates the height of a person in an image, using horizontal and vertical vanishing points and the height of a reference object. Günel et al. [19] introduce the IMDB23K dataset by gathering publicly available celebrity images and their height information. Zhu et al. [75] use this dataset to learn to predict the height of people in images. Dey et al. [13] estimate the height of users in a photo collection by computing height differences between people in an image, creating a graph that links people across photos, and solving a maximum likelihood estimation problem. Bieler et al. [6] use gravity as a prior to convert pixel measurements extracted from a video to metric height. These methods do not address body shape. 3. REPRESENTATIONS & DATA FOR BODY SHAPE We preferably use linguistic shape attributes and anthropometric measurements as a connecting component between in-the-wild images and ground-truth body shapes; see Fig.4. To that end, we annotate linguistic shape attributes for 3D meshes and in-the-wild images, the latter from fashion- model agencies, labeled via Amazon Mechanical Turk. Fig. 4 illustrates shape representations and data collection. Our goal is 3D body shape estimation from in-the-wild images. Collecting data for direct supervision is difficult and does not scale. We explore two alternatives. Linguistic Shape Attributes: We annotate attributes (“A”) for CAESAR meshes, for which we have accurate shape (“S”) parameters, and learn the “A2S” and “S2A” models, to map between these representations. Attribute annotations for images can be easily crowd-sourced, making these scalable. Anthropometric Measurements: We collect images with sparse body measurements from model-agency websites. A virtual measurement module [52] computes the measurements from 3D meshes. Training: We preferably combine these sources to learn a regressor with weak supervision that infers 3D shape from an image. 3.1. SMPL-X BODY MODEL We preferably use SMPL-X [49], a differentiable model that maps shape, ^^ , pose, ^^ , and ^{expression, ^^, parameters to a 3D mesh, M, with N = 10,475 vertices, V. The shape vector ^^ ∈ ℝ ^^ ( ^^ ≤} _{300) has coefficients of a low-dimensional PCA space. The vertices are posed with linear blend skinning} with a learned rigged skeleton, ^^ ∈ ℝ^55×3. In other embodiments, other models can be used, e.g. with different parameters or other properties. 3.2. MODEL-AGENCY IMAGES Model agencies typically provide multiple color images of each model, in various poses, outfits, hairstyles, scenes, and with a varying camera framing, together with anthropometric measurements and clothing size. We preferably collect training data from multiple model-agency websites, focusing on under-represented body types, namely: curve-models.com, cocainemodels.com, nemesismodels.com, jayjay-models.de, kultmodels.com, modelwerk.de, models1.co.uk. showcast.de, the-models.de, and ullamodels.com. In addition to photos, we store gender and four anthropometric measurements, i.e. height, chest, waist and hip circumference, when available. To avoid having the same subject in both the training and test set, we preferably match model identities across websites to identify models that work for several agencies. For details, see Sup. Mat. After identity filtering, for the preferred embodiment, we have 94,620 images of 4,419 models along with their anthropometric measurements. However, the distributions of these measurements, shown in Fig. 5, reveal a bias for “fashion model” body shapes, while other body types are under- represented in comparison to CAESAR [54]. To enhance diversity in body-shapes and avoid strong biases and log tails, we compute the quantized 2D-distribution for height and weight and sample up to 3 models per bin. This results in N = 1,185 models (714 females, 471 males) and 20,635 images. 3.3. LINGUISTIC SHAPE ATTRIBUTES Human body shape can be described by linguistic shape attributes [21]. We draw inspiration from Streuber et al. [64] who collect scores for 30 linguistic attributes for 2563D body meshes, generated by sampling SMPL’s shape space, to train a linear “attribute to shape” regressor. In contrast, we train a model that takes as input an image, instead of attributes, and outputs an accurate 3D shape (and pose).

Table 1. Linguistic shape attributes for human bodies. Some attributes apply to both genders, but others are gender specific. We crowd-source linguistic attribute scores for a variety of body shapes, using images from the following sources: Rendered CAESAR images: We use CAESAR [54] bodies to learn mappings between linguistic shape attributes, anthropometric measurements, and SMPL-X shape parameters, ^^ . Specifically, we register a “gendered” SMPL-X model with 100 shape components to 1,700 male and 2,102 female 3D scans, pose all meshes in an A-pose, and render synthetic images with the same virtual camera. Model-agency photos: Each annotator is shown 3 body images per subject, sampled from the image pool of Sec.3.2. Annotation: To keep annotation tractable, we use A = 15 linguistic shape attributes per gender (subset of BodyTalk’s [64] attributes); see Tab. 1. Each image is annotated by K = 15 annotators on Amazon Mechanical Turk. Their task is to “indicate how strongly [they] agree or disagree that the [listed] words describe the shape of the [depicted] person’s body”; for an example, see Sup. Mat. Annotations range on a discrete 5-level Likert scale from 1 (strongly disagree) to 5 (strongly agree). We get a rating matrix A ∈ {1, 2, 3, 4, 5} ^{^^× ^^× ^^} , where N is the number of subjects. In the following, ^^ _{^^ ^^ ^^} denotes an element of A. 4. MAPPING SHAPE REPRESENTATIONS In Sec.3 we introduce three body-shape representations: (1) SMPL-X’s PCA shape space (Sec. 3.1), (2) anthropometric measurements (Sec.3.2), and (3) linguistic shape attribute scores (Sec.3.3). Here we learn mappings between these, so that in Sec. 5 we can define new losses for training body shape regressors using multiple data sources. 4.1. VIRTUAL MEASUREMENTS (VM) We obtain anthropometric measurements from a 3D body mesh in a T-pose, namely height, weight, ^^( ^^), and chest, waist and hip circumferences, ^^ _^^( ^^), ^^ _^^( ^^), and ^^ _^^( ^^), respectively, by following Wuhrer et al. [70] and the “Virtual Caliper” [52]. For details on how we compute these measurements, see Sup. Mat. 4.2. ATTRIBUTES AND 3D SHAPE Attributes to Shape (A2S): We predict SMPL-X shape coefficients from linguistic attribute scores with a second-degree polynomial regression model. For each shape ^^ _^^ , ^^ = 1... ^^, we create a feature vector, by averaging for each of the A attributes the corresponding K scores:

where ^^ is the shape index (list of “fashion” or CAESAR bodies), ^^ is the attribute index, and ^^ the annotation index.

maps xi to 2^nd order polynomial features. The target matrix Y = [ ^^₁, ... , ^^ _^^ ]^⊤ contains

the shape We compute the polynomial model’s coefficients W via least-

squares fitting:

Empirically, the polynomial model performs better than several models that we evaluated; for details, see Sup. Mat. Shape to Attributes (S2A): We predict linguistic attribute scores, A, from SMPL-X shape parameters, ^^. Again, we fit a second-degree polynomial regression model. S2A has “swapped” inputs and outputs w.r.t. A2S:

Attributes & Measurements to Shape (AHWC2S): Given a sparse set of anthropometric measurements, we predict SMPL-X shape parameters, ^^. The input vector is:

where ^^ _^^ , ^^ _^^ , ^^_ℎ is the chest, waist, and hip circumference, respectively, ℎ and ^^ are the height and weight, and HWC2S means Height + Weight + Circumference to Shape. The regression target is the SMPL-X shape parameters, ^^ _^^. When both Attributes and measurements are available, we combine them for the AHWC2S model with input:

In practice, depending on which measurements are available, we train and use different regressors. Following the naming convention of AHWC2S, these models are: AH2S, AHW2S, AC2S, and AHC2S, as well as their equivalents without attribute input H2S, HW2S, C2S, and HC2S. For an evaluation of the contribution of linguistic shape attributes on top of each anthropometric measurement, see Sup. Mat. Training Data: To train the A2S and S2A mappings we use CAESAR data, for which we have SMPL-X shape parameters, anthropometric measurements, and linguistic attribute scores. We train separate gender-specific models. 5.3D SHAPE REGRESSION FROM AN IMAGE We present SHAPY, a network that predicts SMPL-X parameters from an RGB image with more accurate body shape than existing methods. To improve the realism and accuracy of shape, we explore training losses based on all shape representations discussed above, i.e., SMPL-X meshes (Sec. 3.1), linguistic attribute scores (Sec. 3.3) and anthropometric measurements (Sec. 4.1). In the following, symbols with/-out a hat are regressed/ground-truth values. We convert shape

to height and circumferences values

by applying our virtual measurement tool (Sec. 4.1) to the mesh ^^( ^^̂) in the canonical T-pose. We also convert shape

to linguistic attribute scores, with

We train various SHAPY versions with the following “SHAPY losses”, using either linguistic shape attributes, or anthropometric measurements, or both:

These are optionally added to a base loss, ^^ _{^^ ^^ ^^ ^^} , defined below in “training details”. The architecture of SHAPY, with all optional components, is shown in Fig. 6. A suffix of color-coded letters describes which of the above losses are used when training a model. For example, SHAPY-AH denotes a model trained with the attribute and height losses, i.e.: ^^_SHAPY−AH2S = ^^_base + ^^_attr + ^^_height . Training Details: We initialize SHAPY with the ExPose [9] network weights and use curated fits [9], H3.6M [25], the SPIN [37] training data, and our model-agency dataset (Sec.3.2) for training. In each batch, 50% of the images are sampled from the model-agency images, for which we ensure a gender balance. The “SHAPY losses” of Eqs. (8) to (10) are applied only on the model-agency images. We use these on top of a standard base loss:

where and are 2D and 3D joint losses:

^^ _^^ and ^^ _^^ are losses on pose and shape parameters, and

is PIXIE’s [15] “gendered” shape prior. All losses are L2, unless otherwise explicitly specified. Losses on SMPL-X parameters are applied only on the pose data [9, 25, 37]. For more implementation details, see Sup. Mat. 6. ADDITIONAL EMBODIMENTS An alternative embodiment adds an additional neural network (or head on the existing network) that predicts attributes, metric (I2HWC) and/or semantic (I2A), from the image, see Sec. B.3 of the Sup. Mat. for an instantiation of I2A. This regressor is directly supervised using the image attribute labels. From these estimated attributes we apply the A2S regressor to predict body shape. We then add a loss that penalizes the difference between this body shape and the one regressed direct from the image. While we have described the method as taking a single image as input, it should be obvious that we could take in any other form of sensor measurement such as depth or range images. Additionally, the method could take multiple images, e.g. from an photo collection supplied by the user, or from a video sequence. The goal is to estimate a single body that is consistent with all the images. The simplest embodiment of this is to estimate the body using our approach in each image separately and then take the mean or median of the body shape parameters. Finally, while neural networks are the de-facto tool for 3D body pose and shape estimation, all of the above-described terms can also be easily added to iterative optimization methods, such as SMPLify- X [49] or SMPLify-XMC [45], or regression-optimization hybrids [28]. In such an optimization formulation, we take image measurements like 2D keypoints, foreground silhouettes, part segmentations, etc and fit the body to them. Here, however, we can take attributes that are either provided by a user or estimated using a neural network. We then estimate a 3D body shape and pose that matches both the 2D evidence and the attributes. Such an approach makes it easy to include multiple images. Specifically, we estimate a single body shape that best agrees with the evidence in all images simultaneously. Herein, we focus on semantic attributes that are directly related to body shape, for example, words like curvy or tall. But recent work has shown that many other words are related to how humans perceive bodies even though the words are not obviously related to shape. For example, many people have a mental image of a “powerful southern republican man” and this may differ from their image of a “compassionate northern democrat man”. While these are just stereotypes, if they are commonly shared, then people will describe images consistently using such words. This enables the use of non-shape attributes to estimate body shape as described in [53]. 7. EXPERIMENTS 7.1. EVALUATION DATASETS 3D Poses in the Wild (3DPW) [68]: We use this to evaluate pose accuracy. This is widely used, but has only 5 test subjects, i.e., limited shape variation. For results, see Sup. Mat. Sports Shape and Pose 3D (SSP-3D) [58]: We use this to evaluate 3D body shape accuracy from images. It has 62 tightly-clothed subjects in 311 in-the-wild images from Sports-1M [32], with pseudo ground-truth SMPL meshes that we convert to SMPL-X for evaluation. Model Measurements Test Set (MMTS): We use this to evaluate anthropometric measurement accuracy, as a proxy for body shape accuracy. To create MMTS, we withhold 2699/1514 images of 143/95 female/male identities from our model-agency data, described in Sec.3.2 CAESAR Meshes Test Set (CMTS): We use CAESAR to measure the accuracy of SMPL-X body shapes and linguistic shape attributes for the models of Sec. 4. Specifically, we compute: (1) errors for SMPL-X meshes estimated from linguistic shape attributes and/or anthropometric measurements by A2S and its variations, and (2) errors for linguistic shape attributes estimated from SMPL-X meshes by S2A. To create an unseen mesh test set, we withhold 339 male and 410 female CAESAR meshes from the crowd- sourced CAESAR linguistic shape attributes, described in Sec.3.3. Human Bodies in the Wild (HBW): The field is missing a dataset with varied bodies, varied clothing, in-the-wild images, and accurate 3D shape ground truth. We fill this gap by collecting a novel dataset, called “Human Bodies in the Wild” (HBW), with three steps: (1) We collect accurate 3D body scans for 35 subjects (20 female, 15 male), and register a “gendered” SMPL-X model to these to recover 3D SMPL-X ground-truth bodies [51]. (2) We take photos of each subject in “photo-lab” settings, i.e., in front of a white background with controlled lighting, and in various everyday outfits and “fashion” poses. (3) Subjects upload full-body photos of themselves taken in the wild. For each subject we take up to 111 photos in lab settings, and collect up to 126 in-the-wild photos. In total, HBW has 2543 photos, 1,318 in the lab setting and 1,225 in the wild. We split the data into a validation and a test set (val/test) with 10/25 subjects (6/14 female 4/11 male) and 781/1,762 images (432/983 female 349/779 male), respectively. Figure 7 shows a few HBW subjects, photos and their SMPL-X ground-truth shapes. All subjects gave prior written informed consent to participate in this study and to release the data. The study was reviewed by the ethics board of the University of Tübingen, without objections.

Table 2. Results of A2S variants on CMTS for male subjects, using the male SMPL-X model. For females, see Sup. Mat. 7.2. EVALUATION METRICS We use standard accuracy metrics for 3D body pose, but also introduce metrics specific to 3D body shape. Anthropometric Measurements: We report the mean absolute error in mm between ground- truth and estimated measurements, computed as described in Sec.4.1. When weight is available, we report the mean absolute error in kg. MPJPE and V2V metrics: We report in Sup. Mat. the mean per-joint point error (MPJPE) and mean vertex-to-vertex error (V2V), when SMPL-X meshes are available. The prefix “PA” denotes metrics after Procrustes alignment. Mean point-to-point error (P2P20K): SMPL-X has a highly non-uniform vertex distribution across the body, which negatively biases the mean vertex-to-vertex (V2V) error, when comparing estimated and ground-truth SMPL-X meshes. To account for this, we evenly sample 20K points on SMPL- X’s surface, and report the mean point-to-point (P2P20K) error. For details, see Sup. Mat. 7.3. SHAPE-REPRESENTATION MAPPINGS We evaluate the models A2S and S2A, which map between the various body shape representations (Sec.4). A2S and its variations: How well can we infer 3D body shape from just linguistic shape attributes, anthropometric measurements, or both of these together? In Tab.2, we report reconstruction and measurement errors using many combinations of attributes (A), height (H), weight (W), and circumferences (C). Evaluation on CMTS data shows that attributes improve the overall shape prediction across the board. For example, height + attributes (AH2S) has a lower point-to-point error than height alone. The best performing model, AHWC, uses everything, with P2P20K-errors of 5.8 ± 2.0 mm (males) and 6.2 ± 2.4 mm (females). S2A: How well can we infer linguistic shape attributes from 3D shape? S2A’s accuracy on inferring the attribute Likert score is 75%/69% for males/females; details in Sup. Mat.

Table 3. Evaluation on the HBW test set in mm. We compute the measurement and point-to-point (P2P20K) error between predicted and ground-truth SMPL-X meshes. Table 4. Evaluation on MMTS. We report the mean absolute error between ground-truth and estimated measurements. 7.4.3D SHAPE FROM AN IMAGE We evaluate all of our model’s variations (see Sec. 5) on the HBW validation set and find, perhaps surprisingly, that SHAPY-A outperforms other variants. We refer to this below (and Fig.1) simply as “SHAPY” and report its performance in Tab.3 for HBW, Tab.4 for MMTS, and Tab.5 for SSP-3D. For images with natural and varied clothing (HBW, MMTS), SHAPY significantly outperforms all other methods (Tabs.3 and 4) using only weak 3D shape supervision (Attributes). On these images, Sengupta et al.’s method [59] struggles with the natural clothing. In contrast, their method is more accurate than SHAPY on SSP-3D (Tab.5), which has tight “sports” clothing, in terms of PVE-T-SC, a scale-normalized metric used on this dataset. These results show that silhouettes are good for tight/minimal clothing and that SHAPY struggles with high BMI shapes due to the lack of such shapes in our training data; see Fig.5. Note that, as HBW has true ground-truth 3D shape, it does not need SSP-3D’s scaling for evaluation. A key observation is that training with linguistic shape attributes alone is sufficient, i.e., without anthropometric measurements. Importantly, this opens up the possibility for significantly larger data collections. For a study of how different measurements or attributes impact accuracy, see Sup. Mat. Figure 8 shows SHAPY’s qualitative results.

Table 5. Evaluation on the SSP-3D test set [58]. We report the scaled mean vertex-to-vertex error in T- pose [58], and mIOU. 8. CONCLUSION SHAPY is trained to regress more accurate human body shape from images than previous methods, without explicit 3D shape supervision. To achieve this, we present two different ways to collect proxy annotations for 3D body shape for in-the-wild images. First, we collect sparse anthropometric measurements from online model-agency data. Second, we annotate images with linguistic shape attributes using crowd-sourcing. We learn mappings between body shape, measurements, and attributes, enabling us to supervise a regressor using any combination of these. To evaluate SHAPY, we introduce a new shape estimation benchmark, the “Human Bodies in the Wild” (HBW) dataset. HBW has images of people in natural clothing and natural settings together with ground-truth 3D shape from a body scanner. HBW is more challenging than existing shape benchmarks like SSP-3D, and SHAPY significantly outperforms existing methods on this benchmark. We believe this work will open new directions, since the idea of leveraging linguistic annotations to improve 3D shape has many applications. SUPPLEMENTAL MATERIAL A. DATA COLLECTION A.1. MODEL-AGENCY IDENTITY FILTERING We collect internet data consisting of images and height/chest/waist/hips measurements, from model agency websites. A “fashion model” can work for many agencies and their pictures can appear on multiple websites. To create non-overlapping training, validation and test sets, we match model identities across websites. To that end, in a preferred embodiment, we use ArcFace [11] for face detection and RetinaNet [12] to compute identity embeddings ^^ _^^ ∈ ℝ⁵¹² for each image. For every pair of models ( ^^, ^^) ^{with the same gender label, let ^^, ^^ be the number of query and target model images and ^^ ^^ ∈} _{ℝ ^^×512 and ∈ ℝ ^^×512 the query and target embedding feature matrices. We then compute the} pairwise cosine similarity matrix ^^ ∈ ℝ ^{^^× ^^} between all images in ^^ _^^ and ^^ _^^ , and the aggregate and average similarity:

Each pair with ^^ and ^^ _^^ that has no element larger than the similarity threshold ^^ = 0.3 is ignored, as it contains dissimilar models. Finally, we check if ^^ _{^^ ^^} is larger than ^^, and we keep a list of all pairs for which this holds true. A.2. CROWD-SOURCED LINGUISTIC SHAPE-ATTRIBUTES To collect human ratings of how much a word describes a body shape, we conduct a human intelligence task (HIT) on Amazon Mechanical Turk (AMT). In this task, we show an image of a person along with 15 different gender-specific attributes. We then ask participants to indicate how strongly they agree or disagree that the provided words describe the shape of this person’s body. We arrange the rating buttons from strong disagreement to strong agreement with equal distances to create a 5-point Likert scale. The rating choices are “strongly disagree” (score 1), “rather disagree” (score 2), “average” (score 3), “rather agree” (score 4), “strongly agree” (score 5). We ask multiple persons to rate each body and image, to “average out” the subjectivity of individual ratings [64]. Additionally, we compute the Pearson correlation between averaged attribute ratings and ground-truth measurements. Examples of highly correlated pairs are “Big / Weight”, and “Short / Height”. The layout of our CAESAR annotation task is visualized in Fig.9. To ensure good rating quality, we have several qualification requirements per participant: submitting a minimum of 5000 tasks on AMT and an AMT acceptance rate of 95%, as well as having a US residency and passing a language qualification test to ensure similar language skills and cultures across raters. B. MAPPING SHAPE REPRESENTATIONS B.1. SHAPE TO ANATOMICAL MEASUREMENTS (S2M) An important part of our project is the computation of body measurements. Following “Virtual Caliper” [52], we present a method to compute anatomical measurements from a 3D mesh in the canonical T-pose, i.e. after “undoing” the effect of pose. Specifically, we measure the height, ^^( ^^), weight, ^^( ^^), and the chest, waist and hip circumferences, ^^ _^^( ^^) , ^^ _^^( ^^) , and ^^_ℎ( ^^) , respectively. Let ^^_{ℎ ^^ ^^ ^^}( ^^) , ^^ _{^^ ^^ ^^ ^^ ℎ ^^ ^^ ^^}( ^^), ^^ _{^^ℎ ^^ ^^ ^^}( ^^), ^^ _{^^ ^^ ^^ ^^ ^^}( ^^), ^^_{ℎ ^^ ^^}( ^^) be the head, left heel, chest, waist and hip vertices. ^^( ^^) is computed as the difference in the vertical-axis “Y” coordinates between the top of the head and the left heel: To obtain ^^( ^^) we multiply the mesh volume by 985 ³

kg/m , which is the average human body density. We compute circumference measurements using the method of Wuhrer et al. [70]. Here, ^^ ∈ ℝ ^{^^×3×3} , where F = 20,908 is the number of triangles in the SMPL-X mesh, denotes “shaped” vertices of all triangles of the mesh ^^( ^^, ^^); we drop expressions, ^^, which are not used in this work. Let us explain this using the chest circumference ^^ _^^( ^^) as an example. We form a plane P with normal ^^ = (0, 1, 0) that crosses the point ^^_chest( ^^). Then, let be the set of points of P that

intersect the body mesh (red points in Fig.10). We store their barycentric coordinates ( ^^ _^^ , ^^ _^^ , ^^ _^^) and the corresponding body-triangle index ^^ _^^. Let H be the convex hull of I (black lines in Fig.10), and ℰ the set of edge indices of H. ^^ _^^( ^^) is equal to the length of the convex hull:

where ^^, ^^ are point indices for line segments of ℰ. The process is the same for the waist and hips, but the intersection plane is computed using ^^_waist , ^^_hip . All of ^^( ^^) , ^^( ^^) , ^^ _^^( ^^) , ^^ _^^( ^^) , ^^_ℎ( ^^) are differentiable functions of body shape parameters, ^^. Note that SMPL-X knows the height distribution of humans and acts as a strong prior in shape estimation. Given the ground-truth height of a person (in meter), ^^( ^^) can be used to directly supervise height and overcome scale ambiguity. B.2. MAPPING ATTRIBUTES TO SHAPE (A2S) We introduce A2S, a model that maps the input attribute ratings to shape components ^^ as output. We compare a 2^nd degree polynomial model with a linear regression model and a multi-layer perceptron (MLP), using the Vertex-to-Vertex (V2V) error metric between predicted and ground-truth SMPL-X meshes, and report results in Tab.6. When using only attributes as input (A2S), the polynomial model of degree ^^ = 2 achieves the best performance. Adding height and weight to the input vector requires a small modification, namely using the cubic root of the weight and converting the height from (m) to (cm). We. With these additions, the 2^nd degree polynomial achieves the best performance. Table 6. Comparison of models for A2S and AHW2S regression. B.3. IMAGES TO ATTRIBUTES (I2A) We briefly experimented with models that learn to predict attribute scores from images (I2A). This attribute predictor is implemented using a ResNet50 for feature extraction from the input images, followed by one MLP per gender for attribute score prediction. To quantify the model’s performance, we use the attribute classification metric described in the main part above. I2A achieves 60.7 / 69.3% (fe- /male) of correctly predicted attributes, while our S2A achieves 68.8 / 76% on CAESAR. Our explanation for this result is that it is hard for the I2A model to learn to correctly predict attributes independent of subject pose. Our approach works better, because it decomposes 3D human estimation into predicting pose and shape. Networks are good at estimating pose even without GT shape [39]. “SHAPY ’s losses” affect only the shape branch. To minimize these losses, the network has to learn to correctly predict shape irrespective of pose variations. C. SHAPY- 3D SHAPE REGRESSION FROM IMAGES Implementation details: To train SHAPY, each batch of training images contains 50% images collected from model agency websites and 50% images from ExPose’s [9] training set. Note that the overall number of images of males and females in our collected model data differs significantly; images of female models are many more. Therefore, we randomly sample a subset of female images so that, eventually, we get an equal number of male and female images. We also use the BMI of each subject, when available, as a sampling weight for images. In this way, subjects with higher BMI are selected more often, due to their smaller number, to avoid biasing the model towards the average BMI of the dataset. Our pipeline is implemented in PyTorch [47] and we use the Adam [33] optimizer with a learning rate of 1 ^^ − 4. We tune the weights of each loss term with grid search on the MMTS and HBW validation sets. Using a batch size of 48, SHAPY achieves the best performance on the HBW validation set after 80k steps. D. EXPERIMENTS D.1. METRICS P2P20K: SMPL-X has more than half of its vertices on the head. Consequently, computing an error based on vertices overemphasizes the importance of the head. To remove this bias, we also report the mean distance between ^^ = 20 ^^ mesh surface points; see Fig.11 for a visualization on the ground- truth and estimated meshes. For this, we uniformly sample the SMPL-X template mesh and compute a sparse matrix ^^_SMPL−X ∈ ℝ ^{^^× ^^} that regresses the mesh surface points from SMPL-X vertices V, as P = HSMPL-XV. To use this metric in a mesh with different topology, e.g. SMPL, we simply need to compute the corresponding HSMPL. For this, we align the SMPL model to the SMPL-X template mesh. For each point sampled from the SMPL-X mesh surface, we find the closest point on the aligned SMPL mesh surface. To ^{obtain the SMPL mesh surface points from SMPL vertices, we again compute a sparse matrix, HSMPL ∈} _{ℝ ^^×6,890. The distance between the SMPL-X and SMPL mesh surface points on the template meshes is} 0.073 mm, which is negligible. Given two meshes M1 and M2 of topology T1 and T2 we obtain the mesh surface points P1 = HT1U1 and P2 = HT2U2, where U1 and U2 denote the vertices of the shaped zero posed (t--pose) meshes. To compute the P2P20K error we correct for translation and define

Table 7. Results of A2S and its variations on CMTS test set, in mm or kg. Trained with gender-specific SMPL-X model.

Table 8. Leave-one-out evaluation on MMTS. D.2. SHAPE ESTIMATION A2S and its variations: For completeness, Table 7 shows the results of the female A2S models in addition to the male ones. The male results are also presented in the main part above. Note that attributes improve shape reconstruction across the board. For example, in terms of P2P20K, AH2S is better than just H2S, AHW2S is better than just HW2S. It should be emphasized that even when many measurements are used as input features, i.e. height, weight, and chest/waist/hip circumference, adding attributes still improves the shape estimate, e.g. HWC2S vs. AHWC2S. Attribute/Measurement ablation: To investigate the extent to which attributes can replace ground truth measurements in network training, we train SHAPY’s variations in a leave-one-out manner: SHAPY-H uses only height and SHAPY-C only hip/waist/chest circumference. We compare these models with SHAPY-AH and SHAPY-AC, which use attributes in addition to height and circumference measurements, respectively. For completeness, we also evaluate SHAPY-HC and SHAPY-AHC, which use all measurements; the latter also uses attributes. The results are reported in Tab.8 (MMTS) and Tab.9 (HBW). The tables show that attributes are an adequate replacement for measurements. For example, in Tab. 8, the height (SHAPY-C vs. SHAPY-CA) and circumference errors (SHAPY-H vs. SHAPY-AH) are reduced significantly when attributes are taken into account. On HBW, the P2P20K errors are equal or lower, when attribute information is used, see Tab.9. Surprisingly, seeing attributes improves the height error in all three variations. This suggests that training on model images introduces a bias that A2S antagonizes. S2A: Table 10 shows the results of S2A in detail. All attributes are classified correctly with an accuracy of at least 58.05% (females) and 68.14% (males). The probability of randomly guessing the correct class is 20%. AHWC and AHWC2S noise: To evaluate AHWC’s robustness to noise in the input, we fit AHWC using the per-rater scores instead of the average score. The P2P20K↓error only increases by 1.0 mm to 6.8 when using the per-rater scores. D.3. POSE EVALUATION 3D Poses in the Wild (3DPW) [68]: This dataset is mainly useful for evaluating body pose accuracy since it contains few subjects and limited body shape variation. The test set contains a limited set of 5 subjects in indoor/outdoor videos with everyday clothing. All subjects were scanned to obtain their ground-truth body shape. The body poses are pseudo ground-truth SMPL fits, recovered from images and IMUs. We convert pose and shape to SMPL-X for evaluation. We evaluate SHAPY on 3DPW to report pose estimation accuracy (Tab. 11). SHAPY’s pose accuracy is slightly behind ExPose which also uses SMPL-X. SHAPY’s performance is better than HMR [30] and STRAPS [58]. However, SHAPY does not outperform recent pose estimation methods, e.g. HybrIK [39]. We assume that SHAPY’s pose estimation accuracy on 3DPW can be improved by (1) adding data from the 3DPW training set (similar to Sengupta et al. [59] who sample poses from 3DPW training set) and (2) creating pseudo ground-truth fits for the model data. D.4. QUALITATIVE RESULTS We show additional qualitative results in Fig.13 and Fig.15. Failure cases are shown in Fig.16. To deal with high-BMI bodies, we need to expand the set of training images and add additional shape attributes that are descriptive for high-BMI shapes. Muscle definition on highly muscular bodies is not well represented by SMPL-X, nor do our attributes capture this. The SHAPY approach, however, could be used to capture this with a suitable body model and more appropriate attributes.

Table 9. Leave-one-out evaluation on the HBW test set.

Table 10. S2A evaluation. We report mean, standard deviation and percentage of correctly predicted classes per attribute on CMTS test set.

Table 11. Evaluation on 3DPW [68]. * uses body poses sampled from the 3DPW training set for training.

References (1) Ankur Agarwal and Bill Triggs. Recovering 3D human pose from monocular images. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 28(1):44–58, 2006.3 (2) Brett Allen, Brian Curless, and Zoran Popović. The space of human body shapes: Reconstruction and parameterization from range scans. Transactions on Graphics (TOG), 22(3):587–594, 2003.3 (3) Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. SCAPE: Shape completion and animation of people. Transactions on Graphics (TOG), 24(3):408–416, 2005.3 (4) Alexandru Balan and Michael J. Black. The naked truth: Estimating body shape under clothing. In European Conference on Computer Vision (ECCV), volume 5304, pages 15–29, 2008.3 (5) Alexandru O. Balan, Leonid Sigal, Michael J. Black, James E. Davis, and Horst W. Haussecker. Detailed human shape and pose from images. In Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007.3 (6) Didier Bieler, Semih Gunel, Pascal Fua, and Helge Rhodin. Gravity as a reference for estimating a person’s height from video. In International Conference on Computer Vision (ICCV), pages 8568– 8576, 2019.4 (7) Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision (ECCV), volume 9909, pages 561–578, 2016.1, 3 (8) Zhe Cao, Gines Hidalgo Martinez, Tomas Simon, ShihEn Wei, and Yaser Sheikh. OpenPose: Realtime multiperson 2D pose estimation using part affinity fields. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(1):172–186, 2019.3 (9) Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. Monocular expressive body regression through body-driven attention. In European Conference on Computer Vision (ECCV), volume 12355, pages 20–40, 2020.3, 6, 8, 15, 17 (10) Antonio Criminisi, Ian Reid, and Andrew Zisserman. Single view metrology. International Journal of Computer Vision (IJCV), 40(2):123–148, 2000.4 (11) Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. In Computer Vision and Pattern Recognition (CVPR), pages 4690– 4699, 2019.13 (12) Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. RetinaFace: Single-shot multi-level face localisation in the wild. In Computer Vision and Pattern Recognition (CVPR), pages 5202–5211, 2020.13 (13) Ratan Dey, Madhurya Nangia, Keith W. Ross, and Yong Liu. Estimating heights from photo collections: A data-driven approach. In Conference on Online Social Networks (COSN), page 227– 238, 2014.4 (14) Sai Kumar Dwivedi, Nikos Athanasiou, Muhammed Kocabas, and Michael J. Black. Learning to regress bodies from images using differentiable semantic rendering. In International Conference on Computer Vision (ICCV), pages 11250–11259, 2021.3 (15) Yao Feng, Vasileios Choutas, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. Collaborative regression of expressive bodies using moderation. In International Conference on 3D Vision (3DV), pages 792–804, 2021.6 (16) Georgios Georgakis, Ren Li, Srikrishna Karanam, Terrence Chen, Jana Koˇ seck´ a, and Ziyan Wu. Hierarchical kinematic human mesh recovery. In European Conference on Computer Vision (ECCV), volume 12362, pages 768–784, 2020.3 (17) Ke Gong, Yiming Gao, Xiaodan Liang, Xiaohui Shen, Meng Wang, and Liang Lin. Graphonomy: Universal human parsing via graph transfer learning. In Computer Vision and Pattern Recognition (CVPR), pages 7450–7459, 2019.3 (18) Peng Guan, Alexander Weiss, Alexandru Balan, and Michael J. Black. Estimating human shape and pose from a single image. In International Conference on Computer Vision (ICCV), pages 1381– 1388, 2009.3 (19) Semih Gunel, Helge Rhodin, and Pascal Fua. What face and body shapes can tell us about height. In International Conference on Computer Vision Workshops (ICCVw), pages 1819–1827, 2019.4 (20) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(2):386–397, 2020.3 (21) Matthew Hill, Stephan Streuber, Carina Hahn, Michael Black, and Alice O’Toole. Exploring the relationship between body shapes and descriptions by linking similarity spaces. Journal of Vision (JOV), 15(12):931–931, 2015.4 (22) David T. Hoffmann, Dimitrios Tzionas, Michael J. Black, and Siyu Tang. Learning to train with synthetic humans. In German Conference on Pattern Recognition (GCPR), pages 609–623, 2019. 3 (23) Wei-Lin Hsiao and Kristen Grauman. ViBE: Dressing for diverse body shapes. In Computer Vision and Pattern Recognition (CVPR), pages 11056–11066, 2020.3 (24) Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V. Gehler, Javier Romero, Ijaz Akhter, and Michael J. Black. Towards accurate marker-less human shape and pose estimation over time. In International Conference on 3D Vision (3DV), pages 421–430, 2017.3 (25) Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7):1325–1339, 2013.3, 6 (26) Yasamin Jafarian and Hyun Soo Park. Learning high fidelity depths of dressed humans by watching social media dance videos. In Computer Vision and Pattern Recognition (CVPR), pages 12753– 12762, 2021.3 (27) Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Coherent reconstruction of multiple humans from a single image. In Computer Vision and Pattern Recognition (CVPR), pages 5578–5587, 2020.3 (28) Hanbyul Joo, Natalia Neverova, and Andrea Vedaldi. Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. In International Conference on 3D Vision (3DV), pages 42–52, 2020.6, 17 (29) Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total capture: A 3D deformation model for tracking faces, hands, and bodies. In Computer Vision and Pattern Recognition (CVPR), pages 8320–8329, 2018.1, 3 (30) Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Computer Vision and Pattern Recognition (CVPR), pages 7122–7131, 2018.3, 8, 17 (31) Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jitendra Malik. Learning 3D human dynamics from video. In Computer Vision and Pattern Recognition (CVPR), pages 5614–5623, 2019.1 (32) Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei- Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), pages 1725– 1732, 2014.7 (33) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.15 (34) Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. VIBE: Video inference for human body pose and shape estimation. In Computer Vision and Pattern Recognition (CVPR), pages 5252–5262, 2020.1, 3 (35) Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. PARE: Part attention regressor for 3D human body estimation. In International Conference on Computer Vision (ICCV), pages 11127–11137, 2021.1 (36) Muhammed Kocabas, Chun-Hao P. Huang, Joachim Tesch, Lea Müller, Otmar Hilliges, and Michael J. Black. SPEC: Seeing people in the wild with an estimated camera. In International Conference on Computer Vision (ICCV), pages 11035–11045, 2021.1 (37) Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In International Conference on Computer Vision (ICCV), pages 2252–2261, 2019.1, 3, 6, 8, 17 (38) Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J. Black, and Peter V. Gehler. Unite the people: Closing the loop between 3D and 2D human representations. In Computer Vision and Pattern Recognition (CVPR), pages 6050–6059, 2017.3 (39) Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. HybrIK: A hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In Computer Vision and Pattern Recognition (CVPR), pages 3383–3393, 2021.15, 17 (40) Junbang Liang and Ming C. Lin. Shape-aware human pose and shape reconstruction using multi- view images. In International Conference on Computer Vision (ICCV), pages 4351–4361, 2019.2, 3 (41) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: common objects in context. In European Conference on Computer Vision (ECCV), volume 8693, pages 740–755, 2014.3 (42) Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. Transactions on Graphics (TOG), 34(6):248:1– 248:16, 2015.2, 3 (43) Meysam Madadi, Hugo Bertiche, and Sergio Escalera. SMPLR: Deep learning based SMPL reverse for 3D human pose and shape recovery. Pattern Recognition (PR), 106:107472, 2020.8 (44) Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision (ICCV), pages 5442–5451, 2019.3 (45) Lea Müller, Ahmed A. A. Osman, Siyu Tang, Chun-Hao P. Huang, and Michael J. Black. On self- contact and human pose. In Computer Vision and Pattern Recognition (CVPR), pages 9990–9999, 2021.3, 6, 8, 17 (46) Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter V. Gehler, and Bernt Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International Conference on 3D Vision (3DV), pages 484–494, 2018.3 (47) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In Conference on Neural Information Processing Systems (NeurIPS), pages 8024– 8035, 2019.15 (48) Priyanka Patel, Chun-Hao Paul Huang, Joachim Tesch, David Hoffmann, Shashank Tripathi, and Michael J. Black. AGORA: Avatars in geography optimized for regression analysis. In Computer Vision and Pattern Recognition (CVPR), pages 13468–13478, 2021.3 (49) Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.1, 3, 4, 6 (50) Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3D human pose and shape from a single color image. In Computer Vision and Pattern Recognition (CVPR), pages 459–468, 2018.3 (51) Gerard Pons-Moll, Javier Romero, Naureen Mahmood, and Michael J. Black. Dyna: A model of dynamic human shape in motion. Transactions on Graphics (TOG), 34(4):120:1– 120:14, 2015.7 (52) Sergi Pujades, Betty Mohler, Anne Thaler, Joachim Tesch, Naureen Mahmood, Nikolas Hesse, Heinrich H Bülthoff, and Michael J. Black. The virtual caliper: Rapid creation of metrically accurate avatars from 3D measurements. Transactions on Visualization and Computer Graphics (TVCG), 25(5):1887–1897, 2019.3, 4, 5, 13 (53) Maria Alejandra Quiros-Ramirez, Stephan Streuber, and Michael J. Black. Red shape, blue shape: Political ideology influences the social perception of body shape. Humanities and Social Sciences Communications, 8(148), June 2021.7 (54) Kathleen M. Robinette, Sherri Blackwell, Hein Daanen, Mark Boehmer, Scott Fleming, Tina Brill, David Hoeferlin, and Dennis Burnsides. Civilian American and European Surface Anthropometry Resource (CAESAR) final report. Technical Report AFRL-HE-WP-TR-2002-0169, US Air Force Research Laboratory, 2002.2, 4, 5 (55) Nadine Rueegg, Christoph Lassner, Michael J. Black, and Konrad Schindler. Chained representation cycling: Learning to estimate 3D human pose and shape by cycling between representations. In Conference on Artificial Intelligence (AAAI), pages 5561–5569, 2020.3 (56) Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Hao Li, and Angjoo Kanazawa. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In International Conference on Computer Vision (ICCV), pages 2304–2314, 2019.3 (57) Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In Computer Vision and Pattern Recognition (CVPR), pages 81–90, 2020.3 (58) Akash Sengupta, Ignas Budvytis, and Roberto Cipolla. Synthetic training for accurate 3D human pose and shape estimation in the wild. In British Machine Vision Conference (BMVC), 2020.2, 3, 7, 8, 17 (59) Akash Sengupta, Ignas Budvytis, and Roberto Cipolla. Hierarchical kinematic probability distributions for 3D human shape and pose estimation from images in the wild. In International Conference on Computer Vision (ICCV), pages 11219–11229, 2021.1, 2, 3, 8, 9, 17 (60) Akash Sengupta, Ignas Budvytis, and Roberto Cipolla. Probabilistic 3D human shape and pose estimation from multiple unconstrained images in the wild. In Computer Vision and Pattern Recognition (CVPR), pages 16094–16104, 2021.3, 17 (61) Hyewon Seo, Frederic Cordier, and Nadia MagnenatThalmann. Synthesizing animatable body models with parameterized shape modifications. In Symposium on Computer Animation (SCA), pages 120–125, 2003.3 (62) Hyewon Seo and Nadia Magnenat-Thalmann. An automatic modeling of human bodies from sizing parameters. In Symposium on Interactive 3D Graphics (SI3D), pages 19–26, 2003.3 (63) Leonid Sigal, Alexandru Balan, and Michael J Black. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision (IJCV), 87(1):4–27, 2010.3 (64) Stephan Streuber, M. Alejandra Quiros-Ramirez, Matthew Q. Hill, Carina A. Hahn, Silvia Zuffi, Alice O’Toole, and Michael J. Black. Body Talk: Crowdshaping realistic 3D avatars with words. Transactions on Graphics (TOG), 35(4):54:1–54:14, 2016.2, 3, 4, 5, 13 (65) Aggeliki Tsoli, Matthew Loper, and Michael J. Black. Model-based anthropometry: Predicting measurements from 3D human scans in multiple poses. In Winter Conference on Applications of Computer Vision (WACV), pages 83–90, 2014.3 (66) Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. BodyNet: Volumetric inference of 3D human body shapes. In European Conference on Computer Vision (ECCV), volume 11211, pages 20–38, 2018.3 (67) Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In Computer Vision and Pattern Recognition (CVPR), pages 4627–4635, 2017.3 (68) Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In European Conference on Computer Vision (ECCV), volume 11214, pages 614–631, 2018.3, 7, 17 (69) Andrew Weitz, Lina Colucci, Sidney Primas, and Brinnae Bent. InfiniteForm: A synthetic, minimal bias dataset for fitness applications. arXiv:2110.01330, 2021.3 (70) Stefanie Wuhrer and Chang Shu. Estimating 3D human shapes from measurements. Machine Vision and Applications (MVA), 24(6):1133–1147, 2013.5, 13 (71) Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and Michael J. Black. ICON: Implicit Clothed humans Obtained from Normals. In Computer Vision and Pattern Recognition (CVPR), 2022.3 (72) Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William T. Freeman, Rahul Sukthankar, and Cristian Sminchisescu. GHUM & GHUML: Generative 3D human shape and articulated pose models. In Computer Vision and Pattern Recognition (CVPR), pages 6183–6192, 2020.1, 3 (73) Andrei Zanfir, Eduard Gabriel Bazavan, Hongyi Xu, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. Weakly supervised 3D human pose and shape reconstruction with normalizing flows. In European Conference on Computer Vision (ECCV), volume 12351, pages 465–481, 2020.3 (74) Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, Limin Wang, and Zhenan Sun. PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In International Conference on Computer Vision (ICCV), pages 11446–11456, 2021.1 (75) Rui Zhu, Xingyi Yang, Yannick Hold-Geoffroy, Federico Perazzi, Jonathan Eisenmann, Kalyan Sunkavalli, and Manmohan Chandraker. Single view metrology in the wild. In European Conference on Computer Vision (ECCV), volume 12356, pages 316–333, 2020.4

Claims

CLAIMS 1. A method for training a machine learning model for estimating shapes of objects based on sensor data, the method comprising: - obtaining a training dataset comprising training sensor data and a corresponding ground truth attribute, - estimating, by the machine learning model, a shape for the training sensor data, - determining an attribute corresponding to the estimated shape, and - optimizing the machine learning model using a loss function that is based on a difference of the determined attribute compared to the ground truth attribute.

2. The method of claim 1, wherein the sensor data comprises an image.

3. The method of claim 1 or 2, wherein the object comprises a human.

4. The method of one of the previous claims, wherein the machine learning model comprises a neural network.

5. The method of one of the previous claims, wherein the attribute comprises a metric attribute, in particular a measurement, preferably a circumference and/or a height of the object.

6. The method of one of the previous claims, wherein the attribute comprises a semantic attribute and wherein preferably the determining the attribute corresponding to the estimated shape comprises using a polynomial regression model, preferably a second-degree polynomial regression model.

7. The method of one of the previous claims, wherein the attribute is a human-annotated attribute and the method preferably comprises a further step of obtaining a plurality of human-annotated attributes.

8. The method of one of the previous claims, wherein the estimated shape comprises a parametric representation of the shape, wherein in particular the parametric representation comprises SMPL-X shape coefficients.

9. The method of claim 8, wherein the parametric representation comprises a higher number of parameters than a number of attribute values of the attribute.

10. The method of one of the previous claims, wherein - the shape only comprises pose-independent information, or - the shape also comprises pose information.

11. A method for training a machine learning model to estimate shapes of objects based on sensor data, the method comprising: - obtaining a training dataset comprising training sensor data and a corresponding ground truth attribute, - estimating, by the machine learning model, a shape for the training sensor data, - determining a shape for a ground truth attribute corresponding to the training sensor data, and - optimizing the machine learning model using a loss function that is based on a difference between the shape estimated by the machine learning model and the shape determined for the ground truth attribute.

12. The method of claim 11, wherein the sensor data comprises an image.

13. The method of claim 11 or 12, wherein the object comprises a human.

14. The method of one of claims 11 to 13, wherein the machine learning model comprises a neural network.

15. The method of one of claims 11 to 14, wherein the attribute comprises a metric attribute, in particular a measurement, preferably a circumference and/or a height of the object.

16. The method of one of claims 11 to 15, wherein the attribute comprises a semantic attribute and wherein preferably the determining the attribute corresponding to the estimated shape comprises using a polynomial regression model, preferably a second-degree polynomial regression model.

17. The method of one of claims 11 to 16, wherein the attribute is a human-annotated attribute and the method preferably comprises a further step of obtaining a plurality of human-annotated attributes.

18. The method of one of claims 11 to 17, wherein the estimated shape comprises a parametric representation of the shape, in particular a parametric representation comprising SMPL-X shape coefficients.

19. The method of one of claims 11 to 18, wherein the parametric representation comprises a higher number of parameters than a number of attribute values of the attribute.

20. The method of one of claims 11 to 19, wherein - the shape only comprises pose-independent information, or - the shape also comprises pose information.

21. A method for estimating shapes of objects based on sensor data, wherein the method is based on a machine learning model that has been trained using the method of one of the previous claims.

22. A training device for training a machine learning model to estimate shapes of objects based on sensor data, wherein the training device is configured to carry out a method according to one of claims 1 to 20.

23. A machine learning model for estimating shapes of objects based on sensor data, wherein the machine learning model has been trained with a method according to one of claims 1 to 20.

24. A computer-readable storage medium storing program code, the program code comprising instructions that when executed by a processor carry out the method of one of claims 1 to 21.