RU2776825C1

RU2776825C1 - Modelling people's clothing based on multiple points

Info

Publication number: RU2776825C1
Application number: RU2021122743A
Authority: RU
Inventors: Артур Андреевич ГРИГОРЬЕВ; Виктор Сергеевич Лемпицкий; Илья Дмитривич ЗАХАРКИН; Кирилл Евгеньевич МАЗУР
Original assignee: Самсунг Электроникс Ко., Лтд.
Filing date: 2021-07-30
Publication date: 2022-07-27

Abstract

FIELD: computing technology.

SUBSTANCE: invention relates to modelling realistic clothing worn by people and to realistic 3D modelling of people. In the method for training an overlay network for modelling clothing on a person, wherein the clothing adapts to the pose and body shape of any person, a set of frames of people is provided, wherein each person is dressed in clothing and wherein the frames constitute a video sequence where each person performs a series of motions; a Skinned Multi-Person Linear (SMPL) mesh is calculated for each frame for the pose and body shape of the person in the frame; for each frame, a clothing mesh is calculated for the pose and body shape of the person in the frame; a source point cloud in the form of a set of vertices of said Skinned Multi-Person Linear (SMPL) meshes is created for each frame; a randomly initialised d-dimensional code vector is set to code the style of clothing for each person; the source point clouds and code vectors of clothing are supplied to the overlay network, namely: the source point clouds are supplied to the input of the neural network of the cloud converter of the overlay network, and the code vectors of clothing are supplied to the input of the neural network of the MLP (multilayer perceptron) encoder; the code vector of clothing is processed by the neural network of the MLP encoder, and then the output thereof is transferred to the neural network of the cloud converter, which deforms introduced the input point cloud with account for the output of the neural network of the MLP encoder, and outputs the predicted point cloud of clothing for each frame; after processing all the frames from the set of frames of people, a pre-trained overlay network is obtained, namely: weights of the trained neural network of the MLP encoder, weights of the trained neural network of the cloud converter, code vectors of clothing of the coded styles for all people; by means of the pre-trained overlay network, a suitable style of clothing corresponding to one of the vectors and one of the point clouds is overlaid on any body shape and any pose selected by the user.

EFFECT: possibility of generating images of a person wearing clothing selected from another person.

6 cl, 8 dwg, 2 tbl

Description

Область техники, к которой относится изобретениеThe field of technology to which the invention relates

Изобретение относится к приложениям виртуальной примерки, приложениям телеприсутствия, в частности, изобретение относится к моделированию реалистичной одежды, носимой людьми, и реалистичному моделированию людей в 3D.The invention relates to applications of virtual fitting, telepresence applications, in particular, the invention relates to the modeling of realistic clothes worn by people, and realistic modeling of people in 3D.

Описание известного уровня техникиDescription of the prior art

Моделирование реалистичной одежды, носимой людьми, является важной частью комплексной задачи реалистичного моделирования людей в 3D. Его непосредственные практические применения включают виртуальную примерку одежды, а также повышение реалистичности человеческих аватаров для систем телеприсутствия. Сложность моделирования одежды обусловлена тем, что предметы одежды обладают широким разнообразием геометрии (включая топологические изменения) и внешнего вида (включая большое разнообразие текстильных рисунков, принтов, а также сложную отражательную способность тканей). В частности, особенно сложной задачей при моделировании является обеспечение взаимосвязи между одеждой и телом человека.Modeling realistic clothes worn by people is an important part of the complex task of realistic modeling of people in 3D. Its immediate practical applications include virtual clothing fittings, as well as enhancing the realism of human avatars for telepresence systems. The complexity of modeling clothes is due to the fact that garments have a wide variety of geometry (including topological changes) and appearance (including a wide variety of textile patterns, prints, and complex reflectivity of fabrics). In particular, a particular challenge in modeling is to ensure the relationship between clothing and the human body.

Моделирование геометрии одежды. Многие существующие методы моделируют геометрию одежды, используя один или несколько заранее определенных шаблонов одежды с фиксированной топологией. В одной из более ранних работ, DRAPE [13], обучение осуществляется на физическом моделировании (PBS) и позволяет изменять позу и форму для каждой полученной в результате обучения сетки одежды. В более современных работах шаблоны одежды обычно представлены в виде смещений относительно сетки SMPL [34]. Такой метод используется в ClothCap [46] и в нем захватываются более мелкие детали, полученные в результате обучения из нового набора данных 4D-сканирования. В работе DeepWrinkles [29] также решается проблема моделирования мелкозернистых складок с использованием карт нормалей, создаваемых условным GAN. GarNet [15] включает в себя двухпотоковую архитектуру и позволяет моделировать сетку одежды на реалистичном уровне, который почти соответствует PBS, но при этом на два порядка быстрее. TailorNet [44] следует тому же подходу к шаблонам на основе SMPL, что и [46, 7], но моделирует деформации одежды одновременно в зависимости от позы, формы и стиля (в отличие от предыдущей работы). Он также демонстрирует более высокую скорость предсказания, чем [15]. Modeling the geometry of clothing. Many existing methods model clothing geometry using one or more predefined fixed topology clothing templates. In one of the earlier works, DRAPE [13], training is carried out on a physical simulation (PBS) and allows you to change the pose and shape for each resulting clothing mesh. In more recent works, clothing patterns are usually represented as offsets relative to the SMPL grid [34]. Such a method is used in ClothCap [46] and captures finer details learned from a new 4D scan dataset. The work of DeepWrinkles [29] also solves the problem of modeling fine-grained folds using normal maps generated by conditional GAN. GarNet [15] incorporates a two-thread architecture and allows modeling of the clothing mesh at a realistic level that is almost equal to PBS, but at the same time two orders of magnitude faster. TailorNet [44] follows the same SMPL-based templating approach as [46, 7], but models clothing deformations simultaneously with pose, shape, and style (unlike previous work). It also exhibits a higher prediction rate than [15].

Система CAPE [36] использует модель генеративной формы на основе графовой ConvNet, которая позволяет формировать условия, выбирать и сохранять мелкие детали формы в 3D сетках.The CAPE system [36] uses a generative shape model based on graph ConvNet, which allows you to create conditions, select and save fine details of the shape in 3D grids.

Несколько других работ восстанавливают геометрию одежды одновременно с сеткой всего тела из данных изображения. BodyNet [55] и DeepHuman [64] представляют собой методы на основе вокселей, которые напрямую выводят объемную форму одетого тела из одного изображения. В SiCloPe [42] авторы изобретения используют аналогичный подход, но синтезируют силуэты предметов для восстановления большего количества деталей. HMR [25] использует модель тела SMPL для оценки позы и формы по входному изображению. Некоторые методы, такие как PIFu [51] и ARCH [19], используют сквозные неявные функции для трехмерной реконструкции одетого человека и могут обобщаться на сложную топологию одежды и волос, в то время как PIFuHD [52] восстанавливает 3D поверхность с более высоким разрешением, используя двухуровневую архитектуру. MouldingHumans [11] прогнозирует окончательную поверхность на основе оцененных "видимых" и "скрытых" карт глубины. MonoClothCap [59] демонстрирует многообещающие результаты в динамическом моделировании деформации одежды с временной когерентностью на основе видео. В последней работе Yoon et al. [62] разработан относительно простой, но эффективный конвейер для целевого переноса сетки одежды на основе шаблонов.Several other jobs reconstruct clothing geometry simultaneously with a full body mesh from image data. BodyNet [55] and DeepHuman [64] are voxel-based methods that directly infer the three-dimensional shape of a clothed body from a single image. In SiCloPe [42], the authors of the invention use a similar approach, but synthesize the silhouettes of objects to restore more details. HMR [25] uses the SMPL body model to estimate pose and shape from an input image. Some methods, such as PIFu [51] and ARCH [19], use end-to-end implicit functions for 3D reconstruction of a clothed person and can generalize to complex clothing and hair topology, while PIFuHD [52] reconstructs a 3D surface with higher resolution, using a two-tier architecture. MoldingHumans [11] predicts the final surface based on the estimated "visible" and "hidden" depth maps. MonoClothCap [59] shows promising results in video-based dynamic modeling of clothing deformation with temporal coherence. In the latest work by Yoon et al. [62] developed a relatively simple but efficient pipeline for targeted transfer of a pattern-based clothing mesh.

Моделирование внешнего вида одежды. Большое количество работ сосредоточено на прямом переносе одежды из изображения в изображение, минуя 3D моделирование. Так работы [23, 16, 56, 60, 21] решают задачу переноса желаемого предмета одежды в соответствующую область человека на их изображениях. CAGAN [23] является одной из первых работ, в которых было предложено использовать условную GAN для переноcа изображения в изображение для решения этой задачи. VITON [16] следует идее создания изображения и использует непараметрическое геометрическое преобразование, которое делает всю процедуру двухэтапной, как в SwapNet [48], но с разницей в формулировке задачи и обучающих данных. CP-VTON [56] обеспечивает дополнительное улучшение [16] путем включения полностью обучаемого преобразования сплайна типа тонкой пластинки, за которым следуют CP-VTON+ [40], LAVITON [22], Ayush et al. [5] и ACGPN [60]. Хотя вышеупомянутые работы основаны на предобученных парсерах и оценщиках позы, работа Issenhuth et al. [21] обеспечивает конкурентоспособное качество изображения и значительное ускорение за счет использования настройки преподаватель/ученик для извлечения самого существенного из стандартного конвейера виртуальной примерки. Получающаяся в результате сеть-студент не привлекает дорогостоящую сеть человеческого анализа во время предсказания. В совсем недавней работе VOGUE [31] обучает StyleGAN2 [27] с учетом позы и находит оптимальную комбинацию скрытых кодов для создания высококачественных примерочных изображений.Modeling the appearance of clothing. A large number of works are focused on the direct transfer of clothing from image to image, bypassing 3D modeling. So the works [23, 16, 56, 60, 21] solve the problem of transferring the desired garment to the corresponding area of a person in their images. CAGAN [23] is one of the first papers to propose the use of conditional GAN for image-to-image mapping to solve this problem. VITON [16] follows the idea of creating an image and uses a non-parametric geometric transformation, which makes the whole procedure two-stage, as in SwapNet [48], but with a difference in the formulation of the problem and training data. CP-VTON [56] provides an additional improvement [16] by including a fully trained thin plate spline transformation, followed by CP-VTON+ [40], LAVITON [22], Ayush et al. [5] and ACGPN [60]. Although the above works are based on pre-trained parsers and pose evaluators, the work of Issenhuth et al. [21] provides competitive image quality and significant speedups by using a teacher/student setup to extract the essentials from a standard virtual fitting pipeline. The resulting student network does not involve the costly network of human analysis during prediction. In a very recent work, VOGUE [31] trains StyleGAN2 [27] with pose and finds the optimal combination of hidden codes to generate high-quality fitting images.

В некоторых методах для обучения модели и предсказания используется как двумерная, так и трехмерная информация. Cloth-VTON [39] использует 3D деформацию для реалистичного целевого переноса 2D шаблона одежды. Pix2Surf [41] позволяет в цифровом виде преобразовывать текстуру изображений одежды розничного интернет-магазина в 3D поверхность виртуальных предметов одежды, позволяя осуществлять виртуальную 3D примерку в реальном времени. В другом релевантном исследовании этот сценарий целевого переноса ткани с одним шаблоном расширен на одежду, состоящую из нескольких предметов, с непарными данными [43], создание изображений манекенщиц с высоким разрешением в изготовленной на заказ одежде [61] или редактирование стиля человека на введенном изображении [17].Some methods use both 2D and 3D information to train the model and make predictions. Cloth-VTON [39] uses 3D warping to realistically target a 2D clothing pattern. Pix2Surf [41] digitally transforms the texture of online retail clothing images into the 3D surface of virtual garments, enabling real-time virtual 3D fitting. In another relevant study, this single-template targeted tissue transfer scenario is extended to multi-piece clothing with unpaired data [43], creating high-resolution images of fashion models in custom-made clothing [61], or editing the style of a person on an input image [ 17].

Объединенное моделирование геометрии и внешнего вида. Octopus [2] и Multi-Outfit Net (MGN) [7] восстанавливают текстурированную сетку одетого тела на основе модели SMPL+D. Во втором методе сетка одежды обрабатывается отдельно от сетки тела, что дает возможность перенести эту одежду на другого субъекта. Tex2Shape [4] предлагает интересную структуру, которая превращает задачу регрессии формы в проблему преобразования изображения в изображение. В [53] представлена параметрическая генеративная модель, основанная на обучении, которая может поддерживать любой тип материала одежды, форму тела и большинство топологий одежды. Совсем недавно метод StylePeople [20] объединил моделирование полигональной сетки тела с нейронным рендерингом, так что и геометрия одежды, и текстура кодируются в нейронной текстуре [54]. Подобно работе [20], предлагаемый подход к моделированию внешнего вида также основан на нейронном рендеринге, однако предлагаемая обработка геометрии является более явной.Combined geometry and appearance modeling. Octopus [2] and Multi-Outfit Net (MGN) [7] restore the textured mesh of the dressed body based on the SMPL+D model. In the second method, the clothing mesh is processed separately from the body mesh, which makes it possible to transfer this clothing to another subject. Tex2Shape [4] offers an interesting framework that turns a shape regression problem into an image-to-image transformation problem. [53] presents a parametric learning-based generative model that can support any type of clothing material, body shape, and most clothing topologies. More recently, StylePeople [20] has combined body mesh modeling with neural rendering so that both clothing geometry and texture are encoded in the neural texture [54]. Similar to [20], the proposed appearance modeling approach is also based on neural rendering, however, the proposed geometry processing is more explicit.

В предлагаемом изобретении наблюдается преимущество по сравнению с [20] более явного геометрического моделирования, особенно для свободной одежды.In the proposed invention, there is an advantage over [20] more explicit geometric modeling, especially for loose clothing.

Сущность изобретенияThe essence of the invention

Предлагается новый подход к моделированию человеческой одежды на основе облаков точек (множества точек), аппаратное обеспечение, содержащее программные продукты, которые реализуют способ геометрического моделирования одежды на человеке, в котором одежда адаптируется к позе тела и форме тела человека, обучается глубокая модель, которая может прогнозировать облака точек для различных видов одежды, различных поз и различных форм человеческого тела. Примечательно, что одна и та же модель может работать с одеждой различных типов и топологий. Используя обученную модель, можно выводить геометрию новых видов одежды из всего лишь одного изображения, и выполнять целевой перенос одежды на новые тела в новых позах. Предлагаемая геометрическая модель дополнена моделированием внешнего вида, которое использует геометрию облака точек в качестве геометрического каркаса и применяет нейронную графику на основе множества точек.A new approach to the modeling of human clothing based on point clouds (a set of points), hardware containing software products that implement a method for geometric modeling of clothing on a person, in which clothing adapts to the body posture and shape of the human body, a deep model is trained, which can predict point clouds for different types of clothing, different postures and different human body shapes. It is noteworthy that the same model can work with clothes of various types and topologies. Using the trained model, you can infer the geometry of new types of clothing from just one image, and perform targeted transfer of clothing to new bodies in new poses. The proposed geometry model is complemented by an appearance simulation that uses point cloud geometry as a geometric wireframe and applies neural graphics based on multiple points.

для захвата внешнего вида одежды из видео и для ревизуализации захваченной одежды. Аспекты геометрического моделирования и моделирования внешнего вида предлагаемого метода оценивались в сравнении с недавно предложенными методами, и была установлена эффективность моделирования одежды на основе множества точек.to capture the appearance of clothing from video and to revisit captured clothing. The geometric modeling and appearance modeling aspects of the proposed method were evaluated against recently proposed methods, and the performance of clothing modeling based on multiple points was established.

Предлагаемое геометрическое моделирование отличается от предыдущих работ тем, что используются другие представления (облака точек), которые придают предлагаемому методу топологическую гибкость, возможность моделировать одежду отдельно от тела, а также обеспечивают геометрический каркас для моделирования внешнего вида с помощью нейронного рендеринга.The proposed geometric modeling differs from previous work in that it uses other representations (point clouds) that give the proposed method topological flexibility, the ability to model clothes separately from the body, and also provide a geometric framework for modeling appearance using neural rendering.

В отличие от упомянутых подходов к целевому переносу внешнего вида одежды предлагаемый метод использует явные трехмерные геометрические модели без опоры на индивидуальные шаблоны фиксированной топологии.In contrast to the mentioned approaches to the targeted transfer of the appearance of clothing, the proposed method uses explicit three-dimensional geometric models without relying on individual patterns of a fixed topology.

С другой стороны, предлагаемая часть моделирования внешнего вида требует последовательности видео, в то время как некоторые из упомянутых известных работ используют одно или несколько изображений.On the other hand, the proposed appearance modeling part requires a sequence of videos, while some of the famous works mentioned use one or more images.

Обсуждение модели наложения облака точек. Цель этой модели состоит в том, чтобы захватить геометрию различных видов человеческой одежды, наложенных на человеческих телах с различными формами и позами, используя облака точек. Предлагается скрытая модель для таких облаков точек, которую можно подогнать к одному изображению или к более исчерпывающим данным. Далее описывается комбинация наложения облака точек с нейронным рендерингом, которая позволяет захватить внешний вид одежды из видео.Discussion of the point cloud overlay model. The purpose of this model is to capture the geometry of different types of human clothing overlaid on human bodies with different shapes and poses using point clouds. A latent model is proposed for such point clouds, which can be fitted to a single image or to more comprehensive data. The following describes a combination of point cloud overlay with neural rendering that allows you to capture the appearance of clothing from a video.

Предложено аппаратное средство, содержащее программные продукты, которые реализуют способ отображения одежды на человеке, адаптируемой к позе тела и форме тела, на основе модели наложения облака точек, причем способ содержит этапы, на которых: используют облако точек и нейронную сеть, которая синтезирует такие облака точек для захвата/моделирования геометрии предметов одежды; используют дифференцируемый нейронный рендеринг на основе множества точек для захвата внешнего вида предметов одежды.Proposed hardware containing software products that implement a method for displaying clothes on a person, adaptable to the body posture and body shape, based on the point cloud overlay model, and the method includes the following steps: using a point cloud and a neural network that synthesizes such clouds points for capturing/modeling the geometry of garments; use differentiating neural rendering based on multiple points to capture the appearance of garments.

Предложен способ обучения сети наложения для моделирования одежды на человеке, в котором одежда адаптируется к позе тела и форме тела любого человека, причем способ содержит обеспечение (захват или получение) набора кадров людей.A method is proposed for training an overlay network for modeling clothes on a person, in which the clothes are adapted to the body posture and body shape of any person, the method comprising providing (capturing or receiving) a set of frames of people.

Для обучения сети наложения используется инфраструктура для обучения нейронных сетей и кодовые векторы одежды. Под инфраструктурой подразумевается некоторый набор серверов - машин, содержащих центральный процессор, графический ускоритель, материнскую плату и другие компоненты современного компьютера, объединенные в один кластер, или по меньшей мере один такой сервер с минимальным объемом оперативной памяти 32 ГБ и минимальном объемом видеопамяти 18 ГБ.To train the overlay network, the infrastructure for training neural networks and clothing code vectors are used. Infrastructure means a certain set of servers - machines containing a central processor, a graphics accelerator, a motherboard and other components of a modern computer, united in one cluster, or at least one such server with a minimum 32 GB of RAM and a minimum of 18 GB of video memory.

При этом набор кадров означает набор сигналов, записанных на жесткий диск, на которых каждый человек одет в одежду и где кадры представляют собой последовательности видео, на которых каждый человек выполняет последовательность движений; для каждого кадра вычисляется сетка Skinned Multi-Person Linear (SMPL) для позы и формы тела человека в кадре; для каждого кадра вычисляется сетка одежды в позе и форме тела человека в кадре; создается исходное облако точек в форме набора вершин упомянутых сеток Skinned Multi-Person Linear (SMPL) для каждого кадра; для кодирования стиля одежды для каждого человека задается произвольно инициализированный d-мерный кодовый вектор; исходные облака точек и кодовые векторы одежды подаются в сеть наложения, конкретно, исходные облака точек подаются на вход нейронной сети преобразователя кода сети наложения, а кодовые векторы одежды подаются на вход нейронной сети кодировщика MLP (Multi-Person Linear); кодовый вектор одежды обрабатывается нейронной сетью кодировщиком MLP, и его вывод передается в нейронную сеть преобразователя облака, которая деформирует исходное введенное облако точек с учетом вывода нейронной сети кодировщика MLP и выводит спрогнозированное облако точек одежды для каждого кадра; после обработки всех кадров из набора кадров людей получают предобученную сеть наложения, а именно веса обученной нейронной сети кодировщика MLP, веса обученной нейронной сети преобразователя облака, кодовые векторы одежды кодированных стилей всех людей; с помощью предобученной сети наложения соответствующий стиль одежды, соответствующий одному из векторов и одному из облаков точек, накладывается на любую форму тела и любую позу, выбранную пользователем.In this case, a set of frames means a set of signals recorded on a hard disk, in which each person is dressed in clothes and where the frames are video sequences in which each person performs a sequence of movements; for each frame, a Skinned Multi-Person Linear (SMPL) grid is calculated for the pose and shape of the human body in the frame; for each frame, a clothing mesh is calculated in the pose and shape of the human body in the frame; an initial point cloud is created in the form of a set of vertices of said Skinned Multi-Person Linear (SMPL) meshes for each frame; to encode the clothing style for each person, an arbitrarily initialized d-dimensional code vector is given; the original point clouds and clothing code vectors are input to the overlay network, specifically, the original point clouds are input to the neural network of the overlay network code converter, and the clothing code vectors are input to the neural network of the MLP (Multi-Person Linear) encoder; the clothing code vector is processed by the neural network of the MLP encoder, and its output is passed to the neural network of the cloud converter, which deforms the original input point cloud, taking into account the output of the neural network of the MLP encoder, and outputs the predicted clothing point cloud for each frame; after processing all the frames from the set of frames of people, a pre-trained overlay network is obtained, namely, the weights of the trained neural network of the MLP encoder, the weights of the trained neural network of the cloud converter, the code vectors of the clothes of the encoded styles of all people; using a pre-trained overlay network, the appropriate clothing style corresponding to one of the vectors and one of the point clouds is superimposed on any body shape and any pose selected by the user.

Также предложен способ получения спрогнозированного облака точек одежды и кодового вектора одежды из изображения материального человека в одежде для моделирования этой одежды на человеке, в котором одежда адаптируется к позе тела и форме тела любого человека, способ заключается в том, что: захватывают с помощью устройства обнаружения изображение материального человека в одежде; прогнозируют сетку SMPL с желаемой позой и формой тела по этому изображению методом SMPLify; создают исходное облако точек в форме вершин упомянутой сетки Skinned Multi-Person Linear (SMPL) для изображения; прогнозируют двоичную маску одежды, соответствующую пикселям данной одежды на изображении, с помощью упомянутой сети сегментации; инициализируют случайными значениями d-мерный кодовый вектор одежды для кодирования стиля одежды для данного изображения; подают исходное облако точек и кодовый вектор одежды в предобученную сеть наложения, обученную в соответствии с пунктом 1; получают спрогнозированное облако точек одежды из вывода предобученной сети наложения; проецируют облако точек одежды на черно-белое изображение с заданными параметрами камеры изображения человека; сравнивают путем вычисления функции потерь проекцию этого спрогнозированного облака точек на изображении с истинной бинарной маской одежды, соответствующей пикселям одежды на изображении, через расстояние фаски между двумерными облаками точек, которые являются проекциями 3D облаков точек; оптимизируют кодовый вектор одежды на основе вычисленной функции потерь;A method is also proposed to obtain a predicted clothing point cloud and a clothing code vector from an image of a material person in clothing to simulate this clothing on a person, in which clothing adapts to the body posture and body shape of any person, the method is that: image of a material person in clothes; predicting an SMPL mesh with a desired pose and body shape from the image using the SMPLify method; creating an initial point cloud in the form of vertices of said Skinned Multi-Person Linear (SMPL) mesh for the image; predicting a binary clothing mask corresponding to pixels of the given clothing in the image using said segmentation network; randomly initializing a d-dimensional clothing code vector for encoding a clothing style for the given image; submitting the original point cloud and the clothing code vector to the pre-trained overlay network trained in accordance with paragraph 1; obtaining a predicted clothing point cloud from the output of the pretrained overlay network; projecting a cloud of points of clothing on a black and white image with the given parameters of the human image camera; comparing, by calculating a loss function, a projection of this predicted point cloud in the image with a true binary clothing mask corresponding to clothing pixels in the image, through a chamfer distance between two-dimensional point clouds, which are projections of 3D point clouds; optimizing the clothing code vector based on the computed loss function;

накладывают в соответствии с полученным кодовым вектором одежды спрогнозированное облако точек одежды изображения на любую форму тела и любую позу тела, выбранную пользователем. superimposing, in accordance with the obtained clothing code vector, the predicted image clothing point cloud on any body shape and any body pose selected by the user.

Также предлагается способ моделирования одежды на человеке, в котором одежда адаптируется к позе тела и форме тела любого человека, заключающийся в том, что: обеспечивают поток цветного видео первого человека; выбирают пользователем любую одежду согласно любому видео второго человека в одежде; посредством операционного блока компьютерной системы: получают спрогнозированное облако точек одежды и кодовый вектор одежды согласно способу по пункту 2 для любого кадра видео; инициализируют случайными значениями n-мерный вектор дескриптора внешнего вида для каждой точки облака точек, которая отвечает за цвет; генерируют блоком растеризации 16-канальный тензор изображения с использованием 3D координат каждой точки и нейронного дескриптора каждой точки, и двоичную черно-белую маску, соответствующую пикселям изображения, покрытого этими точками; обрабатывают рендерной сетью 16-канальный тензор изображения вместе с двоичной черно-белой маской для получения цветного RGB изображения и маски одежды; оптимизируют веса рендерной сети и значения дескрипторов внешнего вида в соответствии с истинной последовательностью видео человека для получения желаемого внешнего вида одежды; отображают пользователю на экране видео первого человека в одежде второго человека в виде спрогнозированного визуализированного изображения в одежде с заданной позой и формой тела, причем пользователь может вводить видео любого человека и видеть полученную в результате обучения цветную модель одежды, перенацеленную на новые формы тела и позы, визуализированные поверх этого нового видео, то есть, одевать изображение персонажа из любого видео в любую одежду, выбранную пользователем. При этом реальный человек является пользователем, который отображает пользователю цветную модель одежды на этом пользователе. Also proposed is a method for modeling clothes on a person, in which the clothes adapt to the body posture and body shape of any person, which consists in the following: providing a color video stream of the first person; the user selects any clothing according to any video of the second person wearing the clothing; by means of an operating unit of the computer system: obtaining a predicted clothing point cloud and a clothing code vector according to the method of claim 2 for any video frame; randomly initializing an n-dimensional appearance descriptor vector for each point in the point cloud that is responsible for a color; generating by the rasterizer a 16-channel image tensor using the 3D coordinates of each point and the neural descriptor of each point, and a binary black and white mask corresponding to the pixels of the image covered by these points; processing the 16-channel image tensor together with the binary black and white mask with a render network to obtain an RGB color image and a clothing mask; optimizing the render network weights and the values of the appearance descriptors in accordance with the true video sequence of the person to obtain the desired appearance of the clothing; displaying to the user on the screen a video of the first person in the clothes of the second person in the form of a predicted rendered image in clothes with a given pose and body shape, and the user can enter a video of any person and see the color model of clothes obtained as a result of training, redirected to new body shapes and poses, rendered on top of this new video, i.e., dress the character image from any video in any outfit chosen by the user. In this case, the real person is a user who displays a color model of clothes on this user to the user.

Также предлагается система для моделирования одежды человека с использованием предложенного способа, содержащая: устройство обнаружения, подключенное к компьютерной системе, содержащей операционный блок, подключенный к экрану дисплея и блоку выбора; причем устройство обнаружения выполнено с возможностью захвата потока цветного видео первого реального человека в реальном времени; блок выбора выполнен с возможностью позволить пользователю выбирать любую одежду по любому видео второго человека в одежде; экран дисплея выполнен с возможностью отображать упомянутого первого человека в реальном времени в одежде, выбранной пользователем из упомянутых видео в соответствии с данными, полученными от операционного блока. При этом пользователем является первый человек.Also proposed is a system for modeling human clothes using the proposed method, containing: a detection device connected to a computer system containing an operating unit connected to a display screen and a selection unit; moreover, the detection device is configured to capture a stream of color video of the first real person in real time; the selection unit is configured to allow the user to select any clothing from any video of the second person wearing the clothing; the display screen is configured to display said first person in real time in clothing selected by the user from said videos in accordance with data received from the operating unit. In this case, the user is the first person.

По меньшей мере, один из множества модулей (блоков) может быть реализован через модель AI. Функция, связанная с AI, может выполняться посредством энергонезависимой памяти, энергозависимой памяти и процессора.At least one of the plurality of modules (blocks) may be implemented through the AI model. An AI-related function may be performed by non-volatile memory, volatile memory, and a processor.

Процессор может включать в себя один или несколько процессоров. При этом один или несколько процессоров могут быть процессором общего назначения, например, центральным процессором (ЦП), процессором приложений (AP) или т.п., блоком обработки только графики, таким как графический процессор (GPU), блоком обработки изображений (VPU) и/или специализированным процессором AI, таким как нейронный процессор (NPU).The processor may include one or more processors. Here, one or more processors may be a general purpose processor such as a central processing unit (CPU), an application processor (AP) or the like, a graphics-only processing unit such as a graphics processing unit (GPU), an image processing unit (VPU) and/or a dedicated AI processor such as a Neural Processing Unit (NPU).

Один или несколько процессоров управляют обработкой входных данных в соответствии с заранее определенным рабочим правилом или моделью искусственного интеллекта (AI), хранящейся в энергонезависимой памяти и энергозависимой памяти. Заранее определенное рабочее правило или модель искусственного интеллекта предоставляется посредством обучения или обучения.One or more processors direct the processing of input data in accordance with a predetermined operating rule or artificial intelligence (AI) model stored in non-volatile memory and non-volatile memory. A predetermined operating rule or artificial intelligence model is provided through training or training.

В данном контексте предоставление посредством обучения означает, что путем применения обучающего алгоритма к множеству обучающих данных создается заранее определенное рабочее правило или модель AI с желаемой характеристикой. Обучение может выполняться на том же устройстве, на котором выполняется AI согласно варианту осуществления, и/или может быть реализовано через отдельный сервер/систему.In this context, provision by training means that by applying a training algorithm to a set of training data, a predetermined operating rule or AI model with a desired performance is created. The training may be performed on the same device on which the AI is running according to the embodiment and/or may be implemented via a separate server/system.

Модель AI может состоять из множества уровней нейронной сети. Каждый уровень имеет множество значений весов и выполняет операцию уровня посредством вычисления предыдущего уровня и операции на множестве весов. Примеры нейронных сетей включают, без ограничения перечисленным, сверточную нейронную сеть (CNN), глубокую нейронную сеть (DNN), рекуррентную нейронную сеть (RNN), ограниченную машину Больцмана (RBM), глубокую сеть доверия (DBN), двунаправленную рекуррентную глубокую нейронную сеть (BRDNN), генеративно-состязательные сети (GAN) и глубокие Q-сети.An AI model can be composed of many layers of a neural network. Each level has a set of weight values and performs a level operation by computing the previous level and operating on the set of weights. Examples of neural networks include, but are not limited to, Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network ( BRDNN), Generative Adversarial Networks (GANs), and Deep Q-Nets.

Алгоритм обучения - это метод обучения заранее определенного целевого устройства (например, робота) с использованием множества обучающих данных, чтобы побуждать, разрешать или давать команду целевому устройству выполнять определение или прогнозирование. Примеры алгоритмов обучения включают, без ограничения, обучение с учителем, обучение без учителя, обучение с частичным привлечением учителя или обучение с подкреплением.A learning algorithm is a method of teaching a predetermined target device (eg, a robot) using a set of training data to induce, enable, or instruct the target device to perform a determination or prediction. Examples of learning algorithms include, without limitation, supervised learning, unsupervised learning, partially supervised learning, or reinforcement learning.

Согласно данному изобретению, в способе электронного устройства способ распознавания может получать выходные данные, распознавая изображение в изображении, с использованием данных изображения в качестве входных данных для модели искусственного интеллекта. Модель искусственного интеллекта может быть получена путем обучения. В данном контексте "полученный путем обучения" означает, что заранее определенное рабочее правило или модель искусственного интеллекта, сконфигурированную для выполнения желаемой функции (или цели), получают путем обучения базовой модели искусственного интеллекта несколькими частями обучающих данных с помощью алгоритма обучения. Модель искусственного интеллекта может включать в себя множество уровней нейронной сети. Каждый из множества уровней нейронной сети включает в себя множество весовых значений и выполняет вычисления нейронной сетью путем вычисления между результатом вычисления на предыдущем уровне и множеством весовых значений.According to the present invention, in the electronic device method, the recognition method can obtain output by recognizing an image in an image using image data as input to an artificial intelligence model. An artificial intelligence model can be obtained through training. In this context, "trained" means that a predetermined operating rule or AI model configured to perform a desired function (or purpose) is obtained by training a base AI model with multiple pieces of training data using a learning algorithm. An artificial intelligence model may include multiple layers of a neural network. Each of the plurality of levels of the neural network includes a plurality of weights and performs computations by the neural network by calculating between the calculation result of the previous layer and the plurality of weights.

Визуальное понимание - это метод распознавания и обработки вещей, как это делает человеческое зрение, и включает в себя, например, распознавание объекта, отслеживание объекта, поиск изображения, распознавание человека, распознавание сцены, трехмерную реконструкцию/локализацию или улучшение изображения.Visual understanding is a method of recognizing and processing things as human vision does, and includes, for example, object recognition, object tracking, image search, person recognition, scene recognition, 3D reconstruction/localization, or image enhancement.

Согласно данному изобретению, в способе электронного устройства способ рассуждений или прогнозирования может использовать модель искусственного интеллекта для рекомендации/выполнения с использованием данных. Процессор может выполнять операцию предварительной обработки на данных для преобразования в форму, подходящую для использования в качестве ввода в модель искусственного интеллекта. Модель искусственного интеллекта может быть получена путем обучения. В данном контексте "полученный путем обучения" означает, что заранее определенное рабочее правило или модель искусственного интеллекта, сконфигурированную для выполнения желаемой функции (или цели), получают путем обучения базовой модели искусственного интеллекта несколькими частями обучающих данных с помощью алгоритма обучения. Модель искусственного интеллекта может включать в себя несколько уровней нейронной сети. Каждый из нескольких уровней нейронной сети включает в себя множество весовых значений и выполняет вычисления нейронной сети путем вычисления между результатом вычисления на предыдущем уровне и множеством весовых значений.According to the present invention, in an electronic device method, a reasoning or predictive method may use an artificial intelligence model to recommend/execute using data. The processor may perform a pre-processing operation on the data to transform it into a form suitable for use as input to an artificial intelligence model. An artificial intelligence model can be obtained through training. In this context, "trained" means that a predetermined operating rule or AI model configured to perform a desired function (or purpose) is obtained by training a base AI model with multiple pieces of training data using a learning algorithm. An artificial intelligence model may include several layers of a neural network. Each of the multiple layers of the neural network includes a plurality of weights and performs neural network calculations by calculating between the result of the calculation in the previous layer and the plurality of weights.

Прогнозирование методом рассуждений - это метод логических рассуждений и прогнозирования путем определения информации, который включает в себя, например, рассуждения на основе знаний, прогнозирование оптимизации, планирование на основе предпочтений или рекомендации.Reasoning forecasting is a method of logical reasoning and forecasting by identifying information, which includes, for example, knowledge-based reasoning, optimization forecasting, preference-based planning, or recommendations.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

Представленные выше и/или другие аспекты станут более очевидными из описания примерных вариантов осуществления со ссылками на прилагаемые чертежи, на которых:The above and/or other aspects will become more apparent from the description of exemplary embodiments with reference to the accompanying drawings, in which:

Фиг.1 иллюстрирует предлагаемый метод моделирования геометрии различных предметов одежды с использованием облаков точек (в нижнем ряду показаны произвольные цвета точек).1 illustrates a proposed method for modeling the geometry of various garments using point clouds (arbitrary dot colors are shown in the bottom row).

Фиг.2 - результаты сетей наложения с цветовой кодировкой.Figure 2 - results of color-coded overlay networks.

Фиг.3 - сеть наложения, которая преобразует облако точек тела (слева) и код одежды (сверху) в облако точек одежды, адаптируемой к позе тела и форме тела.3 is an overlay network that converts a body point cloud (left) and clothing code (top) into a clothing point cloud adaptable to body posture and body shape.

Фиг.4 - процесс оценки (оптимизации) кода одежды при наличии одного изображения человека.Fig.4 - the process of evaluation (optimization) of the clothing code in the presence of one image of a person.

Фиг.5 - применение нейронной графики на основе множества точек для моделирования внешнего вида одежды.Figure 5 - application of neural graphics based on multiple points to model the appearance of clothing.

Фиг.6 - спрогнозированные геометрии в проверочных позах, подогнанные к одному кадру (слева).6 shows the predicted geometries in test poses fitted to one frame (left).

Фиг.7 - иллюстрация того, что предлагаемый метод позволяет также перенацелить геометрию и внешний вид на новые формы тела.7 illustrates that the proposed method also allows geometry and appearance to be redirected to new body shapes.

Фиг.8 - сравнение результатов целевого переноса внешнего вида согласно предлагаемому способу на новые позы, которые невидимы во время подгонки, между предлагаемым способом и системой StylePeople (многокадровый вариант), которая использует сетку SMPL в качестве базовой геометрии и основывается только на нейронном рендеринге для "наращивания" свободной одежды в рендерах.8 is a comparison of the results of targeting the appearance according to the proposed method to new poses that are not visible during fitting, between the proposed method and the StylePeople system (multi-frame variant), which uses the SMPL mesh as the base geometry and relies only on neural rendering for " extensions" of loose clothing in renders.

Подробное описание Detailed description

Предлагаемое изобретение позволяет одеть человека, захваченного (или выбранного, или взятого) из одного изображения или видео, в одежду человека, захваченного (или выбранного, или взятого) из другого изображения или видео. Также предлагаемое изобретение обеспечивает визуализацию видео реального человека, возможно, в реальном времени, на экране, при этом человек может выбрать одежду любого другого человека, захваченного (или выбранного, или взятого) из любых видео или изображений, и увидеть свое изображение в выбранной им одежде, возможно, в реальном времени.The present invention allows a person captured (or selected or taken) from one image or video to be dressed in the clothing of a person captured (or selected or taken) from another image or video. Also, the invention provides for the visualization of a video of a real person, possibly in real time, on the screen, while a person can select the clothes of any other person captured (or selected, or taken) from any videos or images, and see his image in the clothes he has chosen possibly in real time.

Таким образом, настоящее изобретение может быть полезно для замены, в случае необходимости, физического присутствия пользователя для примерки одежды его виртуальным присутствием. Это может быть особенно актуально во время пандемии, а также для людей с ограниченными возможностями или просто для удобства любого человека, поскольку это позволяет человеку примерить любую одежду в любое время. При этом человек может находиться в магазине, но ему не нужно ходить между вешалками и переодеваться.Thus, the present invention may be useful for replacing, if necessary, the user's physical presence for trying on clothes with his virtual presence. This may be especially true during a pandemic, as well as for people with disabilities, or simply for the convenience of any person, since it allows a person to try on any clothes at any time. In this case, a person can be in the store, but he does not need to walk between the hangers and change clothes.

Кроме того, человек может использовать компьютер, ноутбук, смартфон для примерки одежды, и соответствующие устройства могут использоваться для наложения моделей одежды на видео человека, стоящего и/или движущегося перед камерой.In addition, a person can use a computer, laptop, smart phone to try on clothes, and the respective devices can be used to superimpose clothing patterns on a video of a person standing and/or moving in front of the camera.

Также изобретение позволяет одевать человека с любой картинки или видео в одежду любого другого человека.Also, the invention allows you to dress a person from any picture or video in the clothes of any other person.

Такое решение можно использовать также в системах телеприсутствия для более четкой прорисовки одежды на теле людей, более реалистичной визуализации их волос и для "переодевания" человеческих аватаров.Such a solution can also be used in telepresence systems to more clearly draw clothes on the body of people, more realistic visualization of their hair and to "dress up" human avatars.

Предлагаемое геометрическое моделирование отличается от известных решений тем, что используются различные представления (облака точек), которые придают предлагаемому методу топологическую гибкость, возможность моделирования одежды отдельно от тела, а также обеспечивается геометрическая основа для моделирования внешнего вида с помощью нейронного рендеринга.The proposed geometric modeling differs from the known solutions in that different representations (point clouds) are used, which give the proposed method topological flexibility, the ability to model clothes separately from the body, and also provide a geometric basis for modeling appearance using neural rendering.

В основу изобретения положена задача модифицировать и обрабатывать изображение или набор изображений (последовательность видеокадров). Эти изображения, безусловно, являются материальными объектами, а именно, изображение представляет собой набор сигналов, хранящийся в памяти любого подходящего устройства (например, компьютера, машиночитаемого носителя, смартфона и т.п.), и это изображение принимается любым подходящим устройством (например, фотоаппаратом, смартфоном, устройством обнаружения и т.п.) от материального захваченного объекта (человека), и изображение обрабатывается в соответствии с настоящим изобретением.The invention is based on the task of modifying and processing an image or a set of images (sequence of video frames). These images are certainly material objects, namely, the image is a set of signals stored in the memory of any suitable device (for example, a computer, machine-readable medium, smartphone, etc.), and this image is received by any suitable device (for example, camera, smartphone, detection device, etc.) from the material captured object (person), and the image is processed in accordance with the present invention.

В отличие от упомянутых подходов к целевому переносу внешнего вида одежды, в предлагаемом изобретении используются явные трехмерные геометрические модели, при этом без опоры на индивидуальные шаблоны с фиксированной топологией.In contrast to the above approaches to targeted clothing appearance transfer, the present invention uses explicit 3D geometries without relying on individual fixed topology templates.

С другой стороны, часть моделирования внешнего вида в соответствии с изобретением требует последовательности видео, в то время как в некоторых из упомянутых работ используется одно или несколько изображений.On the other hand, part of the appearance modeling in accordance with the invention requires a sequence of videos, while some of the works mentioned use one or more images.

Объединенное моделирование геометрии и внешнего вида. Octopus [2] и Multi-Outfit Net (MGN) [7] восстанавливают текстурированную сетку одетого тела на основе модели SMPL+D.Combined geometry and appearance modeling. Octopus [2] and Multi-Outfit Net (MGN) [7] restore the textured mesh of the dressed body based on the SMPL+D model.

На фиг.1 показан предлагаемый способ моделирования геометрии различных видов одежды с использованием облаков точек (в нижнем ряду - произвольные цвета точек). Figure 1 shows the proposed method for modeling the geometry of various types of clothing using point clouds (arbitrary color points in the bottom row).

Облака точек получают путем передачи сеток модели Skinned Multi-Person Linear (SMPL) человеческого тела, состоящих из 6890 вершин и 13776 граней, каждая из которых получена с помощью некоторого метода, прогнозирующего данное изображение (например, SMPLify [8]), и скрытых кодовых векторов одежды, состоящих из 8 действительных чисел, каждое из которых определено в процессе обучения, через предобученную глубокую сеть. Кроме того, предлагаемый метод позволяет моделировать внешний вид одежды с помощью нейронной графики на основе множества точек (пример моделирования внешнего вида показан в верхнем ряду на фиг.1). Внешний вид одежды может быть захвачен из последовательности видео, в то время как для геометрического моделирования на основе множества точек достаточно одного кадра. Последовательность видео представляет собой видео, на котором человек выполняет произвольные движения тела в своей одежде, показывая одежду с разных сторон (например, поворачиваясь на месте на 360 градусов). Такое видео можно получить, например, путем съемки человека камерой мобильного устройства или отдельной профессиональной камерой.Point clouds are obtained by passing meshes of the Skinned Multi-Person Linear (SMPL) model of the human body, consisting of 6890 vertices and 13776 faces, each of which is obtained using some method that predicts this image (for example, SMPLify [8]), and hidden code clothing vectors, consisting of 8 real numbers, each of which is determined in the learning process, through a pre-trained deep network. In addition, the proposed method allows modeling the appearance of clothes using neural graphics based on a set of points (an example of appearance modeling is shown in the top row in figure 1). The appearance of clothing can be captured from a sequence of videos, while a single frame is sufficient for geometric modeling based on many points. The video sequence is a video in which a person performs voluntary body movements in their clothes, showing the clothes from different sides (for example, turning 360 degrees in place). Such a video can be obtained, for example, by shooting a person with a camera of a mobile device or a separate professional camera.

"Точечное моделирование" означает "моделирование облаков точек", этот метод является предметом настоящего изобретения. В результате моделирования на основе облаков точек получается реконструкция геометрии одежды в виде 3D облака точек, которая может выполняться на основе одного изображения (фотографии человека в одежде). Для этого достаточно одного изображения, так как одной из частей предлагаемого метода является оптимизация скрытого кода (8-мерного вектора) одежды для одной картинки. Предлагаемое изобретение представляет собой систему, которая а) моделирует одежду в виде 3D облаков точек, позволяя реконструировать ее структуру (геометрию) по одному изображению человека в одежде, а также адаптировать эту реконструкцию к новым позам и формам тела; б) при наличии не одной фотографии, а целой последовательности кадров (видео) человека в одежде система может реконструировать не только структуру (геометрию) одежды, но и ее фактуру (внешний вид)."Point modeling" means "point cloud modeling", this method is the subject of the present invention. As a result of modeling based on point clouds, a 3D point cloud reconstruction of the clothing geometry is obtained, which can be performed on the basis of a single image (a photograph of a person in clothing). For this, one image is sufficient, since one of the parts of the proposed method is the optimization of the hidden code (8-dimensional vector) of clothing for one image. The present invention is a system that a) models clothing in the form of 3D point clouds, allowing its structure (geometry) to be reconstructed from a single image of a person in clothing, as well as adapting this reconstruction to new postures and body shapes; b) if there is not one photograph, but a whole sequence of frames (video) of a person in clothes, the system can reconstruct not only the structure (geometry) of clothing, but also its texture (appearance).

Предлагаемое решение можно использовать для приложений телеприсутствия в дополненной реальности, виртуальной реальности, на 3D дисплеях и т.п. Его также можно использовать на обычных 2D экранах в программах/средах, где требуется показать изображения людей в одежде.The proposed solution can be used for telepresence applications in augmented reality, virtual reality, 3D displays, etc. It can also be used on conventional 2D screens in programs/environments where images of people in clothing are required.

В соответствии с настоящим изобретением создаются реалистичные модели одежды людей, которые могут быть анимированы поверх моделей сетки тела для произвольных параметров позы тела и формы тела; деформируемая геометрия одежды захватывается облаком точек, геометрическая модель новой одежды может быть создана из всего одного изображения, реалистичная модель одежды, подходящая для ревизуализации, может быть создана из видео.In accordance with the present invention, realistic clothing models of people are created that can be animated on top of body mesh models for arbitrary parameters of body posture and body shape; deformable clothing geometry is captured by a point cloud, new clothing geometric model can be created from just one image, realistic clothing model suitable for revision can be created from video.

Основной новизной предлагаемого изобретения являетсяThe main novelty of the present invention is

- использование облаков точек и нейронной сети, которая синтезирует такие облака точек для захвата и моделирования геометрии одежды;- the use of point clouds and a neural network that synthesizes such point clouds to capture and model the geometry of clothing;

- использование дифференцируемой нейронной визуализации на основе множества точек для захвата внешнего вида одежды.- use of differentiable neural imaging based on multiple points to capture the appearance of clothing.

Данное решение может быть реализовано с помощью любого устройства с достаточно мощным вычислительным блоком (например, графическим процессором) и экраном, такого как, например, компьютер, смартфон. Кроме того, решение может храниться на машиночитаемом носителе с инструкциями для исполнения на компьютере. Для создания моделей одежды нужны фотоснимки или видео.This solution can be implemented using any device with a sufficiently powerful computing unit (for example, a graphics processor) and a screen, such as, for example, a computer, a smartphone. In addition, the solution may be stored on a computer-readable medium with instructions for execution on a computer. To create models of clothes, you need photos or videos.

Моделирование реалистичной одежды, которую носят люди, является важной частью комплектной задачи реалистичного моделирования людей в 3D. Сложность моделирования одежды обусловлена тем, что одежда обладает большим разнообразием геометрии (включая топологические изменения) и внешнего вида (включая большое разнообразие текстильных текстур, принтов и сложной отражающей способности ткани). Еще одной сложной задачей является физическое моделирование статического и динамического взаимодействия ткани с телом.Modeling realistic clothes worn by people is an important part of the complete task of realistic modeling of people in 3D. The complexity of modeling clothes is due to the fact that clothes have a wide variety of geometry (including topological changes) and appearance (including a wide variety of textile textures, prints and complex fabric reflectivity). Another challenging task is the physical modeling of the static and dynamic interaction of tissue with the body.

Предлагается новый подход к моделированию одежды. В предлагаемом подходе геометрия одежды моделируется в виде относительно разреженного облака точек. С помощью предложенного недавно синтетического датасета смоделированной одежды обучается объединенная геометрическая модель разнообразных видов одежды человека. Эта модель описывает конкретную одежду скрытым кодовым вектором (кодом одежды) размерности d, где d - некоторое положительное целое число (в предлагаемых экспериментах d=8, гиперпараметр, настраиваемый в процессе обучения). Код одежды - это d-мерный числовой вектор, то есть упорядоченный набор d действительных чисел, где d - некоторое натуральное число. Для данного кода одежды и данной геометрии человеческого тела (для которой используется наиболее популярный формат SMPL) глубокая нейронная сеть (сеть наложения) затем прогнозирует облако точек, которое аппроксимирует геометрию одежды, наложенной на тело.A new approach to clothing modeling is proposed. In the proposed approach, the clothing geometry is modeled as a relatively sparse point cloud. With the help of the recently proposed synthetic dataset of simulated clothing, a combined geometric model of various types of human clothing is trained. This model describes a specific clothing by a hidden code vector (clothing code) of dimension d, where d is some positive integer (in the proposed experiments, d=8, a hyperparameter adjusted during training). The clothing code is a d-dimensional numeric vector, that is, an ordered set of d real numbers, where d is some natural number. For a given clothing code and a given human body geometry (which uses the most popular SMPL format), a deep neural network (overlay network) then predicts a point cloud that approximates the geometry of the clothing overlaid on the body.

Основным преимуществом предложенной модели является ее способность охватывать различные предметы одежды с различной топологией, используя одно скрытое пространство кодов одежды и одну сеть наложения. Это возможно благодаря выбору представления облаком точек и использованию потерь, характерных для облака точек, во время обучения объединенной модели. После обучения модель способна обобщаться на новую одежду, извлекать ее геометрию из данных и накладывать полученную одежду на тела различной формы и в новых позах. С помощью предлагаемой модели можно получить геометрию одежды всего по одному изображению.The main advantage of the proposed model is its ability to cover different garments with different topologies using one clothing code latent space and one overlay network. This is possible due to the choice of point cloud representation and the use of point cloud specific loss during training of the merged model. After training, the model is able to generalize to new clothes, extract their geometry from the data, and apply the resulting clothes to bodies of various shapes and in new poses. Using the proposed model, you can get the geometry of clothes from just one image.

Предлагаемый метод распространяется не только на получение геометрии, но и на моделирование внешнего вида. При этом используются идеи дифференцируемого рендеринга и нейронной графики на основе множества точек. При наличии последовательности видео одежды, носимой человеком, захватываются фотометрические свойства одежды с помощью нейронных дескрипторов, прикрепленных к точкам в облаке точек, и параметры рендерной сети (декодера). Подгонка нейронных дескрипторов точек и рендерной сети (которая захватывает фотометрические свойства) осуществляется совместно с оценкой кода одежды (который захватывает геометрию одежды) в одном и том же процессе оптимизации. После подгонки одежду можно реалистично перенести и ревизуализировать на новых телах и в новых позах.The proposed method extends not only to obtaining geometry, but also to modeling the appearance. This uses the ideas of differentiable rendering and neural graphics based on many points. Given a video sequence of clothing being worn by a person, the photometric properties of the clothing are captured using neural descriptors attached to points in the point cloud and the parameters of the render network (decoder). The fitting of the neural point descriptors and the render network (which captures photometric properties) is done in conjunction with clothing code evaluation (which captures clothing geometry) in the same optimization process. Once fitted, clothing can be realistically transferred and revisited on new bodies and in new poses.

В ходе экспериментов оценивалась способность предложенной геометрической модели захватывать деформируемую геометрию новой одежды с использованием облаков точек. Затем тестировалась способность предлагаемого полного метода захватывать геометрию и текстуру одежды из видео, а также ревизуализировать полученную в результате обучения одежду для новых целей.The experiments evaluated the ability of the proposed geometric model to capture the deformable geometry of new clothing using point clouds. We then tested the ability of the proposed full method to capture the geometry and texture of the clothing from the video, as well as to revise the resulting clothing for new purposes.

Сначала рассмотрим модель наложения облака точек. Целью этой модели является захват геометрии разнообразной человеческой одежды, наложенной на человеческие тела с различными формами и позами с использованием облаков точек. Предложена скрытая модель таких облаков точек, которую можно подогнать к одному изображению или к более полным данным. Далее будет описана комбинация наложения облака точек с нейронным рендерингом, которая позволяет захватывать внешний вид одежды из видео.First, consider the point cloud overlay model. The goal of this model is to capture the geometry of a variety of human clothing overlaid on human bodies in various shapes and poses using point clouds. A latent model of such point clouds is proposed that can be fitted to a single image or to more complete data. Next, a combination of point cloud overlay with neural rendering will be described, which allows you to capture the appearance of clothing from a video.

Наложение облаков точекPoint cloud overlay

Обучалась модель, использующая генеративную скрытую оптимизацию (GLO) [9]. Набор данных для обучения содержит набор из N комплектов одежды, каждый из которых ассоциирован с d-мерным вектором z (кодом одежды). Таким образом, произвольно инициализируется

, где

для всех i=1, … ,N. В данном случае

- это пространство кодового вектора одежды, а

- d-мерное пространство действительных чисел (пространство действительных векторов длины d).A model was trained using generative latent optimization (GLO) [9]. The training data set contains a set of N clothing sets, each of which is associated with a d-dimensional vector z (clothing code). So arbitrarily initialized

, where

for all i=1, … ,N. In this case

is the clothing code vector space, and

- d-dimensional space of real numbers (the space of real vectors of length d).

Во время обучения для каждого комплекта одежды рассматривается его форма для разнообразного набора человеческих поз. Целевые формы задаются набором геометрий. В предлагаемом случае использовался синтетический датасет CLOTH3D [6], который предоставляет формы в виде сеток различной топологии. В этом датасете каждый субъект одет в комплект одежды и выполняет некоторую последовательность движений. Для каждого комплекта одежды i для каждого кадра j в соответствующей последовательности выбирались точки из сетки данной одежды для получения облака точек

, где X обозначает пространство облаков точек фиксированного размера (в предлагаемых экспериментах используется 8192). Длина обучающей последовательности i-й одежды обозначается как P_i. Также допускается, что дана сетка

тела, и предлагаемые эксперименты работают с форматом сетки SMPL [28] (следовательно, S обозначает пространство сеток SMPL для различных параметров формы тела и параметров позы тела). Путем объединения всего получен датасет

кодов одежды, сеток SMPL и облаков точек одежды.During training, for each set of clothes, its shape is considered for a diverse set of human poses. Target shapes are defined by a set of geometries. In the proposed case, the synthetic dataset CLOTH3D [6] was used, which provides shapes in the form of grids of various topologies. In this dataset, each subject is dressed in a set of clothes and performs some sequence of movements. For each set of clothes i for each frame j in the appropriate sequence, points were selected from the grid of this clothing to obtain a cloud of points

, where X denotes the space of point clouds of a fixed size (8192 is used in the proposed experiments). The length of the training sequence of the i-th clothing is denoted as P _i . It is also assumed that given a grid

bodies, and the proposed experiments work with the SMPL grid format [28] (hence, S denotes the SMPL grid space for various body shape parameters and body posture parameters). By combining everything, a dataset is obtained

clothing codes, SMPL grids, and clothing point clouds.

Поскольку задачей изобретения является обучение прогнозированию геометрии в новых позах и для новых форм тела, введена функция наложения

, которая преобразует скрытый код и сетку SMPL (характеризующую обнаженное тело) в облако точек одежды. В данном случае θ обозначает обучаемые параметры данной функции. Затем выполняется обучение путем оптимизации следующей цели:Since the objective of the invention is to teach geometry prediction in new poses and for new body shapes, the overlay function is introduced

, which converts the hidden code and the SMPL mesh (characterizing the naked body) into a point cloud of clothing. In this case, θ denotes the learnable parameters of this function. Then training is performed by optimizing the following goal:

θ - параметры сети наложения, подлежащие оптимизации;θ - overlay network parameters to be optimized;

θ - тензорное пространство параметров сети наложения;θ - tensor space of overlay network parameters;

z₁, … , z_N - кодовые векторы одежды, подлежащие оптимизации;z ₁ , … , z _N - clothing code vectors to be optimized;

N - количество обучающих объектов;N is the number of training objects;

P_i - количество поз тела обучающего объекта номер i;P _i - the number of postures of the body of the training object number i;

L_3D - функция потерь, измеряющая качество 3D реконструкции (расстояние между двумя облаками точек);L _3D is a loss function that measures the quality of a 3D reconstruction (distance between two point clouds);

G_θ - сеть наложения;G _θ - overlay network;

- 3D облако точек, выбранное с поверхности обрезанной (удалены вершины кистей, ступней и головы) сетки SMPL человеческого тела объекта номер i в позе тела номер j;

- 3D point cloud selected from the cropped surface (the tops of the hands, feet and head are removed) SMPL mesh of the human body of the object number i in body pose number j;

- 3D облако точек истинной одежды объекта номер i в позе тела номер j.

- 3D cloud of points of the true clothes of the object number i in body pose number j.

Целью уравнения (1) является средняя потеря при реконструкции для обучающих облаков точек на обучающем наборе данных. Таким образом, потеря L3D является потерей при 3D реконструкции. В предлагаемых экспериментах использовался приближенный алгоритм для вычисления расстояния движителя Земли (Earth Mover's Distance) [32] в зависимости от потери при трехмерной реконструкции. Следует отметить, что поскольку эта потеря измеряет расстояние между облаками точек и пренебрегает всеми топологическими свойствами, предложенная формулировка обучения естественным образом подходит для обучения предметов одежды с различной топологией.The goal of Equation (1) is the average reconstruction loss for the training point clouds on the training dataset. Thus, L3D loss is a 3D reconstruction loss. In the proposed experiments, an approximate algorithm was used to calculate the Earth Mover's Distance [32] as a function of the 3D reconstruction loss. It should be noted that since this loss measures the distance between point clouds and neglects all topological properties, the proposed training formulation is naturally suitable for training garments with different topologies.

Оптимизация выполняется совместно на параметрах предложенной функции наложения

и на скрытом коде одежды

для всех i=1,…,N. В соответствии с [9], чтобы упорядочить процесс, во время оптимизации коды одежды обрезаются до единичного шара.Optimization is performed jointly on the parameters of the proposed overlay function

and on the hidden clothing code

for all i=1,…,N. In accordance with [9], in order to streamline the process, during optimization, the clothing codes are cut off to a single ball.

Таким образом, процесс оптимизации устанавливает скрытое пространство кода одежды и параметры функции наложения.Thus, the optimization process sets the latent code space of the clothes and the parameters of the overlay function.

Сеть наложения. Функция наложения

реализуется как нейронная сеть, которая берет сетку SMPL s и преобразует это облако точек в облако точек одежды. В последнее время облака точек стали полноправными (почти) членами в мире глубокого обучения, поскольку был предложен ряд архитектур, способных вводить и/или выводить облака точек и работать с ними. В предлагаемой работе используется представленная недавно архитектура преобразователя облака (Cloud Transformer) [37] благодаря отличным результатам в целом ряде разнообразных задач.overlay network. Overlay function

is implemented as a neural network that takes a mesh of SMPL s and transforms that point cloud into a clothing point cloud. Recently, point clouds have become (almost) full members in the world of deep learning, as a number of architectures have been proposed that can input and/or output point clouds and work with them. This work uses the recently presented Cloud Transformer architecture [37] due to its excellent results in a wide variety of tasks.

Преобразователь облака состоит из блоков, каждый из которых последовательно выполняет растеризацию, свертку и дерастеризацию облака точек в полученных путем обучения положениях, зависящих от данных. Таким образом, преобразователь облака деформирует входное облако точек (полученное из сетки SMPL, как обсуждается ниже) в выходное облако точек x на нескольких блоках. Для уменьшения вычислительной сложности и требований к памяти используется упрощенная версия преобразователя облака с одноглавыми блоками. В противном случае авторы следовали архитектуре генератора, предложенной в [37] для реконструкции формы на основе изображения, которая в их случае принимала облако точек (выбранное из единичной сферы) и вектор (вычисленный сетью кодирования изображений) на входе и выводила облако точек формы, показанной на изображении. Эта часть архитектуры нейронной сети, состоящая из преобразователя облака, идентична части, предложенной авторами в [37] для задачи реконструкции облака точек по изображению (за исключением модификации "одноглавые блоки", то есть в предлагаемом изобретении уменьшено количество параллельно включенных "голов" преобразователя). Под "их случаем" подразумевается исходная архитектура "преобразователя облака для реконструкции облака точек на основе изображения", на вход которой поступает изображение и облако точек в виде одной трехмерной сферы точек.The cloud converter consists of blocks, each of which sequentially performs rasterization, convolution and derasterization of the point cloud in data-dependent positions learned by training. Thus, the cloud converter deforms the input point cloud (obtained from the SMPL mesh, as discussed below) into the output point cloud x on multiple blocks. To reduce computational complexity and memory requirements, a simplified version of the single-headed cloud converter is used. Otherwise, the authors followed the generator architecture proposed in [37] for image-based shape reconstruction, which in their case took a point cloud (selected from a unit sphere) and a vector (computed by an image encoding network) as input and outputted a point cloud of the shape shown on the image. This part of the neural network architecture, consisting of a cloud converter, is identical to the part proposed by the authors in [37] for the problem of reconstructing a point cloud from an image (with the exception of the "single-headed blocks" modification, that is, the number of parallel-connected "heads" of the converter is reduced in the proposed invention) . By "their case" is meant the original architecture of the "cloud converter for image-based point cloud reconstruction", which inputs the image and the point cloud as a single 3D point sphere.

В предлагаемом случае входное облако точек и вектор (код одежды - это упорядоченный набор из 8 действительных чисел, который сначала инициализируется случайными значениями, а затем во время обучения изменяет свои значения так, чтобы один вектор соответствовал одному определенному стилю одежда) различны и соответствуют сетке SMPL и коду одежды, соответственно. Сетка SMPL в данном случае является синонимом термина "3D модель", то есть ее следует читать как "3D модель SMPL". Что же касается 3D модели SMPL, то она представляет собой набор из 6890 вершин, каждая с 3D координатами, и 13776 треугольников (граней), каждый из которых состоит из трех вершин. Эта 3D модель была предложена в 2016г. в работе [34], упомянутой в списке цитирований, и была создана для точного моделирования человеческого тела различных форм и в различных позах. Облако точек, кодирующее положение тела и форму тела человека, и вектор, кодирующий желаемый стиль одежды человека, являются разными объектами, подаваемыми на вход сети наложения отдельно друг от друга и в разных точках ввода. In the proposed case, the input point cloud and vector (clothes code is an ordered set of 8 real numbers, which is first initialized with random values, and then during training changes its values so that one vector corresponds to one specific style of clothing) are different and correspond to the SMPL grid and clothing code, respectively. The SMPL mesh in this case is synonymous with the term "3D model", that is, it should be read as "SMPL 3D model". As for the SMPL 3D model, it is a set of 6890 vertices, each with 3D coordinates, and 13776 triangles (faces), each of which consists of three vertices. This 3D model was proposed in 2016. in [34], mentioned in the citation list, and was created to accurately model the human body of various shapes and in various poses. The point cloud encoding the position of the body and the shape of the human body, and the vector encoding the desired style of clothing of the person, are different objects fed to the input of the overlay network separately from each other and at different input points.

Более конкретно, чтобы ввести сетку SMPL в архитектуру преобразователя облака, авторы сначала удаляют части сетки, соответствующие голове, ступням и кистям рук. Затем авторы рассматривают вершины как оставшееся облако точек. Вершинами в данном случае являются точки, указанные в модели SMPL, которых всего 6890, и которые соединены ребрами. Каждая вершина имеет порядковый номер в модели SMPL и свои собственные 3D координаты, и набор всех этих 6890 вершин представляет собой не что иное, как 3D облако точек, подобное человеческому телу. Таким образом, авторы удаляют части 3D модели SMPL человеческого тела, связанные с кистями, ступнями и головой, и берут оставшиеся вершины из этой 3D модели в качестве входного облака точек (3D облако точек, взятое с поверхности обрезанной сетки SMPL человеческого тела объекта номер i в позе тела номер j). Кроме этих 6890 точек берется определенное количество точек, полученных путем взятия средних точек ребер, соединяющих эти вершины, чтобы придать облаку точек большую плотность. Для уплотнения облака точек авторы также добавляют в это облако точек средние точки ребер сетки SMPL (3D модели). Полученное облако точек (которое сформировано сеткой SMPL и отражает изменение позы и формы) вводится в преобразователь облака.More specifically, to introduce the SMPL mesh into the cloud transformer architecture, the authors first remove the parts of the mesh corresponding to the head, feet, and hands. The authors then treat the vertices as the remaining point cloud. The vertices in this case are the points specified in the SMPL model, of which there are a total of 6890, and which are connected by edges. Each vertex has a serial number in the SMPL model and its own 3D coordinates, and the set of all these 6890 vertices is nothing more than a 3D point cloud, similar to the human body. Thus, the authors remove the parts of the SMPL 3D model of the human body associated with the hands, feet and head, and take the remaining vertices from this 3D model as an input point cloud (3D point cloud taken from the surface of the cropped SMPL mesh of the human body of object number i in body pose number j). In addition to these 6890 points, a certain number of points are taken, obtained by taking the midpoints of the edges connecting these vertices, in order to give the point cloud a higher density. To compact the point cloud, the authors also add midpoints of the edges of the SMPL mesh (3D models) to this point cloud. The resulting point cloud (which is generated by the SMPL mesh and reflects the change in pose and shape) is fed into the cloud converter.

В соответствии с [37] в преобразователь облака вводится скрытый код z одежды через соединения AdaIn [18], которые модулируют сверточные карты внутри блоков растеризации-дерастеризации. Из скрытого кода z через перцептрон прогнозируются конкретные веса и смещения для каждого соединения AdaIn, что является обычной процедурой для генераторов на основе стилей [26]. Авторы отмечают, что несмотря на хорошие результаты, полученные с помощью (упрощенной) архитектуры преобразователя облака, можно использовать другие архитектуры глубокого обучения, оперирующие с облаками точек (например, PointNet [47]).In accordance with [37], the hidden code z of the clothes is introduced into the cloud converter through AdaIn connections [18], which modulate the convolutional maps inside the rasterization-derasterization blocks. From the hidden code z, specific weights and biases for each AdaIn connection are predicted via a perceptron, which is a common procedure for style-based generators [26]. The authors note that despite the good results obtained with the (simplified) cloud resolver architecture, other deep learning architectures that operate on point clouds can be used (for example, PointNet [47]).

Также следует отметить, что морфинг, реализуемый сетью наложения, имеет сильно нелокальный характер (т.е. предлагаемая модель не просто вычисляет локальные смещения вершин) и согласован между одеждой и позами (фиг.2). На фиг.2 показаны результаты сетей наложения с цветовой кодировкой. Каждый ряд соответствует какой-то позе. Крайнее левое изображение показывает ввод в сеть наложения. Остальные столбцы соответствуют трем кодам одежды. Цветовое кодирование соответствует спектральным координатам на поверхности сетки SMPL. Цветовое кодирование показывает, что преобразование наложения имеет явно нелокальный характер (т.е. сеть наложения не просто вычисляет локальные смещения). Кроме того, цветовое кодирование выявляет соответствия между аналогичными частями облаков точек одежды на выходах сети наложения.It should also be noted that the morphing implemented by the overlay network is highly non-local (i.e., the proposed model does not simply calculate local vertex displacements) and is consistent across clothing and poses (FIG. 2). Figure 2 shows the results of color-coded overlay networks. Each row corresponds to a certain pose. The far left image shows the input to the overlay network. The remaining columns correspond to the three clothing codes. The color coding corresponds to the spectral coordinates on the surface of the SMPL grid. The color coding shows that the overlay transformation is clearly non-local in nature (i.e., the overlay network does not simply compute local offsets). In addition, color coding reveals correspondences between similar parts of clothing point clouds at the outputs of the overlay network.

Сеть наложения прогнозирует не только локальные смещения каждой точки из облака точек входной модели SMPL, но и значительно изменяет положение точек, так что конечное спрогнозированное облако напоминает геометрию одежды. Например, точки, которые изначально находились на кисти 3D модели тела SMPL, становятся точками низа одежды, то есть они "перетекают" сверху вниз, что не является локальным изменением их положения и указывает на способность сети существенно преобразовывать входное облако для уменьшения функции потерь.The overlay network not only predicts the local displacements of each point from the point cloud of the input SMPL model, but also significantly changes the position of the points so that the final predicted cloud resembles clothing geometry. For example, the points that were originally on the brush of the SMPL 3D body model become the points of the bottom of the clothes, that is, they "flow" from top to bottom, which is not a local change in their position and indicates the ability of the network to significantly transform the input cloud to reduce the loss function.

Детали сети наложенияOverlay network details

На фиг.3 показана сеть наложения, которая преобразует облако точек тела (слева) и код одежды (вверху) в облако точек одежды, адаптируемой к позе тела и форме тела.Figure 3 shows an overlay network that converts a body point cloud (left) and clothing code (top) into a clothing point cloud adapted to body posture and body shape.

Слева и вверху находятся два ввода сети наложения, которые представляют собой 3D облако точек (Mx3 на фиг.3), выбранное из обрезанной сетки SMPL, и d-мерный кодовый вектор одежды (d=8 в предлагаемых экспериментах), соответственно. GLO - это метод генеративной скрытой оптимизации [9], используемый для обучения векторов кода одежды. Кодовый вектор одежды (Z0 на фиг.3) обрабатывается нейронной сетью многослойного перцептрона (MLP) (кодировщиком), и ее вывод затем передается в нейронную сеть преобразователя облака. Преобразователь облака деформирует входное облако точек с учетом вывода нейронной сети MLP (кодировщика) и выдает спрогнозированное 3D облако точек одежды.To the left and top are two overlay network inputs, which are a 3D point cloud (Mx3 in FIG. 3) selected from the clipped SMPL mesh and a d-dimensional clothing code vector (d=8 in the proposed experiments), respectively. GLO is a generative latent optimization technique [9] used to train clothing code vectors. The clothing code vector (Z0 in FIG. 3) is processed by a multilayer perceptron (MLP) neural network (encoder) and its output is then fed to the cloud converter neural network. The cloud converter deforms the input point cloud based on the output of the MLP neural network (encoder) and outputs the predicted 3D clothing point cloud.

Чтобы построить сплошной геометрический априор на одежде, предлагаемую функцию наложения Gθ предобучают на синтетическом датасете Cloth3D.To build a solid geometric prior on clothing, the proposed Gθ overlay function is pre-trained on the Cloth3D synthetic dataset.

Этот процесс разделен на обучающую и проверочную части, в результате чего получают N=6475 обучающих последовательностей видео. Поскольку большинство последовательных кадров имеют общую позу/геометрию одежды, для обучения рассматривается только каждый 10-й кадр. Как описано в разделе "Наложение облака точек", авторы случайным образом инициализируют {z₁,…,z_N}, где

для каждого единичного элемента i в этом датасете. В этих экспериментах была задана относительно низкая размерность скрытого кода, d=8, чтобы избежать переобучения во время последующей подгонки формы одного изображения (как описано в разделе "Наложение облака точек").This process is divided into training and testing parts, resulting in N=6475 training video sequences. Since most consecutive frames share the same clothing pose/geometry, only every 10th frame is considered for training. As described in the Point Cloud Overlay section, the authors randomly initialize {z ₁ ,…,z _N }, where

for each single element i in this dataset. In these experiments, a relatively low latent code dimension, d=8, was set to avoid overfitting during post-shaping of a single image (as described in the Point Cloud Overlay section).

В кодировщик MLP, состоящий из 5 полносвязных слоев, передаются коды одежды z_i для получения 512-мерного скрытого представления. Затем это представление передается в ветвь AdaIn сети преобразователя облака. Информация о позе и теле поступает в облаке точек SMPL с удаленными вершинами кистей, ступней и головы, см. фиг.1. Сеть наложения выводит 3D облака точек с 8,192 точками во всех предложенных экспериментах. В качестве функции потерь было выбрано приблизительное расстояние движителя Земли [32], и каждый вектор GLO и сеть наложения оптимизировались одновременно с использованием Адама [28].The clothing codes z _i are passed to the MLP encoder, which consists of 5 fully connected layers, to obtain a 512-dimensional hidden representation. This representation is then passed to the AdaIn branch of the cloud resolver network. The posture and body information comes in the SMPL point cloud with the tops of the hands, feet and head removed, see Fig.1. The overlay network outputs 3D point clouds with 8,192 points in all proposed experiments. The approximate distance of the Earth mover [32] was chosen as the loss function, and each GLO vector and overlay network was optimized simultaneously using Adam [28].

Хотя предобучение обеспечивает высоко экспрессивные априоры для платьев и юбок, разнообразие более облегающих комплектов одежды несколько ограничено. Гипотетически этот эффект в основном обусловлен большим предпочтением комбинезонов в категориях облегающей одежды Cloth3D.Although pretraining provides highly expressive priors for dresses and skirts, the variety of more form-fitting outfits is somewhat limited. Hypothetically, this effect is mainly due to the greater preference for overalls in Cloth3D's bodycon categories.

Оценка кода одеждыClothing Code Evaluation

После предобучения сети наложения на большом синтетическом датасете [6] можно смоделировать геометрию невиданной ранее одежды. Подгонку можно выполнить как к одному, так и к нескольким изображениям. Для одного изображения авторы оптимизируют код одежды z*, чтобы он соответствовал маске сегментации одежды на изображении.After pre-training the overlay network on a large synthetic dataset [6], it is possible to simulate the geometry of clothes never seen before. Fitting can be done to one or more images. For a single image, the authors optimize the z* clothing code to match the clothing segmentation mask in the image.

Этот процесс проиллюстрирован на фиг.4. Более подробно, двоичная маска одежды прогнозируется путем пропускания данного RGB изображения через сеть Graphonomy [12] и объединения всех семантических масок, соответствующих одежде (в правой части фиг.4). Авторы также подогнали сетку SMPL к человеку на изображении, используя метод Simplify [8]. Затем предобученная сеть наложения генерирует облако точек, когда вводится некоторый исходный код одежды (в верхней левой части фиг.4). Также на входе в сеть наложения находится некоторое исходное облако точек, полученное из обрезанной модели SMPL (как было описано выше) для этого изображения (в левой части фиг.4). Затем облако точек одежды, которое было спрогнозировано сетью наложения на основе этого вектора и этого облака точек, проецируется на новое черно-белое изображение с теми параметрами камеры, с которыми была получена истинная черно-белая маска одежды (в середине фиг.4). Затем эти две черно-белые маски сравниваются путем вычисления функции потерь на двухмерной фаске, и эта ошибка распространяется на входной код одежды, значения которого изменяются так, чтобы функция потерь давала все меньшие и меньшие значения. В процессе оптимизации значения кода одежды, которые подаются на вход сети наложения, изменяются, при этом остальные параметры системы не меняются: "Заморозка" на фиг.4 означает, что параметры сети наложения остаются неизменными в течение процесса оптимизации.This process is illustrated in Fig.4. In more detail, a binary clothing mask is predicted by passing a given RGB image through the Graphonomy network [12] and concatenating all the semantic masks corresponding to the clothing (on the right side of Figure 4). The authors also fit the SMPL mesh to the person in the image using the Simplify method [8]. The pre-trained overlay network then generates a point cloud when some clothing source code is entered (top left of Figure 4). Also at the input to the overlay network is some initial point cloud obtained from the cropped SMPL model (as described above) for this image (on the left side of Fig.4). The clothing point cloud, which was predicted by the overlay network based on this vector and this point cloud, is then projected onto a new black and white image with the camera settings that produced the true black and white clothing mask (middle of FIG. 4). The two black and white masks are then compared by calculating the loss function on the 2D bevel, and this error is propagated to the input clothing code, whose values are changed so that the loss function gives smaller and smaller values. During the optimization process, the values of the clothing code that are fed to the input of the overlay network are changed, while the remaining parameters of the system do not change: "Freeze" in Fig. 4 means that the parameters of the overlay network remain unchanged during the optimization process.

Для сложных комплектов одежды авторы наблюдали нестабильность в процессе оптимизации, где часто

приводит к нежелательным локальным минимумам. Для стабилизации этой подгонки авторы начинают с нескольких случайных начальных значений

независимо (в предлагаемых экспериментах Т=4).For complex clothing sets, the authors observed instability during the optimization process, where often

leads to unwanted local minima. To stabilize this fit, the authors start with a few random initial values

independently (in the proposed experiments T=4).

После нескольких этапов оптимизации определяется средний вектор одеждыAfter several stages of optimization, the average clothing vector is determined

и затем продолжается оптимизация от

до схождения. Было замечено, что этот простой метод постоянно обеспечивает точные коды одежды.

and then the optimization continues from

before convergence. This simple method has been observed to consistently provide accurate clothing codes.

Обычно при оптимизации гипотезы T выполняется 100 этапов обучения. После усреднения оптимизация занимает 50-400 этапов в зависимости от сложности геометрии одежды.Typically, when optimizing the hypothesis T, 100 training steps are performed. After averaging, optimization takes 50-400 steps, depending on the complexity of the clothing geometry.

Моделирование внешнего видаAppearance Modeling

Рендеринг на основе множества точек. Большинство приложений моделирования одежды выходят за рамки геометрического моделирования и также требуют моделирования внешнего вида. Недавно было показано, что облака точек обеспечивают хорошую геометрическую основу для нейронного рендеринга [1, 58, 38]. Авторы этих работ следуют принципу моделирования с помощью нейронной графики на основе множества точек (NPBG) [1] для добавления моделирования внешнего вида к предлагаемой системе.Rendering based on multiple points. Most apparel modeling applications go beyond geometric modeling and require appearance modeling as well. It has recently been shown that point clouds provide a good geometric basis for neural rendering [1, 58, 38]. The authors of these papers follow the principle of multi-point neural graphics (NPBG) modeling [1] to add appearance modeling to the proposed system.

На фиг.5 показано применение нейронной графики на основе множества точек для моделирования внешнего вида одежды. Обучались набор нейронных дескрипторов внешнего вида и рендерная сеть, которые позволяют преобразовать растеризацию облака точек одежды (слева) в его реалистичное маскированное изображение (справа). Figure 5 shows the application of neural graphics based on multiple points to model the appearance of clothing. A set of neural appearance descriptors and a render network were trained to transform the rasterization of the clothing point cloud (left) into its realistic masked image (right).

Начиная слева, имеются фиксированное 3D облако точек и набор обучаемых нейронных дескрипторов, причем к каждой точке облака точек присоединен один 16-мерный (n-мерный) вектор дескриптора внешнего вида. Затем это облако точек с присоединенными обучаемыми дескрипторами внешнего вида передается в блок растеризации, где дифференцируемый растеризатор учитывает перекрытие облака точек и сетки SMPL человеческого тела и генерирует 16-канальный тензор изображения, используя 3D координаты каждой точки и нейронный дескриптор каждой точки ("псевдоцветное изображение" на фиг.5). Он также генерирует двоичную черно-белую маску растеризации, соответствующую пикселям изображения, покрытого точками. Затем 16-канальный тензор изображения передается в рендерную сеть вместе с двоичной черно-белой маской растеризации, и сеть прогнозирует конечное 3-канальное RGB изображение и конечную двоичную маску силуэта одежды.Starting from the left, there is a fixed 3D point cloud and a set of trainable neural descriptors, with one 16-dimensional (n-dimensional) appearance descriptor vector attached to each point in the point cloud. This point cloud, with the trained appearance descriptors attached, is then passed to the rasterizer, where the differentiable rasterizer takes into account the overlap between the point cloud and the human body SMPL mesh, and generates a 16-channel image tensor using the 3D coordinates of each point and each point's neural descriptor ("pseudo-color image" figure 5). It also generates a binary black and white rasterization mask corresponding to the pixels of the dotted image. The 16-channel image tensor is then passed to the render network along with a binary black and white rasterization mask, and the network predicts the final 3-channel RGB image and the final clothing silhouette binary mask.

Таким образом, при моделировании внешнего вида определенной одежды с кодом z к каждой из M точек в облаке точек, моделирующем его геометрию, присоединяются p-мерные скрытые векторы внешнего вида

. Также вводится рендерная сеть

с обучаемыми параметрами

. Для получения реалистичной визуализации одежды с заданными позой s тела и положением C камеры сначала вычисляется облако точек

, а затем это облако точек растеризуется на сетку изображения с разрешением

с использованием параметров камеры и нейронного дескриптора t[m] в качестве псевдоцвета m-й точки. Результат растеризации, представляющий собой p-канальное изображение, объединяется с масками растеризации, которые указывают ненулевые пиксели, а затем они обрабатываются (преобразуются) в цветное RGB изображение одежды и маску одежды (т.е. изображение с предлагаемым числом каналов) рендерной сетью

с обучаемыми параметрами

.Thus, when modeling the appearance of a particular garment with code z, p-dimensional hidden appearance vectors are attached to each of the M points in the point cloud modeling its geometry

. Render network is also introduced

with learnable parameters

. To obtain a realistic visualization of clothes with a given body pose s and camera position C, the point cloud is first calculated

, and then this point cloud is rasterized into an image grid with a resolution

using the camera parameters and the neural descriptor t[m] as the pseudocolor of the m-th point. The result of the rasterization, which is a p-channel image, is combined with rasterization masks that indicate non-zero pixels, and then they are processed (converted) into an RGB color image of clothes and a clothes mask (i.e., an image with a proposed number of channels) by the render network

with learnable parameters

.

Во время растеризации также учитывается сетка SMPL тела и не растеризуются точки, закрытые телом. В качестве рендерной сети используется упрощенная сеть U-net [50].During rasterization, the body's SMPL mesh is also taken into account and points covered by the body are not rasterized. A simplified U-net network [50] is used as a render network.

Захват внешнего вида из видеоCapture appearance from video

Предлагаемый метод позволяет захватывать внешний вид одежды на видео. Для этого выполняется двухэтапная оптимизация. На первом этапе оптимизируется код одежды путем минимизации потери фаски между проекциями облака точек и масками сегментации, как было описано в предыдущем разделе. Затем совместно оптимизируются скрытые векторы T внешнего вида и параметры рендерной сети

. Для второго этапа используется (1) потеря восприятия [24] между маскированным видео и изображением RGB, визуализированным предлагаемой моделью, и (2) потеря Дайса между маской сегментации и маской визуализации, спрогнозированной рендерной сетью.The proposed method allows you to capture the appearance of clothing on video. For this, a two-stage optimization is performed. The first step optimizes the clothing code by minimizing the loss of chamfer between point cloud projections and segmentation masks, as described in the previous section. Then, hidden appearance vectors T and render network parameters are jointly optimized

. For the second stage, (1) the perceptual loss [24] between the masked video and the RGB image rendered by the proposed model, and (2) the Dice loss between the segmentation mask and the rendering mask predicted by the render network are used.

Для оптимизации внешнего вида требуется видео с человеком, вся поверхность тела которого видна хотя бы в одном кадре. В предлагаемых экспериментах обучающие последовательности состоят из 600-2800 кадров для каждого человека. На графическом процессоре NVIDIA Tesla P40 весь процесс занимает около 10 часов.To optimize the appearance, a video with a person whose entire surface of the body is visible in at least one frame is required. In the proposed experiments, training sequences consist of 600-2800 frames for each person. On an NVIDIA Tesla P40 GPU, the whole process takes about 10 hours.

После оптимизации полученную модель одежды можно визуализировать для форм SMPL тела в произвольных позах, обеспечивая изображения RGB и маски сегментации.After optimization, the resulting clothing model can be rendered for SMPL body shapes in arbitrary poses, providing RGB images and segmentation masks.

ЭкспериментыExperiments

Были осуществлены оценка геометрического моделирования и моделирования внешнего вида согласно предлагаемому методу и сравнение с известными датасетами. Для оценки на этих двух этапах использовались два датасетов видео людей. Датасет PeopleSnapsot, представленный в [3], содержит 24 видео, на которых люди в различной одежде поворачиваются в позу А. Что касается одежды, на них отсутствуют примеры людей в юбках и поэтому не раскрываются все преимущества предлагаемого метода. Также оценивалась часть датасета AzurePeople, представленного в [20]. В этой части содержится видео восьми человек в одежде различной сложности, снятые с пяти RGBD Kinect камер.Geometric modeling and appearance modeling were evaluated according to the proposed method and compared with known datasets. Two human video datasets were used to evaluate these two stages. The PeopleSnapsot dataset presented in [3] contains 24 videos in which people in various clothes turn into pose A. As for clothes, there are no examples of people in skirts and therefore do not reveal all the advantages of the proposed method. A part of the AzurePeople dataset presented in [20] was also evaluated. This part contains videos of eight people in clothes of varying complexity, taken from five RGBD Kinect cameras.

Использовались не только изображения юбок, но наличие таких изображений показывает универсальность предлагаемого метода в плане моделирования различных топологий одежды. Например, 3D модель брюк и 3D модель юбки имеют разную связь между вершинами, и преимущество предлагаемой системы состоит в том, что она может спрогнозировать, по меньшей мере, обе эти топологии, чего не могут сделать многие другие современные методы (большинство из них вообще не реконструирует юбки ни в какой приемлемой форме).Not only images of skirts were used, but the presence of such images shows the versatility of the proposed method in terms of modeling various clothing topologies. For example, a 3D model of trousers and a 3D model of a skirt have different vertex relationships, and the advantage of the proposed system is that it can predict at least both of these topologies, which many other modern methods cannot do (most of them do not remodels skirts in no acceptable form).

Камеры RGBD Kinect. Для обоих датасетов были созданы маски сегментации ткани с помощью метода Graphonomy [12] и сетки SMPL с помощью SMPLify [8]. Для запуска всех методов в сравнении также были спрогнозированы ключевые точки Openpose [10], UVI рендеры DensePose [14] и сетки SMPL-X [45].RGBD Kinect cameras. For both datasets, tissue segmentation masks were created using the Graphonomy method [12] and an SMPL mesh using SMPLify [8]. Openpose keypoints [10], DensePose UVI renderers [14], and SMPL-X meshes [45] were also predicted to run all methods in comparison.

В дополнение к описанному выше оценочному датасету также использовался датасет Cloth3D [6] для обучения предлагаемой геометрической метамодели.In addition to the evaluation dataset described above, the Cloth3D dataset [6] was also used to train the proposed geometric metamodel.

Набор данных Cloth3D содержит 11,3 тысяч элементов одежды с различной геометрией, смоделированных в виде сеток, наложенных на 8,5 тысяч тел SMPL, изменяющих позы. В подгонке используется моделирование на основе физики.The Cloth3D dataset contains 11,300 clothing items with different geometries, modeled as meshes overlaid on 8,500 SMPL bodies that change poses. The fitting uses physics-based simulation.

Восстановление геометрии одеждыRestoration of clothing geometry

В данной серии экспериментов оценивалась способность предложенного метода восстановить геометрию одежды из одного фотоснимка.In this series of experiments, the ability of the proposed method to restore the geometry of clothing from a single photograph was evaluated.

Предлагаемый метод позволяет реконструировать 3D геометрию одежды по одному фотоснимку. На примере одного конкретного рисунка: одежда из одного рисунка реконструируется предлагаемой системой и системами, с которыми сравнивается предлагаемый метод. Затем реконструированные 3D модели одежды помещаются в позу, которую предположительно не видели ранее эти системы во время обучения. Сама поза выбирается случайным образом из предложенного отдельного набора поз, ранее извлеченных из фотографий людей в разных позах. Сами изображения, из которых взяты позы, нигде не показываются, используется только поза из них.The proposed method allows reconstructing the 3D geometry of clothing from a single photograph. On the example of one specific drawing: clothes from one drawing are reconstructed by the proposed system and the systems with which the proposed method is compared. The reconstructed 3D clothing models are then placed in a pose that these systems have not previously seen during training. The pose itself is chosen randomly from a proposed separate set of poses, previously extracted from photographs of people in different poses. The images themselves, from which the poses are taken, are not shown anywhere, only the pose from them is used.

При оценке качества оценщик видит три вещи на одной картинке: в центре картинки - изображение человека в одежде, которую нужно реконструировать, с одной стороны от него (слева или справа) - 3D геометрия этой одежды, реконструированная предложенным методом в новой позе, с другой стороны от нее (слева или справа) 3D геометрия этой одежды, реконструированная другим методом в этой же новой позе.When evaluating the quality, the evaluator sees three things in one picture: in the center of the picture - the image of a person in the clothes that need to be reconstructed, on one side of him (left or right) - the 3D geometry of this clothes, reconstructed by the proposed method in a new pose, on the other side from her (left or right) the 3D geometry of this clothing, reconstructed by another method in the same new pose.

Каждый раз 3D геометрия одежды отображается в виде облака точек, визуализированного поверх модели SMPL человеческого тела (в случае предложенного метода) или визуализированной 3D сетки всего тела вместе с одеждой (в случае других методов). Освещение для визуализации одинаковое и постоянное, цвет для всех вершин всех 3D моделей одинаковый и постоянный (серый).Each time, the 3D geometry of the clothing is displayed as a point cloud rendered over an SMPL model of the human body (in the case of the proposed method) or a rendered 3D mesh of the entire body along with the clothing (in the case of other methods). Lighting for rendering is the same and constant, the color for all vertices of all 3D models is the same and constant (gray).

Сравнивались результаты следующих трех методов:The results of the following three methods were compared:

1. Метод Tex2Shape [4], прогнозирующий смещения вершин сетки SMPL в пространстве текстуры. Он идеально подходит для датасета снимков людей Snapshot, но менее пригоден для последовательностей AzurePeople с юбками и платьями.1. Tex2Shape method [4], which predicts displacements of SMPL mesh vertices in texture space. It is ideal for a Snapshot people dataset, but less suitable for AzurePeople sequences with skirts and dresses.

2. Сетевой метод Multi-outfit [7] прогнозирует одежду, наложенную поверх моделей SMPL тела. Он предлагает виртуальный гардероб предварительно подогнанных комплектов одежды, а также может подгонять новые комплекты одежды из одного изображения.2. The Multi-outfit network method [7] predicts clothing worn over SMPL body models. It offers a virtual wardrobe of pre-fitted outfits and can also customize new outfits from a single image.

3. Предлагаемый метод на основе множества точек, прогнозирующий геометрию одежды по облаку точек. 3. The proposed method based on a set of points, predicting the geometry of clothing from a cloud of points.

В сравниваемых системах использовались различные форматы для восстановления одежды (облако точек, смещения вершин, сетки). Более того, они фактически решают несколько различные проблемы, например, предлагаемый метод и сетка Multi-outfit восстанавливают одежду, а Tex2Shape восстанавливает сетки, включающие в себя одежду, тело и волосы. Однако все три системы поддерживают целевой перенос на новые позы. Поэтому авторы решили оценить относительную эффективность этих трех методов с помощью пользовательского исследования, оценивающего реалистичность целевого переноса одежды.In the compared systems, various formats for restoring clothes were used (point cloud, vertex offsets, meshes). Moreover, they actually solve slightly different problems, for example, the proposed method and the Multi-outfit mesh restore clothes, while Tex2Shape restores meshes that include clothes, body and hair. However, all three systems support targeted transfer to new poses. Therefore, the authors decided to evaluate the relative effectiveness of these three methods using a user study assessing the realism of targeted clothing transfer.

Пользователям были представлены триплеты изображений, где среднее изображение показывает предложенную фотографию, а боковые изображения показывают результаты двух сравниваемых методов (в виде рендеров с затемненной сеткой для одной и той же новой позы). Результаты таких парных сравнений (предпочтения пользователей) для 1,5 тысяч пользовательских сравнений показаны в таблице. 1. Users were presented with image triplets, where the middle image shows the suggested photo and the side images show the results of the two methods being compared (as shaded mesh renders for the same new pose). The results of such paired comparisons (user preferences) for 1.5 thousand user comparisons are shown in the table. one.

В первой строке таблицы представлены результаты пользовательского исследования на датасете PeopleSnapshot. Во второй строке таблицы представлены результаты на датасете AzurePeople. В первом столбце представлено сравнение геометрии одежды, спрогнозированной с помощью предлагаемого метода и метода Tex2Shape, во втором столбце - сравнение геометрии одежды, спрогнозированной с помощью предлагаемого метода и метода MGN, в третьем столбце - сравнение геометрии одежды, спрогнозированной предложенным методом и методом Octopus. В каждой ячейке таблицы первое число (перед "vs") - это доля пользователей, которые предпочли геометрию одежды, спрогнозированную предлагаемым методом, а второе число (после "vs") - доля пользователей, которые предпочли геометрию одежды, спрогнозированную методом, записанным в соответствующем столбце. Таблица 1: результаты пользовательского исследования, в котором пользователи сравнивали качество восстановления 3D геометрии одежды (подогнанной к одному изображению). Предлагаемый метод оказался более предпочтительным на датасете AzurePeople с более свободной одеждой, в то время как ранее предложенные методы лучше подходят для более облегающей одежды с фиксированной топологией.The first row of the table presents the results of user research on the PeopleSnapshot dataset. The second row of the table shows the results on the AzurePeople dataset. The first column compares the clothing geometry predicted using the proposed method and the Tex2Shape method, the second column compares the clothing geometry predicted using the proposed method and the MGN method, and the third column compares the clothing geometry predicted by the proposed method and the Octopus method. In each cell of the table, the first number (before "vs") is the proportion of users who preferred the clothing geometry predicted by the proposed method, and the second number (after "vs") is the proportion of users who preferred the clothing geometry predicted by the method recorded in the corresponding column. Table 1: Results of a user study in which users compared the quality of reconstructing 3D clothing geometry (fitted to a single image). The proposed method proved to be more preferable on the AzurePeople dataset with looser clothing, while the previously proposed methods are better suited for tighter clothes with a fixed topology.

Пользователи отдали убедительное предпочтение предлагаемому методу в случае датасета AzurePeople, который содержит юбки и платья, тогда как Tex2Shape и MGN оказались предпочтительнее для датасета PeopleSnapshot с более облегающей одеждой с фиксированной топологией. На фиг.6 показаны типичные случаи, а в дополнительных материалах представлены более обширные качественные сравнения. Следует отметить, что в пользовательском исследовании предлагаемые точки показаны серыми, чтобы исключить влияние фактора цвета на выбор пользователя. На фиг.6 показаны спрогнозированные геометрии в проверочных позах, подогнанные к одному кадру (обучающее изображение слева). Более подробно, первый столбец (обучающее изображение) показывает изображение (один ряд - одно изображение), на котором обучались предлагаемая система и работы, с которыми осуществляется сравнение. Задача заключалась в том, чтобы максимально точно реконструировать 3D геометрию одежды по изображению. Каждый из следующих столбцов представляет результаты адаптации полученной путем обучения геометрии одежды к новой, невиданной ранее позе каждым соответствующим методом: системой Tex2Shape [4] во втором столбце, методом Multi-outfit Net (MGN) [7] в третьем столбце, методом Octopus [2] в предлагаемом столбце. В последнем столбце показаны результаты адаптации к новой позе геометрии одежды, спрогнозированной предлагаемой системой. Для предлагаемого метода (справа) геометрия определяется облаком точек, в то время как для Tex2Shape и Multi-outfit Net (MGN) выводы основаны на сетке. Предлагаемый метод позволил реконструировать платье, в то время как другие методы не справились с этой задачей (нижний ряд). Следует отметить, что предлагаемый метод также позволяет реконструировать более облегающую одежду (верхний ряд), хотя в этом случае лучший результат дает Tex2Shape с его подходом, основанным на смещении. Для полноты картины в дополнительных материалах представлены дополнительные сравнения предлагаемого метода с системой Octopus [2], которая не идеально подходит для реконструкции по одной фотографии.Users showed a strong preference for the proposed method in the case of the AzurePeople dataset, which contains skirts and dresses, while Tex2Shape and MGN were found to be preferable for the PeopleSnapshot dataset with tighter, fixed topology clothing. Figure 6 shows typical cases, and more extensive qualitative comparisons are presented in the supplementary materials. It should be noted that in the user study, the suggested points are shown in gray to eliminate the influence of the color factor on the user's choice. 6 shows the predicted geometries in test poses fitted to one frame (training image on the left). In more detail, the first column (training image) shows the image (one row - one image) on which the proposed system and the jobs being compared are trained. The task was to reconstruct the 3D geometry of clothing from the image as accurately as possible. Each of the following columns represents the results of adapting the clothing geometry learned by training to a new, never seen before pose by each respective method: Tex2Shape [4] in the second column, Multi-outfit Net (MGN) [7] in the third column, Octopus [2] ] in the suggested column. The last column shows the results of adapting to the new pose of the clothing geometry predicted by the proposed system. For the proposed method (right), the geometry is determined by a point cloud, while for Tex2Shape and Multi-outfit Net (MGN), the inferences are grid-based. The proposed method made it possible to reconstruct the dress, while other methods did not cope with this task (bottom row). It should be noted that the proposed method also allows the reconstruction of tighter garments (top row), although in this case Tex2Shape with its displacement-based approach gives the best result. For the sake of completeness, additional materials present additional comparisons of the proposed method with the Octopus system [2], which is not ideal for reconstruction from a single photograph.

Моделирование внешнего видаAppearance Modeling

Предлагаемый конвейер моделирования внешнего вида оценивался относительно системы StylePeople [20] (многокадровый вариант), которая во многих отношениях является наиболее близким аналогом предлагаемой системы. StylePeople подгоняет нейронную текстуру сетки SMPL-X вместе с рендерной сетью, используя видео человека и обратное распространение.The proposed appearance modeling pipeline was evaluated against the StylePeople [20] system (multi-frame variant), which in many respects is the closest analogue of the proposed system. StylePeople fits the SMPL-X neural mesh texture along with the render network using human video and backpropagation.

Для сравнения система StylePeople была модифицирована для создания масок одежды вместе с изображениями RGB и сегментацией переднего плана. Оба метода обучались отдельно на каждом человеке из датасетов AzurePeople и PeopleSnapshot.For comparison, the StylePeople system has been modified to generate clothing masks along with RGB images and foreground segmentation. Both methods were trained separately on each person from the AzurePeople and PeopleSnapshot datasets.

Затем сравнивались изображения одежды, сгенерированные для контрольных представлений, по трем показателям, которые измеряют визуальное сходство с истинными изображениями, а именно: расстояние полученного в результате обучения перцептивного сходства (LPIPS) [63], структурное сходство (SSIM) [57] и его многомасштабная версия (MS-SSIM).The clothing images generated for the control representations were then compared against three metrics that measure visual similarity to the true images, namely: trained perceptual similarity distance (LPIPS) [63], structural similarity (SSIM) [57], and its multiscale version (MS-SSIM).

Результаты этого сравнения представлены в таблице 2, а качественное сравнение показано на фиг.8. The results of this comparison are presented in Table 2 and the qualitative comparison is shown in FIG.

Таблица 2: Количественные сравнения с системой StylePeople на двух тестовых датасетах с использованием общих показателей изображения. Предлагаемый метод превосходит StylePeople по большинству показателей благодаря более точному геометрическому моделированию. При этом верхняя часть таблицы отражает оценку на датасете PeopleSnapshot, нижняя часть таблицы - результаты на датасете AzurePeople. Первая строка в каждом разделе соответствует предложенному методу, вторая строка - методу StylePeople. В первом столбце приведены значения полученного в результате обучения расстояния перцептивного сходства (LPIPS) [63]. Во втором столбце - значения показателя структурного сходства (SSIM) [57]. В третьем столбце приведены значения многомасштабной версии SSIM (MS-SSIM). Все эти показатели были рассчитаны на основе прогнозов предлагаемого метода и прогнозов метода StylePeople и усреднены по соответствующему датасету. Результаты, представленные в этой таблице, свидетельствуют, что предлагаемый метод превосходит метод StylePeople для обоих датасетов по всем показателям, за исключением случая показателя MS-SSIM на датасете AzurePeople. Предположительно, это связано с использованием более сложной рендерной сетью по сравнению с той, которая используется в предлагаемом методе (более упрощенной). Это преимущество подтверждается визуальным анализом количественных результатов (фиг.8).Table 2: Quantitative comparisons with the StylePeople system on two test datasets using overall image metrics. The proposed method outperforms StylePeople in most ways due to more accurate geometric modeling. In this case, the upper part of the table reflects the assessment on the PeopleSnapshot dataset, the lower part of the table - the results on the AzurePeople dataset. The first line in each section corresponds to the proposed method, the second line to the StylePeople method. The first column shows the values of the trained perceptual similarity distance (LPIPS) [63]. The second column contains the values of the structural similarity index (SSIM) [57]. The third column lists the SSIM multi-scale version (MS-SSIM) values. All these indicators were calculated based on the predictions of the proposed method and the predictions of the StylePeople method and averaged over the corresponding dataset. The results presented in this table show that the proposed method outperforms the StylePeople method for both datasets in all respects, with the exception of the MS-SSIM measure on the AzurePeople dataset. Presumably, this is due to the use of a more complex render network compared to the one used in the proposed method (more simplified). This advantage is confirmed by visual analysis of the quantitative results (Fig. 8).

На фиг.1 проиллюстрированы дополнительные результаты для предложенных методов. Представлен ряд комплектов одежды различной топологии и типа, перенацеленных на новые позы из обоих тестовых датасетов. И наконец, на фиг.7 проиллюстрированы примеры целевого переноса геометрии и внешнего вида одежды на новые формы тела согласно предлагаемому методу. Более конкретно, первый ряд и первый столбец иллюстрируют геометрию одежды, полученную в результате обучения из одной фотографии первого объекта из датасета PeopleSnapshot, второй столбец - внешний вид, полученный в результате обучения из видео этого объекта. Следующие два столбца в первом ряду показывают пример адаптации этих полученных в результате обучения геометрии и внешнего вида к новой, невиданной ранее форме человеческого тела. Точно такая же информация отображается во втором ряду, но для другого субъекта из датасета PeopleSnapshot.Figure 1 illustrates additional results for the proposed methods. A number of clothing sets of various topologies and types are presented, redirected to new poses from both test datasets. Finally, Figure 7 illustrates examples of the targeted transfer of the geometry and appearance of clothing to new body shapes according to the proposed method. More specifically, the first row and first column illustrate the clothing geometry learned from a single photo of the first object in the PeopleSnapshot dataset, and the second column shows the appearance learned from the video of that object. The next two columns in the first row show an example of adapting these learned geometry and appearance to a new, never seen before form of the human body. The exact same information is displayed in the second row, but for a different subject from the PeopleSnapshot dataset.

На фиг.7 проиллюстрировано, что предлагаемый метод позволяет также перенацелить геометрию и внешний вид на новые формы тела. Целевой перенос внешнего вида хорошо работает для одежды однородного цвета, но могут искажаться детали принтов (например, в области груди в нижнем ряду).Figure 7 illustrates that the proposed method also allows the geometry and appearance to be redirected to new body shapes. Targeted appearance transfer works well for uniform color garments, but details in prints can be distorted (for example, in the chest area in the bottom row).

На фиг.8 представлено сравнение результатов целевого переноса внешнего вида предлагаемого метода на новые позами, невиданные во время подгонки между предлагаемым методом и системой StylePeople (вариант с несколькими снимками), которая использует сетку SMPL в качестве базовой геометрии и основывается только на нейронном рендеринге для "наращивания" свободной одежды в рендерах. Более конкретно, из датасета AzurePeople было выбрано 6 субъектов с соответствующими им последовательностями видео с 4 камер. Кроме того, для каждого из этих субъектов был произвольно выбран один фотоснимок из их последовательностей видео (столбцы 1, 4 и 7). Для каждого фотоснимка геометрию одежды получили в результате обучения в соответствии с предложенным методом оценки кода одежды, описанным ранее. Затем получили в результате обучения внешний вид каждого субъекта с учетом геометрии его одежды и видео с 4 камер, и с прогнозировали внешний вид для нового контрольного вида с камеры и в новой контрольной позе тела. Результаты предлагаемого метода представлены в столбцах 3, 6 и 9. Затем метод StylePeople [20] также получил такие же последовательности видео этих субъектов, его прогноз на том же новом виде камеры и в той же новой контрольной позе теле показаны в столбцах 2, 5 и 8. Как и ожидалось, предлагаемая система дает более четкие результаты для более свободной одежды благодаря использованию более точного геометрического каркаса. Figure 8 compares the results of targeting the appearance of the proposed method to new poses not seen during the fitting between the proposed method and the StylePeople system (multiple shot variant), which uses the SMPL mesh as base geometry and relies only on neural rendering for " extensions" of loose clothing in renders. More specifically, 6 subjects were selected from the AzurePeople dataset with their corresponding video sequences from 4 cameras. In addition, for each of these subjects, one still image was randomly selected from their video sequences (columns 1, 4, and 7). For each photograph, the clothing geometry was obtained as a result of training in accordance with the proposed method for evaluating the clothing code described earlier. Then, as a result of training, the appearance of each subject was obtained, taking into account the geometry of his clothes and video from 4 cameras, and the appearance was predicted for a new control view from the camera and in a new control body posture. The results of the proposed method are presented in columns 3, 6 and 9. Then the StylePeople method [20] also obtained the same video sequences of these subjects, its prediction on the same new camera view and in the same new control body pose are shown in columns 2, 5 and 8. As expected, the proposed system gives clearer results for looser garments due to the use of a more precise geometric frame.

Выводы и ограниченияConclusions and limitations

Предлагается новый подход к моделированию человеческой одежды на основе облаков точек. Предлагается генеративная модель для одежды различной формы и топологии, которая позволяет захватить геометрию невиданных ранее комплектов одежды и перенацелить ее на новые позы и формы тела. Отсутствие топологии в предлагаемом геометрическом представлении (облаке точек) особенно подходит для моделирования одежды ввиду большого разнообразия форм и составов одежды в реальной жизни. Кроме геометрического моделирования используются идеи нейронной графики на основе множества точек для захвата внешнего вида одежды и ревизуализации полных моделей одежды (геометрия+внешний вид) в новых позах на новых телах.A new approach to the modeling of human clothing based on point clouds is proposed. A generative model for clothes of various shapes and topologies is proposed, which allows you to capture the geometry of previously unseen sets of clothes and redirect it to new poses and body shapes. The lack of topology in the proposed geometric representation (point cloud) is particularly suitable for clothing modeling due to the wide variety of clothing shapes and compositions in real life. In addition to geometric modeling, the ideas of neural graphics based on many points are used to capture the appearance of clothes and revisit full models of clothes (geometry + appearance) in new poses on new bodies.

Предлагаемый текущий подход к моделированию внешнего вида требует последовательности видео, чтобы захватить внешний вид одежды, что потенциально можно решить путем расширения генеративного моделирования на нейронные дескрипторы аналогично генеративной нейронной модели текстуры из [20]. Кроме того, возможности предлагаемой модели сильно ограничены в моделировании динамики ткани, и для расширения предлагаемой модели в этом направлении может быть полезна некоторая интеграция предлагаемого метода с моделированием на основе физики (конечные элементы). И наконец, предлагаемая модель ограничена одеждой, подобной представленной в датасете Cloth3D. Предметы одежды, отсутствующие в этом датасете (например, шляпы), не могут быть захвачены предложенным методом.The proposed current appearance modeling approach requires video sequences to capture the appearance of clothing, which can potentially be addressed by extending generative modeling to neural descriptors in a similar way to the generative neural texture model from [20]. In addition, the capabilities of the proposed model are severely limited in modeling tissue dynamics, and in order to extend the proposed model in this direction, some integration of the proposed method with physics-based (finite element) modeling may be useful. Finally, the proposed model is limited to clothing similar to those shown in the Cloth3D dataset. Items of clothing that are not in this dataset (for example, hats) cannot be captured by the proposed method.

Предлагаемая система моделирования одежды на человеке и подгонки одежды с помощью предлагаемого способа может содержать устройство обнаружения, подключенное к компьютерной системе. Таким устройством обнаружения может быть простая видео/фотокамера. Компьютерная система может содержать операционный блок, подключенный к экрану и блоку выбора. Операционный блок является частью компьютерной системы, реализующей предложенный способ. Компьютерная система может быть, например, одной из компьютера, ноутбука, смартфона и любого другого подходящего электронного устройства; экран может быть дисплеем компьютера, ноутбука, смартфона или любым другим подходящим дисплеем. Блок выбора является интерфейсом, с помощью которого пользователь может выбирать видео и изображения с понравившейся одеждой для примерки.The proposed system for modeling clothes on a person and fitting clothes using the proposed method may include a detection device connected to a computer system. Such a detection device can be a simple video/photo camera. The computer system may include an operating unit connected to a screen and a selection unit. The operating unit is part of a computer system that implements the proposed method. The computer system may be, for example, one of a computer, laptop, smartphone, and any other suitable electronic device; the screen may be a computer display, a laptop display, a smartphone display, or any other suitable display. The selection block is an interface through which the user can select videos and images of the clothes they like to try on.

Устройство обнаружения захватывает поток цветного видео реального человека, например пользователя, в реальном времени, это видео отображается на экране в реальном времени. Человек может выбрать с помощью блока выбора любой желаемый предмет одежды в соответствии с любым видео, на котором изображен человек в этом предмете одежды. Изображения человека и изображения, выбранные человеком, обрабатываются операционным блоком. В результате человек может увидеть себя, возможно, в реальном времени, в выбранной одежде.The detection device captures a real-time color video stream of a real person, such as a user, and this video is displayed on the screen in real time. The person can select, with the selection block, any desired article of clothing in accordance with any video showing a person wearing that article of clothing. The images of the person and the images selected by the person are processed by the operation unit. As a result, a person can see himself, possibly in real time, in the chosen clothes.

Представленные выше примерные варианты осуществления являются примерами и не должны рассматриваться как ограничивающие. Кроме того, описание примерных вариантов осуществления предназначено для иллюстрации, а не для ограничения объема формулы изобретения, и многие альтернативы, модификации и вариации будут очевидны специалистам в данной области техники.The exemplary embodiments presented above are examples and should not be construed as limiting. In addition, the description of exemplary embodiments is intended to be illustrative and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.

ЛитератураLiterature

[1] K. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, and V. S.[1] K. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, and V. S.

Lempitsky. Neural point-based graphics. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, editors, Proc. ECCV, volume 12367 of Lecture Notes in Computer Science, pages 696-712. Springer, 2020. Lempitsky. Neural point-based graphics. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, editors, Proc. ECCV, volume 12367 of Lecture Notes in Computer Science, pages 696-712. Springer, 2020.

[2] T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and[2] T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and

G. Pons-Moll. Learning to reconstruct people in clothing from a single rgb camera. In Proc. CVPR, 2019. G. Pons-Moll. Learning to reconstruct people in clothing from a single rgb camera. In Proc. CPR, 2019.

[3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based reconstruction of 3d people models. InProc. CVPR, pages 8387-8397, 2018.[3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based reconstruction of 3d people models. InProc. CVPR, pages 8387-8397, 2018.

[4] T. Alldieck, G. Pons-Moll, C. Theobalt, and M. Magnor. Tex2shape: Detailed full human body geometry from a single image. In Proc. 3DV, 2019. [4] T. Alldieck, G. Pons-Moll, C. Theobalt, and M. Magnor. Tex2shape: Detailed full human body geometry from a single image. In Proc. 3DV, 2019.

[5] K. Ayush, S. Jandial, A. Chopra, and B. Krishnamurthy. Powering virtual try-on via auxiliary human segmentation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019. [5] K. Ayush, S. Jandial, A. Chopra, and B. Krishnamurthy. Powering virtual try-on via auxiliary human segmentation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019.

[6] H. Bertiche, M. Madadi, and S. Escalera. Cloth3d: Clothed 3d humans. In Proc. ECCV, 2020. [6] H. Bertiche, M. Madadi, and S. Escalera. Cloth3d: Clothed 3d humans. In Proc. ECCV, 2020.

[7] B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll. Multi-outfit net: Learning to dress 3d people from images. In Proc. ICCV. IEEE, oct 2019. [7] B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll. Multi-outfit net: Learning to dress 3d people from images. In Proc. ICCV. IEEE Oct 2019.

[8] F. Bogo, A. Kanazawa, C. Lassner, P. V. Gehler, J. Romero, and M. J. Black. Keep it SMPL: automatic estimation of 3d human pose and shape from a single image. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Proc. ECCV, volume 9909 of Lecture Notes in Computer Science, pages 561-578. Springer, 2016. [8] F. Bogo, A. Kanazawa, C. Lassner, P. V. Gehler, J. Romero, and M. J. Black. Keep it SMPL: automatic estimation of 3d human pose and shape from a single image. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Proc. ECCV, volume 9909 of Lecture Notes in Computer Science, pages 561-578. Springer, 2016.

[9] P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam. Optimizing the latent space of generative networks. In Proc. ICML, 2019. [9] P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam. Optimizing the latent space of generative networks. In Proc. ICML, 2019.

[10] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh. Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence, 43(1):172-186, 2019. [10] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh. Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence, 43(1):172-186, 2019.

[11] V. Gabeur, J.-S. Franco, X. Martin, C. Schmid, and G. Rogez. Moulding humans: Non-parametric 3d human shape estimation from single images. 2019. [11] V. Gabeur, J.-S. Franco, X. Martin, C. Schmid, and G. Rogez. Molding humans: Non-parametric 3d human shape estimation from single images. 2019.

[12] K. Gong, Y. Gao, X. Liang, X. Shen, M. Wang, and L. Lin. Graphonomy: Universal human parsing via graph transfer learning. In Proc. CVPR, 2019. [12] K. Gong, Y. Gao, X. Liang, X. Shen, M. Wang, and L. Lin. Graphonomy: Universal human parsing via graph transfer learning. In Proc. CPR, 2019.

[13] P. Guan, L. Reiss, D. Hirshberg, A. Weiss, and M. J. Black. DRAPE: DRessing Any PErson. ACM Trans. on Graphics (Proc. SIGGRAPH), 31(4):35:1-35:10, July 2012. [13] P. Guan, L. Reiss, D. Hirshberg, A. Weiss, and M. J. Black. DRAPE: Dressing Any Person. ACM Trans. on Graphics (Proc. SIGGRAPH), 31(4):35:1-35:10, July 2012.

[14]

N. Neverova, and I. Kokkinos. Densepose: Dense human pose estimation in the wild. In Proc. CVPR, pages 7297-7306, 2018. [fourteen]

N. Neverova, and I. Kokkinos. Densepose: Dense human pose estimation in the wild. In Proc. CVPR, pages 7297-7306, 2018.

[15] E. Gundogdu, V. Constantin, A. Seifoddini, M. Dang, M. Salzmann, and P. Fua. Garnet: A two-stream network for fast and accurate 3d cloth draping. In Proc. ICCV, 2019.[15] E. Gundogdu, V. Constantin, A. Seifoddini, M. Dang, M. Salzmann, and P. Fua. Garnet: A two-stream network for fast and accurate 3d cloth draping. In Proc. ICCV, 2019.

[16] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis. Viton: An image-based virtual try-on network. In Proc. CVPR, 2018.[16] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis. Viton: An image-based virtual try-on network. In Proc. CPR, 2018.

[17] W.-L. Hsiao, I. Katsman, C.-Y. Wu, D. Parikh, and K. Grauman. Fashion++: Minimal edits for outfit improvement. In Proc. ICCV, 2019. [17] W.-L. Hsiao, I. Katsman, C.-Y. Wu, D. Parikh, and K. Grauman. Fashion++: Minimal edits for outfit improvement. In Proc. ICCV, 2019.

[18] X. Huang and S. Belongie. Arbitrary style transfer in realtime with adaptive instance normalization. In Proc. ICCV, 2017. [18] X. Huang and S. Belongie. Arbitrary style transfer in realtime with adaptive instance normalization. In Proc. ICCV, 2017.

[19] Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung. Arch: Animatable reconstruction of clothed humans. In Proc. CVPR, 2020. [19] Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung. Arch: Animatable reconstruction of clothed humans. In Proc. CVPR, 2020.

[20] K. Iskakov, A. Grigorev, A. Ianina, R. Bashirov, I. Zakharkin, A. Vakhitov, and V. Lempitsky. Stylepeople: A generative model of fullbody human avatars. In Proc. CVPR, 2021. [20] K. Iskakov, A. Grigorev, A. Ianina, R. Bashirov, I. Zakharkin, A. Vakhitov, and V. Lempitsky. Stylepeople: A generative model of fullbody human avatars. In Proc. CVPR, 2021.

[21] T. Issenhuth, J. Mary, and C. Calauz`enes. Do not mask what you do not need to mask: a parser-free virtual try-on. In Proc. ECCV, 2020. [21] T. Issenhuth, J. Mary, and C. Calauz'enes. Do not mask what you do not need to mask: a parser-free virtual try-on. In Proc. ECCV, 2020.

[22] H. Jae Lee, R. Lee, M. Kang, M. Cho, and G. Park. Laviton:[22] H. Jae Lee, R. Lee, M. Kang, M. Cho, and G. Park. Laviton:

A network for looking-attractive virtual try-on. In Proc. ICCV, Oct 2019. A network for looking-attractive virtual try-on. In Proc. ICCV, Oct 2019.

[23] N. Jetchev and U. Bergmann. The conditional analogy gan: Swapping fashion articles on people images. In Proc. ICCV, 2017.[23] N. Jetchev and U. Bergmann. The conditional analogy gan: Swapping fashion articles on people images. In Proc. ICCV, 2017.

[24] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Proc. ECCV, 2016. [24] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Proc. ECCV, 2016.

[25] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. Endto- end recovery of human shape and pose. In Proc. CVPR, 2018. [25] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. Endto-end recovery of human shape and pose. In Proc. CPR, 2018.

[26] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proc. CVPR, pages 4401-4410. Computer Vision Foundation /IEEE, 2019. [26] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proc. CVPR, pages 4401-4410. Computer Vision Foundation /IEEE, 2019.

[27] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. In Proc. CVPR, 2020. [27] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. In Proc. CVPR, 2020.

[28] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017. [28] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017.

[29] Z. Laehner, D. Cremers, and T. Tung. Deepwrinkles: Accurate and realistic clothing modeling. In Proc. ECCV, 2018.[29] Z. Laehner, D. Cremers, and T. Tung. Deepwrinkles: Accurate and realistic modeling clothing. In Proc. ECCV, 2018.

[30] S. Laine, J. Hellsten, T. Karras, Y. Seol, J. Lehtinen, and T. Aila. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 2020. [30] S. Laine, J. Hellsten, T. Karras, Y. Seol, J. Lehtinen, and T. Aila. Modular primitives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 2020.

[31] K. M. Lewis, S. Varadharajan, and I. Kemelmacher-Shlizerman. Vogue: Try-on by stylegan interpolation optimization, 2021. [31] K. M. Lewis, S. Varadharajan, and I. Kemelmacher-Shlizerman. Vogue: Try-on by stylegan interpolation optimization, 2021.

[32] M. Liu, L. Sheng, S. Yang, J. Shao, and S.-M. Hu. Morphing and sampling network for dense point cloud completion. arXiv preprint arXiv:1912.00280, 2019. [32] M. Liu, L. Sheng, S. Yang, J. Shao, and S.-M. Hu. Morphing and sampling network for dense point cloud completion. arXiv preprint arXiv:1912.00280, 2019.

[33] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1- 248:16, Oct. 2015.[33] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1-248:16, Oct. 2015.

[34] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1-248:16, Oct. 2015. [34] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1-248:16, Oct. 2015.

[35] M. M. Loper and M. J. Black. OpenDR: An approximate differentiable renderer. In Proc. ECCV, volume 8695 of Lecture Notes in Computer Science, pages 154-169. Springer International Publishing, Sept. 2014. [35] M. M. Loper and M. J. Black. OpenDR: An approximate differentiable renderer. In Proc. ECCV, volume 8695 of Lecture Notes in Computer Science, pages 154-169. Springer International Publishing, Sept. 2014.

[36] Q. Ma, J. Yang, A. Ranjan, S. Pujades, G. Pons-Moll, S. Tang, and M. J. Black. Learning to Dress 3D People in Generative Clothing. In Proc. CVPR. [36] Q. Ma, J. Yang, A. Ranjan, S. Pujades, G. Pons-Moll, S. Tang, and M. J. Black. Learning to Dress 3D People in Generative Clothing. In Proc. CVPR.

[37] K. Mazur and V. Lempitsky. Cloud transformers, 2020. [37] K. Mazur and V. Lempitsky. Cloud transformers, 2020.

[38] M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely, and R. Martin-Brualla. Neural rerendering in the wild. In Proc. CVPR, June 2019.[38] M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely, and R. Martin-Brualla. Neural rerendering in the wild. In Proc. CVPR, June 2019.

[39] M. R. Minar and H. Ahn. Cloth-vton: Clothing threedimensional reconstruction for hybrid image-based virtual try-on. In Proceedings of the Asian Conference on Computer Vision (ACCV), November 2020. [39] M. R. Minar and H. Ahn. Cloth-vton: Clothing threedimensional reconstruction for hybrid image-based virtual try-on. In Proceedings of the Asian Conference on Computer Vision (ACCV), November 2020.

[40] M. R.Minar, T. T. Tuan, H. Ahn, P. Rosin, and Y.-K. Lai. Cpvton+: Clothing shape and texture preserving image-based virtual try-on. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020. [40] M. R. Minar, T. T. Tuan, H. Ahn, P. Rosin, and Y.-K. Lai. Cpvton+: Clothing shape and texture preserving image-based virtual try-on. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.

[41] A. Mir, T. Alldieck, and G. Pons-Moll. Learning to transfer texture from clothing images to 3d humans. In Proc. CVPR. IEEE, June 2020. [41] A. Mir, T. Alldieck, and G. Pons-Moll. Learning to transfer texture from clothing images to 3d humans. In Proc. CVPR. IEEE, June 2020.

[42] R. Natsume, S. Saito, Z. Huang, W. Chen, C. Ma, H. Li, and S. Morishima. Siclope: Silhouette-based clothed people. In Proc. CVPR, 2019. [42] R. Natsume, S. Saito, Z. Huang, W. Chen, C. Ma, H. Li, and S. Morishima. Siclope: Silhouette-based clothed people. In Proc. CPR, 2019.

[43] A. Neuberger, E. Borenstein, B. Hilleli, E. Oks, and S. Alpert. Image based virtual try-on network from unpaired data. In Proc. CVPR, June 2020. [43] A. Neuberger, E. Borenstein, B. Hilleli, E. Oks, and S. Alpert. Image based virtual try-on network from unpaired data. In Proc. CVPR, June 2020.

[44] C. Patel, Z. Liao, and G. Pons-Moll. Tailornet: Predicting clothing in 3d as a function of human pose, shape and outfit style. In Proc. CVPR, 2020. [44] C. Patel, Z. Liao, and G. Pons-Moll. Tailornet: Predicting clothing in 3d as a function of human pose, shape and outfit style. In Proc. CVPR, 2020.

[45] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proc. CVPR, pages 10975-10985, 2019. [45] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proc. CVPR, pages 10975-10985, 2019.

[46] G. Pons-Moll, S. Pujades, S. Hu, and M. Black. Clothcap:[46] G. Pons-Moll, S. Pujades, S. Hu, and M. Black. clothcap:

Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics, (Proc. SIGGRAPH), 36(4), 2017. Two first authors contributed equally. Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics, (Proc. SIGGRAPH), 36(4), 2017. Two first authors contributed equally.

[47] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proc. CVPR, pages 77-85. IEEE Computer Society, 2017.[47] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proc. CVPR, pages 77-85. IEEE Computer Society, 2017.

[48] A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, and J. Lu. Swapnet: Image based outfit transfer. Proc. ECCV, pages 679-695, 2018. [48] A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, and J. Lu. Swapnet: Image based outfit transfer. Proc. ECCV, pages 679-695, 2018.

[49] N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. Johnson, and G. Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020. [49] N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y. Lo, J. Johnson, and G. Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.

[50] O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pages 234-241. Springer, 2015. (available on arXiv:1505.04597 [cs.CV]). [50] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pages 234-241. Springer, 2015. (available on arXiv:1505.04597 [cs.CV]).

[51] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proc. ICCV, 2019. [51] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proc. ICCV, 2019.

[52] S. Saito, T. Simon, J. Saragih, and H. Joo. Pifuhd: Multilevel pixel-aligned implicit function for high-resolution 3d human digitization. In Proc. ICCV, 2020. [52] S. Saito, T. Simon, J. Saragih, and H. Joo. Pifuhd: Multilevel pixel-aligned implicit function for high-resolution 3d human digitization. In Proc. ICCV, 2020.

[53] Y. Shen, J. Liang, and M. C. Lin. Gan-based outfit generation using sewing pattern images. In Proc. ECCV, 2020.[53] Y. Shen, J. Liang, and M. C. Lin. Gan-based outfit generation using sewing pattern images. In Proc. ECCV, 2020.

[54] J. Thies, M. Zollh¨ofer, and M. Nieβner. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG), 2019. [54] J. Thies, M. Zollh¨ofer, and M. Nieβner. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG), 2019.

[55] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. BodyNet: Volumetric inference of 3D human body shapes. In Proc. ECCV, 2018. [55] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. BodyNet: Volumetric inference of 3D human body shapes. In Proc. ECCV, 2018.

[56] B.Wang, H. Zheng, X. Liang, Y. Chen, L. Lin, and M. Yang. Toward characteristic-preserving image-based virtual try-on network. In Proc. ECCV, 2018. [56] B. Wang, H. Zheng, X. Liang, Y. Chen, L. Lin, and M. Yang. Toward characteristic-preserving image-based virtual try-on network. In Proc. ECCV, 2018.

[57] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600-612, 2004. [57] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600-612, 2004.

[58] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson. SynSin: End-to-end view synthesis from a single image. In Proc. CVPR, 2020. [58] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson. SynSin: End-to-end view synthesis from a single image. In Proc. CVPR, 2020.

[59] D. Xiang, F. Prada, C. Wu, and J. Hodgins. Monoclothcap: Towards temporally coherent clothing capture from monocular rgb video. In Proc. 3DV, 2020.[59] D. Xiang, F. Prada, C. Wu, and J. Hodgins. Monoclothcap: Towards temporally coherent clothing capture from monocular rgb video. In Proc. 3DV, 2020.

[60] H. Yang, R. Zhang, X. Guo, W. Liu, W. Zuo, and P. Luo. Towards photo-realistic virtual try-on by adaptively generating$preserving image content. In Proc. CVPR, 2020. [60] H. Yang, R. Zhang, X. Guo, W. Liu, W. Zuo, and P. Luo. Towards photo-realistic virtual try-on by adaptively generating $preserving image content. In Proc. CVPR, 2020.

[61] G. Yildirim, N. Jetchev, R. Vollgraf, and U. Bergmann.[61] G. Yildirim, N. Jetchev, R. Vollgraf, and U. Bergmann.

Generating high-resolution fashion model images wearing custom outfits. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019. Generating high-resolution fashion model images wearing custom outfits. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019.

[62] J. S. Yoon, K. Kim, J. Kautz, and H. S. Park. Neural 3d clothes retargeting from a single image, 2021. [62] J. S. Yoon, K. Kim, J. Kautz, and H. S. Park. Neural 3d clothes retargeting from a single image, 2021.

[63] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. [63] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018.

[64] Z. Zheng, T. Yu, Y.Wei, Q. Dai, and Y. Liu. Deephuman: 3d human reconstruction from a single image. In Proc. ICCV, October 2019.[64] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu. Deephuman: 3d human reconstruction from a single image. In Proc. ICCV, October 2019.

Claims

1. A method for training an overlay network for modeling clothes on a person, in which the clothes adapt to the body posture and body shape of any person, which consists in the fact that:

providing a set of shots of people in which each person is wearing clothes, and in which the shots are a sequence of videos in which each person performs a sequence of movements;

calculate for each frame a Skinned Multi-Person Linear (SMPL) grid for the pose and shape of the human body in the frame;

calculating for each frame a clothing grid for the posture and shape of the human body in the frame;

generating an initial point cloud in the form of a set of vertices of said Skinned Multi-Person Linear (SMPL) meshes for each frame;

specifying an arbitrarily initialized d-dimensional code vector for encoding a clothing style for each person;

supplying the original point clouds and clothing code vectors to the overlay network, namely: supplying the original point clouds to the input of the neural network of the overlay network cloud converter, and the clothing code vectors to the input of the neural network of the MLP encoder (multilayer perceptron);

processing the clothing code vector by the neural network of the MLP encoder, and then passing its output to the neural network of the cloud transformer, which deforms the input source point cloud, taking into account the output of the neural network of the MLP encoder, and outputs the predicted clothing point cloud for each frame;

after processing all the frames from the set of frames of people, a pre-trained overlay network is obtained, namely: the weights of the trained neural network of the MLP encoder, the weights of the trained neural network of the cloud converter, the code vectors of the clothes of the coded styles for all people;

overlaying, by means of a pretrained overlay network, a suitable clothing style corresponding to one of the vectors and one of the point clouds to any body shape and any pose selected by the user.

2. A method for obtaining a predicted clothing point cloud and a clothing code vector from an image of a material person in clothing for modeling clothing on a person, in which clothing adapts to the body posture and body shape of any person, which consists in the fact that:

capturing, with the detection device, an image of a material person wearing clothes;

predicting an SMPL mesh with a desired pose and body shape from the image using the SMPLify method;

creating an initial point cloud in the form of vertices of said Skinned Multi-Person Linear (SMPL) mesh for the image;

predicting a binary clothing mask corresponding to pixels of the given clothing in the image using said segmentation network;

randomly initializing a d-dimensional clothing code vector for encoding a clothing style for the given image;

the original point cloud and the clothing code vector are fed into the pre-trained overlay network trained in accordance with claim 1,

obtaining a predicted clothing point cloud from the output of the pretrained overlay network;

projecting a cloud of points of clothing on a black and white image with the given parameters of the human image camera;

comparing, by calculating a loss function, a projection of this predicted point cloud in the image with a true binary clothing mask corresponding to clothing pixels in the image, through a chamfer distance between two-dimensional point clouds, which are projections of 3D point clouds;

optimizing the clothing code vector based on the computed loss function;

superimposing, in accordance with the obtained clothing code vector, the predicted image clothing point cloud on any body shape and any body pose selected by the user.

3. A method of modeling clothes on a person, in which the clothes adapt to the body posture and body shape of any person, which consists in the fact that:

provide a color video stream of the first person;

selecting, by the user, any clothing according to any video of the second person wearing the clothing;

through the operating unit of the computer system:

obtaining a predicted clothing point cloud and a clothing code vector according to the method of claim 2 for any video frame;

randomly initializing an n-dimensional appearance descriptor vector for each point in the point cloud that is responsible for a color;

generating a 16-channel image tensor using the 3D coordinates of each point and the neural descriptor of each point, and a binary black-and-white mask corresponding to the pixels of the image covered by these points;

processing the 16-channel image tensor together with the binary black and white mask with a render network to obtain an RGB color image and a clothing mask;

optimizing the render network weights and appearance descriptor values according to the true video sequence of the person to obtain the desired appearance of the clothing;

displaying to the user on the screen a video of the first person in the clothes of the second person in the form of a predicted rendered image in clothes with a given pose and body shape, and the user can enter a video of any person and see the color model of clothes obtained as a result of training, redirected to new body shapes and poses, rendered on top of this new video, i.e., dress the character image from any video in any outfit chosen by the user.

4. The method of claim 3, wherein the real person is a user who displays a color model of clothing on that user to the user.

5. A system for modeling clothes on a person using the method according to claim 3, containing:

a detection device connected to a computer system comprising an operation unit connected to a display screen and a selection unit;

moreover, the detection device is configured to capture a stream of color video of the first real person in real time;

the selection unit is configured to allow the user to select any clothing from any video of the second person wearing the clothing;

the display screen is configured to display said first person in real time in clothing selected by the user from said videos in accordance with data received from the operating unit.

6. The system of claim 5, wherein the user is the first person.