RU2757563C1

RU2757563C1 - Method for visualizing a 3d portrait of a person with altered lighting and a computing device for it

Info

Publication number: RU2757563C1
Application number: RU2021104328A
Authority: RU
Inventors: Артём Михайлович СЕВАСТОПОЛЬСКИЙ; Виктор Сергеевич Лемпицкий
Original assignee: Самсунг Электроникс Ко., Лтд.
Priority date: 2021-02-19
Filing date: 2021-02-19
Publication date: 2021-10-18

Abstract

FIELD: computer technology.

SUBSTANCE: present invention relates to the field of computer technology for creating a 3D portrait with variable lighting. A method for visualizing a 3D portrait of a person with altered lighting is proposed, which consists in taking an input that determines the camera position and lighting conditions; rasterization of hidden 3D point cloud descriptors is performed, and a 3D point cloud is created based on a sequence of images captured by a camera with a blinking flashlight when the camera moves around the upper part of the human body, while the image sequence contains a set of images taken with a flash and a set of images taken without a flashlight; rasterized images are processed by a deep neural network to predict albedo, normals, environmental shadow maps and segmentation masks for the adopted camera pose, and combine the predicted albedo, normals, environmental shadow maps and segmentation mask into a 3D portrait with modified lighting in accordance with lighting conditions.

EFFECT: improving the quality characteristics of the created 3D portrait.

21 cl, 5 dwg

Description

Область техникиTechnology area

[0001] Настоящее изобретение относится, в общем, к обработке изображений на основе искусственного интеллекта (AI) и, в частности, к способу создания 3D портрета с изменяемым освещением с использованием глубокой нейронной сети, а также к вычислительному устройству, реализующему данный способ. [0001] The present invention relates generally to artificial intelligence (AI) image processing and, in particular, to a method for creating a 3D portrait with variable lighting using a deep neural network, as well as a computing device implementing this method.

Уровень техникиState of the art

[0002] Рост использования мобильной фотографии тесно связан с повсеместным применением двумерных дисплеев. По мере того как все более широкое распространение получают устройства трехмерного отображения, такие как гарнитуры виртуальной реальности (VR), очки дополненной реальности (AR), 3D мониторы, все более востребованной областью развития технологии становится расширение мобильной фотографии на получение и обработку 3D контента. Большинство 3D дисплеев проецируют 3D модели либо в окружение пользователя (очки AR), либо в виртуальную среду (гарнитуры VR, 3D мониторы). Поэтому для повышения реалистичности таких моделей необходимо обеспечить возможность изменения их освещения реалистичным образом в соответствии с окружающей их средой (реальной или синтетической).[0002] The growth in the use of mobile photography is closely related to the ubiquitous use of two-dimensional displays. As 3D display devices such as virtual reality (VR) headsets, augmented reality (AR) glasses, 3D monitors become more popular, an increasingly demanded area of technology development becomes extension of mobile photography to receive and process 3D content. Most 3D displays project 3D models either into the user's environment (AR glasses) or into a virtual environment (VR headsets, 3D monitors). Therefore, to increase the realism of such models, it is necessary to ensure that their lighting can be changed in a realistic way in accordance with their environment (real or synthetic).

[0003] Создание реалистичных 3D моделей с возможностью изменения освещения является намного менее изученной областью исследований. Известные решения либо фокусируются на реконструкции одного вида с возможностью изменения освещения (что существенно ограничивает качественные характеристики получаемой модели), либо на построении переосвещаемых моделей с использованием специализированного оборудования (Light Stage). Работы "Deep single-image portrait relighting", H. Zhou, S. Hadap, K. Sunkavalli, D. W. Jacobs (In Proceedings of the IEEE International Conference on Computer Vision, 2019) и "Neural Light Transport for Relighting and View Synthesis" version 1 (v1), X. Zhang, S. Fanello, Y.-T. Tsai, T. Sun, T. Xue, R. Pandey, S. Orts-Escolano, P. Davidson, C. Rhemann, P. Debevec, J. Barron, R. Ramamoorthi, W. Freeman (In Proceedings of ACM Transactions on Graphics, August 2020) можно считать наиболее близкими решениями предшествующего уровня техники.[0003] The creation of realistic 3D models with the ability to change lighting is a much less studied area of research. Known solutions either focus on the reconstruction of one type with the possibility of changing the lighting (which significantly limits the quality characteristics of the resulting model), or on the construction of re-illuminated models using specialized equipment (Light Stage). Works "Deep single-image portrait relighting", H. Zhou, S. Hadap, K. Sunkavalli, DW Jacobs (In Proceedings of the IEEE International Conference on Computer Vision, 2019) and "Neural Light Transport for Relighting and View Synthesis" version 1 (v1), X. Zhang, S. Fanello, Y.-T. Tsai, T. Sun, T. Xue, R. Pandey, S. Orts-Escolano, P. Davidson, C. Rhemann, P. Debevec, J. Barron, R. Ramamoorthi, W. Freeman (In Proceedings of ACM Transactions on Graphics, August 2020) can be considered the closest prior art solutions.

Сущность изобретенияThe essence of the invention

[0004] В настоящей заявке предложены способы, вычислительные устройства и системы для получения фотореалистичных 3D моделей головы или верхней части тела человека (именуемых как "3D портрет человека") из видео, снятых обычными камерами существующих в настоящее время портативных устройств с ограниченными ресурсами, например, снимков, сделанных камерой смартфона.[0004] This application provides methods, computing devices, and systems for obtaining photorealistic 3D models of the head or upper body of a person (referred to as a "3D human portrait") from videos captured by conventional cameras of currently available portable devices with limited resources, for example , pictures taken with a smartphone camera.

[0005] Для предложенной системы (термин "система" используется в данном контексте взаимозаменяемо с терминами "способ" и "устройство" и означает комбинацию существенных признаков предложенного способа и всех необходимых аппаратных частей) требуется всего лишь видео человека, состоящее из чередующихся изображений (кадров), снятых камерой смартфона со вспышкой и без вспышки. При наличии такого видео предложенная система может подготовить модель с изменяемым освещением и визуализировать 3D портрет человека с измененным освещением с произвольной точки обзора. Таким образом, предложенную систему можно успешно использовать в устройствах отображения AR, VR и 3D.[0005] For the proposed system (the term "system" is used in this context interchangeably with the terms "method" and "device" and means a combination of essential features of the proposed method and all necessary hardware parts), only a human video is required, consisting of alternating images (frames ) taken with a smartphone camera with and without flash. In the presence of such a video, the proposed system can prepare a model with variable lighting and visualize a 3D portrait of a person with changed lighting from an arbitrary point of view. Thus, the proposed system can be successfully used in AR, VR and 3D display devices.

[0006] В предложенной системе в качестве геометрических промежуточных представлений используются нейросетевая точечная графика и 3D облака точек. При этом фотометрическая информация кодируется в этих облаках в виде скрытых дескрипторов отдельных точек. Точки и их скрытые дескрипторы можно затем растеризовать для новых точек обзора камеры (например, для точки обзора, которую запросил пользователь соответствующими манипуляциями в среде VR), и эти растеризации обрабатываются глубокой нейронной сетью, обученной прогнозировать, для каждого пикселя, альбедо, нормали (нормальные направления), карты теней окружающей среды и маску сегментации, наблюдаемые из новой позы камеры, и визуализировать 3D портрет человека на основе этой спрогнозированной информации.[0006] In the proposed system, as geometric intermediate representations Neural network point graphics and 3D point clouds are used. In this case, photometric information is encoded in these clouds in the form of hidden descriptors of individual points. The points and their hidden descriptors can then be rasterized for new camera viewpoints (for example, for a viewpoint that the user requested by appropriate manipulations in the VR environment), and these rasterizations are processed by a deep neural network trained to predict, for each pixel, albedo, normal (normal directions), environment shadow maps and segmentation mask observed from the new camera pose, and render a 3D portrait of a person based on this predicted information.

[0007] В процессе реконструкции сцены модель подгоняется к видеоизображениям, и заодно оцениваются вспомогательные параметры освещения сцены и вспышки камеры. Для разделения факторов освещения сцены и альбедо используются априорные параметры, специфичные для лица человека. Хотя эти априорные параметры измеряются только в лицевой части сцены, они облегчают разделение по всему 3D портрету. После реконструкции данную сцену (портрет) можно визуализировать из новых точек обзора и с новым освещением на скоростях, соответствующих интерактивному режиму.[0007] In the process of scene reconstruction, the model is fitted to the video images, and at the same time auxiliary parameters of the scene lighting and camera flash are estimated. To separate the scene lighting factors from the albedo, a priori parameters specific to the human face are used. Although these a priori parameters are only measured at the front of the scene, they facilitate separation throughout the 3D portrait. After reconstruction, this scene (portrait) can be rendered from new points of view and with new lighting at speeds corresponding to the interactive mode.

[0008] Таким образом, первым аспектом настоящего изобретения является способ визуализации 3D портрета человека с измененным освещением, заключающийся в том, что принимают ввод, определяющий позу камеры и условия освещения; осуществляют растеризацию скрытых дескрипторов 3D облака точек с различными разрешениями в соответствии с позой камеры для получения растеризованных изображений, причем 3D облако точек создают на основе последовательности изображений, отснятых камерой с мигающей вспышкой при перемещении камеры по меньшей мере частично вокруг верхней части тела человека, при этом последовательность изображений содержит набор изображений, снятых со вспышкой, и набор изображений, снятых без вспышки; обрабатывают растеризованные изображения глубокой нейронной сетью для прогнозирования альбедо, нормалей, карт теней окружающей среды и маски сегментации для принятой позы камеры, и объединяют спрогнозированные альбедо, нормали, карты теней окружающей среды и маску сегментации в 3D портрет с измененным освещением в соответствии с условиями освещения.[0008] Thus, a first aspect of the present invention is a method for rendering a 3D portrait of a person with altered illumination, comprising: receiving input defining a camera posture and lighting conditions; rasterization of hidden 3D point cloud descriptors with different resolutions in accordance with the camera pose to obtain rasterized images, and the 3D point cloud is created on the basis of a sequence of images captured by a camera with a blinking flash when the camera is moved at least partially around the upper part of the human body, while an image sequence contains a set of images captured with a flash and a set of images captured without a flash; processes rasterized images with a deep neural network to predict albedo, normals, environment shadow maps and segmentation mask for the assumed camera pose, and combine the predicted albedo, normals, environmental shadow maps and segmentation mask into a 3D portrait with changed lighting in accordance with lighting conditions.

[0009] Вторым аспектом настоящего изобретения является вычислительное устройство, содержащее процессор и память, в которой хранятся выполняемые процессором инструкции и веса глубокой нейронной сети, скрытые дескрипторы и вспомогательные параметры, полученные на стадии обучения, при этом при выполнении процессором выполняемых процессором инструкций процессор предписывает вычислительному устройству выполнять способ визуализации 3D портрета человека с измененным освещением согласно первому аспекту или любой дальнейшей реализации первого аспекта.[0009] A second aspect of the present invention is a computing device comprising a processor and a memory that stores processor-executed instructions and weights of a deep neural network, hidden descriptors and auxiliary parameters obtained during the learning stage, wherein when the processor executes processor-executed instructions, the processor instructs the computing the device to perform a method for rendering a 3D portrait of a person with changed lighting according to the first aspect or any further implementation of the first aspect.

[0010] Техническим результатом предлагаемых изобретений являются: (1) новая модель нейросетевой визуализации, которая поддерживает изменение освещения и основана на геометрии 3D облака точек; (2) новый подход к разделению альбедо и освещения, в котором используются априорные параметры лица. Посредством объединения этих результатов была получена система, способная создавать реалистичный 3D портрет человека на основе всего лишь последовательности изображений, снятых обычной камерой, которая поддерживает визуализацию в реальном времени. Таким образом, предлагаемое изобретение позволяет получать в реальном времени на вычислительных устройствах, имеющих ограниченные ресурсы обработки, 3D портреты с реалистично измененным освещением, качество которых выше или, по меньшей мере, сопоставимо с качеством, достигаемым известными решениями, но без использования сложного и дорогостоящего оборудования для получения изображений (например, Light Stage). При этом вычислительное устройство (например, смартфон, планшет или устройство отображения/очки VR/AR/3D или удаленный сервер), имеющее функциональные возможности предложенной системы, обеспечивает значительно лучшее взаимодействие с пользователем и способствует внедрению технологий VR и AR.[0010] The technical result of the proposed inventions are: (1) a new model of neural network visualization, which supports changing lighting and is based on the geometry of a 3D point cloud; (2) a new approach to the separation of albedo and illumination, which uses a priori parameters of the face. By combining these results, a system was obtained that was able to create a realistic 3D portrait of a person from just a sequence of images captured by a conventional camera that supports real-time rendering. Thus, the proposed invention makes it possible to obtain in real time on computing devices with limited processing resources, 3D portraits with realistically changed lighting, the quality of which is higher or at least comparable to the quality achieved by known solutions, but without the use of complex and expensive equipment. to obtain images (for example, Light Stage). At the same time, a computing device (for example, a smartphone, tablet or display device / VR / AR / 3D glasses or a remote server), having the functionality of the proposed system, provides a significantly better user experience and promotes introduction of VR and AR technologies.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF DRAWINGS

Фиг. 1 - полный конвейер обработки согласно варианту осуществления настоящего изобретения.FIG. 1 illustrates a complete processing pipeline in accordance with an embodiment of the present invention.

Фиг. 2 - блок-схема способа визуализации 3D портрета человека с измененным освещением согласно варианту осуществления настоящего изобретения.FIG. 2 is a flowchart of a method for rendering a 3D portrait of a person with altered lighting in accordance with an embodiment of the present invention.

Фиг. 3 - детали этапа создания 3D облака точек на основе последовательности изображений согласно варианту осуществления настоящего изобретения.FIG. 3 is a detail of a step for creating a 3D point cloud based on an image sequence according to an embodiment of the present invention.

Фиг. 4 - блок-схема стадии обучения системы согласно варианту осуществления настоящего изобретения.FIG. 4 is a block diagram of a learning stage of a system according to an embodiment of the present invention.

Фиг. 5 - структурная схема вычислительного устройства, выполненного с возможностью визуализации 3D портрета человека согласно варианту осуществления настоящего изобретенияFIG. 5 is a block diagram of a computing device capable of rendering a 3D portrait of a person according to an embodiment of the present invention.

ОПИСАНИЕ ВАРИАНТОВ ОСУЩЕСТВЛЕНИЯDESCRIPTION OF IMPLEMENTATION OPTIONS

[0011] На фиг.1 изображен полный конвейер обработки согласно одному варианту осуществления настоящего изобретения. Показанный конвейер обработки можно реализовать полностью или частично в существующих вычислительных устройствах, таких как смартфон, планшет, устройство отображения VR (виртуальной реальности), устройство отображения AR (дополненной реальности), устройство отображения 3D. Устройства отображения VR, AR и 3D могут иметь форму интеллектуальных очков. При частичной реализации наиболее ресурсоемкие операции (например, операции стадии обучения) могут выполняться в облаке, например, сервером.[0011] Figure 1 depicts a complete processing pipeline in accordance with one embodiment of the present invention. The illustrated processing pipeline can be implemented in whole or in part in existing computing devices such as smartphone, tablet, VR (virtual reality) display device, AR (augmented reality) display device, 3D display device. VR, AR and 3D display devices can be in the form of smart glasses. Partially implemented, the most resource-intensive operations (for example, learning stage operations) can be performed in the cloud, for example, by a server.

[0012] Для создания 3D портрета человека с измененным освещением предложенной системе требуется последовательность изображений человека. Выполнять съемку такой последовательности изображений могут обычные камеры существующих портативных устройств (например, камера смартфона). В качестве альтернативы можно предоставить конвейеру последовательность изображений из галереи пользователя (при условии разрешения пользователем доступа к галерее) или загрузить с веб-ресурса.[0012] To create a 3D portrait of a person with changed lighting, the proposed system requires a sequence of images of a person. Such a sequence of images can be captured by conventional cameras of existing portable devices (for example, a smartphone camera). Alternatively, you can provide the pipeline with a sequence of images from the user's gallery (subject to the user's permission to access the gallery) or download from a web resource.

[0013] Важно отметить, что съемка последовательности изображений выполняется камерой с мигающей вспышкой при одновременном перемещении камеры, по меньшей мере частично, вокруг верхней части тела человека. Таким образом, последовательность изображений содержит набор изображений, снятых со вспышкой, и набор изображений, снятых без вспышки. Эти существенные признаки облегчают разделение альбедо и освещения для последующих модификаций освещения, включая случаи полного изменения освещения 3D портрета или частичного изменения освещения 3D портрета в соответствии с новой позой камеры. Изменение освещения можно осуществлять для совершенно новых условий освещения, то есть условий освещения, заменяющих освещение, имевшее место при съемке данной последовательности изображений, новыми произвольными условиями освещения. В качестве альтернативы, изменение освещения можно выполнять, например, для изменения освещения определенных частей 3D портрета путем корректной экстраполяции освещения, имевшего место при съемке данной последовательности изображений, на упомянутые части 3D портрета.[0013] It is important to note that the sequence of images is captured by a camera with a blinking flash while moving the camera at least partially around the upper body of the person. Thus, a sequence of images contains a set of pictures taken with a flash and a set of pictures taken without a flash. These essential features make it easier to separate albedo and lighting for subsequent lighting modifications, including cases of completely changing the lighting of a 3D portrait or partially changing the lighting of a 3D portrait to suit a new camera pose. Lighting changes can be made for completely new lighting conditions, that is, lighting conditions that replace the lighting that occurred when shooting a given sequence of images with new arbitrary lighting conditions. Alternatively, the illumination change can be performed, for example, to change the illumination of certain parts of the 3D portrait by correctly extrapolating the illumination that occurred during the shooting of the given sequence of images to the mentioned parts of the 3D portrait.

[0014] Съемку последовательности изображений можно выполнять с использованием мигающей вспышки, используя стандартные функции камеры вычислительного устройства или любое имеющееся программное обеспечение или приложение (например, приложение Open Camera), подходящие для получения изображения с мигающей вспышкой. Пунктирная кривая на фиг.1 показывает примерную траекторию движения камеры вокруг верхней части тела человека. Верхняя часть тела может включать в себя, без ограничения перечисленным, голову, плечи, руки и грудь человека. Например, съёмка может выполняться с контролируемым балансом белого, частотой 30 кадров в секунду, выдержкой 1/200 с и ISO примерно 200 (выбирается для каждой последовательности в соответствии с условиями освещения). Во время каждой записи вспышка камеры мигает, например, один раз в секунду, включаясь каждый раз, например, на 0,1 с. Мигание вспышки может быть регулярным или нерегулярным. Следует понимать, что отснятая последовательность изображений должна охватывать всю верхнюю часть тела человека, для которого создается 3D портрет с измененным освещением. Каждого человека фотографируют, например, в течение 15-25 секунд. Конкретные значения приведены в данном абзаце только для примера и не имеют ограничивающего значения. Понятно, что конкретные значения можно выбирать из соответствующих диапазонов, например +/-10-50% от заданных конкретных значений.[0014] Shooting a sequence of images can be performed using a blinking flash using standard camera functions of the computing device, or any available software or application (eg, the Open Camera application) suitable for capturing a blinking flash image. The dashed line in FIG. 1 shows an approximate trajectory of the camera around the upper body of a person. The upper body may include, but are not limited to, the head, shoulders, arms, and chest of a person. For example, shooting can be performed with controlled white balance, 30 fps, 1/200 s shutter speed and ISO about 200 (selected for each sequence according to lighting conditions). During each recording, the camera flash blinks, for example, once a second, turning on each time, for example, for 0.1 seconds. Flash blinking can be regular or irregular. It should be understood that the captured image sequence should cover the entire upper body of the person for whom the 3D portrait with altered lighting is being created. Each person is photographed, for example, for 15-25 seconds. The specific values given in this paragraph are for example only and are not intended to be limiting. It is understood that specific values can be selected from appropriate ranges, for example +/- 10-50% of the specified specific values.

[0015] Предполагается, что в процессе съемки человек остается практически неподвижным. В частности, предполагается, что голова пользователя направлена вперед и остается практически неподвижной на протяжении всего перемещения камеры в процессе съемки. Перемещение камеры может осуществлять сам человек при съемке селфи или третье лицо. Перемещение камеры может охватывать 180° или меньше вокруг человека, например, двигаясь от левой стороны верхней части тела через лицевую сторону верхней части тела к правой стороне верхней части тела или в противоположном направлении.[0015] It is assumed that during the shooting, the person remains essentially motionless. In particular, it is assumed that the user's head is directed forward and remains practically stationary throughout the entire movement of the camera during shooting. The camera can be moved by the person himself when taking a selfie or by a third person. The movement of the camera can span 180 ° or less around a person, for example, moving from the left side of the upper body through the front side of the upper body to the right side of the upper body or in the opposite direction.

[0016] Затем на основе последовательности снятых изображений создается 3D облако 10 точек. 3D облако 10 точек можно создать, используя технологию "структура из движения" (SfM) или любую другую известную технологию, позволяющую реконструировать 3D структуру сцены или объекта на основе последовательности двумерных изображений этой сцены или объекта. Каждая точка в 3D облаке 10 точек дополнена скрытым дескриптором, представляющим собой многомерный скрытый вектор, характеризующий свойства данной точки. Скрытые дескрипторы выбираются из заранее определенного распределения вероятностей, например, поэлементно из стандартного нормального распределения, а затем подгоняются на стадии обучения. Каждый скрытый дескриптор служит вектором памяти для глубокой нейронной сети (также называемой сетью визуализации и подробно описанной ниже) и может использоваться этой сетью для вывода геометрических и фотометрических свойств каждой точки. Создается 3D облако точек, и позы камеры оцениваются на основе набора изображений, снятых со вспышкой, или набора изображений, снятых без вспышки.[0016] A 3D point cloud 10 is then generated based on the captured image sequence. A 3D cloud of 10 points can be created using structure from motion (SfM) or any other known technology to reconstruct the 3D structure of a scene or object from a sequence of 2D images of that scene or object. Each point in a 3D cloud of 10 points is supplemented with a hidden descriptor, which is a multidimensional hidden vector characterizing the properties of a given point. Latent descriptors are selected from a predefined probability distribution, for example, element-wise from a standard normal distribution, and then fitted at the training stage. Each hidden descriptor serves as a memory vector for a deep neural network (also called a rendering network and detailed below) and can be used by this network to infer the geometric and photometric properties of each point. A 3D point cloud is created and camera poses are estimated based on a set of images captured with a flash or a set of images captured without a flash.

[0017] Затем скрытые дескрипторы созданного 3D облака точек растеризуются с различными разрешениями в соответствии с запрошенной новой позы камеры, чтобы получить растеризованные изображения 15. Растеризацию можно выполнять с использованием метода Z-буферизации или любых других известных методов, позволяющих представлять изображения обнаруженных объектов в 3D пространстве из определенной позы камеры. Различные разрешения для растеризации могут включать в себя, например, 512х512, 256х256, 128х128, 64х64 и 32х32 (на стадии обучения). Во время использования может быть оценено исходное разрешение, например, заданное пользователем для новой позы камеры, и затем может быть построена пирамида растеризаций с постепенно уменьшающимися разрешениями (например, 1920х1080, 860х540, 430х270, 215х145, 107х72). Преимущество ввода в нейронную сеть пирамиды растеризаций с различными разрешениями (от высокого к низкому) позволяет нейросети учитывать как детали изображений (из растеризаций с более высоким разрешением), так и близость точек к камере (из растеризаций с более низким разрешением, которые не содержат "просветов" между проецируемыми точками ближайшей к камере поверхности 3D облака точек и точками дальней поверхности данного облака).[0017] The hidden descriptors of the generated 3D point cloud are then rasterized at different resolutions in accordance with the requested new camera pose to obtain rasterized images 15. Rasterization can be performed using the Z-buffering method or any other known technique that allows images of detected objects to be rendered in 3D. space from a specific camera pose. Various rasterization resolutions may include, for example, 512x512, 256x256, 128x128, 64x64, and 32x32 (in training). During use, the original resolution can be estimated, for example, specified by the user for a new camera pose, and then a rasterization pyramid with gradually decreasing resolutions can be built (for example, 1920x1080, 860x540, 430x270, 215x145, 107x72). The advantage of entering a rasterization pyramid with different resolutions (from high to low) into the neural network allows the neural network to take into account both the details of the images (from rasterizations with a higher resolution) and the proximity of points to the camera (from rasterizations with a lower resolution that do not contain "gaps "between the projected points of the 3D point cloud surface closest to the camera and points on the far surface of this cloud).

[0018] Позу камеры можно задать непосредственно во вводе, например в виде конкретных координат в среде AR/VR, направления камеры, фокусного расстояния и других внутренних параметров камеры, или косвенно, например, с помощью сенсорного ввода для соответствующей позы в среде AR/VR, ввода, соответствующего позе на основе положения и ориентации вычислительного устройства (например, смартфона, интеллектуальных очков AR/VR), используемых для управления в AR/VR среде или для отображения в ней определенной информации. Этот ввод может приниматься соответствующим образом.[0018] The camera pose can be set directly in the input, for example in the form of specific coordinates in the AR / VR environment, camera direction, focal length and other internal parameters of the camera, or indirectly, for example, using touch input for the corresponding pose in the AR / VR environment input corresponding to the pose based on the position and orientation of the computing device (eg, smartphone, AR / VR smart glasses) used to control or display certain information in the AR / VR environment. This input can be received appropriately.

[0019] Затем можно осуществить нейросетевую визуализацию. Нейросетевая визуализация может выполняться путем обработки растеризованных изображений 15 глубокой нейронной сетью, обученной прогнозировать альбедо, нормали, карты теней окружающей среды и маску сегментации для принятой позы камеры. Архитектура глубокой нейронной сети соответствует структуре U-Net с маскируемыми свертками. Стадия обучения предлагаемой системы и архитектура используемой глубокой нейронной сети подробно описаны ниже со ссылкой на фиг.4. Прогнозируемые альбедо, нормали, карты теней окружающей среды и маска сегментации показаны все вместе позицией 20 на фиг.1. Спрогнозированное альбедо описывает пространственно изменяющиеся свойства отражения (альбедо

) поверхности головы (кожа, волосы или другие части). Спрогнозированные нормали содержат векторы

нормалей точек на поверхности головы в мировом пространстве. Спрогнозированные карты теней окружающей среды представлены одноканальной группой, которая использует сигмоидную нелинейность и соответствует затемнению в градациях серого

, вызванному различиями в освещении помещения и, как следствие, возникающими окклюзиями. И наконец, спрогнозированная маска сегментации представлена одноканальной группой, которая также использует сигмоидную нелинейность и определяет маску сегментации

, в которой каждый пиксель содержит спрогнозированную вероятность того, что этот пиксель принадлежит голове, а не фону.[0019] Neural network rendering can then be performed. The neural network rendering can be performed by processing the rasterized images 15 with a deep neural network trained to predict albedo, normals, environmental shadow maps, and a segmentation mask for a given camera pose. The deep neural network architecture follows the U-Net structure with masked convolutions. The training stage of the proposed system and the architecture of the deep neural network used are described in detail below with reference to FIG. 4. Predicted albedos, normals, environmental shadow maps, and a segmentation mask are collectively shown at 20 in FIG. 1. The predicted albedo describes the spatially varying reflection properties (albedo

) the surface of the head (skin, hair or other parts). The predicted normals contain vectors

normals of points on the surface of the head in world space. Predicted environmental shadow maps are represented by a single channel group that uses sigmoid nonlinearity and corresponds to shading in grayscale

caused by differences in room lighting and, as a result, occlusions. Finally, the predicted segmentation mask is represented by a single-channel group that also uses sigmoid nonlinearity and defines the segmentation mask

in which each pixel contains the predicted probability that this pixel belongs to the head and not to the background.

[0020] И наконец, спрогнозированные альбедо, нормали, карты теней окружающей среды и маска сегментации объединяются в 3D портрет 25 с освещением, измененным в соответствии с условиями освещения. Поскольку первоначально отснятая последовательность изображений включает в себя набор изображений, снятых без вспышки, передающих условия освещения, с которыми была отснята данная последовательность изображений, освещение объединенного 3D портрета можно при желании изменить путем полной замены условий освещения, с которыми была отснята данная последовательность изображений, новыми условиями освещения, в которых заинтересован пользователь.[0020] Finally, the predicted albedos, normals, environment shadow maps, and the segmentation mask are combined into a 3D portrait 25 with lighting modified according to lighting conditions. Since the original captured sequence of images includes a set of images captured without a flash that represent the lighting conditions with which the sequence was captured, the lighting of the merged 3D portrait can be changed if desired by completely replacing the lighting conditions with which the given sequence of images was captured with new ones. lighting conditions in which the user is interested.

[0021] Типы освещения могут включать в себя, без ограничения перечисленным, окружающее освещение, направленное освещение и освещение на основе сферических гармоник. Ввод условий освещения может приниматься любым подходящим способом. Вспомогательные параметры, относящиеся к освещению, также могут включать в себя, без ограничения, цветовую температуру освещения в помещении C ^room и цветовую температуру вспышки C ^flash. C ^room определяет цветовую температуру теней в освещении сцены (также называемой помещением), а C ^flash определяет интенсивность и цветовую температуру вспышки. Как C ^room,так и C ^flash могут быть представлены векторами, например 3 числами, общими для всей сцены. Они обучаются (настраиваются) вместе со всей системой. Эти параметры будут подробно описаны ниже. Тип нового освещения, необходимого для пользователя, определяет выражение, с помощью которого создается 3D портрет с измененным освещением из спрогнозированных попиксельных альбедо, нормалей, карт теней окружающей среды, маски сегментации, вспомогательных параметров, относящихся к освещению, и параметров новых условий освещения. Это выражение проиллюстрировано как "Модель освещения" на фиг.1 и будет подробно описано ниже в разделе "Детали реализации способа".[0021] Types of lighting can include, but are not limited to, ambient lighting, directional lighting, and spherical harmonic lighting. The entry of lighting conditions can be accepted in any suitable way. Auxiliary parameters related to lighting may also include, but are not limited to, the color temperature of the lighting in the C ^room and the color temperature of the C ^flash . C ^room determines the color temperature of the shadows in the scene lighting (also called room), and C ^flash determines the intensity and color temperature of the flash. Both C ^room and C ^flash can be represented by vectors, for example 3 numbers, common to the entire scene. They are trained (configured) along with the entire system. These parameters will be detailed below. The type of new lighting the user needs is determined by the expression that creates a 3D portrait with altered lighting from the predicted per-pixel albedos, normals, environment shadow maps, segmentation mask, lighting-related helper parameters, and new lighting conditions parameters. This expression is illustrated as "Lighting Model" in FIG. 1 and will be described in detail below in the "Method Implementation Details" section.

[0022] На фиг.2 показана блок-схема способа визуализации 3D портрета человека с измененным освещением согласно варианту осуществления настоящего изобретения. На этапе S200 камерой с мигающей вспышкой снимают последовательность изображений верхней части тела человека. Во время съемки (например, фотографирования или видеосъемки) камера перемещается этим человеком или третьим лицом, по меньшей мере частично, вокруг верхней части тела человека. Таким образом, полученная последовательность изображений содержит набор изображений, снятых со вспышкой, и набор изображений, снятых без вспышки.[0022] Figure 2 is a flowchart of a method for rendering a 3D portrait of a person with altered lighting in accordance with an embodiment of the present invention. In step S200, the flash camera captures a sequence of images of the upper body of a person. During shooting (for example, photographing or filming) the camera is moved by that person or a third person, at least in part, around the upper part of the person's body. Thus, the resulting sequence of images contains a set of images taken with a flash and a set of images taken without a flash.

[0023] На основе отснятой последовательности изображений на этапе S205 создают 3D облако точек. 3D облако точек можно создать, используя технологию "структура из движения" (SfM) или любую другую известную технологию, позволяющую реконструировать 3D структуру сцены или объекта на основе последовательности двумерных изображений этой сцены или объекта. На стадии обучения, который будет подробно описан ниже, способ включает в себя дополнение каждой точки в 3D облаке 10 точек скрытым дескриптором, который является многомерным скрытым вектором, характеризующим свойства данной точки.[0023] Based on the captured image sequence, a 3D point cloud is generated in step S205. A 3D point cloud can be created using structure from motion (SfM) technology or any other known technology that allows the 3D structure of a scene or object to be reconstructed from a sequence of 2D images of that scene or object. In the learning stage, which will be described in detail below, the method includes augmenting each point in the 3D point cloud 10 with a hidden descriptor, which is a multidimensional hidden vector characterizing the properties of a given point.

[0024] На этапе S210 принимают один или несколько вводов, определяющих позу камеры и/или условия освещения. Затем на этапе S215 растеризуются скрытые дескрипторы 3D облака точек, полученные ранее на стадии обучения, с различными разрешениями в соответствии с позой камеры для получения растеризованных изображений. После этого, на этапе S220, выполняют нейросетевую визуализацию путем обработки растеризованных изображений глубокой нейронной сетью для прогнозирования альбедо, нормалей, карт теней окружающей среды и маски сегментации для принятой позы камеры. Перед стадией использования глубокую нейронную сеть обучают, как будет описано со ссылкой на фиг.4. И наконец, на этапе S225 визуализируют 3D портрет с измененным освещением путем объединения спрогнозированных альбедо, нормалей, карт теней окружающей среды и маски сегментации в 3D портрет с измененным освещением в соответствии с полученными условиями освещения.[0024] In step S210, one or more inputs are received defining the camera pose and / or lighting conditions. Then, in step S215, the latent 3D point cloud descriptors obtained earlier in the training step are rasterized at different resolutions in accordance with the camera pose to obtain rasterized images. Thereafter, in step S220, neural network rendering is performed by processing the rasterized images with a deep neural network to predict albedo, normals, environment shadow maps, and segmentation mask for the adopted camera pose. Before the use phase, the deep neural network is trained, as will be described with reference to FIG. 4. Finally, in step S225, the lighting changed 3D portrait is rendered by combining the predicted albedos, normals, environment shadow maps and segmentation mask into the lighting changed 3D portrait according to the obtained lighting conditions.

[0025] На фиг.3 представлены детали этапа S205 создания 3D облака точек на основе последовательности изображений согласно варианту осуществления настоящего изобретения. Создание 3D облака точек на этапе S205 дополнительно включает в себя этап S205.1 оценки точек обзора камеры, с которых осуществляется съёмка данной последовательности изображений. Позы камеры, оцененные на этапе S205.1, используются для создания 3D облака точек, а также для обучения глубокой нейронной сети, скрытых дескрипторов для 3D облака точек и вспомогательных параметров на стадии обучения. Следует понимать, что поза камеры и условия освещения, полученные через ввод на этапе S210 (т.е. новая или произвольная поза камеры и условия освещения, запрошенные пользователем), отличаются от поз камеры, оцененных на этапе S205.1, и условий освещения, при которых осуществлена съёмка последовательности изображений (т.е. текущие позы камеры и условия освещения во время съёмки). На этапе S205.1 позы камеры оценивают с помощью SfM. Точки первоначально созданного 3D облака точек, по меньшей мере частично, соответствуют верхней части тела человека.[0025] Fig. 3 shows details of a step S205 for creating a 3D point cloud based on an image sequence according to an embodiment of the present invention. The creation of the 3D point cloud in step S205 further includes a step S205.1 of evaluating the camera viewpoints from which the sequence of images is captured. The camera poses evaluated in step S205.1 are used to generate the 3D point cloud, as well as to train the deep neural network, hidden descriptors for the 3D point cloud, and auxiliary parameters during the training phase. It should be understood that the camera pose and lighting conditions obtained through the input in step S210 (i.e., the new or arbitrary camera pose and lighting conditions requested by the user) differ from the camera positions estimated in step S205.1 and the lighting conditions under which the sequence of images was captured (i.e., the current poses cameras and lighting conditions during shooting). In step S205.1, the camera poses are estimated using SfM. The points of the originally generated 3D point cloud correspond at least in part to the upper part of the human body.

[0026] Этап создания S205 3D облака точек дополнительно включает в себя этап S205.2 фактического создания 3D облака точек, обработку каждого изображения последовательности на этапе S205.3 сегментации переднего плана и этап S205.4 фильтрации 3D облака точек на основе переднего плана, отсегментированного нейронной сетью сегментации, для получения отфильтрованного 3D облака точек. 3D облако 10 точек, созданное в результате обработки на этапах S205.3 и S205.4, включает в себя только точки верхней части тела человека (т.е. исключены точки, относящиеся к фону). Нейронная сеть сегментации - это глубокая нейронная сеть, которая получает изображение и создает попиксельную маску сегментации с реальными значениями от 0 до 1. Для заданного порогового скалярного значения от 0 до 1 все значения в этой маске ниже порога считаются фоном, а остальные считаются передним планом (объектом). Такую сеть можно, например, обучить, используя большую коллекцию изображений и соответствующих масок с функцией потерь, которая учитывает расхождение спрогнозированной маски с истинной маской. Для каждого изображения снятого видео оценивается сегментируемый передний план, и для каждого из изображений и соответствующих точек обзора камеры создается её поле зрения в 3D путем пропускания лучей из камеры через пиксели, принадлежащие отсегментированному переднему плану. Точки 3D облака точек, которые не принадлежат пересечению усеченных пирамид, отфильтровываются, и, таким образом, строится отфильтрованное 3D облако точек.[0026] The 3D point cloud creation step S205 further includes the step S205.2 of actually creating the 3D point cloud, the processing of each sequence image in the foreground segmentation step S205.3, and the step S205.4 of filtering the 3D point cloud based on the segmented foreground neural network segmentation, to obtain a filtered 3D point cloud. The 3D point cloud 10 generated by the processing in steps S205.3 and S205.4 includes only the dots of the human upper body (ie, the dots related to the background are excluded). A segmentation neural network is a deep neural network that takes an image and creates a per-pixel segmentation mask with real values from 0 to 1. For a given scalar threshold value from 0 to 1, all values in this mask below the threshold are considered background, and the rest are considered foreground ( object). Such a network could, for example, be trained using a large collection of images and corresponding masks with a loss function that accounts for the discrepancy between the predicted mask and the true mask. For each image in the captured video, a segmented foreground is evaluated, and for each of the images and the corresponding camera viewpoints, its 3D field of view is created by passing rays from the camera through pixels belonging to the segmented foreground. 3D point cloud points that do not belong to the intersection of the truncated pyramids are filtered out, and thus a filtered 3D point cloud is built.

[0027] Описанные сегментация и фильтрация могут улучшить качественные характеристики созданного 3D портрета за счет повышения точности прогнозов 20 альбедо, нормалей, карт теней окружающей среды и маски сегментации, выполняемых глубокой нейронной сетью.[0027] The described segmentation and filtering can improve the quality characteristics of the generated 3D portrait by improving the accuracy of predictions 20 albedos, normals, environmental shadow maps and segmentation masks performed by a deep neural network.

[0028] На фиг.4 изображена блок-схема стадии обучения системы согласно варианту осуществления настоящего изобретения. Обучение может выполняться на том же вычислительном устройстве 50 (см. фиг.5), на котором визуализируется 3D портрет с измененным освещением, или вне этого устройства, например, на сервере (не показан). Если обучение выполняется на сервере, последовательность изображений, отснятых на этапе S200, и вся другая информация, используемая в данном контексте в качестве обучающей информации, может быть передана на сервер. После этого сервер может выполнять обучение, а также любые операции преобразования обучающих данных, как раскрыто в данном документе, и передавать обученную глубокую нейронную сеть, скрытые дескрипторы 3D облака точек и вспомогательные параметры обратно в вычислительное устройство для использования на стадии использования. Для этого вычислительное устройство может быть снабжено блоком связи. Обучение выполняется итеративно, одна итерация показана на фиг.4. Перед началом обучения случайным образом инициализируются веса и другие параметры глубокой нейронной сети, значения скрытых дескрипторов, значения вспомогательных параметров. Вспомогательные параметры включают в себя один или несколько параметров из цветовой температуры освещения в помещении, цветовой температуры вспышки и полутекстуру альбедо.[0028] Figure 4 is a block diagram of a learning stage of a system according to an embodiment of the present invention. The training can be performed on the same computing device 50 (see Fig. 5), on which the 3D portrait with changed lighting is rendered, or outside this device, for example, on a server (not shown). If training is performed on the server, the sequence of images captured in step S200 and all other information used in this context as training information can be transmitted to the server. The server can then perform training as well as any training data transformation operations as disclosed in this document, and transmit the trained deep neural network, hidden 3D point cloud descriptors, and auxiliary parameters back to the computing device for use at the stage of use. For this, the computing device can be equipped with a communication unit. Training is performed iteratively, one iteration is shown in Fig. 4. Before training, weights and other parameters of the deep neural network, values of hidden descriptors, and values of auxiliary parameters are randomly initialized. The auxiliary parameters include one or more of the indoor lighting color temperature, flash color temperature, and albedo semi-texture.

[0029] На этапе S100 из отснятой последовательности изображений произвольно выбираются изображение и поза камеры, соответствующие данному изображению. Затем на этапе S105 с помощью глубокой нейронной сети получают спрогнозированное изображение (т.е. спрогнозированные альбедо, нормали, карты теней окружающей среды и маска сегментации) для данной позы камеры. На этапе S110 сеть реконструкции 3D сетки лица прогнозирует вспомогательные сетки лица с соответствующим отображением текстуры для каждого из изображений в снятой последовательности, включая изображение, выбранное произвольно на этапе S100. Отображение соответствующей текстуры определяется набором двумерных координат в фиксированном, заранее определенном пространстве текстуры для каждой вершины сетки. Сеть реконструкции 3D сетки лица представляет собой глубокую нейронную сеть, которая прогнозирует треугольную сетку лица, найденного на входном изображении, которая совмещается с лицом на входном изображении. Такую сеть можно обучить, используя различные методы - основанные на обучении с учителем, на частично размеченных данных или на обучении без учителя. Неограничивающими примерами являются модель PRNet и модель 3DDFA_v2.[0029] In step S100, an image and a camera pose corresponding to the image are randomly selected from the captured image sequence. Then, in step S105, a predicted image (i.e., predicted albedos, normals, environmental shadow maps, and segmentation mask) for a given camera pose is obtained using the deep neural network. In step S110, the 3D face mesh reconstruction network predicts the sub face meshes with corresponding texture mapping for each of the images in the captured sequence, including the image selected at random in step S100. The mapping of the corresponding texture is determined by a set of 2D coordinates in a fixed, predefined texture space for each mesh vertex. The 3D face mesh reconstruction network is a deep neural network that predicts a triangular mesh of the face found in the input image, which is aligned with the face in the input image. Such a network can be trained using a variety of methods — supervised learning, partially labeled data, or unsupervised learning. Non-limiting examples are the PRNet model and the 3DDFA_v2 model.

[0030] На этапе S115 получают медианную эталонную текстуру путем вычисления попиксельных медиан всех изображений из набора изображений, снятых со вспышкой, геометрически переведенных в пространство текстуры посредством билинейной интерполяции. Пространство текстуры - это прямоугольник на 2D плоскости, в котором каждая точка имеет фиксированную семантику, относящуюся к области лица (например, точка (0,3, 0,3) может всегда соответствовать левому глазу, а точка (0,5, 0,5) может всегда соответствовать кончику носа). Чем точнее прогнозирование вспомогательной сетки лица с отображением текстуры, тем ближе спрогнозированные двумерные координаты каждой вершины сетки к их соответствующей семантике в пространстве текстуры. Вспомогательные сетки лица, спрогнозированные на этапе S110, растеризуются на этапе S120 с использованием цветового кодирования вершин сетки лица с их геометрическими нормалями и Z-буферизацией. Цветовое кодирование вершин с их геометрическими нормалями - это процесс присвоения каждой точке цветовых каналов R, G, B соответственно координатам X, Y, Z вектора нормали в этой точке. Затем присвоенные цвета RGB растеризуются с помощью Z-буферизации.[0030] In step S115, a median reference texture is obtained by calculating the pixel medians of all images from the set of flash images geometrically translated into texture space by bilinear interpolation. A texture space is a rectangle on a 2D plane in which each point has fixed semantics related to the face region (for example, point (0.3, 0.3) can always correspond to the left eye, and point (0.5, 0.5 ) can always correspond to the tip of the nose). The more accurate the prediction of the auxiliary face mesh with texture mapping, the closer the predicted 2D coordinates of each mesh vertex to their corresponding semantics in texture space. The auxiliary face meshes predicted in step S110 are rasterized in step S120 using color coding of the vertices of the face mesh with their geometric normals and Z-buffering. Color-coding vertices with their geometric normals is the process of assigning each point the color channels R, G, B corresponding to the X, Y, Z coordinates of the normal vector at that point. The assigned RGB colors are then rasterized using Z-buffering.

[0031] На этапе S125 вычисляют значения потерь на основе комбинации одной или нескольких из следующих функций потерь: основная функция потерь, функция потерь сегментации, функция потерь затенения помещения, функция потерь симметрии, функция потерь цветового соответствия альбедо, функция потерь нормалей. Основная функция потерь вычисляется как несоответствие между спрогнозированным изображением (из этапа S105) и выбранным изображением (из этапа S100) посредством комбинации неперцептивных и перцептивных функций потерь. Функция потерь сегментации вычисляется как несоответствие между спрогнозированной маской сегментации и отсегментированным передним планом. Функция потерь затенения помещения вычисляется как штраф за резкость спрогнозированных карт теней окружающей среды, причем штраф увеличивается по мере увеличения резкости. Функция потерь симметрии вычисляется как несоответствие между зеркально-отображенной полутекстурой альбедо и спрогнозированным альбедо (из S105), геометрически переведенным в пространство текстуры с помощью билинейной интерполяции.[0031] In step S125, loss values are calculated based on a combination of one or more of the following loss functions: basic loss function, segmentation loss function, room shading loss function, symmetry loss function, albedo color match loss function, normal loss function. The main loss function is calculated as the mismatch between the predicted image (from step S105) and the selected image (from step S100) by a combination of non-perceptual and perceptual loss functions. The segmentation loss function is calculated as the mismatch between the predicted segmentation mask and the segmented foreground. The room shading loss function is calculated as a penalty for sharpening predicted environmental shadow maps, with the penalty increasing as sharpening increases. The symmetry loss function is calculated as the mismatch between the mirrored albedo semi-texture and the predicted albedo (from S105) geometrically translated into texture space using bilinear interpolation.

[0032] Функция потерь цветового соответствия альбедо вычисляется как несоответствие между медианной эталонной текстурой (из S115) и спрогнозированным альбедо (из S105), геометрически переведенным в пространство текстуры посредством билинейной интерполяции. Функция потерь нормалей вычисляется как несоответствие между спрогнозированными нормалями (из S105) и растеризованными изображениями (из S120) вспомогательной сетки лица, соответствующей выбранному изображению. После вычисления значений потерь на этапе S125 эти значения потерь распространяют на этапе S130 обратно на веса глубокой нейронной сети, скрытые дескрипторы и вспомогательные параметры. Иными словами, веса глубокой нейронной сети, скрытые дескрипторы и вспомогательные параметры обновляются на основе значений потерь, вычисленных на этапе S125. Затем выполняется следующая итерация, начиная с этапа S100. Глубокая нейронная сеть обучается фиксированное (заранее определенное) количество итераций (80000 итераций в неограничивающем примере). [0032] The color match albedo loss function is calculated as the mismatch between the median reference texture (from S115) and the predicted albedo (from S105) geometrically translated into texture space by bilinear interpolation. The normal loss function is calculated as the discrepancy between the predicted normals (from S105) and the rasterized images (from S120) of the auxiliary face mesh corresponding to the selected image. After calculating the loss values in step S125, these loss values are propagated in step S130 back to the deep neural network weights, hidden descriptors and auxiliary parameters. In other words, the deep neural network weights, hidden descriptors and auxiliary parameters are updated based on the loss values calculated in step S125. Then, the next iteration is performed starting from step S100. A deep neural network trains a fixed (predetermined) number of iterations (80,000 iterations in a non-limiting example).

Детали реализации способаDetails of the implementation of the method

Модель освещенияLighting model

[0033] Для данной точки x в пространстве, которое принадлежит поверхности объемного объекта, уровень яркости излучения света в положении x в направлении

обычно описывается уравнением визуализации_(рендеринга):[0033] For a given point x in the space that belongs to the surface of the volumetric object, the brightness level of light emission at position x in the direction

usually described by the render_ equation:

где

определяет яркость входящего излучения к x в направлении ω₁, S означает верхнюю полусферу по отношению к касательной плоскости поверхности в точке x с единичной нормалью n(x). Также,

- отношение интенсивности рассеянного света в направлении

и входящего в направлении

, которое обычно называют функцией распределения двунаправленного отражения (BRDF). Кроме того,

- это член видимости (равен 1, если точка x достижима для света с направления

, или 0, если в этом направлении имеется окклюзия), и

характеризует общую яркость излучения, исходящую из x в направлении

. В терминах этой модели BRDF - это свойство материала, которое описывает его характеристики рассеяния света, изменяющиеся в пространстве. Поскольку отснятые изображения являются изображениями RGB, все значения

,

находятся в

и каждый компонент (канал) вычисляется отдельно.where

determines the brightness of the incoming radiation to x in the direction ω ₁ , S means the upper hemisphere with respect to the tangent plane of the surface at the point x with the unit normal n (x). Also,

- the ratio of the intensity of the scattered light in the direction

and entering in the direction

commonly referred to as the Bidirectional Reflection Distribution Function (BRDF). Besides,

is the visibility term (equal to 1 if point x is reachable by light from the direction

, or 0 if there is occlusion in that direction), and

characterizes the overall brightness of the radiation emanating from x in the direction

... In terms of this model, BRDF is a property of a material that describes its light scattering characteristics as it changes in space. Since the captured images are RGB images, all values

,

are situated in

and each component (channel) is calculated separately.

[0034] В варианте осуществления настоящего изобретения съемка человека (объекта) осуществляется с использованием набора изображений, снятых с окружающим освещением, и другого набора изображений в той же самой окружающей среде, дополнительно освещенной вспышкой камеры. Как было описано выше, можно использовать мигающую вспышку. Чтобы смоделировать эти два набора изображений, яркость входящего излучения можно разложить с учетом видимости на два члена:[0034] In an embodiment of the present invention, a person (object) is captured using a set of images captured with ambient lighting and another set of images in the same environment additionally illuminated by a camera flash. A blinking flash can be used as described above. To simulate these two sets of images, the brightness of the incoming radiation can be decomposed, taking into account the visibility, into two terms:

где

- окружающее освещение (в помещении),

- освещение вспышкой, и F указывает, было ли сделано фото с включенной вспышкой (F=1) или выключенной вспышкой (F=0). В данном случае

и

, соответственно, моделируют окклюзию света для окружающего освещения и вспышки.where

- ambient lighting (indoors),

- flash illumination, and F indicates whether the photo was taken with flash on (F = 1) or off flash (F = 0). In this case

and

respectively, simulate light occlusion for ambient lighting and flash.

[0035] Также предполагается, что BRDF является ламбертовской (т.е. постоянной в каждой точке),

(с отброшенной нормирующей постоянной), т.е. соответствует диффузной поверхности, и p(x) обозначает альбедо поверхности в точке x . Следует отметить, что с помощью нейросетевой визуализации система может смоделировать некоторое количество неламбертовских эффектов, поскольку альбедо в модели можно фактически сделать зависимым от вида. Применив модификации к (1), получаем следующее:[0035] It is also assumed that BRDF is Lambert (i.e., constant at every point),

(with the normalizing constant discarded), i.e. corresponds to a diffuse surface, and p (x) denotes the albedo of the surface at point x . It should be noted that using neural network visualization, the system can simulate a number of non-Lambert effects, since the albedo in the model can actually be made species-dependent. Applying modifications to (1), we get the following:

.

...

[0036] Теперь через

обозначим интеграл в первой части, который характеризует затенение, вызванное как комнатными лампами, так и окклюзиями (например, тень от носа на щеке в случае освещения головы человека). Это затенение можно смоделировать как произведение цветовой температуры и затенения в градациях серого

. Что касается второй части (вспышки), яркость входящего излучения прямо моделируется как

, где d(x) - расстояние от вспышки до x , а

- постоянный вектор, пропорциональный цветовой температуре и интенсивности вспышки. Предполагается, что вспышка находится достаточно далеко, и это даёт возможность считать, что

где d - расстояние от камеры до ближайшей точки в 3D облаке точек, а световые лучи от вспышки приблизительно параллельны. Так как на смартфоне вспышка и объектив камеры обычно расположены в непосредственной близости (в

), допускается, что

для всех x , видимых на изображении со вспышкой (изображении, освещенном вспышкой). Исходя из этого, можно выполнить преобразование (3) в (4):[0036] Now through

denote the integral in the first part, which characterizes the shading caused by both indoor lamps and occlusions (for example, the shadow from the nose on the cheek in the case of illumination of the human head). This shading can be modeled as the product of color temperature and grayscale shading.

... As for the second part (flash), the brightness of the incoming radiation is directly modeled as

, where d (x) is the distance from the flash to x , and

- constant vector proportional to color temperature and flash intensity. It is assumed that the flare is far enough away, and this makes it possible to assume that

where d is the distance from the camera to the closest point in the 3D point cloud, and the light rays from the flash are approximately parallel. Since on a smartphone, the flash and the camera lens are usually located in close proximity (in

), it is assumed that

for all x visible in the flash image (flash illuminated image). Based on this, you can perform the transformation (3) to (4):

[0037] Следует отметить, что в (4) разложение интенсивности света

на отдельные компоненты неоднозначно, что является обычным явлением для большинства задач оценки альбедо и освещения. В частности, существует обратно пропорциональная зависимость между

и обоими параметрами,

и

. Эта неоднозначность решается в настоящем изобретении с помощью соответствующих априорных параметров, как будет подробно описано ниже в разделе "Подгонка модели".[0037] It should be noted that in (4), the light intensity decomposition

into separate components is ambiguous, which is a common occurrence for most albedo and illumination estimation problems. In particular, there is an inversely proportional relationship between

and both parameters,

and

... This ambiguity is resolved in the present invention with appropriate a priori parameters, as will be detailed below in the Model Fitting section.

Геометрическое моделированиеGeometric modeling

[0038] После описания модели освещения описывается геометрическое моделирование. Предполагается, что

- это последовательность изображений верхней части тела человека, снятых камерой смартфона в одной и той же среде и характеризующих верхнюю часть тела, включая голову человека, под разными углами. Набор изображений, снятых со вспышкой, это

, т.е. в этот набор входят фотографии верхней части тела человека, дополнительно освещенные вспышкой камеры. Для оценки (реконструкции) точек обзора камеры

каждого изображения можно использовать методы построения структуры из движения (SfM),

. Затем аналогичным образом оценивается (реконструируется) 3D облако точек P=

из всех изображений. Для оценки 3D облака точек можно использовать любой известный метод или программное обеспечение, например, SfM.[0038] After describing the lighting model, the geometric modeling is described. It is assumed that

is a sequence of images of a person's upper body taken with a smartphone camera in the same environment and characterizing the upper body, including the person's head, from different angles. The set of images captured with a flash is

, i.e. This set includes photographs of a person's upper body, additionally illuminated by a camera flash. To evaluate (reconstruct) camera viewpoints

each image can use the methods of constructing structure from motion (SfM),

... Then the 3D point cloud P =

from all images. Any known method or software such as SfM can be used to estimate a 3D point cloud.

[0039] Сегментация и фильтрация . Поскольку моделирование человека осуществляется без фона, на этапе S205.3 сегментируется передний план, а на этапе S205.4 3D на основе отсегментированного переднего плана нейронной сетью сегментации фильтруется облако точек для получения отфильтрованного 3D облака точек. Для примера, но без ограничения, в качестве нейронной сети сегментации можно использовать сеть сегментации U2-Net, разработанную для задачи сегментации объектов переднего плана. Предобученную модель U2-Net можно дополнительно тонко настроить с помощью режима обучения с ускорением, например, на наборе данных изображений людей и соответствующих масок сегментации Supervisely, чтобы сделать ее более пригодной для данной задачи. После этого изображения

пропускают через тонко настроенную нейронную сеть сегментации и получают последовательность исходных "мягких" масок

. И наконец, можно достичь многовидовой согласованности силуэтов посредством визуальной оценки корпуса с использованием 3D облака точек в качестве геометрического представителя и используя веса нейросети сегментации в качестве параметризации. Это реализуется путем тонкой настройки весов нейросети сегментации, чтобы минимизировать несогласованность сегментов между видами, что приводит к получению уточненных масок

.[0039] Segmentation and filtering . Since the human modeling is performed without a background, in step S205.3 the foreground is segmented, and in step S205.4 3D based on the segmented foreground the point cloud is filtered by the segmentation neural network to obtain a filtered 3D point cloud. For example, but without limitation, as a neural network of segmentation, you can use the U2-Net segmentation network, developed for the task of segmentation of foreground objects. The pre-trained U2-Net model can be further fine-tuned using an accelerated learning mode, such as the human image dataset and the corresponding Supervisely segmentation masks, to make it more suitable for the task. After this image

is passed through a fine-tuned neural network of segmentation and a sequence of initial "soft" masks is obtained

... Finally, it is possible to achieve multi-view consistency of silhouettes by visually evaluating the corpus using a 3D point cloud as a geometric representative and using the weights of the segmentation neural network as parameterization. This is implemented by fine-tuning the weights of the segmentation neural network to minimize the inconsistency of segments between views, which leads to obtaining refined masks.

...

[0040] Нейросетевая визуализация (Neural rendering). Нейросетевая визуализация осуществляется на основе глубокой нейронной сети, которая прогнозирует альбедо, нормали и карты теней окружающей среды из 3D облака точек, растеризованного для каждого вида камеры. Каждая точка в 3D облаке точек дополняется скрытым (нейронным) дескриптором, то есть многомерным скрытым вектором, характеризующим свойства точки:[0040] Neural rendering. Neural network rendering is based on a deep neural network that predicts albedo, normals and environmental shadow maps from a 3D point cloud rasterized for each camera view. Each point in a 3D point cloud is supplemented with a hidden (neural) descriptor, that is, a multidimensional hidden vector characterizing the properties of the point:

(L может быть равно 8, но это не является ограничением).

( L can be 8, but this is not a limitation).

[0041] Этап нейросетевой визуализации начинается с растеризации скрытых дескрипторов на холсте, связанном с камерой C _k. Это можно осуществить следующим образом: с помощью Z-буферизации 3D облака точек формируется необработанное изображение (растеризованное изображение)

(для каждого пикселя находится ближайшая к камере точка, которая проецируется на этот пиксель). Скрытый дескриптор каждой из ближайших точек назначается соответствующему пикселю. При отсутствии точки, которая проецируется на какой-либо пиксель, этому пикселю назначается нулевой дескриптор. Аналогичным образом строится набор вспомогательных необработанных изображений

пространственных размеров

, и на этих изображениях выполняется растеризация скрытых дескрипторов по тому же алгоритму. Для покрытия 3D облака точек в нескольких масштабах (разрешениях) вводится пирамида необработанных изображений. Необработанное изображение с самым высоким разрешением

имеет наибольшую пространственную детализацию, в то время как изображения с более низким разрешением имеют меньшее просвечивание точек.[0041] The neural network rendering stage begins by rasterizing the hidden descriptors on the canvas associated with the camera C _k . This can be done as follows: using Z-buffering of a 3D point cloud, a raw image (rasterized image) is generated

(for each pixel there is a point closest to the camera, which is projected onto this pixel). A hidden handle to each of the nearest points is assigned to the corresponding pixel. If there is no point that is projected onto any pixel, this pixel is assigned a zero descriptor. A set of auxiliary raw images is built in a similar way.

spatial dimensions

, and on these images the hidden descriptors are rasterized using the same algorithm. A pyramid of raw images is introduced to cover a 3D point cloud at several scales (resolutions). Highest resolution raw image

has the most spatial detail, while lower-resolution images have less dot bleed through.

[0042] Затем набор необработанных изображений

обрабатывается глубокой нейронной сетью

, точно следуя структуре U-Net с маскируемыми свертками. Каждое из необработанных изображений может быть передано в качестве ввода (или присоединено) в первый уровень сетевого кодировщика соответствующего разрешения. Выходом глубокой нейронной сети является набор плотных карт. Эти карты могут содержать выходные значения RGB.[0042] Then the set of raw images

processed by a deep neural network

closely following the structure of the U-Net with masked convolutions. Each of the raw images can be passed as input (or attached) to the first layer of the network encoder of the appropriate resolution. The output of a deep neural network is a set of dense maps. These maps can contain output RGB values.

[0043] На фиг.1 представлен в общих чертах полный конвейер обработки согласно варианту осуществления настоящего изобретения. В настоящем изобретении последний слой глубокой нейронной сети выдает восьмиканальный тензор с несколькими группами:[0043] Figure 1 is an outline a complete processing pipeline according to an embodiment of the present invention. In the present invention, the last deep neural network layer produces an eight-channel tensor with several groups:

- Первая группа содержит три канала и использует сигмоидную нелинейность. Эти каналы содержат значения альбедо

. Каждый пиксель A описывает пространственно изменяющиеся свойства отражения (альбедо

) поверхности головы (кожи, волос или других частей).- The first group contains three channels and uses sigmoid nonlinearity. These channels contain albedo values

... Each pixel A describes spatially varying reflection properties (albedo

) the surface of the head (skin, hair or other parts).

- Вторая группа также имеет три канала и в конце использует групповую нормализацию L ₂. Эта группа содержит растеризованные нормали

, причем каждый пиксель содержит вектор нормали n(x) точки на поверхности головы в мировом пространстве.- The second group also has three channels and at the end uses L ₂ group normalization. This group contains rasterized normals

, where each pixel contains the normal vector n (x) of a point on the surface of the head in world space.

- Следующая одноканальная группа, использующая сигмоидную нелинейность, соответствует затенению в градациях серого

, вызванному различиями в освещении помещения и, как следствие, возникающими окклюзиями.- Next single channel group using sigmoid nonlinearity corresponds to grayscale shading

caused by differences in room lighting and, as a result, occlusions.

- И наконец, последняя одноканальная группа, также использующая сигмоидную нелинейность, определяет маску сегментации

, при этом каждый пиксель содержит спрогнозированную вероятность того, что данный пиксель принадлежит голове, а не фону.- Finally, the last single-channel group, also using sigmoid nonlinearity, defines the segmentation mask

, with each pixel containing the predicted probability that the given pixel belongs to the head and not to the background.

[0044] На основании этого вывода глубокой нейронной сети определяется окончательное визуализированное изображение (то есть 3D портрет с измененным освещением) путем объединения (fusing) альбедо, нормалей и карт теней окружающей среды, а также маски сегментации, как предписано моделью освещения (4) (т.е. в зависимости от условий освещения): [0044] Based on this deep neural network output, the final rendered image (ie, the 3D portrait with altered lighting) is determined by fusing albedo, normals and environment shadow maps, and a segmentation mask as prescribed by the lighting model (4) ( i.e. depending on lighting conditions):

где скалярное произведение

применяется к каждому пикселю по отдельности. Векторы цветовой температуры C ^room и C ^flash,оба в

, считаются частью модели, общими для всех пикселей. Эти векторы можно оценить из обучающих данных, как обсуждается ниже. Во время использования карты признаков A и N, а также параметры C ^room, C ^flash можно использовать для визуализации верхней части тела человека при измененном освещении, таком как направленное освещение

, или других моделях, например, сферических гармониках (SH). В последнем случае альбедо A умножается на нелинейную функцию от нормалей N пикселей и коэффициентов SH. При наличии панорамы 360° нового окружения можно получить значения коэффициентов SH (до заранее определенного порядка) путем интегрирования этой панорамы. Известно, что выбор порядка 3 или выше, приводящий к по меньшей мере 9 коэффициентам на цветовой канал, часто может создавать выразительные световые эффекты. В случае SH третьего порядка изображение с измененным освещением можно определить как квадратичную форму:where the dot product

applies to each pixel individually. Vectors color temperatureC ^room andC ^flash,both in

are considered part of the model, common to all pixels. These vectors can be estimated from the training data, as discussed below. While using the feature mapA andNand also parametersC ^room,C ^flash can be used to render the upper body of a person under altered lighting such as directional lighting

, or other models such as spherical harmonics (SH). In the latter case, the albedo A is multiplied by the nonlinear function of the normalsN pixels and SH coefficients. If you have a 360 ° panorama of a new environment, you can obtain SH coefficient values (up to a predetermined order) by integrating this panorama. It is known that choosing the order of 3 or higher, resulting in at least 9 coefficients per color channel, can often create impressive lighting effects. In the case of a third-order SH, the illumination altered image can be defined as a quadratic form:

где матрица 4×4 M(SH_coef) линейно зависит от 27 коэффициентов SH_Coef.where the 4 × 4 matrix M (SH _coef ) linearly depends on 27 SH _Coef coefficients.

Подгонка моделиFitting the model

[0045] Общие потери сцены. Эта модель содержит большое число параметров, которые подгоняются к данным. Во время подгонки полученные изображения

(5) сравниваются с истинными изображениями

с помощью функции потерь, построенной как комбинация одного или более компонентов, описанных ниже.[0045] Total scene loss . This model contains a large number of parameters that are fitted to the data. During fitting, the resulting images

(5) compared with true images

using a loss function built as a combination of one or more of the components described below.

[0046] Основная функция потерь равна оцененному несоответствию между спрогнозированным изображением

и истинным изображением

:[0046] The main loss function is equal to the estimated mismatch between the predicted image

and true image

:

где для сравнения пары изображений используется функция несоответствия

:where the mismatch function is used to compare a pair of images

:

[0047] В данном случае

- перцептивное несоответствие, основанное на уровнях сети VGG-16,

относится к среднему абсолютному отклонению,

представляет собой уменьшение через усреднение (average pooling) изображения

с ядром KxK (K может быть равно 4, но без ограничения) и

представляет собой коэффициент балансировки, введенный для выравнивания диапазона значений двух членов (например,

может быть равно 2500). В то время как VGG поощряет соответствие высокочастотных деталей,

поощряет соответствие цветов. Так как наивная оптимизация

может привести к размытию и потере деталей, член

можно оценивать по изображениям с пониженной дискретизацией. [0047] In this case

- perceptual inconsistency based on VGG-16 network layers,

refers to the mean absolute deviation,

represents a reduction through averaging (average pooling) of the image

with the kernel KxK (K can be equal to 4, but not limited) and

is a balancing factor entered to equalize the range of values of two terms (for example,

may be equal to 2500). While VGG encourages high frequency detail matching,

encourages color matching. Since naive optimization

may result in blurring and loss of detail, member

can be estimated from downsampled images.

[0048] Поскольку в предложенном способе необходимо сегментировать визуализированную голову с произвольной точки обзора, можно ввести функцию потерь сегментации, ограничивающую спрогнозированную маску M:[0048] Since in the proposed method it is necessary to segment the rendered head from an arbitrary point of view, it is possible to introduce a segmentation loss function that limits the predicted mask M:

где функция Dice является распространенным выбором вида функции потерь для сегментации (оценивается как попиксельная оценка F1). Фактически, с помощью этой функции потерь глубокая нейронная сеть обучается экстраполировать предварительно вычисленные маски

на новые точки обзора.where the Dice function is a common choice of the kind of loss function for segmentation (evaluated as a per-pixel estimate of F1). In fact, with this loss function, the deep neural network is trained to extrapolate the pre-computed masks

to new points of view.

[0049] Согласно (5) визуализированное изображение для изображений, снятых без вспышки, равно

. На практике это создает определенную неоднозначность между обученными картами A и S. Поскольку A участвует в обоих членах (5), высокочастотный компонент этих визуализаций имеет тенденцию сохраняться по умолчанию в S. Следующая функция потерь затенения помещения неявно требует, чтобы вместо этого A была крайне точной:[0049] According to (5), the rendered image for images captured without flash is

... In practice, this creates a certain ambiguity between the trained maps A and S. Since A participates in both terms (5), the high frequency component of these renderings tends to be stored by default in S. The following room shading loss function implicitly requires A to be extremely accurate instead :

,

где TV - потеря полной вариации на основе L1.where TV is the total variation loss based on L1.

[0050] Потери, основанные на априорных знаниях о структуре лица. Для дополнительного упорядочения процесса обучения можно использовать особое свойство сцен, а именно наличие лицевых областей. С этой целью для каждого обучающего изображения выполняется выравнивание лица с использованием предварительно обученной сети реконструкции 3D сетки лица. В качестве сети реконструкции 3D сетки лица может использоваться известная система PRNet, но без ограничения. При наличии произвольного изображения

, содержащего лицо, сеть реконструкции 3D сетки лица может оценить карту выравнивания лица, то есть тензор Posmap размером 256×256 (конкретный размер указан в качестве примера, а не ограничения), которая отображает UV-координаты (в заранее определенном фиксированном пространстве текстуры, связанном с данным человеческим лицом) в координатах экранного пространства изображения

. [0050] Losses based on a priori knowledge of the face structure. To further streamline the learning process, you can use a special property of scenes, namely the presence of facial areas. To this end, face alignment is performed for each training image using a pretrained 3D face mesh reconstruction network. The well-known PRNet system can be used as a 3D face mesh reconstruction network, but without limitation. In the presence of an arbitrary image

containing the face, the 3D mesh face reconstruction network can evaluateface alignment map, that is, a 256x256 Posmap tensor (specific size is given as an example, not a limitation) that maps UV coordinates (in a predefined fixed texture space associated with a given human face)in screen space coordinates Images

...

[0051] Пусть Posmap₁,..., Posmap_P определяют карты положения, вычисленные для каждого изображения в обучающей последовательности. Билинейная выборка (обратная деформация) изображения

на Posmap обозначается операцией

, которая приводит к отображению видимой части изображения

в пространстве UV-текстуры. Таким образом, отображение создает цветную (частичную) текстуру лица.[0051] Let Posmap ₁ , ..., Posmap _P define position maps computed for each image in the training sequence. Bilinear sampling (inverse deformation) of the image

on Posmap denoted by operation

which causes the visible part of the image to be displayed

in UV space. Thus, the display creates a colored (partial) texture of the face.

[0052] Собранные данные о геометрии лица используются двумя способами. Во-первых, на протяжении всей подгонки оценивается полутекстура альбедо

размером 256×128 (для левой или правой части альбедо). Конкретный размер (256×128) приведен в качестве примера и не является ограничением. Полутекстура альбедо

инициализируется путем взятия попиксельной медианы проецируемых текстур для всех изображений со вспышкой

и усреднения левой и перевернутой правой половины или правой и перевернутой левой половины. Функция потерь симметрии - это функция потерь, использующая априорные знания о структуре лица, которые содействуют симметрии альбедо путем их сравнения с обученной текстурой альбедо: [0052] The collected face geometry data is used in two ways. First, the albedo semi-texture is estimated throughout the fit.

size 256 × 128 (for the left or right side of the albedo). The specific size (256 × 128) is exemplary and not limiting. Albedo semi-texture

initialized by taking the per-pixel median of projected textures for all flash images

and averaging the left and inverted right half or right and inverted left half. The symmetry loss function is a loss function that uses a priori knowledge of the facial structure that promotes albedo symmetry by comparing it to the trained albedo texture :

(эта функция несоответствия оценивается только в том случае, если определено

).

обозначает объединенную текстуру и ее перевернутую по горизонтали версию. Эта функция потерь делает альбедо более симметричным и приводит цвета текстуры альбедо

в соответствие с цветами обученного альбедо. Баланс между этими двумя факторами регулируется скоростью обучения для

. В настоящем изобретении функция потерь симметрии помогает разрешить разложение изображения на альбедо и тени, поскольку противоположные точки лица (например, левая и правая щека) могут иметь одинаковое альбедо, тогда как отбрасываемые тени чаще всего несимметричны.(this mismatch function is only evaluated if it is defined

).

denotes a merged texture and its horizontally flipped version. This loss function makes the albedo more symmetrical and brings the texture colors to the albedo

according to the colors of the trained albedo. The balance between these two factors is regulated by the learning rate for

... In the present invention, the symmetry loss function helps to resolve the decomposition of the image into albedo and shadows, since opposite points of the face (for example, the left and right cheeks) may have the same albedo, while the cast shadows are usually asymmetric.

[0053] Кроме того, чтобы выбрать гамму для альбедо, можно ввести функцию потерь цветового соответствия альбедо

(также вычисленную только для действительных элементов текстуры

).[0053] In addition, to select the gamma for the albedo, you can enter the loss function of the color match albedo

(also computed for valid texture elements only

).

[0054] Другой тип данных, который можно выводить из сети реконструкции 3D сетки лица (например, PRNet), это нормали для данной части лица. Каждая карта выравнивания лица вместе с оцененной глубиной (также с помощью PRNet) и набором треугольников определяет треугольную сетку. Сетки, оцененные для каждого вида, визуализируются и сглаживаются, вычисляются нормали лица, и спроецированные нормали визуализируются на соответствующем виде камеры (технически все операции могут выполняться, например, в системе рендеринга Blender). Затем визуализированные нормали поворачиваются матрицей поворота камеры для их преобразования в мировое пространство. Для оцененных изображений нормалей для лицевой части размером H x W вводится обозначение N ₁,..., N _P. Нормали, спрогнозированные глубокой нейронной сетью, сопоставляются с нормалями сети реконструкции 3D сетки лица (например, PRNet) на лицевой области (определенной маской M _face) с помощью функции потерь нормалей: [0054] Another type of data that can be derived from a 3D face mesh reconstruction network (eg, PRNet) is the normals for a given part of the face. Each face alignment map, together with an estimated depth (also using PRNet) and a set of triangles, defines a triangular mesh. The meshes evaluated for each view are rendered and smoothed, the face normals are calculated, and the projected normals are rendered on the corresponding camera view (technically all operations can be performed, for example, in the Blender rendering system). The rendered normals are then rotated by the camera rotation matrix to convert them to world space. For the estimated images of normals for the front part of the size H x W, the designation N ₁ , ..., N _P is introduced. The normals predicted by the deep neural network are matched to the normals of the 3D face mesh reconstruction network (e.g. PRNet) on the facial area (defined by the M _face mask) using the normal loss function :

[0055] Суммарную (объединенную) функцию потерь можно выразить следующим образом:[0055] The total (combined) loss function can be expressed as follows:

Отдельные функции потерь можно сбалансировать, например, следующим образом:

. Возможны и другие схемы балансировки, например:

.The individual loss functions can be balanced, for example, as follows:

... Other balancing schemes are also possible, for example:

...

[0056] Оптимизация. Обучаемая часть системы обучается для одной сцены путем обратного распространения ошибки на параметры

сети визуализации, точечные дескрипторы D и вспомогательные параметры

. Для

и D можно использовать Adam с одинаковой скоростью обучения, а для остальных обучаемых параметров могут использоваться другие скорости обучения, выбранные эмпирически в соответствии с диапазоном их возможных значений. На каждом этапе отбирается образец обучающего изображения и выполняется прямой проход, за которым следует градиентный шаг. Улучшения во время обучения могут включать произвольное увеличение/уменьшение и последующее обрезание небольшого участка.[0056] Optimization . The trained part of the system is trained for one scene by backpropagating the error to the parameters

render networks, D point descriptors and auxiliary parameters

... For

and D, Adam can be used with the same learning rate, and other learning rates can be used for the remaining learning parameters, chosen empirically in accordance with the range of their possible values. At each stage, a sample of the training image is taken and a forward pass is performed, followed by a gradient step. Improvements during training can include arbitrary zooming in / out and then cropping a small area.

[0057] На фиг.5 представлена структурная схема вычислительного устройства 50, выполненного с возможностью визуализации 3D портрета человека согласно варианту осуществления настоящего изобретения. Вычислительное устройство 50 содержит взаимосвязанные друг с другом процессор 50.1, камеру 50.2 (камера 50.2 не является обязательной, поэтому на фиг.5 она обведена пунктирной линией) и память 50.3. Иллюстрацию взаимосвязей между процессором 50.1, камерой 50.2 и памятью 50.3 не следует считать ограничением, поскольку ясно, что процессор 50.1, камера 50.2 и память 50.3 могут быть взаимосвязаны друг с другом различным образом. Камера 50.2 выполнена с возможностью съёмки последовательности изображений человека с использованием мигающей вспышки. Память 50.3 выполнена с возможностью хранения инструкций, выполняемых процессором, которые побуждают вычислительное устройство 50 выполнять любой этап или подэтап предложенного способа, а также весовых коэффициентов глубокой нейронной сети, скрытых дескрипторов и вспомогательных параметров, полученных на стадии обучения. При исполнении выполняемых процессором инструкций процессор 50.1 выполнен с возможностью предписывать вычислительному устройству 50 выполнять предложенный способ визуализации 3D портрета человека с измененным освещением.[0057] Figure 5 is a block diagram of a computing device 50 configured to render a 3D portrait of a person according to an embodiment of the present invention. Computing device 50 includes interconnected processor 50.1, camera 50.2 (camera 50.2 is optional, therefore it is outlined in dotted line in FIG. 5) and memory 50.3. The illustration of the relationships between the processor 50.1, the camera 50.2, and the memory 50.3 should not be considered limiting, since it is clear that the processor 50.1, the camera 50.2, and the memory 50.3 can be interconnected with each other in various ways. The camera 50.2 is configured to capture a sequence of images of a person using a blinking flash. The memory 50.3 is configured to store instructions executed by the processor that cause the computing device 50 to perform any step or sub-step of the proposed method, as well as the weights of the deep neural network, hidden descriptors and auxiliary parameters obtained at the training stage. When executing the instructions executed by the processor, the processor 50.1 is configured to instruct the computing device 50 to perform the proposed method for rendering a 3D portrait of a person with altered lighting.

[0058] В альтернативном варианте осуществления, в котором для обучения раскрытой системы и обработки последовательности изображений человека используется удаленный сервер, вычислительное устройство может дополнительно содержать блок связи (не показан), предназначенный для обмена данными с удаленным сервером. Эти данные могут включать в себя запрос от вычислительного устройства на обучение предложенной системы и/или обработку последовательности изображений человека. Данные для удаленного сервера также могут включать в себя последовательность изображений, отснятых с использованием мигающей вспышки вычислительным устройством. Данные с удаленного сервера могут включать в себя веса и другие параметры обученной системы и/или 3D портрет человека с измененным освещением. Блок связи может использовать любую известную технологию связи, например, WiFi, WiMax, 4G (LTE), 5G и т.п. Блок связи взаимосвязан с другими компонентами вычислительного устройства или может быть часть процессора (например, как SoC).[0058] In an alternative embodiment that uses a remote server to train the disclosed system and process a human image sequence, the computing device may further comprise a communication unit (not shown) for communicating with the remote server. This data may include a request from a computing device to train the proposed system and / or process a sequence of human images. The data for the remote server may also include a sequence of images captured using a blinking flash by a computing device. Data from a remote server may include weights and other parameters of the trained system and / or a 3D portrait of a person with altered lighting. The communication unit can use any known communication technology, for example, WiFi, WiMax, 4G (LTE), 5G, etc. The communication unit is interconnected with other components of the computing device or can be part of a processor (for example, as a SoC).

[0059] Вычислительное устройство 50 может содержать другие, не показанные компоненты, например экран, камеру, блок связи, сенсорную панель, клавиатуру, малую клавишную панель, одну или несколько кнопок, динамик, микрофон, модуль Bluetooth, модуль NFC, модуль RF, модуль Wi-Fi, источник питания и т.д., а также соответствующие соединения. Предложенный способ визуализации 3D портрета человека с измененным освещением может быть реализован на широком ассортименте вычислительных устройств 50, таких как интеллектуальные очки виртуальной реальности (VR), интеллектуальные очки дополненной реальности (AR), системы 3D отображения, смартфоны, планшеты, ноутбуки, мобильные роботы и навигационные системы, но без ограничения перечисленным. Реализация предложенного способа поддерживает все виды устройств, способных производить вычисления на ЦП. Кроме того, вычислительное устройство 50 может содержать дополнительные компоненты для ускорения нейронной сети и растеризации, такие как GPU (блок обработки графики), NPU (нейронный процессор), TPU (блок обработки тензорных данных). Для реализации растеризации может применяться OpenGL. В случае включения этих дополнительных компонентов такие устройства могут реализовать предложенный способ быстрее, эффективнее и с более высокой производительностью (включая стадию обучения). В неограничивающих вариантах осуществления процессор 50.1 может быть реализован как вычислительное средство, включая, без ограничения, универсальный процессор, специализированную интегральную схему (ASIC), программируемую логическую интегральную схему (FPGA) или систему на кристалле (SoC). [0059] Computing device 50 may include other, not shown components, such as a screen, camera, communication unit, touch panel, keyboard, small keypad, one or more buttons, speaker, microphone, Bluetooth module, NFC module, RF module, module Wi-Fi, power supply, etc., and related connections. The proposed method for rendering a 3D portrait of a person with altered lighting can be implemented on a wide range of computing devices 50, such as smart virtual reality (VR) glasses, augmented reality (AR) smart glasses, 3D display systems, smartphones, tablets, laptops, mobile robots, etc. navigation systems, but not limited to those listed. The implementation of the proposed method supports all types of devices capable of performing calculations on the CPU. In addition, the computing device 50 may include additional components for neural network acceleration and rasterization, such as a GPU (graphics processing unit), NPU (neural processor), TPU (tensor data processing unit). OpenGL can be used to implement rasterization. In the case of the inclusion of these additional components, such devices can implement the proposed method faster, more efficiently and with higher performance (including the training stage). In non-limiting embodiments, the processor 50.1 may be implemented as a computing means, including, but not limited to, a general purpose processor, an application specific integrated circuit (ASIC), a programmable logic integrated circuit (FPGA), or a system on a chip (SoC).

[0060] По меньшей мере один из множества модулей, блоков, компонентов, этапов, подэтапов можно реализовать с помощью модели AI. Один или несколько этапов или подэтапов раскрытого способа могут быть реализованы как блоки/модули (т.е. как аппаратные компоненты) вычислительного устройства 50. Функция, связанная с AI, может выполняться через память 50.3, которая может включать в себя энергонезависимую память и энергозависимую память и процессор 50.1. Процессор 50.1 может включать в себя один или несколько процессоров. Один или несколько процессоров могут содержать универсальный процессор, такой как центральный процессор (CPU), процессор приложений (AP) и т.п., процессор только графической информации, такой как графический процессор (GPU), процессор визуальной информации (VPU) и/или специализированный процессор AI, такой как нейронный процессор (NPU). Один или несколько процессоров управляют обработкой входных данных в соответствии с заранее определенным правилом работы или моделью искусственного интеллекта (AI), хранящейся в энергонезависимой памяти и энергозависимой памяти. Предварительно определенное рабочее правило или модель искусственного интеллекта предоставляется посредством обучения. В данном контексте предоставление посредством обучения означает, что заранее определенное рабочее правило или модель AI с желаемой характеристикой создается посредством применения алгоритма обучения к множеству обучающих данных. Обучение может выполняться в самом устройстве, в котором выполняется AI согласно варианту осуществления, и/или может быть реализовано через отдельный сервер/систему.[0060] At least one of a plurality of modules, blocks, components, steps, sub-steps can be implemented using the AI model. One or more steps or sub-steps of the disclosed method may be implemented as blocks / modules (i.e., hardware components) of the computing device 50. The AI-related function may be performed via memory 50.3, which may include non-volatile memory and volatile memory and processor 50.1. Processor 50.1 may include one or more processors. One or more processors may comprise a general-purpose processor such as a central processing unit (CPU), an application processor (AP), and the like, a graphics-only processor such as a graphics processing unit (GPU), visual information processor (VPU), and / or a specialized AI processor such as a neural processor (NPU). One or more processors control the processing of input data in accordance with a predetermined rule of operation or an artificial intelligence (AI) model stored in nonvolatile memory and volatile memory. A predefined working rule or artificial intelligence model is provided through training. In this context, provision by training means that a predetermined operating rule or AI model with a desired characteristic is created by applying a training algorithm to a set of training data. The training may be performed on the device itself, on which the AI is executed according to the embodiment, and / or may be implemented via a separate server / system.

[0061] Модель AI может состоять из множества уровней нейронной сети. Каждый уровень имеет множество весовых значений и выполняет операцию уровня путем вычисления предыдущего уровня и операции с множеством весов. Примеры нейронных сетей включают в себя, без ограничения перечисленным, сверточную нейронную сеть (CNN), глубокую нейронную сеть (DNN), рекуррентную нейронную сеть (RNN), ограниченную машину Больцмана (RBM), глубокую сеть доверия (DBN), двунаправленную рекуррентную глубокую нейронную сеть (BRDNN), генеративные состязательные сети (GAN) и глубокие Q-сети. Алгоритм обучения - это способ обучения заранее определенного целевого вычислительного устройства 50 с использованием множества обучающих данных, чтобы побудить, разрешить или контролировать выполнение определения, оценки или прогнозирования целевым вычислительным устройством 50. Примерами алгоритмов обучения являются, без ограничения перечисленным, обучение с учителем, обучение без учителя, обучение с частичным привлечением учителя или обучение с подкреплением.[0061] The AI model can be composed of multiple layers of a neural network. Each level has multiple weights and performs a level operation by calculating the previous level and multi-weight operation. Examples of neural networks include, but are not limited to, Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Trust Network (DBN), Bidirectional Recurrent Deep Neural Network network (BRDNN), generative adversarial networks (GAN) and deep Q-networks. A learning algorithm is a method of training a predetermined target computing device 50 using a plurality of training data to induce, permit, or control the determination, evaluation, or prediction of the target computing device 50. Examples of learning algorithms are, but are not limited to, supervised learning, learning without teacher, part-teacher training, or reinforcement learning.

[0062] Следует четко понимать, что не все технические эффекты, упомянутые в данном документе, необходимо использовать в каждом варианте осуществления настоящей технологии. Например, варианты осуществления данной технологии могут быть реализованы без использования пользователем некоторых из этих технических эффектов, в то время как другие варианты могут быть реализованы с использованием пользователем других технических эффектов или вообще без них.[0062] It should be clearly understood that not all of the technical effects mentioned herein need to be used in every embodiment of the present technology. For example, embodiments of this technology may be implemented without the user using some of these technical effects, while other options may be implemented with the user using other technical effects or none at all.

[0063] Модификации и усовершенствования описанных выше реализаций настоящей технологии могут быть очевидными для специалистов в данной области техники. Например, конкретные значения параметров, указанные в приведенном выше описании, не следует рассматривать как ограничение, поскольку значения параметров можно выбирать экспериментальным путем из соответствующих диапазонов, например +/-10-50% от указанных конкретных значений параметров. Приведенное выше описание предназначено скорее для примера, чем для ограничения. Таким образом, объем настоящей технологии ограничивается исключительно объемом прилагаемой формулы изобретения.[0063] Modifications and improvements to the above described implementations of the present technology may be apparent to those skilled in the art. For example, the specific parameter values specified in the above description should not be construed as limiting, since the parameter values can be selected experimentally from the appropriate ranges, for example +/- 10-50% of the specified specific parameter values. The above description is intended as an example rather than a limitation. Thus, the scope of the present technology is limited solely by the scope of the appended claims.

[0064] Хотя представленные выше реализации были описаны и проиллюстрированы со ссылкой на конкретные этапы, выполняемые в определенном порядке, следует понимать, что эти этапы можно объединить, разделить на части или выполнить в другом порядке без отклонения от принципов настоящего изобретения. Соответственно, порядок и группирование этапов не являются ограничением настоящей технологии. Использование формы единственного числа по отношению к любому элементу, раскрытому в данном изобретении, не исключает того, что в фактической реализации может быть два или более таких элементов.[0064] While the above implementations have been described and illustrated with reference to specific steps performed in a specific order, it should be understood that these steps can be combined, subdivided, or performed in a different order without departing from the principles of the present invention. Accordingly, the order and grouping of steps are not a limitation of the present technology. The use of the singular form with respect to any element disclosed in this invention does not preclude that there may be two or more such elements in an actual implementation.

Claims

1. A method for visualizing a 3D portrait of a person with changed lighting, which consists in the fact that:

receive (S210) an input defining a camera posture and lighting conditions;

rasterizing (S215) hidden 3D point cloud descriptors with different resolutions in accordance with the camera pose to obtain rasterized images, wherein the 3D point cloud is generated (S205) based on a sequence of images captured (S200) by a camera with a blinking flash when the camera is moved at least at least partially around the upper body of a person, the sequence of images containing a set of images taken with a flash and a set of images taken without a flash;

processing (S220) rasterized images with a deep neural network to predict albedo, normals, environment shadow maps and segmentation mask for the adopted camera pose, and

combine (S225) the predicted albedos, normals, environmental shadow maps and a segmentation mask into a 3D portrait with altered lighting according to lighting conditions.

2. The method according to claim 1, wherein the step of creating (S205) the 3D point cloud further evaluates (S205.1) the camera viewpoints from which the sequence of images is taken.

3. The method of claim 2, wherein the camera posture and lighting conditions received through the input are different from the camera viewpoints and lighting conditions from which the sequence of images was captured.

4. The method according to claim 2, wherein the step of creating (S205) a 3D point cloud and the sub-step of evaluating (S205.1) camera viewpoints are performed using a structure from motion (SfM), wherein the 3D point cloud comprises points of at least partly matching the upper body of a person.

5. The method according to claim 1, wherein the step of creating (S205) the 3D point cloud further processes each image of the sequence by segmenting (S205.3) the foreground and filtering (S205.4) the 3D point cloud based on the foreground segmented by the neural a segmentation network to obtain a filtered 3D point cloud.

6. The method of claim 1, wherein the step of rasterizing (S215) the 3D point cloud is performed using Z buffering.

7. The method of claim 1, wherein, in said sequence of images, the flash images from the set of flash images are interleaved with non-flash images from the set of non-flash images.

8. The method of claim 1, wherein the hidden descriptor is a multidimensional hidden vector characterizing the properties of a corresponding point in a 3D point cloud.

9. The method according to any one of claims 1 to 8, in which at the stage of training, random sampling (S100) of the image from the captured sequence of images is performed, and the camera posture corresponds to the given image, and a predicted image by the deep neural network for the given camera posture is obtained (S105) , while the learning stage is performed iteratively.

10. The method of claim 9, wherein the deep neural network is trained by backpropagating (S130) the deep neural network weight loss, hidden descriptors and auxiliary parameters,

wherein the loss value is calculated (S125) based on one or more of the following loss functions: basic loss function, segmentation loss function, room shading loss function, symmetry loss function, albedo color match loss function, normal loss function.

11. The method of claim 10, wherein the auxiliary parameters include one or more of the following: indoor lighting color temperature, flash color temperature, and albedo semi-texture.

12. The method according to claim 11, further comprising predicting (S110) auxiliary face meshes with corresponding texture mapping by the 3D face mesh reconstruction network for each of the images in the captured sequence;

wherein the corresponding texture mapping is defined by a set of two-dimensional coordinates in a fixed, predefined texture space for each mesh vertex.

13. The method of claim 10, wherein the basic loss function is computed as a mismatch between the predicted image and the selected image by a combination of non-perceptual and perceptual loss functions.

14. The method of claim 10, wherein the segmentation loss function is computed as a mismatch between the predicted segmentation mask and the segmented foreground.

15. The method of claim 10, wherein the room shading loss function is computed as a sharpening penalty on the predicted environmental shadow maps, the penalty increasing as sharpening increases.

16. The method of claim 10, wherein the symmetry loss function is calculated as the mismatch between the mirrored albedo semi-texture and the predicted albedo geometrically translated into texture space by bilinear interpolation.

17. The method of claim 10, wherein the color match albedo loss function is calculated as a mismatch between the median reference texture and the predicted albedo geometrically translated into texture space by bilinear interpolation,

wherein the median reference texture is obtained (S115) by calculating the per-pixel median of all images from the set of flash images geometrically translated into texture space by bilinear interpolation.

18. The method according to claim 10, wherein the normal loss function is calculated as the discrepancy between the predicted normals and the rasterized images of the auxiliary face mesh corresponding to the selected image,

wherein the rasterization (S120) is performed by color-coding the vertices of the face mesh with their geometric normals and Z-buffering.

19. Computing device (50) containing a processor (50.1) and memory (50.3), which stores the instructions executed by the processor and weights of the deep neural network, hidden descriptors and auxiliary parameters obtained at the training stage, while executing by the processor (50.1) executed by the processor of instructions, the processor (50.1) instructs the computing device (50) to perform a method for rendering a 3D portrait of a person with changed lighting according to any one of claims 1-18.

20. Computing device (50) according to claim 19, further comprising a camera (50.2) adapted to capture a sequence of images of a person with a blinking flash.

21. Computing device (50) according to claim 19, further comprising a communication unit configured to exchange data with a remote server.