RU2755396C1

RU2755396C1 - Neural network transfer of the facial expression and position of the head using hidden position descriptors

Info

Publication number: RU2755396C1
Application number: RU2020119034A
Authority: RU
Inventors: Егор Андреевич БУРКОВ; Игорь Игоревич ПАСЕЧНИК; Виктор Сергеевич Лемпицкий
Original assignee: Самсунг Электроникс Ко., Лтд.
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2021-09-15

Abstract

FIELD: image data processing.SUBSTANCE: invention relates to computer technology, namely to computer graphics. A hardware apparatus for transferring a random facial expression and position of the head to an avatar of the person while maintaining the identity of the avatar of the person, containing: a detector for capturing images of the facial expression and position of the head of a person B, a tool for selecting images of a person A, an identity encoder block configured to retrieve an identity descriptor from the images of the person A, wherein the output of the position encoder does not contain information about the identity of the person A, a position encoder block configured to receive images of the person B at the input, the generator block is configured to output the received image of the avatar of the person A with the position of the head and the facial expression taken from the person B.EFFECT: increasing the quality of image synthesising.6 cl, 10 dwg, 2 tbl

Description

Область техники, к которой относится изобретениеThe technical field to which the invention relates

Изобретение относится к компьютерной графике, телеприсутствию (например, видеосвязи), определению позы головы человека на изображении, видеопроизводству, отслеживанию лица человека на видео, дополненной/виртуальной реальности.The invention relates to computer graphics, telepresence (for example, video communication), determining the posture of a person's head in an image, video production, tracking a person's face in video, augmented / virtual reality.

Уровень техникиState of the art

За последние годы существенно повысились качество и устойчивость переноса видеоизображений головы. Современные наиболее продвинутые системы [33, 22, 37, 30, 39, 42, 10, 36, 35] демонстрируют убедительные, практически фотореалистичные переносы "говорящей головы". Самые последние из них способны обеспечить такой результат с помощью глубоких нейронных генеративных сетей, даже если имеется всего одно изображение целевого человека [30, 42, 10, 36, 35].In recent years, the quality and stability of the transfer of video images of the head has improved significantly. Today's most advanced systems [33, 22, 37, 30, 39, 42, 10, 36, 35] demonstrate convincing, almost photorealistic transfers of the "talking head". The most recent of them are able to provide such a result using deep neural generative networks, even if there is only one image of the target person [30, 42, 10, 36, 35].

Перенос изображений лица/головы является активной областью исследований. В этой области можно выделить работы, в которых изменения и дополнения локализованы в пределах лиц (перенос изображения лица), например [33, 30], и более масштабные подходы, в которых моделируются широкие области, включающие в себя значительную часть одежды, шеи, верхней одежды (перенос изображения головы), например, [22, 37, 42].Face / head transfer is an active area of research. In this area, it is possible to distinguish works in which changes and additions are localized within faces (transfer of a face image), for example [33, 30], and larger-scale approaches in which wide areas are modeled, including a significant part of clothing, neck, upper clothes (transfer of the image of the head), for example, [22, 37, 42].

Важным аспектом систем переноса является представление позы. Как уже отмечалось выше, в большинстве работ перенос осуществляется с использованием ориентиров [33, 22, 37, 42, 10, 36]. Другой подход заключается в использовании элементов динамики лица (AU) [9], как в переносе лица [30] и переносе головы [35]. Обнаружение элементов динамики все еще требует ручного аннотирования и обучения с учителем. В системе X2Face [39] используются скрытые векторы, которые обучаются прогнозированию полей деформирования.An important aspect of transfer systems is posture presentation. As noted above, in most works, the transfer is carried out using landmarks [33, 22, 37, 42, 10, 36]. Another approach is to use elements of facial dynamics (AU) [9], as in face transfer [30] and head transfer [35]. Detecting dynamic elements still requires manual annotation and supervised learning. The X2Face system [39] uses hidden vectors that are trained to predict deformation fields.

Более классический подход заключается в моделировании позы лица/головы на платформе трехмерной морфируемой модели (3DMM) [1] или применении аналогичного подхода в двухмерной среде (например, модель активного внешнего вида) [6]. Однако, обучение 3DMM и подгонка обученной 3DMM почти всегда включают в себя обнаружение ориентиров и тем самым наследуют многие из недостатков ориентиров. В других случаях требуется датасет трехмерных сканов, чтобы построить модель для отделения позы от идентичности на платформе 3DMM.A more classical approach is to simulate a face / head pose on a 3D morphing model (3DMM) platform [1] or apply a similar approach in a two-dimensional environment (eg active appearance model) [6]. However, training a 3DMM and fitting a trained 3DMM almost always involves finding landmarks and thus inherits many of the disadvantages of landmarks. In other cases, a 3D scan dataset is required to build a model to separate pose from identity on the 3DMM platform.

В ряде недавних работ исследовалось, как можно обучать ориентиры без привлечения учителя [44, 19]. Хотя неконтролируемые ключевые точки, в общем, весьма перспективны, они все же содержат специфическую для человека информацию, так же как и контролируемые ключевые точки, и поэтому в основном не пригодны для переноса с одного человека на другого. То же самое относится к плотным, многомерным дескрипторам, таким как дескриптор тела DensePose [14], и к плотным дескрипторам только лица [15, 34]. И наконец, кодек-аватары [26] обучаются на специфических для человека скрытых дескрипторах позы и экстракторах на основе потерь реконструкции. Однако перенос таких дескрипторов с одного человека на другого не рассматривался.A number of recent works have explored how landmarks can be taught without involving a teacher [44, 19]. While uncontrolled cue points are generally very promising, they still contain human-specific information, as well as controlled cue points, and are therefore generally not suitable for transference from one person to another. The same applies to dense, multidimensional descriptors such as the DensePose body descriptor [14], and dense face-only descriptors [15, 34]. Finally, codec avatars [26] are trained on human-specific hidden pose descriptors and extractors based on reconstruction loss. However, the transfer of such descriptors from one person to another has not been considered.

Недавняя параллельная работа [32] продемонстрировала, что можно использовать относительное перемещение неконтролируемых ключевых точек для переноса анимации, по меньшей мере, при отсутствии значительного поворота головы. Всестороннее сравнение предложенного подхода с [32] оставлено для дальнейшей работы.Recent parallel work [32] has demonstrated that it is possible to use the relative movement of uncontrolled keypoints to transfer animation, at least without significant head rotation. A comprehensive comparison of the proposed approach with [32] is left for further work.

Кроме переноса головы/лица существует очень большой объем работ по обучению разделенным представлениям. К репрезентативным работам в этой области относятся работы [8, 40], в которых для обучения используются скрытые дескрипторы позы или формы для произвольных классов объектов с использованием датасетов видеоизображений. Некоторые подходы (например [24]) нацелены на обучение разделению стиля и контента (которое может приблизительно соответствовать разделению формы и текстуры) с использованием состязательных потерь [12] и потерь согласованности цикла [45, 17]. Альтернативно, разделение можно осуществлять путем прямого подбора факторизованных распределений к данным (например, [23]).In addition to head / face transfer, there is a very large body of work on teaching split representations. Representative works in this area include works [8, 40], in which hidden pose or shape descriptors for arbitrary classes of objects are used for training using video datasets. Some approaches (eg [24]) aim at teaching the separation of style and content (which may roughly correspond to the separation of form and texture) using adversarial loss [12] and loss of cycle coherence [45, 17]. Alternatively, splitting can be done by directly fitting factorized distributions to the data (eg, [23]).

Что еще более важно, предлагается новое представление позы (здесь и далее "поза головы" означает совокупность ориентации головы, ее положения, а также выражения лица) для нейросетевого переноса выражения лица и позы головы. Это представление позы играет ключевую роль качестве переноса. Большинство систем, включая [33, 22, 37, 42, 10, 36], основаны на представлении ключевой точки (ориентира). Основным преимуществом такого представления является то, что в настоящее время имеются надежные и эффективные "готовые" детекторы ориентиров [21, 2].More importantly, a new posture representation is proposed (hereinafter “head posture” means a set of head orientation, position, and facial expression) for neural network transfer of facial expression and head posture. This representation of the pose plays a key role in the quality of the transfer. Most systems, including [33, 22, 37, 42, 10, 36], are based on the representation of a key point (landmark). The main advantage of this representation is that reliable and efficient “ready-made” landmark detectors are now available [21, 2].

Появление больших непомеченных датасетов видеоизображений людей, таких как [29, 4, 5], позволяет обучать скрытым дескрипторам позы и выражения без учителя. Этот подход впервые был изучен в [39, 38], где скрытые дескрипторы позы обучались таким образом, что из обученных дескрипторов можно было вывести плотный поток между различными кадрами.The advent of large, untagged human video datasets, such as [29, 4, 5], makes it possible to teach hidden posture and expression descriptors without a teacher. This approach was first studied in [39, 38], where the hidden pose descriptors were trained in such a way that a dense flow between different frames could be deduced from the trained descriptors.

Однако лицевые ориентиры имеют ряд недостатков. Во-первых, обучение детектора ориентиров требует чрезмерных усилий по аннотированию, и в наборах аннотированных ориентиров часто отсутствуют некоторые важные аспекты позы. Например, во многих аннотациях ориентиров отсутствуют зрачки, и, как следствие, перенос не будет полностью контролировать направление взгляда. Во-вторых, многие из ориентиров не имеют анатомической основы, и их аннотации неоднозначны и подвержены ошибкам, особенно если они загорожены другими объектами. На практике такая неоднозначность аннотации часто приводит ко временной нестабильность обнаружения ключевых точек, что, в свою очередь, отражается на результатах переноса. В заключение следует отметить, что ориентиры в качестве представления являются специфическими для человека, так как они содержат существенный объем не зависящей от позы информации о геометрии головы.However, facial landmarks have several disadvantages. First, training a landmark detector requires an overwhelming annotation effort, and some important aspects of the pose are often missing from the annotated landmarks sets. For example, many landmark annotations lack pupils, and as a result, hyphenation will not completely control the direction of gaze. Second, many of the landmarks are not anatomically based and their annotations are ambiguous and error-prone, especially when obstructed by other objects. In practice, this ambiguity in annotation often leads to temporary instability in key point detection, which, in turn, affects the results of the transfer. In conclusion, it should be noted that landmarks as representations are specific to humans, since they contain a significant amount of posture-independent information about the geometry of the head.

Это может быть крайне нежелательно для переноса головы, например, если кто-то пожелает сделать портретную фотографию или картину с целевым человеком, имеющим другую геометрию головы.This can be highly undesirable for head transfer, for example if one wishes to take a portrait photograph or painting with a target person with a different head geometry.

Сущность изобретенияThe essence of the invention

В настоящем изобретении предложено усовершенствовать существующие системы нейросетевого переноса головы по одному примеру (one-shot), используя два важных подхода. Во-первых, предшествующую современную систему переноса [42] просто дополняют способностью прогнозировать сегментацию переднего плана. Такое прогнозирование требуется для различных сценариев, например, для телеприсутствия, где перенос исходного заднего плана в новую среду может быть нежелательным.The present invention proposes to improve existing systems of neural network head transfer according to one example (one-shot), using two important approaches. First, the previous modern transfer system [42] is simply supplemented with the ability to predict foreground segmentation. Such prediction is required for various scenarios, for example, for telepresence, where transferring the original background to a new environment may not be desirable.

Предложена альтернатива подходу на основе деформации [39, 38]. В предложенном подходе обучение ведется по малоразмерным, не зависящим от человека дескрипторам позы наряду со среднеразмерными, зависящими от человека и независящими от позы дескрипторами, посредством наложения набора потерь реконструкции на видеокадры в большой коллекции видеоизображений.An alternative to the deformation-based approach has been proposed [39, 38]. In the proposed approach, training is conducted using small-sized, human-independent pose descriptors along with medium-sized human-independent and pose-independent descriptors, by superimposing a set of reconstruction losses on video frames in a large collection of video images.

Важно отметить, что при оценке потерь реконструкции предлагается удалить задний план, чтобы его помехи и покадровые изменения не влияли на обученные дескрипторы.It is important to note that when estimating reconstruction losses, it is proposed to remove the background so that its noise and frame-by-frame changes do not affect the trained descriptors.

Предлагаемая система модифицирует и расширяет модель переноса Захарова и др. [42]. Во-первых, добавлена возможность прогнозировать сегментацию. Во-вторых, система обучается выполнять перенос на основе скрытых векторов позы, а не ключевых точек.The proposed system modifies and expands the transport model of Zakharov et al. [42]. First, the ability to predict segmentation has been added. Second, the system is trained to perform translation based on the latent vectors of the pose, rather than key points.

Простая обучающая структура на основе выборки множества случайных кадров из одного и того же видеоматериала в сочетании с большим размером датасета видеоданных позволяет обучать экстракторы для обоих дескрипторов, что очень хорошо работает для задач переноса, включая перенос с одного человека на другого.A simple training structure based on sampling many random frames from the same video material, combined with a large video dataset size, allows extractors to be trained for both descriptors, which works very well for transfer tasks, including transfer from one person to another.

В частности, предлагаемый перенос, основанный на новом скрытом представлении позы, сохраняет идентичность целевого человека намного лучше, чем при использовании дескрипторов позы FAb-Net [38] и X2Face [39]. Кроме того, анализируется качество обученных скрытых дескрипторов позы для таких задач, как прогнозирование ориентиров и поиск на основании позы.In particular, the proposed transfer based on a new latent pose representation preserves the identity of the target person much better than using the FAb-Net [38] and X2Face [39] pose descriptors. In addition, the quality of the trained hidden pose descriptors for tasks such as predicting landmarks and searching based on posture is analyzed.

Важнейшим компонентом большого количества ориентированных на человека задач, связанных с компьютерным зрением и графикой является широкое представление позы головы и выражения лица человека. Эффективным подходом является обучение таким представлениям на данных без опоры на выполненную человеком аннотацию, так как при этом можно использовать обширные непомеченные наборы данных видеоизображений людей. В настоящей работе предлагается новый простой метод осуществления такого обучения. В отличие от предыдущих работ, предлагаемое обучение позволяет получить не зависящие от человека дескрипторы, которые фиксируют позу человека и при этом могут передаваться с одного человека на другого, что особенно полезно для таких применений, как перенос лица. Также показано, что кроме переноса лица эти дескрипторы полезны и для других последующих задач, например, для определения ориентации лица. При этом описанный выше способ, выполняемый электронным устройством, может выполняться с использованием модели искусственного интеллекта.A critical component of a large number of human-centered tasks involving computer vision and graphics is a broad representation of a person's head posture and facial expressions. An effective approach is to train such views on the data without relying on human annotation, as it can use vast untagged datasets of video images of people. In this paper, we propose a new simple method for implementing such training. In contrast to previous works, the proposed training allows you to obtain human-independent descriptors that capture a person's posture and at the same time can be transferred from one person to another, which is especially useful for applications such as face transfer. It is also shown that, in addition to transferring a face, these descriptors are useful for other subsequent tasks, for example, to determine the orientation of a face. Meanwhile, the above-described method performed by an electronic device can be performed using an artificial intelligence model.

Важно отметить, что предложенную систему можно успешно применить и для создания видео. Даже если каждый видеокадр создается независимо и без какой-либо временной обработки, предлагаемая система демонстрирует сглаженные во времени выражения лица благодаря дополнениям позы и скрытому представлению позы. Предыдущие методы, в которых для осуществления переноса использовались ключевые точки лица, унаследовали бы шаткость от детекторов ключевых точек.It is important to note that the proposed system can be successfully applied to video creation. Even if each video frame is created independently and without any temporal processing, the proposed system shows time-smoothed facial expressions thanks to pose additions and hidden pose representation. Previous methods, which used facial cue points to perform the transfer, would inherit the shakiness from cue point detectors.

Предложено аппаратное устройство, содержащее программный продукт, выполняющий способ нейросетевого переноса выражения лица и позы головы, содержащее: блок кодировщика идентичности, сконфигурированный для получения дескриптора идентичности из изображения человека А, причем выход блока кодировщика позы не содержит информации об идентичности человека А; блок кодировщика позы, сконфигурированный для получения дескриптора позы головы и выражения лица из изображения человека B, причем выход блока кодировщика позы не содержит информации об идентичности человека B; блок генератора, принимающий выходы блоков кодировщика идентичности и кодировщика позы, причем блок генератора сконфигурирован для синтезирования аватара человека A, имеющего позу головы и выражение лица, взятые от человека B. При этом блок кодировщика позы представляет собой сверточную нейронную сеть, которая принимает на вход изображение человека и выдает на выходе вектор, описывающий позу его головы и выражение лица и не описывающий его идентичность. При этом идентичность человека B означает цвет кожи, форму лица, цвет глаз, одежду и украшения человека B.A hardware device is proposed that contains a software product that performs a method for neural network transfer of facial expressions and head postures, comprising: an identity encoder unit configured to obtain an identity descriptor from an image of a person A, and the output of the pose encoder unit does not contain information about the identity of a person A; a posture encoder unit configured to obtain a head posture descriptor and a facial expression from an image of a person B, wherein the output of the pose encoder unit does not contain identity information of person B; a generator block that receives the outputs of the identity encoder and posture encoder blocks, and the generator block is configured to synthesize an avatar of a person A having a head pose and facial expression taken from a person B. In this case, the posture encoder block is a convolutional neural network that takes an image as input a person and outputs a vector describing the posture of his head and facial expression and does not describe his identity. In this case, the identity of person B means the skin color, face shape, eye color, clothing and jewelry of person B.

Предложен способ синтезирования фотореалистичного аватара человека, заключающийся в том, что: получают с помощью блока кодировщика идентичности дескриптор идентичности из изображения человека А; получают с помощью блока кодировщика позы дескриптор позы головы и выражения лица из изображения человека B; синтезируют с помощью блока генератора, принимающего на вход выходы блоков кодировщика идентичности и кодировщика позы, аватар человека A, имеющий позу головы и выражение лица, взятые от человека B. При этом блок кодировщика позы представляет собой сверточную нейронную сеть, принимающую на вход изображение человека и выдающую вектор, описывающий позу его головы и выражение лица и не описывающий его идентичность. При этом идентичность означает цвет кожи, форму лица, цвет глаз, одежду, украшения.A method for synthesizing a photorealistic avatar of a person is proposed, which consists in the fact that: using an identity encoder unit, an identity descriptor is obtained from an image of a person A; obtaining, using a pose encoder unit, a descriptor for a head pose and a facial expression from an image of person B; are synthesized using a generator block that accepts the outputs of the identity encoder and pose encoder blocks, an avatar of a person A, having a head pose and a facial expression taken from a person B. In this case, the pose encoder block is a convolutional neural network that accepts an image of a person as input and an outstanding vector describing his head posture and facial expression and not describing his identity. In this case, identity means skin color, face shape, eye color, clothing, jewelry.

Описание чертежейDescription of drawings

Представленные выше и/или другие аспекты изобретения станут более очевидными из описания примерных вариантов осуществления со ссылкой на прилагаемые чертежи.The above and / or other aspects of the invention will become more apparent from the description of exemplary embodiments with reference to the accompanying drawings.

На фиг. 1 показано использование произвольно выбранных людей в качестве образцов выражения лица и позы головы (верхний ряд) для получения реалистичных переносов произвольных говорящих голов (таких как Мона Лиза, нижний ряд).FIG. 1 shows the use of randomly selected people as samples of facial expressions and head posture (top row) to obtain realistic transfers of arbitrary talking heads (such as Mona Lisa, bottom row).

На фиг. 2a показан предложенный обучающий конвейер (для простоты дискриминатор не показан).FIG. 2a shows the proposed training pipeline (for simplicity, the discriminator is not shown).

На фиг. 2b показано применение предложенного способа.FIG. 2b shows the application of the proposed method.

На фиг. 3 показана оценка систем переноса с точки зрения их способности представлять позу образца и сохранять исходную идентичность (стрелки указывают на улучшение).FIG. 3 shows the assessment of the transfer systems in terms of their ability to represent the pose of the specimen and maintain the original identity (arrows indicate improvement).

На фиг. 4 представлено сравнение переноса с одного человека на другого для нескольких систем на тестовом наборе VoxCeleb2.FIG. 4 shows a comparison of transfer from one person to another for several systems on the VoxCeleb2 test set.

На фиг. 5 показан перенос путем интерполяции между двумя векторами позы по сферической траектории в пространстве дескриптора позы.FIG. 5 shows translation by interpolation between two pose vectors along a spherical path in pose descriptor space.

На фиг. 6 представлено дополнительное сравнение переноса с одного человека на другого для нескольких систем на тестовом наборе VoxCeleb2.FIG. 6 shows an additional comparison of human-to-human transfer for multiple systems on the VoxCeleb2 test suite.

На фиг. 7 представлена количественная оценка того, как абляция нескольких важных признаков обучающей выборки влияет на предлагаемую систему.FIG. 7 presents a quantitative assessment of how the ablation of several important features of the training sample affects the proposed system.

На фиг. 8 представлено сравнение переноса с одного человека на другого для предложенной лучшей модели и ее абляционных версий.FIG. 8 shows a comparison of the transfer from one person to another for the proposed best model and its ablative versions.

На фиг. 9 показано влияние дополнений позы на моделях X2Face+ и FAb-Net+. Без дополнений становится заметным расхождение идентичности.FIG. 9 shows the effect of posture additions on the X2Face + and FAb-Net + models. Without additions, the identity divergence becomes noticeable.

Описание вариантов осуществленияDescription of embodiments

Предложена система нейросетевого переноса выражения лица и позы головы, в основу которой положено скрытое представление позы и которая способна спрогнозировать сегментацию переднего плана вместе с изображением RGB. Скрытое представление позы обучается как часть всей системы переноса, и процесс обучения основан исключительно на потерях реконструкции изображения. Несмотря на свою простоту, при наличии большой и достаточно разнообразной обучающей выборки данных такое обучение успешно выделяет позу из идентичности. После этого полученная система может воспроизводить мимику человека-образца и, более того, выполнять перенос с одного человека на другого. Кроме того, обученные дескрипторы полезны для других связанных с позой задач, таких как прогнозирование ключевых точек и поиск на основе позы.A neural network transfer system for facial expression and head posture is proposed, which is based on the latent representation of the posture and is capable of predicting foreground segmentation along with an RGB image. The latent representation of the pose is trained as part of the entire transfer system, and the learning process is based solely on the loss of image reconstruction. Despite its simplicity, in the presence of a large and sufficiently diverse training data set, such training successfully separates a pose from an identity. After that, the resulting system can reproduce the facial expressions of a person-sample and, moreover, carry out a transfer from one person to another. In addition, trained descriptors are useful for other pose-related tasks such as predicting cue points and searching based on posture.

На фиг. 1 показано использование произвольно выбранных людей в качестве образцов выражения лица и позы головы (верхний ряд) для создания реалистичных переносов произвольных говорящих голов (таких как Мона Лиза, нижний ряд). Предложенная система может создавать реалистичные переносы произвольных говорящих голов (таких как Мона Лиза), используя произвольных людей в качестве образцов выражения лица и позы головы (верхний ряд). Несмотря на то, что обучение выполняется без учителя, этот способ позволяет успешно разложить позу и идентичность, так что сохраняется идентичность человека, на которого произведен перенос.FIG. Figure 1 shows the use of randomly selected people as samples of facial expressions and head posture (top row) to create realistic hyphenation of free talking heads (such as Mona Lisa, bottom row). The proposed system can create realistic transfers of arbitrary talking heads (such as the Mona Lisa) using arbitrary people as models for facial expressions and head postures (top row). Despite the fact that teaching is performed without a teacher, this method allows you to successfully decompose the posture and identity, so that the identity of the person to whom the transfer is made is preserved.

Изобретение позволяет быстро получить точное и гибкое представление (дескриптор) позы головы и выражения лица человека по одному изображению его головы.The invention makes it possible to quickly obtain an accurate and flexible representation (descriptor) of the head posture and facial expression of a person from a single image of his head.

Изобретение позволяет определять позу головы (направление/поворот/углы наклона) и выражение лица (включая направление взгляда) по изображению человека. Изобретение может найти применение в следующих областях:The invention makes it possible to determine the posture of the head (direction / rotation / angles of inclination) and facial expression (including the direction of gaze) from the image of a person. The invention can find application in the following areas:

системах телеприсутствия (видеосвязи);telepresence systems (video communication);

системах дистанционного управления с помощью лиц (в телевизорах, смартфонах, роботах и т.д.);remote control systems using faces (in TVs, smartphones, robots, etc.);

развлекательных системах, играх (управление виртуальным аватаром или его создание);entertainment systems, games (managing or creating a virtual avatar);

мессенджерах (например, создание анимированных стикеров/смайликов);messengers (for example, creating animated stickers / emoticons);

системах дополненной (AR) или виртуальной (VR) реальности (определение положения головы или взгляда при отрисовке виртуальных объектов с правильного ракурса).systems of augmented (AR) or virtual (VR) reality (determining the position of the head or gaze when drawing virtual objects from the correct angle).

Предлагаемая система модифицирует и расширяет модель переноса Захарова и др. [42]. Во-первых, добавлена возможность прогнозирования сегментации. Во-вторых, система обучается выполнять перенос на основе скрытых векторов позы, а не ключевых точек. Как и в [42], обучение осуществляется на датасете видеопоследовательностей VoxCeleb2 [4]. Каждая последовательность содержит говорящего человека и получается из необработанной последовательности путем прогона детектора лица, обрезки полученного лица и изменения его размера до фиксированного размера (в предлагаемом случае 256х256). Кроме того, как и в случае [42], предусмотрена стадия "метаобучения", на которой большую модель, ответственную за воспроизведение всех людей в датасете, обучают через последовательность обучающих эпизодов, и стадия дообучения, на которой "метамодель" подвергают дообучению на кортеж изображений (или одно изображение) конкретного человека.The proposed system modifies and expands the transport model of Zakharov et al. [42]. First, the ability to predict segmentation has been added. Second, the system is trained to perform translation based on the latent vectors of the pose, rather than key points. As in [42], training is carried out on the VoxCeleb2 video sequence dataset [4]. Each sequence contains a speaking person and is obtained from the raw sequence by running a face detector, cropping the resulting face and resizing it to a fixed size (in the proposed case, 256x256). In addition, as in the case of [42], a stage of "meta-training" is provided, at which a large model responsible for reproducing all people in the dataset is trained through a sequence of training episodes, and a stage of additional training, at which the "meta-model" is subjected to additional training for a tuple of images (or one image) of a specific person.

Как показано на фиг. 2а, на каждом этапе метаобучения предлагаемая система выбирает набор кадров из видео человека. Эти кадры обрабатываются двумя кодировщиками. Кодировщик большего размера применяется к нескольким кадрам видео, а кодировщик меньшего размера - к удержанному кадру. Под удержанными данными обычно подразумевается часть данных, которая специально не показывается определенной модели (т.е. удерживается вне ее). В данном случае удержанный кадр - это просто еще один кадр того же человека. Термин "удержан" подчеркивает, что данный кадр гарантированно не поступит в кодировщик идентичности и поступит только в кодировщик позы. Полученные вложения передаются в сеть-генератор, целью которой является реконструкция последнего (удержанного) кадра. Наличие кодировщика идентичности, который является более емким, чем кодировщик позы, очень важно для отделения позы от идентичности в скрытом пространстве вложений позы, что является ключевым компонентом настоящего изобретения. Не менее важно иметь очень узкое место в кодировщике позы, которое в предлагаемом случае реализуется через нейронную сеть меньшей производительности в кодировщике позы, и меньшую размерность вложений позы, чем вложения идентичности. Это вынуждает кодировщик позы кодировать только информацию позы во вложения позы и игнорировать информацию идентичности. Поскольку производительность кодировщика позы ограничена, а его вход не совсем соответствует другим кадрам относительно идентичности (благодаря дополнению данных), система обучается извлекать всю не зависящую от позы информацию через кодировщик идентификации и использует меньший кодировщик для захвата только связанной с позой информации, тем самым обеспечивая разделение позы и идентичности.As shown in FIG. 2a, at each stage of meta-learning, the proposed system selects a set of frames from a person's video. These frames are processed by two encoders. The larger encoder is applied to multiple video frames, and the smaller encoder is applied to the held frame. Withheld data usually refers to a piece of data that is not specifically shown to a particular model (i.e. held outside of it). In this case, the held frame is just another frame of the same person. The term "held" emphasizes that a given frame is guaranteed not to go to the identity encoder and only goes to the pose encoder. The received attachments are sent to the generator network, the purpose of which is to reconstruct the last (withheld) frame. Having an identity encoder that is more capacious than a pose encoder is essential for separating pose from identity in the hidden pose nesting space, which is a key component of the present invention. It is equally important to have a very bottleneck in the pose encoder, which in the proposed case is implemented through a neural network of lower performance in the pose encoder, and a smaller dimension of pose nesting than identity nesting. This forces the pose encoder to encode only the pose information into the pose attachments and ignore the identity information. Since the performance of the pose encoder is limited and its input does not quite match other frames with respect to identity (due to data padding), the system learns to extract all pose independent information through the identity encoder and uses a smaller encoder to capture only the pose related information, thereby providing separation poses and identities.

Предложен способ нейросетевого переноса выражения лица и позы головы, т.е. алгоритм синтезирования фотореалистичного аватара человека A, где выражение лица и поза головы взяты из изображения человека B (как показано на фиг. 2a).A method for neural network transfer of facial expression and head posture is proposed, i.e. an algorithm for synthesizing a photorealistic avatar of person A, where the facial expression and head posture are taken from the image of person B (as shown in Fig. 2a).

Фиг. 2b иллюстрирует конвейер логического вывода предложенного способа, т.е. алгоритм прогнозирования/отрисовки предложенной системой после ее обучения.FIG. 2b illustrates the inference pipeline of the proposed method, i. E. the forecasting / rendering algorithm by the proposed system after its training.

Следует отметить, что блок генератора обучается одновременно со всеми остальными блоками. Во время обучения блок кодировщика идентичности прогнозирует вложения идентичности из нескольких изображений человека, а блок кодировщика позы прогнозирует вложение позы из дополнительного изображения этого человека. Блок генератора использует усредненные вложения идентичности и вложение позы и обучается выводить изображение, которое поступило в блок кодировщика позы, а также маску переднего плана для этого изображения. Новыми в этом конвейере являются: (1) блок кодировщика позы, во время обучения которого непосредственно не участвует учитель, (2) использование маски переднего плана в качестве цели, (3) дополнения позы (сохраняющие идентичность искажения ввода в блок кодировщика позы) как метод разделения позы и идентичности.It should be noted that the generator block is trained simultaneously with all other blocks. During training, an identity encoder block predicts identity nesting from multiple images of a person, and a pose encoder block predicts a pose nesting from that person's additional image. The generator block uses the averaged identity nesting and pose nesting and is trained to output the image that entered the pose encoder block, as well as the foreground mask for that image. New to this pipeline are: (1) a pose encoder block that does not directly involve the teacher during training, (2) use a foreground mask as a target, (3) pose augments (preserving the identity of the input distortion in a pose encoder block) as a method separation of posture and identity.

Прежде всего, вычисляется вложение идентичности как среднее значение выходов кодировщика идентичности по всем имеющимся изображениям человека А. После этого кодировщик идентичности отбрасывается, а остальная часть системы дообучается на изображения человека А точно так же, как на этапе метаобучения, со следующими отличиями: (1) в обучающем наборе имеется всего один человек, (2) веса кодировщика позы сохраняются замороженными, (3) вложение идентичности преобразуется в обучаемый параметр. В заключение, изображение любого человека B пропускается через кодировщик позы, и спрогнозированные рендеринг и сегментация генерируются как обычно (т.е. как во время обучения) из полученного вложения позы и дообученного вложения идентичности.First of all, the identity embedding is calculated as the average of the outputs of the identity encoder over all available images of person A. there is only one person in the training set, (2) the pose encoder weights are kept frozen, (3) the identity embedding is converted to a trainable parameter. Finally, the image of any person B is passed through the pose encoder, and the predicted rendering and segmentation is generated as usual (i.e., as during training) from the resulting pose nesting and the retrained identity nesting.

1. Для получения дескриптора идентичности (цвета кожи, формы лица, цвета глаз, одежды, украшений и т.п.) из изображения человека А используется программный блок "кодировщик идентичности" (например, с использованием системы, описанной в статье "Few-Shot Adversarial Learning of Realistic Neural Talking Head Models".1. To obtain an identity descriptor (skin color, face shape, eye color, clothing, jewelry, etc.) from the image of person A, the "identity encoder" software block is used (for example, using the system described in the article "Few-Shot Adversarial Learning of Realistic Neural Talking Head Models ".

2. Для получения дескриптора позы головы и выражения лица из изображения человека В (пунктирная линия на фиг. 2а) используется программный блок "кодировщик позы".2. To obtain a descriptor of a head pose and a facial expression from the image of a person B (dashed line in Fig. 2a), a "pose encoder" program block is used.

3. Выходы этих двух блоков подаются на вход программного блока "генератор" для синтезирования желаемого аватара.3. The outputs of these two blocks are fed to the input of the "generator" program block for synthesizing the desired avatar.

Блок "кодировщик позы" представляет собой обучаемую модель машинного обучения (например, сверточную нейронную сеть). Он принимает на входе изображение человека и выдает на выходе вектор, который описывает позу головы и выражение лица и не описывает их идентичность.The pose encoder block is a trainable machine learning model (eg, a convolutional neural network). It takes an image of a person as input and outputs a vector that describes the head posture and facial expression and does not describe their identity.

Алгоритм обучения для этой модели: в обучающем конвейере системы "Few-Shot Adversarial Learning of Realistic Neural Talking Head Models" входной блок "RGB & landmarks" заменяется блоком "кодировщика позы" согласно изобретению. Затем вся система обучается, как описано в работе "Few-Shot Adversarial Learning of Realistic Neural Talking Head Models" с дополнениями, описанными ниже. В частности: в генераторе нет понижающей дискретизации (он не относится к архитектуре image-to-image (с одного изображения на другое)), так как он запускается с постоянного обучаемого тензора, а не с полноразмерного изображения; вложение поз дополнительно присоединяется к вводу MLP, который предсказывает параметры AdaIN; кодировщик идентичности имеет другую архитектуру (ResNeXt-50 32×4d).The learning algorithm for this model: In the training pipeline of the "Few-Shot Adversarial Learning of Realistic Neural Talking Head Models" system, the input block "RGB & landmarks" is replaced by the "pose encoder" block according to the invention. The entire system is then trained as described in "Few-Shot Adversarial Learning of Realistic Neural Talking Head Models" with the additions described below. In particular: there is no downsampling in the generator (it does not belong to the image-to-image architecture (from one image to another)), since it starts from a constant trainable tensor, and not from a full-size image; the nesting of poses is additionally appended to an MLP input that predicts the AdaIN parameters; the identity encoder has a different architecture (ResNeXt-50 32x4d).

Предложены произвольные дополнения позы к образцу выражения лица и позы головы; генератор прогнозирует маску сегментации переднего плана в дополнительном выходном канале, и для согласования этого прогноза с истиной применяется потеря Дайса.Arbitrary posture additions to the facial expression pattern and head posture are suggested; the generator predicts a foreground segmentation mask in the auxiliary output channel, and Dice loss is applied to match this prediction with true.

По сравнению с другими существующими решениями этой задачи новизна заключается в следующем:Compared to other existing solutions to this problem, the novelty lies in the following:

выход блока "кодировщика позы" не содержит информации об идентичности человека B, следовательно, этот выход деперсонифицирован, что имеет преимущество с точки зрения информационной безопасности;the output of the "pose encoder" block does not contain information about the identity of person B, therefore, this output is depersonalized, which has an advantage from the point of view of information security;

по той же причине, что и выше, предложенный подход позволяет обойтись без блока для адаптации дескриптора позы головы и выражения лица к идентичности человека А (внутри или вне блока "генератор"), а значит, улучшает вычислительную эффективность.for the same reason as above, the proposed approach eliminates the need for a block for adapting the head posture descriptor and facial expression to the identity of person A (inside or outside the generator block), which means it improves computational efficiency.

Далее будет описан этап метаобучения.Next, the meta-learning step will be described.

В каждом эпизоде метаобучения рассматривается отдельная видеопоследовательность. Затем из этой последовательности выбирается K+1 случайных кадров

, а также SK+1 - карта сегментации переднего плана для IK+1, которая предварительно вычисляется с помощью готовой сети семантической сегментации.Each episode of the meta-teaching examines a separate video sequence. Then K + 1 random frames are selected from this sequence

as well as SK + 1 - a foreground segmentation map for IK + 1, which is pre-computed using a ready-made semantic segmentation network.

Затем первые K изображений

подаются в относительно высокопроизводительную сверточную сеть F, которая вызывает кодировщик идентичности. Она аналогична сети внедрения в [42], за исключением того, что не принимает ключевые точки на входе.Then the first K images

are fed into a relatively high performance convolutional network F, which calls the identity encoder. It is similar to the embedding network in [42], except that it does not accept cue points as input.

Для каждого изображения Ii кодировщик идентичности выдает di-мерный вектор

, который вызывает вложение идентичности Ii. Ожидается, что вложения идентичности содержат независимую от позы информацию о человеке (включая освещение, одежду и т.п.). При наличии K кадров получается один вектор идентичности

путем приведения к среднему

.For each image Ii, the identity encoder produces a di-dimensional vector

which invokes identity nesting Ii. Identity attachments are expected to contain posture-independent information about a person (including lighting, clothing, etc.). In the presence of K frames, one identity vector is obtained

by converting to the mean

...

Оставшееся изображение IK+1 (источник позы) сначала подвергается преобразованию A для произвольного дополнения позы, которое будет описано ниже. Затем A(IK+1) передается через сеть гораздо меньшей производительности, которая называется кодировщиком позы и обозначается как G. Кодировщик позы выдает dp-мерное вложение позы

, которое стремится стать не зависящим от человека дескриптором позы. The remaining image IK + 1 (pose source) is first transformed A for arbitrary pose addition, which will be described below. A (IK + 1) is then sent over a much lower performance network called the pose encoder, denoted as G. The pose encoder outputs a dp-dimensional embedding of the pose

, which seeks to become a person-independent posture descriptor.

Преобразование A, упомянутое выше, важно для разделения позы и идентичности. Оно сохраняет позу человека нетронутой, но может изменить его идентичность. В частности, оно произвольно масштабирует изображение независимо по горизонтальной и вертикальной осям и произвольно применяет операции сохранения контента, такие как размытие, повышение резкости, изменение контрастности или сжатие JPEG. В изобретении привлекается дополнение позы, поскольку оно применяется на источнике позы, и его можно рассматривать как форму дополнения данных.The A transformation mentioned above is important for separating pose and identity. It keeps the person's posture intact, but it can change their identity. In particular, it arbitrarily scales the image independently along the horizontal and vertical axes and arbitrarily applies content saving operations such as blur, sharpening, contrast change, or JPEG compression. The invention draws on pose augmentation because it is applied at the pose source and can be considered a form of data augmentation.

Вложения позы и идентичности передаются в сеть-генератор, которая пытается реконструировать изображение IK+1 с максимально возможной точностью. В то время как в [42] использовались растеризованные ключевые точки (изображения стикмен) для передачи позы в их сети-генераторы, авторы настоящего изобретения полностью полагаются на механизм AdaIN [16] для передачи вложений как позы, так и идентичности в генератор. Более конкретно, предлагаемый генератор повышающей дискретизации запускается с постоянного обучаемого тензора размера 512×4x4 и выдает на выходе два тензора:

размера 3×256x256 и

размера 1×256x256, которые он пытается согласовать с частью переднего плана изображения IK+1 и его маской сегментации SK+1, соответственно. Это достигается простым прогнозированием тензора 3×256x256 в конечном слое. После каждой свертки вставляются блоки AdaIN. Для получения коэффициентов AdaIN берутся соединенные вложения позы и идентичности, и этот (di+dp)-мерный вектор передается через MLP с обучаемыми параметрами как в StyleGAN [20].The pose and identity attachments are passed to the generator network, which attempts to reconstruct the IK + 1 image as accurately as possible. While [42] used rasterized cue points (stickman images) to pass the pose to their generator networks, the present inventors rely entirely on the AdaIN mechanism [16] to pass both pose and identity attachments to the generator. More specifically, the proposed upsampling generator starts with a 512 × 4x4 constant trainable tensor and outputs two tensors:

size 3 × 256x256 and

size 1 × 256x256, which it tries to match with the foreground portion of the image IK + 1 and its segmentation mask SK + 1, respectively. This is achieved by simply predicting a 3x256x256 tensor in the final layer. AdaIN blocks are inserted after each convolution. To obtain the AdaIN coefficients, the connected embeddings of the pose and identity are taken, and this (di + dp) -dimensional vector is transmitted through MLP with trained parameters as in StyleGAN [20].

Ожидается, что созданные генератором

и

будут максимально близки к

и SK+1, соответственно. Это достигается с помощью нескольких функций потерь. Карты сегментации сопоставляются с помощью потери коэффицтента Дайса [27]. С другой стороны, изображения головы с забитым задним планом сопоставляются с использованием той же комбинации потерь, что и в [42]. В частности, существуют потери контента на основании сопоставления активаций ConvNet для модели VGG-19, обученной классификации ImageNet, и модели VGGFace, обученной распознавать лица. Кроме того,

и

передаются через дискриминатор проекции (в данном случае отличие от [42] состоит в том, что в изобретении снова не предоставляются ему растеризованные ключевые точки) для вычисления состязательной потери, которая подталкивает изображения к реалистичности, потерь согласования признаков дискриминатора и параметра согласования вложения.Generator-generated

and

will be as close to

and SK + 1, respectively. This is achieved using several loss functions. Segmentation maps are compared using the loss of Dyes coefficient [27]. On the other hand, images of a head with a clogged background are compared using the same loss combination as in [42]. In particular, there is content loss based on the mapping of ConvNet activations for the VGG-19 model trained for ImageNet classification and the VGGFace model trained for face recognition. Besides,

and

are passed through the projection discriminator (in this case, the difference from [42] is that the invention again does not provide it with rasterized key points) to calculate the adversarial loss, which pushes the images towards realism, the matching losses of the discriminator's features and the nesting matching parameter.

Перенос и дообучение. После выполнения метаобучения модели ее можно использовать для подгонки к новым идентичностям, невидимым во время метаобучения. Таким образом, при наличии одного или нескольких изображений нового человека можно извлечь из них вектор

идентичности, пропустив эти изображения через кодировщик идентичности и усреднив эти результаты поэлементно. Затем, подставив вектор позы y, извлеченный из изображения того же самого или другого человека, можно осуществить перенос этого человека, вычислив изображение

и его маску

переднего плана.Transfer and additional training. After meta-learning a model, you can use it to fit new identities that are not visible during meta-learning. Thus, in the presence of one or several images of a new person, it is possible to extract from them the vector

identity by passing these images through an identity encoder and averaging these results element by element. Then, substituting the pose vector y, extracted from the image of the same or another person, we can carry out the transfer of this person by calculating the image

and his mask

foreground.

Чтобы дополнительно уменьшить расхождение идентичности, предлагается следовать [42] и осуществить дообучение модели (а именно, весов MLP, генератора и дискриминатора) с тем же набором потерь, что и в [42], плюс с потерей коэффициента Дайса, рассматривая предоставленный набор изображений нового человека и их сегментацию в качестве истины. Оцененное вложение идентичности

сохраняется фиксированным во время дообучения (включение его в оптимизацию не привело к каким-либо различиям в предлагаемых экспериментах, так как число параметров во вложении

намного меньше, чем в MLP и сети-генераторе). Сеть G вложения позы также сохраняется фиксированной во время дообучения.To further reduce the identity discrepancy, it is proposed to follow [42] and carry out additional training of the model (namely, the MLP weights, generator and discriminator) with the same set of losses as in [42], plus with the loss of the Dice coefficient, considering the provided set of images of a new human beings and their segmentation as truth. Evaluated Identity Nesting

remains fixed during additional training (its inclusion in optimization did not lead to any differences in the proposed experiments, since the number of parameters in the nesting

much less than MLP and grid-generator). The net G nesting posture is also kept fixed during retraining.

Главным выводом является то, что применительно к человеку X обученная, как описано выше, модель переноса может успешно воспроизводить мимику человека на изображении I, когда вектор позы

извлекается из изображения того же человека X. Более удивительно то, что эта модель также может воспроизводить мимику, когда вектор позы извлекается из изображения другого человека Y. В этом случае просачивание идентичности от этого другого человека сводится к минимуму, т.е. полученное изображение все еще выглядит как изображение человека X.The main conclusion is that in relation to person X, trained as described above, the transfer model can successfully reproduce the facial expressions of a person in image I, when the pose vector

extracted from the image of the same person X. More surprisingly, this model can also reproduce the facial expressions when the pose vector is extracted from the image of another person Y. In this case, the leakage of identity from this other person is minimized, i.e. the resulting image still looks like an image of person X.

Изначально, такое разделение позы и идентичности не должно было бы происходить и для его обеспечения потребовалась бы какая-то форма состязательного обучения [12] или совместимости циклов [45, 17]. Оказалось, что в случае (i) достаточно низкой производительности сети G экстрактора позы, (ii) применения дополнений позы и (iii) удаления заднего плана, разделение происходит автоматически, и предложенные эксперименты с дополнительными условиями потерь, такими как, например в [8], не дают никаких дальнейших улучшений. Очевидно, что с помощью трех описанных выше методов модель предпочитает извлекать все специфические для человека детали из кадра источника идентичности, используя более производительную сеть экстрактора идентичности.Initially, this separation of posture and identity should not have occurred and would require some form of adversarial learning [12] or loop compatibility [45, 17]. It turned out that in the case of (i) a sufficiently low performance of the network G of the pose extractor, (ii) applying the posture additions and (iii) removing the background, the separation occurs automatically, and the proposed experiments with additional loss conditions, such as, for example, in [8] do not give any further improvements. Obviously, with the three methods described above, the model prefers to extract all of the person-specific details from the identity source frame using the more powerful identity extractor network.

Далее будет дана оценка этого эффекта разделения, который стал "приятным сюрпризом", и будет показано, что он действительно сильнее, чем в случае других похожих методов (т.е. он лучше поддерживает перенос с одного человека на другого с меньшим просачиванием идентичности).This separation effect will then be assessed, which came as a "pleasant surprise" and shown to be indeed stronger than other similar methods (ie, it better supports transference from one person to another with less identity leakage).

Кроме того, дополнительно выполняются абляционные исследования, чтобы определить, как производительность кодировщика позы, дополнения позы, сегментация и размерность dp скрытого вектора позы влияют на способность предлагаемой системы переноса сохранять позу и идентичность.In addition, ablation studies are additionally performed to determine how the performance of the posture encoder, posture augmentation, segmentation, and dp dimension of the latent posture vector affect the ability of the proposed transfer system to maintain posture and identity.

Предлагаемая обучающая выборка данных представляет собой коллекцию видеоматериалов YouTube из VoxCeleb2 [4]. Имеется около 100 000 видеоизображений около 6000 человек. Из каждых 25 кадров в каждом видео выбирался 1 кадр, так что осталось всего около семи миллионов общих обучающих изображений. На каждом изображении аннотированное лицо повторно обрезалось, для чего сначала детектор S3FD [43] захватывал его ограничивающий прямоугольник, а затем прямоугольник превращался в квадрат путем увеличения меньшей стороны, боковые стороны прямоугольника увеличивались на 80% с сохранением центра, и, наконец, размер обрезанного изображения изменялся до 256X256.The proposed training dataset is a collection of YouTube videos from VoxCeleb2 [4]. There are about 100,000 video images of about 6,000 people. For every 25 frames in each video, 1 frame was selected, so there are only about seven million total training images left. In each image, the annotated face was re-cropped, for which first the S3FD detector [43] captured its bounding rectangle, and then the rectangle was transformed into a square by enlarging the smaller side, the sides of the rectangle were enlarged by 80% while maintaining the center, and finally, the size of the cropped image changed to 256X256.

Сегментацию человека осуществляли с помощью модели Graphonomy [11]. Как и в [42], авторы установили K=8, используя тем самым вектор идентичности, извлеченный из восьми случайных кадров видео, чтобы реконструировать девятый кадр по его дескриптору позы.Human segmentation was performed using the Graphonomy model [11]. As in [42], the authors set K = 8, thereby using the identity vector extracted from eight random video frames to reconstruct the ninth frame from its pose descriptor.

В предложенной лучшей модели кодировщик позы имеет архитектуру MobileNetV2 [31], а кодировщик идентичности - архитектуру ResNeXt-50 (32 X 4d) [41]. Они оба не отлаживались и поэтому содержат нормализацию пакетов [18]. Размеры вложений позы и идентичности dp и di равны 256 и 512, соответственно. К этим вложениям не применяются нормализация или регуляризация. Модуль, преобразующий их в параметры AdaIN, представляет собой персептрон ReLU со спектральной нормализацией и один скрытый слой из 768 нейронов.In the proposed best model, the pose encoder has the MobileNetV2 architecture [31], and the identity encoder has the ResNeXt-50 (32 X 4d) architecture [41]. They were both not debugged and therefore contain packet normalization [18]. The pose and identity nest sizes dp and di are 256 and 512, respectively. No normalization or regularization is applied to these attachments. The module that converts them into AdaIN parameters is a ReLU perceptron with spectral normalization and one hidden layer of 768 neurons.

В основу предлагаемого генератора положен генератор [42], но без блоков понижающей дискретизации, так как все входы делегированы AdaIN, расположенным после каждой свертки. Точнее, обучаемый постоянный тензор 512 × 4 × 4 преобразуется двумя остаточными блоками постоянного разрешения, за которыми следуют 6 остаточных блоков повышающей дискретизации. Начиная с четвертого блока повышенной дискретизации, количество каналов начинает уменьшаться вдвое, так что тензор конечного разрешения (256 X 256) имеет 64 канала. Этот тензор пропускается через слой AdaIN, ReLU, свертку 1X1 и tanh, становясь 4-канальным изображением. В отличие от [42], авторы не используют механизм самовнимания (self-attention). Спектральная нормализация [28] применяется везде в генераторе, дискриминаторе и MLP.The proposed generator is based on the generator [42], but without downsampling blocks, since all inputs are delegated to the AdaIN located after each convolution. More precisely, the trainable constant tensor 512 × 4 × 4 is transformed by two residual blocks of constant resolution, followed by 6 residual blocks of upsampling. Starting with the fourth upsampling block, the number of channels begins to halve, so that the final resolution tensor (256 X 256) has 64 channels. This tensor is passed through the AdaIN layer, ReLU, 1X1 convolution and tanh, becoming a 4-channel image. Unlike [42], the authors do not use the self-attention mechanism. Spectral normalization [28] is applied everywhere in generator, discriminator and MLP.

Вместо чередующихся обновлений генератора и дискриминатора выполняется одно обновление веса для всех сетей после накопления градиентов из всех параметров потерь.Instead of alternating generator and discriminator updates, a single weight update is performed for all networks after accumulating gradients from all loss parameters.

Была обучена модель для 1200000 итераций с минипакетом из 8 выборок, распределенных на два графических процессора NVIDIA P40, что в общей сложности заняло около двух недель.The model was trained for 1,200,000 iterations with a mini-batch of 8 samples distributed across two NVIDIA P40 GPUs, which took about two weeks in total.

Предлагаемая количественная оценка оценивает как относительную эффективность дескрипторов позы с использованием вспомогательных заданий, так и качество переноса с одного человека на другого. Показаны примеры переноса в сценариях с одним и тем же человеком и с одного человека на другого с точки зрения качества, а также результаты интерполяции в обученном пространстве позы. Абляционные исследования в дополнительном материале показывают эффект различных компонентов предлагаемого метода.The proposed quantitative assessment assesses both the relative effectiveness of posture descriptors using auxiliary tasks and the quality of transfer from one person to another. Examples of transfer in scenarios with the same person and from one person to another are shown in terms of quality, as well as the results of interpolation in the trained pose space. Ablation studies in the supplementary material show the effect of various components of the proposed method.

Далее представлено сравнение результатов предлагаемого изобретения с результатами известных способов и систем. Рассматриваются следующие дескрипторы позы, основанные на различных степенях участия учителя.The following is a comparison of the results of the present invention with the results of known methods and systems. The following posture descriptors are considered based on different degrees of teacher involvement.

Предлагаемое изобретение. 256-мерные скрытые дескрипторы позы, обученные в предложенной системе.The proposed invention. 256-dimensional hidden pose descriptors trained in the proposed system.

X2Face. 128-мерные векторы образцов, обученные в системе переноса X2Face [39].X2Face. 128-dimensional sample vectors trained in the X2Face transfer system [39].

FAB-Net. Оцениваются 256-мерные дескрипторы FAb-Net [38] в качестве представления позы. Они связаны с предлагаемым изобретением в том, что, хотя они не являются не зависящими от человека, они также обучаются без участия учителя на видеоколлекции VoxCeleb2.FAB-Net. The 256-dimensional FAb-Net descriptors [38] are evaluated as representations of the pose. They are related to the proposed invention in that, although they are not human independent, they are also taught without teacher participation on the VoxCeleb2 video collection.

3DMM. Рассматривается новейшая система 3DMM [3]. Эта система извлекает разложенные жесткую позу, выражение лица и дескриптор формы, используя глубокую сеть. Дескриптор позы получается путем объединения поворота жесткой позы (представленного в виде кватерниона) и параметров выражения лица (29 коэффициентов).3DMM. The newest 3DMM system is considered [3]. This system extracts decomposed rigid posture, facial expression, and shape descriptor using a deep web. The pose descriptor is obtained by combining the rotation of the rigid pose (represented as a quaternion) and the facial expression parameters (29 coefficients).

Предлагаемый дескриптор обучается на датасете VoxCeleb2. Дескриптор X2Face обучается на меньшем датасете VoxCeleb1 [29], а FAb-Net - на обоих. Дескрипторы 3DMM контролируются наиболее тщательно, так как 3DMM обучается по 3D-сканам и требует детектора ориентира (который, в свою очередь, обучается по схеме с привлечением учителя).The proposed descriptor is trained on the VoxCeleb2 dataset. The X2Face descriptor is trained on the smaller VoxCeleb1 dataset [29], and the FAb-Net on both. 3DMM descriptors are most closely monitored as 3DMM learns from 3D scans and requires a landmark detector (which in turn is trained with a teacher).

Кроме того, рассматриваются следующие системы переноса головы, основанные на этих дескрипторах позы:In addition, the following head transfer systems are considered based on these pose descriptors:

X2Face. Система X2Face [39] основана на нативных дескрипторах и переносе на основе деформации.X2Face. The X2Face system [39] is based on native descriptors and deformation-based transfer.

X2Face+. В этом варианте вместо предложенного кодировщика позы авторы используют замороженную предобученную управляющую сеть X2Face (вплоть до управляющего вектора) и оставляют остальную часть архитектуры неизменной относительно предложенной. Обучаются кодировщик идентичности, генератор, зависящий от скрытого вектора позы X2Face и предполагаемого вложения идентичности, и дискриминатор проекции.X2Face +. In this variant, instead of the proposed pose encoder, the authors use the frozen pre-trained X2Face control network (up to the control vector) and leave the rest of the architecture unchanged relative to the proposed one. The identity encoder, the generator that depends on the latent vector of the X2Face pose and the assumed identity nesting, and the projection discriminator are trained.

FAB-Net+. То же самое, что и X2Face+, но с замороженной FAb-Net вместо предлагаемого кодировщика позы.FAB-Net +. Same as X2Face +, but with frozen FAb-Net instead of the suggested pose encoder.

3DMM+. То же самое, что и X2Face+, но с замороженной Exp-Net[3] вместо предложенного кодировщика позы и с отключенными дополнениями позы. Дескриптор позы строится из выходов ExpNet, как описано выше.3DMM +. Same as X2Face +, but with frozen Exp-Net [3] instead of the proposed pose encoder and with pose add-ons disabled. The pose descriptor is built from the ExpNet outputs as described above.

В предлагаемом изобретении эти 35-мерные дескрипторы дополнительно нормируются по поэлементному среднему и стандартному отклонению, вычисленному на обучающем наборе VoxCeleb2.In the proposed invention, these 35-dimensional descriptors are additionally normalized to the element-wise mean and standard deviation calculated on the VoxCeleb2 training set.

FSTH. Исходная система создания говорящей головы по малому количеству примеров (few-shot talking head system) [42], управляемая растеризованными ключевыми точками.FSTH. The original few-shot talking head system [42] driven by rasterized cue points.

FSTH+. Авторы переобучают систему [42], внеся несколько изменений, которые делают ее более сопоставимой с предлагаемой системой и другими базовыми аспектами. Необработанные координаты ключевых точек вводятся в генератор с использованием механизма AdaIN (как в предлагаемой системе). Генератор прогнозирует сегментацию вместе с изображением. Авторы также используют такие же обрезки, которые отличаются от [42].FSTH +. The authors retrain the system [42] with several changes that make it more comparable to the proposed system and other basic aspects. The raw coordinates of key points are entered into the generator using the AdaIN mechanism (as in the proposed system). The generator predicts segmentation along with the image. The authors also use the same cuts, which are different from [42].

Чтобы понять, насколько хороши обученные дескрипторы позы при сопоставлении разных людей в одной и той же позе, используется датасет Multi-PIE [13], который не используется для обучения ни одного из дескрипторов, но содержит шесть аннотаций классов эмоций для людей в различных позах. Этот датасет ограничен практически фронтальной и полупрофильной ориентациями камеры (а именно: 08_0, 13_0, 14_0, 05_1, 05_0, 04_1, 19_0), так что остается 177 280 изображений. В каждой группе ориентации камеры авторы произвольно выбирают из нее изображение запроса и выбирают ближайшие N изображений из той же группы, используя косинусное сходство дескрипторов. Если в ответ возвращается человек с такой же меткой эмоций, совпадение будет корректным.To understand how good the trained pose descriptors are when juxtaposing different people in the same pose, the Multi-PIE dataset [13] is used, which is not used to train any of the descriptors, but contains six annotations of emotion classes for people in different poses. This dataset is limited to near-frontal and semi-profile camera orientations (namely: 08_0, 13_0, 14_0, 05_1, 05_0, 04_1, 19_0), so there are 177,280 images left. In each camera orientation group, the authors randomly select a request image from it and select the nearest N images from the same group using the cosine similarity of descriptors. If a person with the same emotion tag is returned in response, the match will be correct.

Эту процедуру повторяют 100 раз для каждой группы. В таблице 1 показано общее отношение корректных совпадений в списках топ-10, топ-20, топ-50 и топ-100. Для дескриптора 3DMM авторы рассматривают только 29 коэффициентов выражения лица и игнорируют информацию о жесткой позе как несущественную для эмоций.This procedure is repeated 100 times for each group. Table 1 shows the overall ratio of correct matches in the top 10, top 20, top 50, and top 100 lists. For the 3DMM descriptor, the authors consider only 29 facial expression coefficients and ignore information about rigid posture as irrelevant to emotions.

Таблица 1. Точность результатов поиска на основе позы (выражения) с использованием различных дескрипторов позы на датасете Multi-Pie. Детали см. в тексте.Table 1. Accuracy of search results based on pose (expression) using various pose descriptors on the Multi-Pie dataset. See text for details.

Точность для топ-N запросов (%)Accuracy for top N queries (%) ДескрипторDescriptor N=10N = 10 N=20N = 20 N=50N = 50 N=100N = 100 Fab-NetFab-Net 45.745.7 40.840.8 36.636.6 35.735.7 3DMM3DMM 47.347.3 45.645.6 41.941.9 41.141.1 X2FaceX2Face 61.061.0 55.755.7 51.851.8 49.449.4 OursOurs 75.775.7 63.863.8 57.857.8 54.154.1

В этом сравнении можно заметить, что скрытое пространство предлагаемых вложений позы лучше сгруппировано по классам эмоций, чем у других дескрипторов выражения лица, поскольку предлагаемый результат намного лучше для метрик топ-10 и топ-20, хотя и похож на X2Face и лучше, чем остальные для топ-50 и топ-100. Векторы FAb-Net и X2Face содержат информацию идентичности, поэтому они с большей вероятностью будут близки к векторам, представляющим одного и того же или похожего человека. Что касается 3DMM, здесь требуются различные скрытые векторы выражения, чтобы превратить различные формы (людей) в одно и то же выражение лица по конструкции; поэтому коэффициенты выражения могут легко совпадать для разных людей, показывающих различные выражения лица.In this comparison, you can see that the hidden space of the proposed pose attachments is better grouped by emotion class than other facial expression descriptors, since the proposed result is much better for the top 10 and top 20 metrics, although it is similar to X2Face and better than the rest. for the top 50 and top 100. The vectors FAb-Net and X2Face contain identity information, so they are more likely to be close to vectors representing the same or similar person. As far as 3DMM is concerned, different hidden expression vectors are required to transform different shapes (people) into the same facial expression by design; therefore, the expression coefficients can easily match for different people showing different facial expressions.

Регрессия ключевых точек не входит в предложенные целевые применения, так как ключевые точки содержат специфическую для человека информацию. Однако это популярная задача, с которой в прошлом сравнивались неконтролируемые дескрипторы поз, поэтому предлагаемый метод выполняется на стандартном эталонном тесте в тестовом наборе MAFL [25]. Для прогнозирования ключевых точек используется ReLU MLP с одним скрытым слоем размером 768, и в предлагаемом случае авторы используют в качестве ввода вложения как позы, так и идентичности. Используя стандартный нормированный глазной базис, авторы получают ошибку расстояния 2,63. Это меньше, чем ошибка 3,44, полученная FAb-Net, хотя и превосходит достигнутый в [19] уровень (2,54) для этой задачи.Keypoint regression is not included in the proposed target applications, as keypoints contain human-specific information. However, this is a popular problem with which uncontrolled pose descriptors have been compared in the past, so the proposed method is performed on a standard benchmark in the MAFL test set [25]. To predict key points, a ReLU MLP with one hidden layer of size 768 is used, and in the proposed case, the authors use both pose and identity nesting as input. Using a standard normalized ocular basis, the authors obtain a distance error of 2.63. This is less than the 3.44 error obtained by FAb-Net, although it exceeds the level (2.54) achieved in [19] for this problem.

Количественная оценка. Сравнивается производительность семи перечисленных выше систем переноса в схеме переноса с одного человека на другого. Для этого произвольно выбираются 30 человек из тестовой группы VoxCeleb2 и на них обучаются модели говорящих голов T1; ...; Т30. Каждая модель Tk создана из 32 случайных кадров видео

. Все модели, кроме X2Face, дообучаются по этим 32 кадрам за 600 шагов оптимизации. Используя эти модели, авторы вычисляют две метрики для каждой системы: ошибку идентичности

и ошибку реконструкции позы

.Quantitative assessment. The performance of the seven transfer systems listed above in the transfer scheme from one person to another is compared. To do this, 30 people are randomly selected from the VoxCeleb2 test group and the T1 talking head models are trained on them; ...; T30. Each Tk model is created from 32 random video frames

... All models, except for X2Face, are retrained with these 32 frames in 600 optimization steps. Using these models, the authors compute two metrics for each system: identity error

and posture reconstruction error

...

Ошибка идентичности

оценивает, насколько близко полученные говорящие головы похожи на подлинного человека k, для которого обучалась модель. Для этого используется сеть R распознавания лица ArcFace [7], которая выдает дескрипторы идентичности (векторы). Авторы вычисляют усредненный эталонный дескриптор

из датасета дообучения

и используют косинусное сходство (csim), чтобы сравнить его с дескрипторами, полученными из результатов переноса с одного человека на другого. Перенос с одного человека на другого выполняется путем прогона Tk со всеми другими 29 людьми. Чтобы получить окончательную ошибку, усредняются сходства (минус одно) для всех 30 человек в тестовом наборе.Identity error

evaluates how closely the received talking heads resemble a real person k for whom the model was trained. For this, the face recognition network R ArcFace [7] is used, which produces identity descriptors (vectors). Authors calculate the averaged reference descriptor

from the retraining dataset

and use cosine similarity (csim) to compare it to descriptors derived from transfer from one person to another. Transferring from one person to another is done by running Tk with all the other 29 people. To get the final error, the similarities (minus one) are averaged for all 30 people in the test set.

Формально,Formally,

С другой стороны, ошибка реконструкции позы

предназначена для количественной оценки того, насколько хорошо система воспроизводит позу и выражение лица образца, и определяется в виде лицевых ориентиров. Поскольку наборы ориентиров можно сравнивать напрямую только для одного и того же человека, тестовый датасет ограничен парами самопереноса, т.е. прогон осуществляется только для Tk с Ik. Однако, поскольку Tk обучен на

, то во избежание переобучения авторы используют другие 32 удержанных кадра из того же видео

. Используется готовый алгоритм прогнозирования двухмерных лицевых ориентиров L [2] для получения ориентиров как в образце

, так и в результате переноса

.On the other hand, posture reconstruction error

is designed to quantify how well the system reproduces the pose and facial expression of the sample, and is defined in terms of facial landmarks. Since landmark sets can only be compared directly for the same person, the test dataset is limited to self-transfer pairs, i.e. the run is carried out only for Tk with Ik. However, since Tk is trained on

, then in order to avoid retraining, the authors use the other 32 retained frames from the same video

... A ready-made algorithm for predicting two-dimensional facial landmarks L [2] is used to obtain landmarks as in the sample

, and as a result of the transfer

...

В предлагаемом случае мерой

того, насколько близко ориентиры

приближаются к эталонным ориентирам

, является среднее расстояние между соответствующими ориентирами, нормированное на глазной базис. Как и ранее, вычисляется d для всех образцов и среднее для всех 30 человек:In the proposed case, the measure

how close the landmarks are

approaching benchmarks

, is the average distance between the corresponding landmarks, normalized to the ocular basis. As before, d is calculated for all samples and the average for all 30 people:

На фиг. 3 показана оценка систем переноса с точки зрения их способности представлять позу образца и сохранять начальную идентичность (стрелки указывают на улучшение). Горизонтальная и вертикальная оси соответствуют ошибке идентификации

и ошибке реконструкции позы

, соответственно, которые обе вычисляются из подмножества тестовой части датасета VoxCeleb2. Эти метрики подробно описаны ниже в разделе "Подробное описание" в подразделе "Количественная оценка". Каждая точка на графике соответствует известной системе переноса головы, которая сравнивается с данной предложенной системой. График на фиг. 3 содержит эти две метрики, оцененные для сравниваемых моделей. Идеальная система Т имела бы

, т.е. чем ближе к нижнему левому углу, тем лучше. В этих условиях предлагаемая полная модель явно лучше всех систем кроме FSTH+, которая немного лучше по одной из метрик, но намного хуже по другой, и выигрыш которой обусловлен использованием внешнего детектора ключевых точек.FIG. 3 shows the assessment of the transfer systems in terms of their ability to represent the pose of the specimen and maintain the initial identity (arrows indicate improvement). Horizontal and vertical axes correspond to identification error

and posture reconstruction error

, respectively, which are both calculated from a subset of the test part of the VoxCeleb2 dataset. These metrics are detailed below in the "Detailed Description" section in the "Quantification" subsection. Each point on the graph corresponds to a known head transfer system, which is compared to this proposed system. The graph in FIG. 3 contains these two metrics assessed for the compared models. An ideal system T would have

, i.e. the closer to the bottom left corner, the better. Under these conditions, the proposed complete model is clearly better than all systems except FSTH +, which is slightly better in one of the metrics, but much worse in the other, and the gain of which is due to the use of an external key point detector.

Качественное сравнение. На фиг. 4 представлено сравнение переноса с одного человека на другого для нескольких систем на тестовом наборе VoxCeleb2. Верхнее левое изображение является одним из 32 кадров источника идентичности, верхнее левое изображение показывает одно из изображений, определяющих целевую идентичность из тестовой части VoxCeleb2, т.е. человека, которого было бы желательно визуализировать с другими выражением лица и позой головы. Остальные изображения в этом ряду также взяты из тестовой части VoxCeleb2, и они определяют позу, которую следует передать целевому человеку. Другие изображения в верхнем ряду представляют собой образцы выражения лица и позы головы. Предлагаемый способ лучше сохраняет идентичность целевого человека и успешно передает мимику от человека-образца. Каждый из оставшихся рядов соответствует какому-то методу нейросетевого переноса выражения лица и позы головы, который сравнивается с "Proposed invention", являющимся предлагаемым изобретением, и другим методам, соответствующие тем, которые показаны на фиг. 3. Показаны результаты переноса, полученные соответствующим методом.Qualitative comparison. FIG. 4 shows a comparison of transfer from one person to another for several systems on the VoxCeleb2 test set. The top left image is one of 32 frames of the identity source, the top left image shows one of the images defining the target identity from the VoxCeleb2 test portion, i.e. a person whom it would be desirable to visualize with a different facial expression and head posture. The rest of the images in this row are also taken from the test section of VoxCeleb2, and they define the pose that should be transmitted to the target person. The other images in the top row are examples of facial expressions and head posture. The proposed method better preserves the identity of the target person and successfully transmits facial expressions from the person-sample. Each of the remaining rows corresponds to some method of neural network translation of facial expression and head posture, which is compared with the "Proposed invention", which is the invention, and other methods corresponding to those shown in FIG. 3. The transfer results obtained by the corresponding method are shown.

На фиг. 4 показано качественное сравнение описанных выше систем переноса. Видно, что система FSTH, руководствующаяся растеризованными ориентирами, в значительной степени полагается на пропорции лица образца и поэтому не является не зависящей от человека. Ее модифицированная версия FSTH+ дает лучший результат, так как имеет большую репрезентативную силу вокруг векторизованных ключевых точек; тем не менее, имеется заметное "просачивание идентичности" (например, сравните ширину головы в столбцах 1 и 2) и ошибки в значимых выражениях лица, например, закрытые глаза. Основанный на деформации метод X2Face уже потерпел неудачу на небольших поворотах.FIG. 4 shows a qualitative comparison of the transfer systems described above. It can be seen that the FSTH system, guided by rasterized landmarks, relies heavily on the proportions of the specimen's face and therefore is not independent of humans. Its modified version of FSTH + gives a better result, as it has more representative power around the vectorized keypoints; however, there is a noticeable "identity leak" (eg, compare head widths in columns 1 and 2) and errors in meaningful facial expressions, eg closed eyes. X2Face's deformation-based method has already failed in small turns.

Два похожих метода, X2Face+ и FAb-Ne+, обеспечивают убедительные базовый уровень, несмотря на некоторые признаки расхождения идентичности, например, следы очков в столбце 7 и длинные волосы, проявившиеся из образца выражения лица и позы головы в столбце 5. Важно отметить, что, хотя дескрипторы позы из этих методов не являются не зависящими от человека, во время обучения все еще применяются дополнения позы. В описанном ниже абляционном исследовании продемонстрировано, что эффективность переноса с одного человека на другого резко снижается, если в этих двух методах удаляются дополнения позы.Two similar methods, X2Face + and FAb-Ne +, provide a convincing baseline despite some signs of identity discrepancy, such as eyeglass marks in column 7 and long hair emerging from the facial expression pattern and head posture in column 5. It is important to note that, although the pose descriptors from these methods are not human-independent, during training, pose additions are still applied. The ablative study described below demonstrates that the effectiveness of transference from one person to another is drastically reduced when posture additions are removed in the two methods.

Метод 3DMM+ имеет очень узкое место в интерпретируемых параметрах, и поэтому его расхождение идентичности очень мало. Однако, по-видимому, по той же причине, этот метод не так хорош при визуализации корректных едва заметных выражений лица. Предлагаемая полная система способна точно представлять выражение лица и позу головы образца, сохраняя при этом идентичность целевого человека.The 3DMM + method has a very bottleneck in interpreted parameters and therefore its identity discrepancy is very small. However, apparently for the same reason, this method is not so good at rendering correct subtle facial expressions. The proposed complete system is capable of accurately representing the facial expression and head posture of the specimen while maintaining the identity of the target person.

Кроме того, авторы также показывают перенос путем интерполяции в пространстве позы для предложенной системы на фиг. 5, которая демонстрирует плавные изменения позы. На фиг. 5 показан перенос путем интерполяции между двумя векторами позы по сферической траектории в пространстве дескриптора позы. В каждом ряду показаны результаты переноса согласно предложенному изобретению для некоторого человека из тестовой части VoxCeleb2. Первое и последнее изображения в каждом ряду вычисляются с использованием некоторых кадров источника позы из тестовой части VoxCeleb2 (не показаны на фигуре), отображенных с помощью кодировщика позы во вложениях A и B, соответственно. Изображения между ними получены путем обычного выполнения предлагаемого изобретения, за исключением того, что вложение позы получено путем сферической интерполяции между A и B, а не вычислено кодировщиком позы из некоторого изображения. В частности, вложения позы в столбцах 2-5 представляют собой slerp(A, B; 0.2), slerp (A, B; 0.4), slerp(A, B; 0.6), slerp(A, B; 0.8).In addition, the authors also show translation by interpolation in pose space for the proposed system in FIG. 5, which demonstrates smooth changes in posture. FIG. 5 shows translation by interpolation between two pose vectors along a spherical path in pose descriptor space. Each row shows the results of the transfer according to the proposed invention for a certain person from the test portion of VoxCeleb2. The first and last images in each row are computed using some of the pose source frames from the VoxCeleb2 test portion (not shown in the figure) displayed with the pose encoder in attachments A and B, respectively. The images in between are obtained by the normal implementation of the present invention, except that the pose nesting is obtained by spherical interpolation between A and B, and is not calculated by the pose encoder from some image. Specifically, the pose nesting in columns 2-5 is slerp (A, B; 0.2), slerp (A, B; 0.4), slerp (A, B; 0.6), slerp (A, B; 0.8).

На фиг. 6 показано дополнительное сравнение переноса с одного человека на другого для нескольких систем на тестовом наборе VoxCeleb2. Верхнее левое изображение - это одно из изображений, определяющих целевую идентичность из тестовой части VoxCeleb2, т.е. человека, которого необходимо визуализировать с другими выражением лица и позой головы. Остальные изображения в этом ряду также взяты из тестовой части VoxCeleb2 и определяют позу, которую следует передать целевому человеку.FIG. 6 shows an additional comparison of the transfer from one person to another for several systems on the VoxCeleb2 test set. The top left image is one of the images defining the target identity from the VoxCeleb2 test portion, i.e. the person to be visualized with a different facial expression and head posture. The rest of the images in this row are also taken from the test part of VoxCeleb2 and define the pose that should be transmitted to the target person.

Каждый из оставшихся рядов соответствует какому-либо методу нейросетевого переноса выражения лица и позы головы, которые сравниваются с "Proposed invention", являющимся предлагаемым изобретением, и другим методам, соответствующие тем, что показаны на фиг. 3. Показаны результаты переноса, полученные соответствующим методом.Each of the remaining rows corresponds to some method of neural network translation of facial expression and head posture, which are compared to the "Proposed invention", which is the invention, and other methods corresponding to those shown in FIG. 3. The transfer results obtained by the corresponding method are shown.

Временная гладкость. Дополнительное видео демонстрирует способность предложенного дескриптора создавать сглаженный во времени перенос без какого-либо временного сглаживания извлеченной позы (при условии, что результаты обнаружения ограничивающей рамки являются сглаженными во времени). В то же время, достижение сглаженного во времени переноса с помощью систем, управляемых ключевыми точками (FSTH, FSTH+), требует много сглаживания ключевых точек.Temporary smoothness. An additional video demonstrates the ability of the proposed descriptor to create time-smoothed hyphenation without any temporal smoothing of the extracted pose (assuming the results of the bounding box detection are time-smoothed). At the same time, achieving time-smoothed transfer using key point driven systems (FSTH, FSTH +) requires a lot of key point smoothing.

Рассматриваются уменьшение размерности вектора позы, увеличение производительности кодировщика позы, сохранение заднего плана на изображениях и устранение дополнения позы.Reducing the dimension of the pose vector, increasing the performance of the pose encoder, preserving the background in images, and eliminating pose augmentation are discussed.

Реализуется предложенная лучшая модель переобучения с различными подмножествами этих изменений, и делается попытка удалить дополнения позы из X2Face+ и FAb-Net+. X2Face+ и FAb-Net+ изменены путем удаления дополнений позы, и ниже обсуждаются наблюдаемые эффекты.The proposed best retraining model is implemented with different subsets of these changes, and an attempt is made to remove the pose add-ons from X2Face + and FAb-Net +. X2Face + and FAb-Net + have been modified by removing the posture add-ons, and the observed effects are discussed below.

Все полученные рассматриваемые модели приведены в таблице 2. Они сравниваются как в количественном, так и в качественном аспекте, как и в разделе 4.3.All the considered models obtained are shown in Table 2. They are compared both in quantitative and qualitative aspects, as in Section 4.3.

В предлагаемом исследовании изучаются четыре абляционных размера, которые соответствуют столбцам в таблице 2.The proposed study examines four ablation sizes that correspond to the columns in Table 2.

Таблица 2. Сводная таблица систем, сравниваемых в абляционном исследовании Table 2. Summary table of the systems compared in the ablation study

Название моделиModel name Вектор позы
dim, dpPose vector
dim, dp Кодировщик позыPose encoder Стертый задний планErased background Дополнение позыPose addition OursOurs 256256 MobileNetV2MobileNetV2 ++ ++ -PoseDim-PoseDim 6464 ++ ++ -Augm-Augm 256256 ++ -Segm-Segm 256256 ++ +PoseEnc+ PoseEnc 256256 ResNeXt-50
(32×4d)ResNeXt-50
(32 × 4d) ++ ++ +PoseEnc-Augm+ PoseEnc-Augm 256256 ++ +PoseEnc-Segm+ PoseEnc-Segm 256256 ++ X2Face+X2Face + 128128 X2Face (предобучен)X2Face (pretrained) ++ ++ X2Face+(-Augm)X2Face + (- Augm) 128128 ++ FAb-Net+FAb-Net + 256256 FAb-Net (предобучен)FAb-Net (pretrained) ++ ++ FAb-Net+(-Augm)FAb-Net + (- Augm) 256256 ++

Размерность вектора позы dp. Во-первых, dp уменьшается с 256 до 64 просто путем изменения числа каналов в последнем обучаемом слое кодировщика позы. Предложенная базовая модель (изобретение), но с векторами позы, ограниченными размером 64, помечена на фиг. 7 как -PoseDim. На фиг. 7 представлена количественная оценка того, как абляция нескольких важных признаков обучающей схемы влияет на предлагаемую систему. Кроме того, иллюстрируется влияние дополнения позы во время обучения для X2Face+ и FAb-Net+.The dimension of the pose vector is dp. First, dp is reduced from 256 to 64 simply by changing the number of channels in the last trainable layer of the pose encoder. The proposed basic model (invention), but with pose vectors limited to size 64, is labeled in FIG. 7 as -PoseDim. FIG. 7 presents a quantitative assessment of how the ablation of several important features of the training scheme affects the proposed system. In addition, the effect of posture supplementation during training for X2Face + and FAb-Net + is illustrated.

Интуитивно понятно, что подобное этому более узкое место должно как ограничивать способность представлять различные позы, так и побуждать генератор извлекать специфическую для человека информацию из более содержательного вложения идентичности. Согласно представленному графику, ошибка реконструкции позы действительно незначительно увеличивается, в то время как система остается в такой же степени не зависящей от человека. Однако качественно эта разница в позе ничтожна.Intuitively, a bottleneck like this should both limit the ability to represent different poses and induce the generator to extract human-specific information from a more meaningful identity nesting. According to the presented graph, the posture reconstruction error does increase slightly, while the system remains equally independent of the person. However, this difference in posture is qualitatively negligible.

Производительность кодировщика позы. Во-вторых, кодировщик позы заменяется более сильной сетью, а именно, ResNeXt-50 (32 X 4), что уравнивает его с кодировщиком идентичности (т.е. с такой же архитектурой). Предложенная лучшая модель с этой модификацией обозначена как +PoseEnc. Как отмечалось выше, кодировщик позы и кодировщик идентичности намеренно не уравновешены, так что кодировщик позы слабее и тем самым вынуждает процесс оптимизации отдавать предпочтение извлечению специфической для человека информации из кадров источника идентичности, а не из кадра образца. Как метрики, так и образцы переноса для +PoseEnc позволяют предположить, что эта идея не была бессмысленной: более производительный кодировщик позы начинает передавать специфические для человека признаки из образца выражения лица и позы головы. На фиг. 8 видно, что на результат +PoseEnc оказывает влияние одежда образца #1, волосы образцов #6-#8, форма лица образца #9. Это также подтверждается очень большим увеличением ошибки идентичности на фиг. 7. С другой стороны, такая система реконструирует позу с более высокой точностью, что может указывать на то, что это может быть лучшим выбором для самопереноса, если "просачивание идентичности" не так актуально.Pose encoder performance. Second, the pose encoder is replaced by a stronger network, namely ResNeXt-50 (32 X 4), which makes it equal to the identity encoder (i.e., with the same architecture). The suggested best model with this modification is designated + PoseEnc. As noted above, the pose encoder and identity encoder are intentionally unbalanced, so the pose encoder is weaker and thus forces the optimization process to prefer extracting human-specific information from the identity source frames rather than from the sample frame. Both the metrics and hyphenation patterns for + PoseEnc suggest that this idea was not meaningless: a more powerful pose encoder begins to convey human-specific cues from the facial expression pattern and head posture. FIG. 8, it can be seen that the + PoseEnc result is influenced by the clothes of sample # 1, hair of samples # 6- # 8, and the face shape of sample # 9. This is also confirmed by the very large increase in the identity error in FIG. 7. On the other hand, such a system reconstructs the pose with higher accuracy, which may indicate that it may be a better choice for self-transference if the "identity leak" is not so relevant.

Стирание заднего плана. В-третьих, предложенная система модифицируется так, чтобы она не прогнозировала сегментацию переднего плана, не использовала сегментацию для вычисления функций потерь и, таким образом, стала системой без участия учителя. "Наш плюс" в этом изменении -Segm. Кодировщик позы в такой системе тратит свою производительность на кодирование заднего плана изображения образца, а не на оценку слабовыраженных деталей выражения лица. Это объясняется тем, что функции потери восприятия слишком чувствительны к расхождениям между созданным и целевым задним планом по сравнению с различиями в выражении лица. Что еще более важно, поскольку задний план часто изменяется внутри видео, реконструировать задний план целевого изображения, просто глядя на изображения источника идентичности, слишком сложно. Поэтому в алгоритме оптимизации возникает соблазн перебросить работу кодировщика идентичности на кодировщик позы. Это видно из графика и образцов, в которых введение заднего плана вносит большой вклад в расхождение идентичности, и еще более заметно при объединении с более сильным кодировщиком позы (модель +PoseEnc-Segm).Erasing the background. Third, the proposed system is modified so that it does not predict foreground segmentation, does not use segmentation to calculate loss functions, and thus becomes a teacher-less system. "Our plus" in this change is Segm. The pose encoder in such a system wastes its productivity on encoding the background of the sample image rather than evaluating the subtle details of the facial expression. This is because the perceptual loss functions are too sensitive to discrepancies between the generated and targeted backgrounds compared to differences in facial expressions. More importantly, since the background often changes within the video, it is too difficult to reconstruct the background of the target image simply by looking at the images of the identity source. Therefore, in the optimization algorithm, it is tempting to transfer the work of the identity encoder to the pose encoder. This is evident from the graph and samples, in which the introduction of the background makes a large contribution to the identity discrepancy, and is even more noticeable when combined with a stronger pose encoder (+ PoseEnc-Segm model).

Дополнение позы. Реализовано переобучение модели без произвольных дополнений позы, т.е. А устанавливается на преобразование идентичности. В этой схеме система обучается точно реконструировать изображение образца выражения лица и позы головы, и поэтому с большей вероятностью деградирует до автокодировщика (при условии, что кодировщик позы обучается вместе со всей системой).Supplementing the pose. The model has been retrained without arbitrary additions to the pose, i.e. A is set to transform identity. In this scheme, the system is trained to accurately reconstruct the image of the facial expression pattern and head posture, and therefore is more likely to degrade to an autoencoder (assuming the posture encoder is trained along with the entire system).

Как хорошо видно на фиг. 7 (Proposed invention!-Augm, +PoseEnc! +PoseEnc-Augm), несмотря на дополнительное улучшение способности представлять позы, это также существенно вредит сохранению идентичности. Фактически, система с мощным кодировщиком позы ResNeXt-50, обученным без дополнений позы (+PoseEnc-Augm), оказалась худшей из предложенных моделей с точки зрения PT, но в то же время лучшей моделью с точки зрения качества переноса позы. Опять же, такая модель может быть очень полезна для самопереноса, но быть совершенно не пригодной для создания "марионетки" (переноса с одного человека на другого). Тем не менее, даже при самопереносе следует соблюдать осторожность, так как эта модель может дать нежелательные эффекты, такие как передачу качества изображения (например, из образца #8 на фиг. 8). На фиг. 8 представлено сравнение переноса с одного человека на другого для предложенной лучшей модели и ее подвергнутых абляции версий. Верхнее левое изображение на фиг. 8 - это одно из изображений, определяющих целевую идентичность из тестовой части VoxCeleb2, т.е. человека, которого авторы хотели бы визуализировать с другими выражением лица и позой головы. Остальные изображения в этом ряду также взяты из тестовой части VoxCeleb2 и определяют позу для передачи целевому человеку.As best seen in FIG. 7 (Proposed invention! -Augm, + PoseEnc! + PoseEnc-Augm), despite the additional improvement in the ability to represent poses, it also significantly harms the preservation of identity. In fact, the system with the powerful ResNeXt-50 pose encoder trained without posture additions (+ PoseEnc-Augm) turned out to be the worst of the proposed models in terms of PT, but at the same time the best model in terms of posture transfer quality. Again, such a model can be very useful for self-transfer, but not at all suitable for creating a "puppet" (transfer from one person to another). However, even with self-transfer, care must be taken as this model can produce undesirable effects such as image quality rendering (eg, from sample # 8 in FIG. 8). FIG. 8 shows a comparison of the transfer from one person to another for the proposed best model and its ablated versions. The top left image in FIG. 8 is one of the images defining the target identity from the VoxCeleb2 test part, i.e. the person the authors would like to visualize with a different facial expression and head posture. The rest of the images in this row are also taken from the test part of VoxCeleb2 and determine the pose to be transmitted to the target person.

Этот эффект еще раз подтверждается при удалении дополнений позы из X2Face+ и FAb-Net+ (к каждой модели добавлен суффикс (-Augm)). При включенных произвольных дополнениях, несмотря на специфическую для человека природу дескрипторов позы X2Face и FAb-Net, генератор все еще развивает устойчивость к специфическим для человека признакам образцов выражения лица и позы головы. Однако без дополнений степень "просачивания идентичности" полностью объясняется спецификой идентичности этих готовых дескрипторов. Кроме того, ошибка реконструкции позы должна уменьшиться, если генератор больше не должен быть таким устойчивым к образцам, и поэтому некоторая часть его производительности высвободится и может быть использована для отрисовки более точных поз. Как и ожидалось, на фиг. 7 показан значительный рост ошибки идентичности и резкое уменьшение ошибки позы для этих двух моделей. Это еще раз доказывает, что дескрипторы X2Face и FAb-Net не являются независящими от человека. Кроме того, на фиг. 9 можно наглядно увидеть расхождение идентичности. Фиг. 9 иллюстрирует эффект дополнений позы на моделях X2Face+ и FAb-Net+. Без дополнений становится заметным расхождение идентичности. Верхнее левое изображение на фиг. 9 - это одно из изображений, определяющих целевую идентичность из тестовой части VoxCeleb2, т.е. человека, которого авторы хотели бы визуализировать с другими выражением лица и позой головы. Остальные изображения в этом ряду также взяты из тестовой части VoxCeleb2 и определяют позу для передачи целевому человеку.This effect is further confirmed by removing the pose add-ons from X2Face + and FAb-Net + (suffix (-Augm) added to each model). With voluntary additions enabled, despite the human-specific nature of the X2Face and FAb-Net pose descriptors, the generator still develops resistance to human-specific features of facial expression patterns and head posture. However, without the additions, the degree of "identity leakage" is fully explained by the specific identity of these ready-made descriptors. In addition, the pose reconstruction error should be reduced if the generator is no longer supposed to be so pattern-resistant, and so some of its performance is freed up and can be used to render more accurate poses. As expected, in FIG. 7 shows a significant increase in identity error and a sharp decrease in posture error for the two models. This proves once again that the X2Face and FAb-Net descriptors are not human independent. In addition, in FIG. 9, you can clearly see the discrepancy of identity. FIG. 9 illustrates the effect of posture additions on the X2Face + and FAb-Net + models. Without additions, the identity divergence becomes noticeable. The top left image in FIG. 9 is one of the images defining the target identity from the VoxCeleb2 test part, i.e. the person the authors would like to visualize with a different facial expression and head posture. The rest of the images in this row are also taken from the test part of VoxCeleb2 and determine the pose to be transmitted to the target person.

Каждый из остальных рядов соответствует какому-либо сравниваемому методу нейросетевого переноса выражения лица и позы головы. Эти сравниваемые методы перечислены в четырех нижних строках таблицы 2, а именно, X2Face+ и Fab-Net+, каждый из которых оценивается как с дополнениями позы, так и без них. Показаны результаты переноса из соответствующего метода, например, очки из образца #7 или форма лица из образца #1 переданы результату.Each of the other rows corresponds to some comparable method of neural network transfer of facial expression and head posture. These compared methods are listed in the bottom four rows of Table 2, namely X2Face + and Fab-Net +, each scored with and without pose additions. The transfer results from the corresponding method are shown, for example glasses from sample # 7 or the face shape from sample # 1 are transferred to the result.

В заключение следует отметить, что существует компромисс между ошибкой сохранения идентичности

и ошибкой реконструкции позы

. Этот компромисс регулируется путем применения вышеуказанных изменений в зависимости от того, который из сценария самопереноса или сценария переноса с образца одного человека на другого более важен. Во втором случае лучше использовать предложенную лучшую модель "PROPOSED", тогда как для первого сценария хорошим кандидатом может быть +PoseEnc или +PoseEnc-Segm.In conclusion, it should be noted that there is a trade-off between the identity preservation error

and posture reconstruction error

... This trade-off is adjusted by applying the above changes depending on which of the self-transfer scenario or the transfer scenario from one person's sample to another is more important. In the second case, it is better to use the suggested best model "PROPOSED", while for the first scenario, + PoseEnc or + PoseEnc-Segm might be a good candidate.

Выше был описан и проанализирован нейросетевой перенос выражений лица и поз головы, в котором используются скрытые дескрипторы позы и который позволяет достичь реалистичного переноса. В отличие от предшествующей системы [42], в которой качестве в дескриптора позы использовались ключевые точки, в предлагаемой системе используются дескрипторы позы без явного участия учителя, исключительно на основе потерь реконструкции. Единственная слабая форма участия учителя обусловлена масками сегментации. Предлагаемые обученные дескрипторы позы головы превосходят предыдущие дескрипторы без участия учителя в задачах поиска на основе позы, а также переноса с одного человека на другого.Above, a neural network transfer of facial expressions and head postures was described and analyzed, which uses hidden posture descriptors and which allows you to achieve a realistic transfer. In contrast to the previous system [42], in which key points were used as the pose descriptor, the proposed system uses pose descriptors without the explicit participation of the teacher, solely on the basis of reconstruction losses. The only weak form of teacher involvement comes from segmentation masks. The proposed trained head pose posture descriptors are superior to previous descriptors without teacher involvement in posture-based search tasks as well as transference from one person to another.

Главный и, возможно, неожиданный вывод заключается в том, что ограниченной производительности сети извлечения позы в предлагаемой схеме достаточно для разделения позы и идентичности. В то же время, возможно, что соответствующее использование циклических и/или состязательных потерь может еще больше улучшить разделение.The main and perhaps unexpected conclusion is that the limited performance of the pose extraction network in the proposed scheme is sufficient to separate pose and identity. At the same time, it is possible that the appropriate use of cyclic and / or adversarial losses can further improve the separation.

Возможно, из-за ограниченной производительности сети предлагаемые дескрипторы позы и система переноса имеют проблемы с захватом некоторой слабовыраженной мимики, особенно направления взгляда (хотя это все же работает лучше, чем дескрипторы ключевых точек, у которых вообще отсутствует представление взгляда). Другим очевидным направлением исследований является обучение дескриптора позы и всей системы с частичным привлечением учителя.Possibly due to limited network performance, the proposed pose descriptors and the transfer system have problems capturing some subtle facial expressions, especially gaze direction (although this still works better than cue point descriptors, which have no gaze representation at all). Another obvious area of research is the training of the pose descriptor and the whole system with partial involvement of the teacher.

По меньшей мере, один из множества блоков может быть реализован через модель AI. Функция, связанная с AI, может выполняться через энергонезависимую память, энергозависимую память и процессор.At least one of the plurality of blocks may be implemented through the AI model. AI related function can be performed via nonvolatile memory, volatile memory and processor.

Процессор может включать в себя один или несколько процессоров. При этом один или несколько процессоров могут быть процессором общего назначения, таким как центральный процессор (CPU), процессор приложений (AP) или т.п., процессором только графики, таким как графический процессор (GPU), процессором машинного зрения (VPU) и/или специализированным процессором AI, таким как нейронный процессор (NPU).A processor can include one or more processors. However, one or more processors may be a general-purpose processor such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processor such as a graphics processing unit (GPU), a machine vision processor (VPU), and / or a specialized AI processor such as a neural processor (NPU).

Один или несколько процессоров управляют обработкой входных данных в соответствии с предопределенной моделью рабочего правила или искусственного интеллекта (AI), хранящейся в энергонезависимой памяти и энергозависимой памяти. Предопределенная модель рабочего правила или искусственного интеллекта предоставляется посредством обучения или изучения.One or more processors control the processing of input data in accordance with a predefined work rule or artificial intelligence (AI) model stored in nonvolatile memory and volatile memory. A predefined model of a work rule or artificial intelligence is provided through teaching or learning.

В данном контексте предоставление посредством обучения означает, что, посредством применения алгоритма обучения к множеству обучающих данных создается предопределенное рабочее правило или модель AI требуемой характеристики. Это обучение может выполняться в самом устройстве, в котором выполняется AI в соответствии с вариантом осуществления, и/или может быть реализовано через отдельный сервер/систему.In this context, provision by training means that by applying a training algorithm to a set of training data, a predefined working rule or AI model of the desired characteristic is created. This training can be performed on the device itself, in which the AI is performed in accordance with the embodiment, and / or can be implemented through a separate server / system.

Модель AI может состоять из нескольких слоев нейронной сети. Каждый слой имеет множество весовых значений и выполняет операцию слоя посредством вычисления предыдущего слоя и операции с множеством весов. Примеры нейронных сетей включают, без ограничения, сверточную нейронную сеть (CNN), глубокую нейронную сеть (DNN), рекуррентную нейронную сеть (RNN), ограниченную машину Больцмана (RBM), глубокую сеть доверия (DBN), двунаправленную рекуррентную глубокую нейронную сеть (BRDNN), генеративные состязательные сети (GAN) и глубокие Q-сети.An AI model can be composed of multiple layers of a neural network. Each layer has multiple weights and performs a layer operation by calculating the previous layer and a multi-weight operation. Examples of neural networks include, but are not limited to, Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Trust Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN) ), generative adversarial networks (GANs), and deep Q-networks.

Алгоритм обучения представляет собой способ обучения заданного целевого устройства (например, робота), используя множество обучающих данных для побуждения, разрешения или управления целевым устройством, выполнять определение или прогнозирование. Примеры алгоритмов обучения включают в себя, без ограничения, обучение с учителем, обучение без учителя, обучение с частичным привлечением учителя, или обучение с подкреплением.A learning algorithm is a method of training a given target device (eg, a robot) using a plurality of training data to induce, permit, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, part-time learning, or reinforcement learning.

Согласно данному раскрытию в электронном устройстве с помощью способа распознавания выражения лица и позы головы можно получить выходные данные, распознающие изображение или выражение лица и позу головы на изображении, используя данные изображения в качестве входных данных для модели искусственного интеллекта. Модель искусственного интеллекта можно получить посредством обучения. В данном контексте фраза "получить посредством обучения" означает, что предопределенную модель рабочего правила или искусственного интеллекта, сконфигурированную для выполнения желаемой функции (или цели), получают путем обучения базовой модели искусственного интеллекта на множестве фрагментов обучающих данных с помощью обучающего алгоритма. Эта модель искусственного интеллекта может включать в себя множество слоев нейронной сети. Каждый из множества слоев нейронной сети имеет множество весовых значений и выполняет нейросетевое вычисление путем вычисления между результатом вычисления предыдущего слоя и множеством весовых значений.According to this disclosure, in an electronic device, using the facial expression and head posture recognition method, it is possible to obtain output that recognizes an image or facial expression and head posture in an image using the image data as input to an artificial intelligence model. An artificial intelligence model can be obtained through training. In this context, "obtain by training" means that a predefined work rule or artificial intelligence model, configured to perform a desired function (or goal), is obtained by training a basic artificial intelligence model on multiple pieces of training data using a training algorithm. This artificial intelligence model can include many layers of a neural network. Each of the plurality of neural network layers has a plurality of weights and performs a neural network computation by computation between the result of the computation of the previous layer and the plurality of weights.

Визуальное понимание - это метод распознавания и обработки объектов, подобный человеческому зрению, и оно включает в себя, например, распознавание объекта, отслеживание объекта, поиск изображения, распознавание человека, распознавание сцены, трехмерную реконструкцию/локализацию или улучшение изображения.Visual awareness is a method of recognizing and processing objects similar to human vision, and it includes, for example, object recognition, object tracking, image search, human recognition, scene recognition, 3D reconstruction / localization, or image enhancement.

При этом описанный способ, выполняемый электронным устройством, может выполняться с использованием модели искусственного интеллекта.In this case, the described method performed by an electronic device can be performed using an artificial intelligence model.

Описанные выше примерные варианты осуществления изобретения являются лишь примерами и не должны рассматриваться как ограничивающие. Описание примерных вариантов осуществления также предназначено для иллюстрации, а не для ограничения объема формулы изобретения, и многие альтернативы, модификации и варианты будут очевидны для специалистов в данной области техники.The above described exemplary embodiments of the invention are only examples and should not be construed as limiting. The description of the exemplary embodiments is also intended to be illustrative, and not to limit the scope of the claims, and many alternatives, modifications, and variations will be apparent to those skilled in the art.

Список источниковList of sources

[1] V. Blanz, T. Vetter, et al. A morphable model for the synthesis of 3d faces. In Proc. SIGGRAPH, volume 99, pages 187- 194, 1999,[1] V. Blanz, T. Vetter, et al. A morphable model for the synthesis of 3d faces. In Proc. SIGGRAPH, volume 99, pages 187-194, 1999,

[2] A, Bulat and G. Tzimlropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230, 000 3d facial landmarks). In Proc. ICCV, pages 1021-1030,2017.[2] A, Bulat and G. Tzimlropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proc. ICCV, pages 1021-1030,2017.

[3] F.-J. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia, and G. Medioni, Expnet: Landmark-free, deep, 3d facial expressions. In In proc. FG, pages 122-129. IEEE, 2018.[3] F.-J. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia, and G. Medioni, Expnet: Landmark-free, deep, 3d facial expressions. In In proc. FG, pages 122-129. IEEE, 2018.

[4] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2 018,[4] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2018,

[5] J. S. Chung, A. Senior, 0. Vinyals, and A. Zisserman. Lip reading sentences in the 'wild. In Proc. CVPR, pages 3444-3453. IEEE, 2017 .[5] J. S. Chung, A. Senior, 0. Vinyals, and A. Zisserman. Lip reading sentences in the 'wild. In Proc. CVPR, pages 3444-3453. IEEE, 2017.

[6] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models, T-PAMI, (6):681-685, 2001.[6] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models, T-PAMI, (6): 681-685, 2001.

[7] J. Deng, J. Guo, X. Niannan, and S, Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In CVPR,[7] J. Deng, J. Guo, X. Niannan, and S, Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In CVPR,

[8] E. L. Denton and V, Birodkar. Unsupervised learning of[8] E. L. Denton and V, Birodkar. Unsupervised learning of

disentangled representations from, video. In Proc. NeurlPS, pages 4414-4423, 2017.disentangled representations from, video. In Proc. NeurlPS, pages 4414-4423, 2017.

[9] P. Ekman. Facial action coding system. 1977.[9] P. Ekman. Facial action coding system. 1977.

[10] С. Fu, Y. Ни, X. Wu, G, Wang, Q. Zhang, and R. He. High fidelity face manipulation with extreme pose and expression. arXiv preprint arXiv:1903,12 00 3, 2 019,[10] C. Fu, Y. Ni, X. Wu, G, Wang, Q. Zhang, and R. He. High fidelity face manipulation with extreme pose and expression. arXiv preprint arXiv: 1903.12 00 3, 2019,

[11] K. Gong, Y. Gao, X. Liang, X. Shen, M. Wang, and L. Lin. Graphonomy: Universal human parsing via graph transfer learning. In CVPR, 2019.[11] K. Gong, Y. Gao, X. Liang, X. Shen, M. Wang, and L. Lin. Graphonomy: Universal human parsing via graph transfer learning. In CVPR, 2019.

[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.Warde-Farley, , S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. NIPS, 2014.[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. NIPS, 2014.

[13] R. Gross, I. Matthews, J. Cohn, T, Kanade, and S. Baker. Multi-pie. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, IEEE Computer Society, September 2008.[13] R. Gross, I. Matthews, J. Cohn, T, Kanade, and S. Baker. Multi-pie. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, IEEE Computer Society, September 2008.

[14] R. A. G"uler, N. Neverova, and I. Kokkinos. DensePose: Dense human pose estimation in the wild. In Proc. CVPR, June 2018.[14] R. A. G "uler, N. Neverova, and I. Kokkinos. DensePose: Dense human pose estimation in the wild. In Proc. CVPR, June 2018.

[15] R, A. G"uler, G. Trigeorgis, E, Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos. DenseReg: Fully convolutional dense shape regression in-the-wild. In CVPR, volume 2, page 5, 2017,[15] R, A. G "uler, G. Trigeorgis, E, Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos. DenseReg: Fully convolutional dense shape regression in-the-wild. In CVPR, volume 2 , page 5, 2017,

[16] X. Huang and S. Belongie. Arbitrary style transfer in realtime with adaptive instance normalization. In Proc. ICCV, 2 017.[16] X. Huang and S. Belongie. Arbitrary style transfer in realtime with adaptive instance normalization. In Proc. ICCV, 2017.

[17] X. Huang, M,-Y. Liu, S. Belongie, and J. Kautz, Multimodal unsupervised image-to-image translation. In Proc. ECCV, 2018 .[17] X. Huang, M, -Y. Liu, S. Belongie, and J. Kautz, Multimodal unsupervised image-to-image translation. In Proc. ECCV, 2018.

[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. ICML, ICML'15, pages 448-456, 2015.[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. ICML, ICML'15, pages 448-456, 2015.

Unsupervised learning of object landmarks through conditional image generation. In Proc. NeurlPS, pages 4016-4027, 2018.Unsupervised learning of object landmarks through conditional image generation. In Proc. NeurlPS, pages 4016-4027, 2018.

[1] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proc. CVPR, June 2 019.[1] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proc. CVPR, June 2019.

[2] V. Kazemi and J, Sullivan. One millisecond face alignment with an ensemble of regression trees. Proc. CVPR, pages 1867-1874, 2014.[2] V. Kazemi and J, Sullivan. One millisecond face alignment with an ensemble of regression trees. Proc. CVPR, pages 1867-1874, 2014.

[3] H. Kim, P. Garrido, A. Tewari, W, Xu, J. Thies, M. Niefiner, P. P'erez, C. Richardt, M. Zollh"ofer, and C. Theobalt. Deep video portraits. In Proc. SIGGRAPH, 2018,[3] H. Kim, P. Garrido, A. Tewari, W, Xu, J. Thies, M. Niefiner, P. P'erez, C. Richardt, M. Zollh "ofer, and C. Theobalt. Deep video portraits. In Proc. SIGGRAPH, 2018,

[4] H, Kim and A. Mnih. Disentangling by f act or i sing. In Proc. ICML, pages 2654-2663, 2018.[4] H, Kim and A. Mnih. Disentangling by f act or i sing. In Proc. ICML, pages 2654-2663, 2018.

[5] M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lentinen, and J. Kautz, Few-shot unsupervised image-to-image translation. In Proc. ICCV, 2019.[5] M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lentinen, and J. Kautz, Few-shot unsupervised image-to-image translation. In Proc. ICCV, 2019.

[6] Z. Liu, P. Luo, X. Wang, and X. Tang, Deep learning face attributes in the wild. In Proc. ICCV, December 2015.[6] Z. Liu, P. Luo, X. Wang, and X. Tang, Deep learning face attributes in the wild. In Proc. ICCV, December 2015.

[26] S. Lombardi, J. Saragih, T. Simon, and Y. Sheikh. Deep appearance models for face rendering. ACM Transactions on Graphics (TOG), 37(4):68, 2018.[26] S. Lombardi, J. Saragih, T. Simon, and Y. Sheikh. Deep appearance models for face rendering. ACM Transactions on Graphics (TOG), 37 (4): 68, 2018.

[1] F. Milletari, N. Navab, and S.-A. Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentationоn. pages 565-571, 10 2016.[1] F. Milletari, N. Navab, and S.-A. Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentationоn. pages 565-571, 10 2016.

[2] T. Miyato, Т. Kataoka, М. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.[2] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.

[3] A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.[3] A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.

[4] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In Proc. ECCV, pages 818---833, 2018.[4] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In Proc. ECCV, pages 818 --- 833, 2018.

[5] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proc. CVPR, June 2 018.[5] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proc. CVPR, June 2,018.

[6] A. Siaronin, S. Lathuili'ere, S. Tulyakov, E, Ricci, and N. Sebe. First order motion model for image animation. In Proc. NeurlPS, pages 7135-7145, 2019.[6] A. Siaronin, S. Lathuili'ere, S. Tulyakov, E, Ricci, and N. Sebe. First order motion model for image animation. In Proc. NeurlPS, pages 7135-7145, 2019.

[7] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher- Shlizerman. Synthesizing Obama: learning lip sync from, audio. ACM Transactions on Graphics (TOG), 36(4):95, 2017.[7] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. Synthesizing Obama: learning lip sync from, audio. ACM Transactions on Graphics (TOG), 36 (4): 95, 2017.

[34] J. Thewlis, S. Albanie, H. Bilen, and A. Vedaldi. Unsupervised learning of landmarks by descriptor vector exchange. In Proc. ICCV, 2 019.[34] J. Thewlis, S. Albanie, H. Bilen, and A. Vedaldi. Unsupervised learning of landmarks by descriptor vector exchange. In Proc. ICCV, 2019.

[1] S. Tripathy, J. Kannala, and E. Rahtu. Icface: Interpretable and controllable face reenactment using gans. CoRR, abs/1904.01909, 2019.[1] S. Tripathy, J. Kannala, and E. Rahtu. Icface: Interpretable and controllable face reenactment using gans. CoRR, abs / 1904.01909, 2019.

[2] T.Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Сatanzazо, Few-shоt videо-tо-videо synthesis.СоRR, abs/1910.12713, 2019.[2] T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzazo, Few-shót video-to-video synthesis. CoRR, abs / 1910.12713, 2019.

[3] Т. -С. Wang, М. - Y. Liu, J. - Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro, Video-to-video synthesis. Proc. NeurlPS, 2018 .[3] T. -C. Wang, M. - Y. Liu, J. - Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro, Video-to-video synthesis. Proc. NeurlPS, 2018.

[4] О. Wiles, A. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from video. In Proc. BMVC, 2018.[4] O. Wiles, A. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from video. In Proc. BMVC, 2018.

[5] О. Wiles, A. Sophia Koepke, and A. Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In Proc. ECCV, September 2 018.[5] O. Wiles, A. Sophia Koepke, and A. Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In Proc. ECCV, September 2018.

[6] F. Xiao, П. Liu, and Y. J. Lee. Identity from here, pose from there: Self-supervised disentanglement and generation of objects using unlabeled videos. In Proc. ICCV, October 2019.[6] F. Xiao, P. Liu, and Y. J. Lee. Identity from here, pose from there: Self-supervised disentanglement and generation of objects using unlabeled videos. In Proc. ICCV, October 2019.

[7] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proc.CVPR, July 2017.[7] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proc.CVPR, July 2017.

[8] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky. Few-shot adversarial learning of realistic neural talking head models. In Proc. ICCV, October 2019.[8] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky. Few-shot adversarial learning of realistic neural talking head models. In Proc. ICCV, October 2019.

[9] S, Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, S3fd: Single shot scale-invariant face detector. In Proc. ICCV, Oct 2 017.[9] S, Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, S3fd: Single shot scale-invariant face detector. In Proc. ICCV, Oct 2017.

[10] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee. IJnsupervised discovery of object landmarks as structural representations. In Proc. CVPR, pages 2694-2703, 2018.[10] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee. IJnsupervised discovery of object landmarks as structural representations. In Proc. CVPR, pages 2694-2703, 2018.

[11] J.-Y. Zhu, T. Park, P. Isola, and A. A, Efros. Unpaired imageto-image translation using cycle-consistent adversarial networks.In Proc.ICCV,2017.[11] J.-Y. Zhu, T. Park, P. Isola, and A. A, Efros. Unpaired imageto-image translation using cycle-consistent adversarial networks.In Proc.ICCV, 2017.

Claims

1. A hardware device for transferring an arbitrary facial expression and head posture to a person's avatar while maintaining the identity of a person's avatar, containing:

a detector for capturing images of facial expressions and head posture of person B;

a means for selecting images of a person A;

an identity encoder unit configured to obtain an identity descriptor from images of a person A, wherein the posture encoder output does not contain information about the identity of person A;

a posture encoder unit configured to input images of a person B and configured to obtain a head posture descriptor and a facial expression from an image of person B, wherein the output of the pose encoder unit does not contain information about the identity of person B;

a generator block receiving outputs of the identity encoder and posture encoder blocks, the generator block being configured to synthesize an avatar image of a person A having a head pose and facial expression taken from a person B;

moreover, the generator block is configured to output the resulting image of an avatar of a person A having a head pose and facial expression taken from a person B.

2. The hardware device of claim 1, wherein the pose encoder block is a convolutional neural network that takes an image of a person as input and outputs a vector describing his head pose and facial expression and not describing their identity, and which is applied to the images arbitrary people, including people whose images were not included in the training set.

3. The hardware device of claim 1, wherein the identity of person B means the skin color, face shape, eye color, clothing and jewelry of person B.

4. The method of operation of the hardware device according to claim 1 for transferring an arbitrary facial expression and head posture to a person's avatar while maintaining the identity of the person's avatar, which consists in the following:

capture, by means of the detector, images of facial expressions and postures of the head of person B;

select images of person A;

obtaining, using the identity encoder unit, an identity descriptor from images of person A;

obtain, using the pose encoder unit, a descriptor of a head pose and a facial expression from images of person B;

synthesized by means of a generator block that accepts the outputs of the identity encoder and posture encoder blocks, an image of an avatar of a person A with a head pose and a facial expression taken from a person B,

outputting the resulting image of the avatar of person A, having a head pose and facial expression taken from person B.

5. The method according to claim 4, in which the pose encoder block is a convolutional neural network that accepts an image of a person as input and outputs a vector describing the head pose and facial expression and not describing their identity, and is applicable to images of arbitrary people, in including people whose images were not included in the training sample.

6. The method of claim 4, wherein identity means skin color, face shape, eye color, clothing, jewelry.