EA039495B1

EA039495B1 - Method and system for creating facial expressions based on text

Info

Publication number: EA039495B1
Application number: EA202090169A
Authority: EA
Inventors: Альберт Рувимович ЕФИМОВ; Алексей Сергеевич ГОННОЧЕНКО; Михаил Александрович ВЛАДИМИРОВ
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Priority date: 2019-12-27
Filing date: 2020-01-28
Publication date: 2022-02-03
Also published as: RU2723454C1; WO2021133201A1; EA202090169A1

Abstract

This technical solution generally relates to the field of the image data processing, and in particular, to a method and system for creating facial expressions based on text. The technical result achieved is ensuring the possibility of creating a video stream with an animated image of 3D head model, with a dynamic texture of the facial mask placed on it, based on the data of a voice signal. The claimed technical result is achieved through implementation of the method for processing a voice signal for generation of a video stream, performed at least by one computer and comprising the stages as follows: obtaining data of at least one voice signal; dividing the sections of the voice signal that contain information about the voice into time windows; generating a frequency spectrum image for each time window in order to obtain a sequence of frequency spectrum images; defining, on the basis of the sequence of frequency spectrum images, a sequence of data about a plurality of landmarks which form a facial mask; applying the facial mask to a 3D head model in order to generate a sequence of frames containing an image of the 3D head model with the facial mask applied thereto; generating, on the basis of the sequence of frequency spectrum images, a sequence of frames of the dynamic texture of the facial mask; generating a sequence of frames containing an image of the resulting 3D head model with the facial mask dynamic texture applied thereto, on the basis of the sequence of frames containing an image of the 3D head model with the facial mask applied thereto and the frames of the dynamic texture of the facial mask; generating a sequence of frames with an image of the resulting 3D head model against a background scene; and combining the sequence of frames into a video stream.

Description

Данное техническое решение, в общем, относится к области обработки данных изображения, а в частности, к способу и системе для создания мимики на основе текста. Представленное решение может быть использовано в процессе генерирования по меньшей мере одной анимированной модели - цифрового аватара на основе данных речи и движений.This technical solution, in general, relates to the field of image data processing, and in particular, to a method and system for creating facial expressions based on text. The presented solution can be used in the process of generating at least one animated model - a digital avatar based on speech and movement data.

Уровень техникиState of the art

Из уровня техники известны различные решения, направленные на создание цифровых аватаров. Например, известен способ предоставления анимации в реальном времени для персонализированного анимированного аватара, раскрытый в заявке US 2012130717 (А1), опубл. 2012.05.24. В известном решении выполняют обучение одной или нескольких анимированных моделей для предоставления набора вероятностных движений для одной или нескольких верхних частей тела аватара, основываясь, по меньшей мере частично, на данных речи и движения; ассоциирование одной или нескольких заданных фраз эмоциональных состояний с одной или несколькими анимированными моделями; получение речевого ввода в реальном времени; идентификацию эмоционального состояния, которое должно быть выражено, основываясь, по меньшей мере частично, на одной или нескольких заранее определенных фразах, соответствующих, по меньшей мере частично, речевому вводу в реальном времени; и генерирование анимированной последовательности движений одной или нескольких верхних частей тела аватара путем применения одной или нескольких анимированных моделей в ответ на речевой ввод в реальном времени, причем анимированная последовательность движений выражает идентифицированное эмоциональное состояние.Various solutions are known from the prior art for creating digital avatars. For example, a method for providing real-time animation for a personalized animated avatar is known, as disclosed in US 2012130717 (A1), publ. 2012.05.24. In a known solution, one or more animated models are trained to provide a set of probabilistic movements for one or more upper body parts of an avatar based at least in part on speech and motion data; associating one or more predetermined emotional state phrases with one or more animated models; receiving voice input in real time; identifying an emotional state to be expressed based at least in part on one or more predetermined phrases corresponding at least in part to real-time speech input; and generating an animated motion sequence of one or more upper body parts of the avatar by applying one or more animated models in response to real-time speech input, the animated motion sequence expressing the identified emotional state.

Также известна система удаленного обслуживания клиентов, раскрытая в заявке US 2017308904 (А1), опубл. 2017.10.26. Известная система включает в себя блок интерактивного отображения в месте нахождения клиента, обеспечивающий двустороннюю аудиовизуальную связь с удаленным агентом по обслуживанию/продажам, при этом коммуникация обеспечивается посредством виртуального цифрового ведущего, отображаемого на дисплее. В известную систему также интегрирована виртуальная система Digital Actor. Digital Actor используется в решении для электронного обучения, при производстве фильмов и в качестве ведущего на телевидении, в Интернете или других вещательных приложениях.Also known is a remote customer service system disclosed in US 2017308904 (A1), publ. 2017.10.26. The known system includes an interactive display unit at a customer's location providing two-way audio-visual communication with a remote service/sales agent, the communication being provided through a virtual digital presenter displayed on the display. A virtual Digital Actor system is also integrated into the well-known system. The Digital Actor is used in an e-learning solution, in film production, and as a presenter on TV, the Internet, or other broadcast applications.

Также известен AI-ведущий Синьхуа, раскрытый в статье Watch: China's Xinhua unveils the world's first AI news anchor, опубл. 09.11.2018 в Интернет по адресу: https://www.hindustantimes.com/tech/china-s-xinhua-unveils-the-world-s-first-ai-news-anchor/storvACUifqUzGSI8IEURbrlmBJ.html.Also known is the Xinhua AI anchor, revealed in Watch: China's Xinhua unveils the world's first AI news anchor, publ. 11/09/2018 on the Internet at: https://www.hindustantimes.com/tech/china-s-xinhua-unveils-the-world-s-first-ai-news-anchor/storvACUifqUzGSI8IEURbrlmBJ.html.

Представленные решения обладают рядом недостатков, таких как статичная поза головы, обусловленная ограничениями их версии технологии (двухмерный синтез мимики), из-за чего мимика ведущего выглядит скованно и менее естественно.The presented solutions have a number of disadvantages, such as a static head posture due to the limitations of their technology version (two-dimensional synthesis of facial expressions), which makes the host's facial expressions look constrained and less natural.

3D-синтез в свою очередь позволит развивать технологию в full-3D рендер, что существенно расширяет область применения технологии (использование виртуальных студий, свободная камера и пр.).3D synthesis, in turn, will allow developing the technology into full-3D rendering, which significantly expands the scope of the technology (using virtual studios, a free camera, etc.).

Сущность технического решенияThe essence of the technical solution

Технической проблемой или задачей, поставленной в данном техническом решении, является создание простого и надежного способа и системы для создания мимики на основе текста.The technical problem or task posed in this technical solution is to provide a simple and reliable method and system for creating facial expressions based on text.

Техническим результатом, достигаемым при решении вышеуказанной технической задачи, является обеспечение возможности создания видеопотока с анимированным изображением 3D-модели головы с размещенной на ней динамической текстурой лицевой маски на основе данных речевого сигнала. Данные речевого сигнала могут быть произвольно сгенерированы средствами синтеза речи (системы text-tospeech), эмитирующих человеческий голос, соответствующий по своим параметрам голосу диктора.The technical result achieved by solving the above technical problem is to provide the possibility of creating a video stream with an animated image of a 3D head model with a dynamic texture of the face mask placed on it based on speech signal data. Speech signal data can be arbitrarily generated by means of speech synthesis (text-tospeech system), emitting a human voice, corresponding in its parameters to the speaker's voice.

Указанный технический результат достигается благодаря осуществлению способа обработки речевого сигнала для формирования видеопотока, выполняемого по меньшей мере одним вычислительным устройством, содержащего этапы, на которых получают данные по меньшей мере одного речевого сигнала;The specified technical result is achieved due to the implementation of a method for processing a speech signal to form a video stream, performed by at least one computing device, comprising the steps at which data of at least one speech signal is obtained;

разделяют участки речевого сигнала, содержащие информацию о голосе, на временные окна;separating sections of the speech signal containing information about the voice into time windows;

формируют для каждого временного окна изображение частотного спектра для получения последовательности изображений частотного спектра;generating for each time window a frequency spectrum image to obtain a sequence of frequency spectrum images;

на основе последовательности изображений частотного спектра определяют последовательность данных о множестве координат точек, образующих лицевую маску;based on the sequence of images of the frequency spectrum determine the sequence of data on the set of coordinates of the points that form the face mask;

размещают лицевую маску на 3D-модели головы для формирования последовательности кадров, содержащих изображение 3D-модели головы с размещенной на ней лицевой маской;placing a face mask on the 3D head model to form a sequence of frames containing an image of the 3D head model with the face mask placed thereon;

на основе последовательности изображений частотного спектра формируют последовательность кадров динамической текстуры лицевой маски;based on the sequence of images of the frequency spectrum, a sequence of frames of the dynamic texture of the face mask is formed;

формируют последовательность кадров, содержащих изображение результирующей 3D-модели головы с размещенной на ней динамической текстурой лицевой маски на основе последовательности кадров, содержащих изображение 3D-модели головы с размещенной на ней лицевой маской, и кадров динамической текстуры лицевой маски;forming a sequence of frames containing an image of the resulting 3D head model with a dynamic face mask texture placed on it based on a sequence of frames containing an image of a 3D head model with a face mask placed on it, and frames of the face mask dynamic texture;

формируют последовательность кадров с изображением результирующей 3D-модели головы на фоform a sequence of frames with the image of the resulting 3D model of the head on the background

- 1 039495 не сцены;- 1 039495 not scenes;

объединяют полученную на предыдущем шаге последовательность кадров в видеопоток.combine the sequence of frames obtained at the previous step into a video stream.

В одном из частных примеров осуществления способа дополнительно выполняют этапы, на которых при получении данных речевого сигнала определяют шум-голос; выделяют участки речевого сигнала, содержащие информацию о голосе; причем при определении шума-голоса учитываются данные фонетической разметки.In one of the particular examples of the implementation of the method, the steps are additionally performed, at which, upon receipt of the data of the speech signal, the noise-voice is determined; allocate sections of the speech signal containing information about the voice; moreover, when determining the noise-voice, phonetic markup data are taken into account.

В другом частном примере осуществления способа дополнительно выполняют этап геометрической валидации данных о множестве координат точек, сформированных для временного окна, содержащихся в последовательности данных о множестве координат точек, образующих лицевую маску. В другом частном примере осуществления способа упомянутый этап геометрической валидации данных о множестве координат точек содержит этапы, на которых на основе данных фонетической разметки голоса определяют контрольные координаты точек лицевой маски, валидацию которых необходимо выполнить; определяют расстояние между контрольными точками лицевой маски, информация о которых содержится в данных о множестве координат точек, сформированных для временного окна, и сравнивают его с заданным значением расстояния для этих контрольных точек согласно данным о фонетических разметках голоса; причем если определенное расстояние между контрольными точками лицевой маски не превышает заданного значения расстояния между этими контрольными точками, то данные о множестве координат точек проходят валидацию.In another particular example of the implementation of the method, the step of geometric validation of the data on the set of point coordinates generated for the time window contained in the sequence of data on the set of point coordinates forming the face mask is additionally performed. In another particular example of the implementation of the method, the mentioned stage of geometric validation of data on the set of coordinates of points contains the steps at which, based on the phonetic voice markup data, the control coordinates of the points of the face mask are determined, the validation of which must be performed; determining the distance between the control points of the face mask, information about which is contained in the data on the set of coordinates of points generated for the time window, and comparing it with the specified distance value for these control points according to the data on the phonetic markings of the voice; moreover, if a certain distance between the control points of the face mask does not exceed the specified value of the distance between these control points, then the data on the set of coordinates of the points are validated.

В другом частном примере осуществления способа размещение лицевой маски на 3D-модели головы осуществляется посредством прикрепления лицевой маски по известным пограничным вершинам 3Dмодели головы и соответствующим для этих вершин точкам лицевой маски.In another particular example of the implementation of the method, the placement of the face mask on the 3D head model is carried out by attaching the face mask to the known boundary vertices of the 3D head model and the face mask points corresponding to these vertices.

В другом частном примере осуществления способа дополнительно выполняют этап, на котором размещают на изображениях 3D-модели головы текстуры зубов и языка согласно положению лицевой маски.In another particular embodiment of the method, the step is additionally performed, at which the textures of teeth and tongue are placed on the images of the 3D model of the head according to the position of the face mask.

В другом частном примере осуществления способа дополнительно выполняют этап морфинга изображений результирующей 3D-модели головы.In another particular embodiment of the method, the step of morphing images of the resulting 3D head model is additionally performed.

В другом частном примере осуществления способа дополнительно выполняют этап цветокоррекции и совмещения динамических текстур для устранения пульсирующего изменения цвета лицевой маски посредством слияния по цвету упомянутых изображений 3D-модели головы, содержащихся на предыдущих и последующих кадрах.In another particular embodiment of the method, the stage of color correction and dynamic texture matching is additionally performed to eliminate the pulsating change in the color of the face mask by color merging the mentioned images of the 3D head model contained in previous and subsequent frames.

В другом предпочтительном варианте осуществления заявленного решения представлено устройство обработки речевого сигнала, содержащее по меньшей мере одно вычислительное устройство и по меньшей мере одно устройство памяти, содержащее машиночитаемые инструкции, которые при их исполнении по меньшей мере одним вычислительным устройством выполняют указанный выше способ.In another preferred embodiment of the claimed solution, a speech signal processing device is provided, comprising at least one computing device and at least one memory device containing machine-readable instructions, which, when executed by at least one computing device, perform the above method.

Краткое описание чертежейBrief description of the drawings

Признаки и преимущества настоящего изобретения станут очевидными из приводимого ниже подробного описания изобретения и прилагаемых чертежей, на которых на фиг. 1 представлена принципиальная архитектура технологии;The features and advantages of the present invention will become apparent from the following detailed description of the invention and the accompanying drawings, in which: FIG. 1 shows the principal architecture of the technology;

на фиг. 2 - пример системы для создания мимики на основе текста;in fig. 2 - an example of a system for creating facial expressions based on text;

на фиг. 3 - пример фонетической разметки фразы Добрый день;in fig. 3 - an example of phonetic markup of the phrase Good afternoon;

на фиг. 4 - пример общего вида вычислительного устройства.in fig. 4 is an example of a general view of a computing device.

Подробное описание изобретенияDetailed description of the invention

Ниже будут описаны понятия и термины, необходимые для понимания данного технического решения.The concepts and terms necessary for understanding this technical solution will be described below.

В данном техническом решении под системой подразумевается, в том числе компьютерная система, ЭВМ (электронно-вычислительная машина), ЧПУ (числовое программное управление), ПЛК (программируемый логический контроллер), компьютеризированные системы управления и любые другие устройства, способные выполнять заданную, четко определенную последовательность операций (действий, инструкций).In this technical solution, a system means, including a computer system, a computer (electronic computer), CNC (numerical control), PLC (programmable logic controller), computerized control systems and any other devices capable of performing a given, well-defined sequence of operations (actions, instructions).

Под устройством обработки команд подразумевается электронный блок либо интегральная схема (микропроцессор), исполняющая машинные инструкции (программы).A command processing device is an electronic unit or an integrated circuit (microprocessor) that executes machine instructions (programs).

Устройство обработки команд считывает и выполняет машинные инструкции (программы) с одного или более устройств хранения данных. В роли устройства хранения данных могут выступать, но не ограничиваясь, жесткие диски (HDD), флеш-память, ПЗУ (постоянное запоминающее устройство), твердотельные накопители (SSD), оптические приводы.An instruction processing device reads and executes machine instructions (programs) from one or more data storage devices. The role of a storage device can be, but not limited to, hard drives (HDD), flash memory, ROM (read only memory), solid state drives (SSD), optical drives.

Программа - последовательность инструкций, предназначенных для исполнения устройством управления вычислительной машины или устройством обработки команд.Program - a sequence of instructions intended for execution by a computer control device or a command processing device.

Видео фрейм (video frame) - видеокадр, кадр изображения, телевизионный кадр.Video frame (video frame) - video frame, image frame, television frame.

Риг - термин в компьютерной анимации, который описывает набор зависимостей между управляющими и управляемыми элементами, созданный таким образом, чтобы управляющих элементов было меньше, чем управляемых.Rig is a computer animation term that describes a set of dependencies between controls and controlled elements, designed in such a way that there are fewer controls than controlled ones.

Программно-аппаратный комплекс Цифровой аватар базируется на 2 технологических комплекThe software and hardware complex Digital Avatar is based on 2 technological complexes

- 2 039495 сах (см. фиг. 1):- 2 039495 sax (see Fig. 1):

синтез речи на основе произвольного текста [1], в котором текстовую информацию (ТХТ) направляют на вход нейронной сети (DNN) для его обработки посредством технологии Text-to-speech (TTS) и получения сгенерированного нейронной сетью речевого сигнала (VOICE);speech synthesis based on arbitrary text [1], in which text information (TXT) is sent to the input of a neural network (DNN) for processing using Text-to-speech technology (TTS) and obtaining a neural network-generated speech signal (VOICE);

синтез изображения на основе сгенерированной речи [2], в котором речевой сигнал (VOICE) направляют на вход нейронной сети (DNN) для формирования мимики (Mimic) и изображения цифрового аватара (VIDEO).image synthesis based on generated speech [2], in which a speech signal (VOICE) is sent to the input of a neural network (DNN) to form facial expressions (Mimic) and an image of a digital avatar (VIDEO).

В данном технологическом комплексе также используются несколько технологий: известные технологии постобработки композитинга (объединения изображений в кадре) и новой технологии Генерации мимики цифрового аватара на основе голоса.This technological complex also uses several technologies: well-known technologies for post-processing compositing (combining images in a frame) and a new technology for generating facial expressions of a digital avatar based on voice.

В соответствии со схемой, приведенной на фиг. 2, система 100 для создания мимики на основе текста, выполняющая описанный выше синтез изображения на основе сгенерированной речи, содержит устройство 10 памяти и устройство 20 обработки речевого сигнала.In accordance with the diagram shown in Fig. 2, the text-based facial expression system 100 performing the above-described image synthesis based on the generated speech includes a memory device 10 and a speech signal processing device 20.

Устройство 10 памяти может представлять собой оперативную или постоянную память, предназначенную для хранения речевых сигналов, синтезированных на основе текста описанным выше методом, например, в виде звуковых дорожек. Для генерации упомянутого речевого сигнала могут использоваться решения, предлагаемые компаниями ЦРТ, АБК и Google. Дополнительно вместе с речевым сигналом могут быть сохранены данные фонетической разметки голоса, сгенерированные при синтезировании текста. Фонетическая разметка голоса может быть сформирована известными из уровня техники методами, например, посредством решений, предлагаемых компанией ЦРТ. Данные фонетической разметки содержат указатели времени начала и окончания произнесения звука в рамках речевого сигнала - звуковой дорожки. Пример данных фонетической разметки представлен на фиг. 3, на котором изображен речевой сигнал фразы Добрый день, а также в нижней части чертежа - данные фонетической разметки.The memory device 10 may be a random access or non-volatile memory for storing speech signals synthesized from text in the manner described above, for example in the form of audio tracks. Solutions offered by STC, ABC and Google can be used to generate the mentioned speech signal. Additionally, along with the speech signal, phonetic voice markup data generated during text synthesis can be stored. The phonetic marking of the voice can be formed by methods known from the prior art, for example, by means of solutions offered by the STC company. The phonetic markup data contains indicators of the start and end time of the pronunciation of a sound within the framework of a speech signal - an audio track. An example of phonetic markup data is shown in FIG. 3, which shows the speech signal of the phrase Good afternoon, and also in the lower part of the drawing, phonetic markup data.

Устройство 20 обработки речевого сигнала может быть выполнено на базе по меньшей мере одного вычислительного устройства и содержать модуль 21 выделения сигнала/шума, модуль 22 артикуляционного синтеза, модуль 23 позиционирования лицевой маски, модуль 24 формирования кадра (фрейма) динамической текстуры и модуль 25 обработки кадров. Модуль 21 выделения сигнала/шума и модуль 25 обработки кадров могут быть выполнены на базе вычислительного устройства, оснащенного специализированным программным обеспечением для выполнения приписанных им в настоящей заявке функций. Модуль 22 артикуляционного синтеза, модуль 23 позиционирования лицевой маски и модуль 24 формирования кадра динамической текстуры могут быть выполнены на базе свёрточных нейронных сетей, заранее обученных на выборке данных, отобранной разработчиком данных модулей.The speech signal processing device 20 can be based on at least one computing device and contain a signal/noise extraction module 21, an articulation synthesis module 22, a face mask positioning module 23, a dynamic texture frame (frame) generation module 24, and a frame processing module 25 . The signal/noise extraction module 21 and the frame processing module 25 can be made on the basis of a computing device equipped with specialized software to perform the functions assigned to them in this application. The articulatory synthesis module 22, the face mask positioning module 23, and the dynamic texture framing module 24 may be based on convolutional neural networks pre-trained on a sample of data selected by the developer of these modules.

В соответствии с заложенным программным алгоритмом устройство 20 обработки речевого сигнала обращается к устройству 10 памяти для извлечения данных по меньшей мере одного речевого сигнала, которые могут быть сохранены, например, при частоте дискретизации 22 кГц моноканал 16 бит, которые поступают в модуль 21 выделения сигнала/шума. Дополнительно вместе с данными речевого сигнала из упомянутого устройства 10 могут быть извлечены данные фонетической разметки голоса для данного речевого сигнала.In accordance with the embedded software algorithm, the speech signal processing device 20 accesses the memory device 10 to retrieve data of at least one speech signal that can be stored, for example, at a sampling rate of 22 kHz, a monochannel of 16 bits, which is fed to the signal extraction module 21 / noise. Additionally, together with the speech signal data, phonetic voice markup data for the given speech signal can be extracted from said device 10.

При получении данных речевого сигнала упомянутый модуль 21 определяет на основе амплитуднофазовой частотной характеристики (АФЧХ) звуковой волны (аудиопотока речи) шум-голос, выделяет участки, содержащие информацию о голосе и размечает остальные фрагменты речевого сигнала, не содержащие информацию о голосе, соответствующими тайм-кодами (временным кодом), которые маркируются как молчание. В случае поступления данных фонетической разметки голоса, в частности данных о времени начала и окончания произнесения звука, эти данные также учитываются упомянутым модулем 21 при определении шум-голос, в связи с чем улучшается качество его отделения. Далее данные речевого сигнала с выделенными участками, содержащими информацию о голосе, поступают в модуль 22 артикуляционного синтеза, который разделяет участки, содержащие информацию о голосе, на временные окна, (например, равные 200 мс), причем одно временное окно соответствует одному кадру (фрейму) видео. Дополнительно в упомянутый модуль 22 могут быть направлены данные фонетической разметки голоса.Upon receipt of speech signal data, said module 21 determines the noise-voice based on the amplitude-phase frequency response (APFC) of the sound wave (audio speech stream), selects sections containing information about the voice and marks the remaining fragments of the speech signal that do not contain information about the voice, corresponding to the time codes (time code) that are marked as silence. In the case of receipt of phonetic voice markup data, in particular data on the start and end time of pronouncing a sound, these data are also taken into account by the mentioned module 21 when determining the noise-voice, and therefore the quality of its separation is improved. Next, the data of the speech signal with selected sections containing information about the voice enters the articulation synthesis module 22, which divides the sections containing information about the voice into time windows (for example, equal to 200 ms), and one time window corresponds to one frame (frame ) video. Additionally, phonetic voice markup data can be sent to said module 22.

После того как участки речевого сигнала, содержащие информацию о голосе, были разделены на временные окна, модуль 22 артикуляционного синтеза преобразует АФЧХ звуковой волны для каждого временного окна, содержащего информацию о голосе, в изображение частотного спектра, например двумерное изображение графика зависимости относительной энергии звуковых колебаний от частоты. Упомянутое преобразование может быть выполнено посредством преобразования Фурье - семейства математических методов, основанных на разложении исходной непрерывной функции от времени на совокупность базисных гармонических функций (в качестве которых выступают синусоидальные функции) различной частоты, амплитуды и фазы. Из определения видно, что основная идея преобразования заключается в том, что любую функцию можно представить в виде бесконечной суммы синусоид, каждая из которых будет характеризоваться своей амплитудой, частотой и начальной фазой. Таким образом, в упомянутом модуле 22 формируется последовательность изображений частотного спектра. Далее модуль 22 артикуляционного синтеза на основе последовательности изображений частотного спектра определяет последовательность данных о множестве координат точек, образующих лицевую маску. Для этого кажAfter sections of the speech signal containing voice information have been divided into time windows, the articulation synthesis module 22 converts the AFC of the sound wave for each time window containing voice information into a frequency spectrum image, for example, a two-dimensional image of a plot of the relative energy of sound vibrations from frequency. The above transformation can be performed using the Fourier transform - a family of mathematical methods based on the decomposition of the original continuous function of time into a set of basic harmonic functions (which are sinusoidal functions) of various frequencies, amplitudes and phases. It can be seen from the definition that the main idea of the transformation is that any function can be represented as an infinite sum of sinusoids, each of which will be characterized by its amplitude, frequency and initial phase. Thus, in said module 22, a sequence of frequency spectrum images is formed. Next, the module 22 articulation synthesis on the basis of the sequence of images of the frequency spectrum determines the sequence of data on the set of coordinates of the points that form the face mask. For this,

- 3 039495 дое изображение частотного спектра подается на вход свёрточным нейронным сетям, которыми упомянутый модуль 22 может быть оснащен, в которых полученное изображение преобразуется в матрицу изображения, содержащую значения пикселей изображения, после чего полученная матрица изображения умножается на матрицу (ядро) свертки поэлементно, где матрица свёртки - это матрица коэффициентов, заданная разработчиком для каждого слоя свёрточной нейронной сети в зависимости он требуемой предварительной обработки поступившего на вход изображения частотного спектра. Методы подбора коэффициентов матрицы свертки широко известны из уровня техники и далее не будут раскрыты в настоящей заявке (см., например, статью Матричные фильтры обработки изображений, опубл. 25.04.2012 в Интернет по адресу: https://habr.com/ru/post/142818/).- 3 039495 one image of the frequency spectrum is fed to the input of convolutional neural networks, with which the said module 22 can be equipped, in which the resulting image is converted into an image matrix containing image pixel values, after which the resulting image matrix is multiplied by the convolution matrix (kernel) element by element, where the convolution matrix is a matrix of coefficients specified by the developer for each layer of the convolutional neural network, depending on the required pre-processing of the frequency spectrum image received at the input. Methods for selecting the coefficients of the convolution matrix are widely known from the prior art and will not be further disclosed in this application (see, for example, the article Image Processing Matrix Filters, published on April 25, 2012 on the Internet at: https://habr.com/ru/ post/142818/).

Далее на основе результатов умножения матрицы изображения на матрицу свертки, полученных на каждом слое свёрточной нейронной сети, формируется выходное изображение частотного спектра, которое подается на вход специализированной нейронной сети, заранее обученной на различных изображениях частотных спектров и соответствующих этим изображениям множестве координат точек (landmarks), образующих лицевую маску, на выходе которой упомянутый модуль 22 получает данные о множестве координат точек, образующих лицевую маску, для каждого временного окна. Таким образом, в модуле 22 формируется последовательность данных о множестве координат точек, образующих лицевую маску. Методы определения множества координат точек, образующих лицевую маску, на основе изображения лица широко известны из уровня техники (см, например, статью Распознавание лиц. Создаем и примеряем маски, опубл. 12.12.2017 в Интернет по адресу: https://habr.com/ru/company/ epam systems/blog/343514/) и более подробно не будут раскрыты в материалах настоящей заявки. Изображения лица, на основе которых выполняют определение множества координат точек, могут быть получены посредством обработки данных видеоизображений лица, например, диктора, разделенное на временные окна, соответствующие временным окнам для речевого сигнала.Further, based on the results of multiplying the image matrix by the convolution matrix obtained on each layer of the convolutional neural network, an output image of the frequency spectrum is formed, which is fed to the input of a specialized neural network, pre-trained on various images of the frequency spectra and corresponding to these images in the set of coordinates of points (landmarks) , forming a face mask, at the output of which the said module 22 receives data on the set of coordinates of points forming the face mask for each time window. Thus, in module 22, a data sequence is formed about the set of coordinates of the points that form the face mask. Methods for determining the set of coordinates of points that form a face mask based on a face image are widely known from the prior art (see, for example, the article Face recognition. Creating and trying on masks, published on December 12, 2017 on the Internet at: https://habr.com /en/company/epam systems/blog/343514/) and will not be disclosed in more detail in the materials of this application. Face images based on which a plurality of point coordinates are determined can be obtained by processing video image data of a face, such as a speaker, divided into time windows corresponding to time windows for a speech signal.

Дополнительно модуль 22 артикуляционного синтеза может быть выполнен с возможностью геометрической валидации упомянутых данных о множестве координат точек, сформированных для временного окна, содержащихся в последовательности данных о множестве координат точек, образующих лицевую маску. Для этого модуль 22 артикуляционного синтеза на основе данных фонетической разметки голоса определяет контрольные координаты точек лицевой маски, валидацию которых необходимо выполнить. Например, данные о фонетической разметки голоса могут указывать на наличие звука б, м или п во временном окне, в связи с чем на основе данной информации модулем 22 артикуляционного синтеза могут быть определены контрольные точки лицевой маски, характеризующие расположение центров поверхности верхней и нижней губ, уголков губ и пр. Далее упомянутый модуль 22 определяет расстояние между контрольными точками лицевой маски, информация о которых содержится в данных о множестве координат точек, сформированных для временного окна, и сравнивает его с заданным значением расстояния для этих контрольных точек согласно данным о фонетической разметки голоса для данного временного окна. Например, для данных о фонетической разметке голоса, указывающих на наличие звука б, м или п во временном окне, значение расстояния между контрольными точками, характеризующими расположение центров поверхности верхней и нижней губ, может равняться минимальному расстоянию, близкому к 0 (нулю), для обеспечения визуализации смыкания губ на 3Dмодели головы. Если определенное расстояние между контрольными точками лицевой маски больше заданного значения расстояния между этими контрольными точками, т.е. больше минимального расстояния, то данные о множестве контрольных точек не проходят валидацию, а модуль 22 артикуляционного синтеза осуществляет возврат на предыдущий этап, и изображение частотного спектра, содержащееся в упомянутой последовательности изображений, на основе которого были определены данные о множестве координат точек, не прошедшие валидацию, снова подается описанным выше способом на вход свёрточной нейронной сети до получения данных о множестве координат точек, проходящей проверку (валидацию).Additionally, the articulatory synthesis module 22 can be configured to geometrically validate said point coordinate set data generated for the time window contained in the point coordinate set data sequence forming the face mask. To do this, the articulation synthesis module 22, based on the data of the phonetic marking of the voice, determines the control coordinates of the points of the face mask, the validation of which must be performed. For example, data on the phonetic marking of the voice may indicate the presence of the sound b, m or p in the time window, and therefore, based on this information, the articulation synthesis module 22 can determine the control points of the face mask, characterizing the location of the centers of the surface of the upper and lower lips, lip corners, etc. Further, said module 22 determines the distance between the face mask control points, information about which is contained in the point coordinate set data generated for the time window, and compares it with the specified distance value for these control points according to the voice phonetic markup data. for this time window. For example, for data on phonetic marking of the voice, indicating the presence of the sound b, m, or p in the time window, the value of the distance between the control points characterizing the location of the centers of the surface of the upper and lower lips can be equal to the minimum distance close to 0 (zero), for providing visualization of closing lips on a 3D model of the head. If the defined distance between the face mask control points is greater than the specified value of the distance between these control points, i.e. is greater than the minimum distance, then the data on the set of control points do not pass validation, and the articulation synthesis module 22 returns to the previous stage, and the frequency spectrum image contained in the mentioned sequence of images, on the basis of which the data on the set of coordinates of points that did not pass validation were determined , is again fed in the manner described above to the input of the convolutional neural network until data is received on the set of coordinates of the points being checked (validated).

Если определенное расстояние между контрольными точками лицевой маски не превышает заданного значения расстояния между этими контрольными точками, то данные о множестве координат точек проходят валидацию. Информация о контрольных точках для данных фонетической разметки и значения расстояния между контрольными точками может быть заранее задана разработчиком модуля 22 артикуляционного синтеза.If the specified distance between the control points of the face mask does not exceed the specified value of the distance between these control points, then the data on the set of coordinates of the points are validated. The breakpoint information for the phonetic markup data and the distance between the breakpoints may be predetermined by the designer of the articulation synthesis module 22 .

Далее последовательность данных о множестве координат точек, прошедших валидацию, направляются в модуль 23 позиционирования лицевой маски, который извлекает из памяти, которой он дополнительно может быть оснащен, 3D-модель головы, например головы диктора, и выполняет размещение лицевой маски на 3D-модели головы, после чего формирует кадр (фрейм) с изображением 3D-модели с размещенной на ней лицевой маской. Позиционирование лицевой маски на 3D-модель головы может осуществляться посредством ее прикрепления по известным пограничным вершинам 3D-модели головы и соответствующим для этих вершин точкам лицевой маски, заданными разработчиком упомянутого модуля 23. Таким образом, на выходе упомянутого модуля 23 формируется последовательность кадров, содержащих изображение 3D-модели головы с размещенной на ней лицевой маской.Further, the sequence of data on the set of coordinates of points that have passed the validation is sent to the face mask positioning module 23, which retrieves from the memory, which it can optionally be equipped with, a 3D model of the head, for example, the speaker's head, and performs the placement of the face mask on the 3D head model , after which it forms a frame (frame) with an image of a 3D model with a face mask placed on it. The positioning of the face mask on the 3D head model can be carried out by attaching it to the known boundary vertices of the 3D head model and the face mask points corresponding to these vertices, specified by the developer of the mentioned module 23. Thus, at the output of the mentioned module 23, a sequence of frames containing an image is formed. 3D model of the head with a face mask placed on it.

В альтернативном варианте реализации заявленного решения пограничные вершины 3D-модели гоIn an alternative implementation of the claimed solution, the boundary vertices of the 3D model

- 4 039495 ловы и соответствующие для этих вершин точки лицевой маски могут быть определены известными из уровня техники методами, например посредством библиотеки dlib с помощью последовательного выполнения нахождения лица в кадре, нахождения на нем неподвижных точек и получения их координат.- 4 039495 The heads and corresponding points of the face mask for these vertices can be determined by methods known from the prior art, for example, using the dlib library by successively performing finding a face in the frame, finding fixed points on it and obtaining their coordinates.

Дополнительно в памяти упомянутого модуля 23 может быть сохранены текстуры зубов и языка для их размещения на изображении 3D-модели головы согласно положению лицевой маски. Положение текстур зубов и языка может быть заданы разработчиком модуля 23 исходя из анатомических свойств человека. Привязка зубов и языка осуществляется в рамках проектирования 3D-модели. Такие элементы обладают ригами с жесткой логикой работы и относятся к стандартной практике создания скелета 3Dмодели.Additionally, the textures of the teeth and tongue can be stored in the memory of the mentioned module 23 for their placement on the image of the 3D head model according to the position of the face mask. The position of the textures of the teeth and tongue can be set by the developer of module 23 based on the anatomical properties of a person. The binding of teeth and tongue is carried out as part of the design of a 3D model. Such elements have rigid logic rigs and are standard practice for creating a 3D model skeleton.

Также последовательность изображений частотного спектра направляется в модуль 24 формирования кадров динамической текстуры, который для каждого временного окна на основе изображения частотного спектра формирует кадр динамической текстуры лицевой маски, которые далее объединяются в последовательность кадров динамической текстуры лицевой маски, таким образом, обеспечивая формирование дискретного изображения текстуры лицевой маски. Динамическая текстура лицевой маски может содержать информацию о участках лица, которые следует отобразить с покраснением, пульсацией, изменением фактуры лица и пр. Для формирования кадров динамической текстуры лицевой маски упомянутый модуль 24 может быть оснащен сверточной нейронной сетью, заранее обученной на различных изображениях частотных спектров и соответствующих этим изображениям кадрах динамической текстуры лицевой маски. Изображения лица, на основе которых формируют кадры динамической текстуры лицевой маски, также могут быть получены посредством обработки данных видеоизображений лица, например, диктора, разделенное на временные окна, соответствующие временным окнам для речевого сигнала. Соответственно, изображения частотных спектров подаются на вход упомянутой нейронной сети, а на ее выходе формируются кадры динамической текстуры лицевой маски. Последовательности кадров, сформированные модулем 23 позиционирования лицевой маски и модулем 24 формирования кадров динамической текстуры, далее направляются в модуль 25 обработки кадров, который извлекает из кадра о динамической текстуре лицевой маски, например, первого кадра в последовательности кадров динамической текстуры лицевой маски, информацию о динамической текстуре лицевой маски, и размещает динамическую текстуру лицевой маски на кадре, в частности на первом кадре в последовательности кадров, содержащем изображение 3D-модели головы с размещенной на ней лицевой маской. Аналогичным образом динамическая текстура лицевой маски размещается на всех кадрах в последовательности кадров, содержащих изображение 3D-модели головы с размещенной на ней лицевой маской, в соответствии с номерами этих кадров и номерами тех же кадров, содержащих информацию о динамической текстуре лицевой маски. Таким образом, в модуле 25 обработки кадров формируется последовательность кадров, содержащих изображение результирующей 3D-модели головы с размещенной на ней динамической текстурой лицевой маски.Also, the sequence of frequency spectrum images is sent to the dynamic texture framer 24, which for each time window, based on the frequency spectrum image, forms a dynamic texture frame of the face mask, which are further combined into a sequence of dynamic texture frames of the face mask, thus providing the formation of a discrete texture image. face mask. The dynamic texture of the face mask may contain information about areas of the face that should be displayed with redness, ripple, change in the texture of the face, etc. To form frames of the dynamic texture of the face mask, the said module 24 can be equipped with a convolutional neural network previously trained on various images of the frequency spectra and the frames of the dynamic texture of the face mask corresponding to these images. The facial images based on which the face mask dynamic texture frames are formed can also be obtained by processing video data of a face, such as a speaker, divided into time windows corresponding to the time windows for the speech signal. Accordingly, images of the frequency spectra are fed to the input of the mentioned neural network, and frames of the dynamic texture of the face mask are formed at its output. The frame sequences generated by the face mask positioning module 23 and the dynamic texture framing module 24 are further sent to the frame processing module 25, which extracts from the face mask dynamic texture frame, for example, the first frame in the face mask dynamic texture frame sequence, dynamic texture information. texture of the face mask, and places the dynamic texture of the face mask on the frame, in particular on the first frame in the sequence of frames containing the image of the 3D head model with the face mask placed on it. Similarly, the dynamic texture of the face mask is placed on all frames in the sequence of frames containing the image of the 3D head model with the face mask placed on it, in accordance with the numbers of these frames and the numbers of the same frames containing information about the dynamic texture of the face mask. Thus, in the frame processing module 25, a sequence of frames is formed containing an image of the resulting 3D head model with a dynamic texture of the face mask placed on it.

Дополнительно для придания монолитного вида модуль 25 обработки кадров может быть оснащен матричными фильтрами обработки изображений, широко используемыми в уровне техники для обработки изображений. Посредством данных фильтров упомянутый модуль 25 может выполнить анализ на сформированных кадрах, содержащих изображение результирующей 3D-модели головы, границ перехода между лицевой маской и 3D-модели головы, заполнить разрывы между изображениями 3D-модели головы и динамической текстурой лицевой маски, а также выполнить размытие цветовых границ между текстурами.Additionally, to give a monolithic appearance, the frame processing unit 25 may be equipped with image processing matrix filters commonly used in the art for image processing. By means of these filters, said module 25 can perform analysis on the generated frames containing the image of the resulting 3D head model, transition boundaries between the face mask and the 3D head model, fill gaps between the 3D head model images and the dynamic texture of the face mask, and perform blurring. color borders between textures.

Также дополнительно модуль 25 обработки кадров может быть оснащен соответствующей 3Dмоделью для формирования характеристик поверхности 3D-модели головы (рельеф), которые могут быть совмещены с такими параметрами как свойства материала (базовый оттенок, фактура).Also, additionally, the frame processing module 25 can be equipped with an appropriate 3D model to form the surface characteristics of a 3D head model (relief), which can be combined with such parameters as material properties (basic shade, texture).

После этого модуль 25 обработки кадров формирует последовательность кадров с изображением результирующей 3D-модели головы на фоне сцены и текстурой зубов и языка на основе данных последовательности кадров, содержащих изображение результирующей 3D-модели головы. Для этого упомянутый модуль 25 извлекает изображение заранее заготовленной сцены, например изображение какоголибо помещения, добавляет в изображение сцены изображение результирующей 3D-модели головы, содержащееся в первом кадре упомянутой выше последовательности кадров, текстуры зубов и языка для данного изображения 3D-модели головы согласно положению лицевой маски, после чего формирует кадр с изображением результирующей 3D-модели головы на фоне сцены, т.е. с динамической текстурой лицевой маски и текстурой зубов и языка. Аналогичным образом модуль 25 обработки кадров формирует кадры для второго и последующих кадров, содержащих изображение результирующей 3D-модели головы, вследствие чего в модуле 25 обработки кадров формируется последовательность кадров с изображением результирующей 3D-модели головы на фоне сцены с текстурой зубов и языка на основе данных последовательности кадров, содержащих изображение результирующей 3D-модели головы. Далее модуль 25 обработки кадров осуществляет объединение полученной последовательности кадров в видеопоток, причем на этапе объединения последовательности кадров упомянутый модуль 25 может быть выполнен с возможностью морфинга изображений результирующей 3D-модели головы с размещенной на ней динамической текстурой лицевой маски для обеспечения плавности переходов между упомянутымиThereafter, the frame processing module 25 generates a frame sequence with the image of the resulting 3D head model in the background of the scene and the texture of the teeth and tongue based on the frame sequence data containing the image of the resulting 3D head model. To do this, the module 25 extracts an image of a pre-prepared scene, for example, an image of a room, adds to the scene image the image of the resulting 3D head model contained in the first frame of the above sequence of frames, the texture of the teeth and tongue for this image of the 3D head model according to the position of the facial mask, after which it forms a frame with an image of the resulting 3D head model against the background of the scene, i.e. with dynamic face mask texture and teeth and tongue texture. Similarly, the frame processing module 25 generates frames for the second and subsequent frames containing an image of the resulting 3D head model, as a result of which, in the frame processing module 25, a sequence of frames is formed depicting the resulting 3D head model against the background of a scene with a texture of teeth and tongue based on data a sequence of frames containing an image of the resulting 3D head model. Next, the frame processing module 25 combines the received frame sequence into a video stream, and at the stage of combining the frame sequence, the said module 25 can be configured to morph images of the resulting 3D head model with a dynamic face mask texture placed on it to ensure smooth transitions between the mentioned

- 5 039495 изображениями 3D-модели головы и устранения резкого изменения формы лицевой маски посредством слияния по геометрии упомянутых изображений 3D-модели головы, содержащихся на предыдущих и последующих кадрах, а также с возможностью цветокоррекции и совмещения динамических текстур для устранения пульсирующего изменения цвета лицевой маски посредством слияния по цвету упомянутых изображений 3D-модели головы, содержащихся на предыдущих и последующих кадрах. Технологии морфинга изображений, цветокоррекции и совмещения динамических текстур широко известны из уровня техники и более подробно не будут раскрыты в настоящей заявке.- 5 039495 images of a 3D head model and elimination of a sharp change in the shape of the face mask by merging the geometry of the mentioned images of the 3D head model contained in previous and subsequent frames, as well as with the ability to color correct and combine dynamic textures to eliminate a pulsating change in the color of the face mask by means of merging by color of the mentioned images of the 3D head model contained in the previous and subsequent frames. Technologies for image morphing, color correction and dynamic texture matching are widely known in the art and will not be disclosed in more detail in this application.

Полученный видеопоток устройством 20 обработки речевого сигнала известными из уровня техники методами объединяется с данными речевого сигнала, после чего видеопоток сохраняется в устройстве 10 памяти и может быть выведен на устройства вывода информации по соответствующему запросу. При необходимости полученный видеопоток может быть сжат с целью уменьшения занимаемого объёма данных.The received video stream by the speech signal processing device 20 is combined with the speech signal data by methods known from the prior art, after which the video stream is stored in the memory device 10 and can be output to information output devices upon appropriate request. If necessary, the received video stream can be compressed in order to reduce the amount of data occupied.

В общем виде (см. фиг. 4) вычислительное устройство содержит объединенные общей шиной информационного обмена один или несколько процессоров (201), средства памяти, такие как ОЗУ (202) и ПЗУ (203), интерфейсы ввода/вывода (204), устройства ввода/вывода (205) и устройство для сетевого взаимодействия (206).In general terms (see Fig. 4), the computing device contains one or more processors (201), memory facilities such as RAM (202) and ROM (203), input/output interfaces (204), devices input/output (205) and a device for networking (206).

Процессор (201) (или несколько процессоров, многоядерный процессор и т.п.) может выбираться из ассортимента устройств, широко применяемых в настоящее время, например, таких производителей, как Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. Под процессором или одним из используемых процессоров в устройстве (200) также необходимо учитывать графический процессор, например GPU NVIDIA или Graphcore, тип которых также является пригодным для полного или частичного выполнения способа, а также может применяться для обучения и применения моделей машинного обучения в различных информационных системах.The processor (201) (or multiple processors, multi-core processor, etc.) can be selected from a range of devices currently widely used, for example, manufacturers such as Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™ , Qualcomm Snapdragon™, etc. Under the processor or one of the processors used in the device (200), it is also necessary to take into account the graphics processor, such as NVIDIA GPU or Graphcore, the type of which is also suitable for full or partial execution of the method, and can also be used to train and apply machine learning models in various information systems.

ОЗУ (202) представляет собой оперативную память и предназначено для хранения исполняемых процессором (201) машиночитаемых инструкций для выполнения необходимых операций по логической обработке данных. ОЗУ (202), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.). При этом в качестве ОЗУ (202) может выступать доступный объем памяти графической карты или графического процессора.RAM (202) is a random access memory and is designed to store machine-readable instructions executable by the processor (201) to perform the necessary operations for logical data processing. The RAM (202) typically contains the executable instructions of the operating system and associated software components (applications, program modules, etc.). In this case, the RAM (202) may be the available memory of the graphics card or graphics processor.

ПЗУ (203) представляет собой одно или более устройств постоянного хранения данных, например жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш-память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD-R/RW, BlueRay Disc, MD) и др.The ROM (203) is one or more persistent storage devices such as a hard disk drive (HDD), a solid state data drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media (CD-R/ RW, DVD-R/RW, BlueRay Disc, MD), etc.

Для организации работы компонентов устройства (200) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (204). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь, PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232 и т.п.Various types of I/O interfaces (204) are used to organize the operation of device components (200) and organize the operation of external connected devices. The choice of the appropriate interfaces depends on the specific implementation of the computing device, which can be, but not limited to, PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

Для обеспечения взаимодействия пользователя с устройством (200) применяются различные средства (205) В/В информации, например клавиатура, дисплей (монитор), сенсорный дисплей, тач-пад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п. Средство сетевого взаимодействия (206) обеспечивает передачу данных посредством внутренней или внешней вычислительной сети, например Интранет, Интернет, ЛВС и т.п. В качестве одного или более средств (206) может использоваться, но не ограничиваться, Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др.Various means (205) of I/O information are used to ensure user interaction with the device (200), for example, a keyboard, a display (monitor), a touch screen, a touchpad, a joystick, a mouse, a light pen, a stylus, a touchpad, a trackball, speakers, microphone, augmented reality tools, optical sensors, tablet, light indicators, projector, camera, biometric identification tools (retinal scanner, fingerprint scanner, voice recognition module), etc. The networking means (206) provides data transmission via an internal or external computer network, such as an Intranet, Internet, LAN, and the like. As one or more means (206), an Ethernet card, a GSM modem, a GPRS modem, an LTE modem, a 5G modem, a satellite communication module, an NFC module, a Bluetooth and/or BLE module, a Wi-Fi module, and others

Дополнительно могут применяться также средства спутниковой навигации в составе устройства (200), например GPS, ГЛОНАСС, BeiDou, Galileo. Конкретный выбор элементов устройства (200) для реализации различных программно-аппаратных архитектурных решений может варьироваться с сохранением обеспечиваемого требуемого функционала.Additionally, satellite navigation tools within the device (200) can also be used, such as GPS, GLONASS, BeiDou, Galileo. The specific choice of elements of the device (200) for the implementation of various software and hardware architectural solutions may vary while maintaining the required functionality provided.

Модификации и улучшения вышеописанных вариантов осуществления настоящего технического решения будут ясны специалистам в данной области техники. Предшествующее описание представлено только в качестве примера и не несет никаких ограничений. Таким образом, объем настоящего технического решения ограничен только объемом прилагаемой формулы изобретения.Modifications and improvements to the above described embodiments of the present technical solution will be clear to experts in this field of technology. The foregoing description is provided by way of example only and is not intended to be limiting in any way. Thus, the scope of the present technical solution is limited only by the scope of the appended claims.

ФОРМУЛА ИЗОБРЕТЕНИЯCLAIM

Claims

CLAIM

1. A method for processing a speech signal to form a video stream, performed by at least one computing device, comprising the steps that receive data from at least one speech signal;

separating sections of the speech signal containing information about the voice into time windows;

- 6 039495 form for each time window the image of the frequency spectrum to obtain a sequence of images of the frequency spectrum;

based on the sequence of images of the frequency spectrum determine the sequence of data on the set of coordinates of the points that form the face mask;

placing a face mask on the 3D head model to form a sequence of frames containing an image of the 3D head model with the face mask placed thereon;

based on the sequence of images of the frequency spectrum, a sequence of frames of the dynamic texture of the face mask is formed;

forming a sequence of frames containing an image of the resulting 3D head model with a dynamic face mask texture placed on it based on a sequence of frames containing an image of a 3D head model with a face mask placed on it, and frames of the face mask dynamic texture;

forming a sequence of frames depicting the resulting 3D model of the head against the background of the scene;

combine the sequence of frames obtained at the previous step into a video stream.

2. The method according to claim 1, characterized in that additional steps are performed at which, upon receipt of the data of the speech signal, the noise-voice is determined;

allocate sections of the speech signal containing information about the voice;

moreover, when determining the noise-voice, phonetic markup data are taken into account.

3. The method according to claim 1, characterized in that additionally performing the step of geometric validation of data on the set of point coordinates generated for the time window contained in the sequence of data on the set of point coordinates forming the face mask.

4. The method according to claim 3, characterized in that said step of geometric validation of data on the set of coordinates of points comprises the steps at which, based on the phonetic voice markup data, control coordinates of points of the face mask are determined, the validation of which must be performed;

determining the distance between the control points of the face mask, information about which is contained in the data on the set of coordinates of points generated for the time window, and comparing it with the specified distance value for these control points according to the data on the phonetic markings of the voice;

moreover, if a certain distance between the control points of the face mask does not exceed the specified value of the distance between these control points, then the data on the set of coordinates of the points are validated.

5. The method according to claim 1, characterized in that the placement of the face mask on the 3D head model is carried out by attaching the face mask along the known boundary vertices of the 3D head model and the face mask points corresponding to these vertices.

6. The method according to claim 1, characterized in that the step is additionally performed, at which the textures of teeth and tongue are placed on the images of the 3D model of the head according to the position of the face mask.

7. The method according to claim 1, characterized in that the step of morphing images of the resulting 3D head model is additionally performed.

8. The method according to claim 1, characterized in that the step of color correction and dynamic texture matching is additionally performed to eliminate the pulsating change in the color of the face mask by color merging said images of the 3D head model contained in previous and subsequent frames.

9. A speech signal processing device, comprising at least one computing device and at least one memory device containing machine-readable instructions, which, when executed by at least one computing device, perform the operations of the method according to any one of claims 1-8.