RU2723454C1

RU2723454C1 - Method and system for creating facial expression based on text

Info

Publication number: RU2723454C1
Application number: RU2019144357A
Authority: RU
Inventors: Альберт Рувимович Ефимов; Алексей Сергеевич Гонноченко; Михаил Александрович Владимиров
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-06-11
Also published as: WO2021133201A1; EA039495B1; EA202090169A1

Abstract

FIELD: data processing.SUBSTANCE: invention relates to processing image data. Data of at least one speech signal is obtained. Voice information sections are divided into time windows. A frequency spectrum image is formed for each time window to obtain a sequence of frequency spectrum images. Based on the sequence of images of the frequency spectrum, a sequence of data on the plurality of coordinates of the points forming the face mask is determined. Face mask is placed on 3D-model of head to form sequence of frames containing image of 3D-model of head with face mask arranged on it. A series of frames of the dynamic texture of the face mask is generated based on a sequence of images of the frequency spectrum. A sequence of frames containing an image of resulting 3D model of the head with a dynamic texture of the face mask is formed.EFFECT: technical result consists in enabling generation of a video stream with an animated image of 3D head model with a dynamic texture of the face mask on the basis of the speech signal data.9 cl, 4 dwg

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

[001] Данное техническое решение в общем относится к области обработки данных изображения, а в частности к способу и системе для создания мимики на основе текста. Представленное решение может быть использовано в процессе генерирования по меньшей мере одной анимированной модели - цифрового аватара на основе данных речи и движений.[001] This technical solution generally relates to the field of image data processing, and in particular to a method and system for creating facial expressions based on text. The presented solution can be used in the process of generating at least one animated model - a digital avatar based on speech and movement data.

УРОВЕНЬ ТЕХНИКИBACKGROUND

[002] Из уровня техники известны различные решения, направленные на создание цифровых аватаров. Например, известен способ предоставления анимации в реальном времени для персонализированного анимированного аватара, раскрытый в заявке US 2012130717 (А1), опубл. 2012-05-24. В известном решении выполняют: обучение одной или нескольких анимированных моделей для предоставления набора вероятностных движений для одной или нескольких верхних частей тела аватара, основываясь, по меньшей мере, частично на данных речи и движения; ассоциирование одной или нескольких заданных фраз эмоциональных состояний с одной или несколькими анимированными моделями; получение речевого ввода в реальном времени; идентификацию эмоционального состояния, которое должно быть выражено, основываясь, по меньшей мере, частично на одной или нескольких заранее определенных фразах, соответствующих, по меньшей мере, части речевому вводу в реальном времени; и генерирование анимированной последовательности движений одной или нескольких верхних частей тела аватара путем применения одной или нескольких анимированных моделей в ответ на речевой ввод в реальном времени, причем анимированная последовательность движений выражает идентифицированное эмоциональное состояние.[002] Various solutions are known in the art for creating digital avatars. For example, there is a method of providing real-time animation for a personalized animated avatar, disclosed in application US 2012130717 (A1), publ. 2012-05-24. In a known solution, one performs: training one or more animated models to provide a set of probabilistic movements for one or more upper parts of the avatar’s body, based at least in part on speech and movement data; associating one or more specified phrases of emotional states with one or more animated models; receiving voice input in real time; the identification of the emotional state, which should be expressed based at least in part on one or more predetermined phrases corresponding to at least part of the speech input in real time; and generating an animated sequence of movements of one or more upper parts of the avatar body by applying one or more animated models in response to real-time speech input, the animated sequence of movements expressing an identified emotional state.

[003] Также известна система удаленного обслуживания клиентов, раскрытая в заявке US 2017308904 (А1), опубл. 2017-10-26. Известная система включает в себя блок интерактивного отображения в месте нахождения клиента, обеспечивающий двустороннюю аудиовизуальную связь с удаленным агентом по обслуживанию/продажам, при этом коммуникация обеспечивается посредством виртуального цифрового ведущего, отображаемого на дисплее. В известную систему также интегрирована виртуальная система «Digital Actor». «Digital Actor» используется в решении для электронного обучения, при производстве фильмов и в качестве ведущего на телевидении, в Интернете или других вещательных приложениях.[003] Also known is a remote customer service system disclosed in application US 2017308904 (A1), publ. 2017-10-26. The known system includes an interactive display unit at the location of the client, providing two-way audio-visual communication with a remote service / sales agent, while communication is provided through a virtual digital presenter displayed on the display. A virtual system “Digital Actor” is also integrated into the well-known system. “Digital Actor” is used in the solution for e-learning, in the production of films and as a host on television, the Internet or other broadcast applications.

[004] Также известен AI-ведущий «Синьхуа», раскрытый в статье «Watch: China's Xinhua unveils the world's first AI news anchor», опубл. 09.11.2018 в Интернет по адресу: https://www.hindustantimes.com/tech/china-s-xinhua-unveils-the-world-s-first-ai-news-anchor/storv-ACUjfqUzGSI8IEURbrlmBJ.html.[004] Also known is the Xinhua AI host, disclosed in the article Watch: China's Xinhua unveils the world's first AI news anchor, publ. 11/09/2018 on the Internet at: https://www.hindustantimes.com/tech/china-s-xinhua-unveils-the-world-s-first-ai-news-anchor/storv-ACUjfqUzGSI8IEURbrlmBJ.html.

[005] Представленные решения обладают рядом недостатков, таких как статичная поза головы, обусловленная ограничениями их версии технологии (двухмерный синтез мимики) из чего мимика ведущего выглядит скованно и менее естественно. 3D синтез в свою очередь позволит развивать технологию в full-3D рендер, что существенно расширяет область применения технологии (использование виртуальных студий, «свободная камера» и пр.).[005] The presented solutions have a number of disadvantages, such as a static head pose, due to the limitations of their version of the technology (two-dimensional synthesis of facial expressions), from which the facial expressions of the presenter look constrained and less natural. 3D synthesis, in turn, will allow the technology to be developed in full-3D rendering, which significantly expands the scope of the technology (the use of virtual studios, “free camera”, etc.).

СУЩНОСТЬ ТЕХНИЧЕСКОГО РЕШЕНИЯESSENCE OF TECHNICAL SOLUTION

[006] Технической проблемой или задачей, поставленной в данном техническом решении, является создание простого и надежного способа и системы для создания мимики на основе текста.[006] A technical problem or task set in this technical solution is to create a simple and reliable method and system for creating facial expressions based on text.

[007] Техническим результатом, достигаемым при решении вышеуказанной технической задачи, является обеспечение возможности создания видеопотока с анимированным изображением 3D-модели головы с размещенной на ней динамической текстурой лицевой маски на основе данных речевого сигнала. Данные речевого сигнала могут быть произвольно сгенерированы средствами синтеза речи (системы text-to-speech), эмитирующих человеческий голос, соответствующий по своим параметрам голосу диктора.[007] The technical result achieved by solving the above technical problem is to provide the ability to create a video stream with an animated image of a 3D model of the head with the dynamic texture of the face mask placed on it based on speech signal data. The data of the speech signal can be arbitrarily generated by means of speech synthesis (text-to-speech system), emitting a human voice, corresponding in its parameters to the voice of the speaker.

[008] Указанный технический результат достигается благодаря осуществлению способа обработки речевого сигнала для формирования видеопотока, выполняемого по меньшей мере одним вычислительным устройством, содержащего этапы, на которых:[008] The specified technical result is achieved by implementing a method of processing a speech signal to generate a video stream performed by at least one computing device, comprising the steps of:

- получают данные по меньшей мере одного речевого сигнала;- receive data of at least one speech signal;

- разделяют участки речевого сигнала, содержащие информацию о голосе, на временные окна;- divide sections of the speech signal containing voice information into time windows;

- формируют для каждого временного окна изображение частотного спектра для получения последовательности изображений частотного спектра;- form for each time window the image of the frequency spectrum to obtain a sequence of images of the frequency spectrum;

- на основе последовательности изображений частотного спектра определяют последовательность данных о множестве координат точек, образующих лицевую маску;- based on the sequence of images of the frequency spectrum, a sequence of data on the set of coordinates of the points forming the face mask is determined;

- размещают лицевую маску на 3D-модели головы для формирования последовательности кадров, содержащих изображение 3D-модели головы с размещенной на ней лицевой маской;- place the face mask on the 3D model of the head to form a sequence of frames containing the image of the 3D model of the head with the face mask placed on it;

- на основе последовательности изображений частотного спектра формируют последовательность кадров динамической текстуры лицевой маски;- based on the sequence of images of the frequency spectrum form a sequence of frames of the dynamic texture of the facial mask;

- формируют последовательность кадров, содержащих изображение результирующей 3D-модели головы с размещенной на ней динамической текстурой лицевой маски на основе последовательности кадров, содержащих изображение 3D-модели головы с размещенной на ней лицевой маской, и кадров динамической текстуры лицевой маски;- form a sequence of frames containing an image of the resulting 3D model of the head with a dynamic facial mask placed on it based on a sequence of frames containing an image of a 3D model of the head with a facial mask placed on it and frames of the dynamic texture of the face mask;

- формируют последовательность кадров с изображением результирующей 3D-модели головы на фоне сцены;- form a sequence of frames with the image of the resulting 3D-model of the head against the background of the scene;

- объединяют полученную на предыдущем шаге последовательность кадров в видеопоток.- combine the sequence of frames obtained in the previous step into a video stream.

[009] В одном из частных примеров осуществления способа дополнительно выполняют этапы, на которых: при получении данных речевого сигнала определяют шум-голос; выделяют участки речевого сигнала, содержащие информацию о голосе; причем при определении шума-голоса учитываются данные фонетической разметки.[009] In one particular embodiment of the method, steps are further performed in which: when receiving speech signal data, noise-voice is determined; allocate sections of the speech signal containing information about the voice; moreover, when determining noise-voices, phonetic marking data is taken into account.

[010] В другом частном примере осуществления способа дополнительно выполняют этап геометрической валидации данных о множестве координат точек, сформированных для временного окна, содержащихся в последовательности данных о множестве координат точек, образующих лицевую маску.[010] In another particular embodiment of the method, a step of geometrically validating data on a plurality of coordinates of points generated for a time window contained in a sequence of data on a plurality of coordinates of points forming a face mask is additionally performed.

[011] В другом частном примере осуществления способа упомянутый этап геометрической валидации данных о множестве координат точек содержит этапы, на которых: на основе данных фонетической разметки голоса определяют контрольные координаты точек лицевой маски, валидацию которых необходимо выполнить; определяют расстояние между контрольными точками лицевой маски, информация о которых содержится в данных о множестве координат точек, сформированных для временного окна, и сравнивают его с заданным значением расстояния для этих контрольных точек согласно данным о фонетической разметке голоса; причем если определенное расстояние между контрольными точками лицевой маски не превышает заданного значения расстояния между этими контрольными точками, то данные о множестве координат точек проходят валидацию.[011] In another particular embodiment of the method, said step of geometrically validating data on a plurality of point coordinates comprises steps in which: based on the phonetic marking of the voice, control coordinates of the face mask points are determined, the validation of which must be performed; determine the distance between the control points of the facial mask, information about which is contained in the data on the set of coordinates of the points formed for the time window, and compare it with a predetermined distance for these control points according to the phonetic markup of the voice; moreover, if a certain distance between the control points of the facial mask does not exceed a specified value of the distance between these control points, then the data about the set of coordinates of the points passes validation.

[012] В другом частном примере осуществления способа размещение лицевой маски на 3D-модели головы осуществляется посредством прикрепления лицевой маски по известным пограничным вершинам 3D-модели головы и соответствующим для этих вершин точкам лицевой маски.[012] In another particular embodiment of the method, the face mask is placed on the 3D model of the head by attaching the face mask to the known boundary vertices of the 3D head model and the corresponding points of the facial mask for these vertices.

[013] В другом частном примере осуществления способа дополнительно выполняют этап, на котором размещают на изображениях 3D-модели головы текстуры зубов и языка согласно положению лицевой маски.[013] In another particular embodiment of the method, a step is further performed in which teeth and tongue textures are placed on the 3D models of the head according to the position of the face mask.

[014] В другом частном примере осуществления способа дополнительно выполняют этап морфинга изображений результирующей 3D-модели головы.[014] In another particular embodiment of the method, the step of morphing images of the resulting 3D head model is additionally performed.

[015] В другом частном примере осуществления способа дополнительно выполняют этап цветокоррекции и совмещения динамических текстур для устранения пульсирующего изменения цвета лицевой маски посредством слияния по цвету упомянутых изображений 3D-модели головы, содержащихся на предыдущих и последующих кадрах.[015] In another particular embodiment of the method, the step of color correction and combining dynamic textures is additionally performed to eliminate the pulsating color change of the face mask by merging the color of said 3D head model images contained in previous and subsequent frames.

[016] В другом предпочтительном варианте осуществления заявленного решения представлено устройство обработки речевого сигнала, содержащее по меньшей мере одного вычислительное устройство и по меньшей мере одно устройство памяти, содержащее машиночитаемые инструкции, которые при их исполнении по меньшей мере одним вычислительным устройством выполняют указанный выше способ.[016] In another preferred embodiment of the claimed solution, there is provided a speech signal processing device comprising at least one computing device and at least one memory device containing computer-readable instructions that, when executed by at least one computing device, perform the above method.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[017] Признаки и преимущества настоящего изобретения станут очевидными из приводимого ниже подробного описания изобретения и прилагаемых чертежей, на которых:[017] The features and advantages of the present invention will become apparent from the following detailed description of the invention and the accompanying drawings, in which:

[018] На Фиг. 1 - представлена принципиальная архитектура технологии.[018] In FIG. 1 - presents the fundamental architecture of the technology.

[019] На Фиг. 2 - представлен пример системы для создания мимики на основе текста.[019] In FIG. 2 - presents an example of a system for creating facial expressions based on text.

[020] На Фиг. 3 - пример фонетической разметки фразы «Добрый день».[020] In FIG. 3 is an example of phonetic markup for the phrase “Good afternoon.”

[021] На Фиг. 4 - представлен пример общего вида вычислительного устройства.[021] In FIG. 4 is an example of a general view of a computing device.

ПОДРОБНОЕ ОПИСАНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

[022] Ниже будут описаны понятия и термины, необходимые для понимания данного технического решения.[022] Below will be described the concepts and terms necessary for understanding this technical solution.

[023] В данном техническом решении под системой подразумевается, в том числе компьютерная система, ЭВМ (электронно-вычислительная машина), ЧПУ (числовое программное управление), ПЛК (программируемый логический контроллер), компьютеризированные системы управления и любые другие устройства, способные выполнять заданную, четко определенную последовательность операций (действий, инструкций).[023] In this technical solution, a system means, including a computer system, a computer (electronic computer), CNC (numerical control), PLC (programmable logic controller), computerized control systems, and any other devices capable of performing a given , clearly defined sequence of operations (actions, instructions).

[024] Под устройством обработки команд подразумевается электронный блок либо интегральная схема (микропроцессор), исполняющая машинные инструкции (программы).[024] An instruction processing device is understood to mean an electronic unit or an integrated circuit (microprocessor) that executes machine instructions (programs).

[025] Устройство обработки команд считывает и выполняет машинные инструкции (программы) с одного или более устройств хранения данных. В роли устройства хранения данных могут выступать, но не ограничиваясь, жесткие диски (HDD), флеш-память, ПЗУ (постоянное запоминающее устройство), твердотельные накопители (SSD), оптические приводы.[025] An instruction processing device reads and executes machine instructions (programs) from one or more data storage devices. Hard disk drives (HDD), flash memory, ROM (read-only memory), solid-state drives (SSD), optical drives can act as storage devices.

[026] Программа - последовательность инструкций, предназначенных для исполнения устройством управления вычислительной машины или устройством обработки команд.[026] A program is a sequence of instructions for execution by a computer control device or an instruction processing device.

[027] Видео фрейм (video frame) - видеокадр, кадр изображения, телевизионный кадр.[027] Video frame — A video frame, an image frame, a television frame.

[028] Риг - термин в компьютерной анимации, который описывает набор зависимостей между управляющими и управляемыми элементами, созданный таким образом, чтобы управляющих элементов было меньше, чем управляемых.[028] Rig is a term in computer animation that describes a set of dependencies between control and managed elements, created so that there are fewer control elements than controlled elements.

[029] Программно-аппаратный комплекс «Цифровой аватар» базируется на 2-х технологических комплексах (см. фиг. 1):[029] The hardware-software complex "Digital Avatar" is based on 2 technological complexes (see Fig. 1):

- синтез речи на основе произвольного текста [1], в котором текстовую информацию (ТХТ) направляют на вход нейронной сети (DNN) для его обработки посредством технологии Text-to-speech (TTS) и получения сгенерированного нейронной сетью речевого сигнала (VOICE);- speech synthesis based on arbitrary text [1], in which text information (TXT) is sent to the input of a neural network (DNN) for processing by means of Text-to-speech (TTS) technology and receiving a speech signal generated by a neural network (VOICE);

- синтез изображения на основе сгенерированной речи [2], в котором речевой сигнал (VOICE) направляют на вход нейронной сети (DNN) для формирования мимики (Mimic) и изображения цифрового аватара (VIDEO). В данном технологическом комплексе также используются несколько технологий: известные технологий постобработки композитинга (объединения изображений в кадре) и новой технологии «Генерации мимики цифрового аватара на основе голоса». [030] В соответствии со схемой, приведенной на фиг. 2, система 100 для создания мимики на основе текста, выполняющая описанный выше синтез изображения на основе сгенерированной речи, содержит: устройство 10 памяти и устройство 20 обработки речевого сигнала.- image synthesis based on the generated speech [2], in which the speech signal (VOICE) is sent to the input of the neural network (DNN) to form facial expressions (Mimic) and digital avatar images (VIDEO). This technology complex also uses several technologies: the well-known technologies for compositing post-processing (combining images in a frame) and the new technology “Generating facial expressions of a digital avatar based on voice”. [030] In accordance with the circuit of FIG. 2, a text-based facial expression system 100 performing the above-described image synthesis based on generated speech, comprises: a memory device 10 and a speech signal processing device 20.

[031] Устройство 10 памяти может представлять собой оперативную или постоянную память, предназначенную для хранения речевых сигналов, синтезированных на основе текста описанным выше методом, например, в виде звуковых дорожек. Для генерации упомянутого речевого сигнала могут использоваться решения, предлагаемыми компаниями ЦРТ, АБК и Google. Дополнительно вместе с речевым сигналом может быть сохранены данные фонетической разметки голоса, сгенерированные при синтезировании текста. Фонетическая разметка голоса может быть сформирована известными из уровня техники методами, например, посредством решений, предлагаемых компанией ЦРТ. Данные фонетической разметки содержат указатели времени начала и окончания произнесения звука в рамках речевого сигнала - звуковой дорожки. Пример данных фонетической разметки представлен на фиг. 3, на котором изображен речевой сигнал фразы «Добрый день», а также в нижней части чертежа - данные фонетической разметки.[031] The memory device 10 may be random-access or read-only memory intended for storing speech signals synthesized based on text by the method described above, for example, in the form of sound tracks. To generate the mentioned speech signal, solutions offered by MDG, ABK and Google can be used. Additionally, along with the speech signal, the phonetic markup data of the voice generated by synthesizing the text can be stored. Phonetic marking of the voice can be formed by methods known from the prior art, for example, by means of solutions offered by the MDG company. Phonetic markup data contains indicators of the time the beginning and end of the pronunciation of sound within the speech signal - sound track. An example of phonetic marking data is shown in FIG. 3, which shows the speech signal of the phrase "Good afternoon", as well as in the lower part of the drawing - phonetic markup data.

[032] Устройство 20 обработки речевого сигнала может быть выполнено на базе по меньшей мере одного вычислительного устройства и содержать: модуль 21 выделения сигнала/шума, модуль 22 артикуляционного синтеза, модуль 23 позиционирования лицевой маски, модуль 24 формирования кадра (фрейма) динамической текстуры и модуль 25 обработки кадров. Модуль 21 выделения сигнала/шума и модуль 25 обработки кадров могут быть выполнены на базе вычислительного устройства, оснащенного специализированным программным обеспечением для выполнения приписанных им в настоящей заявке функций. Модуль 22 артикуляционного синтеза, модуль 23 позиционирования лицевой маски и модуль 24 формирования кадра динамической текстуры могут быть выполнены на базе сверточных нейронных сетей, заранее обученных на выборке данных, отобранной разработчиком данных модулей.[032] The speech signal processing device 20 may be implemented on the basis of at least one computing device and comprise: a signal / noise extraction module 21, an articulation synthesis module 22, a face mask positioning module 23, a dynamic texture frame (frame) forming unit 24, and frame processing module 25. The signal / noise extraction module 21 and the frame processing module 25 can be performed on the basis of a computing device equipped with specialized software for performing the functions assigned to it in this application. Articulation synthesis module 22, face mask positioning module 23 and dynamic texture frame formation module 24 can be performed on the basis of convolutional neural networks pre-trained on a data sample selected by the developer of these modules.

[033] В соответствии с заложенным программным алгоритмом устройство 20 обработки речевого сигнала обращается к устройству 10 памяти для извлечения данных по меньшей мере одного речевого сигнала, которые могут быть сохранены, например, при частоте дискретизации 22кГц моно-канал 16 бит, которые поступают в модуль 21 выделения сигнала/шума. Дополнительно вместе сданными речевого сигнала из упомянутого устройства 10 могут быть извлечены данные фонетической разметки голоса для данного речевого сигнала.[033] In accordance with the software algorithm, the speech signal processing device 20 accesses the memory device 10 to retrieve data of at least one speech signal that can be stored, for example, at a sampling frequency of 22 kHz, the 16-bit mono channel that arrives at the module 21 signal / noise extraction. Additionally, together with the delivered speech signal from said device 10, phonetic voice marking data for a given speech signal can be extracted.

[034] При получении данных речевого сигнала упомянутый модуль 21 определяет на основе амплитудно-фазовой частотной характеристики (АФЧХ) звуковой волны (аудио-потока речи) шум-голос, выделяет участки, содержащие информацию о голосе, и размечает остальные фрагменты речевого сигнала, не содержащие информацию о голосе, соответствующими тайм-кодами (временным кодом), которые маркируются как молчание. В случае поступления данных фонетической разметки голоса, в частности данных о времени начала и окончания произнесения звука, эти данные также учитываются упомянутым модулем 21 при определении шума-голос, в связи с чем улучшается качество его отделения.[034] When receiving data of the speech signal, said module 21 determines the noise-voice based on the amplitude-phase frequency characteristic (AFC) of the sound wave (speech audio stream), selects portions containing voice information, and marks out the remaining fragments of the speech signal, not containing information about the voice, the corresponding time codes (time code), which are marked as silence. In the case of receipt of phonetic voice marking data, in particular, data on the start and end time of pronunciation, these data are also taken into account by the mentioned module 21 when determining noise-voice, and therefore the quality of its separation is improved.

[035] Далее данные речевого сигнала с выделенными участками, содержащими информацию о голосе, поступают в модуль 22 артикуляционного синтеза, который разделяет участки, содержащие информацию о голосе, на временные окна, (например, равные 200 мс), причем одно временное окно соответствует одному кадру (фрейму) видео. Дополнительно в упомянутый модуль 22 могут быть направлены данные фонетической разметки голоса.[035] Next, the data of the speech signal with the selected sections containing voice information is supplied to the articulation synthesis module 22, which divides the sections containing voice information into time windows (for example, 200 ms), and one time window corresponds to one frame (frame) of the video. Additionally, phonetic voice marking data may be sent to said module 22.

[036] После того, как участки речевого сигнала, содержащие информацию о голосе, были разделены на временные окна, модуль 22 артикуляционного синтеза преобразует АФЧХ звуковой волны для каждого временного окна, содержащего информацию о голосе, в изображение частотного спектра, например, двумерное изображение графика зависимости относительной энергии звуковых колебаний от частоты. Упомянутое преобразование может быть выполнено посредством преобразования Фурье - семейства математических методов, основанных на разложении исходной непрерывной функции от времени на совокупность базисных гармонических функций (в качестве которых выступают синусоидальные функции) различной частоты, амплитуды и фазы. Из определения видно, что основная идея преобразования заключается в том, что любую функцию можно представить в виде бесконечной суммы синусоид, каждая из которых будет характеризоваться своей амплитудой, частотой и начальной фазой. Таким образом, в упомянутом модуле 22 формируется последовательность изображений частотного спектра.[036] After the sections of the speech signal containing the voice information have been divided into time windows, the articulation synthesis module 22 converts the sound wave frequency response for each time window containing the voice information into an image of a frequency spectrum, for example, a two-dimensional image of a graph dependence of the relative energy of sound vibrations on frequency. The mentioned transformation can be performed by means of the Fourier transform, a family of mathematical methods based on the decomposition of the initial continuous function of time into a set of basic harmonic functions (which are sinusoidal functions) of different frequency, amplitude and phase. It can be seen from the definition that the main idea of the transformation is that any function can be represented as an infinite sum of sinusoids, each of which will be characterized by its amplitude, frequency and initial phase. Thus, in said module 22, a sequence of images of the frequency spectrum is formed.

[037] Далее модуль 22 артикуляционного синтеза на основе последовательности изображений частотного спектра определяет последовательность данных о множестве координат точек, образующих лицевую маску. Для этого каждое изображение частотного спектра подается на вход сверточным нейронным сетям, которыми упомянутый модуль 22 может быть оснащен, в которых полученное изображение преобразуется в матрицу изображения, содержащую значения пикселей изображения, после чего полученная матрица изображения умножается на матрицу (ядро) свертки поэлементно, где матрица свертки - это матрица коэффициентов, заданная разработчиком для каждого слоя сверточной нейронной сети в зависимости он требуемой предварительной обработки поступившего на вход изображения частотного спектра. Методы подбора коэффициентов матрицы свертки широко известны из уровня техники и далее не будут раскрыты в настоящей заявке (см., например, статью «Матричные фильтры обработки изображений», опубл. 25.04.2012 в Интернет по адресу: https://habr.com/ru/post/142818/).[037] Next, the articulation synthesis module 22, based on the sequence of images of the frequency spectrum, determines the sequence of data on the set of coordinates of the points forming the face mask. To this end, each image of the frequency spectrum is fed to the input by convolutional neural networks, which the module 22 can be equipped with, in which the received image is converted into an image matrix containing the pixel values of the image, after which the resulting image matrix is multiplied by the convolution matrix (core) elementwise, where convolution matrix is a matrix of coefficients set by the developer for each layer of the convolutional neural network depending on the required preliminary processing of the frequency spectrum image received at the input. Methods for selecting convolution matrix coefficients are widely known in the art and will not be further described in this application (see, for example, the article “Image Processing Matrix Filters”, published on April 25, 2012 on the Internet at: https://habr.com/ com / post / 142818 /).

[038] Далее на основе результатов умножения матрицы изображения на матрицу свертки, полученных на каждом слое сверточной нейронной сети, формируется выходное изображение частотного спектра, которое подается на вход специализированной нейронной сети, заранее обученной на различных изображениях частотных спектров и соответствующих этим изображениям множестве координат точек (landmarks), образующих лицевую маску, на выходе которой упомянутый модуль 22 получает данные о множестве координат точек, образующих лицевую маску, для каждого временного окна. Таким образом, в модуле 22 формируется последовательность данных о множестве координат точек, образующих лицевую маску. Методы определения множества координат точек, образующих лицевую маску, на основе изображения лица широко известны их уровня техники (см, например, статью «Распознавание лиц. Создаем и примеряем маски», опубл. 12.12.2017 в Интернет по адресу: https://habr.com/ru/company/epam systems/bloq/343514/) и более подробно не будут раскрыты в материалах настоящей заявки. Изображения лица, на основе которых выполняют определение множества координат точек, могут быть получены посредством обработки данных видеоизображений лица, например, диктора, разделенное на временные окна, соответствующие временным окнам для речевого сигнала.[038] Next, based on the results of multiplying the image matrix by the convolution matrix obtained on each layer of the convolutional neural network, an output image of the frequency spectrum is generated, which is fed to the input of a specialized neural network pre-trained on various images of the frequency spectra and the set of points coordinates corresponding to these images (landmarks) forming a face mask, at the output of which the mentioned module 22 receives data about the set of coordinates of the points forming the face mask for each time window. Thus, in module 22, a sequence of data is generated about the set of coordinates of the points forming the face mask. Methods for determining the set of coordinates of the points that make up the face mask based on the face image are widely known for their prior art (see, for example, the article “Face recognition. Create and try on masks”, published on 12.12.2017 at: https: // habr .com / ru / company / epam systems / bloq / 343514 /) and will not be disclosed in more detail in the materials of this application. Face images on the basis of which the determination of the set of points coordinates can be obtained by processing video data of a face, for example, a speaker, divided into time windows corresponding to time windows for a speech signal.

[039] Дополнительно модуль 22 артикуляционного синтеза может быть выполнен с возможностью геометрической валидации упомянутых данных о множестве координат точек, сформированных для временного окна, содержащихся в последовательности данных о множестве координат точек, образующих лицевую маску. Для этого модуль 22 артикуляционного синтеза на основе данных фонетической разметки голоса определяет контрольные координаты точек лицевой маски, валидацию которых необходимо выполнить. Например, данные о фонетической разметки голоса могут указывать на наличие звука «б», «м» или «п» во временном окне, в связи с чем на основе данной информации модулем 22 артикуляционного синтеза может быть определены контрольные точки лицевой маски, характеризующие расположение центров поверхности верней и нижней губ, уголков губ и пр. Далее упомянутый модуль 22 определяет расстояние между контрольными точками лицевой маски, информация о которых содержится в данных о множестве координат точек, сформированных для временного окна, и сравнивает его с заданным значением расстояния для этих контрольных точек согласно данным о фонетической разметке голоса для данного временного окна.[039] Additionally, articulation synthesis module 22 may be configured to geometrically validate said data on a plurality of points coordinates generated for a time window contained in a data sequence on a plurality of coordinates of points forming a face mask. For this, the articulation synthesis module 22, based on the phonetic markup of the voice, determines the control coordinates of the points of the facial mask, the validation of which must be performed. For example, data on phonetic markup of a voice may indicate the presence of sound “b”, “m” or “p” in a time window, in connection with which, based on this information, articulation synthesis module 22 can determine facial mask control points characterizing the location of the centers surfaces of the upper and lower lips, lip corners, etc. Next, the mentioned module 22 determines the distance between the control points of the face mask, information about which is contained in the data about the set of coordinates of the points formed for the time window, and compares it with the specified distance value for these control points according to phonetic voice marking data for a given time window.

[040] Например, для данных о фонетические разметки голоса, указывающих на наличие звука «б», «м» или «п» во временном окне, значение расстояния между контрольными точками, характеризующими расположение центров поверхности верней и нижней губ, может равняться минимальному расстоянию, близкому к «0» (нулю), для обеспечения визуализации смыкания губ на 3D-модели головы.[040] For example, for phonetic voice markup data indicating the presence of sound “b”, “m” or “p” in a time window, the distance between the control points characterizing the location of the surface centers of the upper and lower lips may be equal to the minimum distance close to “0” (zero) to provide visualization of lip closure on the 3D model of the head.

[041] Если определенное расстояние между контрольными точками лицевой маски больше заданного значения расстояния между этими контрольными точками, т.е. больше минимального расстояния, то данные о множестве контрольных точек не проходят валидацию, а модуль 22 артикуляционного синтеза осуществляет возврат на предыдущий этап, и изображение частотного спектра, содержащееся в упомянутой последовательности изображений, на основе которого были определены данные о множестве координат точек, не прошедшие валидацию, снова подается описанным выше способом на вход сверточной нейронной сети до получения данных о множестве координат точек, проходящей проверку (валидацию).[041] If the determined distance between the control points of the face mask is greater than the specified distance between these control points, ie is greater than the minimum distance, the data on the set of control points do not pass validation, and the articulation synthesis module 22 returns to the previous stage, and the image of the frequency spectrum contained in the mentioned sequence of images, on the basis of which the data on the set of coordinates of points that did not pass validation were determined , is again fed as described above to the input of the convolutional neural network to obtain data on the set of coordinates of points undergoing verification (validation).

[042] Если определенное расстояние между контрольными точками лицевой маски не превышает заданного значения расстояния между этими контрольными точками, то данные о множестве координат точек проходят валидацию. Информация о контрольных точках для данных фонетической разметки и значения расстояния между контрольными точками может быть заранее задана разработчиком модуля 22 артикуляционного синтеза.[042] If the determined distance between the control points of the face mask does not exceed the specified value of the distance between these control points, then the data on the set of coordinates of the points passes validation. Information about the control points for the phonetic marking data and the distance between the control points can be predetermined by the developer of the articulation synthesis module 22.

[043] Далее последовательность данных о множестве координат точек, прошедших валидацию, направляются в модуль 23 позиционирования лицевой маски, который извлекает из памяти, которой он дополнительно может быть оснащен, 3D-модель головы, например, головы диктора, и выполняет размещение лицевой маски на 3D-модели головы, после чего формирует кадр (фрейм) с изображением 3D-модели с размещенной на ней лицевой маской. Позиционирование лицевой маски на 3D-модель головы может осуществляться посредством ее прикрепления по известным пограничным вершинам 3d-модели головы и соответствующим для этих вершин точкам лицевой маски, заданными разработчиком упомянутого модуля 23. Таким образом, на выходе упомянутого модуля 23 формируется последовательность кадров, содержащих изображение 3D-модели головы с размещенной на ней лицевой маской.[043] Next, the sequence of data on the set of coordinates of the points that have been validated is sent to the face mask positioning module 23, which extracts from the memory that it can optionally be equipped with a 3D model of the head, for example, the speaker’s head, and performs face mask placement on 3D-models of the head, after which it forms a frame (frame) with the image of the 3D-model with the face mask placed on it. The positioning of the facial mask on the 3D model of the head can be carried out by attaching it to the known boundary vertices of the 3D model of the head and the corresponding points of the facial mask specified by the developer of the said module 23. Thus, at the output of the said module 23 a sequence of frames containing the image is formed 3D models of the head with a face mask placed on it.

[044] В альтернативном варианте реализации заявленного решения пограничные вершины 3d-модели головы и соответствующие для этих вершин точки лицевой маски могут быть определены известными из уровня техники методами, например, посредством библиотеки dlib с помощью последовательного выполнения нахождения лица в кадре, нахождения на нем неподвижных точек и получения их координат.[044] In an alternative implementation of the claimed solution, the boundary vertices of the 3d head model and the corresponding facial mask points for these vertices can be determined using methods known from the prior art, for example, using the dlib library by sequentially locating a person in a frame and keeping it stationary points and getting their coordinates.

[045] Дополнительно в памяти упомянутого модуля 23 может быть сохранены текстуры зубов и языка для их размещения на изображении 3D-модели головы согласно положению лицевой маски. Положение текстур зубов и языка может быть заданы разработчиком модуля 23 исходя из анатомических свойств человека. Привязка зубов и языка осуществляется в рамках проектирования 3D-модели. Такие элементы обладают ригами с жесткой логикой работы и относятся к стандартной практике создания «скелета» 3D-модели.[045] In addition, the textures of the teeth and tongue can be stored in the memory of said module 23 for placement on the image of the 3D model of the head according to the position of the face mask. The position of the textures of the teeth and tongue can be set by the developer of the module 23 based on the anatomical properties of a person. Teeth and tongue snapping is done as part of the 3D model design. Such elements have rigs with strict work logic and are referred to the standard practice of creating a “skeleton" of a 3D model.

[046] Также последовательность изображений частотного спектра направляется в модуль 24 формирования кадров динамической текстуры, который для каждого временного окна на основе изображения частотного спектра формирует кадр динамической текстуры лицевой маски, которые далее объединяются в последовательность кадров динамической текстуры лицевой маски, таким образом обеспечивая формирование дискретного изображения текстуры лицевой маски. Динамическая текстура лицевой маски может содержать информацию о участках лица, которые следует отобразить с покраснением, пульсацией, изменением фактуры лица и пр. Для формирования кадров динамической текстуры лицевой маски упомянутый модуль 24 может быть оснащен сверточной нейронной сетью, заранее обученной на различных изображениях частотных спектров и соответствующих этим изображениям кадрах динамической текстуры лицевой маски. Изображения лица, на основе которых формируют кадры динамической текстуры лицевой маски, также могут быть получены посредством обработки данных видеоизображений лица, например, диктора, разделенное на временные окна, соответствующие временным окнам для речевого сигнала. Соответственно, изображения частотных спектров подаются на вход упомянутой нейронной сети, а на ее выходе формируются кадры динамической текстуры лицевой маски.[046] Also, the sequence of images of the frequency spectrum is sent to the dynamic texture frame formation module 24, which for each time window, on the basis of the frequency spectrum image, forms a frame of the dynamic texture of the face mask, which are then combined into a sequence of frames of the dynamic texture of the face mask, thereby providing a discrete images of facial mask texture. The dynamic texture of the facial mask may contain information about the areas of the face that should be displayed with redness, pulsation, changes in facial texture, etc. To form the frames of the dynamic texture of the facial mask, said module 24 may be equipped with a convolutional neural network pre-trained on various images of the frequency spectra and frames of dynamic texture of the face mask corresponding to these images. Face images, on the basis of which frames of the dynamic texture of the face mask are formed, can also be obtained by processing video data of the face, for example, a speaker divided into time windows corresponding to time windows for the speech signal. Accordingly, the images of the frequency spectra are fed to the input of the said neural network, and the frames of the dynamic texture of the facial mask are formed at its output.

[047] Последовательности кадров, сформированные модуль 23 позиционирования лицевой маски и модулем 24 формирования кадров динамической текстуры, далее направляются в модуль 25 обработки кадров, который извлекает из кадра о динамической текстуры лицевой маски, например, первого кадра в последовательность кадров динамической текстуры лицевой маски, информацию о динамической текстуре лицевой маски, и размещает динамическую текстуру лицевой маски на кадре, в частности на первом кадре в последовательности кадров, содержащем изображение 3D-модели головы с размещенной на ней лицевой маской. Аналогичным образом динамическая текстура лицевой маски размещается на всех кадрах в последовательности кадров, содержащих изображение 3D-модели головы с размещенной на ней лицевой маской, в соответствии номерами этих кадров и номерами тех же кадров, содержащих информацию о динамической текстуре лицевой маски. Таким образом, в модуле 25 обработки кадров формируется последовательность кадров, содержащих изображение результирующей 3D-модели головы с размещенной на ней динамической текстурой лицевой маски.[047] The frame sequences generated by the face mask positioning module 23 and the dynamic texture frame generating module 24 are then sent to the frame processing module 25, which extracts from the frame the dynamic texture of the facial mask, for example, the first frame into the frame sequence of the dynamic texture of the facial mask, information about the dynamic texture of the face mask, and places the dynamic texture of the face mask on the frame, in particular on the first frame in the sequence of frames containing the image of a 3D model of the head with the face mask placed on it. Similarly, the dynamic texture of the face mask is placed on all frames in the sequence of frames containing the image of the 3D model of the head with the face mask placed on it, in accordance with the numbers of these frames and the numbers of the same frames containing information about the dynamic texture of the face mask. Thus, in the frame processing module 25, a sequence of frames is formed containing the image of the resulting 3D head model with the dynamic texture of the face mask placed on it.

[048] Дополнительно для придания монолитного вида модуль 25 обработки кадров может быть оснащен матричными фильтрами обработки изображений, широко используемые в уровне техники для обработки изображений. Посредством данных фильтров упомянутый модуль 25 может выполнить анализ на сформированных кадрах, содержащих изображение результирующей 3D-модели головы, границ перехода между лицевой маской и 3D-модели головы, заполнить разрывы между изображениями 3D-модели головы и динамической текстурой лицевой маски, а также выполнить размытие цветовых границ между текстурами.[048] Additionally, to provide a monolithic look, the frame processing module 25 may be equipped with image processing matrix filters, which are widely used in the prior art for image processing. Using these filters, said module 25 can perform analysis on formed frames containing an image of the resulting 3D head model, the transition boundaries between the face mask and 3D head model, fill in the gaps between the 3D head model images and the dynamic texture of the face mask, and also blur color borders between textures.

[049] Также дополнительно модуль 25 обработки кадров может быть оснащен соответствующей 3D-моделью для формирования характеристик поверхности 3D-модели головы (рельеф), которые могут быть совмещены с такими параметрами как «свойства материала» (базовый оттенок, фактура).[049] Also, additionally, the frame processing module 25 may be equipped with a corresponding 3D model for forming surface characteristics of the 3D head model (relief), which can be combined with such parameters as “material properties” (basic shade, texture).

[050] После этого модуль 25 обработки кадров формирует последовательность кадров с изображением результирующей 3D-модели головы на фоне сцены и текстурой зубов и языка на основе данных последовательности кадров, содержащих изображение результирующей 3D-модели головы. Для этого упомянутый модуль 25 извлекает изображение заранее заготовленной сцены, например, изображение какого-либо помещения, добавляет в изображение сцены изображение результирующей 3D-модели головы, содержащееся в первом кадре упомянутой выше последовательности кадров, текстуры зубов и языка для данного изображения 3D-модели головы согласно положению лицевой маски, после чего формирует кадр с изображением результирующей 3D-модели головы на фоне сцены, т.е. с динамической текстурой лицевой маски и текстурой зубов и языка. Аналогичным образом модуль 25 обработки кадров формирует кадры для второго и последующих кадров, содержащих изображение результирующей 3D-модели головы, вследствие чего в модуле 25 обработки кадров формируется последовательность кадров с изображением результирующей 3D-модели головы на фоне сцены с текстурой зубов и языка на основе данных последовательности кадров, содержащих изображение результирующей 3D-модели головы.[050] After this, the frame processing unit 25 generates a frame sequence with an image of the resulting 3D head model against the background of the scene and the texture of the teeth and tongue based on the data of the sequence of frames containing the image of the resulting 3D head model. To this end, said module 25 extracts an image of a pre-prepared scene, for example, an image of a room, adds an image of the resulting 3D model of the head to the scene image, which is contained in the first frame of the above sequence of frames, texture of the teeth and tongue for this image of the 3D model of the head according to the position of the facial mask, after which it forms a frame with the image of the resulting 3D-model of the head against the background of the scene, i.e. with dynamic texture of the facial mask and the texture of the teeth and tongue. Similarly, the frame processing module 25 generates frames for the second and subsequent frames containing the image of the resulting 3D head model, as a result of which the frame sequence module 25 forms the image of the resulting 3D head model against the background of the scene with the texture of the teeth and tongue based on the data sequences of frames containing the image of the resulting 3D model of the head.

[051] Далее модуль 25 обработки кадров осуществляет объединение полученной последовательности кадров в видеопоток, причем на этапе объединения последовательности кадров упомянутый модуль 25 может быть выполнен с возможностью морфинга изображений результирующей 3D-модели головы с размещенной на ней динамической текстурой лицевой маски для обеспечения плавности переходов между упомянутыми изображениями 3D-модели головы и устранения резкого изменения формы лицевой маски посредством слияния по геометрии упомянутых изображений 3D-модели головы, содержащихся на предыдущих и последующих кадрах, а также с возможностью цветокоррекции и совмещения динамических текстур для устранения пульсирующего изменения цвета лицевой маски посредством слияния по цвету упомянутых изображений 3D-модели головы, содержащихся на предыдущих и последующих кадрах. Технологии морфинга изображений, цветокоррекции и совмещения динамических текстур широко известны из уровня техники и более подробно не будут раскрыты в настоящей заявке.[051] Next, the frame processing unit 25 combines the obtained sequence of frames into a video stream, and at the stage of combining the frame sequence, said module 25 can be capable of morphing images of the resulting 3D head model with a dynamic facial mask placed on it to ensure smooth transitions between the above-mentioned images of the 3D model of the head and eliminating a sharp change in the shape of the face mask by merging in geometry the mentioned images of the 3D model of the head contained in previous and subsequent frames, as well as with the possibility of color correction and combining dynamic textures to eliminate the pulsating color change of the face mask by merging the color of the mentioned images of the 3D model of the head contained in previous and subsequent frames. Technologies for morphing images, color grading, and combining dynamic textures are widely known in the art and will not be described in more detail in this application.

[052] Полученный видеопоток устройством 20 обработки речевого сигнала известными из уровня техники методами объединяется с данными речевого сигнала, после чего видеопоток сохраняется в устройстве 10 памяти и может быть выведен на устройства вывода информации по соответствующему запросу. При необходимости полученный видеопоток может быть сжат с целью уменьшения занимаемого объема данных.[052] The received video stream by the voice signal processing device 20 by methods known from the prior art is combined with the speech signal data, after which the video stream is stored in the memory device 10 and can be output to information output devices according to the corresponding request. If necessary, the resulting video stream can be compressed in order to reduce the occupied data volume.

[053] В общем виде (см. фиг. 4) вычислительное устройство содержит объединенные общей шиной информационного обмена один или несколько процессоров (201), средства памяти, такие как ОЗУ (202) и ПЗУ (203), интерфейсы ввода/вывода (204), устройства ввода/вывода (205), и устройство для сетевого взаимодействия (206).[053] In its general form (see Fig. 4), the computing device comprises one or more processors (201) connected by a common data bus, memory tools such as RAM (202) and ROM (203), input / output interfaces (204) ), input / output devices (205), and a device for network communication (206).

[054] Процессор (201) (или несколько процессоров, многоядерный процессор и т.п.) может выбираться из ассортимента устройств, широко применяемых в настоящее время, например, таких производителей, как: Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. Под процессором или одним из используемых процессоров в устройстве (200) также необходимо учитывать графический процессор, например, GPU NVIDIA или Graphcore, тип которых также является пригодным для полного или частичного выполнения способа, а также может применяться для обучения и применения моделей машинного обучения в различных информационных системах.[054] The processor (201) (or multiple processors, a multi-core processor, etc.) can be selected from a variety of devices that are currently widely used, for example, manufacturers such as: Intel ™, AMD ™, Apple ™, Samsung Exynos ™, MediaTEK ™, Qualcomm Snapdragon ™, etc. Under the processor or one of the processors used in the device (200), it is also necessary to take into account a graphic processor, for example, an NVIDIA or Graphcore GPU, the type of which is also suitable for complete or partial execution of the method, and can also be used for training and application of machine learning models in various information systems.

[055] ОЗУ (202) представляет собой оперативную память и предназначено для хранения исполняемых процессором (201) машиночитаемых инструкций для выполнение необходимых операций по логической обработке данных. ОЗУ (202), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.). При этом, в качестве ОЗУ (202) может выступать доступный объем памяти графической карты или графического процессора.[055] RAM (202) is a random access memory and is intended to store machine-readable instructions executed by the processor (201) to perform the necessary operations for logical data processing. RAM (202), as a rule, contains executable instructions of the operating system and corresponding software components (applications, program modules, etc.). At the same time, the available memory capacity of the graphics card or graphics processor can act as RAM (202).

[056] ПЗУ (203) представляет собой одно или более устройств постоянного хранения данных, например, жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш-память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD-R/RW, BlueRay Disc, MD) и др.[056] The ROM (203) is one or more permanent storage devices, for example, a hard disk drive (HDD), solid state data storage device (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R / RW, DVD-R / RW, BlueRay Disc, MD), etc.

[057] Для организации работы компонентов устройства (200) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (204). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232 и т.п.[057] Various types of I / O interfaces (204) are used to organize the operation of the components of the device (200) and organize the operation of external connected devices. The choice of appropriate interfaces depends on the particular computing device, which can be, but not limited to: PCI, AGP, PS / 2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS / Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[058] Для обеспечения взаимодействия пользователя с устройством (200) применяются различные средства (205) В/В информации, например, клавиатура, дисплей (монитор), сенсорный дисплей, тач-пад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п.[058] Various means (205) of I / O information are used to ensure user interaction with the device (200), for example, a keyboard, display (monitor), touch screen, touch pad, joystick, mouse, light pen, stylus, touch panel, trackball, speakers, microphone, augmented reality, optical sensors, tablet, light indicators, projector, camera, biometric identification tools (retina scanner, fingerprint scanner, voice recognition module), etc.

[059] Средство сетевого взаимодействия (206) обеспечивает передачу данных посредством внутренней или внешней вычислительной сети, например, Интранет, Интернет, ЛВС и т.п. В качестве одного или более средств (206) может использоваться, но не ограничиваться: Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др.[059] The network interaction tool (206) provides data transmission via an internal or external computer network, for example, Intranet, Internet, LAN, etc. As one or more means (206) may be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communications module, NFC module, Bluetooth and / or BLE module, Wi-Fi module and other

[060] Дополнительно могут применяться также средства спутниковой навигации в составе устройства (200), например, GPS, ГЛОНАСС, BeiDou, Galileo. Конкретный выбор элементов устройства (200) для реализации различных программно-аппаратных архитектурных решений может варьироваться с сохранением обеспечиваемого требуемого функционала.[060] Additionally, satellite navigation aids as part of the device (200), for example, GPS, GLONASS, BeiDou, Galileo, can also be used. The specific choice of elements of the device (200) for the implementation of various software and hardware architectural solutions may vary while maintaining the required functionality.

[061] Модификации и улучшения вышеописанных вариантов осуществления настоящего технического решения будут ясны специалистам в данной области техники. Предшествующее описание представлено только в качестве примера и не несет никаких ограничений. Таким образом, объем настоящего технического решения ограничен только объемом прилагаемой формулы изобретения.[061] Modifications and improvements to the above-described embodiments of the present technical solution will be apparent to those skilled in the art. The preceding description is provided as an example only and is not subject to any restrictions. Thus, the scope of the present technical solution is limited only by the scope of the attached claims.

Claims

1. A method of processing a speech signal to form a video stream, performed by at least one computing device, comprising the steps of:

- receive data of at least one speech signal;

- divide sections of the speech signal containing voice information into time windows;

- form for each time window the image of the frequency spectrum to obtain a sequence of images of the frequency spectrum;

- based on the sequence of images of the frequency spectrum, a sequence of data on the set of coordinates of the points forming the face mask is determined;

- place the face mask on the 3D model of the head to form a sequence of frames containing the image of the 3D model of the head with the face mask placed on it;

- based on the sequence of images of the frequency spectrum form a sequence of frames of the dynamic texture of the facial mask;

- form a sequence of frames containing an image of the resulting 3D model of the head with a dynamic facial mask placed on it based on a sequence of frames containing an image of a 3D model of the head with a facial mask placed on it and frames of the dynamic texture of the face mask;

- form a sequence of frames with the image of the resulting 3D-model of the head against the background of the scene;

- combine the sequence of frames obtained in the previous step into a video stream.

2. The method according to p. 1, characterized in that it additionally perform the steps in which:

- upon receipt of the speech signal data, noise-voice is determined;

- allocate sections of the speech signal containing information about the voice; moreover, when determining noise-voices, phonetic marking data is taken into account.

3. The method according to p. 1, characterized in that it additionally perform the step of geometric validation of data on the set of coordinates of the points generated for the time window contained in the sequence of data on the set of coordinates of the points forming the face mask.

4. The method according to p. 3, characterized in that the said step of geometric validation of data about the set of coordinates of the points contains the stages in which:

- on the basis of the phonetic markup of the voice determine the control coordinates of the points of the face mask, the validation of which must be performed;

- determine the distance between the control points of the facial mask, information about which is contained in the data on the set of coordinates of the points formed for the time window, and compare it with the specified distance for these control points according to the phonetic markup of the voice;

moreover, if a certain distance between the control points of the facial mask does not exceed a specified value of the distance between these control points, then the data about the set of coordinates of the points passes validation.

5. The method according to claim 1, characterized in that the face mask is placed on the 3D model of the head by attaching the face mask to the known boundary vertices of the 3D head model and the corresponding points of the facial mask for these vertices.

6. The method according to p. 1, characterized in that it additionally perform the stage of placing textures of the teeth and tongue on the 3D models of the head according to the position of the facial mask.

7. The method according to p. 1, characterized in that it further performs the step of morphing images of the resulting 3D-model of the head.

8. The method according to p. 1, characterized in that it further performs the step of color correction and combining dynamic textures to eliminate the pulsating color change of the face mask by merging the color of the above-mentioned images of the 3D model of the head contained in previous and subsequent frames.

9. A device for processing a speech signal containing at least one computing device and at least one memory device containing machine-readable instructions that, when executed by at least one computing device, perform the method according to any one of claims. 1-8.