RU2793573C1

RU2793573C1 - Bandwidth extension and noise removal for speech audio recordings

Info

Publication number: RU2793573C1
Application number: RU2022121967A
Authority: RU
Inventors: Павел Константинович Андреев; Айбек Арстанбекович Аланов; Олег Юрьевич Иванов; Дмитрий Петрович ВЕТРОВ
Original assignee: Самсунг Электроникс Ко., Лтд.
Filing date: 2022-08-12
Publication date: 2023-04-04

Abstract

FIELD: computer technology.

SUBSTANCE: computer technology for processing and analysis of audio recordings. Waveform of the audio signal is taken from the signal source; the waveform of the audio signal is processed through STFT operations, to obtain a chalk spectrogram; 2D Unet convolutional blocks are applied, via SpectralUnet, to the chalk spectrogram to denoise the chalk spectrogram and restore high frequencies; the output signal obtained from the SpetralUNet is converted, by means of a HiFi generator, into a waveform domain; the output signal of the HiFi generator is concatenated with the waveform of the audio signal; the output waveform is corrected in the time domain by means of WaveUNet; the output waveform in the frequency domain is adjusted to remove artifacts and noise by means of SpectralMaskNet; the SpectralMaskNet output signal is processed with a one-dimensional convolutional layer; the corrected waveform of the audio signal is produced.

EFFECT: increasing the accuracy of noise suppression in speech recorded in a noisy environment.

10 cl, 6 dwg, 2 tbl

Description

ОБЛАСТЬ ТЕХНИКИ, К КОТОРОЙ ОТНОСИТСЯ ИЗОБРЕТЕНИЕFIELD OF TECHNOLOGY TO WHICH THE INVENTION RELATES

Предложенное изобретение относится к области вычисления, в частности к способам обработки и анализа аудиозаписей. Изобретение может использоваться в различных устройствах, передающих, принимающих и генерирующих речевые записи для улучшения ощущений пользователя при прослушивании этих записей.The proposed invention relates to the field of computing, in particular to methods for processing and analyzing audio recordings. The invention can be used in various devices that transmit, receive and generate speech recordings to improve the user experience when listening to these recordings.

ОПИСАНИЕ УРОВНЯ ТЕХНИКИDESCRIPTION OF THE PRIOR ART

Проблема условной генерации речи, при обработке аудиосигнала, чрезвычайно важна с практической точки зрения. Обработка аудиосигнала применяется в том числе для расширения частотной полосы (BWE), улучшения речевого сигнала (SE, также именуемое подавление шума в речевом сигнале), и многих других. Один недавний успех в области условной генерации речи связан с применением генеративных состязательных сетей (Kumar et al., 2019; Kong et al., 2020). Генератор HiFi (Kong et al., 2020) недавно был предложен как полностью сверточная сеть высокой вычислительной эффективности, которая обеспечивает нейронное вокодирование речевых сигналов, причем качество речевого сигнала сравнимо с авторегрессивным аналогом, но на несколько порядков величины быстрее. Ключевой частью этой архитектуры является модуль слияния мультирецептивных полей (MRF), который позволяет моделировать разнообразные паттерны рецептивных полей. Благодаря регулировке параметров архитектуры HiFi можно достичь хорошего компромисса между вычислительной эффективностью и качеством выборки модели. Настоящее изобретение предлагает адаптировать известную модель HiFi (Kong et al., 2020) к задачам расширения частотной полосы и улучшения речевого сигнала за счет разработки нового генератора.The problem of conditional speech generation in audio signal processing is extremely important from a practical point of view. Audio signal processing is used for bandwidth extension (BWE), speech enhancement (SE, also referred to as speech noise reduction), and many others. One recent success in the field of conditional speech generation is related to the application of generative adversarial networks (Kumar et al., 2019; Kong et al., 2020). The HiFi Generator (Kong et al., 2020) has recently been proposed as a highly computationally efficient fully convolutional network that provides neural vocoding of speech signals with speech signal quality comparable to its autoregressive counterpart, but several orders of magnitude faster. A key part of this architecture is the Multi-Receptive Field Fusion (MRF) module, which allows you to model a variety of receptive field patterns. By adjusting the parameters of the HiFi architecture, a good compromise between computational efficiency and model sampling quality can be achieved. The present invention proposes to adapt the well-known HiFi model (Kong et al., 2020) to the tasks of expanding the frequency bandwidth and improving the speech signal by developing a new generator.

Расширение частотной полосы (Kuleshov et al., 2017; Lin et al., 2021) (также известное как сверхразрешение аудиосигнала) можно рассматривать как реалистическое увеличение частоты дискретизации сигнала. Полоса или частота дискретизации речевого сигнала может быть усечена вследствие низкого качества устройств записи или каналов передачи. Поэтому модели сверхразрешения играют важную практическую роль в электросвязи.Bandwidth extension (Kuleshov et al., 2017; Lin et al., 2021) (also known as audio super-resolution) can be thought of as a realistic increase in the sampling rate of a signal. The bandwidth or sampling rate of the speech signal may be truncated due to poor quality recording devices or transmission channels. Therefore, super-resolution models play an important practical role in telecommunications.

Несколько предыдущих работ (Birnbaum et al., 2019; Lin et al., 2021; Wang & Wang, 2020) посвящены проблеме расширения частотной полосы от формы волны к форме волны или посредством совместных временно-частотных нейронных архитектур, снабженных различными контролируемыми реконструкционными потерями. В работе Birnbaum et al. (2019) (TFiLM) предложен слой линейной модуляции по временным размерностям, который использует рекуррентную нейронную сеть для изменения активаций сверточной модели. Авторы применяли этот слой к сверточной нейронной архитектуре кодер-декодер, действующей в области формы волны (Kuleshov et al., 2017) и наблюдали значительные преимущества этих слоев для качества расширения частотной полосы. В работе Lin et al. (2021) (2S-BWE) рассмотрен двухстадийный подход к расширению частотной полосы. На первой стадии спектр сигнала прогнозируется либо временной сверточной сетью (TCN) (Bai et al., 2018), либо сверточной рекуррентной сетью (CRN) (Tan & Wang, 2018), тогда как на второй стадии необработанная форма волны уточняется моделью WaveUNet.Several previous works (Birnbaum et al., 2019; Lin et al., 2021; Wang & Wang, 2020) have addressed the problem of bandwidth extension from waveform to waveform or through joint time-frequency neural architectures equipped with various controlled reconstruction losses. Birnbaum et al. (2019) (TFiLM) proposed a temporal linear modulation layer that uses a recurrent neural network to modify convolutional model activations. The authors applied this layer to a convolutional neural encoder-decoder architecture operating in the waveform domain (Kuleshov et al., 2017) and observed significant benefits of these layers for the quality of the bandwidth extension. In Lin et al. (2021) (2S-BWE) considered a two-stage approach to bandwidth extension. In the first stage, the signal spectrum is predicted by either a temporal convolutional network (TCN) (Bai et al., 2018) or a convolutional recurrent network (CRN) (Tan & Wang, 2018), while in the second stage, the raw waveform is refined by the WaveUNet model.

С другой стороны, подавление шума в аудиосигнале (Fu et al., 2019; Tagliasacchi et al., 2020) всегда представляет наибольший интерес среди специалистов по обработке аудиосигнала ввиду его важности и трудности. В этой задаче необходимо очистить исходный сигнал (чаще всего речь) от посторонних искажений.On the other hand, audio noise reduction (Fu et al., 2019; Tagliasacchi et al., 2020) is always of the greatest interest among audio signal processors due to its importance and difficulty. In this task, it is necessary to clean the original signal (most often speech) from extraneous distortions.

Недавние работы по глубокому обучению темы образуют две линии исследования. Первая действует на уровне форм волны, или во временной области. В работе Stoller et al. (2018) предложено адаптировать модель UNet (Ronneberger et al., 2015) к одномерной обработке сигнала во временной области для решения проблемы разделения источников аудиосигнала, которая является общим случаем проблемы подавления шума в речевом сигнале. Предложенная архитектура сверточный кодер-декодер (CED) стала обычной для моделей нейронной сети улучшения речевого сигнала. Например, Pascual et al. (2017) следуют конвейеру состязательного обучения и используют сеть CED в качестве генератора, использующего полностью сверточный дискриминатор для обучения. Модель SEANet (Tagliasacchi et al., 2020) также решает проблему подавления шума в речевом сигнале и использует полностью сверточные архитектуры генераторов и дискриминаторов. В работе Defossez et al. (2020) предложена архитектура DEMUCS для проблемы подавления шума в речевом сигнале. DEMUCS (Dйfossez et al., 2019) является сетью CED с логически управляемыми свертками и модулями долгой краткосрочной памяти в части узкого места. Модель обучается с использованием совместных реконструкционных потерь во временной и частотной области. Недавно, Gulati et al. (2020) эффективно объединенные сверточные нейронные сети и преобразователи во временной области. Полученная модель называется Conformer и продемонстрировала самые современные характеристики в различных задачах обработки звука. Recent work on deep learning topics form two lines of research. The first operates at the waveform level, or in the time domain. Stoller et al. (2018) proposed to adapt the UNet model (Ronneberger et al., 2015) to one-dimensional time-domain signal processing to solve the problem of separation of audio signal sources, which is a general case of the noise suppression problem in a speech signal. The proposed convolutional encoder-decoder (CED) architecture has become commonplace for speech enhancement neural network models. For example, Pascual et al. (2017) follow an adversarial learning pipeline and use the CED network as a generator using a fully convolutional discriminator for training. The SEANet model (Tagliasacchi et al., 2020) also solves the noise suppression problem in the speech signal and uses fully convolutional generator and discriminator architectures. Defossez et al. (2020) proposed a DEMUCS architecture for the problem of noise suppression in a speech signal. DEMUCS (Défossez et al., 2019) is a CED network with logically controlled convolutions and long short-term memory modules in the bottleneck part. The model is trained using joint reconstruction losses in the time and frequency domains. Recently, Gulati et al. (2020) Efficiently coupled convolutional neural networks and time domain transformers. The resulting model is called the Conformer and has demonstrated state-of-the-art performance in a variety of audio processing tasks.

Некоторые работы не опираются на информацию временной области и вместо этого используют представление аудиосигнала в виде высокоуровневой спектрограммы. Многие подходы, образующие эту линию, используют метод спектрального маскирования, т.е. для каждой точки спектрограммы они прогнозируют действительнозначный мультипликативный коэффициент, заключенный в [0, 1]. Например, в документах MetricGAN (Fu et al., 2019) и MetricGAN+ (Fu et al., 2021) используется двунаправленный LSTM, объединенный со спектральным маскированием для непосредственной оптимизации объективных метрик общего качества речевого сигнала, достижения результатов уровня техники для этих метрик.Some works do not rely on time domain information and instead use a high-level spectrogram representation of the audio signal. Many approaches that form this line use the method of spectral masking, i.e. for each point in the spectrogram, they predict a real-valued multiplicative coefficient contained in [0, 1]. For example, MetricGAN (Fu et al., 2019) and MetricGAN+ (Fu et al., 2021) papers use bidirectional LSTM combined with spectral masking to directly optimize objective metrics of overall speech quality, achieving state of the art results for these metrics.

Предложенное изобретение может использоваться для расширения частотной полосы речевых записей и улучшения ощущений пользователя при прослушивании этих записей. Кроме того, предложенное изобретение можно использовать для подавления шума в речи, записанной в зашумленном окружении. The proposed invention can be used to expand the frequency band of speech recordings and improve the user experience when listening to these recordings. In addition, the proposed invention can be used to suppress noise in speech recorded in a noisy environment.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

Настоящее изобретение решает задачи улучшения речевого сигнала в составе аудиосигнала и расширения частотной полосы аудиосигнала.The present invention solves the problems of improving the speech signal in the composition of the audio signal and expanding the frequency band of the audio signal.

Предложенная система принимает на входе форму волны аудиосигнала, которая представляет собой последовательность действительных чисел, зашумленную или с уменьшенной частотной полосой и выводит чистую форму волны аудиосигнала (без шума, с высоким качеством). Технический эффект состоит в повышении качества аудиозаписи (аудиосигнал не содержит шум в случае задачи улучшения речевого сигнала или его частотная полоса увеличивается в случае задачи расширения частотной полосы). В объеме принципов настоящего изобретения рассматриваются улучшение речевого сигнала и расширение частотной полосы, однако следует отметить, что предложенный способ может обучаться для ослабления и других различных артефактов в аудиозаписях.The proposed system takes as input an audio waveform, which is a sequence of real numbers with noisy or reduced bandwidth, and outputs a pure audio waveform (no noise, high quality). The technical effect is to improve the quality of the audio recording (the audio signal does not contain noise in the case of the task of improving the speech signal, or its frequency band increases in the case of the task of expanding the frequency band). It is within the scope of the principles of the present invention to improve the speech signal and expand the frequency band, however, it should be noted that the proposed method can be trained to attenuate various other artifacts in audio recordings.

Предложенный способ, осуществляемый электронным устройством, может осуществляться с использованием модели искусственного интеллекта. Электронным устройством может быть любое подходящее электронное устройство, способное воспроизводить аудиосигнал. Нейронная сеть может быть реализована подходящими программными и аппаратными средствами, например, специализированным вычислительным устройством.The proposed method, carried out by an electronic device, can be carried out using an artificial intelligence model. The electronic device may be any suitable electronic device capable of reproducing an audio signal. The neural network can be implemented with suitable software and hardware, such as a dedicated computing device.

Модель искусственного интеллекта может обрабатываться специальным устройством обработки искусственного интеллекта, выполненным в качестве аппаратной структуры, приспособленной для обработки модели искусственного интеллекта. Модель искусственного интеллекта может быть получена путем обучения. Здесь, "полученный путем обучения" означает, что заранее заданный правило работы или модель искусственного интеллекта, выполненная с возможностью осуществления желаемой особенности (или цели), получается путем обучения базовой модели искусственного интеллекта множественными экземплярами обучающих данных согласно алгоритму обучения. Модель искусственного интеллекта может включать в себя несколько слоев нейронной сети. Каждый из нескольких слоев нейронной сети включает в себя несколько весовых значений и осуществляет нейронно-сетевое вычисление путем вычисления между результатом вычисления предыдущим слоем и несколькими весовыми значениями.The artificial intelligence model may be processed by a special artificial intelligence processing device configured as a hardware structure adapted to process the artificial intelligence model. An artificial intelligence model can be obtained through training. Here, "trained" means that a predetermined operating rule or AI model capable of implementing a desired feature (or goal) is obtained by training the base AI model with multiple instances of training data according to a learning algorithm. An artificial intelligence model may include several layers of a neural network. Each of the multiple layers of the neural network includes multiple weights and performs a neural network calculation by calculating between the calculation result of the previous layer and the multiple weights.

Ниже приведена общая формулировка задач расширения частотной полосы и улучшения речевого сигнала для аудиосигнала.Below is a general formulation of the tasks of expanding the frequency band and improving the speech signal for the audio signal.

Для данного аудиосигнала

с низкой частотой дискретизации s, модель расширения частотной полосы служит для восстановления записи с высоким разрешением

с частотой дискретизации S (т.е. расширения эффективной частотной полосы). x - численное представление входного аудиосигнала, x_i - отдельные метки времени, содержащие аудиосигнал, N - количество меток времени, s - исходная частота дискретизации, S - целевая частота дискретизации, y - численное представление выходного сигнала. Обучающие и оценочные данные генерируются путем применения низкочастотных фильтров к сигналу с высокой частотой дискретизации с последующей понижающей дискретизацией сигнала до частоты дискретизации s:For a given audio signal

with low sampling rate s, the bandwidth extension model serves to restore the high-resolution recording

with a sampling rate S (i.e. an extension of the effective bandwidth). x is a numerical representation of the input audio signal, x _i are the individual timestamps containing the audio signal, N is the number of timestamps, s is the source sample rate, S is the target sample rate, y is the numerical representation of the output signal. Training and evaluation data are generated by applying low pass filters to a high sample rate signal and then downsampling the signal to a sample rate s:

где lowpass (y, s/2) означает применение фильтра нижних частот с частотой отсечки s/2 (найквистовой частотой при частоте дискретизации s), Resample…, s, S) обозначает понижающую дискретизацию сигнала от частоты дискретизации S до частоты s. Согласно недавним работам (Wang & Wang, 2021; Sulun & Davies, 2020; Liu et al., 2021), тип фильтра нижних частот и порядок в ходе обучения рандомизируются для устойчивости модели.where lowpass (y, s/2) means applying a low pass filter with cutoff frequency s/2 (Nyquist at sample rate s), Resample…, s, S) means downsampling the signal from sample rate S to frequency s. According to recent work (Wang & Wang, 2021; Sulun & Davies, 2020; Liu et al., 2021), the low-pass filter type and training order are randomized for model robustness.

При решении проблемы подавления шума в аудиосигнале, необходимо очистить исходный сигнал от посторонних искажений. В настоящем изобретении, аддитивный внешний шум означает искажение. When solving the problem of noise suppression in an audio signal, it is necessary to clean the original signal from extraneous distortions. In the present invention, additive external noise means distortion.

Говоря формально, выражая зашумленный сигнал в виде x=y+n, алгоритм подавления шума прогнозирует чистый сигнал y, т.е. подавляет шум n. Formally speaking, by expressing the noisy signal as x=y+n, the noise suppression algorithm predicts a clean signal y, i.e. suppresses noise n.

Предлагается способ обработки аудиосигнала, причем способ осуществляется на вычислительном устройстве, имеющем внутреннюю память, где хранится несколько форм волны аудиосигнала, причем способ содержит этапы:A method for processing an audio signal is proposed, the method being carried out on a computing device having an internal memory where several audio signal waveforms are stored, the method comprising the following steps:

извлечения формы волны аудиосигнала из внутренней памяти устройства или другого источника сигнала;extracting the waveform of the audio signal from the internal memory of the device or other signal source;

обработки формы волны аудиосигнала с помощью операции оконного преобразования Фурье для получения мел-спектрограммы;processing the waveform of the audio signal with a Fourier window transform operation to obtain a chalk spectrogram;

обработки мел-спектрограмма с помощью модуля спектральной предобработки, который применяет двухмерные сверточные блоки к мел-спектрограмме, причем выходной сигнал модуля спектральной предобработки является первым тензором;processing the chalk spectrogram with a spectral preprocessing module that applies two-dimensional convolutional blocks to the chalk spectrogram, wherein the output of the spectral preprocessing module is a first tensor;

обработки первого тензора с помощью модуля полностью сверточной нейронной сети, что повышает временное разрешение обработанного первого тензора, причем выходной сигнал модуля полностью сверточной нейронной сети является вторым тензором, содержащим несколько одномерных последовательностей действительных чисел, длина которых согласуется с длиной упомянутой формы волны аудиосигнала;processing the first tensor with a fully convolutional neural network module, which improves the temporal resolution of the processed first tensor, wherein the output of the fully convolutional neural network module is a second tensor containing multiple one-dimensional real number sequences whose length matches the length of said audio waveform;

конкатенации второго тензора с формой волны аудиосигнала, причем результирующий третий тензор содержит соединенные одномерные последовательности;concatenating a second tensor with an audio waveform, the resulting third tensor having concatenated one-dimensional sequences;

обработки третьего тензора одномерной сверточной нейронной архитектурой Unet во временной области, которая применяет одномерные свертки в нескольких масштабах третьего тензора во временном измерении, причем выходной сигнал одномерной сверточной нейронной архитектуры Unet во временной области является четвертым тензором, который состоит из 1d последовательностей;processing the third tensor by a 1D convolutional neural architecture Unet in the time domain, which applies 1D convolutions at multiple scales of the third tensor in the time domain, wherein the output of the 1D convolutional neural architecture Unet in the time domain is a fourth tensor that consists of 1d sequences;

обработки четвертого тензора обучаемым модулем спектрального маскирования, который применяет поканальное оконное преобразование Фурье (STFT) к четвертому тензору, и изменяет абсолютные величины коэффициентов STFT, причем выходной сигнал обучаемого модуля спектрального маскирования является пятым тензором;processing the fourth tensor by a trainable spectral masker that applies a per-channel windowed Fourier transform (STFT) to the fourth tensor and changes the absolute values of the STFT coefficients, wherein the output of the trainable spectral masker is the fifth tensor;

обработки пятого тензора одномерным сверточным слоем, причем выходным сигналом одномерного сверточного слоя является выходная форма волны аудиосигнала. processing the fifth tensor with a one-dimensional convolutional layer, wherein the output of the one-dimensional convolutional layer is an output waveform of an audio signal.

Предложена система для обработки формы волны аудиосигнала на основе генератора GAN, содержащая следующие модули: A system for processing the waveform of an audio signal based on the GAN generator is proposed, containing the following modules:

модуль спектральной предобработки (SpectralUnet), выполненный с возможностью: spectral preprocessing module (SpectralUnet), made with the ability to:

- принимать входной аудиосигнал, преобразованный в мел-спектрограмму посредством операции оконного преобразования Фурье (STFT), - receive an input audio signal converted into a chalk spectrogram by a windowed Fourier transform (STFT) operation,

- применять двухмерные сверточные блоки Unet к мел-спектрограмму для очистки мел-спектрограммы от шума и восстанавливать высокие частоты; - apply two-dimensional Unet convolutional blocks to the chalk spectrogram to clean the chalk spectrogram from noise and restore high frequencies;

модуль полностью сверточной нейронной сети (HiFi-генератор), выполненный с возможностью преобразования выходного сигнала, полученного из SpetralUNet, в область формы волны;a fully convolutional neural network module (HiFi generator) configured to convert the output signal obtained from the SpetralUNet into a waveform domain;

одномерный сверточный Unet нейронный модуль (WaveUNet) во временной области, выполненный с возможностью коррекции полученной формы волны во временной области;a one-dimensional convolutional Unet neural module (WaveUNet) in the time domain, configured to correct the received waveform in the time domain;

обучаемый модуль спектрального маскирования (SpectralMaskNet), выполненный с возможностью коррекции выходного сигнала из WaveUNet в частотной области для удаления артефактов и шума, причем выходной сигнал из SpectralMaskNet, обработанный одномерным сверточным слоем, является скорректированной формой волны аудиосигнала.a trainable spectral masking module (SpectralMaskNet) configured to correct the output from WaveUNet in the frequency domain to remove artifacts and noise, wherein the output from SpectralMaskNet processed by the 1D convolutional layer is the corrected audio waveform.

WaveUNet принимает на входе выходной сигнал генератора HiFi, конкатенированный с формой волны входного аудиосигнала. Система дополнительно содержит по меньшей мере три идентичных полностью сверточных дискриминатора, сконфигурированных для обучения системы для расширения частотной полосы формы волны аудиосигнала. Система дополнительно содержит по меньшей мере три идентичных полностью сверточных дискриминатора, сконфигурированных для обучения системы для подавления шума в речевом сигнале формы волны аудиосигнала. Система дополнительно содержит по меньшей мере три идентичных полностью сверточных дискриминатора, сконфигурированных для обучения системы для подавления шума в речевом сигнале и расширения частотной полосы формы волны аудиосигнала. Дискриминаторы являются дискриминаторами SSD с уменьшенными количеством весов.WaveUNet accepts as input the output of a HiFi generator concatenated with the input audio waveform. The system further comprises at least three identical fully convolutional discriminators configured to train the system to expand the frequency band of the audio waveform. The system further comprises at least three identical fully convolutional discriminators configured to train the system to suppress noise in the audio waveform speech signal. The system further comprises at least three identical fully convolutional discriminators configured to train the system to suppress noise in the speech signal and expand the frequency band of the audio waveform. The discriminators are SSD discriminators with a reduced number of weights.

Предложен способ обработки формы волны аудиосигнала с использованием вышеописанной системы, причем способ осуществляется на вычислительном устройстве, причем способ содержит этапы: A method for processing an audio signal waveform using the system described above is proposed, the method being carried out on a computing device, the method comprising the steps of:

приема формы волны аудиосигнала от источника сигнала;receiving a waveform of an audio signal from a signal source;

обработки, посредством операций STFT, формы волны аудиосигнала для получения мел-спектрограммы;processing, by means of STFT operations, the waveform of the audio signal to obtain a chalk spectrogram;

применения, посредством SpectralUnet, двухмерных сверточных блоков Unet к мел-спектрограмме для очистки мел-спектрограммы от шума и восстановления высоких частот;applying, via SpectralUnet, 2D Unet convolutional blocks to the chalk spectrogram to denoise the chalk spectrogram and restore high frequencies;

преобразования, посредством HiFi-генератора, выходного сигнала, полученного из SpetralUNet, в область формы волны;converting, by means of a HiFi generator, an output signal received from SpetralUNet into a waveform domain;

конкатенации выходного сигнала генератора HiFi с формой волны аудиосигнала;concatenating the output signal of the HiFi generator with the waveform of the audio signal;

коррекции, посредством WaveUNet, выходной формы волны во временной области;correcting, by means of WaveUNet, the output waveform in the time domain;

коррекции выходной формы волны в частотной области для удаления артефактов и шума, посредством SpectralMaskNet;correction of the output waveform in the frequency domain to remove artifacts and noise, using SpectralMaskNet;

обработки выходного сигнала SpectralMaskNet одномерным сверточным слоем;processing the SpectralMaskNet output signal with a one-dimensional convolutional layer;

вывода скорректированной формы волны аудиосигнала.output the corrected waveform of the audio signal.

Выходной сигнал SpectralUnet является первым тензором; причем этап преобразования посредством HiFi-генератора реализует повышение временного разрешения обработанного первого тензора, причем выходной сигнал HiFi-генератора является вторым тензором, содержащим несколько одномерных последовательностей длина которых согласуется с длиной упомянутой формы волны аудиосигнала; причем тензор, полученный в результате конкатенации, является третьим тензором, содержащим соединенные одномерные последовательности; причем этап коррекции посредством WaveUNet содержит обработку третьего тензора одномерной сверточной нейронной архитектурой Unet во временной области, которая применяет одномерные свертки на нескольких разрешениях третьего тензора во временном измерении, причем выходной сигнал одномерной сверточной нейронной архитектуры Unet во временной области является четвертым тензором, который состоит из 1d последовательностей; причем этап коррекции посредством SpectralMaskNet содержит обработку четвертого тензора обучаемым модулем спектрального маскирования, который применяет поканальное оконное преобразование Фурье (STFT) к четвертому тензору, и изменяет абсолютные величины коэффициентов STFT, причем выходной сигнал обучаемого модуля спектрального маскирования является пятым тензором.The output of SpectralUnet is the first tensor; wherein the transformation step by means of the HiFi generator implements temporal resolution enhancement of the processed first tensor, wherein the output of the HiFi generator is a second tensor containing multiple one-dimensional sequences whose length matches the length of said audio waveform; moreover, the tensor resulting from the concatenation is the third tensor containing the connected one-dimensional sequences; wherein the step of correcting by WaveUNet comprises processing the third tensor by a 1D convolutional neural architecture Unet in the time domain, which applies 1D convolutions at multiple resolutions of the third tensor in the time domain, wherein the output of the 1D convolutional neural architecture Unet in the time domain is a fourth tensor that consists of 1d sequences; moreover, the stage of correction by SpectralMaskNet comprises processing the fourth tensor by a trainable spectral masking module, which applies a per-channel windowed Fourier transform (STFT) to the fourth tensor, and changes the absolute values of the STFT coefficients, and the output signal of the trainable spectral masking module is the fifth tensor.

Способ дополнительно содержит этап обработки пятого тензора одномерным сверточным слоем причем выходным сигналом одномерного сверточного слоя является выходная форма волны аудиосигнала. Способ дополнительно содержит этап обучения суммой состязательных функций потерь, функций потерь на согласование особенностей и мел-спектрограммных функций потерь, причем состязательные потери и потери на согласование особенностей вычисляются посредством по меньшей мере трех идентичных полностью сверточных дискриминаторов.The method further comprises the step of processing the fifth tensor with a one-dimensional convolutional layer, wherein the output of the one-dimensional convolutional layer is an audio output waveform. The method further comprises the step of learning a sum of contention loss functions, feature matching loss functions, and chalk spectrogram loss functions, wherein the contention loss and feature matching loss are computed by at least three identical fully convolutional discriminators.

ЧЕРТЕЖИBLUEPRINTS

Вышеупомянутые и/или другие аспекты будут более понятны из нижеследующего описания иллюстративных вариантов осуществления со ссылкой на прилагаемые чертежи, в которых:The above and/or other aspects will be better understood from the following description of illustrative embodiments with reference to the accompanying drawings, in which:

фиг. 1 демонстрирует архитектуру HiFi++ и обучающий конвейер.fig. 1 shows the HiFi++ architecture and training pipeline.

Фиг. 2 демонстрирует архитектуру модуля SpectralUNet.Fig. 2 shows the architecture of the SpectralUNet module.

Фиг. 3 демонстрирует архитектуру HiFi.Fig. 3 shows the HiFi architecture.

Фиг. 4 демонстрирует блоки генератора HiFi. Fig. 4 shows the HiFi generator blocks.

Фиг. 5 демонстрирует архитектуру модуля WaveUNet. Fig. 5 shows the architecture of the WaveUNet module.

Фиг. 6 демонстрирует обучающий конвейер архитектуры HiFi++.Fig. 6 shows the training pipeline of the HiFi++ architecture.

ПОДРОБНОЕ ОПИСАНИЕ DETAILED DESCRIPTION

Предложенный способ позволяет расширять частотную полосу речевых записей за счет соответствующего обучения нейронной модели и улучшать ощущения пользователя при прослушивании этих записей. The proposed method allows to expand the frequency band of speech recordings due to the appropriate training of the neural model and improve the user's experience when listening to these recordings.

Кроме того, способ можно использовать для подавления шума в речи, записанной в зашумленном окружении, посредством другого процесса обучения нейронной модели. Технический эффект по сравнению с аналогичными методами состоит в том, что способ обеспечивает более удачный компромисс между качеством генерируемой речи и размером модели и имеет более низкую вычислительную сложность. Предложенное изобретение обеспечивает более высокое качество, которое измеряется на основании реакции людей-аннотаторов на улучшение речевого сигнала и расширение частотной полосы, по сравнению с аналогичными методами и имеет меньше параметров, что позволяет использовать меньше ресурсов памяти.In addition, the method can be used to suppress noise in speech recorded in a noisy environment through another neural model training process. The technical effect compared to similar methods is that the method provides a better compromise between the quality of the generated speech and the size of the model and has a lower computational complexity. The proposed invention provides higher quality, which is measured based on the response of human annotators to speech improvement and bandwidth expansion, compared to similar methods and has fewer parameters, which allows the use of less memory resources.

Предложенное изобретение принимает на входе зашумленную, или с уменьшенной частотной полосой, форму волны аудиосигнала (форма волны аудиосигнала представляет собой длинный вектор действительных чисел, которые представляют амплитуды аудиосигнала (громкость) в течение короткого периода времени) и выводит чистую форму волны аудиосигнала высокого качества без шума.The proposed invention takes a noisy or reduced bandwidth audio waveform as an input (an audio waveform is a long vector of real numbers that represent the audio signal amplitudes (loudness) over a short period of time) and outputs a pure high quality audio waveform without noise. .

В заявке используются следующие термины:The application uses the following terms:

1. Операция оконного преобразования Фурье (STFT) представляет собой последовательность преобразований Фурье оконного сигнала. STFT обеспечивает в качестве выходного сигнала локализованную по времени частотную информацию для ситуаций, в которых частотные компоненты сигнала изменяются со временем. Оконное преобразование Фурье широко используется для обработки речи, поскольку эти сигналы обычно представляют собой гармонические структуры. 1. The Windowed Fourier Transform (STFT) operation is a sequence of Fourier transforms of a windowed signal. The STFT provides time-localized frequency information as an output for situations in which the frequency components of the signal change over time. The windowed Fourier transform is widely used for speech processing because these signals are usually harmonic structures.

2. Мел-спектрограмма представляет собой амплитудную STFT-спектрограмму, преобразованный к частотной шкале мелов, которая определяется как перцептивная шкала основных тонов, которые слушатели ощущают равноотстоящими друг от друга. Мел-спектрограмма обычно имеет более низкую размерность в частотном измерении, чем входная спектрограмма, и доказала свою полезность в качестве промежуточного представления для систем преобразования аудиосигнала. 2. The chalk spectrogram is an amplitude STFT spectrogram converted to the chalk frequency scale, which is defined as the perceptual scale of fundamental tones that listeners perceive as equally spaced from each other. The chalk spectrogram typically has a lower frequency dimension than the input spectrogram and has proven useful as an intermediate representation for audio signal conditioning systems.

3. Генеративно-состязательные сети (GANs) являются широко используемым типом нейронный генеративной модели. GAN состоят из генераторных и дискриминаторных нейронных сетей, которые состязаются друг с другом. Генераторная сеть обучается отображению из исходной области в целевую область, тогда как дискриминатор обучается отличать реальные объекты от сгенерированных в целевой области. Таким образом, дискриминатор предписывает генератору вырабатывать выборки, неотличимые от реальных.3. Generative adversarial networks (GANs) are a widely used type of neural generative model. GANs are composed of generator and discriminator neural networks that compete with each other. The generator network is trained to map from the source area to the target area, while the discriminator is trained to distinguish real objects from those generated in the target area. Thus, the discriminator instructs the generator to generate samples that are indistinguishable from real ones.

Известный генератор HiFi (Kong et al., 2020) недавно был предложен в качестве полностью сверточной сети с высокой вычислительной эффективностью, которая обеспечивает нейронное вокодирование речевых сигналов, причем качество речевого сигнала сравнимо с авторегрессивным аналогом, но на несколько порядков величины быстрее. Ключевой частью этой архитектуры является модуль слияния мультирецептивных полей (MRF) (Kong, J., Kim, J., & Bae, J. (2020). Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022-17033), которая позволяет моделировать разнообразные паттерны рецептивных полей. Благодаря регулировке параметров архитектуры HiFi можно достичь хорошего компромисса между вычислительной эффективностью и качеством выборки модели.The well-known HiFi generator (Kong et al., 2020) has recently been proposed as a highly computationally efficient fully convolutional network that provides neural vocoding of speech signals, with speech signal quality comparable to the autoregressive counterpart, but several orders of magnitude faster. A key part of this architecture is the Multireceptive Field Fusion (MRF) module (Kong, J., Kim, J., & Bae, J. (2020). Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022-17033), which allows you to model a variety of patterns of receptive fields. By adjusting the parameters of the HiFi architecture, a good compromise between computational efficiency and model sampling quality can be achieved.

В настоящем изобретении, модель HiFi (Kong et al., 2020) адаптирована к задачам расширения частотной полосы и улучшения речевого сигнала (подавления шума в речевом сигнале) за счет конструирования нового генератора и оптимизации нейронных дискриминаторных сетей.In the present invention, the HiFi model (Kong et al., 2020) is adapted to the tasks of broadening the frequency bandwidth and improving the speech signal (noise suppression in the speech signal) by designing a new generator and optimizing neural discriminator networks.

Kong et al. (2020); You et al. (2021) утверждают, что успех модели HiFi по большей части обусловлен инструментарием дискриминации с несколькими уровнями разрешения. Это изобретение демонстрирует, что этот инструментарий можно значительно упростить до нескольких абсолютно идентичных дискриминаторов, действующих при одном и том же разрешении и обеспечивающих при этом сравнимое качество. Таким образом, успех дискриминаторов с несколькими уровнями разрешения в основном связан с эффектом генеративных мультисостязательных сетей (Durugkar et al., 2016), т.е. использованием нескольких дискриминаторов в ходе состязательного обучения. Помимо принципиального упрощения инструментария дискриминации, количество параметров дискриминаторов и их вычислительная сложность уменьшается, облегчая ускоренное обучение.Kong et al. (2020); You et al. (2021) argue that the success of the HiFi model is largely due to the multi-resolution discrimination toolkit. This invention demonstrates that this toolkit can be greatly simplified to several absolutely identical discriminators operating at the same resolution and still providing comparable quality. Thus, the success of multi-resolution discriminators is mainly due to the effect of generative multi-adversarial networks (Durugkar et al., 2016), i.e. using multiple discriminators in adversarial learning. In addition to the fundamental simplification of the discrimination tools, the number of discriminator parameters and their computational complexity are reduced, facilitating accelerated learning.

Ключевой вклад предложенного изобретения состоит в новой архитектуре генератора HiFi++, которая позволяет эффективно адаптировать генератор HiFi-GAN к расширению частотной полосы и улучшению речевого сигнала аудиозаписи. Предложенная архитектура базируется на известном генераторе HiFi с добавлением новых модулей. В частности, в архитектуру генератора вводятся спектральная предобработка (SpectralUnet), сеть сверточных кодеров-декодеров (WaveUNet) и модули обучаемого спектрального маскирования (SpectralMaskNet). Снабженный этими модификациями, предложенный генератор может успешно применяться для решения задач расширения частотной полосы и улучшения речевого сигнала. Модель является значительно более облегченной, чем испытанные аналоги, хотя и имеющие более высокое качество.The key contribution of the proposed invention lies in the new architecture of the HiFi++ generator, which allows the HiFi-GAN generator to be effectively adapted to the expansion of the frequency band and the improvement of the speech signal of the audio recording. The proposed architecture is based on the well-known HiFi generator with the addition of new modules. In particular, spectral preprocessing (SpectralUnet), a convolutional encoder-decoder network (WaveUNet), and trainable spectral masking modules (SpectralMaskNet) are introduced into the generator architecture. Equipped with these modifications, the proposed generator can be successfully used to solve the problems of expanding the frequency band and improving the speech signal. The model is much lighter than the tested analogues, although of higher quality.

Настоящее изобретение вносит следующие вклады:The present invention makes the following contributions:

1. Предложена система для обработки аудиосигнала на основе архитектуры генератора HiFi++, которая обеспечивается введением трех дополнительных модулей в генератор HiFi-GAN: подсетей SpectralUnet, WaveUNet и SpectralMaskNet. Эта новая архитектура генератора позволяет построить унифицированный инструментарий для расширения частотной полосы и улучшения речевого сигнала, доставляющую результаты уровня техники в этих областях.1. A system for audio signal processing based on the architecture of the HiFi++ generator is proposed, which is provided by the introduction of three additional modules into the HiFi-GAN generator: SpectralUnet, WaveUNet and SpectralMaskNet subnets. This new generator architecture allows the construction of a unified toolkit for bandwidth extension and speech enhancement delivering state of the art results in these areas.

2. Описана важность инструментария дискриминации с несколькими уровнями разрешения для условной генерации форма волны и предложены новые дискриминаторы, легкие, простые и быстрые, но в то же время способные обеспечивать качество, сравнимое с исходными дискриминаторами HiFi.2. The importance of multi-resolution discrimination tools for conditional waveform generation is described and new discriminators are proposed that are light, simple and fast, but at the same time capable of providing quality comparable to the original HiFi discriminators.

Фиг. 1 демонстрирует архитектуру HiFi++ и обучающий конвейер после блока “выходной аудиосигнал”. Архитектура HiFi++ по сравнению с генератором HiFi дополнительно имеет следующие модули (подсети): SpectralUNet, WaveUNet и SpectralMaskNet. Генератор HiFi++ базируется на части HiFi, на вход которой поступает представление мел-спектрограммы, обогащенное посредством SpectralUnet, и его выходной сигнал проходит через модули постобработки: WaveUNet корректирует выходную форму волны во временной области, тогда как SpectralMaskNet очищает его в частотной области. Fig. 1 shows the HiFi++ architecture and the training pipeline after the audio output block. The HiFi++ architecture, compared to the HiFi generator, additionally has the following modules (subnets): SpectralUNet, WaveUNet and SpectralMaskNet. The HiFi++ generator is based on the HiFi part, which receives as input a chalk spectrogram representation enriched by SpectralUnet, and its output passes through post-processing modules: WaveUNet corrects the output waveform in the time domain, while SpectralMaskNet cleans it in the frequency domain.

Раскрыта конкретная архитектура нейронной сети, где гиперпараметры нейронных модулей заданы равными конкретным числам, однако специалистам в данной области техники очевидно, что принцип изобретения не ограничивается конкретным выбором гиперпараметров.A specific neural network architecture is disclosed where the hyperparameters of the neural modules are set to specific numbers, however, it will be apparent to those skilled in the art that the principle of the invention is not limited to a particular choice of hyperparameters.

Предложенная система установлена, и предложенный способ действует на любом подходящем электронном вычислительном устройстве, имеющем внутреннюю память, где хранится форма волны аудиосигнала. Форма волны аудиосигнала может извлекаться пользователем из памяти устройства или из интернета, или из аудиозаписи, осуществляемой пользователем в настоящее время или из другого подходящего источника. Введенная форма волны аудиосигнала (которая является вектором действительных чисел, которые представляют амплитуды аудиосигнала в течение короткого периода времени) обрабатывается посредством операции оконного преобразования Фурье (“STFT и шкала мел” на фиг. 1 и 6) для получения мел-спектрограммы, которая является входной мел-спектрограммой для генератора HiFi++. The proposed system is installed and the proposed method operates on any suitable electronic computing device having an internal memory where the waveform of the audio signal is stored. The waveform of the audio signal may be retrieved by the user from the device's memory or from the Internet, or from an audio recording currently being made by the user or from another suitable source. The input audio waveform (which is a vector of real numbers that represent the amplitudes of the audio signal over a short period of time) is processed through a windowed Fourier transform operation (“STFT and Chalk Scale” in FIGS. 1 and 6) to obtain a chalk spectrogram, which is the input chalk spectrogram for the HiFi++ generator.

Модуль SpectralUNet вводится как предшествующая часть генератора HiFi++, которая принимает входную мел-спектрограмму (см. фиг. 1), которая является тензором. Мел-спектрограмма имеет двухмерную структуру, и двухмерные сверточные блоки модели SpectralUnet предназначены для облегчения работы с этой структурой на начальной стадии преобразования мел-спектрограммы. Идея состоит в том, чтобы упростить задачу для оставшейся части генератора HiFi++, которая должна преобразовывать это 2d представление в 1d последовательность. Модуль SpectralUNet сконструирован как UNet-подобная архитектура с 2d свертками. The SpectralUNet module is introduced as a precursor to the HiFi++ generator which accepts an input chalk spectrogram (see Fig. 1) which is a tensor. The chalk spectrogram has a 2D structure, and the 2D convolutional blocks of the SpectralUnet model are designed to make it easier to work with this structure at the initial stage of the chalk spectrogram transformation. The idea is to make it easy for the rest of the HiFi++ generator to convert this 2d representation into a 1d sequence. The SpectralUNet module is designed as a UNet-like architecture with 2d folds.

фиг. 2 демонстрирует архитектуру описанного ниже модуля SpectralUNet.fig. 2 shows the architecture of the SpectralUNet module described below.

Экспериментально было установлено, что, применительно к настоящему изобретению, модуль SpectralUNet может быть преимущественно включен в качестве части предварительной обработки, которая подготавливает входную мел-спектрограмму, корректируя и извлекая из нее информацию, необходимую для расширения частотной полосы и улучшения речевого сигнала. Например, SpectralUNet может неявно извлекать чистую мел-спектрограмму из зашумленной в случае улучшения речевого сигнала или восстанавливать высокие частоты в случае расширения частотной полосы. SpectralUNet осуществляет обработку в спектральной области. Выходной сигнал модуля спектральной предобработки является первым тензором, поступающим на вход модуля полностью сверточной нейронной сети - генератора HiFi.It has been experimentally found that, in relation to the present invention, the SpectralUNet module can advantageously be included as part of the pre-processing that prepares the input chalk spectrogram, correcting and extracting from it the information necessary to expand the frequency bandwidth and improve the speech signal. For example, SpectralUNet can implicitly extract a clean chalk spectrogram from a noisy spectrogram in case of improved speech signal or restore high frequencies in case of bandwidth extension. SpectralUNet performs processing in the spectral domain. The output signal of the spectral preprocessing module is the first tensor that enters the input of the fully convolutional neural network module - the HiFi generator.

Модуль генератора HiFi является модулем полностью сверточной нейронной сети, детально изображенным на фиг. 3. Модуль генератора HiFi осуществляет обработку и выводит данные в области формы волны (времени). Генератор является полностью сверточной нейронной сетью. Он обрабатывает выходной сигнал модуля SpeсtralUNet последовательностью транспонированных сверток, каждая из которых сопровождается модулем слияния мультирецептивных полей (MRF) (см. фиг. 4). Каждая транспонированная свертка увеличивает временное разрешение обработанного тензора с коэффициентами (шагами), указанными на фиг. 3. The HiFi generator module is a fully convolutional neural network module shown in detail in FIG. 3. The HiFi generator module processes and outputs data in the waveform (time) domain. The generator is a fully convolutional neural network. It processes the output of the SpétralUNet module with a sequence of transposed convolutions, each followed by a multi-receptive field (MRF) fusion module (see FIG. 4). Each transposed convolution increases the temporal resolution of the processed tensor by the factors (steps) shown in FIG. 3.

Мел-спектрограмма имеет более низкое временное разрешение, чем форма волны (которое может регулироваться параметрами мел-спектрограммы, например, можно использовать размер скачка, равный 256, размер окна, равный 1024, и количеством мел-частотных бинов, равным 128. Такие параметры соответствуют в 256 раз более низкому временному разрешению мел-спектрограммы по сравнению с формой волны), число транспонированных сверток и длину их шагов следует выбирать так, чтобы разрешение тензора, выработанный генератором HiFi, было равно разрешению формы волны. The chalk spectrogram has a lower temporal resolution than the waveform (which can be controlled by the chalk spectrogram parameters, for example, you can use a hop size of 256, a window size of 1024, and a number of chalk frequency bins of 128. Such parameters correspond to 256 times lower temporal resolution of the chalk spectrogram compared to the waveform), the number of transposed convolutions and the length of their steps should be chosen so that the resolution of the tensor generated by the HiFi generator is equal to the resolution of the waveform.

Количество транспонированных сверток также влияет на количество параметров и вычислительную сложность генератора HiFi++. Предложенное изобретение не ограничивается конкретным количеством транспонированных сверток и длин шагов.The number of transposed convolutions also affects the number of parameters and the computational complexity of the HiFi++ generator. The proposed invention is not limited to a specific number of transposed convolutions and step lengths.

На фиг. 3, Conv1d и ConvTranspose1d обозначают, в порядке примера, одномерную свертку и одномерную транспонированную свертку, соответственно, с размерами ядра, шагами и коэффициентами растяжения (стандартными параметрами), указанными на фигуре, LeakyReLU является стандартной функцией активации (нелинейность). Специалистам в данной области техники очевидно, что принцип изобретения не ограничивается конкретным выбором размера ядра, шага и коэффициентов растяжения. In FIG. 3, Conv1d and ConvTranspose1d denote, by way of example, one-dimensional convolution and one-dimensional transposed convolution, respectively, with kernel sizes, steps, and stretch factors (standard parameters) indicated in the figure, LeakyReLU is the standard activation function (non-linearity). It will be apparent to those skilled in the art that the principle of the invention is not limited to the particular choice of core size, pitch, and stretch ratios.

На фиг. 4 показан пример схемы модуля слияния мультирецептивных полей (MRF). Модуль состоит из нескольких сверточных остаточных блоков (ResBlock) с различными размерами ядра и коэффициентами растяжения для моделирования разнообразных паттернов рецептивных полей. Количество ResBlock-ов влияет на размер модели. Чем больше количество ResBlock-ов, тем больше размер модели и выше качество речевого сигнала и шире частотная полоса модели. Поэтому нужен компромисс между размером и качеством модели. На фиг. 4 количество ResBlock задано равным 3, причем размеры ядра и коэффициенты растяжения указаны на фиг. 3 в порядке примера, и принцип изобретения не ограничивается конкретными размером ядра, шагом и коэффициентами растяжения. Структура единичного ResBlock также указана на фиг. 4. In FIG. 4 shows an example schematic of a multi-receptive field (MRF) fusion module. The module consists of several convolutional residual blocks (ResBlock) with different core sizes and stretch factors to model a variety of receptive field patterns. The number of ResBlocks affects the size of the model. The greater the number of ResBlocks, the larger the model size and the higher the quality of the speech signal and the wider the frequency bandwidth of the model. Therefore, a compromise is needed between the size and quality of the model. In FIG. 4, the number of ResBlocks is set to 3, with core sizes and stretch factors indicated in FIG. 3 by way of example, and the principle of the invention is not limited to particular core size, pitch, and stretch ratios. The structure of a single ResBlock is also shown in FIG. 4.

ResBlock состоит из одномерной свертки (Conv1d) и функции активации LeakyReLU (понятие остаточных блоков было введено в He K. et al. Deep residual learning for image recognition //Proceedings of the IEEE conference on computer vision and pattern recognition. - 2016. - С. 770-778). На вход ResBlock поступают промежуточные тензоры, полученные на предыдущих стадиях обработки. Транспонированные свертки (ConvTranponse1d, указанные на фиг. 3) генератора HiFi увеличивают временное разрешение входного тензора, благодаря чему результирующий второй тензор на выходе модуля HiFi в целом имеет размерность 8xT, то есть 8 последовательностей действительных чисел, причем длина этих последовательностей равна длине входной формы волны. Эти 8 последовательностей образуют промежуточный тензор. Заметим, что количество блоков conv1d не влияет на размерность указанного тензора. Размерность тензора определяется количеством выходных каналов последней свертки. Целью этого модуля является преобразования представлений, полученных из SpetralUNet, в область формы волны. Структура каждого ResBlock, показанная на фиг. 4, хорошо известна для нейронных сетей (см., например, He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778)).ResBlock consists of a one-dimensional convolution (Conv1d) and a LeakyReLU activation function (the concept of residual blocks was introduced in He K. et al. Deep residual learning for image recognition // Proceedings of the IEEE conference on computer vision and pattern recognition. - 2016. - С 770-778). ResBlock receives intermediate tensors obtained at the previous stages of processing. The transposed convolutions (ConvTranponse1d indicated in Fig. 3) of the HiFi generator increase the temporal resolution of the input tensor, so that the resulting second tensor at the output of the HiFi module has a total dimension of 8xT, i.e. 8 sequences of real numbers, and the length of these sequences is equal to the length of the input waveform . These 8 sequences form an intermediate tensor. Note that the number of conv1d blocks does not affect the dimension of the specified tensor. The dimension of the tensor is determined by the number of output channels of the last convolution. The purpose of this module is to convert representations received from SpetralUNet into waveform domain. The structure of each ResBlock shown in FIG. 4 is well known for neural networks (see, for example, He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778)).

Все значения на фиг. 4 указаны в порядке примера, и специалистам в данной области техники будет очевидно, что принцип изобретения не ограничивается конкретными примерами.All values in Fig. 4 are by way of example, and it will be apparent to those skilled in the art that the principle of the invention is not limited to specific examples.

Далее, результирующий второй тензор конкатенируется с начальной формой волны аудиосигнала оператором конкатенации (пунктирный стрелка от “аудиосигнал” до стрелки между “генератор Hi-Fi” и “WaveUNet” на фиг. 1), давая третий тензор, и далее третий тензор обрабатывается моделью WaveUNet, т.е. одномерной сверточной нейронной архитектурой Unet во временной области. Одномерная сверточная нейронная архитектура Unet во временной области, показанная в нижней части фиг. 1 как архитектура UNet с 1d свертками, принимает на входе выходной сигнал подмодуля генератора HiFi, конкатенированный с начальной формой волны аудиосигнала, для объединения информации, извлеченной из мел-спектрограммы, с информацией, содержащейся в начальной форме волны аудиосигнала, поскольку мел-спектрограмма не содержит полной информации о начальной форме волны аудиосигнала. Входной третий тензор, подаваемый на одномерную сверточную нейронную архитектуру Unet во временной области, является тензором, содержащим несколько одномерных последовательностей, причем каждая последовательность имеет такую же длину, как начальная форма волны аудиосигнала. Модель WaveUNet обрабатывает входной третий тензор с разными разрешениями, причем такой тип обработки облегчается за счет многомасштабной структуры модели WaveUNet. Каждый уровень многомасштабной структуры построен из блоков повышающей дискретизации и понижающей дискретизации, соответствующих друг другу, которые обрабатывают информацию с конкретным разрешением. Все пары соответствующих блоков повышающей дискретизации и понижающей дискретизации образуют многомасштабную структуру. Многомасштабная структура обрабатывает тензоры с разными разрешениями. Выходным сигналом многомасштабной структуры является обработанный четвертый тензор. Обработка тензора блоками понижающей дискретизации и блоками повышающей дискретизации является известной стандартной процедурой, используемой для различных задач в уровне техники.Next, the resulting second tensor is concatenated with the initial audio waveform by the concatenation operator (dashed arrow from “audio signal” to the arrow between “Hi-Fi generator” and “WaveUNet” in Fig. 1), giving the third tensor, and then the third tensor is processed by the WaveUNet model , i.e. one-dimensional convolutional neural architecture Unet in the time domain. The one-dimensional convolutional neural architecture of Unet in the time domain, shown at the bottom of FIG. 1 as a UNet architecture with 1d convolutions, takes as input the output of the HiFi generator submodule concatenated with the initial audio waveform to combine the information extracted from the chalk spectrogram with the information contained in the initial audio waveform, since the chalk spectrogram does not contain complete information about the initial waveform of the audio signal. The input third tensor fed to the 1D convolutional neural architecture Unet in the time domain is a tensor containing multiple 1D sequences, each sequence having the same length as the initial audio waveform. The WaveUNet model processes the input third tensor at different resolutions, and this type of processing is facilitated by the multiscale structure of the WaveUNet model. Each level of the multi-scale structure is built from upsampling and downsampling blocks, corresponding to each other, which process information with a specific resolution. All pairs of respective upsampling and downsampling blocks form a multi-scale structure. The multiscale structure handles tensors with different resolutions. The output signal of the multiscale structure is the processed fourth tensor. Tensor processing by downsampling and upsampling blocks is a known standard procedure used for various tasks in the prior art.

Фиг. 5 содержит схему модуля нейронной сети WaveUNet генератора HiFi++.Fig. 5 contains a diagram of the WaveUNet neural network module of the HiFi++ generator.

Модуль WaveUNet располагается после части HiFi и действует непосредственно во временной области и может рассматриваться как механизм постобработки во временной области, который улучшает выходной сигнал части HiFi, обрабатывая ее одновременно с входной формой волны (обеспеченный в форме третьего тензора). Модуль WaveUNet является примером общеизвестной архитектуры WaveUNet (Stoller et al., 2018), которая представляет собой полностью сверточную 1D-UNet-подобную нейронную сеть. Этот модуль выводит четвертый двухмерный тензор.The WaveUNet module sits after the HiFi part and operates directly in the time domain and can be thought of as a time domain post-processing engine that improves the output of the HiFi part by processing it simultaneously with the input waveform (provided in the form of a third tensor) . The WaveUNet module is an example of the well-known WaveUNet architecture (Stoller et al., 2018), which is a fully convolutional 1D-UNet-like neural network. This module outputs the fourth 2D tensor.

Заметим, что предложенное изобретение использует стандартные полностью сверточные многомасштабные архитектуры кодер-декодер для сетей WaveUNet и SpectralUNet (понятие Unet было введено в Ronneberger O., Fischer P., Brox T. U-net: Convolutional networks for biomedical image segmentation //International Conference on Medical image computing and computer-assisted intervention. - Springer, Cham, 2015. - С. 234-241). Архитектуры этих сетей изображены на фиг. 5 и 2.Note that the proposed invention uses standard fully convolutional multiscale encoder-decoder architectures for WaveUNet and SpectralUNet networks (the concept of Unet was introduced in Ronneberger O., Fischer P., Brox T. U-net: Convolutional networks for biomedical image segmentation //International Conference on Medical image computing and computer-assisted intervention, Springer, Cham, 2015, pp. 234-241). The architectures of these networks are shown in Fig. 5 and 2.

Согласно фиг. 5, каждый блок понижающей дискретизации модели WaveUNet осуществляет понижающую дискретизацию (например, понижающую дискретизацию с коэффициентом 4) третьего тензора во временном измерении. Аналогично, согласно фиг. 2, каждый блок понижающей дискретизации в SpectralUNet прореживает мел-спектрограмму (например, осуществляет понижающую дискретизацию с коэффициентом 2) во временном и частотном измерениях (для каждого измерения). Значения ширины W1, W2, W3, W4 и параметры глубины блока определяют количество параметров и вычислительную сложность результирующих сетей. According to FIG. 5, each downsampler of the WaveUNet model downsamples (eg, downsamples by a factor of 4) the third tensor in the time dimension. Similarly, according to FIG. 2, each downsampler in SpectralUNet decimates the chalk spectrogram (eg, downsamples by a factor of 2) in time and frequency dimensions (for each dimension). The width values W1, W2, W3, W4 and the block depth parameters determine the number of parameters and the computational complexity of the resulting networks.

Входным сигналом модуля WaveUNet является третий тензор, содержащий соединенные одномерные последовательности после конкатенации второго тензора с формой волны аудиосигнала. Как упомянуто выше, выходной сигнал (второй тензор) модуля генератора HiFi (модуля полностью сверточной нейронной сети) конкатенируется с начальной формой волны аудиосигнала для объединения информации, извлеченной из мел-спектрограммы, с информацией, содержащейся в форме волны аудиосигнала, поскольку не содержит полной информации о входном сигнале. Создание и обработка мел-спектрограммы позволяет использовать информацию спектральной области в ходе обработки аудиосигнала. С другой стороны, конкатенация объединяет информацию, извлеченную в спектральной области, с необработанной формой волны, что позволяет модели осуществлять совместную обработку аудиосигнала во временно-частотной области в модуле WaveUNet. The input to the WaveUNet module is a third tensor containing concatenated 1D sequences after concatenating the second tensor with the audio waveform. As mentioned above, the output (second tensor) of the HiFi Generator Module (Full Convolutional Neural Network Module) is concatenated with the original audio waveform to combine the information extracted from the chalk spectrogram with the information contained in the audio waveform, since it does not contain complete information. about the input signal. The creation and processing of the chalk spectrogram allows you to use the information of the spectral region in the course of processing the audio signal. On the other hand, concatenation combines the information extracted in the spectral domain with the raw waveform, which allows the model to co-process the audio signal in the time-frequency domain in the WaveUNet module.

Третий тензор представляет собой тензор, содержащий 9 одномерных последовательностей, причем каждая последовательность имеет такую же длину, как начальная форма волны аудиосигнала. Третий тензор обрабатывается последовательностью сверточных остаточных блоков. Модуль WaveUNet обрабатывает входной третий тензор с разными разрешениями, причем такой тип обработки облегчается многомасштабной структурой нейронной сети WaveUNet, которая использует сверточные блоки понижающей дискретизации совместно со сверточными блоками повышающей дискретизации и перепускные соединения конкатенации (что является стандартной WaveUNet нейронная архитектура). Каждый сверточный блок имеет некоторую ширину, которая выражается количеством каналов в сверточных слоях, содержащих блок. Значения ширины W1, W2, W3, W4 блока, указанные на фиг. 4, равны 10, 20, 40, 80 (специалистам в данной области техники очевидно, что принцип изобретения не ограничивается конкретным выбором значений ширины блока. Значения ширины могут изменяться, влияя на размер модели), соответственно. Числа 10, 20, 40, 80 указывают параметры архитектуры модели WaveUNet. В частности, они указывают количество каналов, используемых в сверточных слоях в блоках (значения ширины блока). Выходной сигнал одномерной сверточной нейронной архитектуры Unet во временной области (нейронной сети WaveUNet) является четвертым тензором, который состоит из 1d последовательностей.The third tensor is a tensor containing 9 one-dimensional sequences, with each sequence being the same length as the initial audio waveform. The third tensor is processed by a sequence of convolutional residual blocks. The WaveUNet module processes the input third tensor at different resolutions, and this type of processing is facilitated by WaveUNet's multi-scale neural network structure, which uses convolutional downsampling blocks in conjunction with convolutional upsampling blocks and concatenation bypass connections (which is the standard WaveUNet neural architecture). Each convolutional block has a certain width, which is expressed as the number of channels in the convolutional layers containing the block. The block widths W1, W2, W3, W4 shown in FIG. 4 are 10, 20, 40, 80 (those skilled in the art will appreciate that the principle of the invention is not limited to a particular choice of box width values. The width values may vary to affect the size of the model), respectively. The numbers 10, 20, 40, 80 indicate the architecture parameters of the WaveUNet model. In particular, they indicate the number of channels used in convolutional layers in blocks (block width values). The output of Unet's one-dimensional time-domain convolutional neural architecture (WaveUNet neural network) is a fourth tensor that consists of 1d sequences.

Модуль SpectralMaskNet (обучаемый модуль спектрального маскирования) (показанный в нижней части фиг. 1 - архитектура UNet с 1d свертками принимает на входе выходной сигнал модуля WaveUNet (четвертый тензор), который является тензором с формой 8xT, где T - длина входной формы волны. Она применяет поканальное оконное преобразование Фурье (STFT и шкала мел) к этому тензору, т.е. вычисляет оконное преобразование Фурье комплекснозначных спектрограмм (тензор в форме 512x(T/256)x2) для каждой из 8 последовательностей, содержащих тензор, независимо (таким образом, результирующий тензор имеет форму 8Ч512x(T/256)x2). Каждую комплекснозначную спектрограмму можно разложить на амплитудную и фазовую спектрограммы, причем амплитудная спектрограмма (“амплитуды” на фиг. 1) состоит из абсолютных значений комплексных чисел, содержащих комплекснозначную спектрограмму, и фазовая спектрограмма (“фазы” на фиг. 1) состоит из аргументов комплексных чисел, содержащих комплекснозначную спектрограмму. Дополнительно, нейронная сеть SpectralUnet принимает амплитудные спектрограммы (тензор в форме 8Ч512x(T/256), состоящий из абсолютных значений комплексных чисел, образующих комплекснозначную спектрограмму) для прогнозирования мультипликативных коэффициентов для абсолютных величин. Мультипликативные коэффициенты содержат тензор из положительных действительных чисел, форма которого совпадает с формой амплитудной спектрограммы (8Ч512x(T/256)). Положительность мультипликативных коэффициентов гарантируется функцией активации Softplus (которая является общеизвестной функцией) применяемой к выходному сигналу SpectralUNet. Спрогнозированные мультипликативные коэффициенты используются для коррекции комплекснозначной спектрограммы путем умножения каждого комплексного значения в комплекснозначной спектрограмме с соответствующим мультипликативным коэффициентом, который является действительным числом. Заключительная часть (“обратное поканальное STFT” на фиг. 1) состоит из обратного оконного преобразования, Фурье применяемого к каждой из 8 комплекснозначных спектрограмм, в результате чего образуются 8 одномерных последовательностей (таким образом, выходной сигнал тензор имеет форму 8xT, где T - длина входной (и выходной) формы волны). Такая обработка эквивалентна применению мультипликативной коррекции к абсолютной величине сигнала при неизменной фазе. Целью этого модуля является осуществление постобработки сигнала в частотной области. Это эффективный механизм удаления артефактов и шума в частотной области из выходной формы волны на основе обучения. Выходной сигнал SpectralMaskNet, который является пятым тензором с формой 8xT, обрабатывается одномерным сверточным слоем (не показанным на фиг. 1) для формирования выходной формы волны аудиосигнала (последовательности действительных чисел длиной T).The SpectralMaskNet (Spectral Masking Trainer) module (shown at the bottom of Fig. 1 - the UNet architecture with 1d convolutions takes as input the output of the WaveUNet module (fourth tensor), which is a tensor with a shape of 8xT, where T is the length of the input waveform. It applies a per-channel windowed Fourier transform (STFT and chalk scale) to this tensor, i.e. computes a windowed Fourier transform of complex-valued spectrograms (a tensor in the form 512x(T/256)x2) for each of the 8 sequences containing the tensor independently (thus , the resulting tensor has the form 8×512x(T/256)x2). spectrogram (“phases" in Fig. 1) consists of complex number arguments containing a complex-valued spectrogram. Additionally, the SpectralUnet neural network accepts amplitude spectrograms (a tensor in the form 8×512x(T/256), consisting of the absolute values of complex numbers forming a complex-valued spectrogram) for predicting multiplicative coefficients for absolute values. The multiplicative coefficients contain a tensor of positive real numbers, the shape of which coincides with the shape of the amplitude spectrogram (8×512x(T/256)). The positiveness of the multiplicative coefficients is guaranteed by the Softplus activation function (which is a well-known function) applied to the SpectralUNet output signal. The predicted multiplicative coefficients are used to correct the complex-valued spectrogram by multiplying each complex value in the complex-valued spectrogram with the corresponding multiplicative coefficient, which is a real number. The final part (“inverse channel-by-channel STFT” in Fig. 1) consists of an inverse window Fourier transform applied to each of the 8 complex-valued spectrograms, resulting in 8 one-dimensional sequences (thus the output of the tensor is of the form 8xT, where T is the length input (and output) waveforms). Such processing is equivalent to applying a multiplicative correction to the absolute value of the signal at a constant phase. The purpose of this module is to perform post-processing of the signal in the frequency domain. It is an efficient mechanism for removing artifacts and noise in the frequency domain from the output waveform based on training. The output of the SpectralMaskNet, which is a fifth tensor with a shape of 8xT, is processed by a 1D convolutional layer (not shown in FIG. 1) to generate an output audio waveform (a sequence of real numbers of length T).

Фиг. 2 демонстрирует схему модуля нейронной сети SpectralUNet генератора HiFi++. Нейронные сети с этой архитектурой используются как на начальной стадии преобразования мел-спектрограммы в форму волны в модуле SpectralUNet, так и как часть модуля SpectralMaskNet. Структура одинакова в обоих случаях, но параметры сверточных слоев отличаются и выучиваются независимо. Подмодуль SpectralUnet обрабатывает входной тензор с разными разрешениями двухмерными сверточными слоями, причем такой тип обработки облегчается многомасштабной структурой нейронной сети SpectralUnet, который использует сверточные блоки понижающей дискретизации совместно со сверточными блоками повышающей дискретизации и перепускные соединения конкатенации (что является стандартной нейронной архитектурой UNet). Каждый сверточный блок имеет некоторую ширину, которая выражается количеством каналов в сверточных слоях, содержащих блок, Значения ширины W1, W2, W3, W4 блока, указанные на фигуре, равны, например, 8, 12, 24, 32, соответственно. Эти числа (8, 12, 24, 32) указывают параметры архитектуры модели WaveUNet. В частности, они указывают количество каналов, используемых в сверточных слоях в блоках (ширину блока). Специалистам в данной области техники очевидно, что принцип изобретения не ограничивается конкретными значениями ширины блока.Fig. 2 shows the diagram of the neural network module SpectralUNet of the HiFi++ generator. Neural networks with this architecture are used both at the initial stage of converting a chalk spectrogram into a waveform in the SpectralUNet module, and as part of the SpectralMaskNet module. The structure is the same in both cases, but the parameters of the convolutional layers are different and are learned independently. The SpectralUnet submodule processes the multi-resolution input tensor with 2D convolutional layers, and this type of processing is facilitated by SpectralUnet's multi-scale neural network framework, which uses downsampling convolutional blocks in conjunction with upsampling convolutional blocks and concatenation bypass connections (which is UNet's standard neural architecture). Each convolutional block has a certain width, which is expressed by the number of channels in the convolutional layers containing the block. The block widths W1, W2, W3, W4 indicated in the figure are, for example, 8, 12, 24, 32, respectively. These numbers (8, 12, 24, 32) indicate the architecture parameters of the WaveUNet model. In particular, they indicate the number of channels used in convolutional layers in blocks (block width). Those skilled in the art will appreciate that the principle of the invention is not limited to specific block widths.

Следует отметить, что блоки WaveUNet и SpectralUNet одинаковую архитектурную структуру за исключением того, что SpectralUNet использует 2d свертки с размером ядра 3Ч3 вместо 1d сверток с размером ядра, равным 5 в WaveUNet. Блоковая глубина (количество остаточных блоков) равна 4 для WaveUnet. Специалистам в данной области техники очевидно, что принцип изобретения не ограничивается конкретными размерами. Архитектурная структура каждого из блоков WaveUNet и SpectralUNet состоит из:It should be noted that the WaveUNet and SpectralUNet blocks have the same architectural structure, except that SpectralUNet uses 2d convolutions with a kernel size of 3×3 instead of 1d convolutions with a kernel size of 5 in WaveUNet. Block depth (number of remaining blocks) is 4 for WaveUnet. Those skilled in the art will appreciate that the principle of the invention is not limited to specific dimensions. The architectural structure of each of the WaveUNet and SpectralUNet blocks consists of:

- сверточного остаточного блока с одномерной сверткой и функцией активации LeakyReLU; - convolutional residual block with one-dimensional convolution and LeakyReLU activation function;

- сверточного блока понижающей дискретизации, который состоит из нескольких сверточных остаточных блоков и шаговой одномерной свертки с перепускным соединением U-Net; - a convolutional downsampling block, which consists of several convolutional residual blocks and stepwise one-dimensional convolution with a U-Net bypass connection;

- сверточного блока повышающей дискретизации, который состоит из нескольких сверточных остаточных блоков, интерполяции по ближайшим соседям (UpsampleNearestInterpolation) и перепускного соединения U-Net. UpsampleNearestInterpolation увеличивает разрешение тензора, за счет повторения соседних значений. Conv1d обозначает одномерную свертку с размером ядра, обозначенным как ker, размером заполнения (паддинг), обозначенным как pad, и количеством входных и выходных каналов, LeakyReLU - стандартная функция активации (нелинейность). Concat обозначает операцию конкатенации по канальным измерениям;- a convolutional upsampling block, which consists of several convolutional residual blocks, nearest neighbor interpolation (UpsampleNearestInterpolation) and a U-Net bypass connection. UpsampleNearestInterpolation increases the resolution of a tensor by repeating neighboring values. Conv1d denotes a one-dimensional convolution with kernel size denoted as ker, padding size denoted as pad, and number of input and output channels, LeakyReLU is the standard activation function (nonlinearity). Concat stands for the operation of concatenation over channel dimensions;

- сверточных блоков узкого места, которые состоят из нескольких сверточных остаточных блоков. - convolutional bottleneck blocks, which consist of several convolutional residual blocks.

Короче говоря, как упомянуто выше со ссылкой на фиг. 1, сначала мел-спектрограмма исходного сигнала обрабатывается посредством модели SpectralUnet с использованием 2-мерных сверточных блоков. Сигнал обрабатывается путем слияния блоков мультирецептивных полей исходного генератора HiFi-GAN. После этого обработанный сигнал конкатенируется посредством операции конкатенации с начальной (входной) формой волны аудиосигнала. Результирующий тензор поступает в блок WaveUNet. Выходной сигнал модели WaveUNet обрабатывается модулем SpectralMaskNet. Затем выходная форма волны формируется одномерным сверточным слоем.In short, as mentioned above with reference to FIG. 1, the chalk spectrogram of the original signal is first processed by the SpectralUnet model using 2D convolutional blocks. The signal is processed by merging blocks of multireceptive fields of the original HiFi-GAN generator. Thereafter, the processed signal is concatenated by a concatenation operation with the initial (input) waveform of the audio signal. The resulting tensor is fed into the WaveUNet block. The output of the WaveUNet model is processed by the SpectralMaskNet module. The output waveform is then formed by a one-dimensional convolutional layer.

HiFi++ обучается посредством состязательных функций потерь, потерь на согласование особенностей и потерь на мел-спектрограмме, которые предписывают генератору HiFi++ вырабатывать аудиосигналы высокого качества. Отметим, что расширение частотной полосы аудиосигнала и улучшение речевого сигнала для аудиосигнала (удаление артефактов и шумов из аудиосигнала) обусловлены тем, что HiFi++ был обучен для этого, и тензорные преобразования HiFi++ происходят таким образом, чтобы удалить шум и артефакты. HiFi++ is trained through adversarial loss, feature matching loss, and chalk spectrogram loss functions that cause the HiFi++ generator to produce high quality audio signals. Note that the audio bandwidth extension and speech enhancement for the audio signal (removal of artifacts and noise from the audio signal) is because HiFi++ has been trained for this, and HiFi++ tensor transforms are done in such a way as to remove noise and artifacts.

Фиг. 6 демонстрирует обучающий конвейер архитектуры HiFi++. Выходной аудиосигнал, вырабатываемый генератором HiFi++ в ходе процедуры обучения, обрабатывается нейронной сетью дискриминатора совместно с эталонными аудиосигналами (опорным набором). Показатели реалистичности, полученные нейронными дискриминаторными сетями из выходных аудиосигналов архитектуры HiFi++ и из эталонных аудиосигналов, используются для вычисления среднеквадратических состязательных потерь (LS-GAN, уравнение 2). Промежуточные особенности используются для вычисления потерь на согласование особенностей, которые указаны ниже в уравнении 3. Мел-спектрограммы используются для вычисления потерь на мел-спектрограмме (уравнение 4). Все функции потерь являются числами, полученными взвешенным суммированием с мультипликативными коэффициентами, как показано в уравнении 5, для выработки окончательной функции потерь. Из результирующей функции потерь вычисляются градиенты, которые используются для итерационного обучения генератора HiFi++ и дискриминаторов с использованием оптимизатора Adam, что является стандартной практикой для обучения нейронных сетей. Fig. 6 shows the training pipeline of the HiFi++ architecture. The output audio signal generated by the HiFi++ generator during the training procedure is processed by the neural network of the discriminator together with the reference audio signals (reference set). The realism scores obtained by neural discriminator networks from the audio outputs of the HiFi++ architecture and from the reference audio signals are used to calculate the root mean square contention loss (LS-GAN, Equation 2). The intermediate features are used to calculate the feature matching loss, which is shown in Equation 3 below. The chalk spectrograms are used to calculate the loss on the chalk spectrogram (Equation 4). All loss functions are numbers obtained by weighted summation with multiplicative coefficients, as shown in Equation 5, to produce the final loss function. From the resulting loss function, gradients are calculated and used to iteratively train the HiFi++ generator and discriminators using the Adam optimizer, which is standard practice for training neural networks.

Каждый выходной аудиосигнал, вырабатываемый HiFi++ по завершении процедуры обучения, должен соответствовать эталонному аудиосигналу (т.е. желаемому выходному сигналу, без шума и/или с увеличенной частотной полосой).Each audio output signal produced by HiFi++ upon completion of the training procedure must match the reference audio signal (i.e. the desired output signal, without noise and/or with increased bandwidth).

Если набор данных который содержит как шум, так и уменьшенную частоту, используется для обучения, то во время работы генератора HiFi++ одновременно будет происходить и расширение частотной полосы, и устранение шума. Кроме того, процесс обучения может осуществляться отдельно для задач расширения частотной полосы и улучшения речевого сигнала, в результате чего различные результирующие модели получаются для расширения частотных полос и улучшения речевого сигнала, однако специалистам в данной области техники очевидно, что объем изобретения не ограничивается этими задачами (компенсации только этих артефактов отдельно), поскольку модель может обучаться ослаблять различные артефакты, в том числе комбинацию уменьшенной частотной полосы и шума в аудиосигналах путем обеспечения различных массивов данных в ходе обучения.If a dataset that contains both noise and reduced frequency is used for training, then the HiFi++ generator will simultaneously expand the frequency band and eliminate the noise. In addition, the learning process can be performed separately for the tasks of bandwidth extension and speech enhancement, resulting in different resulting models for bandwidth extension and speech enhancement, however, it will be obvious to those skilled in the art that the scope of the invention is not limited to these tasks ( only compensate for these artifacts separately), since the model can be trained to reduce various artifacts, including the combination of reduced bandwidth and noise in audio signals, by providing different datasets during training.

Таким образом, задачи расширения частотных полос и улучшения речевого сигнала решаются ввиду того, что модель обучается решать эти задачи в процессе обучения. Весовые коэффициенты модели регулируются в процессе обучения таким образом, что, посредством вычислительных операций, осуществляемых нейронной сетью, численное представление аудиосигнала, содержащего эти артефакты, преобразуется в численное представление аудиосигнала, не содержащее артефактов. Процесс обучения определяется вышеописанными функциями потерь. Thus, the tasks of expanding the frequency bands and improving the speech signal are solved due to the fact that the model is trained to solve these problems in the learning process. The weight coefficients of the model are adjusted during the learning process in such a way that, through computational operations performed by the neural network, the numerical representation of the audio signal containing these artifacts is converted into a numerical representation of the audio signal that does not contain artifacts. The learning process is determined by the loss functions described above.

Как известно из уровня техники, генератор HiFi обучается в состязательном режиме на основе двух типов дискриминаторов: многопериодного дискриминатора (MPD) и многомасштабного дискриминатора (MSD). MPD состоит из нескольких субдискриминаторов, каждый из которых обрабатывает различные периодические подсигналы входного аудиосигнала. Цель дискриминаторов MPD состоит в идентификации различных периодических паттернов речи. MSD также состоит из нескольких субдискриминаторов, которые оценивают входные формы волны с различными временными разрешениями. В MelGAN (Kumar et al., 2019) было предложено обрабатывать последовательные паттерны и долговременные зависимости. Обучение HiFi состоит из 5 дискриминаторов MPD и 3 дискриминаторов MSD, которые суммарно имеют размер почти в 5 раз больше, чем генератор HiFi V1 и значительно замедляют процесс обучения. В Kong et al. (2020) утверждается, что такой сложный и дорогостоящий инструментарий дискриминации с несколькими уровнями разрешения является одним из ключевых факторов производительности высокого качества модели HiFi, которая поддерживается абляционным исследованием. Структуру модели HiFi, известной из уровня техники, можно упростить до нескольких идентичных дискриминаторов, которые меньше, чем дискриминаторы HiFi и сильно сокращают время обучения, обеспечивая при этом сравнимое качество. Прежде всего, абляционное исследование из документа HiFi (Kong et al., 2020) может вводить в заблуждение, поскольку оно демонстрирует, что без дискриминаторов MPD производительность модели резко снижается. Однако это является следствием неправильного выбора гиперпараметров и, точнее, модель может достигать качество, сравнимое с отсутствием MPD. Дополнительно, полезно заменять дискриминаторы MSD, которые действуют на различных входных разрешениях для идентичных гораздо меньших дискриминаторов, которые обрабатывают форму волны с единым начальным разрешением (дискриминаторов SSD). Архитектура дискриминаторов SSD такая же, как в дискриминаторах MSD, за исключением того, что количество каналов уменьшается с коэффициентом 4 в каждом слое для снижения вычислительной сложности. Поэтому преимущество дискриминаторов HiFi с несколькими уровнями разрешения может в значительный степени объясняться общеизвестным эффектом в литературе по GAN генеративных мультисостязательных сетей (Durugkar et al., 2016).As is known in the art, the HiFi generator is trained in adversarial mode based on two types of discriminators: a multi-period discriminator (MPD) and a multi-scale discriminator (MSD). The MPD consists of several sub-discriminators, each of which processes different periodic sub-signals of the input audio signal. The purpose of MPD discriminators is to identify various periodic speech patterns. The MSD also consists of several sub-discriminators that evaluate input waveforms with different time resolutions. In MelGAN (Kumar et al., 2019), it was proposed to handle sequential patterns and long-term dependencies. The HiFi training consists of 5 MPD discriminators and 3 MSD discriminators, which in total are almost 5 times larger than the HiFi V1 generator and significantly slow down the learning process. In Kong et al. (2020) argue that such a complex and costly multi-resolution discrimination toolkit is one of the key performance factors for the high quality of the HiFi model that is supported by ablative research. The prior art HiFi model structure can be simplified to several identical discriminators that are smaller than the HiFi discriminators and greatly reduce training time while still providing comparable quality. First of all, the ablative study from the HiFi paper (Kong et al., 2020) can be misleading, as it demonstrates that without MPD discriminators, model performance drops dramatically. However, this is a consequence of the wrong choice of hyperparameters and, more precisely, the model can achieve quality comparable to the absence of MPD. Additionally, it is useful to replace MSD discriminators that operate at different input resolutions for identical much smaller discriminators that process the waveform with a single initial resolution (SSD discriminators). The architecture of SSD discriminators is the same as MSD discriminators, except that the number of channels is reduced by a factor of 4 in each layer to reduce computational complexity. Therefore, the advantage of multi-resolution HiFi discriminators can largely be explained by a well-known effect in the GAN literature of generative multi-adversarial networks (Durugkar et al., 2016).

Главный смысл эффекта генеративных мультисостязательных сетей (Durugkar et al., 2016) состоит в том, что производительность генеративной модели можно легко повысить путем обучения на основе множественных дискриминаторов с одинаковой архитектурой, но разной инициализацией. Чем больше дискриминаторов, тем более высокого качества выборки может достигать модель, однако этот эффект очень быстро насыщается с количеством дискриминаторов. В целях сокращения времени обучения и вычислительных ресурсов, предложенное изобретение использует по меньшей мере 3 идентичных дискриминатора SSD для обучения модели HiFi++. Можно использовать меньшее количество дискриминаторов, но эксперименты показали, что качество выходного аудиосигнала снижается. Кроме того, в настоящем изобретении для повышения перцептивного качества аудиосигнала, потерям на мел-спектрограмме назначается меньший вес, например, 15 вместо 45 в исходном коде HiFi-GAN. Дополнительно, в настоящем изобретении, спектральная нормализация в одном из дискриминаторов MSD не используется, и темп обучения для дискриминатора снижается по сравнению с уровнем техники, что дополнительно улучшает результаты. Эксперименты показали, что для настройки обучения, параметры могут изменять в зависимости от начальной задачи. Улучшение результатов будет более подробно описано ниже со ссылкой на таблицу 1.The main point of the effect of generative multi-adversarial networks (Durugkar et al., 2016) is that the performance of a generative model can be easily improved by training based on multiple discriminators with the same architecture but different initialization. The more discriminators, the higher the sample quality the model can achieve, but this effect saturates very quickly with the number of discriminators. In order to reduce training time and computational resources, the proposed invention uses at least 3 identical SSD discriminators to train the HiFi++ model. You can use fewer discriminators, but experiments have shown that the quality of the audio output is reduced. In addition, in the present invention, in order to improve the perceptual quality of the audio signal, the loss on the chalk spectrogram is assigned a lower weight, for example, 15 instead of 45 in the HiFi-GAN source code. Additionally, in the present invention, spectral normalization is not used in one of the MSD discriminators, and the learning rate for the discriminator is reduced compared to the prior art, further improving the results. Experiments have shown that to customize training, the parameters can be changed depending on the initial task. The improvement in results will be described in more detail below with reference to Table 1.

Следует отметить, что, в общем случае, система, отвечающая изобретению, может обучаться 5 исходными дискриминаторами MPD и 3 дискриминаторами MSD, как упомянуто выше и без необходимости уменьшения весовых коэффициентов в нейронной сети, однако заявитель выяснил, что идентичные дискриминаторы и уменьшенные весовые коэффициенты позволяют получить более простую реализацию с приемлемым качеством выходного аудиосигнала.It should be noted that, in general, the system according to the invention can be trained with 5 original MPD discriminators and 3 MSD discriminators, as mentioned above, and without the need to reduce weights in the neural network, however, the applicant has found that identical discriminators and reduced weights allow get a simpler implementation with acceptable audio output quality.

В настоящем изобретении, генератор HiFi++ обучается на основе по меньшей мере трех дискриминаторов одной и той же упрощенной структуры в состязательном режиме. Генераторная сеть HiFi++ обучается отображению из входного аудиосигнала с шумом или уменьшенной частотной полосой в аудиосигналы высокого качества, пока дискриминаторы учатся отличать реальные сигналы высокого качества от сигналов, вырабатываемых генератором HiFi++. Таким образом, дискриминаторы предписывают генератору HiFi++ создавать выборки, неотличимые от высококачественных. Обратная связь между дискриминаторами и выходом генератора HiFi++ осуществляется стандартными методами обучения нейронной сети и не является предметом этой патентной заявки.In the present invention, the HiFi++ generator is trained based on at least three discriminators of the same simplified structure in adversarial mode. The HiFi++ Generator Network learns to map from a noisy or reduced bandwidth audio input to high quality audio signals while the discriminators learn to distinguish real high quality signals from the signals generated by the HiFi++ generator. Thus, the discriminators instruct the HiFi++ generator to create samples that are indistinguishable from high quality. The feedback between the discriminators and the output of the HiFi++ generator is done by standard neural network training methods and is not the subject of this patent application.

Далее описан термин "потери при обучении". Функция потерь представляет собой особые критерии, согласно которым модели предписывается прогнозировать высококачественный аудиосигналы, она измеряет, насколько далеки модельные прогнозы от высококачественных аудиосигналов, и модель обучается минимизировать эту функцию потерь, таким образом, она обучается прогнозировать аудиосигналы высокого качества. The term "learning loss" is described below. The loss function is a specific criteria by which the model is told to predict high quality audio signals, it measures how far the model predictions are from high quality audio signals, and the model is trained to minimize this loss function, thus it learns to predict high quality audio signals.

a) Потери GAN. a) GAN loss.

Поскольку используется обучение с множественной дискриминацией, существует k идентичных дискриминаторов D1, ..., Dk (k=3 во всех экспериментах BWE (расширение частотной полосы) и SE (улучшение качества, удаление шума). Поскольку целью используемого состязательного обучения является LS-GAN, которая обеспечивает неисчезающие градиентные потоки (см., например, Mao et al., 2017) по сравнению с потерями первоначальной GAN (Goodfellow et al., 2014). Потери LS-GAN для генератора G_и с параметрами и и дискриминаторами Dϕ₁, …, D_ϕk с параметрами ϕ₁, …, ϕ_k задаются какSince multiple discrimination training is used, there are k identical discriminators D1, ..., Dk (k=3 in all experiments BWE (bandwidth extension) and SE (quality improvement, noise removal). Since the goal of adversarial training used is LS-GAN , which provides non-vanishing gradient flows (see e.g. Mao et al., 2017) compared to the loss of the original GAN (Goodfellow et al., 2014) LS-GAN loss for the generator G _and with parameters and and discriminators Dϕ ₁ , …, D _ϕk with parameters ϕ ₁ , …, ϕ _k are given as

где y обозначает эталонный аудиосигнал (эталон означает чисто речевые записи, критерий того, что модель будет способна прогнозировать из зашумленных аудиозаписей), и x=f(y) обозначает входное условие, и преобразование f может быть мел-спектрограммой, фильтром нижних частот или добавлением шума.where y denotes a reference audio signal (reference means pure speech recordings, a criterion that the model will be able to predict from noisy audio recordings), and x=f(y) denotes an input condition, and transform f can be a chalk spectrogram, a low pass filter, or an addition noise.

b) Потери на согласование особенностей (активаций). b) Feature matching (activation) loss.

Потери на согласование особенностей вычисляются как расстояние L1 между картами промежуточных особенностей (активаций) дискриминаторов, вычисленными для эталонной выборки, и условно сгенерированными (Larsen et al., 2016; Kumar et al., 2019). Это успешно применялось к синтезу речи (Kumar et al., 2019) для стабилизации процесса состязательного обучения. Потери на согласование особенностей вычисляются какThe feature matching loss is calculated as the distance L1 between the maps of intermediate features (activations) of the discriminators calculated for the reference sample and conditionally generated (Larsen et al., 2016; Kumar et al., 2019). This has been successfully applied to speech synthesis (Kumar et al., 2019) to stabilize the adversarial learning process. The feature matching loss is calculated as

где T обозначает количество слоев в дискриминаторе;

и Nj обозначают активации и число активаций в j-м слое i-го дискриминатора, соответственно. E - математическое ожидание, G - генераторная нейронная сеть, D - дискриминатор. Активации представляют собой промежуточные тензоры, возникающие в нейронной сети в ходе обработки входного тензора, N обозначает количество таких тензоров.where T denotes the number of layers in the discriminator;

and Nj denote the activations and the number of activations in the j-th layer of the i-th discriminator, respectively. E - mathematical expectation, G - generator neural network, D - discriminator. Activations are intermediate tensors that appear in the neural network during processing of the input tensor, N denotes the number of such tensors.

с) Потери на мел-спектрограмме.c) Loss on the chalk spectrogram.

Потери на мел-спектрограмме — это расстояние L₁ между мел-спектрограммой формы волны, синтезированной генератором, и мел-спектрограммой эталонной формы волны. Другими словами, расстояние между мел-спектрограммами образует функцию потерь. Она определяется какThe chalk loss is the distance L ₁ between the chalk spectrogram of the waveform synthesized by the generator and the chalk spectrogram of the reference waveform. In other words, the distance between the chalk spectrograms forms a loss function. It is defined as

где φ - функция, которая преобразует форму волны в соответствующую мел-спектрограмму. where φ is a function that converts the waveform into the corresponding chalk spectrogram.

d) Окончательные потери. Окончательные потери для генератора и дискриминатор выражаются в видеd) Final loss. The final loss for the generator and discriminator is expressed as

где λ_fm=2 и λ_mel=45 для экспериментов BWE и SE, что найдено оптимально через поиск по сетке. where λ _fm =2 and λ _mel =45 for the BWE and SE experiments, which is found optimally through grid search.

Вышеприведенные иллюстративные варианты осуществления являются примерами и не подлежат рассмотрению в порядке ограничения. Кроме того, описание иллюстративных вариантов осуществления призвано быть иллюстративным, но не ограничивать объем формулы изобретения, и специалисты в данной области техники смогут предложить многочисленные альтернативы, модификации и вариации.The above exemplary embodiments are examples and are not to be considered as limiting. In addition, the description of the illustrative embodiments is intended to be illustrative, but not to limit the scope of the claims, and those skilled in the art will be able to suggest numerous alternatives, modifications, and variations.

ЭкспериментыExperiments

Используется публичный массив данных LJ-Speech (Ito & Johnson, 2017) (лицензия публичной области), который является стандартным в области синтеза речи. LJ-Speech представляет собой массив данных, который состоит из 13100 аудиоклипов суммарной длительностью приблизительно 24 часов. Используется разбиение удостоверения обучения из документа HiFi (Kong et al., 2020) размером 12950 обучающих клипов и 150 валидационных клипов. Выборки аудиосигнала имеют частоту дискретизации 22 кГц.The LJ-Speech public dataset (Ito & Johnson, 2017) (public domain license) is used, which is standard in the field of speech synthesis. LJ-Speech is a dataset that consists of 13100 audio clips with a total duration of approximately 24 hours. A training credential split from a HiFi document (Kong et al., 2020) of 12950 training clips and 150 validation clips is used. The audio signal samples have a sampling rate of 22 kHz.

В контексте анализа дискриминаторов генератор HiFi обучается, эти эксперименты служат для мотивации использования этого набора дискриминаторов при обучении HiFi++, затем они используются при обучении HiFi++. In the context of discriminator analysis, a HiFi generator is trained, these experiments serve to motivate the use of this set of discriminators in HiFi++ training, then they are used in HiFi++ training.

Расширение частотной полосы Bandwidth extension

Используется общедоступный массив данных VCTK (Yamagishi et al., 2019) (лицензия CC BY 4,0), который включает в себя 44200 речевых записей, принадлежащих 110 спикерам. 6 спикеров из обучающего набора и 8 записей из высказываний, соответствующих каждому спикеру, исключаются во избежание утечки данных текстового уровня и уровня спикера в обучающий набор. Для оценивания, используются 48 высказываний, соответствующих 6 спикерам, исключенным из обучающих данных. Следует обратить внимание на то, что текст, соответствующий оценочным высказываниям, не читается ни в одной из записей, образующих обучающие данные.The public data set VCTK (Yamagishi et al., 2019) (license CC BY 4.0) is used, which includes 44200 speech recordings belonging to 110 speakers. The 6 speakers from the training set and the 8 entries from the utterances corresponding to each speaker are excluded to avoid leakage of text-level and speaker-level data into the training set. For evaluation, 48 statements are used, corresponding to 6 speakers excluded from the training data. It should be noted that the text corresponding to the evaluation statements is not read in any of the records that form the training data.

Деление на обучающие и тренировочные выборки призвано исключить возможность переобучения алгоритма на тренировочных данных (их запоминания), это стандартная процедура, необходимая для тестирования алгоритма на независимых данных, то есть на тех, которые он не может запомнить в процессе обучения.The division into training and training samples is designed to exclude the possibility of retraining the algorithm on training data (memorizing them), this is a standard procedure necessary for testing the algorithm on independent data, that is, on those that it cannot remember in the learning process.

Пример обеспечения расширения частотной полосы: в объеме предложенного изобретения формы волны аудиосигнала с частотными полосами 1 кГц, 2 кГц и 4 кГц рассматриваются как входные сигналы в модель, причем модель создает формы волны аудиосигнала с частотной полосой 8 кГц, которая соответствует улучшение до 3 пунктов средней экспертной оценки разборчивости речи, измеренному посредством человеческой обратной связи. Однако специалистам в данной области техники очевидно, что принцип изобретения не ограничивается конкретными входными и выходными частотными полосами.Example of providing bandwidth extension: Within the scope of the proposed invention, audio waveforms with frequency bands of 1 kHz, 2 kHz, and 4 kHz are considered as input signals to the model, with the model generating audio waveforms with a bandwidth of 8 kHz, which corresponds to an improvement of up to 3 points of the average peer review of speech intelligibility as measured by human feedback. However, those skilled in the art will appreciate that the principle of the invention is not limited to specific input and output frequency bands.

Модель обучается создавать аудиосигнал с увеличенной частотной полосой. В ходе обучения модель снабжается примерами входных аудиосигналов с уменьшенной частотной полосой и выходных аудиосигналов с нормальной (увеличенной) частотной полосой. Таким образом, модель обучается создавать аудиосигналы с нормальной (увеличенной) частотной полосой из аудиосигналов с уменьшенной частотной полосой.The model is trained to produce an audio signal with an increased bandwidth. During training, the model is provided with examples of reduced bandwidth audio inputs and normal (increased) bandwidth audio outputs. Thus, the model is trained to create normal (increased) bandwidth audio signals from reduced bandwidth audio signals.

Подавление шума в речевом сигнале Speech Noise Suppression

В экспериментах по подавлению шума используется массив данных VCTK-DEMAND (Valentini-Botinhao et al., 2017) (лицензия CC BY 4,0). Обучающие наборы (11572 высказывания) состоят из 28 спикеров с 4 значениями отношение сигнал-шум (SNR) (15, 10, 5 и 0 дБ). Испытательный набор (824 высказывания) состоит из 2 спикеров с 4 значениями SNR (17,5, 12,5, 7,5 и 2,5 дБ). Noise suppression experiments use the VCTK-DEMAND (Valentini-Botinhao et al., 2017) dataset (CC BY 4.0). The training sets (11572 utterances) consist of 28 speakers with 4 signal-to-noise ratio (SNR) values (15, 10, 5 and 0 dB). The test set (824 statements) consists of 2 speakers with 4 SNR values (17.5, 12.5, 7.5 and 2.5 dB).

Пример обеспечения улучшения речевого сигнала: в объеме предложенного изобретения формы волны аудиосигнала со средним отношением сигнал-шум 8,4 рассматриваются как входные сигналы в модель, причем модель создает формы волны аудиосигнала с повышенным средним отношением сигнал-шум 18,4, которое соответствует улучшению до 1 пункта средней экспертной оценки разборчивости речи, измеренному посредством человеческой обратной связи. Однако специалистам в данной области техники очевидно, что принцип изобретения не ограничивается конкретным отношением сигнал-шум на входе и выходе.Example of providing speech enhancement: Within the scope of the present invention, audio waveforms with an average signal-to-noise ratio of 8.4 are considered as inputs to the model, the model producing audio waveforms with an increased average signal-to-noise ratio of 18.4, which corresponds to an improvement of up to 1 point average peer review of speech intelligibility as measured by human feedback. However, it will be apparent to those skilled in the art that the principle of the invention is not limited to a specific signal-to-noise ratio at the input and output.

Модель обучается создавать аудиосигнал с увеличенной частотной полосой. В ходе обучения, модель снабжается примерами входных аудиосигналов с шумом и выходных аудиосигналов без шума. Модель обучается создавать аудиосигналы без шума из аудиосигналов с шумом.The model is trained to produce an audio signal with an increased bandwidth. During training, the model is provided with examples of input audio signals with noise and output audio signals without noise. The model is trained to create noise-free audio signals from noisy audio signals.

Оценивание Evaluation

Для объективного оценивания выборок в задаче SE (удаление шума) используются традиционные метрики WB-PESQ (Rix et al., 2001), STOI (Taal et al., 2011), масштабно-инвариантное отношение сигнал-искажение (SI-SDR) (Le Roux et al., 2019). For objective evaluation of samples in the SE (noise removal) problem, traditional metrics WB-PESQ (Rix et al., 2001), STOI (Taal et al., 2011), scale-invariant signal-to-distortion ratio (SI-SDR) (Le Roux et al., 2019).

Для субъективного оценивания качества используются 5-масштабные испытания MOS. Все аудиоклипы были нормализованы во избежание влияния различий в громкости аудиосигнала для экспертов. Эксперты выбирались англоговорящими с надлежащим оборудованием для прослушивания. 5-scale MOS tests are used for subjective quality evaluation. All audio clips have been normalized to avoid being affected by audio volume differences for experts. The experts were selected by English speakers with proper listening equipment.

Средняя экспертная оценка разборчивости речи (MOS) модели представляет собой меру, использующую краудсорсинговую адаптацию стандартной процедуры оценивания абсолютной категории. Предложена следующая процедура вычисления MOS.The Mean Expert Speech Intelligibility (MOS) score of the model is a measure using a crowdsourced adaptation of the standard absolute category scoring procedure. The following procedure for calculating MOS is proposed.

1. Выбрать поднабор из 40 случайных выборок из испытательного набора (по одному для каждой задачи, т.е. для расширения частотной полосы или улучшения речевого сигнала).1. Select a subset of 40 random samples from the test set (one for each task, i.e. to expand the frequency bandwidth or improve the speech signal).

2. Выбрать набор оцениваемых моделей; вывести их прогнозы на основании выбранного поднабора.2. Select a set of estimated models; output their predictions based on the selected subset.

3. Произвольно смешать прогнозы и разбить их на страницы размером 20 почти равномерно.3. Randomly mix the predictions and paginate them into pages of size 20 almost evenly.

равномерно означает, что на каждой странице присутствует по меньшей мере выборок из каждой модели.

evenly means that there are at least samples from each model on each page.

4. Вставить 4 дополнительные захватывающие выборки в случайные места на каждой странице: 2 выборки из эталона и 2 выборки шума безо всякой речи.4. Insert 4 additional capture samples at random locations on each page: 2 samples from the reference and 2 noise samples without any speech.

5. Выгрузить страницы на краудсорсинговую платформу, задать количество оценщиков для каждой страницы равной по меньшей мере 30.5. Upload pages to the crowdsourcing platform, set the number of raters for each page to at least 30.

Оценщиков просят работать в наушниках в тихом окружении; они должны прослушать аудиозапись до конца, прежде чем оценить его.Assessors are asked to work with headphones in a quiet environment; they must listen to the audio to the end before judging it.

6. Отфильтровать результаты, где эталонные выборки получают любую оценку, кроме 4 (хорошо) и 5 (отлично), или выборки без голоса получают любую оценку, кроме 1 (плохо).6. Filter out the results where the reference samples receive any score other than 4 (good) and 5 (excellent), or the unvoted samples receive any score other than 1 (poor).

7. Разбить произвольным образом оставшиеся рейтинги для каждой модели на 5 групп почти равного размера, вычислить их среднее и стандартное отклонение.7. Randomly divide the remaining ratings for each model into 5 groups of almost equal size, calculate their mean and standard deviation.

Поскольку модели равномерно распределены по страницам, предвзятость оценщика одинаково влияет на все модели, поэтому относительный порядок моделей остается. С другой стороны, оценщик будет иметь доступ ко всему разнообразию моделей на одной странице и, таким образом может лучше масштабировать свои рейтинги. С другой стороны, рейтинги моделей не являются независимыми друг от друга в этой настройке, поскольку оценщики склонны оценивать качество выборки относительно средней выборки страницы, т.е. чем с более плохими моделями производится сравнение, тем большие MOS назначаются хорошим. 4 захватывающие выборки на страницу также является приемлемым выбором, поскольку просто невозможно случайно догадаться скорректировать ответы на эти вопросы.Because the models are evenly distributed across the pages, estimator bias affects all models equally, so the relative order of the models remains. On the other hand, the estimator will have access to the whole variety of models on one page and thus can better scale their ratings. On the other hand, model ratings are not independent of each other in this setting, as raters tend to rate sample quality relative to the average page sample, i.e. the more bad models are compared, the larger MOS are assigned to good ones. 4 exciting samples per page is also an acceptable choice, since it's simply impossible to randomly guess to adjust the answers to these questions.

Недостаток MOS состоит в том, что иногда требуется слишком много оценщиков для каждой выборки, чтобы достоверно определить, какая модель лучше. Возможное решение состоит в том, чтобы использовать упрощенную версию рейтинга сравнения категорий, т.е. исследование предпочтений. Это исследование сравнивает две модели, оценщика просят выбрать модель, создающую наилучший выходной сигнал для одного и того же входного сигнала. Если оценщик не слышит разницы, нужно выбрать вариант “одинаковые”.The disadvantage of MOS is that it sometimes takes too many estimators for each sample to reliably determine which model is better. A possible solution is to use a simplified version of the category comparison ranking, i.e. preference research. This study compares two models, the evaluator is asked to select the model that produces the best output for the same input. If the evaluator does not hear a difference, the “same” option should be selected.

1. Выбрать поднабор из 40 случайных выборок из испытательного набора.1. Select a subset of 40 random samples from the test set.

2. Произвольно перемешать этот набор, разбить его на страницы размером 20.2. Randomly mix this set, break it into pages of size 20.

3. Выбрать произвольно на каждой странице 10 позиций, где первым будет прогноз Model1.3. Select arbitrarily on each page 10 positions, where the first forecast will be Model1.

4. Вставить 4 дополнительные захватывающие выборки в случайные места на каждой странице: каждая захватывающая выборка является парой чистой речи из эталона и ее заметно искаженной версии. Порядок моделей в захватывающей выборке является случайным, но на каждой странице существует 2 выборки с одним порядком и 2 выборки с другим.4. Insert 4 additional capture samples at random locations on each page: each capture sample is a pair of pure speech from the reference and its markedly distorted version. The order of the models in the capture sample is random, but there are 2 samples on each page with one order and 2 samples with another.

6. Отфильтровать результаты, где захватывающие выборки классифицированы неправильно.6. Filter out results where the exciting samples are misclassified.

7. Использовать знаковый статистический тест, чтобы отбрасывать гипотезу о том, что модели генерируют речь одинакового [медианного] перцептивного качества.7. Use a sign statistical test to reject the hypothesis that the models generate speech of the same [median] perceptual quality.

Инструментарий дискриминацииToolkit of Discrimination

Причины эффективности инструментария HiFi-GAN исследуются путем дублирования абляции документа HiFi-GAN. Архитектура генератора HiFi V3 обучается различными наборами дискриминаторов в течение одного миллиона итераций для решения проблемы вокодирования речевых сигналов на массиве данных LJ-Speech. Осуществляются обширное оценивание генерируемых выборок и измерение сложности инструментария дискриминации. Для сравнения эмпирической сложности обучения, общее время обучения измеряется на единственном GPU NVIDIA GeForce GTX 1080 Ti. Результаты приведены в таблице 1.The reasons for the effectiveness of the HiFi-GAN toolkit are explored by duplicating the ablation of the HiFi-GAN document. The HiFi V3 generator architecture is trained with various sets of discriminators over one million iterations to solve the problem of vocoding speech signals on an LJ-Speech dataset. Generated samples are extensively evaluated and the complexity of the discrimination tool is measured. To compare empirical training difficulty, total training time is measured on a single NVIDIA GeForce GTX 1080 Ti GPU. The results are shown in table 1.

Таблица 1 демонстрирует результаты оценивания важности инструментария дискриминации на нейронном вокодировании речевых сигналов. Эталон означает желаемый выходной сигнал, без шума или с увеличенной частотной полосой. MSD означает многомасштабный дискриминатор. MPD означает многопериодный дискриминатор. SSD означает дискриминатор с одним разрешением. k - количество дискриминаторов. MOS (средняя экспертная оценка разборчивости речи) средняя оценка качества аудиосигнала, назначенная слушателями, чем выше, тем лучше. Размер D (млн.) означает количество параметров дискриминаторов в миллионах. D MAC (G) обозначает количество операций умножения с накоплением в миллиардах в секунду. Время обучения (D) вычисляется на единственном GPU NVIDIA GeForce GTX 1080 Ti в днях. Table 1 shows the results of evaluating the importance of discrimination tools on neural vocoding of speech signals. Reference means the desired output signal, without noise or with increased bandwidth. MSD stands for Multiscale discriminator. MPD stands for Multi-Period Discriminator. SSD stands for Single Resolution Discriminator. k is the number of discriminators. MOS (Mean Expert Score for Speech Understanding) is the average audio quality score given by listeners, the higher the better. The size D (million) means the number of discriminator parameters in millions. D MAC (G) denotes the number of multiplication-accumulate operations in billions per second. Training time (D) is calculated on a single NVIDIA GeForce GTX 1080 Ti GPU in days.

Таблица 1Table 1

МодельModel Качество моделиModel quality Сложность моделиModel complexity MOSMOS Размер D (млн.)Size D (million) D MAC (G) D MAC (G) Время обучения (дни)Training time (days) ЭталонReference 4,70 ± 0,044.70±0.04 исход. MSDExodus. MSD 2,39 ± 0,052.39±0.05 29,629.6 7,847.84 7,07.0 настроенный MSDcustomized MSD 3,90 ± 0,053.90±0.05 29,629.6 7,847.84 7,07.0 SSD (k=3)SSD (k= 3) 4,00 ± 0,064.00±0.06 29,629.6 13,4413.44 10,510.5 SSD (k=3) (наш)SSD (k= 3) (ours) 3,98 ± 0,063.98±0.06 1,861.86 0,860.86 2,92.9 SSD (k=5) (наш)SSD (k= 5) (ours) 4,10 ± 0,074.10±0.07 3,13.1 1,431.43 4,84.8 исход. MSD+MPDExodus. MSD+MPD 4,23 ± 0,074.23 ± 0.07 70,770.7 17,2817.28 11,911.9

Авторы HiFi-GAN показали, что без дискриминаторов MPD система теряет почти 2 пункта MOS, что является очень большим снижением. Такое снижение в основном обусловлено неправильным выбором гиперпараметров. Если просто удалить дискриминаторы MPD из обучающего конвейера с использованием исходного кода HiFi-GAN, качество генерируемых аудиосигналов фактически снижается (см. исход. MSD в таблице 1). Однако качество можно значительно повысить путем настройки гиперпараметров.The authors of HiFi-GAN have shown that without MPD discriminators, the system loses almost 2 MOS points, which is a very large reduction. This decrease is mainly due to the wrong choice of hyperparameters. Simply removing the MPD discriminators from the training pipeline using the HiFi-GAN source code actually reduces the quality of the generated audio signals (see source MSD in Table 1). However, the quality can be greatly improved by tuning the hyperparameters.

Одна из ключевых причин ухудшения качества состоит в том, что относительный вес состязательных потерь становится значительно меньше (состязательные потери для различных дискриминаторов суммируются) по сравнению с весом потерь на мел-спектрограмме. Таким образом, генератор в основном управляется потерями на мел-спектрограмме, которые лишают аудиосигналы естественного звучания. Перцептивное качество аудиосигнала можно повысить, назначив меньший вес потерям на мел-спектрограмме (15 вместо 45 в исходном коде HiFi-GAN).One of the key reasons for the quality degradation is that the relative weight of the contention loss becomes much smaller (the contention loss for different discriminators is summed) compared to the loss weight on the chalk spectrogram. Thus, the oscillator is mainly driven by the loss in the chalk spectrogram, which deprives the audio signals of their natural sounding. The perceptual quality of the audio signal can be improved by assigning less weight to the loss in the chalk spectrogram (15 instead of 45 in the original HiFi-GAN code).

Кроме того, авторы HiFi-GAN используют спектральную нормализацию в одном из дискриминаторов MSD, что не рекомендовано авторами статьи MelGAN. Настройка темпа обучения для дискриминатора (1*10⁻⁵ вместо 2*10⁻⁴) дополнительно улучшает результаты. Это изменение гиперпараметров обеспечивает улучшение примерно на 1,5 пунктов MOS (см. настроенный MOS MSD=3,90 (результат предложенного изобретения), исход. MOS MSD=2,39 в таблице 1). In addition, the authors of HiFi-GAN use spectral normalization in one of the MSD discriminators, which is not recommended by the authors of the MelGAN article. Adjusting the learning rate for the discriminator (1*10 ⁻⁵ instead of 2*10 ⁻⁴ ) further improves the results. This hyperparameter change provides an improvement of about 1.5 MOS points (see adjusted MOS MSD=3.90 (result of the present invention), outgoing MOS MSD=2.39 in Table 1).

Дополнительно исследованы важность дискриминации с несколькими уровнями разрешения путем удаления среднего пулинга из дискриминаторов MSD (таким образом, все три дискриминатора действуют при одном и том же разрешении и отличаются только инициализацией). Этот дискриминатор с одним разрешением (SSD) оказывает дополнительное положительное влияние на качество генерируемых выборок и повышает MOS на 0,1 пункта. В настоящем изобретении, дискриминаторы SSD можно значительно облегчить, уменьшив количество каналов в каждом слое с коэффициентом 4 без существенного ухудшения качества (SSD (k=3) (предлагаемый). Дополнительно, качество инструментария дискриминации можно дополнительно улучшить за счет увеличения количества дискриминаторов SSD с 3 до 5. Это изменение приводит к дополнительному улучшению качества MOS на 0,1 пункта (SSD (k=5) (наш)). В целом, результаты проливают свет на возможные причины успеха инструментария дискриминации HiFi-GAN. Действительно, можно видеть, что с настройкой гиперпараметров и использованием нескольких идентичных дискриминаторов, можно достигать качества, аналогичного исходной инструментария дискриминации HiFi-GAN (исход. MSD+MPD). Таким образом, для обучения модели HiFi++ на задачах расширения частотной полосы и улучшения речевого сигнала используется 3 идентичных дискриминатора SSD.The importance of multi-resolution discrimination is further explored by removing the middle pooling from the MSD discriminators (thus all three discriminators operate at the same resolution and differ only in initialization). This single resolution discriminator (SSD) has an additional positive effect on the quality of the generated samples and improves the MOS by 0.1 points. In the present invention, the SSD discriminators can be significantly lightened by reducing the number of channels in each layer by a factor of 4 without significantly degrading the quality (SSD (k=3) (proposed). Additionally, the quality of the discrimination tool can be further improved by increasing the number of SSD discriminators from 3 to 5. This change leads to an additional 0.1 point improvement in MOS quality (SSD (k=5) (our)). Overall, the results shed light on possible reasons for the success of the HiFi-GAN discrimination tool. Indeed, one can see that with hyperparameter tuning and the use of several identical discriminators, it is possible to achieve a quality similar to the original HiFi-GAN discrimination tool (original MSD+MPD) Thus, 3 identical SSD discriminators are used to train the HiFi++ model on the tasks of bandwidth extension and speech enhancement.

Расширение частотной полосыBandwidth extension

В экспериментах по расширению частотной полосы, в качестве целей используются записи с частотой дискретизации 16 кГц и для входных данных рассматриваются три частотных полосы: 1 кГц, 2 кГц и 4 кГц. До прореживания сигнала до желаемой частоты дискретизации (2 кГц, 4 кГц или 8 кГц) применяется фильтр нижних частот, произвольно выбранный из фильтров Баттеруорта, Чебышева, Бесселя и эллиптического фильтра различных порядков во избежание нежелательных паразитных частот и для повышения устойчивости модели Liu et al. (2021). Затем прореженный сигнал передискретизируется обратно до частоты дискретизации 16 кГц с использованием полифазной фильтрации.In the bandwidth extension experiments, recordings with a sampling rate of 16 kHz are used as targets and three frequency bands are considered for input data: 1 kHz, 2 kHz, and 4 kHz. Prior to decimating the signal to the desired sampling frequency (2 kHz, 4 kHz, or 8 kHz), a low-pass filter arbitrarily selected from Butterworth, Chebyshev, Bessel, and various orders of elliptic filter is applied to avoid unwanted spurious frequencies and to increase the robustness of the Liu et al. (2021). The decimated signal is then resampled back to a sampling rate of 16 kHz using polyphase filtering.

Результаты и сравнение с другими методами приведены в таблице 2. Предложенная модель HiFi++ обеспечивает более удачный компромисс между размером модели и качеством расширения частотной полосы, чем другие методы. В частности, предложенная модель в 5 раз меньше, чем ближайшая базовая модель SEANet (Li et al., 2021) при этом превосходя ее для всех входных частотных полос. Для удостоверения превосходства HiFi++ над SEANet помимо испытаний MOS, проводится попарное сравнение между этими двумя моделями и наблюдается статистически значимое преобладание предложенной модели (p-значения равны 2,8*10⁻²² для частотной полосы 1 кГц, 0,003 для 2 кГц, и 0,02 для 4 кГц для биномиальной проверки).The results and comparison with other methods are shown in Table 2. The proposed HiFi++ model provides a better compromise between model size and bandwidth extension quality than other methods. In particular, the proposed model is 5 times smaller than the nearest SEANet base model (Li et al., 2021) while outperforming it for all input frequency bands. To verify the superiority of HiFi++ over SEANet, in addition to the MOS tests, a pairwise comparison is made between the two models and there is a statistically significant dominance of the proposed model (p-values are 2.8*10 ⁻²² for 1 kHz bandwidth, 0.003 for 2 kHz, and 0. 02 for 4 kHz for binomial check).

Эти результаты подчеркивают важность состязательных задач для моделей расширения частотной полосы речевого сигнала. Вопреки ожиданиям, модель SEANet (Li et al., 2021) оказалась самой сильной базовой линией среди испытанных аналогов, оставив других далеко позади. Эта модель использует состязательную задачу, аналогичную нашей. Модели TFilm (Birnbaum et al., 2019) и 2S-BWE (Lin et al., 2021, TSN - временная сверточная сеть, CRN - сверточная рекуррентная сеть) используют задачи контролируемой реконструкции и достигают очень низкой производительности, особенно для низких входных частотных полос.These results highlight the importance of adversarial tasks for speech bandwidth extension models. Contrary to expectations, the SEANet model (Li et al., 2021) turned out to be the strongest baseline among the peers tested, leaving others far behind. This model uses an adversarial task similar to ours. The TFilm (Birnbaum et al., 2019) and 2S-BWE (Lin et al., 2021, TSN - Temporal Convolutional Network, CRN - Convolutional Recurrent Network) models use supervised reconstruction problems and achieve very low performance, especially for low input frequency bands .

Таблица 2 демонстрирует результаты расширения частотной полосы (BWE) на массиве данных VCTK (* указывает повторную реализацию). BWE обозначает расширение частотной полосы. #Param (млн.) - количество параметров модели (в миллионах). Модель обеспечивает более высокое качество, чем испытанные решения из литературы, за счет доставки более высокого качества MOS, имея при этом наименьшее количество параметров (1,7 миллиона). Предложенный генератор обеспечивает MOS, (равную 4,10 при BW (1 кГц), равную 4,44 при BWE(2 кГц), равную 4,51 при BWE (4 кГц)), что превышает все значения других традиционных генераторов во всех указанных диапазонах BWE. В то же время, предложенный генератор использует наименьшее количество параметров, равное 1,7 миллиона. Table 2 shows the results of bandwidth extension (BWE) on the VCTK dataset (* indicates re-implementation). BWE stands for bandwidth extension. #Param (millions) - number of model parameters (in millions). The model delivers higher quality than the tested solutions in the literature by delivering higher quality MOS while having the fewest parameters (1.7 million). The proposed generator provides MOS (equal to 4.10 at BW (1 kHz), equal to 4.44 at BWE (2 kHz), equal to 4.51 at BWE (4 kHz)), which exceeds all the values of other conventional generators in all specified BWE bands. At the same time, the proposed generator uses the smallest number of parameters, equal to 1.7 million.

Таблица 2 table 2

МодельModel BWE (1 кГц)BWE (1 kHz) BWE (2 кГц)BWE (2 kHz) BWE (4 кГц)BWE (4 kHz) MOSMOS MOSMOS MOSMOS #Param (млн.)#Param (million) ЭталонReference 4,62 ± 0,064.62±0.06 4,63 ± 0,034.63 ± 0.03 4,50 ± 0,044.50±0.04 -- HiFi++ (наш)HiFi++ (ours) 4,10 ± 0,054.10±0.05 4,44 ± 0,024.44±0.02 4,51 ± 0,024.51 ± 0.02 1,71.7 *SEANet*SEANet 3,94 ± 0,093.94±0.09 4,43 ± 0,054.43±0.05 4,45 ± 0,044.45±0.04 9,29.2 VoiceFixerVoiceFixer 3,04 ± 0,083.04±0.08 3,82 ± 0,063.82±0.06 4,34 ± 0,034.34 ± 0.03 122,1122.1 *2S-BWE (TCN)*2S-BWE (TCN) 2,01 ± 0,062.01±0.06 2,98 ± 0,082.98±0.08 4,10 ± 0,044.10±0.04 2,72.7 *2S-BWE (CRN)*2S-BWE (CRN) 1,97 ± 0,061.97±0.06 2,85 ± 0,042.85±0.04 4,27 ± 0,054.27±0.05 9,29.2 TFiLMTFiLM 1,98 ± 0,021.98±0.02 2,67 ± 0,042.67±0.04 3,54 ± 0,043.54±0.04 68,268.2 ВходEntrance 1,87 ± 0,081.87±0.08 2,46 ± 0,042.46±0.04 3,36 ± 0,063.36±0.06 --

Улучшение речевого сигнала Voice enhancement

Сравнение HiFi++ с базовыми моделями продемонстрировано в таблице 3. SI-SDR - масштабно-инвариантное отношение сигнал-искажение (Le Roux et al., 2019). STOI (мера кратковременной объективной разборчивости) (Taal et al., 2011). PESQ (оценка перцептивного качества речевого сигнала) (Rix et al., 2001) являются традиционными метриками. Предложенная модель достигает сравнимой производительности с аналогами VoiceFixer (Liu et al., 2021) и DEMUCS (Defossez et al., 2020) при гораздо меньшем (#Param(млн.)=1,7). В частности, например, MOS=4,33 для HiFi++, MOS=4,32 для VoiceFixer, MOS=4,22 для DEMUCS. Что интересно, VoiceFixer достигает высокого субъективного качества, уступая при этом другим моделям согласно объективным метрикам, особенно SI-SDR и STOI. Действительно, VoiceFixer не использует информацию формы волны напрямую и принимает на входе только мел-спектрограмму, таким образом, она пропускает части входного сигнала и не ставит задачей точную реконструкцию исходного сигнала, что приводит к низкой производительности в отношении классических относительных метрик, например, SI-SDR, STOI и PESQ. Предложенная модель HiFi++ обеспечивает подходящие метрики относительного качества (MOS=4,33; SI-SDR=18,4, STOI=0,95, PESQ=2,76). В то же время предложенная модель учитывает весь спектр сигнала, который очень информативен для улучшения речевого сигнала, чтобы было продемонстрировано успехом классических спектральных способов. Заслуживает внимания тот факт, что предложенное изобретение значительно превосходит модель SEANet (Tagliasacchi et al., 2020), которая обучается в аналогичном состязательном режиме и имеет большее количество параметров, но не учитывает спектральную информацию (MOS=4,33 для HiFi++, MOS=3,99 для SEANet, SI-SDR=18,4 для HiFi++, SI-SDR=13,5 для SEANet, STOI=0,95 для HiFi++, STOI=0,92 для SEANet, PESQ=2,76 для HiFi++, PESQ=2,36 для SEANet).Comparison of HiFi++ with base models is shown in Table 3. SI-SDR - scale invariant signal-to-distortion ratio (Le Roux et al., 2019). STOI (a measure of short-term objective intelligibility) (Taal et al., 2011). PESQ (perceptual speech quality score) (Rix et al., 2001) are traditional metrics. The proposed model achieves comparable performance with analogues VoiceFixer (Liu et al., 2021) and DEMUCS (Defossez et al., 2020) at a much lower (#Param(million)=1.7). In particular, for example, MOS=4.33 for HiFi++, MOS=4.32 for VoiceFixer, MOS=4.22 for DEMUCS. Interestingly, the VoiceFixer achieves a high subjective quality while being inferior to other models according to objective metrics, especially SI-SDR and STOI. Indeed, VoiceFixer does not use the waveform information directly and accepts only a chalk spectrogram as input, thus it skips parts of the input signal and does not aim to accurately reconstruct the original signal, resulting in poor performance against classical relative metrics such as SI- SDR, STOI and PESQ. The proposed HiFi++ model provides suitable relative quality metrics (MOS=4.33; SI-SDR=18.4, STOI=0.95, PESQ=2.76). At the same time, the proposed model takes into account the entire spectrum of the signal, which is very informative for improving the speech signal, as demonstrated by the success of classical spectral methods. It is noteworthy that the proposed invention is significantly superior to the SEANet model (Tagliasacchi et al., 2020), which is trained in a similar adversarial mode and has more parameters, but does not take into account spectral information (MOS=4.33 for HiFi++, MOS=3 .99 for SEANet, SI-SDR=18.4 for HiFi++, SI-SDR=13.5 for SEANet, STOI=0.95 for HiFi++, STOI=0.92 for SEANet, PESQ=2.76 for HiFi++, PESQ =2.36 for SEANet).

Таблица 3 демонстрирует результаты подавления шума в речевом сигнале на массиве данных Voicebank-DEMAND. (* указывает повторную реализацию). Модель обеспечивает более высокое качество, чем испытанные решения из литературы, за счет доставки более высокого качества MOS (которая является основной мерой производительности системы, поскольку вычисляется непосредственно из человеческой обратной связи), имея при этом наименьшее количество параметров (1,7 миллиона).Table 3 shows the results of speech noise reduction on the Voicebank-DEMAND data set. (* indicates a reimplementation). The model provides higher quality than the tested solutions in the literature by delivering a higher quality MOS (which is the main measure of system performance since it is computed directly from human feedback) while having the fewest parameters (1.7 million).

Таблица 3Table 3

МодельModel MOSMOS SI-SDRSI-SDR STOISTOI PESQPESQ #Par (млн.)#Par (million) ЭталонReference 4,60 ± 0,034.60±0.03 -- 1,001.00 4,644.64 -- HiFi++ (наш)HiFi++ (ours) 4,33 ± 0,054.33±0.05 18,418.4 0,950.95 2,762.76 1,71.7 VoiceFixerVoiceFixer 4,32 ± 0,054.32±0.05 -18,5-18.5 0,890.89 2,382.38 122,1122.1 DEMUCSDEMUCS 4,22 ± 0,054.22±0.05 18,518.5 0,950.95 3,033.03 60,860.8 MetricGAN+MetricGAN+ 4,01 ± 0,094.01 ± 0.09 8,58.5 0,930.93 3,133.13 2,72.7 *SEANet*SEANet 3,99 ± 0,093.99±0.09 13,513.5 0,920.92 2,362.36 9,29.2 *SE-Conformer*SE Conformer 3,39 ± 0,093.39±0.09 15,815.8 0,910.91 2,162.16 1,81.8 ВходEntrance 3,36 ± 0,063.36±0.06 8,48.4 0,920.92 1,971.97

Абляционное исследованиеAblation study

Абляция представляет собой исследование важности отдельных компонентов, которые составляют решение. Для удостоверения эффективности предложенных модификаций, осуществляется абляционное исследование введенных модулей SpectralUNet, WaveUNet и SpectralMaskNet. Для каждого модуля рассматривается архитектура без этого модуля с увеличенной емкостью части генератора HiFi для согласования размера начальной архитектуры HiFi++.Ablation is an exploration of the importance of the individual components that make up the solution. To verify the effectiveness of the proposed modifications, an ablative study of the introduced SpectralUNet, WaveUNet and SpectralMaskNet modules is carried out. For each module, the architecture without this module is considered with the increased capacity of the HiFi generator part to match the size of the initial HiFi++ architecture.

Результаты абляционного исследования приведены в таблице 4, где указан вклад каждого модуля в производительность HiFi++. Также демонстрируется сравнение с базовой моделью генератора HiFi, на вход которого поступает только мел-спектрограмма. Структура базового генератора HiFi такая же, как в версиях V1 и V2 из документа HiFi-GAN, за исключением того, что параметр "дискретизировать с повышением начальный канал" задается равным 256 (он равен 128 для V2 и 512 для V1). WaveUNet и SpectralMaskNet являются важными компонентами архитектуры, поскольку их отсутствие заметно снижает производительность модели.The results of the ablative study are shown in Table 4, which lists the contribution of each module to HiFi++ performance. A comparison is also shown with a basic model of a HiFi generator, which receives only a chalk spectrogram as an input. The structure of the basic HiFi generator is the same as in versions V1 and V2 of the HiFi-GAN document, except that the "upsample initial channel" parameter is set to 256 (it is 128 for V2 and 512 for V1). WaveUNet and SpectralMaskNet are important components of the architecture, since their absence significantly reduces the performance of the model.

SpectralUNet не оказывает влияния на качество SE, и оказывает незначительное положительное влияние на BWE (статистическая значимость улучшения гарантируется парной проверкой). SpectralUNet has no effect on SE quality, and has a slight positive effect on BWE (statistically significant improvement guaranteed by pairwise testing).

Таблица 4 демонстрирует вклад каждого модуля архитектуры HiFi++ в BWE (расширение частотной полосы) и SE (улучшение речевого сигнала).Table 4 shows the contributions of each module of the HiFi++ architecture to BWE (Bandwidth Extension) and SE (Speech Enhancement).

Можно видеть, что базовая линия (HiFi++) в ее полном наборе дает наилучшие результаты.It can be seen that the baseline (HiFi++) in its full set gives the best results.

Таблица 4 Table 4

BVVE(lkHz)BVVE(lkHz) SESE МодельModel MOSMOS MOSMOS #Param (млн.)#Param (million) ЭталонReference 4,50 ± 0,064.50±0.06 4,48 ± 0,054.48±0.05 -- Базовая линия (HiFi++)Baseline (HiFi++) 3,92 ± 0,043.92±0.04 4,27 ± 0,044.27 ± 0.04 1,711.71 без SpectralUNetwithout SpectralUNet 3,83 ± 0,063.83±0.06 4,26 ± 0,054.26±0.05 1,721.72 без WaveUNetwithout WaveUNet 3,46 ± 0,063.46±0.06 4,19 ± 0,034.19±0.03 1,751.75 без SpectralMaskNetwithout SpectralMaskNet 3,51 ± 0,063.51±0.06 1 17 ± 0,051 17 ± 0.05 1,741.74 базовый HiFibasic hi-fi 3,42 ± 0,053.42±0.05 4,17 ± 0,044.17 ± 0.04 3,563.56 входentrance 1,69 ± 0,051.69±0.05 3,51 ± 0,063.51±0.06 --

ВыводConclusion

Настоящее изобретение может быть реализовано аппаратными, программными и программно-аппаратными средствами. Кроме того, указание реализации изобретения в форме нейронной сети не означает, что его невозможно реализовать в форме специализированного оборудования, где каждый модуль может быть реализован и как отдельный блок, и как единая структура (например, интегральная схема).The present invention may be implemented in hardware, software, and firmware. In addition, the indication of the implementation of the invention in the form of a neural network does not mean that it cannot be implemented in the form of specialized equipment, where each module can be implemented both as a separate unit and as a single structure (for example, an integrated circuit).

Предложенное изобретение предусматривает универсальный инструментарий HiFi++ для расширения частотной полосы и улучшения речевого сигнала. Ряд обширных экспериментов указывает, что предложенная модель достигает таких же результатов, как базовые линии уровня техники для задач BWE (расширение частотной полосы) и SE (улучшение речевого сигнала, удаление шума). The proposed invention provides a universal HiFi++ toolkit for expanding the frequency band and improving the speech signal. A number of extensive experiments indicate that the proposed model achieves the same results as the prior art baselines for BWE (bandwidth extension) and SE (speech enhancement, noise removal) tasks.

Отличительные особенности настоящего изобретения:Distinctive features of the present invention:

a) предложено использовать модуль SpectralUNet для увеличения разрешения мел-спектрограммы. Мел-спектрограмма имеет двухмерную структуру, и двухмерные сверточные блоки модели SpectralUnet предназначены для облегчения работы с этой структурой на начальной стадии преобразования мел-спектрограммы в форму волны.a) it is proposed to use the SpectralUNet module to increase the resolution of the chalk spectrogram. The chalk spectrogram has a 2D structure, and the 2D convolutional blocks of the SpectralUnet model are designed to make it easier to work with this structure at the initial stage of converting the chalk spectrogram to a waveform.

b) предложено применять модуль WaveUNet к сигналу после сверточных блоков модели HiFi. Модуль WaveUNet является полностью сверточной нейронной сетью, которая продемонстрировала свою эффективность в задачах удаления шума и расширения частотной полосы.b) it is proposed to apply the WaveUNet module to the signal after the convolutional blocks of the HiFi model. The WaveUNet module is a fully convolutional neural network that has been shown to be effective in noise removal and bandwidth expansion.

c) модуль SpectralMaskNet предложен как оконечная часть генератора. SpectralMaskNet является обучаемым спектральным маскированием. Модуль прогнозирует мультипликативные коэффициенты для абсолютных величин Фурье оконного преобразования Фурье. Модуль предназначен для постобработки сигнала в частотной области.c) the SpectralMaskNet module is proposed as the final part of the generator. SpectralMaskNet is a trainable spectral masking. The module predicts the multiplicative coefficients for the absolute values of the Fourier windowed Fourier transform. The module is designed for signal post-processing in the frequency domain.

d) предложено использовать 3 идентичных вычислительно эффективных и принципиально простых дискриминатора вместо 8 различных, вычислительно неэффективных дискриминатора известной модели HiFi-GAN. Предложенный инструментарий дискриминации принципиально снижает сложность обучения как в отношении вычислений, так и времени, в то же время обеспечивая производительность, сравнимую с моделью HiFi-GAN. d) it is proposed to use 3 identical computationally efficient and fundamentally simple discriminators instead of 8 different, computationally inefficient discriminators of the well-known HiFi-GAN model. The proposed discrimination toolkit fundamentally reduces training complexity both in terms of computation and time, while at the same time providing performance comparable to the HiFi-GAN model.

e) Эффективность предложенного способа показана для задач расширения частотной полосы и удаления шума в речевых аудиозаписях. e) The effectiveness of the proposed method is shown for the tasks of expanding the frequency band and removing noise in speech audio recordings.

Предложенная модель HiFi++ превосходит существующие базовые линии в каждой задаче со значительно меньшей сложностью модели. Инструментарий HiFi++ надежен и пригоден для различных задач, связанных с речью. The proposed HiFi++ model outperforms existing baselines in every problem with significantly less model complexity. The HiFi++ toolkit is robust and suitable for a variety of speech-related tasks.

СсылкиLinks

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec 2,0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006,11477, 2020.Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006,11477, 2020.

Bai, S., Kolter, J. Z., and Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803,01271, 2018.Bai, S., Kolter, J. Z., and Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803,01271, 2018.

Birnbaum, S., Kuleshov, V., Enam, Z., Koh, P. W., and Ermon, S. Temporal film: Capturing long range sequence dependencies with feature-wise modulations. arXiv preprint arXiv:1909,06628, 2019.Birnbaum, S., Kuleshov, V., Enam, Z., Koh, P. W., and Ermon, S. Temporal film: Capturing long range sequence dependencies with feature-wise modulations. arXiv preprint arXiv:1909.06628, 2019.

Dйfossez, A., Usunier, N., Bottou, L., and Bach, F. Music source separation in the waveform domain. arXiv preprint arXiv:1911,13254, 2019.Défossez, A., Usunier, N., Bottou, L., and Bach, F. Music source separation in the waveform domain. arXiv preprint arXiv:1911,13254, 2019.

Defossez, A., Synnaeve, G., and Adi, Y. Real time speech enhancement in the waveform domain. In Interspeech, 2020.Defossez, A., Synnaeve, G., and Adi, Y. Real time speech enhancement in the waveform domain. Interspeech, 2020.

Durugkar, I., Gemp, I., and Mahadevan, S. Generative multi-adversarial networks. arXiv preprint arXiv:1611,01673, 2016.Durugkar, I., Gemp, I., and Mahadevan, S. Generative multi-adversarial networks. arXiv preprint arXiv:1611,01673, 2016.

Fu, S.-W., Liao, C.-F., Tsao, Y., and Lin, S.-D. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In International Conference on Machine Learning, pp. 2031-2041. PMLR, 2019.Fu, S.-W., Liao, C.-F., Tsao, Y., and Lin, S.-D. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In International Conference on Machine Learning, pp. 2031-2041. PMLR, 2019.

Fu, S.-W., Yu, C., Hsieh, T.-A., Plantinga, P., Ravanelli, M., Lu, X., and Tsao, Y. Metricgan+: An improved version of metricgan for speech enhancement. arXiv preprint arXiv:2104,03538, 2021.Fu, S.-W., Yu, C., Hsieh, T.-A., Plantinga, P., Ravanelli, M., Lu, X., and Tsao, Y. Metricgan+: An improved version of metricgan for speech enhancement. arXiv preprint arXiv:2104,03538, 2021.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., and Pang, R. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pp. 5036-5040, 2020. doi: 10,21437/Interspeech.2020-3015.Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., and Pang, R. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pp. 5036-5040, 2020. doi: 10.21437/Interspeech.2020-3015.

Ito, K. and Johnson, L. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.Ito, K. and Johnson, L. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.

Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., and Weyde, T. Singing voice separation with deep u-net convolutional networks. 2017.Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., and Weyde, T. Singing voice separation with deep u-net convolutional networks. 2017.

Kim, E. and Seo, H. SE-Conformer: Time-Domain Speech Enhancement Using Conformer. In Proc. Interspeech 2021, pp. 2736-2740, 2021. doi: 10,21437/Interspeech.2021-2207.Kim, E. and Seo, H. SE-Conformer: Time-Domain Speech Enhancement Using Conformer. In Proc. Interspeech 2021, pp. 2736-2740, 2021. doi: 10.21437/Interspeech.2021-2207.

Kim, S. and Sathe, V. Bandwidth extension on raw audio via generative adversarial networks. arXiv preprint arXiv:1903,09027, 2019.Kim, S. and Sathe, V. Bandwidth extension on raw audio via generative adversarial networks. arXiv preprint arXiv:1903,09027, 2019.

Kong, J., Kim, J., and Bae, J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. arXiv preprint arXiv:2010,05646, 2020.Kong, J., Kim, J., and Bae, J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. arXiv preprint arXiv:2010,05646, 2020.

Kong, Q., Cao, Y., Liu, H., Choi, K., and Wang, Y. Decoupling magnitude and phase estimation with deep resunet for music source separation. arXiv preprint arXiv:2109,05418, 2021.Kong, Q., Cao, Y., Liu, H., Choi, K., and Wang, Y. Decoupling magnitude and phase estimation with deep resunet for music source separation. arXiv preprint arXiv:2109,05418, 2021.

Kuleshov, V., Enam, S. Z., and Ermon, S. Audio super resolution using neural networks. arXiv preprint arXiv:1708,00853, 2017.Kuleshov, V., Enam, S. Z., and Ermon, S. Audio super resolution using neural networks. arXiv preprint arXiv:1708,00853, 2017.

Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Brйbisson, A., Bengio, Y., and Courville, A. Melgan: Generative adversarial networks for conditional waveform synthesis. arXiv preprint arXiv:1910,06711, 2019.Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Bribisson, A., Bengio, Y., and Courville, A. Melgan: Generative adversarial networks for conditional waveform synthesis. arXiv preprint arXiv:1910.06711, 2019.

Larsen, A. B. L., Sшnderby, S. K., Larochelle, H., and Winther, O. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pp. 1558-1566. PMLR, 2016.Larsen, A. B. L., Schnderby, S. K., Larochelle, H., and Winther, O. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pp. 1558-1566. PMLR, 2016.

Le Roux, J., 404 Wisdom, S., Erdogan, H., and Hershey, J. R. Sdr-half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626-630. IEEE, 2019.Le Roux, J., 404 Wisdom, S., Erdogan, H., and Hershey, J. R. Sdr-half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626-630. IEEE, 2019.

Li, Y., Tagliasacchi, M., Rybakov, O., Ungureanu, V., and Roblek, D. Real-time speech frequency bandwidth extension. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 691-695. IEEE, 2021.Li, Y., Tagliasacchi, M., Rybakov, O., Ungureanu, V., and Roblek, D. Real-time speech frequency bandwidth extension. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 691-695. IEEE, 2021.

Lin, J., Wang, Y., Kalgaonkar, K., Keren, G., Zhang, D., and Fuegen, C. A two-stage approach to speech bandwidth extension. Proc. Interspeech 2021, pp. 1689-1693, 2021.Lin, J., Wang, Y., Kalgaonkar, K., Keren, G., Zhang, D., and Fuegen, C. A two-stage approach to speech bandwidth extension. Proc. Interspeech 2021, pp. 1689-1693, 2021.

Liu, H., Kong, Q., Tian, Q., Zhao, Y., Wang, D., Huang, C., and Wang, Y. Voicefixer: Toward general speech restoration with neural vocoder. arXiv preprint arXiv:2109,13731, 2021.Liu, H., Kong, Q., Tian, Q., Zhao, Y., Wang, D., Huang, C., and Wang, Y. Voicefixer: Toward general speech restoration with neural vocoder. arXiv preprint arXiv:2109,13731, 2021.

Lo, C.-C., Fu, S.-W., Huang, W.-C., Wang, X., Yamagishi, J., Tsao, Y., and Wang, H.-M. Mosnet: Deep learning based objective assessment for voice conversion. arXiv preprint arXiv:1904,08352, 2019.Lo, C.-C., Fu, S.-W., Huang, W.-C., Wang, X., Yamagishi, J., Tsao, Y., and Wang, H.-M. Mosnet: Deep learning based objective assessment for voice conversion. arXiv preprint arXiv:1904.08352, 2019.

Lorenzo-Trueba, J., Yamagishi, J., Toda, T., Saito, D., Villavicencio, F., Kinnunen, T., and Ling, Z. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. arXiv preprint arXiv:1804,04262, 2018.Lorenzo-Trueba, J., Yamagishi, J., Toda, T., Saito, D., Villavicencio, F., Kinnunen, T., and Ling, Z. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods . arXiv preprint arXiv:1804.04262, 2018.

Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2794-2802, 2017.Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2794-2802, 2017.

Pascual, S., Bonafonte, A., and Serra, J. Segan: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703,09452, 2017.Pascual, S., Bonafonte, A., and Serra, J. Segan: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452, 2017.

Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617-3621. IEEE, 2019.Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617-3621. IEEE, 2019.

Rix, A. W., Beerends, J. G., Hollier, M. P., and Hekstra, A. P. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001Rix, A. W., Beerends, J. G., Hollier, M. P., and Hekstra, A. P. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001

IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No.01CH37221), volume 2, pp. 749-752. IEEE, 2001.IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No.01CH37221), volume 2, pp. 749-752. IEEE, 2001.

Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234-241. Springer, 2015.Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234-241. Springer, 2015.

Stoller, D., Ewert, S., and Dixon, S. WaveUNet: A multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806,03185, 2018.Stoller, D., Ewert, S., and Dixon, S. WaveUNet: A multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806,03185, 2018.

Sulun, S. and Davies, M. E. On filter generalization for music bandwidth extension using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 15(1):132-142, 2020. Taal, C. H., Hendriks, R. C., Heusdens, R., and Jensen, J. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125-2136, 2011.Sulun, S. and Davies, M. E. On filter generalization for music bandwidth extension using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 15(1):132-142, 2020. Taal, C. H., Hendriks, R. C., Heusdens, R., and Jensen, J. An algorithm for intelligibility prediction of time-frequency weighted noisy speech . IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125-2136, 2011.

Tagliasacchi, M., Li, Y., Misiunas, K., and Roblek, D. Seanet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009,02095, 2020.Tagliasacchi, M., Li, Y., Misiunas, K., and Roblek, D. Seanet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009,02095, 2020.

Tan, K. and Wang, D. A convolutional recurrent neural network for real-time speech enhancement. In Interspeech, pp. 3229-3233, 2018.Tan, K. and Wang, D. A convolutional recurrent neural network for real-time speech enhancement. In Interspeech, pp. 3229-3233, 2018.

Tian, Q., Chen, Y., Zhang, Z., Lu, H., Chen, L., Xie, L., and Liu, S. Tfgan: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis. arXiv preprint arXiv:2011,12206, 2020.Tian, Q., Chen, Y., Zhang, Z., Lu, H., Chen, L., Xie, L., and Liu, S. Tfgan: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis. arXiv preprint arXiv:2011,12206, 2020.

Valentini-Botinhao, C. et al. Noisy speech database for training speech enhancement algorithms and tts models. 2017.Valentini-Botinhao, C. et al. Noisy speech database for training speech enhancement algorithms and tts models. 2017.

Wang, H. and Wang, D. Time-frequency loss for cnn based speech super-resolution. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 861-865. IEEE, 2020.Wang, H. and Wang, D. Time-frequency loss for cnn based speech super-resolution. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 861-865. IEEE, 2020.

Wang, H. and Wang, D. Towards robust speech super-resolution. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.Wang, H. and Wang, D. Towards robust speech super-resolution. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.

Wisdom, S., Hershey, J. R., Wilson, K., Thorpe, J., Chinen, M., Patton, B., and Saurous, R. A.Wisdom, S., Hershey, J. R., Wilson, K., Thorpe, J., Chinen, M., Patton, B., and Saurous, R. A.

Differentiable consistency constraints for improved deep speech enhancement. In ICASSP 2019-IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 900-904. IEEE, 2019.Differentiable consistency constraints for improved deep speech enhancement. In ICASSP 2019-IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 900-904. IEEE, 2019.

Yamagishi, J., Veaux, C., MacDonald, K., et al. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0,92). 2019.Yamagishi, J., Veaux, C., MacDonald, K., et al. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). 2019.

You, J., Kim, D., Nam, G., Hwang, G., and Chae, G. Gan vocoder: Multi-resolution discriminator is all you need. arXiv preprint arXiv:2103,05236, 2021.You, J., Kim, D., Nam, G., Hwang, G., and Chae, G. Gan vocoder: Multi-resolution discriminator is all you need. arXiv preprint arXiv:2103,05236, 2021.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586-595, 2018.Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586-595, 2018.

Claims

1. A system for processing an audio signal waveform based on a GAN generator, containing the following modules:

spectral preprocessing module (SpectralUnet), made with the ability to:

- receive an input audio signal converted into a chalk spectrogram by a windowed Fourier transform (STFT) operation,

- apply two-dimensional Unet convolutional blocks to the chalk spectrogram to clean the chalk spectrogram from noise and restore high frequencies;

a fully convolutional neural network module (HiFi generator) configured to convert the output signal obtained from the SpetralUNet into a waveform domain;

a one-dimensional convolutional Unet neural module (WaveUNet) operating in the time domain, configured to correct the received waveform in the time domain;

a trainable spectral masking module (SpectralMaskNet) configured to correct the output from WaveUNet in the frequency domain to remove artifacts and noise, wherein the output from SpectralMaskNet processed by the 1D convolutional layer is the corrected audio waveform.

2. The system of claim 1, wherein WaveUNet receives as input a HiFi generator output concatenated with an audio waveform.

3. The system of claim 2, further comprising at least three identical fully convolutional discriminators configured to train the system to spread the frequency band of the audio waveform.

4. The system of claim 2, further comprising at least three identical fully convolutional discriminators configured to train the system to suppress noise in the audio waveform speech signal.

5. The system of claim 2, further comprising at least three identical fully convolutional discriminators configured to train the system to cancel noise in the speech signal and expand the frequency band of the audio waveform.

6. The system according to any one of paragraphs. 3-5, in which the discriminators are SSD discriminators with a reduced number of weights.

7. The method of processing the waveform of an audio signal using the system according to any one of paragraphs. 1-6, wherein the method is carried out on a computing device, the method comprising the steps of:

receiving a waveform of an audio signal from a signal source;

processing, through STFT operations, the waveform of the audio signal to obtain a chalk spectrogram;

applying, via SpectralUnet, 2D Unet convolutional blocks to the chalk spectrogram to denoise the chalk spectrogram and restore high frequencies;

converting, by means of a HiFi generator, an output signal obtained from the SpetralUNet into a waveform domain;

concatenating the output signal of the HiFi generator with the waveform of the audio signal;

correcting, by means of WaveUNet, the output waveform in the time domain;

adjusting the output waveform in the frequency domain to remove artifacts and noise by means of SpectralMaskNet;

processing the SpectralMaskNet output signal with a one-dimensional convolutional layer;

outputting the corrected waveform of the audio signal.

8. The method of claim 7, wherein the SpectralUnet output is the first tensor;

wherein the transformation step by means of the HiFi generator implements temporal resolution enhancement of the processed first tensor, wherein the output of the HiFi generator is a second tensor comprising a plurality of one-dimensional sequences whose length matches the length of said audio waveform;

moreover, the tensor resulting from the concatenation is the third tensor containing the connected one-dimensional sequences;

wherein the step of correcting by WaveUNet comprises processing the third tensor by a 1D convolutional neural architecture Unet in the time domain, which applies 1D convolutions on multiple scales of the third tensor in the time domain, wherein the output of the 1D convolutional neural architecture Unet in the time domain is a fourth tensor that consists of 1d sequences;

moreover, the stage of correction by SpectralMaskNet comprises processing the fourth tensor by a trainable spectral masking module, which applies a per-channel windowed Fourier transform (STFT) to the fourth tensor, and changes the absolute values of the STFT coefficients, and the output signal of the trainable spectral masking module is the fifth tensor.

9. The method of claim 8, further comprising the step of processing the fifth tensor with a 1D convolutional layer, wherein the output of the 1D convolutional layer is an output audio waveform.

10. The method according to any one of paragraphs. 7-9 further comprising the step of learning the sum of contention loss functions, feature matching loss functions, and chalk spectrogram loss functions, wherein the contention loss and feature matching loss are computed by at least three identical fully convolutional discriminators.