RU2795573C1

RU2795573C1 - Method and device for improving speech signal using fast fourier convolution

Info

Publication number: RU2795573C1
Application number: RU2022121009A
Authority: RU
Inventors: Иван Сергеевич ЩЕКОТОВ; Павел Константинович Андреев; Айбек Арстанбекович Аланов; Олег Юрьевич Иванов; Дмитрий Петрович ВЕТРОВ
Original assignee: Самсунг Электроникс Ко., Лтд.
Filing date: 2022-08-02
Publication date: 2023-05-05

Abstract

FIELD: audio processing.

SUBSTANCE: increase in the accuracy of noise suppression in the speech signal is achieved due to the fact that the channels of the input tensor are divided into local and global branches; regular convolutional layers for local updates of tensor transformation maps on a local branch are used; a Fourier transform is performed in the frequency measurement of the global branch tensor; the map of characteristics of the global branch in the spectral domain is updated through pointwise convolutional layers; an inverse Fourier transform is applied to the updated global branch feature map; and the activations of the local and global branches are summarized.

EFFECT: increasing the accuracy of noise suppression in the speech signal.

10 cl, 4 dwg, 3 tbl

Description

ОБЛАСТЬ ТЕХНИКИ, К КОТОРОЙ ОТНОСИТСЯ ИЗОБРЕТЕНИЕ FIELD OF TECHNOLOGY TO WHICH THE INVENTION RELATES

Настоящее изобретение относится в общем к области компьютерных технологий, в частности, к способам обработки и анализа аудиозаписей. Его можно использовать в различных устройствах для передачи, приема и записи речи для улучшения пользовательского восприятия при прослушивании речевых записей.The present invention relates in general to the field of computer technology, in particular, to methods for processing and analyzing audio recordings. It can be used in various devices for transmitting, receiving and recording speech to improve the user experience when listening to voice recordings.

УРОВЕНЬ ТЕХНИКИBACKGROUND OF THE INVENTION

Улучшение речевого сигнала (SE) направлено на восстановление чистого речевого сигнала из зашумленных входных сигналов. SE повышает разборчивость и понятность речи, способствуя улучшению пользовательского восприятия. Улучшение речевого сигнала представляет наибольший интерес среди специалистов по обработке аудиосигнала ввиду его фундаментального значения в электросвязи.Speech enhancement (SE) aims to recover a clean speech signal from noisy input signals. SE enhances speech intelligibility and intelligibility, contributing to a better user experience. Speech enhancement is of most interest to audio signal processors due to its fundamental importance in telecommunications.

Эта проблема имеет много решений в традиционной обработке сигналов, но каждое такое решение опирается на некоторые предположения, лежащие в основе модели шума. Благодаря последним достижениям в глубоком обучении, подходы на основе больших данных в наши дни возобладали в области улучшения речевого сигнала.This problem has many solutions in traditional signal processing, but each such solution relies on some assumptions underlying the noise model. Thanks to recent advances in deep learning, big data-based approaches have taken over the field of speech enhancement these days.

Одно популярное направление в методах глубокого обучения, использующихся для улучшения речевого сигнала, базируется на извлечении сигнала во временной области. Эти методы непосредственно отображают зашумленную форму волны в чистую, обычно оставляя в стороне любую информацию о спектре сигнала, что может приводить к потере эффективности. Эти подходы обычно используют структуру сверточного кодера-декодера (CED).One popular direction in deep learning methods used to improve the speech signal is based on signal extraction in the time domain. These methods directly map the noisy waveform to a clean one, usually leaving aside any information about the spectrum of the signal, which can lead to loss of efficiency. These approaches typically use a convolutional encoder-decoder (CED) structure.

Например, [1] и [2] используют сеть CED в качестве генератора, использующего для обучения полностью сверточный дискриминатор. Некоторые из этих подходов дополнительно используют нейронные модули, способные захватывать информацию длительной временной последовательности, например, ячейки долгой краткосрочной памяти [3] и трансформеры [4].For example, [1] and [2] use the CED network as a generator using a fully convolutional discriminator for training. Some of these approaches additionally use neural modules capable of capturing information of a long time sequence, for example, long short-term memory cells [3] and transformers [4].

Другое направление исследования опирается на оценивание представления оконного преобразования Фурье (STFT) (комплексного спектра). Подходы, соответствующие этим направлениям, ориентированы на прогнозирование коэффициентов STFT для чистого сигнала непосредственно [5] или коррекцию спектра зашумленного сигнала путем оценивания различных масок для изменения амплитуд и фаз [5]. В общем случае, следует отметить, что, в отношении заявленного здесь метода, STFT является функцией, которая помогает получить комплексный спектр из входного аудиосигнала. На основании амплитуд STFT, можно получить амплитудную спектрограмму (также просто именуемую здесь спектрограммой). Спектрограмма получается из начального аудиосигнала следующим образом: из входного сигнала получают амплитуды посредством STFT и строят из них спектрограмму. В контексте настоящего изобретения, термины STFT, представление STFT, комплексный спектр можно использовать взаимозаменяемо.Another line of research relies on the estimation of the representation of the windowed Fourier transform (STFT) (complex spectrum). Approaches corresponding to these directions are focused on predicting the STFT coefficients for a pure signal directly [5] or correcting the spectrum of a noisy signal by estimating various masks for changing amplitudes and phases [5]. In general, it should be noted that, in relation to the method stated here, STFT is a function that helps to obtain a complex spectrum from an input audio signal. Based on the STFT amplitudes, an amplitude spectrogram (also simply referred to here as a spectrogram) can be obtained. The spectrogram is obtained from the initial audio signal in the following way: amplitudes are obtained from the input signal by means of STFT and a spectrogram is built from them. In the context of the present invention, the terms STFT, STFT representation, complex spectrum can be used interchangeably.

Например, в документах MetricGAN [6] и MetricGAN+ [7] используется двунаправленный LSTM для прогнозирования двоичных масок для амплитудной спектрограммы, непосредственно оптимизирующих объективные метрики общего качества речевого сигнала, и сообщения результатов уровня техники для этих метрик. Прямое оценивание фаз спектрограммы обычно представляется затруднительным, и для упрощения этой задачи предлагаются различные ухищрения. Эти методы включают в себя использование комплекснозначных сетей, отвязывание амплитуды и использование отдельных вокодерных сетей для синтеза формы волны.For example, the papers MetricGAN [6] and MetricGAN+ [7] use a bi-directional LSTM to predict amplitude spectrogram binary masks directly optimizing objective metrics of overall speech quality and report prior art results for those metrics. Direct estimation of the phases of a spectrogram is usually difficult, and various tricks are proposed to simplify this task. These methods include the use of complex valued networks, amplitude decoupling, and the use of separate vocoder networks for waveform synthesis.

Источник Kong, Q., Cao, Y., Liu, H., Choi, K., & Wang, Y. (2021). Decoupling magnitude and phase estimation with deep resunet for music source separation. (arXiv preprint arXiv:2109,05418) раскрывает использование архитектуры UNet для оценивания коэффициентов оконного преобразования Фурье. Метод из уровня техники использует базовые свертки в архитектуре UNet. Однако данный источник из уровня техники не описывает использование быстрой свертки Фурье в архитектуре UNet, что критически влияет на качество формируемой речи, поскольку позволяет лучше использовать параметры нейронной сети.Source Kong, Q., Cao, Y., Liu, H., Choi, K., & Wang, Y. (2021). Decoupling magnitude and phase estimation with deep resunet for music source separation. (arXiv preprint arXiv:2109,05418) discloses the use of the UNet architecture to estimate windowed Fourier transform coefficients. The prior art method uses basic rollups in the UNet architecture. However, this source from the prior art does not describe the use of fast Fourier convolution in the UNet architecture, which critically affects the quality of the generated speech, since it allows better use of the neural network parameters.

Источник Chi, L., Jiang, B., & Mu, Y. (2020). Fast Fourier convolution. Advances in Neural Information Processing Systems, 33, 4479-4488 вводит понятие нейронного оператора быстрой свертки Фурье и применяет его для распознавания изображения, распознавания действия видео и обнаружение человеческой ключевой точки. Этот источник можно рассматривать как ближайший аналог из уровня техники по отношению к заявленному изобретению. Однако этот ближайший аналог из уровня техники не предусматривает улучшение речевого сигнала с использованием быстрой свертки Фурье для восстановления спектрограммы.Source Chi, L., Jiang, B., & Mu, Y. (2020). Fast Fourier convolution. Advances in Neural Information Processing Systems, 33, 4479-4488 introduces the concept of the Neural Fast Fourier Convolution Operator and applies it to image recognition, video action recognition, and human cue point detection. This source can be considered as the closest analogue from the prior art in relation to the claimed invention. However, this closest analogue of the prior art does not improve the speech signal using fast Fourier convolution to reconstruct the spectrogram.

РАСКРЫТИЕ ИЗОБРЕТЕНИЯDISCLOSURE OF THE INVENTION

Данный раздел, в котором раскрыты различные аспекты заявленного изобретения, предназначен для обеспечения краткого обзора заявленных объектов изобретения и их вариантов осуществления. Ниже приведены подробные характеристики технических средств и способов, которые реализуют сочетания признаков заявленного изобретения. Ни данное раскрытие изобретения, ни подробное описание, приведенное ниже совместно с сопровождающими чертежами, не следует рассматривать как определяющие объем заявленного изобретения. Объем правовой охраны заявленного изобретения определяется только нижеследующей формулой изобретения.This section, which discloses various aspects of the claimed invention, is intended to provide a brief overview of the claimed inventions and their embodiments. Below are detailed characteristics of technical means and methods that implement combinations of features of the claimed invention. Neither this disclosure of the invention nor the detailed description given below in conjunction with the accompanying drawings should be construed as defining the scope of the claimed invention. The scope of legal protection of the claimed invention is determined only by the following claims.

Техническая проблема, решаемая настоящим изобретением, состоит в улучшение речевого сигнала с использованием быстрой свертки Фурье для восстановления представления STFT. В настоящем изобретении представление STFT прогнозируется напрямую (т.е. прогнозируется не только амплитудная спектрограмма). Однако, в объеме принципа настоящего изобретения, можно прогнозировать также спектрограмму. Следует отметить, что восстановление (прогнозирование) спектрограммы является более простой задачей, поскольку амплитудная спектрограмма не содержит информации о фазах (фазной информации), тогда как STFT содержит как амплитудную, так и фазовую информацию.The technical problem solved by the present invention is to improve the speech signal using fast Fourier convolution to reconstruct the STFT representation. In the present invention, the STFT representation is predicted directly (ie, not only the amplitude spectrogram is predicted). However, within the scope of the principle of the present invention, a spectrogram can also be predicted. It should be noted that the restoration (prediction) of the spectrogram is a simpler task, since the amplitude spectrogram does not contain phase information (phase information), while STFT contains both amplitude and phase information.

Задача настоящего изобретения состоит в создании усовершенствованного способа и устройства для подавления шума в речевом сигнале и/или улучшения звукового сигнала, содержащего речь.The object of the present invention is to provide an improved method and apparatus for suppressing noise in a speech signal and/or improving an audio signal containing speech.

Технический результат, достигаемый с использованием заявленного изобретения, состоит в повышении качества речи, подавлении шума и/или улучшении речевой компоненты в речевом аудиосигнале.The technical result achieved using the claimed invention is to improve the quality of speech, suppress noise and/or improve the speech component in the speech audio signal.

В первом аспекте настоящая задача решается способом подавления шума в речевом сигнале с использованием по меньшей мере одного оператора быстрой свертки Фурье, причем способ содержит: разделение каналов входного тензора на локальную и глобальную ветви; использование обычных сверточных слоев для локальных преобразований тензора на локальной ветви; осуществление преобразования Фурье в частотном измерении тензора глобальной ветви; обновление карты характеристик глобальной ветви в спектральной области посредством поточечных сверточных слоев; применение обратного преобразования Фурье к обновленной карте характеристик глобальной ветви; и суммирование активаций локальной и глобальной ветвей. Каналы входного тензора можно вывести из входной спектрограммы, представляющей аудиосигнал, содержащий речь.In a first aspect, the present problem is solved by a method for suppressing noise in a speech signal using at least one fast Fourier convolution operator, the method comprising: dividing input tensor channels into local and global branches; using ordinary convolutional layers for local transformations of the tensor on the local branch; implementation of the Fourier transform in the frequency dimension of the global branch tensor; updating the characteristics map of the global branch in the spectral domain by means of pointwise convolutional layers; applying the inverse Fourier transform to the updated global branch feature map; and summation of local and global branch activations. The input tensor channels can be derived from an input spectrogram representing an audio signal containing speech.

Следует отметить, что, согласно принципу настоящего изобретения, входной тензор для FFC можно получить как из амплитудной спектрограммы, так и из представления STFT. В примерах практической реализации настоящего изобретения, входной тензор получается из представления STFT, однако изобретение не ограничивается этим источником входного тензора, что также подтверждается экспериментами с прогнозированием фазы, которые осуществлялись на амплитудных спектрограммах, но не представлении STFT, чтобы показать, что быстрая свертка Фурье хороша при прогнозировании фазной информации, и мотивировать ее использование в этом случае для улучшения речевого сигнала или подавления шума в нем. Следует отметить, что это также делает быстрые свертки Фурье потенциально жизнеспособными для других применений, например, кодирования речевых сигналов, расширения полосы и т.д.It should be noted that, according to the principle of the present invention, the input tensor for the FFC can be obtained from both the amplitude spectrogram and the STFT representation. In the examples of the practical implementation of the present invention, the input tensor is obtained from the STFT representation, however, the invention is not limited to this source of the input tensor, which is also confirmed by phase prediction experiments that were carried out on amplitude spectrograms, but not the STFT representation, to show that the fast Fourier convolution is good. when predicting phase information, and to motivate its use in this case to improve the speech signal or suppress noise in it. It should be noted that this also makes FFTs potentially viable for other applications such as speech coding, bandwidth extension, etc.

Суммированные активации локальной и глобальной ветвей могут отражаться в выходной спектрограмме, представляющей улучшенный аудиосигнал, содержащий речь. По меньшей мере один оператор быстрой свертки Фурье может быть частью архитектуры нейронной сети с автоэнкодером быстрой свертки Фурье (FFC-AE). По меньшей мере один оператор быстрой свертки Фурье может быть частью архитектуры нейронной сети с U-Net быстрой свертки Фурье (FFC-UNet).The summed activations of the local and global branches may be reflected in an output spectrogram representing an enhanced audio signal containing speech. The at least one Fast Fourier Convolution Operator may be part of a Fast Fourier Convolution Autoencoder (FFC-AE) neural network architecture. The at least one Fast Fourier Convolution operator may be part of a Fast Fourier Convolution U-Net (FFC-UNet) neural network architecture.

Обновление карты характеристик глобальной ветви может содержать: применение действительного одномерного быстрого преобразования Фурье в частотном измерении входного тензора и конкатенацию действительной и мнимой частей спектра в канальном измерении; применение сверточного блока (с ядром 1×1) в частотной области; применение обратного преобразования Фурье. По меньшей мере один оператор быстрой свертки Фурье может применяться в частотном измерении. Способ может содержать использование сверточной нейронной сети, использующей одну или более моделей машинного обучения (ML), обученных посредством инструментария мультидискриминаторного состязательного обучения.Updating the map of global branch characteristics may include: applying a real one-dimensional fast Fourier transform in the frequency domain of the input tensor and concatenating the real and imaginary parts of the spectrum in the channel domain; applying a convolutional block (with a 1×1 kernel) in the frequency domain; application of the inverse Fourier transform. At least one fast Fourier convolution operator may be applied in frequency domain. The method may comprise using a convolutional neural network using one or more machine learning (ML) models trained by a multi-discriminator adversarial learning toolkit.

Во втором аспекте настоящая задача решается устройством для подавления шума в речевом сигнале с использованием оператора быстрой свертки Фурье, причем устройство содержит: память; и процессор, соединенный с памятью, причем процессор, при выполнении инструкций, сохраненных в памяти, выполнен с возможностью: разделения каналов входного тензора на локальную и глобальную ветви; использования обычных сверточных слоев для локальных преобразований тензора на локальной ветви; осуществления преобразования Фурье в частотном измерении тензора глобальной ветви; обновления карты характеристик глобальной ветви в спектральной области посредством поточечных сверточных слоев; применения обратного преобразования Фурье к обновленной карте характеристик глобальной ветви; и суммирования активаций локальной и глобальной ветвей.In a second aspect, the present problem is solved by an apparatus for suppressing noise in a speech signal using a fast Fourier convolution operator, the apparatus comprising: a memory; and a processor connected to the memory, wherein the processor, when executing instructions stored in the memory, is configured to: split the channels of the input tensor into local and global branches; using ordinary convolutional layers for local transformations of the tensor on the local branch; implementation of the Fourier transform in the frequency dimension of the global branch tensor; updating the map of characteristics of the global branch in the spectral domain through pointwise convolutional layers; applying the inverse Fourier transform to the updated global branch feature map; and summation of local and global branch activations.

В третьем аспекте настоящая задача решается машиночитаемым носителем, на котором хранятся исполняемые процессором инструкции, которые, при выполнении по меньшей мере одним процессором, предписывают по меньшей мере одному процессору осуществлять способ вышеупомянутого первого аспекта.In a third aspect, the present object is accomplished by a computer-readable medium that stores processor-executable instructions that, when executed by at least one processor, cause at least one processor to perform the method of the aforementioned first aspect.

Специалистам в данной области техники очевидно, что принцип изобретения не ограничивается изложенными выше аспектами, и изобретение может принимать форму других предметов изобретения, например, устройства, компьютерной программы или компьютерного программного продукта. Дополнительные признаки, которые могут характеризовать конкретные варианты осуществления настоящего изобретения, будут очевидны специалистам в данной области техники из приведенного ниже подробного описания вариантов осуществления.Those skilled in the art will appreciate that the principle of the invention is not limited to the aspects set forth above, and the invention may take the form of other inventive items such as a device, computer program, or computer program product. Additional features that may characterize particular embodiments of the present invention will be apparent to those skilled in the art from the following detailed description of the embodiments.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

Чертежи приведены в данном документе для облегчения понимания сущности настоящего изобретения. Чертежи схематичны и не выполнены в масштабе. Чертежи служат только для иллюстрации и не предназначены для определения объема настоящего изобретения.The drawings are given in this document to facilitate understanding of the essence of the present invention. The drawings are schematic and not to scale. The drawings are for illustration only and are not intended to define the scope of the present invention.

Фиг. 1 иллюстрирует схему устройства подавления шума в речевом сигнале согласно настоящему изобретению.Fig. 1 illustrates a diagram of a device for suppressing noise in a speech signal according to the present invention.

Фиг. 2 схематически иллюстрирует архитектуру нейронного модуля FFC-AE, используемого в настоящем изобретении.Fig. 2 schematically illustrates the architecture of the FFC-AE neural module used in the present invention.

Фиг. 3 схематически иллюстрирует архитектуру нейронный модуль FFC-Unet, используемого в настоящем изобретении.Fig. 3 schematically illustrates the architecture of the FFC-Unet neural module used in the present invention.

Фиг. 4 - блок-схема способа подавления шума в речевом сигнале согласно изобретению.Fig. 4 is a flow diagram of a method for suppressing noise in a speech signal according to the invention.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯIMPLEMENTATION OF THE INVENTION

Иллюстративные варианты осуществления настоящего изобретения подробно описаны ниже. Иллюстративные варианты осуществления проиллюстрированы на сопровождающих чертежах, на которых одинаковые или аналогичные ссылочные позиции обозначают одинаковые или аналогичные элементы или элементы, которые имеют одинаковые или аналогичные функции. Иллюстративные варианты осуществления, описанные с обращением к сопровождающим чертежам, являются примерными и используются лишь для объяснения настоящего изобретения и не подлежат рассмотрению в плане каких-либо его ограничений.Exemplary embodiments of the present invention are described in detail below. Exemplary embodiments are illustrated in the accompanying drawings, in which the same or similar reference numbers designate the same or similar elements or elements that have the same or similar functions. The illustrative embodiments described with reference to the accompanying drawings are exemplary and are used only to explain the present invention and are not to be considered in terms of any of its limitations.

В настоящем изобретении предложен способ подавления шума в речевом сигнале с использованием оператора быстрой свертки Фурье, который решает проблему стандартного одноканального подавления шума в речевом сигнале. Другими словами, изобретение ставит задачей обучение отображению зашумленной формы волны y=x+n с аддитивным шумом n в чистую x. Изобретение использует нейронный оператор быстрой свертки Фурье для улучшения речевого сигнала и подавления шума. Способ, отвечающий изобретению, может применяться в различных устройствах, поддерживающих вычисления с плавающей точкой или фиксированной точкой.The present invention proposes a method for suppressing noise in a speech signal using a fast Fourier convolution operator that solves the problem of conventional single-channel noise suppression in a speech signal. In other words, the invention aims to learn how to map a noisy waveform y=x+n with additive noise n to pure x. The invention uses a fast Fourier convolution neural operator to improve the speech signal and suppress noise. The method of the invention can be applied to various devices that support floating point or fixed point calculations.

Быстрая свертка Фурье (FFC) представляет собой недавно предложенный нейронный оператор, демонстрирующий многообещающую производительность, в порядке неограничительного примера, в ряде задач компьютерного зрения. Оператор FFC позволяет использовать операции в большом рецептивном поле в ранних слоях нейронной сети. В частности, оператор FFC показал себя особенно полезным для восстановления периодических структур, которые часто встречаются в обработке аудиосигнала.Fast Fourier Convolution (FFC) is a recently proposed neural operator showing promising performance, by way of non-limiting example, in a number of computer vision tasks. The FFC operator allows the use of operations in a large receptive field in the early layers of the neural network. In particular, the FFC operator has proven to be particularly useful for restoring periodic structures that are often encountered in audio signal processing.

Авторы настоящего изобретения установили, что быстрая свертка Фурье хороша при прогнозировании фазной информации для дальнейшего использования с целью улучшения речевого сигнала или подавления шума в нем. The inventors of the present invention have found that Fast Fourier Convolution is good at predicting phase information for further use to improve or suppress noise in a speech signal.

Авторы настоящего изобретения установили, что нейронные сети на основе быстрой свертки Фурье превосходят аналогичные сверточные модели и демонстрируют результаты, лучшие или сравнимые с другими моделями улучшение речевого сигнала.The present inventors have found that Fast Fourier Convolution Neural Networks outperform similar convolutional models and show results better than or comparable to other speech improvement models.

В настоящем изобретении быстрая свертка Фурье используется в архитектурах U-Net и автоэнкодера. В порядке неограничительного примера, в соответствии с настоящим изобретением, быстрая свертка Фурье применяется посредством следующих основных этапов:In the present invention, Fast Fourier Convolution is used in U-Net and Autoencoder architectures. By way of non-limiting example, in accordance with the present invention, Fast Fourier Convolution is applied through the following main steps:

a) разделение каналов входного тензора на локальную и глобальную ветви;a) division of input tensor channels into local and global branches;

b) использование обычных сверточных слоев для локальных преобразований тензоров на локальной ветви;b) using ordinary convolutional layers for local transformations of tensors on a local branch;

c) осуществление преобразования Фурье в частотном измерении тензора глобальной ветви, его обновление в спектральной области посредством поточечных сверточных слоев и применение обратного преобразования Фурье; иc) performing a Fourier transform in the frequency domain of the global branch tensor, updating it in the spectral domain by means of pointwise convolutional layers, and applying the inverse Fourier transform; And

d) суммирование активаций локальной и глобальной ветвей.d) summation of local and global branch activations.

Способ, отвечающий изобретению, действует на аудиоконтенте, например, одном или более аудиосигналов, включающих в себя один или более каналов, в частности, но без ограничения, единственный канал.The method of the invention operates on audio content, for example one or more audio signals including one or more channels, in particular, but not limited to, a single channel.

В общем случае, звуковой (аудио) сигнал, записанный в акустических условиях реального мира, может содержать нежелательный шум, созданный окружающей средой и/или записывающим оборудованием. Это приводит к тому, что результирующая цифровая характеризация аудиосигнала содержит нежелательный шум. Цифровую характеризацию следует фильтровать для устранения нежелательного шума.In general, a sound (audio) signal recorded in real world acoustic conditions may contain unwanted noise created by the environment and/or recording equipment. This causes the resulting digital characterization of the audio signal to contain unwanted noise. The digital characterization should be filtered to remove unwanted noise.

Однако каждый тип шума требует своего собственного типа фильтра, который следует выбирать вручную или искать среди наборов фильтров. Отфильтровывание шума на частотах, отличных от человеческой речи может преимущественно достигаться посредством моделей глубокого обучения. В отличие от фильтров, выбранных и подготовленных заранее, нейронные сети преимущественно охватывают более разнообразные типы шума и могут дополнительно обучаться путем добавления новых типов шума.However, each type of noise requires its own filter type, which must be manually selected or searched through filter sets. Filtering out noise at frequencies other than human speech can advantageously be achieved through deep learning models. Unlike pre-selected and pre-prepared filters, neural networks predominantly cover more diverse types of noise and can be further trained by adding new types of noise.

В общем случае, записанный звук (аудиосигнал) состоит из множества звуковых волн, которые одновременно достигают микрофонного датчика в течение периода времени. Аудиосигнал также может именоваться формой волны. Сигналы имеют непрерывную природу и, при записи на цифровое устройство через датчик, они в целом проходят процедуры дискретизации и квантования. Дискретизация связана с частотой дискретизации, и квантование связано с сигналом, хранящимся с некоторой заданной точностью. В результате, действительные числа сохраняются как конечные числа с плавающей точкой с конечной точностью (числом битов, необходимых для хранения числа). Чем выше точность, тем лучше фактическая амплитуда может отображаться в дискретизированную.In general, the recorded sound (audio signal) consists of a plurality of sound waves that simultaneously reach the microphone pickup over a period of time. An audio signal may also be referred to as a waveform. The signals are of a continuous nature and, when recorded to a digital device through a sensor, they generally go through sampling and quantization procedures. Sampling is related to the sampling rate, and quantization is related to the signal being stored with some given precision. As a result, real numbers are stored as finite floating point numbers with finite precision (the number of bits required to store the number). The higher the accuracy, the better the actual amplitude can be mapped to the sampled one.

Таким образом, форма волны представляет собой длинный вектор из чисел с плавающей точкой. Зашумленная форма волны содержит некоторый шум, который в целом носит аддитивный характер поверх предположительно чистой формы волны (которая предположительно представляет речь в контексте настоящего изобретения). Поэтому необходимо определять, какие частоты содержат более сложный входной сигнал.So the waveform is a long vector of floating point numbers. The noisy waveform contains some noise that is generally additive on top of the supposedly pure waveform (which is supposed to represent speech in the context of the present invention). Therefore, it is necessary to determine which frequencies contain the more complex input signal.

В настоящем изобретении, модели глубокого обучения действуют на (представлении) STFT, и отвечающий изобретению подход к нейронной сети предусматривает отыскание такого отображения (обучения нейронной сети), которое будет обучать отличать шум от чистой речи и удалять первый. STFT позволяет использовать информацию о частотах и фазах для обучения нейронных сетей с этой целью.In the present invention, deep learning models act on the (representation) STFT, and the neural network approach of the invention is to find a mapping (neural network training) that will train to distinguish noise from pure speech and remove the former. STFT allows the use of frequency and phase information to train neural networks for this purpose.

В этом смысле, оконное преобразование Фурье (STFT) является последовательностью преобразований Фурье оконного сигнала, фокусируется на более коротких последовательностях аудиосигнала. STFT обеспечивает локализованную по времени частотную информацию для ситуаций, в которых частотные компоненты сигнала изменяются со временем. Оконное преобразование Фурье широко используется для обработки речи, поскольку эти сигналы обычно обладают гармоническими структурами. Отвечающий изобретению подход к подавлению шума базируется на идентификации изменений частоты по временной оси сигнала.In this sense, the windowed Fourier transform (STFT) is a sequence of Fourier transforms of a windowed signal that focuses on shorter audio signal sequences. STFT provides time-localized frequency information for situations in which the frequency components of a signal change over time. The windowed Fourier transform is widely used for speech processing because these signals usually have harmonic structures. The inventive approach to noise suppression is based on the identification of frequency changes along the time axis of the signal.

Технически это осуществляется подразделением аудиосигнала, содержащего речь, на более короткие интервалы и применением быстрого преобразования Фурье (FFT) по отдельности для каждого такого интервала, что достигается изменением коэффициентов преобразование Фурье путем введения данной ненулевой оконной функции на каждом интервале.Technically, this is done by dividing the audio signal containing speech into shorter intervals and applying the fast Fourier transform (FFT) separately for each such interval, which is achieved by changing the Fourier transform coefficients by introducing a given non-zero window function at each interval.

Важно отметить, что, как упомянуто выше, в случае зашумленного речевого сигнала, шум имеет аддитивную структуру, поэтому, в зашумленной спектрограмме, вместо “тишины” на некоторых частотах присутствует некоторый шум, заглушающий чистую речь. Эта характерная аддитивная структура шума используется в настоящем изобретении для отличения речи от шума и, соответственно, обучения нейронных сетей.It is important to note that, as mentioned above, in the case of a noisy speech signal, the noise has an additive structure, therefore, in a noisy spectrogram, instead of “silence” at some frequencies, there is some noise that drowns out pure speech. This characteristic additive structure of noise is used in the present invention to distinguish speech from noise and, accordingly, train neural networks.

Как упомянуто выше, на первом этапе применения FFC к речевому аудиосигналу, каналы входного тензора делятся на локальную и глобальную ветви.As mentioned above, in the first step of applying FFC to a speech audio signal, the channels of the input tensor are divided into local and global branches.

Используемый здесь термин «тензор» в общем следует понимать как своего рода линейный многокомпонентный (например, алгебраический) объект, определяемый в конечномерном векторном пространстве конечной. Входной тензор выводится из входного сигнала с использованием любого подходящего способа, хорошо известного в технике. В частности, представление STFT в процессе обучения является 4-мерным тензором размера (Batch_Size, Frequency bins, Time bins, 2). Это означает, что необходимо дискретизировать с повышением количество каналов от 2 (действительного и комплексного в последнем измерении) до произвольного числа (настоящее изобретение не ограничивается никаким конкретным числом, при условии, что наблюдается компромисс между сложностью модели в отношении количества вычисляемых параметров и количества производимых операций и ее качеством, благодаря чему эти две характеристики уравновешиваются, таким образом, с одной стороны, получается наилучшее доступное качество, и, с другой стороны, модель остается по возможности малой).The term "tensor" used here should generally be understood as a kind of linear multi-component (for example, algebraic) an object defined in a finite-dimensional vector space finite. The input tensor is derived from the input signal using any suitable method well known in the art. In particular, the STFT representation in the learning process is a 4-dimensional size tensor (Batch_Size, Frequency bins, Time bins, 2). This means that it is necessary to upsample the number of channels from 2 (real and complex in the last dimension) to an arbitrary number (the present invention is not limited to any particular number, provided that there is a trade-off between the complexity of the model in terms of the number of calculated parameters and the number of operations performed and its quality, whereby these two characteristics are balanced, thus, on the one hand, the best available quality is obtained, and, on the other hand, the model remains as small as possible).

С этой целью используются нейронные операторы наподобие сверток, перемежающиеся с некоторыми нелинейностями, которые помогают дискретизировать с повышением каналы, поддерживая поток информации через сеть. Количество каналов, которые определяют эти операции, являются гиперпараметрами, подлежащими тонкой настройке для получения наилучших результатов из модели, которая используется в методах, отвечающих изобретению. Гиперпараметры характеризуются в нижеследующих описаниях соответствующих операторов. В общем случае, следует понимать, что свертка, будь то 1-мерная, 2-мерная или 3-мерная операция, определяется как функция, имеющая следующее параметры: input_channels, output_channels, kernel_size, а также некоторые факультативные, например, конкретные параметры известное как “stride”, “padding” и т.д. Более формальное определение в отношении данного инструментария (например, PyTorch) можно найти, например, по адресу: https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html.To this end, neural operators like convolutions are used interspersed with some non-linearities that help to up-sample the channels, keeping the information flowing through the network. The number of channels that define these operations are hyperparameters to be fine-tuned to get the best results from the model that is used in the methods of the invention. Hyperparameters are characterized in the following descriptions of the respective operators. In general, it should be understood that a convolution, be it a 1-dimensional, 2-dimensional or 3-dimensional operation, is defined as a function having the following parameters: input_channels, output_channels, kernel_size , as well as some optional ones, such as the specific parameters known as "stride", "padding", etc. A more formal definition regarding this toolkit (eg PyTorch ) can be found, for example, at: https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html.

Будучи определенной этими параметрами, свертка может преобразовывать данный входной тензор по-разному:Given these parameters, the convolution can transform a given input tensor in different ways:

- если input_channels < output_channels, то входной тензор дискретизируется с повышением в канальном измерении;- if input_channels < output_channels , then the input tensor is upsampled in the channel dimension;

- если input_channels > output_channels, то входной тензор дискретизируется с понижением в канальном измерении;- if input_channels > output_channels , then the input tensor is downsampled in the channel dimension;

- иначе измерение каналов остается неизменным.- otherwise the channel measurement remains unchanged.

Другие параметры, например, “stride”, “padding” и kernel_size определяют преобразование входным тензором в других измерениях (не каналах), в частности, останутся ли они неизменными или будут дискретизированы с понижением.Other parameters, such as "stride", "padding" and kernel_size determine the transformation of the input tensor in other dimensions (not channels), in particular, whether they remain unchanged or are downsampled.

Несмотря на то, что они являются несколько более сложными, чем Linear Layers, свертки являются по существу линейными операциями. Поэтому, чтобы свертки могли обучаться нелинейным отображениям, нужно вносить нелинейности.Although they are somewhat more complex than Linear Layers , convolutions are essentially linear operations. Therefore, in order for convolutions to learn non-linear mappings, it is necessary to introduce non-linearities.

FFC делит аудиоканалы входного тензора на локальную и глобальную ветви. Следует отметить, что операция разделения в настоящем контексте является частным случаем более общей операции, которая в этой области техники обычно носит название «слайсинг», и которая может осуществляться в произвольном количестве измерений и реализована в обычных инструментариях глубокого обучения, например, PyTorch, TensorFlow, JAX и т.д. и является неотъемлемой особенностью, например, языка программирования python.FFC divides the audio channels of the input tensor into local and global branches. It should be noted that the split operation in the present context is a special case of a more general operation, which in this area of technology is usually called “slicing”, and which can be carried out in an arbitrary number of dimensions and is implemented in common deep learning toolkits, for example, PyTorch, TensorFlow, JAX etc. and is an integral feature of, for example, the python programming language.

Функция разделения, определяемая на тензоре, является операцией, которая делит тензор в данном измерении в некоторой заданной пропорции, что приводит к нескольким меньшим подмножествам входного тензора одинакового размера во всех измерениях кроме одного, в котором осуществлялась операция. В FFC, разделение осуществляется в канальном измерении для получения двух тензоров, один из которых отправляется в локальную ветвь, а другой в глобальную ветвь, как показано, в частности, на фиг. 1.A split function defined on a tensor is an operation that divides a tensor in a given dimension by some specified proportion, resulting in several smaller subsets of the input tensor of the same size in all dimensions except the one in which the operation was performed. In FFC, splitting is done in the channel dimension to obtain two tensors, one of which is sent to the local branch and the other to the global branch, as shown in particular in FIG. 1.

Локальная ветвь использует обычные сверточные слои для локальных преобразований тензора на локальной ветви. В этом контексте, термин «карты характеристик» (тензоры) можно использовать взаимозаменяемо с термином «активации», как объяснено здесь, в смысле некоторых промежуточных представлений, тензоров, полученных путем применения некоторых нейронных операторов.The local branch uses regular convolutional layers for local tensor transformations on the local branch. In this context, the term "feature maps" (tensors) can be used interchangeably with the term "activations", as explained here, in the sense of some intermediate representations of tensors obtained by applying some neural operators.

Глобальная ветвь осуществляет преобразование Фурье карты характеристик и обновляет ее в спектральной области, влияя на глобальный контекст. Настоящее изобретение осуществляет преобразование Фурье только в частотных измерениях карт характеристик (соответствующих спектрограммами STFT).The global branch performs the Fourier transform of the feature map and updates it in the spectral domain, affecting the global context. The present invention performs the Fourier transform only in the frequency measurements of the response maps (corresponding to the STFT spectrograms).

В частности, настоящее изобретение реализует глобальную ветвь слоя FFC в три этапа:In particular, the present invention implements the global branch of the FFC layer in three steps:

1. Применение действительного одномерного быстрого преобразование Фурье в частотном измерении входной карты характеристик и конкатенация действительной и мнимой частей спектра в канальном измерении:1. Applying a real 1D FFT in the frequency domain of the input response map and concatenating the real and imaginary parts of the spectrum in the channel domain:

2. Применение сверточного блока (с ядром 1×1) в частотной области:2. Application of a convolutional block (with a 1×1 kernel) in the frequency domain:

3. Применение обратного преобразования Фурье:3. Application of the inverse Fourier transform:

Глобальная и локальная ветви взаимодействуют друг с другом путем суммирования активаций, как показано на фиг. 1.The global and local branches interact with each other by summing activations, as shown in FIG. 1.

Активация представляет собой выходной сигнал произвольного нейронного слоя (по сравнению с активацией нейронов в действительном биологическом мозгу). В случае блоков FFC, локальная ветвь содержит два отдельных слоя свертки (модуль 130 на фиг. 1), выходные сигналы которых суммируются, и глобальная ветвь является линейной комбинацией (или, более тривиально, суммой) сверточного слоя (модуль 130 на фиг. 1) и слоя спектрального преобразования (модуль 131 на фиг. 1). Другими словами, в контексте настоящего изобретения, активации являются выходными сигналами каждого отдельного нейронного слоя (в частности, сверточного слоя (модуль 130) или слоя спектрального преобразования (модуль 131)). Активации также можно рассматривать как промежуточные представления нейронной сети, достигаемые последовательной (поблочной) обработки данного входного сигнала.Activation is the output of an arbitrary neuronal layer (compared to the activation of neurons in the actual biological brain). In the case of FFC blocks, the local branch contains two separate convolution layers (modulo 130 in Fig. 1) whose outputs are summed, and the global branch is a linear combination (or, more trivially, sum) of the convolutional layer (module 130 in Fig. 1) and a spectral transform layer (module 131 in FIG. 1). In other words, in the context of the present invention, activations are the outputs of each individual neural layer (in particular, the convolutional layer (module 130) or the spectral transform layer (module 131)). Activations can also be considered as intermediate representations of a neural network, achieved by sequential (blockwise) processing of a given input signal.

В одном или более неограничивающих иллюстративных вариантах осуществления настоящее изобретение использует ту же разновидность FFC, которая была исследована в источнике [10] для ретуширования изображения, за исключением того, что настоящее изобретение использует одномерное преобразование Фурье в частотном измерении.In one or more non-limiting exemplary embodiments, the present invention uses the same kind of FFC that was investigated in reference [10] for image retouching, except that the present invention uses a 1D Fourier transform in the frequency domain.

На фиг. 1 показана схема устройства подавления шума в речевом сигнале согласно изобретению, использующего нейронный модуль быстрой свертки Фурье для улучшения речевого сигнала. Параметр

управляет отношением каналов, используемых в глобальной ветви модуля.In FIG. 1 shows a diagram of a speech noise suppressor according to the invention using a Fast Fourier Convolution Neural Module to improve the speech signal. Parameter

controls the relation of channels used in the global branch of the module.

Затем обычные сверточные слои используются для локальных преобразований тензора на локальной ветви. Операцию свертки можно описать как применение обучаемого скользящего фильтра и взаимодействие с различными частями входного тензора, в ходе которого получается выходной тензор. Буквально, для каждой пары входного и выходного каналов операции свертки (как объяснено здесь), применяется фильтр с некоторым kernel_size, и этот фильтр «скользит» по входному тензору, объединяя информацию из различных частей входного тензора.Then regular convolutional layers are used for local tensor transformations on the local branch. The convolution operation can be described as applying a trainable sliding filter and interacting with different parts of the input tensor to produce an output tensor. Literally, for each pair of input and output channels of a convolution operation (as explained here), a filter with some kernel_size is applied, and this filter "slides" over the input tensor, combining information from different parts of the input tensor.

Поэтому, в контексте настоящего изобретения, свертки означают операции, которые осуществляют локальные обновления, где «локальность» определяется размером фильтра, т.е. чем крупнее свертки, тем он лучше захватывает нелокальные особенности, содержащиеся во входном тензоре. Однако увеличение размер свертки является затратным в отношении вычисления. «Локальные обновления» в этом контексте указывает этот локальный характер операции свертки в отличие от слоя спектрального преобразования (части глобальной ветви), который глобально взаимодействует с тензором.Therefore, in the context of the present invention, folds mean operations that perform local updates, where "locality" is determined by the size of the filter, i.e. the larger the convolutions, the better it captures the non-local features contained in the input tensor. However, increasing the size of the convolution is computationally expensive. "Local updates" in this context indicates this local nature of the convolution operation as opposed to the spectral transformation layer (part of the global branch) which globally interacts with the tensor.

Преобразование Фурье осуществляется нейронной сетью в частотном измерении карты характеристик глобальной ветви, что приводит к обновлению карты характеристик в спектральной области посредством поточечных сверточных слоев, после чего применяется обратное преобразование Фурье.The Fourier transform is performed by the neural network in the frequency domain of the global branch feature map, which results in updating the feature map in the spectral domain by pointwise convolutional layers, after which the inverse Fourier transform is applied.

В итоге активации локальной и глобальной ветвей суммируются, что дает аудиосигнал с улучшенной речью и/или шумом, сниженным до такой степени, что речь в обработанном аудиосигнале становится отчетливо понятной и постижимой слушателем независимо от уровня громкости обработанного аудиосигнала при воспроизведении.As a result, the activations of the local and global branches are summed, resulting in an audio signal with enhanced speech and/or noise reduced to such an extent that the speech in the processed audio signal becomes clearly intelligible and intelligible to the listener, regardless of the volume level of the processed audio signal during playback.

Настоящее изобретение реализует две архитектуры нейронной сети для улучшения речевого сигнала. Первая (FFC-AE) инспирирована [10]. Эта архитектура изображена на фиг. 2 (слева). Вторая архитектура инспирирована классической работой [11] и показана на фиг. 2 (справа). Эти две архитектуры нейронной сети будут более подробно описаны ниже в отношении соответствующих аспектов настоящего изобретения.The present invention implements two neural network architectures for speech enhancement. The first one (FFC-AE) was inspired by [10]. This architecture is shown in Fig. 2 (left). The second architecture is inspired by the classical work [11] and is shown in Fig. 2 (right). These two neural network architectures will be described in more detail below with respect to their respective aspects of the present invention.

FFC-AEFFC-AE

В контексте настоящего изобретения, первый модель архитектура также именуется автоэнкодером быстрой свертки Фурье (FFC-AE). FFC-AE по существу является автоэнкодером быстрой свертки Фурье, который применяет архитектуру, впервые описанную в источнике [15], для конкретной задачи улучшения речевого сигнала. Эта архитектура состоит из сверточного кодера (также именуемого шаговой сверткой), который дискретизирует с понижением входному спектрограмму во временном и частотном измерениях с коэффициентом два. За кодером следует ряд остаточных блоков, каждый из которых состоит из двух последовательных модулей быстрой свертки Фурье. Затем выходной сигнал остаточных блоков дискретизируется с повышением посредством транспонированной свертки и используется для прогнозирования действительной и мнимой частей обесшумленной спектрограммы.In the context of the present invention, the first model architecture is also referred to as Fast Fourier Convolution Autoencoder (FFC-AE). FFC-AE is essentially a Fast Fourier Convolution Autoencoder that applies the architecture first described in [15] for the specific task of speech enhancement. This architecture consists of a convolutional encoder (also called step convolution) that downsamples the input spectrogram in time and frequency by a factor of two. The encoder is followed by a series of residual blocks, each of which consists of two successive fast Fourier convolution modules. The output of the residual blocks is then upsampled by transposed convolution and used to predict the real and imaginary parts of the denoised spectrogram.

Архитектура FFC-AE в соответствии с настоящим изобретением функционирует с использованием следующих операторов.The FFC-AE architecture in accordance with the present invention operates using the following statements.

Сначала входная речь, содержащая представление STFT аудиосигнала (или, при необходимости, спектрограмма) обрабатывается с использованием операторов Conv-BN-ReLU, характеризуемого параметрами (ker=7×7, in=2, out=in_ch), и Conv-BN-ReLU, характеризуемого параметрами (ker=3×3, in=in_ch out=in_ch·2, stride=2). Затем, FFC(α) осуществляется N раз, где α - действительное число ∈ [0, 1], N определяемое количество остаточных блоков FFC. Результаты FFC суммируются и обрабатываются оператором транспонированной свертки ConvTransposed-BN-ReLU (ker=3×3, in=in_ch·2, out=in_ch, stride=2). Затем оператор свертки Conv(ker=7×7, in=in_ch, out=2) обрабатывает представление STFT. Затем представление выводится. В данном случае, параметр in_ch определяет общую ширину сетей, N определяет количество остаточных блоков FFC, α - действительное число ∈ [0, 1], как упомянуто выше.First, the input speech containing the STFT representation of the audio signal (or, if necessary, the spectrogram) is processed using the operators Conv-BN-ReLU, characterized by the parameters ( ker=7×7, in=2, out=in_ch ), and Conv-BN-ReLU , characterized by parameters ( ker=3×3, in=in_ch out=in_ch 2, stride=2 ). Then, FFC(α) is performed N times, where α is a real number ∈ [0, 1], N is the number of remaining FFC blocks to be determined. The FFC results are summarized and processed by the ConvTransposed-BN-ReLU transposed convolution operator ( ker=3×3, in=in_ch 2, out=in_ch, stride=2 ). The conv operator Conv( ker=7x7, in=in_ch, out=2 ) then processes the STFT representation. The view is then rendered. In this case, the parameter in_ch defines the total width of the networks, N defines the number of residual FFC blocks, α is a real number ∈ [0, 1] as mentioned above.

Транспонированная свертка представляет собой операцию, противоположную свертке. Свертка характеризуется гиперпараметром “stride”, который определяет, насколько плотно или разреженно обучаемые фильтры применяются к данному тензору (чем больше stride, тем больше тензор дискретизируется с понижением в частотном и временном измерениях). Транспонированная свертка, вместо понижающей дискретизации, дискретизирует с повышением данный тензор в частотном и временном измерениях. Ядра обозначают размеры фильтра сверток.Transposed convolution is the opposite operation of convolution. The convolution is characterized by the “stride” hyperparameter, which determines how densely or sparsely the trained filters are applied to a given tensor (the larger the stride, the more the tensor is downsampled in frequency and time dimensions). Transposed convolution, instead of downsampling, upsamples a given tensor in frequency and time dimensions. The kernels denote the dimensions of the convolution filter.

обучение модели в настоящем изобретении осуществляется партиями выборок. Методы, которые улучшают обучение модели и делают процедуру более устойчивой, включают в себя Batch Normalization (BN), посредством вычисления статистик среднего и стандартного отклонения из данной партии выборок и нормализации входного сигнала посредством этих статистик.training of the model in the present invention is carried out by batches of samples. Techniques that improve model training and make the procedure more robust include Batch Normalization (BN), by calculating mean and standard deviation statistics from a given batch of samples and normalizing the input signal by means of these statistics.

ReLU является нелинейной функций активации, которая отображает отрицательные выходные значения в нуль, оставляя при этом неотрицательные выходные значения без изменения. Нелинейности критичны для обучения нейронных сетей. ReLU is a non-linear activation function that maps negative output values to zero while leaving non-negative output values unchanged. Nonlinearities are critical for training neural networks.

Хотя увеличение коэффициентов понижающей дискретизации приводит к дополнительному сокращению количества операций в ходе выведения, авторы настоящего изобретения установили, что оно также приводит к значительному снижению производительности (качеству удаления шума), тогда как коэффициент 2 обеспечивает хороший компромисс между производительностью и сложностью.Although increasing the downsampling ratios leads to an additional reduction in the number of operations during the derivation, the present inventors have found that it also leads to a significant decrease in performance (quality of denoising), while a factor of 2 provides a good compromise between performance and complexity.

FFC-UNetFFC-UNet

Вторая архитектура нейронной сети, используемая в настоящем изобретении, именуется здесь FFC-UNet. В данном случае слои FFC включены в архитектуру U-Net. На каждом уровне структуры U-Net, используются несколько остаточных блоков FFC со сверточной повышающей дискретизацией или понижающей дискретизацией. В частности, авторы настоящего изобретения нашли полезным сделать параметр α (отношение каналов, идущих в глобальную ветвь быстрой свертки Фурье) зависящим от уровня U-Net, где используется FFC.The second neural network architecture used in the present invention is referred to here as FFC-UNet. In this case, the FFC layers are included in the U-Net architecture. At each level of the U-Net structure, multiple FFC residual blocks are used with convolutional upsampling or downsampling. In particular, the inventors of the present invention have found it useful to make the parameter α (ratio of channels going to the global fast Fourier branch) dependent on the U-Net layer where the FFC is used.

Более высокие уровни структуры U-Net работают с более высокими разрешениями данных, где присутствуют периодические структуры, тогда как более низкие уровни работают в грубом масштабе, лишенном периодической структуры. В более общем случае, как указано в [8], более глубокие слои нейронных сетей в основном предположительно используют локальные паттерны, тогда как самые верхние слои настоятельно требуют глобального контекста при обработке информации. Таким образом, глобальная ветвь слоев FFC менее полезна в грубых масштабах, и настоящее изобретение уменьшает параметр α, начиная с 0,75 на самом верхнем уровне до 0 на нижнем слое с шагом 0,25.Higher levels of the U-Net structure operate at higher data resolutions where periodic structures are present, while lower levels operate at a coarse scale devoid of periodic structure. In a more general case, as stated in [8], the deeper layers of neural networks are mainly supposed to use local patterns, while the uppermost layers strongly require a global context in information processing. Thus, the global branch of the FFC layers is less useful on a coarse scale, and the present invention reduces the parameter α from 0.75 at the topmost level to 0 at the bottom layer in increments of 0.25.

Архитектура FFC-UNet в соответствии с настоящим изобретением функционирует с использованием следующих операторов.The FFC-UNet architecture in accordance with the present invention operates using the following statements.

Сначала входная спектрограмма обрабатывается с использованием оператора свертки Conv(ker=7×7, in=2, out=in_ch). Затем обработка переходит к оператору Conv(ker=1×1, in=in_ch·2i-1 out=in_ch·2i, stride=2), после чего операция FFC(α[i]) осуществляется N раз, где N определяет количество остаточных блоков FFC, для i=1..K-1. FFC(α=α[K]) осуществляется N раз, и результаты суммируются, и затем обрабатываются оператором свертки ConvTransposed(ker=1×1, in=in_ch·2i, out=in_ch·2i-1, stride=2), после чего операция FFC(α[i]) осуществляется N раз, для i=K-1..1, и результаты суммируются. Оператор Conv(ker=1×1, in=in_ch·2i, out=in_ch·2 ^i-1) обрабатывает результат упомянутого суммирования, который затем обеспечивается для обработки оператором Conv(ker=7×7, in=in_ch, out=2), и обработанная спектрограмма в итоге выводится.First, the input spectrogram is processed using the convolution operator Conv( ker=7×7, in=2, out=in_ch ). Then the processing proceeds to the Conv( ker=1×1, in=in_ch 2i-1 out=in_ch 2i, stride=2 ) operator, after which the FFC( α[i] ) operation is performed N times, where N determines the number of residual FFC blocks, for i=1..K-1 . FFC( α=α[K] ) is performed N times and the results are summed and then processed by the convolution operator ConvTransposed( ker=1×1, in=in_ch 2i, out=in_ch 2i-1, stride=2 ), after whereby the operation FFC(α [i] ) is performed N times, for i=K-1..1 , and the results are summed up. The operator Conv( ker=1×1, in=in_ch 2i, out=in_ch 2 ^i-1 ) processes the result of said summation, which is then provided for processing by the operator Conv( ker=7×7, in=in_ch, out=2 ), and the processed spectrogram is finally output.

Следует отметить, что, в добавление к использованию K, которое обозначает глубину архитектуры FFC-UNet, а также K вещественных чисел ∈ [0, 1], архитектура FFC-UNet также отличается наличием перепускного соединения, которое переносит результаты Conv(ker=7×7, in=2, out=in_ch) непосредственно на оператор конкатенации, который также выполнен с возможностью конкатенации вышеупомянутых суммированных результатов NxFFC(α[i]).It should be noted that, in addition to using K , which denotes the depth of the FFC-UNet architecture, as well as K real numbers ∈ [0, 1], the FFC-UNet architecture is also distinguished by the presence of a bypass connection that carries the results of Conv( ker=7× 7, in=2, out=in_ch ) directly to a concatenation operator, which is also configured to concatenate the aforementioned summarized NxFFC(α [i] ) results.

Через перепускное соединение, некоторые промежуточные представления отправляются из вышерасположенного слоя нейронной сети в нижерасположенный для облегчения реконструкции. Таким образом, получается тензор из вышерасположенного слоя, а также некоторый выходной сигнал из более недавнего слоя, и они конкатенируются в канальном измерении. Поскольку это приводит к удвоению количества каналов, информация объединяется со сверточным слоем, который уменьшает количество каналов в два раза.Through a bypass connection, some intermediate representations are sent from the upper layer of the neural network to the lower one to facilitate reconstruction. Thus, a tensor from the upstream layer is obtained, as well as some output from the more recent layer, and they are concatenated in the channel dimension. Since this doubles the number of channels, the information is combined with a convolutional layer, which reduces the number of channels by half.

Описав архитектуры FFC-AE и FFC-UNet нейронной сети, используемые в настоящем изобретении для осуществления быстрого преобразования Фурье на входной спектрограмме речевого сигнала, обратимся к описанию устройства для подавления шума в речевом сигнале, которое реализует принципы настоящего изобретения.Having described the FFC-AE and FFC-UNet architectures of the neural network used in the present invention to perform a Fast Fourier Transform on an input speech spectrogram, let us turn to a description of a device for suppressing noise in a speech signal that implements the principles of the present invention.

Возвращаясь к фиг. 1, опишем более подробно отвечающее изобретению устройство для подавления шума в речевом сигнале с использованием оператора быстрой свертки Фурье. Устройство 100 для подавления шума в речевом сигнале с использованием оператора быстрой свертки Фурье, также дополнительно именуемое здесь устройство 100 подавления шума в речевом сигнале, содержит модуль 110 локальной ветви и модуль 120 глобальной ветви, каждый из которых содержит по меньшей мере один модуль 130 свертки 3×3. Кроме того, модуль глобальной ветви также содержит модуль 140 спектрального преобразования. Дополнительно, устройство 100 подавления шума в речевом сигнале также содержит один или более модулей 150 суммирование.Returning to FIG. 1, we will describe in more detail the inventive apparatus for suppressing noise in a speech signal using the fast Fourier convolution operator. A device 100 for suppressing noise in a speech signal using a fast Fourier convolution operator, also additionally referred to here as a device 100 for suppressing noise in a speech signal, includes a local branch module 110 and a global branch module 120, each of which contains at least one convolution module 130 3 ×3. In addition, the global branch module also contains a spectral transform module 140 . Additionally, the device 100 suppression of noise in the speech signal also contains one or more modules 150 summation.

Модуль 140 спектрального преобразования содержит массив модулей 141, 142, 143, 144 оператора свертки, модуль 145 суммирования и 1×1 модуль 146 свертки.The spectral transform module 140 contains an array of convolution operator modules 141, 142, 143, 144, a summation module 145, and a 1×1 convolution module 146.

Следует отметить, что модули устройства 100 подавления шума в речевом сигнале можно реализовать самыми разными путями в зависимости от сценария реализации настоящего изобретения. В частности, эти модули, которые отвечают за различные операторы свертки или различные другие операторы нейронной сети, элементы и т.д., можно реализовать с использованием одного или более процессоров, например, компьютерных процессоров общего назначения (CPU), цифровых сигнальных процессоров (DSP) микропроцессоров и т.д., работающих под управлением соответствующих программных элементов, интегральных схем, вентильных матриц, программируемых пользователем (FPGA) или любого другого аналогичного средства, хорошо известного специалистам в данной области техники. Для обеспечения входного аудиосигнала, например, звукового сигнала содержащего, речь, для устройства 100 подавления шума в речевом сигнале, в конкретных вариантах осуществления изобретения также можно использовать соответствующее средство ввода, например, один или более микрофонов, аналого-цифровых преобразователей (ADC) и т.д. Средство вывода для вывода улучшенного аудиосигнала, например, улучшенного и/или обесшумленного звукового сигнала, содержащего речь, в конкретных вариантах осуществления изобретения также можно использовать соответствующее средство вывода, например, цифро-аналоговые преобразователи (DAC), громкоговоритель(и) для воспроизведения улучшенного аудиосигнала, например, улучшенного и/или обесшумленного звукового сигнала, содержащего речь. Однако следует понимать, что изобретение не ограничивается никакими деталями, касающимися наличия и/или конкретным характером этих средств ввода/вывода или конкретным аппаратным средством обработки, используемым для реализации модулей устройства 100 подавления шума в речевом сигнале, как упомянуто выше.It should be noted that the modules of the speech noise suppressor 100 can be implemented in a variety of ways depending on the implementation scenario of the present invention. In particular, these modules, which are responsible for various convolution operators or various other neural network operators, elements, etc., may be implemented using one or more processors, e.g., general purpose computer processors (CPUs), digital signal processors (DSPs). ) microprocessors, etc., operating under the control of appropriate software elements, integrated circuits, field programmable gate arrays (FPGAs), or any other similar means well known to those skilled in the art. To provide an audio input signal, such as an audio signal containing speech, to the speech noise reduction device 100, appropriate input means, such as one or more microphones, analog-to-digital converters (ADCs), etc., may also be used in particular embodiments of the invention. .d. Output means for outputting an enhanced audio signal, such as enhanced and/or noise-free audio containing speech, in particular embodiments of the invention, appropriate output means, such as digital-to-analogue converters (DACs), loudspeaker(s) may also be used to reproduce the enhanced audio signal , for example, an improved and/or noise-free audio signal containing speech. However, it should be understood that the invention is not limited to any details regarding the presence and/or the specific nature of these input/output means or the specific processing hardware used to implement the modules of the speech noise suppressor 100 as mentioned above.

Также необходимо отчетливо понимать, что по меньшей мере модули устройства 100 подавления шума в речевом сигнале можно реализовать в форме программного обеспечения, обеспеченного на одном или более языках программирования, или в форме исполнимого кода, что хорошо известно специалистам в данной области техники. Такое программное обеспечение может быть реализовано в виде компьютерной программы или программ, компьютерного программного продукта, в частности, реализованного на материальном машиночитаемом носителе любого подходящего вида, элемента(ов) компьютерной программы, блоков или модулей. Оно может храниться локально или распределяться по одной или более проводным или беспроводным сетям, с использованием одного или более удаленных серверов и т.д. Эти детали не ограничивают объем настоящего изобретения.It is also to be clearly understood that at least the modules of the speech noise suppressor 100 may be implemented in the form of software provided in one or more programming languages, or in the form of executable code, as is well known to those skilled in the art. Such software may be implemented as a computer program or programs, a computer program product, in particular implemented on a tangible computer-readable medium of any suitable form, computer program element(s), blocks or modules. It may be stored locally or distributed over one or more wired or wireless networks, using one or more remote servers, and so on. These details do not limit the scope of the present invention.

Таким образом, устройство 100 подавления шума в речевом сигнале также может содержать по меньшей мере одно запоминающее устройство, например, RAM, ROM, флеш-память, EPROM, EEPROM и т.д. и/или сменный носитель данных, для постоянного или временного хранения соответствующих программных инструкций, а также сигналов и/или данных участвующих в подавление шума в речевом сигнале и/или его улучшении в соответствии с настоящим изобретением. Детали, касающиеся такого запоминающего устройства, предусмотрены в различных вариантах осуществления настоящего изобретения и не ограничивают объем настоящего изобретения.Thus, the speech noise suppressor 100 may also include at least one storage device such as RAM, ROM, flash memory, EPROM, EEPROM, and so on. and/or a removable storage medium, for permanent or temporary storage of the corresponding program instructions, as well as signals and/or data involved in the suppression of noise in the speech signal and/or its improvement in accordance with the present invention. Details regarding such a storage device are provided in various embodiments of the present invention and do not limit the scope of the present invention.

Одна или более нейронных сетей, где реализованы модули отвечающего изобретению устройства 100 подавления шума в речевом сигнале и/или соответствующие этапы способа и, в частности, архитектуры FFC-AE и FFC-Unet, как описано выше, используют одну или более моделей машинного обучения (ML), которые могут быть преимущественно обучены усовершенствовать методы улучшения речевого сигнала или подавления шума в нем, лежащих в основе настоящее изобретение. Перейдем к более подробному описанию обучения модели ML.One or more neural networks implementing the modules of the inventive speech noise suppressor 100 and/or the corresponding method steps and in particular the FFC-AE and FFC-Unet architectures as described above use one or more machine learning models ( ML), which can advantageously be trained to improve the speech signal enhancement or noise reduction techniques underlying the present invention. Let's move on to a more detailed description of the training of the ML model.

ОбучениеEducation

Спрогнозированная спектрограмма оконного преобразования Фурье преобразуется в форму волны посредством обратного оконного преобразования Фурье.The predicted windowed Fourier transform spectrogram is converted to a waveform by an inverse Fourier windowed transform.

Изобретение может использовать инструментарий мультидискриминаторного состязательного обучения, предложенный в [12], для обучения моделей ML. Этот инструментарий состоит из трех потерь, а именно потерь GAN, потерь на согласовании характеристик и потери на мел-спектрограмме.The invention can use the multi-discriminator adversarial learning toolkit proposed in [12] to train ML models. This toolkit consists of three losses, namely the GAN loss, the match loss, and the chalk loss.

Обучение нейронных сетей в соответствии с настоящим изобретением будет описано ниже более подробно. Заметим, однако, что изобретение не ограничивается этими конкретными деталями. Инструментарий мультидискриминаторного состязательного обучения, как упомянуто выше, может использовать порождающие состязательные сети (GAN), которые относятся к широко используемому типу нейронной порождающей модели. В общем случае, GAN состоят из генераторных и дискриминаторных нейронных сетей, которые состязаются друг с другом. Генераторная сеть обучается отображению из исходной области в целевую область, тогда как дискриминатор обучается отличать действительные объекты от генерируемых в целевой области. Таким образом, дискриминатор предписывает генератору создавать выборки, неотличимые от реальных.The training of neural networks in accordance with the present invention will be described in more detail below. Note, however, that the invention is not limited to these specific details. The multi-discriminator adversarial learning toolkit, as mentioned above, can use Generative Adversarial Networks (GANs), which are a commonly used type of neural generative model. In general, GANs are composed of generator and discriminator neural networks that compete with each other. The generator network is trained to map from the source area to the target area, while the discriminator is trained to distinguish between real objects and those generated in the target area. Thus, the discriminator instructs the generator to create samples that are indistinguishable from real ones.

В контексте настоящего изобретения, генератор обучается удалять шум из сигнала для получения чистого, тогда как несколько дискриминаторов обучаются отличать между выходные сигналы генераторов от эталонных чистых форм волны. Обратная связь дискриминаторов проходит также через генератор, что позволяет обоим обучаться в состязательном стиле.In the context of the present invention, an oscillator is trained to remove noise from a signal to produce a clean one, while a number of discriminators are trained to distinguish between the output signals of the oscillators from reference clean waveforms. The feedback from the discriminators also goes through the generator, allowing both to train adversarially.

Потери GANGAN loss

Потери GAN представляют собой потери, используемые для состязательного обучения. Настоящее изобретение использует потери LS-GAN [13] для генератора G _θ с параметрами θ и дискриминаторами

(для дополнительных подробностей см., например источник [12]) с параметрами ϕ_i, …, ϕ_k, которые составляют комбинацию всех весовых коэффициентов в нейронной сети:The GAN loss is the loss used for adversarial learning. The present invention uses the LS-GAN loss [13] for a generator G _θ with parameters θ and discriminators

(for more details, see, for example, the source [12]) with parameters ϕ _i , …, ϕ _k , which are a combination of all weight coefficients in the neural network:

,

где y обозначает эталонный аудиосигнал, и x - его зашумленную версию; D - дискриминатор с множеством параметров {ϕ_i} где i=1…k (k=3 - количество дискриминаторов); G - генератор с параметрами θ.where y denotes the reference audio signal and x is its noisy version; D is a discriminator with a set of parameters {ϕ _i } where i=1…k (k=3 is the number of discriminators); G is a generator with parameters θ .

В соответствии с изобретением, генераторы действуют на зашумленных формах волны (x -> G(x)), тогда как дискриминаторы действуют на эталонных формах волны (y -> D(y)), а также выходных сигналах генераторов (G(x) -> D(G(x))).According to the invention, the oscillators operate on noisy waveforms ( x -> G(x) ) while the discriminators operate on reference waveforms ( y -> D(y) ) as well as the output signals of the generators ( G(x) - > D(G(x)) ).

E обозначает математическое ожидание взятое в пространстве обеих зашумленной и эталонной форм волны для обучения дискриминатора (E _{(x, y)}), и зашумленном пространстве для обучения генератора (E _x), ∑ - сумма по всем дискриминаторам. На практике математическое ожидание оценивается методом Монте-Карло, и потери вычисляются на небольших подвыборках (батчах), а не на всем массиве данных. E denotes the mean taken in the space of both the noisy and reference waveforms for training the discriminator ( E _{(x, y)} ), and the noisy space for training the generator ( E _x ), ∑ is the sum over all discriminators. In practice, the mathematical expectation is estimated by the Monte Carlo method, and the losses are calculated on small subsamples (batches), and not on the entire data array.

|| . ||₁ обозначают потери L₁ (абсолютная разность значений).|| . || ₁ denote the loss L ₁ (absolute difference of values).

Потери на согласовании характеристикCharacteristic matching loss

Потери на согласовании особенностей вычисляются как расстояние L₁ между промежуточными активациями дискриминаторов, вычисленными для эталонной выборки, и условно сгенерированными (см., например [14]):The feature matching loss is calculated as the distance L ₁ between the intermediate activations of the discriminators calculated for the reference sample and conditionally generated ones (see, for example, [14]):

где T обозначает количество слоев в дискриминаторе;

и

обозначают активации и размер активаций в j-ом слое i-го дискриминатора, соответственно.where T denotes the number of layers in the discriminator;

And

denote activations and the size of activations in the j-th layer of the i-th discriminator, respectively.

Потери на мел-спектрограммеChalk Loss

Потери на мел-спектрограмме это расстояние L1 между мел-спектрограммой формы волны, синтезированной генератором, и мел-спектрограммой эталонной формы волны.Chalk loss is the distance L1 between the waveform chalk generated by the generator and the reference waveform chalk.

Они определяется какThey are defined as

где φ - функция, которая преобразует форму волны в соответствующую мел-спектрограмму.where φ is a function that converts the waveform into the corresponding chalk spectrogram.

Окончательные потериFinal loss

Окончательные потери для генератора и дискриминаторов выражаются в виде:The final loss for the generator and discriminators is expressed as:

Во всех экспериментах были заданы λ_fm=2 и λ_mel=45.In all experiments, λ _fm =2 and λ _mel =45 were set.

Обращаясь к фиг. 4, рассмотрим этапы способа подавления шума в речевом сигнале с использованием по меньшей мере одного оператора быстрой свертки Фурье в соответствии с изобретением.Referring to FIG. 4, consider the steps of a method for suppressing noise in a speech signal using at least one fast Fourier convolution operator in accordance with the invention.

На этапе S1 звуковой сигнал, содержащий речь, поступает на устройство подавления шума в речевом сигнале согласно изобретению. Звуковой сигнал, содержащий речь, может вводиться с использованием одного или более средств, например, микрофона(ов), при необходимости через по меньшей мере один аналого-цифровой преобразователь (ADC). Однако такой способ ввода звукового сигнала, содержащего речь, не ограничивает разнообразие возможностей, и, в порядке альтернативы, звуковой (аудио) сигнал, содержащий речь, может приниматься в форме цифрового сигнала через одну или более сетей связи (проводных или беспроводных) и т.д.In step S1, an audio signal containing speech is supplied to the speech noise suppressor according to the invention. An audio signal containing speech may be input using one or more means, such as a microphone(s), optionally via at least one analog-to-digital converter (ADC). However, such a method for inputting an audio signal containing speech does not limit the variety of possibilities, and, alternatively, an audio (audio) signal containing speech can be received in the form of a digital signal via one or more communication networks (wired or wireless), etc. d.

На этапе S2 входной тензор выводится из (при необходимости, оцифрованного) звукового сигнала, содержащего речь, как описано выше, однако это может осуществляться любым подходящим способом, хорошо известным в технике.In step S2, the input tensor is derived from the (if necessary, digitized) audio signal containing speech, as described above, however this may be done in any suitable manner well known in the art.

На этапе S3 каналы входного тензора делятся на локальную и глобальную ветви, как описано выше.In step S3, the input tensor channels are divided into local and global branches, as described above.

На этапе S4 карты характеристик получаются для глобальной и локальной ветвей, как описано выше.In step S4, feature maps are obtained for the global and local branches as described above.

На этапе S5 обычные сверточные слои используются для локальных преобразований тензора на локальной ветви. В данном случае, нейронные операторы FFC-AE и FFC-UNet могут применяться, как описано выше.In step S5, regular convolutional layers are used for local tensor transformations on the local branch. In this case, the neural operators FFC-AE and FFC-UNet can be applied as described above.

На этапе S6 преобразование Фурье осуществляется в частотном измерении карты характеристик глобальной ветви.In step S6, the Fourier transform is performed in the frequency dimension of the global branch characteristic map.

На этапе S7 карта характеристик глобальной ветви обновляется в спектральной области посредством поточечных сверточных слоев, причем поточечный сверточный слой означает свертку с размером фильтра в единицу.In step S7, the global branch feature map is updated in the spectral domain by pointwise convolutional layers, where pointwise convolutional layer means a convolution with a filter size of one.

На этапе S8 обратное преобразование Фурье применяется к обновленной карте характеристик глобальной ветви.In step S8, the inverse Fourier transform is applied to the updated global branch feature map.

На этапе S9 активации локальной и глобальной ветвей суммируются.In step S9, the activations of the local and global branches are summed.

На этапе S10 выводится обработанный звуковой сигнал, содержащий речь, в котором речь улучшена и/или обесшумлена посредством вышеупомянутых операций. Это может включать в себя преобразование обработанного сигнала в аналоговый сигнал для дальнейшего воспроизведения, например, с помощью одного или более громкоговорителей посредством одного или более цифро-аналоговых преобразователей (DAC), но такое преобразование является факультативным и не ограничивает объем настоящего изобретения. Эффект подавления шума в речевом сигнале можно продемонстрировать на спектрограмме соответствующего аудиосигнала: как упомянуто выше, известно, что шум имеет аддитивную структуру, поэтому, на зашумленной спектрограмме, на некоторых частотах присутствует некоторая величина шума вместо «тишины», тогда как чистая речь приглушается. Напротив, в результате реализации способа, отвечающего изобретению, речь является громкой и ясной на всех частотах или большинстве частот в спектрограмме результирующего (обработанного) аудиосигнала, тогда как шум существенно снижается как в не содержащей речь, так и «речевой» частях соответствующего обработанного сигнала.In step S10, the processed audio signal containing speech in which the speech has been enhanced and/or denoised by the above operations is output. This may include converting the processed signal to an analog signal for further playback, for example, using one or more speakers via one or more digital-to-analog converters (DACs), but such conversion is optional and does not limit the scope of the present invention. The effect of noise suppression in a speech signal can be demonstrated on the spectrogram of the corresponding audio signal: as mentioned above, noise is known to have an additive structure, therefore, in a noisy spectrogram, some amount of noise is present at some frequencies instead of "silence", while clear speech is muted. On the contrary, as a result of the implementation of the method according to the invention, speech is loud and clear at all frequencies or most frequencies in the spectrogram of the resulting (processed) audio signal, while the noise is significantly reduced in both the non-speech and "speech" parts of the corresponding processed signal.

Примеры реализации настоящего изобретения будут описаны ниже, с экспериментальными данными из оценивания производительности методов, отвечающих изобретению, на иллюстративных массивах данных, как описано ниже.Embodiments of the present invention will be described below, with experimental data from evaluating the performance of the methods of the invention on illustrative datasets as described below.

Массивы данныхData arrays

В практических реализациях настоящего изобретения использовались два набора данных для оценивания эффективности предложенных методов подавления шума в речевом сигнале. Первым из них был массив данных VCTK-DEMAND (см., например [15]), который является стандартным набором данных для систем подавления шума в речевом сигнале. Обучающий набор состоял из 28 говорящих с 4 отношениями сигнал-шум (SNR) (15, 10, 5 и 0 дБ) и содержал 11572 высказывания. Испытательный набор (824 высказывания) состоял из 2 говорящих, невидимых для модели в ходе обучения с 4 SNR (17,5, 12,5, 7,5 и 2,5 дБ).In practical implementations of the present invention, two sets of data were used to evaluate the effectiveness of the proposed methods for suppressing noise in a speech signal. The first of these was the VCTK-DEMAND data set (see, for example, [15]), which is a standard data set for speech noise suppression systems. The training set consisted of 28 speakers with 4 signal-to-noise ratios (SNR) (15, 10, 5 and 0 dB) and contained 11572 utterances. The test set (824 utterances) consisted of 2 speakers invisible to the model during training with 4 SNRs (17.5, 12.5, 7.5 and 2.5 dB).

В качестве второго набора данных был выбран набор данных Deep Noise Supression Challenge (DNS) (см., например [16]). 100 часов обучающих данных было синтезировано с использованием предоставленных кодов и конфигурации по умолчанию. Единственное изменение состоит в том, что искусственная реверберация в ходе синтеза не использовалась. Модели были протестированы на двух видах испытательных наборов. В качестве первого набора (DNS-CUSTOM) были выбраны контрольные данные, случайно выбранные и исключенные из синтезированных 100 часов обучающих данных. Вторым (DNS-BLIND) был стандартный слепой испытательный набор из хранилища DNS. Эти данные записывались в присутствие естественного шума в сценариях реального мира.The Deep Noise Supression Challenge (DNS) data set was chosen as the second data set (see, for example, [16]). 100 hours of training data was synthesized using the provided codes and the default configuration. The only change is that no artificial reverb was used during the synthesis. The models were tested on two types of test sets. As the first set (DNS-CUSTOM), control data was chosen, randomly selected and excluded from the synthesized 100 hours of training data. The second (DNS-BLIND) was a standard blind test set from the DNS repository. These data were recorded in the presence of natural noise in real world scenarios.

МетрикиMetrics

Для объективного оценивания выборок в рассматриваемых задачах использовались традиционные метрики WB-PESQ (см. [17]), STOI (см. [18]), масштабно-инвариантное отношение сигнал-искажение (SI-SDR) (см. [19]). Помимо традиционных метрик качества речевого сигнала, авторы настоящего изобретения рассматривают абсолютную меру объективного качества речевого сигнала на основе прямого прогнозирования показателя MOS согласно тонко настроенной модели wave2vec2.0 (WV-MOS).For objective evaluation of samples in the problems under consideration, traditional metrics WB-PESQ (see [17]), STOI (see [18]), and scale-invariant signal-to-distortion ratio (SI-SDR) (see [19]) were used. In addition to traditional speech quality metrics, the present inventors consider an absolute measure of the objective quality of a speech signal based on direct MOS prediction according to a finely tuned wave2vec2.0 (WV-MOS) model.

ПримерыExamples

Далее настоящее изобретение будет проиллюстрировано практическими примерами реализации. Во всех примерах сигналы преобразуются в спектральную область с использованием STFT с окном Ханна размером 1024 и размером скачка 256. Для модели FFC-AE авторы настоящего изобретения установили α=0,75, in_ch=32, N=9. Для FFC-UNet, K=4, N=4, in_ch=32 и α=0,75 постепенно уменьшались при переходе от верхних слоев к нижним согласно расписанию, описанному в разделе 2.3. Модели обучались на протяжении 500 эпох с размером партии 16. Использовали оптимизатор Адама с темпом обучения 0,0002.Next, the present invention will be illustrated by practical examples of implementation. In all examples, the signals are transformed into the spectral domain using STFT with Hann window size 1024 and hop size 256. For the FFC-AE model, the present inventors set α =0.75, in_ch =32, N =9. For FFC-UNet, K =4, N =4, in_ch =32, and α =0.75 gradually decreased from upper to lower layers according to the schedule described in section 2.3. The models were trained over 500 epochs with a batch size of 16. An Adam optimizer was used with a learning rate of 0.0002.

Оценивание фазыPhase Estimation

Результаты оценивания фазы на массиве данных LJ-Speech приведены ниже в таблице 1.The results of phase estimation on the LJ-Speech dataset are shown in Table 1 below.

Таблица 1. оценивание фазы на массиве данных LJ-SpeechTable 1. Phase estimation on the LJ-Speech dataset МодельModel WV-MOSWV-MOS # Params (млн.)# Params (million) FFC-AE (наш)FFC-AE (ours) 4,314.31 0,40.4 базовый U-Netbasic U-Net 4,194.19 20,720.7 FFC-AE (абл.)FFC-AE (abl.) 4,034.03 0,70.7

Способность модели FFC-AE оценивать фазы тестировали на основе амплитудных спектрограмм на массиве данных LJ-Speech (см. [23]) и сравнивали с аналогичными архитектурами с базовыми свертками. Также производили сравнение с моделью U-Net (базовой U-Net) и моделью, которая идентична FFC-AE за исключением того, что все Фурье-блоки в глобальной ветви заменены базовыми свертками (FFC-AE (абл.)).The ability of the FFC-AE model to estimate phases was tested on the basis of amplitude spectrograms on the LJ-Speech dataset (see [23]) and compared with similar architectures with basic convolutions. A comparison was also made with the U-Net model (basic U-Net) and a model that is identical to FFC-AE except that all Fourier blocks in the global branch are replaced by basic convolutions (FFC-AE (abl.)).

Модели обучались для прогнозирования синуса и косинуса фаз из амплитудной спектрограммы и снабжались полными амплитудными спектрограммами. Следует отметить, что, в общем случае, обучение синусу и косинусу фаз означает обучение комплексному спектру из амплитудной спектрограммы (которая не содержит комплексной информации и, таким образом, информации о фазы). Комплексные числа могут быть представлены синусами и косинусами по формуле Эйлера (это соответствие позволяет преобразовывать их между собой). Для подавления шума в речевом сигнале, фазная информация использовалось в экспериментах за счет обучения на STFT. Результаты приведены выше в таблице 1. FFC-AE значительно превосходит другие модели прогнозирования фазы, хотя имеет меньше параметров.The models were trained to predict the sine and cosine of the phases from the amplitude spectrogram and supplied with full amplitude spectrograms. It should be noted that, in general, learning the sine and cosine phases means learning the complex spectrum from the amplitude spectrogram (which contains no complex information and thus no phase information). Complex numbers can be represented by sines and cosines using the Euler formula (this correspondence allows you to convert them between each other). To suppress noise in the speech signal, phase information was used in experiments through training on the STFT. The results are shown in Table 1 above. FFC-AE is significantly superior to other phase prediction models, although it has fewer parameters.

Улучшение речевого сигналаVoice enhancement

Качество предложенных моделей сравнивали с моделями из литературы на обоих описанных выше наборах данных. На Voicebank, как можно видеть из таблицы 2, FFC-AE значительно превосходила большинство остальных моделей по WV-MOS и давала конкурентоспособные результаты на других метриках с учетом ее компактного размера модели. FFC-UNet была сопоставима с DEMUCS [3] по качеству в отношении WV-MOS, будучи в 8 раз меньше по размеру. Согласно критерию DNS (таблица 3), оба модели, используемые в методах, отвечающих изобретению, превосходили FullSubNet (см. [22]) (одну из лучших моделей в DNS Challenge 2021) в отношении WV-MOS на обоих испытательных наборах DNS-CUSTOM и DNS-BLIND.The quality of the proposed models was compared with models from the literature on both datasets described above. On Voicebank, as can be seen from Table 2, FFC-AE significantly outperformed most other models in WV-MOS and was competitive on other metrics given its compact model size. FFC-UNet was comparable to DEMUCS [3] in quality with respect to WV-MOS, being 8 times smaller in size. According to the DNS criterion (Table 3), both models used in the methods of the invention outperformed FullSubNet (see [22]) (one of the best models in DNS Challenge 2021) for WV-MOS on both DNS-CUSTOM and DNS BLIND.

Результаты подавления шума в речевом сигнале представлены ниже в таблице 2.The results of noise suppression in the speech signal are presented in Table 2 below.

Таблица 2: результаты подавления шума в речевом сигнале на массиве данных Voicebank-DEMAND Table 2: Speech Noise Reduction Results on the Voicebank-DEMAND Data Set МодельModel WV-MOSWV-MOS SI-SDRSI-SDR STOISTOI PESQ PESQ #Params (млн.)#Params (million) ЭталонReference 4,504.50 -- 1,001.00 4,644.64 -- FFC-AE (наш)FFC-AE (ours) 4,344.34 17,8817.88 0,9450.945 2,882.88 0,4220.422 FFC-UNet (наш)FFC-UNet (ours) 4,374.37 17,62817.628 0,8680.868 2,9252.925 7,677.67 VoiceFixer [20]Voice Fixer [20] 4,144.14 -18,5-18.5 0,890.89 2,382.38 122,1122.1 DEMUCS [3]DEMUCS [3] 4,374.37 18,518.5 0,950.95 3,033.03 60,860.8 MetricGAN+ [7]MetricGAN+ [7] 3,903.90 8,58.5 0,930.93 3,133.13 2,72.7 ResUNet-Decouple+ [21]ResUNet-Decouple+ [21] 4,1274.127 18,4018.40 0,840.84 2,452.45 102,57102.57 SE-Conformer [4]SE Conformer [4] 3,883.88 15,815.8 0,910.91 2,162.16 1,81.8 ВходEntrance 2,992.99 8,48.4 0,920.92 1,971.97 --

В нижеследующей таблице 3 представлены экспериментальные данные, иллюстрирующие практический пример реализации настоящего изобретения, который демонстрируют результаты подавления шума в речевом сигнале на массиве данных DNS. * (звездочка) в таблице 3 указывает результаты на DNS-BLIND.The following table 3 presents experimental data illustrating a practical example of the implementation of the present invention, which demonstrates the results of noise suppression in a speech signal on a DNS dataset. * (asterisk) in Table 3 indicates DNS-BLIND results.

Таблица 3: результаты подавления шума в речевом сигнале на массиве данных DNSTable 3: Speech Noise Reduction Results on the DNS Dataset модельmodel WV-MOSWV-MOS WV-MOS*WV-MOS* SI-SDRSI-SDR STOISTOI PESQPESQ # Params (млн.)# Params (million) ЭталонReference 3,8453.845 -- 1,001.00 4,644.64 -- -- FFC-AE (наш)FFC-AE (ours) 3,0753.075 2,5292.529 13,39813.398 0,8320.832 2,4192.419 0,4220.422 FFC-UNet (наш)FFC-UNet (ours) 3,2433.243 2,6232.623 14,6814.68 0,8510.851 2,5652.565 7,677.67 FullSubNet [22]FullSubNet[22] 2,902.90 2,412.41 14,9614.96 0,820.82 2,432.43 5,65.6 ResUNet-Decouple+ [21]ResUNet-Decouple+ [21] 2,9422.942 1,1241.124 14,7814.78 0,8100.810 2,0832.083 102,57102.57 ВходEntrance 1,1951.195 -- 0,690.69 1,491.49 -- --

Варианты осуществления настоящего изобретения, как описано выше, предусматривают новую архитектуру для подавления шума в речевом сигнале. Архитектура строится на недавно предложенном нейронном операторе быстрой свертки Фурье [8]. Результаты демонстрируют, что предложенные модели работают лучше или так же, как традиционные системы.Embodiments of the present invention, as described above, provide a new architecture for suppressing noise in a speech signal. The architecture is based on the recently proposed Fast Fourier Convolution Neural Operator [8]. The results demonstrate that the proposed models perform better or the same as traditional systems.

В частности, предложенная архитектура значительно превосходит ближайший аналог [21], который использует базовые свертки внутри архитектуры U-Net. Напротив, предложенная архитектура использует быструю свертку Фурье и достигает более высокой производительности при гораздо меньшем количестве параметров. Настоящий метод является новым и отвечающим изобретению по меньшей мере в том, что нейронный оператор быстрой свертки Фурье для улучшения речевого сигнала и подавления шума показал нетривиальный результат высокой эффективности быстрой свертки Фурье для восстановления спектрограммы.In particular, the proposed architecture is significantly superior to the closest analogue [21], which uses basic convolutions within the U-Net architecture. In contrast, the proposed architecture uses Fast Fourier Convolution and achieves better performance with much fewer parameters. The present method is novel and in accordance with the invention, at least in that the FFT neural operator for speech enhancement and noise suppression has shown a non-trivial result of FFT's high performance for spectrogram reconstruction.

Выше были описаны способ и устройство для улучшения речевого сигнала и подавления шума в звуковом сигнале, содержащем речь. Специалисты в данной области техники должны понять, что изобретение можно реализовать различными комбинациями аппаратных и программных средств, и никакие подобные конкретные комбинации не ограничивают объем настоящего изобретения. Вышеописанные модули, которые составляют отвечающее изобретению устройство, можно реализовать в форме отдельного аппаратного средства, или два или более модулей можно реализовать в виде одного аппаратного средства, или отвечающую изобретению систему можно реализовать в виде одного или более компьютеров, процессоров (CPU) например, процессоров общего назначения или специализированных процессоров, например, цифровых сигнальных процессоров (DSP), или одной или более ASIC, FPGA, логических элементов и т.д. В качестве альтернативы, один или более модулей можно реализовать как программное средство, например, программу или программы, элемент(ы) или модуль(и) компьютерной программы которые управляют одним или более компьютерами, CPU и т.д. для осуществления этапов способа и/или операций, подробно описанных выше. Это программное средство может быть реализовано на одном или более машиночитаемых носителей, которые хорошо известны специалистам в данной области техники, может храниться в одном или более блоках памяти, например, ROM, RAM, флеш-памяти, EEPROM и т.д., или поступать, например, от удаленных серверов через одно или более соединения проводной и/или беспроводной сети, интернет, соединение Ethernet, LAN или, при необходимости, другие локальные или глобальные компьютерные сети.The method and apparatus for improving a speech signal and suppressing noise in an audio signal containing speech has been described above. Those skilled in the art will appreciate that the invention may be implemented in various combinations of hardware and software, and no such specific combinations limit the scope of the present invention. The modules described above that make up the device of the invention may be implemented as separate hardware, or two or more modules may be implemented as a single hardware, or the system of the invention may be implemented as one or more computers, processors (CPUs), e.g. general purpose or specialized processors, such as digital signal processors (DSPs), or one or more ASICs, FPGAs, logic gates, etc. Alternatively, one or more modules may be implemented as software, such as a program or programs, element(s) or module(s) of a computer program that control one or more computers, CPUs, and so on. to carry out the method steps and/or operations detailed above. The software may be implemented on one or more computer-readable media, which are well known to those skilled in the art, may be stored in one or more memory units, such as ROM, RAM, flash memory, EEPROM, etc., or , for example, from remote servers via one or more wired and/or wireless network connections, the Internet, an Ethernet connection, a LAN, or other local or wide area computer networks as needed.

Специалисты в данной области техники должны понять, что только некоторые из возможных примеров методов и материалов и технических средств, позволяющих реализовать варианты осуществления настоящего изобретения, описаны выше и показаны в чертежах. Подробное описание вышеприведенных вариантов осуществления изобретения не предназначено для ограничения настоящего изобретения или определения объема его правовой охраны.Those skilled in the art will appreciate that only some of the possible examples of methods and materials and techniques for implementing embodiments of the present invention are described above and shown in the drawings. The detailed description of the above embodiments of the invention is not intended to limit the present invention or determine the scope of its legal protection.

Другие варианты осуществления, которые могут входить в объем настоящего изобретения, могут быть предложены специалистами в данной области техники по изучении вышеприведенного описания изобретения с обращением к сопровождающим чертежам, и все такие очевидные модификации, изменения и/или эквивалентные замены подлежат включению в объем настоящего изобретения. Все упомянутые и рассмотренные здесь источники из уровня техники настоящим включены в данное описание путем ссылки, когда это применимо.Other embodiments that may be within the scope of the present invention may be suggested by those skilled in the art upon examination of the foregoing description of the invention with reference to the accompanying drawings, and all such obvious modifications, alterations and/or equivalent substitutions are to be included within the scope of the present invention. All prior art references cited and discussed herein are hereby incorporated by reference when applicable.

При том, что настоящее изобретение описано и проиллюстрировано с обращением к различным вариантам осуществления, специалистам в данной области техники следует понимать, что могут быть выполнены различные изменения, касающиеся его формы и конкретных подробностей, не выходящие за рамки объема настоящего изобретения, который определяется только нижеприведенной формулой изобретения и ее эквивалентами.While the present invention has been described and illustrated with reference to various embodiments, it will be understood by those skilled in the art that various changes may be made as to its form and specific details without departing from the scope of the present invention, which is only defined by the following. claims and their equivalents.

БиблиографияBibliography

[1] M. Tagliasacchi, Y. Li, K. Misiunas, and D. Roblek, “Seanet: A multi-modal speech enhancement network,” arXiv preprint arXiv:2009,02095, 2020.[1] M. Tagliasacchi, Y. Li, K. Misiunas, and D. Roblek, “Seanet: A multi-modal speech enhancement network,” arXiv preprint arXiv:2009,02095 , 2020.

[2] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703,09452, 2017.[2] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703,09452 , 2017.

[3] A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” in Interspeech, 2020.[3] A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” in Interspeech , 2020.

[4] E. Kim and H. Seo, “SE-Conformer: Time-Domain Speech Enhancement Using Conformer,” in Proc. Interspeech 2021, 2021, pp. 2736-2740.[4] E. Kim and H. Seo, “SE-Conformer: Time-Domain Speech Enhancement Using Conformer,” in Proc. Interspeech 2021 , 2021, pp. 2736-2740.

[5] H.-S. Choi, J.-H. Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee, “Phase-aware speech enhancement with deep complex u-net,” in International Conference on Learning Representations, 2018.[5] H.-S. Choi, J.-H. Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee, “Phase-aware speech enhancement with deep complex u-net,” in International Conference on Learning Representations , 2018.

[6] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in International Conference on Machine Learning. PMLR, 2019, pp. 2031-2041.[6] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in International Conference on Machine Learning . PMLR, 2019, pp. 2031-2041.

[7] S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao, “Metricgan+: An improved version of metricgan for speech enhancement,” arXiv preprint arXiv:2104,03538, 2021.[7] S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao, “Metricgan+: An improved version of metricgan for speech enhancement,” arXiv preprint arXiv:2104,03538 , 2021.

[8] L. Chi, B. Jiang, and Y. Mu, “Fast fourier convolution,” Advances in Neural Information Processing Systems, vol. 33, pp. 4479-4488, 2020.[8] L. Chi, B. Jiang, and Y. Mu, “Fast fourier convolution,” Advances in Neural Information Processing Systems , vol. 33, pp. 4479-4488, 2020.

[9] H. J. Nussbaumer, “The fast fourier transform,” in Fast Fourier Transform and Convolution Algorithms. Springer, 1981, pp. 80-111.[9] HJ Nussbaumer, “The fast fourier transform,” in Fast Fourier Transform and Convolution Algorithms . Springer, 1981, pp. 80-111.

[10] R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2149-2159.[10] R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2022, pp. 2149-2159.

[11] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234-241.[11] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention . Springer, 2015, pp. 234-241.

[12] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” arXiv preprint arXiv:2010,05646, 2020.[12] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” arXiv preprint arXiv:2010,05646 , 2020.

[13] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794-2802.[13] X. Mao, Q. Li, H. Xie, RY Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision , 2017, pp . 2794-2802.

[14] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Br'ebisson, Y. Bengio, and A. Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” arXiv preprint arXiv:1910,06711, 2019.[14] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, WZ Teoh, J. Sotelo, A. de Br'ebisson, Y. Bengio, and A. Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” arXiv preprint arXiv:1910,06711 , 2019.

[15] C. Valentini-Botinhao et al., “Noisy speech database for training speech enhancement algorithms and tts models,” 2017.[15] C. Valentini-Botinhao et al., “Noisy speech database for training speech enhancement algorithms and tts models,” 2017.

[16] H. Dubey, V. Gopal, R. Cutler, A. Aazami, S. Matusevych, S. Braun, S. E. Eskimez, M. Thakker, T. Yoshioka, H. Gamper et al., “Icassp 2022 deep noise suppression challenge,” arXiv preprint arXiv:2202,13288, 2022.[16] H. Dubey, V. Gopal, R. Cutler, A. Aazami, S. Matusevych, S. Braun, SE Eskimez, M. Thakker, T. Yoshioka, H. Gamper et al., “Icassp 2022 deep noise suppression challenge,” arXiv preprint arXiv:2202,13288 , 2022.

[17] A.W. Rix, J.G. Beerends, M.P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), vol. 2. IEEE, 2001, pp. 749-752.[17] AW Rix, JG Beerends, MP Hollier, and AP Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) , vol. 2. IEEE, 2001, pp. 749-752.

[18] C.H. Taal, R.C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125-2136, 2011.[18] CH Taal, RC Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 7, pp. 2125-2136, 2011.

[19] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr-half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626-630.[19] J. Le Roux, S. Wisdom, H. Erdogan, and JR Hershey, “Sdr-half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 626-630.

[20] H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang, “Voicefixer: Toward general speech restoration with neural vocoder,” arXiv preprint arXiv:2109,13731, 2021.[20] H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang, “Voicefixer: Toward general speech restoration with neural vocoder,” arXiv preprint arXiv:2109, 13731 , 2021.

[21] Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang, “Decoupling magnitude and phase estimation with deep resunet for music source separation,” arXiv preprint arXiv:2109,05418, 2021.[21] Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang, “Decoupling magnitude and phase estimation with deep resunet for music source separation,” arXiv preprint arXiv:2109,05418 , 2021.

[22] X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: a full-band and sub-band fusion model for real-time single-channel speech enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6633-6637.[22] X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: a full-band and sub-band fusion model for real-time single-channel speech enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2021, pp. 6633-6637.

[23] K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.[23] K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.

Claims

1. A method for suppressing noise in a speech signal using at least one fast Fourier convolution operator, the method comprising the steps of:

divide the channels of the input tensor into local and global branches;

use regular convolutional layers for local updates of tensor transformation maps on a local branch;

performing a Fourier transform in the frequency dimension of the global branch tensor;

updating the map of characteristics of the global branch in the spectral domain through pointwise convolutional layers;

applying an inverse Fourier transform to the updated global branch feature map; And

summarize the activations of the local and global branches.

2. The method of claim 1, wherein the input tensor channels are derived from an input spectrogram representing an audio signal containing speech.

3. The method of claim 1, wherein the summed activations of the local and global branches are reflected in an output spectrogram representing an enhanced audio signal containing speech.

4. The method of claim 1, wherein the at least one Fast Fourier Convolution Operator is part of a Fast Fourier Convolution Autoencoder (FFC-AE) neural network architecture.

5. The method of claim 1, wherein the at least one Fast Fourier Convolution Operator is part of the U-Net Fast Fourier Convolution Neural Network (FFC-UNet) architecture.

6. The method of claim 1, wherein updating the global branch feature map comprises the steps of:

apply a real 1D fast Fourier transform in the frequency domain of the input response map and concatenate the real and imaginary parts of the spectrum in the channel domain:

applying a convolutional block (with a 1x1 kernel) in the frequency domain;

apply the inverse Fourier transform.

7. The method of claim 1, wherein the at least one fast Fourier convolution operator is applied to the frequency domain.

8. The method of claim 1, the method comprising using a convolutional neural network using one or more machine learning (ML) models trained by a multi-discriminator adversarial learning toolkit.

9. A device for suppressing noise in a speech signal using a fast Fourier convolution operator, the device comprising:

memory; And

a processor coupled to the memory, wherein the processor, when executing instructions stored in the memory, is configured to:

splitting the channels of the input tensor into local and global branches;

using conventional convolutional layers for local updates of tensor transformation maps on a local branch;

implementation of the Fourier transform in the frequency dimension of the global branch tensor;

applying the inverse Fourier transform to the updated global branch feature map; And

summation of local and global branch activations.

10. A computer-readable medium that stores processor-executable instructions that, when executed by at least one processor, cause at least one processor to carry out the method of any one of claims. 1-8.