RU2782364C1

RU2782364C1 - Apparatus and method for isolating sources using sound quality assessment and control

Info

Publication number: RU2782364C1
Application number: RU2021121442A
Authority: RU
Inventors: Кристиан УЛЕ; Маттео ТОРКОЛИ; Саша ДИШ; Йоуни ПАУЛУС; Юрген ХЕРРЕ; Оливер ХЕЛЛЬМУТ; Харальд ФУКС
Original assignee: Фраунхофер-Гезелльшафт Цур Фердерунг Дер Ангевандтен Форшунг Е.Ф.
Priority date: 2018-12-21
Filing date: 2019-12-20
Publication date: 2022-10-26

Abstract

FIELD: computing technology.

SUBSTANCE: invention relates to the field of computing technology for processing audio data. The technical result is achieved by determining the estimated target signal depending on the input audio signal; determining the resulting values, depending on the estimated sound quality of the estimated target signal, in order to obtain one or multiple parameter values; and forming an isolated audio signal, depending on the one or multiple parameter values and depending on one of the estimated target signal, the input audio signal, and the estimated difference signal; wherein the estimated difference signal is an estimate of a signal containing only a section of the difference audio signal, wherein the isolated audio signal is formed depending on the parameter values and depending on the linear combination of the estimated target signal and the input audio signal; or wherein the isolated audio signal is formed depending on the parameter values and depending on the linear combination of the estimated target signal and the estimated difference signal.

EFFECT: maximum reduction of noise level on the condition of absence of artifacts.

16 cl, 6 dwg

Description

Настоящее изобретение относится к отделению (выделению) источников аудиосигналов, в частности, к адаптивному сигнальному управлению качеством звука отделенных выходных сигналов и, в частности, к устройству и способу отделения источников с использованием оценки и управления качеством звука.The present invention relates to separation (separation) of audio sources, in particular, to adaptive signal quality control of the sound quality of separated output signals, and, in particular, to a device and method for separating sources using sound quality estimation and control.

При отделении источников качество выходных сигналов ухудшается, и это ухудшение монотонно увеличивается вместе с ослаблением сигналов помех.As the sources are separated, the quality of the output signals deteriorates, and this degradation increases monotonically with the attenuation of the interfering signals.

Отделение источников аудиосигналов проводилось в прошлом.The separation of audio sources has been done in the past.

Отделение (выделение) источников аудиосигналов направлено на получение целевого сигнала

при заданном совокупном сигнале

,The separation (selection) of audio signal sources is aimed at obtaining the target signal

for a given cumulative signal

,

(1),

(one),

где

содержит все сигналы помех и в дальнейшем называется "сигналом помех". Результатом отделения

является оценка целевого сигнала

,where

contains all interference signals and is referred to as "interference signal" in the following. The result of separation

is the estimate of the target signal

,

(2)

и, возможно, дополнительно оценку сигнала помех

,and, possibly, an additional evaluation of the interference signal

,

(3)

Такая обработка обычно вносит артефакты в выходной сигнал, которые ухудшают качество звука. Это ухудшение качества звука монотонно увеличивается с величиной отделения, ослабления сигналов помех. Во многих приложениях не требуется полное отделение, а частичное усиление, звуки помех ослаблены, но все еще присутствуют в выходном сигнале.Such processing usually introduces artifacts into the output signal that degrade the sound quality. This deterioration in sound quality increases monotonically with the amount of separation, attenuation of interference signals. In many applications, full separation is not required, but partial amplification, interference sounds are attenuated but still present in the output signal.

Это имеет дополнительное преимущество в том, что качество звука выше, чем в полностью отделенных сигналах, поскольку вносится меньше артефактов, а утечка сигналов помех частично маскирует воспринимаемые артефакты.This has the added benefit that the audio quality is better than with completely separated signals because less artifact is introduced and interference signal leakage partially masks perceived artifacts.

Частичная маскировка аудиосигнала означает, что его громкость (например, воспринимаемая интенсивность) частично снижается. Кроме того, может быть желательно и необходимо, чтобы вместо достижения большого ослабления качество звука на выходе не опускалось ниже заданного уровня качества звука.Partial masking of an audio signal means that its loudness (eg perceived intensity) is partially reduced. In addition, it may be desirable and necessary that, instead of achieving a large attenuation, the output audio quality does not fall below a predetermined audio quality level.

Примером такого применения является улучшение диалога. Аудио сигналы в теле- и радиовещании и звук в фильмах часто представляют собой смешение речевых сигналов и фоновых сигналов, например, звуков окружающей среды и музыки. Когда эти сигналы смешиваются таким образом, что уровень речи слишком низок по сравнению с уровнем фона, у слушателя могут возникнуть трудности с пониманием того, что было сказано, или понимание требует очень больших усилий при прослушивании, и это приводит к утомлению слушателя. В таких сценариях могут быть применены способы автоматического снижения уровня фона, но результат должен иметь высокое качество звука.An example of such an application is the enhancement of dialogue. Audio signals in television and radio broadcasts and sound in films are often a mixture of speech signals and background signals such as environmental sounds and music. When these signals are mixed in such a way that the speech level is too low compared to the background level, the listener may have difficulty understanding what has been said, or comprehension is very difficult to listen to, resulting in listener fatigue. In such scenarios, ways to automatically reduce the background level can be applied, but the result should be of high sound quality.

На предшествующем уровне техники существуют различные способы отделения источников. Отделение целевого сигнала из смешения сигналов обсуждалось на предшествующем уровне техники. Эти способы можно разделить категориями на два подхода. Первая категория способов основана на сформулированных предположениях о модели сигнала и/или модели смешивания. Модель сигнала описывает характеристики входных сигналов, здесь

и

. Модель смешивания описывает характеристики того, как входные сигналы объединяются для получения смешенного сигнала

, здесь посредством сложения.In the prior art, there are various ways to separate sources. The separation of the target signal from the signal mix has been discussed in the prior art. These methods can be categorized into two approaches. The first category of methods is based on formulated assumptions about the signal model and/or mixing model. The signal model describes the characteristics of the input signals, here

and

. The mixing model describes the characteristics of how input signals are combined to produce a mixed signal.

, here by addition.

На основе этих предположений способ разрабатывается аналитически или эвристически. Например, способ независимого компонентного анализа (Independent Component Analysis) может быть получен, если предположить, что смешение содержит два исходных сигнала, которые статистически независимы, смешение была захвачено двумя микрофонами, и смешивание было получено путем сложения обоих сигналов (производящего мгновенное смешение). Обратный процесс смешивания затем математически выводится как инверсия матрицы смешивания, и элементы этой матрицы отделения смешивания вычисляются в соответствии с указанным способом. Большинство аналитических способов получены путем формулировки задачи отделения как численной оптимизации критерия, например, среднеквадратичной ошибки между истинной целью и оцененной целью.Based on these assumptions, the method is developed analytically or heuristically. For example, an Independent Component Analysis method can be obtained by assuming that the mix contains two original signals that are statistically independent, the mix was captured by two microphones, and the mix was obtained by adding both signals (producing instant mix). The inverse mixing process is then mathematically derived as the inverse of the mixing matrix, and the elements of this mixing separation matrix are calculated in accordance with the specified method. Most of the analytical methods are obtained by formulating the separation problem as a numerical optimization of a criterion, for example, the root mean square error between the true target and the estimated target.

Вторая категория управляется данными. В этом случае оценивается представление целевых сигналов или оценивается набор параметров для извлечения целевых сигналов из входного смешения. Оценка основана на модели, которая была обучена на наборе обучающих данных, отсюда название "управляемая данными". Оценка получается путем оптимизации критерия, например, путем минимизации среднеквадратичной ошибки между истинной целью и оцененной целью, учитывая обучающие данные. Примером для этой категории являются искусственные нейронные сети (Artificial Neural Networks, ANN), которые были обучены выдавать оценку речевого сигнала при наличии смешения речевого сигнала и сигнала помех. Во время обучения регулируемые параметры искусственной нейронной сети определяются таким образом, чтобы критерий производительности, вычисленный для набора обучающих данных, был оптимизирован - в среднем по всему набору данных.The second category is data driven. In this case, the representation of the target signals is evaluated, or a set of parameters is evaluated to extract the target signals from the input mix. The score is based on a model that has been trained on the training dataset, hence the name "data driven". The score is obtained by optimizing the criterion, for example, by minimizing the mean square error between the true target and the estimated target given the training data. An example for this category are Artificial Neural Networks (ANNs) that have been trained to produce an estimate of a speech signal in the presence of a mixture of speech and noise signals. During training, the adjustable parameters of the artificial neural network are determined in such a way that the performance criterion calculated for the training dataset is optimized - on average over the entire dataset.

Что касается отделения источников, решение, оптимальное в смысле среднеквадратичной ошибки или оптимальное по любому другому числовому критерию, не обязательно является решением с наивысшим качеством звука, которое предпочитают люди-слушатели.As far as source separation is concerned, the solution that is optimal in terms of rms error, or optimal in terms of any other numerical criterion, is not necessarily the highest audio quality solution that human listeners prefer.

Вторая проблема связана с тем, что отделение источников всегда приводит к двум эффектам: во-первых, к желаемому ослаблению звуков помех и, во-вторых, к нежелательному ухудшению качества звука. Оба эффекта коррелированы, например, увеличение желаемого эффекта приводит к увеличению нежелательного эффекта. Конечная цель состоит в том, чтобы управлять компромиссом между ними.The second problem is that the separation of sources always leads to two effects: firstly, to the desired attenuation of interference sounds and, secondly, to an undesirable deterioration in sound quality. Both effects are correlated, for example, an increase in the desired effect leads to an increase in the undesirable effect. The ultimate goal is to manage the compromise between them.

Качество звука может быть оценено, например, количественно с помощью теста на прослушивание или с помощью вычислительных моделей качества звука. Качество звука имеет множество аспектов, в дальнейшем называемых компонентами качества звука (Sound Quality Components, SQC).Sound quality can be quantified, for example, with a listening test or with computational sound quality models. Sound quality has many aspects, hereinafter referred to as Sound Quality Components (SQC).

Например, качество звука определяется воспринимаемой интенсивностью артефактов (это компоненты сигнала, которые были внесены обработкой сигналов, например, отделением источников, и которые снижают качество звука).For example, audio quality is determined by the perceived intensity of artifacts (these are signal components that have been introduced by signal processing, such as source separation, that degrade audio quality).

Или, например, качество звука определяется воспринимаемой интенсивностью сигналов помех, или, например, разборчивостью речи (когда целевой сигнал является речью), или, например, общим качеством звука.Or, for example, the sound quality is determined by the perceived intensity of the interfering signals, or, for example, speech intelligibility (when the target signal is speech), or, for example, the overall sound quality.

Существуют различные вычислительные модели качества звука, которые вычисляют (оценивают) компоненты качества звука,

,

, где

обозначает количество компонентов качества звука.There are various sound quality computational models that calculate (estimate) sound quality components,

,

, where

indicates the number of sound quality components.

Такие способы обычно оценивают компонент качества звука с учетом целевого сигнала и оценки целевого сигнала,Such methods typically estimate the audio quality component given a target signal and an estimate of the target signal,

(4)

(four)

или учитывая также сигнал помех,or considering also the interference signal,

(5).

В практическом применении целевые сигналы

(и сигналы помех

) не доступны, иначе не требовалось бы отделение. Когда доступны только входной сигнал

и оценки целевого сигнала

, компоненты качества звука не могут быть вычислены с помощью этих способов.In practical application, target signals

(and interference signals

) are not available, otherwise separation would not be required. When only the input signal is available

and target signal evaluation

, the sound quality components cannot be calculated using these methods.

В предшествующем уровне техники были описаны различные вычислительные модели для оценки аспектов качества звука, включая разборчивость.Various computational models have been described in the prior art for evaluating aspects of sound quality, including intelligibility.

Оценка слепого отделения источников (Blind Source Separation Evaluation, BSSEval) (см. [1]) представляет собой набор инструментов для многокритериальной оценки производительности. Оцениваемый сигнал подвергается декомпозиции посредством ортогональной проекции на компонент целевого сигнала, помехи от других источников и артефакты. Метрики вычисляются как энергетические соотношения этих компонентов и выражаются в дБ. Этими метриками являются отношение источника (исходного сигнала) к искажениям (Source to Distortion Ratio, SDR), отношение источника к помехам (Source to Interference Ratio, SIR) и отношение источника к артефактам (Source to Artifact Ratio, SAR).Blind Source Separation Evaluation (BSSEval) (see [1]) is a set of tools for multi-criteria performance evaluation. The estimated signal is decomposed by means of an orthogonal projection onto the target signal component, interference from other sources, and artifacts. The metrics are calculated as the energy ratios of these components and are expressed in dB. These metrics are Source to Distortion Ratio (SDR), Source to Interference Ratio (SIR), and Source to Artifact Ratio (SAR).

Способы перцептивной оценки отделения источников аудио (Perceptual Evaluation methods for Audio Source Separation, PEASS) (см. [2]) были разработаны как перцептивно мотивированный преемник способа BSSEval. Выполняется проекция сигнала на временных сегментах с помощью гамматонового фильтр-банка.Perceptual Evaluation methods for Audio Source Separation (PEASS) (see [2]) were developed as a perceptually motivated successor to the BSSEval method. The signal is projected onto time segments using a gammaton filter bank.

PEMO-Q (см. [3]) используется для обеспечения множественных признаков. Четыре оценки восприятия получаются из этих признаков с использованием нейронной сети, обученной с помощью субъективных оценок. Оценками восприятия являются: общая оценка восприятия (Overall Perceptual Score, OPS), оценка восприятия, связанная с помехами (Interference-related Perceptual Score, IPS), оценка восприятия, связанная с артефактами (Artifact-related Perceptual Score, APS) и оценка восприятия, связанная с целевым сигналом (Target-related Perceptual Score, TPS).PEMO-Q (see [3]) is used to provide multiple features. Four perception scores are obtained from these features using a neural network trained with subjective scores. The Perceptual Scores are: Overall Perceptual Score (OPS), Interference-related Perceptual Score (IPS), Artifact-related Perceptual Score (APS) and Perceptual Score, associated with the target signal (Target-related Perceptual Score, TPS).

Оценка восприятия качества аудио (Perceptual Evaluation of Audio Quality, PEAQ) (см. [4]) представляет собой метрику, разработанную для аудиокодирования. Она использует периферийную модель уха для вычисления представлений базилярной мембраны опорного и испытательного сигнала. Аспекты различия между этими представлениями определяются количественно несколькими выходными переменными. Посредством нейронной сети, обученной с помощью субъективных данных, эти переменные объединяются, чтобы получить основной результат, например, общую оценку различий (Overall Difference Grade, ODG).Perceptual Evaluation of Audio Quality (PEAQ) (see [4]) is a metric developed for audio coding. It uses a peripheral ear model to calculate the basilar membrane representations of the reference and test signals. Aspects of difference between these representations are quantified by several output variables. Through a neural network trained with subjective data, these variables are combined to get the main result, for example, the overall difference score (Overall Difference Grade, ODG).

Оценка восприятия качества речи (Perceptual Evaluation of Speech Quality, PESQ) (см. [5]) представляет собой метрику, разработанную для речи, передаваемой по телекоммуникационным сетям. Следовательно, способ содержит предварительную обработку, которая имитирует телефонную трубку. Показатели для звуковых помех вычисляются по заданной громкости сигналов и объединяются в оценках PESQ. На их основе прогнозируется оценка MOS посредством полиномиальной функции отображения (см. [6]).Perceptual Evaluation of Speech Quality (PESQ) (see [5]) is a metric developed for speech transmitted over telecommunication networks. Therefore, the method includes pre-processing that simulates a handset. The metrics for audio interference are calculated from a given signal loudness and combined in the PESQ scores. Based on them, the MOS estimate is predicted using a polynomial mapping function (see [6]).

ViSQOLAudio (см. [7]) представляет собой метрику, разработанную для музыки, закодированной на низких битовых скоростях, разработанную на основе виртуального объективного слушателя качества речи (Virtual Speech Quality Objective Listener, ViSQOL). Обе метрики основаны на модели периферийной слуховой системы, чтобы создать внутренние представления сигналов, названных нейрограммами. Они сравниваются через адаптацию индекса структурного сходства, первоначально разработанного для оценки качества сжатых изображений.ViSQOLAudio (see [7]) is a metric designed for music encoded at low bit rates, developed on top of the Virtual Speech Quality Objective Listener (ViSQOL). Both metrics are based on a model of the peripheral auditory system to create internal representations of signals called neurograms. They are compared through an adaptation of the Structural Similarity Index originally developed to evaluate the quality of compressed images.

Индекс качества аудио слуховых аппаратов (Hearing-Aid Audio Quality Index, HAAQI) (см. [8]) представляет собой индекс, предназначенный для прогнозирования качества музыки для людей, использующих слуховые аппараты. Индекс основан на модели слуховой периферии, расширенной для учета последствий потери слуха. Это соответствует базе данных оценок качества, сделанных слушателями с нормальным или ослабленным слухом. Моделирование потери слуха можно обойти, и индекс становится действительным также для людей с нормальным слухом. Основываясь на той же слуховой модели, авторы HAAQI также предложили индекс качества речи - индекс качества речи слуховых аппаратов (Hearing-Aid Speech Quality Index, HASQI) (см. [9]), и индекс разборчивости речи - индекс восприятия речи слуховых аппаратов (Hearing-Aid Speech Perception Index, HASPI) (см. [10]).The Hearing-Aid Audio Quality Index (HAAQI) (see [8]) is an index designed to predict the quality of music for hearing aid wearers. The index is based on an auditory periphery model extended to account for the effects of hearing loss. This corresponds to a database of quality ratings made by hearing-impaired or normal-hearing listeners. The simulation of hearing loss can be bypassed and the index becomes valid also for people with normal hearing. Based on the same auditory model, the HAAQI authors also proposed a speech quality index - the Hearing-Aid Speech Quality Index (HASQI) (see [9]), and a speech intelligibility index - the hearing aid speech perception index (Hearing -Aid Speech Perception Index, HASPI) (see [10]).

Кратковременная объективная разборчивость (Short-Time Objective Intelligibility, STOI) (см. [11]) представляет собой показатель, который, как ожидается, будет иметь монотонное соотношение со средней разборчивостью речи. Она особенно относится к речи, обработанной с помощью некоторого частотно-временного взвешивания.Short-Time Objective Intelligibility (STOI) (see [11]) is a measure that is expected to have a monotonous relationship with average speech intelligibility. It applies especially to speech processed with some time-frequency weighting.

В [12] искусственная нейронная сеть обучается таким образом, чтобы оценивать отношение источника к искажению, учитывая только входной сигнал и выходной оцененный целевой сигнал, где вычисление отношения источника к искажению обычно принимало бы в качестве входных данных также истинную цель и сигнал помех. Множество алгоритмов отделения выполняется параллельно на одном и том же входном сигнале. Оценки отношения источника к искажению используются, чтобы выбрать для каждого временного интервала выходные данные алгоритма с наилучшим отношением источника к искажению. Следовательно, не сформулирован контроль над компромиссом между качеством звука и отделением, и не предложен контроль параметров алгоритма отделения. Кроме того, используется отношение источника к искажению, которое не мотивировано восприятием и, как было показано, плохо коррелировано с воспринимаемым качеством, например, в [13].In [12], an artificial neural network is trained to estimate the source-to-distortion ratio given only the input signal and the estimated target output signal, where the calculation of the source-to-distortion ratio would normally take as input also the true target and the interference signal. Many separation algorithms are executed in parallel on the same input signal. The source-to-distortion ratio estimates are used to select, for each time interval, the algorithm output with the best source-to-distortion ratio. Therefore, no control over the trade-off between sound quality and separation is formulated, and no control over the parameters of the separation algorithm is proposed. In addition, a source-to-distortion ratio is used, which is not motivated by perception and has been shown to be poorly correlated with perceived quality, for example in [13].

Кроме того, в последнее время появились работы по улучшению речи с помощью контролируемого обучения, в которых оценки компонентов качества звука интегрируются в функции затрат, в то время как традиционно модели улучшения речи оптимизируются на основе среднеквадратичной ошибки (MSE) между оцененной и чистой речью. Например, в [14], [15], [16] используются функции затрат на основе STOI, а не на MSE. В [17] используется обучение с подкреплением на основе PESQ или PEASS. Тем не менее, отсутствует контроль над компромиссом между качеством звука и отделением.In addition, recently there have been works on speech enhancement using supervised learning, in which estimates of the sound quality components are integrated into cost functions, while traditionally speech enhancement models are optimized based on the root mean square error (MSE) between the estimated and clean speech. For example, [14], [15], [16] use cost functions based on STOI rather than MSE. In [17], reinforcement learning based on PESQ or PEASS is used. However, there is no control over the trade-off between sound quality and separation.

В [18] предложено устройство обработки аудио, в котором показатель слышимости используется вместе с показателем идентификации артефактов для управления частотно-временным усилением, применяемым обработкой. Это делается, например, для того чтобы обеспечить максимальное снижение уровня шума при условии отсутствия артефактов, компромисс между качеством звука и отделением фиксирован. Кроме того, система не предполагает контролируемого обучения. Для выявления артефактов используется коэффициент эксцесса, показатель, который напрямую сравнивает выходные и входные сигналы (возможно, в сегментах, где отсутствует речь), без необходимости определения истинной цели и сигнала помех. Этот простой показатель дополняется показателем слышимости.[18] proposes an audio processing device in which the audibility score is used in conjunction with the artifact identification score to control the time-frequency gain applied by the processing. This is done, for example, in order to provide the maximum reduction in noise level, provided there are no artifacts, the trade-off between sound quality and separation is fixed. In addition, the system does not involve supervised learning. Artifact detection uses the kurtosis factor, a metric that directly compares output and input signals (perhaps in non-speech segments), without the need to determine the true target and interference signal. This simple indicator is supplemented by an indicator of audibility.

Задача настоящего изобретения состоит в том, чтобы обеспечить улучшенные концепции для отделения источников. Задача настоящего изобретения решена посредством устройства по п. 1, способа по п. 16 и компьютерной программы по п. 17 формулы изобретения.The object of the present invention is to provide improved concepts for source separation. The task of the present invention is solved by means of the device according to claim 1, the method according to claim 16 and the computer program according to claim 17 of the claims.

Обеспечено устройство для формирования отделенного аудиосигнала из входного аудиосигнала. Входной аудиосигнал содержит участок целевого аудиосигнала и участок разностного аудиосигнала. Участок разностного аудиосигнала указывает разность между входным аудиосигналом и участком целевого аудиосигнала. Устройство содержит разделитель источника, модуль определения и процессор сигналов. Разделитель источника сконфигурирован для определения оцененного целевого сигнала, который зависит от входного аудиосигнала, оцененный целевой сигнал является оценкой сигнала, который содержит только участок целевого аудиосигнала. Модуль определения сконфигурирован для определения одного или нескольких результирующих значений в зависимости от оцененного качества звука оцененного целевого сигнала, чтобы получить одно или несколько значений параметров, причем одно или несколько значений параметров представляют собой одно или несколько результирующих значений или зависят от одного или нескольких результирующих значений. Процессор сигналов сконфигурирован для формирования отделенного аудиосигнала в зависимости от одного или нескольких значений параметров и в зависимости по меньшей мере от одного из оцененного целевого сигнала и входного аудиосигнала и оцененного разностного сигнала, причем оцененный разностный сигнал является оценкой сигнала, который содержит только участок разностного аудиосигнала.An apparatus is provided for generating a separated audio signal from an input audio signal. The input audio signal contains a portion of the target audio signal and a portion of the difference audio signal. The difference audio section indicates the difference between the input audio signal and the target audio section. The device contains a source separator, a definition module, and a signal processor. The source splitter is configured to define an estimated target signal that depends on the input audio signal, the estimated target signal is an estimated signal that contains only a section of the target audio signal. The determining module is configured to determine one or more result values depending on the estimated audio quality of the estimated target signal to obtain one or more parameter values, where one or more parameter values are one or more result values or depend on one or more result values. The signal processor is configured to generate a separated audio signal depending on one or more parameter values and depending on at least one of the estimated target signal and the input audio signal and the estimated difference signal, wherein the estimated difference signal is a signal estimate that contains only a section of the difference audio signal.

Кроме того, обеспечен способ формирования отделенного аудиосигнала из входного аудиосигнала. Входной аудиосигнал содержит участок целевого аудиосигнала и участок разностного аудиосигнала. Участок разностного аудиосигнала указывает разность между входным аудиосигналом и участком целевого аудиосигнала. Способ содержит:In addition, a method for generating a separated audio signal from an input audio signal is provided. The input audio signal contains a portion of the target audio signal and a portion of the difference audio signal. The difference audio section indicates the difference between the input audio signal and the target audio section. The method contains:

- определение оцененного целевого сигнала, который зависит от входного аудиосигнала, оцененный целевой сигнал является оценкой сигнала, который содержит только участок целевого аудиосигнала;- determining the estimated target signal, which depends on the input audio signal, the estimated target signal is an estimate of the signal, which contains only a portion of the target audio signal;

- определение одного или нескольких результирующих значений в зависимости от оцененного качества звука оцененного целевого сигнала, чтобы получить одно или несколько значений параметров, причем одно или несколько значений параметров представляют собой одно или несколько результирующих значений или зависят от одного или нескольких результирующих значений; и- determining one or more result values depending on the estimated audio quality of the estimated target signal to obtain one or more parameter values, where one or more parameter values represent one or more result values or depend on one or more result values; and

- формирование отделенного аудиосигнала в зависимости от одного или нескольких значений параметров и в зависимости по меньшей мере от одного из оцененного целевого сигнала и входного аудиосигнала и от оцененного разностного сигнала, оцененный разностный сигнал является оценкой сигнала, который содержит только участок разностного аудиосигнала.- generating a separated audio signal depending on one or more parameter values and depending on at least one of the estimated target signal and the input audio signal and the estimated difference signal, the estimated difference signal is an estimate of a signal that contains only a portion of the difference audio signal.

Кроме того, обеспечена компьютерная программа для реализации описанного выше способа при ее исполнении на процессоре компьютера или процессоре сигналов.In addition, a computer program is provided for implementing the method described above when it is executed on a computer processor or a signal processor.

Далее варианты осуществления настоящего изобретения описаны более подробно со ссылкой на следующие фигуры.Hereinafter, embodiments of the present invention are described in more detail with reference to the following figures.

Фиг. 1a иллюстрирует устройство для формирования отделенного аудиосигнала из входного аудиосигнала в соответствии с вариантом осуществления,Fig. 1a illustrates an apparatus for generating a separated audio signal from an input audio signal according to an embodiment,

Фиг. 1b иллюстрирует устройство для формирования отделенного аудиосигнала в соответствии с другим вариантом осуществления, дополнительно содержащее искусственную нейронную сеть,Fig. 1b illustrates an apparatus for generating a separated audio signal according to another embodiment, further comprising an artificial neural network,

Фиг. 2 иллюстрирует устройство в соответствии с вариантом осуществления, которое сконфигурировано для использования оценки качества звука, и которое сконфигурировано для проведения последующей обработки,Fig. 2 illustrates an apparatus according to an embodiment which is configured to use sound quality estimation and which is configured to perform post-processing,

Фиг. 3 иллюстрирует устройство в соответствии с другим вариантом осуществления, в котором проводится прямая оценка параметров последующей обработки,Fig. 3 illustrates an apparatus according to another embodiment in which post-processing parameters are directly evaluated,

Фиг. 4 иллюстрирует устройство в соответствии с дополнительным вариантом осуществления, в котором проводится оценка качества звука и вторичное отделение, иFig. 4 illustrates an apparatus according to a further embodiment in which sound quality evaluation and secondary separation are performed, and

Фиг. 5 иллюстрирует устройство в соответствии с другим вариантом осуществления, в котором проводится прямая оценка параметров отделения.Fig. 5 illustrates an apparatus according to another embodiment in which separation parameters are directly estimated.

Фиг. 1a иллюстрирует устройство для формирования отделенного аудиосигнала из входного аудиосигнала в соответствии с вариантом осуществления. Входной аудиосигнал содержит участок целевого аудиосигнала и участок разностного аудиосигнала. Участок разностного аудиосигнала указывает разность между входным аудиосигналом и участком целевого аудиосигнала.Fig. 1a illustrates an apparatus for generating a separated audio signal from an input audio signal according to an embodiment. The input audio signal contains a portion of the target audio signal and a portion of the difference audio signal. The difference audio section indicates the difference between the input audio signal and the target audio section.

Устройство содержит разделитель 110 источника, модуль 120 определения и процессор 130 сигналов.The apparatus includes a source splitter 110, a determiner 120, and a signal processor 130.

Разделитель 110 источника сконфигурирован для определения оцененного целевого сигнала, который зависит от входного аудиосигнала, оцененный целевой сигнал является оценкой сигнала, который содержит только участок целевого аудиосигнала.The source splitter 110 is configured to determine an estimated target signal that depends on the input audio signal, the estimated target signal is a signal estimate that contains only a section of the target audio signal.

Модуль 120 определения сконфигурирован для определения одного или нескольких результирующих значений в зависимости от оцененного качества звука оцененного целевого сигнала, чтобы получить одно или несколько значений параметров, причем одно или несколько значений параметров представляют собой одно или несколько результирующих значений или зависят от одного или нескольких результирующих значений.Determination module 120 is configured to determine one or more result values depending on the estimated audio quality of the estimated target signal to obtain one or more parameter values, wherein the one or more parameter values are one or more result values or depend on one or more result values. .

Процессор 130 сигналов сконфигурирован для формирования отделенного аудиосигнала в зависимости от одного или нескольких значений параметров и в зависимости по меньшей мере от одного из оцененного целевого сигнала и входного аудиосигнала и от оцененного разностного сигнала. Оцененный разностный сигнал является оценкой сигнала, который содержит только участок разностного аудиосигнала.The signal processor 130 is configured to generate a separated audio signal depending on one or more parameter values and depending on at least one of the estimated target signal and the input audio signal and the estimated difference signal. The estimated difference signal is an estimate of a signal that contains only a portion of the difference audio signal.

Факультативно в варианте осуществления модуль 120 определения, например, может быть сконфигурирован для определения одного или нескольких результирующих значений в зависимости от оцененного целевого сигнала и в зависимости по меньшей мере от одного из входного аудиосигнала и оцененного разностного сигнала.Optionally, in an embodiment, determiner 120 may, for example, be configured to determine one or more result values depending on the estimated target signal and depending on at least one of the input audio signal and the estimated difference signal.

Варианты осуществления обеспечивают мотивированное восприятием и адаптируемое к сигналу управление компромиссом между качеством звука и отделением c использованием контролируемого обучения. Это может быть достигнуто двумя способами. Первый способ оценивает качество звука выходного сигнала и использует эту оценку, чтобы адаптировать параметры отделения или последующую обработку отделенных сигналов. Во втором варианте осуществления регрессионный метод непосредственно выдает управляющие параметры, в результате чего качество звука выходного сигнала отвечает предварительно заданным требованиям.Embodiments provide perceptually motivated and signal-adaptive control of the trade-off between sound quality and separation using supervised learning. This can be achieved in two ways. The first method estimates the audio quality of the output signal and uses this estimate to adapt the separation parameters or post-processing of the separated signals. In the second embodiment, the regression method directly outputs the control parameters, resulting in the sound quality of the output signal meeting the predetermined requirements.

В соответствии с вариантами осуществления анализ входного сигнала и выходного сигнала отделения проводится для получения оценки

качества звука и определения параметров обработки на основе

, чтобы качество звука на выходе (при использовании определенных параметров обработки) было не нижнее заданного значения качества.In accordance with embodiments, analysis of the input signal and the output signal of the department is carried out to obtain an estimate

sound quality and determine processing parameters based on

so that the output audio quality (when using certain processing options) is not below the specified quality value.

В некоторых вариантах осуществления анализ выдает показатель качества

в (9). Из показателя качества вычисляется управляющий параметр

в формуле (13) ниже (например, масштабный коэффициент), и окончательные выходные данные получаются посредством микширования начальных выходных данных и входных данных, как в формуле (13) ниже. Вычисление

может выполняться итерационно или посредством регрессии, причем параметры регрессии получаются в результате обучения из набора обучающих сигналов, см. фиг. 2. В вариантах осуществления управляющий параметр может представлять собой не масштабный коэффициент, а, например, параметр сглаживания и т.п.In some embodiments, the analysis produces a quality score

at 9). The control parameter is calculated from the quality index

in formula (13) below (eg, scaling factor), and the final output is obtained by mixing the initial output and input data as in formula (13) below. calculation

can be performed iteratively or by regression, with the regression parameters obtained as a result of training from a set of training signals, see FIG. 2. In embodiments, the control parameter may not be a scaling factor but, for example, a smoothing parameter or the like.

В некоторых вариантах осуществления анализ приводит к управляющему параметру

в (13) непосредственно, см. фиг. 3.In some embodiments, the analysis results in a control parameter

in (13) directly, see fig. 3.

Фиг. 4 и фиг. 5 определяют дополнительные варианты осуществления.Fig. 4 and FIG. 5 define additional embodiments.

Некоторые варианты осуществления достигают управления качеством звука на этапе последующей обработки, как описано ниже.Some embodiments achieve audio quality control in the post-processing step, as described below.

Подмножество описанных в настоящем документе вариантов осуществления может применяться независимо от способа отделения. Некоторые описанные в настоящем документе варианты осуществления управляют процессом отделения.A subset of the embodiments described herein may be applied regardless of the separation method. Some of the embodiments described herein control the separation process.

Отделение источника c использованием спектрального взвешивания обрабатывает сигналы в частотно-временной области или кратковременной спектральной области. Входной сигнал

преобразуется посредством оконного преобразования Фурье (STFT) или обрабатывается с помощью набора фильтров, что дает в результате комплекснозначные коэффициенты преобразования STFT или сигналы

частотных подполос, где

обозначает индекс временного кадра,

обозначает индекс частотного интервала или индекс частотной подполосы. Комплекснозначные коэффициенты преобразования STFT или сигналы частотных подполос требуемого сигнала представляют собой

, и сигнал помех представляет собой

.Source separation using spectral weighting processes signals in the time-frequency domain or short-term spectral domain. Input signal

transformed by a windowed Fourier transform (STFT) or processed by a filter bank, resulting in complex-valued STFT transform coefficients or signals

frequency subbands, where

denotes the time frame index,

denotes a frequency slot index or a frequency subband index. The complex valued STFT transform coefficients or frequency subband signals of the desired signal are

, and the interference signal is

.

Отделенные (выделенные) выходные сигналы вычисляются посредством спектрального взвешивания какThe separated (dedicated) output signals are computed by spectral weighting as

(6),

(6)

где спектральные весовые коэффициенты

поэлементно умножаются на входной сигнал. Цель состоит в том, чтобы ослабить элементы в

, где источник помех

является большим. С этой целью спектральные весовые коэффициенты могут быть вычислены на основе оценки цели

, оценки источника помех

или оценки отношения сигнала к источнику помех, например,where the spectral weights

element-wise multiplied by the input signal. The goal is to weaken the elements in

, where the interference source

is big. To this end, the spectral weights can be computed based on the target score

, interference source estimates

or an estimate of the signal-to-interference ratio, for example,

(7)

илиor

(8),

(eight),

где

и

- параметры, управляющие отделением. Например, увеличение

может привести к большему ослаблению источника помех, но также и к более сильному ухудшению качества звука. Спектральные весовые коэффициенты могут быть дополнительно модифицированы, например, посредством задания порога, чтобы

было больше порога. Модифицированные коэффициенты усиления

вычисляются какwhere

and

- parameters that control the department. For example, an increase

may lead to more attenuation of the interferer, but also to a greater deterioration in sound quality. The spectral weights can be further modified, for example by setting a threshold, so that

was over the threshold. Modified gains

calculated as

.

Увеличение порога v сокращает ослабление источника помех и сокращает потенциальное ухудшение качества звука.Increasing the threshold v reduces the attenuation of the interfering source and reduces the potential degradation in audio quality.

Оценка требуемых величин (цели

, источника помех

или отношения сигнала к источнику помех) является основой этих способов, и в прошлом были разработаны различные способы оценки. Они следуют одному из двух описанных выше подходов.Estimation of the required values (objectives

, interference source

or signal-to-interferer ratio) is the basis of these methods, and various evaluation methods have been developed in the past. They follow one of the two approaches described above.

Затем выходной сигнал

вычисляется с использованием обратной обработки преобразования STFT или набора фильтров.Then the output signal

is computed using the inverse processing of the STFT transform or filter bank.

Далее описывается отделение источника с использованием оценки целевого сигнала в соответствии с вариантами осуществления.Next, source separation using target signal estimation according to the embodiments will be described.

Представление целевого сигнала также может быть оценено непосредственно по входному сигналу, например, с помощью искусственной нейронной сети. Недавно были предложены различные способы, в которых искусственная нейронная сеть обучалась для оценки целевого временного сигнала, или его коэффициентов STFT, или величин коэффициентов STFT.The representation of the target signal can also be estimated directly from the input signal, for example, using an artificial neural network. Recently, various methods have been proposed in which an artificial neural network was trained to estimate a target timing signal, or its STFT coefficients, or STFT coefficient values.

Что касается качества звука, компонент качества звука (SQC) получается посредством применения модели контролируемого обучения

для оценки результатов этой вычислительной модели,Regarding audio quality, the audio quality component (SQC) is obtained by applying a supervised learning model

to evaluate the results of this computational model,

(9).

Способ контролируемого обучения

реализован следующим образом.Method of supervised learning

implemented as follows.

1. Конфигурация модели контролируемого обучения

с помощью обучаемых параметров,

входных переменных и

выходных переменных.1. Supervised learning model configuration

using learnable parameters,

input variables and

output variables.

2. Формирование набора данных с помощью сигналов-примеров для цели

и смешения

.2. Formation of a data set using sample signals for the target

and mixing

.

3. Вычисление оценки для целевых сигналов посредством отделения источников,

.3. Computing an estimate for target signals by separating sources,

.

4. Вычисление компонентов качества звука

из полученных сигналов посредством вычислительных моделей качества звука в соответствии с (9) или (10).4. Calculation of sound quality components

from the received signals by means of computational sound quality models in accordance with (9) or (10).

5. Обучение модели контролируемого обучения

таким образом, чтобы она выдавала оценки

с учетом соответствующих сигналов-примеров для предполагаемой цели

(результата отделения источников) и смешения

. В качестве альтернативы, обучение модели контролируемого обучения

таким образом, чтобы она выдавала оценки

с учетом

и

(если)

.5. Supervised learning model training

in such a way that it gives ratings

with appropriate example signals for the intended purpose

(the result of separation of sources) and mixing

. Alternatively, supervised learning model training

in such a way that it gives ratings

taking into account

and

(if)

.

6. В применении обученная модель получает оцененную цель

(результат отделения источников), полученную из смешения

с использованием способа отделения источников вместе со смешением

.6. In application, the trained model gets the estimated target

(the result of separation of sources), obtained from mixing

using the source separation method along with mixing

.

Обеспечено применение способов контролируемого обучения для контроля качества отделенного выходного сигнала.The use of supervised learning methods to control the quality of the separated output signal is provided.

Далее описывается оценка качества звука с использованием контролируемого обучения соответствии с вариантами осуществления.The following describes sound quality evaluation using supervised learning in accordance with the embodiments.

Фиг. 1b иллюстрирует вариант осуществления, в котором модуль 120 определения содержит искусственную нейронную сеть 125. Искусственная нейронная сеть 125, например, может быть сконфигурирована для определения одного или нескольких результирующих значений в зависимости от оцененного целевого сигнала. Искусственная нейронная сеть 125, например, может быть сконфигурирована для приема множества входных значений, каждое из множества входных значений зависит по меньшей мере от одного из оцененного целевого сигнала и оцененного разностного сигнала и от входного аудиосигнала. Искусственная нейронная сеть 125, например, может быть сконфигурирована для определения одного или нескольких результирующих значений в качестве одного или нескольких выходных значений искусственной нейронной сети 125.Fig. 1b illustrates an embodiment in which determination module 120 comprises an artificial neural network 125. Artificial neural network 125, for example, may be configured to determine one or more result values depending on the estimated target signal. The artificial neural network 125, for example, may be configured to receive a plurality of input values, each of the plurality of input values depending on at least one of the estimated target signal and the estimated difference signal, and on the input audio signal. Artificial neural network 125, for example, may be configured to determine one or more result values as one or more output values of artificial neural network 125.

Факультативно в варианте осуществления искусственная нейронная сеть 125, например, может быть сконфигурирована для определения одного или нескольких результирующих значений в зависимости от оцененного целевого сигнала и по меньшей мере одного сигнала из входного аудиосигнала и оцененного разностного сигнала.Optionally, in an embodiment, the artificial neural network 125, for example, may be configured to determine one or more result values depending on the estimated target signal and at least one signal from the input audio signal and the estimated difference signal.

В варианте осуществления каждое из множества входных значений, например, может зависеть по меньшей мере от одного из оцененного целевого сигнала и оцененного разностного сигнала и от входного аудиосигнала. Одно или несколько результирующих значений, например, могут указывать оцененное качество звука оцененного целевого сигнала.In an embodiment, each of the plurality of input values, for example, may depend on at least one of the estimated target signal and the estimated difference signal, and on the input audio signal. One or more result values, for example, may indicate the estimated audio quality of the estimated target signal.

В соответствии с вариантом осуществления каждое из множества входных значений может, например, зависеть по меньшей мере от одного из оцененного целевого сигнала и оцененного разностного сигнала и от входного аудиосигнала. Одно или несколько результирующих значений, например, могут представлять собой одно или несколько значений параметров.According to an embodiment, each of the plurality of input values may, for example, depend on at least one of the estimated target signal and the estimated difference signal, and on the input audio signal. One or more result values, for example, may be one or more parameter values.

В варианте осуществления искусственная нейронная сеть 125, например, может быть сконфигурирована для обучения посредством приема множества наборов обучающих данных, причем каждый из множества наборов обучающих данных содержит множество входных обучающих значений искусственной нейронной сети 125 и одно или несколько выходных обучающих значений искусственной нейронной сети 125, причем каждое из множества выходных обучающих значений, например, может зависеть по меньшей мере от одного из обучающего целевого сигнала и обучающего разностного сигнала и от обучающего входного сигнала, причем каждое из одного или нескольких выходных обучающих значений, например, может зависеть от оценки качества звука обучающего целевого сигнала.In an embodiment, artificial neural network 125, for example, may be configured to train by receiving a plurality of training datasets, each of the plurality of training datasets comprising a plurality of artificial neural network 125 input training values and one or more artificial neural network 125 output training values, each of the plurality of output training values, for example, may depend on at least one of the training target signal and the training difference signal and on the training input signal, wherein each of the one or more output training values, for example, may depend on the sound quality score of the training target signal.

В вариантах осуществления оценка для компонента качества звука получается посредством контролируемого обучения с использованием модели контролируемого обучения (SLM), например, искусственной нейронной сети (Artificial Neural Network, ANN) 125. Искусственная нейронная сеть 125, например, может представлять собой полностью соединенную искусственную нейронную сеть 125, которая содержит входной слой с A блоками, по меньшей мере один скрытый слой с входными уровнями, каждый по меньшей мере с двумя блоками, и выходной слой с одним или несколькими блоками.In embodiments, an estimate for the audio quality component is obtained by supervised learning using a supervised learning model (SLM), such as an Artificial Neural Network (ANN) 125. The artificial neural network 125, for example, may be a fully connected artificial neural network. 125 which contains an input layer with A blocks, at least one hidden layer with input levels each with at least two blocks, and an output layer with one or more blocks.

Модель контролируемого обучения может быть реализована как регрессионная модель или модель классификации. Регрессионная модель оценивает одно целевое значение на выходе одного блока в выходном слое. В качестве альтернативы задача регрессии может быть сформулирована как задача классификации посредством квантования выходного значения по меньшей на 3 этапа с использованием выходного слоя с

блоками, где

равно количеству этапов квантования.The supervised learning model can be implemented as a regression model or a classification model. The regression model evaluates one target value at the output of one block in the output layer. Alternatively, the regression problem can be formulated as a classification problem by quantizing the output value into at least 3 steps using an output layer with

blocks, where

is equal to the number of quantization steps.

Для каждого этапа квантования используется один выходной блок.One output block is used for each quantization step.

Модель контролируемого обучения сначала обучается с помощью набора данных, который содержит несколько примеров смешенного сигнала

, оцененной цели

и компонента качества звука

, где компонент качества звука был вычислен из оцененной цели

и истинной цели

, например. Один элемент набора данных обозначен как

. Выходной результат модели контролируемого обучения здесь обозначен как

.The supervised learning model is first trained on a dataset that contains multiple mixed signal examples

, estimated goal

and sound quality component

, where the audio quality component was computed from the estimated target

and true purpose

, for example. One element of the dataset is denoted as

. The output of the supervised learning model is denoted here as

.

Количество блоков во входном слое

соответствует количеству входных значений. Вводы в модели вычисляются из входных сигналов. Каждый сигнал может быть факультативно обработан посредством набора фильтров частотно-временного преобразования, например, краткосрочного преобразования Фурье (STFT). Например, ввод может быть построен посредством конкатенации коэффициентов STFT, вычисленных из

смежных кадров из

и

, где

или

. Если

- общее количество спектральных коэффициентов на кадр, то общее количество входных коэффициентов равно

.Number of blocks in the input layer

corresponds to the number of input values. The inputs in the model are computed from the input signals. Each signal can optionally be processed through a bank of time-frequency transform filters, such as the Short Term Fourier Transform (STFT). For example, the input can be constructed by concatenating the STFT coefficients computed from

adjacent frames from

and

, where

or

. If a

is the total number of spectral coefficients per frame, then the total number of input coefficients is

.

Каждый блок искусственной нейронной сети 125 вычисляет свое выходное значение как линейную комбинацию входных значений, которые затем факультативно обрабатываются с помощью нелинейной функции сжатия,Each block of artificial neural network 125 calculates its output value as a linear combination of input values, which are then optionally processed using a non-linear compression function,

(10),

(ten),

где

обозначает выход одного нейрона,

обозначают

входных значений,

обозначают

весовых коэффициентов для линейной комбинации, и

обозначают

дополнительных составляющих смещения. Для блоков в первом скрытом слое количество входных значений

равно количеству входных коэффициентов D. Все

и

являются параметрами искусственной нейронной сети 125, которые определяются в способе обучения.where

denotes the output of one neuron,

designate

input values,

designate

weighting factors for the linear combination, and

designate

additional bias components. For blocks in the first hidden layer, the number of input values

equals the number of input coefficients D. All

and

are the parameters of the artificial neural network 125, which are determined in the training method.

Блоки одного слоя соединены с блоками следующего слоя, выходы блоков предыдущего слоя являются входами в блоки следующего слоя.The blocks of one layer are connected to the blocks of the next layer, the outputs of the blocks of the previous layer are the inputs to the blocks of the next layer.

Обучение выполняется посредством минимизации ошибки предсказания с использованием численного метода оптимизации, например, метода градиентного спуска. Ошибка предсказания для одного элемента является функцией разности

. Ошибка предсказания для всего набора данных или подмножества набора данных, используемого в качестве критерия оптимизации, является, например, среднеквадратичной ошибкой MSE или средней абсолютной ошибкой MAE, где

обозначает количество элементов в наборе данных.Training is performed by minimizing the prediction error using a numerical optimization method such as gradient descent. The prediction error for one element is a function of the difference

. The prediction error for the entire data set or a subset of the data set used as an optimization criterion is, for example, the MSE root mean square error or the MAE mean absolute error, where

denotes the number of elements in the dataset.

(11)

(eleven)

(12)

Другие показатели ошибок возможны для целей обучения, если они являются монотонными функциями

и дифференцируемыми. Кроме того, существуют другие структуры и элементы для построения искусственных нейронных сетей, например, слои сверточной нейронной сети или слои рекуррентной нейронной сети.Other error rates are possible for learning purposes if they are monotone functions

and differentiable. In addition, there are other structures and elements for building artificial neural networks, such as convolutional neural network layers or recurrent neural network layers.

Все они имеют общее в том, что они реализуют отображение из многомерного входа на одно- или многомерный выход, причем функция отображения управляется набором параметров (например,

и

), которые определяются в процедуре обучения посредством оптимизации скалярного критерия.They all have in common that they implement a mapping from a multidimensional input to a one or multidimensional output, with the mapping function controlled by a set of parameters (for example,

and

), which are determined in the learning procedure by optimizing the scalar criterion.

После обучения модель контролируемого обучения может использоваться для оценки качества звука неизвестной оцененной цели

с учетом смешения без необходимости в истинной цели

.After training, a supervised learning model can be used to evaluate the sound quality of an unknown evaluated target.

subject to confusion without the need for a true target

.

Что касается вычислительных моделей качества звука, в экспериментах в соответствии с вариантами осуществления успешно использовались различные вычислительные модели для оценки аспектов качества звука (включая разборчивость), такие как вычислительные модели, описанные в [1]-[11], в частности оценка слепого отделения источников (BSSEval) (см. [1]), способы оценки восприятия для отделения источников аудио (PEASS) (см. [2]), PEMO-Q (см. [3]), оценка восприятия качества аудио (PEAQ) (см. [4]), оценка восприятия качества речи (PESQ) (см. [5] и [6]), ViSQOLAudio (см. [7), индекс качества аудио слухового аппарата (HAAQI) (см. [8]), индекс качества речи слухового аппарата (HASQI) (см. [9), индекс восприятия речи слухового аппарата (HASPI) (см. [10]), и кратковременная объективная разборчивость (STOI) (см. [11]).With regard to sound quality computational models, experiments in accordance with embodiments have successfully used various computational models to evaluate aspects of sound quality (including intelligibility), such as the computational models described in [1]-[11], in particular the evaluation of blind source separation. (BSSEval) (see [1]), Perceptual Assessment Methods for Audio Source Separation (PEASS) (see [2]), PEMO-Q (see [3]), Perceptual Audio Quality Score (PEAQ) (see [4]), Perceptual Speech Quality Score (PESQ) (see [5] and [6]), ViSQOLAudio (see [7], Hearing Aid Audio Quality Index (HAAQI) (see [8]), Quality Index Hearing Aid Speech Perception (HASQI) (see [9], Hearing Aid Speech Perception Index (HASPI) (see [10]), and Short Term Objective Intelligibility (STOI) (see [11]).

Таким образом, в соответствии с вариантом осуществления оценка качества звука обучающего целевого сигнала, например, может зависеть от одной или нескольких вычислительных моделей качества звука.Thus, according to an embodiment, the estimation of the sound quality of the training target signal, for example, may depend on one or more computational sound quality models.

Например, в варианте осуществления оценка качества звука обучающего целевого сигнала может зависеть от одной или нескольких следующих вычислительных моделей качества звука:For example, in an embodiment, the audio quality estimate of the training target signal may depend on one or more of the following audio quality computational models:

Оценка слепого отделения источников,Evaluation of the blind separation of sources,

Методы оценки восприятия для отделения источников аудио,Perceptual evaluation methods for separating audio sources,

Оценка восприятия качества аудио,Audio quality perception evaluation,

Оценка восприятия качества речи,Speech quality perception assessment,

Аудио с виртуальным объективным слушателем качества речи,Audio with virtual speech quality objective listener,

Индекс качества аудио слухового аппарата,Hearing aid audio quality index,

Индекс качества речи слухового аппарата,Hearing aid speech quality index,

Индекс восприятия речи слухового аппарата, иHearing aid speech perception index, and

Кратковременная объективная разборчивость.Short-term objective intelligibility.

Другие вычислительные модели качества звука, например, также могут использоваться в других вариантах осуществления.Other sound quality computational models, for example, may also be used in other embodiments.

Далее описывается управление качеством звука.The following describes sound quality control.

Управление качеством звука может быть реализовано посредством оценки компонента качества звука и вычисления параметров обработки на основе оценки компонента качества звука или посредством прямой оценки оптимальных параметров обработки таким образом, чтобы компонент качества звука соответствовал целевому значению

(или не опускался ниже этого целевого значения).Audio quality control can be realized by estimating the audio quality component and calculating processing parameters based on the audio quality component estimate, or by directly estimating the optimal processing parameters so that the audio quality component matches the target value.

(or did not fall below this target value).

Оценка компонента качества звука была описана выше. Аналогичным образом оптимальные параметры обработки могут быть оценены посредством обучения регрессионного метода с помощью требуемых значений оптимальных параметров обработки. Оптимальные параметры обработки вычисляются, как описано ниже. Эта обработка в дальнейшем называется модулем оценки параметров (Parameter Estimation Module, PEM).The evaluation of the audio quality component has been described above. Similarly, the optimal processing parameters can be estimated by training the regression method with the desired values of the optimal processing parameters. Optimal processing parameters are calculated as described below. This processing is hereinafter referred to as a Parameter Estimation Module (PEM).

Целевое значение для качества звука

будет определять компромисс между отделением и качеством звука. Этот параметр может управляться пользователем или указываться в зависимости от сценария воспроизведения звука. Воспроизведение звука в домашних условиях в спокойной обстановке на высококачественном оборудовании может извлечь преимущество из более высокого качества звука и меньшего отделения. Воспроизведение звука в транспортных средствах в шумной среде через динамики, встроенные в смартфон, может извлечь преимущество из более низкого качества звука, но более высокого отделения и разборчивости речи.Target value for sound quality

will determine the trade-off between separation and sound quality. This setting can be controlled by the user or specified depending on the audio playback scenario. Playing audio at home in a quiet environment on high quality equipment can benefit from higher sound quality and smaller separation. Audio playback in vehicles in noisy environments through speakers built into a smartphone can benefit from lower sound quality but higher separation and speech intelligibility.

Кроме того, оценочные величины (либо компонент качества звука, либо параметры обработки) могут быть дополнительно применены либо для управления последующей обработкой, либо для управления вторичным отделением.In addition, the estimated values (either the audio quality component or the processing parameters) can be further applied either to control the post-processing or to control the secondary compartment.

Таким образом, для реализации предложенного способа могут использоваться четыре разных концепции. Эти концепции проиллюстрированы на фиг. 2, фиг. 3, фиг. 4 и фиг. 5 и описаны далее.Thus, four different concepts can be used to implement the proposed method. These concepts are illustrated in Fig. 2, fig. 3, fig. 4 and FIG. 5 and are described below.

Фиг. 2 иллюстрирует устройство в соответствии с вариантом осуществления, которое сконфигурировано для использования оценки качества звука, и которое сконфигурировано для проведения последующей обработки.Fig. 2 illustrates an apparatus according to an embodiment that is configured to use audio quality estimation and that is configured to perform post-processing.

В соответствии с таким вариантом осуществления модуль 120 определения, например, может быть сконфигурирован для оценки, в зависимости по меньшей мере от одного из оцененного целевого сигнала и входного аудиосигнала и от оцененного разностного сигнала, значения качества звука как одного или нескольких результирующих значений, причем значение качества звука указывает оцененное качество звука оцененного целевого сигнала. Модуль 120 определения, например, может быть сконфигурирован для определения одного или нескольких значений параметров в зависимости от значения качества звука.According to such an embodiment, determining module 120 may, for example, be configured to evaluate, depending on at least one of the estimated target signal and the input audio signal and on the estimated difference signal, the sound quality value as one or more result values, wherein the value audio quality indicates the estimated audio quality of the estimated target signal. The determiner 120, for example, may be configured to determine one or more parameter values depending on the sound quality value.

Таким образом в соответствии с вариантом осуществления модуль 120 определения, например, может быть сконфигурирован для определения, в зависимости от оцененного качества звука оцененного целевого сигнала, управляющего параметра как одного или нескольких значений параметра. Процессор 130 сигналов, например, может быть сконфигурирован для определения отделенного аудиосигнала в зависимости от управляющего параметра и в зависимости по меньшей мере от одного из оцененного целевого сигнала и входного аудиосигнала и от оцененного разностного сигнала.Thus, according to an embodiment, determining module 120, for example, may be configured to determine, depending on the estimated audio quality of the estimated target signal, the control parameter as one or more parameter values. The signal processor 130, for example, may be configured to determine the separated audio signal depending on the control parameter and depending on at least one of the estimated target signal and the input audio signal and the estimated difference signal.

Далее описаны конкретные варианты осуществления.The following describes specific embodiments.

На первом этапе применяется отделение. Отделенный сигнал и необработанный сигнал вводятся в модуль оценки качества (Quality Estimation Module, QEM). QEM вычисляет оценку для компонентов качества звука,

.At the first stage, separation is applied. The separated signal and the raw signal are input to a Quality Estimation Module (QEM). QEM calculates a score for audio quality components,

.

Оценочные компоненты качества звука

используются для вычисления набора параметров

для управления последующей обработкой.Sound Quality Evaluation Components

are used to calculate a set of parameters

to control post-processing.

Переменные

,

и

могут изменяться во времени, но зависимость от времени в дальнейшем опущена для ясности обозначения.Variables

,

and

may change over time, but the dependence on time is omitted from here on for clarity of notation.

Такая последующая обработка, например, добавляет масштабированную или отфильтрованную копию входного сигнала к масштабированной или отфильтрованной копии выходного сигнала и тем самым сокращает ослабление сигналов помех (например, эффект отделения), например,Such post-processing, for example, adds a scaled or filtered copy of the input signal to a scaled or filtered copy of the output signal and thereby reduces the attenuation of interference signals (for example, the effect of separation), for example,

(13),

где параметр

управляет величиной отделения.where parameter

controls the size of the branch.

В других вариантах осуществления, например, может использоваться формула:In other embodiments, for example, the formula may be used:

,

где

- оцененный разностный сигнал.where

is the estimated difference signal.

Сокращение отделения приводит кThe reduction in separation leads to

1) сокращению количества артефактов и1) reducing the number of artifacts and

2) увеличению утечки звуков помех, которая маскирует артефакты отделения.2) increased leakage of interference sounds, which masks separation artifacts.

Таким образом, в варианте осуществления процессор 120 сигналов, например, может быть сконфигурирован для определения отделенного аудиосигнала в зависимости от формулы (13), где

- отделенный аудиосигнал,

- оцененный целевой сигнал,

- входной аудиосигнал,

- управляющий параметр, и

- индекс.Thus, in an embodiment, the signal processor 120, for example, may be configured to determine the separated audio signal depending on formula (13), where

- separated audio signal,

- estimated target signal,

- input audio signal,

is the control parameter, and

- index.

Параметр вычисляется с учетом оценки качества звука

и целевого показателя качества

,The parameter is calculated taking into account the sound quality rating

and quality target

,

(14).

(fourteen).

Эта функция

, например, может представлять собой итерационный экстенсивный поиск, как проиллюстрировано с помощью следующего псевдокода.This function

, for example, can be an iterative extensive search, as illustrated by the following pseudocode.

В качестве альтернативы соотношение

может быть вычислено следующим образом.Alternatively, the ratio

can be calculated as follows.

1. Вычисление

для набора значений

,

.1. Calculation

for a set of values

,

.

2. Вычисление остающихся значений

посредством интерполяции и экстраполяции.2. Calculation of the remaining values

through interpolation and extrapolation.

Например, когда параметр обработки

управляет последующей обработкой, как в уравнении (13),

вычисляется для фиксированного количества значений

, например, соответствующих 18, 12 и 6 дБ относительного усиления

.For example, when the processing parameter

controls post-processing as in equation (13),

calculated for a fixed number of values

, for example, corresponding to 18, 12 and 6 dB of relative gain

.

Таким образом, отображение

аппроксимируется, и

может быть выбрано.So the display

is approximated, and

can be chosen.

Подводя итог, в варианте осуществления процессор 130 сигналов, например, может быть сконфигурирован для формирования отделенного аудиосигнала посредством определения первой версии отделенного аудиосигнала и посредством изменения отделенного аудиосигнала один или несколько раз для получения одной или нескольких промежуточных версий отделенного аудиосигнала. Модуль 120 определения, например, может быть сконфигурирован для изменения значения качества звука в зависимости от одного из одного или нескольких промежуточных значений отделенного аудиосигнала. Процессор 130 сигналов, например, может быть сконфигурирован для прекращения изменения отделенного аудиосигнала, если значение качества звука больше или равно заданному значению качества.To summarize, in an embodiment, the signal processor 130, for example, may be configured to generate a separated audio signal by determining a first version of the separated audio signal and by changing the separated audio signal one or more times to obtain one or more intermediate versions of the separated audio signal. The determiner 120, for example, may be configured to change the sound quality value depending on one of one or more intermediate values of the separated audio signal. The signal processor 130, for example, may be configured to stop changing the separated audio signal if the audio quality value is greater than or equal to a predetermined quality value.

Фиг. 3 иллюстрирует устройство в соответствии с другим вариантом осуществления, в котором проводится прямая оценка параметров последующей обработки.Fig. 3 illustrates an apparatus according to another embodiment in which post-processing parameters are directly evaluated.

Сначала применяется отделение. Отделенные сигналы вводятся в модуль оценки параметра (Parameter Estimation Module, PEM). Оценочные параметры применяются для управления последующей обработкой. PEM был обучен непосредственно оценивать p(n) на основе отделенного сигнала

и входного сигнала

. Это означает, что операция в уравнении 14 перемещена в фазу обучения, и регрессионный метод обучается оценивать

вместо

. Следовательно, производится обучение следующей функции.Separation is applied first. The separated signals are input to a Parameter Estimation Module (PEM). Estimated parameters are used to control post-processing. The PEM was trained to directly estimate p(n) based on the separated signal

and input signal

. This means that the operation in Equation 14 is moved to the learning phase and the regression method is trained to evaluate

instead of

. Therefore, the next function is trained.

(15).

(fifteen).

Очевидно, что эта процедура имеет преимущество в том, что требует меньше вычислений, в отличие от описанной выше процедуры. Это достигается за счет меньшей гибкости, поскольку модель обучается для фиксированной настройки

. Однако несколько моделей могут быть обучены на разных значениях

. Таким образом, окончательная гибкость в выборе

может быть сохранена.Obviously, this procedure has the advantage of requiring less computation than the procedure described above. This comes at the cost of less flexibility as the model is trained for a fixed setting.

. However, multiple models can be trained on different values

. So the ultimate flexibility in choosing

can be saved.

В варианте осуществления процессор сигналов 130, например, может быть сконфигурирован для формирования отделенного аудиосигнала в зависимости от одного или нескольких значений параметров и в зависимости от последующей обработки оцененного целевого сигнала.In an embodiment, signal processor 130, for example, may be configured to generate a separated audio signal depending on one or more parameter values and depending on subsequent processing of the estimated target signal.

Фиг. 4 иллюстрирует устройство в соответствии с дополнительным вариантом осуществления, в котором проводятся оценка качества звука и вторичное отделение.Fig. 4 illustrates an apparatus according to a further embodiment in which sound quality evaluation and secondary separation are performed.

Сначала применяется отделение. Отделенные сигналы вводятся в QEM. Оценочные компоненты качества звука используются для вычисления набора параметров для управления вторичным отделением. Во вторичное отделение

вводятся либо входной сигнал

, либо результат первого отделения

, линейная комбинация обоих

, где

и

являются весовыми коэффициентами, или промежуточный результат из первого отделения.Separation is applied first. The separated signals are entered into QEM. The estimated audio quality components are used to calculate a set of parameters for controlling the secondary compartment. To the secondary department

input or input signal

, or the result of the first division

, a linear combination of both

, where

and

are the weights, or the intermediate result from the first division.

Таким образом, в таком варианте осуществления процессор 130 сигналов, например, может быть сконфигурирован для формирования отделенного аудиосигнала в зависимости от одного или нескольких значений параметров и в зависимости от линейной комбинации оцененного целевого сигнала и входного аудиосигнала, или процессор 130 сигналов, например, может быть сконфигурирован для формирования отделенного аудиосигнала в зависимости от одного или нескольких значений параметров и в зависимости от линейной комбинации оцененного целевого сигнала и оцененного разностного сигнала.Thus, in such an embodiment, the signal processor 130, for example, may be configured to generate a separated audio signal depending on one or more parameter values and depending on the linear combination of the estimated target signal and the input audio signal, or the signal processor 130, for example, may be configured to generate a separated audio signal depending on one or more parameter values and depending on a linear combination of the estimated target signal and the estimated difference signal.

Подходящими параметрами для управления вторичным отделением являются, например, параметры, которые модифицируют спектральные весовые коэффициенты.Suitable parameters for controlling the secondary compartment are, for example, parameters that modify the spectral weights.

На фиг. 5 показано устройство в соответствии с другим вариантом осуществления, в котором проводится прямая оценка параметров отделения.In FIG. 5 shows an apparatus according to another embodiment in which separation parameters are directly evaluated.

Сначала применяется отделение. Отделенные сигналы вводятся в PEM. Оценочные параметры управляют вторичным отделением.Separation is applied first. The separated signals are entered into the PEM. Estimated parameters control the secondary branch.

Во вторичное отделение z(n) вводятся либо входной сигнал x(n), либо результат первого отделения

, линейная комбинация обоих

, где

и

являются весовыми коэффициентами, или промежуточный результат из первого отделения.Either the input signal x(n) or the result of the first branch are entered into the secondary branch z(n)

, a linear combination of both

, where

and

are the weights, or the intermediate result from the first division.

Например, выполняется управление следующими параметрами:

, и

из уравнений (5), (6) и

, как описано выше.For example, the following parameters are controlled:

, and

from equations (5), (6) and

as described above.

Что касается итерационной обработки в соответствии с вариантами осуществления, фиг. 4 и 5 изображают итерационную обработку с одной итерацией. В общем случае она может быть повторена несколько раз и реализована в цикле.With respect to iterative processing according to the embodiments, FIG. 4 and 5 show iterative processing with one iteration. In the general case, it can be repeated several times and implemented in a loop.

Итерационная обработка (без промежуточной оценки качества) очень похожа на другие предыдущие способы, которые выполняют конкатенацию нескольких отделений.Iterative processing (without intermediate quality evaluation) is very similar to other previous methods that perform the concatenation of several branches.

Такой подход, например, может подойти для объединения нескольких разных способов (что лучше, чем повторение одного способа).This approach, for example, may be suitable for combining several different ways (which is better than repeating one way).

Хотя некоторые аспекты были описаны в контексте устройства, ясно, что эти аспекты также представляют собой описание соответствующего способа, в котором модуль или устройство соответствуют этапу способа или признаку этапа способа. Аналогичным образом, аспекты, описанные в контексте этапа способа, также представляют собой описание соответствующего модуля, элемента или признака соответствующего устройства. Некоторые или все этапы способа могут быть исполнены посредством (или с использованием) аппаратного устройства, такого как, например, микропроцессор, программируемый компьютер или электронная схема. В некоторых вариантах осуществления один или несколько из наиболее важных этапов способа могут быть исполнены таким устройством.Although some aspects have been described in the context of a device, it is clear that these aspects are also a description of the corresponding method, in which the module or device corresponds to a method step or a feature of a method step. Likewise, aspects described in the context of a method step are also descriptions of the respective module, element, or feature of the respective device. Some or all of the steps of the method may be executed by (or using) a hardware device such as, for example, a microprocessor, a programmable computer, or an electronic circuit. In some embodiments, one or more of the most important steps of the method may be performed by such a device.

В зависимости от некоторых требований реализации варианты осуществления изобретения могут быть реализованы в аппаратном или программном обеспечении, по меньшей мере частично в аппаратном обеспечении или по меньшей мере частично в программном обеспечении. Реализация может быть выполнена с использованием цифрового запоминающего носителя, например, дискеты, цифрового универсального диска (DVD), диска Blu-Ray, компакт-диска (CD), постоянного запоминающего устройства (ПЗУ; ROM), программируемого постоянного запоминающего устройства (ППЗУ; PROM), стираемого программируемого постоянного запоминающего устройства (СППЗУ; EPROM), электрически стираемого программируемого постоянного запоминающего устройства (ЭСППЗУ; EEPROM) и флэш-памяти, имеющего сохраненные на нем считываемые в электронном виде сигналы, которые взаимодействуют (или способны взаимодействовать) с программируемой компьютерной системой, в результате чего выполняется соответствующий способ. Таким образом, цифровой запоминающий носитель может являться машиночитаемым.Depending on some implementation requirements, embodiments of the invention may be implemented in hardware or software, at least partially in hardware, or at least partially in software. Implementation can be done using digital storage media such as floppy disk, digital versatile disk (DVD), Blu-ray disc, compact disc (CD), read only memory (ROM; ROM), programmable read only memory (PROM; PROM). ), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and flash memory having electronically readable signals stored thereon that interact (or are capable of interacting) with a programmable computer system. , resulting in the corresponding method being executed. Thus, the digital storage medium can be computer readable.

Некоторые варианты осуществления в соответствии с изобретением содержат носитель данных, имеющий читаемые в электронном виде управляющие сигналы, которые способны взаимодействовать с программируемой компьютерной системой, в результате чего выполняется один из способов, описанных в настоящем документе.Some embodiments in accordance with the invention include a storage medium having electronically readable control signals that are capable of interacting with a programmable computer system, resulting in one of the methods described herein.

Обычно варианты осуществления настоящего изобретения могут быть реализованы как компьютерный программный продукт с программным кодом, причем программный код выполняет один из способов, когда компьютерный программный продукт исполняется на компьютере. Программный код, например, может быть сохранен на машиночитаемом носителе.Typically, embodiments of the present invention may be implemented as a computer program product with program code, the program code performing one of the methods when the computer program product is executed on the computer. The program code may, for example, be stored on a computer-readable medium.

Другие варианты осуществления содержат компьютерную программу для выполнения одного из описанных в настоящем документе способов, сохраненную на машиночитаемом носителе.Other embodiments include a computer program for performing one of the methods described herein, stored on a computer-readable medium.

Другими словами, вариант осуществления способа изобретения, таким образом, представляет собой компьютерную программу, имеющую программный код для выполнения одного из описанных здесь способов, когда компьютерная программа выполняется на компьютере.In other words, an embodiment of the method of the invention is thus a computer program having program code for performing one of the methods described herein when the computer program is executed on a computer.

Дополнительным вариантом осуществления способов изобретения, таким образом, является носитель данных (или цифровой запоминающий носитель, или машиночитаемый носитель), содержащий записанную на нем компьютерную программу для выполнения одного из способов, описанных в настоящем документе. Носитель данных, цифровой запоминающий носитель или носитель с записанными данными обычно являются материальными и/или непереходными носителями.An additional embodiment of the methods of the invention, therefore, is a storage medium (or digital storage medium, or computer-readable medium) containing a computer program recorded thereon for performing one of the methods described herein. The storage medium, digital storage medium, or recorded data medium is typically a tangible and/or non-transitory medium.

Дополнительным вариантом осуществления способа настоящего изобретения, таким образом, являются поток данных или последовательность сигналов, представляющие компьютерную программу для выполнения одного из способов, описанных в настоящем документе. Поток данных или последовательность сигналов, например, могут быть сконфигурированы для переноса через соединение передачи данных, например, через интернет.An additional embodiment of the method of the present invention is thus a data stream or signal sequence representing a computer program for performing one of the methods described herein. The data stream or signal sequence, for example, may be configured to be carried over a data connection, such as over the Internet.

Дополнительный вариант осуществления содержит средство обработки, например, компьютер или программируемое логическое устройство, сконфигурированное или адаптированное для выполнения одного из способов, описанных в настоящем документе.An additional embodiment comprises a processing means, such as a computer or programmable logic device, configured or adapted to perform one of the methods described herein.

Дополнительный вариант осуществления содержит компьютер, имеющий установленную на нем компьютерную программу для выполнения одного из способов, описанных в настоящем документе.An additional embodiment comprises a computer having a computer program installed thereon for performing one of the methods described herein.

Дополнительный вариант осуществления в соответствии с изобретением содержит устройство или систему, сконфигурированную для переноса на приемник (например, в электронном или оптическом виде) компьютерной программы для выполнения одного из способов, описанных в настоящем документе. Приемник, например, может являться компьютером, мобильным устройством, запоминающим устройством и т.п. Устройство или система, например, могут содержать файловый сервер для переноса компьютерной программы на приемник.An additional embodiment in accordance with the invention comprises an apparatus or system configured to transfer to a receiver (eg, electronically or optically) a computer program to perform one of the methods described herein. The receiver, for example, may be a computer, a mobile device, a storage device, or the like. The device or system, for example, may include a file server for transferring a computer program to the receiver.

В некоторых вариантах осуществления программируемое логическое устройство (например, программируемая пользователем вентильная матрица) может использоваться для выполнения некоторой или всей функциональности способов, описанных в настоящем документе. В некоторых вариантах осуществления программируемая пользователем вентильная матрица может взаимодействовать с микропроцессором для выполнения одного из способов, описанных в настоящем документе. Обычно способы предпочтительно выполняются любым аппаратным устройством.In some embodiments, a programmable logic device (eg, a field programmable gate array) may be used to perform some or all of the functionality of the methods described herein. In some embodiments, a user-programmable gate array may interface with a microprocessor to perform one of the methods described herein. In general, the methods are preferably performed by any hardware device.

Устройство, описанное в настоящем документе, может быть реализовано с использованием аппаратного устройства, с использованием компьютера или с использованием комбинации аппаратного устройства и компьютера.The apparatus described herein may be implemented using a hardware device, using a computer, or using a combination of a hardware device and a computer.

Способы, описанные в настоящем документе, могут быть выполнены с использованием аппаратного устройства, с использованием компьютера или с использованием комбинации аппаратного устройства и компьютера.The methods described herein may be performed using a hardware device, using a computer, or using a combination of a hardware device and a computer.

Описанные выше варианты осуществления являются лишь иллюстрацией принципов настоящего изобретения. Подразумевается, что модификации и вариации размещений и подробностей, описанных в настоящем документе, будут очевидны для других специалистов в данной области техники. Таким образом, подразумевается, что изобретение ограничено только объемом последующей патентной формулы изобретения, а не конкретными подробностями, представленными посредством описания и разъяснения изложенных в настоящем документе вариантов осуществления.The embodiments described above are merely illustrative of the principles of the present invention. It is intended that modifications and variations of the arrangements and details described herein will be apparent to others skilled in the art. Thus, the invention is intended to be limited only by the scope of the following patent claims and not by the specific details provided by way of the description and explanation of the embodiments set forth herein.

ЛитератураLiterature

[1] E. Vincent, R. Gribonval, and C. Fйvotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1462-1469, 2006.[1] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1462-1469, 2006.

[2] V. Emiya, E. Vincent, N. Harlander, and V. Hohmann, “Subjective and objective quality assessment of audio source separation,” IEEE Trans. Audio, Speech and Language Process., vol. 19, no. 7, 2011.[2] V. Emiya, E. Vincent, N. Harlander, and V. Hohmann, “Subjective and objective quality assessment of audio source separation,” IEEE Trans. Audio, Speech and Language Process., vol. 19, no. 7, 2011.

[3] R. Huber and B. Kollmeier, “PEMO-Q - a new method for objective audio quality assessment using a model of audatory perception,” IEEE Trans. Audio, Speech and Language Process., vol. 14, 2006.[3] R. Huber and B. Kollmeier, “PEMO-Q - a new method for objective audio quality assessment using a model of audatory perception,” IEEE Trans. Audio, Speech and Language Process., vol. 14, 2006.

[4] ITU-R Rec. BS.1387-1, “Method for objective measurements of perceived audio quality,” 2001.[4] ITU-R Rec. BS.1387-1, “Method for objective measurements of perceived audio quality,” 2001.

[5] ITU-T Rec. P.862, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” 2001.[5] ITU-T Rec. P.862, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” 2001.

[6] ITU-T Rec. P.862.1, “Mapping function for transforming P.862 raw results scores to MOS-LQO,” 2003.[6] ITU-T Rec. P.862.1, “Mapping function for transforming P.862 raw results scores to MOS-LQO,” 2003.

[7] A. Hines, E. Gillen et al., “ViSQOLAudio: An Objective Audio Quality Metric for Low Bitrate Codecs,” J. Acoust. Soc. Am., vol. 137, no. 6, 2015.[7] A. Hines, E. Gillen et al., “ViSQOLAudio: An Objective Audio Quality Metric for Low Bitrate Codecs,” J. Acoust. soc. Am., vol. 137, no. 6, 2015.

[8] J. M. Kates and K. H. Arehart, “The Hearing-Aid Audio Quality Index (HAAQI),” IEEE Trans. Audio, Speech and Language Process., vol. 24, no. 2, 2016, evaluation code kindly provided by Prof. J.M. Kates.[8] J. M. Kates and K. H. Arehart, “The Hearing-Aid Audio Quality Index (HAAQI),” IEEE Trans. Audio, Speech and Language Process., vol. 24, no. 2, 2016, evaluation code kindly provided by Prof. J.M. Kates.

[9] J. M. Kates and K. H. Arehart, “The Hearing-Aid Speech Quality Index (HASQI) version 2,” Journal of the Audio Engineering Society, vol. 62, no. 3, pp. 99-117, 2014.[9] J. M. Kates and K. H. Arehart, “The Hearing-Aid Speech Quality Index (HASQI) version 2,” Journal of the Audio Engineering Society, vol. 62, no. 3, pp. 99-117, 2014.

[10] J. M. Kates and K. H. Arehart, “The Hearing-Aid Speech Perception Index (HASPI),” Speech Communication, vol. 65, pp. 75-93, 2014.[10] J. M. Kates and K. H. Arehart, “The Hearing-Aid Speech Perception Index (HASPI),” Speech Communication, vol. 65, pp. 75-93, 2014.

[11] C. Taal, R. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speech and Language Process., vol. 19, no. 7, 2011.[11] C. Taal, R. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speech and Language Process., vol. 19, no. 7, 2011.

[12] E. Manilow, P. Seetharaman, F. Pishdadian, and B. Pardo, “Predicting algorithm efficacy for adaptive multi-cue source separation,” in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017 IEEE Workshop on, 2017, pp. 274-278.[12] E. Manilow, P. Seetharaman, F. Pishdadian, and B. Pardo, “Predicting algorithm efficacy for adaptive multi-cue source separation,” in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017 IEEE Workshop on , 2017, pp. 274-278.

[13] M. Cartwright, B. Pardo, G. J. Mysore, and M. Hoffman, “Fast and easy crowdsourced perceptual audio evaluation,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 2016.[13] M. Cartwright, B. Pardo, G. J. Mysore, and M. Hoffman, “Fast and easy crowdsourced perceptual audio evaluation,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 2016.

[14] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 9, 2018.[14] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) , vol. 26, no. 9, 2018.

[15] Y. Koizumi, K. Niwa, Y. Hioka, K. Koabayashi, and Y. Haneda, “Dnn-based source enhancement to increase objective sound quality assessment score,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.[15] Y. Koizumi, K. Niwa, Y. Hioka, K. Koabayashi, and Y. Haneda, “Dnn-based source enhancement to increase objective sound quality assessment score,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.

[16] Y. Zhao, B. Xu, R. Giri, and T. Zhang, “Perceptually guided speech enhancement using deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on, 2018.[16] Y. Zhao, B. Xu, R. Giri, and T. Zhang, “Perceptually guided speech enhancement using deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on, 2018.

[17] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, “Dnn-based source enhancement self-optimized by reinforcement learning using sound quality measurements,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 2017.[17] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, “Dnn-based source enhancement self-optimized by reinforcement learning using sound quality measurements,” in Acoustics, Speech and Signal Processing ( ICASSP), 2017 IEEE International Conference on, 2017.

[18] J. Jensen and M. S. Pedersen, “Audio processing device comprising artifact reduction,” US Patent US 9,432,766 B2, Aug. 30, 2016.[18] J. Jensen and M. S. Pedersen, “Audio processing device comprising artifact reduction,” US Patent US 9,432,766 B2, Aug. 30, 2016.

Claims

1. A device for generating a separated audio signal from an input audio signal, the input audio signal comprising a target audio signal section and a difference audio signal section, the difference audio signal section indicating a difference between the input audio signal and the target audio signal section, the device comprising:

a source separator (110) for determining an estimated target signal that depends on the input audio signal, wherein the estimated target signal is an estimate of a signal that contains only a portion of the target audio signal,

determination module (120), wherein the determination module (120) is configured to determine one or more resulting values depending on the estimated sound quality of the estimated target signal to obtain one or more parameter values, wherein one or more parameter values are one or more result values or depend on one or more result values, and

a signal processor (130) for generating a separated audio signal depending on one or more parameter values and depending on at least one of the estimated target signal, and the input audio signal, and the estimated difference signal, where the estimated difference signal is an estimate of a signal that contains only section of the difference audio signal,

wherein the signal processor (130) is configured to generate a separated audio signal depending on one or more parameter values and depending on a linear combination of the estimated target signal and the input audio signal, or

wherein the signal processor (130) is configured to generate a separated audio signal depending on one or more parameter values and depending on a linear combination of the estimated target signal and the estimated difference signal.

2. The device according to claim 1,

in which the determining module (120) is configured to determine, depending on the estimated sound quality of the estimated target signal, the control parameter as one or more parameter values, and

wherein the signal processor is configured to determine the separated audio signal depending on the control parameter and depending on at least one of the estimated target signal and the input audio signal and the estimated difference signal.

3. Device according to claim 2,

wherein the signal processor (130) is configured to determine the separated audio signal depending on:

,

or depending on:

,

where

- separated audio signal,

where

- estimated target signal,

where

- input audio signal,

where

is the estimated difference signal,

where

is the control parameter, and

where

- index.

4. Device according to claim 2,

in which the determining module (120) is configured to evaluate, depending on at least one of the estimated target signal, and the input audio signal, and the estimated difference signal, the sound quality value as one or more result values, and the sound quality value indicates the estimated quality the sound of the estimated target signal, and

wherein the determination module (120) is configured to determine one or more parameter values depending on the sound quality value.

5. The device according to claim 4,

wherein the signal processor (130) is configured to generate a separated audio signal by determining a first version of the separated audio signal and by changing the separated audio signal one or more times to obtain one or more intermediate versions of the separated audio signal,

in which the determination module (120) is configured to change the sound quality value depending on one of one or more intermediate values of the separated audio signal, and

wherein the signal processor (130) is configured to stop changing the separated audio signal if the sound quality value is greater than or equal to the predetermined quality value.

6. Device according to claim 1,

wherein the determining module (120) is configured to determine one or more result values depending on the estimated target signal and depending on at least one of the input audio signal and the estimated difference signal.

7. The device according to claim 1,

in which the determination module (120) contains an artificial neural network (125) to determine one or more resulting values depending on the estimated target signal, and the artificial neural network (125) is configured to receive a plurality of input values, each of the plurality of input values depends on from at least one of the estimated target signal, and the estimated difference signal, and the input audio signal, and wherein the artificial neural network (125) is configured to determine one or more resulting values as one or more output values of the artificial neural network (125).

8. Device according to claim 7,

wherein each of the plurality of input values depends on at least one of the estimated target signal and the estimated difference signal and the input audio signal, and

in which one or more result values indicate the estimated audio quality of the estimated target signal.

9. The device according to claim 7,

wherein each set of input values depends on at least one of the estimated target signal and the estimated difference signal and the input audio signal, and

in which one or more result values are one or more parameter values.

10. Device according to claim 7,

wherein the artificial neural network (125) is configured to learn by receiving a plurality of training data sets, each of the plurality of training data sets comprising a plurality of artificial neural network (125) input training values and one or more artificial neural network (125) output training values , wherein each of the plurality of training output values depends on at least one of the training target signal, and the training difference signal, and the training input signal, with each of the one or more training output values dependent on the audio quality estimate of the training target signal.

11. The device according to claim 10,

wherein the estimate of the sound quality of the training target signal depends on one or more computational sound quality models.

12. Device according to claim 11,

wherein one or more computational sound quality models are at least one of the following models:

Blind Source Separation Evaluation,

methods of perceptual evaluation for separating audio sources,

audio quality perception evaluation,

assessment of perception of speech quality,

audio with speech quality virtual objective listener,

hearing aid audio quality index,

hearing aid speech quality index,

hearing aid speech perception index, and

short-term objective intelligibility.

13. The device according to claim 7,

wherein the artificial neural network (125) is configured to determine one or more result values depending on the estimated target signal and depending on at least one of the input audio signal and the estimated difference signal.

14. Device according to claim 1,

wherein the signal processor (130) is configured to generate a separated audio signal depending on one or more parameter values and depending on subsequent processing of the estimated target signal.

15. A method for generating a separated audio signal from an input audio signal, the input audio signal comprising a target audio signal portion and a difference audio signal portion, the difference audio signal portion indicating a difference between the input audio signal and the target audio signal portion, the method comprising:

determining an estimated target signal that depends on the input audio signal, wherein the estimated target signal is an estimate of a signal that contains only a portion of the target audio signal,

determining one or more result values depending on the estimated audio quality of the estimated target signal to obtain one or more parameter values, where one or more parameter values are one or more result values or depend on one or more result values, and

generating a separated audio signal depending on one or more parameter values and depending on at least one of the estimated target signal, and the input audio signal, and the estimated difference signal, wherein the estimated difference signal is an estimate of a signal that contains only a portion of the difference audio signal,

moreover, the formation of the separated audio signal is carried out depending on one or more parameter values and depending on the linear combination of the estimated target signal and the input audio signal; or

wherein the formation of the separated audio signal is carried out depending on one or more parameter values and depending on the linear combination of the estimated target signal and the estimated difference signal.

16. A computer-readable medium containing program code for performing the method of claim 15 when executed on a computer processor or signal processor.