CN116964666A - Dereverberation based on media type - Google Patents

Dereverberation based on media type Download PDF

Info

Publication number
CN116964666A
CN116964666A CN202280019905.6A CN202280019905A CN116964666A CN 116964666 A CN116964666 A CN 116964666A CN 202280019905 A CN202280019905 A CN 202280019905A CN 116964666 A CN116964666 A CN 116964666A
Authority
CN
China
Prior art keywords
audio signal
input audio
determining
media type
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280019905.6A
Other languages
Chinese (zh)
Inventor
李凯
杨少凡
马远星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby Laboratories Licensing Corp
Original Assignee
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp
Priority claimed from PCT/US2022/019816 external-priority patent/WO2022192580A1/en
Publication of CN116964666A publication Critical patent/CN116964666A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Abstract

A method for suppressing reverberation may involve receiving an input audio signal. The method may involve classifying a media type of the input audio signal as one of a group comprising at least: 1) Speech; 2) Music; or 3) speech under music. The method may involve determining whether to perform dereverberation on the input audio signal based at least on determining that a media type of the input audio signal has been classified as speech. The method may involve generating an output audio signal by performing dereverberation on the input audio signal in response to determining to perform dereverberation on the input audio signal.

Description

Dereverberation based on media type
Cross Reference to Related Applications
The present application claims priority from the following priority applications: international patent application No. PCT/CN2021/080314 filed on 11/3/2021; U.S. provisional patent application Ser. No. 63/180,710, filed on 28 at 4 at 2021; and european patent application number 21174289.5 filed 5/18 in 2021.
Technical Field
The present disclosure relates to systems, methods, and media for dereverberation. The present disclosure further relates to systems, methods, and media for classifying an input audio signal.
Background
Audio devices, such as headphones, speakers, etc., are widely deployed. People often listen to audio content (e.g., podcasts, broadcast programs, television programs, music videos, etc.) that may include mixed types of media content (e.g., speech, music, speech under music, etc.). Such audio content may include reverberation. Performing reverberation suppression on audio content can be difficult, particularly user-generated audio content comprising mixed types of media content.
Symbols and terms
Throughout this disclosure, including in the claims, the terms "speaker (speaker)", "loudspeaker (loudspecker)" and "audio reproduction transducer" are synonymously used to refer to any sound producing transducer (or set of transducers) driven by a single speaker feed. A typical set of headphones includes two speakers. The speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter) that may be driven by a single common speaker feed or multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuit branches coupled to different transducers.
Throughout this disclosure, including in the claims, the expression "performing an operation on" a signal or data (e.g., filtering, scaling, transforming, or applying gain to a signal or data) is used in a broad sense to mean performing an operation directly on a signal or data or on a processed version of a signal or data (e.g., a version of a signal that has undergone preliminary filtering or preprocessing prior to performing an operation thereon).
Throughout this disclosure, including in the claims, the expression "system" is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem implementing a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, where the subsystem generates M inputs and the other X-M inputs are received from external sources) may also be referred to as a decoder system.
Throughout this disclosure, including in the claims, the term "processor" is used in a broad sense to mean a system or device that is programmable or otherwise configurable (e.g., in software or firmware) to perform operations on data (e.g., audio or video or other image data). Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chip sets), digital signal processors programmed and/or otherwise configured to perform pipelined processing of audio or other sound data, programmable general purpose processors or computers, and programmable microprocessor chips or chip sets.
Throughout this disclosure, including in the claims, the term "coupled" or "coupled" is used to mean a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection or through an indirect connection via other devices and connections.
Throughout this disclosure, including in the claims, the term "classifier" is generally used to refer to an algorithm that predicts classes of input. For example, as used herein, an audio signal may be classified as being associated with a particular media type (e.g., speech, music, speech under music, etc.). It should be appreciated that the techniques described herein may be implemented using various types of classifiers, such as decision trees, ada-boost, XG-boost, random forests, generalized Moment Methods (GMMs), hidden Markov Models (HMMs), naive bayes, and/or various types of neural networks (e.g., convolutional Neural Networks (CNNs), deep Neural Networks (DNNs), recurrent Neural Networks (RNNs), long-short term memories (LSTM), gated loop units (GRUs), etc.).
Disclosure of Invention
At least some aspects of the present disclosure may be implemented via a method. Some methods may involve receiving an input audio signal. Some such methods may involve classifying a media type of the input audio signal as one of a group comprising at least: 1) Speech; 2) Music; or 3) speech under music. Some such methods may involve determining whether to perform dereverberation on the input audio signal based at least on a determination that a media type of the input audio signal has been classified as speech. Some such methods may involve generating an output audio signal by performing dereverberation on an input audio signal in response to determining to perform dereverberation on the input audio signal.
In some examples, the method may involve determining a degree of reverberation in the input audio signal, wherein determining whether to perform dereverberation on the input audio signal may be based on the degree of reverberation. In some examples, the degree of reverberation may be based on at least one of: 1) Reverberation time (Reverberation Time, RT) (RT 60); or 2) a direct mix ratio (Direct to Reverberant Ratio, DRR); or a diffuseness estimate. In some examples, determining the degree of reverberation may involve calculating a two-dimensional acoustic modulation spectrum of the input audio signal, wherein the degree of reverberation may be based on an amount of energy in a high-modulation frequency portion of the two-dimensional acoustic modulation spectrum. In some examples, determining the degree of reverberation may involve calculating at least one of: 1) The ratio of the energy in the high modulation frequency portion of the two-dimensional acoustic modulation spectrum to the energy at all modulation frequencies of the two-dimensional acoustic modulation spectrum; or 2) a ratio of energy of a high modulation frequency portion of the two-dimensional acoustic modulation spectrum to energy of a low modulation frequency portion of the two-dimensional acoustic modulation spectrum.
In some examples, the method may involve determining whether to perform dereverberation of the input audio signal based on a determination that the degree of reverberation exceeds a threshold.
In some examples, the method may involve classifying a media type of the input audio signal by separating the input audio signal into two or more spatial components. According to some embodiments, the two or more spatial components may include a middle channel (channel) and a side channel. In some examples, the method may further involve calculating a power of the side channel, and classifying the side channel in response to determining that the power of the side channel exceeds a threshold. According to other embodiments, the two or more spatial components include a diffuse (diffuse) component and a direct (direct) component. In some examples, classifying the media type of the input audio signal may involve classifying each of the two or more spatial components as one of: 1) Speech; 2) Music; or 3) speech under music, wherein the media type of the input audio signal may be classified by combining classifications for each of two or more spatial components. In some examples, in response to determining that the input audio signal includes stereo audio, the input audio signal may be separated into two or more spatial components.
In some examples, the method may involve classifying a media type of the input audio signal by separating the input audio signal into a human sound component and a non-human sound component. In some examples, in response to determining that the input audio signal includes a single audio channel, the input audio signal may be separated into a human sound component and a non-human sound component. In some examples, the method may further involve classifying the human voice component as one of: 1) Speech; or 2) nonverbal. The method may further involve classifying the non-human acoustic component as one of: 1) Music; or 2) non-musical. In some examples, the media type of the input audio signal may be classified by combining the classification of the human voice component and the classification of the non-human voice component.
In some examples, determining whether to perform dereverberation on the input audio signal may be based on a classification of a second input audio signal preceding the input audio signal.
In some examples, the method may involve receiving a third input audio signal. The method may further involve determining not to perform dereverberation on the third input audio signal. The method may further involve disabling the performing of the dereverberation algorithm on the third input audio signal in response to determining not to perform the dereverberation on the third input audio signal. In some examples, determining not to perform dereverberation on the third input audio signal may be based at least in part on a classification of a media type of the third input audio signal. In some examples, the classification of the third input audio signal may be one of: 1) Music; or 2) speech under music. In some examples, determining not to perform dereverberation on the third input audio signal may be based at least in part on a determination that a degree of reverberation in the third input audio signal is below a threshold.
According to another aspect of the present disclosure, there is provided a method for classifying an input audio signal into one of at least two media types, the method comprising: receiving an input audio signal; separating the input audio signal into two or more spatial components; and classifying each of the two or more spatial components as one of at least two media types, wherein the media types of the input audio signal are classified by combining the classifications of each of the two or more spatial components.
In some examples, the two or more spatial components include a middle channel and a side channel, and the method further comprises: calculating the power of the side channel; and classifying the side channel in response to determining that the power of the side channel exceeds the threshold.
In some examples, the two or more spatial components include a diffuse component and a direct component.
In some examples, the input audio signal is separated into two or more spatial components in response to determining that the input audio signal includes stereo audio.
In some examples, classifying the media type of the input audio signal includes separating the input audio signal into a human sound component and a non-human sound component. In some examples, in response to determining that the input audio signal includes a single audio channel, the input audio signal is separated into a human sound component and a non-human sound component. In some examples, classifying the media type of the input audio signal includes: classifying the human voice component as one of: 1) Speech; or 2) nonverbal; classifying the non-human acoustic component as one of: 1) Music; or 2) non-music, wherein the media type of the input audio signal is classified by combining the classification of the vocal component and the classification of the non-vocal component.
Some or all of the operations, functions, and/or methods described herein may be performed by one or more devices in accordance with instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as the memory devices described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. Thus, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.
At least some aspects of the present disclosure may be implemented via an apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some embodiments, the apparatus is or includes an audio processing system having an interface system and a control system. The control system may include one or more general purpose single or multi-chip processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or a combination thereof.
The present disclosure provides various technical advantages. For example, speech intelligibility may be improved by selectively performing dereverberation on a particular type of input audio signal (e.g., an input audio signal classified as speech). Furthermore, by prohibiting the performance of dereverberation on other types of input audio signals (e.g., input audio signals classified as music, speech under music, etc.), adverse consequences of dereverberation, such as reduced audio quality, may be avoided for audio signals that do not require an increase in speech intelligibility. Technical advantages of the present disclosure may be particularly useful for user-generated content (e.g., podcasts).
The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Drawings
Fig. 1A and 1B illustrate representations of example audio signals including reverberation.
Fig. 2 illustrates a block diagram of an example system for performing dereverberation based on media types, according to some embodiments.
Fig. 3 illustrates an example of a process for performing dereverberation based on media types, according to some embodiments.
Fig. 4 illustrates an example of a process for spatial separation of input audio signals according to some embodiments.
Fig. 5 illustrates an example of a process for source separation of an input audio signal according to some embodiments.
Fig. 6 illustrates an example of a process for determining a degree of reverberation according to some embodiments.
Fig. 7A, 7B, 7C, and 7D illustrate exemplary diagrams of two-dimensional acoustic modulation spectra of an exemplary audio signal.
Fig. 8 shows a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the disclosure.
Like reference numbers and designations in the various drawings indicate like elements.
Detailed Description
Reverberation occurs when an audio signal is distorted by various reflections from various surfaces (e.g., walls, ceilings, floors, furniture, etc.). Reverberation can have a significant impact on sound quality and speech intelligibility. Thus, dereverberation may be performed on an audio signal that includes speech to improve speech intelligibility.
Sound arriving at a receiver (e.g., a human listener, microphone, etc.) consists of direct sound, which includes sound directly from a sound source without any reflections, and reverberant sound, which includes sound reflected from various surfaces in the environment. Reverberant sound includes early reflections and late reflections. Early reflections may arrive at the receiver shortly after or simultaneously with the direct sound and may thus be partly integrated into the direct sound. The integration of early reflections with the direct sound creates a spectral coloring effect that helps to improve perceived sound quality. Late reflections reach the receiver after early reflections (e.g., more than 50-80 milliseconds after the direct sound). Late reflections may adversely affect speech intelligibility. Thus, dereverberation may be performed on the audio signal to reduce the effects of late reflections present in the audio signal, thereby improving speech intelligibility.
Fig. 1A shows an example of an acoustic impulse response in a reverberant environment. As illustrated, early reflections 102 may arrive at the receiver simultaneously with or shortly after the direct sound. In contrast, late reflection 104 may reach the receiver after early reflection 102.
Fig. 1B shows an example of a time-domain input audio signal 152 and a corresponding spectrogram 154. As illustrated in the spectrogram 154, early reflections may produce a change in the spectrogram 154 as depicted by the spectral coloring 156.
Dereverberation may reduce audio quality, for example, by reducing perceived loudness, changing spectral color effects, and the like. The reduced audio quality may be particularly disadvantageous when performing dereverberation of an audio signal that mainly comprises music or speech under music. For example, the audio quality of an audio signal that mainly comprises music or speech under music may be reduced without any improvement in speech intelligibility. As a more specific example, the dereverberation may be adapted to process low quality speech content, such as user generated content captured in far-field use cases. Continuing with this particular example, user-generated content, such as podcasts, may include low-quality speech content and professionally-generated music content. In some cases, professionally generated musical content may include artificial reverberation. In this case, applying dereverberation to mixed media content (e.g., including low quality speech content and professionally generated music content with artificial reverberation) may result in excessive suppression of reverberation, thereby reducing audio quality.
In some implementations, dereverberation may be performed on the input audio signal based on an identification of the media type(s) associated with the input audio signal. For example, the input audio signal may be analyzed to determine whether the input audio signal is: 1) Speech; 2) Music; 3) Speech under music; or 4) other items. Examples of speech under music may include podcast pre-or end-play, television section present or end-play, etc.
In some implementations, dereverberation may be performed on an input audio signal that is identified as speech or predominantly speech. Conversely, dereverberation of an input audio signal identified as music, predominantly music, speech under music, or predominantly speech under music may be inhibited. By disabling the dereverberation for non-speech or predominantly non-speech media types, dereverberation may be performed on an input audio signal that would significantly benefit from the dereverberation (e.g., because the input audio signal comprises predominantly speech), while preventing degradation of sound quality due to the dereverberation when such dereverberation is not required to improve speech intelligibility.
In some implementations, the input audio signal may be classified as one of: 1) Speech; 2) Music; 3) Speech under music; or 4) other items using various techniques. As used herein, "other items" may refer to noise, sound effects, speech under sound effects, and the like. For example, in some implementations, an input audio signal may be classified by dividing the input audio signal into two or more spatial components and classifying each spatial component as one of: 1) Speech; 2) Music; 3) Speech under music; or 4) other items. Continuing with the example, in some implementations, the classifications for each spatial component may then be combined to generate an aggregate classification for the input audio signal. As another example, in some implementations, the input audio signal may be classified by separating the input audio signal into a human acoustic component and a non-human acoustic component. The human voice component may be divided into one of the following: 1) Speech; or 2) non-speech, and the non-human acoustic component may be classified as one of: 1) Music; or 2) non-musical. Continuing with the example, in some implementations, the classifications of each of the human voice component and the non-human voice component may then be combined to generate an aggregate classification of the input audio signal. Although the present disclosure describes several methods for classification in the context of methods for suppressing reverberation, the methods for classification of the present invention may be used in other contexts. In particular, the present disclosure relates to a method for classifying an input audio signal into one of at least two media types, comprising: receiving an input audio signal; separating the input audio signal into two or more spatial components; and classifying each of the two or more spatial components as one of at least two media types, wherein the media types of the input audio signal are classified by combining the classifications of each of the two or more spatial components.
In some implementations, the input audio signal that has been classified as speech may additionally be analyzed to determine the amount of reverberation present in the input audio signal. In some such implementations, dereverberation may be performed on an input audio signal that has been identified as having an amount of reverberation exceeding a threshold amount. The amount of reverberation may be identified using a direct mixing ratio (DRR), and/or using a reduced Reverberation Time (RT) of 60dB (e.g., RT 60), and/or using a diffuseness measurement, and/or other suitable reverberation measure. Note that the amount of reverberation may be a function of the DRR, wherein the amount of reverberation increases with decreasing value of the DRR, and wherein the amount of reverberation decreases with increasing value of the DRR.
Additionally or alternatively, in some implementations, dereverberation may be performed on the input audio signal based on a classification of the media type of the previous audio signal. In some implementations, the previous audio signal may be a previous frame or portion of audio content prior to the input audio signal. In some implementations, the classification of the input audio signal may be adjusted based on the classification of the previous audio signal such that the classification of the neighboring audio signal is effectively smoothed. The adjustment may be performed based on the confidence level of each classification. Determining whether to perform dereverberation on the input audio signal based at least in part on the classification of the previous audio signal may prevent dereverberation from being applied in an incoherent manner, thereby improving overall audio quality.
In some implementations, dereverberation may be performed on the input audio signal using various techniques. For example, in some implementations, dereverberation may be performed based on amplitude modulation of the input audio signal at various frequency bands. As a more specific example, in some embodiments, the time-domain audio signal may be transformed into a frequency-domain signal. Continuing with this more specific example, the frequency domain signal may be divided into a plurality of sub-bands, for example, by applying a filter bank to the frequency domain signal. Continuing with this more specific example further, an amplitude modulation value may be determined for each subband, and a band pass filter may be applied to the amplitude modulation value. In some implementations, the band pass filter value can be selected based on the cadence of the human speech, e.g., such that the center frequency of the band pass filter exceeds the cadence of the human speech (e.g., in the range of 10-20Hz, approximately 15Hz, etc.). Continuing this particular example still further, the gain of each sub-band may be determined based on a function of the amplitude modulation signal value and the band pass filtered amplitude modulation value. Gain may then be applied to each subband. In some embodiments, dereverberation may be performed using techniques described in U.S. patent No. 9,520,140, which is incorporated herein by reference in its entirety.
As another example, in some implementations, dereverberation may be performed by estimating the dereverberated signal using a deep neural network, a weighted prediction error method, a variance normalized delay linear prediction method, a single channel linear filter, a multi-channel linear filter, or the like. As yet another example, in some implementations, the dereverberation may be performed by estimating a room response and performing a deconvolution operation on the input audio signal based on the room response.
It should be noted that the techniques described herein for dereverberation based on media types may be performed on various types or forms of audio content, including, but not limited to: podcasts, broadcast programs, audio content associated with video conferences, audio content associated with television programs or movies, and the like. The audio content may be live or pre-recorded.
Fig. 2 illustrates a block diagram of an example system 200 that can be used to perform dereverberation based on an identified media type associated with an input audio signal, in accordance with some embodiments.
As illustrated, the system 200 may include a media type classifier 202. The media type classifier 202 may receive an input audio signal. In some implementations, the media type classifier 202 may classify the input audio signal as: 1) Speech; 2) Music; 3) Speech under music; or 4) other items.
In some implementations, in response to determining that the input audio signal is not speech or is not predominantly speech (e.g., determining that the input audio signal is music, speech under music, or other items), the media type classifier 202 may pass the input audio signal without directing the input audio signal to the reverberation analyzer 204. Conversely, in response to determining that the input audio signal is speech or predominantly speech, the media type classifier 202 may pass the input audio signal to the reverberation analyzer 204.
In some implementations, the reverberation analyzer 204 may determine the degree of reverberation present in the input audio signal. In some implementations, the reverberation analyzer 204 may determine to perform dereverberation on the input audio signal in response to determining that the degree of reverberation exceeds a threshold. That is, in some implementations, the reverberation analyzer 204 may direct the input audio signal to the dereverberation component 206 further in response to determining that the input audio signal is sufficiently reverberant. Conversely, in response to determining that the input audio signal is insufficiently reverberated (e.g., the input audio signal includes relatively "dry" speech), the reverberation analyzer 204 may pass the input audio signal without directing the input audio signal to the dereverberation component 206, effectively disabling the dereverberation from being performed on the input audio signal.
The dereverberation component 206 may take as input an input audio signal that has been determined to have reverberation exceeding a threshold and may generate a dereverberated audio signal. It should be appreciated that dereverberation component 206 may perform any suitable reverberation suppression technique(s).
In some implementations, the media type classifier 202 classifies the media type of the input audio signal based on one or both of spatial separation of components of the input audio signal or musical source separation of components of the input audio signal.
For example, in some implementations, the media type classifier 202 may include a spatial information separator 208. The spatial information separator 208 may separate the input audio signal into two or more spatial components. Examples of two or more spatial components may include direct and diffuse components, side and middle channels, and the like. In some implementations, the spatial information separator 208 may classify the media type of the input audio signal by classifying each of the two or more spatial components separately. In some implementations, the spatial information separator 208 may then generate a classification of the input audio signal by combining the classifications for each of the two or more components (e.g., by using a decision fusion algorithm). Examples of decision fusion algorithms that may be used to combine the classification of each of the two or more components include bayesian analysis, the Dempster-Shafer algorithm, fuzzy logic algorithms, and the like. Note that a technique for classifying media types based on spatial source separation is shown in fig. 4 and described below in connection with this figure.
As another example, in some implementations, the media type classifier 202 may include a music source separator 210. The music source separator 210 may separate the input audio signal into a human sound component and a non-human sound component. In some implementations, the music source separator 210 may then classify the vocal component as one of: 1) Speech; or 2) nonverbal. In some implementations, the music source separator 210 may classify the non-human sound component as one of: 1) Music; or 2) non-musical. In some implementations, the music source separator 210 may generate a classification of the input audio signal based on the classification of the human voice component and the non-human voice component as one of: 1) Speech; 2) Music; 3) Speech under music; or 4) other items. For example, in some implementations, the music source separator 210 may combine classifications of human and non-human sound components (e.g., by using a decision fusion algorithm). Examples of decision fusion algorithms that may be used to combine classifications for each of two or more components include bayesian analysis, the Dempster-Shafer algorithm, fuzzy logic algorithms, and the like.
In some implementations, the media type classifier 202 may determine whether to use the spatial information separator 208 or to classify the media type of the input audio signal by using the music source separator 210. For example, the media type classifier 202 may determine to use the spatial information separator 208 to classify the media type in response to determining that the input audio signal is a stereo audio signal. As another example, the media type classifier 202 may determine to classify the media type using the music source separator 210 in response to determining that the input audio signal is a single channel audio signal.
In the example of fig. 2, the media type classifier 202 is used in the context of a system 200 for performing dereverberation. It is emphasized that the media type classifier 202 may be used as a stand-alone system or may be used in other audio processing systems.
Fig. 3 illustrates an example of a process 300 for performing dereverberation on an input audio signal based on media type classification, according to some embodiments. In some implementations, the blocks of process 300 may be performed by a device or apparatus (e.g., apparatus 200 of fig. 2). It should be noted that in some implementations, the blocks of process 300 may be performed in an order not shown in fig. 3, and/or one or more blocks of process 300 may be performed substantially in parallel. Additionally, it should be noted that in some implementations, one or more blocks of process 300 may be omitted.
At 302, the process 300 may receive an input audio signal. The input audio signal may be recorded or may be live content. The input audio signal may include various types of audio content, such as speech, music, speech under music, and the like. Example types of audio content may include podcasts, broadcast programs, audio content associated with television programs or movies, and the like.
At 304, the process 300 may classify the media type of the input audio signal. For example, in some implementations, the process 300 may classify the input audio signal as one of: 1) Speech; 2) Music; 3) Speech under music; or 4) other items.
In some implementations, the process 300 may classify media types of the input audio signal based on separation of spatial components of the input audio signal. For example, in some implementations, the process 300 may separate the input audio signal into two or more spatial components, such as direct and diffuse components, side and middle channels, and the like. In some implementations, the process 300 can then classify the media type of the audio content in each spatial component. In some implementations, the process 300 may then classify the input audio signal by combining the classifications for each spatial component. Note that a more detailed technique for classifying media types of an input audio signal based on spatial separation is shown in fig. 4 and described below in connection with this figure.
Additionally or alternatively, in some implementations, the process 300 may classify media types of the input audio signal based on musical source separation of the input audio signal. For example, in some implementations, the process 300 may separate the input audio signal into a human sound component and a non-human sound component. In some implementations, the process 300 can then classify the media type of the audio content in each of the human voice component and the non-human voice component. In some implementations, the process 300 may then classify the input audio signal by combining the classifications of each of the human voice component and the non-human voice component. Note that a more detailed technique for classifying media types of an input audio signal based on music source separation is shown in fig. 5 and described below in connection with this figure.
At 306, the process 300 may determine whether to analyze the reverberation characteristics of the input audio signal. In some implementations, the process 300 may determine whether to analyze the reverberation characteristics based on the media type classification of the input audio signal determined at block 304. For example, in some implementations, the process 300 may determine to analyze the reverberation characteristics in response to determining that the media type classification of the input audio signal is speech (yes at 306). In contrast, in some implementations, in response to determining that the media type classification is not speech (e.g., the media type classification is music, speech under music, or other items), the process 300 may determine not to analyze the reverberation characteristics (no at 306).
If at 306, the process 300 determines that the reverberation characteristics are not to be analyzed (no at 306), the process 300 may end at 314.
Conversely, if at 306, the process 300 determines that the reverberation characteristics are to be analyzed (yes at 306), the process 300 may determine the degree of reverberation in the input audio signal at 308.
In some implementations, the reverberation level may be calculated using RT60 metrics and/or DRR metrics associated with the input audio signal.
Additionally or alternatively, in some implementations, the process 300 may determine a degree of reverberation in the input audio signal based on the spectrogram information. For example, in some implementations, the process 300 may determine the degree of reverberation based on energy at various modulation frequencies of the input audio signal. In particular, because non-reverberant speech may tend to have modulation frequency peaks at relatively low modulation frequencies (e.g., 3Hz, 4Hz, etc.), and because reverberant speech may tend to have a significant amount of energy at higher modulation frequencies (e.g., 10Hz, 20Hz, 50Hz, etc.), process 300 may determine the degree of reverberation in the input audio signal based on the energy of the input audio signal at relatively high modulation frequencies (e.g., above 10Hz, above 20Hz, etc.).
Note that a more detailed technique for determining the degree of reverberation based on the spectrogram information is shown in fig. 7 and described below in connection with this figure.
At 310, process 300 may determine whether to perform dereverberation on the input audio signal. In some implementations, the process 300 may determine whether to perform dereverberation based on the degree of reverberation determined at block 308. For example, in some implementations, the process 300 may determine to perform dereverberation in response to determining that the degree of reverberation exceeds a threshold (yes at 310). As another example, in some implementations, the process 300 may determine not to perform dereverberation in response to determining that the degree of reverberation is below a threshold (no at 310).
In some implementations, the process 300 may additionally or alternatively determine whether to perform dereverberation on the input audio signal based on a media type classification of a previous audio signal. The previous audio signal may correspond to a frame or portion of audio content preceding the input audio signal. It should be noted that the frames or portions of audio content may have any suitable duration, such as 10 milliseconds, 20 milliseconds, etc.
In some implementations, the process 300 may determine whether to perform dereverberation on the input audio signal based on the media type classification of the previous audio signal by adjusting the media type classification based on the classification of the previous audio signal (e.g., as determined at block 304). For example, in some implementations, the media type classification of the input audio signal may be adjusted based on a confidence level of the media type classification of the input audio signal and/or based on a confidence level of the media type classification of a previous audio signal. As a more specific example, where the media type classification of the previous audio signal is associated with a relatively high confidence level (e.g., greater than 70%, greater than 80%, etc.) and the media type classification of the input audio signal is associated with a relatively low confidence level (e.g., less than 30%, less than 20%, etc.), the media type classification of the input audio signal may be adjusted or modified to the media type classification of the previous audio signal. It should be noted that the adjustment of the media type classification of the input audio signal may be performed one or more times. For example, the media type classification may be adjusted prior to analyzing the reverberation characteristics at block 306. As another example, the media type classification may be adjusted after determining the reverberation level at block 308.
If, at 310, process 300 determines that dereverberation is not performed ("no" at 310), process 300 may end at 314.
Conversely, if at 310, process 300 determines that dereverberation is to be performed ("yes" at 310), process 300 may generate an output audio signal by performing dereverberation on the input audio signal. For example, in some implementations, dereverberation may be performed based on amplitude modulation of the input audio signal at various frequency bands. As a more specific example, dereverberation may be performed using the techniques described in U.S. patent No. 9,520,140, which is incorporated herein by reference in its entirety. As another example, in some implementations, dereverberation may be performed by estimating the dereverberated signal using a deep neural network, a multi-channel linear filter, or the like. As yet another example, in some implementations, the dereverberation may be performed by estimating a room response and performing a deconvolution operation on the input audio signal based on the room response.
Process 300 may then end at 314.
It should be noted that after ending at 314, the output audio signal may be presented, for example, via speakers, headphones, or the like. In some implementations, the output audio signal may be the original input audio signal without performing the dereverberation of block 312 (e.g., because the input audio signal is classified as music, speech under music, or other non-speech content). In some implementations, different dereverberation techniques other than those applied at 312 may be applied to the original input audio signal without performing the dereverberation of block 312 (e.g., because the input audio signal is classified as speech, speech under music, or other nonverbal content).
In some implementations, where dereverberation is performed at block 312, the output audio signal may correspond to the dereverberated input audio signal.
In some implementations, the media types of the input audio signals may be classified based on spatial separation of components of the input audio signals. Example components include direct and diffuse components, mid and side channels, and the like. In some implementations, each spatial component may be classified as one of: 1) Speech; 2) Music; 3) Speech under music; or 4) other items. In some implementations, the input audio signal may be classified based on a combination of classifications for each spatial component. In some implementations, two or more spatial components may be identified based on an upmix of the input audio signal. In some implementations, media type classification of the input audio signal based on spatial separation of components of the input audio signal may be performed in response to determining that the input audio signal is a multi-channel audio signal (e.g., stereo audio signal, 5.1 audio signal, 7.1 audio signal, etc.).
Fig. 4 illustrates an example of a process 400 for classifying media types of an input audio signal based on spatial separation of components of the input audio signal, according to some embodiments. It should be noted that the blocks of process 400 may be performed in various orders not shown in fig. 4, and/or in some implementations, two or more blocks of process 400 may be performed substantially in parallel. Additionally or alternatively, it should be noted that in some implementations, one or more blocks of process 400 may be omitted.
Process 400 may begin at 402 by receiving an input audio signal. In some implementations, the input audio signal may include two or more audio channels.
At 404, the process 400 may upmix the input audio signal to increase the number of audio channels associated with the input audio signal. Process 400 may use various types of upmixing. For example, in some implementations, the process 400 may perform an upmixing technique, such as left/right to mid/side shuffling (shuffleling). As another example, in some implementations, the process 400 may perform an upmixing technique that transforms stereo audio input into multi-channel content (e.g., 5.1, 7.1, etc.).
In some implementations, the input audio signal may be split into a direct component and a diffuse component. For example, in some implementations, the direct and diffuse components may be identified based on inter-channel coherence. As a more specific example, in some implementations, the direct and diffuse components may be identified based on a coherence matrix analysis.
At 406, the process 400 may obtain side channels and intermediate channels from the upmixed input audio signal. For example, in the case where the up-mixed input audio signal corresponds to a mid/side channel after shuffling, the side channel may correspond to a side channel after shuffling (shuffled), and the middle channel may correspond to a mid channel after shuffling. As another example, in case the upmixed input audio signal corresponds to a multi-channel upmix (e.g., 5.1, 7.1, etc.), the intermediate channel may be directly obtained from the upmixed audio signal, and the side channel may be obtained by downmixing a left/right pair (e.g., left/right, left surround/right surround, etc.).
In the case where the input audio signal is divided into a direct component and a diffuse component, the intermediate channel may correspond to the direct component and the side channel may correspond to the diffuse component.
At 408, the process 400 may determine whether the power in the side channel exceeds a threshold. Examples of thresholds may be-65 dB (dBuS), -68 dBuS, -70 dBuS, -72 dBuS, etc., relative to full scale.
If it is determined at 408 that the power in the side channel does not exceed the threshold (NO at 408), the process 400 may proceed to block 412.
Conversely, if it is determined at 408 that the power in the side channel exceeds the threshold ("yes" at 408), the process 400 may classify the side channel as one of: 1) Speech; 2) Music; 3) Speech under music; or 4) other items. In some implementations, the classification of the side channel may be associated with a confidence level. Examples of classifiers that may be used to classify the side channels include k-nearest neighbors, case-based reasoning, decision trees, naive bayes, and/or various types of neural networks (e.g., convolutional Neural Networks (CNNs), etc.).
At 412, the process 400 may classify the intermediate channel as one of: 1) Speech, 2) music; 3) Speech under music; or 4) other items. In some implementations, the classification of the middle channel may be associated with a confidence level. Examples of classifiers that may be used to classify the intermediate channels include k-nearest neighbors, case-based reasoning, decision trees, naive bayes, and/or various types of neural networks (e.g., convolutional Neural Networks (CNNs), etc.).
At 414, the process 400 may classify the input audio signal as one of the following by combining the side channel classification (if present) with the intermediate channel classification: 1) Speech; 2) Music; 3) Speech under music; or 4) other items.
For example, in some embodiments, a decision fusion algorithm may be used to combine side channel classifications and intermediate channel classifications. Examples of decision fusion algorithms that may be used to combine the classification of each of the two or more components include bayesian analysis, the Dempster-Shafer algorithm, fuzzy logic algorithms, and the like.
As another example, in some implementations, in response to a side channel being classified as music, speech under music, or other items, the input audio signal may be classified as "nonspeech" regardless of the classification of the middle channel. As a more specific example, in the case where the middle channel is classified as "speech" and the side channels are classified as "music", the input audio signal may be classified as speech under music.
As yet another example, in some implementations, the side channel classification and the middle channel classification may be combined based on confidence levels associated with the side channel classification and the middle channel classification, respectively. As a more specific example, in some embodiments, the side channel classification and the middle channel classification may be combined such that the classification of spatial components associated with higher confidence levels is given more weight in the combination. As a specific example, where the middle channel is classified as "speech" with a relatively high confidence level (e.g., greater than 70%, greater than 80%, etc.) and the side channels are classified as "music", "speech under music", or "other items" with a relatively low confidence level (e.g., less than 30%, less than 20%, etc.), the input audio signal may be classified as speech. As another specific example, where the middle channel is classified as "speech" with a relatively low confidence level (e.g., less than 30%, less than 20%, etc.) and the side channels are classified as "music", "speech under music", or "other items" with a relatively high confidence level (e.g., greater than 70%, greater than 80%, etc.), the input audio signal may be classified as "speech under music" or "other items".
It should be noted that in the event that the side channels are not classified (e.g., because the power in the side channels is below a threshold as determined at block 408), the classification of the input audio signal may correspond to the classification of the middle channel.
In some implementations, the input audio signal may be classified based on separating a musical source of the input audio signal into a human acoustic component and a non-human acoustic component. The human voice component may then be classified as speech or non-speech, and the non-human voice component may be classified as music or non-music. In some implementations, based on a combination of classifications of the human voice component and the non-human voice component, the input audio signal may then be classified as one of: 1) Speech; 2) Music; 3) Speech under music; or 4) other items. In some implementations, in response to determining that the input audio signal is a single channel audio signal, musical source separation of the input audio signal may be used to classify the input audio signal. Alternatively, in some implementations, in addition to classifying the input audio signal based on spatial separation of components, music source separation may be used to classify the input audio signal.
Fig. 5 illustrates an example of a process 500 for classifying an input audio signal based on music source separation, according to some embodiments. It should be noted that the blocks of process 500 may be performed in various orders not shown in fig. 5, and/or in some implementations, two or more blocks of process 500 may be performed substantially in parallel. Additionally or alternatively, it should be noted that in some implementations, one or more blocks of process 500 may be omitted.
Process 500 may begin at 502 by receiving an input audio signal. In some implementations, the input audio signal may be a single channel audio signal.
At 504, the process 500 may separate the input audio signal into a human acoustic component and a non-human acoustic component. In some implementations, one or more trained machine learning models may be used to identify the human voice component and the non-human voice component. Example types of machine learning models that may be used to separate the input audio signal into human and non-human acoustic components may include Deep Neural Networks (DNNs), convolutional Neural Networks (CNNs), long and Short Term Memory (LSTM) networks, convolutional Recurrent Neural Networks (CRNN), gated loop units (GRUs), convolutional gated loop units (CGRU), and so forth.
At 506, the process 500 may classify the human voice component as one of: 1) Speech; or 2) nonverbal. In some implementations, the classification of the vocal component can be associated with a confidence level. Examples of classifiers that may be used to classify the human voice components include k-nearest neighbors, case-based reasoning, decision trees, naive bayes, and/or various types of neural networks (e.g., convolutional Neural Networks (CNNs), etc.).
At 508, the process 500 may classify the non-human acoustic component as one of: 1) Music; and 2) non-music. In some implementations, the classification of the non-human voice component may be associated with a confidence level. Examples of classifiers that may be used to classify the non-human acoustic components include k-nearest neighbors, case-based reasoning, decision trees, naive bayes, and/or various types of neural networks (e.g., convolutional Neural Networks (CNNs), etc.).
At 510, process 500 may classify the input audio signal as one of the following by combining the classification of the human voice component and the classification of the non-human voice component: 1) Speech; 2) Music; 3) Speech under music; or 4) other items. For example, in some embodiments, the classifications of the human voice components may be combined with the classifications of the non-human voice components using any suitable decision fusion algorithm(s) that combines the classifications from the two classifiers to generate an aggregate classification of the input audio signal. Examples of decision fusion algorithms that may be used to combine the classification of each of the two or more components include bayes, dempster-Shafer, fuzzy logic algorithms, and the like.
As another example, in some implementations, the classification of the human voice component and the classification of the non-human voice component may be combined based on confidence levels of the classification of the human voice component and the classification of the non-human voice component, respectively. As a more specific example, in some embodiments, the classification of the human voice component and the classification of the non-human voice component may be combined such that components associated with higher confidence levels are given more weight in the combination.
In some implementations, an amount of reverberation present in the input audio signal can be determined. In some implementations, the DRR may be used to calculate the amount of reverberation. For example, in some embodiments, the amount of reverberation may be inversely related to the DRR such that the amount of reverberation increases with decreasing value of the DRR and such that the amount of reverberation decreases with increasing value of the DRR. In some implementations, the amount of reverberation can be calculated using the duration required for the sound pressure level to decrease by a fixed amount (e.g., 60 dB). For example, the amount of reverberation can be calculated using RT60, RT60 indicating the time when the sound pressure level decreases by 60 dB. In some implementations, the DRR or RT60 associated with the input audio signal may be estimated using various algorithms or techniques, which may be based on signal processing and/or based on a machine learning model.
In some implementations, the amount of reverberation in the input audio signal can be calculated by estimating the diffuseness of the input audio signal. Fig. 6 illustrates an example of a process 600 for estimating a diffuseness of an input audio signal according to some embodiments. It should be noted that the blocks of process 600 may be performed in various orders not shown in fig. 6, and/or in some implementations, two or more blocks of process 600 may be performed substantially in parallel. Additionally or alternatively, it should be noted that in some implementations, one or more blocks of process 600 may be omitted.
It should be noted that in some embodiments, the amount of reverberation may be determined based on a combination of multiple metrics. The plurality of metrics may include, for example, DRR, RT60, diffuseness estimate, and the like. In some implementations, various techniques, such as weighted averaging, may be used to combine the multiple metrics. In some implementations, one or more metrics may be scaled or normalized.
Process 600 may begin at 602 by receiving an input audio signal.
At 604, the process 600 may calculate a two-dimensional acoustic modulation spectrum of the input audio signal. The two-dimensional acoustic modulation spectrum may be indicative of the energy present in the input audio signal as a function of acoustic frequency and modulation frequency.
At 606, process 600 may determine a degree of spread of the input audio signal based on energy in a high modulation frequency portion of the two-dimensional acoustic modulation spectrum (e.g., for modulation frequencies greater than 6Hz, greater than 10Hz, etc.). For example, in some implementations, the process 600 may calculate the ratio of energy in the high modulation frequency portion to energy across all modulation frequencies. As another example, in some implementations, the process 600 may calculate a ratio of energy in the high modulation frequency portion to energy in the low modulation frequency portion (e.g., for modulation frequencies below 10Hz, below 20Hz, etc.).
Fig. 7A, 7B, 7C, and 7D illustrate examples of two-dimensional acoustic modulation spectra for various types of input speech signals. As illustrated, each two-dimensional acoustic modulation frequency indicates the energy present in the input signal as a function of acoustic frequency (as shown by the y-axis of each spectrum shown in fig. 7A, 7B, 7C, and 7D) and modulation frequency (as shown by the x-axis of each spectrum shown in fig. 7A, 7B, 7C, and 7D).
As shown in fig. 7A, a "clean" speech with little or no reverberation may have a two-dimensional acoustic modulation spectrum with most of the energy concentrated at relatively low modulation frequencies (e.g., less than 5Hz, less than 10Hz, etc.).
As shown in fig. 7B, an input signal comprising clean speech and early and late reverberation reflections may have a two-dimensional acoustic modulation spectrum with energy distributed across all modulation frequencies.
As shown in fig. 7C, an input signal comprising clean speech and early reverberation reflections may have a two-dimensional acoustic modulation spectrum in which energy is typically concentrated at relatively low modulation frequencies (e.g., less than 5Hz, less than 10 Hz). In other words, the two-dimensional acoustic modulation frequency for an input signal that includes clean speech and early reverberation reflections (but no late reverberation reflections) may be substantially similar to the two-dimensional acoustic modulation frequency spectrum of clean speech.
As shown in fig. 7D, an input signal that includes late reverberation reflections without clean speech or early reverberation reflections may have a two-dimensional acoustic modulation spectrum with energy distributed across all modulation frequencies.
Thus, as shown in fig. 7A, 7B, 7C, and 7D, the diffusivity estimate may be calculated based on the ratio between the amount of energy at the relatively high modulation frequency and the total energy or based on the relative ratio between the amount of energy at the relatively high modulation frequency and the energy at the relatively low modulation frequency.
Fig. 8 is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the disclosure. As with the other figures provided herein, the types and numbers of elements shown in fig. 8 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. According to some examples, the apparatus 800 may be configured to perform at least some of the methods disclosed herein. In some implementations, the apparatus 800 may be or include one or more components of a television, an audio system, a mobile device (e.g., a cellular telephone), a notebook computer, a tablet computer, a smart speaker, or another type of device.
According to some alternative embodiments, the apparatus 800 may be or may include a server. In some such examples, apparatus 800 may be or may include an encoder. Thus, in some cases, the apparatus 800 may be a device configured for use within an audio environment, such as a home audio environment, while in other cases, the apparatus 800 may be a device configured for use in a "cloud", e.g., a server.
In this example, apparatus 800 includes an interface system 805 and a control system 810. In some implementations, the interface system 805 may be configured to communicate with one or more other devices of an audio environment. In some examples, the audio environment may be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, and so forth. In some implementations, the interface system 805 may be configured to exchange control information and associated data with audio devices of an audio environment. In some examples, the control information and associated data may relate to one or more software applications being executed by the apparatus 800.
In some implementations, the interface system 805 may be configured to receive a content stream or to provide a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals. In some cases, the audio data may include spatial data such as channel data and/or spatial metadata. In some examples, the content stream may include video data and audio data corresponding to the video data.
The interface system 805 may include one or more network interfaces and/or one or more external device interfaces (e.g., one or more Universal Serial Bus (USB) interfaces). According to some implementations, the interface system 805 may include one or more wireless interfaces. The interface system 805 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system, and/or a gesture sensor system. In some examples, interface system 805 may include one or more interfaces between control system 810 and a memory system (such as optional memory system 815 shown in fig. 8). However, in some cases, control system 810 may include a memory system. In some implementations, the interface system 805 may be configured to receive input from one or more microphones in an environment.
For example, control system 810 may include a general purpose single or multi-chip processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
In some embodiments, control system 810 may be located in more than one device. For example, in some implementations, a portion of the control system 810 may be located in a device within one of the environments depicted herein, and another portion of the control system 810 may be located in a device outside of the environment, such as a server, mobile device (e.g., smart phone or tablet), or the like. In other examples, a portion of control system 810 may be located in a device within an environment, and another portion of control system 810 may be located in one or more other devices within the environment. For example, the functionality of the control system may be distributed across multiple intelligent audio devices of the environment, or may be shared by orchestration devices (as may be referred to herein as devices of the intelligent home hub) and one or more other devices of the environment. In other examples, a portion of control system 810 may be located in a device (e.g., a server) that implements a cloud-based service, and another portion of control system 810 may be located in another device (e.g., another server, a memory device, etc.) that implements the cloud-based service. In some examples, the interface system 805 may also be located in more than one device.
In some implementations, the control system 810 may be configured to at least partially perform the methods disclosed herein. According to some examples, control system 810 may be configured to implement a method of dereverberation based on media type classification.
Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as the memory devices described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. For example, one or more non-transitory media may be located in the optional memory system 815 and/or the control system 810 shown in fig. 8. Accordingly, various innovative aspects of the subject matter described in this disclosure can be implemented in one or more non-transitory media having software stored thereon. For example, the software may include instructions for controlling at least one device to classify a media type of the audio content, determine a degree of reverberation, determine whether to perform dereverberation, perform dereverberation on the audio signal, and so forth. The software may be executable by one or more components of a control system, such as control system 810 of fig. 8, for example.
In some examples, apparatus 800 may include an optional microphone system 820 shown in fig. 8. Optional microphone system 820 may include one or more microphones. In some implementations, one or more microphones may be part of or associated with another device (e.g., a speaker of a speaker system, a smart audio device, etc.). In some examples, the apparatus 800 may not include the microphone system 820. However, in some such implementations, the apparatus 800 may still be configured to receive microphone data for one or more microphones in an audio environment via the interface system 810. In some such implementations, a cloud-based implementation of apparatus 800 may be configured to receive microphone data or noise indicia corresponding at least in part to microphone data from one or more microphones in an audio environment via interface system 810.
According to some embodiments, the apparatus 800 may comprise an optional loudspeaker system 825 shown in fig. 8. Optional loudspeaker system 825 may include one or more loudspeakers, which may also be referred to herein as "speakers," or more generally as "audio reproduction transducers. In some examples (e.g., cloud-based implementations), the apparatus 800 may not include the loudspeaker system 825. In some embodiments, the apparatus 800 may comprise headphones. Headphones may be connected or coupled to device 800 via a headphone jack or via a wireless connection (e.g., bluetooth).
In some embodiments, the apparatus 800 may include an optional sensor system 830 shown in fig. 8. Optional sensor system 830 may include one or more touch sensors, gesture sensors, motion detectors, and the like. According to some embodiments, optional sensor system 830 may include one or more cameras. In some implementations, the camera may be a standalone camera. In some examples, one or more cameras of optional sensor system 830 may be located in an audio device, which may be a single-use audio device or a virtual assistant. In some such examples, one or more cameras of optional sensor system 830 may be located in a television, mobile phone, or smart speaker. In some examples, apparatus 800 may not include sensor system 830. However, in some such embodiments, the apparatus 800 may still be configured to receive sensor data for one or more sensors in an audio environment via the interface system 810.
In some implementations, the apparatus 800 may include an optional display system 835 shown in fig. 8. Optional display system 835 may include one or more displays, such as one or more Light Emitting Diode (LED) displays. In some examples, optional display system 835 may include one or more Organic Light Emitting Diode (OLED) displays. In some examples, optional display system 835 may include one or more displays of a television. In other examples, optional display system 835 may include a notebook computer display, a mobile device display, or another type of display. In some examples where the apparatus 800 includes a display system 835, the sensor system 830 may include a touch sensor system and/or a gesture sensor system proximate to one or more displays of the display system 835. According to some such embodiments, the control system 810 may be configured to control the display system 835 to present one or more Graphical User Interfaces (GUIs).
According to some such examples, apparatus 800 may be or may include a smart audio device. In some such embodiments, the apparatus 800 may be or may include a wake word detector. For example, the apparatus 800 may be or may include a virtual assistant.
Aspects of the present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer-readable medium (e.g., disk) storing code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems may be or include a programmable general purpose processor, digital signal processor, or microprocessor programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including embodiments of the disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, memory, and a processing subsystem programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to asserted (asserted) data thereto.
Some embodiments may be implemented as a configurable (e.g., programmable) Digital Signal Processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform the required processing on the audio signal(s), including the execution of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general-purpose processor (e.g., a Personal Computer (PC) or other computer system or microprocessor, which may include an input device and memory) programmed and/or otherwise configured with software or firmware to perform any of a variety of operations, including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general-purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more microphones and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or keyboard), memory, and a display device.
Another aspect of the disclosure is a computer-readable medium (e.g., a disk or other tangible storage medium) storing code (e.g., an encoder executable to perform one or more examples of the disclosed methods or steps thereof) for performing one or more examples of the disclosed methods or steps thereof.
While specific embodiments of, and applications for, the present disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many more modifications than mentioned herein are possible without departing from the scope of the disclosure described and claimed herein. It is to be understood that while certain forms of the disclosure have been illustrated and described, the disclosure is not to be limited to the specific embodiments described and illustrated or to the specific methods described.
Aspects of the invention may be understood from the example embodiments (EEEs) enumerated below:
EEE 1. A method for suppressing reverberation, comprising:
receiving an input audio signal;
classifying a media type of the input audio signal as one of a group comprising at least: 1) Speech; 2) Music; or 3) speech under music;
Determining whether to perform dereverberation on the input audio signal based at least on a determination that the media type of the input audio signal has been classified as speech; and
in response to determining to perform dereverberation on the input audio signal, an output audio signal is generated by performing dereverberation on the input audio signal.
The EEE 2. The method of claim 1, further comprising determining a degree of reverberation in the input audio signal, wherein determining whether to perform dereverberation of the input audio signal is based on the degree of reverberation.
EEE 3. The method of claim 2, wherein the reverberation level is based on a reverberation time (RT 60), a direct mixing ratio (DRR), a diffuseness estimate, or any combination thereof.
The method of EEE 4. EEE 3 wherein determining the reverberation level comprises:
a two-dimensional acoustic modulation spectrum of the input audio signal is calculated, wherein the degree of reverberation is based on an amount of energy in a high modulation frequency portion of the two-dimensional acoustic modulation spectrum.
The method of EEE 5. The method of EEE 4 wherein determining the reverberation degree comprises calculating at least one of: 1) A ratio of energy in a high modulation frequency portion of the two-dimensional acoustic modulation spectrum to energy of all modulation frequencies in the two-dimensional acoustic modulation spectrum; or 2) a ratio of energy in the high-modulation frequency portion of the two-dimensional acoustic modulation spectrum to energy in the low-modulation frequency portion of the two-dimensional acoustic modulation spectrum.
The method of EEE 6. EEE 4 or 5 wherein determining whether to perform dereverberation of the input audio signal is based on a determination that the degree of reverberation exceeds a threshold.
The method of any of EEEs 1-6, wherein classifying the media type of the input audio signal comprises separating the input audio signal into two or more spatial components.
EEE 8. The method of EEE 7, wherein said two or more spatial components comprise a middle channel and a side channel.
EEE 9 the method of EEE 8, further comprising:
calculating the power of the side channel; and
the side channels are classified in response to determining that the power of the side channels exceeds a threshold.
The method of EEE 7 wherein the two or more spatial components comprise a diffuse component and a direct component.
The method of any of EEEs 7-10, wherein classifying the media type of the input audio signal comprises classifying each of the two or more spatial components as one of: 1) Speech; 2) Music; or 3) speech under music, wherein the media type of the input audio signal is classified by combining classifications of each of the two or more spatial components.
The method of any of EEEs 7-11, wherein the input audio signal is separated into the two or more spatial components in response to determining that the input audio signal comprises stereo audio.
The method of any of EEEs 1-6, wherein classifying the media type of the input audio signal comprises separating the input audio signal into a human sound component and a non-human sound component.
The method of EEE 14 wherein the input audio signal is separated into the human voice component and the non-human voice component in response to determining that the input audio signal comprises a single audio channel.
The method of EEE 15. EEE 13 or 14 wherein classifying the media type of the input audio signal comprises:
classifying the human voice component as one of: 1) Speech; or 2) nonverbal;
classifying the non-human acoustic component as one of: 1) Music; or 2) a non-musical one,
wherein the media type of the input audio signal is classified by combining the classification of the human voice component and the classification of the non-human voice component.
The method of any of EEEs 1-15, wherein determining whether to perform dereverberation of the input audio signal is based on a classification of a second input audio signal preceding the input audio signal.
The method of any one of EEEs 1 through 16, further comprising:
receiving a third input audio signal;
determining that dereverberation is not performed on the third input audio signal; and
in response to determining not to perform dereverberation on the third input audio signal, performing a dereverberation algorithm on the third input audio signal is disabled.
The method of EEE 18 wherein determining not to perform dereverberation of the third input audio signal is based at least in part on a classification of a media type of the third input audio signal.
The method of EEE 19 wherein the classification of the media type of the third input audio signal is one of: 1) Music; or 2) speech under music.
The method of any of EEEs 17-19, wherein determining not to perform dereverberation of the third input audio signal is based at least in part on a determination that a degree of reverberation in the third input audio signal is below a threshold.
EEE 21. An apparatus configured for implementing the method of any one of claims EEE 1-20.
EEE 22. A system configured for implementing the method of any one of claims EEE 1-20.
EEE 23. One or more non-transitory media having software stored thereon that includes instructions for controlling one or more devices to perform the method of any of EEEs 1-20.
EEE 24. A method for classifying an input audio signal into one of at least two media types, comprising:
receiving an input audio signal;
separating the input audio signal into two or more spatial components; and
classifying each of the two or more spatial components as one of the at least two media types,
wherein the media type of the input audio signal is classified by combining the classifications of each of the two or more spatial components.
The method of EEE 24, wherein the two or more spatial components comprise a middle channel and a side channel, the method further comprising:
calculating the power of the side channel; and
The side channels are classified in response to determining that the power of the side channels exceeds a threshold.
The method of EEE 24, wherein the two or more spatial components include a diffuse component and a direct component.
The method of any of EEEs 24-26, wherein the input audio signal is separated into the two or more spatial components in response to determining that the input audio signal comprises stereo audio.
The method of any of EEEs 24-26, wherein classifying the media type of the input audio signal comprises separating the input audio signal into a human sound component and a non-human sound component.
The method of EEE 29 wherein the input audio signal is separated into the human voice component and the non-human voice component in response to determining that the input audio signal comprises a single audio channel.
The method of EEE 30 wherein classifying the media type of the input audio signal comprises:
classifying the human voice component as one of: 1) Speech; or 2) nonverbal;
classifying the non-human acoustic component as one of: 1) Music; or 2) a non-musical one,
Wherein the media type of the input audio signal is classified by combining the classification of the human voice component and the classification of the non-human voice component.
EEE 31. A system configured for implementing the method of any one of claims 24 to 30.
EEE 32. One or more non-transitory media having stored thereon software comprising instructions for controlling one or more devices to perform the method of any of EEEs 24-30.

Claims (22)

1. A method for suppressing reverberation, comprising:
receiving an input audio signal;
classifying a media type of the input audio signal as one of a group comprising at least: 1) Speech; 2) Music; or 3) speech under music;
determining whether to perform dereverberation on the input audio signal based at least on determining that the media type of the input audio signal has been classified as speech; and
in response to determining to perform dereverberation on the input audio signal, an output audio signal is generated by performing dereverberation on the input audio signal.
2. The method of claim 1, further comprising determining a degree of reverberation in the input audio signal, wherein determining whether to perform dereverberation on the input audio signal is based on the degree of reverberation, and optionally wherein the degree of reverberation is based on a reverberation time (RT 60), a direct mixing ratio (DRR), a diffuseness estimate, or any combination thereof.
3. The method of claim 2, wherein determining the degree of reverberation comprises:
calculating a two-dimensional acoustic modulation spectrum of the input audio signal, wherein the degree of reverberation is based on an amount of energy in a high modulation frequency part of the two-dimensional acoustic modulation spectrum, and optionally,
wherein determining the degree of reverberation comprises calculating at least one of: 1) A ratio of energy in a high modulation frequency portion of the two-dimensional acoustic modulation spectrum to energy of all modulation frequencies in the two-dimensional acoustic modulation spectrum; or 2) a ratio of energy in the high-modulation frequency portion of the two-dimensional acoustic modulation spectrum to energy in the low-modulation frequency portion of the two-dimensional acoustic modulation spectrum.
4. A method according to claim 2 or 3, wherein determining whether to perform dereverberation of the input audio signal is based on determining that the degree of reverberation exceeds a threshold.
5. The method of any one of claims 1 to 4, wherein classifying the media type of the input audio signal comprises separating the input audio signal into two or more spatial components, and optionally wherein the input audio signal is separated into the two or more spatial components in response to determining that the input audio signal comprises stereo audio.
6. The method of claim 5, wherein the two or more spatial components comprise a middle channel and a side channel, and optionally wherein the method further comprises:
calculating the power of the side channel; and
the side channels are classified in response to determining that the power of the side channels exceeds a threshold.
7. The method of claim 5, wherein the two or more spatial components comprise a diffuse component and a direct component.
8. The method of any of claims 5-7, wherein classifying the media type of the input audio signal comprises classifying each of the two or more spatial components as one of: 1) Speech; 2) Music; or 3) speech under music, wherein the media type of the input audio signal is classified by combining classifications of each of the two or more spatial components.
9. The method of any of claims 1-4, wherein classifying the media type of the input audio signal comprises separating the input audio signal into a human sound component and a non-human sound component, and optionally wherein, in response to determining that the input audio signal comprises a single audio channel, separating the input audio signal into the human sound component and the non-human sound component.
10. The method of claim 9, wherein classifying the media type of the input audio signal comprises:
classifying the human voice component as one of: 1) Speech; or 2) nonverbal;
classifying the non-human acoustic component as one of: 1) Music; or 2) a non-musical one,
wherein the media type of the input audio signal is classified by combining the classification of the human voice component and the classification of the non-human voice component.
11. The method of any of claims 1 to 10, wherein determining whether to perform dereverberation on the input audio signal is based on a classification of a second input audio signal preceding the input audio signal.
12. The method of any one of claims 1 to 11, further comprising:
receiving a third input audio signal;
determining that dereverberation is not performed on the third input audio signal; and
in response to determining not to perform dereverberation of the third input audio signal, disabling the performing of a dereverberation algorithm on the third input audio signal, and optionally wherein determining not to perform dereverberation of the third input audio signal is based at least in part on: (a) A classification of a media type of the third input audio signal, or (b) determining that a degree of reverberation in the third input audio signal is below a threshold, wherein the classification of the media type of the third input audio signal is one of: 1) Music; or 2) speech under music.
13. A method for classifying an input audio signal into one of at least two media types, comprising:
receiving an input audio signal;
separating the input audio signal into two or more spatial components; and
classifying each of the two or more spatial components as one of the at least two media types,
wherein the media type of the input audio signal is classified by combining the classifications of each of the two or more spatial components.
14. The method of claim 13, wherein the two or more spatial components comprise a middle channel and a side channel, the method further comprising:
calculating the power of the side channel; and
the side channels are classified in response to determining that the power of the side channels exceeds a threshold.
15. The method of claim 13, wherein the two or more spatial components comprise a diffuse component and a direct component.
16. The method of any of claims 13 to 15, wherein the input audio signal is separated into the two or more spatial components in response to determining that the input audio signal comprises stereo audio.
17. The method of any of claims 13-15, wherein classifying the media type of the input audio signal comprises separating the input audio signal into a human-sound component and a non-human-sound component.
18. The method of claim 17, wherein the input audio signal is separated into the human voice component and the non-human voice component in response to determining that the input audio signal comprises a single audio channel.
19. The method of claim 17 or 18, wherein classifying the media type of the input audio signal comprises:
classifying the human voice component as one of: 1) Speech; or 2) nonverbal;
classifying the non-human acoustic component as one of: 1) Music; or 2) a non-musical one,
wherein the media type of the input audio signal is classified by combining the classification of the human voice component and the classification of the non-human voice component.
20. An apparatus configured to implement the method of any one of claims 1 to 19.
21. A system configured to implement the method of any one of claims 1 to 19.
22. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any of claims 1-19.
CN202280019905.6A 2021-03-11 2022-03-10 Dereverberation based on media type Pending CN116964666A (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CNPCT/CN2021/080314 2021-03-11
CN2021080314 2021-03-11
US202163180710P 2021-04-28 2021-04-28
US63/180,710 2021-04-28
EP21174289.5 2021-05-18
PCT/US2022/019816 WO2022192580A1 (en) 2021-03-11 2022-03-10 Dereverberation based on media type

Publications (1)

Publication Number Publication Date
CN116964666A true CN116964666A (en) 2023-10-27

Family

ID=75977634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280019905.6A Pending CN116964666A (en) 2021-03-11 2022-03-10 Dereverberation based on media type

Country Status (1)

Country Link
CN (1) CN116964666A (en)

Similar Documents

Publication Publication Date Title
US9293151B2 (en) Speech signal enhancement using visual information
TWI426502B (en) Apparatus and method for extracting an ambient signal in an apparatus and method for obtaining weighting coefficients for extracting an ambient signal and computer program
JP4952698B2 (en) Audio processing apparatus, audio processing method and program
EP3189521B1 (en) Method and apparatus for enhancing sound sources
US20230319190A1 (en) Acoustic echo cancellation control for distributed audio devices
JP2023159381A (en) Sound recognition audio system and method thereof
US20220246161A1 (en) Sound modification based on frequency composition
CN114747233A (en) Content and context aware ambient noise compensation
WO2023287773A1 (en) Speech enhancement
EP3847645B1 (en) Determining a room response of a desired source in a reverberant environment
CN116964666A (en) Dereverberation based on media type
WO2023287782A1 (en) Data augmentation for speech enhancement
US20220254332A1 (en) Method and apparatus for normalizing features extracted from audio data for signal recognition or modification
JP2024509254A (en) Dereverberation based on media type
Li et al. Effect of the division between early and late reflections on intelligibility of ideal binary-masked speech
EP3029671A1 (en) Method and apparatus for enhancing sound sources
CN116964665A (en) Improving perceived quality of dereverberation
CN116830561A (en) Echo reference prioritization and selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination