WO2014063754A1

WO2014063754A1 - Audio enhancement system

Info

Publication number: WO2014063754A1
Application number: PCT/EP2012/071305
Authority: WO
Inventors: David Virette
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2012-10-26
Filing date: 2012-10-26
Publication date: 2014-05-01
Also published as: EP2888736A1; EP2888736B1

Abstract

The invention relates to an audio enhancement system (200), comprising at least two terminals (201a, 201b), each one comprising at least one acoustic sensor and processing means, wherein the acoustic sensors of the at least two terminals (201a, 201b) are wirelessly coupled with respect to each other forming an acoustic sensor network, each terminal (201a) being configured to provide information of at least one audio source model of the terminal (201a) to at least one of the other terminals (201b), wherein information of an audio source model of a terminal (201a) describes an audio characteristic of at least one audio source (205) impacting on the at least one acoustic sensor of the terminal (201a), and wherein the processing means of the at least two terminals (201a, 201b) are configured to perform audio enhancement processing based on the information of the audio source models of the at least two terminals (201a, 201b).

Description

DESCRIPTION

Audio enhancement system

BACKGROUND OF THE INVENTION

The present invention relates to an audio enhancement system comprising at least two wireless coupled terminals, in particular Smartphones, Tablet PCs or audio conferencing systems and a method for enhanced audio processing.

Wireless acoustic sensor network is a technology using several spatially distributed microphones that communicate through a wireless connection over short distances. This technology can be used to enhance the audio quality based on the different monophonic representation of the same sound field. For instance, noise reduction methods can be used to cancel the background noise. This is usually efficient for stationary background noise, but the quality is quite limited for non-stationary noise. Indeed, the noise representation is obtained from the distant microphones which are not perfectly synchronized. Hence, the noise estimation performed on a distant microphone is efficient for stationary noise, but not for highly changing background noise. The noise reduction algorithm is usually based on spectral subtraction. Such methods can for instance be derived from Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean square error log-spectral amplitude estimator," IEEE Trans, on Acoust, Speech, Signal Processing, vol. ASSP-33, pp. 443-445, Apr. 1985 or Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean square error short-time spectral amplitude estimator," IEEE Trans, on Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 1 109- 1 121 , Dec. 1984.

Microphone array methods can also be used, such as described by J. Benesty, J. Chen, and Y. Huang, "Microphone Array Signal Processing", Springer 2008, but a perfect synchronization of the acoustic sensors is then required in order to take into account the exact delays and energy differences between various monophonic recordings.

Microphone array signal processing is usually based on a known microphone

arrangement, which is not the case for a wireless acoustic sensor network where the sensors are arbitrary placed in space and can move. On the other hand, blind source separation methods have been used for speech enhancement based on monophonic signal, also called single-channel source separation. Non-negative Matrix Factorization (NMF) methods have been used in that context with relatively good results. The basic principle of NMF-based audio processing 100 as schematically illustrated in Fig. 1 is to find a locally optimal factorization of a short-time magnitude spectrogram V 103 of an audio signal 101 into two factors W and H, of which the first one W represents the spectra of the events occurring in the signal 101 and the second one H their activation over time. The first factor W describes the component spectra of the source model 109. The second factor H describes the activations 107 of the signal spectrogram 103 of the audio signal 101 . The first factor W and the second factor H are matched with the short-time magnitude spectrogram V 103 of the audio signal 101 by an optimization procedure. The source model 109 is pre-defined when applying supervised NMF and a joint estimation is applied for the source model 109 when using unsupervised NMF. The source signal or signals 1 13 can be derived from the source spectrogram 1 1 1 .

Non-negative Matrix Factorization (NMF) and its extensions have been successfully used in areas related to speech recognition, including speech de-noising and speaker separation. NMF has been usually used as offline audio processing for noise reduction or source separation based on source models which are pre-defined. It has been recently extended for on-line processing where the processing is done with a sliding window in order to process the audio signal frame by frame and achieve real time processing as presented in Cyril Joder, Felix Weninger, Florian Eyben, David Virette, Bjorn Schuller: "Real-time Speech Separation by Semi-Supervised Nonnegative Matrix Factorization", Proc. International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA) 2012, Tel Aviv, Israel, Springer LNCS, Vol. 7191 , pp. 323-329, March 12-15, 2012. In supervised NMF, the source models are known in advance and for semi-supervised NMF only one source model is known, either noise or speaker. Both approaches have been extended to on-line NMF in order to be applied in real time speech enhancement implementations. It has been shown that the supervised NMF performed better than the semi-supervised NMF, which is normal as the supervised NMF uses the exact model for all the sources of an audio signal. Supervised NMF offers usually better performances as it is based on known source models which perfectly match with the sources which are part of the audio mix. With the same audio mix, the semi-supervised NMF will provide lower performances in terms of noise reduction or source separation as only one source is known a priori, i.e. pre-defined. The other source models are estimated and refined over time. The quality improves with the refinement of the noise model.

SUMMARY OF THE INVENTION It is the object of the invention to improve the audio quality of an audio rendering system without the knowledge of an exact source model of the audio signal.

This object is achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

The invention is based on the finding that the audio quality is improved when NMF audio processing with a distributed source model adaptation is applied. The audio quality improvement is achieved when using a wireless acoustic sensor network and a distributed source model adaptation. A wireless acoustic sensor network comprises microphones with short distance wireless connection. The source models are trained from each acoustic sensor of the network and shared between all the terminals of the local network The speech and/or audio enhancement is performed based on a distributed source model learning unit. The wireless acoustic sensor network can advantageously be based on mobile devices and/or audio conference terminals. Source models, e.g. background noise and/or unknown speakers can be merged in the active terminal, i.e. the terminal being used for the monophonic sound recording. Speech enhancement and/or audio

enhancement is advantageously based on Non negative Matrix Factorization (NMF).

By applying such an audio enhancement system, the audio quality is significantly improved and the audio rendering is enhanced without requiring an exact source model of the audio signal, as will be presented in the following.

In order to describe the invention in detail, the following terms, abbreviations and notations will be used: audio

rendering: a reproduction technique capable of creating spatial sound fields in an extended area by means of loudspeakers or loudspeaker arrays, audio

enhancement: a technique for enhancing the quality of an audio signal

NMF: non-negative matrix factorization,

WASN: wireless acoustic sensor network,

According to a first aspect, the invention relates to an audio enhancement system, comprising at least two terminals, each one comprising at least one acoustic sensor and processing means, wherein the acoustic sensors of the at least two terminals are wirelessly coupled with respect to each other forming an acoustic sensor network, each terminal being configured to provide information of at least one audio source model of the terminal to at least one of the other terminals, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal, and wherein the processing means of the at least two terminals are configured to perform audio enhancement processing based on the information of the audio source models of the at least two terminals.

By using the so designed acoustic sensor network the audio quality of audio rendering is improved without exact knowledge of a source model of the audio signal.

In a first possible implementation form of the audio enhancement system according to the first aspect, each terminal is configured to adapt the at least one audio source model of the terminal based on an audio characteristic of a main speaker impacting on the at least one acoustic sensor of the terminal.

By that distributed source model adaptation taking into account the audio characteristics of the main speaker audio quality is enhanced. In a second possible implementation form of the audio enhancement system according to the first aspect, each terminal is configured to adapt the at least one audio source model of the terminal based on audio characteristics of different users and/or different noise environments impacting on the at least one acoustic sensor of the terminal.

By that distributed source model adaptation taking into account the audio characteristics of different users and/or different noise environments audio quality is enhanced.

In a third possible implementation form of the audio enhancement system according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, each terminal is configured to use an output signal of the at least one acoustic sensor of the terminal as training signal for adjusting the at least one audio source model of the terminal. By using an output signal of the at least one acoustic sensor of the terminal as training signal, speech and/or audio enhancement can be performed as a learning process. The learning can be advantageously based on a learning unit or on mobile devices and/or audio conference terminals. Source models, e.g. background noise and/or unknown speakers can be merged in the active terminal, thereby improving the audio rendering.

In a fourth possible implementation form of the audio enhancement system according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, each terminal is configured to refine the information of the at least one audio source model of the terminal based on the provided information of the audio source models of the other terminals.

Such a distributed source model adaptation facilitates a distributive learning of the audio enhancement system. Audio quality is enhanced from step to step by refining the information base.

In a fifth possible implementation form of the audio enhancement system according to the fourth implementation form of the first aspect, each terminal is configured to provide the refined information of the at least one audio source model of the terminal to at least one of the other terminals. The audio enhancement system forms a distributive learning system where the different terminals are the multiple nodes of the system. A drop out of one of the terminals does not result in a drop out of the whole system. In a sixth possible implementation form of the audio enhancement system according to the fourth implementation form or according to the fifth implementation form of the first aspect, each terminal is configured to merge the information of the at least one audio source model of the terminal with the provided information of the audio source models of the other terminals to provide the refined information.

By merging information of some or all of the terminals, information processed by the audio enhancement system can be improved thereby optimizing audio quality.

In a seventh possible implementation form of the audio enhancement system according to the sixth implementation form of the first aspect, each terminal is configured to perform the merging based on a similarity measure describing a similarity of the audio source models.

Efficiency of information processing is improved when similar information is processed similarly and different information is processed in a different way. This reduces complexity of the audio enhancement system.

In an eighth possible implementation form of the audio enhancement system according to the seventh implementation form of the first aspect, the similarity measure is based on a distance between component spectra of the audio source models.

A distance between component spectra of the audio source models is easy to process as the component spectra of the audio source models are parameters of the non-negative matrix factorization model.

In a ninth possible implementation form of the audio enhancement system according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the at least two terminals comprise at least one of a mobile device, in particular a smartphone or a tablet PC, a dedicated wireless microphone and an audio conference terminal. That is, the audio enhancement system can be easily formed by any conventional mobile devices such as smartphones, tablet PCs, wireless microphones or audio conference terminals.

In a tenth possible implementation form of the audio enhancement system according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the processing means of the at least two terminals are configured to perform the audio enhancement processing based on non-negative matrix factorization.

Non-negative matrix factorization is a mathematical algorithm that is well suited for realtime processing and can be easily implemented in conventional mobile devices.

In an eleventh possible implementation form of the audio enhancement system according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the processing means of a terminal are configured to store the information of the at least one audio source model provided by another terminal in a volatile memory of the terminal. The information of other terminals is only needed when a terminal is active. Therefore, a volatile memory is sufficient for storing the information of audio source models provided by other terminals. Thus, a low-cost audio enhancement system can be efficiently

implemented. According to a second aspect, the invention relates to a method for enhancing audio processing of a system comprising at least two terminals, each one comprising at least one acoustic sensor, the method comprising: coupling the acoustic sensors of the at least two terminals wirelessly with respect to each other to form an acoustic sensor network; providing by each terminal information of at least one audio source model of the terminal to each other terminal, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal; and enhancing audio processing based on the information of the audio source models of the at least two terminals. By using the so designed acoustic sensor network the audio quality of audio rendering is improved without exact knowledge of a source model of the audio signal.

In a first possible implementation form of the method according to the second aspect, the method comprises selecting one of the at least two terminals as active terminal, wherein the selection is based on a distance of the at least two terminals to a speaker or based on an energy of a signal received by the at least one acoustic sensor of the at least two terminals. When the selection is based on a distance of a terminal to a speaker or on an energy of a received signal, distant acoustic sources are less considered in the model than nearby acoustic sources. This improves the efficiency of the audio enhancement method.

In a second possible implementation form of the method according to the first

implementation form of the second aspect, the method comprises: identifying an audio source model from the at least one audio source models of the active terminal as being associated to an owner of the active terminal and classified as desired audio source model; and classifying the other audio source models from the at least one audio source models of the active terminal as non-desired audio source models.

When identifying an audio source model as the owner of the active terminal, noise can be effectively separated which improves the audio quality of the method.

The methods described herein may be implemented as software in a Digital Signal Processor (DSP), in a micro-controller or in any other side-processor or as hardware circuit within an application specific integrated circuit (ASIC).

The invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof, e.g. in available hardware of conventional mobile devices or in new hardware dedicated for processing the audio enhancement system.

BRIEF DESCRIPTION OF THE DRAWINGS Further embodiments of the invention will be described with respect to the following figures, in which:

Fig. 1 shows a schematic diagram of a conventional non-negative Matrix Factorization (NMF) technique used in audio processing;

Fig. 2 shows a schematic diagram of an audio enhancement system according to an implementation form; Fig. 3 shows a schematic diagram of a Tablet PC comprising four acoustic sensors according to an implementation form; and

Fig. 4 shows a schematic diagram of a method for enhancing audio processing according to an implementation form.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Fig. 2 shows a schematic diagram of an audio enhancement system 200 according to an implementation form. The audio enhancement system 200 comprises five terminals 201 a, 201 b, 201 c, 201 d, 201 e and a conferencing unit, i.e. a main terminal 203, each one comprising at least one acoustic sensor and processing means. The acoustic sensors of the terminals 201 a, 201 b, 201 c, 201 d, 201 e and 203 are wirelessly coupled with respect to each other forming an acoustic sensor network. Each terminal provides information of at least one audio source model of the terminal to each other terminal. Information of an audio source model of a terminal 201 a describes an audio characteristic of at least one audio source, e.g. a distant speaker 207 or environment noise 209, impacting on the at least one acoustic sensor of the terminal. The processing means of the terminals 201 a, 201 b, 201 c, 201 d and 203 perform audio enhancement processing based on the information of the audio source models of the other terminals.

For each terminal, the processing means may be implemented on a digital signal processing (DSP) unit of the respective terminal, e.g. as an embedded DSP unit in software or the processing means may be implemented as a hardware circuit of the terminal. Fig. 2 describes the typical conference scenario where several people take place in the same room for an audio conference. The main terminal 203, e.g. an audio conferencing terminal and the several mobile devices 201 a, 201 b, 201 c, 201 d, 201 e, i.e. terminals or mobile terminals are wirelessly connected to the main terminal 203, thereby forming a wireless acoustic sensor network. Each of the mobile terminals 201 a, 201 b, 201 c, 201 d and 201 e comprises at least one acoustic sensor, e.g. a microphone. In an

implementation form, each mobile terminal comprises one microphone. In an

implementation form, the mobile terminals comprise several microphones 307, e.g. one, two, three, four microphones 307 as described below with respect to Fig. 3 or even more than four. In an implementation from, all microphones are part of the acoustic sensor network. In an implementation form, only one microphone for each mobile device is part of the acoustic sensor network. In an implementation form, more than one microphone for each mobile device are part of the acoustic sensor network. In an implementation form, the acoustic sensor network is formed by the mobile devices 201 a, 201 b, 201 c, 201 d, and 201 e in a decentralized manner without the need of a central or dedicated main terminal 203.

In an implementation form, the audio enhancement system 200 establishes a wireless acoustic sensor network based on the communication terminals, e.g. mobile terminals, audio conference terminals, dedicated wireless microphones, etc. All the terminals of the network exchange the known source models. For instance, the speaker models which are known by each terminal of the local network, i.e., the acoustic sensor network are distributed to all other terminals of this network. In an implementation form, the source models are determined as described above with respect to Fig. 1 . Every mobile device comprises a sufficiently accurate model of the main user. In an implementation form, a training sequence is performed by each mobile owner or mobile user. In an

implementation form, the mobile terminal comprises a particularly well-adapted source model for the main user which has been trained in quiet conditions and/or on a large database, i.e. continuous model refinement is applied.

In an alternative implementation form, each terminal comprises several source models, e.g. source models from different users or different noise environments, e.g. office, home, etc. In an implementation form, all the models or a subset of the known models are distributed to the other terminals. In an implementation form, a priority is set to the most probable models, for instance, it is expected that the owner of a mobile phone takes part in the audio conferencing.

The active terminal, i.e. the terminal which is identified as being closer to the talker uses the received source models to perform NMF-based audio processing, i.e. audio enhancement in order to reduce or cancel all audio interferences arising from background noise 209, e.g. printers, projectors, etc. or interfering talkers 207, for example. In the illustration of Fig. 2, the main speaker 205 is closer to its mobile device 201 a than an interfering talker 207.

In an implementation form, the active terminal is the dedicated audio conferencing terminal 203. In an alternative implementation form, the active terminal is one of the mobile devices 201 a, 201 b, 201 c, 201 d, 201 e of the local network, i.e. the acoustic sensor network.

In an implementation form, the active terminal is manually selected. The user of a terminal asks to get the floor and the audio conference moderator gives the floor to the user, which means that the terminal which asks to become the active terminal is then used as main sound recording device. In an alternative implementation form, the active terminal is automatically selected by a wireless acoustic sensor network control unit. In an

implementation form, this wireless acoustic sensor network control unit is arranged in the main audio conference terminal 203. In that case, the microphone or microphones with the highest energy are selected as active terminal or terminals. All other terminals are identified as non-active terminals.

In an alternative implementation form, the non-active terminals are used to adapt the unknown source models, i.e. the speakers without dedicated source model and/or background noise 209 which are not modelled. In this implementation form, the non-active terminals continuously update the source models which are regularly synchronized among the terminals. The unknown source models are initialized randomly and updated with each new frame.

In a further alternative implementation form, the active terminal is selected. This active terminal then determines the desired source models, e.g., the source model which is identified as the model of the owner of the terminal. The other source models are used in the NMF processing as non-desired sources, i.e. for noise reduction or source separation. In this implementation form, the sources are classified as desired or non-desired sources in the audio mix. In an implementation form, this classification is based on simple information, e.g. describing the owner or user of the terminal. In an implementation form, this classification is based on information describing a proximity to the active terminal. The sources which are close to the active terminal are identified as the desired sources.

In order to provide the status of the source models which are shared among the terminals, status information is associated to the source model. In an implementation form, this status information comprises known or unknown status of the source model for the terminal. In the case of unknown source models, further information about initial/refined status is associated. A single identification code is given to each model which allows replacing a source model by its refined version without having to increase the number of models.

In a further alternative implementation form, the source model adaptation comprises a merging of several source models, that is, at least two, to improve the source modelling. Each non active terminal continuously learns new source models, e.g., unknown speakers and/or background noise, in order to refine the complete definition of the source models which are part of the audio mix. In an implementation form, new source models are shared among all terminals of the local network and one of the terminals, e.g., the active terminal merges them into a new refined source model.

In an implementation form, the merging operation is based on a similarity measure which is applied to the received source models for unknown sources. In an implementation form, this similarity measure is defined according to the following procedure: The distances between the component spectra W are calculated; clusters of source models are defined, e.g. by using a k-means algorithm; and the source models are combined in a new refined source model based on similar component spectra or cluster.

In an implementation form, the new source models which are received from each terminal taking part in the network are stored in volatile memory to reduce memory constraints and/or memory space. Implementation forms of the invention provide an improved audio processing based on non-negative matrix factorization with distributed source model learning unit which allows obtaining a better definition, i.e. the source model of all the audio sources which are part of the audio mix and a faster adaptation to the introduction of new sound sources.

Fig. 3 shows a schematic diagram of a Tablet PC 300 comprising four acoustic sensors according to an implementation form. The Tablet PC 300 comprises a number of four microphones 307 arranged in the middle of each edge of the tablet in order to better discriminate the direction of the sounds. The tablet PC 300 may correspond to the mobile device described above with respect to Figure 2. The tablet PC 300 is adapted to perform the audio enhancement processing analogously to the procedure described with respect to Fig. 2.

Fig. 4 shows a schematic diagram of a method 400 for enhancing audio processing according to an implementation form.

The method 400 is for enhancing the audio processing of a system comprising at least two terminals, each one comprising at least one acoustic sensor. The method 400 comprises coupling 401 the acoustic sensors of the at least two terminals wirelessly with respect to each other to form an acoustic sensor network. The method 400 comprises providing 403 by each terminal information of at least one audio source model of the terminal to each other terminal, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal. The method 400 comprises enhancing 405 the audio processing based on the information of the audio source models of the at least two terminals.

In one embodiment, the method 400 includes the following elements and steps which are performed in order to determine enhanced source models: A wireless acoustic sensor network is established based on the communication terminals (mobile terminals, audio conference terminals, dedicated wireless microphones, etc). All the terminals of the network exchange the known source models (for instance the speaker models which are known by each terminal of the local network are distributed to all other terminals of this network). Each mobile device uses a sufficiently accurate model of the main user. In an implementation form, each mobile device performs a training procedure. In an implementation form, the mobile terminal uses a particularly well-adapted source model for the main user which has been trained in quiet conditions and/or on a large database (continuous model refinement). Alternatively, each terminal includes several source models (from different users or different noise environments: office, home, etc). All the models or a subset of the known models are distributed to other terminals. In an implementation form, a priority is set to the most probable models (for instance, it is expected that the owner of a mobile phone takes part to the audio conferencing). The active terminal (terminal which is identified as being closer to the talker) uses the received source models to perform NMF-based audio processing (audio enhancement) in order to reduce or cancel all audio interferences (background noise or interfering talkers). In an implementation form, the active terminal is the dedicated audio conferencing terminal or one of the mobile devices of the local network.

As an optional feature of the method 400, the active terminal is manually selected. The user of a terminal asks to get the floor and the audio conference moderator gives the floor to the user, which means that the terminal which asks to become the active terminal is then used as main sound recording device. Alternatively, the active terminal is

automatically selected by the wireless acoustic sensor network control unit (e.g. main audio conference terminal). In that case, the microphone(s) with more energy is/are selected as active terminal(s). All other terminals are identified as non-active terminals. In an alternative embodiment, the non-active terminals are used to adapt the unknown source models (speakers without dedicated source model and/or background noise which are not modelled). In this embodiment, the non-active terminals continuously update the source models which are regularly synchronized among the terminals. The unknown source models are initialized randomly and updated with each new frame.

In a further alternative embodiment, the following steps are also included:

As in the previous embodiment, the method includes a step of selection of the active terminal, this active terminal then determines the desired source models (e.g. the source model which is identified as the model of the owner of the terminal). The other source models are used in the NMF processing as non-desired source (noise reduction or source separation). This embodiment includes a further step to classify the sources as desired/non desired sources in the audio mix. This classification is based on simple information (owner or user of the terminal) or on proximity to the active terminal. The sources which are close to the active terminal being identified as desired sources.

In order to provide the status of the source models which are shared among the terminals, status information is associated to the source model. This status information is for instance known/unknown status of the source model for the terminal, and for unknown source models, further information about initial/refined status is also associated. A single identification code is given to each model which allows replacing a source model by its refined version without having to increase the number of models.

In a further alternative embodiment, the source model adaptation includes a step of merging several source models (at least two) to improve the source modelling. Each non active terminal continuously learns new source models (unknown speakers and/or background noise) in order to refine the complete definition of the source models being part of the audio mix. New source models are shared among all terminals of the local network and one of the terminals (e.g. the active terminal) merges them into a new refined source model.

In an implementation form, the merging operation is based on a similarity measure which is applied to the received source models for unknown sources. This similarity measure is defined using the following procedure: Calculating the distance between the component spectra (W); defining a cluster of source models (using for instance a k-means algorithm); and combining the source models based on similar component spectra (cluster) in a new refined source model.

In an implementation form, the new source models which are received from each terminal taking part in the network are stored in volatile memory to reduce memory constraints and/or space. In an implementation form, the method is a NMF-based audio processing for audio enhancement with distributed source model adaptation. One of the possible scenarios for implementing the method is described above with respect to Fig. 2.

From the foregoing, it will be apparent to those skilled in the art that a variety of methods, systems, computer programs on recording media, and the like, are provided. The present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein.

The present disclosure also supports a system configured to execute the performing and computing steps described herein.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present inventions has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the spirit and scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the inventions may be practiced otherwise than as specifically described herein.

Claims

CLAIMS:

1 . Audio enhancement system (200), comprising at least two terminals (201 a, 201 b), each one comprising at least one acoustic sensor and processing means, wherein the acoustic sensors of the at least two terminals (201 a, 201 b) are wirelessly coupled with respect to each other forming an acoustic sensor network, each terminal (201 a) being configured to provide information of at least one audio source model of the terminal (201 a) to at least one of the other terminals (201 b), wherein information of an audio source model of a terminal (201 a) describes an audio characteristic of at least one audio source (205) impacting on the at least one acoustic sensor of the terminal (201 a), and wherein the processing means of the at least two terminals (201 a, 201 b) are configured to perform audio enhancement processing based on the information of the audio source models of the at least two terminals (201 a, 201 b).

2. The audio enhancement system (200) of claim 1 , wherein each terminal (201 a) is configured to adapt the at least one audio source model of the terminal (201 a) based on an audio characteristic of a main speaker (205) impacting on the at least one acoustic sensor of the terminal (201 a).

3. The audio enhancement system (200) of claim 1 , wherein each terminal (201 a) is configured to adapt the at least one audio source model of the terminal (201 a) based on audio characteristics of different users (205, 207) and/or different noise environments (209) impacting on the at least one acoustic sensor of the terminal (201 a).

4. The audio enhancement system (200) of one of the preceding claims, wherein each terminal (201 a) is configured to use an output signal of the at least one acoustic sensor of the terminal (201 a) as training signal for adjusting the at least one audio source model of the terminal (201 a).

5. The audio enhancement system (200) of one of the preceding claims, wherein each terminal (201 a) is configured to refine the information of the at least one audio source model of the terminal (201 a) based on the provided information of the audio source models of the other terminals (201 b).

6. The audio enhancement system (200) of claim 5, wherein each terminal (201 a) is configured to provide the refined information of the at least one audio source model of the terminal (201 a) to at least one of the other terminals (201 b).

7. The audio enhancement system (200) of claim 5 or claim 6, wherein each terminal (201 a) is configured to merge the information of the at least one audio source model of the terminal (201 a) with the provided information of the audio source models of the other terminals (201 b) to provide the refined information.

8. The audio enhancement system (200) of claim 7, wherein each terminal (201 a) is configured to perform the merging based on a similarity measure describing a similarity of the audio source models.

9. The audio enhancement system (200) of claim 8, wherein the similarity measure is based on a distance between component spectra of the audio source models.

10. The audio enhancement system (200) of one of the preceding claims, wherein the at least two terminals (201 a, 201 b) comprise at least one of a mobile device, in particular a smartphone or a tablet PC, a dedicated wireless microphone and an audio conference terminal.

1 1. The audio enhancement system (200) of one of the preceding claims, wherein the processing means of the at least two terminals (201 a, 201 b) are configured to perform the audio enhancement processing based on non-negative matrix factorization.

12. The audio enhancement system (200) of one of the preceding claims, wherein the processing means of a terminal (201 a) are configured to store the information of the at least one audio source model provided by another terminal (201 b) in a volatile memory of the terminal (201 a).

13. Method (400) for enhancing audio processing of a system comprising at least two terminals, each one comprising at least one acoustic sensor, the method comprising: coupling (401 ) the acoustic sensors of the at least two terminals wirelessly with respect to each other to form an acoustic sensor network; providing (403) by each terminal information of at least one audio source model of the terminal to each other terminal, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal; and enhancing (405) audio processing based on the information of the audio source models of the at least two terminals.

14. The method (400) of claim 13, comprising: selecting one of the at least two terminals as active terminal, wherein the selection is based on a distance of the at least two terminals to a speaker or based on an energy of a signal received by the at least one acoustic sensor of the at least two terminals.

15. The method (400) of claim 14, comprising: identifying an audio source model from the at least one audio source models of the active terminal as being associated to an owner of the active terminal and classified as desired audio source model; and classifying the other audio source models from the at least one audio source models of the active terminal as non-desired audio source models.