WO2014063754A1 - Audio enhancement system - Google Patents

Audio enhancement system Download PDF

Info

Publication number
WO2014063754A1
WO2014063754A1 PCT/EP2012/071305 EP2012071305W WO2014063754A1 WO 2014063754 A1 WO2014063754 A1 WO 2014063754A1 EP 2012071305 W EP2012071305 W EP 2012071305W WO 2014063754 A1 WO2014063754 A1 WO 2014063754A1
Authority
WO
WIPO (PCT)
Prior art keywords
terminal
audio
terminals
audio source
information
Prior art date
Application number
PCT/EP2012/071305
Other languages
French (fr)
Inventor
David Virette
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2012/071305 priority Critical patent/WO2014063754A1/en
Priority to EP12781070.3A priority patent/EP2888736B1/en
Publication of WO2014063754A1 publication Critical patent/WO2014063754A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • the present invention relates to an audio enhancement system comprising at least two wireless coupled terminals, in particular Smartphones, Tablet PCs or audio conferencing systems and a method for enhanced audio processing.
  • Wireless acoustic sensor network is a technology using several spatially distributed microphones that communicate through a wireless connection over short distances. This technology can be used to enhance the audio quality based on the different monophonic representation of the same sound field. For instance, noise reduction methods can be used to cancel the background noise. This is usually efficient for stationary background noise, but the quality is quite limited for non-stationary noise. Indeed, the noise representation is obtained from the distant microphones which are not perfectly synchronized. Hence, the noise estimation performed on a distant microphone is efficient for stationary noise, but not for highly changing background noise.
  • the noise reduction algorithm is usually based on spectral subtraction. Such methods can for instance be derived from Y. Ephraim and D.
  • Microphone array methods can also be used, such as described by J. Benesty, J. Chen, and Y. Huang, “Microphone Array Signal Processing", Springer 2008, but a perfect synchronization of the acoustic sensors is then required in order to take into account the exact delays and energy differences between various monophonic recordings.
  • Microphone array signal processing is usually based on a known microphone
  • NMF Matrix Factorization
  • the second factor H describes the activations 107 of the signal spectrogram 103 of the audio signal 101 .
  • the first factor W and the second factor H are matched with the short-time magnitude spectrogram V 103 of the audio signal 101 by an optimization procedure.
  • the source model 109 is pre-defined when applying supervised NMF and a joint estimation is applied for the source model 109 when using unsupervised NMF.
  • the source signal or signals 1 13 can be derived from the source spectrogram 1 1 1 .
  • Non-negative Matrix Factorization and its extensions have been successfully used in areas related to speech recognition, including speech de-noising and speaker separation.
  • NMF has been usually used as offline audio processing for noise reduction or source separation based on source models which are pre-defined. It has been recently extended for on-line processing where the processing is done with a sliding window in order to process the audio signal frame by frame and achieve real time processing as presented in Cyril Joder, Felix Weninger, Florian Eyben, David Virette, Bjorn Schuller: "Real-time Speech Separation by Semi-Supervised Nonnegative Matrix Factorization", Proc.
  • the semi-supervised NMF will provide lower performances in terms of noise reduction or source separation as only one source is known a priori, i.e. pre-defined.
  • the other source models are estimated and refined over time. The quality improves with the refinement of the noise model.
  • the invention is based on the finding that the audio quality is improved when NMF audio processing with a distributed source model adaptation is applied.
  • the audio quality improvement is achieved when using a wireless acoustic sensor network and a distributed source model adaptation.
  • a wireless acoustic sensor network comprises microphones with short distance wireless connection.
  • the source models are trained from each acoustic sensor of the network and shared between all the terminals of the local network.
  • the speech and/or audio enhancement is performed based on a distributed source model learning unit.
  • the wireless acoustic sensor network can advantageously be based on mobile devices and/or audio conference terminals.
  • Source models, e.g. background noise and/or unknown speakers can be merged in the active terminal, i.e. the terminal being used for the monophonic sound recording. Speech enhancement and/or audio
  • NMF Non negative Matrix Factorization
  • the audio quality is significantly improved and the audio rendering is enhanced without requiring an exact source model of the audio signal, as will be presented in the following.
  • rendering a reproduction technique capable of creating spatial sound fields in an extended area by means of loudspeakers or loudspeaker arrays, audio
  • NMF non-negative matrix factorization
  • WASN wireless acoustic sensor network
  • the invention relates to an audio enhancement system, comprising at least two terminals, each one comprising at least one acoustic sensor and processing means, wherein the acoustic sensors of the at least two terminals are wirelessly coupled with respect to each other forming an acoustic sensor network, each terminal being configured to provide information of at least one audio source model of the terminal to at least one of the other terminals, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal, and wherein the processing means of the at least two terminals are configured to perform audio enhancement processing based on the information of the audio source models of the at least two terminals.
  • the audio quality of audio rendering is improved without exact knowledge of a source model of the audio signal.
  • each terminal is configured to adapt the at least one audio source model of the terminal based on an audio characteristic of a main speaker impacting on the at least one acoustic sensor of the terminal.
  • each terminal is configured to adapt the at least one audio source model of the terminal based on audio characteristics of different users and/or different noise environments impacting on the at least one acoustic sensor of the terminal.
  • each terminal is configured to use an output signal of the at least one acoustic sensor of the terminal as training signal for adjusting the at least one audio source model of the terminal.
  • speech and/or audio enhancement can be performed as a learning process.
  • the learning can be advantageously based on a learning unit or on mobile devices and/or audio conference terminals.
  • Source models e.g. background noise and/or unknown speakers can be merged in the active terminal, thereby improving the audio rendering.
  • each terminal is configured to refine the information of the at least one audio source model of the terminal based on the provided information of the audio source models of the other terminals.
  • Such a distributed source model adaptation facilitates a distributive learning of the audio enhancement system. Audio quality is enhanced from step to step by refining the information base.
  • each terminal is configured to provide the refined information of the at least one audio source model of the terminal to at least one of the other terminals.
  • the audio enhancement system forms a distributive learning system where the different terminals are the multiple nodes of the system. A drop out of one of the terminals does not result in a drop out of the whole system.
  • each terminal is configured to merge the information of the at least one audio source model of the terminal with the provided information of the audio source models of the other terminals to provide the refined information.
  • information processed by the audio enhancement system can be improved thereby optimizing audio quality.
  • each terminal is configured to perform the merging based on a similarity measure describing a similarity of the audio source models.
  • Efficiency of information processing is improved when similar information is processed similarly and different information is processed in a different way. This reduces complexity of the audio enhancement system.
  • the similarity measure is based on a distance between component spectra of the audio source models.
  • a distance between component spectra of the audio source models is easy to process as the component spectra of the audio source models are parameters of the non-negative matrix factorization model.
  • the at least two terminals comprise at least one of a mobile device, in particular a smartphone or a tablet PC, a dedicated wireless microphone and an audio conference terminal. That is, the audio enhancement system can be easily formed by any conventional mobile devices such as smartphones, tablet PCs, wireless microphones or audio conference terminals.
  • the processing means of the at least two terminals are configured to perform the audio enhancement processing based on non-negative matrix factorization.
  • Non-negative matrix factorization is a mathematical algorithm that is well suited for realtime processing and can be easily implemented in conventional mobile devices.
  • the processing means of a terminal are configured to store the information of the at least one audio source model provided by another terminal in a volatile memory of the terminal.
  • the information of other terminals is only needed when a terminal is active. Therefore, a volatile memory is sufficient for storing the information of audio source models provided by other terminals.
  • the invention relates to a method for enhancing audio processing of a system comprising at least two terminals, each one comprising at least one acoustic sensor, the method comprising: coupling the acoustic sensors of the at least two terminals wirelessly with respect to each other to form an acoustic sensor network; providing by each terminal information of at least one audio source model of the terminal to each other terminal, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal; and enhancing audio processing based on the information of the audio source models of the at least two terminals.
  • the method comprises selecting one of the at least two terminals as active terminal, wherein the selection is based on a distance of the at least two terminals to a speaker or based on an energy of a signal received by the at least one acoustic sensor of the at least two terminals.
  • the selection is based on a distance of a terminal to a speaker or on an energy of a received signal, distant acoustic sources are less considered in the model than nearby acoustic sources. This improves the efficiency of the audio enhancement method.
  • the method comprises: identifying an audio source model from the at least one audio source models of the active terminal as being associated to an owner of the active terminal and classified as desired audio source model; and classifying the other audio source models from the at least one audio source models of the active terminal as non-desired audio source models.
  • noise can be effectively separated which improves the audio quality of the method.
  • DSP Digital Signal Processor
  • ASIC application specific integrated circuit
  • the invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof, e.g. in available hardware of conventional mobile devices or in new hardware dedicated for processing the audio enhancement system.
  • Fig. 1 shows a schematic diagram of a conventional non-negative Matrix Factorization (NMF) technique used in audio processing;
  • NMF Matrix Factorization
  • Fig. 2 shows a schematic diagram of an audio enhancement system according to an implementation form
  • Fig. 3 shows a schematic diagram of a Tablet PC comprising four acoustic sensors according to an implementation form
  • Fig. 4 shows a schematic diagram of a method for enhancing audio processing according to an implementation form.
  • Fig. 2 shows a schematic diagram of an audio enhancement system 200 according to an implementation form.
  • the audio enhancement system 200 comprises five terminals 201 a, 201 b, 201 c, 201 d, 201 e and a conferencing unit, i.e. a main terminal 203, each one comprising at least one acoustic sensor and processing means.
  • the acoustic sensors of the terminals 201 a, 201 b, 201 c, 201 d, 201 e and 203 are wirelessly coupled with respect to each other forming an acoustic sensor network.
  • Each terminal provides information of at least one audio source model of the terminal to each other terminal.
  • Information of an audio source model of a terminal 201 a describes an audio characteristic of at least one audio source, e.g. a distant speaker 207 or environment noise 209, impacting on the at least one acoustic sensor of the terminal.
  • the processing means of the terminals 201 a, 201 b, 201 c, 201 d and 203 perform audio enhancement processing based on the information of the audio source models of the other terminals.
  • the processing means may be implemented on a digital signal processing (DSP) unit of the respective terminal, e.g. as an embedded DSP unit in software or the processing means may be implemented as a hardware circuit of the terminal.
  • Fig. 2 describes the typical conference scenario where several people take place in the same room for an audio conference.
  • the main terminal 203 e.g. an audio conferencing terminal and the several mobile devices 201 a, 201 b, 201 c, 201 d, 201 e, i.e. terminals or mobile terminals are wirelessly connected to the main terminal 203, thereby forming a wireless acoustic sensor network.
  • Each of the mobile terminals 201 a, 201 b, 201 c, 201 d and 201 e comprises at least one acoustic sensor, e.g. a microphone.
  • each mobile terminal comprises one microphone.
  • the mobile terminals comprise several microphones 307, e.g. one, two, three, four microphones 307 as described below with respect to Fig. 3 or even more than four. In an implementation from, all microphones are part of the acoustic sensor network. In an implementation form, only one microphone for each mobile device is part of the acoustic sensor network. In an implementation form, more than one microphone for each mobile device are part of the acoustic sensor network. In an implementation form, the acoustic sensor network is formed by the mobile devices 201 a, 201 b, 201 c, 201 d, and 201 e in a decentralized manner without the need of a central or dedicated main terminal 203.
  • the audio enhancement system 200 establishes a wireless acoustic sensor network based on the communication terminals, e.g. mobile terminals, audio conference terminals, dedicated wireless microphones, etc. All the terminals of the network exchange the known source models. For instance, the speaker models which are known by each terminal of the local network, i.e., the acoustic sensor network are distributed to all other terminals of this network.
  • the source models are determined as described above with respect to Fig. 1 . Every mobile device comprises a sufficiently accurate model of the main user.
  • a training sequence is performed by each mobile owner or mobile user. In an
  • the mobile terminal comprises a particularly well-adapted source model for the main user which has been trained in quiet conditions and/or on a large database, i.e. continuous model refinement is applied.
  • each terminal comprises several source models, e.g. source models from different users or different noise environments, e.g. office, home, etc.
  • all the models or a subset of the known models are distributed to the other terminals.
  • a priority is set to the most probable models, for instance, it is expected that the owner of a mobile phone takes part in the audio conferencing.
  • the active terminal i.e. the terminal which is identified as being closer to the talker uses the received source models to perform NMF-based audio processing, i.e. audio enhancement in order to reduce or cancel all audio interferences arising from background noise 209, e.g. printers, projectors, etc. or interfering talkers 207, for example.
  • NMF-based audio processing i.e. audio enhancement in order to reduce or cancel all audio interferences arising from background noise 209, e.g. printers, projectors, etc. or interfering talkers 207, for example.
  • background noise 209 e.g. printers, projectors, etc.
  • interfering talkers 207 interfering talkers
  • the active terminal is the dedicated audio conferencing terminal 203.
  • the active terminal is one of the mobile devices 201 a, 201 b, 201 c, 201 d, 201 e of the local network, i.e. the acoustic sensor network.
  • the active terminal is manually selected.
  • the user of a terminal asks to get the floor and the audio conference moderator gives the floor to the user, which means that the terminal which asks to become the active terminal is then used as main sound recording device.
  • the active terminal is automatically selected by a wireless acoustic sensor network control unit.
  • this wireless acoustic sensor network control unit is arranged in the main audio conference terminal 203.
  • the microphone or microphones with the highest energy are selected as active terminal or terminals. All other terminals are identified as non-active terminals.
  • the non-active terminals are used to adapt the unknown source models, i.e. the speakers without dedicated source model and/or background noise 209 which are not modelled.
  • the non-active terminals continuously update the source models which are regularly synchronized among the terminals.
  • the unknown source models are initialized randomly and updated with each new frame.
  • the active terminal is selected. This active terminal then determines the desired source models, e.g., the source model which is identified as the model of the owner of the terminal.
  • the other source models are used in the NMF processing as non-desired sources, i.e. for noise reduction or source separation.
  • the sources are classified as desired or non-desired sources in the audio mix. In an implementation form, this classification is based on simple information, e.g. describing the owner or user of the terminal. In an implementation form, this classification is based on information describing a proximity to the active terminal. The sources which are close to the active terminal are identified as the desired sources.
  • status information is associated to the source model.
  • this status information comprises known or unknown status of the source model for the terminal.
  • further information about initial/refined status is associated.
  • a single identification code is given to each model which allows replacing a source model by its refined version without having to increase the number of models.
  • the source model adaptation comprises a merging of several source models, that is, at least two, to improve the source modelling.
  • Each non active terminal continuously learns new source models, e.g., unknown speakers and/or background noise, in order to refine the complete definition of the source models which are part of the audio mix.
  • new source models are shared among all terminals of the local network and one of the terminals, e.g., the active terminal merges them into a new refined source model.
  • the merging operation is based on a similarity measure which is applied to the received source models for unknown sources.
  • this similarity measure is defined according to the following procedure: The distances between the component spectra W are calculated; clusters of source models are defined, e.g. by using a k-means algorithm; and the source models are combined in a new refined source model based on similar component spectra or cluster.
  • the new source models which are received from each terminal taking part in the network are stored in volatile memory to reduce memory constraints and/or memory space.
  • Implementation forms of the invention provide an improved audio processing based on non-negative matrix factorization with distributed source model learning unit which allows obtaining a better definition, i.e. the source model of all the audio sources which are part of the audio mix and a faster adaptation to the introduction of new sound sources.
  • Fig. 3 shows a schematic diagram of a Tablet PC 300 comprising four acoustic sensors according to an implementation form.
  • the Tablet PC 300 comprises a number of four microphones 307 arranged in the middle of each edge of the tablet in order to better discriminate the direction of the sounds.
  • the tablet PC 300 may correspond to the mobile device described above with respect to Figure 2.
  • the tablet PC 300 is adapted to perform the audio enhancement processing analogously to the procedure described with respect to Fig. 2.
  • Fig. 4 shows a schematic diagram of a method 400 for enhancing audio processing according to an implementation form.
  • the method 400 is for enhancing the audio processing of a system comprising at least two terminals, each one comprising at least one acoustic sensor.
  • the method 400 comprises coupling 401 the acoustic sensors of the at least two terminals wirelessly with respect to each other to form an acoustic sensor network.
  • the method 400 comprises providing 403 by each terminal information of at least one audio source model of the terminal to each other terminal, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal.
  • the method 400 comprises enhancing 405 the audio processing based on the information of the audio source models of the at least two terminals.
  • the method 400 includes the following elements and steps which are performed in order to determine enhanced source models:
  • a wireless acoustic sensor network is established based on the communication terminals (mobile terminals, audio conference terminals, dedicated wireless microphones, etc). All the terminals of the network exchange the known source models (for instance the speaker models which are known by each terminal of the local network are distributed to all other terminals of this network).
  • Each mobile device uses a sufficiently accurate model of the main user.
  • each mobile device performs a training procedure.
  • the mobile terminal uses a particularly well-adapted source model for the main user which has been trained in quiet conditions and/or on a large database (continuous model refinement).
  • each terminal includes several source models (from different users or different noise environments: office, home, etc). All the models or a subset of the known models are distributed to other terminals.
  • a priority is set to the most probable models (for instance, it is expected that the owner of a mobile phone takes part to the audio conferencing).
  • the active terminal uses the received source models to perform NMF-based audio processing (audio enhancement) in order to reduce or cancel all audio interferences (background noise or interfering talkers).
  • the active terminal is the dedicated audio conferencing terminal or one of the mobile devices of the local network.
  • the active terminal is manually selected.
  • the user of a terminal asks to get the floor and the audio conference moderator gives the floor to the user, which means that the terminal which asks to become the active terminal is then used as main sound recording device.
  • the active terminal is
  • the wireless acoustic sensor network control unit e.g. main audio conference terminal.
  • the microphone(s) with more energy is/are selected as active terminal(s).
  • All other terminals are identified as non-active terminals.
  • the non-active terminals are used to adapt the unknown source models (speakers without dedicated source model and/or background noise which are not modelled).
  • the non-active terminals continuously update the source models which are regularly synchronized among the terminals. The unknown source models are initialized randomly and updated with each new frame.
  • the method includes a step of selection of the active terminal, this active terminal then determines the desired source models (e.g. the source model which is identified as the model of the owner of the terminal).
  • the other source models are used in the NMF processing as non-desired source (noise reduction or source separation).
  • This embodiment includes a further step to classify the sources as desired/non desired sources in the audio mix. This classification is based on simple information (owner or user of the terminal) or on proximity to the active terminal. The sources which are close to the active terminal being identified as desired sources.
  • status information is associated to the source model.
  • This status information is for instance known/unknown status of the source model for the terminal, and for unknown source models, further information about initial/refined status is also associated.
  • a single identification code is given to each model which allows replacing a source model by its refined version without having to increase the number of models.
  • the source model adaptation includes a step of merging several source models (at least two) to improve the source modelling.
  • Each non active terminal continuously learns new source models (unknown speakers and/or background noise) in order to refine the complete definition of the source models being part of the audio mix.
  • New source models are shared among all terminals of the local network and one of the terminals (e.g. the active terminal) merges them into a new refined source model.
  • the merging operation is based on a similarity measure which is applied to the received source models for unknown sources.
  • This similarity measure is defined using the following procedure: Calculating the distance between the component spectra (W); defining a cluster of source models (using for instance a k-means algorithm); and combining the source models based on similar component spectra (cluster) in a new refined source model.
  • the new source models which are received from each terminal taking part in the network are stored in volatile memory to reduce memory constraints and/or space.
  • the method is a NMF-based audio processing for audio enhancement with distributed source model adaptation. One of the possible scenarios for implementing the method is described above with respect to Fig. 2.
  • the present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein.
  • the present disclosure also supports a system configured to execute the performing and computing steps described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention relates to an audio enhancement system (200), comprising at least two terminals (201a, 201b), each one comprising at least one acoustic sensor and processing means, wherein the acoustic sensors of the at least two terminals (201a, 201b) are wirelessly coupled with respect to each other forming an acoustic sensor network, each terminal (201a) being configured to provide information of at least one audio source model of the terminal (201a) to at least one of the other terminals (201b), wherein information of an audio source model of a terminal (201a) describes an audio characteristic of at least one audio source (205) impacting on the at least one acoustic sensor of the terminal (201a), and wherein the processing means of the at least two terminals (201a, 201b) are configured to perform audio enhancement processing based on the information of the audio source models of the at least two terminals (201a, 201b).

Description

DESCRIPTION
Audio enhancement system
BACKGROUND OF THE INVENTION
The present invention relates to an audio enhancement system comprising at least two wireless coupled terminals, in particular Smartphones, Tablet PCs or audio conferencing systems and a method for enhanced audio processing.
Wireless acoustic sensor network is a technology using several spatially distributed microphones that communicate through a wireless connection over short distances. This technology can be used to enhance the audio quality based on the different monophonic representation of the same sound field. For instance, noise reduction methods can be used to cancel the background noise. This is usually efficient for stationary background noise, but the quality is quite limited for non-stationary noise. Indeed, the noise representation is obtained from the distant microphones which are not perfectly synchronized. Hence, the noise estimation performed on a distant microphone is efficient for stationary noise, but not for highly changing background noise. The noise reduction algorithm is usually based on spectral subtraction. Such methods can for instance be derived from Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean square error log-spectral amplitude estimator," IEEE Trans, on Acoust, Speech, Signal Processing, vol. ASSP-33, pp. 443-445, Apr. 1985 or Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean square error short-time spectral amplitude estimator," IEEE Trans, on Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 1 109- 1 121 , Dec. 1984.
Microphone array methods can also be used, such as described by J. Benesty, J. Chen, and Y. Huang, "Microphone Array Signal Processing", Springer 2008, but a perfect synchronization of the acoustic sensors is then required in order to take into account the exact delays and energy differences between various monophonic recordings.
Microphone array signal processing is usually based on a known microphone
arrangement, which is not the case for a wireless acoustic sensor network where the sensors are arbitrary placed in space and can move. On the other hand, blind source separation methods have been used for speech enhancement based on monophonic signal, also called single-channel source separation. Non-negative Matrix Factorization (NMF) methods have been used in that context with relatively good results. The basic principle of NMF-based audio processing 100 as schematically illustrated in Fig. 1 is to find a locally optimal factorization of a short-time magnitude spectrogram V 103 of an audio signal 101 into two factors W and H, of which the first one W represents the spectra of the events occurring in the signal 101 and the second one H their activation over time. The first factor W describes the component spectra of the source model 109. The second factor H describes the activations 107 of the signal spectrogram 103 of the audio signal 101 . The first factor W and the second factor H are matched with the short-time magnitude spectrogram V 103 of the audio signal 101 by an optimization procedure. The source model 109 is pre-defined when applying supervised NMF and a joint estimation is applied for the source model 109 when using unsupervised NMF. The source signal or signals 1 13 can be derived from the source spectrogram 1 1 1 .
Non-negative Matrix Factorization (NMF) and its extensions have been successfully used in areas related to speech recognition, including speech de-noising and speaker separation. NMF has been usually used as offline audio processing for noise reduction or source separation based on source models which are pre-defined. It has been recently extended for on-line processing where the processing is done with a sliding window in order to process the audio signal frame by frame and achieve real time processing as presented in Cyril Joder, Felix Weninger, Florian Eyben, David Virette, Bjorn Schuller: "Real-time Speech Separation by Semi-Supervised Nonnegative Matrix Factorization", Proc. International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA) 2012, Tel Aviv, Israel, Springer LNCS, Vol. 7191 , pp. 323-329, March 12-15, 2012. In supervised NMF, the source models are known in advance and for semi-supervised NMF only one source model is known, either noise or speaker. Both approaches have been extended to on-line NMF in order to be applied in real time speech enhancement implementations. It has been shown that the supervised NMF performed better than the semi-supervised NMF, which is normal as the supervised NMF uses the exact model for all the sources of an audio signal. Supervised NMF offers usually better performances as it is based on known source models which perfectly match with the sources which are part of the audio mix. With the same audio mix, the semi-supervised NMF will provide lower performances in terms of noise reduction or source separation as only one source is known a priori, i.e. pre-defined. The other source models are estimated and refined over time. The quality improves with the refinement of the noise model.
SUMMARY OF THE INVENTION It is the object of the invention to improve the audio quality of an audio rendering system without the knowledge of an exact source model of the audio signal.
This object is achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
The invention is based on the finding that the audio quality is improved when NMF audio processing with a distributed source model adaptation is applied. The audio quality improvement is achieved when using a wireless acoustic sensor network and a distributed source model adaptation. A wireless acoustic sensor network comprises microphones with short distance wireless connection. The source models are trained from each acoustic sensor of the network and shared between all the terminals of the local network The speech and/or audio enhancement is performed based on a distributed source model learning unit. The wireless acoustic sensor network can advantageously be based on mobile devices and/or audio conference terminals. Source models, e.g. background noise and/or unknown speakers can be merged in the active terminal, i.e. the terminal being used for the monophonic sound recording. Speech enhancement and/or audio
enhancement is advantageously based on Non negative Matrix Factorization (NMF).
By applying such an audio enhancement system, the audio quality is significantly improved and the audio rendering is enhanced without requiring an exact source model of the audio signal, as will be presented in the following.
In order to describe the invention in detail, the following terms, abbreviations and notations will be used: audio
rendering: a reproduction technique capable of creating spatial sound fields in an extended area by means of loudspeakers or loudspeaker arrays, audio
enhancement: a technique for enhancing the quality of an audio signal
NMF: non-negative matrix factorization,
WASN: wireless acoustic sensor network,
According to a first aspect, the invention relates to an audio enhancement system, comprising at least two terminals, each one comprising at least one acoustic sensor and processing means, wherein the acoustic sensors of the at least two terminals are wirelessly coupled with respect to each other forming an acoustic sensor network, each terminal being configured to provide information of at least one audio source model of the terminal to at least one of the other terminals, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal, and wherein the processing means of the at least two terminals are configured to perform audio enhancement processing based on the information of the audio source models of the at least two terminals.
By using the so designed acoustic sensor network the audio quality of audio rendering is improved without exact knowledge of a source model of the audio signal.
In a first possible implementation form of the audio enhancement system according to the first aspect, each terminal is configured to adapt the at least one audio source model of the terminal based on an audio characteristic of a main speaker impacting on the at least one acoustic sensor of the terminal.
By that distributed source model adaptation taking into account the audio characteristics of the main speaker audio quality is enhanced. In a second possible implementation form of the audio enhancement system according to the first aspect, each terminal is configured to adapt the at least one audio source model of the terminal based on audio characteristics of different users and/or different noise environments impacting on the at least one acoustic sensor of the terminal.
By that distributed source model adaptation taking into account the audio characteristics of different users and/or different noise environments audio quality is enhanced.
In a third possible implementation form of the audio enhancement system according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, each terminal is configured to use an output signal of the at least one acoustic sensor of the terminal as training signal for adjusting the at least one audio source model of the terminal. By using an output signal of the at least one acoustic sensor of the terminal as training signal, speech and/or audio enhancement can be performed as a learning process. The learning can be advantageously based on a learning unit or on mobile devices and/or audio conference terminals. Source models, e.g. background noise and/or unknown speakers can be merged in the active terminal, thereby improving the audio rendering.
In a fourth possible implementation form of the audio enhancement system according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, each terminal is configured to refine the information of the at least one audio source model of the terminal based on the provided information of the audio source models of the other terminals.
Such a distributed source model adaptation facilitates a distributive learning of the audio enhancement system. Audio quality is enhanced from step to step by refining the information base.
In a fifth possible implementation form of the audio enhancement system according to the fourth implementation form of the first aspect, each terminal is configured to provide the refined information of the at least one audio source model of the terminal to at least one of the other terminals. The audio enhancement system forms a distributive learning system where the different terminals are the multiple nodes of the system. A drop out of one of the terminals does not result in a drop out of the whole system. In a sixth possible implementation form of the audio enhancement system according to the fourth implementation form or according to the fifth implementation form of the first aspect, each terminal is configured to merge the information of the at least one audio source model of the terminal with the provided information of the audio source models of the other terminals to provide the refined information.
By merging information of some or all of the terminals, information processed by the audio enhancement system can be improved thereby optimizing audio quality.
In a seventh possible implementation form of the audio enhancement system according to the sixth implementation form of the first aspect, each terminal is configured to perform the merging based on a similarity measure describing a similarity of the audio source models.
Efficiency of information processing is improved when similar information is processed similarly and different information is processed in a different way. This reduces complexity of the audio enhancement system.
In an eighth possible implementation form of the audio enhancement system according to the seventh implementation form of the first aspect, the similarity measure is based on a distance between component spectra of the audio source models.
A distance between component spectra of the audio source models is easy to process as the component spectra of the audio source models are parameters of the non-negative matrix factorization model.
In a ninth possible implementation form of the audio enhancement system according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the at least two terminals comprise at least one of a mobile device, in particular a smartphone or a tablet PC, a dedicated wireless microphone and an audio conference terminal. That is, the audio enhancement system can be easily formed by any conventional mobile devices such as smartphones, tablet PCs, wireless microphones or audio conference terminals.
In a tenth possible implementation form of the audio enhancement system according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the processing means of the at least two terminals are configured to perform the audio enhancement processing based on non-negative matrix factorization.
Non-negative matrix factorization is a mathematical algorithm that is well suited for realtime processing and can be easily implemented in conventional mobile devices.
In an eleventh possible implementation form of the audio enhancement system according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the processing means of a terminal are configured to store the information of the at least one audio source model provided by another terminal in a volatile memory of the terminal. The information of other terminals is only needed when a terminal is active. Therefore, a volatile memory is sufficient for storing the information of audio source models provided by other terminals. Thus, a low-cost audio enhancement system can be efficiently
implemented. According to a second aspect, the invention relates to a method for enhancing audio processing of a system comprising at least two terminals, each one comprising at least one acoustic sensor, the method comprising: coupling the acoustic sensors of the at least two terminals wirelessly with respect to each other to form an acoustic sensor network; providing by each terminal information of at least one audio source model of the terminal to each other terminal, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal; and enhancing audio processing based on the information of the audio source models of the at least two terminals. By using the so designed acoustic sensor network the audio quality of audio rendering is improved without exact knowledge of a source model of the audio signal.
In a first possible implementation form of the method according to the second aspect, the method comprises selecting one of the at least two terminals as active terminal, wherein the selection is based on a distance of the at least two terminals to a speaker or based on an energy of a signal received by the at least one acoustic sensor of the at least two terminals. When the selection is based on a distance of a terminal to a speaker or on an energy of a received signal, distant acoustic sources are less considered in the model than nearby acoustic sources. This improves the efficiency of the audio enhancement method.
In a second possible implementation form of the method according to the first
implementation form of the second aspect, the method comprises: identifying an audio source model from the at least one audio source models of the active terminal as being associated to an owner of the active terminal and classified as desired audio source model; and classifying the other audio source models from the at least one audio source models of the active terminal as non-desired audio source models.
When identifying an audio source model as the owner of the active terminal, noise can be effectively separated which improves the audio quality of the method.
The methods described herein may be implemented as software in a Digital Signal Processor (DSP), in a micro-controller or in any other side-processor or as hardware circuit within an application specific integrated circuit (ASIC).
The invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof, e.g. in available hardware of conventional mobile devices or in new hardware dedicated for processing the audio enhancement system.
BRIEF DESCRIPTION OF THE DRAWINGS Further embodiments of the invention will be described with respect to the following figures, in which:
Fig. 1 shows a schematic diagram of a conventional non-negative Matrix Factorization (NMF) technique used in audio processing;
Fig. 2 shows a schematic diagram of an audio enhancement system according to an implementation form; Fig. 3 shows a schematic diagram of a Tablet PC comprising four acoustic sensors according to an implementation form; and
Fig. 4 shows a schematic diagram of a method for enhancing audio processing according to an implementation form.
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
Fig. 2 shows a schematic diagram of an audio enhancement system 200 according to an implementation form. The audio enhancement system 200 comprises five terminals 201 a, 201 b, 201 c, 201 d, 201 e and a conferencing unit, i.e. a main terminal 203, each one comprising at least one acoustic sensor and processing means. The acoustic sensors of the terminals 201 a, 201 b, 201 c, 201 d, 201 e and 203 are wirelessly coupled with respect to each other forming an acoustic sensor network. Each terminal provides information of at least one audio source model of the terminal to each other terminal. Information of an audio source model of a terminal 201 a describes an audio characteristic of at least one audio source, e.g. a distant speaker 207 or environment noise 209, impacting on the at least one acoustic sensor of the terminal. The processing means of the terminals 201 a, 201 b, 201 c, 201 d and 203 perform audio enhancement processing based on the information of the audio source models of the other terminals.
For each terminal, the processing means may be implemented on a digital signal processing (DSP) unit of the respective terminal, e.g. as an embedded DSP unit in software or the processing means may be implemented as a hardware circuit of the terminal. Fig. 2 describes the typical conference scenario where several people take place in the same room for an audio conference. The main terminal 203, e.g. an audio conferencing terminal and the several mobile devices 201 a, 201 b, 201 c, 201 d, 201 e, i.e. terminals or mobile terminals are wirelessly connected to the main terminal 203, thereby forming a wireless acoustic sensor network. Each of the mobile terminals 201 a, 201 b, 201 c, 201 d and 201 e comprises at least one acoustic sensor, e.g. a microphone. In an
implementation form, each mobile terminal comprises one microphone. In an
implementation form, the mobile terminals comprise several microphones 307, e.g. one, two, three, four microphones 307 as described below with respect to Fig. 3 or even more than four. In an implementation from, all microphones are part of the acoustic sensor network. In an implementation form, only one microphone for each mobile device is part of the acoustic sensor network. In an implementation form, more than one microphone for each mobile device are part of the acoustic sensor network. In an implementation form, the acoustic sensor network is formed by the mobile devices 201 a, 201 b, 201 c, 201 d, and 201 e in a decentralized manner without the need of a central or dedicated main terminal 203.
In an implementation form, the audio enhancement system 200 establishes a wireless acoustic sensor network based on the communication terminals, e.g. mobile terminals, audio conference terminals, dedicated wireless microphones, etc. All the terminals of the network exchange the known source models. For instance, the speaker models which are known by each terminal of the local network, i.e., the acoustic sensor network are distributed to all other terminals of this network. In an implementation form, the source models are determined as described above with respect to Fig. 1 . Every mobile device comprises a sufficiently accurate model of the main user. In an implementation form, a training sequence is performed by each mobile owner or mobile user. In an
implementation form, the mobile terminal comprises a particularly well-adapted source model for the main user which has been trained in quiet conditions and/or on a large database, i.e. continuous model refinement is applied.
In an alternative implementation form, each terminal comprises several source models, e.g. source models from different users or different noise environments, e.g. office, home, etc. In an implementation form, all the models or a subset of the known models are distributed to the other terminals. In an implementation form, a priority is set to the most probable models, for instance, it is expected that the owner of a mobile phone takes part in the audio conferencing.
The active terminal, i.e. the terminal which is identified as being closer to the talker uses the received source models to perform NMF-based audio processing, i.e. audio enhancement in order to reduce or cancel all audio interferences arising from background noise 209, e.g. printers, projectors, etc. or interfering talkers 207, for example. In the illustration of Fig. 2, the main speaker 205 is closer to its mobile device 201 a than an interfering talker 207.
In an implementation form, the active terminal is the dedicated audio conferencing terminal 203. In an alternative implementation form, the active terminal is one of the mobile devices 201 a, 201 b, 201 c, 201 d, 201 e of the local network, i.e. the acoustic sensor network.
In an implementation form, the active terminal is manually selected. The user of a terminal asks to get the floor and the audio conference moderator gives the floor to the user, which means that the terminal which asks to become the active terminal is then used as main sound recording device. In an alternative implementation form, the active terminal is automatically selected by a wireless acoustic sensor network control unit. In an
implementation form, this wireless acoustic sensor network control unit is arranged in the main audio conference terminal 203. In that case, the microphone or microphones with the highest energy are selected as active terminal or terminals. All other terminals are identified as non-active terminals.
In an alternative implementation form, the non-active terminals are used to adapt the unknown source models, i.e. the speakers without dedicated source model and/or background noise 209 which are not modelled. In this implementation form, the non-active terminals continuously update the source models which are regularly synchronized among the terminals. The unknown source models are initialized randomly and updated with each new frame.
In a further alternative implementation form, the active terminal is selected. This active terminal then determines the desired source models, e.g., the source model which is identified as the model of the owner of the terminal. The other source models are used in the NMF processing as non-desired sources, i.e. for noise reduction or source separation. In this implementation form, the sources are classified as desired or non-desired sources in the audio mix. In an implementation form, this classification is based on simple information, e.g. describing the owner or user of the terminal. In an implementation form, this classification is based on information describing a proximity to the active terminal. The sources which are close to the active terminal are identified as the desired sources.
In order to provide the status of the source models which are shared among the terminals, status information is associated to the source model. In an implementation form, this status information comprises known or unknown status of the source model for the terminal. In the case of unknown source models, further information about initial/refined status is associated. A single identification code is given to each model which allows replacing a source model by its refined version without having to increase the number of models.
In a further alternative implementation form, the source model adaptation comprises a merging of several source models, that is, at least two, to improve the source modelling. Each non active terminal continuously learns new source models, e.g., unknown speakers and/or background noise, in order to refine the complete definition of the source models which are part of the audio mix. In an implementation form, new source models are shared among all terminals of the local network and one of the terminals, e.g., the active terminal merges them into a new refined source model.
In an implementation form, the merging operation is based on a similarity measure which is applied to the received source models for unknown sources. In an implementation form, this similarity measure is defined according to the following procedure: The distances between the component spectra W are calculated; clusters of source models are defined, e.g. by using a k-means algorithm; and the source models are combined in a new refined source model based on similar component spectra or cluster.
In an implementation form, the new source models which are received from each terminal taking part in the network are stored in volatile memory to reduce memory constraints and/or memory space. Implementation forms of the invention provide an improved audio processing based on non-negative matrix factorization with distributed source model learning unit which allows obtaining a better definition, i.e. the source model of all the audio sources which are part of the audio mix and a faster adaptation to the introduction of new sound sources.
Fig. 3 shows a schematic diagram of a Tablet PC 300 comprising four acoustic sensors according to an implementation form. The Tablet PC 300 comprises a number of four microphones 307 arranged in the middle of each edge of the tablet in order to better discriminate the direction of the sounds. The tablet PC 300 may correspond to the mobile device described above with respect to Figure 2. The tablet PC 300 is adapted to perform the audio enhancement processing analogously to the procedure described with respect to Fig. 2.
Fig. 4 shows a schematic diagram of a method 400 for enhancing audio processing according to an implementation form.
The method 400 is for enhancing the audio processing of a system comprising at least two terminals, each one comprising at least one acoustic sensor. The method 400 comprises coupling 401 the acoustic sensors of the at least two terminals wirelessly with respect to each other to form an acoustic sensor network. The method 400 comprises providing 403 by each terminal information of at least one audio source model of the terminal to each other terminal, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal. The method 400 comprises enhancing 405 the audio processing based on the information of the audio source models of the at least two terminals.
In one embodiment, the method 400 includes the following elements and steps which are performed in order to determine enhanced source models: A wireless acoustic sensor network is established based on the communication terminals (mobile terminals, audio conference terminals, dedicated wireless microphones, etc). All the terminals of the network exchange the known source models (for instance the speaker models which are known by each terminal of the local network are distributed to all other terminals of this network). Each mobile device uses a sufficiently accurate model of the main user. In an implementation form, each mobile device performs a training procedure. In an implementation form, the mobile terminal uses a particularly well-adapted source model for the main user which has been trained in quiet conditions and/or on a large database (continuous model refinement). Alternatively, each terminal includes several source models (from different users or different noise environments: office, home, etc). All the models or a subset of the known models are distributed to other terminals. In an implementation form, a priority is set to the most probable models (for instance, it is expected that the owner of a mobile phone takes part to the audio conferencing). The active terminal (terminal which is identified as being closer to the talker) uses the received source models to perform NMF-based audio processing (audio enhancement) in order to reduce or cancel all audio interferences (background noise or interfering talkers). In an implementation form, the active terminal is the dedicated audio conferencing terminal or one of the mobile devices of the local network.
As an optional feature of the method 400, the active terminal is manually selected. The user of a terminal asks to get the floor and the audio conference moderator gives the floor to the user, which means that the terminal which asks to become the active terminal is then used as main sound recording device. Alternatively, the active terminal is
automatically selected by the wireless acoustic sensor network control unit (e.g. main audio conference terminal). In that case, the microphone(s) with more energy is/are selected as active terminal(s). All other terminals are identified as non-active terminals. In an alternative embodiment, the non-active terminals are used to adapt the unknown source models (speakers without dedicated source model and/or background noise which are not modelled). In this embodiment, the non-active terminals continuously update the source models which are regularly synchronized among the terminals. The unknown source models are initialized randomly and updated with each new frame.
In a further alternative embodiment, the following steps are also included:
As in the previous embodiment, the method includes a step of selection of the active terminal, this active terminal then determines the desired source models (e.g. the source model which is identified as the model of the owner of the terminal). The other source models are used in the NMF processing as non-desired source (noise reduction or source separation). This embodiment includes a further step to classify the sources as desired/non desired sources in the audio mix. This classification is based on simple information (owner or user of the terminal) or on proximity to the active terminal. The sources which are close to the active terminal being identified as desired sources.
In order to provide the status of the source models which are shared among the terminals, status information is associated to the source model. This status information is for instance known/unknown status of the source model for the terminal, and for unknown source models, further information about initial/refined status is also associated. A single identification code is given to each model which allows replacing a source model by its refined version without having to increase the number of models.
In a further alternative embodiment, the source model adaptation includes a step of merging several source models (at least two) to improve the source modelling. Each non active terminal continuously learns new source models (unknown speakers and/or background noise) in order to refine the complete definition of the source models being part of the audio mix. New source models are shared among all terminals of the local network and one of the terminals (e.g. the active terminal) merges them into a new refined source model.
In an implementation form, the merging operation is based on a similarity measure which is applied to the received source models for unknown sources. This similarity measure is defined using the following procedure: Calculating the distance between the component spectra (W); defining a cluster of source models (using for instance a k-means algorithm); and combining the source models based on similar component spectra (cluster) in a new refined source model.
In an implementation form, the new source models which are received from each terminal taking part in the network are stored in volatile memory to reduce memory constraints and/or space. In an implementation form, the method is a NMF-based audio processing for audio enhancement with distributed source model adaptation. One of the possible scenarios for implementing the method is described above with respect to Fig. 2.
From the foregoing, it will be apparent to those skilled in the art that a variety of methods, systems, computer programs on recording media, and the like, are provided. The present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein.
The present disclosure also supports a system configured to execute the performing and computing steps described herein.
Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present inventions has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the spirit and scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the inventions may be practiced otherwise than as specifically described herein.

Claims

CLAIMS:
1 . Audio enhancement system (200), comprising at least two terminals (201 a, 201 b), each one comprising at least one acoustic sensor and processing means, wherein the acoustic sensors of the at least two terminals (201 a, 201 b) are wirelessly coupled with respect to each other forming an acoustic sensor network, each terminal (201 a) being configured to provide information of at least one audio source model of the terminal (201 a) to at least one of the other terminals (201 b), wherein information of an audio source model of a terminal (201 a) describes an audio characteristic of at least one audio source (205) impacting on the at least one acoustic sensor of the terminal (201 a), and wherein the processing means of the at least two terminals (201 a, 201 b) are configured to perform audio enhancement processing based on the information of the audio source models of the at least two terminals (201 a, 201 b).
2. The audio enhancement system (200) of claim 1 , wherein each terminal (201 a) is configured to adapt the at least one audio source model of the terminal (201 a) based on an audio characteristic of a main speaker (205) impacting on the at least one acoustic sensor of the terminal (201 a).
3. The audio enhancement system (200) of claim 1 , wherein each terminal (201 a) is configured to adapt the at least one audio source model of the terminal (201 a) based on audio characteristics of different users (205, 207) and/or different noise environments (209) impacting on the at least one acoustic sensor of the terminal (201 a).
4. The audio enhancement system (200) of one of the preceding claims, wherein each terminal (201 a) is configured to use an output signal of the at least one acoustic sensor of the terminal (201 a) as training signal for adjusting the at least one audio source model of the terminal (201 a).
5. The audio enhancement system (200) of one of the preceding claims, wherein each terminal (201 a) is configured to refine the information of the at least one audio source model of the terminal (201 a) based on the provided information of the audio source models of the other terminals (201 b).
6. The audio enhancement system (200) of claim 5, wherein each terminal (201 a) is configured to provide the refined information of the at least one audio source model of the terminal (201 a) to at least one of the other terminals (201 b).
7. The audio enhancement system (200) of claim 5 or claim 6, wherein each terminal (201 a) is configured to merge the information of the at least one audio source model of the terminal (201 a) with the provided information of the audio source models of the other terminals (201 b) to provide the refined information.
8. The audio enhancement system (200) of claim 7, wherein each terminal (201 a) is configured to perform the merging based on a similarity measure describing a similarity of the audio source models.
9. The audio enhancement system (200) of claim 8, wherein the similarity measure is based on a distance between component spectra of the audio source models.
10. The audio enhancement system (200) of one of the preceding claims, wherein the at least two terminals (201 a, 201 b) comprise at least one of a mobile device, in particular a smartphone or a tablet PC, a dedicated wireless microphone and an audio conference terminal.
1 1. The audio enhancement system (200) of one of the preceding claims, wherein the processing means of the at least two terminals (201 a, 201 b) are configured to perform the audio enhancement processing based on non-negative matrix factorization.
12. The audio enhancement system (200) of one of the preceding claims, wherein the processing means of a terminal (201 a) are configured to store the information of the at least one audio source model provided by another terminal (201 b) in a volatile memory of the terminal (201 a).
13. Method (400) for enhancing audio processing of a system comprising at least two terminals, each one comprising at least one acoustic sensor, the method comprising: coupling (401 ) the acoustic sensors of the at least two terminals wirelessly with respect to each other to form an acoustic sensor network; providing (403) by each terminal information of at least one audio source model of the terminal to each other terminal, wherein information of an audio source model of a terminal describes an audio characteristic of at least one audio source impacting on the at least one acoustic sensor of the terminal; and enhancing (405) audio processing based on the information of the audio source models of the at least two terminals.
14. The method (400) of claim 13, comprising: selecting one of the at least two terminals as active terminal, wherein the selection is based on a distance of the at least two terminals to a speaker or based on an energy of a signal received by the at least one acoustic sensor of the at least two terminals.
15. The method (400) of claim 14, comprising: identifying an audio source model from the at least one audio source models of the active terminal as being associated to an owner of the active terminal and classified as desired audio source model; and classifying the other audio source models from the at least one audio source models of the active terminal as non-desired audio source models.
PCT/EP2012/071305 2012-10-26 2012-10-26 Audio enhancement system WO2014063754A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2012/071305 WO2014063754A1 (en) 2012-10-26 2012-10-26 Audio enhancement system
EP12781070.3A EP2888736B1 (en) 2012-10-26 2012-10-26 Audio enhancement system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2012/071305 WO2014063754A1 (en) 2012-10-26 2012-10-26 Audio enhancement system

Publications (1)

Publication Number Publication Date
WO2014063754A1 true WO2014063754A1 (en) 2014-05-01

Family

ID=47137691

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2012/071305 WO2014063754A1 (en) 2012-10-26 2012-10-26 Audio enhancement system

Country Status (2)

Country Link
EP (1) EP2888736B1 (en)
WO (1) WO2014063754A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10983320B2 (en) 2013-11-25 2021-04-20 European Molecular Biology Laboratory Optical arrangement for imaging a sample

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8295762B1 (en) * 2011-05-20 2012-10-23 Google Inc. Distributed blind source separation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8295762B1 (en) * 2011-05-20 2012-10-23 Google Inc. Distributed blind source separation

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
ALEXANDER BERTRAND ET AL: "Distributed Adaptive Estimation of Node-Specific Signals in Wireless Sensor Networks With a Tree Topology", IEEE TRANSACTIONS ON SIGNAL PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 59, no. 5, 1 May 2011 (2011-05-01), pages 2196 - 2210, XP011353096, ISSN: 1053-587X, DOI: 10.1109/TSP.2011.2108290 *
CYRIL JODER; FELIX WENINGER; FLORIAN EYBEN; DAVID VIRETTE; BJORN SCHULLER: "Proc. International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA) 2012, Tel Aviv, Israel", vol. 7191, 12 March 2012, SPRINGER LNCS, article "Real-time Speech Separation by Semi-Supervised Nonnegative Matrix Factorization", pages: 323 - 329
DOCLO S ET AL: "Reduced-Bandwidth and Distributed MWF-Based Noise Reduction Algorithms for Binaural Hearing Aids", IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, USA, vol. 17, no. 1, 1 January 2009 (2009-01-01), pages 38 - 51, XP011241217, ISSN: 1558-7916, DOI: 10.1109/TASL.2008.2004291 *
HEUSDENS R. ET AL.: "DISTRIBUTED MVDR BEAMFORMING FOR (WIRELESS) MICROPHONE NETWORKSUSING MESSAGE PASSING", INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT 2012, AACHEN, GERMANY, 4 September 2012 (2012-09-04), XP002711598 *
J. BENESTY; J. CHEN; Y. HUANG: "Microphone Array Signal Processing", 2008, SPRINGER
OZEROV A ET AL: "Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation", IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, USA, vol. 18, no. 3, 1 March 2010 (2010-03-01), pages 550 - 563, XP011282775, ISSN: 1558-7916, DOI: 10.1109/TASL.2009.2031510 *
Y. EPHRAIM; D. MALAH: "Speech enhancement using a minimum mean square error log-spectral amplitude estimator", IEEE TRANS. ON ACOUST., SPEECH, SIGNAL PROCESSING, vol. ASSP-33, April 1985 (1985-04-01), pages 443 - 445, XP000931203, DOI: doi:10.1109/TASSP.1985.1164550
Y. EPHRAIM; D. MALAH: "Speech enhancement using a minimum mean square error short-time spectral amplitude estimator", IEEE TRANS. ON ACOUST., SPEECH, SIGNAL PROCESSING, vol. ASSP-32, December 1984 (1984-12-01), pages 1109 - 1121, XP002435684, DOI: doi:10.1109/TASSP.1984.1164453
YUSUKE HIOKA ET AL: "Distributed blind source separation with an application to audio signals", ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2011 IEEE INTERNATIONAL CONFERENCE ON, IEEE, 22 May 2011 (2011-05-22), pages 233 - 236, XP032000717, ISBN: 978-1-4577-0538-0, DOI: 10.1109/ICASSP.2011.5946383 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10983320B2 (en) 2013-11-25 2021-04-20 European Molecular Biology Laboratory Optical arrangement for imaging a sample

Also Published As

Publication number Publication date
EP2888736A1 (en) 2015-07-01
EP2888736B1 (en) 2019-06-26

Similar Documents

Publication Publication Date Title
US10522167B1 (en) Multichannel noise cancellation using deep neural network masking
CN107135443B (en) Signal processing method and electronic equipment
US11694710B2 (en) Multi-stream target-speech detection and channel fusion
US9978388B2 (en) Systems and methods for restoration of speech components
US10123113B2 (en) Selective audio source enhancement
CN111418012B (en) Method for processing an audio signal and audio processing device
EP4004906A1 (en) Per-epoch data augmentation for training acoustic models
Chen et al. Visual acoustic matching
JP7498560B2 (en) Systems and methods
US12003673B2 (en) Acoustic echo cancellation control for distributed audio devices
US11264017B2 (en) Robust speaker localization in presence of strong noise interference systems and methods
WO2022206602A1 (en) Speech wakeup method and apparatus, and storage medium and system
WO2022253003A1 (en) Speech enhancement method and related device
CN111863020A (en) Voice signal processing method, device, equipment and storage medium
CN115482830A (en) Speech enhancement method and related equipment
US20220335937A1 (en) Acoustic zoning with distributed microphones
US20170206898A1 (en) Systems and methods for assisting automatic speech recognition
WO2014063754A1 (en) Audio enhancement system
Motlicek et al. Real‐Time Audio‐Visual Analysis for Multiperson Videoconferencing
Tran et al. Automatic adaptive speech separation using beamformer-output-ratio for voice activity classification
Zaken et al. Neural-Network-Based Direction-of-Arrival Estimation for Reverberant Speech-the Importance of Energetic, Temporal and Spatial Information
Nguyen et al. A two-step system for sound event localization and detection
Kim et al. DNN-based Parameter Estimation for MVDR Beamforming and Post-filtering
CN115910047B (en) Data processing method, model training method, keyword detection method and equipment
Giacobello An online expectation-maximization algorithm for tracking acoustic sources in multi-microphone devices during music playback

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12781070

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2012781070

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE