EP4315328A1 - Estimation d'un masque optimise pour le traitement de donnees sonores acquises - Google Patents

Estimation d'un masque optimise pour le traitement de donnees sonores acquises

Info

Publication number
EP4315328A1
EP4315328A1 EP22714494.6A EP22714494A EP4315328A1 EP 4315328 A1 EP4315328 A1 EP 4315328A1 EP 22714494 A EP22714494 A EP 22714494A EP 4315328 A1 EP4315328 A1 EP 4315328A1
Authority
EP
European Patent Office
Prior art keywords
sound data
time
sound
mask
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP22714494.6A
Other languages
German (de)
English (en)
French (fr)
Inventor
Alexandre Guerin
Henrique TOMAZ-AMORIM
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Orange SA
Original Assignee
Orange SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Orange SA filed Critical Orange SA
Publication of EP4315328A1 publication Critical patent/EP4315328A1/fr
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • This description relates to the processing of sound data, in particular in the context of distant sound recording.
  • Far-field sound recording occurs, for example, when a speaker is far from sound recording equipment.
  • it offers advantages manifested by real ergonomic comfort for the user to interact "hands-free" with a service in use: make a phone call, issue voice commands via "smartspeaker" type equipment (Google Home®, Amazon Echo®, etc.).
  • this distant sound recording induces certain artefacts: the reverberation and the surrounding noises appear amplified due to the remoteness of the user. These artefacts degrade the intelligibility of the speaker's voice, and consequently the functioning of the services. It appears that communication is more difficult, whether with a human or a voice recognition engine.
  • hands-free terminals such as smartspeakers or teleconferencing "octopuses" are generally equipped with a microphone antenna which makes it possible to enhance the useful signal by reducing these disturbances.
  • Antenna-based enhancement exploits the spatial information encoded during the multi-channel recording and specific to each source to discriminate the signal of interest from other noise sources.
  • a mask denoted (respectively is defined as a real, usually in the range [0; 1] , such as an estimate of the signal of interest (respectively noise n(t,f )) is obtained by simple multiplication of this mask with the observations x(t,f), i.e.:
  • a method for processing sound data acquired by a plurality of microphones is proposed, in which:
  • a direction of arrival of a sound coming from at least one acoustic source of interest is determined
  • a weighting mask is developed to be applied in the temp-frequency domain to the sound data acquired to construct an acoustic signal representing the sound coming from the source of interest and enhanced with respect to ambient noise.
  • the term "representative quantity" of a signal amplitude is understood here to mean the amplitude of the signal but also its energy or else its power, etc.
  • the aforementioned ratios can be estimated by dividing the amplitude (or energy, or power, etc.) of the signal represented by the filtered sound data by the amplitude (or energy, or power, etc. ) of the signal represented by the acquired (thus raw) sound data.
  • the weighting mask thus obtained is then representative, at each time-frequency point of the time-frequency domain, of a degree of preponderance of the acoustic source of interest, with respect to ambient noise.
  • the weighting mask can be estimated to directly build an acoustic signal representing the sound coming from the source of interest, and enhanced with respect to ambient noise, or even to calculate second spatial filters which can be more effective for reduce the noise more strongly than in the aforementioned case of a direct construction.
  • the aforementioned first spatial filtering (applied to the data acquired before estimating the ratios) can be of the “Delay and Sum” type.
  • the amplitude of the signals represents these phase shifts inherent in the distances between microphones.
  • this first spatial filtering can be of the MPDR type (for “Minimum Power Distortionless Response”). It has the advantage of better reducing the surrounding noise, while keeping the useful signal intact, and does not require any information other than the direction of arrival.
  • MPDR Maximum Power Distortionless Response
  • w MPDR the spatial filtering of the MPDR type
  • a s represents a vector defining the direction of arrival of the sound (or "steering vector")
  • R x is a spatial covariance matrix estimated at each time-frequency point (t,f) by a relation of kind :
  • - card is the “cardinal” operator
  • - x(t 1 ,f 1 ) is a vector representing the sound data acquired in the time-frequency domain
  • x(t 1 ,f 1 ) H is its Hermitian conjugate.
  • the method may optionally include a subsequent step of refining the weighting mask to denoise its estimate.
  • the estimation can be denoised by smoothing by applying, for example, local averages, defined heuristically.
  • this estimate can be denoised by defining an a priori mask distribution model.
  • the first approach makes it possible to keep complexity low, while the second approach, based on a model, obtains better performance, at the cost of increased complexity.
  • the elaborated weighting mask can be further refined by smoothing at each time-frequency point by applying a local statistical operator, calculated on a time-frequency neighborhood of the time-frequency point ( t,f ) considered.
  • This operator can take the form of an average, a Gaussian filter, a median filter, or other.
  • the elaborated weighting mask can also be refined by smoothing at each time-frequency point, by applying a probabilistic approach comprising:
  • the mask can be considered as a uniform random variable in an interval [0,1].
  • the probabilistic estimator of the mask M s (t, f) can for example be representative of a maximum likelihood, over a plurality of observations of a pair of variables, representing respectively: - an acoustic signal resulting from the application of the weighting mask to the data sound acquired, and
  • the purpose of these two embodiments is thus to refine the mask after its estimation.
  • the mask obtained (optionally refined) can be applied directly to the acquired data (raw, picked up by the microphones) or used to construct a second spatial filter to be applied to these acquired data.
  • the construction of the acoustic signal representing the sound coming from the source of interest and enhanced with respect to ambient noise may involve the application of a second spatial filtering, obtained from the weighting mask.
  • - x(t 1 ,f 1 ) is a vector representing the sound data acquired in the time-frequency domain, and x(t 1 ,f 1 ) H its Hermitian conjugate, and is the expression of the weighting mask in the domain time-frequency.
  • the second spatial filtering can be of the MWF type for
  • Multichannel Wiener Filter and in this case spatial covariance matrices R s and R n are estimated, respectively of the acoustic signal representing the sound from the source of interest, and of the ambient noise, the spatial filtering of the MWF type being given by : where: - ⁇ (t,f) is a neighborhood of a time-frequency point (t,f), - card is the “cardinal” operator,
  • - x(t 1 ,f 1 ) is a vector representing the sound data acquired in the time-frequency domain, and x(t 1 ,f 1 ) H its Hermitian conjugate, and is the expression of the weighting mask in the domain time-frequency.
  • the spatial covariance matrix R n above represents the “ambient noise”.
  • the latter may in reality comprise emissions from sound sources which have not, however, been retained as being the sound source of interest. Separate processing operations can be carried out for each source for which a direction of arrival has been detected (for example dynamically) and, in the processing for a given source, the emissions from the other sources are considered to form part of the noise.
  • the spatial filtering carried out of the MWF type for example, can be derived from the masking estimated for the most advantageous time-frequency points because the acoustic source of interest is preponderant there. It should also be noted that two joint optimizations can be carried out, one for the covariance R s of the acoustic signal involving the sought-after time-frequency mask M s and the other for the covariance R n of the ambient noise involving a mask M n linked to the noise (by then selecting time-frequency points at which the noise alone is preponderant).
  • the solution described above thus makes it possible, in general, to estimate in a time-frequency domain an optimal mask in the time-frequency points where the source of interest is preponderant, from the sole information of direction of arrival of the source of interest, without neural network input (either to apply the mask directly to the acquired data, or to construct a second spatial filtering to be applied to the acquired data).
  • This description also proposes a computer program comprising instructions for the implementation of all or part of a method as defined herein when this program is executed by a processor.
  • a non-transitory, computer-readable recording medium on which such a program is recorded is provided.
  • the present description also proposes a device comprising (as illustrated in FIG. 3) at least one interface for receiving (IN) sound data acquired by a plurality of microphones (MIC) and a processing circuit (PROC, MEM) configured to: - from the sound data acquired by the plurality of microphones, determine a direction of arrival of a sound from at least one acoustic source of interest,
  • a device comprising (as illustrated in FIG. 3) at least one interface for receiving (IN) sound data acquired by a plurality of microphones (MIC) and a processing circuit (PROC, MEM) configured to: - from the sound data acquired by the plurality of microphones, determine a direction of arrival of a sound from at least one acoustic source of interest,
  • the device may also comprise an output interface (reference OUT in FIG. 3) to deliver this acoustic signal.
  • This interface OUT can be connected to a voice recognition module for example to correctly interpret commands from a user, despite ambient noise, the acoustic signal delivered having then been processed according to the method presented above.
  • FIG. 1 [0046] [Fig. 1] schematically shows a possible context for implementing the method presented above.
  • FIG. 2 illustrates a succession of steps that a method within the meaning of the present description may comprise, according to a particular embodiment.
  • Fig. 3 illustrates a succession of steps that a method within the meaning of the present description may comprise, according to a particular embodiment.
  • FIG. 3 schematically shows an example of a sound data processing device according to one embodiment.
  • the processing circuit of the device DIS presented previously may typically comprise a memory MEM capable of storing in particular the instructions of the aforementioned computer program, as well as a processor PROC capable to cooperate with the memory MEM to execute the computer program.
  • the output interface OUT can supply a voice recognition module MOD of a personal assistant capable of identifying in the aforementioned acoustic signal a voice command from a user UT who, as illustrated in FIG. 1, can pronounce a voice command picked up by a microphone antenna MIC, and this in particular in the presence of ambient noise and/or sound reverberations REV, generated by the walls and/or partitions of a room for example in which the user is located UT.
  • the processing of the acquired sound data within the meaning of the present description and which is detailed below, nevertheless makes it possible to overcome such difficulties.
  • FIG. 2 An example of an overall method within the meaning of the present description is illustrated in FIG. 2.
  • the method begins with a first step S1 of acquiring the sound data picked up by the microphones. Then, a time-frequency transform of the signals acquired in step S3 is performed, after an apodization carried out in step S2.
  • the direction of arrival of the sound from the source of interest (DoA) can then be estimated in step S4 by giving in particular the vector a s (f) of this direction of arrival (or "steering vector") .
  • step S5 a first spatial filtering is applied to the sound data acquired by the microphones, for example in the time-frequency space, and according to the direction of arrival DoA.
  • the first spatial filtering can be of the Delay and Sum or MPDR type and it is “centered” on the DoA.
  • the filter is of the MPDR type
  • the acquired data expressed in the time-frequency domain are used, in addition to the DoA, to build the filter (arrow illustrated in dotted lines for this purpose).
  • amplitude (or energy or power) ratios are estimated between the filtered acquired data and the raw acquired data (denoted by x(t,f) in the time-frequency domain) .
  • This estimation of the ratios in the time-frequency domain makes it possible to construct a first, approximate form of the weighting mask already favoring the DoA at step S7 because the aforementioned ratios are of high levels mainly in the direction of arrival DoA.
  • step S8 it is then possible to provide a subsequent step S8, optional, consisting in smoothing this first mask in order to refine it.
  • step S9 it is also possible to generate a second spatial filtering from this refined mask. This second filtering can then then be applied in the time-frequency domain to the sound data acquired in order to generate, in step S10, an acoustic signal substantially devoid of noise and which can then be interpreted properly by a voice recognition module or other.
  • This vector is called “observation” or “mixture”.
  • the signals may be the signals picked up directly by the microphones of the antenna, or a combination of these microphone signals as in the case of an antenna collecting the signals according to a representation in ambiophonic format (also called “ambisonic”).
  • step S3 the different quantities (signals, covariance matrices, masks, filters), are expressed in a time-frequency domain, in step S3, as follows:
  • F ⁇ . is for example the short-term Fourier transform of size L:
  • Enhancement filters can be defined according to the information available. They can then be used for the deduction of the mask in the time-frequency domain.
  • steering vector For a source s of given position, we note a s the column vector which points in the direction of this source (the direction of arrival of the sound), vector called “steering vector”.
  • the steering vector of a plane wave of incidence ⁇ with respect to the antenna is defined at l step S4 in the frequency domain by:
  • c is the speed of sound in air.
  • the first channel here corresponds to the last sensor encountered by the sound wave. This steering vector then gives the direction of arrival of the sound or "DOA".
  • the steering vector can also be given by the relationship:
  • step S5 From the mere knowledge of the direction of arrival of a sound source (or DOA), in step S5 it is possible to define a filter of the delay-and-sum (DS) type which points in the direction from this source, as follows: [0070] is the transpose-conjugate operator of a matrix or a vector.
  • ⁇ (t, f) is a more or less wide neighborhood around the time-frequency point (t,f), and card is the “cardinal” operator.
  • the noise spatial covariance matrix can be calculated in the same way as that of the useful signal, and more particularly in the form:
  • the aim here is to estimate these time-frequency masks M s (t,f) and M n (t,f).
  • the direction of arrival of the sound (or “DOA”, obtained in step S4), coming from the useful source s at time t, denoted doa s (t), is considered to be known.
  • This DOA can be estimated by a localization algorithm such as “SRP-phat” ([@diBiaseSRPPhat]), and tracked by a tracking algorithm such as a Kalman filter for example. It can be composed of a single component as in the case of a linear antenna, or of azimuth and elevation components ( ⁇ , ⁇ ) in the case of a spherical antenna of the ambisonic type for example.
  • ⁇ , ⁇ azimuth and elevation components
  • w s which points in the direction of the useful source.
  • This filter can be of the Delay and Sum type, or hereinafter of the w MPDR type presented by:
  • step S7 This enhanced signal makes it possible to calculate a preliminary mask at step S7, given by the ratios of step S6: where x ref is a reference channel resulting from the capture, and y a positive real y typically takes the integer values (for example 1 for the amplitude or 2 for the energy). It should be noted that when y ® ⁇ , the mask tends towards the binary mask indicating the preponderance of the source with respect to the noise.
  • the first channel which is the omnidirectional channel
  • this may be the signal corresponding to any sensor.
  • this mask corresponds to the expression: which defines a mask at desired behavior, namely close to 1 when the signal s is preponderant, and close to 0 when the noise is preponderant.
  • the enhanced signal although already in a better condition than the raw signals acquired, may still contain noise and can be improved. by a mask estimation refinement processing (step S8).
  • the mask refinement step S8 is described below. Although this step is advantageous, it is in no way essential, and can be carried out optionally, for example if the mask estimated for the filtering in step S7 turns out to be noisy beyond a chosen threshold. To limit the noise of the mask, a soft(.) smoothing function is applied in step S8.
  • the application of this smoothing function can amount to estimating a local average, at each time-frequency point, for example as follows:
  • [0097] defines a neighborhood of the considered time-frequency point (t,f).
  • u th is a threshold to be adjusted according to the desired level.
  • R a random variable defined by:
  • - corresponds to the enhanced signal (i.e filtered by an MPDR or DS enhancement filter)
  • - x corresponds to a particular channel of the mix
  • M S follows a normal distribution, with a zero mean and a variance which depends on M s , as follows:
  • the mask can be calculated using probabilistic estimators.
  • M s (t,f) the estimator of the mask M s (t,f) in the sense of maximum likelihood.
  • - card is the “cardinal” operator, is a vector representing the sound data acquired in the domain time-frequency, and its Hermitian conjugate, and is the expression of the weighting mask in the time-frequency domain.
  • the MWF-type spatial filtering is then given by:
  • step S10 an acoustic signal representing the sound coming from the source of interest and enhanced with respect to the ambient noise (typically delivered by the output interface OUT of the device illustrated in FIG. 3).
  • the present technical solutions can find application in particular in the enhancement of speech by complex filters, for example of the MWF type ([@laurelineLSTM], [@amelieUnet]), which ensures good hearing quality and a high rate of automatic speech recognition, without the need for a neural network.
  • the approach can be used for the detection of keywords or "wake-up words" or even the transcription of a speech signal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
EP22714494.6A 2021-04-01 2022-03-18 Estimation d'un masque optimise pour le traitement de donnees sonores acquises Pending EP4315328A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR2103400A FR3121542A1 (fr) 2021-04-01 2021-04-01 Estimation d’un masque optimisé pour le traitement de données sonores acquises
PCT/FR2022/050495 WO2022207994A1 (fr) 2021-04-01 2022-03-18 Estimation d'un masque optimise pour le traitement de donnees sonores acquises

Publications (1)

Publication Number Publication Date
EP4315328A1 true EP4315328A1 (fr) 2024-02-07

Family

ID=75850368

Family Applications (1)

Application Number Title Priority Date Filing Date
EP22714494.6A Pending EP4315328A1 (fr) 2021-04-01 2022-03-18 Estimation d'un masque optimise pour le traitement de donnees sonores acquises

Country Status (4)

Country Link
EP (1) EP4315328A1 (zh)
CN (1) CN117121104A (zh)
FR (1) FR3121542A1 (zh)
WO (1) WO2022207994A1 (zh)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9747922B2 (en) * 2014-09-19 2017-08-29 Hyundai Motor Company Sound signal processing method, and sound signal processing apparatus and vehicle equipped with the apparatus
US10522167B1 (en) * 2018-02-13 2019-12-31 Amazon Techonlogies, Inc. Multichannel noise cancellation using deep neural network masking
CN110503972B (zh) * 2019-08-26 2022-04-19 北京大学深圳研究生院 语音增强方法、系统、计算机设备及存储介质
US11373668B2 (en) * 2019-09-17 2022-06-28 Bose Corporation Enhancement of audio from remote audio sources

Also Published As

Publication number Publication date
CN117121104A (zh) 2023-11-24
WO2022207994A1 (fr) 2022-10-06
FR3121542A1 (fr) 2022-10-07

Similar Documents

Publication Publication Date Title
JP6480644B1 (ja) マルチチャネル音声認識のための適応的オーディオ強化
EP1356461B1 (fr) Procede et dispositif de reduction de bruit
WO2020108614A1 (zh) 音频识别方法、定位目标音频的方法、装置和设备
EP2680262B1 (fr) Procédé de débruitage d'un signal acoustique pour un dispositif audio multi-microphone opérant dans un milieu bruité
EP1154405B1 (fr) Procédé et dispositif de reconnaissance vocale dans des environnements a niveau de bruit fluctuant
EP3807669B1 (fr) Localisation de sources sonores dans un environnement acoustique donné
Xiao et al. Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
EP2772916B1 (fr) Procédé de débruitage d'un signal audio par un algorithme à gain spectral variable à dureté modulable dynamiquement
JP2015524076A (ja) 源信号分離のためのシステム及び方法
EP2774147A1 (en) Audio signal noise attenuation
EP4046390A1 (fr) Localisation perfectionnee d'une source acoustique
WO2015011078A1 (fr) Procédé de suppression de la réverbération tardive d'un signal sonore
EP4315328A1 (fr) Estimation d'un masque optimise pour le traitement de donnees sonores acquises
CN116403594A (zh) 基于噪声更新因子的语音增强方法和装置
EP3627510A1 (fr) Filtrage d'un signal sonore acquis par un systeme de reconnaissance vocale
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
EP4248231A1 (fr) Localisation perfectionnée d'une source acoustique
US20240212701A1 (en) Estimating an optimized mask for processing acquired sound data
WO2020049263A1 (fr) Dispositif de rehaussement de la parole par implementation d'un reseau de neurones dans le domaine temporel
US20230368766A1 (en) Temporal alignment of signals using attention
Sharma et al. Development of a speech separation system using frequency domain blind source separation technique
Chen et al. Early Reflections Based Speech Enhancement
WO2024126242A1 (fr) Obtention d'une réponse impulsionnelle d'une salle
WO2023219751A1 (en) Temporal alignment of signals using attention
Bai et al. Deep Learning Applied to Dereverberation and Sound Event Classification in Reverberant Environments

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20231027

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR