WO2019003131A1 - Audio signal digital processing method and system thereof - Google Patents

Audio signal digital processing method and system thereof Download PDF

Info

Publication number
WO2019003131A1
WO2019003131A1 PCT/IB2018/054744 IB2018054744W WO2019003131A1 WO 2019003131 A1 WO2019003131 A1 WO 2019003131A1 IB 2018054744 W IB2018054744 W IB 2018054744W WO 2019003131 A1 WO2019003131 A1 WO 2019003131A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
punctual
environment
source
speech
Prior art date
Application number
PCT/IB2018/054744
Other languages
French (fr)
Inventor
Ahmadi MEHRNOOSH
Vincenzo Randazzo
Vito PIRRONE
Stefano CORNETTO
Martina STRAZZACAPA
Maria Sole TEBERINO
Maryia ZARETSKAYA
Original Assignee
Politecnico Di Torino
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Politecnico Di Torino filed Critical Politecnico Di Torino
Publication of WO2019003131A1 publication Critical patent/WO2019003131A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • H04R1/1083Reduction of ambient noise

Definitions

  • the present invention relates to an audio signal processing method and system thereof, in particular for adoption in a working environment such as an open space or a working environment for white collars or the like.
  • the scope of the present invention is to process environment audio, in particular within an open space or another working or delimited area, to eliminate unwanted audio in order to increase concentration of workers within the open space or the working area.
  • an audio processing method and system able to separate the audio signals of each punctual audio source from a mixed audio signal of an open space or a working area, to separate a number of punctual audio sources, to recognize speakers, and to reproduce, via a wearable in-ear, on-ear, or around-the-ear speaker only those audio tracks belonging to the environment audio and matching a list of allowed punctual audio sources, including human punctual audio sources.
  • a group of workers can include each other in the list of allowed punctual audio sources and hold an in person meeting in a crowded working environment and the relevant voices will be emphasized via the ear speaker with respect to voices from non-allowed audio sources. Indeed, during such a meeting, the wearable speaker of each participant will reproduce the voices of other participants and will not reproduce other audios or noises.
  • the Figure shows, as a whole, an environment, in particular a working environment, where users, in particular co-workers, are talking simultaneously.
  • the environment also comprises an alarm box AB including an alarm audio signal emitter (not shown), for example a fire alarm box and a network of antennas N for wirelessly exchange of data at a suitable band-width, i.e. the higher the number of antenna per unit of area of the environment, the higher the band-width for data exchange.
  • Antennas N are, for example, wi-fi ® antennas or Bluetooth® antennas. It is important to note that, to increase bandwidth for data exchange, also gain, transmit/ receive power and computing power are involved.
  • Each audio source cited above generates during normal activity in the environment a relative audio track or channel which mixes with other audio tracks in an environment mixed audio signal including simultaneous speech of, e.g. users A, B, C.
  • a method according to the invention comprises the step of providing a list of one or more allowed audio sources, including a human or speech punctual audio source.
  • a step may be implemented by the users via a user interface, such as a smartphone or a desktop, in order to input the users admitted to the meeting.
  • the list of allowed audio sources is generated or completed based on data retrieved from a calendar and time management system.
  • participants to a meeting are invited and confirmed via the calendar and time management system and the list of allowed sources is generated, preferably automatically generated, including the invited and/ or confirmed participants stored in the calendar and time management system. It is possible to generate a new list of allowed sound sources for each scheduled meeting or to modify the previous list via the user interface.
  • Participants to the meeting are located in an open space office or similar environment comprising a room shared with other co-workers not participating to the meeting.
  • Antenna network N covers the space of the environment in order to provide suitable band for exchange of data.
  • a mixed simultaneous speech audio signal from the environment where users A, B, C are talking is acquired via a microphone unit M.
  • the microphone unit comprises one or more microphones located within the environment.
  • the microphones may have fixed locations within the environment or may be portable microphones, such as those provided in headphones and/ or in portable intelligent devices such as smartphones, tablets (normally defined as personal portable intelligent devices) and laptop PCs.
  • the position and coverage of the microphone unit delimits the area, such as the open space, to which the method is applied.
  • the mixed simultaneous speech audio signal is processed by a digital processing unit D in order to isolate the audio track from each audio source, via a Blind Source Separation (BSS) algorithm, like a Parra-Spence algorithm.
  • BSS Blind Source Separation
  • a BSS algorithm is based on multichannel techniques and on Time Direction Of Arrival (TDOA) techniques.
  • Digital processing unit D is either a single processor unit or a multi- processor architecture, including a cloud computing multi-processor unit in order to reduce data elaboration time.
  • Digital processing unit D also recognizes the punctual audio sources and the tracks by such punctual audio sources, including the audio tracks of users A, B, C. This is performed for example via a source recognition algorithm.
  • a step of providing a library of training audio data of punctual audio sources This is for example implemented by processing a single speech track of each co-worker and assign the relevant data to the relative co-worker.
  • Such data in the library work as respective dictionaries during separation and recognition.
  • the library also comprises data from non-human or non-speech punctual audio sources, such as alarms, door bell, etc. that are fixedly located within the delimited environment, such as alarm box AB, and that may enter or exit the delimited environment, such as human users. Also ring tones of mobile phones can be considered.
  • non-human or non-speech punctual audio sources such as alarms, door bell, etc. that are fixedly located within the delimited environment, such as alarm box AB, and that may enter or exit the delimited environment, such as human users.
  • ring tones of mobile phones can be considered.
  • An example of elaboration of audio tracks in order to extract suitable dictionary data for the library is the provision of a biometric audio fingerprint by vector/ matrix quantization or factorization of the training audio tracks, wherein each punctual audio source is processed in order to generate an identification vector/ matrix, i.e. the training data, included in the library.
  • digital processing unit D then elaborates the mixed environment to identify the identification vector/ matrix of the library.
  • the library includes training data from a high number of different punctual audio sources related with the delimited environment, such as workers and fixed sound sources of the open space.
  • training data of all co-workers and other non-human punctual sound sources present in the working environment are processed and added to the library, including data processed from audio tracks of alarms, e.g. fire alarm from alarm box AB.
  • the method is applied in a delimited environment, such as an open space, it is preferable that tracks of the majority or all fixed punctual sound sources within the environment are processed in order to extract training data to be added in the library
  • both the separation via the BSS algorithm and the recognition via the Source Recognition algorithm based on the training data of the library, in particular punctual audio sources are characterized during separation of the tracks via the dictionaries collected in the library.
  • Non-negative matrix factorization is a factorization method that approximates a non-negative matrix ⁇ fc - ⁇ - using a non-negative dictionary matrix € i. ⁇ . i>l and a non-negative activation matrix V € t3 ⁇ 4J* i such that ⁇ 3 ⁇ 4;
  • ⁇ C— 1 ⁇ 1 is a speech power spectrogram, with the complex valued short time Fourier transform (STFT) of the audio signal, M the absolute value and 2 the element wise square.
  • X is a matrix with F frequency bins and N time frames.
  • NMF tries to capture the most frequent patterns of the speech in K F-dimensional basis vectors that form a dictionary T for the speech.
  • the matrix V contains the coefficients of the linear combination and thus indicate how the kth basis vector is activated in the nth time frame.
  • NMF is a rank reduction operation.
  • a discrepancy measure is chosen between the original X and the reconstruction X and can be minimized by finding optimal dictionaries and activations.
  • the Euclidian (EU) distance, the Kullback-Leibler (KL) divergence and the Itakura-Saito (IS) divergence are well known measures. In this paper the IS divergence will be used.
  • the activation matrix quantitatively indicates the activation of each basis vector for each target speaker in each time frame.
  • the combined activity of all basis vectors in a target speaker's dictionary is a measure of the activity of the target speaker in the test segment. It is possible to include Group Sparsity (GS-NMF) constraints on the activations * to enforce solutions where it is unlikely that basis vectors from different target speakers are active at the same time frame. [15], [16].
  • GS-NMF Group Sparsity
  • a simple way of estimating the speaker identity is to determine the target speaker for which the sum of the activations, over all its basis vectors and over all the time frames, is maximal. This way of classification can be seen as a per frame speaker probability estimation where the final estimation is a weighted average over all frames, giving more weight to frames with higher activation or more energy.
  • NMF has also shown good results in source separation problems.
  • the procedure is very similar to that of SR.
  • the test data contains speech of multiple sources that speak simultaneously. The task is not to determine the speaker identity, but the original source signal of each speaker.
  • « B are the indices of the basis vectors belonging to the dictionary of the sth speaker and ) denotes the phase of * •
  • variable -3 ⁇ 4 « is then reformulated as follows.
  • the estimated ID for speaker s is j for which 3 ⁇ 4 is maximal.
  • the speaker recognition can be interpreted as assigning the jth dictionary, and thus the target speaker identity of the j dictionary to the position of the sth speaker in the test utterance.
  • S the amount of speakers in the test mixture, can be equal to or smaller than J, the amount of speakers in the library. Not all known speakers must appear in the test mixture.
  • alarms are included by default in the list of allowed punctual audio sources.
  • the method comprised the step of reproducing via a wearable (in-ear, on-ear or around-the-ear) ear-speaker S used by the speech users listed in the list of allowable punctual audio sources, each speech audio track of the human environment punctual audio source matching those of the list of allowed punctual audio sources.
  • a wearable (in-ear, on-ear or around-the-ear) ear-speaker S used by the speech users listed in the list of allowable punctual audio sources, each speech audio track of the human environment punctual audio source matching those of the list of allowed punctual audio sources.
  • the mixed simultaneous speech audio signal is pre-processed before separation and recognition in order to implement active noise cancelling of background noise or the like. This is for example implemented via filtering.
  • the present invention is best suited for working environments where co- workers hold meetings and are often located in an open space.
  • the present invention is also applicable in similar environments, for example clubs, student common areas or the like, where people group and talk in simultaneous groups within the same large room and it is preferable to avoid that voices from the groups mix with each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

An audio digital processing method comprising the steps of providing a list of allowed punctual audio sources, including a speech or human punctual audio source; acquiring in an environment a mixed simultaneous speech audio signal from one or more environment punctual audio sources, including a speech or human environment punctual audio source, via a microphone unit; processing the mixed simultaneous speech audio signal in order to: separate audio tracks of each environment punctual audio source via a Blind Source Separation algorithm; and recognize tracks via Source Recognition algorithm; reproducing via a wearable (in-ear, on-ear or around-the-ear) ear-speaker, each speech audio track of the human environment punctual audio source matching those of the list of allowed punctual audio sources.

Description

Audio signal digital processing method and system thereof
DESCRIPTION
TECHNICAL FIELD
The present invention relates to an audio signal processing method and system thereof, in particular for adoption in a working environment such as an open space or a working environment for white collars or the like.
STATE OT THE ART
It is known to divide an open space by cubicles to define a working station for a white collar. A cubicle does not provide an effective audio absorption within the open space. The same applies to other audio absorption techniques, including those providing audio absorbing panels ore the like within the open space. The current solutions cause a certain level of noise pollution due to people interaction (chatting, phone calls, etc..) or to the background. This causes loss of concentration, increased stress and consequently a decreased productivity for the workers. Furthermore, such a noise affects the quality of vocal communication with other people because in a noisy environment is often more difficult and more stressful to talk with colleagues.
It is also known to provide passive wearable noise-cancellation devices such as ear plugs.
As well, it is known to provide active noise cancellation devices to reduce unwanted audio (noise) by generating a second audio specifically designed to interfere and cancel the unwanted audio.
SCOPES AND BRIEF DESCRIPTION OF THE INVENTION
The scope of the present invention is to process environment audio, in particular within an open space or another working or delimited area, to eliminate unwanted audio in order to increase concentration of workers within the open space or the working area.
The scope of the present invention is achieved by an audio processing method and system able to separate the audio signals of each punctual audio source from a mixed audio signal of an open space or a working area, to separate a number of punctual audio sources, to recognize speakers, and to reproduce, via a wearable in-ear, on-ear, or around-the-ear speaker only those audio tracks belonging to the environment audio and matching a list of allowed punctual audio sources, including human punctual audio sources.
In this manner, a group of workers can include each other in the list of allowed punctual audio sources and hold an in person meeting in a crowded working environment and the relevant voices will be emphasized via the ear speaker with respect to voices from non-allowed audio sources. Indeed, during such a meeting, the wearable speaker of each participant will reproduce the voices of other participants and will not reproduce other audios or noises.
Other features and advantages of the present invention are discussed in the description and cited in the dependent claims.
BRIEF DESCRPTION OF THE DRAWINGS
The invention will be described below on the basis of non-limiting examples shown for explanation purposes in the annexed drawing, which refers to a sketch of an audio signal digital processing system according to the present invention. DETAILED DESCRIPTION OF THE INVENTION
The Figure shows, as a whole, an environment, in particular a working environment, where users, in particular co-workers, are talking simultaneously.
A user A and a user B are in a meeting and a user C is talking via a mobile telephone to a non-illustrated speaker and is not interacting with users A, B. In such a condition, speeches by users A, B are disturbed by speech of user C and vice-versa. The environment also comprises an alarm box AB including an alarm audio signal emitter (not shown), for example a fire alarm box and a network of antennas N for wirelessly exchange of data at a suitable band-width, i.e. the higher the number of antenna per unit of area of the environment, the higher the band-width for data exchange. Antennas N are, for example, wi-fi ® antennas or Bluetooth® antennas. It is important to note that, to increase bandwidth for data exchange, also gain, transmit/ receive power and computing power are involved.
Each audio source cited above, generates during normal activity in the environment a relative audio track or channel which mixes with other audio tracks in an environment mixed audio signal including simultaneous speech of, e.g. users A, B, C.
A method according to the invention comprises the step of providing a list of one or more allowed audio sources, including a human or speech punctual audio source. Such a step may be implemented by the users via a user interface, such as a smartphone or a desktop, in order to input the users admitted to the meeting.
Preferably, the list of allowed audio sources is generated or completed based on data retrieved from a calendar and time management system. In particular, participants to a meeting are invited and confirmed via the calendar and time management system and the list of allowed sources is generated, preferably automatically generated, including the invited and/ or confirmed participants stored in the calendar and time management system. It is possible to generate a new list of allowed sound sources for each scheduled meeting or to modify the previous list via the user interface.
Participants to the meeting are located in an open space office or similar environment comprising a room shared with other co-workers not participating to the meeting. Antenna network N covers the space of the environment in order to provide suitable band for exchange of data.
In a further step, a mixed simultaneous speech audio signal from the environment where users A, B, C are talking is acquired via a microphone unit M. The microphone unit comprises one or more microphones located within the environment. The microphones may have fixed locations within the environment or may be portable microphones, such as those provided in headphones and/ or in portable intelligent devices such as smartphones, tablets (normally defined as personal portable intelligent devices) and laptop PCs. The position and coverage of the microphone unit delimits the area, such as the open space, to which the method is applied.
The mixed simultaneous speech audio signal is processed by a digital processing unit D in order to isolate the audio track from each audio source, via a Blind Source Separation (BSS) algorithm, like a Parra-Spence algorithm. For instance, a BSS algorithm is based on multichannel techniques and on Time Direction Of Arrival (TDOA) techniques.
Digital processing unit D is either a single processor unit or a multi- processor architecture, including a cloud computing multi-processor unit in order to reduce data elaboration time.
Digital processing unit D also recognizes the punctual audio sources and the tracks by such punctual audio sources, including the audio tracks of users A, B, C. This is performed for example via a source recognition algorithm.
It is possible to perform the separation of each track belonging to a single punctual audio source and, afterwards, the recognition of each track in order to assign track to the pertinent punctual audio source. For example, a mixed track comprising voices of users A and B is firstly elaborated to isolate two tracks and afterwards, digital processing unit D recognizes which track belongs to user A and which track belongs to user B.
In order to save time and reduce latency of the method, it is preferable to include a step of providing a library of training audio data of punctual audio sources. This is for example implemented by processing a single speech track of each co-worker and assign the relevant data to the relative co-worker. Such data in the library work as respective dictionaries during separation and recognition.
According to a non limiting embodiment of the present invention, the library also comprises data from non-human or non-speech punctual audio sources, such as alarms, door bell, etc. that are fixedly located within the delimited environment, such as alarm box AB, and that may enter or exit the delimited environment, such as human users. Also ring tones of mobile phones can be considered.
An example of elaboration of audio tracks in order to extract suitable dictionary data for the library is the provision of a biometric audio fingerprint by vector/ matrix quantization or factorization of the training audio tracks, wherein each punctual audio source is processed in order to generate an identification vector/ matrix, i.e. the training data, included in the library. During separation and/ or recognition, digital processing unit D then elaborates the mixed environment to identify the identification vector/ matrix of the library. In order to increase reliability of the method, it is preferable that the library includes training data from a high number of different punctual audio sources related with the delimited environment, such as workers and fixed sound sources of the open space. In particular, when the method is implemented in a working environment, training data of all co-workers and other non-human punctual sound sources present in the working environment, i.e. the open space, are processed and added to the library, including data processed from audio tracks of alarms, e.g. fire alarm from alarm box AB. When the method is applied in a delimited environment, such as an open space, it is preferable that tracks of the majority or all fixed punctual sound sources within the environment are processed in order to extract training data to be added in the library
In order to reduce elaboration time and latency, it is preferable to approach both the separation via the BSS algorithm and the recognition via the Source Recognition algorithm based on the training data of the library, in particular punctual audio sources are characterized during separation of the tracks via the dictionaries collected in the library.
The above is preferably implemented by a Non-negative Matrix
Factorization algorithm. Details are explained below and, additionally, in other paragraphs from J. Zegers, H. Van Hamme in 'Joint Audio Source Separation and Speaker Recognition', April 29th, 2016 - arXiv:1604.08852vl:
Non-negative matrix factorization is a factorization method that approximates a non-negative matrix ■Α· fc -^- using a non-negative dictionary matrix € i.^. i>l and a non-negative activation matrix V€ t¾J*i such that ^ ¾; A In our application ^C— 1^1 is a speech power spectrogram, with the complex valued short time Fourier transform (STFT) of the audio signal, M the absolute value and 2 the element wise square. X is a matrix with F frequency bins and N time frames. NMF tries to capture the most frequent patterns of the speech in K F-dimensional basis vectors that form a dictionary T for the speech. The matrix V contains the coefficients of the linear combination and thus indicate how the kth basis vector is activated in the nth time frame. Usually & < N) such that NMF is a rank reduction operation. A discrepancy measure is chosen between the original X and the reconstruction X and can be minimized by finding optimal dictionaries and activations. The Euclidian (EU) distance, the Kullback-Leibler (KL) divergence and the Itakura-Saito (IS) divergence are well known measures. In this paper the IS divergence will be used.
Figure imgf000009_0001
To minimize this divergence, multiplicative iterative update formulas with convergence guarantees have been derived [14]
Figure imgf000009_0002
where the sub-indices refer to the corresponding element in the matrix. To avoid scaling ambiguities the columns of T are to be normalized: The use of NMF in SR applications for single speech is straight-forward. In the training phase, training data of the jth target speaker ^ is facto rized using equations 2 and 3. The obtained dictionaries P for each of the J target speakers, are assumed to be speaker dependent and are collected in the library ' i s©* — 'ϊ' * , -■■ - > ϊ. " i
During testing the identity of a speaker s has to be found in a previously unseen NMF is applied with a fixed library ¾ and the activations V otj**. are found iteratively using equation 3. The activation matrix quantitatively indicates the activation of each basis vector for each target speaker in each time frame. The combined activity of all basis vectors in a target speaker's dictionary is a measure of the activity of the target speaker in the test segment. It is possible to include Group Sparsity (GS-NMF) constraints on the activations * to enforce solutions where it is unlikely that basis vectors from different target speakers are active at the same time frame. [15], [16].
A simple way of estimating the speaker identity is to determine the target speaker for which the sum of the activations, over all its basis vectors and over all the time frames, is maximal. This way of classification can be seen as a per frame speaker probability estimation where the final estimation is a weighted average over all frames, giving more weight to frames with higher activation or more energy.
Figure imgf000010_0001
Where are the indices of the basis vectors belonging to the dictionary of the j-th target speaker. It is possible to perform a more advanced classification of the activations to a speaker identity by using, for example, support vector machines.
Aside from SR applications, NMF has also shown good results in source separation problems. When the different speakers are learned on single speech training data, the procedure is very similar to that of SR. However, in SS, the test data contains speech of multiple sources that speak simultaneously. The task is not to determine the speaker identity, but the original source signal of each speaker.
After learning in the training phase, ^ is calculated in the same way as in previous paragraphs. Using Wiener filtering and the phase of the observations, the original source signal can be estimated [8].
Figure imgf000011_0001
Where «Bare the indices of the basis vectors belonging to the dictionary of the sth speaker and ) denotes the phase of * ».
In many situations, however, there is no possibility for supervised source separations. In blind source separation (BSS) no is available to learn the library T*«s. Instead the library will be created during the separation itself. Usually one resorts to multichannel techniques in such cases, where Time Direction Of Arrival (TDOA) techniques can be used to assist the source separation. The mixing matrix M £€ is assumed static and thus independent on n.
Figure imgf000011_0002
where I is the amount of microphones and m**. indicates the frequency domain representation of the room impulse response (RIR) between the sth speaker and the ith microphone for the fth frequency bin. ¾ is the received STFT spectrogram in the ith microphone and V* is the STFT spectrogram of the original signal of the sth speaker. Because of the scaling ambiguity in equation 6 only the relative RIRs between the microphones can be estimated. The notation for the combined microphone signals is as follows.
Sawada et al. proposed a multichannel IS divergence [9].
Dis c, {T, vt H, z}} d %ft>
Figure imgf000012_0001
Where tr(.) is the trace of a matrix, logdet(.) is the natural logarithm of the determinant of a matrix ss ~ χ<***χ/»* with . * the Hermitian transpose of a matrix and l The same interpretations are given to */* and -«fc« as in single channel NMF. is a latent speaker-indicator that indicates the certainty that the kth basis vector belongs to the dictionary of the sth speaker under the constraints ¾*¾ ~ y and ™ 1 is a I x I Hermitian positive semi- definite matrix with on its diagonals the power gain of the sth speaker and the fth frequency bin to each microphone. The off-diagonal elements include the phase differences between microphones and thus contain spatial information of the speaker. Multiplicative update formulas have been found in [9 eq. 42-47] that minimize the divergence in equation 8. The separated signals are then obtained through Wiener filtering.
Figure imgf000013_0001
When performing speaker recognition in simultaneous speech scenarios, one could opt for a sequential approach. First apply blind source separation to obtain multiple, supposedly single speech, segments from simultaneous speech. Proceed as if those segments do not contain any crosstalk and apply single speaker recognition as explained in the previous paragraphs. However in this paper a joint approach is chosen, where speakers are characterized through dictionaries while separating the sources. During training, source separation is performed as explained in the previous paragraphs. The kth basis vector is then assigned to the sth speaker for which ¾ is maximal. The dictionaries are collected in the library ¾. While testing a similar source separation algorithm is used but remains fixed. Since every basis vector is contained in a dictionary, the meaning of the Z variable is changed. It now maps a complete dictionary j, and its corresponding target speaker identity, to a test speaker s. A new variable indicator is introduced that assigns a basis vector k to a dictionary ] if ~™ 1 under the constraints <¾** and ^ * <:*'M "** \
The variable -¾«is then reformulated as follows.
{10}
Figure imgf000013_0002
It can be easily shown that the update formulas below then extend [9, eq.
42-47].
Figure imgf000014_0001
To update an algebraic Riccati equation is solved.
(15)
Figure imgf000014_0002
Where is the of the previous update. To avoid scale ambiguity these normalizations should follow:
&S* ~ Bf«ft ( fs tf* '···· W∑* ¾tf ¾»i ∑*»
¾fr ¾¾ j» ¾*.¾ In the test phase a basis vector is kept fixed to a dictionary. Therefore i¾fe — I only if the kth basis vector belongs to the j dictionary, otherwise ~ ^ . Using equation 14 and the normalization, one can see that the values for ¾fc are then fixed for the whole iterative process.
The estimated ID for speaker s is j for which ¾ is maximal.
1 i¾™ arg ma : .¾ (IS)
J
Through ^ * and ¾ the speaker recognition can be interpreted as assigning the jth dictionary, and thus the target speaker identity of the j dictionary to the position of the sth speaker in the test utterance. Notice the S the amount of speakers in the test mixture, can be equal to or smaller than J, the amount of speakers in the library. Not all known speakers must appear in the test mixture.
The discussion above is limited to speaker recognition, i.e. human or speech punctual audio sources, but it is possible to extend the example to any punctual audio source, including non-human audio sources and/ or non-speech audio sources, in order to overview and control a delimited environment inside a building, such as an open space.
Preferably, alarms are included by default in the list of allowed punctual audio sources.
According to the present invention, after the mixed speech audio signal is processed, the method comprised the step of reproducing via a wearable (in-ear, on-ear or around-the-ear) ear-speaker S used by the speech users listed in the list of allowable punctual audio sources, each speech audio track of the human environment punctual audio source matching those of the list of allowed punctual audio sources. In particular, provided that users A, B, C have the respective audio speech data processed and stored in the library as training/ dictionary data, ear speakers of users A, B, who are in a meeting and are in the list of allowed punctual sound sources for that meeting, will produce the audio track from users A, B and not the audio track of user C.
According to a preferred embodiment, the mixed simultaneous speech audio signal is pre-processed before separation and recognition in order to implement active noise cancelling of background noise or the like. This is for example implemented via filtering.
Advantages of the inventions are as follows.
All the existing technologies let users decrease the unwanted noise, but they do not allow them to select between some specific desired or allowed sound and the other sound signals, identified as "noise". Furthermore, they do not provide speaker selection feature, i.e. the ability to recognize the voices among the noise and eventually select only some specific voices to be heard by the user. In fact, with current solutions it is not possible to hear only some selected human voices and discard unwanted voices and background noise. Thus, the proposed solution aim to protect the worker from the noise pollution but it doesn't completely isolate (acoustically speaking) him from the environment and from the people the user might want to interact with. Indeed, the ear speaker provides a quote of passive noise damping but such damping is not absolute. The proposed solution will increase the concentration and the performance of the user when working, without compromising his ability to interact, improving the quality of speech communication in noise polluted working environments.
According to the invention it is possible to limit latency between the step of acquiring and the step of reproducing is less or equal 100 milliseconds. Such threshold is considered suitable for the user to neglect time shift between the spoken language and the reproduction via ear speakers S. This ensures a fair experience.
The present invention is best suited for working environments where co- workers hold meetings and are often located in an open space. However, the present invention is also applicable in similar environments, for example clubs, student common areas or the like, where people group and talk in simultaneous groups within the same large room and it is preferable to avoid that voices from the groups mix with each other.
References:
T. Virtanen, "Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria", IEEE Transactions on Audio, Speech, and Language Processing, vol 15, no. 3, pp.1066- 1074, 2007.
H. Sawada, H. Kameoka, S. Araki, and N. Ueda, "Multichannel extensions of non-negative matrix factorization with complexvalued data," IEEE Transactions on Audio, Speech, and Language Processing, vol 21, no. 5, pp. 971- 982, 2013.
M. Nakano, H. Kameoka, J. Le Roux, Y. Kitano, N. Ono, and S. Sagayama, "Convergence-guaranteed multiplicative algorithms for nonnegative matrix factorization with β-divergence," In Proceedings of the 2010 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Kittila, Finland, 29 August - 1 September 2010, pp.283-288.
A. Lefevre, F. Bach, and C. Fevotte, "Itakura-Saito nonnegative matrix factorization with group sparsity", in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp, 21-24.
A. Hurmalainen, R. Saeidi, and T. Virtanen, "Group sparsity for speaker identity discrimination in factorisation-based speech recognition" in INTERSPEECH, 2012, pp-2138-2141.

Claims

1. Audio digital processing method comprising the steps of:
- Providing a list of allowed punctual audio sources, including a speech or human punctual audio source;
- Acquiring in an environment a mixed simultaneous speech audio signal from one or more environment punctual audio sources, including a speech or human environment punctual audio source, via a microphone unit;
- Processing the mixed simultaneous speech audio signal in order to: separate audio tracks of each environment punctual audio source via a Blind Source Separation algorithm; and
recognize tracks via Source Recognition algorithm;
- Reproducing via a wearable (in-ear, on-ear or around-the-ear) ear- speaker, each speech audio track of the human environment punctual audio source matching those of the list of allowed punctual audio sources.
2. Method according to claim 1, further comprising the following step prior to the step of processing:
Providing a library of training data to identify punctual audio sources;
Wherein the list of allowed punctual audio sources is selected from the library; and
Wherein both the separation and the recognition in the step of processing are based on data coded in the library.
3. Method according to claim 2, wherein both the separation and the recognition in the step of processing are based on a Non-negative Matrix Factorization algorithm as a joint source separation and a source recognition algorithm.
4. Method according to claim 3, wherein latency between the step of acquiring and the step of reproducing is less or equal 100 milliseconds.
5. Method according to any of the preceding claims, wherein an Active Noise Cancellation algorithm is applied and the step of reproducing comprises reproduction of a cancellation audio signal to interfere with the tracks of one or more punctual audio sources excluded from the list of allowed punctual audio sources.
6. Method according to any of the preceding claims, wherein the step of providing the list is performed via a user interface.
7. Method according to claim 6, wherein the step of providing the list is based on data about human punctual audio sources processed by a calendar management system for participation to a meeting.
8. Method according to any of the preceding claims, wherein an alarm sound source of the environment is by default included in the list of allowed punctual audio sources.
9. Audio processing system comprising:
- a storage device to memorize a list of allowed punctual audio sources, including a speech or human punctual audio source;
- a microphone unit to acquire in an environment a mixed simultaneous speech audio signal from one or more environment punctual audio sources, including a speech or human environment punctual audio source; - a processing device to process mixed simultaneous speech audio signal in order to:
separate audio tracks of each environment punctual audio source via a Blind Source Separation algorithm; and
recognize tracks via Source Recognition algorithm;
- a wearable (in-ear, on-ear or around-the-ear) ear-speaker to reproduce each speech audio track of the human environment punctual audio source matching those of the list of allowed punctual audio sources.
10. System according to claim 9, characterized by comprising a plurality of antennas to interconnect the microphone and/ or the ear speaker to the processing device and wirelessly exchange data about the list and/ or the mixed simultaneous speech audio single and/ or the speech audio tracks.
PCT/IB2018/054744 2017-06-30 2018-06-27 Audio signal digital processing method and system thereof WO2019003131A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IT102017000073663A IT201700073663A1 (en) 2017-06-30 2017-06-30 Audio signal digital processing method and system thereof
IT102017000073663 2017-06-30

Publications (1)

Publication Number Publication Date
WO2019003131A1 true WO2019003131A1 (en) 2019-01-03

Family

ID=60183014

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2018/054744 WO2019003131A1 (en) 2017-06-30 2018-06-27 Audio signal digital processing method and system thereof

Country Status (2)

Country Link
IT (1) IT201700073663A1 (en)
WO (1) WO2019003131A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8194900B2 (en) * 2006-10-10 2012-06-05 Siemens Audiologische Technik Gmbh Method for operating a hearing aid, and hearing aid
US20120215519A1 (en) * 2011-02-23 2012-08-23 Qualcomm Incorporated Systems, methods, apparatus, and computer-readable media for spatially selective audio augmentation
US20150003652A1 (en) * 2013-06-27 2015-01-01 Gn Resound A/S Hearing aid operating in dependence of position
US20160057526A1 (en) * 2014-04-08 2016-02-25 Doppler Labs, Inc. Time heuristic audio control
US20160071526A1 (en) * 2014-09-09 2016-03-10 Analog Devices, Inc. Acoustic source tracking and selection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8194900B2 (en) * 2006-10-10 2012-06-05 Siemens Audiologische Technik Gmbh Method for operating a hearing aid, and hearing aid
US20120215519A1 (en) * 2011-02-23 2012-08-23 Qualcomm Incorporated Systems, methods, apparatus, and computer-readable media for spatially selective audio augmentation
US20150003652A1 (en) * 2013-06-27 2015-01-01 Gn Resound A/S Hearing aid operating in dependence of position
US20160057526A1 (en) * 2014-04-08 2016-02-25 Doppler Labs, Inc. Time heuristic audio control
US20160071526A1 (en) * 2014-09-09 2016-03-10 Analog Devices, Inc. Acoustic source tracking and selection

Also Published As

Publication number Publication date
IT201700073663A1 (en) 2018-12-30

Similar Documents

Publication Publication Date Title
EP3776535B1 (en) Multi-microphone speech separation
Erdogan et al. Improved MVDR beamforming using single-channel mask prediction networks.
Yoshioka et al. Multi-microphone neural speech separation for far-field multi-talker speech recognition
Yoshioka et al. Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks
Yoshioka et al. Advances in online audio-visual meeting transcription
CN111489760B (en) Speech signal dereverberation processing method, device, computer equipment and storage medium
US11694710B2 (en) Multi-stream target-speech detection and channel fusion
Tan et al. Neural spectrospatial filtering
Bub et al. Knowing who to listen to in speech recognition: Visually guided beamforming
US11496830B2 (en) Methods and systems for recording mixed audio signal and reproducing directional audio
Wang et al. Continuous speech separation with ad hoc microphone arrays
EP3847645B1 (en) Determining a room response of a desired source in a reverberant environment
Bohlender et al. Neural networks using full-band and subband spatial features for mask based source separation
EP4004905B1 (en) Normalizing features extracted from audio data for signal recognition or modification
Bando et al. Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition.
CN116312570A (en) Voice noise reduction method, device, equipment and medium based on voiceprint recognition
WO2019003131A1 (en) Audio signal digital processing method and system thereof
Pfeifenberger et al. Eigenvector-Based Speech Mask Estimation Using Logistic Regression.
CN118435278A (en) Apparatus, method and computer program for providing spatial audio
Liu et al. A unified network for multi-speaker speech recognition with multi-channel recordings
Yoshioka et al. Picknet: Real-time channel selection for ad hoc microphone arrays
Ravenscroft et al. Combining Conformer and Dual-Path-Transformer Networks for Single Channel Noisy Reverberant Speech Separation
Ichikawa et al. Effective speech suppression using a two-channel microphone array for privacy protection in face-to-face sales monitoring
Ideli Audio-visual speech processing using deep learning techniques
Fukumori et al. CENSREC-4: An evaluation framework for distant-talking speech recognition in reverberant environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18739951

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18739951

Country of ref document: EP

Kind code of ref document: A1