EP3828886A1 - Method and system for separating the voice component and the noise component in an audio flow - Google Patents

Method and system for separating the voice component and the noise component in an audio flow Download PDF

Info

Publication number
EP3828886A1
EP3828886A1 EP20209511.3A EP20209511A EP3828886A1 EP 3828886 A1 EP3828886 A1 EP 3828886A1 EP 20209511 A EP20209511 A EP 20209511A EP 3828886 A1 EP3828886 A1 EP 3828886A1
Authority
EP
European Patent Office
Prior art keywords
signal
module
voice
generate
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP20209511.3A
Other languages
German (de)
French (fr)
Inventor
Félix MATHIEU
Thomas COURTAT
François CAPMAN
François SAUSSET
Shaheen ACHECHE
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thales SA
Original Assignee
Thales SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thales SA filed Critical Thales SA
Publication of EP3828886A1 publication Critical patent/EP3828886A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Definitions

  • the invention relates to a method and a system for separating, in real time in an audio stream, the part of the stream associated with a voice or speech, from another part of the stream containing the noise.
  • the invention finds its application in a context where one or more people are talking in a noisy environment (hubbub, engine noise, ventilation, etc.).
  • the speech signal superimposed on the noisy signals is digitized into an audio stream by a sound sensor.
  • the invention also relates to a method and a system for enhancing a real-time voice signal in an audio stream from a method of separating audio sources in delayed time.
  • the patent application US 20190066713 discloses a method of obtaining, by a device, a combined sound signal for combined signals from multiple sound sources in an area in which a person is located.
  • the processing implemented uses deep neural networks.
  • An example of a method for separating several voices in an audio signal comprises the steps described below and not shown for reasons of simplification.
  • the incoming audio signal is denoted by X , it has the length L.
  • the signal is transmitted to an encoder M 1 which transforms X into a tensor X (1) of dimensions F ⁇ T where T is a divisor of L and F a number of filters given by the designer.
  • the encoder M 1 consists of a 1D Convolution with F filters. The coefficients of the convolution kernels are adjusted during a learning phase.
  • the tensor is transmitted on the one hand to a multiplier for future use and on the other hand to a separation module.
  • the separation module is divided into two submodules M 2 and M 4 .
  • the first submodule M 2 transforms the tensor X (1) into a tensor X (2) of dimensions F ⁇ T.
  • the first submodule M 2 consists of a normalization layer, a 1x1 convolution and a stack of 1D-Conv modules known from the prior art and whose parameters are set during a learning phase.
  • the second submodule M 4 transforms X (2) into an X (4) tensor of dimensions 2F ⁇ T.
  • the second submodule M 4 connects a non-linearity, a 1x1 convolution and a sigmoid function.
  • the coefficients of the 1x1 convolution are set during a learning phase.
  • X (1) is concatenated to itself to form a tensor of dimensions 2F x T which is multiplied by X (4) to form X (5) .
  • the module M 5 takes as input X (5) and outputs two signals of length L by means of a 1D deconvolution, the parameters of which are adjusted during a learning phase.
  • the digital parameters defining the processing of the different modules are obtained in a prior learning phase on a database.
  • the figure 1 illustrates an application to the separation of signals of different types, by separating the voice channel and the noise channel.
  • data is represented in the form of tensors.
  • the data is modified by a succession of modules.
  • the data is projected into an abstract space generally defined by its dimensions.
  • the present invention implements the following treatments:
  • the signal (input stream X ) is split into N frames of length L, with X N the nth frame.
  • the method carries out the following treatments:
  • the frame X N is encoded by a network of 1D convolutions.
  • the result is a tensor X NOT 1 of dimensions F x T with F the number of filters given by the designer,
  • the result X NOT 1 is then transformed by a module M 2, 101.
  • the result X NOT 2 is a tensor of dimensions F x T.
  • the modulus M 4 estimates, 103, from X NOT 2 , a tensor X NOT 4 of dimensions 2F x T.
  • X NOT 1 is concatenated to itself, 104, to form a tensor of dimensions 2F x T which is multiplied by X NOT 4 to train X NOT 5 .
  • the M 5 module from X NOT 5 produces a tensor of dimension 2 x T, 105, from which we obtain two outputs of dimensions 1 x T X N , 0 and X N , 1 which are respectively the voice channel and the noise channel.
  • One of the objectives of the present invention is to provide a method and a device making it possible to separate, in real time, voices from the background noise in an audio stream, or denoising of the voice in an audio stream, in particular by taking counts information from previous frames. This improves performance and processing latency.
  • the method thus enables the propagation of “global information” on the signal, its updating and its use from frame to frame.
  • N 0 is set arbitrarily, for example it is identically zero.
  • the figure 2 illustrates an example of a device allowing the implementation of the method according to the invention.
  • the signal from which it is necessary to extract (separate) the voice (s) from the noise contained in the audio stream is received on an audio sensor 10.
  • the audio sensor is connected to a set of equipment or Hardware modules 20 configured to separate the voice from the noise which will be detailed in figure 3 .
  • the figure 3 illustrates a first variant embodiment for separating a voice from the noise in an audio signal, the processing being carried out at the level of the assembly 20. This separation is carried out in real time. Modules similar to the diagram of the figure 1 bear the same references.
  • the assembly also includes a module M 3 , the function of which is detailed below.
  • the audio signal received on the sensor is during a first step separated into N frames X 1 . .... X N.
  • Each frame X N is associated with a tensor I N which is of constant dimension, independent of the index of the frame.
  • the method will update the value of the tensor I N from frame to frame and the joint use of X N and I N to estimate X N , 0 and X N, 1 .
  • the frame X N is transmitted to a first module M 1 , 100, which generates a signal X NOT 1 .
  • the tensor I N -1 obtained during the previous step for the processing of the frame X N - 1 is transmitted in a module M 3 , 201.
  • M 3 generates a tensor I N , 202, which will be used during the processing of the frame X N +1 .
  • Encoder M 3 takes a signal as input X NOT 2 , 203, result of signal transformation X NOT 1 by a module M 2 and performs the concatenation of X NOT 2 and I N , in order to generate a signal X NOT 3 of dimension 2F x T, M 3 : X NOT 2 I NOT - 1 - > X NOT 3 I NOT .
  • the signal X NOT 3 , 204 is transmitted to a module M 4 in order to generate a signal X NOT 4 which is combined, 104, with the signal X NOT 1 , the signal resulting from the combination is decoded by a decoder M 5 , 105, in order to generate a first voice signal X N , 0 and a second noise signal X N, 1 .
  • the steps implemented by the method according to the invention are as follows:
  • I N has dimension F x F
  • the method and the device according to the invention allow real-time separation of the voice from the noise in an audio signal received on a sensor in real time and without degrading the parameters specific to the voice.
  • the digital parameters defining the processing of the different modules are set in a prior learning phase on a database.
  • the invention allows real-time operation with a controllable latency / quality compromise, so as not to degrade the audio signal which does not contain noise, and makes it possible to enhance the noise in a signal which does not contain words (voice).
  • the method makes it possible in particular to preprocess the audio signal of speech to improve the quality of the voice processing / enhancement bricks (compression, analysis).

Abstract

L'invention concerne un procédé et un système pour séparer en temps réel dans un flux audio la composante voix et la composante bruit.The invention relates to a method and a system for separating in real time in an audio stream the voice component and the noise component.

Description

L'invention concerne un procédé et un système permettant de séparer, en temps réel dans un flux audio, la partie du flux associée à une voix ou à de la parole, d'une autre partie du flux contenant les bruits.The invention relates to a method and a system for separating, in real time in an audio stream, the part of the stream associated with a voice or speech, from another part of the stream containing the noise.

L'invention trouve son application dans un contexte où une ou plusieurs personnes parlent dans un environnement bruité (brouhaha, bruit de moteur, ventilation, etc.). Le signal de la parole superposé aux signaux bruyants est numérisé dans un flux audio par un capteur sonore.The invention finds its application in a context where one or more people are talking in a noisy environment (hubbub, engine noise, ventilation, etc.). The speech signal superimposed on the noisy signals is digitized into an audio stream by a sound sensor.

L'invention concerne aussi un procédé et un système pour rehausser un signal de voix en temps réel dans un flux audio à partir d'un procédé de séparation de sources audio en temps différé.The invention also relates to a method and a system for enhancing a real-time voice signal in an audio stream from a method of separating audio sources in delayed time.

L'état de l'art connu du demandeur se divise en deux catégories, les approches dites classiques et les approches possibles par l'intelligence artificielle connue sous la dénomination anglo-saxonne de « deep learning ».The state of the art known to the applicant is divided into two categories, the so-called conventional approaches and the approaches possible by artificial intelligence known under the Anglo-Saxon name of “deep learning”.

Dans l'approche de « deep learning », des approches traitent directement du problème de séparation voix/bruit de fond, d'autres concernent la séparation signal/signal, voix/voix.In the “deep learning” approach, some approaches directly deal with the problem of voice / background noise separation, others relate to signal / signal, voice / voice separation.

La demande de brevet US 20190066713 divulgue un procédé consistant à obtenir, par un dispositif, un signal sonore combiné pour des signaux combinés provenant de multiples sources sonores dans une zone dans laquelle se trouve une personne. Le traitement mis en œuvre fait appel à des réseaux de neurones profonds.The patent application US 20190066713 discloses a method of obtaining, by a device, a combined sound signal for combined signals from multiple sound sources in an area in which a person is located. The processing implemented uses deep neural networks.

Un exemple de procédé pour séparer plusieurs voix dans un signal audio selon l'art antérieur comporte les étapes décrites ci-après et non représentées pour des raisons de simplification. Le signal audio entrant est noté X, il a pour longueur L. Le signal est transmis à un encodeur M 1 qui transforme X en un tenseur X (1)de dimensions F × TT est un diviseur de L et F un nombre de filtres donné par le concepteur. L'encodeur M 1 consiste en une Convolution 1D à F filtres. Les coefficients des noyaux de convolution sont réglés lors d'une phase d'apprentissage. Le tenseur est transmis d'une part à un multiplicateur pour une utilisation future et d'autre part à un module de séparation. Le module de séparation est divisé en deux sous-modules M 2 et M 4. Le premier sous-module M 2 transforme le tenseur X (1) en un tenseur X (2) de dimensions F × T. Le premier sous-module M 2 est constitué d'une couche de normalisation, une convolution 1x1 et un empilement de modules 1D-Conv connus de l'art antérieur et dont les paramètres sont réglés lors d'une phase d'apprentissage.An example of a method for separating several voices in an audio signal according to the prior art comprises the steps described below and not shown for reasons of simplification. The incoming audio signal is denoted by X , it has the length L. The signal is transmitted to an encoder M 1 which transforms X into a tensor X (1) of dimensions F × T where T is a divisor of L and F a number of filters given by the designer. The encoder M 1 consists of a 1D Convolution with F filters. The coefficients of the convolution kernels are adjusted during a learning phase. The tensor is transmitted on the one hand to a multiplier for future use and on the other hand to a separation module. The separation module is divided into two submodules M 2 and M 4 . The first submodule M 2 transforms the tensor X (1) into a tensor X (2) of dimensions F × T. The first submodule M 2 consists of a normalization layer, a 1x1 convolution and a stack of 1D-Conv modules known from the prior art and whose parameters are set during a learning phase.

Le deuxième sous-module M 4 transforme X (2) en X (4) tenseur de dimensions 2F × T. Pour cela, le deuxième sous-module M 4 enchaîne une non-linéarité, une convolution 1x1 et une fonction sigmoîde. Les coefficients de la convolution 1x1 sont réglés lors d'une phase d'apprentissage.The second submodule M 4 transforms X (2) into an X (4) tensor of dimensions 2F × T. For this, the second submodule M 4 connects a non-linearity, a 1x1 convolution and a sigmoid function. The coefficients of the 1x1 convolution are set during a learning phase.

X (1) est concaténé à lui-même pour former un tenseur de dimensions 2F x T qui est multiplié à X (4) pour former X (5). X (1) is concatenated to itself to form a tensor of dimensions 2F x T which is multiplied by X (4) to form X (5) .

Le module M 5 prend pour entrée X (5) et donne en sortie deux signaux de longueur L au moyen d'une déconvolution 1D dont les paramètres sont réglés lors d'une phase d'apprentissage.The module M 5 takes as input X (5) and outputs two signals of length L by means of a 1D deconvolution, the parameters of which are adjusted during a learning phase.

Les paramètres numériques définissant les traitements des différents modules sont obtenus dans une phase préalable d'apprentissage sur une base de données.The digital parameters defining the processing of the different modules are obtained in a prior learning phase on a database.

En remplaçant une des voix par du bruit, il est immédiat d'utiliser les méthodes décrites dans l'état de l'art pour séparer la voix du bruit de fond dans un signal audio et, en conservant uniquement la sortie contenant le signal de voix, de rehausser la voix d'un signal bruité.By replacing one of the voices with noise, it is immediate to use the methods described in the state of the art to separate the voice from the background noise in an audio signal and, by keeping only the output containing the voice signal , to enhance the voice with a noisy signal.

La figure 1 illustre une application à la séparation de signaux de différents types, en séparant le canal voix et le canal bruit.The figure 1 illustrates an application to the separation of signals of different types, by separating the voice channel and the noise channel.

Tel que décrit, l'état de l'art ne permet pas directement le traitement en temps réel d'un flux audio.As described, the state of the art does not directly allow real-time processing of an audio stream.

Le document de Mimilakis Stylianos loannis et al, intitulé « A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation », du 25 septembre 2017, pages 1-6, XP 033263882 , divulgue un procédé permettant de séparer la voix d'un fond musical.The document Mimilakis Stylianos loannis et al, entitled “A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation”, September 25, 2017, pages 1-6, XP 033263882 , discloses a method for separating the voice from a musical background.

Dans le domaine technique du « deep learning », les données sont représentées sous forme de tenseurs. Les données sont modifiées par une succession de modules. En sortie de chaque module, les données sont projetées dans un espace abstrait défini en général par ses dimensions.In the technical field of "deep learning", data is represented in the form of tensors. The data is modified by a succession of modules. At the output of each module, the data is projected into an abstract space generally defined by its dimensions.

Pour ce faire la présente invention met en œuvre les traitements suivants :To do this, the present invention implements the following treatments:

Le signal (flux d'entrée X) est découpé en N trames de longueur L, avec XN la nième trame. Le procédé exécute les traitements suivants :The signal (input stream X ) is split into N frames of length L, with X N the nth frame. The method carries out the following treatments:

La trame XN est encodée par un réseau de convolutions 1D. Le résultat est un tenseur X N 1

Figure imgb0001
de dimensions F x T avec F le nombre de filtres donné par le concepteur,The frame X N is encoded by a network of 1D convolutions. The result is a tensor X NOT 1
Figure imgb0001
of dimensions F x T with F the number of filters given by the designer,

T un diviseur de L dépendant de la taille des filtres F, 100. Le résultat X N 1

Figure imgb0002
est ensuite transformé par un module M2, 101. Le résultat X N 2
Figure imgb0003
est un tenseur de dimensions F x T. Le module M4 estime, 103, à partir de X N 2 ,
Figure imgb0004
un tenseur X N 4
Figure imgb0005
de dimensions 2F x T. X N 1
Figure imgb0006
est concaténé à lui-même, 104, pour former un tenseur de dimensions 2F x T qui est multiplié à X N 4
Figure imgb0007
pour former X N 5 .
Figure imgb0008
Le module M5 à partir de X N 5
Figure imgb0009
produit un tenseur de dimension 2 x T, 105, à partir duquel on obtient deux sorties de dimensions 1 x T X N,0 et X N,1 qui sont respectivement le canal voix et le canal bruit.T a divisor of L depending on the size of the filters F, 100. The result X NOT 1
Figure imgb0002
is then transformed by a module M 2, 101. The result X NOT 2
Figure imgb0003
is a tensor of dimensions F x T. The modulus M 4 estimates, 103, from X NOT 2 ,
Figure imgb0004
a tensor X NOT 4
Figure imgb0005
of dimensions 2F x T. X NOT 1
Figure imgb0006
is concatenated to itself, 104, to form a tensor of dimensions 2F x T which is multiplied by X NOT 4
Figure imgb0007
to train X NOT 5 .
Figure imgb0008
The M 5 module from X NOT 5
Figure imgb0009
produces a tensor of dimension 2 x T, 105, from which we obtain two outputs of dimensions 1 x T X N , 0 and X N , 1 which are respectively the voice channel and the noise channel.

Ces étapes sont réitérées sur chaque nouvelle trame. Les paramètres sont appris sur une base de données de sons. L'inconvénient de ce procédé est qu'il n'utilise pas les informations des trames précédentes pour traiter la trame courante. Ceci entraîne notamment une qualité dégradée et une forte latence dans les traitements, du fait de la durée des trames.These steps are reiterated on each new frame. The parameters are learned from a sound database. The disadvantage of this method is that it does not use the information from the previous frames to process the current frame. This leads in particular to degraded quality and high latency in processing, due to the duration of the frames.

L'un des objectifs de la présente invention est d'offrir un procédé et un dispositif permettant de séparer, en temps réel, des voix du bruit de fond dans un flux audio, ou débruitage de la voix dans un flux audio, notamment en tenant compte des informations issues des trames précédentes. Ceci permet d'améliorer les performances et la latence de traitement. Le procédé permet ainsi la propagation de « l'information globale » sur le signal, sa mise à jour et son exploitation de trame en trame.One of the objectives of the present invention is to provide a method and a device making it possible to separate, in real time, voices from the background noise in an audio stream, or denoising of the voice in an audio stream, in particular by taking counts information from previous frames. This improves performance and processing latency. The method thus enables the propagation of “global information” on the signal, its updating and its use from frame to frame.

L'invention concerne un procédé pour séparer en temps réel de la voix du bruit dans un signal audio reçu sur un récepteur équipé d'un capteur audio caractérisé en ce qu'il comporte au moins les étapes suivantes :

  • On sépare le flux audio reçu en N trames XN ,
  • Pour chaque trame XN on associe un tenseur contenant des informations sur l'ensemble du flux audio,
  • On transmet la trame XN à un premier module M1 qui génère un signal X N 1 ,
    Figure imgb0010
  • Le tenseur I N-1 obtenu lors de l'étape précédente pour le traitement de la trame XN - 1 est transmis à un module M3,
  • Le module M3 prend en entrée un signal X N 2 ,
    Figure imgb0011
    résultat de la transformation du signal X N 1
    Figure imgb0012
    par un module M2 et réalise la concaténation de X N 2
    Figure imgb0013
    et IN afin de générer un signal X N 3
    Figure imgb0014
    de dimension 2F x T,
  • Le signal X N 3
    Figure imgb0015
    est transmis à un module M4 afin de générer un signal X N 4
    Figure imgb0016
    qui est combiné avec le signal X N 1 ,
    Figure imgb0017
  • Le signal résultant de la combinaison est décodé par un décodeur M5 afin de générer un premier signal de voix X N,0 et un deuxième signal X N,1.
The invention relates to a method for separating voice in real time from noise in an audio signal received on a receiver equipped with an audio sensor, characterized in that it comprises at least the following steps:
  • The audio stream received is separated into N frames X N ,
  • For each frame X N we associate a tensor containing information on the whole audio stream,
  • The frame X N is transmitted to a first module M 1 which generates a signal X NOT 1 ,
    Figure imgb0010
  • The tensor I N -1 obtained during the previous step for the processing of the frame X N - 1 is transmitted to a module M 3 ,
  • The M 3 module takes a signal as input X NOT 2 ,
    Figure imgb0011
    signal transformation result X NOT 1
    Figure imgb0012
    by a module M 2 and performs the concatenation of X NOT 2
    Figure imgb0013
    and I N in order to generate a signal X NOT 3
    Figure imgb0014
    of dimension 2F x T,
  • The signal X NOT 3
    Figure imgb0015
    is transmitted to a module M 4 in order to generate a signal X NOT 4
    Figure imgb0016
    which is combined with the signal X NOT 1 ,
    Figure imgb0017
  • The signal resulting from the combination is decoded by a decoder M 5 in order to generate a first voice signal X N , 0 and a second signal X N, 1 .

Pour traiter une trame N on suppose que la trame N - 1 a été traitée précédemment et que les quantités résultant de ce traitement ont été stockées. Pour la trame 0, I 0 est fixé arbitrairement par exemple il est identiquement nul.To process an N frame, it is assumed that the N - 1 frame has been processed previously and that the quantities resulting from this processing have been stored. For frame 0, I 0 is set arbitrarily, for example it is identically zero.

L'invention concerne aussi un dispositif pour séparer de la voix du bruit dans un signal audio reçu sur un récepteur équipé d'un capteur audio caractérisé en ce qu'il comporte au moins les éléments suivants :

  • Un premier module M1 recevant des trames d'un signal contenant de la voix et du bruit,
  • Le premier module à une sortie reliée à un deuxième module M2 configuré pour générer un signal transmis à un troisième module M3 qui reçoit une valeur de tenseur associée à une trame précédente XN - 1 pour générer un tenseur IN associé à la trame courante et un signal X N 3
    Figure imgb0018
    de dimension 2F x T,
The invention also relates to a device for separating voice from noise in an audio signal received on a receiver equipped with an audio sensor characterized in that it comprises at least the following elements:
  • A first module M 1 receiving frames of a signal containing voice and noise,
  • The first module has an output connected to a second module M 2 configured to generate a signal transmitted to a third module M 3 which receives a tensor value associated with a previous frame X N - 1 to generate a tensor I N associated with the frame current and a signal X NOT 3
    Figure imgb0018
    of dimension 2F x T,

Le module M3 inséré entre le module M2 et le module M4 prend en entrée un tenseur homogène en dimensions à celui fourni en sortie du module M2 et fournit en sortie un tenseur homogène en dimensions à celui que prend en entrée le module M4. Une entrée IN - 1 supplémentaire est fournie en entrée du module M3 pour le traitement de la trame numéro N et le module M3 fournit en sortie additionnelle le tenseur IN .

  • Un module M4 qui combine le signal X N 3
    Figure imgb0019
    et le signal X N 1
    Figure imgb0020
    afin de générer un signal X N 4 ,
    Figure imgb0021
  • Un décodeur M5 configuré pour générer un premier signal de voix X N,0 et un deuxième signal de bruit X N,1 à partir du signal X N 4 .
    Figure imgb0022
The module M 3 inserted between the module M 2 and the module M 4 takes at input a tensor which is homogeneous in dimensions to that supplied at the output of the module M 2 and provides at the output a homogeneous tensor in dimensions to that which the module M takes at the input 4 . An additional input I N - 1 is supplied at the input of the module M 3 for the processing of the frame number N and the module M 3 provides the tensor I N as an additional output.
  • An M 4 module which combines the signal X NOT 3
    Figure imgb0019
    and the signal X NOT 1
    Figure imgb0020
    in order to generate a signal X NOT 4 ,
    Figure imgb0021
  • A decoder M 5 configured to generate a first voice signal X N , 0 and a second noise signal X N, 1 from the signal X NOT 4 .
    Figure imgb0022

D'autres caractéristiques, détails et avantages de l'invention ressortiront à la lecture de la description faite en référence aux dessins annexés donnés à titre d'exemple non limitatifs et qui représentent, respectivement :

  • [Fig.1], une illustration de l'art antérieur,
  • [Fig.2], un exemple de système permettant la mise en œuvre du procédé selon l'invention,
  • [Fig.3] une illustration des étapes mises en œuvre par le procédé selon l'invention.
Other characteristics, details and advantages of the invention will emerge on reading the description given with reference to the appended drawings given by way of non-limiting example and which represent, respectively:
  • [ Fig. 1 ], an illustration of the prior art,
  • [ Fig. 2 ], an example of a system allowing the implementation of the method according to the invention,
  • [ Fig. 3 ] an illustration of the steps implemented by the method according to the invention.

La figure 2 illustre un exemple de dispositif permettant la mise en œuvre du procédé selon l'invention.The figure 2 illustrates an example of a device allowing the implementation of the method according to the invention.

Le signal dont il faut extraire (séparer) la ou les voix du bruit contenu dans le flux audio est reçu sur un capteur audio 10. Le capteur audio est relié à un ensemble d'équipements ou modules Hardware 20 configurés pour séparer la voix du bruit qui seront détaillés à la figure 3.The signal from which it is necessary to extract (separate) the voice (s) from the noise contained in the audio stream is received on an audio sensor 10. The audio sensor is connected to a set of equipment or Hardware modules 20 configured to separate the voice from the noise which will be detailed in figure 3 .

La figure 3 illustre une première variante de réalisation pour séparer une voix du bruit dans un signal audio, les traitements étant effectués au niveau de l'ensemble 20. Cette séparation est réalisée en temps réel. Les modules similaires au schéma de la figure 1 portent les mêmes références. L'ensemble comprend en plus un module M3 dont la fonction est détaillée ci-après.The figure 3 illustrates a first variant embodiment for separating a voice from the noise in an audio signal, the processing being carried out at the level of the assembly 20. This separation is carried out in real time. Modules similar to the diagram of the figure 1 bear the same references. The assembly also includes a module M 3 , the function of which is detailed below.

Le signal audio reçu sur le capteur est lors d'une première étape séparé en N trames X 1 .....XN . A chaque trame XN est associé un tenseur IN qui est de dimension constante, indépendante de l'indice de la trame. Le procédé va mettre à jour la valeur du tenseur IN de trame en trame et l'utilisation jointe de XN et IN pour estimer X N,0 et X N,1.The audio signal received on the sensor is during a first step separated into N frames X 1 . .... X N. Each frame X N is associated with a tensor I N which is of constant dimension, independent of the index of the frame. The method will update the value of the tensor I N from frame to frame and the joint use of X N and I N to estimate X N , 0 and X N, 1 .

La trame XN est transmise à un premier module M1, 100, qui génère un signal X N 1 .

Figure imgb0023
Le tenseur I N-1 obtenu lors de l'étape précédente pour le traitement de la trame XN - 1 est transmis dans un module M3, 201.The frame X N is transmitted to a first module M 1 , 100, which generates a signal X NOT 1 .
Figure imgb0023
The tensor I N -1 obtained during the previous step for the processing of the frame X N - 1 is transmitted in a module M 3 , 201.

M 3 génère un tenseur IN , 202, qui sera utilisé lors du traitement de la trame X N+1. M 3 generates a tensor I N , 202, which will be used during the processing of the frame X N +1 .

Le codeur M3 prend en entrée un signal X N 2 ,

Figure imgb0024
203, résultat de la transformation du signal X N 1
Figure imgb0025
par un module M2 et réalise la concaténation de X N 2
Figure imgb0026
et IN , afin de générer un signal X N 3
Figure imgb0027
de dimension 2F x T, M 3 : X N 2 I N 1 > X N 3 I N .
Figure imgb0028
Encoder M 3 takes a signal as input X NOT 2 ,
Figure imgb0024
203, result of signal transformation X NOT 1
Figure imgb0025
by a module M 2 and performs the concatenation of X NOT 2
Figure imgb0026
and I N , in order to generate a signal X NOT 3
Figure imgb0027
of dimension 2F x T, M 3 : X NOT 2 I NOT - 1 - > X NOT 3 I NOT .
Figure imgb0028

Le signal X N 3 ,

Figure imgb0029
204, est transmis à un module M4 afin de générer un signal X N 4
Figure imgb0030
qui est combiné, 104, avec le signal X N 1 ,
Figure imgb0031
le signal résultant de la combinaison est décodé par un décodeur M5, 105, afin de générer un premier signal de voix X N,0 et un deuxième signal de bruit X N,1.The signal X NOT 3 ,
Figure imgb0029
204, is transmitted to a module M 4 in order to generate a signal X NOT 4
Figure imgb0030
which is combined, 104, with the signal X NOT 1 ,
Figure imgb0031
the signal resulting from the combination is decoded by a decoder M 5 , 105, in order to generate a first voice signal X N , 0 and a second noise signal X N, 1 .

Dans un mode de réalisation, les étapes mises en œuvre par le procédé selon l'invention sont les suivantes :In one embodiment, the steps implemented by the method according to the invention are as follows:

Pour tout N, IN est de dimension F x FFor all N, I N has dimension F x F

AN est un tenseur F x F défini par A N = X N 2 X N 2 t T

Figure imgb0032

  • a. X N 2 X N 2 t
    Figure imgb0033
    est le produit matriciel de X N 2
    Figure imgb0034
    et de sa transposée
    IN = I N-1 + λ(AN - I N-1) avec λ un facteur de gain 0 et 1 donné par l'utilisateur
    BN = Softmax(I N-1)
  • a. La fonction softmax est classique en machine learning ; à un vecteur de K nombres, (v 1 ...v K ) elle associe un vecteur de K nombre (w 1 ... wK ) avec pour tout w k = exp v k l = 1 K exp v l ,
    Figure imgb0035
  • b. Pour calculer BN , la fonction softmax est appliquée indépendamment à toutes les lignes de IN ,
C N = B N X N 2
Figure imgb0036
est le produit matriciel entre BN et X N 2 ;
Figure imgb0037
ses dimensions sont F×T, X N 3
Figure imgb0038
est de dimension 2F x T, c'est la concaténation de X N 2
Figure imgb0039
et CN . A N is a tensor F x F defined by AT NOT = X NOT 2 X NOT 2 t T
Figure imgb0032
  • at. X NOT 2 X NOT 2 t
    Figure imgb0033
    is the matrix product of X NOT 2
    Figure imgb0034
    and its transpose
    I N = I N -1 + λ ( A N - I N -1 ) with λ a gain factor 0 and 1 given by the user
    B N = Softmax ( I N -1 )
  • at. The softmax function is classic in machine learning; to a vector of K numbers, ( v 1 ... v K ) it associates a vector of K number ( w 1 ... w K ) with for all w k = exp v k l = 1 K exp v l ,
    Figure imgb0035
  • b. To calculate B N , the softmax function is applied independently to all the lines of I N ,
VS NOT = B NOT X NOT 2
Figure imgb0036
is the matrix product between B N and X NOT 2 ;
Figure imgb0037
its dimensions are F × T , X NOT 3
Figure imgb0038
is of dimension 2F x T, it is the concatenation of X NOT 2
Figure imgb0039
and C N.

Le procédé et le dispositif selon l'invention permettent une séparation en temps réel de la voix du bruit dans un signal audio reçu sur un capteur en temps réel et sans dégrader les paramètres propres à la voix.The method and the device according to the invention allow real-time separation of the voice from the noise in an audio signal received on a sensor in real time and without degrading the parameters specific to the voice.

Les paramètres numériques définissant les traitements des différents modules sont réglés dans une phase préalable d'apprentissage sur une base de données.The digital parameters defining the processing of the different modules are set in a prior learning phase on a database.

L'invention permet un fonctionnement en temps réel avec un compromis latence/qualité contrôlable, de ne pas dégrader le signal audio qui ne contient pas de bruit, et permet de rehausser le bruit dans un signal ne contenant pas de paroles (de voix).The invention allows real-time operation with a controllable latency / quality compromise, so as not to degrade the audio signal which does not contain noise, and makes it possible to enhance the noise in a signal which does not contain words (voice).

Le procédé permet notamment de prétraiter le signal audio de la parole pour améliorer la qualité de briques de traitement / valorisation de la voix (compression, analyse).The method makes it possible in particular to preprocess the audio signal of speech to improve the quality of the voice processing / enhancement bricks (compression, analysis).

L'ajout dans la chaîne de traitement d'un module M3 permet d'améliorer la qualité de mise en place d'une stratégie trame par trame pour la mise en temps réel des traitements.The addition in the processing chain of a module M 3 makes it possible to improve the quality of implementation of a frame-by-frame strategy for real-time processing.

Claims (2)

Procédé pour séparer, en temps réel, de la voix du bruit dans un signal audio reçu sur un récepteur équipé d'un capteur audio caractérisé en ce qu'il comporte au moins les étapes suivantes : - On sépare le flux audio reçu en N trames XN , - Pour chaque trame XN on associe un tenseur contenant des informations sur l'ensemble du flux audio, - On transmet la trame XN à un premier module M1, (100), qui génère un signal X N 1 ,
Figure imgb0040
- Le tenseur I N-1 obtenu lors de l'étape précédente pour le traitement de la trame XN - 1 est transmis à un module M3, (201), - Le module M3 prend en entrée un signal X N 2 ,
Figure imgb0041
(203), résultat de la transformation du signal X N 1
Figure imgb0042
par un module M2 et réalise la concaténation de X N 2
Figure imgb0043
et IN , afin de générer un signal X N 3
Figure imgb0044
de dimension 2F x T,
- Le signal X N 3 ,
Figure imgb0045
(204), est transmis à un module M4 afin de générer un signal X N 4
Figure imgb0046
qui est combiné, (104), avec le signal X N 1 ,
Figure imgb0047
- le signal résultant de la combinaison est décodé par un décodeur M5, (105), afin de générer un premier signal de voix X N,0 et un deuxième signal X N,1.
Process for separating, in real time, voice from noise in an audio signal received on a receiver equipped with an audio sensor characterized in that it comprises at least the following steps: - We separate the audio stream received into N frames X N , - For each frame X N we associate a tensor containing information on the whole audio stream, - The frame X N is transmitted to a first module M 1 , (100), which generates a signal X NOT 1 ,
Figure imgb0040
- The tensor I N -1 obtained during the previous step for the processing of the frame X N - 1 is transmitted to a module M 3 , (201), - The M 3 module takes a signal as input X NOT 2 ,
Figure imgb0041
(203), result of the transformation of the signal X NOT 1
Figure imgb0042
by a module M 2 and performs the concatenation of X NOT 2
Figure imgb0043
and I N , in order to generate a signal X NOT 3
Figure imgb0044
of dimension 2F x T,
- The signal X NOT 3 ,
Figure imgb0045
(204), is transmitted to a module M 4 in order to generate a signal X NOT 4
Figure imgb0046
which is combined, (104), with the signal X NOT 1 ,
Figure imgb0047
- the signal resulting from the combination is decoded by a decoder M 5 , (105), in order to generate a first voice signal X N , 0 and a second signal X N, 1 .
Dispositif pour séparer, en temps réel, de la voix du bruit dans un signal audio reçu sur un récepteur équipé d'un capteur audio caractérisé en ce qu'il comporte au moins les éléments suivants : - Un premier module M1 recevant des trames d'un signal contenant de la voix et du bruit, - Le premier module à une sortie reliée à un deuxième module M2 configuré pour générer un signal transmis à un troisième module M3 qui reçoit une valeur de tenseur associée à une trame précédente XN - 1 pour générer un tenseur IN associé à la trame courante et un signal X N 3
Figure imgb0048
de dimension 2F x T,
- Un module M4 qui combine le signal X N 3
Figure imgb0049
et le signal X N 1
Figure imgb0050
afin de générer un signal X N 4 ,
Figure imgb0051
- Un module (104) qui combine le signal X N 4
Figure imgb0052
avec le signal X N 1
Figure imgb0053
afin de générer un signal X N 5
Figure imgb0054
- Un décodeur M5, (105) configuré pour générer un premier signal de voix X N,0 et un deuxième signal X N,1 à partir du signal X N 5 .
Figure imgb0055
Device for separating, in real time, voice from noise in an audio signal received on a receiver equipped with an audio sensor, characterized in that it comprises at least the following elements: - A first module M 1 receiving frames of a signal containing the voice and noise, - The first module has an output connected to a second module M 2 configured to generate a signal transmitted to a third module M 3 which receives a tensor value associated with a previous frame X N - 1 to generate a tensor I N associated with the current frame and a signal X NOT 3
Figure imgb0048
of dimension 2F x T,
- An M 4 module which combines the signal X NOT 3
Figure imgb0049
and the signal X NOT 1
Figure imgb0050
in order to generate a signal X NOT 4 ,
Figure imgb0051
- A module (104) which combines the signal X NOT 4
Figure imgb0052
with the signal X NOT 1
Figure imgb0053
in order to generate a signal X NOT 5
Figure imgb0054
- A decoder M 5 , (105) configured to generate a first voice signal X N , 0 and a second signal X N , 1 from the signal X NOT 5 .
Figure imgb0055
EP20209511.3A 2019-11-27 2020-11-24 Method and system for separating the voice component and the noise component in an audio flow Pending EP3828886A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
FR1913283A FR3103619B1 (en) 2019-11-27 2019-11-27 METHOD AND SYSTEM FOR SEPARATE IN AN AUDIO STREAM THE VOICE COMPONENT AND THE NOISE COMPONENT

Publications (1)

Publication Number Publication Date
EP3828886A1 true EP3828886A1 (en) 2021-06-02

Family

ID=70918486

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20209511.3A Pending EP3828886A1 (en) 2019-11-27 2020-11-24 Method and system for separating the voice component and the noise component in an audio flow

Country Status (3)

Country Link
EP (1) EP3828886A1 (en)
FR (1) FR3103619B1 (en)
SG (1) SG10202011769TA (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190066713A1 (en) 2016-06-14 2019-02-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190066713A1 (en) 2016-06-14 2019-02-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MIMILAKIS STYLIANOS IOANNIS ET AL: "A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation", 2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), IEEE, 25 September 2017 (2017-09-25), pages 1 - 6, XP033263882, DOI: 10.1109/MLSP.2017.8168117 *
MIMILAKIS STYLIANOS LOANNIS ET AL., A RECURRENT ENCODER-DECODER APPROACH WITH SKIP-FILTERING CONNECTIONS FOR MONAURAL SINGING VOICE SÉPARATION, 25 September 2017 (2017-09-25), pages 1 - 6
STEPHENSON CORY ET AL: "Monaural speaker separation using source-contrastive estimation", 2017 IEEE INTERNATIONAL WORKSHOP ON SIGNAL PROCESSING SYSTEMS (SIPS), IEEE, 3 October 2017 (2017-10-03), pages 1 - 6, XP033257056, DOI: 10.1109/SIPS.2017.8110005 *

Also Published As

Publication number Publication date
FR3103619B1 (en) 2022-06-24
FR3103619A1 (en) 2021-05-28
SG10202011769TA (en) 2021-06-29

Similar Documents

Publication Publication Date Title
EP0919096B1 (en) Method for cancelling multichannel acoustic echo and multichannel acoustic echo canceller
EP2005420B1 (en) Device and method for encoding by principal component analysis a multichannel audio signal
CA2436318C (en) Noise reduction method and device
EP2691952B1 (en) Allocation, by sub-bands, of bits for quantifying spatial information parameters for parametric encoding
EP2772916B1 (en) Method for suppressing noise in an audio signal by an algorithm with variable spectral gain with dynamically adaptive strength
FR2808917A1 (en) METHOD AND DEVICE FOR VOICE RECOGNITION IN FLUATING NOISE LEVEL ENVIRONMENTS
EP0998166A1 (en) Device for audio processing,receiver and method for filtering the wanted signal and reproducing it in presence of ambient noise
EP1395981B1 (en) Device and method for processing an audio signal
WO2017103418A1 (en) Adaptive channel-reduction processing for encoding a multi-channel audio signal
CA3053032A1 (en) Method and apparatus for dynamic modifying of the timbre of the voice by frequency shift of the formants of a spectral envelope
FR2690551A1 (en) Quantization method of a predictor filter for a very low rate vocoder.
EP3025514B1 (en) Sound spatialization with room effect
EP3025342B1 (en) Method for suppressing the late reverberation of an audible signal
EP3828886A1 (en) Method and system for separating the voice component and the noise component in an audio flow
FR3060830A1 (en) SUB-BAND PROCESSING OF REAL AMBASSIC CONTENT FOR PERFECTIONAL DECODING
WO2023165946A1 (en) Optimised encoding and decoding of an audio signal using a neural network-based autoencoder
FR2817694A1 (en) SPATIAL LOWERING METHOD AND DEVICE FOR DARK AREAS OF AN IMAGE
EP2515300A1 (en) Method and System for noise reduction
FR3085784A1 (en) DEVICE FOR ENHANCING SPEECH BY IMPLEMENTING A NETWORK OF NEURONES IN THE TIME DOMAIN
EP4315328A1 (en) Estimating an optimized mask for processing acquired sound data
US20220405547A1 (en) Residual normalization for improved neural network classifications
EP0812070B1 (en) Method and device for compression encoding of a digital signal
WO2013053631A1 (en) Method and device for separating signals by iterative spatial filtering
EP1605440A1 (en) Method for signal source separation from a mixture signal
FR2667472A1 (en) REAL-TIME BINARY SEGMENTATION DEVICE OF DIGITAL IMAGES.

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20211109

RBV Designated contracting states (corrected)

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230306

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230517