WO2023077237A1 - Système et procédé d'amélioration d'un signal audio - Google Patents

Système et procédé d'amélioration d'un signal audio Download PDF

Info

Publication number
WO2023077237A1
WO2023077237A1 PCT/CA2022/051637 CA2022051637W WO2023077237A1 WO 2023077237 A1 WO2023077237 A1 WO 2023077237A1 CA 2022051637 W CA2022051637 W CA 2022051637W WO 2023077237 A1 WO2023077237 A1 WO 2023077237A1
Authority
WO
WIPO (PCT)
Prior art keywords
microphone
audio
microphones
neural network
waveform
Prior art date
Application number
PCT/CA2022/051637
Other languages
English (en)
Inventor
Duncan MACCONNELL
Ladan GOLSHANARA
Jesus TURRUBIATES
Original Assignee
Tandemlaunch Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tandemlaunch Inc. filed Critical Tandemlaunch Inc.
Publication of WO2023077237A1 publication Critical patent/WO2023077237A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R29/00Monitoring arrangements; Testing arrangements
    • H04R29/004Monitoring arrangements; Testing arrangements for microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/08Mouthpieces; Microphones; Attachments therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction

Definitions

  • the following relates to audio signals generated from audio devices such as microphones and speakers. The following more specifically relates to the improvement and processing of these audio signals.
  • FIG. 1 provides a schematic diagram of a conventional condenser microphone.
  • Condenser microphones work on the principle of capacitance. Capacitors consist of parallel conducting plates that store charge and are used to smooth out signals like voltage variations in a power supply. In a condenser microphone, the incoming sound 1 vibrates the diaphragm 2 of a capacitor. This varies the capacitance between the diaphragm 2 and the back plate 3. The varying capacitance is converted into a corresponding electrical signal 4.
  • FIG. 2 provides a schematic diagram of a conventional dynamic microphone.
  • a dynamic microphone converts sound into a small electrical current. Sound waves 1 hit a diaphragm 2 that vibrates, moving a magnet 8 near a coil 7. This produces an electric current 9.
  • Both the condenser and dynamic microphones are transducers; they transform sound pressure waves into voltage, through the movement of the microphone diaphragm 2.
  • the selection of diaphragm 2 and accompanying electrical circuit determine the voltage that represents the sound, and thus determine the perceptual quality of the sound. It is very expensive to design and build high quality, professional microphones. When product designers need microphones in products, the size and expense of the microphone is weighed against the perceptual benefits of having a high-quality microphone in the system.
  • Diaphragm material, design, thickness, and diameter can help to determine a microphone’s frequency, transient and polar responsiveness.
  • the microphone quality is limited by, for example, the material, design, thickness, and diameter of the Diaphragm 2.
  • diaphragms 2 can be categorized into three sizes — large, medium, and small. Larger diaphragm microphones are typically more sensitive due to their increased surface area, but also have a more limited frequency response since sound waves have to move more mass.
  • Small diaphragm microphones are capable of handling higher sound pressure levels due to their stiffer diaphragms. They also have an increased frequency response, particularly in the higher end of the frequency spectrum. Their decreased sensitivity relative to large diaphragm microphones makes them less susceptible to proximity effect and ambient noise due to their directional characteristics.
  • condenser microphones are better suited for high frequency applications such as recording a vocalist in an isolation booth, recording an acoustic guitar to capture definition, recording a group of singers, recording an acoustic piano, recording sound effects, or recording a podcast voice in a quiet or acoustically treated room.
  • dynamic microphones are not as sensitive, which makes them better suited for low frequency applications such as recording drums, recording guitar amplifiers, recording multiple individuals’ voices sitting around a table, or recording one or more speakers on a stage when you need to avoid picking up other sounds.
  • a method of improving an audio signal comprising: outputting an audio waveform from a sound source; capturing the audio waveform from a first microphone and capturing the audio waveform from a second microphone capsule aligned beside the first microphone; and sending the captured audio waveforms to a digital audio processing system having a neural network.
  • the neural network is configured to learn differences between the first audio waveform and the second audio waveform.
  • the audio signals processed from the first microphone differing from the audio signal processed from second microphone.
  • the sound source may comprise a curated data set consisting of test signals, representative audio signals.
  • the method may further comprise applying the learned differences to a third audio waveform recorded from a third microphone; such that the third microphone has similar characteristics to the first microphone.
  • the first microphone may have a non-ideal set of characteristics
  • the second microphone may have an ideal set of characteristics for a specific function.
  • the specific function can be selected from at least one of the following: conversation, lyrical, music, noise cancellation, and instrumental.
  • the third microphone can be located on a mobile device.
  • the third microphone records audio from a telephone conversation such that the digital audio processing system processes the conversation in real-time.
  • the digital audio processing system can be located on an application on the mobile device.
  • the method can also be used for polar pattern translation, wherein the first microphone comprises a first polar pattern from one of a unidirectional, bidirectional, and omnidirectional patterns, and the second microphone comprises a second polar pattern, from one of a unidirectional, bidirectional, and omnidirectional microphones.
  • the third microphone may comprise the first polar pattern and wherein the learned differences between the first and second microphone can be applied to a third audio waveform recorded from the third microphone to match the second polar pattern.
  • FIG. 1 depicts a schematic diagram of a condenser microphone
  • FIG. 2 depicts a schematic diagram of a dynamic microphone
  • FIG. 3 depicts a schematic diagram of recording a training model dataset
  • FIG. 4 depicts a schematic diagram of the model training inputs
  • FIG. 5 depicts a schematic diagram of the digital audio processing system algorithm
  • FIG. 6 depicts an embodiment of the application of the invention
  • FIG. 7 depicts a further embodiment of the application of the invention.
  • FIG. 8A depicts a schematic diagram of a unidirectional polar pattern
  • FIG. 8B depicts a schematic diagram of a bi-directional polar pattern
  • FIG. 8C depicts a schematic diagram of an omnidirectional polar pattern
  • FIG. 9A depicts a schematic diagram of the digital audio processing system
  • FIG. 9B depicts a schematic diagram of the digital audio processing system
  • FIG. 10 depicts a schematic diagram of the deep neural network algorithm
  • FIG. 11 depicts a schematic diagram of the digital audio processing system algorithm for upgrading a speaker
  • FIG. 12A depicts one example of the training model
  • FIG. 12B lists the equations used by the training model.
  • the invention provides a Digital Audio Processing System 100 having a neural net audio processing translation from one distinct microphone to another microphone.
  • This invention allows small and cheap microphones to take on the spectral characteristics (i.e. the particular sound) of a large high-quality microphone.
  • the Digital Audio Processing System 100 uses a neural network to learn the difference between audio signals captured by model microphones and audio signals captured by lower-quality microphones. The system then applies the trained model differences on the audio signals of other lower-quality microphones to produce high-quality sounds from the lower-quality microphones. Machine learning allows us to learn the complex effect that microphones have when capturing sound.
  • FIG. 3 depicts a schematic diagram of recording a training model dataset.
  • An unprocessed audio dataset 101 is played on a speaker 102.
  • the unprocessed audio dataset 101 may be any type of unprocessed audio such as a recording of a group of people having a conversation, a recording of a musical instrument, a recording of a person or group of people singing, musical songs, etc.
  • the unprocessed audio dataset may be any source of sound, such as a group of people having a conversation, a musical instrument, a person, or group of people singing, etc.
  • two microphones 103, 104 are used to capture and record the unprocessed audio signal 101 simultaneously.
  • the two microphones 103, 104 are preferably capsule aligned such that they are capturing an identical audio wavefront. Capsule aligned refers to the physical alignment of the diaphragm 2 of two or more microphones.
  • one of the microphones 103 may be of lesser quality than the other microphone 104.
  • the build quality of the microphone will affect the sound quality recorded by the microphone.
  • diaphragm material, design, thickness, and diameter can help to determine a microphone’s frequency, transient and polar responsiveness.
  • the microphone quality is limited by, for example, the material, design, thickness, and diameter of the diaphragm.
  • microphone 103 may be a simple, inexpensive microphone and microphone 104 can be a high-quality recording microphone which is expensive.
  • the resulting audio recordings or datasets 105, 106 will be of varying quality.
  • the first dataset 105 is the audio recorded by the first microphone 103.
  • the second dataset 106 is the audio recorded by the second microphone 104.
  • the second microphone 104 is of higher quality, and thus the recorded audio signal 106 will have a higher sound quality (i.e. , less distortion, less static noise, high bit rate, etc.) than audio signal 105. Consequently, we assume that if the first microphone 103 is of lesser quality, the audio signal 105 recorded by the first microphone 103 will have a lower sound quality compared to the audio signal 106 recorded by the second microphone 104. As such, audio signal 105 is expected to have more distortion, more static noise, low bit rate, than audio signal 106 as audio signal 105 is recorded from a low-quality microphone 103.
  • the first microphone 103 can have a first set of characteristics and the second microphone 104 can have a second set of characteristics. These characteristics may include differences in the microphone’s capturing frequency, quality, type of sound, etc.
  • the first microphone 103 may be a condenser-type of microphone and the second microphone 104 may be a dynamic-type of microphone. As such, the resulting dataset 105 recorded by the first microphone 103 will be different from the dataset 106 recorded by the second microphone 104.
  • the varying recorded audio signals 105 and 106 is then input into a deep neural network (DNN) 107.
  • DNN deep neural network
  • the difference between the audio signals 105, 106 of the two different microphones 103, 104 is captured, the neural network 107 is configured to learn the difference between the microphones 103, 104.
  • the Digital Audio Processing System 100 can then apply this difference to new audio signals.
  • FIG. 4 shows a schematic diagram of the model training inputs to be used by the deep neural network 107 and Digital Audio Processing System 100.
  • Many different high quality microphones can be used to train the neural network 107 since the characteristics of different microphones make them more suitable for certain applications.
  • condenser microphones are better suited for high frequency applications such as recording a vocalist in an isolation booth, recording an acoustic guitar to capture definition, recording a group of singers, recording an acoustic piano, recording sound effects, or recording a podcast voice in a quiet or acoustically treated room.
  • dynamic microphones are better suited for low frequency applications such as recording drums, recording guitar amplifiers, recording multiple individuals’ voices sitting around a table, or recording one or more speakers on a stage when you need to avoid picking up other sounds.
  • varying types of microphones can be used to train the data models for varying applications.
  • a plurality of high quality microphones can be used simultaneously to train the model.
  • the neural network 107 can use distinct digital audio inputs 105 and 106 and learn the differences between the wave forms.
  • FIG. 5 depicts a schematic diagram of the digital audio processing system algorithm for upgrading a microphone.
  • a microphone use-case model, or representative model is to be determined 501. This can be done via a computational processor or as a user input.
  • the representative models include (but are not limited to): ameliorating microphone quality, polar pattern translation, microphone position modelling, noise cancellation, singing, conversational audio, lyrical, instrumental audio, etc.
  • Audio data from the use case is then obtained 502.
  • At least a first and second microphone 103, 104 are then obtained 503. These microphones can be differing in some way such as quality, polar pattern, position, etc.). Any number of microphones can be used to train the data.
  • a form of audio is played on a speaker and recorded on microphones 103, 104.
  • the form audio is ideally related to the use case (i.e. conversation, music, lyrical, noise cancellation, instrumental, etc.). It is also ideal if the diaphragms of the microphones 103, 104 are aligned.
  • the audio signals 105 and 106 can be obtained by simultaneously recording the audio on microphones 103 and 104 (see 504 of FIG. 5). As a result, at least two datasets 105, 106 are obtained from the recording captured by microphones 103, 104 (see 505 and 506 of FIG. 5).
  • the audio dataset 505 and 506 can then be used as an input for a deep neural network 107 (see 507 of FIG.
  • the neural network outputs a set of learned weights for the deep neural network layers (508).
  • Neural network architecture is implemented to learn processing (509).
  • the learned weights, or learned differences, can then be applied to new audio that has not been seen by the DNN (510).
  • the new audio may be in the form of a pre-recorded digital signal or on real-time audio converted to a digital signal.
  • the new audio which may have been recorded on a microphone of lesser quality, or microphone of non-ideal characteristics, may be upgraded to sound like audio recorded on a high quality microphone, or a microphone of ideal characteristics.
  • the deep neural network 107 training model expects to receive representative audio (502), consisting of the types of signals the in order to perform the transformation. For instance, a microphone upgrade model for a music application would require singing voice and live instrument training data, while a microphone upgrade for meeting applications would require speech signals.
  • Sonic characteristics include time and frequency domain characteristics, frequencies, and signal amplitudes.
  • Various test signals comprising of singular tones, and complex combinations, at particular frequencies, for particular durations can be used as input audio signals.
  • This system can be used for digital emulation of specific microphone models used for recording music.
  • the system may be used to upgrade a budget microphone to a top- of-the-line microphone.
  • the invention includes a neural net trained on different diaphragm aligned microphones, and the real time audio processing allowing the translation of a signal recorded with a first, lower quality microphones to sound like it was recorded with a different, higher quality microphone.
  • FIG. 6 depicts an embodiment of the application of the invention.
  • sound waves are captured by a new microphone 108.
  • Microphone 108 may be of lesser quality or have a different set of characteristics than an ideal microphone 104.
  • Microphone 108 ideally has similar characteristics as the microphone 103 that was used to train the model.
  • the audio 110 captured by microphone 108 can be imported to the digital audio processing system 100 taught by this invention.
  • the Digital Audio Processing System (DAPS) 100 would have been trained using the model of a high quality or, ideal microphone 104.
  • the model trained by the ideal microphone 104 can then be applied to the signal of the new microphone 108. This ameliorates the audio 110 recorded by the new microphone 108 as the model trained by ideal microphone 104 is applied to the audio 110 of new microphone 108.
  • the model is first trained by learning the difference between a less-than-ideal microphone 103 and an ideal microphone 104.
  • the output of the DAPS 100 is audio 111 which sounds like it came from an ideal microphone.
  • FIG. 7 depicts a further embodiment of the application of the invention.
  • sound waves 110 can be recorded with a small (or low quality) microphone 701 located within a mobile device.
  • the mobile device may include an application or program with the Digital Audio Processing System 100 on it.
  • the Digital Audio Processing System comprises a deep neural network which can take the sound waves captured from the low-quality microphone 701 and apply the trained model to those waves.
  • the resulting sound would be sound waves 111 having the characteristics of a higher quality microphone.
  • this process can be used by cell phone manufacturers to process Micro-Electro-Mechanical Systems (MEMS) microphone 701 signals to sound like a signal coming from a higher quality microphone.
  • MEMS Micro-Electro-Mechanical Systems
  • the invention includes a neural net trained on audio recorded at different microphone positions, and the real time audio processing to allow signals recorded on small MEMS microphones 701 commonly found in mobile devices, to sound like a signal that was recorded with a high-quality large diaphragm microphone.
  • the mobile device may include an application or program with the Digital Audio Processing System 100 on it which would allow real-time phone conversations to be upgraded in real-time.
  • the DAPS 100 may be located on an external server such as a database or cloud server, wherein the input audio 110 can be exported, converted, and imported back to the mobile device. It can be understood that this model can be applied to other methods of verbal communication such as: real time audio/video calls (facetime, zoom, teams, skype, etc.), live audio streaming, live video recording/streaming, and the like.
  • FIG. 8A depicts a schematic diagram of a unidirectional polar pattern
  • FIG. 8B depicts a schematic diagram of a bi-directional polar pattern
  • FIG. 8C depicts a schematic diagram of an omnidirectional polar pattern.
  • the method taught herein can be used to translate a microphone with one specific polar pattern to a microphone with a distinct polar pattern (i.e. , unidirectional microphone to a bidirectional microphone, etc.).
  • the invention includes a neural net trained on microphones with different polar pattern types, and the real time audio processing allowing the translation of a signal recorded with one polar pattern to a signal which sounds like it was recorded with a microphone having a different polar pattern.
  • this system can be used for microphone position modeling. Condition the model by microphone distance from source, as per common recording use cases.
  • the invention includes a neural net trained on different microphone positions, and the real time audio processing allowing signals recorded at on distance to a signal that sounds like it was recorded at a different distance.
  • this system can be used for diaphragm frequency modelling. For instance, if a small diaphragm microphone is used, it may only be suited for high frequency applications. This same microphone may not be useful or ideal for low frequency applications. As such, the model can be trained by a large diaphragm microphone used for low frequency applications. The neural network can then be used to learn the difference in audio signals between the small diaphragm audio signals and the large diaphragm audio signals. The learned differences can then be applied in the future to audio signals obtained from small diaphragm microphones and convert them to audio that sounds like it was recorded using a large diaphragm microphone. [5600] FIGs.
  • FIG. 9A and B show a schematic diagram of the model training completed by the Digital Audio Processing System 100.
  • the digital audio processing system 100 comprises at least a two or more microphones 103, 104, an audio analog to digital converter 112, digital audio workstation 113, and the recorded audio signals 105, 106.
  • FIG. 9B shows the continuation of the method of the Digital Audio Processing System 100.
  • the audio dataset 105 processed by the first microphone 103 and the audio dataset 106 processed by the second microphone 104 once recorded through an analog to digital converter 112 and into a digital audio workstation 113, can be used as digital audio inputs to the deep neural network 107.
  • the neural network 107 can use distinct digital audio inputs 105 and 106 and learn the differences between the wave forms.
  • the learned differences 114 between the waveforms are then saved into a database so they can be accessed later to apply the learned differences 114 to new audio from new microphones.
  • FIG. 10 depicts a schematic diagram of the deep neural network algorithm, as shown in FIG. 5.
  • a microphone use-case model, or representative model is to be determined 501. This can be done via a computational processor or as a user input.
  • the representative models include (but are not limited to): ameliorating microphone quality, polar pattern translation, microphone position modelling, singing, conversational audio, lyrical, instrumental audio, etc. Audio data from the use case is then obtained 502. At least a first and second microphone 103, 104 are then obtained 503. These microphones can be differing in some way such as quality, polar pattern, position, etc.). Any number of microphones can be used to train the data.
  • a form of audio is played on a speaker and recorded on microphones 103, 104.
  • the form audio is ideally related to the use case (i.e. conversation, music, lyrical, noise cancellation, instrumental, etc.). It is also ideal if the diaphragms of the microphones 103, 104 are aligned.
  • the audio signals 105 and 106 can be obtained by simultaneously recording the audio on microphones 103 and 104 (see 504 of FIG. 10). As a result, at least two datasets 105, 106 are obtained from the recording captured by microphones 103, 104 (see 505 and 506 of FIG. 10).
  • the audio dataset 505 and 506 can then be used as an input for a deep neural network 107 (see 507 of FIG. 10).
  • Step 507 is explained as follows.
  • the model is divided into three parts: adaptive frontend, synthesis back-end and latent-space DNN.
  • the architecture is designed to model nonlinear audio effects with short-term memory and is based on a parallel combination of cascade input filters, trainable wave-shaping nonlinearities, and output filters.
  • the audio can be sampled with acceptable hop size ranging between 2 and 8192, with an ideally the hop size can be 256, 512, 1024.
  • the model training sampling rate (samples per second) can range between 8-192 KHz.
  • the new input audio may also have a sampling rate that matches that of the sampling rate used for model training.
  • a larger frame size will result in more frequency resolution, but less time resolution.
  • a lower frame size will result in a lower frequency resolution, but a high time resolution.
  • Different applications require varying levels of frequency resolution/time resolution. For instance, if the audio processing needed to be completed in real-time, a smaller frame size should be used. If the audio processing needed to be completed high frequency resolution, a larger frame size should be used. As such, the frame size can be moderated based on the application that needs to be achieved.
  • the DNN can be pre-set with all the ideal parameters for one application, such as for OEM (original equipment manufacturer) MEMS microphones.
  • the parameters can be left open to be chosen and set by the user.
  • the Adaptive front-end can comprise a convolutional encoder. It can contain two convolutional layers, one pooling layer and one residual connection. The front-end can be considered adaptive since its convolutional layers learn a filter bank for each modeling task and directly from the first microphone input audio dataset 105.
  • the first convolutional layer is followed by the absolute value as the nonlinear activation function and the second convolutional layer are locally connected (LC). This means we follow a filter bank architecture since each filter is only applied to its corresponding row in the input feature map. The later layer is followed by the softplus nonlinearity.
  • the max-pooling layer is a moving window layer, where the maximum value within each window corresponds to the output and the positions of the maximum values are stored and used by the back-end. The operation performed by the first layer is shown in FIG. 12B (equations 1 .2 and 1 .3).
  • W1 represents the kernel matrix from the first layer
  • X1 represents the feature map after the input audio x is convolved with W1 .
  • the weights W1 may comprise any number of one-dimensional filters having a size between (2-512), ideally 64.
  • the residual connection R is equal to X1 , which corresponds to the frequency band decomposition of the input x. This is due the output of each filter of Conv can be seen as a frequency band.
  • the operation performed by the second layer is described by the equation 1 .4 shown in FIG. 12B. Equation 1 .4 (see FIG. 12B) shows an example where the filter size 128.
  • X2(') and W2(') are the th row of the feature map X2 and kernel matrix W2, respectively.
  • X2 is obtained after the LC convolution with W2, the weight matrix of ConvW- local, which, in this example, has 128 filters of size 128.
  • f2() is the so /fp/us function.
  • the adaptive front-end performs time-domain convolutions with the first microphone 103 input audio dataset 105 and is designed to learn a latent representation for each audio effect modeling task, such . It also generates a residual connection which is used by the back-end to facilitate the synthesis of the waveform based on the specific audio effect transformation.
  • the latent-space DNN contains two fully-connected (FC) layers.
  • the first layer is based on LC layers and the second layer comprises a FC layer.
  • the DNN modifies the latent representation Z into a new latent representation Z" which is fed into the synthesis back-end.
  • the first layer applies a different FC layer to each row of the matrix Z and the second layer is applied to each row of the output matrix from the first layer.
  • the number of hidden units are calculated using half the filter size, (for example, if the filter size is 128, the number of hidden units would be 64) , are followed by an activation function such as: softplus, tanh, reLU, etc. which can be applied to the complete latent representation rather than to the channel dimension.
  • Equation 1 .5 and 1 .6 The operation performed by the latent-space DNN is shown by equations 1 .5 and 1 .6, Zh"(') is the ith row of the output feature map Zh" of the LC layers.
  • V1(') is the ith FC layer corresponding to the weight matrix V1 of the LC layer.
  • V2 corresponds to the weights of the FC layer.
  • the output of the max pooling operation Z corresponds to an optimal latent representation of the input audio.
  • the synthesis back-end accomplishes the nonlinear task by the following steps. First, X2”, the discrete approximation of X2, is obtained via unpooling the modified envelopes Z". Then the feature map X1“ is the result of the element-wise multiplication of the residual connection R and X2“. This can be seen as an input filtering operation, since a different envelope gain is applied to each of the frequency band decompositions obtained in the front-end.
  • the deep neural network smooth adaptive activation functions (DNN-SAAF) step applies various wave-shaping nonlinearities to X . This is achieved with a processing block containing dense layers and smooth adaptive activation functions. In one embodiment, the DNN-SAAF comprises 4 fully connected layers. However, it can be appreciated that the DNN-SAAF can be any number of layers.
  • each function can be locally connected and composed of intervals ranging between 2-100 (ideally having a value of 9-25) between -1 to +1.
  • the deconvolution layer corresponds to the deconvolution operation, which can be implemented by transposing the first layer transform.
  • This layer is not trainable since its kernels are transposed versions of W1.
  • the back-end reconstructs the audio waveform in the same manner that the front-end decomposed it.
  • the complete waveform can be synthesized using windowing and constant overlap-add gain.
  • the DNN can optimize the loss function to determine the difference between the audio waveform 105 that was processed by the DNN 507 and the second microphone audio dataset 106.
  • the first microphone audio dataset 105 is process by the neural network.
  • the second microphone audio dataset 106 is the ideal audio dataset, and thus, is not processed by the DNN.
  • the number of iterations can be arbitrary and will stop once the loss function is minimized.
  • the neural network 507 then outputs a set of learned weights for the deep neural network layers (508).
  • Neural network architecture is implemented to learn processing (509).
  • the learned weights W, or learned differences, can then be applied to new audio that has not been seen by the DNN (510).
  • the new audio may be in the form of a pre-recorded digital signal or as a real-time audio converted to a digital signal.
  • the new audio which may have been recorded on a microphone of lesser quality, or microphone of non-ideal characteristics, may be upgraded to sound like audio recorded on a high quality microphone, or a microphone of ideal characteristics.
  • FIG. 11 The process taught by FIG. 11 can be used to model the difference of, and upgrade speakers.
  • a reference microphone can be used to record the same audio dataset through two different loudspeakers. It can be appreciated that it is ideal to keep all other acoustic variables the same.
  • the first speaker audio can be captured with the reference microphone, and its audio signals processed to a first dataset.
  • the second speaker audio can simultaneously be captured with the reference microphone, and its audio signals processed to a second dataset.
  • the neural network can be used to learn the difference between these datasets. The learned differences can then be applied to new speaker sounds in real time or recordings. It can be appreciated that the deep neural network learns transformations, and as such, it is possible to modify old recordings through conventional signal processing.
  • FIG. 11 depicts a schematic diagram of the digital audio processing system algorithm for upgrading a speaker.
  • a speaker use-case model, or representative model is to be determined 601 . This can be done via a computational processor or as a user input.
  • the representative models include (but are not limited to): ameliorating speaker quality, interference pattern, speaker position modelling, singing, conversational audio, lyrical, instrumental audio, etc.
  • Audio data from the use case is then obtained 602.
  • At least a first and second speakers are then obtained 603. These speakers can be differing in some way such as quality, interference pattern, position, etc.). Any number of speakers can be used to train the data.
  • a form of audio is played on a microphone and recorded on at least two speakers.
  • the form audio is ideally related to the use case (i.e. conversation, music, lyrical, noise cancellation, instrumental, etc.).
  • the audio signals can be obtained by simultaneously recording the audio on the two speakers (see 604). As a result, at least two datasets are obtained from the recording captured by the speakers (see 605 and 606).
  • the differing audio datasets can then be used as an input for a deep neural network 107 (see 607).
  • the neural network outputs a set of learned weights for the deep neural network layers (608).
  • Neural network architecture is implemented to learn processing (609). The learned weights, or learned differences, can then be applied to new audio that has not been seen by the DNN (610).
  • the new audio may be in the form of a pre-recorded digital signal or on real-time audio converted to a digital signal.
  • the new audio which may be playing on a lesser quality speaker, or speaker of non-ideal characteristics, may be upgraded to sound like audio being played on a high quality speaker, or a speaker of ideal characteristics.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

L'invention concerne un procédé d'amélioration d'un signal audio. Le procédé comprend : l'émission d'une forme d'onde audio à partir d'une source sonore ; la capture de la forme d'onde audio à partir d'un premier microphone et la capture de la forme d'onde audio à partir d'une seconde capsule de microphone alignée à côté du premier microphone ; et l'envoi des formes d'onde audio capturées à un système de traitement audionumérique doté d'un réseau neuronal. Le réseau neuronal est configuré pour apprendre les différences entre la première forme d'onde audio et la seconde forme d'onde audio. Les signaux audio traités par le premier microphone diffèrent des signaux audio traités par le second microphone. La source sonore peut comprendre un ensemble de données organisées composé de signaux de test, de signaux audio représentatifs.
PCT/CA2022/051637 2021-11-04 2022-11-04 Système et procédé d'amélioration d'un signal audio WO2023077237A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163275636P 2021-11-04 2021-11-04
US63/275,636 2021-11-04

Publications (1)

Publication Number Publication Date
WO2023077237A1 true WO2023077237A1 (fr) 2023-05-11

Family

ID=86240459

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2022/051637 WO2023077237A1 (fr) 2021-11-04 2022-11-04 Système et procédé d'amélioration d'un signal audio

Country Status (1)

Country Link
WO (1) WO2023077237A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190261914A1 (en) * 2013-04-18 2019-08-29 Digimarc Corporation Physiologic audio methods and arrangements
US20200211540A1 (en) * 2018-12-27 2020-07-02 Microsoft Technology Licensing, Llc Context-based speech synthesis
US20200367810A1 (en) * 2017-12-22 2020-11-26 Resmed Sensor Technologies Limited Apparatus, system, and method for health and medical sensing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190261914A1 (en) * 2013-04-18 2019-08-29 Digimarc Corporation Physiologic audio methods and arrangements
US20200367810A1 (en) * 2017-12-22 2020-11-26 Resmed Sensor Technologies Limited Apparatus, system, and method for health and medical sensing
US20200211540A1 (en) * 2018-12-27 2020-07-02 Microsoft Technology Licensing, Llc Context-based speech synthesis

Similar Documents

Publication Publication Date Title
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
US20210089967A1 (en) Data training in multi-sensor setups
US9672821B2 (en) Robust speech recognition in the presence of echo and noise using multiple signals for discrimination
JP6903611B2 (ja) 信号生成装置、信号生成システム、信号生成方法およびプログラム
Naylor et al. Speech dereverberation
CN112151059A (zh) 面向麦克风阵列的通道注意力加权的语音增强方法
Zhao et al. Noisy-Reverberant Speech Enhancement Using DenseUNet with Time-Frequency Attention.
CN111477238B (zh) 一种回声消除方法、装置及电子设备
CN111261179A (zh) 回声消除方法及装置和智能设备
CN101233561B (zh) 通过根据背景噪声控制振动器的操作来增强移动通信设备中的语音可懂度
Chen et al. Improving Mask Learning Based Speech Enhancement System with Restoration Layers and Residual Connection.
CN110234051A (zh) 一种基于深度学习的防啸叫扩声方法及系统
WO2021183657A1 (fr) Système et procédé d'augmentation de données pour des données vocales à base de caractéristiques
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
Gong et al. Self-attention channel combinator frontend for end-to-end multichannel far-field speech recognition
WO2022256577A1 (fr) Procédé d'amélioration de la parole et dispositif informatique mobile mettant en oeuvre le procédé
Eklund Data augmentation techniques for robust audio analysis
WO2023077237A1 (fr) Système et procédé d'amélioration d'un signal audio
Li et al. DDS: A new device-degraded speech dataset for speech enhancement
O’Reilly et al. Effective and inconspicuous over-the-air adversarial examples with adaptive filtering
Kim et al. U-convolution based residual echo suppression with multiple encoders
Omologo A prototype of distant-talking interface for control of interactive TV
Kim et al. Spectral distortion model for training phase-sensitive deep-neural networks for far-field speech recognition
Borgström et al. A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement
Paniagua-Peñaranda et al. Assessing the robustness of recurrent neural networks to enhance the spectrum of reverberated speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22888691

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE