WO2019097227A1 - Generation of sound synthesis models - Google Patents

Generation of sound synthesis models Download PDF

Info

Publication number
WO2019097227A1
WO2019097227A1 PCT/GB2018/053299 GB2018053299W WO2019097227A1 WO 2019097227 A1 WO2019097227 A1 WO 2019097227A1 GB 2018053299 W GB2018053299 W GB 2018053299W WO 2019097227 A1 WO2019097227 A1 WO 2019097227A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
audio
generating
synthesised
model
Prior art date
Application number
PCT/GB2018/053299
Other languages
French (fr)
Inventor
David Moffat
Joshua Reiss
Original Assignee
Queen Mary University Of London
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Queen Mary University Of London filed Critical Queen Mary University Of London
Publication of WO2019097227A1 publication Critical patent/WO2019097227A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00

Definitions

  • the field of the invention is the synthesis of sounds.
  • Embodiments use machine learning to automatically generate sound effect models for synthesising sounds.
  • sound effect models are easily generated.
  • Sound effects have a number of applications in creative industries. There are libraries of sound effects that store pre-recorded sound samples. A problem experienced by libraries of sound effects is that it is necessary for each library to store a very large number of pre- recorded sound samples in order for a broad range of sound effects to be generated. In addition, adapting pre-recorded sound effects for use in different scenarios is difficult.
  • a method of generating a sound synthesis model of an audio signal comprising: receiving an audio signal; identifying audio events in the received an audio signal; separating each of the identified audio events into audio sub-bands; generating a model of each of the audio sub- bands; applying a machine learning technique to each generated model of each of the audio sub-bands so as to determine a plurality of different types of audio event in the received audio signal; generating, for each of the determined plurality of different types of audio event, a model of the type of audio event; and generating a sound synthesis model of the received audio signal in dependence on each of the generated models for types of audio event.
  • the machine learning technique is an unsupervised clustering technique.
  • each model for a type of audio event is a probability distribution.
  • the method further comprises: separating the received audio signal into a foreground component and a background component; wherein said identification of audio events in the received an audio signal is only performed on the foreground component of the received audio signal; and the sound synthesis model of the received audio signal is generated in dependence on the background component.
  • separating the received audio signal into a foreground component and a background component comprises performing a Short Time Fourier Transform.
  • the model of each of the audio sub-bands is generated by using a gamma distribution model and/or a polynomial regression model.
  • the method further comprises determining the principle components of each of the audio sub-bands; wherein the machine learning technique is applied in dependence on the determined principle components.
  • the received audio signal is a plurality of audio signals.
  • the method is computer-implemented.
  • the method is implemented by software modules of a computing device.
  • a computer program comprising instructions that, when executed by a computing device, cause the computing device to perform the method according to the first aspect.
  • a computing device arranged to perform the method according to the first aspect.
  • a method of synthesising an audio signal comprising: receiving a model for generating a synthesised background component of an audio signal; receiving a plurality of models for generating a synthesised foreground component of the audio signal, wherein each of the plurality of models is for a different type of audio event; receiving control parameters of the models; generating a synthesised background component of the audio signal in dependence on the model for generating a synthesised background component of the audio signal and one or more of the control parameters; and generating a synthesised foreground component of the audio signal in dependence on the received plurality of models for generating a synthesised foreground component of the audio signal and one or more of the control parameters; wherein generating the synthesised foreground component of the audio signal comprises: generating vectors by sampling each of the models for generating a synthesised background component of an audio signal and one or more of the control parameters; wherein generating
  • the received model for generating a synthesised background component and/or plurality of models for generating a synthesised foreground component are generated according to the method of the first aspect.
  • the received model for generating a synthesised background component is a filtered probability density distribution.
  • the method further comprises expanding the generated vectors using an inverse principal components analysis; wherein the sub-band envelopes are generated in dependence on the expanded vectors.
  • control parameters for the models for generating a synthesised foreground component comprise one or more of density, density distribution, gain, gain distribution, timbral and tonal control parameters.
  • the audio signal is synthesised substantially in real-time.
  • one or more of the model for generating a synthesised background component of an audio signal, plurality of models for generating a synthesised foreground component of the audio signal and control parameters of the models are received via a
  • the received audio signal is a plurality of audio signals.
  • the method is computer-implemented.
  • the method is implemented by software modules of a computing device.
  • a computer program comprising instructions that, when executed by a computing device, cause the computing device to perform the method according to the fourth aspect.
  • a computing device arranged to perform the method according to the fourth aspect.
  • Figure 1 shows processing steps in sound synthesis model generation and audio signal generation techniques according to embodiments.
  • Embodiments provide a sound synthesis system that uses sound effect models to synthesise sounds. This improves on the use of sound effect libraries since large libraries of pre- recorded sounds are not required. The control of each generated sound effect by a user is also improved. Embodiments also provide techniques for automatically generating sound effect models for use by sound synthesis systems. Advantageously, the generation of sound effect models according to embodiments is quick and accurate.
  • machine learning techniques are used to process audio signals in order to generate a framework of sound effect models.
  • the audio signals that the sound effect models are generated in dependence on may be pre-recorded sounds from a sound effect library.
  • an audio signal which may be any type of audio
  • the background component is modelled as comprising a constant filtered noise signal.
  • the foreground component is modelled as comprising a number of regular sound events.
  • the sound events are then identified, separated and analysed independently. Each sound event then has a model representation of it created based on analysing the time and frequency properties of the sound event.
  • These individual models are then grouped together into clusters by an unsupervised machine learning clustering system that identifies a number of sound categories, i.e. sound types. That is to say, each cluster corresponds to a different sound category/type.
  • Each of the clusters is then modelled, with each parameter for recreating a sound being obtained from a probability distribution, such as a Gaussian distribution.
  • the modelled clusters and background component can be used as sound synthesis models for synthesising the modelled sound.
  • the overall controls for each cluster are presented to the user who can change, for example, the volume, rate and synchronisation of each of the clusters as well as the background component.
  • a user can also control the timbral and tonal properties of the synthesised sound.
  • a controllable synthesised audio signal is therefore generated that is similar to the original audio signal.
  • Embodiments are described in more detail below. Embodiments comprise the separate techniques of sound synthesis model generation and audio signal generation.
  • Sound synthesis model generation comprises the generation sound synthesis models for modelling one or more audio signals.
  • Sound synthesis model generation is a processing operation of one or more audio signals and is a pre-processing that is performed prior to sound synthesis.
  • Audio signal generation also referred to as sound signal generation or audio/sound synthesis, comprises the generation of one or more audio signals.
  • the audio signal generation is a real-time synthesis approach.
  • the audio signal generation may be a recreation of an audio signal that has been modelled by the sound synthesis model generation.
  • the sound synthesis model generation techniques according to embodiments can be performed offline whereas the audio signal generation according to embodiments is preferably performed online and substantially in real-time.
  • Figure 1 shows the processing steps in the sound synthesis model generation and audio signal generation techniques according to embodiments.
  • the sound synthesis model generation processes are shown outside of the dashed box and the audio signal generation techniques are shown inside the dashed box.
  • the processes for sound synthesis model generation are performed by a model generation system.
  • the model generation system may be any type of computing system.
  • the components of the model generation system may be implemented as software modules.
  • An audio signal is input into the model generation system.
  • the audio signal may be from any type of sound source and it may be a combination of a plurality of audio signals from a variety of sound sources.
  • the model generation system then analyses the input audio signal. In the analysis process, the background component of the audio signal is separated from the input audio signal.
  • the separation of the background component may be performed by, for example, median filtering of the Short Time Fourier Transform (STFT).
  • STFT Short Time Fourier Transform
  • embodiments also include other techniques for separating the background component.
  • the median spectrum of the background component is calculated.
  • the modelling technique uses the median spectrum of the background component as a fixed, i.e. constant, background spectrum.
  • the foreground component of the input audio signal is the rest of the audio signal after the background component has been removed.
  • the foreground component can be generated by the same process that separates the background component from the input audio signal.
  • a difference between the foreground component and the background component is that the foreground component is not constant and the foreground component can change.
  • the foreground component of the input audio signal is modelled as comprising individual audio events that are identifiable as peaks in the amplitude spectrum of the foreground component.
  • the start and the end of an audio event can be detected through local minima identification of the amplitude spectrum using the techniques disclosed in ‘Sadjad Siddiq. Morphing of granular sounds. In Proceedings of the 18th International Conference on Digital Audio Effects (DAFx-l5), pages 4-11, 2015’ and‘Sadjad Siddiq. Data-driven granular synthesis. In Audio Engineering Society Convention 142, May 2017’, the entire contents of both of which are incorporated herein by reference.
  • Each of the identifiable audio events can be separately processed. As shown in Figure 1, the number of identified audio events is referred to herein as‘n ⁇
  • the model generation system then processes each audio event separately.
  • Each audio event is decomposed into a number of individual audio sub-bands.
  • the decomposition of the audio events into sub-bands may be based on the Equivalent Rectangular Band (ERB) scale, however embodiments include using other techniques for the decomposition of the audio events into sub-bands.
  • the decomposition may, for example, be performed using the techniques disclosed in‘Brian R Glasberg and Brian CJ Moore. Derivation of auditory filter shapes from notched-noise data. Hearing research, 47(1): 103-138, 1990’, the entire contents of which are incorporated herein by reference. As shown in Figure 1, the number of sub-bands for each of the‘n’ events is referred to herein as‘s’.
  • Each of the sub-bands has an envelope.
  • the next processing step generates an
  • each envelope may be modelled by a gamma distribution model and/or each envelope may be modelled as a polynomial using a polynomial regression model.
  • the model may provide a vector representation of the envelope of each sub-band for each audio event. This can be considered to be a vector, referred to as a sound event vector for each of the sound events.
  • the principal components of each of the sound event vectors may then be taken, so as to perform dimensionality reduction.
  • a machine learning technique is then performed.
  • the machine learning technique is preferably an unsupervised clustering technique.
  • the machine learning technique determines a number of different types of sound event.
  • the machine learning technique can determine the optimal number of different types of sound event that were identified in the separated foreground component of the input audio signal.
  • the machine learning technique can be based on any of a number of techniques in machine learning, such as neural network techniques.
  • the output from the machine learning technique is one or more models for each of a plurality of different types of audio event.
  • the number of outputs from the unsupervised clustering technique i.e. clusters of models for a type of sound event, is‘nf .
  • a generalisable model is then generated, in dependence on these clusters of models, where each parameter can be considered to be a value sampled from any of a number of probability distributions. For example, a Gaussian distribution may be used to represent all of the cluster parameters.
  • a model is generated for each of the‘nf clusters.
  • the models may be considered to be sound synthesis models since, as explained below, the generated model for each of the‘nf clusters, in addition to other data generated in the sound synthesis model generation techniques by the model generation system, can be used as inputs to a real-time sound synthesis system.
  • Embodiments also include the generation of a synthesised audio signal by a sound synthesis system.
  • the audio signal may be generated in real-time.
  • the generated audio signal may be one or more audio signals that have been modelled by the above-described model generation system according to embodiments.
  • the sound synthesis system according to embodiments may also be used to generate audio signals that have been obtained by a different sound synthesis model generation technique.
  • the sound synthesis system according to embodiments may also be used to generate audio signals that are not a recreation of a modelled specific audio signal but instead constructed from different modelled components from a plurality of modelled audio signals.
  • the sound synthesis system may generate a broad range of audio signals from modelled components that each correspond to a different class of sound.
  • the number of clusters produced by the unsupervised clustering technique i.e. the ‘m’ clusters of models, and the model of each of the‘m’ clusters
  • the sound synthesis system is configured to use default values of the user inputs that are not provided when generating the audio signal being synthesised.
  • the synthesis process uses the received spectrum model of the background component to generate a synthesised background sound. This can be performed by filtering Gaussian White Noise (GWN) so that it has the same spectrum properties as the spectrum model of the background component.
  • GWN Gaussian White Noise
  • a gain control is provided for the synthesised background component.
  • the parameter controls preferably include all of density, density distribution, gain and gain distribution.
  • the parameter controls allow the properties of each of the clusters in the sound synthesis process to be changed.
  • models of each cluster which are probability distributions, are sampled using a probabilistically triggered Monte Carlo sampling method and cluster parameter vectors are obtained.
  • the cluster parameter vectors are expanded using an inverse Principal Component
  • PCA Principal Component Analysis
  • All of the sound events in the foreground of a synthesised signal are constructed and combined with the synthesised background component so as to generate a synthesised audio signal.
  • the probability, density and gain control parameters for each sound event allow for tonal and timbral control of the sonic texture.
  • embodiments provide controllable and interactive sound effect models, so that a sound designer may produce and control the type of sound that they desire.
  • Embodiments allow a designer to take a sound recording, or range of sound recordings, and feed them into a machine learning system in order to generate sound synthesis models. The designer can then use the sound synthesis models to generate a controllable sound synthesis system.
  • a particularly advantageous aspect of embodiments is that the features used to classify and describe the sounds correspond to perceptually meaningful descriptions, e.g., roughness, strength of an impact, noisiness, boomy, etc. This allows for the automatic generation of user controls that can be refined as appropriate.
  • Embodiments include a number of modifications and variations of the techniques as described above.
  • Embodiments include the use of any probability distribution models in the sound synthesis model generation and audio signal generation processes.
  • the synthesised background sound may alternative be generated using a different model than GWN.
  • the audio signal generation processes preferably comprise an API so that the inputs to the audio signal generation processes and synthesised audio signal generated by the audio signal generation processes can be received from and provided to a remote user.
  • the synthesised audio signal may be provided over the Internet to a user who is remote from the sound synthesis system.
  • Methods and processes described herein can be embodied as code (e.g., software code) and/or data. Such code and data can be stored on one or more computer-readable media, which may include any device or medium that can store code and/or data for use by a computer system.
  • code and data can be stored on one or more computer-readable media, which may include any device or medium that can store code and/or data for use by a computer system.
  • the computer system When a computer system reads and executes the code and/or data stored on a computer-readable medium, the computer system performs the methods and processes embodied as data structures and code stored within the computer-readable storage medium.
  • a processor e.g., a processor of a computer system or data storage system.
  • computer-readable media include removable and non-removable structures/devices that can be used for storage of information, such as computer-readable instructions, data structures, program modules, and other data used by a computing system/environment.
  • a computer-readable medium includes, but is not limited to, volatile memory such as random access memories (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read only-memories (ROM, PROM, EPROM, EEPROM), magnetic and
  • MRAM ferromagnetic/ferroelectric memories
  • FeRAM phase-change memory
  • magnetic and optical storage devices hard drives, magnetic tape, CDs, DVDs
  • network devices or other media now known or later developed that is capable of storing computer- readable information/data.
  • Computer-readable media should not be construed or interpreted to include any propagating signals.

Abstract

Disclosed herein is a method of generating a sound synthesis model of an audio signal, the method comprising: receiving an audio signal; identifying audio events in the received an audio signal; separating each of the identified audio events into audio sub-bands; generating a model of each of the audio sub-bands; applying a machine learning technique to each generated model of each of the audio sub-bands so as to determine a plurality of different types of audio event in the received audio signal; generating, for each of the determined plurality of different types of audio event, a model of the type of audio event; and generating a sound synthesis model of the received audio signal in dependence on each of the generated models for types of audio event.

Description

GENERATION OF SOUND SYNTHESIS MODELS
Field The field of the invention is the synthesis of sounds. Embodiments use machine learning to automatically generate sound effect models for synthesising sounds. Advantageously, sound effect models are easily generated.
Background
Sound effects have a number of applications in creative industries. There are libraries of sound effects that store pre-recorded sound samples. A problem experienced by libraries of sound effects is that it is necessary for each library to store a very large number of pre- recorded sound samples in order for a broad range of sound effects to be generated. In addition, adapting pre-recorded sound effects for use in different scenarios is difficult.
There is therefore a need to improve known techniques for generating sound effects.
Summary
According to a first aspect of the invention, there is provided a method of generating a sound synthesis model of an audio signal, the method comprising: receiving an audio signal; identifying audio events in the received an audio signal; separating each of the identified audio events into audio sub-bands; generating a model of each of the audio sub- bands; applying a machine learning technique to each generated model of each of the audio sub-bands so as to determine a plurality of different types of audio event in the received audio signal; generating, for each of the determined plurality of different types of audio event, a model of the type of audio event; and generating a sound synthesis model of the received audio signal in dependence on each of the generated models for types of audio event.
Preferably, the machine learning technique is an unsupervised clustering technique. Preferably, each model for a type of audio event is a probability distribution.
Preferably, the method further comprises: separating the received audio signal into a foreground component and a background component; wherein said identification of audio events in the received an audio signal is only performed on the foreground component of the received audio signal; and the sound synthesis model of the received audio signal is generated in dependence on the background component.
Preferably, separating the received audio signal into a foreground component and a background component comprises performing a Short Time Fourier Transform.
Preferably, the model of each of the audio sub-bands is generated by using a gamma distribution model and/or a polynomial regression model.
Preferably, the method further comprises determining the principle components of each of the audio sub-bands; wherein the machine learning technique is applied in dependence on the determined principle components.
Preferably, the received audio signal is a plurality of audio signals.
Preferably, the method is computer-implemented.
Preferably, the method is implemented by software modules of a computing device.
According to a second aspect of the invention, there is provided a computer program comprising instructions that, when executed by a computing device, cause the computing device to perform the method according to the first aspect.
According to a third aspect of the invention, there is provided a computing device arranged to perform the method according to the first aspect. According to a fourth aspect of the invention, there is provided a method of synthesising an audio signal, the method comprising: receiving a model for generating a synthesised background component of an audio signal; receiving a plurality of models for generating a synthesised foreground component of the audio signal, wherein each of the plurality of models is for a different type of audio event; receiving control parameters of the models; generating a synthesised background component of the audio signal in dependence on the model for generating a synthesised background component of the audio signal and one or more of the control parameters; and generating a synthesised foreground component of the audio signal in dependence on the received plurality of models for generating a synthesised foreground component of the audio signal and one or more of the control parameters; wherein generating the synthesised foreground component of the audio signal comprises: generating vectors by sampling each of the models for generating a synthesised foreground component of an audio signal; generating sub-band envelopes in dependence on the generated vectors; and applying the envelopes to one or more sub-band filtered probability density distributions.
Preferably, the received model for generating a synthesised background component and/or plurality of models for generating a synthesised foreground component are generated according to the method of the first aspect.
Preferably, the received model for generating a synthesised background component is a filtered probability density distribution.
Preferably, the method further comprises expanding the generated vectors using an inverse principal components analysis; wherein the sub-band envelopes are generated in dependence on the expanded vectors.
Preferably, the control parameters for the models for generating a synthesised foreground component comprise one or more of density, density distribution, gain, gain distribution, timbral and tonal control parameters.
Preferably, the audio signal is synthesised substantially in real-time. Preferably, one or more of the model for generating a synthesised background component of an audio signal, plurality of models for generating a synthesised foreground component of the audio signal and control parameters of the models are received via a
communications network; and the synthesised background and foreground components of the audio signal are transmitted over the communications network.
Preferably, the received audio signal is a plurality of audio signals.
Preferably, the method is computer-implemented.
Preferably, the method is implemented by software modules of a computing device.
According to a fifth aspect of the invention, there is provided a computer program comprising instructions that, when executed by a computing device, cause the computing device to perform the method according to the fourth aspect.
According to a sixth aspect of the invention, there is provided a computing device arranged to perform the method according to the fourth aspect.
List of Figures
Figure 1 shows processing steps in sound synthesis model generation and audio signal generation techniques according to embodiments.
Description
Embodiments provide a sound synthesis system that uses sound effect models to synthesise sounds. This improves on the use of sound effect libraries since large libraries of pre- recorded sounds are not required. The control of each generated sound effect by a user is also improved. Embodiments also provide techniques for automatically generating sound effect models for use by sound synthesis systems. Advantageously, the generation of sound effect models according to embodiments is quick and accurate.
According to embodiments, machine learning techniques are used to process audio signals in order to generate a framework of sound effect models. The audio signals that the sound effect models are generated in dependence on may be pre-recorded sounds from a sound effect library.
According to embodiments, an audio signal, which may be any type of audio
source/sample and/or a combination of audio signals, is first separated into background and foreground components. The background component is modelled as comprising a constant filtered noise signal. The foreground component is modelled as comprising a number of regular sound events. The sound events are then identified, separated and analysed independently. Each sound event then has a model representation of it created based on analysing the time and frequency properties of the sound event. These individual models are then grouped together into clusters by an unsupervised machine learning clustering system that identifies a number of sound categories, i.e. sound types. That is to say, each cluster corresponds to a different sound category/type. Each of the clusters is then modelled, with each parameter for recreating a sound being obtained from a probability distribution, such as a Gaussian distribution.
The modelled clusters and background component can be used as sound synthesis models for synthesising the modelled sound. In a sound synthesis system, the overall controls for each cluster are presented to the user who can change, for example, the volume, rate and synchronisation of each of the clusters as well as the background component. A user can also control the timbral and tonal properties of the synthesised sound. A controllable synthesised audio signal is therefore generated that is similar to the original audio signal.
Embodiments are described in more detail below. Embodiments comprise the separate techniques of sound synthesis model generation and audio signal generation.
Sound synthesis model generation comprises the generation sound synthesis models for modelling one or more audio signals. Sound synthesis model generation is a processing operation of one or more audio signals and is a pre-processing that is performed prior to sound synthesis.
Audio signal generation, also referred to as sound signal generation or audio/sound synthesis, comprises the generation of one or more audio signals. The audio signal generation is a real-time synthesis approach. The audio signal generation may be a recreation of an audio signal that has been modelled by the sound synthesis model generation.
The sound synthesis model generation techniques according to embodiments can be performed offline whereas the audio signal generation according to embodiments is preferably performed online and substantially in real-time.
Figure 1 shows the processing steps in the sound synthesis model generation and audio signal generation techniques according to embodiments. The sound synthesis model generation processes are shown outside of the dashed box and the audio signal generation techniques are shown inside the dashed box.
The sound synthesis model generation is described in detail below.
The processes for sound synthesis model generation are performed by a model generation system. The model generation system may be any type of computing system. The components of the model generation system may be implemented as software modules.
An audio signal is input into the model generation system. The audio signal may be from any type of sound source and it may be a combination of a plurality of audio signals from a variety of sound sources. The model generation system then analyses the input audio signal. In the analysis process, the background component of the audio signal is separated from the input audio signal.
The separation of the background component may be performed by, for example, median filtering of the Short Time Fourier Transform (STFT). However, embodiments also include other techniques for separating the background component. The median spectrum of the background component is calculated. The modelling technique uses the median spectrum of the background component as a fixed, i.e. constant, background spectrum.
The foreground component of the input audio signal is the rest of the audio signal after the background component has been removed. The foreground component can be generated by the same process that separates the background component from the input audio signal. A difference between the foreground component and the background component is that the foreground component is not constant and the foreground component can change.
The foreground component of the input audio signal is modelled as comprising individual audio events that are identifiable as peaks in the amplitude spectrum of the foreground component. For example, the start and the end of an audio event can be detected through local minima identification of the amplitude spectrum using the techniques disclosed in ‘Sadjad Siddiq. Morphing of granular sounds. In Proceedings of the 18th International Conference on Digital Audio Effects (DAFx-l5), pages 4-11, 2015’ and‘Sadjad Siddiq. Data-driven granular synthesis. In Audio Engineering Society Convention 142, May 2017’, the entire contents of both of which are incorporated herein by reference. Each of the identifiable audio events can be separately processed. As shown in Figure 1, the number of identified audio events is referred to herein as‘n\
The model generation system then processes each audio event separately. Each audio event is decomposed into a number of individual audio sub-bands. The decomposition of the audio events into sub-bands may be based on the Equivalent Rectangular Band (ERB) scale, however embodiments include using other techniques for the decomposition of the audio events into sub-bands. The decomposition may, for example, be performed using the techniques disclosed in‘Brian R Glasberg and Brian CJ Moore. Derivation of auditory filter shapes from notched-noise data. Hearing research, 47(1): 103-138, 1990’, the entire contents of which are incorporated herein by reference. As shown in Figure 1, the number of sub-bands for each of the‘n’ events is referred to herein as‘s’.
Each of the sub-bands has an envelope. The next processing step generates an
approximate model of each individual envelope. For example, each envelope may be modelled by a gamma distribution model and/or each envelope may be modelled as a polynomial using a polynomial regression model. The model may provide a vector representation of the envelope of each sub-band for each audio event. This can be considered to be a vector, referred to as a sound event vector for each of the sound events. The principal components of each of the sound event vectors may then be taken, so as to perform dimensionality reduction.
A machine learning technique is then performed. The machine learning technique is preferably an unsupervised clustering technique. The machine learning technique determines a number of different types of sound event. The machine learning technique can determine the optimal number of different types of sound event that were identified in the separated foreground component of the input audio signal. The machine learning technique can be based on any of a number of techniques in machine learning, such as neural network techniques. The output from the machine learning technique is one or more models for each of a plurality of different types of audio event.
As shown in Figure 1, the number of outputs from the unsupervised clustering technique, i.e. clusters of models for a type of sound event, is‘nf . A generalisable model is then generated, in dependence on these clusters of models, where each parameter can be considered to be a value sampled from any of a number of probability distributions. For example, a Gaussian distribution may be used to represent all of the cluster parameters.
Accordingly, a model is generated for each of the‘nf clusters. The models may be considered to be sound synthesis models since, as explained below, the generated model for each of the‘nf clusters, in addition to other data generated in the sound synthesis model generation techniques by the model generation system, can be used as inputs to a real-time sound synthesis system.
Embodiments also include the generation of a synthesised audio signal by a sound synthesis system. The audio signal may be generated in real-time.
The generated audio signal may be one or more audio signals that have been modelled by the above-described model generation system according to embodiments. However, the sound synthesis system according to embodiments may also be used to generate audio signals that have been obtained by a different sound synthesis model generation technique. The sound synthesis system according to embodiments may also be used to generate audio signals that are not a recreation of a modelled specific audio signal but instead constructed from different modelled components from a plurality of modelled audio signals. For example, the sound synthesis system may generate a broad range of audio signals from modelled components that each correspond to a different class of sound.
The processes performed by the sound synthesis system according to embodiments are shown within the dashed line in Figure 1.
The sound synthesis system according to embodiments generates a synthesised audio signal from the following inputs to the sound synthesis system:
A spectrum model of the background component of the audio signal being synthesised
The number of clusters produced by the unsupervised clustering technique, i.e. the ‘m’ clusters of models, and the model of each of the‘m’ clusters
- User-defined inputs from which control parameters of the components of the sound synthesis process can be generated.
If any of the user-defined inputs are not received then the sound synthesis system is configured to use default values of the user inputs that are not provided when generating the audio signal being synthesised. The synthesis process uses the received spectrum model of the background component to generate a synthesised background sound. This can be performed by filtering Gaussian White Noise (GWN) so that it has the same spectrum properties as the spectrum model of the background component. A gain control is provided for the synthesised background component.
For each model of the‘m’ clusters, the parameter controls preferably include all of density, density distribution, gain and gain distribution. The parameter controls allow the properties of each of the clusters in the sound synthesis process to be changed.
After each parameter has been set, models of each cluster, which are probability distributions, are sampled using a probabilistically triggered Monte Carlo sampling method and cluster parameter vectors are obtained.
The cluster parameter vectors are expanded using an inverse Principal Component
Analysis (PCA) and, in dependence on this, approximations of each of the envelopes of sub-bands of events are constructed. The sub-band envelopes are applied to a sub-band filtered GWN, and combined to generate an entire sound event.
All of the sound events in the foreground of a synthesised signal are constructed and combined with the synthesised background component so as to generate a synthesised audio signal.
The probability, density and gain control parameters for each sound event allow for tonal and timbral control of the sonic texture.
Advantageously, embodiments provide controllable and interactive sound effect models, so that a sound designer may produce and control the type of sound that they desire.
Embodiments allow a designer to take a sound recording, or range of sound recordings, and feed them into a machine learning system in order to generate sound synthesis models. The designer can then use the sound synthesis models to generate a controllable sound synthesis system.
A particularly advantageous aspect of embodiments is that the features used to classify and describe the sounds correspond to perceptually meaningful descriptions, e.g., roughness, strength of an impact, noisiness, boomy, etc. This allows for the automatic generation of user controls that can be refined as appropriate.
Embodiments include a number of modifications and variations of the techniques as described above.
Embodiments include the use of any probability distribution models in the sound synthesis model generation and audio signal generation processes. For example, the synthesised background sound may alternative be generated using a different model than GWN.
The audio signal generation processes preferably comprise an API so that the inputs to the audio signal generation processes and synthesised audio signal generated by the audio signal generation processes can be received from and provided to a remote user. For example, the synthesised audio signal may be provided over the Internet to a user who is remote from the sound synthesis system.
The flow charts and descriptions thereof herein should not be understood to prescribe a fixed order of performing the method steps described therein. Rather, the method steps may be performed in any order that is practicable. Although the present invention has been described in connection with specific exemplary embodiments, it should be understood that various changes, substitutions, and alterations apparent to those skilled in the art can be made to the disclosed embodiments without departing from the spirit and scope of the invention as set forth in the appended claims.
Methods and processes described herein can be embodied as code (e.g., software code) and/or data. Such code and data can be stored on one or more computer-readable media, which may include any device or medium that can store code and/or data for use by a computer system. When a computer system reads and executes the code and/or data stored on a computer-readable medium, the computer system performs the methods and processes embodied as data structures and code stored within the computer-readable storage medium. In certain embodiments, one or more of the steps of the methods and processes described herein can be performed by a processor (e.g., a processor of a computer system or data storage system). It should be appreciated by those skilled in the art that computer-readable media include removable and non-removable structures/devices that can be used for storage of information, such as computer-readable instructions, data structures, program modules, and other data used by a computing system/environment. A computer-readable medium includes, but is not limited to, volatile memory such as random access memories (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read only-memories (ROM, PROM, EPROM, EEPROM), magnetic and
ferromagnetic/ferroelectric memories (MRAM, FeRAM), phase-change memory and magnetic and optical storage devices (hard drives, magnetic tape, CDs, DVDs); network devices; or other media now known or later developed that is capable of storing computer- readable information/data. Computer-readable media should not be construed or interpreted to include any propagating signals.

Claims

Claims
1. A method of generating a sound synthesis model of an audio signal, the method comprising:
receiving an audio signal;
identifying audio events in the received an audio signal;
separating each of the identified audio events into audio sub-bands;
generating a model of each of the audio sub-bands;
applying a machine learning technique to each generated model of each of the audio sub-bands so as to determine a plurality of different types of audio event in the received audio signal;
generating, for each of the determined plurality of different types of audio event, a model of the type of audio event; and
generating a sound synthesis model of the received audio signal in dependence on each of the generated models for types of audio event.
2. The method according to claim 1, wherein the machine learning technique is an unsupervised clustering technique.
3. The method according to claim 1 or 2, wherein each model for a type of audio event is a probability distribution.
4. The method according to any preceding claim, further comprising:
separating the received audio signal into a foreground component and a background component;
wherein said identification of audio events in the received an audio signal is only performed on the foreground component of the received audio signal; and the sound synthesis model of the received audio signal is generated in dependence on the background component.
5. The method according to claim 4, wherein separating the received audio signal into a foreground component and a background component comprises performing a Short Time Fourier Transform.
6. The method according to any preceding claim, wherein the model of each of the audio sub-bands is generated by using a gamma distribution model and/or a polynomial regression model.
7. The method according to any preceding claim, further comprising determining the principle components of each of the audio sub-bands; wherein the machine learning technique is applied in dependence on the determined principle components.
8. The method according to any preceding claim, wherein the received audio signal is a plurality of audio signals.
9. The method according to any preceding claim, wherein the method is computer- implemented.
10. The method according to any preceding claim, wherein the method is implemented by software modules of a computing device.
11. A computer program comprising instructions that, when executed by a computing device, cause the computing device to perform the method according to any of claims 1 to 10.
12. A computing device arranged to perform the method according to any of claims 1 to 10.
13. A method of synthesising an audio signal, the method comprising:
receiving a model for generating a synthesised background component of an audio signal; receiving a plurality of models for generating a synthesised foreground component of the audio signal, wherein each of the plurality of models is for a different type of audio event; receiving control parameters of the models; generating a synthesised background component of the audio signal in dependence on the model for generating a synthesised background component of the audio signal and one or more of the control parameters; and generating a synthesised foreground component of the audio signal in dependence on the received plurality of models for generating a synthesised foreground component of the audio signal and one or more of the control parameters; wherein generating the synthesised foreground component of the audio signal comprises: generating vectors by sampling each of the models for generating a synthesised foreground component of an audio signal; generating sub-band envelopes in dependence on the generated vectors; and applying the envelopes to one or more sub-band filtered probability density distributions.
14. The method according to claim 13, wherein the received model for generating a synthesised background component and/or plurality of models for generating a synthesised foreground component are generated according to the method of any of claims 1 to 10.
15. The method according to claim 13 or 14, wherein the received model for generating a synthesised background component is a filtered probability density distribution.
16. The method according to any of claims 13 to 15, further comprising expanding the generated vectors using an inverse principal components analysis; wherein the sub-band envelopes are generated in dependence on the expanded vectors.
17. The method according to any of claims 13 to 16, wherein the control parameters for the models for generating a synthesised foreground component comprise one or more of density, density distribution, gain, gain distribution, timbral and tonal control parameters.
18. The method according to any of claims 13 to 17, wherein the audio signal is
synthesised substantially in real-time.
19. The method according to any of claims 13 to 18, wherein one or more of the model for generating a synthesised background component of an audio signal, plurality of models for generating a synthesised foreground component of the audio signal and control parameters of the models are received via a communications network; and the synthesised background and foreground components of the audio signal are transmitted over the communications network.
20. The method according to any of claims 13 to 19, wherein the received audio signal is a plurality of audio signals.
21. The method according to any of claims 13 to 20, wherein the method is computer- implemented.
22. The method according to any of claims 13 to 21, wherein the method is
implemented by software modules of a computing device.
23. A computer program comprising instructions that, when executed by a computing device, cause the computing device to perform the method according to any of claims 13 to 22.
24. A computing device arranged to perform the method according to any of claims 13 to 22.
PCT/GB2018/053299 2017-11-14 2018-11-14 Generation of sound synthesis models WO2019097227A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB1718800.4A GB201718800D0 (en) 2017-11-14 2017-11-14 Sound effects synthesis
GB1718800.4 2017-11-14

Publications (1)

Publication Number Publication Date
WO2019097227A1 true WO2019097227A1 (en) 2019-05-23

Family

ID=60788393

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2018/053299 WO2019097227A1 (en) 2017-11-14 2018-11-14 Generation of sound synthesis models

Country Status (2)

Country Link
GB (1) GB201718800D0 (en)
WO (1) WO2019097227A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010044719A1 (en) * 1999-07-02 2001-11-22 Mitsubishi Electric Research Laboratories, Inc. Method and system for recognizing, indexing, and searching acoustic signals

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010044719A1 (en) * 1999-07-02 2001-11-22 Mitsubishi Electric Research Laboratories, Inc. Method and system for recognizing, indexing, and searching acoustic signals

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JONATHAN DOHERTY ET AL: "Pattern Matching Techniques for Replacing Missing Sections of Audio Streamed across Wireless Networks", ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY (TIST), ASSOCIATION FOR COMPUTING MACHINERY CORPORATION, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, vol. 6, no. 2, 31 March 2015 (2015-03-31), pages 1 - 38, XP058067860, ISSN: 2157-6904, DOI: 10.1145/2663358 *
LEVY ET AL: "Extraction of High-Level Musical Structure From Audio Data and Its Application to Thumbnail Generation", ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 2006. ICASSP 2006 PROCEEDINGS . 2006 IEEE INTERNATIONAL CONFERENCE ON TOULOUSE, FRANCE 14-19 MAY 2006, PISCATAWAY, NJ, USA,IEEE, PISCATAWAY, NJ, USA, 14 May 2006 (2006-05-14) - 19 May 2006 (2006-05-19), pages V - V, XP031015952, ISBN: 978-1-4244-0469-8, DOI: 10.1109/ICASSP.2006.1661200 *
RADHAKRISHNAN R ET AL: "Modelling sports highlights using a time series clustering framework & model interpretation", VISUAL COMMUNICATIONS AND IMAGE PROCESSING; 20-1-2004 - 20-1-2004; SAN JOSE,, vol. 5682, 1 January 2005 (2005-01-01), pages 269 - 276, XP009120135, ISBN: 978-1-62841-730-2, DOI: 10.1117/12.588059 *

Also Published As

Publication number Publication date
GB201718800D0 (en) 2017-12-27

Similar Documents

Publication Publication Date Title
US10783875B2 (en) Unsupervised non-parallel speech domain adaptation using a multi-discriminator adversarial network
JP6861500B2 (en) Neural network training device and method, speech recognition device and method
Stöter et al. Countnet: Estimating the number of concurrent speakers using supervised learning
US9553681B2 (en) Source separation using nonnegative matrix factorization with an automatically determined number of bases
JP4406428B2 (en) Signal separation device, signal separation method, signal separation program, and recording medium
WO2021229197A1 (en) Time-varying and nonlinear audio processing using deep neural networks
Steinmetz et al. Filtered noise shaping for time domain room impulse response estimation from reverberant speech
US11082789B1 (en) Audio production assistant for style transfers of audio recordings using one-shot parametric predictions
CN111201569A (en) Electronic device and control method thereof
WO2023001128A1 (en) Audio data processing method, apparatus and device
Coto-Jiménez et al. Improving automatic speech recognition containing additive noise using deep denoising autoencoders of LSTM networks
WO2016050725A1 (en) Method and apparatus for speech enhancement based on source separation
Jaureguiberry et al. Fusion methods for speech enhancement and audio source separation
Gref et al. Improved transcription and indexing of oral history interviews for digital humanities research
Lemercier et al. Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration
Kim et al. Efficient implementation of the room simulator for training deep neural network acoustic models
Nguyen et al. Feature adaptation using linear spectro-temporal transform for robust speech recognition
CN109644304B (en) Source separation for reverberant environments
Adam et al. Wavelet cesptral coefficients for isolated speech recognition
WO2023226572A1 (en) Feature representation extraction method and apparatus, device, medium and program product
KR102194194B1 (en) Method, apparatus for blind signal seperating and electronic device
WO2019097227A1 (en) Generation of sound synthesis models
Surampudi et al. Enhanced feature extraction approaches for detection of sound events
Thakare et al. Comparative analysis of emotion recognition system
Yanagisawa et al. Noise robustness in HMM-TTS speaker adaptation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18811894

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18811894

Country of ref document: EP

Kind code of ref document: A1