EP3113180A1 - Method for performing audio inpainting on a speech signal and apparatus for performing audio inpainting on a speech signal - Google Patents
Method for performing audio inpainting on a speech signal and apparatus for performing audio inpainting on a speech signal Download PDFInfo
- Publication number
- EP3113180A1 EP3113180A1 EP15306085.0A EP15306085A EP3113180A1 EP 3113180 A1 EP3113180 A1 EP 3113180A1 EP 15306085 A EP15306085 A EP 15306085A EP 3113180 A1 EP3113180 A1 EP 3113180A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- speech
- gap
- speech signal
- transcript
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 13
- 230000003595 spectral effect Effects 0.000 claims description 18
- 230000002123 temporal effect Effects 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 abstract description 25
- 238000002156 mixing Methods 0.000 abstract description 3
- 238000012549 training Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 10
- 238000013507 mapping Methods 0.000 description 9
- 238000003786 synthesis reaction Methods 0.000 description 9
- 230000015572 biosynthetic process Effects 0.000 description 8
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000001228 spectrum Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000021615 conjugation Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/005—Correction of errors induced by the transmission channel, if related to the coding algorithm
Definitions
- the present principles relate to a method for performing audio inpainting on a speech signal, and an apparatus for performing audio inpainting on a speech signal.
- Audio inpainting is the problem of recovering audio samples which are missing or distorted due to, e.g., lost IP packets during a Voice over IP (VoIP) transmission or any other kind of deterioration. Audio inpainting algorithms have various applications ranging from IP packets loss recovery, especially in VoIP or mobile phone, or voice censorship cancelling to various types of damaged audio repairs, including declipping and declicking. Moreover, inpainting might be used for speech modification, e.g., to replace a word of a sequence of words in a speech sequence by some other words. While signal completion has been thoroughly investigated for image and video inpainting, it is much less the case in the context of audio data in general and speech in particular.
- VoIP Voice over IP
- Adler et al. [1] introduced an audio inpainting algorithm for the specific purpose of audio declipping, i.e., intended to recover missing audio samples in the time domain that were clipped due to, e.g., limited range of the acquisition device. Also techniques for filling missing segments in the time-frequency domain have been developed [2,3]. However, these methods are not suitable in case of large spectral holes, especially when all frequency bins are missing in certain time frames.
- Drori et al. [4] proposed another approach to audio inpainting in the spectral domain, relying on exemplar spectral patches taken from the known part of the spectrogram.
- Bahat et al. [7] proposed a method for filling moderate gaps, e.g.
- a novel method to fill gaps in speech data while preserving speech meaning and voice characteristics is disclosed. It has been found that, if a gap occurs in a speech signal, it is very helpful to use any kind of information in order to fill the gap, and that it is possible to fill the gap by using a text transcript of the corresponding utterance.
- the disclosed speech audio inpainting technique plausibly recovers speech parts that are lost due to, e.g., specific audio editing or lossy transmission with the help of synthetic speech generated from the text transcript of the missing part.
- the synthesized speech is modified based on conventional voice conversion (e.g., as in [5]) to fit with the original speaker's voice.
- a text transcript of the missing speech part is generated or given, e.g., it can be provided by a user, infered by natural language processing techniques based on the known phrases before and/or after the gap, or available from any other source.
- the text transcript of the missing speech part is used to complete an obfuscated speech signal. It allows leveraging recent progress of text-to-speech (TTS) synthesizers at generating very natural and high quality speech data.
- a method for speech inpainting comprises synthesizing speech for a gap that occurs in a speech signal using a transcript of the speech signal, converting the synthesized speech by voice conversion according to the original speech signal, and blending the synthesized converted speech into the original speech signal to fill the gap.
- the apparatus comprises a speech analyzer that is adapted for detecting a gap in the speech signal, a speech synthesizer that is adapted for performing automatic speech synthesis from text transcript at least for the gap, a voice converter that is adapted for performing voice conversion to adapt the synthesized speech to an original speaker's voice, and a mixer that is adapted for blending of the converted synthesized speech into the original speech audio track.
- a speech analyzer that is adapted for detecting a gap in the speech signal
- a speech synthesizer that is adapted for performing automatic speech synthesis from text transcript at least for the gap
- a voice converter that is adapted for performing voice conversion to adapt the synthesized speech to an original speaker's voice
- a mixer that is adapted for blending of the converted synthesized speech into the original speech audio track.
- temporal and/or phase mismatches are removed.
- Voice conversion is a process that transforms the speech signal from the voice of one person, which is called source speaker, as if it would have been uttered by another person, which is called target speaker.
- a learning step two steps have to be considered: a learning step and a conversion step.
- a mapping function is learned to map voice parameters of a source speaker to voice parameters of a target speaker.
- some training data from both speakers are needed.
- parallel training data which is a set of sentences uttered by both source and target speakers.
- the target speaker is the one whose data are missing whereas the "source speaker" is a synthesized speech.
- target data can be extracted from the surrounding region of the gap or, in case of a famous speaker, in one embodiment it can be retrieved from a database, e.g. on the Internet.
- training data for the target speaker can be recorded by e.g. asking the target speaker to say some words, utterances or sentences.
- source data are synthesized with a text-to-speech synthesizer thanks to the transcript of the source speech, in one embodiment.
- text is extracted from the available speech signal by means of automatic speech recognition (ASR).
- ASR automatic speech recognition
- a context of the remainder of the speech signal is analyzed, and, according to the context and the remainder, one or more words, sounds or syllables are determined that are omitted by the gap. This can be done by estimating or guessing (e.g., in one embodiment by using a dictionary), or by obtaining from any source a complete transcript of the speech signal that covers at least the gap. It is easier to locate the gap if the complete transcript covers some more speech before and/or after the gap.
- a computer readable medium has stored thereon executable instructions that when executed on a processor cause a processor to perform a method as disclosed above.
- a method for performing speech inpainting on a speech signal comprises automatically generating a transcript on an input speech signal, determining voice characteristics of the input speech signal, processing the input speech signal, whereby a processed speech signal is obtained, detecting a gap in the processed speech signal, automatically synthesizing from the transcript speech at least for the gap, voice converting the synthesized speech according to the determined voice characteristics, and inpainting the processed speech signal, wherein the voice converted synthesized speech is filled into the gap.
- an apparatus for performing speech inpainting on a speech signal comprises at least one hardware component, such as a hardware processor, and a non-transitory, tangible, computer-readable storage medium tangibly embodying at least one software component, and when executing on the at least one hardware processor, the software component causes the hardware processor to automatically perform the steps of claim 1.
- Fig.1 shows a general workflow of a speech inpainting system.
- An input speech signal has a missing part 10, ie. a gap.
- a textual transcript of the missing part is available, for example it can be generated from the original speech signal.
- a speech utterance corresponding to the missing part 10 is synthesized 51 from the known text transcription through text-to-speech synthesis in a TTS synthesis block 11.
- TTS synthesis systems may synthesize speech only phoneme by phoneme. Thus, if gaps occur in the middle of a phoneme, it is unlikely to recover only the utterance corresponding to the missing part.
- the generated speech is used for the gap filling.
- the synthesized speech has generally few similarities with the original speaker in terms of timbre and prosody. Therefore, its spectral features and fundamental frequency (F0) trajectory are adapted via voice conversion 12 to be similar to those of the target speech.
- the gap is filled by the voice converted synthesized speech signal, which results in an inpainted output signal 13.
- a conventional speech analysis-synthesis system e.g.[6]
- a STRAIGHT smooth spectrogram representing the evolution of the vocal tract without time and frequency interference
- an F0 trajectory and a voice/unvoiced detector and an aperiodic component.
- the first two parameters are manipulated by voice conversion to modify the speech.
- a STRAIGHT smooth spectrogram is known e.g. from [6].
- STRAIGHT is a speech tool for speech analysis and synthesis. It allows flexible manipulations on speech because it decomposes speech in the source-filter model in three parts: a smooth spectrum representing a spectral envelope, a fundamental frequency F0 measurement, and an aperiodic component.
- the fundamental frequency F0 measurement and the aperiodic component correspond to the source of the source-filter model, while the smooth spectrum representing a spectral envelope corresponds to the filter.
- the smooth STRAIGHT spectrum is a good representation of the envelope, because STRAIGHT reconstructs the envelope as if it was sampled by the source. Manipulating this spectrum allows us to make good modification of the timbre of the voice.
- the voice conversion system 12 comprises two steps. First a mapping function is learned on training data, and then it is used to convert new utterances. In order to get the mapping function, parameters to convert are extracted (e.g. with the STRAIGHT system) and aligned with dynamic time warping (DTW [8]). Then the learning phase is performed e.g.
- Fig.2 shows different embodiments of a voice conversion system, using a speech database. It is important to note that the original speech samples from the database do not necessarily need to cover the words or context of the current speech signal on which the inpainting is performed.
- the mapping function allowing to perform the prediction comprises two kind of parameters: general parameters that need to be calculated only once and parameters specific to the utterance that should be calculated for each utterance that is possible to convert.
- the general parameters may comprise e.g.
- GMM Gaussian Mixture Model
- NMF Non-negative Matrix Factorization
- the specific parameters may comprise posterior probabilities for GMM-based voice conversion and/or temporal activation matrices for NMF-based voice conversion.
- the speaker is a well-known person for whom many original speech samples can be retrieved from the Internet
- the user is asked to enter, for a partly available speech signal 22, the speaker's identity in a query 21.
- the query results in voice characteristics of the speaker, or in original speech samples of the speaker from which the voice characteristics are extracted.
- the synthesized or original speech samples 23 obtained from a database or from the Internet 24 may be used to fill the gap.
- This approach may use standard voice conversion 25.
- voice characteristics of the speaker are retrieved upon a query 26 or automatically from the remaining part of the speech signal 27, which serve as a small set of training data.
- the synthesized speech for the gap 28 is voice converted 29 using the retrieved voice characteristics from around the gap.
- two options may be considered to obtain training data, depending on whether the target speaker is a famous person or not. If the target speaker is e.g. well-known, it is generally possible to retrieve characteristic voice data from the Internet via the speaker's identity, or to try guessing the identity with automatic speaker recognition. Otherwise, only local data, i.e.
- Fig.3 shows two embodiments 30,35 for learning voice conversion.
- a mapping function is learned on training data, and then it is used to convert new utterances.
- speech is generated from the training data by a text-to-speech block 31,38 (e.g. a speech synthesizer) and voice conversion parameters are extracted (e.g. with the STRAIGHT system) and aligned 32 to the synthesized speech with dynamic time warping (DTW).
- DTW dynamic time warping
- mapping function 34 For the following reasons, only a small amount of training data is available, since only the speech surrounding the gap can be used as reliable speech to extract voice parameters. In another embodiment 33, a large amount of training data can be obtained from a database 36 such as the Internet, and automatic speech recognition 37 is used. After speech parameters are converted thanks to the mapping function 34, a waveform signal is resynthesized, e.g. by a conventional STRAIGHT synthesizer with the new voice parameters. In some embodiments, one or more additional steps may need to be performed, since once conversion is performed the resulting speech may still not perfectly fill the gap for the following reasons.
- GMM [9] Gaussian mixture model
- NMF [10] Non-negative Matrix Factorization
- edge mismatches such as spectral, fundamental frequency and phase discontinuities may need to be counteracted.
- spectral trajectories of the formants are naturally smooth due to the slow variation of the vocal tract shape.
- Fundamental frequency and temporal phase are not as smooth as the spectral trajectories, but still need continuity to sound natural.
- the speech signal is converted, it is unlikely that the parameters of the spectral envelope trajectory, fundamental frequency and temporal phase are temporally continuous at the border of the gap.
- the parameters of the spectral envelope trajectory, fundamental frequency and temporal phase are adapted to the ones nearby in the non-missing part of the speech, so that any discontinuity at the border is reduced.
- the speaking rate is converted to match the available part of the speech signal. If the speaking rate cannot be converted, at least a temporal adjustment may be done on the global time scaling of the converted utterance.
- Fig.4 shows an overview of post-processing of the converted utterance.
- the converted set of frames may not properly fill the gap. This can be seen as spectral discontinuities 4a.
- the gaps may be properly filled by finding 41 in the spectral domain the best frames at the end of the converted spectrogram and merging them with the reliable spectrogram of the available portion of speech signal.
- the F0 trajectory can be computed in the same way as for spectral parameters.
- the edges of the resulting signal are "allocated" to gap edges.
- the body of the signal may not be suited to the gap: it may be too large or too small. Therefore, in one embodiment the converted signal with F0 modification is time-scaled 44 (without pitch modification, in one embodiment) according to the indices found in the phase adjustment step.
- the length-adjusted (ie. "stretched” or “compressed”) signal is overlap-added 45 on the edges to minimize fuzzy artefacts that could still remain.
- One advantage of the disclosed audio inpainting technique is that even long gaps can be inpainted. It is also robust when only a small amount of data is available in voice conversion.
- Fig.5 shows a flow-chart of a method for performing speech inpainting on a speech signal, according to one embodiment.
- the method 50 comprises determining 51 voice characteristics of the speech signal, detecting 52 a gap in the speech signal, automatically synthesizing 53, from a transcript, speech at least for the gap, voice converting 54 the synthesized speech according to the determined voice characteristics, and inpainting 55 the speech signal, wherein the voice converted synthesized speech is filled into the gap.
- the method further comprises a step of automatically generating 56 said transcript on an input speech signal.
- the method further comprises a step of processing 57 the voice signal, wherein the gap is generated during the processing, and wherein the transcript is generated before the processing.
- the step of automatically synthesizing 53 speech at least for the gap comprises retrieving from a database recorded speech data from a natural speaker. This may support, enhance, replace or control the synthesis.
- the method further comprises steps of detecting 581 that the transcript does not cover the gap, determining 582 one or more words or sounds omitted by the gap, and adding 583 the estimated word or sound to the transcript before synthesizing speech from the transcript.
- the determining 582 is done by estimating or guessing the one or more words or sounds (e.g. from a dictionary).
- the determining 582 is done by retrieving a complete transcript of the speech through other channels (e.g. the Internet).
- the determined voice characteristics comprise parameters for a spectral envelope and a fundamental frequency F0 (or, in other words, it is timbre and prosody of the speech).
- the method further comprises adapting parameters for a spectral envelope trajectory, a fundamental frequency and temporal phase at one or both boundaries of the gap to match the corresponding parameters of the available adjacent speech signal before and/or after the gap. This is in order for the parameters to be temporally continuous before and/or after the gap.
- the method further comprises a step of time-scaling the voice-converted speech signal before it is filled into the gap.
- Fig.6 shows a block diagram of an apparatus 60 for performing speech inpainting on a speech signal, according to one embodiment.
- the apparatus comprises a speech analyser 61 for detecting a gap G in the speech signal SI, a speech synthesizer 62 for automatically synthesizing from a transcript T speech SS at least for the gap, a voice converter 63 for converting the synthesized speech SS according to the determined voice characteristics VC, and a mixer 64 for inpainting the speech signal, wherein the voice converted synthesized speech VCS is filled into the gap G of the speech signal to obtain an inpainted speech output signal SO.
- the apparatus further comprises a voice analyzer 65 for determining voice characteristics of the speech signal.
- the apparatus further comprises a speech-to-text converter 66 for automatically generating a transcript of the speech signal.
- the apparatus further comprises a database having stored speech data of example phonemes or words of natural speech, and the speech synthesizer 62 retrieves speech data from the database for automatically synthesizing the speech at least for the gap.
- the apparatus further comprises an interface 67 for receiving a complete transcript of the speech signal, the transcript covering at least text that is omitted by the gap.
- the apparatus further comprises a time-scaler for time-scaling the voice-converted speech signal before it is filled into the gap.
- an apparatus for performing speech inpainting on a speech signal comprises a processor and a memory storing instructions that, when executed by the processor, cause the apparatus to perform the method steps of any of the methods disclosed above.
- the verb "comprise” and its conjugations does not exclude the presence of elements or steps other than those stated in a claim.
- the use of the article "a” or “an” preceding an element does not exclude the presence of a plurality of such elements.
- the invention resides in each and every novel feature or combination of features. It should be noted that although the STRAIGHT system is mentioned, other types of speech analysis and synthesis systems may be used other than STRAIGHT, as would be apparent to those of ordinary skill in the art, all of which are contemplated within the spirit and scope of the invention.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
Description
- The present principles relate to a method for performing audio inpainting on a speech signal, and an apparatus for performing audio inpainting on a speech signal.
- Audio inpainting is the problem of recovering audio samples which are missing or distorted due to, e.g., lost IP packets during a Voice over IP (VoIP) transmission or any other kind of deterioration. Audio inpainting algorithms have various applications ranging from IP packets loss recovery, especially in VoIP or mobile phone, or voice censorship cancelling to various types of damaged audio repairs, including declipping and declicking. Moreover, inpainting might be used for speech modification, e.g., to replace a word of a sequence of words in a speech sequence by some other words. While signal completion has been thoroughly investigated for image and video inpainting, it is much less the case in the context of audio data in general and speech in particular.
- Adler et al. [1] introduced an audio inpainting algorithm for the specific purpose of audio declipping, i.e., intended to recover missing audio samples in the time domain that were clipped due to, e.g., limited range of the acquisition device. Also techniques for filling missing segments in the time-frequency domain have been developed [2,3]. However, these methods are not suitable in case of large spectral holes, especially when all frequency bins are missing in certain time frames. Drori et al. [4] proposed another approach to audio inpainting in the spectral domain, relying on exemplar spectral patches taken from the known part of the spectrogram. Bahat et al. [7] proposed a method for filling moderate gaps, e.g. corresponding to the loss of several successive IP packets, and especially in the case of speech signals. These approaches are based on self-similarity of some speech features within the signal and thus perform poorly if the missing part is actually very different from the rest. The known approaches, including the speech-specific method in [7], are unable to cope with situations where quite large temporal gaps are missing in a speech signal. For example, one such gap can cover one entire word or a sequence of words. Indeed, methods based on audio patch similarity or speech feature similarities are unable to recreate entire missing words.
- A novel method to fill gaps in speech data while preserving speech meaning and voice characteristics is disclosed. It has been found that, if a gap occurs in a speech signal, it is very helpful to use any kind of information in order to fill the gap, and that it is possible to fill the gap by using a text transcript of the corresponding utterance. The disclosed speech audio inpainting technique plausibly recovers speech parts that are lost due to, e.g., specific audio editing or lossy transmission with the help of synthetic speech generated from the text transcript of the missing part. The synthesized speech is modified based on conventional voice conversion (e.g., as in [5]) to fit with the original speaker's voice.
- A text transcript of the missing speech part is generated or given, e.g., it can be provided by a user, infered by natural language processing techniques based on the known phrases before and/or after the gap, or available from any other source. The text transcript of the missing speech part is used to complete an obfuscated speech signal. It allows leveraging recent progress of text-to-speech (TTS) synthesizers at generating very natural and high quality speech data.
In principle, a method for speech inpainting comprises synthesizing speech for a gap that occurs in a speech signal using a transcript of the speech signal, converting the synthesized speech by voice conversion according to the original speech signal, and blending the synthesized converted speech into the original speech signal to fill the gap. - An apparatus for performing speech inpainting on a speech signal is disclosed in
claim 11. The apparatus comprises a speech analyzer that is adapted for detecting a gap in the speech signal, a speech synthesizer that is adapted for performing automatic speech synthesis from text transcript at least for the gap, a voice converter that is adapted for performing voice conversion to adapt the synthesized speech to an original speaker's voice, and a mixer that is adapted for blending of the converted synthesized speech into the original speech audio track. In one embodiment of the mixer, temporal and/or phase mismatches are removed. - Voice conversion is a process that transforms the speech signal from the voice of one person, which is called source speaker, as if it would have been uttered by another person, which is called target speaker. In a usual voice conversion workflow, two steps have to be considered: a learning step and a conversion step. During the learning step, a mapping function is learned to map voice parameters of a source speaker to voice parameters of a target speaker. To model differences between the two speakers, some training data from both speakers are needed. For conversion within the same language, it is more conventional to use parallel training data, which is a set of sentences uttered by both source and target speakers. In the present case, the target speaker is the one whose data are missing whereas the "source speaker" is a synthesized speech. For the training, target data can be extracted from the surrounding region of the gap or, in case of a famous speaker, in one embodiment it can be retrieved from a database, e.g. on the Internet. In another embodiment, training data for the target speaker can be recorded by e.g. asking the target speaker to say some words, utterances or sentences. Then source data are synthesized with a text-to-speech synthesizer thanks to the transcript of the source speech, in one embodiment.
In one embodiment, text is extracted from the available speech signal by means of automatic speech recognition (ASR). Then, it is determined that one or more words or sounds are missing due to a gap in the speech signal, a context of the remainder of the speech signal is analyzed, and, according to the context and the remainder, one or more words, sounds or syllables are determined that are omitted by the gap. This can be done by estimating or guessing (e.g., in one embodiment by using a dictionary), or by obtaining from any source a complete transcript of the speech signal that covers at least the gap. It is easier to locate the gap if the complete transcript covers some more speech before and/or after the gap.
In one embodiment, a computer readable medium has stored thereon executable instructions that when executed on a processor cause a processor to perform a method as disclosed above.
It is clear that in case of fully missing words it may in general simply be impossible to recover the missing speech, because it is not known what was said. At least some embodiments of the present principles provide a solution for this case, for example by generating the transcript based on the undistorted speech signal.
In one embodiment, a method for performing speech inpainting on a speech signal comprises automatically generating a transcript on an input speech signal, determining voice characteristics of the input speech signal, processing the input speech signal, whereby a processed speech signal is obtained, detecting a gap in the processed speech signal, automatically synthesizing from the transcript speech at least for the gap, voice converting the synthesized speech according to the determined voice characteristics, and inpainting the processed speech signal, wherein the voice converted synthesized speech is filled into the gap.
In one embodiment, an apparatus for performing speech inpainting on a speech signal comprises at least one hardware component, such as a hardware processor, and a non-transitory, tangible, computer-readable storage medium tangibly embodying at least one software component, and when executing on the at least one hardware processor, the software component causes the hardware processor to automatically perform the steps of claim 1.
Advantageous embodiments of the invention are disclosed in the dependent claims, the following description and the figures. - Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in
-
Fig.1 a general workflow of a speech inpainting system; -
Fig.2 embodiments of a voice conversion system; -
Fig.3 two embodiments for learning voice conversion, -
Fig.4 an overview of post-processing of the converted utterance, -
Fig.5 a flow-chart of a method for performing speech inpainting, and -
Fig.6 a block diagram of an apparatus for performing speech inpainting. -
Fig.1 shows a general workflow of a speech inpainting system. An input speech signal has amissing part 10, ie. a gap. A textual transcript of the missing part is available, for example it can be generated from the original speech signal. A speech utterance corresponding to themissing part 10 is synthesized 51 from the known text transcription through text-to-speech synthesis in aTTS synthesis block 11. However, such TTS synthesis systems may synthesize speech only phoneme by phoneme. Thus, if gaps occur in the middle of a phoneme, it is unlikely to recover only the utterance corresponding to the missing part. In one embodiment, it is more appropriate to synthesize as well the first and the last phoneme, corresponding to the beginning and the end of the missing part respectively, to reproduce the linguistic information because of the pronunciation context. It is also advantageous because it avoids speech discontinuities issues. Afterautomatic speech synthesis 11, the generated speech is used for the gap filling. However, the synthesized speech has generally few similarities with the original speaker in terms of timbre and prosody. Therefore, its spectral features and fundamental frequency (F0) trajectory are adapted viavoice conversion 12 to be similar to those of the target speech. Finally, the gap is filled by the voice converted synthesized speech signal, which results in aninpainted output signal 13. In the general pipeline, a conventional speech analysis-synthesis system (e.g.[6]) is used. This system enables performing flexible modifications on speech signals without loss of naturalness. In one embodiment, three parameters are extracted from the input signal: a STRAIGHT smooth spectrogram representing the evolution of the vocal tract without time and frequency interference, an F0 trajectory and a voice/unvoiced detector, and an aperiodic component. The first two parameters are manipulated by voice conversion to modify the speech. A STRAIGHT smooth spectrogram is known e.g. from [6]. STRAIGHT is a speech tool for speech analysis and synthesis. It allows flexible manipulations on speech because it decomposes speech in the source-filter model in three parts: a smooth spectrum representing a spectral envelope, a fundamental frequency F0 measurement, and an aperiodic component. Basically, the fundamental frequency F0 measurement and the aperiodic component correspond to the source of the source-filter model, while the smooth spectrum representing a spectral envelope corresponds to the filter. The smooth STRAIGHT spectrum is a good representation of the envelope, because STRAIGHT reconstructs the envelope as if it was sampled by the source. Manipulating this spectrum allows us to make good modification of the timbre of the voice.
In one embodiment, thevoice conversion system 12 comprises two steps. First a mapping function is learned on training data, and then it is used to convert new utterances. In order to get the mapping function, parameters to convert are extracted (e.g. with the STRAIGHT system) and aligned with dynamic time warping (DTW [8]). Then the learning phase is performed e.g. with a Gaussian mixture model (GMM [9]) or nonnegative matrix factorization (NMF [10])) to get the mapping function.
Fig.2 shows different embodiments of a voice conversion system, using a speech database. It is important to note that the original speech samples from the database do not necessarily need to cover the words or context of the current speech signal on which the inpainting is performed. The mapping function allowing to perform the prediction comprises two kind of parameters: general parameters that need to be calculated only once and parameters specific to the utterance that should be calculated for each utterance that is possible to convert. The general parameters may comprise e.g. Gaussian Mixture Model (GMM) parameters for GMM-based voice conversion and/or a phoneme dictionary for Non-negative Matrix Factorization (NMF)-based voice conversion. The specific parameters may comprise posterior probabilities for GMM-based voice conversion and/or temporal activation matrices for NMF-based voice conversion.
In one embodiment, where the speaker is a well-known person for whom many original speech samples can be retrieved from the Internet, the user is asked to enter, for a partlyavailable speech signal 22, the speaker's identity in aquery 21. The query results in voice characteristics of the speaker, or in original speech samples of the speaker from which the voice characteristics are extracted. The synthesized ororiginal speech samples 23 obtained from a database or from theInternet 24 may be used to fill the gap. This approach may usestandard voice conversion 25.
In another embodiment, where it is not possible to obtain sufficient original speech samples (e.g. because the speaker is not famous), voice characteristics of the speaker are retrieved upon aquery 26 or automatically from the remaining part of thespeech signal 27, which serve as a small set of training data. The synthesized speech for thegap 28 is voice converted 29 using the retrieved voice characteristics from around the gap.
Thus, two options may be considered to obtain training data, depending on whether the target speaker is a famous person or not. If the target speaker is e.g. well-known, it is generally possible to retrieve characteristic voice data from the Internet via the speaker's identity, or to try guessing the identity with automatic speaker recognition. Otherwise, only local data, i.e. data around the gap or some additional data, are available and the voice conversion system is adapted to the amount of data.
Fig.3 shows twoembodiments 30,35 for learning voice conversion. As described above, a mapping function is learned on training data, and then it is used to convert new utterances. In order to get themapping function 34, speech is generated from the training data by a text-to-speech block 31,38 (e.g. a speech synthesizer) and voice conversion parameters are extracted (e.g. with the STRAIGHT system) and aligned 32 to the synthesized speech with dynamic time warping (DTW). Then the learning phase is performed 33,39, e.g. with a Gaussian mixture model (GMM [9]) or Non-negative Matrix Factorization (NMF [10])), to get themapping function 34. In oneembodiment 30, only a small amount of training data is available, since only the speech surrounding the gap can be used as reliable speech to extract voice parameters. In anotherembodiment 33, a large amount of training data can be obtained from adatabase 36 such as the Internet, andautomatic speech recognition 37 is used.
After speech parameters are converted thanks to themapping function 34, a waveform signal is resynthesized, e.g. by a conventional STRAIGHT synthesizer with the new voice parameters.
In some embodiments, one or more additional steps may need to be performed, since once conversion is performed the resulting speech may still not perfectly fill the gap for the following reasons. First, edge mismatches such as spectral, fundamental frequency and phase discontinuities may need to be counteracted. Indeed, spectral trajectories of the formants are naturally smooth due to the slow variation of the vocal tract shape. Fundamental frequency and temporal phase are not as smooth as the spectral trajectories, but still need continuity to sound natural. Although the speech signal is converted, it is unlikely that the parameters of the spectral envelope trajectory, fundamental frequency and temporal phase are temporally continuous at the border of the gap. Thus, in one embodiment, the parameters of the spectral envelope trajectory, fundamental frequency and temporal phase are adapted to the ones nearby in the non-missing part of the speech, so that any discontinuity at the border is reduced. Besides, duration of the converted utterance may be longer or shorter than the true missing utterance. Therefore, in one embodiment, the speaking rate is converted to match the available part of the speech signal. If the speaking rate cannot be converted, at least a temporal adjustment may be done on the global time scaling of the converted utterance.
A method dealing with these issues is briefly outlined inFig.4 , which shows an overview of post-processing of the converted utterance. First, the converted set of frames may not properly fill the gap. This can be seen asspectral discontinuities 4a. According to an embodiment of the present principles, the gaps may be properly filled by finding 41 in the spectral domain the best frames at the end of the converted spectrogram and merging them with the reliable spectrogram of the available portion of speech signal. This can be done by the known dynamic time warping (DTW) algorithm. Aligning converted and reliable spectra is a way to find which data are used to fill the gap. Then, in a similar adjustment to handlephase discontinuities 4b, the best samples to merge are found 42 on the signal waveform. Such issue appears when for instance speech is voice converted and the waveform signal has the particularity to be periodic. This property is used in cross-correlation between the edges of the reliable signal and the beginning of the converted signal. Peaks in the cross-correlation point out best indices to merge both signals. Then, a fundamental frequency F0 trajectory is modified 43 so that F0 and F0 derivative (dF0/dt)discontinuities 4c are minimized especially on the edges of the convertedsignal 4d. The F0 trajectory can be computed in the same way as for spectral parameters. The edges of the resulting signal are "allocated" to gap edges. However, the body of the signal may not be suited to the gap: it may be too large or too small. Therefore, in one embodiment the converted signal with F0 modification is time-scaled 44 (without pitch modification, in one embodiment) according to the indices found in the phase adjustment step. Finally, the length-adjusted (ie. "stretched" or "compressed") signal is overlap-added 45 on the edges to minimize fuzzy artefacts that could still remain.
One advantage of the disclosed audio inpainting technique is that even long gaps can be inpainted. It is also robust when only a small amount of data is available in voice conversion.
Fig.5 shows a flow-chart of a method for performing speech inpainting on a speech signal, according to one embodiment. Themethod 50 comprises determining 51 voice characteristics of the speech signal, detecting 52 a gap in the speech signal, automatically synthesizing 53, from a transcript, speech at least for the gap, voice converting 54 the synthesized speech according to the determined voice characteristics, andinpainting 55 the speech signal, wherein the voice converted synthesized speech is filled into the gap.
In one embodiment, the method further comprises a step of automatically generating 56 said transcript on an input speech signal.
In one embodiment, the method further comprises a step of processing 57 the voice signal, wherein the gap is generated during the processing, and wherein the transcript is generated before the processing.
In one embodiment, the step of automatically synthesizing 53 speech at least for the gap comprises retrieving from a database recorded speech data from a natural speaker. This may support, enhance, replace or control the synthesis.
In one embodiment, the method further comprises steps of detecting 581 that the transcript does not cover the gap, determining 582 one or more words or sounds omitted by the gap, and adding 583 the estimated word or sound to the transcript before synthesizing speech from the transcript.
In one embodiment, the determining 582 is done by estimating or guessing the one or more words or sounds (e.g. from a dictionary).
In one embodiment, the determining 582 is done by retrieving a complete transcript of the speech through other channels (e.g. the Internet).
In one embodiment, the determined voice characteristics comprise parameters for a spectral envelope and a fundamental frequency F0 (or, in other words, it is timbre and prosody of the speech).
In one embodiment, the method further comprises adapting parameters for a spectral envelope trajectory, a fundamental frequency and temporal phase at one or both boundaries of the gap to match the corresponding parameters of the available adjacent speech signal before and/or after the gap. This is in order for the parameters to be temporally continuous before and/or after the gap.
In one embodiment, the method further comprises a step of time-scaling the voice-converted speech signal before it is filled into the gap. -
Fig.6 shows a block diagram of anapparatus 60 for performing speech inpainting on a speech signal, according to one embodiment. The apparatus comprises aspeech analyser 61 for detecting a gap G in the speech signal SI, aspeech synthesizer 62 for automatically synthesizing from a transcript T speech SS at least for the gap, avoice converter 63 for converting the synthesized speech SS according to the determined voice characteristics VC, and amixer 64 for inpainting the speech signal, wherein the voice converted synthesized speech VCS is filled into the gap G of the speech signal to obtain an inpainted speech output signal SO.
In one embodiment, the apparatus further comprises avoice analyzer 65 for determining voice characteristics of the speech signal.
In one embodiment, the apparatus further comprises a speech-to-text converter 66 for automatically generating a transcript of the speech signal.
In one embodiment, the apparatus further comprises a database having stored speech data of example phonemes or words of natural speech, and thespeech synthesizer 62 retrieves speech data from the database for automatically synthesizing the speech at least for the gap.
In one embodiment, the apparatus further comprises aninterface 67 for receiving a complete transcript of the speech signal, the transcript covering at least text that is omitted by the gap.
In one embodiment, the apparatus further comprises a time-scaler for time-scaling the voice-converted speech signal before it is filled into the gap.
In one embodiment, an apparatus for performing speech inpainting on a speech signal comprises a processor and a memory storing instructions that, when executed by the processor, cause the apparatus to perform the method steps of any of the methods disclosed above.
It is noted that the use of the verb "comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. Furthermore, the use of the article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Furthermore, the invention resides in each and every novel feature or combination of features.
It should be noted that although the STRAIGHT system is mentioned, other types of speech analysis and synthesis systems may be used other than STRAIGHT, as would be apparent to those of ordinary skill in the art, all of which are contemplated within the spirit and scope of the invention. - While there has been shown, described, and pointed out fundamental novel features of the present invention as applied to preferred embodiments thereof, it will be understood that various omissions and substitutions and changes in the apparatus and method described, in the form and details of the devices disclosed, and in their operation, may be made by those skilled in the art without departing from the spirit of the present invention. It is expressly intended that all combinations of those elements that perform substantially the same function in substantially the same way to achieve the same results are within the scope of the invention. Substitutions of elements from one described embodiment to another are also fully intended and contemplated. Each feature disclosed in the description and (where appropriate) the claims and drawings may be provided independently or in any appropriate combination. Features may, where appropriate be implemented in hardware, software, or a combination of the two. Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.
-
- [1] Amir Adler, Valentin Emiya, Maria Jafari, Michael Elad, Rémi Gribonval, Mark D. Plumbley, "Audio inpainting," IEEE Transactions on Audio, Speech and Language Processing, IEEE, 2012, 20 (3), pp. 922 - 932
- [2] P. Smaragdis et al. "Missing data imputation for spectral audio signal," Proc. IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2009
- [3] J. Le Roux et al., "Computational auditory induction as a missing data model-fitting problem with Bregman divergence," Speech Communication, vol. 53, no. 5, pp. 658-676, 2011
- [4] I. Drori et al. "Spectral sound gap filling," Proc. ICPR 2004, pp. 871-874
- [5] Jani Nurminen, Hanna Silén, Victor Popa, Elina Helander and Moncef Gabbouj (2012). "Voice Conversion, Speech Enhancement, Modeling and Recognition- Algorithms and Applications", Dr. S Ramakrishnan (Ed.), ISBN: 978-953-51-0291-5, InTech, DOI: 10.5772/37334.
- [6] Hideki Kawahara. Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. In 1997 IEEE Inter- national Conference on Acoustics, Speech, and Signal Processing, ICASSP'97, Munich, Germany, April 21-24, 1997, pages 1303-1306, 1997.
- [7] Y. Bahat, Y. Y. Schechner, and M. Elad, "Self-content-based audio inpainting," Signal Processing, vol. 111, pp. 61-72, 2015.
- [8] D. Ellis (2003). Dynamic Time Warp (DTW) in Matlab, Web resource, available at http://www.ee.columbia.edu/-dpwe/resources/matlab/dtw/. Visited 4/29/2015.
- [9] Toda, T.; Black, A.W.; Tokuda, K., "Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory," Audio, Speech, and Language Processing, IEEE Transactions on , vol.15, no.8, pp.2222,2235, Nov. 2007
- [10] Aihara, R.; Nakashika, T.; Takiguchi, T.; Ariki, Y., "Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary," Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on , vol., no., pp.7894,7898, 4-9 May 2014
Claims (15)
- A method (50) for performing inpainting on a speech signal, comprising- determining (51) voice characteristics of the speech signal;- detecting (52) a gap in the speech signal;- automatically synthesizing (53), from a transcript, speech at least for the gap;- voice converting (54) the synthesized speech according to the determined voice characteristics; and- inpainting (55) the speech signal, wherein the voice converted synthesized speech is filled into the gap.
- The method of claim 1, further comprising a step of automatically generating (56) said transcript on the speech signal.
- The method according to claim 1 or 2, further comprising a step of processing (57) the voice signal, wherein the gap is generated during the processing, and wherein the transcript is generated before the processing.
- The method according to any of claims 1-3, wherein the step of automatically synthesizing (53), from a transcript, speech at least for the gap comprises retrieving from a database recorded speech data from a natural speaker.
- The method according to any of claims 1-4, further comprising steps of- detecting (581) that the transcript does not cover the gap;- determining (582) one or more words or sounds omitted by the gap; and- adding (583) the estimated word or sound to the transcript before synthesizing speech from the transcript.
- The method according to claim 5, wherein the determining (582) is done by estimating or guessing the one or more words or sounds.
- The method according to claim 5, wherein the determining (582) is done by retrieving a complete transcript of the speech through other channels.
- The method according to any of claims 1-7, wherein the determined voice characteristics comprise parameters for a spectral envelope and a fundamental frequency.
- The method according to one of the claims 1-8, further comprises adapting parameters for a spectral envelope trajectory, a fundamental frequency and a temporal phase at one or both boundaries of the gap to match the corresponding parameters of the available adjacent speech signal before and/or after the gap.
- The method according to one of the claims 1-9, further comprising a step of time-scaling the voice-converted speech signal before it is filled into the gap.
- An apparatus (60) for performing speech inpainting on a speech signal, comprising- a speech analyser (61) for detecting a gap in the speech signal;- a speech synthesizer (62) for automatically synthesizing from a transcript speech at least for the gap;- a voice converter (63) for converting the synthesized speech according to the determined voice characteristics; and- a mixer (64) for inpainting the speech signal, wherein the voice converted synthesized speech is filled into the gap of the speech signal.
- The apparatus of claim 11, further comprising a voice analyzer (65) for determining voice characteristics of the speech signal.
- The apparatus of claim 11 or 12, further comprising a speech-to-text converter (66) for automatically generating a transcript of the speech signal.
- The apparatus of one of the claims 11-13, further comprising a database having stored speech data of example phonemes or words of natural speech, wherein the speech synthesizer (62) retrieves speech data from the database for automatically synthesizing the speech at least for the gap.
- The apparatus of one of the claims 11-14, further comprising an interface (67) for receiving a complete transcript of the speech signal, the transcript covering at least text that is omitted by the gap.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP15306085.0A EP3113180B1 (en) | 2015-07-02 | 2015-07-02 | Method for performing audio inpainting on a speech signal and apparatus for performing audio inpainting on a speech signal |
PL15306085T PL3113180T3 (en) | 2015-07-02 | 2015-07-02 | Method for performing audio inpainting on a speech signal and apparatus for performing audio inpainting on a speech signal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP15306085.0A EP3113180B1 (en) | 2015-07-02 | 2015-07-02 | Method for performing audio inpainting on a speech signal and apparatus for performing audio inpainting on a speech signal |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3113180A1 true EP3113180A1 (en) | 2017-01-04 |
EP3113180B1 EP3113180B1 (en) | 2020-01-22 |
Family
ID=53610835
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP15306085.0A Active EP3113180B1 (en) | 2015-07-02 | 2015-07-02 | Method for performing audio inpainting on a speech signal and apparatus for performing audio inpainting on a speech signal |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP3113180B1 (en) |
PL (1) | PL3113180T3 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180211672A1 (en) * | 2015-04-10 | 2018-07-26 | Dolby International Ab | Method for performing audio restauration, and apparatus for performing audio restauration |
JP6452061B1 (en) * | 2018-08-10 | 2019-01-16 | クリスタルメソッド株式会社 | Learning data generation method, learning method, and evaluation apparatus |
US11356492B2 (en) * | 2020-09-16 | 2022-06-07 | Kyndryl, Inc. | Preventing audio dropout |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110165912A1 (en) * | 2010-01-05 | 2011-07-07 | Sony Ericsson Mobile Communications Ab | Personalized text-to-speech synthesis and personalized speech feature extraction |
US20150023345A1 (en) * | 2013-07-17 | 2015-01-22 | Technion Research And Development Foundation Ltd. | Example-based audio inpainting |
-
2015
- 2015-07-02 EP EP15306085.0A patent/EP3113180B1/en active Active
- 2015-07-02 PL PL15306085T patent/PL3113180T3/en unknown
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110165912A1 (en) * | 2010-01-05 | 2011-07-07 | Sony Ericsson Mobile Communications Ab | Personalized text-to-speech synthesis and personalized speech feature extraction |
US20150023345A1 (en) * | 2013-07-17 | 2015-01-22 | Technion Research And Development Foundation Ltd. | Example-based audio inpainting |
Non-Patent Citations (11)
Title |
---|
AIHARA, R.; NAKASHIKA, T.; TAKIGUCHI, T.; ARIKI, Y.: "Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary", ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014 IEEE INTERNATIONAL CONFERENCE ON, 4 May 2014 (2014-05-04), pages 7894,7898 |
AMIR ADLER ET AL: "Audio Inpainting", IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, USA, vol. 20, no. 3, 1 March 2012 (2012-03-01), pages 922 - 932, XP011397627, ISSN: 1558-7916, DOI: 10.1109/TASL.2011.2168211 * |
AMIR ADLER; VALENTIN EMIYA; MARIA JAFARI; MICHAEL ELAD; REMI GRIBONVAL; MARK D. PLUMBLEY: "Audio inpainting", IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, vol. 20, no. 3, 2012, pages 922 - 932, XP011397627, DOI: doi:10.1109/TASL.2011.2168211 |
D. ELLIS, DYNAMIC TIME WARP (DTW) IN MATLAB, 2003, Retrieved from the Internet <URL:http://www.ee.columbia.edu/-dpwe/resources/matlab/dtw> |
DRORI ET AL.: "Spectral sound gap filling", PROC. ICPR, 2004, pages 871 - 874, XP010724530, DOI: doi:10.1109/ICPR.2004.1334397 |
HIDEKI KAWAHARA: "Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited", 1997 IEEE INTER- NATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, ICASSP'97, 21 April 1997 (1997-04-21), pages 1303 - 1306 |
J. LE ROUX ET AL.: "Computational auditory induction as a missing data model-fitting problem with Bregman divergence", SPEECH COMMUNICATION, vol. 53, no. 5, 2011, pages 658 - 676 |
JANI NURMINEN; HANNA SITEN; VICTOR POPA; ELINA HELANDER; MONCEF GABBOUJ: "Voice Conversion, Speech Enhancement, Modeling and Recognition- Algorithms and Applications", 2012, INTECH |
P. SMARAGDIS ET AL.: "Missing data imputation for spectral audio signal", PROC. IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP, 2009 |
TODA, T.; BLACK, A.W.; TOKUDA, K.: "Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory", AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE TRANSACTIONS ON, vol. 15, no. 8, November 2007 (2007-11-01), pages 2222,2235, XP011192987, DOI: doi:10.1109/TASL.2007.907344 |
Y. BAHAT; Y. Y. SCHECHNER; M. ELAD: "Self-content-based audio inpainting", SIGNAL PROCESSING, vol. 111, 2015, pages 61 - 72 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180211672A1 (en) * | 2015-04-10 | 2018-07-26 | Dolby International Ab | Method for performing audio restauration, and apparatus for performing audio restauration |
JP6452061B1 (en) * | 2018-08-10 | 2019-01-16 | クリスタルメソッド株式会社 | Learning data generation method, learning method, and evaluation apparatus |
US11356492B2 (en) * | 2020-09-16 | 2022-06-07 | Kyndryl, Inc. | Preventing audio dropout |
Also Published As
Publication number | Publication date |
---|---|
PL3113180T3 (en) | 2020-06-01 |
EP3113180B1 (en) | 2020-01-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3855340B1 (en) | Cross-lingual voice conversion system and method | |
US10255903B2 (en) | Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system | |
US9911407B2 (en) | System and method for synthesis of speech from provided text | |
AU2020227065B2 (en) | Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system | |
Tian et al. | Correlation-based frequency warping for voice conversion | |
US10706867B1 (en) | Global frequency-warping transformation estimation for voice timbre approximation | |
EP4018439B1 (en) | Systems and methods for adapting human speaker embeddings in speech synthesis | |
CN108369803B (en) | Method for forming an excitation signal for a parametric speech synthesis system based on a glottal pulse model | |
EP3113180B1 (en) | Method for performing audio inpainting on a speech signal and apparatus for performing audio inpainting on a speech signal | |
CN116994553A (en) | Training method of speech synthesis model, speech synthesis method, device and equipment | |
JP7040258B2 (en) | Pronunciation converter, its method, and program | |
US10446133B2 (en) | Multi-stream spectral representation for statistical parametric speech synthesis | |
Unnibhavi et al. | A survey of speech recognition on south Indian Languages | |
JP5245962B2 (en) | Speech synthesis apparatus, speech synthesis method, program, and recording medium | |
KR20100111544A (en) | System for proofreading pronunciation using speech recognition and method therefor | |
JPWO2009041402A1 (en) | Frequency axis expansion / contraction coefficient estimation apparatus, system method, and program | |
KR102051235B1 (en) | System and method for outlier identification to remove poor alignments in speech synthesis | |
US11302300B2 (en) | Method and apparatus for forced duration in neural speech synthesis | |
Gu et al. | An improved voice conversion method using segmental GMMs and automatic GMM selection | |
JP6468518B2 (en) | Basic frequency pattern prediction apparatus, method, and program | |
CN116884385A (en) | Speech synthesis method, device and computer readable storage medium | |
Chomwihoke et al. | Comparative study of text-to-speech synthesis techniques for mobile linguistic translation process | |
Qian et al. | A unified trajectory tiling approach to high quality TTS and cross-lingual voice transformation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20171005 |
|
RBV | Designated contracting states (corrected) |
Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
INTG | Intention to grant announced |
Effective date: 20180115 |
|
GRAJ | Information related to disapproval of communication of intention to grant by the applicant or resumption of examination proceedings by the epo deleted |
Free format text: ORIGINAL CODE: EPIDOSDIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
INTC | Intention to grant announced (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20180618 |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: INTERDIGITAL CE PATENT HOLDINGS |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
INTG | Intention to grant announced |
Effective date: 20190926 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE PATENT HAS BEEN GRANTED |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: EP |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: REF Ref document number: 1227425 Country of ref document: AT Kind code of ref document: T Effective date: 20200215 |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R096 Ref document number: 602015045941 Country of ref document: DE |
|
REG | Reference to a national code |
Ref country code: NL Ref legal event code: FP |
|
REG | Reference to a national code |
Ref country code: NO Ref legal event code: T2 Effective date: 20200122 |
|
REG | Reference to a national code |
Ref country code: LT Ref legal event code: MG4D |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: FI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: PT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200614 Ref country code: RS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: LV Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: IS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200522 Ref country code: HR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200423 Ref country code: BG Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200422 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R097 Ref document number: 602015045941 Country of ref document: DE |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: EE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: LT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: SM Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: RO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: CZ Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: ES Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: SK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: DK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: MK05 Ref document number: 1227425 Country of ref document: AT Kind code of ref document: T Effective date: 20200122 |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
26N | No opposition filed |
Effective date: 20201023 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: AT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: IT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: MC Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: PL |
|
REG | Reference to a national code |
Ref country code: BE Ref legal event code: MM Effective date: 20200731 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CH Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20200731 Ref country code: IE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20200702 Ref country code: LU Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20200702 Ref country code: LI Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20200731 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: BE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20200731 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: TR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: MT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: CY Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 Ref country code: AL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20200122 |
|
P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230511 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: NO Payment date: 20230719 Year of fee payment: 9 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: PL Payment date: 20240621 Year of fee payment: 10 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: NL Payment date: 20240725 Year of fee payment: 10 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20240730 Year of fee payment: 10 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: GB Payment date: 20240724 Year of fee payment: 10 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: FR Payment date: 20240725 Year of fee payment: 10 |