EP3113180B1

EP3113180B1 - Method for performing audio inpainting on a speech signal and apparatus for performing audio inpainting on a speech signal

Info

Publication number: EP3113180B1
Application number: EP15306085.0A
Authority: EP
Inventors: Pierre Prablanc; Quang Khanh Ngoc DUONG; Alexey Ozerov; Patrick Perez
Original assignee: InterDigital CE Patent Holdings SAS
Current assignee: InterDigital CE Patent Holdings SAS
Priority date: 2015-07-02
Filing date: 2015-07-02
Publication date: 2020-01-22
Anticipated expiration: 2035-07-02
Also published as: PL3113180T3; EP3113180A1

Description

Field of the invention

The present principles relate to a method for performing audio inpainting on a speech signal, and an apparatus for performing audio inpainting on a speech signal.

Background

Audio inpainting is the problem of recovering audio samples which are missing or distorted due to, e.g., lost IP packets during a Voice over IP (VoIP) transmission or any other kind of deterioration. Audio inpainting algorithms have various applications ranging from IP packets loss recovery, especially in VoIP or mobile phone, or voice censorship cancelling to various types of damaged audio repairs, including declipping and declicking. Moreover, inpainting might be used for speech modification, e.g., to replace a word of a sequence of words in a speech sequence by some other words. While signal completion has been thoroughly investigated for image and video inpainting, it is much less the case in the context of audio data in general and speech in particular.
Adler et al. [1] introduced an audio inpainting algorithm for the specific purpose of audio declipping, i.e., intended to recover missing audio samples in the time domain that were clipped due to, e.g., limited range of the acquisition device. Also techniques for filling missing segments in the time-frequency domain have been developed [2,3]. However, these methods are not suitable in case of large spectral holes, especially when all frequency bins are missing in certain time frames. Drori et al. [4] proposed another approach to audio inpainting in the spectral domain, relying on exemplar spectral patches taken from the known part of the spectrogram. Bahat et al. [7] proposed a method for filling moderate gaps, e.g. corresponding to the loss of several successive IP packets, and especially in the case of speech signals. These approaches are based on self-similarity of some speech features within the signal and thus perform poorly if the missing part is actually very different from the rest. The known approaches, including the speech-specific method in [7], are unable to cope with situations where quite large temporal gaps are missing in a speech signal. For example, one such gap can cover one entire word or a sequence of words. Indeed, methods based on audio patch similarity or speech feature similarities are unable to recreate entire missing words.

Summary of the Invention

A novel method to fill gaps in speech data while preserving speech meaning and voice characteristics is disclosed in claim 1.
It has been found that, if a gap occurs in a speech signal, it is very helpful to use any kind of information in order to fill the gap, and that it is possible to fill the gap by using a text transcript of the corresponding utterance. The disclosed speech audio inpainting technique plausibly recovers speech parts that are lost due to, e.g., specific audio editing or lossy transmission with the help of synthetic speech generated from the text transcript of the missing part. The synthesized speech is modified based on conventional voice conversion (e.g., as in [5]) to fit with the original speaker's voice.
A text transcript of the missing speech part is generated or given, e.g., it can be provided by a user, infered by natural language processing techniques based on the known phrases before and/or after the gap, or available from any other source. The text transcript of the missing speech part is used to complete an obfuscated speech signal. It allows leveraging recent progress of text-to-speech (TTS) synthesizers at generating very natural and high quality speech data.
In principle, a method for speech inpainting comprises synthesizing speech for a gap that occurs in a speech signal using a transcript of the speech signal, converting the synthesized speech by voice conversion according to the original speech signal, and blending the synthesized converted speech into the original speech signal to fill the gap.
An apparatus for performing speech inpainting on a speech signal is disclosed in claim 11. The apparatus comprises a speech analyzer that is adapted for detecting a gap in the speech signal, a speech synthesizer that is adapted for performing automatic speech synthesis from text transcript at least for the gap, a voice converter that is adapted for performing voice conversion to adapt the synthesized speech to an original speaker's voice, and a mixer that is adapted for blending of the converted synthesized speech into the original speech audio track. In one embodiment of the mixer, temporal and/or phase mismatches are removed.
Voice conversion is a process that transforms the speech signal from the voice of one person, which is called source speaker, as if it would have been uttered by another person, which is called target speaker. In a usual voice conversion workflow, two steps have to be considered: a learning step and a conversion step. During the learning step, a mapping function is learned to map voice parameters of a source speaker to voice parameters of a target speaker. To model differences between the two speakers, some training data from both speakers are needed. For conversion within the same language, it is more conventional to use parallel training data, which is a set of sentences uttered by both source and target speakers. In the present case, the target speaker is the one whose data are missing whereas the "source speaker" is a synthesized speech. For the training, target data can be extracted from the surrounding region of the gap or, in case of a famous speaker, in one embodiment it can be retrieved from a database, e.g. on the Internet. In another embodiment, training data for the target speaker can be recorded by e.g. asking the target speaker to say some words, utterances or sentences. Then source data are synthesized with a text-to-speech synthesizer thanks to the transcript of the source speech, in one embodiment.
In one embodiment, text is extracted from the available speech signal by means of automatic speech recognition (ASR). Then, it is determined that one or more words or sounds are missing due to a gap in the speech signal, a context of the remainder of the speech signal is analyzed, and, according to the context and the remainder, one or more words, sounds or syllables are determined that are omitted by the gap. This can be done by estimating or guessing (e.g., in one embodiment by using a dictionary), or by obtaining from any source a complete transcript of the speech signal that covers at least the gap. It is easier to locate the gap if the complete transcript covers some more speech before and/or after the gap.
All following occurrences of the words "embodiment(s)" and "implementation(s)", if referring to feature combinations different from those defined by the independent claims, refer to examples which were originally filed but which do not represent embodiments/implementations of the presently claimed invention; these examples are still shown for illustrative purposes only.
In one embodiment, a computer readable medium has stored thereon executable instructions that when executed on a processor cause a processor to perform a method as disclosed above.
It is clear that in case of fully missing words it may in general simply be impossible to recover the missing speech, because it is not known what was said. At least some embodiments of the present principles provide a solution for this case, for example by generating the transcript based on the undistorted speech signal.
In one embodiment, a method for performing speech inpainting on a speech signal comprises automatically generating a transcript on an input speech signal, determining voice characteristics of the input speech signal, processing the input speech signal, whereby a processed speech signal is obtained, detecting a gap in the processed speech signal, automatically synthesizing from the transcript speech at least for the gap, voice converting the synthesized speech according to the determined voice characteristics, and inpainting the processed speech signal, wherein the voice converted synthesized speech is filled into the gap.
In one embodiment, an apparatus for performing speech inpainting on a speech signal comprises at least one hardware component, such as a hardware processor, and a non-transitory, tangible, computer-readable storage medium tangibly embodying at least one software component, and when executing on the at least one hardware processor, the software component causes the hardware processor to automatically perform the steps of claim 1.
Advantageous embodiments of the invention are disclosed in the dependent claims, the following description and the figures.

Brief description of the drawings

Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in

Fig.1: a general workflow of a speech inpainting system;
Fig.2: embodiments of a voice conversion system;
Fig.3: two embodiments for learning voice conversion,
Fig.4: an overview of post-processing of the converted utterance,
Fig.5: a flow-chart of a method for performing speech inpainting, and
Fig.6: a block diagram of an apparatus for performing speech inpainting.

Detailed description of the invention

Fig.1 shows a general workflow of a speech inpainting system. An input speech signal has a missing part 10, ie. a gap. A textual transcript of the missing part is available, for example it can be generated from the original speech signal. A speech utterance corresponding to the missing part 10 is synthesized 51 from the known text transcription through text-to-speech synthesis in a TTS synthesis block 11. However, such TTS synthesis systems may synthesize speech only phoneme by phoneme. Thus, if gaps occur in the middle of a phoneme, it is unlikely to recover only the utterance corresponding to the missing part. In one embodiment, it is more appropriate to synthesize as well the first and the last phoneme, corresponding to the beginning and the end of the missing part respectively, to reproduce the linguistic information because of the pronunciation context. It is also advantageous because it avoids speech discontinuities issues. After automatic speech synthesis 11, the generated speech is used for the gap filling. However, the synthesized speech has generally few similarities with the original speaker in terms of timbre and prosody. Therefore, its spectral features and fundamental frequency (F0) trajectory are adapted via voice conversion 12 to be similar to those of the target speech. Finally, the gap is filled by the voice converted synthesized speech signal, which results in an inpainted output signal 13. In the general pipeline, a conventional speech analysis-synthesis system (e.g.[6]) is used. This system enables performing flexible modifications on speech signals without loss of naturalness. In one embodiment, three parameters are extracted from the input signal: a STRAIGHT smooth spectrogram representing the evolution of the vocal tract without time and frequency interference, an F0 trajectory and a voice/unvoiced detector, and an aperiodic component. The first two parameters are manipulated by voice conversion to modify the speech. A STRAIGHT smooth spectrogram is known e.g. from [6]. STRAIGHT is a speech tool for speech analysis and synthesis. It allows flexible manipulations on speech because it decomposes speech in the source-filter model in three parts: a smooth spectrum representing a spectral envelope, a fundamental frequency F0 measurement, and an aperiodic component. Basically, the fundamental frequency F0 measurement and the aperiodic component correspond to the source of the source-filter model, while the smooth spectrum representing a spectral envelope corresponds to the filter. The smooth STRAIGHT spectrum is a good representation of the envelope, because STRAIGHT reconstructs the envelope as if it was sampled by the source. Manipulating this spectrum allows us to make good modification of the timbre of the voice.
In one embodiment, the voice conversion system 12 comprises two steps. First a mapping function is learned on training data, and then it is used to convert new utterances. In order to get the mapping function, parameters to convert are extracted (e.g. with the STRAIGHT system) and aligned with dynamic time warping (DTW [8]). Then the learning phase is performed e.g. with a Gaussian mixture model (GMM [9]) or nonnegative matrix factorization (NMF [10])) to get the mapping function.
Fig.2 shows different embodiments of a voice conversion system, using a speech database. It is important to note that the original speech samples from the database do not necessarily need to cover the words or context of the current speech signal on which the inpainting is performed. The mapping function allowing to perform the prediction comprises two kind of parameters: general parameters that need to be calculated only once and parameters specific to the utterance that should be calculated for each utterance that is possible to convert. The general parameters may comprise e.g. Gaussian Mixture Model (GMM) parameters for GMM-based voice conversion and/or a phoneme dictionary for Non-negative Matrix Factorization (NMF)-based voice conversion. The specific parameters may comprise posterior probabilities for GMM-based voice conversion and/or temporal activation matrices for NMF-based voice conversion.
In one embodiment, where the speaker is a well-known person for whom many original speech samples can be retrieved from the Internet, the user is asked to enter, for a partly available speech signal 22, the speaker's identity in a query 21. The query results in voice characteristics of the speaker, or in original speech samples of the speaker from which the voice characteristics are extracted. The synthesized or original speech samples 23 obtained from a database or from the Internet 24 may be used to fill the gap. This approach may use standard voice conversion 25.
In another embodiment, where it is not possible to obtain sufficient original speech samples (e.g. because the speaker is not famous), voice characteristics of the speaker are retrieved upon a query 26 or automatically from the remaining part of the speech signal 27, which serve as a small set of training data. The synthesized speech for the gap 28 is voice converted 29 using the retrieved voice characteristics from around the gap.
Thus, two options may be considered to obtain training data, depending on whether the target speaker is a famous person or not. If the target speaker is e.g. well-known, it is generally possible to retrieve characteristic voice data from the Internet via the speaker's identity, or to try guessing the identity with automatic speaker recognition. Otherwise, only local data, i.e. data around the gap or some additional data, are available and the voice conversion system is adapted to the amount of data.
Fig.3 shows two embodiments 30,35 for learning voice conversion. As described above, a mapping function is learned on training data, and then it is used to convert new utterances. In order to get the mapping function 34, speech is generated from the training data by a text-to-speech block 31,38 (e.g. a speech synthesizer) and voice conversion parameters are extracted (e.g. with the STRAIGHT system) and aligned 32 to the synthesized speech with dynamic time warping (DTW). Then the learning phase is performed 33,39, e.g. with a Gaussian mixture model (GMM [9]) or Non-negative Matrix Factorization (NMF [10])), to get the mapping function 34. In one embodiment 30, only a small amount of training data is available, since only the speech surrounding the gap can be used as reliable speech to extract voice parameters. In another embodiment 33, a large amount of training data can be obtained from a database 36 such as the Internet, and automatic speech recognition 37 is used.
After speech parameters are converted thanks to the mapping function 34, a waveform signal is resynthesized, e.g. by a conventional STRAIGHT synthesizer with the new voice parameters.
In some embodiments, one or more additional steps may need to be performed, since once conversion is performed the resulting speech may still not perfectly fill the gap for the following reasons. First, edge mismatches such as spectral, fundamental frequency and phase discontinuities may need to be counteracted. Indeed, spectral trajectories of the formants are naturally smooth due to the slow variation of the vocal tract shape. Fundamental frequency and temporal phase are not as smooth as the spectral trajectories, but still need continuity to sound natural. Although the speech signal is converted, it is unlikely that the parameters of the spectral envelope trajectory, fundamental frequency and temporal phase are temporally continuous at the border of the gap. Thus, in one embodiment, the parameters of the spectral envelope trajectory, fundamental frequency and temporal phase are adapted to the ones nearby in the non-missing part of the speech, so that any discontinuity at the border is reduced. Besides, duration of the converted utterance may be longer or shorter than the true missing utterance. Therefore, in one embodiment, the speaking rate is converted to match the available part of the speech signal. If the speaking rate cannot be converted, at least a temporal adjustment may be done on the global time scaling of the converted utterance.
A method dealing with these issues is briefly outlined in Fig.4, which shows an overview of post-processing of the converted utterance. First, the converted set of frames may not properly fill the gap. This can be seen as spectral discontinuities 4a. According to an embodiment of the present principles, the gaps may be properly filled by finding 41 in the spectral domain the best frames at the end of the converted spectrogram and merging them with the reliable spectrogram of the available portion of speech signal. This can be done by the known dynamic time warping (DTW) algorithm. Aligning converted and reliable spectra is a way to find which data are used to fill the gap. Then, in a similar adjustment to handle phase discontinuities 4b, the best samples to merge are found 42 on the signal waveform. Such issue appears when for instance speech is voice converted and the waveform signal has the particularity to be periodic. This property is used in cross-correlation between the edges of the reliable signal and the beginning of the converted signal. Peaks in the cross-correlation point out best indices to merge both signals. Then, a fundamental frequency F0 trajectory is modified 43 so that F0 and F0 derivative (dF0/dt) discontinuities 4c are minimized especially on the edges of the converted signal 4d. The F0 trajectory can be computed in the same way as for spectral parameters. The edges of the resulting signal are "allocated" to gap edges. However, the body of the signal may not be suited to the gap: it may be too large or too small. Therefore, in one embodiment the converted signal with F0 modification is time-scaled 44 (without pitch modification, in one embodiment) according to the indices found in the phase adjustment step. Finally, the length-adjusted (ie. "stretched" or "compressed") signal is overlap-added 45 on the edges to minimize fuzzy artefacts that could still remain.
One advantage of the disclosed audio inpainting technique is that even long gaps can be inpainted. It is also robust when only a small amount of data is available in voice conversion.
Fig.5 shows a flow-chart of a method for performing speech inpainting on a speech signal, according to one embodiment. The method 50 comprises determining 51 voice characteristics of the speech signal, detecting 52 a gap in the speech signal, automatically synthesizing 53, from a transcript, speech at least for the gap, voice converting 54 the synthesized speech according to the determined voice characteristics, and inpainting 55 the speech signal, wherein the voice converted synthesized speech is filled into the gap.
In one embodiment, the method further comprises a step of automatically generating 56 said transcript on an input speech signal.
In one embodiment, the method further comprises a step of processing 57 the voice signal, wherein the gap is generated during the processing, and wherein the transcript is generated before the processing.
In one embodiment, the step of automatically synthesizing 53 speech at least for the gap comprises retrieving from a database recorded speech data from a natural speaker. This may support, enhance, replace or control the synthesis.
In one embodiment, the method further comprises steps of detecting 581 that the transcript does not cover the gap, determining 582 one or more words or sounds omitted by the gap, and adding 583 the estimated word or sound to the transcript before synthesizing speech from the transcript.
In one embodiment, the determining 582 is done by estimating or guessing the one or more words or sounds (e.g. from a dictionary).
In one embodiment, the determining 582 is done by retrieving a complete transcript of the speech through other channels (e.g. the Internet).
In one embodiment, the determined voice characteristics comprise parameters for a spectral envelope and a fundamental frequency F0 (or, in other words, it is timbre and prosody of the speech).
In one embodiment, the method further comprises adapting parameters for a spectral envelope trajectory, a fundamental frequency and temporal phase at one or both boundaries of the gap to match the corresponding parameters of the available adjacent speech signal before and/or after the gap. This is in order for the parameters to be temporally continuous before and/or after the gap.
In one embodiment, the method further comprises a step of time-scaling the voice-converted speech signal before it is filled into the gap.
Fig.6 shows a block diagram of an apparatus 60 for performing speech inpainting on a speech signal, according to one embodiment. The apparatus comprises a speech analyser 61 for detecting a gap G in the speech signal SI, a speech synthesizer 62 for automatically synthesizing from a transcript T speech SS at least for the gap, a voice converter 63 for converting the synthesized speech SS according to the determined voice characteristics VC, and a mixer 64 for inpainting the speech signal, wherein the voice converted synthesized speech VCS is filled into the gap G of the speech signal to obtain an inpainted speech output signal SO.
In one embodiment, the apparatus further comprises a voice analyzer 65 for determining voice characteristics of the speech signal.
In one embodiment, the apparatus further comprises a speech-to-text converter 66 for automatically generating a transcript of the speech signal.
In one embodiment, the apparatus further comprises a database having stored speech data of example phonemes or words of natural speech, and the speech synthesizer 62 retrieves speech data from the database for automatically synthesizing the speech at least for the gap.
In one embodiment, the apparatus further comprises an interface 67 for receiving a complete transcript of the speech signal, the transcript covering at least text that is omitted by the gap.
In one embodiment, the apparatus further comprises a time-scaler for time-scaling the voice-converted speech signal before it is filled into the gap.
In one embodiment, an apparatus for performing speech inpainting on a speech signal comprises a processor and a memory storing instructions that, when executed by the processor, cause the apparatus to perform the method steps of any of the methods disclosed above.
It is noted that the use of the verb "comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. Furthermore, the use of the article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Furthermore, the invention resides in each and every novel feature or combination of features.
It should be noted that although the STRAIGHT system is mentioned, other types of speech analysis and synthesis systems may be used other than STRAIGHT, as would be apparent to those of ordinary skill in the art, all of which are contemplated within the scope of the invention.
While there has been shown, described, and pointed out fundamental novel features of the present invention as applied to preferred embodiments thereof, it will be understood that various omissions and substitutions and changes in the apparatus and method described, in the form and details of the devices disclosed, and in their operation, may be made by those skilled in the art without departing from the scope of the present invention. It is expressly intended that all combinations of those elements that perform substantially the same function in substantially the same way to achieve the same results are within the scope of the invention. Substitutions of elements from one described embodiment to another are also fully intended and contemplated. Each feature disclosed in the description and (where appropriate) the claims and drawings may be provided independently or in any appropriate combination. Features may, where appropriate be implemented in hardware, software, or a combination of the two. Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.
The scope of the present invention is defined in the appended claims.

Cited References

[1] Amir Adler, Valentin Emiya, Maria Jafari, Michael Elad, Remi Gribonval, Mark D. Plumbley, "Audio inpainting," IEEE Transactions on Audio, Speech and Language Processing, IEEE, 2012, 20 (3), pp. 922 - 932 , XP011397627
[2] P. Smaragdis et al. "Missing data imputation for spectral audio signal," Proc. IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2009
[3] J. Le Roux et al., "Computational auditory induction as a missing data model-fitting problem with Bregman divergence," Speech Communication, vol. 53, no. 5, pp. 658-676, 2011
[4] I. Drori et al. "Spectral sound gap filling," Proc. ICPR 2004, pp. 871-874
[5] Jani Nurminen, Hanna Silen, Victor Popa, Elina Helander and Moncef Gabbouj (2012). "Voice Conversion, Speech Enhancement, Modeling and Recognition- Algorithms and Applications", Dr. S Ramakrishnan (Ed.), ISBN: 978-953-51-0291-5, InTech, DOI: 10.5772/37334.
[6] Hideki Kawahara. Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. In 1997 IEEE Inter- national Conference on Acoustics, Speech, and Signal Processing, ICASSP '97, Munich, Germany, April 21-24, 1997, pages 1303-1306, 1997.
[7] Y. Bahat, Y. Y. Schechner, and M. Elad, "Self-content-based audio inpainting," Signal Processing, vol. 111, pp. 61-72, 2015.
[8] D. Ellis (2003). Dynamic Time Warp (DTW) in Matlab, Web resource, available at http://www.ee.columbia.edu/-dpwe/resources/matlab/dtw/. Visited 4/29/2015.
[9] Toda, T.; Black, A.W.; Tokuda, K., "Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory," Audio, Speech, and Language Processing, IEEE Transactions on, vol.15, no.8, pp.2222,2235, Nov. 2007
[10] Aihara, R.; Nakashika, T.; Takiguchi, T.; Ariki, Y., "Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary," Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on , vol., no., pp.7894,7898, 4-9 May 2014

Claims

A method (50) comprising:
- obtaining (51) voice characteristics of a speech signal;

- detecting (52) a gap in the speech signal;

- automatically synthesizing (53), from a transcript, speech at least for the gap in the speech signal;

- voice converting (54) the synthesized speech according to the obtained voice characteristics of the speech signal; and

- inpainting (55) the speech signal, wherein the voice converted synthesized speech is filled into the gap.
The method of claim 1, comprising automatically generating (56) said transcript from the speech signal.
The method according to claim 1 or 2, comprising processing (57) the speech signal, wherein the gap occurs during the processing, and wherein the transcript is generated before the processing.
The method according to any of claims 1-3, wherein automatically synthesizing (53), from a transcript, speech at least for the gap comprises retrieving from a database recorded speech data from a natural speaker.
The method according to any of claims 1-4, comprising
- detecting (581) that the transcript does not cover the gap;

- determining (582) one or more words or sounds omitted within the gap; and

- adding (583) the determined word or sound to the transcript before synthesizing speech from the transcript.
The method according to claim 5, wherein the determining (582) is done by estimating or guessing the one or more words or sounds.
The method according to claim 5, wherein the determining (582) is done by retrieving a complete transcript of the speech through other channels.
The method according to any of claims 1-7, wherein the voice characteristics comprise parameters for a spectral envelope and a fundamental frequency.
The method according to one of the claims 1-8, comprises adapting parameters for a spectral envelope trajectory, a fundamental frequency and a temporal phase at one or both boundaries of the gap to match the corresponding parameters of the available adjacent speech signal before and/or after the gap.
The method according to one of the claims 1-9, comprising time-scaling the voice-converted speech signal before it is filled into the gap.
An apparatus (60) comprising:
- a speech analyser (61) for detecting gap in a speech signal;
- a speech synthesizer (62) for automatically synthesizing, from a transcript, speech at least for a gap in the speech signal;

- means for obtaining voice characteristics of the speech signal;

- a voice converter (63) for converting the synthesized speech according to the obtained voice characteristics of the speech signal; and

- a mixer (64) for inpainting the speech signal, wherein the voice converted synthesized speech is filled into the gap of the speech signal.
The apparatus of claim 11, wherein said means for obtaining comprises a voice analyzer (65) for obtaining the voice characteristics of the speech signal.
The apparatus of claim 11 or 12, comprising a speech-to-text converter (66) for automatically generating a transcript of the speech signal.
The apparatus of one of the claims 11-13, comprising a database having stored speech data of example phonemes or words of natural speech, wherein the speech synthesizer (62) retrieves speech data from the database for automatically synthesizing the speech at least for the gap.
The apparatus of one of the claims 11-14, comprising an interface (67) for receiving a complete transcript of the speech signal, the transcript covering at least text that is omitted by the gap.